Search String Pattern in Large Text Files C# - c#

I have been trying to search string patterns in a large text file. I am reading line by line and checking each line which is causing a lot of time. I did try with HashSet and ReadAllLines.
HashSet<string> strings = new HashSet<string>(File.ReadAllLines(#"D:\Doc\Tst.txt"));
Now when I am trying to search the string, it's not matching. As it is looking for a match of the entire row. I just want to check if the string appears in the row.
I had tried by using this:
using (System.IO.StreamReader file = new System.IO.StreamReader(#"D:\Doc\Tst.txt"))
{
while ((CurrentLine = file.ReadLine()) != null)
{
vals = chk_log(CurrentLine, date_Format, (range.Cells[i][counter]).Value2, vals);
if (vals == true)
break;
}
}
bool chk_log(string LineText, string date_to_chk, string publisher, bool tvals)
{
if (LineText.Contains(date_to_chk))
if (LineText.Contains(publisher))
{
tvals = true;
}
else
tvals = false;
else tvals = false;
return tvals;
}
But this is consuming too much time. Any help on this would be good.

Reading into a HashSet doesn't make sense to me (unless there are a lot of duplicated lines) since you aren't testing for membership of the set.
Taking a really naive approach you could just do this.
var isItThere = File.ReadAllLines(#"d:\docs\st.txt").Any(x =>
x.Contains(date_to_chk) && x.Contains(publisher));
65K lines at (say) 1K a line isn't a lot of memory to worry about, and I personally wouldn't bother with Parallel since it sounds like it would be superfast to do anyway.
You could replace Any where First to find the first result or Where to get an IEnumerable<string> containing all results.

You can use a compiled regular expression instead of String.Contains (compile once before looping over the lines). This typically gives better performance.
var regex = new Regex($"{date}|{publisher}", RegexOptions.Compiled);
foreach (string line in File.ReadLines(#"D:\Doc\Tst.txt"))
{
if (regex.IsMatch(line)) break;
}
This also shows a convenient standard library function for reading a file line by line.
Or, depending on what you want to do...
var isItThere = File.ReadLines(#"D:\Doc\Tst.txt").Any(regex.IsMatch);

Related

how to find text in a string in c#

I am learning Dotnet c# on my own.
how to find whether a given text exists or not in a string and if exists, how to find count of times the word has got repeated in that string. even if the word is misspelled, how to find it and print that the word is misspelled?
we can do this with collections or linq in c# but here i used string class and used contains method but iam struck after that.
if we can do this with help of linq, how?
because linq works with collections, Right?
you need a list in order to play with linq.
but here we are playing with string(paragraph).
how linq can be used find a word in paragraph?
kindly help.
here is what i have tried so far.
string str = "Education is a ray of light in the darkness. It certainly is a hope for a good life. Eudcation is a basic right of every Human on this Planet. To deny this right is evil. Uneducated youth is the worst thing for Humanity. Above all, the governments of all countries must ensure to spread Education";
for(int i = 0; i < i++)
if (str.Contains("Education") == true)
{
Console.WriteLine("found");
}
else
{
Console.WriteLine("not found");
}
You can make a string a string[] by splitting it by a character/string. Then you can use LINQ:
if(str.Split().Contains("makes"))
{
// note that the default Split without arguments also includes tabs and new-lines
}
If you don't care whether it is a word or just a sub-string, you can use str.Contains("makes") directly.
If you want to compare in a case insensitive way, use the overload of Contains:
if(str.Split().Contains("makes", StringComparer.InvariantCultureIgnoreCase)){}
string str = "money makes many makes things";
var strArray = str.Split(" ");
var count = strArray.Count(x => x == "makes");
the simplest way is to use Split extension to split the string into an array of words.
here is an example :
var words = str.Split(' ');
if(words.Length > 0)
{
foreach(var word in words)
{
if(word.IndexOf("makes", StringComparison.InvariantCultureIgnoreCase) != -1)
{
Console.WriteLine("found");
}
else
{
Console.WriteLine("not found");
}
}
}
Now, since you just want the count of number word occurrences, you can use LINQ to do that in a single line like this :
var totalOccurrences = str.Split(' ').Count(x=> x.IndexOf("makes", StringComparison.InvariantCultureIgnoreCase) != -1);
Note that StringComparison.InvariantCultureIgnoreCase is required if you want a case-insensitive comparison.

How to search string in large text file?

I want to get the line containing a certain word that cannot be repeated like profile ID without make loop to read each of line separately, Because if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.
Example for line text file
name,id,image,age,place,link
string word = "13215646";
string output = string.Empty;
using (var fileStream = File.OpenRead(FileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
String line;
while ((line = streamReader.ReadLine()) != null)
{
string[] strList = line.Split(',');
if (word == strList[1]) // check if word = id
{
output = line;
break;
}
}
}
You can use this to search the file:
var output = File.ReadLines(FileName).
Where(line => line.Split(',')[1] == word).
FirstOrDefault();
But it won't solve this:
if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.
There's not a practical way to avoid this for a basic file.
The only ways around actually reading through the file is either maintaining an index, which requires absolute control over everything that might write into the file, or if you can guarantee the file is already sorted by the columns that matter, in which case you can do something like a binary search.
But neither is likely for a random csv file. This is one of the reasons people use databases.
However, we also need to stop and check whether this is really a problem for you. I'd expect the code above to handle files up to a couple hundred MB in around 1 to 2 seconds on modern hardware, even if you need to look through the whole file.
You can optimise the code. Here are few ideas:
var ids = new ["13215646", "113"];
foreach(var line in File.ReadLines(FileName))
{
var id = line.Split(',', count: 3)[1]; // Optimization 1: Use: `count: 3`
if(ids.Contains(id) // Optimization 2: Search for multiple ids
{
//Do what you need with the line
}
}

string.contains and string.replace in one single line of code

I'm currently writing some software where I have to load a lot of columnnames from an external file. Usually I would do this with some JSON but for reasons of userfriendlyness I can't do it right now. I need to use a textfile which is readable to the users and includes a lot of comments.
So I have created my own file to hold all of these values.
Now when I'm importing these values in my software I essentially run through my configfile line by line and I check for every line if it matches a parameter which I then parse. But this way I end up with a big codeblock with very repetitive code and I was wondering is could not simplify it in a way so that every check is done in just one line.
Here is the code I'm currently using:
if (line.Contains("[myValue]"))
{
myParameter = line.Replace("[myValue]", string.Empty).Trim();
}
I know that using Linq you can simply things and put them in one single line, I'm just not sure if it would work in this case?
Thanks for your help!
Kenneth
Why not just create a method if this piece of code often repeated :
void SetParameter(string line, string name, ref string parameter)
{
if (line.Contains(name))
{
parameter = line.Replace(name, string.Empty).Trim();
}
}
SetParameter(line, "[myValue]", ref myParameter);
If you want to avoid calling both Replace and Contains, which is probably a good idea, you could also just call Replace:
void SetParameter(string line, string name, ref string parameter)
{
var replaced = line.Replace(name, string.Empty);
if (line != replaced)
{
parameter = replaced.Trim();
}
}
Try this way (ternary):
myParameter = line.Contains("[myValue]")?line.Replace("[myValue]", string.Empty).Trim():myParameter;
Actually,
line.IndexOf should be faster.
From your code, look like you are replacing with just empty text, so why not take the entire string (consisting of many lines) and replace at one shot, instead of checking one line at a time.
You could use RegEx. This might possibly relieve you of some repetitive code
string line = "[myvalue1] some string [someotherstring] [myvalue2]";
// All your Keys stored at a single place
string[] keylist = new string[] { #"\[myvalue1]", #"\[myvalue2]" };
var newString = Regex.Replace(line, string.Join("|", keylist), string.Empty);
Hope it helps.

Best way to search for many guids at once in a string?

I have before me the problem of searching for 90,000 GUIDs in a string. I need to get all instances of each GUID. Technically, I need to replace them too, but that's another story.
Currently, I'm searching for each one individually, using a regex. But it occurred to me that I could get better performance by searching for them all together. I've read about tries in the past and have never used one but it occurred to me that I could construct a trie of all 90,000 GUIDs and use that to search.
Alternatively, perhaps there is an existing library in .NET that can do this. It crossed my mind that there is no reason why I shouldn't be able to get good performance with just a giant regex, but this appears not to work.
Beyond that, perhaps there is some clever trick I could use relating to GUID structure to get better results.
This isn't really a crucial problem for me, but I thought I might be able to learn something.
Have a look at the Rabin-Karp string search algorithm. It's well suited for multi-pattern searching in a string:
http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_string_search_algorithm#Rabin.E2.80.93Karp_and_multiple_pattern_search
You will not get good performance with RegEx because it's performance is inherently poor. Additionally, if all the GUID's share the same format you should only need one RegEx. and regex.Replace(input, replacement); would do it.
If you have the list of guids in memory already the performance would be better by looping over that list and doing calling String.Replace like so
foreach(string guid in guids)
inputString.replace(guid, replacement);
I developed a method for replacing a large number of strings a while back, that may be useful:
A better way to replace many strings - obfuscation in C#
Another option would be to use a regular expression to find all GUIDs in the string, then loop through them and check if each is part of your set of GUIDs.
Basic example, using a Dictionary for fast lookup of the GUIDs:
Dictionary<string, string> guids = new Dictionary<string, string>();
guids.Add("3f74a071-54fc-10de-0476-a6b991f0be76", "(replacement)");
string text = "asdf 3f74a071-54fc-10de-0476-a6b991f0be76 lkaq2hlqwer";
text = Regex.Replace(text, #"[\da-f]{8}-[\da-f]{4}-[\da-f]{4}-[\da-f]{4}-[\da-f]{12}", m => {
string replacement;
if (guids.TryGetValue(m.Value, out replacement)) {
return replacement;
} else {
return m.Value;
}
});
Console.WriteLine(text);
Output:
asdf (replacement) lkaq2hlqwer
OK, this looks good. So to be clear here is the original code, which took 65s to run on the example string:
var unusedGuids = new HashSet<Guid>(oldToNewGuid.Keys);
foreach (var guid in oldToNewGuid) {
var regex = guid.Key.ToString();
if (!Regex.IsMatch(xml, regex))
unusedGuids.Add(guid.Key);
else
xml = Regex.Replace(xml, regex, guid.Value.ToString());
}
The new code is as follows and takes 6.7s:
var unusedGuids = new HashSet<Guid>(oldToNewGuid.Keys);
var guidHashes = new MultiValueDictionary<int, Guid>();
foreach (var guid in oldToNewGuid.Keys) {
guidHashes.Add(guid.ToString().GetHashCode(), guid);
}
var indices = new List<Tuple<int, Guid>>();
const int guidLength = 36;
for (int i = 0; i < xml.Length - guidLength; i++) {
var substring = xml.Substring(i, guidLength);
foreach (var value in guidHashes.GetValues(substring.GetHashCode())) {
if (value.ToString() == substring) {
unusedGuids.Remove(value);
indices.Add(new Tuple<int, Guid>(i, value));
break;
}
}
}
var builder = new StringBuilder();
int start = 0;
for (int i = 0; i < indices.Count; i++) {
var tuple = indices[i];
var substring = xml.Substring(start, tuple.Item1 - start);
builder.Append(substring);
builder.Append(oldToNewGuid[tuple.Item2].ToString());
start = tuple.Item1 + guidLength;
}
builder.Append(xml.Substring(start, xml.Length - start));
xml = builder.ToString();

Reading a file line by line in C#

I am trying to read some text files, where each line needs to be processed. At the moment I am just using a StreamReader, and then reading each line individually.
I am wondering whether there is a more efficient way (in terms of LoC and readability) to do this using LINQ without compromising operational efficiency. The examples I have seen involve loading the whole file into memory, and then processing it. In this case however I don't believe that would be very efficient. In the first example the files can get up to about 50k, and in the second example, not all lines of the file need to be read (sizes are typically < 10k).
You could argue that nowadays it doesn't really matter for these small files, however I believe that sort of the approach leads to inefficient code.
First example:
// Open file
using(var file = System.IO.File.OpenText(_LstFilename))
{
// Read file
while (!file.EndOfStream)
{
String line = file.ReadLine();
// Ignore empty lines
if (line.Length > 0)
{
// Create addon
T addon = new T();
addon.Load(line, _BaseDir);
// Add to collection
collection.Add(addon);
}
}
}
Second example:
// Open file
using (var file = System.IO.File.OpenText(datFile))
{
// Compile regexs
Regex nameRegex = new Regex("IDENTIFY (.*)");
while (!file.EndOfStream)
{
String line = file.ReadLine();
// Check name
Match m = nameRegex.Match(line);
if (m.Success)
{
_Name = m.Groups[1].Value;
// Remove me when other values are read
break;
}
}
}
You can write a LINQ-based line reader pretty easily using an iterator block:
static IEnumerable<SomeType> ReadFrom(string file) {
string line;
using(var reader = File.OpenText(file)) {
while((line = reader.ReadLine()) != null) {
SomeType newRecord = /* parse line */
yield return newRecord;
}
}
}
or to make Jon happy:
static IEnumerable<string> ReadFrom(string file) {
string line;
using(var reader = File.OpenText(file)) {
while((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
...
var typedSequence = from line in ReadFrom(path)
let record = ParseLine(line)
where record.Active // for example
select record.Key;
then you have ReadFrom(...) as a lazily evaluated sequence without buffering, perfect for Where etc.
Note that if you use OrderBy or the standard GroupBy, it will have to buffer the data in memory; ifyou need grouping and aggregation, "PushLINQ" has some fancy code to allow you to perform aggregations on the data but discard it (no buffering). Jon's explanation is here.
It's simpler to read a line and check whether or not it's null than to check for EndOfStream all the time.
However, I also have a LineReader class in MiscUtil which makes all of this a lot simpler - basically it exposes a file (or a Func<TextReader> as an IEnumerable<string> which lets you do LINQ stuff over it. So you can do things like:
var query = from file in Directory.GetFiles("*.log")
from line in new LineReader(file)
where line.Length > 0
select new AddOn(line); // or whatever
The heart of LineReader is this implementation of IEnumerable<string>.GetEnumerator:
public IEnumerator<string> GetEnumerator()
{
using (TextReader reader = dataSource())
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
Almost all the rest of the source is just giving flexible ways of setting up dataSource (which is a Func<TextReader>).
Since .NET 4.0, the File.ReadLines() method is available.
int count = File.ReadLines(filepath).Count(line => line.StartsWith(">"));
NOTE: You need to watch out for the IEnumerable<T> solution, as it will result in the file being open for the duration of processing.
For example, with Marc Gravell's response:
foreach(var record in ReadFrom("myfile.csv")) {
DoLongProcessOn(record);
}
the file will remain open for the whole of the processing.
Thanks all for your answers! I decided to go with a mixture, mainly focusing on Marc's though as I will only need to read lines from a file. I guess you could argue seperation is needed everywhere, but heh, life is too short!
Regarding the keeping the file open, that isn't going to be an issue in this case, as the code is part of a desktop application.
Lastly I noticed you all used lowercase string. I know in Java there is a difference between capitalised and non capitalised string, but I thought in C# lowercase string was just a reference to capitalised String?
public void Load(AddonCollection<T> collection)
{
// read from file
var query =
from line in LineReader(_LstFilename)
where line.Length > 0
select CreateAddon(line);
// add results to collection
collection.AddRange(query);
}
protected T CreateAddon(String line)
{
// create addon
T addon = new T();
addon.Load(line, _BaseDir);
return addon;
}
protected static IEnumerable<String> LineReader(String fileName)
{
String line;
using (var file = System.IO.File.OpenText(fileName))
{
// read each line, ensuring not null (EOF)
while ((line = file.ReadLine()) != null)
{
// return trimmed line
yield return line.Trim();
}
}
}

Categories