How to make xml to csv parsing/conversion faster? - c#

I'm currently using the snippet below to convert xml data(not well formed) to .CSV format after doing some processing in between. It only converts those elements in the xml data that contain a integer from the list testList (List<int> testList). It only converts and writes to the file once that match has been made. I need to use this algorithm for files that are several GB's in size. Currently it processes a 1 Gb file in ~7.5 minutes. Can someone suggest any changes that I could make to improve performance? I've fixed everything I could but it won't get any faster. Any help will be appreciated!
Note: Message.TryParse is an external parsing method that I have to use and can't exclude or change.
Note: StreamElements is just a customized Xmlreader that improves performance.
foreach (var element in StreamElements(p, "XML"))
{
string joined = string.Concat(element.ToString().Split().Take(3)) + string.Join(" ", element.
ToString().Split().Skip(3));
List<string> listX = new List<string>();
listX.Add(joined.ToString());
Message msg = null;
if (Message.TryParse(joined.ToString(), out msg))
{
var values = element.DescendantNodes().OfType<XText>()
.Select(v => Regex.Replace(v.Value, "\\s+", " "));
foreach (var val in values)
{
for (int i = 0; i < testList.Count; i++)
{
if (val.ToString().Contains("," + testList[i].ToString() + ","))
{
var line = string.Join(",", values);
sss.WriteLine(line);
}
}
}
}
}

I'm seeing some things you could probably improve:
You're calling .ToString() on joined a couple of times, when joined is already a string.
You may be able to speed up your regex replace by compiling your regex first, outside of the loop.
You're iterating over values multiple times, and each time it has to re-evaluate the LINQ that makes up the definition for values. Try using .ToList() before saving the result of that LINQ statement into values.
But before focusing on stuff like this, you really need to identify what's taking the time in your code. My guess is that it's almost all spent in these two places:
Reading from the XML stream
Writing to sss
If I'm right, then anything else you focus on is going to be premature optimization. Spend some time testing what happens if you comment out various parts of your for loop, to see where all the time is being spent.

Related

How to search string in large text file?

I want to get the line containing a certain word that cannot be repeated like profile ID without make loop to read each of line separately, Because if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.
Example for line text file
name,id,image,age,place,link
string word = "13215646";
string output = string.Empty;
using (var fileStream = File.OpenRead(FileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
String line;
while ((line = streamReader.ReadLine()) != null)
{
string[] strList = line.Split(',');
if (word == strList[1]) // check if word = id
{
output = line;
break;
}
}
}
You can use this to search the file:
var output = File.ReadLines(FileName).
Where(line => line.Split(',')[1] == word).
FirstOrDefault();
But it won't solve this:
if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.
There's not a practical way to avoid this for a basic file.
The only ways around actually reading through the file is either maintaining an index, which requires absolute control over everything that might write into the file, or if you can guarantee the file is already sorted by the columns that matter, in which case you can do something like a binary search.
But neither is likely for a random csv file. This is one of the reasons people use databases.
However, we also need to stop and check whether this is really a problem for you. I'd expect the code above to handle files up to a couple hundred MB in around 1 to 2 seconds on modern hardware, even if you need to look through the whole file.
You can optimise the code. Here are few ideas:
var ids = new ["13215646", "113"];
foreach(var line in File.ReadLines(FileName))
{
var id = line.Split(',', count: 3)[1]; // Optimization 1: Use: `count: 3`
if(ids.Contains(id) // Optimization 2: Search for multiple ids
{
//Do what you need with the line
}
}

How to replace various string occurrences in multiple strings the fastest way

I've a common problem where I've not found a proper solution. I've multiple XML strings with a specific tag (e.g. MIME_SOURCE) and I don't know which XML string contains which value. But I have to replace all occurrences.
On the other hand I have a dictionary containing all possible values of the XML as a key and the value to replace with as value. As I said, I don't know what to replace in which XML.
E.g.
Part of first XML
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\icon_top.jpg</MIME_SOURCE>
</MIME>
Part of second XML:
<MIME>
<MIME_SOURCE>\Web\Bilder klein\5478.jpg</MIME_SOURCE>
</MIME>
Dictionary looks like:
{"\Web\Bilder Groß\1509_131_021_01.jpg", "/Web/Bilder Groß/1509_131_021_01.jpg"}
{"\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg", "/Web/Bilder Groß/1509_131_021_01_MitWasserzeichen.jpg"}
{"\Web\Bilder Groß\icon_top.jpg", "icon_top.jpg"}
{"\Web\Bilder klein\5478.jpg", "5478.jpg"}
My main problem is, if I iterate through the dictionary for each XML string the effort will be count of XML strings multiplied with count of entries in the dictionary (n*m). This is really bad in my case as there can be around a million XML strings and at least thousands of entries in the dictionary.
Currently I'm using string.Replace for each key of the dictionary for each XML.
Do you have a good idea how to speed up this process?
Edit:
I've changed code to the following one:
var regex = new Regex(#"<MIME_SOURCE>[\s\S]*?<\/MIME_SOURCE>");
foreach (Match match in regex.Matches(stringForXml))
{
DoReplacements...
}
This fits to the requirements for now as the replacement will only be done for each MIME_SOURCE in the XML. But I will as well have a look at the mentioned algorithm.
The most correct way is to properly parse your XML. Then you can go through it in a single pass:
var xml = #"<root>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\icon_top.jpg</MIME_SOURCE>
</MIME>
</root>";
var replacements = new Dictionary<string, string>()
{
{#"\Web\Bilder Groß\1509_131_021_01.jpg", "/Web/Bilder Groß/1509_131_021_01.jpg"},
{#"\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg", "/Web/Bilder Groß/1509_131_021_01_MitWasserzeichen.jpg"},
{#"\Web\Bilder Groß\icon_top.jpg", "icon_top.jpg"},
{#"\Web\Bilder klein\5478.jpg", "5478.jpg"}
};
var doc = XDocument.Parse(xml);
foreach (var source in doc.Root.Descendants("MIME_SOURCE"))
{
if (replacements.TryGetValue(source.Value, out var replacement))
{
source.Value = replacement;
}
}
var result = doc.ToString();
If you can make some assumptions about how your XML is structured (e.g. no whitespace between the <MINE_SOURCE> tags, no attributes, etc), then you can use some regex, allowing you to again make a single pass:
var result = Regex.Replace(xml, #"<MIME_SOURCE>([^<]+)</MIME_SOURCE>", match =>
{
if (replacements.TryGetValue(match.Groups[1].Value, out var replacement))
{
return $"<MIME_SOURCE>{replacement}</MIME_SOURCE>";
}
return match.Value;
});
You'll have to benchmark different approaches yourself on your own data. Use BenchmarkDotNet.
As I already mentioned in a comment above, I used to have a similar problem (see: c# Fastest string search in all files).
Using the Aho–Corasick algorithm that has been suggested to me in the accepted answer I was able to conduct a string search in fast enough time for my problem (going from a minutes execution time to merely seconds).
An implementation of said algorithm can be found here.
Here is a little sample on how to use the implementation linked above. (looking some needles in a haystack)
static bool anyViaAhoCorasick(string[] needles, string haystack)
{
var trie = new Trie();
trie.Add(needles);
trie.Build();
return trie.Find(haystack).Any();
}

Search String Pattern in Large Text Files C#

I have been trying to search string patterns in a large text file. I am reading line by line and checking each line which is causing a lot of time. I did try with HashSet and ReadAllLines.
HashSet<string> strings = new HashSet<string>(File.ReadAllLines(#"D:\Doc\Tst.txt"));
Now when I am trying to search the string, it's not matching. As it is looking for a match of the entire row. I just want to check if the string appears in the row.
I had tried by using this:
using (System.IO.StreamReader file = new System.IO.StreamReader(#"D:\Doc\Tst.txt"))
{
while ((CurrentLine = file.ReadLine()) != null)
{
vals = chk_log(CurrentLine, date_Format, (range.Cells[i][counter]).Value2, vals);
if (vals == true)
break;
}
}
bool chk_log(string LineText, string date_to_chk, string publisher, bool tvals)
{
if (LineText.Contains(date_to_chk))
if (LineText.Contains(publisher))
{
tvals = true;
}
else
tvals = false;
else tvals = false;
return tvals;
}
But this is consuming too much time. Any help on this would be good.
Reading into a HashSet doesn't make sense to me (unless there are a lot of duplicated lines) since you aren't testing for membership of the set.
Taking a really naive approach you could just do this.
var isItThere = File.ReadAllLines(#"d:\docs\st.txt").Any(x =>
x.Contains(date_to_chk) && x.Contains(publisher));
65K lines at (say) 1K a line isn't a lot of memory to worry about, and I personally wouldn't bother with Parallel since it sounds like it would be superfast to do anyway.
You could replace Any where First to find the first result or Where to get an IEnumerable<string> containing all results.
You can use a compiled regular expression instead of String.Contains (compile once before looping over the lines). This typically gives better performance.
var regex = new Regex($"{date}|{publisher}", RegexOptions.Compiled);
foreach (string line in File.ReadLines(#"D:\Doc\Tst.txt"))
{
if (regex.IsMatch(line)) break;
}
This also shows a convenient standard library function for reading a file line by line.
Or, depending on what you want to do...
var isItThere = File.ReadLines(#"D:\Doc\Tst.txt").Any(regex.IsMatch);

C# Dictionary allowing seemingly identical keys

I have created a dictionary, and created code to read a txt file, and input each word from the file into the dictionary.
//Set up OpenFileDialog box, and prompt user to select file to open
DialogResult openFileResult;
OpenFileDialog file = new OpenFileDialog() ;
file.Filter = "txt files (*.txt)|*.txt";
openFileResult = file.ShowDialog();
if (openFileResult == DialogResult.OK)
{
//If user selected file successfully opened
//Reset form
this.Controls.Clear();
this.InitializeComponent();
//Read from file, split into array of words
Stream fs = file.OpenFile();
StreamReader reader;
reader = new StreamReader(fs);
string line = reader.ReadToEnd();
string[] words = line.Split(' ', '\n');
//Add each word and frequency to dictionary
foreach (string s in words)
{
AddToDictionary(s);
}
//Reset variables, and set-up chart
ResetVariables();
ChartInitialize();
foreach (string s in wordDictionary.Keys)
{
//Calculate statistics from dictionary
ComputeStatistics(s);
if (dpCount < 50)
{
AddToGraph(s);
}
}
//Print statistics
PrintStatistics();
}
And the AddToDictionary(s) function is:
public void AddToDictionary(string s)
{
//Function to add string to dictionary
string wordLower = s.ToLower();
if (wordDictionary.ContainsKey(wordLower))
{
int wordCount = wordDictionary[wordLower];
wordDictionary[wordLower] = wordDictionary[wordLower] + 1;
}
else
{
wordDictionary.Add(wordLower, 1);
txtUnique.Text += wordLower + ", ";
}
}
The text file being read by this program is:
To be or not to be that is the question
Whether tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles
And by opposing end them To die to sleep
No more and by a sleep to say we end
The heartache and the thousand natural shocks
That flesh is heir to Tis a consummation
Devoutly to be wished To die to sleep
To sleep perchance to dream ay theres the rub
For in that sleep of death what dreams may come
When we **have** shuffled off this mortal coil
Must give us pause Theres the respect
That makes calamity of so long life
For who would bear the whips and scorns of time
The oppressors wrong the proud mans contumely
The pangs of despised love the laws delay
The insolence of office and the spurns
That patient merit of th unworthy takes
When he himself might his quietus make
With a bare bodkin Who would fardels bear
To grunt and sweat under a weary life
But that the dread of something after death
The undiscovered country from whose bourn
No traveller returns puzzles the will
And makes us rather bear those ills we **have**
Than fly to others that we know not of
Thus conscience does make cowards of us all
And thus the native hue of resolution
Is sicklied oer with the pale cast of thought
And enterprise of great pitch and moment
With this regard their currents turn awry
And lose the name of action Soft you now
The fair Ophelia Nymph in thy orisons
Be all my sins remembered
The problem I am encountering is that the word "have" is appearing twice in the dictionary. I know this doesn't happen with dictionaries, but for some reason it is appearing twice. Does anyone know why this would happen?
If you run:
var sb = new StringBuilder();
sb.AppendLine("test which");
sb.AppendLine("is a test");
var words = sb.ToString().Split(' ', '\n').Distinct();
Inspecting words in the debugger shows that some instances of "test" have acquired a \r due to the two byte CRLF line terminator - which isn't treated by the split.
To fix, change your split to:
Split(new[] {" ", Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
Splitting text into words is generally hard solution if you want to support multiple languages. Regular expressions are generally better in dealing with parsing than basic String.Split.
I.e. in your case you are picking up variations of "new line" as part of a word, you also could be picking up things like non-breaking space,...
Following code will pick words better than your current .Split, for more info - How do I split a phrase into words using Regex in C#
var words = Regex.Split(line, #"\W+").ToList();
Additionally you should make sure your dictionary is case insensitive like following (pick comparer based on your needs, there are culture-aware once too):
var dictionary = new Dictionary(StringComparer.OrdinalIgnoreCase);
I would be inclined to change the following code:
//Read from file, split into array of words
Stream fs = file.OpenFile();
StreamReader reader;
reader = new StreamReader(fs);
string line = reader.ReadToEnd();
string[] words = line.Split(' ', '\n');
//Add each word and frequency to dictionary
foreach (string s in words)
{
AddToDictionary(s);
}
to this:
wordDictionary =
File
.ReadAllLines(file)
.SelectMany(x => x.Split(new [] { ' ', }, StringSplitOptions.RemoveEmptyEntries))
.Select(x => x.ToLower())
.GroupBy(x => x)
.ToDictionary(x => x.Key, x => x.Count());
This completely avoids the issues with line endings and also has the added advantage that it doesn't leave any undisposed streams lying around.

C# implementation of Dictionary to count occurrences of words returns duplicate words in output

I recently made a little application to read in a text file of lyrics, then use a Dictionary to calculate how many times each word occurs. However, for some reason I'm finding instances in the output where the same word occurs multiple times with a tally of 1, instead of being added onto the original tally of the word. The code I'm using is as follows:
StreamReader input = new StreamReader(path);
String[] contents = input.ReadToEnd()
.ToLower()
.Replace(",","")
.Replace("(","")
.Replace(")", "")
.Replace(".","")
.Split(' ');
input.Close();
var dict = new Dictionary<string, int>();
foreach (String word in contents)
{
if (dict.ContainsKey(word))
{
dict[word]++;
}else{
dict[word] = 1;
}
}
var ordered = from k in dict.Keys
orderby dict[k] descending
select k;
using (StreamWriter output = new StreamWriter("output.txt"))
{
foreach (String k in ordered)
{
output.WriteLine(String.Format("{0}: {1}", k, dict[k]));
}
output.Close();
timer.Stop();
}
The text file I'm inputting is here: http://pastebin.com/xZBHkjGt (it's the lyrics of the top 15 rap songs, if you're curious)
The output can be found here: http://pastebin.com/DftANNkE
A quick ctrl-F shows that "girl" occurs at least 13 different times in the output. As far as I can tell, it is the exact same word, unless there's some sort of difference in ASCII values. Yes, there are some instances on there with odd characters in place of a apostrophe, but I'll worry about those later. My priority is figuring out why the exact same word is being counted 13 different times as different words. Why is this happening, and how do I fix it? Any help is much appreciated!
Another way is to split on non words.
var lyrics = "I fly with the stars in the skies I am no longer tryin' to survive I believe that life is a prize But to live doesn't mean your alive Don't worry bout me and who I fire I get what I desire, It's my empire And yes I call the shots".ToLower();
var contents = Regex.Split(lyrics, #"[^\w'+]");
Also here's an alternative (and probably more obscure) loop
int value;
foreach (var word in contents)
{
dict[word] = dict.TryGetValue(word, out value) ? ++value : 1;
}
dict.Remove("");
If you notice, the repeat occurrences appear on a line following a word which apparently doesn't have a count.
You're not stripping out newlines, so em\r\ngirl is being treated as a different word.
String[] contents = input.ReadToEnd()
.ToLower()
.Replace(",", "")
.Replace("(", "")
.Replace(")", "")
.Replace(".", "")
.Split("\r\n ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
Works better.
Add Trim to each word:
foreach (String word in contents.Select(w => w.Trim()))

Categories