LINQ conditional aggregation based on next elements' values - c#

What's a good LINQ equivalent of this pesudo-code: "given a list of strings, for each string that doesn't contain a tab character, concatenate it (with a pipe delimiter) to the end of the previous string, and return the resulting sequence" ?
More Info:
I have a List<string> representing lines in a tab-delimited text file. The last field in each line is always a multiline text field, and the file was generated by a buggy system that mishandles fields with embedded newlines. So I end up with a list like this:
1235 \t This is Record 1
7897 \t This is Record 2
8977 \t This is Record 3
continued on the next line
and still continued more
8375 \t This is Record 4
I'd like to coalesce this list by concatenating all the orphan lines (lines with no tab characters) to the end of the previous line. Like this:
1235 \t This is Record 1
7897 \t This is Record 2
8977 \t This is Record 3|continued on the next line|and still continued more
8375 \t This is Record 4
Solving this with a for() loop would be easy, but I'm trying to improve my LINQ skills and I was wondering if there is a reasonably efficient LINQ solution to this problem. Is there?

This is not a problem that should be solved with LINQ. LINQ is designed for enumeration, whereas this is best solved by iteration.
Enumerating a sequence properly means no item has knowledge of the other items, which obviously won't work in your case. Use a for loop so you can cleanly go through the strings one by one and in order.

Just did for my curiosity.
var originalList = new List<string>
{
"1235 \t This is Record 1",
"7897 \t This is Record 2",
"8977 \t This is Record 3",
"continued on the next line",
"and still continued more",
"8375 \t This is Record 4"
};
var resultList = new List<string>();
resultList.Add(originalList.Aggregate((workingSentence, next)
=> {
if (next.Contains("\t"))
{
resultList.Add(workingSentence);
return next;
}
else
{
workingSentence += "|" + next;
return workingSentence;
}
}));
The resultList should contain what you want.
Please note that this is not an optimal solution. The line workingSentence += "|" + next; may create lots of temp objects depending on your data pattern.
An optimal solution may involve to keep multiple index variables to look ahead of strings and concatenate them when the next string contains a tab character instead of concatenating one by one as shown above. However, it will be more complex than the one above because of boundary checking and keeping multiple index variables :).
Update: The following solution will not create temporary string objects for concatenation.
var resultList = new List<string>();
var tempList = new List<string>();
tempList.Add(originalList.Aggregate((cur, next)
=> {
tempList.Add(cur);
if (next.Contains("\t"))
{
resultList.Add(string.Join("|", tempList));
tempList.Clear();
}
return next;
}));
resultList.Add(string.Join("|", tempList));
The following is a solution using for loop.
var resultList = new List<string>();
var temp = new List<string>();
for(int i = 0, j = 1; j < originalList.Count; i++, j++)
{
temp.Add(originalList[i]);
if (j != originalList.Count - 1)
{
if (originalList[j].Contains("\t"))
{
resultList.Add(string.Join("|", temp));
temp.Clear();
}
}
else // when originalList[j] is the last item
{
if (originalList[j].Contains("\t"))
{
resultList.Add(string.Join("|", temp));
resultList.Add(originalList[j]);
}
else
{
temp.Add(originalList[j]);
resultList.Add(string.Join("|", temp));
}
}
}

After trying a for() solution, I tried a LINQ solution and came up with the one below. For my reasonably small (10K lines) file it was fast enough that I didn't care about the efficiency, and I found it much more readable than the equivalent for() solution.
var lines = new List<string>
{
"1235 \t This is Record 1",
"7897 \t This is Record 2",
"8977 \t This is Record 3",
"continued on the next line",
"and still continued more",
"8375 \t This is Record 4"
};
var fixedLines = lines
.Select((s, i) => new
{
Line = s,
Orphans = lines.Skip(i + 1).TakeWhile(s2 => !s2.Contains('\t'))
})
.Where(s => s.Line.Contains('\t'))
.Select(s => string.Join("|", (new string[] { s.Line }).Concat(s.Orphans).ToArray()))

You could do something like this:
string result = records.Aggregate("", (current, s) => current + (s.Contains("\t") ? "\n" + s : "|" + s));
I cheated and got Resharper to generate this for me. This is close -- it leaves a blank line at the top though.
However, as you can see, this is not very readable. I realize you're looking for a learning exercise but I'd take a nice readable foreach loop over this any day.

Related

Selecting text from Dictionary - single words vs phrases - e.g. 'rice' vs 'rice wine'

Problem
I am writing a recipe parser in C#. I am selecting text inside a Rich Text Box where recipe ingredients are matched with Dictionary entries. I'm not sure how to deal with (or describe) the case where single words are matched (and double counted) inside a phrase that is also in the Dictionary
Example
In my Dictionary I have entries for 'rice' and 'rice wine'. I want to make sure that 'rice' is not matched in phrases that all already in the Dictionary like 'rice wine'. That is, the 'rice' part of 'rice wine' is not matched with the single 'rice' entry.
Terminology
I'd imagine this is a pretty usual case for text retrieval but I don't know what domain terminology would be.
Code
Currently I'm loading the Dictionary from an SQL query
tagList.Add(new KeyValuePair<string, string>(reader[0].ToString(), "0"));
And then searching the RichTextBox by looping the Dictionary and then looping thro9ugh the RTB.
foreach (KeyValuePair<string, string> word in tagList)
{
int startindex = 0;
while (startindex < richTextBox1.TextLength)
{
int wordstartIndex = richTextBox1.Find(word.Key, startindex, RichTextBoxFinds.WholeWord);
if (wordstartIndex != -1)
{
Console.WriteLine("found: " + word.Key);
richTextBox1.SelectionStart = wordstartIndex;
richTextBox1.SelectionLength = word.Key.Length;
if (word.Value.ToString() == "0")
{
richTextBox1.SelectionBackColor = Color.Yellow;
}
}
else
break;
startindex += wordstartIndex + word.Key.Length;
}
}
Use a SortedList instead of a Dictionary, so that "rice" will be right before "rice wine" and any other matching multiple words. When you find a match for "rice", enter a second loop where you peek the next elements from the list and look for matches with multiple words.
I refactored my lookup database table and made 4 columns for tags with one, two, three and four words - eg 'rice', rice wine', 'rice wine vinegar' and 'sour rice wine noodles'
I used 4 dictionaries and loaded each dictionary with the corresponding column from the database lookup table.
I looked at my target string with the four word dictionary first, then the three word dictionary, then two then the one word dictionary.
I used Regex's whole word boundary pattern "\b" + word.key + "\b" to tokenise the word.
Slow but it does the job for now.
foreach (KeyValuePair<string, string> word in tagTwo)
{
string ingredientString = richTextBox1.Text.ToLower();
if (ingredientString.Contains(word.Key))
{
string input = ingredientString;
string pattern = #"\b" + word.Key + "\\b";
if (Regex.IsMatch(input, pattern) == true)
{
Console.WriteLine(pattern);
string replace = "[[token]]";
string output = Regex.Replace(input, pattern, replace);
richTextBox1.Text = output;
insertStringLine = "INSERT INTO ingredientCount (ingredientTag, tagCount) VALUES ('" + word.Key + "',1);" + Environment.NewLine;
SQLiteCommand createSQL = new SQLiteCommand(insertStringLine.Replace(",)", ")"), conn);
createSQL.ExecuteNonQuery();
}
}
}

Splitting an element of an array

In my C# program (I'm new to C# so I hope that I'm doing things correctly), I'm trying to read in all of the lines from a text file, which will look something along the lines of this, but with more entries (these are fictional people so don't worry about privacy):
Logan Babbleton ID #: 0000011 108 Crest Circle Mr. Logan M. Babbleton
Pittsburgh PA 15668 SSN: XXX-XX-XXXX
Current Program(s): Bachelor of Science in Cybersecurity
Mr. Carter J. Bairn ID #: 0000012 21340 North Drive Mr. Carter Joseph Bairn
Pittsburgh PA 15668 SSN: XXX-XX-XXXX
Current Program(s): Bachelor of Science in Computer Science
I have these lines read into an array, concentrationArray and want to find the lines that contain the word "Current", split them at the "(s): " in "Program(s): " and print the words that follow. I've done this earlier in my program, but splitting at an ID instead, like this:
nameLine = nameIDLine.Split(new string[] { "ID" }, StringSplitOptions.None)[1];
However, whenever I attempt to do this, I get an error that my index is out of the bounds of my split array (not my concentrationArray). Here's what I currently have:
for (int i = 0; i < concentrationArray.Length; i++)
{
if (concentrationArray[i].Contains("Current"))
{
lstTest.Items.Add(concentrationArray[i].Split(new string[] { "(s): " }, StringSplitOptions.None)[1]);
}
}
Where I'm confused is that if I change the index to 0 instead of 1, it will print everything out perfectly, but it will print out the first half, instead of the second half, which is what I want. What am I doing wrong? Any feedback is greatly appreciated since I'm fairly new at C# and would love to learn what I can. Thanks!
Edit - The only thing that I could think of was that maybe sometimes there wasn't anything after the string that I used to separate each element, but when I checked my text file, I found that was not the case and there is always something following the string used to separate.
You should check the result of split before trying to read at index 1.
If your line doesn't contain a "(s): " your code will crash with the exception given
for (int i = 0; i < concentrationArray.Length; i++)
{
if (concentrationArray[i].Contains("Current"))
{
string[] result = concentrationArray[i].Split(new string[] { "(s): " }, StringSplitOptions.None);
if(result.Length > 1)
lstTest.Items.Add(result[1]);
else
Console.WriteLine($"Line {i} has no (s): followeed by a space");
}
}
To complete the answer, if you always use index 0 then there is no error because when no separator is present in the input string then the output is an array with a single element containing the whole unsplitted string
If the line will always starts with
Current Program(s):
then why don't you just replace it with empty string like this:
concentrationArray[i].Replace("Current Program(s): ", "")
It is perhaps a little easier to understand and more reusable if you separate the concerns. It will also be easier to test. An example might be...
var allLines = File.ReadLines(#"C:\your\file\path\data.txt");
var currentPrograms = ExtractCurrentPrograms(allLines);
if (currentPrograms.Any())
{
lstTest.Items.AddRange(currentPrograms);
}
...
private static IEnumerable<string> ExtractCurrentPrograms(IEnumerable<string> lines)
{
const string targetPhrase = "Current Program(s):";
foreach (var line in lines.Where(l => !string.IsNullOrWhiteSpace(l)))
{
var index = line.IndexOf(targetPhrase);
if (index >= 0)
{
var programIndex = index + targetPhrase.Length;
var text = line.Substring(programIndex).Trim();
if (!string.IsNullOrWhiteSpace(text))
{
yield return text;
}
}
}
}
Here is a bit different approach
List<string> test = new List<string>();
string pattern = "Current Program(s):";
string[] allLines = File.ReadAllLines(#"C:\Users\xyz\Source\demo.txt");
foreach (var line in allLines)
{
if (line.Contains(pattern))
{
test.Add(line.Substring(line.IndexOf(pattern) + pattern.Length));
}
}
or
string pattern = "Current Program(s):";
lstTest.Items.AddRange(File.ReadLines(#"C:\Users\ODuritsyn\Source\demo.xml")
.Where(line => line.Contains(pattern))
.Select(line => line.Substring(line.IndexOf(pattern) + pattern.Length)));

Reading into list and picking specific numbers

A request came in for an app written a long time ago, they wanted another column on an existing datagridview with current drilling times. These times come from a file that machines read and run. Below is an example of a file.
Now the numbers there looking for are the ones with an S. Now, the very first number is the line number. The line number needs grabbed too as the time's need to match up with the line. So I have to read this into a list, grab the line number and then the associated "s" time number. I will be adding these "s times" to an existing table which already has the line numbers so I'll add these "s times" to a new column next to the corresponding line number. I'm a bit stuck here.
Any help is appreciated.
I know I can read everything in like this
string f = ("//dnc/WJ/MTI-WJ/" + cboPartProgram.Text);
List<string> lines = new List<string>();
using (StreamReader r = new StreamReader(f))
{
// Use while != null pattern for loop
string line;
while ((line = r.ReadLine()) != null)
{
lines.Add(line);
}
}
But once in a list I need to ignore lines that start with an apostrophe " ' ", then take only the first line number "1, 2, 3" etc for however many there are and then the "s" number of that line.
So that I'm left with:
1 2.75
2 2.5
3 2
...etc
Now, on the database there is a program table. with columns
Program Line Time
Program1 1
Program1 2
Program1 3
Program1 4
etc
So the "s" times will be added to the time column of the existing table. But the "s" times taken from line 1,2,3,4 and so on will coincide with the line column of the table.
So, the final end result in this example would be
Program Line Time
Program1 1 2.75
Program1 2 2.5
Program1 3 2
UPDATE
I'm not sure if the edited code to cover the lines w/out the sTime work yet because I'm ran into an issue recently. There is a datagridview that populates the columns from a list filled from a sql statement. I've added a column for the sTime (turns out I may not need the line# so you'll notice I removed that) however I need to incorporate my loop from sTime into the loop that is occurring for the list that is also populating the dgv. Now I've tried to incorporate this in a couple of different ways however what happens is either the very last sTime is use for all 80 entries of the dgv (as the example below does) or if I move the " } " brackets around some and combine them then uses the first sTime to create 80 entries of dgv, then the 2nd sTime to create 80 entries of the dgv. So instead of 80 entries in dgv with each individual sTime next to it, you end up with 6,400 entries in dgv 80x80. so I'm not sure how to combine these 2 to work.
var sLines = File.ReadAllLines("//dnc/WJ/MTI-WJ/" + cboPartProgram.Text)
.Where(s => !s.StartsWith("'"))
.Select(s => new
{
SValue = Regex.Match(s, "(?<=S)[\\d.]*").Value
})
.ToArray();
string lastSValue = "";
foreach (var line in sLines)
{
string val = line.SValue == string.Empty ? lastSValue : line.SValue;
lastSValue = val;
}
foreach (WjDwellOffsets offset in dwellTimes)
{
origionalDwellOffsets.Add(offset.dwellOffset);
dgvDwellTimes.Rows.Add(new object[] { offset.positionId, lastSValue, offset.dwellOffset, 0, offset.dwellOffset, "Update", (offset.dateTime > new DateTime(1900, 1, 1)) ? offset.dateTime.ToString() : "" });
DataGridViewDisableButtonCell btnCell = ((DataGridViewDisableButtonCell)dgvDwellTimes.Rows[dgvDwellTimes.Rows.Count - 1].Cells[5]);
btnCell.Enabled = false;
}
If your only condition is to keep the lines that have an "S" in them, and if that data sample is indicative of the overall line pattern, then this is quite simple (and you can use Regex to further filter your results):
var sLines = File.ReadAllLines("filepath.txt")
.Where(s => !s.StartsWith("'") && s.Contains("S"))
.Select(s => new
{
LineNumber = Regex.Match(s, "^\\d*").Value,
SValue = Regex.Match(s, "(?<=S)[\\d.]*").Value
})
.ToArray();
// Use like this
foreach (var line in sLines)
{
string num = line.LineNumber;
string val = line.SValue;
}
To maintain all lines and just have relevant information on some, this method can be a bit tweaked, but it will also require a bit of some outside processing.
var sLines = File.ReadAllLines("filepath.txt")
.Where(s => !s.StartsWith("'"))
.Select(s => new
{
LineNumber = Regex.Match(s, "^\\d*").Value,
SValue = Regex.Match(s, "(?<=S)[\\d.]*").Value
})
.ToArray();
// Use like this
string lastSValue = "";
foreach (var line in sLines)
{
string num = line.LineNumber;
string val = line.SValue == string.Empty ? lastSValue : line.SValue;
// Do the stuff
lastSValue = val;
}

best way to take an intersection of more than two hashsets in c#, when we donot know before hand how many hashsets are there

I am making a boolean retrieval system for some large no. of documents, in which i have made a dictionary of hashsets, and the the entries into the dictionary are the terms, and the hashsets contains the documentids in which the term was found.
Now when i want to search for a single word, i will simply enter the word and i will index the dictionary using the entered word in query and print out the corresponding hashset.
But i also want to search for sentences, in this case i will split the query into individual words and index the dictionary by those words, now depending upon the number of words in the query, that many number of hash sets will be returned, now i will want to take an intersection of these hash sets so that i can return the document ids in which i find out the words in the query.
My question is what is the best way to take intersection of these hash sets?
Currently i am putting the hash sets into a list, and then i take intersection of these n no. of hashsets two at a time and then take the intersection of result of first two and then the third one and so on...
This is the code
Dictionary<string, HashSet<string>> dt = new Dictionary<string, HashSet<string>>();//assume it is filled with data...
while (true)
{
Console.WriteLine("\n\n\nEnter the query you want to search");
string inp = Console.ReadLine();
string[] words = inp.Split(new Char[] { ' ', ',', '.', ':', '?', '!', '\t' });
List<HashSet<string>> outparr = new List<HashSet<string>>();
foreach(string w in words)
{
HashSet<string> outp = new HashSet<string>();
if (dt.TryGetValue(w, out outp))
{
outparr.Add(outp);
Console.WriteLine("Found {0} documents.", outp.Count);
foreach (string s in outp)
{
Console.WriteLine(s);
}
}
}
HashSet<string> temp = outparr.First();
foreach(HashSet<string> hs in outparr)
{
temp = new HashSet<string>(temp.Intersect(hs));
}
Console.WriteLine("Output After Intersection:");
Console.WriteLine("Found {0} documents: ", temp.Count);
foreach(string s in temp)
{
Console.WriteLine(s);
}
}
IntersectWith is a good aproach. Like this:
HashSet<string> res = null;
HashSet<string> outdictinary = null;
foreach(string w in words)
{
if (dt.TryGetValue(w, out outdictinary))
{
if( res==null)
res =new HashSet( outdictinary,outdictinary.Comparer);
else
{
if (res.Count==0)
break;
res.IntersectWith(outdictinary);
}
}
}
if (res==null) res = new HashSet();
Console.WriteLine("Output After Intersection:");
Console.WriteLine("Found {0} documents: ", res.Count);
foreach(string s in res)
{
Console.WriteLine(s);
}
The principle that you are using is sound, but you can tweak it a bit.
By sorting the hash sets on size, you can start with the smallest one, that way you can minimise the number of comparisons.
Instead of using the IEnumerable<>.Intersect method you can do the same thing in a loop, but using the fact that you already have a hash set. Checking if a value exists in a hash set is very fast, so you can just loop through the items in the smallest set and look for matching values in the next set, and put them in a new set.
In the loop you can skip the first item as you start with that. You don't need to intersect it with itself.
outparr = outparr.OrderBy(o => o.Count).ToList();
HashSet<string> combined = outparr[0];
foreach(HashSet<string> hs in outparr.Skip(1)) {
HashSet<string> temp = new HashSet<string>();
foreach (string s in combined) {
if (hs.Contains(s)) {
temp.Add(s);
}
}
combined = temp;
}
To answer your question, it's possible that at one point you'll find a set of documents that contains words a, b and c and another set that contains only other words in your query so the intersection can become empty after a few iterations. You can check for this and break out of the foreach.
Now, IMHO it doesn't make sense to do that intersection because usualy a search result should contain multiple files ordered by relevance.
It will also be much easier because you already have a list of files containing one word. From the hashes obtained for each word you'll have to count the occurences of file ids and return a limited number of ids ordered descending by the number of occurences.

C# implementation of Dictionary to count occurrences of words returns duplicate words in output

I recently made a little application to read in a text file of lyrics, then use a Dictionary to calculate how many times each word occurs. However, for some reason I'm finding instances in the output where the same word occurs multiple times with a tally of 1, instead of being added onto the original tally of the word. The code I'm using is as follows:
StreamReader input = new StreamReader(path);
String[] contents = input.ReadToEnd()
.ToLower()
.Replace(",","")
.Replace("(","")
.Replace(")", "")
.Replace(".","")
.Split(' ');
input.Close();
var dict = new Dictionary<string, int>();
foreach (String word in contents)
{
if (dict.ContainsKey(word))
{
dict[word]++;
}else{
dict[word] = 1;
}
}
var ordered = from k in dict.Keys
orderby dict[k] descending
select k;
using (StreamWriter output = new StreamWriter("output.txt"))
{
foreach (String k in ordered)
{
output.WriteLine(String.Format("{0}: {1}", k, dict[k]));
}
output.Close();
timer.Stop();
}
The text file I'm inputting is here: http://pastebin.com/xZBHkjGt (it's the lyrics of the top 15 rap songs, if you're curious)
The output can be found here: http://pastebin.com/DftANNkE
A quick ctrl-F shows that "girl" occurs at least 13 different times in the output. As far as I can tell, it is the exact same word, unless there's some sort of difference in ASCII values. Yes, there are some instances on there with odd characters in place of a apostrophe, but I'll worry about those later. My priority is figuring out why the exact same word is being counted 13 different times as different words. Why is this happening, and how do I fix it? Any help is much appreciated!
Another way is to split on non words.
var lyrics = "I fly with the stars in the skies I am no longer tryin' to survive I believe that life is a prize But to live doesn't mean your alive Don't worry bout me and who I fire I get what I desire, It's my empire And yes I call the shots".ToLower();
var contents = Regex.Split(lyrics, #"[^\w'+]");
Also here's an alternative (and probably more obscure) loop
int value;
foreach (var word in contents)
{
dict[word] = dict.TryGetValue(word, out value) ? ++value : 1;
}
dict.Remove("");
If you notice, the repeat occurrences appear on a line following a word which apparently doesn't have a count.
You're not stripping out newlines, so em\r\ngirl is being treated as a different word.
String[] contents = input.ReadToEnd()
.ToLower()
.Replace(",", "")
.Replace("(", "")
.Replace(")", "")
.Replace(".", "")
.Split("\r\n ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
Works better.
Add Trim to each word:
foreach (String word in contents.Select(w => w.Trim()))

Categories