C# Dictionary allowing seemingly identical keys

C# Dictionary allowing seemingly identical keys - c#

I have created a dictionary, and created code to read a txt file, and input each word from the file into the dictionary.
//Set up OpenFileDialog box, and prompt user to select file to open
DialogResult openFileResult;
OpenFileDialog file = new OpenFileDialog() ;
file.Filter = "txt files (*.txt)|*.txt";
openFileResult = file.ShowDialog();
if (openFileResult == DialogResult.OK)
{
//If user selected file successfully opened
//Reset form
this.Controls.Clear();
this.InitializeComponent();
//Read from file, split into array of words
Stream fs = file.OpenFile();
StreamReader reader;
reader = new StreamReader(fs);
string line = reader.ReadToEnd();
string[] words = line.Split(' ', '\n');
//Add each word and frequency to dictionary
foreach (string s in words)
{
AddToDictionary(s);
}
//Reset variables, and set-up chart
ResetVariables();
ChartInitialize();
foreach (string s in wordDictionary.Keys)
{
//Calculate statistics from dictionary
ComputeStatistics(s);
if (dpCount < 50)
{
AddToGraph(s);
}
}
//Print statistics
PrintStatistics();
}
And the AddToDictionary(s) function is:
public void AddToDictionary(string s)
{
//Function to add string to dictionary
string wordLower = s.ToLower();
if (wordDictionary.ContainsKey(wordLower))
{
int wordCount = wordDictionary[wordLower];
wordDictionary[wordLower] = wordDictionary[wordLower] + 1;
}
else
{
wordDictionary.Add(wordLower, 1);
txtUnique.Text += wordLower + ", ";
}
}
The text file being read by this program is:
To be or not to be that is the question
Whether tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles
And by opposing end them To die to sleep
No more and by a sleep to say we end
The heartache and the thousand natural shocks
That flesh is heir to Tis a consummation
Devoutly to be wished To die to sleep
To sleep perchance to dream ay theres the rub
For in that sleep of death what dreams may come
When we **have** shuffled off this mortal coil
Must give us pause Theres the respect
That makes calamity of so long life
For who would bear the whips and scorns of time
The oppressors wrong the proud mans contumely
The pangs of despised love the laws delay
The insolence of office and the spurns
That patient merit of th unworthy takes
When he himself might his quietus make
With a bare bodkin Who would fardels bear
To grunt and sweat under a weary life
But that the dread of something after death
The undiscovered country from whose bourn
No traveller returns puzzles the will
And makes us rather bear those ills we **have**
Than fly to others that we know not of
Thus conscience does make cowards of us all
And thus the native hue of resolution
Is sicklied oer with the pale cast of thought
And enterprise of great pitch and moment
With this regard their currents turn awry
And lose the name of action Soft you now
The fair Ophelia Nymph in thy orisons
Be all my sins remembered
The problem I am encountering is that the word "have" is appearing twice in the dictionary. I know this doesn't happen with dictionaries, but for some reason it is appearing twice. Does anyone know why this would happen?

If you run:
var sb = new StringBuilder();
sb.AppendLine("test which");
sb.AppendLine("is a test");
var words = sb.ToString().Split(' ', '\n').Distinct();
Inspecting words in the debugger shows that some instances of "test" have acquired a \r due to the two byte CRLF line terminator - which isn't treated by the split.
To fix, change your split to:
Split(new[] {" ", Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)

Splitting text into words is generally hard solution if you want to support multiple languages. Regular expressions are generally better in dealing with parsing than basic String.Split.
I.e. in your case you are picking up variations of "new line" as part of a word, you also could be picking up things like non-breaking space,...
Following code will pick words better than your current .Split, for more info - How do I split a phrase into words using Regex in C#
var words = Regex.Split(line, #"\W+").ToList();
Additionally you should make sure your dictionary is case insensitive like following (pick comparer based on your needs, there are culture-aware once too):
var dictionary = new Dictionary(StringComparer.OrdinalIgnoreCase);

I would be inclined to change the following code:
//Read from file, split into array of words
Stream fs = file.OpenFile();
StreamReader reader;
reader = new StreamReader(fs);
string line = reader.ReadToEnd();
string[] words = line.Split(' ', '\n');
//Add each word and frequency to dictionary
foreach (string s in words)
{
AddToDictionary(s);
}
to this:
wordDictionary =
File
.ReadAllLines(file)
.SelectMany(x => x.Split(new [] { ' ', }, StringSplitOptions.RemoveEmptyEntries))
.Select(x => x.ToLower())
.GroupBy(x => x)
.ToDictionary(x => x.Key, x => x.Count());
This completely avoids the issues with line endings and also has the added advantage that it doesn't leave any undisposed streams lying around.

Related

How to search string in large text file?

I want to get the line containing a certain word that cannot be repeated like profile ID without make loop to read each of line separately, Because if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.
Example for line text file
name,id,image,age,place,link
string word = "13215646";
string output = string.Empty;
using (var fileStream = File.OpenRead(FileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
String line;
while ((line = streamReader.ReadLine()) != null)
{
string[] strList = line.Split(',');
if (word == strList[1]) // check if word = id
{
output = line;
break;
}
}
}

You can use this to search the file:
var output = File.ReadLines(FileName).
Where(line => line.Split(',')[1] == word).
FirstOrDefault();
But it won't solve this:
if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.
There's not a practical way to avoid this for a basic file.
The only ways around actually reading through the file is either maintaining an index, which requires absolute control over everything that might write into the file, or if you can guarantee the file is already sorted by the columns that matter, in which case you can do something like a binary search.
But neither is likely for a random csv file. This is one of the reasons people use databases.
However, we also need to stop and check whether this is really a problem for you. I'd expect the code above to handle files up to a couple hundred MB in around 1 to 2 seconds on modern hardware, even if you need to look through the whole file.

You can optimise the code. Here are few ideas:
var ids = new ["13215646", "113"];
foreach(var line in File.ReadLines(FileName))
{
var id = line.Split(',', count: 3)[1]; // Optimization 1: Use: `count: 3`
if(ids.Contains(id) // Optimization 2: Search for multiple ids
{
//Do what you need with the line
}
}

How to contact whole text from file into the string avoiding empty lines beetwen strings

How to get whole text from document contacted into the string. I'm trying to split text by dot: string[] words = s.Split('.'); I want take this text from text document. But if my text document contains empty lines between strings, for example:
pat said, “i’ll keep this ring.”
she displayed the silver and jade wedding ring which, in another time track,
she and joe had picked out; this
much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
result looks like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track
3. she and joe had picked out this
4. much of the alternate world she had elected to retain
5. he wondered what if any legal basis she had kept in addition
6. none he hoped wisely however he said nothing
7. better not even to ask
but desired correct output should be like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track she and joe had picked out this much of the alternate world she had elected to retain
3. he wondered what if any legal basis she had kept in addition
4. none he hoped wisely however he said nothing
5. better not even to ask
So to do this first I need to process text file content to get whole text as single string, like this:
pat said, “i’ll keep this ring.” she displayed the silver and jade wedding ring which, in another time track, she and joe had picked out; this much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
I can't to do this same way as it would be with list content for example: string concat = String.Join(" ", text.ToArray());,
I'm not sure how to contact text into string from text document

I think this is what you want:
var fileLocation = #"c:\\myfile.txt";
var stringFromFile = File.ReadAllText(fileLocation);
//replace Environment.NewLine with any new line character your file uses
var withoutNewLines = stringFromFile.Replace(Environment.NewLine, "");
//modify to remove any unwanted character
var withoutUglyCharacters = Regex.Replace(withoutNewLines, "[“’”,;-]", "");
var withoutTwoSpaces = withoutUglyCharacters.Replace(" ", " ");
var result = withoutTwoSpaces.Split('.').Where(i => i != "").Select(i => i.TrimStart()).ToList();
So first you read all text from your file, then you remove all unwanted characters and then split by . and return non empty items

Have you tried replacing double new-lines before splitting using a period?
static string[] GetSentences(string filePath) {
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
var sentences = Regex.Split(lines, #"\.[\s]{1,}?");
return sentences;
}
I haven't tested this, but it should work.
Explanation:
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
Throws an exception if the file could not be found. It is advisory you surround the method call with a try/catch.
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
Creates a string, and ignores any lines which are purely whitespace or empty.
var sentences = Regex.Split(lines, #".[\s]{1,}?");
Creates a string array, where the string is split at every period and whitespace following the period.
E.g:
The string "I came. I saw. I conquered" would become
I came
I saw
I conquered
Update:
Here's the method as a one-liner, if that's your style?
static string[] SplitSentences(string filePath) => File.Exists(filePath) ? Regex.Split(string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line))), #"") : null;

I would suggest you to iterate through all characters and just check if they are in range of 'a' >= char <= 'z' or if char == ' '. If it matches the condition then add it to the newly created string else check if it is '.' character and if it is then end your line and add another one :
List<string> lines = new List<string>();
string line = string.Empty;
foreach(char c in str)
{
if((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20)
line += c;
else if(c == '.')
{
lines.Add(line.Trim());
line = string.Empty;
}
}
Working online example
Or if you prefer "one-liner"s :
IEnumerable<string> lines = new string(str.Select(c => (char)(((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20) ? c : c == '.' ? '\n' : '\0')).ToArray()).Split('\n').Select(s => s.Trim());

I may be wrong about this. I would think that you may not want to alter the string if you are splitting it. Example, there are double/single quote(s) (“) in part of the string. Removing them may not be desired which brings up the possibly of a question, reading a text file that contains single/double quotes (as your example data text shows) like below:
var stringFromFile = File.ReadAllText(fileLocation);
will not display those characters properly in a text box or the console because the default encoding using the ReadAllText method is UTF8. Example the single/double quotes will display (replacement characters) as diamonds in a text box on a form and will be displayed as a question mark (?) when displayed to the console. To keep the single/double quotes and have them display properly you can get the encoding for the OS’s current ANSI encoding by adding a parameter to the ReadAllText method like below:
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
Below is code using a simple split method to .split the string on periods (.) Hope this helps.
private void button1_Click(object sender, EventArgs e) {
string fileLocation = #"C:\YourPath\YourFile.txt";
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
string bigString = stringFromFile.Replace(Environment.NewLine, "");
string[] result = bigString.Split('.');
int count = 1;
foreach (string s in result) {
if (s != "") {
textBox1.Text += count + ". " + s.Trim() + Environment.NewLine;
Console.WriteLine(count + ". " + s.Trim());
count++;
}
else {
// period at the end of the string
}
}
}

Working on huge text file, C#. Modifying the file

Please, help me resolve this issue.
I have a huge input.txt. Now it's 465 Mb, but later it will be 1Gb at least.
User enters a term (not a whole word). Using that term I need to find a word that contains it, put it between <strong> tags and save the contents to the output.txt. The term-search should be case insensitive.
This is what I have so far. It works on small texts, but doesn't on bigger ones.
Regex regex = new Regex(" ");
string text = File.ReadAllText("input.txt");
Console.WriteLine("Please, enter a term to search for");
string term = Console.ReadLine();
string[] w = regex.Split(text);
for (int i = 0; i < w.Length; i++)
{
if (Processor.Contains(w[i], term, StringComparison.OrdinalIgnoreCase))
{
w[i] = #"<strong>" + w[i] + #"</string>";
}
}
string result = null;
result = string.Join(" ", w);
File.WriteAllText("output.txt", result);

Trying to read the entire file in one go is causing your memory exception. Look into reading the file in stages. The FileStream and BufferedStream classes provide ways of doing this:
https://msdn.microsoft.com/en-us/library/system.io.filestream(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/system.io.bufferedstream.read(v=vs.110).aspx

Try not to load the entire file into memory, avoid huge GB-size arrays, Strings etc. (you may just not have enough RAM). Can you process the file line by line (i.e. you don't have multiline terms, do you?)? If it's your case then
...
var source = File
.ReadLines("input.txt") // Notice absence of "All", not ReadAllLines
.Select(line => line.Split(' ')) // You don't need Regex here, just Split
.Select(items => items
.Select(item => String.Equals(item, term, StringComparison.OrdinalIgnoreCase)
? #"<strong>" + term + #"</strong>"
: item))
.Select(items => String.Join(" ", items));
File.WriteAllLines("output.txt", source);

Read the file line by line (or buffer more lines). A bit slower but should work.
Also there can be a problem if all the lines match your term. Consider writing results in a temporary file when you find them and then just rename/move the file to the destination folder.

Streamreader isn't returning the correct values from my text file, can't figure out how to properly read my text files C#

I'm running three counters, one to return the total amount of chars, one to return the number of '|' chars in my .txt file (total). And one to read how many separate lines are in my text file. I'm assuming my counters are wrong, I'm not sure. In my text file there are some extra '|' chars, but that is a bug I need to fix later...
The Message Boxes show
"Lines = 8"
"Entries = 8"
"Total Chars = 0"
Not sure if it helps but the .txt file is compiled using a streamwriter, and I have a datagridview saved to a string to create the output. Everything seems okay with those functions.
Here is a copy of the text file I'm reading
Matthew|Walker|MXW320|114282353|True|True|True
Audrey|Walker|AXW420|114282354|True|True|True
John|Doe|JXD020|111222333|True|True|False
||||||
And here is the code.
private void btnLoadList_Click(object sender, EventArgs e)
{
var loadDialog = new OpenFileDialog
{
InitialDirectory = Convert.ToString(Environment.SpecialFolder.MyDocuments),
Filter = "Text (*.txt)|*.txt",
FilterIndex = 1
};
if (loadDialog.ShowDialog() != DialogResult.OK) return;
using (new StreamReader(loadDialog.FileName))
{
var lines = File.ReadAllLines(loadDialog.FileName);//Array of all the lines in the text file
foreach (var assocStringer in lines)//For each assocStringer in lines (Runs 1 cycle for each line in the text file loaded)
{
var entries = assocStringer.Split('|'); // split the line into pieces (e.g. an array of "Matthew", "Walker", etc.)
var obj = (Associate) _bindingSource.AddNew();
if (obj == null) continue;
obj.FirstName = entries[0];
obj.LastName = entries[1];
obj.AssocId = entries[2];
obj.AssocRfid = entries[3];
obj.CanDoDiverts = entries[4];
obj.CanDoMhe = entries[5];
obj.CanDoLoading = entries[6];
}
}
}
Hope you guys find the bug(s) here. Sorry if the formatting is sloppy I'm self-taught, no classes. Any extra advice is welcomed, be as honest and harsh as need be, no feelings will be hurt.
In summary
Why is this program not reading the correct values from the text file I'm using?

Not totally sure I get exactly what you're trying to do, so correct me if I'm off, but if you're just trying to get the line count, pipe (|) count and character count for the file the following should get you that.
var lines = File.ReadAllLines(load_dialog.FileName);
int lineCount = lines.Count();
int totalChars = 0;
int totalPipes = 0; // number of "|" chars
foreach (var s in lines)
{
var entries = s.Split('|'); // split the line into pieces (e.g. an array of "Matthew", "Walker", etc.)
totalChars += s.Length; // add the number of chars on this line to the total
totalPipes = totalPipes + entries.Count() - 1; // there is always one more entry than pipes
}
All the Split() is doing is breaking the full line into an array of the individual fields in the string. Since you only seem to care about the number of pipes and not the fields, I'm not doing much with it other than determining the number of pipes by taking the number of fields and subtracting one (since you don't have a trailing pipe on each line).

C# implementation of Dictionary to count occurrences of words returns duplicate words in output

I recently made a little application to read in a text file of lyrics, then use a Dictionary to calculate how many times each word occurs. However, for some reason I'm finding instances in the output where the same word occurs multiple times with a tally of 1, instead of being added onto the original tally of the word. The code I'm using is as follows:
StreamReader input = new StreamReader(path);
String[] contents = input.ReadToEnd()
.ToLower()
.Replace(",","")
.Replace("(","")
.Replace(")", "")
.Replace(".","")
.Split(' ');
input.Close();
var dict = new Dictionary<string, int>();
foreach (String word in contents)
{
if (dict.ContainsKey(word))
{
dict[word]++;
}else{
dict[word] = 1;
}
}
var ordered = from k in dict.Keys
orderby dict[k] descending
select k;
using (StreamWriter output = new StreamWriter("output.txt"))
{
foreach (String k in ordered)
{
output.WriteLine(String.Format("{0}: {1}", k, dict[k]));
}
output.Close();
timer.Stop();
}
The text file I'm inputting is here: http://pastebin.com/xZBHkjGt (it's the lyrics of the top 15 rap songs, if you're curious)
The output can be found here: http://pastebin.com/DftANNkE
A quick ctrl-F shows that "girl" occurs at least 13 different times in the output. As far as I can tell, it is the exact same word, unless there's some sort of difference in ASCII values. Yes, there are some instances on there with odd characters in place of a apostrophe, but I'll worry about those later. My priority is figuring out why the exact same word is being counted 13 different times as different words. Why is this happening, and how do I fix it? Any help is much appreciated!

Another way is to split on non words.
var lyrics = "I fly with the stars in the skies I am no longer tryin' to survive I believe that life is a prize But to live doesn't mean your alive Don't worry bout me and who I fire I get what I desire, It's my empire And yes I call the shots".ToLower();
var contents = Regex.Split(lyrics, #"[^\w'+]");
Also here's an alternative (and probably more obscure) loop
int value;
foreach (var word in contents)
{
dict[word] = dict.TryGetValue(word, out value) ? ++value : 1;
}
dict.Remove("");

If you notice, the repeat occurrences appear on a line following a word which apparently doesn't have a count.
You're not stripping out newlines, so em\r\ngirl is being treated as a different word.

String[] contents = input.ReadToEnd()
.ToLower()
.Replace(",", "")
.Replace("(", "")
.Replace(")", "")
.Replace(".", "")
.Split("\r\n ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
Works better.

Add Trim to each word:
foreach (String word in contents.Select(w => w.Trim()))

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.