Working on huge text file, C#. Modifying the file

Working on huge text file, C#. Modifying the file - c#

Please, help me resolve this issue.
I have a huge input.txt. Now it's 465 Mb, but later it will be 1Gb at least.
User enters a term (not a whole word). Using that term I need to find a word that contains it, put it between <strong> tags and save the contents to the output.txt. The term-search should be case insensitive.
This is what I have so far. It works on small texts, but doesn't on bigger ones.
Regex regex = new Regex(" ");
string text = File.ReadAllText("input.txt");
Console.WriteLine("Please, enter a term to search for");
string term = Console.ReadLine();
string[] w = regex.Split(text);
for (int i = 0; i < w.Length; i++)
{
if (Processor.Contains(w[i], term, StringComparison.OrdinalIgnoreCase))
{
w[i] = #"<strong>" + w[i] + #"</string>";
}
}
string result = null;
result = string.Join(" ", w);
File.WriteAllText("output.txt", result);

Trying to read the entire file in one go is causing your memory exception. Look into reading the file in stages. The FileStream and BufferedStream classes provide ways of doing this:
https://msdn.microsoft.com/en-us/library/system.io.filestream(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/system.io.bufferedstream.read(v=vs.110).aspx

Try not to load the entire file into memory, avoid huge GB-size arrays, Strings etc. (you may just not have enough RAM). Can you process the file line by line (i.e. you don't have multiline terms, do you?)? If it's your case then
...
var source = File
.ReadLines("input.txt") // Notice absence of "All", not ReadAllLines
.Select(line => line.Split(' ')) // You don't need Regex here, just Split
.Select(items => items
.Select(item => String.Equals(item, term, StringComparison.OrdinalIgnoreCase)
? #"<strong>" + term + #"</strong>"
: item))
.Select(items => String.Join(" ", items));
File.WriteAllLines("output.txt", source);

Read the file line by line (or buffer more lines). A bit slower but should work.
Also there can be a problem if all the lines match your term. Consider writing results in a temporary file when you find them and then just rename/move the file to the destination folder.

Related

How to contact whole text from file into the string avoiding empty lines beetwen strings

How to get whole text from document contacted into the string. I'm trying to split text by dot: string[] words = s.Split('.'); I want take this text from text document. But if my text document contains empty lines between strings, for example:
pat said, “i’ll keep this ring.”
she displayed the silver and jade wedding ring which, in another time track,
she and joe had picked out; this
much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
result looks like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track
3. she and joe had picked out this
4. much of the alternate world she had elected to retain
5. he wondered what if any legal basis she had kept in addition
6. none he hoped wisely however he said nothing
7. better not even to ask
but desired correct output should be like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track she and joe had picked out this much of the alternate world she had elected to retain
3. he wondered what if any legal basis she had kept in addition
4. none he hoped wisely however he said nothing
5. better not even to ask
So to do this first I need to process text file content to get whole text as single string, like this:
pat said, “i’ll keep this ring.” she displayed the silver and jade wedding ring which, in another time track, she and joe had picked out; this much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
I can't to do this same way as it would be with list content for example: string concat = String.Join(" ", text.ToArray());,
I'm not sure how to contact text into string from text document

I think this is what you want:
var fileLocation = #"c:\\myfile.txt";
var stringFromFile = File.ReadAllText(fileLocation);
//replace Environment.NewLine with any new line character your file uses
var withoutNewLines = stringFromFile.Replace(Environment.NewLine, "");
//modify to remove any unwanted character
var withoutUglyCharacters = Regex.Replace(withoutNewLines, "[“’”,;-]", "");
var withoutTwoSpaces = withoutUglyCharacters.Replace(" ", " ");
var result = withoutTwoSpaces.Split('.').Where(i => i != "").Select(i => i.TrimStart()).ToList();
So first you read all text from your file, then you remove all unwanted characters and then split by . and return non empty items

Have you tried replacing double new-lines before splitting using a period?
static string[] GetSentences(string filePath) {
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
var sentences = Regex.Split(lines, #"\.[\s]{1,}?");
return sentences;
}
I haven't tested this, but it should work.
Explanation:
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
Throws an exception if the file could not be found. It is advisory you surround the method call with a try/catch.
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
Creates a string, and ignores any lines which are purely whitespace or empty.
var sentences = Regex.Split(lines, #".[\s]{1,}?");
Creates a string array, where the string is split at every period and whitespace following the period.
E.g:
The string "I came. I saw. I conquered" would become
I came
I saw
I conquered
Update:
Here's the method as a one-liner, if that's your style?
static string[] SplitSentences(string filePath) => File.Exists(filePath) ? Regex.Split(string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line))), #"") : null;

I would suggest you to iterate through all characters and just check if they are in range of 'a' >= char <= 'z' or if char == ' '. If it matches the condition then add it to the newly created string else check if it is '.' character and if it is then end your line and add another one :
List<string> lines = new List<string>();
string line = string.Empty;
foreach(char c in str)
{
if((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20)
line += c;
else if(c == '.')
{
lines.Add(line.Trim());
line = string.Empty;
}
}
Working online example
Or if you prefer "one-liner"s :
IEnumerable<string> lines = new string(str.Select(c => (char)(((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20) ? c : c == '.' ? '\n' : '\0')).ToArray()).Split('\n').Select(s => s.Trim());

I may be wrong about this. I would think that you may not want to alter the string if you are splitting it. Example, there are double/single quote(s) (“) in part of the string. Removing them may not be desired which brings up the possibly of a question, reading a text file that contains single/double quotes (as your example data text shows) like below:
var stringFromFile = File.ReadAllText(fileLocation);
will not display those characters properly in a text box or the console because the default encoding using the ReadAllText method is UTF8. Example the single/double quotes will display (replacement characters) as diamonds in a text box on a form and will be displayed as a question mark (?) when displayed to the console. To keep the single/double quotes and have them display properly you can get the encoding for the OS’s current ANSI encoding by adding a parameter to the ReadAllText method like below:
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
Below is code using a simple split method to .split the string on periods (.) Hope this helps.
private void button1_Click(object sender, EventArgs e) {
string fileLocation = #"C:\YourPath\YourFile.txt";
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
string bigString = stringFromFile.Replace(Environment.NewLine, "");
string[] result = bigString.Split('.');
int count = 1;
foreach (string s in result) {
if (s != "") {
textBox1.Text += count + ". " + s.Trim() + Environment.NewLine;
Console.WriteLine(count + ". " + s.Trim());
count++;
}
else {
// period at the end of the string
}
}
}

C# Dictionary allowing seemingly identical keys

I have created a dictionary, and created code to read a txt file, and input each word from the file into the dictionary.
//Set up OpenFileDialog box, and prompt user to select file to open
DialogResult openFileResult;
OpenFileDialog file = new OpenFileDialog() ;
file.Filter = "txt files (*.txt)|*.txt";
openFileResult = file.ShowDialog();
if (openFileResult == DialogResult.OK)
{
//If user selected file successfully opened
//Reset form
this.Controls.Clear();
this.InitializeComponent();
//Read from file, split into array of words
Stream fs = file.OpenFile();
StreamReader reader;
reader = new StreamReader(fs);
string line = reader.ReadToEnd();
string[] words = line.Split(' ', '\n');
//Add each word and frequency to dictionary
foreach (string s in words)
{
AddToDictionary(s);
}
//Reset variables, and set-up chart
ResetVariables();
ChartInitialize();
foreach (string s in wordDictionary.Keys)
{
//Calculate statistics from dictionary
ComputeStatistics(s);
if (dpCount < 50)
{
AddToGraph(s);
}
}
//Print statistics
PrintStatistics();
}
And the AddToDictionary(s) function is:
public void AddToDictionary(string s)
{
//Function to add string to dictionary
string wordLower = s.ToLower();
if (wordDictionary.ContainsKey(wordLower))
{
int wordCount = wordDictionary[wordLower];
wordDictionary[wordLower] = wordDictionary[wordLower] + 1;
}
else
{
wordDictionary.Add(wordLower, 1);
txtUnique.Text += wordLower + ", ";
}
}
The text file being read by this program is:
To be or not to be that is the question
Whether tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles
And by opposing end them To die to sleep
No more and by a sleep to say we end
The heartache and the thousand natural shocks
That flesh is heir to Tis a consummation
Devoutly to be wished To die to sleep
To sleep perchance to dream ay theres the rub
For in that sleep of death what dreams may come
When we **have** shuffled off this mortal coil
Must give us pause Theres the respect
That makes calamity of so long life
For who would bear the whips and scorns of time
The oppressors wrong the proud mans contumely
The pangs of despised love the laws delay
The insolence of office and the spurns
That patient merit of th unworthy takes
When he himself might his quietus make
With a bare bodkin Who would fardels bear
To grunt and sweat under a weary life
But that the dread of something after death
The undiscovered country from whose bourn
No traveller returns puzzles the will
And makes us rather bear those ills we **have**
Than fly to others that we know not of
Thus conscience does make cowards of us all
And thus the native hue of resolution
Is sicklied oer with the pale cast of thought
And enterprise of great pitch and moment
With this regard their currents turn awry
And lose the name of action Soft you now
The fair Ophelia Nymph in thy orisons
Be all my sins remembered
The problem I am encountering is that the word "have" is appearing twice in the dictionary. I know this doesn't happen with dictionaries, but for some reason it is appearing twice. Does anyone know why this would happen?

If you run:
var sb = new StringBuilder();
sb.AppendLine("test which");
sb.AppendLine("is a test");
var words = sb.ToString().Split(' ', '\n').Distinct();
Inspecting words in the debugger shows that some instances of "test" have acquired a \r due to the two byte CRLF line terminator - which isn't treated by the split.
To fix, change your split to:
Split(new[] {" ", Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)

Splitting text into words is generally hard solution if you want to support multiple languages. Regular expressions are generally better in dealing with parsing than basic String.Split.
I.e. in your case you are picking up variations of "new line" as part of a word, you also could be picking up things like non-breaking space,...
Following code will pick words better than your current .Split, for more info - How do I split a phrase into words using Regex in C#
var words = Regex.Split(line, #"\W+").ToList();
Additionally you should make sure your dictionary is case insensitive like following (pick comparer based on your needs, there are culture-aware once too):
var dictionary = new Dictionary(StringComparer.OrdinalIgnoreCase);

I would be inclined to change the following code:
//Read from file, split into array of words
Stream fs = file.OpenFile();
StreamReader reader;
reader = new StreamReader(fs);
string line = reader.ReadToEnd();
string[] words = line.Split(' ', '\n');
//Add each word and frequency to dictionary
foreach (string s in words)
{
AddToDictionary(s);
}
to this:
wordDictionary =
File
.ReadAllLines(file)
.SelectMany(x => x.Split(new [] { ' ', }, StringSplitOptions.RemoveEmptyEntries))
.Select(x => x.ToLower())
.GroupBy(x => x)
.ToDictionary(x => x.Key, x => x.Count());
This completely avoids the issues with line endings and also has the added advantage that it doesn't leave any undisposed streams lying around.

Reading file names, respectively?

I have 1000 files in a folder, I want to find the name of the file, but when I do it
The file names are not sorted.
For example: These are my filenames
1-CustomerApp.txt
2-CustomerApp.txt
3-CustomerApp.txt
...
var adddress = #"data\";
str = "";
DirectoryInfo d = new DirectoryInfo(adddress);//Assuming Test is your Folder
FileInfo[] Files = d.GetFiles("*.txt"); //Getting Text files
foreach (FileInfo file in Files)
{
str = str + ", " + file.Name;
}

You could use this to keep a "numerical" order
FileInfo[] files = d.GetFiles("*.txt")
.OrderBy(m => m.Name.PadLeft(200, '0')).ToArray();
200 is quite arbitrary, of course.
This will add "as many 0 as needed" so that the file name + n 0 are a string with a length of 200.
to make it a little bit less arbitrary and "brittle", you could do
var f = d.GetFiles("*.txt");
var maxLength = f.Select(l => l.Name.Length).Max() + 1;
FileInfo[] files = f.OrderBy(m => m.Name.PadLeft(maxLength, '0')).ToArray();
How does it work (bad explanation, someone could do better) and why is it far from perfect :
Ordering of characters in c# puts numeric characters before alpha characters.
So 0 (or 9) will be before a.
But 10a, will be before 2a as 1 comes before 2.
see char after char
1 / 2 => 10a first
But if we change
10a and 2a, with PadLeft, to
010a and 002a, we see, character after character
0 / 0 => equal
1 / 0 => 002a first
This "works" in your case, but really depends on your "file naming" logic.
CAUTION : This solution won't work with other file naming logic
For example, the numeric part is not at the start.
f-2-a and f-10-a
Because
00-f-2-a would still be before 0-f-10-a
or the "non-numeric part" is not of the same length.
1-abc and 2-z
Because
01-abc will come after 0002-z

They are sorted alphabetically. I don't know how do you see what they are not sorted (with what you compare or where have you seen different result), if you are using file manager, then perhaps it apply own sorting. In windows explorer result will be the same (try to sort by name column).
If you know template of how file name is created, then you can do a trick, to apply own sorting. In case of your example, extract number at start (until minus) and pad it with zeroes (until maximum possible number size), and then sort what you get.
Or you can pad with zeroes when files are created:
0001-CustomerApp.txt
0002-CustomerApp.txt
...
9999-CustomerApp.txt

This should give you more or less what you want; it skips any characters until it hits a number, then includes all directly following numerical characters to get the number. That number is then used as a key for sorting. Somewhat messy perhaps, but should work (and be less brittle than some other options, since this searches explicitly for the first "whole" number within the filename):
var filename1 = "10-file.txt";
var filename2 = "2-file.txt";
var filenames = new []{filename1, filename2};
var sorted =
filenames.Select(fn => new {
nr = int.Parse(new string(
fn.SkipWhile(c => !Char.IsNumber(c))
.TakeWhile(c => Char.IsNumber(c))
.ToArray())),
name = fn
})
.OrderBy(file => file.nr)
.Select(file => file.name);

Try the alphanumeric-sorting and sort the files' names according to it.
FileInfo[] files = d.GetFiles("*.txt");
files.OrderBy(file=>file.Name, new AlphanumComparatorFast())

IndexOf does not correctly identify if a line starts with a value

How can I remove a whole line from a text file if the first word matches to a variable I have?
What I'm currently trying is:
List<string> lineList = File.ReadAllLines(dir + "textFile.txt").ToList();
lineList = lineList.Where(x => x.IndexOf(user) <= 0).ToList();
File.WriteAllLines(dir + "textFile.txt", lineList.ToArray());
But I can't get it to remove.

The only mistake that you have is you are checking <= 0 with indexOf, instead of = 0.
-1 is returned when the string does not contain the searched for string.
<= 0 means either starts with or does not contain
=0 means starts with <- This is what you want

This method will read the file line-by-line instead of all at once. Also note that this implementation is case-sensitive.
It also assumes you aren't subjected to leading spaces.
using (var writer = new StreamWriter("temp.file"))
{
//here I only write back what doesn't match
foreach(var line in File.ReadLines("file").Where(x => !x.StartsWith(user)))
writer.WriteLine(line); // not sure if this will cause a double-space ?
}
File.Move("temp.file", "file");

You were pretty close, String.StartsWith handles that nicely:
// nb: if you are case SENSITIVE remove the second argument to ll.StartsWith
File.WriteAllLines(
path,
File.ReadAllLines(path)
.Where(ll => ll.StartsWith(user, StringComparison.OrdinalIgnoreCase)));
For really large files that may not be well performing, instead:
// Write our new data to a temp file and read the old file On The Fly
var temp = Path.GetTempFileName();
try
{
File.WriteAllLines(
temp,
File.ReadLines(path)
.Where(
ll => ll.StartsWith(user, StringComparison.OrdinalIgnoreCase)));
File.Copy(temp, path, true);
}
finally
{
File.Delete(temp);
}
Another issue noted was that both IndexOf and StartsWith will treat ABC and ABCDEF as matches if the user is ABC:
var matcher = new Regex(
#"^" + Regex.Escape(user) + #"\b", // <-- matches the first "word"
RegexOptions.CaseInsensitive);
File.WriteAllLines(
path,
File.ReadAllLines(path)
.Where(ll => matcher.IsMatch(ll)));

Use `= 0` instead of `<= 0`.

C# implementation of Dictionary to count occurrences of words returns duplicate words in output

I recently made a little application to read in a text file of lyrics, then use a Dictionary to calculate how many times each word occurs. However, for some reason I'm finding instances in the output where the same word occurs multiple times with a tally of 1, instead of being added onto the original tally of the word. The code I'm using is as follows:
StreamReader input = new StreamReader(path);
String[] contents = input.ReadToEnd()
.ToLower()
.Replace(",","")
.Replace("(","")
.Replace(")", "")
.Replace(".","")
.Split(' ');
input.Close();
var dict = new Dictionary<string, int>();
foreach (String word in contents)
{
if (dict.ContainsKey(word))
{
dict[word]++;
}else{
dict[word] = 1;
}
}
var ordered = from k in dict.Keys
orderby dict[k] descending
select k;
using (StreamWriter output = new StreamWriter("output.txt"))
{
foreach (String k in ordered)
{
output.WriteLine(String.Format("{0}: {1}", k, dict[k]));
}
output.Close();
timer.Stop();
}
The text file I'm inputting is here: http://pastebin.com/xZBHkjGt (it's the lyrics of the top 15 rap songs, if you're curious)
The output can be found here: http://pastebin.com/DftANNkE
A quick ctrl-F shows that "girl" occurs at least 13 different times in the output. As far as I can tell, it is the exact same word, unless there's some sort of difference in ASCII values. Yes, there are some instances on there with odd characters in place of a apostrophe, but I'll worry about those later. My priority is figuring out why the exact same word is being counted 13 different times as different words. Why is this happening, and how do I fix it? Any help is much appreciated!

Another way is to split on non words.
var lyrics = "I fly with the stars in the skies I am no longer tryin' to survive I believe that life is a prize But to live doesn't mean your alive Don't worry bout me and who I fire I get what I desire, It's my empire And yes I call the shots".ToLower();
var contents = Regex.Split(lyrics, #"[^\w'+]");
Also here's an alternative (and probably more obscure) loop
int value;
foreach (var word in contents)
{
dict[word] = dict.TryGetValue(word, out value) ? ++value : 1;
}
dict.Remove("");

If you notice, the repeat occurrences appear on a line following a word which apparently doesn't have a count.
You're not stripping out newlines, so em\r\ngirl is being treated as a different word.

String[] contents = input.ReadToEnd()
.ToLower()
.Replace(",", "")
.Replace("(", "")
.Replace(")", "")
.Replace(".", "")
.Split("\r\n ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
Works better.

Add Trim to each word:
foreach (String word in contents.Select(w => w.Trim()))

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.