Problem: I need to iterate through multiple files in a folder and read them. They are .txt files. While reading I need to note what words occured in each file.
For example:
File 1 text: "John is my friend friend" -> words: John, is, my, friend
File 2 text: "John is Mark" -> words: John, is, Mark
Currently I was reading files and then making it one big file, but it does not work like this so I have to read them separately. Old idea:
string[] filesZ = { "1.txt", "2.txt" };
var allLinesZ = filesZ.SelectMany(i => System.IO.File.ReadAllLines(i));
System.IO.File.WriteAllLines("n.txt", allLinesZ.ToArray());
var logFileZ = File.ReadAllLines("n.txt");
So this is the first question, how to iterate through them and reading all of them without making a big file.
The second one will be how to make a counter to all of the words for seperate files, currently for one big file I am using:
var logFileZ = File.ReadAllLines("n.txt");
List<string> LogListZ = new List<string>(logFileZ);
var fi = new Dictionary<string, int>();
LogListZ.ForEach(str => AddToDictionary(fi, str));
foreach (var entry in fi)
{
Console.WriteLine(entry.Key + ": " + entry.Value);
}
This is AddToDictionary:
static void AddToDictionary(Dictionary<string, int> dictionary, string input)
{
input.Split(new[] { ' ', ',', '.', '?', '!', '.' }, StringSplitOptions.RemoveEmptyEntries).ToList().ForEach(n =>
{
if (dictionary.ContainsKey(n))
dictionary[n]++;
else
dictionary.Add(n, 1);
});
}
I was thinking about making a loop through all the files (is it possible?) and inside make a counter that counts word for example John in how many files it was. I don't need a specific file number, just a number of occurence of a word, without counting (like in example file 1) words twice (friend).
You don't have to do much for part one of your question: remove WriteAllLines, remove the ReadAllLines for "n.txt", rename allLinesZ variable to logFileZ, and add ToList or ToArray call:
var logFileZ = filesZ
.SelectMany(i => System.IO.File.ReadAllLines(i))
.ToList();
You can make a counter in one go as well: split each string as you go, feed it to SelectMany, use GroupBy, and convert to dictionary using Count() as the value:
var counts = filesZ
.SelectMany(i => System.IO.File.ReadAllLines(i)
.SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', '.' })
.Distinct())
.GroupBy(word => word)
.ToDictionary(g => g.Key, g => g.Count());
The call of Distinct() ensures that the same word will not be counted twice if it's in a single file.
Related
I have a .txt file which I would like to split using the split method. My current code is:
string[] alltext = File.ReadAllText(fullPath).Split(new[] { ',' }, 3);
The problem I now have is that I want it to loop through the whole in a way that it always splits the text into three pieces that belong together. If I have a text with:
testing, testing,
buenooo diasssss
testing, testing,
buenooo diasssss
testing, testing,
buenooo diasssss
(the format here is hard to display, but want to show that they are on different lines, so reading line by line will most likely not be possible)
I want "testing", "testing", "buenooo diasssss" to be dispalyed on my console althought they are on different lines.
If I would do it with lines I would simply loop through each line, but this does not work in this case.
You can first remove "\r\n"(new line) from the text, then split and select the first three items.
var alltext = File.ReadAllText(fullPath).Replace("\r\n","").Split(',').ToList().Take(3);
foreach(var item in alltext)
Console.WriteLine(item);
Edit
If you want all three items to be displayed in one line in the console:
int lineNumber = 0;
var alltext = File.ReadAllText(fullPath).Split(new string[] { "\r\n", "," }, StringSplitOptions.None).ToList();
alltext.RemoveAll(item => item == "");
while (lineNumber * 3 < alltext.Count)
{
var tempList = alltext.Skip(lineNumber * 3).Take(3).ToList(); ;
lineNumber++;
Console.WriteLine("line {0} => {1}, {2}, {3}",lineNumber, tempList[0], tempList[1], tempList[2]);
}
result:
Try this:
var data =
File.ReadLines(fullpath)
.Select((x, n) => (line: x, group: n / 3))
.GroupBy(x => x.group, x => x.line)
.Select(x =>
String
.Concat(x)
.Split(',', StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Trim()));
That gives me:
I’m just so close, but my program is still not working properly. I am trying to count how many times a set of words appear in a text file, list those words and their individual count and then give a sum of all the found matched words.
If there are 3 instances of “lorem”, 2 instances of “ipsum”, then the total should be 5.
My sample text file is simply a paragraph of “Lorem ipsum” repeated a few times in a text file.
My problem is that this code I have so far, only counts the first occurrence of each word, even though each word is repeated several times throughout the text file.
I am using a “pay for” parser called “GroupDocs.Parser” that I added through the NuGet package manager. I would prefer not to use a paid for version if possible.
Is there an easier way to do this in C#?
Here’s a screen shot of my desired results.
Here is the full code that I have so far.
using GroupDocs.Parser;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace ConsoleApp5
{
class Program
{
static void Main(string[] args)
{
using (Parser parser = new Parser(#"E:\testdata\loremIpsum.txt"))
{
// Extract a text into the reader
using (TextReader reader = parser.GetText())
{
// Define the search terms.
string[] wordsToMatch = { "Lorem", "ipsum", "amet" };
Dictionary<string, int> stats = new Dictionary<string, int>();
string text = reader.ReadToEnd();
char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
// split words
string[] words = text.Split(chars);
int minWordLength = 2;// to count words having more than 2 characters
// iterate over the word collection to count occurrences
foreach (string word in wordsToMatch)
{
string w = word.Trim().ToLower();
if (w.Length > minWordLength)
{
if (!stats.ContainsKey(w))
{
// add new word to collection
stats.Add(w, 1);
}
else
{
// update word occurrence count
stats[w] += 1;
}
}
}
// order the collection by word count
var orderedStats = stats.OrderByDescending(x => x.Value);
// print occurrence of each word
foreach (var pair in orderedStats)
{
Console.WriteLine("Total occurrences of {0}: {1}", pair.Key, pair.Value);
}
// print total word count
Console.WriteLine("Total word count: {0}", stats.Count);
Console.ReadKey();
}
}
}
}
}
Any suggestions on what I'm doing wrong?
Thanks in advance.
Splitting the entire content of the text file to get a string array of the words is not a good idea because doing so will create a new string object in memory for each word. You can imagine the cost when you deal with big files.
An alternative approach is:
Use the Parallel.ForEach method to read the lines from the text file in parallel.
Use the thread-safe ConcurrentDictionary<TKey,TValue> collection to be accessed by the paralleled threads.
Increment the values of each word (key) by the count of the Regex.Matches Method.
using System;
using System.Collections.Concurrent;
using System.Linq;
using System.IO;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
static void Main(string[] args)
{
var file = #"loremIpsum.txt";
var obj = new object();
var wordsToMatch = new ConcurrentDictionary<string, int>();
wordsToMatch.TryAdd("Lorem", 0);
wordsToMatch.TryAdd("ipsum", 0);
wordsToMatch.TryAdd("amet", 0);
Console.WriteLine("Press a key to continue...");
Console.ReadKey();
Parallel.ForEach(File.ReadLines(file),
(line) =>
{
foreach (var word in wordsToMatch.Keys)
lock (obj)
wordsToMatch[word] += Regex.Matches(line, word,
RegexOptions.IgnoreCase).Count;
});
foreach (var kv in wordsToMatch.OrderByDescending(x => x.Value))
Console.WriteLine($"Total occurrences of {kv.Key}: {kv.Value}");
Console.WriteLine($"Total word count: {wordsToMatch.Values.Sum()}");
Console.ReadKey();
}
stats is a dictionary, so stats.Count will only tell you how many distinct words there are. You need to add up all the values in it. Something like stats.Values.Sum().
You can replace this code with a LINQ query that uses case-insensitive grouping. Eg:
char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
var text=File.ReadAllText(somePath);
var query=text.Split(chars)
.GroupBy(w=>w,StringComparer.OrdinalIgnoreCase)
.Select(g=>new {word=g.Key,count=g.Count())
.Where(stat=>stat.count>2)
.OrderByDescending(stat=>stat.count);
At that point you can iterate over the query or copy the results to an array or dictionary with ToArray(), ToList() or ToDictionary().
This isn't the most efficient code - for one thing, the entire file is loaded in memory. One could use File.ReadLines to load and iterate over the lines one by one. LINQ could be used to iterate over the lines as well:
var lines=File.ReadLines(somePath);
var query=lines.SelectMany(line=>line.Split(chars))
.GroupBy(w=>w,StringComparer.OrdinalIgnoreCase)
.Select(g=>new {word=g.Key,count=g.Count())
.Where(stat=>stat.count>2)
.OrderByDescending(stat=>stat.count);
Please, help me resolve this issue.
I have a huge input.txt. Now it's 465 Mb, but later it will be 1Gb at least.
User enters a term (not a whole word). Using that term I need to find a word that contains it, put it between <strong> tags and save the contents to the output.txt. The term-search should be case insensitive.
This is what I have so far. It works on small texts, but doesn't on bigger ones.
Regex regex = new Regex(" ");
string text = File.ReadAllText("input.txt");
Console.WriteLine("Please, enter a term to search for");
string term = Console.ReadLine();
string[] w = regex.Split(text);
for (int i = 0; i < w.Length; i++)
{
if (Processor.Contains(w[i], term, StringComparison.OrdinalIgnoreCase))
{
w[i] = #"<strong>" + w[i] + #"</string>";
}
}
string result = null;
result = string.Join(" ", w);
File.WriteAllText("output.txt", result);
Trying to read the entire file in one go is causing your memory exception. Look into reading the file in stages. The FileStream and BufferedStream classes provide ways of doing this:
https://msdn.microsoft.com/en-us/library/system.io.filestream(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/system.io.bufferedstream.read(v=vs.110).aspx
Try not to load the entire file into memory, avoid huge GB-size arrays, Strings etc. (you may just not have enough RAM). Can you process the file line by line (i.e. you don't have multiline terms, do you?)? If it's your case then
...
var source = File
.ReadLines("input.txt") // Notice absence of "All", not ReadAllLines
.Select(line => line.Split(' ')) // You don't need Regex here, just Split
.Select(items => items
.Select(item => String.Equals(item, term, StringComparison.OrdinalIgnoreCase)
? #"<strong>" + term + #"</strong>"
: item))
.Select(items => String.Join(" ", items));
File.WriteAllLines("output.txt", source);
Read the file line by line (or buffer more lines). A bit slower but should work.
Also there can be a problem if all the lines match your term. Consider writing results in a temporary file when you find them and then just rename/move the file to the destination folder.
I have a bunch of text files that has a custom format, looking like this:
App Name
Export Layout
Produced at 24/07/2011 09:53:21
Field Name Length
NAME 100
FULLNAME1 150
ADDR1 80
ADDR2 80
Any whitespaces may be tabs or spaces. The file may contain any number of field names and lengths.
I want to get all the field names and their corresponding field lengths and perhaps store them in a dictionary. This information will be used to process a corresponding fixed width data file having the mentioned field names and field lengths.
I know how to skip lines using ReadLine(). What I don't know is how to say: "When you reach the line that starts with 'Field Name', skip one more line, then starting from the next line, grab all the words on the left column and the numbers on the right column."
I have tried String.Trim() but that doesn't remove the whitespaces in between.
Thanks in advance.
You can use SkipWhile(l => !l.TrimStart().StartsWith("Field Name")).Skip(1):
Dictionary<string, string> allFieldLengths = File.ReadLines("path")
.SkipWhile(l => !l.TrimStart().StartsWith("Field Name")) // skips lines that don't start with "Field Name"
.Skip(1) // go to next line
.SkipWhile(l => string.IsNullOrWhiteSpace(l)) // skip following empty line(s)
.Select(l =>
{ // anonymous method to use "real code"
var line = l.Trim(); // remove spaces or tabs from start and end of line
string[] token = line.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries);
return new { line, token }; // return anonymous type from
})
.Where(x => x.token.Length == 2) // ignore all lines with more than two fields (invalid data)
.Select(x => new { FieldName = x.token[0], Length = x.token[1] })
.GroupBy(x => x.FieldName) // groups lines by FieldName, every group contains it's Key + all anonymous types which belong to this group
.ToDictionary(xg => xg.Key, xg => string.Join(",", xg.Select(x => x.Length)));
line.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries) will split by space and tabs and ignores all empty spaces. Use GroupBy to ensure that all keys are unique in the dictionary. In the case of duplicate field-names the Length will be joined with comma.
Edit: since you have requested a non-LINQ version, here is it:
Dictionary<string, string> allFieldLengths = new Dictionary<string, string>();
bool headerFound = false;
bool dataFound = false;
foreach (string l in File.ReadLines("path"))
{
string line = l.Trim();
if (!headerFound && line.StartsWith("Field Name"))
{
headerFound = true;
// skip this line:
continue;
}
if (!headerFound)
continue;
if (!dataFound && line.Length > 0)
dataFound = true;
if (!dataFound)
continue;
string[] token = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
if (token.Length != 2)
continue;
string fieldName = token[0];
string length = token[1];
string lengthInDict;
if (allFieldLengths.TryGetValue(fieldName, out lengthInDict))
// append this length
allFieldLengths[fieldName] = lengthInDict + "," + length;
else
allFieldLengths.Add(fieldName, length);
}
I like the LINQ version more because it's much more readable and maintainable (imo).
Based on the assumption that the position of the header line is fixed, we may consider actual key-value pairs to start from the 9th line. Then, using the ReadAllLines method to return a String array from the file, we just start processing from index 8 onwards:
string[] lines = File.ReadAllLines(filepath);
Dictionary<string,int> pairs = new Dictionary<string,int>();
for(int i=8;i<lines.Length;i++)
{
string[] pair = Regex.Replace(lines[i],"(\\s)+",";").Split(';');
pairs.Add(pair[0],int.Parse(pair[1]));
}
This is a skeleton, not accounting for exception handling, but I guess it should get you started.
You can use String.StartsWith() to detect "FieldName". Then String.Split() with a parameter of null to split by whitespace. This will get you your fieldname and length strings.
Given a data file delimited by space,
10 10 10 10 222 331
2 3 3 4 45
4 2 2 4
How to read this file and load into an Array
Thank you
var fileContent = File.ReadAllText(fileName);
var array = fileContent.Split((string[])null, StringSplitOptions.RemoveEmptyEntries);
if you have numbers only and need a list of int as a result, you can do this:
var numbers = array.Select(arg => int.Parse(arg)).ToList();
It depends on the kind of array you want. If you want to flatten everything into a single-dimensional array, go with Alex Aza's answer, otherwise, if you want a 2-dimensional array that maps to the lines and elements within the text file:
var array = File.ReadAllLines(filename)
.Select(line => line.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries))
.Where(line => !string.IsNullOrWhiteSpace(line)) // Use this to filter blank lines.
.Select(int.Parse) // Assuming you want an int array.
.ToArray();
Be aware that there is no error handling, so if parsing fails, the above code will throw an exception.
You will be interested in StreamReader.ReadLine() and String.Split()
I couldn't get Quick Joe Smith's answer to work, so I modified it. I put the modified code into a static method within a "FileReader" class:
public static double[][] readWhitespaceDelimitedDoubles(string[] input)
{
double[][] array = input.Where(line => !String.IsNullOrWhiteSpace(line)) // Use this to filter blank lines.
.Select(line => line.Split((string[])null, StringSplitOptions.RemoveEmptyEntries))
.Select(line => line.Select(element => double.Parse(element)))
.Select(line => line.ToArray())
.ToArray();
return array;
}
For my application, I was parsing for double as opposed to int. To call the code, try using something like this:
string[] fileContents = System.IO.File.ReadAllLines(openFileDialog1.FileName);
double[][] fileContentsArray = FileReader.readWhitespaceDelimitedDoubles(fileContents);
Console.WriteLine("Number of Rows: {0,3}", fileContentsArray.Length);
Console.WriteLine("Number of Cols: {0,3}", fileContentsArray[0].Length);