Best way to read multiple very large files - c#

I need help figuring out the fastest way to read through about 80 files with over 500,000 lines in each file, and write to one master file with each input file's line as a column in the master. The master file must be written to a text editor like notepad and not a Microsoft product because they can't handle the number of lines.
For example, the master file should look something like this:
File1_Row1,File2_Row1,File3_Row1,...
File1_Row2,File2_Row2,File3_Row2,...
File1_Row3,File2_Row3,File3_Row3,...
etc.
I've tried 2 solutions so far:
Create a jagged array to hold each files' contents into an array and then once reading all lines in all files, write the master file. The issue with this solution is that Windows OS memory throws an error that too much virtual memory is being used.
Dynamically create a reader thread for each of the 80 files that reads a specific line number, and once all threads finish reading a line, combine those values and write to file, and repeat for each line in all files. The issue with this solution is that it is very very slow.
Does anybody have a better solution for reading so many large files in a fast way?

The best way is going to be to open the input files with a StreamReader for each one and a StreamWriter for the output file. Then you loop through each reader and read a single line and write it to the master file. This way you are only loading one line at a time so there should be minimal memory pressure. I was able to copy 80 ~500,000 line files in 37 seconds. An example:
using System;
using System.Collections.Generic;
using System.IO;
using System.Diagnostics;
class MainClass
{
static string[] fileNames = Enumerable.Range(1, 80).Select(i => string.Format("file{0}.txt", i)).ToArray();
public static void Main(string[] args)
{
var stopwatch = Stopwatch.StartNew();
List<StreamReader> readers = fileNames.Select(f => new StreamReader(f)).ToList();
try
{
using (StreamWriter writer = new StreamWriter("master.txt"))
{
string line = null;
do
{
for(int i = 0; i < readers.Count; i++)
{
if ((line = readers[i].ReadLine()) != null)
{
writer.Write(line);
}
if (i < readers.Count - 1)
writer.Write(",");
}
writer.WriteLine();
} while (line != null);
}
}
finally
{
foreach(var reader in readers)
{
reader.Close();
}
}
Console.WriteLine("Elapsed {0} ms", stopwatch.ElapsedMilliseconds);
}
}
I've assume that all the input files have the same number of lines, but you should be add the logic to keep reading when at least one file has given you data.

Use Memory Mapped files seems what is suitable to you. Something that does not execute pressure on memory of your app contemporary maintaining good performance in IO operations.
Here complete documentation: Memory-Mapped Files

If you have enough memory on the computer, I would use the Parallel.Invoke construct and read each file into a pre-allocated array such as:
string[] file1lines = new string[some value];
string[] file2lines = new string[some value];
string[] file3lines = new string[some value];
Parallel.Invoke(
() =>
{
ReadMyFile(file1,file1lines);
},
() =>
{
ReadMyFile(file2,file2lines);
},
() =>
{
ReadMyFile(file3,file3lines);
}
);
Each ReadMyFile method should just use the following sample code which, according to these benchmarks, is the fastest way to read a text file:
int x = 0;
using (StreamReader sr = File.OpenText(fileName))
{
while ((file1lines[x] = sr.ReadLine()) != null)
{
x += 1;
}
}
If you need to manipulate the data from each file before writing your final output, read this article on the fastest way to do that.
Then you just need one method to write the contents to each string[] to the output as you desire.

Have an array of open file handles. Loop through this array and read a line from each file into a string array. Then combine this array into the master file, append a newline at the end.
This differs from your second approach that it is single threaded and doesn't read a specific line but always the next one.
Of course you need to be error proof if there are files with less lines than others.

Related

c# - splitting a large list into smaller sublists

Fairly new to C# - Sitting here practicing. I have a file with 10 million passwords listed in a single file that I downloaded to practice with.
I want to break the file down to lists of 99. Stop at 99 then do something. Then start where it left off and repeat the do something with the next 99 until it reaches the last item in the file.
I can do the count part well, it is the stop at 99 and continue where I left off is where I am having trouble. Anything I find online is not close to what I am trying to do and anything I add to this code on my own does not work.
I am more than happy to share more information if I am not clear. Just ask and will respond however, I might not be able to respond until tomorrow depending on what time it is.
Here is the code I have started:
using System;
using System.IO;
namespace lists01
{
class Program
{
static void Main(string[] args)
{
int count = 0;
var f1 = #"c:\tmp\10-million-password-list-top-1000000.txt";
{
var content = File.ReadAllLines(f1);
foreach (var v2 in content)
{
count++;
Console.WriteLine(v2 + "\t" + count);
}
}
}
}
}
My end goal is to do this with any list of items from files I have. I am only using this password list because it was sizable and thought it would be good for this exercise.
Thank you
Keith
Here is a couple of different ways to approach this. Normally, I would suggest the ReadAllLines function that you have in your code. The trade off is that you are loading the entire file into memory at once, then you operate on it.
Using read all lines in concert with Linq's Skip() and Take() methods, you can chop the lines up into groups like this:
var lines = File.ReadAllLines(fileName);
int linesAtATime = 99;
for (int i = 0; i < lines.Length; i = i + linesAtATime)
{
List<string> currentLinesGroup = lines.Skip(i).Take(linesAtATime).ToList();
DoSomethingWithLines(currentLinesGroup);
}
But, if you are working with a really large file, it might not be practical to load the entire file into memory. Plus, you might not want to leave the file open while you are working on the lines. This option gives you more control over how you move through the file. It just loads the part it needs into memory, and closes the file while you are working on the current set of lines.
List<string> lines = new List<string>();
int maxLines = 99;
long seekPosition = 0;
bool fileLoaded = false;
string line;
while (!fileLoaded)
{
using (Stream stream = File.Open(fileName, FileMode.Open))
{
//Jump back to the previous position
stream.Seek(seekPosition, SeekOrigin.Begin);
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream && lines.Count < maxLines)
{
line = reader.ReadLine();
seekPosition += (line.Length + 2); //Tracks how much data has been read.
lines.Add(line);
}
fileLoaded = reader.EndOfStream;
}
}
DoSomethingWithLines(lines);
lines.Clear();
}
In this case, I used Stream because it has the ability to seek to a specific position in the file. But then I used StreaReader because it has the ReadLine() methods.

Dispose array of string in a loop

I have the following loop inside a function:
for(int i = 0; i < 46;i++){
String[] arrStr = File.ReadAllLines(path+"File_"+i+".txt")
List<String> output = new List<String>();
for(j = 0;j< arrStr.Length;j++){
//Do Something
output.Add(someString);
}
File.WriteAllLines(path+"output_File_"+i+".txt",output.toArray());
output.Clear();
}
Each txt file has about 20k lines.The function opens 46 of them and I need to run the function more than 1k times so I'm planning to leave the program running overnight,so far I didnt find any erros but since there is an 20k size String array being referenced at each interaction of the loop,i'm afraid that there might be some issue with trash memory being acumulated or something from the arrays in the past interactions. If there is such a risk,which method is best to dispose of the old array in this case?
Also,is it memory safe to run 3 programs like this at the same time?
Use Streams with using this will handle the memory management for you:
for (int i = 0; i < 46; i++)
{
using (StreamReader reader = new StreamReader(path))
{
using (StreamWriter writer = new StreamWriter(outputpath))
{
while(!reader.EndOfStream)
{
string line = reader.ReadLine();
// do something with line
writer.WriteLine(line);
}
}
}
}
The Dispose methods of StreamReader and StreamWriter are automatically called when exiting the using block, freeing up any memory used. Using streams also ensures your entire file isn't in memory at once.
More info on MSDN - File Stream and I/O
Sounds like you came from the C world :-)
C# garbage collection is fine, you will not have any problems with that.
I would be more worried about file-system errors.

How do I read from a file?

I'm trying to get my program to read code from a .txt and then read it back to me, but for some reason, it crashes the program when I compile. Could someone let me know what I'm doing wrong? Thanks! :)
using System;
using System.IO;
public class Hello1
{
public static void Main()
{
string winDir=System.Environment.GetEnvironmentVariable("windir");
StreamReader reader=new StreamReader(winDir + "\\Name.txt");
try {
do {
Console.WriteLine(reader.ReadLine());
}
while(reader.Peek() != -1);
}
catch
{
Console.WriteLine("File is empty");
}
finally
{
reader.Close();
}
Console.ReadLine();
}
}
I don't like your solution for two simple reasons:
1)I don't like gotta Cath 'em all(try catch). For avoing check if the file exist using System.IO.File.Exist("YourPath")
2)Using this code you haven't dispose the streamreader. For avoing this is better use the using constructor like this: using(StreamReader sr=new StreamReader(path)){ //Your code}
Usage example:
string path="filePath";
if (System.IO.File.Exists(path))
using (System.IO.StreamReader sr = new System.IO.StreamReader(path))
{
while (sr.Peek() > -1)
Console.WriteLine(sr.ReadLine());
}
else
Console.WriteLine("The file not exist!");
If your file is located in the same folder as the .exe, all you need to do is StreamReader reader = new StreamReader("File.txt");
Otherwise, where File.txt is, put the full path to the file. Personally, I think it's easier if they are in the same location.
From there, it's as simple as Console.WriteLine(reader.ReadLine());
If you want to read all lines and display all at once, you could do a for loop:
for (int i = 0; i < lineAmount; i++)
{
Console.WriteLine(reader.ReadLine());
}
Use the code below if you want the result as a string instead of an array.
File.ReadAllText(Path.Combine(winDir, "Name.txt"));
Why not use System.IO.File.ReadAllLines(winDir + "\Name.txt")
If all you're trying to do is display this as output in the console, you could do that pretty compactly:
private static string winDir = Environment.GetEnvironmentVariable("windir");
static void Main(string[] args)
{
Console.Write(File.ReadAllText(Path.Combine(winDir, "Name.txt")));
Console.Read();
}
using(var fs = new FileStream(winDir + "\\Name.txt", FileMode.Open, FileAccess.Read))
{
using(var reader = new StreamReader(fs))
{
// your code
}
}
The .NET framework has a variety of ways to read a text file. Each have pros and cons... lets go through two.
The first, is one that many of the other answers are recommending:
String allTxt = File.ReadAllText(Path.Combine(winDir, "Name.txt"));
This will read the entire file into a single String. It will be quick and painless. It comes with a risk though... If the file is large enough, you may run out of memory. Even if you can store the entire thing into memory, it may be large enough that you will have paging, and will make your software run quite slowly. The next option addresses this.
The second solution allows you to work with one line at a time and not load the entire file into memory:
foreach(String line in File.ReadLines(Path.Combine(winDir, "Name.txt")))
// Do Work with the single line.
Console.WriteLine(line);
This solution may take a little longer for files because it's going to do work MORE OFTEN with the contents of the file... however, it will prevent awkward memory errors.
I tend to go with the second solution, but only because I'm paranoid about loading huge Strings into memory.

Issues in with line end when writing multiple files into one file with C#

I'm trying to write 4 sets of 15 txt files into 4 large txt files in order to make it easier to import into another app.
Here's my code:
using System;
using System.IO;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace AggregateMultipleFiles
{
class AggMultiFilestoOneFile
{/*This program can reduce multiple input files and grouping results into one file for easier app loading.*/
static void Main(string[] args)
{
TextWriter writer = new StreamWriter("G:/user/data/yr2009/fy09_filtered.txt");
int linelen =495;
char[] buf = new char[linelen];
int line_num = 1;
for (int i = 1; i <= 15; i++)
{
TextReader reader = File.OpenText("G:/user/data/yr2009/fy09_filtered"+i+".txt");
while (true)
{
int nin = reader.Read(buf, 0, buf.Length);
if (nin == 0 )
{
Console.WriteLine("File ended");
break;
}
writer.Write(new String(buf));
line_num++;
}
reader.Close();
}
Console.WriteLine("done");
Console.WriteLine(DateTime.Now);
Console.ReadLine();
writer.Close();
}
}
}
My problem is somewhere in calling the end of the file. It doesn't finishing writing the last line of a file, and then, proceeds to start writing the first line of the next file half way through the middle of the last line of the previous file.
This is throwing off all of my columns and data in the app it imports into.
Someone suggested that perhaps I need to pad the end of each line of each of the 15 files with carriage and line return, \r\n.
Why doesn't what I have work?
Would padding work instead? How would I write that?
Thank you!
I strongly suspect this is the problem:
writer.Write(new String(buf));
You're always creating a string from all of buf, rather than just the first nin characters. If any of your files are short, you may end up with "null" Unicode characters (i.e. U+0000) which may be seen as string terminators in some apps.
There's no need even to create a string - just use:
writer.Write(buf, 0, nin);
(I would also strongly suggest using using statements instead of manually calling Close, by the way.)
It's also worth noting that there's nothing to guarantee that you're really reading a line at a time. You might as well increase your buffer size to something like 32K in order to read the files in potentially fewer chunks.
Additionally, if the files are small enough, you could read each one into memory completely, which would make your code simpler:
using (var writer = File.CreateText("G:/user/data/yr2009/fy09_filtered.txt"))
{
for (int i = 1; i <= 15; i++)
{
string inputName = "G:/user/data/yr2009/fy09_filtered" + i + ".txt";
writer.Write(File.ReadAllText(inputName));
}
}

How do I locate a particular word in a text file using .NET

I am sending mails (in asp.net ,c#), having a template in text file (.txt) like below
User Name :<User Name>
Address : <Address>.
I used to replace the words within the angle brackets in the text file using the below code
StreamReader sr;
sr = File.OpenText(HttpContext.Current.Server.MapPath(txt));
copy = sr.ReadToEnd();
sr.Close(); //close the reader
copy = copy.Replace(word.ToUpper(),"#" + word.ToUpper()); //remove the word specified UC
//save new copy into existing text file
FileInfo newText = new FileInfo(HttpContext.Current.Server.MapPath(txt));
StreamWriter newCopy = newText.CreateText();
newCopy.WriteLine(copy);
newCopy.Write(newCopy.NewLine);
newCopy.Close();
Now I have a new problem,
the user will be adding new words within an angle, say for eg, they will be adding <Salary>.
In that case i have to read out and find the word <Salary>.
In other words, I have to find all the words, that are located with the angle brackets (<>).
How do I do that?
Having a stream for your file, you can build something similar to a typical tokenizer.
In general terms, this works as a finite state machine: you need an enumeration for the states (in this case could be simplified down to a boolean, but I'll give you the general approach so you can reuse it on similar tasks); and a function implementing the logic. C#'s iterators are quite a fit for this problem, so I'll be using them on the snippet below. Your function will take the stream as an argument, will use an enumerated value and a char buffer internally, and will yield the strings one by one. You'll need this near the start of your code file:
using System.Collections.Generic;
using System.IO;
using System.Text;
And then, inside your class, something like this:
enum States {
OUT,
IN,
}
IEnumerable<string> GetStrings(TextReader reader) {
States state=States.OUT;
StringBuilder buffer;
int ch;
while((ch=reader.Read())>=0) {
switch(state) {
case States.OUT:
if(ch=='<') {
state=States.IN;
buffer=new StringBuilder();
}
break;
case States.IN:
if(ch=='>') {
state=States.OUT;
yield return buffer.ToString();
} else {
buffer.Append(Char.ConvertFromUtf32(ch));
}
break;
}
}
}
The finite-state machine model always has the same layout: while(READ_INPUT) { switch(STATE) {...}}: inside each case of the switch, you may be producing output and/or altering the state. Beyond that, the algorithm is defined in terms of states and state changes: for any given state and input combination, there is an exact new state and output combination (the output can be "nothing" on those states that trigger no output; and the state may be the same old state if no state change is triggered).
Hope this helps.
EDIT: forgot to mention a couple of things:
1) You get a TextReader to pass to the function by creating a StreamReader for a file, or a StringReader if you already have the file on a string.
2) The memory and time costs of this approach are O(n), with n being the length of the file. They seem quite reasonable for this kind of task.
Using regex.
var matches = Regex.Matches(text, "<(.*?)>");
List<string> words = new List<string>();
for (int i = 0; i < matches.Count; i++)
{
words.Add(matches[i].Groups[1].Value);
}
Of course, this assumes you already have the file's text in a variable. Since you have to read the entire file to achieve that, you could look for the words as you are reading the stream, but I don't know what the performance trade off would be.
This is not an answer, but comments can't do this:
You should place some of your objects into using blocks. Something like this:
using(StreamReader sr = File.OpenText(HttpContext.Current.Server.MapPath(txt)))
{
copy = sr.ReadToEnd();
} // reader is closed by the end of the using block
//remove the word specified UC
copy = copy.Replace(word.ToUpper(), "#" + word.ToUpper());
//save new copy into existing text file
FileInfo newText = new FileInfo(HttpContext.Current.Server.MapPath(txt));
using(var newCopy = newText.CreateText())
{
newCopy.WriteLine(copy);
newCopy.Write(newCopy.NewLine);
}
The using block ensures that resources are cleaned up even if an exception is thrown.

Categories