I have the following loop inside a function:
for(int i = 0; i < 46;i++){
String[] arrStr = File.ReadAllLines(path+"File_"+i+".txt")
List<String> output = new List<String>();
for(j = 0;j< arrStr.Length;j++){
//Do Something
output.Add(someString);
}
File.WriteAllLines(path+"output_File_"+i+".txt",output.toArray());
output.Clear();
}
Each txt file has about 20k lines.The function opens 46 of them and I need to run the function more than 1k times so I'm planning to leave the program running overnight,so far I didnt find any erros but since there is an 20k size String array being referenced at each interaction of the loop,i'm afraid that there might be some issue with trash memory being acumulated or something from the arrays in the past interactions. If there is such a risk,which method is best to dispose of the old array in this case?
Also,is it memory safe to run 3 programs like this at the same time?
Use Streams with using this will handle the memory management for you:
for (int i = 0; i < 46; i++)
{
using (StreamReader reader = new StreamReader(path))
{
using (StreamWriter writer = new StreamWriter(outputpath))
{
while(!reader.EndOfStream)
{
string line = reader.ReadLine();
// do something with line
writer.WriteLine(line);
}
}
}
}
The Dispose methods of StreamReader and StreamWriter are automatically called when exiting the using block, freeing up any memory used. Using streams also ensures your entire file isn't in memory at once.
More info on MSDN - File Stream and I/O
Sounds like you came from the C world :-)
C# garbage collection is fine, you will not have any problems with that.
I would be more worried about file-system errors.
Related
Fairly new to C# - Sitting here practicing. I have a file with 10 million passwords listed in a single file that I downloaded to practice with.
I want to break the file down to lists of 99. Stop at 99 then do something. Then start where it left off and repeat the do something with the next 99 until it reaches the last item in the file.
I can do the count part well, it is the stop at 99 and continue where I left off is where I am having trouble. Anything I find online is not close to what I am trying to do and anything I add to this code on my own does not work.
I am more than happy to share more information if I am not clear. Just ask and will respond however, I might not be able to respond until tomorrow depending on what time it is.
Here is the code I have started:
using System;
using System.IO;
namespace lists01
{
class Program
{
static void Main(string[] args)
{
int count = 0;
var f1 = #"c:\tmp\10-million-password-list-top-1000000.txt";
{
var content = File.ReadAllLines(f1);
foreach (var v2 in content)
{
count++;
Console.WriteLine(v2 + "\t" + count);
}
}
}
}
}
My end goal is to do this with any list of items from files I have. I am only using this password list because it was sizable and thought it would be good for this exercise.
Thank you
Keith
Here is a couple of different ways to approach this. Normally, I would suggest the ReadAllLines function that you have in your code. The trade off is that you are loading the entire file into memory at once, then you operate on it.
Using read all lines in concert with Linq's Skip() and Take() methods, you can chop the lines up into groups like this:
var lines = File.ReadAllLines(fileName);
int linesAtATime = 99;
for (int i = 0; i < lines.Length; i = i + linesAtATime)
{
List<string> currentLinesGroup = lines.Skip(i).Take(linesAtATime).ToList();
DoSomethingWithLines(currentLinesGroup);
}
But, if you are working with a really large file, it might not be practical to load the entire file into memory. Plus, you might not want to leave the file open while you are working on the lines. This option gives you more control over how you move through the file. It just loads the part it needs into memory, and closes the file while you are working on the current set of lines.
List<string> lines = new List<string>();
int maxLines = 99;
long seekPosition = 0;
bool fileLoaded = false;
string line;
while (!fileLoaded)
{
using (Stream stream = File.Open(fileName, FileMode.Open))
{
//Jump back to the previous position
stream.Seek(seekPosition, SeekOrigin.Begin);
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream && lines.Count < maxLines)
{
line = reader.ReadLine();
seekPosition += (line.Length + 2); //Tracks how much data has been read.
lines.Add(line);
}
fileLoaded = reader.EndOfStream;
}
}
DoSomethingWithLines(lines);
lines.Clear();
}
In this case, I used Stream because it has the ability to seek to a specific position in the file. But then I used StreaReader because it has the ReadLine() methods.
I want to read a text file line by line. I wanted to know if I'm doing it as efficiently as possible within the .NET C# scope of things.
This is what I'm trying so far:
var filestream = new System.IO.FileStream(textFilePath,
System.IO.FileMode.Open,
System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite);
var file = new System.IO.StreamReader(filestream, System.Text.Encoding.UTF8, true, 128);
while ((lineOfText = file.ReadLine()) != null)
{
//Do something with the lineOfText
}
To find the fastest way to read a file line by line you will have to do some benchmarking. I have done some small tests on my computer but you cannot expect that my results apply to your environment.
Using StreamReader.ReadLine
This is basically your method. For some reason you set the buffer size to the smallest possible value (128). Increasing this will in general increase performance. The default size is 1,024 and other good choices are 512 (the sector size in Windows) or 4,096 (the cluster size in NTFS). You will have to run a benchmark to determine an optimal buffer size. A bigger buffer is - if not faster - at least not slower than a smaller buffer.
const Int32 BufferSize = 128;
using (var fileStream = File.OpenRead(fileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize)) {
String line;
while ((line = streamReader.ReadLine()) != null)
{
// Process line
}
}
The FileStream constructor allows you to specify FileOptions. For example, if you are reading a large file sequentially from beginning to end, you may benefit from FileOptions.SequentialScan. Again, benchmarking is the best thing you can do.
Using File.ReadLines
This is very much like your own solution except that it is implemented using a StreamReader with a fixed buffer size of 1,024. On my computer this results in slightly better performance compared to your code with the buffer size of 128. However, you can get the same performance increase by using a larger buffer size. This method is implemented using an iterator block and does not consume memory for all lines.
var lines = File.ReadLines(fileName);
foreach (var line in lines)
// Process line
Using File.ReadAllLines
This is very much like the previous method except that this method grows a list of strings used to create the returned array of lines so the memory requirements are higher. However, it returns String[] and not an IEnumerable<String> allowing you to randomly access the lines.
var lines = File.ReadAllLines(fileName);
for (var i = 0; i < lines.Length; i += 1) {
var line = lines[i];
// Process line
}
Using String.Split
This method is considerably slower, at least on big files (tested on a 511 KB file), probably due to how String.Split is implemented. It also allocates an array for all the lines increasing the memory required compared to your solution.
using (var streamReader = File.OpenText(fileName)) {
var lines = streamReader.ReadToEnd().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
// Process line
}
My suggestion is to use File.ReadLines because it is clean and efficient. If you require special sharing options (for example you use FileShare.ReadWrite), you can use your own code but you should increase the buffer size.
If you're using .NET 4, simply use File.ReadLines which does it all for you. I suspect it's much the same as yours, except it may also use FileOptions.SequentialScan and a larger buffer (128 seems very small).
While File.ReadAllLines() is one of the simplest ways to read a file, it is also one of the slowest.
If you're just wanting to read lines in a file without doing much, according to these benchmarks, the fastest way to read a file is the age old method of:
using (StreamReader sr = File.OpenText(fileName))
{
string s = String.Empty;
while ((s = sr.ReadLine()) != null)
{
//do minimal amount of work here
}
}
However, if you have to do a lot with each line, then this article concludes that the best way is the following (and it's faster to pre-allocate a string[] if you know how many lines you're going to read) :
AllLines = new string[MAX]; //only allocate memory here
using (StreamReader sr = File.OpenText(fileName))
{
int x = 0;
while (!sr.EndOfStream)
{
AllLines[x] = sr.ReadLine();
x += 1;
}
} //Finished. Close the file
//Now parallel process each line in the file
Parallel.For(0, AllLines.Length, x =>
{
DoYourStuff(AllLines[x]); //do your work here
});
Use the following code:
foreach (string line in File.ReadAllLines(fileName))
This was a HUGE difference in reading performance.
It comes at the cost of memory consumption, but totally worth it!
If the file size is not big, then it is faster to read the entire file and split it afterwards
var filestreams = sr.ReadToEnd().Split(Environment.NewLine,
StringSplitOptions.RemoveEmptyEntries);
There's a good topic about this in Stack Overflow question Is 'yield return' slower than "old school" return?.
It says:
ReadAllLines loads all of the lines into memory and returns a
string[]. All well and good if the file is small. If the file is
larger than will fit in memory, you'll run out of memory.
ReadLines, on the other hand, uses yield return to return one line at
a time. With it, you can read any size file. It doesn't load the whole
file into memory.
Say you wanted to find the first line that contains the word "foo",
and then exit. Using ReadAllLines, you'd have to read the entire file
into memory, even if "foo" occurs on the first line. With ReadLines,
you only read one line. Which one would be faster?
If you have enough memory, I've found some performance gains by reading the entire file into a memory stream, and then opening a stream reader on that to read the lines. As long as you actually plan on reading the whole file anyway, this can yield some improvements.
You can't get any faster if you want to use an existing API to read the lines. But reading larger chunks and manually find each new line in the read buffer would probably be faster.
When you need to efficiently read and process a HUGE text file, ReadLines() and ReadAllLines() are likely to throw Out of Memory exception, this was my case. On the other hand, reading each line separately would take ages. The solution was to read the file in blocks, like below.
The class:
//can return empty lines sometimes
class LinePortionTextReader
{
private const int BUFFER_SIZE = 100000000; //100M characters
StreamReader sr = null;
string remainder = "";
public LinePortionTextReader(string filePath)
{
if (File.Exists(filePath))
{
sr = new StreamReader(filePath);
remainder = "";
}
}
~LinePortionTextReader()
{
if(null != sr) { sr.Close(); }
}
public string[] ReadBlock()
{
if(null==sr) { return new string[] { }; }
char[] buffer = new char[BUFFER_SIZE];
int charactersRead = sr.Read(buffer, 0, BUFFER_SIZE);
if (charactersRead < 1) { return new string[] { }; }
bool lastPart = (charactersRead < BUFFER_SIZE);
if (lastPart)
{
char[] buffer2 = buffer.Take<char>(charactersRead).ToArray();
buffer = buffer2;
}
string s = new string(buffer);
string[] sresult = s.Split(new string[] { "\r\n" }, StringSplitOptions.None);
sresult[0] = remainder + sresult[0];
if (!lastPart)
{
remainder = sresult[sresult.Length - 1];
sresult[sresult.Length - 1] = "";
}
return sresult;
}
public bool EOS
{
get
{
return (null == sr) ? true: sr.EndOfStream;
}
}
}
Example of use:
class Program
{
static void Main(string[] args)
{
if (args.Length < 3)
{
Console.WriteLine("multifind.exe <where to search> <what to look for, one value per line> <where to put the result>");
return;
}
if (!File.Exists(args[0]))
{
Console.WriteLine("source file not found");
return;
}
if (!File.Exists(args[1]))
{
Console.WriteLine("reference file not found");
return;
}
TextWriter tw = new StreamWriter(args[2], false);
string[] refLines = File.ReadAllLines(args[1]);
LinePortionTextReader lptr = new LinePortionTextReader(args[0]);
int blockCounter = 0;
while (!lptr.EOS)
{
string[] srcLines = lptr.ReadBlock();
for (int i = 0; i < srcLines.Length; i += 1)
{
string theLine = srcLines[i];
if (!string.IsNullOrEmpty(theLine)) //can return empty lines sometimes
{
for (int j = 0; j < refLines.Length; j += 1)
{
if (theLine.Contains(refLines[j]))
{
tw.WriteLine(theLine);
break;
}
}
}
}
blockCounter += 1;
Console.WriteLine(String.Format("100 Mb blocks processed: {0}", blockCounter));
}
tw.Close();
}
}
I believe splitting strings and array handling can be significantly improved,
yet the goal here was to minimize number of disk reads.
I've got a list of 369 different names and I want to print these names into a csv file. All's going well until I take a look at the outputted csv file and it only has 251 rows. I've tried outputting to a .txt instead, and still it only outputs 251 rows. Ive stepped through with the debugger and it is still calling writer.WriteLine() 369 times.
Is there some sort of writing restriction in place? If so, why 251? How do I write all 369 names?
Here's my code just in case:
List<String> names = new List<String>();
//Retrieve names from a separate source.
var writer = new StreamWriter(File.OpenWrite(#"C:names.txt"));
for (int i = 0; i < names.Count; i++ )
{
System.Console.WriteLine(names[i].ToString());
writer.WriteLine(names[i].ToString());
}
System.Console.Write(names.Count);
The output on the console shows all 369 names and the names.Count prints 369.
You need to close your StreamWriter, the best way is to use a using block like so:
using(StreamWriter writer = new StreamWriter(File.OpenWrite("C:\\names.txt")) {
// code here
}
The using block will always call the .Dispose method of StreamWriter which has the effect of flushing the stream. Presently you have buffered-but-unwritten data in your StreamWriter instance.
You do not show anywhere that you properly close writer. If your program terminates abnormally, the writer would never be flushed to disk.
Try making use of a using block.
// NOTE: The is should be C:\names.txt. The posted code is missing a \
using (var writer = new StreamWriter(File.OpenWrite(#"C:names.txt")))
{
// Your code here
}
You have to flush buffer after last write. Put writer inside using statement.
Dispose method of writer flushes buffer. You can also call writer.Flush(). But since you still have to make sure that writer is disposed just put it in a using statement as other suggested.
List<String> names = new List<String>();
var sb = new StringBuilder()
//Retrieve names from a separate source.
for (int i = 0; i < names.Count; i++ )
{
System.Console.WriteLine(names[i].ToString());
sb.WriteLine(names[i].ToString());
}
using (var writer = new StreamWriter(File.OpenWrite(#"C:\names.txt")))
{
writer.WriteLine(sb.ToString());
}
I need help figuring out the fastest way to read through about 80 files with over 500,000 lines in each file, and write to one master file with each input file's line as a column in the master. The master file must be written to a text editor like notepad and not a Microsoft product because they can't handle the number of lines.
For example, the master file should look something like this:
File1_Row1,File2_Row1,File3_Row1,...
File1_Row2,File2_Row2,File3_Row2,...
File1_Row3,File2_Row3,File3_Row3,...
etc.
I've tried 2 solutions so far:
Create a jagged array to hold each files' contents into an array and then once reading all lines in all files, write the master file. The issue with this solution is that Windows OS memory throws an error that too much virtual memory is being used.
Dynamically create a reader thread for each of the 80 files that reads a specific line number, and once all threads finish reading a line, combine those values and write to file, and repeat for each line in all files. The issue with this solution is that it is very very slow.
Does anybody have a better solution for reading so many large files in a fast way?
The best way is going to be to open the input files with a StreamReader for each one and a StreamWriter for the output file. Then you loop through each reader and read a single line and write it to the master file. This way you are only loading one line at a time so there should be minimal memory pressure. I was able to copy 80 ~500,000 line files in 37 seconds. An example:
using System;
using System.Collections.Generic;
using System.IO;
using System.Diagnostics;
class MainClass
{
static string[] fileNames = Enumerable.Range(1, 80).Select(i => string.Format("file{0}.txt", i)).ToArray();
public static void Main(string[] args)
{
var stopwatch = Stopwatch.StartNew();
List<StreamReader> readers = fileNames.Select(f => new StreamReader(f)).ToList();
try
{
using (StreamWriter writer = new StreamWriter("master.txt"))
{
string line = null;
do
{
for(int i = 0; i < readers.Count; i++)
{
if ((line = readers[i].ReadLine()) != null)
{
writer.Write(line);
}
if (i < readers.Count - 1)
writer.Write(",");
}
writer.WriteLine();
} while (line != null);
}
}
finally
{
foreach(var reader in readers)
{
reader.Close();
}
}
Console.WriteLine("Elapsed {0} ms", stopwatch.ElapsedMilliseconds);
}
}
I've assume that all the input files have the same number of lines, but you should be add the logic to keep reading when at least one file has given you data.
Use Memory Mapped files seems what is suitable to you. Something that does not execute pressure on memory of your app contemporary maintaining good performance in IO operations.
Here complete documentation: Memory-Mapped Files
If you have enough memory on the computer, I would use the Parallel.Invoke construct and read each file into a pre-allocated array such as:
string[] file1lines = new string[some value];
string[] file2lines = new string[some value];
string[] file3lines = new string[some value];
Parallel.Invoke(
() =>
{
ReadMyFile(file1,file1lines);
},
() =>
{
ReadMyFile(file2,file2lines);
},
() =>
{
ReadMyFile(file3,file3lines);
}
);
Each ReadMyFile method should just use the following sample code which, according to these benchmarks, is the fastest way to read a text file:
int x = 0;
using (StreamReader sr = File.OpenText(fileName))
{
while ((file1lines[x] = sr.ReadLine()) != null)
{
x += 1;
}
}
If you need to manipulate the data from each file before writing your final output, read this article on the fastest way to do that.
Then you just need one method to write the contents to each string[] to the output as you desire.
Have an array of open file handles. Loop through this array and read a line from each file into a string array. Then combine this array into the master file, append a newline at the end.
This differs from your second approach that it is single threaded and doesn't read a specific line but always the next one.
Of course you need to be error proof if there are files with less lines than others.
I am keeping several text log files that I want to keep from growing too large. I searched for and found a lot of people asking the same thing and I found couple of solutions that looked like the efficiency was questionable so I tried rolling my own function. I did the same thing previously in VB6 and ended up using the function in all my apps so I know I will be using it frequently now in my C# programs. This should probably be CW but since marking a question as CW is disabled I am posting it here. My question is, since I will be using this a lot is it efficient, and if not what should I change to improve it? Currently I am limiting the log files to 1MB and these are the largest logs I have kept so I don't anticipate them getting much if any larger.
private static void ShrinkFile(string file)
{
StreamReader sr = new StreamReader(file);
for (int i = 0; i < 9; i++) // throw away the first 10 lines
{
sr.ReadLine();
}
string remainingContents = sr.ReadToEnd();
sr.Close();
File.WriteAllText(file, remainingContents);
}
beside suggesting you to use a proper logging framework like Log4Net or NLog (or any other), to improve your code you can at minimum make sure you always close the stream with a using:
private static void ShrinkFile(string file)
{
using(var sr = new StreamReader(file))
{
for (int i = 0; i < 9; i++) // throw away the first 10 lines
{
sr.ReadLine();
}
// false here means to overwrite existing file.
using (StreamWriter sw = new StreamWriter(file, false))
{
sw.Write(sr.ReadToEnd());
}
}
}
also I have avoided to do the ReadToEnd into a string because you can directly write into the StreamWriter.