c# - splitting a large list into smaller sublists

c# - splitting a large list into smaller sublists - c#

Fairly new to C# - Sitting here practicing. I have a file with 10 million passwords listed in a single file that I downloaded to practice with.
I want to break the file down to lists of 99. Stop at 99 then do something. Then start where it left off and repeat the do something with the next 99 until it reaches the last item in the file.
I can do the count part well, it is the stop at 99 and continue where I left off is where I am having trouble. Anything I find online is not close to what I am trying to do and anything I add to this code on my own does not work.
I am more than happy to share more information if I am not clear. Just ask and will respond however, I might not be able to respond until tomorrow depending on what time it is.
Here is the code I have started:
using System;
using System.IO;
namespace lists01
{
class Program
{
static void Main(string[] args)
{
int count = 0;
var f1 = #"c:\tmp\10-million-password-list-top-1000000.txt";
{
var content = File.ReadAllLines(f1);
foreach (var v2 in content)
{
count++;
Console.WriteLine(v2 + "\t" + count);
}
}
}
}
}
My end goal is to do this with any list of items from files I have. I am only using this password list because it was sizable and thought it would be good for this exercise.
Thank you
Keith

Here is a couple of different ways to approach this. Normally, I would suggest the ReadAllLines function that you have in your code. The trade off is that you are loading the entire file into memory at once, then you operate on it.
Using read all lines in concert with Linq's Skip() and Take() methods, you can chop the lines up into groups like this:
var lines = File.ReadAllLines(fileName);
int linesAtATime = 99;
for (int i = 0; i < lines.Length; i = i + linesAtATime)
{
List<string> currentLinesGroup = lines.Skip(i).Take(linesAtATime).ToList();
DoSomethingWithLines(currentLinesGroup);
}
But, if you are working with a really large file, it might not be practical to load the entire file into memory. Plus, you might not want to leave the file open while you are working on the lines. This option gives you more control over how you move through the file. It just loads the part it needs into memory, and closes the file while you are working on the current set of lines.
List<string> lines = new List<string>();
int maxLines = 99;
long seekPosition = 0;
bool fileLoaded = false;
string line;
while (!fileLoaded)
{
using (Stream stream = File.Open(fileName, FileMode.Open))
{
//Jump back to the previous position
stream.Seek(seekPosition, SeekOrigin.Begin);
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream && lines.Count < maxLines)
{
line = reader.ReadLine();
seekPosition += (line.Length + 2); //Tracks how much data has been read.
lines.Add(line);
}
fileLoaded = reader.EndOfStream;
}
}
DoSomethingWithLines(lines);
lines.Clear();
}
In this case, I used Stream because it has the ability to seek to a specific position in the file. But then I used StreaReader because it has the ReadLine() methods.

Related

How to optimize is List have an element check? C# [duplicate]

I want to read a text file line by line. I wanted to know if I'm doing it as efficiently as possible within the .NET C# scope of things.
This is what I'm trying so far:
var filestream = new System.IO.FileStream(textFilePath,
System.IO.FileMode.Open,
System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite);
var file = new System.IO.StreamReader(filestream, System.Text.Encoding.UTF8, true, 128);
while ((lineOfText = file.ReadLine()) != null)
{
//Do something with the lineOfText
}

To find the fastest way to read a file line by line you will have to do some benchmarking. I have done some small tests on my computer but you cannot expect that my results apply to your environment.
Using StreamReader.ReadLine
This is basically your method. For some reason you set the buffer size to the smallest possible value (128). Increasing this will in general increase performance. The default size is 1,024 and other good choices are 512 (the sector size in Windows) or 4,096 (the cluster size in NTFS). You will have to run a benchmark to determine an optimal buffer size. A bigger buffer is - if not faster - at least not slower than a smaller buffer.
const Int32 BufferSize = 128;
using (var fileStream = File.OpenRead(fileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize)) {
String line;
while ((line = streamReader.ReadLine()) != null)
{
// Process line
}
}
The FileStream constructor allows you to specify FileOptions. For example, if you are reading a large file sequentially from beginning to end, you may benefit from FileOptions.SequentialScan. Again, benchmarking is the best thing you can do.
Using File.ReadLines
This is very much like your own solution except that it is implemented using a StreamReader with a fixed buffer size of 1,024. On my computer this results in slightly better performance compared to your code with the buffer size of 128. However, you can get the same performance increase by using a larger buffer size. This method is implemented using an iterator block and does not consume memory for all lines.
var lines = File.ReadLines(fileName);
foreach (var line in lines)
// Process line
Using File.ReadAllLines
This is very much like the previous method except that this method grows a list of strings used to create the returned array of lines so the memory requirements are higher. However, it returns String[] and not an IEnumerable<String> allowing you to randomly access the lines.
var lines = File.ReadAllLines(fileName);
for (var i = 0; i < lines.Length; i += 1) {
var line = lines[i];
// Process line
}
Using String.Split
This method is considerably slower, at least on big files (tested on a 511 KB file), probably due to how String.Split is implemented. It also allocates an array for all the lines increasing the memory required compared to your solution.
using (var streamReader = File.OpenText(fileName)) {
var lines = streamReader.ReadToEnd().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
// Process line
}
My suggestion is to use File.ReadLines because it is clean and efficient. If you require special sharing options (for example you use FileShare.ReadWrite), you can use your own code but you should increase the buffer size.

If you're using .NET 4, simply use File.ReadLines which does it all for you. I suspect it's much the same as yours, except it may also use FileOptions.SequentialScan and a larger buffer (128 seems very small).

While File.ReadAllLines() is one of the simplest ways to read a file, it is also one of the slowest.
If you're just wanting to read lines in a file without doing much, according to these benchmarks, the fastest way to read a file is the age old method of:
using (StreamReader sr = File.OpenText(fileName))
{
string s = String.Empty;
while ((s = sr.ReadLine()) != null)
{
//do minimal amount of work here
}
}
However, if you have to do a lot with each line, then this article concludes that the best way is the following (and it's faster to pre-allocate a string[] if you know how many lines you're going to read) :
AllLines = new string[MAX]; //only allocate memory here
using (StreamReader sr = File.OpenText(fileName))
{
int x = 0;
while (!sr.EndOfStream)
{
AllLines[x] = sr.ReadLine();
x += 1;
}
} //Finished. Close the file
//Now parallel process each line in the file
Parallel.For(0, AllLines.Length, x =>
{
DoYourStuff(AllLines[x]); //do your work here
});

Use the following code:
foreach (string line in File.ReadAllLines(fileName))
This was a HUGE difference in reading performance.
It comes at the cost of memory consumption, but totally worth it!

If the file size is not big, then it is faster to read the entire file and split it afterwards
var filestreams = sr.ReadToEnd().Split(Environment.NewLine,
StringSplitOptions.RemoveEmptyEntries);

There's a good topic about this in Stack Overflow question Is 'yield return' slower than "old school" return?.
It says:
ReadAllLines loads all of the lines into memory and returns a
string[]. All well and good if the file is small. If the file is
larger than will fit in memory, you'll run out of memory.
ReadLines, on the other hand, uses yield return to return one line at
a time. With it, you can read any size file. It doesn't load the whole
file into memory.
Say you wanted to find the first line that contains the word "foo",
and then exit. Using ReadAllLines, you'd have to read the entire file
into memory, even if "foo" occurs on the first line. With ReadLines,
you only read one line. Which one would be faster?

If you have enough memory, I've found some performance gains by reading the entire file into a memory stream, and then opening a stream reader on that to read the lines. As long as you actually plan on reading the whole file anyway, this can yield some improvements.

You can't get any faster if you want to use an existing API to read the lines. But reading larger chunks and manually find each new line in the read buffer would probably be faster.

When you need to efficiently read and process a HUGE text file, ReadLines() and ReadAllLines() are likely to throw Out of Memory exception, this was my case. On the other hand, reading each line separately would take ages. The solution was to read the file in blocks, like below.
The class:
//can return empty lines sometimes
class LinePortionTextReader
{
private const int BUFFER_SIZE = 100000000; //100M characters
StreamReader sr = null;
string remainder = "";
public LinePortionTextReader(string filePath)
{
if (File.Exists(filePath))
{
sr = new StreamReader(filePath);
remainder = "";
}
}
~LinePortionTextReader()
{
if(null != sr) { sr.Close(); }
}
public string[] ReadBlock()
{
if(null==sr) { return new string[] { }; }
char[] buffer = new char[BUFFER_SIZE];
int charactersRead = sr.Read(buffer, 0, BUFFER_SIZE);
if (charactersRead < 1) { return new string[] { }; }
bool lastPart = (charactersRead < BUFFER_SIZE);
if (lastPart)
{
char[] buffer2 = buffer.Take<char>(charactersRead).ToArray();
buffer = buffer2;
}
string s = new string(buffer);
string[] sresult = s.Split(new string[] { "\r\n" }, StringSplitOptions.None);
sresult[0] = remainder + sresult[0];
if (!lastPart)
{
remainder = sresult[sresult.Length - 1];
sresult[sresult.Length - 1] = "";
}
return sresult;
}
public bool EOS
{
get
{
return (null == sr) ? true: sr.EndOfStream;
}
}
}
Example of use:
class Program
{
static void Main(string[] args)
{
if (args.Length < 3)
{
Console.WriteLine("multifind.exe <where to search> <what to look for, one value per line> <where to put the result>");
return;
}
if (!File.Exists(args[0]))
{
Console.WriteLine("source file not found");
return;
}
if (!File.Exists(args[1]))
{
Console.WriteLine("reference file not found");
return;
}
TextWriter tw = new StreamWriter(args[2], false);
string[] refLines = File.ReadAllLines(args[1]);
LinePortionTextReader lptr = new LinePortionTextReader(args[0]);
int blockCounter = 0;
while (!lptr.EOS)
{
string[] srcLines = lptr.ReadBlock();
for (int i = 0; i < srcLines.Length; i += 1)
{
string theLine = srcLines[i];
if (!string.IsNullOrEmpty(theLine)) //can return empty lines sometimes
{
for (int j = 0; j < refLines.Length; j += 1)
{
if (theLine.Contains(refLines[j]))
{
tw.WriteLine(theLine);
break;
}
}
}
}
blockCounter += 1;
Console.WriteLine(String.Format("100 Mb blocks processed: {0}", blockCounter));
}
tw.Close();
}
}
I believe splitting strings and array handling can be significantly improved,
yet the goal here was to minimize number of disk reads.

How to close file that has been read

So im trying to close a file (transactions.txt) that has been open that i've used to read into a textbox and now I want to save back to the file but the problem debug says that the file is in use so I need to find a way to close it. Can anyone help me with this? Thanks!
SearchID = textBox1.Text;
string ID = SearchID.ToString();
bool idFound = false;
int count = 0;
foreach (var line in File.ReadLines("transactions.txt"))
{
//listView1.Items.Add(line);
if (line.Contains(ID))
{
idFound = true;
}
//Displays Transactions if the variable SearchID is found.
if (idFound && count < 8)
{
textBox2.Text += line + "\r\n";
count++;
}
}
}
private void SaveEditedTransaction()
{
SearchID = textBox1.Text;
string ID = SearchID.ToString();
bool idFound = false;
int count = 0;
foreach (var lines in File.ReadLines("transactions.txt"))
{
//listView1.Items.Add(line);
if (lines.Contains(ID))
{
idFound = true;
}
if (idFound)
{
string edited = File.ReadAllText("transactions.txt");
edited = edited.Replace(lines, textBox2.Text);
File.WriteAllText("Transactions.txt", edited);
}

The problem here is that File.ReadLines keeps the file open while you read it, since you've put the call to write new text to it inside the loop, the file is still open.
Instead I would simply break out of the loop when you find the id, and then put the if-statement that writes to the file outside the loop.
This, however, means that you will also need to maintain which line to replace in.
So actually, instead I would switch to using File.ReadAllLines. This reads the entire file into memory, and closes it, before the loop starts.
Now, pragmatic minds might argue that if you have a lot of text in that text file, File.ReadLines (that you're currently using) will use a lot less memory than File.ReadAllLines (that I am suggesting you should use), but if that's the case then you should switch to a database, which would be much more suited to your purpose anyway. It is, however, a bit of an overkill for a toy project with 5 lines in that file.

Use StreamReader directly with the using statement, for example:
var lines = new List<string>();
using (StreamReader reader = new StreamReader(#"C:\test.txt")) {
var line = reader.ReadLine();
while (line != null) {
lines.Add(line);
line = reader.ReadLine();
}
}
By using the using statement the StreamReader instance will automatically be disposed of after it's done with it.

You can try with this:
File.WriteAllLines(
"transactions.txt",
File.ReadAllLines("transactions.txt")
.Select(x => x.Contains(ID) ? textBox2.Text : x));
It works fine, but if the file is big you have to find other solutions.

You can use the StreamReader class instead of the methods of the File class. In this way you can use, Stream.Close() and Stream.Dispose().

Best way to read multiple very large files

I need help figuring out the fastest way to read through about 80 files with over 500,000 lines in each file, and write to one master file with each input file's line as a column in the master. The master file must be written to a text editor like notepad and not a Microsoft product because they can't handle the number of lines.
For example, the master file should look something like this:
File1_Row1,File2_Row1,File3_Row1,...
File1_Row2,File2_Row2,File3_Row2,...
File1_Row3,File2_Row3,File3_Row3,...
etc.
I've tried 2 solutions so far:
Create a jagged array to hold each files' contents into an array and then once reading all lines in all files, write the master file. The issue with this solution is that Windows OS memory throws an error that too much virtual memory is being used.
Dynamically create a reader thread for each of the 80 files that reads a specific line number, and once all threads finish reading a line, combine those values and write to file, and repeat for each line in all files. The issue with this solution is that it is very very slow.
Does anybody have a better solution for reading so many large files in a fast way?

The best way is going to be to open the input files with a StreamReader for each one and a StreamWriter for the output file. Then you loop through each reader and read a single line and write it to the master file. This way you are only loading one line at a time so there should be minimal memory pressure. I was able to copy 80 ~500,000 line files in 37 seconds. An example:
using System;
using System.Collections.Generic;
using System.IO;
using System.Diagnostics;
class MainClass
{
static string[] fileNames = Enumerable.Range(1, 80).Select(i => string.Format("file{0}.txt", i)).ToArray();
public static void Main(string[] args)
{
var stopwatch = Stopwatch.StartNew();
List<StreamReader> readers = fileNames.Select(f => new StreamReader(f)).ToList();
try
{
using (StreamWriter writer = new StreamWriter("master.txt"))
{
string line = null;
do
{
for(int i = 0; i < readers.Count; i++)
{
if ((line = readers[i].ReadLine()) != null)
{
writer.Write(line);
}
if (i < readers.Count - 1)
writer.Write(",");
}
writer.WriteLine();
} while (line != null);
}
}
finally
{
foreach(var reader in readers)
{
reader.Close();
}
}
Console.WriteLine("Elapsed {0} ms", stopwatch.ElapsedMilliseconds);
}
}
I've assume that all the input files have the same number of lines, but you should be add the logic to keep reading when at least one file has given you data.

Use Memory Mapped files seems what is suitable to you. Something that does not execute pressure on memory of your app contemporary maintaining good performance in IO operations.
Here complete documentation: Memory-Mapped Files

If you have enough memory on the computer, I would use the Parallel.Invoke construct and read each file into a pre-allocated array such as:
string[] file1lines = new string[some value];
string[] file2lines = new string[some value];
string[] file3lines = new string[some value];
Parallel.Invoke(
() =>
{
ReadMyFile(file1,file1lines);
},
() =>
{
ReadMyFile(file2,file2lines);
},
() =>
{
ReadMyFile(file3,file3lines);
}
);
Each ReadMyFile method should just use the following sample code which, according to these benchmarks, is the fastest way to read a text file:
int x = 0;
using (StreamReader sr = File.OpenText(fileName))
{
while ((file1lines[x] = sr.ReadLine()) != null)
{
x += 1;
}
}
If you need to manipulate the data from each file before writing your final output, read this article on the fastest way to do that.
Then you just need one method to write the contents to each string[] to the output as you desire.

Have an array of open file handles. Loop through this array and read a line from each file into a string array. Then combine this array into the master file, append a newline at the end.
This differs from your second approach that it is single threaded and doesn't read a specific line but always the next one.
Of course you need to be error proof if there are files with less lines than others.

Issues in with line end when writing multiple files into one file with C#

I'm trying to write 4 sets of 15 txt files into 4 large txt files in order to make it easier to import into another app.
Here's my code:
using System;
using System.IO;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace AggregateMultipleFiles
{
class AggMultiFilestoOneFile
{/*This program can reduce multiple input files and grouping results into one file for easier app loading.*/
static void Main(string[] args)
{
TextWriter writer = new StreamWriter("G:/user/data/yr2009/fy09_filtered.txt");
int linelen =495;
char[] buf = new char[linelen];
int line_num = 1;
for (int i = 1; i <= 15; i++)
{
TextReader reader = File.OpenText("G:/user/data/yr2009/fy09_filtered"+i+".txt");
while (true)
{
int nin = reader.Read(buf, 0, buf.Length);
if (nin == 0 )
{
Console.WriteLine("File ended");
break;
}
writer.Write(new String(buf));
line_num++;
}
reader.Close();
}
Console.WriteLine("done");
Console.WriteLine(DateTime.Now);
Console.ReadLine();
writer.Close();
}
}
}
My problem is somewhere in calling the end of the file. It doesn't finishing writing the last line of a file, and then, proceeds to start writing the first line of the next file half way through the middle of the last line of the previous file.
This is throwing off all of my columns and data in the app it imports into.
Someone suggested that perhaps I need to pad the end of each line of each of the 15 files with carriage and line return, \r\n.
Why doesn't what I have work?
Would padding work instead? How would I write that?
Thank you!

I strongly suspect this is the problem:
writer.Write(new String(buf));
You're always creating a string from all of buf, rather than just the first nin characters. If any of your files are short, you may end up with "null" Unicode characters (i.e. U+0000) which may be seen as string terminators in some apps.
There's no need even to create a string - just use:
writer.Write(buf, 0, nin);
(I would also strongly suggest using using statements instead of manually calling Close, by the way.)
It's also worth noting that there's nothing to guarantee that you're really reading a line at a time. You might as well increase your buffer size to something like 32K in order to read the files in potentially fewer chunks.
Additionally, if the files are small enough, you could read each one into memory completely, which would make your code simpler:
using (var writer = File.CreateText("G:/user/data/yr2009/fy09_filtered.txt"))
{
for (int i = 1; i <= 15; i++)
{
string inputName = "G:/user/data/yr2009/fy09_filtered" + i + ".txt";
writer.Write(File.ReadAllText(inputName));
}
}

Keeping a log file under a certain size

I am keeping several text log files that I want to keep from growing too large. I searched for and found a lot of people asking the same thing and I found couple of solutions that looked like the efficiency was questionable so I tried rolling my own function. I did the same thing previously in VB6 and ended up using the function in all my apps so I know I will be using it frequently now in my C# programs. This should probably be CW but since marking a question as CW is disabled I am posting it here. My question is, since I will be using this a lot is it efficient, and if not what should I change to improve it? Currently I am limiting the log files to 1MB and these are the largest logs I have kept so I don't anticipate them getting much if any larger.
private static void ShrinkFile(string file)
{
StreamReader sr = new StreamReader(file);
for (int i = 0; i < 9; i++) // throw away the first 10 lines
{
sr.ReadLine();
}
string remainingContents = sr.ReadToEnd();
sr.Close();
File.WriteAllText(file, remainingContents);
}

beside suggesting you to use a proper logging framework like Log4Net or NLog (or any other), to improve your code you can at minimum make sure you always close the stream with a using:
private static void ShrinkFile(string file)
{
using(var sr = new StreamReader(file))
{
for (int i = 0; i < 9; i++) // throw away the first 10 lines
{
sr.ReadLine();
}
// false here means to overwrite existing file.
using (StreamWriter sw = new StreamWriter(file, false))
{
sw.Write(sr.ReadToEnd());
}
}
}
also I have avoided to do the ReadToEnd into a string because you can directly write into the StreamWriter.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

c# - splitting a large list into smaller sublists - c#

Related

How to optimize is List have an element check? C# [duplicate]

How to close file that has been read

Best way to read multiple very large files

Issues in with line end when writing multiple files into one file with C#

Keeping a log file under a certain size

Categories

Resources