C# Memory Mapped File - Better read or write sequentially? - c#

I have two file of about 50GB each: an input and an output file.
I am using Memory Mapped File to manage these two files.
The input file contains 3 millions of Web pages, and after I have decided a permutation π of them, I have to write into the output file the Web pages in the new order.
So, I can choose to read sequentially the input file and write the web pages in different location of the output file, accordingly to the permutation π.
Or I can do the opposite: reading randomly the input file according to the permutation π and write sequentially into the output file.
Which option is faster? Why?

TL;DR: Due to caching, all file-append operations are sequential. Even writes to the middle of files will be elevator sorted and performed at block size, etc.
Random writing tends to be faster than random reading for several reasons:
When a file grows, the filesystem can choose where to put the new block.
Writes don't have to be performed immediately, the write buffer can assure that an entire block is written at once, meaning that data won't be added to an existing block, which already has a location.
Your processing can't take place until reads complete. And reading relies on a predictive cache. The OS is good at pre-caching sequential reads, horrible for random reads. If your reads are less than block sized, things are even worse -- the actual amount of data read from the disk will be greater than the size of the file.

Related

C# Reuse StreamWriter or FileStream but change destination file

A little background...
Everything I'm about to describe up to my implementation of the StreamWriter is business processes which I cannot change.
Every month I pull around 200 different tables of data into individual files.
Each file contains roughly 400,000 lines of business logic details for upwards of 5,000-6,000 different business units.
To effectively use that data with the tools on hand, I have to break down those files into individual files for each business unit...
200 files x 5000 business units per file = 100,000 different files.
The way I've BEEN doing it is the typical StreamWriter loop...
foreach(string SplitFile in List<BusinessFiles>)
{
using (StreamWriter SW = new StreamWriter(SplitFile))
{
foreach(var BL in g)
{
string[] Split1 = BL.Split(',');
SW.WriteLine("{0,-8}{1,-8}{2,-8}{3,-8}{4,-8}{5,-8}{6,-8}{7,-8}{8,-8}{9,-16}{10,-1}",
Split1[0], Split1[1], Split1[2], Split1[3], Split1[4], Split1[5], Split1[6], Split1[7], Split1[8], Convert.ToDateTime(Split1[9]).ToString("dd-MMM-yyyy"), Split1[10]);
}
}
}
The issue with this is, It takes an excessive amount of time.
Like, It can take 20 mins to process all the files sometimes.
Profiling my code shows me that 98% of the time spent is on the system disposing of the StreamWriter after the program leaves the loop.
So my question is......
Is there a way to keep the underlying Stream open and reuse it to write a different file?
I know I can Flush() the Stream but I can't figure out how to get it to start writing to another file altogether. I can't seem to find a way to change the destination filename without having to call another StreamWriter.
Edit:
A picture of what it shows when I profile the code
Edit 2:
So after poking around a bit more I started looking at it a different way.
First thing is, I already had the reading of the one file and writing of the massive amount of smaller files in a nested parallel loop so I was essentially maxing out my I/O as is.
I'm also writing to an SSD, so all those were good points.
Turns out I'm reading the 1 massive file and writing ~5600 smaller ones every 90 seconds or so.
That's 60 files a second. I guess I can't really ask for much more than that.
This sounds about right. 100,000 files in 20 minutes is more than 83 files every second. Disk I/O is pretty much the slowest thing you can do within a single computer. All that time in the Dispose() method is waiting for the buffer to flush out to disk while closing the file... it's the actual time writing the data to your persistent storage, and a separate using block for each file is the right way to make sure this is done safely.
To speed this up it's tempting to look at asynchronous processing (async/await), but I don't think you'll find any gains there; ultimately this is an I/O-bound task, so optimizing for your CPU scheduling might even make things worse. Better gains could be available if you can change the output to write into a single (indexed) file, so the operating system's disk buffering mechanism can be more efficient.
I would agree with Joel that the time is mostly due to writing the data out to disk. I would however be a little bit more optimistic about doing parallel IO, since SSDs are better able to handle higher loads than regular HDDs. So I would try a few things:
1. Doing stuff in parallel
Change your outer loop to a parallel one
Parallel.ForEach(
myBusinessFiles,
new ParallelOptions(){MaxDegreeOfParallelism = 2},
SplitFile => {
// Loop body
});
Try changing the degree of parallelism to see if performance improves or not. This assumes the data is thread safe.
2. Try writing high speed local SSD
I'm assuming you are writing to a network folder, this will add some additional latency, so you might try to write to a local disk. If you are already doing that, consider getting a faster disk. If you need to move all the files to a network drive afterwards, you will likely not gain anything, but it can give an idea about the penalty you get from the network.
3. Try writing to a Zip Archive
There are zip archives that can contain multiple files inside it, while still allowing for fairly easy access of an individual file. This could help improve performance in a few ways:
Compression. I would assume your data is fairly easy to compress, so you would write less data overall.
Less file system operations. Since you are only writing to a single file you would avoid some overhead with the file system.
Reduced overhead due to cluster size. Files have a minimum size, this can cause a fair bit of wasted space for small files. Using an archive avoids this.
You could also try saving each file in an individual zip-archive, but then you would mostly benefit from the compression.
Responding to your question, you have an option (add a flag on the constructor) but it is strongly tied to the garbage collector, also think about multi thread environment it could be a mess. That said this is the overloaded constructor:
StreamWriter(Stream, Encoding, Int32, Boolean)
Initializes a new instance of the StreamWriter class for the specified stream by using the specified encoding and buffer size, and optionally leaves the stream open.
public StreamWriter (System.IO.Stream stream, System.Text.Encoding? encoding = default, int bufferSize = -1, bool leaveOpen = true);
Source

ReadInnerXml Out Of Memory

While parsing large data in an xml file (about 1GB) the concept was to write the various elements to seperate temporary files. Each of those files could then be processed later while being able to initially advertise the overall content of the file to the end user. The user may select only a few of the elements for further processing. This is accomplished by calling the reader's ReadInnerXml() method which returns a string which in turn is easilly provided to a SteamWriter. However when an exception results such as an out of memory error I am uncertain as to if (1) I can skip past and read other (hopefully smaller) elements and (2) can I not avoid a string completely and somehow write the xml element to a temporary file without having to consume so much physical memory using the string?

Remove All Duplicates In A Large Text File

I am really stumped at this problem and as a result I have stopped working for a while. I work with really large pieces of data. I get approx 200gb of .txt data every week. The data can range up to 500 million lines. A lot of these are duplicate. I would guess only 20gb is unique. I have had several custom programs made including hash remove duplicates, external remove duplicates but none seem to work. The latest one was using a temp database but took several days to remove the data.
The problem with all the programs is that they crash after a certain point and after spending a large amount of money on these programs I thought I would come online and see if anyone can help. I understand this has been answered on here before and I have spent the last 3 hours reading about 50 threads on here but none seem to have the same problem as me i.e huge datasets.
Can anyone recommend anything for me? It needs to be super accurate and fast. Preferably not memory based as I only have 32gb of ram to work with.
The standard way to remove duplicates is to sort the file and then do a sequential pass to remove duplicates. Sorting 500 million lines isn't trivial, but it's certainly doable. A few years ago I had a daily process that would sort 50 to 100 gigabytes on a 16 gb machine.
By the way, you might be able to do this with an off-the-shelf program. Certainly the GNU sort utility can sort a file larger than memory. I've never tried it on a 500 GB file, but you might give it a shot. You can download it along with the rest of the GNU Core Utilities. That utility has a --unique option, so you should be able to just sort --unique input-file > output-file. It uses a technique similar to the one I describe below. I'd suggest trying it on a 100 megabyte file first, then slowly working up to larger files.
With GNU sort and the technique I describe below, it will perform a lot better if the input and temporary directories are on separate physical disks. Put the output either on a third physical disk, or on the same physical disk as the input. You want to reduce I/O contention as much as possible.
There might also be a commercial (i.e. pay) program that will do the sorting. Developing a program that will sort a huge text file efficiently is a non-trivial task. If you can buy something for a few hundreds of dollars, you're probably money ahead if your time is worth anything.
If you can't use a ready made program, then . . .
If your text is in multiple smaller files, the problem is easier to solve. You start by sorting each file, removing duplicates from those files, and writing the sorted temporary files that have the duplicates removed. Then run a simple n-way merge to merge the files into a single output file that has the duplicates removed.
If you have a single file, you start by reading as many lines as you can into memory, sorting those, removing duplicates, and writing a temporary file. You keep doing that for the entire large file. When you're done, you have some number of sorted temporary files that you can then merge.
In pseudocode, it looks something like this:
fileNumber = 0
while not end-of-input
load as many lines as you can into a list
sort the list
filename = "file"+fileNumber
write sorted list to filename, optionally removing duplicates
fileNumber = fileNumber + 1
You don't really have to remove the duplicates from the temporary files, but if your unique data is really only 10% of the total, you'll save a huge amount of time by not outputting duplicates to the temporary files.
Once all of your temporary files are written, you need to merge them. From your description, I figure each chunk that you read from the file will contain somewhere around 20 million lines. So you'll have maybe 25 temporary files to work with.
You now need to do a k-way merge. That's done by creating a priority queue. You open each file, read the first line from each file and put it into the queue along with a reference to the file that it came from. Then, you take the smallest item from the queue and write it to the output file. To remove duplicates, you keep track of the previous line that you output, and you don't output the new line if it's identical to the previous one.
Once you've output the line, you read the next line from the file that the one you just output came from, and add that line to the priority queue. You continue this way until you've emptied all of the files.
I published a series of articles some time back about sorting a very large text file. It uses the technique I described above. The only thing it doesn't do is remove duplicates, but that's a simple modification to the methods that output the temporary files and the final output method. Even without optimizations, the program performs quite well. It won't set any speed records, but it should be able to sort and remove duplicates from 500 million lines in less than 12 hours. Probably much less, considering that the second pass is only working with a small percentage of the total data (because you removed duplicates from the temporary files).
One thing you can do to speed the program is operate on smaller chunks and be sorting one chunk in a background thread while you're loading the next chunk into memory. You end up having to deal with more temporary files, but that's really not a problem. The heap operations are slightly slower, but that extra time is more than recaptured by overlapping the input and output with the sorting. You end up getting the I/O essentially for free. At typical hard drive speeds, loading 500 gigabytes will take somewhere in the neighborhood of two and a half to three hours.
Take a look at the article series. It's many different, mostly small, articles that take you through the entire process that I describe, and it presents working code. I'm happy to answer any questions you might have about it.
I am no specialist in such algorithms, but if it is a textual data (or numbers, doesn't matter), you can try to read your big file and write it into several files by first two or three symbols: all lines starting with "aaa" go to aaa.txt, all lines starting with "aab" - to aab.txt, etc. You'll get lots of files within which the data are in the equivalence relation: a duplicate to a word is in the same file as the word itself. Now, just parse each file in the memory and you're done.
Again, not sure that it will work, but i'd try this approach first...

Is my binary files caching method stupid?

In my C# application i have to read a huge amount of binary files, but at the first run, reading those files using FileStream and BinaryReader, takes a lot of times. But the second times you run the app, reading the files is 4 times faster.
After reading this post "Slow reading hundreds of files" I decided to precache the binary files.
After reading this other post "How can I check if a program is running for the first time?", my app now can detect if it is the first time it is running then I precache the files by using this simple technique "Caching a binary file in C#".
Is there another way of precaching huge amount of binary files?
Edit:
This is how I read and parse the files
f_strm = new FileStream(#location, FileMode.Open, FileAccess.Read);
readBinary = new BinaryReader(f_strm);
Parse(readBinary);
The Parse() function just contains a switch statement that I use to parse the data.
I don't do anything more complicated. As an example, I tried to read and parse 10.000 binary files of 601KB, it took 39 secondes and about 589.000 cycles to read and parse the files.
When I run again the app, it finally took about 45.000 cycles and 1.5 seconds to read and parse.
Edit:
By "huge" amount of files I mean millions of files. It's not always the case, but most of the time I have to deal with at least 10.000 files. The size of those files can be between 600Ko and 700MB.
Just read them once and discard the results. That puts them into the OS cache and makes future reads very fast. Using the OS cache is automatic and very safe.
Or, make yourself a Dictionary<string, byte[]> where you store the file contents keyed by the file path. Be careful not to exhaust available memory or your app will fail or become very slow due to paging.

optimizing streaming in of many small files

I have hundreds of thousands of small text files between 0 and 8kb each on a LAN network share. I can use some interop calls with kernel32.dll and FindFileEx to recursively pull a list of the fully qualified UNC path of each file and store the paths in memory in a collection class such as List<string>. Using this approach I was able to populate the List<string> fairly quickly (about 30seconds per 50k file names as compared to 3minutes of Directory.GetFiles).
Though, once I've crawled the directories and stored the file paths in the List<string> I now want to make a pass on every path stored in my list and read the contents of the small text file and perform some action based on the values read in.
As a test bed I iterated over each file path in a List<string> that stored 42,945 file paths to this LAN network share and performed the following lines on each FileFullPath:
StreamReader file = new StreamReader(FileFullPath);
file.ReadToEnd();
file.Close();
So with just these lines, it takes 13-15minutes runtime for all 42,945 files paths stored in my list.
Is there a more optimal way to load in many small text files via C#? Is there some interop I should consider? Or is this pretty much the best I can expect? It just seems like an awfully long time.
I would consider using Directory.EnumerateFiles, and then processing your files as you read them.
This would prevent the need to actually store the list of 42,945 files at once, as well as open up the potential of doing some of the processing in parallel using PLINQ (depending on the processing requirements of the files).
If the processing has a reasonably large CPU portion of the total time (and it's not purely I/O bound), this could potentially provide a large benefit in terms of complete time required.

Categories