In my C# application i have to read a huge amount of binary files, but at the first run, reading those files using FileStream and BinaryReader, takes a lot of times. But the second times you run the app, reading the files is 4 times faster.
After reading this post "Slow reading hundreds of files" I decided to precache the binary files.
After reading this other post "How can I check if a program is running for the first time?", my app now can detect if it is the first time it is running then I precache the files by using this simple technique "Caching a binary file in C#".
Is there another way of precaching huge amount of binary files?
Edit:
This is how I read and parse the files
f_strm = new FileStream(#location, FileMode.Open, FileAccess.Read);
readBinary = new BinaryReader(f_strm);
Parse(readBinary);
The Parse() function just contains a switch statement that I use to parse the data.
I don't do anything more complicated. As an example, I tried to read and parse 10.000 binary files of 601KB, it took 39 secondes and about 589.000 cycles to read and parse the files.
When I run again the app, it finally took about 45.000 cycles and 1.5 seconds to read and parse.
Edit:
By "huge" amount of files I mean millions of files. It's not always the case, but most of the time I have to deal with at least 10.000 files. The size of those files can be between 600Ko and 700MB.
Just read them once and discard the results. That puts them into the OS cache and makes future reads very fast. Using the OS cache is automatic and very safe.
Or, make yourself a Dictionary<string, byte[]> where you store the file contents keyed by the file path. Be careful not to exhaust available memory or your app will fail or become very slow due to paging.
Related
A little background...
Everything I'm about to describe up to my implementation of the StreamWriter is business processes which I cannot change.
Every month I pull around 200 different tables of data into individual files.
Each file contains roughly 400,000 lines of business logic details for upwards of 5,000-6,000 different business units.
To effectively use that data with the tools on hand, I have to break down those files into individual files for each business unit...
200 files x 5000 business units per file = 100,000 different files.
The way I've BEEN doing it is the typical StreamWriter loop...
foreach(string SplitFile in List<BusinessFiles>)
{
using (StreamWriter SW = new StreamWriter(SplitFile))
{
foreach(var BL in g)
{
string[] Split1 = BL.Split(',');
SW.WriteLine("{0,-8}{1,-8}{2,-8}{3,-8}{4,-8}{5,-8}{6,-8}{7,-8}{8,-8}{9,-16}{10,-1}",
Split1[0], Split1[1], Split1[2], Split1[3], Split1[4], Split1[5], Split1[6], Split1[7], Split1[8], Convert.ToDateTime(Split1[9]).ToString("dd-MMM-yyyy"), Split1[10]);
}
}
}
The issue with this is, It takes an excessive amount of time.
Like, It can take 20 mins to process all the files sometimes.
Profiling my code shows me that 98% of the time spent is on the system disposing of the StreamWriter after the program leaves the loop.
So my question is......
Is there a way to keep the underlying Stream open and reuse it to write a different file?
I know I can Flush() the Stream but I can't figure out how to get it to start writing to another file altogether. I can't seem to find a way to change the destination filename without having to call another StreamWriter.
Edit:
A picture of what it shows when I profile the code
Edit 2:
So after poking around a bit more I started looking at it a different way.
First thing is, I already had the reading of the one file and writing of the massive amount of smaller files in a nested parallel loop so I was essentially maxing out my I/O as is.
I'm also writing to an SSD, so all those were good points.
Turns out I'm reading the 1 massive file and writing ~5600 smaller ones every 90 seconds or so.
That's 60 files a second. I guess I can't really ask for much more than that.
This sounds about right. 100,000 files in 20 minutes is more than 83 files every second. Disk I/O is pretty much the slowest thing you can do within a single computer. All that time in the Dispose() method is waiting for the buffer to flush out to disk while closing the file... it's the actual time writing the data to your persistent storage, and a separate using block for each file is the right way to make sure this is done safely.
To speed this up it's tempting to look at asynchronous processing (async/await), but I don't think you'll find any gains there; ultimately this is an I/O-bound task, so optimizing for your CPU scheduling might even make things worse. Better gains could be available if you can change the output to write into a single (indexed) file, so the operating system's disk buffering mechanism can be more efficient.
I would agree with Joel that the time is mostly due to writing the data out to disk. I would however be a little bit more optimistic about doing parallel IO, since SSDs are better able to handle higher loads than regular HDDs. So I would try a few things:
1. Doing stuff in parallel
Change your outer loop to a parallel one
Parallel.ForEach(
myBusinessFiles,
new ParallelOptions(){MaxDegreeOfParallelism = 2},
SplitFile => {
// Loop body
});
Try changing the degree of parallelism to see if performance improves or not. This assumes the data is thread safe.
2. Try writing high speed local SSD
I'm assuming you are writing to a network folder, this will add some additional latency, so you might try to write to a local disk. If you are already doing that, consider getting a faster disk. If you need to move all the files to a network drive afterwards, you will likely not gain anything, but it can give an idea about the penalty you get from the network.
3. Try writing to a Zip Archive
There are zip archives that can contain multiple files inside it, while still allowing for fairly easy access of an individual file. This could help improve performance in a few ways:
Compression. I would assume your data is fairly easy to compress, so you would write less data overall.
Less file system operations. Since you are only writing to a single file you would avoid some overhead with the file system.
Reduced overhead due to cluster size. Files have a minimum size, this can cause a fair bit of wasted space for small files. Using an archive avoids this.
You could also try saving each file in an individual zip-archive, but then you would mostly benefit from the compression.
Responding to your question, you have an option (add a flag on the constructor) but it is strongly tied to the garbage collector, also think about multi thread environment it could be a mess. That said this is the overloaded constructor:
StreamWriter(Stream, Encoding, Int32, Boolean)
Initializes a new instance of the StreamWriter class for the specified stream by using the specified encoding and buffer size, and optionally leaves the stream open.
public StreamWriter (System.IO.Stream stream, System.Text.Encoding? encoding = default, int bufferSize = -1, bool leaveOpen = true);
Source
I have two file of about 50GB each: an input and an output file.
I am using Memory Mapped File to manage these two files.
The input file contains 3 millions of Web pages, and after I have decided a permutation π of them, I have to write into the output file the Web pages in the new order.
So, I can choose to read sequentially the input file and write the web pages in different location of the output file, accordingly to the permutation π.
Or I can do the opposite: reading randomly the input file according to the permutation π and write sequentially into the output file.
Which option is faster? Why?
TL;DR: Due to caching, all file-append operations are sequential. Even writes to the middle of files will be elevator sorted and performed at block size, etc.
Random writing tends to be faster than random reading for several reasons:
When a file grows, the filesystem can choose where to put the new block.
Writes don't have to be performed immediately, the write buffer can assure that an entire block is written at once, meaning that data won't be added to an existing block, which already has a location.
Your processing can't take place until reads complete. And reading relies on a predictive cache. The OS is good at pre-caching sequential reads, horrible for random reads. If your reads are less than block sized, things are even worse -- the actual amount of data read from the disk will be greater than the size of the file.
I have hundreds of thousands of small text files between 0 and 8kb each on a LAN network share. I can use some interop calls with kernel32.dll and FindFileEx to recursively pull a list of the fully qualified UNC path of each file and store the paths in memory in a collection class such as List<string>. Using this approach I was able to populate the List<string> fairly quickly (about 30seconds per 50k file names as compared to 3minutes of Directory.GetFiles).
Though, once I've crawled the directories and stored the file paths in the List<string> I now want to make a pass on every path stored in my list and read the contents of the small text file and perform some action based on the values read in.
As a test bed I iterated over each file path in a List<string> that stored 42,945 file paths to this LAN network share and performed the following lines on each FileFullPath:
StreamReader file = new StreamReader(FileFullPath);
file.ReadToEnd();
file.Close();
So with just these lines, it takes 13-15minutes runtime for all 42,945 files paths stored in my list.
Is there a more optimal way to load in many small text files via C#? Is there some interop I should consider? Or is this pretty much the best I can expect? It just seems like an awfully long time.
I would consider using Directory.EnumerateFiles, and then processing your files as you read them.
This would prevent the need to actually store the list of 42,945 files at once, as well as open up the potential of doing some of the processing in parallel using PLINQ (depending on the processing requirements of the files).
If the processing has a reasonably large CPU portion of the total time (and it's not purely I/O bound), this could potentially provide a large benefit in terms of complete time required.
I am working on a project, which actually loads data from a CSV file, processes it and then save it on the disk. For fast reading of CSV data, I am using Lumenworks CSV reader http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader. This works fine till a limit but when I have CSV with a size of 1GB or more, it takes time. Is there any other way for faster CSV reading?
Not a lot of info provided... so on the assumption that this is an IO limitation your options are:
Get Faster Storage [e.g. SSD, RAID].
Try compression - sometimes the time spent in compression [e.g Zip] saves multiples in IO.
Try threading - particularly useful if doing computationally hard calculations - but probably a bad fit in this scenario.
Change the problem - do you need to read/write a 1GB file? Maybe you can change the data format [156 is a lot smaller than "156,", maybe you only need to deal with smaller blocks, maybe the time taken honestly doesn't matter etc.
Any others?
Hmm, you could try caching the output, I've experimented with MemoryMappedFiles & RAM Drives... you could do it with some simple threading... while this does potentially return sooner, it has huge risks and complexities
I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:
byte[8]byte[4]byte[n]
Where n is equal to the int32 value of byte[4].
This file has no delimiters and to read the whole file you would just repeat until EOF. With each "item" represented as byte[8]byte[4]byte[n].
The file looks like
byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF
byte[8] is a 64-bit number representing a period of time represented by .NET Ticks. I need to sort this file but can't seem to figure out the quickest way to do so.
Presently, I load the Ticks into a struct and the byte[n] start and end positions and read to the end of the file. After this, I sort the List in memory by the Ticks property and then open a BinaryReader and seek to each position in Ticks order, read the byte[n] value, and write to an external file.
At the end of the process I end up with a sorted binary file, but it takes FOREVER. I am using C# .NET and a pretty beefy server, but disk IO seems to be an issue.
Server Specs:
2x 2.6 GHz Intel Xeon (Hex-Core with HT) (24-threads)
32GB RAM
500GB RAID 1+0
2TB RAID 5
I've looked all over the internet and can only find examples where a huge file is 1GB (makes me chuckle).
Does anyone have any advice?
At great way to speed up this kind of file access is to memory-map the entire file into address space and let the OS take care of reading whatever bits from the file it needs to. So do the same thing as you're doing right now, except read from memory instead of using a BinaryReader/seek/read.
You've got lots of main memory, so this should provide pretty good performance (as long as you're using a 64-bit OS).
Use merge sort.
It's online and parallelizes well.
http://en.wikipedia.org/wiki/Merge_sort
If you can learn Erlang or Go, they could be very powerful and scale extremely well, as you have 24 threads. Utilize Async I/O. Merge Sort.
And since you have 32GB of Ram, try to load as much as you can into RAM and sort it there then write back to disk.
I would do this in several passes. On the first pass, I would create a list of ticks, then distribute them evenly into many (hundreds?) buckets. If you know ahead of time that the ticks are evenly distributed, you can skip this initial pass. On a second pass, I would split the records into these few hundred separate files of about same size (these much smaller files represent groups of ticks in the order that you want). Then I would sort each file separately in memory. Then concatenate the files.
It is somewhat similar to the hashsort (I think).