Read whole file in memory VS read in chunks - c#

I'm relatively new to C# and programming, so please bear with me. I'm working an an application where I need to read some files and process those files in chunks (for example data is processed in chunks of 48 bytes).
I would like to know what is better, performance-wise, to read the whole file at once in memory and then process it or to read file in chunks and process them directly or to read data in larger chunks (multiple chunks of data which are then processed).
How I understand things so far:
Read whole file in memory
pros:
-It's fast, because the most time expensive operation is seeking, once the head is in place it can read quite fast
cons:
-It consumes a lot of memory
-It consumes a lot of memory in very short time ( This is what I am mainly afraid of, because I do not want that it noticeably impacts overall system performance)
Read file in chunks
pros:
-It's easier (more intuitive) to implement
while(numberOfBytes2Read > 0)
read n bytes
process read data
-It consumes very little memory
cons:
-It could take much more time, if the disk has to seek the file again and move the head to the appropriate position, which in average costs around 12ms.
I know that the answer depends on file size (and hardware). I assume it is better to read the whole file at once, but for how large files is this true, what is the maximum recommended size to read in memory at once (in bytes or relative to the hardware - for example % of RAM)?
Thank you for your answers and time.

It is recommended to read files in buffers of 4K or 8K.
You should really never read files all at once if you want to write it back to another stream. Just read to a buffer and write the buffer back. This is especially through for web programming.
If you have to load the whole file since your operation (text-processing, etc) needs the whole content of the file, buffering does not really help, so I believe it is preferable to use File.ReadAllText or File.ReadAllBytes.
Why 4KB or 8KB?
This is closer to the underlying Windows operating system buffers. Files in NTFS are normally stored in 4KB or 8KB chuncks on the disk although you can choose 32KB chuncks

Your chunk needs to be just large enougth, 48 bytes is of course to small, 4K is reasonable.

Related

Compressing and decompressing very large files using System.IO.Compressing.Gzip

My problem can be described with following statements:
I would like my program to be able to compress and decompress selected files
I have very large files (20 GB+). It is safe to assume that the size will never fit into the memory
Even after compression the compressed file might still not fit into the memory
I would like to use System.IO.Compression.GzipStream from .NET Framework
I would like my application to be parallel
As I am a newbie to compression / decompression I had following idea on how to do it:
I could use split the files into chunks and compress each of them separately. Then merge them back into a whole compressed file.
Question 1 about this approach - Is compressing multiple chunks and then merging them back together going to give me the proper result i.e. if I were to reverse the process (starting from compressed file, back to decompressed) will I receive the same original input?
Question 2 about this approach - Does this approach make sense to you? Perhaps you could direct me towards some good lecture about the topic? Unfortunately I could not find anything myself.
You do not need to chunk the compression just to limit memory usage. gzip is designed to be a streaming format, and requires on the order of 256KB of RAM to compress. The size of the data does not matter. The input could be one byte, 20 GB, or 100 PB -- the compression will still only need 256KB of RAM. You just read uncompressed data in, and write compressed data out until done.
The only reason to chunk the input as you diagram is to make use of multiple cores for compression. Which is a perfectly good reason for your amount of data. Then you can do exactly what you describe. So long as you combine the output in the correct order, the decompression will then reproduce the original input. You can always concatenate valid gzip streams to make a valid gzip stream. I would recommend that you make the chunks relatively large, e.g. megabytes, so that the compression is not noticeably impacted by the chunking.
Decompression cannot be chunked in this way, but it is much faster so there would be little to no benefit even if you could. The decompression is usually i/o bound.

Combining FileStream and MemoryStream to avoid disk accesses/paging while receiving gigabytes of data?

I'm receiving a file as a stream of byte[] data packets (total size isn't known in advance) that I need to store somewhere before processing it immediately after it's been received (I can't do the processing on the fly). Total received file size can vary from as small as 10 KB to over 4 GB.
One option for storing the received data is to use a MemoryStream, i.e. a sequence of MemoryStream.Write(bufferReceived, 0, count) calls to store the received packets. This is very simple, but obviously will result in out of memory exception for large files.
An alternative option is to use a FileStream, i.e. FileStream.Write(bufferReceived, 0, count). This way, no out of memory exceptions will occur, but what I'm unsure about is bad performance due to disk writes (which I don't want to occur as long as plenty of memory is still available) - I'd like to avoid disk access as much as possible, but I don't know of a way to control this.
I did some testing and most of the time, there seems to be little performance difference between say 10 000 consecutive calls of MemoryStream.Write() vs FileStream.Write(), but a lot seems to depend on buffer size and the total amount of data in question (i.e the number of writes). Obviously, MemoryStream size reallocation is also a factor.
Does it make sense to use a combination of MemoryStream and FileStream, i.e. write to memory stream by default, but once the total amount of data received is over e.g. 500 MB, write it to FileStream; then, read in chunks from both streams for processing the received data (first process 500 MB from the MemoryStream, dispose it, then read from FileStream)?
Another solution is to use a custom memory stream implementation that doesn't require continuous address space for internal array allocation (i.e. a linked list of memory streams); this way, at least on 64-bit environments, out of memory exceptions should no longer be an issue. Con: extra work, more room for mistakes.
So how do FileStream vs MemoryStream read/writes behave in terms of disk access and memory caching, i.e. data size/performance balance. I would expect that as long as enough RAM is available, FileStream would internally read/write from memory (cache) anyway, and virtual memory would take care of the rest. But I don't know how often FileStream will explicitly access a disk when being written to.
Any help would be appreciated.
No, trying to optimize this doesn't make any sense. Windows itself already caches file writes, they are buffered by the file system cache. So your test is about accurate, both MemoryStream.Write() and FileStream.Write() actually write to RAM and have no significant perf differences. The file system driver lazily writes it to disk in the background.
The RAM used for the file system cache is what's left over after processes claimed their RAM needs. By using a MemoryStream, you reduce the effectiveness of the file system cache. Or in other words, you trade one for the other without benefit. You're in fact off worse, you use double the amount of RAM.
Don't help, this is already heavily optimized inside the operating system.
Since recent versions of Windows enable write caching by default, I'd say you could simply use FileStream and let Windows manage when or if anything actually is written to the physical hard drive.
If these files don't stick around after you've received them, you should probably write the files to a temp directory and delete them when you're done with them.
Use a FileStream constructor that allows you to define the buffer size. For example:
using (outputFile = new FileStream("filename",
FileMode.Create, FileAccess.Write, FileShare.None, 65536))
{
}
The default buffer size is 4K. Using a 64K buffer reduces the number of calls to the file system. A larger buffer will reduce the number of writes, but each write starts to take longer. Emperical data (many years of working with this stuff) indicates that 64K is a very good choice.
As somebody else pointed out, the file system will likely do further caching, and do the actual disk write in the background. It's highly unlikely that you'll receive data faster than you can write it to a FileStream.

How ReadLine works in .NET

Let's say I have a 1 GB text file and I want to read it. If I try to open this file, I would get an "Memory Overflow" error. I know, the usual answer is "Use StreamReader.ReadLine() method". But I am wondering how this works. If the program which uses ReadLine method wants to get a line, it will have to open the entire text file sooner or later. As far as I know, files are stored on the disk and they can be opened in memory in an "all or nothing" principle. If only one line of my 1 GB text file is stored in a memory at a time by using a ReadLine() method, this means that we have to disk I-O for every line of my 1 GB text file while reading it. Isn't this a terrible thing to do for performance?
I'm so confused and I want some details about this.
this means that we have to disk I-O for every line of my 1 GB text file
No, there are lots of layers between your ReadLine() call and the physical disk, designed to not make this a problem. The ones that matter most:
FileStream, the underlying class that does the job for StreamReader, uses a buffer to reduce the number of ReadFile() calls. Default size is 4096 bytes
ReadFile() reads file data from the file system cache, not the disk. That may result in a call to the disk driver, but that's not so common. The operating system is smart enough to guess that you are likely to read more data from the file and pre-reads it from the disk as long as that is cheap to do and RAM isn't being used for anything else. It typically slurps an entire disk cylinder worth of data.
The disk drive itself has a cache as well, usually several megabytes.
The file system cache is by far the most important one. Also a tricky one because it stops your from accurately profiling your program. When you run your test over and over again, your program in fact never reads from the disk, only the cache. Which makes it unrealistically fast. Albeit that a 1 GB file might not quite fit, depends how much RAM you have in the machine.
Usually behind the scenes a FileStream object is opened which reads a large block of your file from disk and pulls it into memory. This block acts as a cache for ReadLine() to read from, so you don't have to worry about each ReadLine() causing a disk access.
Terrible thing for the performance of what?
Obviously it should be faster, given you have the memory available to deal the whole file in memory.
Finding and allocating a contiguous block is a cost though.
A gig is a significant block of ram, if your process has it, what's hurting?
Swapping could easily hurt more than streaming.
Do you need all the file at once, Do you need it all the time?
If you went to read / write. What would that do to you?
What if the file went to 2 gig?
You can optimise for one factor. Before you do, you've got to make sure it's the right one, and above all you have to remember this is a real machine. You have a finite amount of resources, so optimisation is always robbing Peter to pay Paul. Peter might get upset...

how to improve a large number of smaller files read and write speed or performance

Yesterday,I asked the question at here:how do disable disk cache in c# invoke win32 CreateFile api with FILE_FLAG_NO_BUFFERING.
In my performance test show(write and read test,1000 files and total size 220M),the FILE_FLAG_NO_BUFFERING can't help me improve performance and lower than .net default disk cache,since i try change FILE_FLAG_NO_BUFFERING to FILE_FLAG_SEQUENTIAL_SCAN can to reach the .net default disk cache and faster little.
before,i try use mongodb's gridfs feature replace the windows file system,not good(and i don't need to use distributed feature,just taste).
in my Product,the server can get a lot of the smaller files(60-100k) on per seconds through tcp/ip,then need save it to the disk,and third service read these files once(just read once and process).if i use asynchronous I/O whether can help me,whether can get best speed and best low cpu cycle?. someone can give me suggestion?or i can still use FileStream class?
update 1
the memory mapped file whether can to achieve my demand.that all files write to one big file or more and read from it?
If your PC is taking 5-10 seconds to write a 100kB file to disk, then you either have the world's oldest, slowest PC, or your code is doing something very inefficient.
Turning off disk caching will probably make things worse rather than better. With a disk cache in place, your writes will be fast, and Windows will do the slow part of flushing the data to disk later. Indeed, increasing I/O buffering usually results in significantly improved I/O in general.
You definitely want to use asynchronous writes - that means your server starts the data writing, and then goes back to responding to its clients while the OS deals with writing the data to disk in the background.
There shouldn't be any need to queue the writes (as the OS will already be doing that if disc caching is enabled), but that is something you could try if all else fails - it could potentially help by writing only one file at a time to minimise the need for disk seeks..
Generally for I/O, using larger buffers helps to increase your throughput. For example instead of writing each individual byte to the file in a loop, write a buffer-ful of data (ideally the entire file, for the sizes you mentioned) in one Write operation. This will minimise the overhead (instead of calling a write function for every byte, you call a function once for the entire file). I suspect you may be doing something like this, as it's the only way I know to reduce performance to the levels you've suggested you are getting.
Memory-mapped files will not help you. They're really best for accessing the contents of huge files.
One of buggest and significant improvements, in your case, can be, imo, process the filles without saving them to a disk and after, if you really need to store them, push them on Queue and provess it in another thread, by saving them on disk. By doing this you will immidiately get processed data you need, without losing time to save a data on disk, but also will have a file on disk after, without losing computational power of your file processor.

How to measure characteristics of file (hard-disk) I/O?

How to measure characteristics of file (hard-disk) I/O? For example on a machine with a hard-disk (with speed X) and a cpu i7 (or whatever number of cores) and Y amount of ram (with Z Hz BIOS) what would be (on Windows OS):
Optimum number of files that can be written to the HD in parallel?
Optimum number of files that can be read from HD in parallel?
Facilities of file-system helping faster writings. (Like: Is there a feature or tool there that let you write batches of binary data on different sectors (or hards) and then bind them as a file? I do not know much about underlying file I/O in OS. But it would be reasonable to have such tools!)
If there are such tools as part before, are there in .NET too?
I want to write large files (streamed over the web or another source) as fast (and as parallel) as possible! I am coding this in C#. And it acts like a download manager; so if streaming got interrupted, it can carry on later.
The answer (as so often) depends on your usage. The whole operating system is one big tradeoff between different use scenarios. For the NTFS filesystem one could mention block size set to 4k, NTFS storing files less than block size in MTF, size of files, number of files, fragmentation, etc.
If you are planning to write large files then a block size of 64k may be good. That is if you plan to read large amounts of data. If you read smaller amounts of data then smaller sizes are good. The OS works in 4k pages, so 4k is good. Compression (and encryption?) as well as SQL and Exchange only work on 4k pages (iirc).
If you write small files (<4k)they will be stored inside the MFT so you don't have to make "an ekstra jump". This is especially useful in write operations (read may have MFT cached). MFT stores files in sequences (i.e. blocks 1000-1010,2000-2010) so fragmentation will make the MFT bigger. Writing files to disk in parallell is one of the main causes to fragmentation, the other is deleting files. You may pre-allocate the required size for a file and Windows will try to find a suitable place on the disk to counter fragmentation. There are also real-time defragmentation programs like O&O Defrag.
Windows maps a binarystream pretty much directly to the physical location on the disk, so using different read/write methods will not yield as much performance boost as other factors. For maximum speed programs use technieue for direct memory mapping to disk. See http://en.wikipedia.org/wiki/Memory-mapped_file
There is an option in Windows (under Device Manager, Harddisks) to increase caching on the disk. This is dangerous as ut could damage the filesystem if the computer bluescreens or looses power, but gives a big performance boost on writing smaller files (and on all writes). If the disk is busy this is especially valuable as the seek-time will decrease. Windows use what is called the elevator algorithm which basically means it moves the harddisk heads over the surface back and forth serving any application in the direction it is moving.
Hope this helps. :)

Categories