FileStream.SetLength(long length) too slow when length is in gigabytes

FileStream.SetLength(long length) too slow when length is in gigabytes - c#

I need to write a small tool to eat up a disk's free space (just leaving a few Kilo Bytes) to test some "low disk space" use cases. The code:
new FileStream(filename).SetLength(remaining free bytes - 1024); //leaving 1KB's free space
But FileStream.SetLength(long length) is too slow if the length is in gigabytes, is just as slow as copying big HD movies to the disk. (Edit: Sorry I just realize that I experienced this only when writing to removable flashes, if write to other local drives, the speed is fast enough.)
So I wonder, is there a faster way to write blank files (that is, filled with zeros)? Code in C/C++ is also welcome.
Or is there other trick that I can test the "low disk space" cases without having to write blank files?

You can create large files without writing to them via the Windows API call SetFileValidData.
However, note that it will NOT fill the file with zeros (which is why it is faster). Also, READ CAREFULLY the documentation for that function, since there are security implications.

After a google search I was directed to this SO question:
Creating big file on Windows
That answers my question.

Related

How ReadLine works in .NET

Let's say I have a 1 GB text file and I want to read it. If I try to open this file, I would get an "Memory Overflow" error. I know, the usual answer is "Use StreamReader.ReadLine() method". But I am wondering how this works. If the program which uses ReadLine method wants to get a line, it will have to open the entire text file sooner or later. As far as I know, files are stored on the disk and they can be opened in memory in an "all or nothing" principle. If only one line of my 1 GB text file is stored in a memory at a time by using a ReadLine() method, this means that we have to disk I-O for every line of my 1 GB text file while reading it. Isn't this a terrible thing to do for performance?
I'm so confused and I want some details about this.

this means that we have to disk I-O for every line of my 1 GB text file
No, there are lots of layers between your ReadLine() call and the physical disk, designed to not make this a problem. The ones that matter most:
FileStream, the underlying class that does the job for StreamReader, uses a buffer to reduce the number of ReadFile() calls. Default size is 4096 bytes
ReadFile() reads file data from the file system cache, not the disk. That may result in a call to the disk driver, but that's not so common. The operating system is smart enough to guess that you are likely to read more data from the file and pre-reads it from the disk as long as that is cheap to do and RAM isn't being used for anything else. It typically slurps an entire disk cylinder worth of data.
The disk drive itself has a cache as well, usually several megabytes.
The file system cache is by far the most important one. Also a tricky one because it stops your from accurately profiling your program. When you run your test over and over again, your program in fact never reads from the disk, only the cache. Which makes it unrealistically fast. Albeit that a 1 GB file might not quite fit, depends how much RAM you have in the machine.

Usually behind the scenes a FileStream object is opened which reads a large block of your file from disk and pulls it into memory. This block acts as a cache for ReadLine() to read from, so you don't have to worry about each ReadLine() causing a disk access.

Terrible thing for the performance of what?
Obviously it should be faster, given you have the memory available to deal the whole file in memory.
Finding and allocating a contiguous block is a cost though.
A gig is a significant block of ram, if your process has it, what's hurting?
Swapping could easily hurt more than streaming.
Do you need all the file at once, Do you need it all the time?
If you went to read / write. What would that do to you?
What if the file went to 2 gig?
You can optimise for one factor. Before you do, you've got to make sure it's the right one, and above all you have to remember this is a real machine. You have a finite amount of resources, so optimisation is always robbing Peter to pay Paul. Peter might get upset...

Efficient log backup program in C#

I am writing a log backup program in C#. The main objective is to take logs from multiple servers, copy and compress the files and then move them to a central data storage server. I will have to move about 270Gb of data every 24 hours. I have a dedicated server to run this job and a LAN of 1Gbps. Currently I am reading lines from a (text)file, copying them into a buffer stream and writing them to the destination.
My last test copied about 2.5Gb of data in 28 minutes. This will not do. I will probably thread the program for efficiency, but I am looking for a better method to copy the files.
I was also playing with the idea of compressing everything first and then using a stream buffer a bit to copy. Really, I am just looking for a little advice from someone with more experience than me.
Any help is appreciated, thanks.

You first need to profile as Umair said so that you can figure out how much of the 28 minutes is spent compressing vs. transmitting. Also measure the compression rate (bytes/sec) with different compression libraries, and compare your transfer rate against other programs such as Filezilla to see if you're close to your system's maximum bandwidth.
One good library to consider is DotNetZip, which allows you to zip to a stream, which can be handy for large files.
Once you get it fine-tuned for one thread, experiment with several threads and watch your processor utilization to see where the sweet spot is.

One of the solutions can be is what you mantioned: compress files in one Zip file and after transfer them via network. This will bemuch faster as you are transfering one file and often on of principal bottleneck during file transfers is Destination security checks.
So if you use one zip file, there should be one check.
In short:
Compress
Transfer
Decompress (if you need)
This already have to bring you big benefits in terms of performance.

Compress the logs at source and use TransmitFile (that's a native API - not sure if there's a framework equivalent, or how easy it is to P/Invoke this) to send them to the destination. (Possibly HttpResponse.TransmitFile does the same in .Net?)
In any event, do not read your files linewise - read the files in blocks (loop doing FileStream.Read for 4K - say - bytes until read count == 0) and send that direct to the network pipe.

Trying profiling your program... bottleneck is often where you least expect it to be. As some clever guy said "Premature optimisation is the root of all evil".
Once in a similar scenario at work, I was given the task to optimise the process. And after profiling the bottleneck was found to be a call to sleep function (which was used for synchronisation between thread!!!! ).

Read whole file in memory VS read in chunks

I'm relatively new to C# and programming, so please bear with me. I'm working an an application where I need to read some files and process those files in chunks (for example data is processed in chunks of 48 bytes).
I would like to know what is better, performance-wise, to read the whole file at once in memory and then process it or to read file in chunks and process them directly or to read data in larger chunks (multiple chunks of data which are then processed).
How I understand things so far:
Read whole file in memory
pros:
-It's fast, because the most time expensive operation is seeking, once the head is in place it can read quite fast
cons:
-It consumes a lot of memory
-It consumes a lot of memory in very short time ( This is what I am mainly afraid of, because I do not want that it noticeably impacts overall system performance)
Read file in chunks
pros:
-It's easier (more intuitive) to implement
while(numberOfBytes2Read > 0)
read n bytes
process read data
-It consumes very little memory
cons:
-It could take much more time, if the disk has to seek the file again and move the head to the appropriate position, which in average costs around 12ms.
I know that the answer depends on file size (and hardware). I assume it is better to read the whole file at once, but for how large files is this true, what is the maximum recommended size to read in memory at once (in bytes or relative to the hardware - for example % of RAM)?
Thank you for your answers and time.

It is recommended to read files in buffers of 4K or 8K.
You should really never read files all at once if you want to write it back to another stream. Just read to a buffer and write the buffer back. This is especially through for web programming.
If you have to load the whole file since your operation (text-processing, etc) needs the whole content of the file, buffering does not really help, so I believe it is preferable to use File.ReadAllText or File.ReadAllBytes.
Why 4KB or 8KB?
This is closer to the underlying Windows operating system buffers. Files in NTFS are normally stored in 4KB or 8KB chuncks on the disk although you can choose 32KB chuncks

Your chunk needs to be just large enougth, 48 bytes is of course to small, 4K is reasonable.

How does disk de-fragmenting work?

I'd like to have a go at writing something which shows the state of a hard drive in terms of how fragmented it is. Maybe even has a go at de-fragmenting it.
But I've realised that I don't fully understand how this works.
Can anyone explain this to me and perhaps offer some suggestions of where I might start?
I mainly use C# - would this be a suitable language to have a go at putting something together.
Thanks in advance

Please begin with the Wikipedia Article on Disk Fragmentation
Then after that, it depends on how low-level you want to go.
First for the official howto see Defragmenting Files on MSDN.
From the article....
Use the FSCTL_GET_VOLUME_BITMAP control code to find a place on the volume that is large enough to accept an entire file.
Note If necessary, move other files to make a place that is large enough. Ideally, there is enough unallocated clusters after the first extent of the file that you can move subsequent extents into the space after the first extent.
Use the FSCTL_GET_RETRIEVAL_POINTERS control code to get a map of the current layout of the file on the disk.
Walk the RETRIEVAL_POINTERS_BUFFER structure returned by FSCTL_GET_RETRIEVAL_POINTERS.
Use the FSCTL_MOVE_FILE control code to move each cluster as you walk the structure.
Note You may need to renew either the bitmap or the retrieval structure, or both at various times as other processes write to the disk.
For a C# wrapper of the above, check out this blog post.
Finally, depending on your situation, you can use the WMI Defrag method on the Win32_Volume class.
Hope this helps.

To show the fragmentation state of a filesystem, you would have to find out which blocks of the disk belong to which files. All files that do not solely consist of consecutive blocks are fragmented; they contain holes and/or the blocks are scattered over the disk.
To defragment a filesystem you would have to move around the blocks so that all files are consecutive and rewrite the metadata to have the filesystem in a consistent state in the end.

When files are saved the bytes they use is put into allocated blocks, if the file grows and the next consecutive block is not available, the OS starts writing to the next available block, splitting the file into 2 fragments.
Defragmentation collects files into consecutive blocks by moving blocks out of the way (into free space) so that the file being defragmented can have consecutive blocks. for non Solid State hard drives this affects performance (as there is no seek time reading consecutive blocks)
Some defragmenters move more commonly read files to the outside of the disk (since it spins faster the further away from the spindle it is).

Larger File Streams using C#

There are some text files(Records) which i need to access using C#.Net. But the matter is those files are larger than 1GB. (minimum size is 1 GB)
what should I need to do?
What are the factors which I need to be concentrate on?
Can some one give me an idea to over come from this situation.
EDIT:
Thanks for the fast responses. yes they are fixed length records. These text files coming from a local company. (There last month transaction records)
Is it possible to access these files like normal text files (using normal file stream).
and
How about the memory management????

Expanding on CasperOne's answer
Simply put there is no way to reliably put a 100GB file into memory at one time. On a 32 bit machine there is simply not enough addressing space. In a 64 bit machine there is enough addressing space but during the time in which it would take to actually get the file in memory, your user will have killed your process out of frustration.
The trick is to process the file incrementally. The base System.IO.Stream() class is designed to process a variable (and possibly infinite) stream in distinct quantities. It has several Read methods that will only progress down a stream a specific number of bytes. You will need to use these methods in order to divide up the stream.
I can't give more information because your scenario is not specific enough. Can you give us more details or your record delimeters or some sample lines from the file?
Update
If they are fixed length records then System.IO.Stream will work just fine. You can even use File.Open() to get access to the underlying Stream object. Stream.Read has an overload that requests the number of bytes to be read from the file. Since they are fixed length records this should work well for your scenario.
As long as you don't call ReadAllText() and instead use the Stream.Read() methods which take explicit byte arrays, memory won't be an issue. The underlying Stream class will take care not to put the entire file into memory (that is of course, unless you ask it to :) ).

You aren't specifically listing the problems you need to overcome. A file can be 100GB and you can have no problems processing it.
If you have to process the file as a whole then that is going to require some creative coding, but if you can simply process sections of the file at a time, then it is relatively easy to move to the location in the file you need to start from, process the data you need to process in chunks, and then close the file.
More information here would certainly be helpful.

What are the main problems you are having at the moment? The big thing to remember is to think in terms of streams - i.e. keep the minimum amount of data in memory that you can. LINQ is excellent at working with sequences (although there are some buffering operations you need to avoid, such as OrderBy).
For example, here's a way of handling simple records from a large file efficiently (note the iterator block).
For performing multiple aggregates/analysis over large data from files, consider Push LINQ in MiscUtil.
Can you add more context to the problems you are thinking of?

Expanding on JaredPar's answer.
If the file is a binary file (i.e. ints stored as 4 bytes, fixed length strings etc) you can use the BinaryReader class. Easier than pulling out n bytes and then trying to interrogate that.
Also note, the read method on System.IO.Stream is a non blocking operation. If you ask for 100 bytes it may return less than that, but still not have reached end of file.
The BinaryReader.ReadBytes method will block until it reads the requested number of bytes, or End of file - which ever comes first.
Nice collaboration lads :)

Hey Guys, I realize that this post hasn't been touched in a while, but I just wanted to post a site that has the solution to your problem.
http://thedeveloperpage.wordpress.com/c-articles/using-file-streams-to-write-any-size-file-introduction/
Hope it helps!
-CJ

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

FileStream.SetLength(long length) too slow when length is in gigabytes - c#

You can create large files without writing to them via the Windows API call SetFileValidData. However, note that it will NOT fill the file with zeros (which is why it is faster). Also, READ CAREFULLY the documentation for that function, since there are security implications.

After a google search I was directed to this SO question: Creating big file on Windows That answers my question.

Related

How ReadLine works in .NET

Efficient log backup program in C#

Read whole file in memory VS read in chunks

How does disk de-fragmenting work?

Larger File Streams using C#

Categories

Resources