Writing at the end of file - c#

I'm working on a system that requires high file I/O performance (with C#).
Basically, I'm filling up large files (~100MB) from the start of the file until the end of the file.
Every ~5 seconds I'm adding ~5MB to the file (sequentially from the start of the file), on every bulk I'm flushing the stream.
Every few minutes I need to update a structure which I write at the end of the file (some kind of metadata).
When flushing each one of the bulks I have no performance issue.
However, when updating the metadata at the end of the file I get really low performance.
My guess is that when creating the file (which also should be done extra fast), the file doesn't really allocates the entire 100MB on the disk and when I flush the metadata it must allocates all space until the end of file.
Guys/Girls, any Idea how I can overcome this problem?
Thanks a lot!
From comment:
In general speaking the code is as follows, first the file is opened:
m_Stream = new FileStream(filename,
FileMode.CreateNew,
FileAccess.Write,
FileShare.Write, 8192, false);
m_Stream.SetLength(100*1024*1024);
Every few seconds I'm writing ~5MB.
m_Stream.Seek(m_LastPosition, SeekOrigin.Begin);
m_Stream.Write(buffer, 0, buffer.Length);
m_Stream.Flush();
m_LastPosition += buffer.Length; // HH: guessed the +=
m_Stream.Seek(m_MetaDataSize, SeekOrigin.End);
m_Stream.Write(metadata, 0, metadata.Length);
m_Stream.Flush(); // Takes too long on the first time(~1 sec).

As suggested above would it not make sense (assuming you must have the meta data at the end of the file) write that first.
That would do 2 things (assuming a non sparse file)...
1. allocate the total space for the entire file
2. make any following write operations a little faster as the space is ready and waiting.
Can you not do this asyncronously?
At least the application can then move on to other things.

Have you tried the AppendAllText method?

Your question isn't totally clear, but my guess is you create a file, write 5MB, then seek to 100MB and write the metadata, then seek back to 5MB and write another 5MB and so on.
If that is the case, this is a filesystem problem. When you extend the file, NTFS has to fill the gap in with something. As you say, the file is not allocated until you write to it. The first time you write the metadata the file is only 5MB long, so when you write the metadata NTFS has to allocate and write out 95MB of zeros before it writes the metadata. Upsettingly I think it also does this synchronously, so you don't even win using overlapped IO.

How about using the BufferedStream?
http://msdn.microsoft.com/en-us/library/system.io.bufferedstream(v=VS.100).aspx

Related

Limit file size on Dokan FileSystem

I'm trying to make a virtual file system using the C# version of Dokan.
What I want to do right now is to set the max limit of a file for my filesystem, for example, the filesystem can't have a file with more than 2GB.
At the moment I'm doing this on Operation SetEndOfFile but I can only give DiskFull error and I want to return something like NTStatus.FileTooLarge, but when I do that the filesystem simply ignore that return.
Is there any options to do what I want?
You can check the offset of the WriteFile function, if offset is bigger than 2GB - you return NtStatus.FileTooLarge.
Also, you should check the buffer.Length too, in case when you edit the file on the disk - it doesn't chop it into smaller buffered chunks, instead the entire file is put into a one buffer. In this scenario, you should find a way not to destroy the file, because the CleanUp will be called immediately after returning NtStatus.FileTooLarge from WriteFile method with info.DeleteOnClose set to true.

FileStream.Read() - bytes read

FileStream.Read() returns the amount of bytes read, but... is there any situation other than having reached the end of file, that it will read less bytes than the number of bytes requested and not throw an exception?
the documentation says:
The Read method returns zero only after reaching the end of the stream. Otherwise, Read always reads at least one byte from the stream before returning. If no data is available from the stream upon a call to Read, the method will block until at least one byte of data can be returned. An implementation is free to return fewer bytes than requested even if the end of the stream has not been reached.
But this doesn't quite explain in what situations data would be unavailable and cause the method to block until it can read again. I mean, shouldn't most situations where data is unavailable force an exception?
What are real situations where comparing the number of bytes read against the number of expected bytes could differ (assuming that we're already checking for end of file when we mention number of bytes expected)?
EDIT: A bit more information, reason why I'm asking this is because I've come across a bit of code where the developer pretty much did something like this:
bytesExpected = (remainingBytesInFile > 94208 ? 94208 : remainingBytesInFile
while (bytesRead < bytesExpected)
{
bytesRead += fileStream.Read(buffer, bytesRead, bytesExpected - bytesRead)
}
Now, I can't see any advantage to having this while at all, I'd expect it to throw an exception if it can't read the number of bytes expected (bearing in mind it's already taking into account that there are those many bytes left to read)
What would the reason one could possibly have for something like this? I'm sure I'm missing something
The documentation is for Stream.Read, from which FileStream is derived. Since FileStream is a stream, it should obey the stream contract. Not all streams do, but unless you have a very good reason, you should stick to that.
In a typical file stream, you'll only get a return value smaller than count when you reach the end of file (and it's a pretty simple way of checking for the end of file).
However, in a NetworkStream, for example, you keep reading in a loop until the method returns zero - signalling the end of stream. The same works for file streams - you know you're at the end of the file when Read returns zero.
Most importantly, FileStream isn't just for what you'd consider files - it's also for pseudo-files like standard input/output pipes and COM ports, for example (try opening a file stream on PRN, for example). In that case, you're not reading a file with a fixed length, and the behaviour is the same as with NetworkStream.
Finally, don't forget that FileStream isn't sealed. It's perfectly fine for you to implement a virtualized file system, for example - and it's perfectly fine if your virtualized file system doesn't support seeking, or checking the length of file.
EDIT:
To address your edit, this is exactly how you're supposed to read any stream. Nothing wrong with it. If there's nothing else to read in a stream, the Read method will simply return 0, and you know the stream is over. The only thing is, it seems that he tries to fill his buffer to full, one buffer at a time - this only makes sense if you explicitly need to partition the file by 94208 bytes, and pass that byte[] for further processing somewhere.
If that's not the case, you don't really need to fill the full buffer - you just keep reading (and probably writing on some other side) until Read returns 0. And indeed, by default, FileStream will always fill the whole buffer unless it's built around a pipe handle - but since that's a possibility, you shouldn't rely on the "real file" behaviour, so as long as you need those byte[] for something non-stream (e.g. parsing messages), this is entirely fine. If you're only using the stream as an actual stream, and you're streaming the data somewhere else, it doesn't have a point, really - you only need one while to read the file.
Your expectations would only apply to the case when the stream is reading data off of a no-latency source. Other I/O sources can be slow, which is why the Read method might will not always be able to return immediately. That doesn't mean that there is an error (so no exception), just that it has to wait for data to arrive.
Examples: network stream, file stream on slow disk, etc.
(UPDATE, HDD example) To give an example specific to files (since your case is FileStream, although Read is defined on Stream and so all implementations should fulfill the requirements): mechanical hard-drives go to "sleep" when not active (specially on battery-powered devices, read laptops). Spinning up can take a second or so. That is not an IOException, but your read would have to wait for a second before any data is read.
Simple answer is that on a FileStream it probably never happens.
However keep in mind that the Read method is inherited from Stream which serves as base for many other streams like NetworkStream and in this case you may not be able to read has many bytes as you requested simple because they havent been received from the network yet.
So like the documentation says it all depends on the implementation of the specific type of stream - FileStream, NetworkStream, etc.

C# - remove blocks of bytes in large binary files

i want a fast way in c# to remove a blocks of bytes in different places from binary file of size between 500MB to 1GB , the start and the length of bytes needed to be removed are in saved array
int[] rdiDataOffset= {511,15423,21047};
int[] rdiDataSize={102400,7168,512};
EDIT:
this is a piece of my code and it will not work correctly unless i put buffer size to 1:
while(true){
if (rdiDataOffset.Contains((int)fsr.Position))
{
int idxval = Array.IndexOf(rdiDataOffset, (int)fsr.Position, 0, rdiDataOffset.Length);
int oldRFSRPosition = (int)fsr.Position;
size = rdiDataSize[idxval];
fsr.Seek(size, SeekOrigin.Current);
}
int bufferSize = size == 0 ? 2048 : size;
if ((size>0) && (bufferSize > (size))) bufferSize = (size);
if (bufferSize > (fsr.Length - fsr.Position)) bufferSize = (int)(fsr.Length - fsr.Position);
byte[] buffer = new byte[bufferSize];
int nofbytes = fsr.Read(buffer, 0, buffer.Length);
fsr.Flush();
if (nofbytes < 1)
{
break;
}
}
No common file system provides an efficient way to remove chunks from the middle of an existing file (only truncate from the end). You'll have to copy all the data after the removal back to the appropriate new location.
A simple algorithm for doing this using a temp file (it could be done in-place as well but you have a riskier situation in case things go wrong).
Create a new file and call SetLength to set the stream size (if this is too slow you can Interop to SetFileValidData). This ensures that you have room for your temp file while you are doing the copy.
Sort your removal list in ascending order.
Read from the current location (starting at 0) to the first removal point. The source file should be opened without granting Write share permissions (you don't want someone mucking with it while you are editing it).
Write that content to the new file (you will likely need to do this in chunks).
Skip over the data not being copied
Repeat from #3 until done
You now have two files - the old one and the new one ... replace as necessary. If this is really critical data you might want to look a transactional approach (either one you implement or using something like NTFS transactions).
Consider a new design. If this is something you need to do frequently then it might make more sense to have an index in the file (or near the file) which contains a list of inactive blocks - then when necessary you can compress the file by actually removing blocks ... or maybe this IS that process.
If you're on the NTFS file system (most Windows deployments are) and you don't mind doing p/invoke methods, then there is a way, way faster way of deleting chunks from a file. You can make the file sparse. With sparse files, you can eliminate a large chunk of the file with a single call.
When you do this, the file is not rewritten. Instead, NTFS updates metadata about the extents of zeroed-out data. The beauty of sparse files is that consumers of your file don't have to be aware of the file's sparseness. That is, when you read from a FileStream over a sparse file, zeroed-out extents are transparently skipped.
NTFS uses such files for its own bookkeeping. The USN journal, for example, is a very large sparse memory-mapped file.
The way you make a file sparse and zero-out sections of that file is to use the DeviceIOControl windows API. It is arcane and requires p/invoke but if you go this route, you'll surely hide the uggles behind nice pretty function calls.
There are some issues to be aware of. For example, if the file is moved to a non-ntfs volume and then back, the sparseness of the file can disappear - so you should program defensively.
Also, a sparse file can appear to be larger than it really is - complicating tasks involving disk provisioning. A 5g sparse file that has been completely zeroed out still counts 5g towards a user's disk quota.
If a sparse file accumulates a lot of holes, you might want to occasionally rewrite the file in a maintenance window. I haven't seen any real performance troubles occur, but I can at least imagine that the metadata for a swiss-cheesy sparse file might accrue some performance degradation.
Here's a link to some doc if you're into the idea.

FileStream.Seek vs. Buffered Reading

Motivated by this answer I was wondering what's going on under the curtain if one uses lots of FileStream.Seek(-1).
For clarity I'll repost the answer:
using (var fs = File.OpenRead(filePath))
{
fs.Seek(0, SeekOrigin.End);
int newLines = 0;
while (newLines < 3)
{
fs.Seek(-1, SeekOrigin.Current);
newLines += fs.ReadByte() == 13 ? 1 : 0; // look for \r
fs.Seek(-1, SeekOrigin.Current);
}
byte[] data = new byte[fs.Length - fs.Position];
fs.Read(data, 0, data.Length);
}
Personally I would have read like 2048 bytes into a buffer and searched that buffer for the char.
Using Reflector I found out that internally the method is using SetFilePointer.
Is there any documentation about windows caching and reading a file backwards? Does Windows buffer "backwards" and consult the buffer when using consecutive Seek(-1) or will it read ahead starting from the current position?
It's interesting that on the one hand most people agree with Windows doing good caching, but on the other hand every answer to "reading file backwards" involves reading chunks of bytes and operating on that chunk.
Going forward vs backward doesn't usually make much difference. The file data is read into the file system cache after the first read, you get a memory-to-memory copy on ReadByte(). That copy isn't sensitive to the file pointer value as long as the data is in the cache. The caching algorithm does however work from the assumption that you'd normally read sequentially. It tries to read ahead, as long as the file sectors are still on the same track. They usually are, unless the disk is heavily fragmented.
But yes, it is inefficient. You'll get hit with two pinvoke and API calls for each individual byte. There's a fair amount of overhead in that, those same two calls could also read, say, 65 kilobytes with the same amount of overhead. As usual, fix this only when you find it to be a perf bottleneck.
Here is a pointer on File Caching in Windows
The behavior may also depends on where physically resides the file (hard disk, network, etc.) as well as local configuration/optimization.
An also important source of information is the CreateFile API documentation: CreateFile Function
There is a good section named "Caching Behavior" that tells us at least how you can influence file caching, at least in the unmanaged world.

Is there a way to make this faster? MemoryStream vs FileStream

I am working with iTextSharp, and need to generate hundreds of thousands of RTF documents - the resulting files are between 5KB and 500KB.
I am listing 2 approaches below - the original approach wasn't necessarily slow, but I figured why write and retrieve to/from file to get the output string I need. I saw this other approach using MemoryStream, but it actually slowed things down. I essentially just need the outputted RTF content, so that I can run some filters on that RTF to clean up unnecessary formatting. The queries bringing back the data are very quick instant seeming . To generate a 1000 files (actually 2000 files are created in process) with original approach files takes about 15 minutes, the same with second approach takes about 25-30 minutes. The resulting files that I've run are averaging around 80KB.
Is there something wrong with the second approach? Seems like it should be faster than the first one, not slower.
Original approach:
RtfWriter2.GetInstance(doc, new FileStream(RTFFilePathName, FileMode.Create));
doc.Open();
//Add Tables and stuff here
doc.Close(); //It saves a file here to (RTFPathFileName)
StreamReader srRTF = new StreamReader(RTFFilePathName);
string rtfText = srRTF.ReadToEnd();
srRTF.Close();
//Do additional things with rtfText before writing to my final file
New approach, trying to speed it up but this is actually half as fast:
MemoryStream stream = new MemoryStream();
RtfWriter2.GetInstance(doc, stream);
doc.Open();
//Add Tables and stuff here
doc.Close();
string rtfText =
ASCIIEncoding.ASCII.GetString(stream.GetBuffer());
stream.Close();
//Do additional things with rtfText before writing to my final file
The second approach I am trying I found here:
iTextSharp - How to generate a RTF document in the ClipBoard instead of a file
How big your resulting stream is? MemoryStream performs a lot of memory copy operations while growing, so for large results it may take significantly longer to write data by small chunks compared with FileStream.
To verify if it is the problem set inital size of MemoryStream to some large value around resulting size and re-run the code.
To fix it you can pre-grow memory stream initially (if you know approximate output) or write your own stream that uses different scheme when growing. Also using temporary file might be good enough for your purposes as is.
Like Alexei said, its probably caused by fact, yo are creating MemoryStream every time, and every time it continously re-alocates memory as it grows. Try creating only 1 stream and reset it to begining before every write.
Also I think stream.GetBuffer() again returns new memory, so try using same StreamReader with your MemoryStream.
And it seems your code can be easily paralelised, so you can try run it using Paralel Extesions or using TreadPool.
And it seems little weird, you are writing your text as bytes in stream, then reading this stream as bytes and converting to text. Wouldnt it be possible to save your document directly as text?
A MemoryStream is not associated with a file, and has no concept of a filename. Basically, you can't do that.
You certainly can't cast between them; you can only cast upwards an downwards - not sideways; to visualise:
Stream
|
| |
FileStream MemoryStream
You can cast a MemoryStream to a Stream trivially, and a Stream to a MemoryStream via a type-check; but never a FileStream to a MemoryStream. That is like saying a dog is an animal, and an elephant is an animal, so we can cast a dog to an elephant.
You could subclass MemoryStream and add a Name property (that you supply a value for), but there would still be no commonality between a FileStream and a YourCustomMemoryStream, and FileStream doesn't implement a pre-existing interface to get a Name; so the caller would have to explicitly handle both separately, or use duck-typing (maybe via dynamic or reflection).
Another option (perhaps easier) might be: write your data to a temporary file; use a FileStream from there; then (later) delete the file.
I know this is old but there is a lot of misinformation in this thread.
It's all about buffer size. The internal buffers are significantly smaller with a memory stream vs a file stream. Smaller buffers cause more read\writes.
Just intilaize your memory stream with either a file stream or a byte array with a size of around 80k. Close the doc, set stream position to 0 and read to end the contents.
On a side note, get buffer will return the whole allocated buffer. So if you only wrote 1 byte and the buffer is 4k, you will have a lot of garbage in your string.

Categories