Crash safe on-the-fly compression with GZipStream

Crash safe on-the-fly compression with GZipStream - c#

I'm compressing a log file as data is written to it, something like:
using (var fs = new FileStream("Test.gz", FileMode.Create, FileAccess.Write, FileShare.None))
{
using (var compress = new GZipStream(fs, CompressionMode.Compress))
{
for (int i = 0; i < 1000000; i++)
{
// Clearly this isn't what is happening in production, just
// a simply example
byte[] message = RandomBytes();
compress.Write(message, 0, message.Length);
// Flush to disk (in production we will do this every x lines,
// or x milliseconds, whichever comes first)
if (i % 20 == 0)
{
compress.Flush();
}
}
}
}
What I want to ensure is that if the process crashes or is killed, the archive is still valid and readable. I had hoped that anything since the last flush would be safe, but instead I am just ending up with a corrupt archive.
Is there any way to ensure I end up with a readable archive after each flush?
Note: it isn't essential that we use GZipStream, if something else will give us the desired result.

An option is to let Windows handle the compression. Just enable compression on the folder where you're storing your log files. There are some performance considerations you should be aware of when copying the compressed files, and I don't know how well NT compression performs in comparision to GZipStream or other compression options. You'll probably want to compare compression ratios and CPU load.
There's also the option of opening a compressed file, if you don't want to enable compression on the entire folder. I haven't tried this, but you might want to look into it: http://social.msdn.microsoft.com/forums/en-US/netfxbcl/thread/1b63b4a4-b197-4286-8f3f-af2498e3afe5

Good news: GZip is a streaming format. Therefore corruption at the end of the stream cannot affect the beginning which was already written.
So even if your streaming writes are interrupted at an arbitrary point, most of the stream is still good. You can write yourself a little tool that reads from it and just stops at the first exception it sees.
If you want an error-free solution I'd recommend splitting the log into one file every x seconds (maybe x = 1 or 10?). Write into a file with extensions ".gz.tmp" and rename to ".gz" after the file was completely written and closed.

Yes, but it's more involved than just flushing. Take a look at gzlog.h and gzlog.c in the zlib distribution. It does exactly what you want, efficiently adding short log entries to a gzip file, and always leaving a valid gzip file behind. It also has protection against crashes or shutdowns during the process, still leaving a valid gzip file behind and not losing any log entries.
I recommend not using GZIPStream. It is buggy and does not provide the necessary functionality. Use DotNetZip instead as your interface to zlib.

Related

C# I/O async (copyAsync): how to avoid file fragmentation?

Within a tool copying big files between disks, I replaced the
System.IO.FileInfo.CopyTo method by System.IO.Stream.CopyToAsync.
This allow a faster copy and a better control during the copy, e.g. I can stop the copy.
But this create even more fragmentation of the copied files. It is especially annoying when I copy file of many hundreds megabytes.
How can I avoid disk fragmentation during copy?
With the xcopy command, the /j switch copies files without buffering. And it is recommended for very large file in TechNet
It seems indeed to avoid file fragmentation (while a simple file copy within windows 10 explorer DOES fragment my file!)
A copy without buffering seems to be the opposite way than this async copy. Or it there any way to do async copy without buffering?
Here it my current code for aync copy. I let the default buffersize of 81920 bytes, i.e. 10*1024*size(int64).
I am working with NTFS file systems, thus 4096 bytes clusters.
EDIT: I updated the code with SetLength as suggested, added the FileOptions Async while creating the destinationStream and fix setting the attributes AFTER setting the time (otherwise, exception is thrown for ReadOnly files)
int bufferSize = 81920;
try
{
using (FileStream sourceStream = source.OpenRead())
{
// Remove existing file first
if (File.Exists(destinationFullPath))
File.Delete(destinationFullPath);
using (FileStream destinationStream = File.Create(destinationFullPath, bufferSize, FileOptions.Asynchronous))
{
try
{
destinationStream.SetLength(sourceStream.Length); // avoid file fragmentation!
await sourceStream.CopyToAsync(destinationStream, bufferSize, cancellationToken);
}
catch (OperationCanceledException)
{
operationCanceled = true;
}
} // properly disposed after the catch
}
}
catch (IOException e)
{
actionOnException(e, "error copying " + source.FullName);
}
if (operationCanceled)
{
// Remove the partially written file
if (File.Exists(destinationFullPath))
File.Delete(destinationFullPath);
}
else
{
// Copy meta data (attributes and time) from source once the copy is finished
File.SetCreationTimeUtc(destinationFullPath, source.CreationTimeUtc);
File.SetLastWriteTimeUtc(destinationFullPath, source.LastWriteTimeUtc);
File.SetAttributes(destinationFullPath, source.Attributes); // after set time if ReadOnly!
}
I fear also that the File.SetAttributes and Time at the end on my code could increase file fragmentation.
Is there a proper way to create a 1:1 asynchronous file copy without any file fragmentation, i.e. asking the HDD that the file steam get only contiguous sectors?
Other topics regarding file fragmentation like How can I limit file fragmentation while working with .NET suggests incrementing the file size in larger chunks, but it does not seem to be a direct answer to my question.

but the SetLength method does the job
It does not do the job. It only updates the file size in the directory entry, it does not allocate any clusters. The easiest way to see this for yourself is by doing this on a very large file, say 100 gigabytes. Note how the call completes instantly. Only way it can be instant is when the file system does not also do the job of allocating and writing the clusters. Reading from the file is actually possible, even though the file contains no actual data, the file system simply returns binary zeros.
This will also mislead any utility that reports fragmentation. Since the file has no clusters, there can be no fragmentation. So it only looks like you solved your problem.
The only thing you can do to force the clusters to be allocated is to actually write to the file. It is in fact possible to allocate 100 gigabytes worth of clusters with a single write. You must use Seek() to position to Length-1, then write a single byte with Write(). This will take a while on a very large file, it is in effect no longer async.
The odds that it will reduce fragmentation are not great. You merely reduced the risk somewhat that the writes will be interleaved by writes from other processes. Somewhat, actual writing is done lazily by the file system cache. Core issue is that the volume was fragmented before you began writing, it will never be less fragmented after you're done.
Best thing to do is to just not fret about it. Defragging is automatic on Windows these days, has been since Vista. Maybe you want to play with the scheduling, maybe you want to ask more about it at superuser.com

I think, FileStream.SetLength is what you need.

Considering Hans Passant answer,
in my code above, an alternative to
destinationStream.SetLength(sourceStream.Length);
would be, if I understood it properly:
byte[] writeOneZero = {0};
destinationStream.Seek(sourceStream.Length - 1, SeekOrigin.Begin);
destinationStream.Write(writeOneZero, 0, 1);
destinationStream.Seek(0, SeekOrigin.Begin);
It seems indeed to consolidate the copy.
But a look at the source code of FileStream.SetLengthCore seems it does almost the same, seeking at the end but without writing one byte:
private void SetLengthCore(long value)
{
Contract.Assert(value >= 0, "value >= 0");
long origPos = _pos;
if (_exposedHandle)
VerifyOSHandlePosition();
if (_pos != value)
SeekCore(value, SeekOrigin.Begin);
if (!Win32Native.SetEndOfFile(_handle)) {
int hr = Marshal.GetLastWin32Error();
if (hr==__Error.ERROR_INVALID_PARAMETER)
throw new ArgumentOutOfRangeException("value", Environment.GetResourceString("ArgumentOutOfRange_FileLengthTooBig"));
__Error.WinIOError(hr, String.Empty);
}
// Return file pointer to where it was before setting length
if (origPos != value) {
if (origPos < value)
SeekCore(origPos, SeekOrigin.Begin);
else
SeekCore(0, SeekOrigin.End);
}
}
Anyway, not sure that theses method guarantee no fragmentation, but at least avoid it for most of the cases. Thus the auto defragment tool will finish the job at a low performance expense.
My initial code without this Seek calls created hundred of thousands of fragments for 1 GB file, slowing down my machine when the defragment tool went active.

Writing to file, memory used steadily increasing

I have an application where I need to write binary to a file constantly. The bits of data are small, about 1K each. The computers this is running on aren't great and are running XP. I've run into the problem that when I turn on the logging the computers just get totally hosed and I watch the Task Manager and just see the memory usage going up and up until it crashes.
A coworker suggested that I just keep the packets in memory until a certain amount of time has passed and then write it all at once instead of writing each one separately - tried that, same issue.
This is the code (loggingBuffer is the List<byte[]> I'm storing the packets in while the interval passes):
if ((DateTime.Now - lastStoreTime).TotalSeconds > 10)
{
string fileName = #"C:\Storage\file";
FileMode fm = File.Exists(fileName) ? FileMode.Append : FileMode.Create;
using (BinaryWriter w = new BinaryWriter(File.Open(fileName, fm), Encoding.ASCII))
{
foreach (byte[] packetData in loggingBuffer)
{
w.Write(packetData);
}
}
loggingBuffer.Clear();
lastStoreTime= DateTime.Now;
}
Is there anything different I should be doing to accomplish this?

Seems to me that, while you're writing each 10 seconds, you could close the file in between. And cleanup all related file-writing things. Perhaps that would solved your problem.
Secondly, I'd suggest creating the BinaryWriter outside the function where you actually write the data. It'll keep things clearer. In your current code you're checking each time wether to append data or to create a new file and the write to it. If you'll do this outside the function and call it just once perhaps this will save memory too. All untested by me, that is :)

C# Is opening and reading from a Stream slow?

I have 22k text (rtf) files which I must append to one final one.
The code looks something like this:
using (TextWriter mainWriter = new StreamWriter(mainFileName))
{
foreach (string currentFile in filesToAppend)
{
using (TextReader currentFileRader = new StreamReader(currentFile))
{
string fileContent = currentFileRader.ReadToEnd();
mainWriter.Write(fileContent);
}
}
}
Clearly, this opens 22k times a stream to read from the files.
My questions are :
1) in general, is opening a stream a slow operation? Is reading from a stream a slow operation ?
2) is there any difference if I read the file as byte[] and append it as byte[] than using the file text?
3) any better ideas to merge 22k files ?
Thanks.

1) in general, is opening a stream a slow operation?
No, not at all. Opening a stream is blazing fast, it's only a matter of reserving a handle from the underlying Operating System.
2) is there any difference if I read the file as byte[] and append it
as byte[] than using the file text?
Sure, it might be a bit faster, rather than converting the bytes into strings using some encoding, but the improvement would be negligible (especially if you are dealing with really huge files) compared to what I suggest you in the next point.
3) any ways to achieve this better ? ( merge 22k files )
Yes, don't load the contents of every single file in memory, just read it in chunks and spit it to the output stream:
using (var output = File.OpenWrite(mainFileName))
{
foreach (string currentFile in filesToAppend)
{
using (var input = File.OpenRead(currentFile))
{
input.CopyTo(output);
}
}
}
The Stream.CopyTo method from the BCL will take care of the heavy lifting in my example.

Probably the best way to speed this up is to make sure that the output file is on a different physical disk drive than the input files.
Also, you can get some increase in speed by creating the output file with a large buffer. For example:
using (var fs = new FileStream(filename, FileMode.Create, FileAccess.Write, FileShare.None, BufferSize))
{
using (var mainWriter = new StreamWriter(fs))
{
// do your file copies here
}
}
That said, your primary bottleneck will be opening the files. That's especially true if those 22,000 files are all in the same directory. NTFS has some problems with large directories. You're better off splitting that one large directory into, say, 22 directories with 1,000 files each. Opening a file from a directory that contains tens of thousands of files is much slower than opening a file in a directory that has only a few hundred files.

What's slow about reading data from a file is the fact that you aren't moving around electrons which can propagate a signal at speeds that are...really fast. To read information in files you have to actually spin these metal disks around and use magnets to read data off of them. These disks are spinning at far slower than electrons can propagate signals through wires. Regardless of what mechanism you use in code to tell these disks to spin around, you're still going to have to wait for them to go a spinin' and that's going to take time.
Whether you treat the data as bytes or text isn't particularly relevant no.

Reusing a filestream

In the past I've always used a FileStream object to write or rewrite an entire file after which I would immediately close the stream. However, now I'm working on a program in which I want to keep a FileStream open in order to allow the user to retain access to the file while they are working in between saves. ( See my previous question).
I'm using XmlSerializer to serialize my classes to a from and XML file. But now I'm keeping the FileStream open to be used to save (reserialized) my class instance later. Are there any special considerations I need to make if I'm reusing the same File Stream over and over again, versus using a new file stream? Do I need to reset the stream to the beginning between saves? If a later save is smaller in size than the previous save will the FileStream leave the remainder bytes from the old file, and thus create a corrupted file? Do I need to do something to clear the file so it will behave as if I'm writing an entirely new file each time?

Your suspicion is correct - if you reset the position of an open file stream and write content that's smaller than what's already in the file, it will leave trailing data and result in a corrupt file (depending on your definition of "corrupt", of course).
If you want to overwrite the file, you really should close the stream when you're finished with it and create a new stream when you're ready to re-save.
I notice from your linked question that you are holding the file open in order to prevent other users from writing to it at the same time. This probably wouldn't be my choice, but if you are going to do that, then I think you can "clear" the file by invoking stream.SetLength(0) between successive saves.

There are various ways to do this; if you are re-opening the file, perhaps set it to truncate:
using(var file = new FileStream(path, FileMode.Truncate)) {
// write
}
If you are overwriting the file while already open, then just trim it after writing:
file.SetLength(file.Position); // assumes we're at the new end
I would try to avoid delete/recreate, since this loses any ACLs etc.

Another option might be to use SetLength(0) to truncate the file before you start rewriting it.

Recently ran into the same requirement. In fact, previously, I used to create a new FileStream within a using statement and overwrite the previous file. Seems like the simple and effective thing to do.
using (var stream = new FileStream(path, FileMode.Create, FileAccess.Write)
{
ProtoBuf.Serializer.Serialize(stream , value);
}
However, I ran into locking issues where some other process is locking the target file. In my attempt to thwart this I retried the write several times before pushing the error up the stack.
int attempt = 0;
while (true)
{
try
{
using (var stream = new FileStream(path, FileMode.Create, FileAccess.Write)
{
ProtoBuf.Serializer.Serialize(stream , value);
}
break;
}
catch (IOException)
{
// could be locked by another process
// make up to X attempts to write the file
attempt++;
if (attempt >= X)
{
throw;
}
Thread.Sleep(100);
}
}
That seemed to work for almost everyone. Then that problem machine came along and forced me down the path of maintaining a lock on the file the entire time. So in lieu of retrying to write the file in the case it's already locked, I'm now making sure I get and hold the stream open so there are no locking issues with later writes.
int attempt = 0;
while (true)
{
try
{
_stream = new FileStream(path, FileMode.Open, FileAccess.ReadWrite, FileShare.Read);
break;
}
catch (IOException)
{
// could be locked by another process
// make up to X attempts to open the file
attempt++;
if (attempt >= X)
{
throw;
}
Thread.Sleep(100);
}
}
Now when I write the file the FileStream position must be reset to zero, as Aaronaught said. I opted to "clear" the file by calling _stream.SetLength(0). Seemed like the simplest choice. Then using our serializer of choice, Marc Gravell's protobuf-net, serialize the value to the stream.
_stream.SetLength(0);
ProtoBuf.Serializer.Serialize(_stream, value);
This works just fine most of the time and the file is completely written to the disk. However, on a few occasions I've observed the file not being immediately written to the disk. To ensure the stream is flushed and the file is completely written to disk I also needed to call _stream.Flush(true).
_stream.SetLength(0);
ProtoBuf.Serializer.Serialize(_stream, value);
_stream.Flush(true);

Based on your question I think you'd be better served closing/re-opening the underlying file. You don't seem to be doing anything other than writing the whole file. The value you can add by re-writing Open/Close/Flush/Seek will be next to 0. Concentrate on your business problem.

Power Loss after StreamWriter.Close() produces blank file, why?

Ok, so to explain; I am developing for a system that can suffer a power failure at any point in time, one point that I am testing is directly after I have written a file out using a StreamWriter. The code below:
// Write the updated file back out to the Shell directory.
using (StreamWriter shellConfigWriter =
new StreamWriter(#"D:\xxx\Shell\Config\Game.cfg.bak"))
{
for (int i = 0; i < configContents.Count; i++)
{
shellConfigWriter.WriteLine(configContents[i]);
}
shellConfigWriter.Close();
}
FileInfo gameCfgBackup = new FileInfo(#"D:\xxx\Shell\Config\Game.cfg.bak");
gameCfgBackup.CopyTo(#"D:\xxx\Shell\Config\Game.cfg", true);
Writes the contents of shellConfigWriter (a List of strings) out to a file used as a temporary store, then it is copied over the original. Now after this code has finished executing the power is lost, upon starting back up again the file Game.cfg exists and is the correct size, but is completely blank. At first I thought that this was due to Write-Caching being enabled on the hard drive, but even with it off it still occurs (albeit less often).
Any ideas would be very welcome!
Update: Ok, so after removing the .Close() statements and calling .Flush() after every write operation the files still end up blank. I could go one step further and create a backup of the original file first, before creating the new one, and then I have enough backups to do a integrity check, but I don't think it'll help to solve the underlying issue (that when I tell it to write to, flush and close a file... It doesn't!).

Keep the OS from buffering the output using the FileOptions parameter of the FileStream object's constructor:
using (Stream fs = new FileStream(#"D:\xxx\Shell\Config\Game.cfg.bak", FileMode.Create, FileAccess.Write, FileShare.None, 0x1000, FileOptions.WriteThrough))
using (StreamWriter shellConfigWriter = new StreamWriter(fs))
{
for (int i = 0; i < configContents.Count; i++)
{
shellConfigWriter.WriteLine(configContents[i]);
}
shellConfigWriter.Flush();
shellConfigWriter.BaseStream.Flush();
}

First of all, you don't have to call shellConfigWriter.Close() there. The using statement will take care of it. What you might want to do instead to guard against power failure is call shellConfigWriter.Flush().
Update
Something else you might want to consider is that if a power failure can really happen at any time, it could happen in the middle of a write, such that only some of the bytes make it to a file. There's really no way to stop that.
To protect against these scenarios, a common procedure is to use state/condition flag files. You use the existence or non-existence on the file system of a zero-byte file with a particular name to tell your program where to pick up again when it resumes. Then you don't create or destroy the files that trigger a particular state until you are sure you've reached that state and completed the previous.
The downside here is that it might mean throwing a lot of work away now and then. But the benefit is that it means the functional part of your code looks like normal: there's very little extra work to do to make the system sufficiently robust.

You want to set AutoFlush = true;

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.