C# I/O async (copyAsync): how to avoid file fragmentation? - c#

Within a tool copying big files between disks, I replaced the
System.IO.FileInfo.CopyTo method by System.IO.Stream.CopyToAsync.
This allow a faster copy and a better control during the copy, e.g. I can stop the copy.
But this create even more fragmentation of the copied files. It is especially annoying when I copy file of many hundreds megabytes.
How can I avoid disk fragmentation during copy?
With the xcopy command, the /j switch copies files without buffering. And it is recommended for very large file in TechNet
It seems indeed to avoid file fragmentation (while a simple file copy within windows 10 explorer DOES fragment my file!)
A copy without buffering seems to be the opposite way than this async copy. Or it there any way to do async copy without buffering?
Here it my current code for aync copy. I let the default buffersize of 81920 bytes, i.e. 10*1024*size(int64).
I am working with NTFS file systems, thus 4096 bytes clusters.
EDIT: I updated the code with SetLength as suggested, added the FileOptions Async while creating the destinationStream and fix setting the attributes AFTER setting the time (otherwise, exception is thrown for ReadOnly files)
int bufferSize = 81920;
try
{
using (FileStream sourceStream = source.OpenRead())
{
// Remove existing file first
if (File.Exists(destinationFullPath))
File.Delete(destinationFullPath);
using (FileStream destinationStream = File.Create(destinationFullPath, bufferSize, FileOptions.Asynchronous))
{
try
{
destinationStream.SetLength(sourceStream.Length); // avoid file fragmentation!
await sourceStream.CopyToAsync(destinationStream, bufferSize, cancellationToken);
}
catch (OperationCanceledException)
{
operationCanceled = true;
}
} // properly disposed after the catch
}
}
catch (IOException e)
{
actionOnException(e, "error copying " + source.FullName);
}
if (operationCanceled)
{
// Remove the partially written file
if (File.Exists(destinationFullPath))
File.Delete(destinationFullPath);
}
else
{
// Copy meta data (attributes and time) from source once the copy is finished
File.SetCreationTimeUtc(destinationFullPath, source.CreationTimeUtc);
File.SetLastWriteTimeUtc(destinationFullPath, source.LastWriteTimeUtc);
File.SetAttributes(destinationFullPath, source.Attributes); // after set time if ReadOnly!
}
I fear also that the File.SetAttributes and Time at the end on my code could increase file fragmentation.
Is there a proper way to create a 1:1 asynchronous file copy without any file fragmentation, i.e. asking the HDD that the file steam get only contiguous sectors?
Other topics regarding file fragmentation like How can I limit file fragmentation while working with .NET suggests incrementing the file size in larger chunks, but it does not seem to be a direct answer to my question.

but the SetLength method does the job
It does not do the job. It only updates the file size in the directory entry, it does not allocate any clusters. The easiest way to see this for yourself is by doing this on a very large file, say 100 gigabytes. Note how the call completes instantly. Only way it can be instant is when the file system does not also do the job of allocating and writing the clusters. Reading from the file is actually possible, even though the file contains no actual data, the file system simply returns binary zeros.
This will also mislead any utility that reports fragmentation. Since the file has no clusters, there can be no fragmentation. So it only looks like you solved your problem.
The only thing you can do to force the clusters to be allocated is to actually write to the file. It is in fact possible to allocate 100 gigabytes worth of clusters with a single write. You must use Seek() to position to Length-1, then write a single byte with Write(). This will take a while on a very large file, it is in effect no longer async.
The odds that it will reduce fragmentation are not great. You merely reduced the risk somewhat that the writes will be interleaved by writes from other processes. Somewhat, actual writing is done lazily by the file system cache. Core issue is that the volume was fragmented before you began writing, it will never be less fragmented after you're done.
Best thing to do is to just not fret about it. Defragging is automatic on Windows these days, has been since Vista. Maybe you want to play with the scheduling, maybe you want to ask more about it at superuser.com

I think, FileStream.SetLength is what you need.

Considering Hans Passant answer,
in my code above, an alternative to
destinationStream.SetLength(sourceStream.Length);
would be, if I understood it properly:
byte[] writeOneZero = {0};
destinationStream.Seek(sourceStream.Length - 1, SeekOrigin.Begin);
destinationStream.Write(writeOneZero, 0, 1);
destinationStream.Seek(0, SeekOrigin.Begin);
It seems indeed to consolidate the copy.
But a look at the source code of FileStream.SetLengthCore seems it does almost the same, seeking at the end but without writing one byte:
private void SetLengthCore(long value)
{
Contract.Assert(value >= 0, "value >= 0");
long origPos = _pos;
if (_exposedHandle)
VerifyOSHandlePosition();
if (_pos != value)
SeekCore(value, SeekOrigin.Begin);
if (!Win32Native.SetEndOfFile(_handle)) {
int hr = Marshal.GetLastWin32Error();
if (hr==__Error.ERROR_INVALID_PARAMETER)
throw new ArgumentOutOfRangeException("value", Environment.GetResourceString("ArgumentOutOfRange_FileLengthTooBig"));
__Error.WinIOError(hr, String.Empty);
}
// Return file pointer to where it was before setting length
if (origPos != value) {
if (origPos < value)
SeekCore(origPos, SeekOrigin.Begin);
else
SeekCore(0, SeekOrigin.End);
}
}
Anyway, not sure that theses method guarantee no fragmentation, but at least avoid it for most of the cases. Thus the auto defragment tool will finish the job at a low performance expense.
My initial code without this Seek calls created hundred of thousands of fragments for 1 GB file, slowing down my machine when the defragment tool went active.

Related

C# System.OutOfMemoryException when Creating New Array

I am reading files into an array; here is the relevant code; a new DiskReader is created for each file and path is determined using OpenFileDialog.
class DiskReader{
// from variables section:
long MAX_STREAM_SIZE = 300 * 1024 * 1024; //300 MB
FileStream fs;
public Byte[] fileData;
...
// Get file size, check it is within allowed size (MAX)STREAM_SIZE), start process including progress bar.
using (fs = File.OpenRead(path))
{
if (fs.Length < MAX_STREAM_SIZE)
{
long NumBytes = (fs.Length < MAX_STREAM_SIZE ? fs.Length : MAX_STREAM_SIZE);
updateValues[0] = (NumBytes / 1024 / 1024).ToString("#,###.0");
result = LoadData(NumBytes);
}
else
{
// Need for something to handle big files
}
if (result)
{
mainForm.ShowProgress(true);
bw.RunWorkerAsync();
}
}
...
bool LoadData(long NumBytes)
{
try
{
fileData = new Byte[NumBytes];
fs.Read(fileData, 0, fileData.Length);
return true;
}
catch (Exception e)
{
return false;
}
}
The first time I run this, it works fine. The second time I run it, sometimes it works fine, most times it throws an System.OutOfMemoryException at
[Edit:
"first time I run this" was a bad choice of words, I meant when I start the programme and open a file is fine, I get the problem when I try to open a different file without exiting the programme. When I open the second file, I am setting the DiskReader to a new instance which means the fileData array is also a new instance. I hope that makes it clearer.]
fileData = new Byte[NumBytes];
There is no obvious pattern to it running and throwing an exception.
I don't think it's relevant, but although the maximum file size is set to 300 MB, the files I am using to test this are between 49 and 64 MB.
Any suggestions on what is going wrong here and how I can correct it?
If the exception is being thrown at that line only, then my guess is that you've got a problem somewhere else in your code, as the comments suggest. Reading the documentation of that exception here, I'd bet you call this function one too many times somewhere and simply go over the limit on object length in memory, since there don't seem to be any problem spots in the code that you posted.
The fs.Length property requires the whole stream to be evaluated, hence to read the file anyway. Try doing something like
byte[] result;
if (new FileInfo(path).Length < MAX_STREAM_SIZE)
{
result = File.ReadAllBytes(path);
}
Also depending on your needs, you might avoid using byte array and read the data directly from the file stream. This should have much lower memory footprint
If I understand well what you want to do, I have this proposal: The best option is to allocate one static array of defined MAX size at the beginning. And then keep that array, only fill it with a new data from another file. This way your memory should be absolutely fine. You just need to store file size in a separate variable, because the array will have always the same MAX size.
This is a common approach in systems with automatic memory management - it makes the program faster when you allocate a constant size of memory at the start and then never allocate anything during the computation, because garbage collector is not run many times.

Writing to file, memory used steadily increasing

I have an application where I need to write binary to a file constantly. The bits of data are small, about 1K each. The computers this is running on aren't great and are running XP. I've run into the problem that when I turn on the logging the computers just get totally hosed and I watch the Task Manager and just see the memory usage going up and up until it crashes.
A coworker suggested that I just keep the packets in memory until a certain amount of time has passed and then write it all at once instead of writing each one separately - tried that, same issue.
This is the code (loggingBuffer is the List<byte[]> I'm storing the packets in while the interval passes):
if ((DateTime.Now - lastStoreTime).TotalSeconds > 10)
{
string fileName = #"C:\Storage\file";
FileMode fm = File.Exists(fileName) ? FileMode.Append : FileMode.Create;
using (BinaryWriter w = new BinaryWriter(File.Open(fileName, fm), Encoding.ASCII))
{
foreach (byte[] packetData in loggingBuffer)
{
w.Write(packetData);
}
}
loggingBuffer.Clear();
lastStoreTime= DateTime.Now;
}
Is there anything different I should be doing to accomplish this?
Seems to me that, while you're writing each 10 seconds, you could close the file in between. And cleanup all related file-writing things. Perhaps that would solved your problem.
Secondly, I'd suggest creating the BinaryWriter outside the function where you actually write the data. It'll keep things clearer. In your current code you're checking each time wether to append data or to create a new file and the write to it. If you'll do this outside the function and call it just once perhaps this will save memory too. All untested by me, that is :)

Crash safe on-the-fly compression with GZipStream

I'm compressing a log file as data is written to it, something like:
using (var fs = new FileStream("Test.gz", FileMode.Create, FileAccess.Write, FileShare.None))
{
using (var compress = new GZipStream(fs, CompressionMode.Compress))
{
for (int i = 0; i < 1000000; i++)
{
// Clearly this isn't what is happening in production, just
// a simply example
byte[] message = RandomBytes();
compress.Write(message, 0, message.Length);
// Flush to disk (in production we will do this every x lines,
// or x milliseconds, whichever comes first)
if (i % 20 == 0)
{
compress.Flush();
}
}
}
}
What I want to ensure is that if the process crashes or is killed, the archive is still valid and readable. I had hoped that anything since the last flush would be safe, but instead I am just ending up with a corrupt archive.
Is there any way to ensure I end up with a readable archive after each flush?
Note: it isn't essential that we use GZipStream, if something else will give us the desired result.
An option is to let Windows handle the compression. Just enable compression on the folder where you're storing your log files. There are some performance considerations you should be aware of when copying the compressed files, and I don't know how well NT compression performs in comparision to GZipStream or other compression options. You'll probably want to compare compression ratios and CPU load.
There's also the option of opening a compressed file, if you don't want to enable compression on the entire folder. I haven't tried this, but you might want to look into it: http://social.msdn.microsoft.com/forums/en-US/netfxbcl/thread/1b63b4a4-b197-4286-8f3f-af2498e3afe5
Good news: GZip is a streaming format. Therefore corruption at the end of the stream cannot affect the beginning which was already written.
So even if your streaming writes are interrupted at an arbitrary point, most of the stream is still good. You can write yourself a little tool that reads from it and just stops at the first exception it sees.
If you want an error-free solution I'd recommend splitting the log into one file every x seconds (maybe x = 1 or 10?). Write into a file with extensions ".gz.tmp" and rename to ".gz" after the file was completely written and closed.
Yes, but it's more involved than just flushing. Take a look at gzlog.h and gzlog.c in the zlib distribution. It does exactly what you want, efficiently adding short log entries to a gzip file, and always leaving a valid gzip file behind. It also has protection against crashes or shutdowns during the process, still leaving a valid gzip file behind and not losing any log entries.
I recommend not using GZIPStream. It is buggy and does not provide the necessary functionality. Use DotNetZip instead as your interface to zlib.

how to copy one Stream object values to second Stream Object in asp.net

In my project user can upload file up to 1GB. I want to copy that uploaded file stream data to second stream.
If I use like this
int i;
while ( ( i = fuVideo.FileContent.ReadByte() ) != -1 )
{
strm.WriteByte((byte)i);
}
then it is taking so much time.
If i try to do this by byte array then I need to add array size in long which is not valid.
If someone has better idea to do this then please let me know.
--
Hi Khepri thanks for your response. I tried Stream.Copy but it is taking so much time to copy one stream object to second.
I tried with 8.02Mb file and it took 3 to 4 minutes.
The code i have added is
Stream fs = fuVideo.FileContent; //fileInf.OpenRead();
Stream strm = ftp.GetRequestStream();
fs.CopyTo(strm);
If i am doing something wrong then please let me know.
Is this .NET 4.0?
If so Stream.CopyTo is probably your best bet.
If not, and to give credit where credit is due, see the answer in this SO thread. If you're not .NET 4.0 make sure to read the comments in that thread as there are some alternative solutions (Async stream reading/writing) that may be worth investigating if performance is at an absolute premium which may be your case.
EDIT:
Based off the update, are you trying to copy the file to another remote destination? (Just guessing based on GetRequestStream() [GetRequestStream()]. The time is going to be the actual transfer of the file content to the destination. So in this case when you do fs.CopyTo(strm) it has to move those bytes from the source stream to the remote server. That's where the time is coming from. You're literally doing a file upload of a huge file. CopyTo will block your processing until it completes.
I'd recommend looking at spinning this kind of processing off to another task or at the least look at the asynchronous option I listed. You can't really avoid this taking a large period of time. You're constrained by file size and available upload bandwidth.
I verified that when working locally CopyTo is sub-second. I tested with a half gig file and a quick Stopwatch class returned a processing time of 800 millisecondss.
If you are not .NET 4.0 use this
static void CopyTo(Stream fromStream, Stream destination, int bufferSize)
{
int num;
byte[] buffer = new byte[bufferSize];
while ((num = fromStream.Read(buffer, 0, buffer.Length)) != 0)
{
destination.Write(buffer, 0, num);
}
}

Copy a file without using the windows file cache

Anybody know of a way to copy a file from path A to path B and suppressing the Windows file system cache?
Typical use is copying a large file from a USB drive, or server to your local machine. Windows seems to swap everything out if the file is really big, e.g. 2GiB.
Prefer example in C#, but I'm guessing this would be a Win32 call of some sort if possible.
In C# I have found something like this to work, this can be changed to copy directly to destination file:
public static byte[] ReadAllBytesUnbuffered(string filePath)
{
const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;
var fileInfo = new FileInfo(filePath);
long fileLength = fileInfo.Length;
int bufferSize = (int)Math.Min(fileLength, int.MaxValue / 2);
bufferSize += ((bufferSize + 1023) & ~1023) - bufferSize;
using (var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.None,
bufferSize, FileFlagNoBuffering | FileOptions.SequentialScan))
{
long length = stream.Length;
if (length > 0x7fffffffL)
{
throw new IOException("File too long over 2GB");
}
int offset = 0;
int count = (int)length;
var buffer = new byte[count];
while (count > 0)
{
int bytesRead = stream.Read(buffer, offset, count);
if (bytesRead == 0)
{
throw new EndOfStreamException("Read beyond end of file EOF");
}
offset += bytesRead;
count -= bytesRead;
}
return buffer;
}
}
Even more important, there are FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING.
MSDN has a nice article on them both: http://support.microsoft.com/kb/99794
I am not sure if this helps, but take a look at Increased Performance Using FILE_FLAG_SEQUENTIAL_SCAN.
SUMMARY
There is a flag for CreateFile()
called FILE_FLAG_SEQUENTIAL_SCAN which
will direct the Cache Manager to
access the file sequentially.
Anyone reading potentially large files
with sequential access can specify
this flag for increased performance.
This flag is useful if you are reading
files that are "mostly" sequential,
but you occasionally skip over small
ranges of bytes.
If you dont mind using a tool, ESEUTIL worked great for me.
You can check out this blog entry comparing Buffered and NonBuffered IO functions and from where to get ESEUTIL.
copying some text from the technet blog:
So looking at the definition of buffered I/O above, we can see where the perceived performance problems lie - in the file system cache overhead. Unbuffered I/O (or a raw file copy) is preferred when attempting to copy a large file from one location to another when we do not intend to access the source file after the copy is complete. This will avoid the file system cache overhead and prevent the file system cache from being effectively flushed by the large file data. Many applications accomplish this by calling CreateFile() to create an empty destination file, then using the ReadFile() and WriteFile() functions to transfer the data.
CreateFile() - The CreateFile function creates or opens a file, file stream, directory, physical disk, volume, console buffer, tape drive, communications resource, mailslot, or named pipe. The function returns a handle that can be used to access an object.
ReadFile() - The ReadFile function reads data from a file, and starts at the position that the file pointer indicates. You can use this function for both synchronous and asynchronous operations.
WriteFile() - The WriteFile function writes data to a file at the position specified by the file pointer. This function is designed for both synchronous and asynchronous operation.
For copying files around the network that are very large, my copy utility of choice is ESEUTIL which is one of the database utilities provided with Exchange.
Eseutil is a correct answer, also since Win7 / 2008 R2, you can use the /j switch in Xcopy, which has the same effect.
I understand this question was 11 years ago, nowadays there is robocopy which is kind of replacement for xcopy.
you need to check /J option
/J :: copy using unbuffered I/O (recommended for large files)

Categories