Anybody know of a way to copy a file from path A to path B and suppressing the Windows file system cache?
Typical use is copying a large file from a USB drive, or server to your local machine. Windows seems to swap everything out if the file is really big, e.g. 2GiB.
Prefer example in C#, but I'm guessing this would be a Win32 call of some sort if possible.
In C# I have found something like this to work, this can be changed to copy directly to destination file:
public static byte[] ReadAllBytesUnbuffered(string filePath)
{
const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;
var fileInfo = new FileInfo(filePath);
long fileLength = fileInfo.Length;
int bufferSize = (int)Math.Min(fileLength, int.MaxValue / 2);
bufferSize += ((bufferSize + 1023) & ~1023) - bufferSize;
using (var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.None,
bufferSize, FileFlagNoBuffering | FileOptions.SequentialScan))
{
long length = stream.Length;
if (length > 0x7fffffffL)
{
throw new IOException("File too long over 2GB");
}
int offset = 0;
int count = (int)length;
var buffer = new byte[count];
while (count > 0)
{
int bytesRead = stream.Read(buffer, offset, count);
if (bytesRead == 0)
{
throw new EndOfStreamException("Read beyond end of file EOF");
}
offset += bytesRead;
count -= bytesRead;
}
return buffer;
}
}
Even more important, there are FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING.
MSDN has a nice article on them both: http://support.microsoft.com/kb/99794
I am not sure if this helps, but take a look at Increased Performance Using FILE_FLAG_SEQUENTIAL_SCAN.
SUMMARY
There is a flag for CreateFile()
called FILE_FLAG_SEQUENTIAL_SCAN which
will direct the Cache Manager to
access the file sequentially.
Anyone reading potentially large files
with sequential access can specify
this flag for increased performance.
This flag is useful if you are reading
files that are "mostly" sequential,
but you occasionally skip over small
ranges of bytes.
If you dont mind using a tool, ESEUTIL worked great for me.
You can check out this blog entry comparing Buffered and NonBuffered IO functions and from where to get ESEUTIL.
copying some text from the technet blog:
So looking at the definition of buffered I/O above, we can see where the perceived performance problems lie - in the file system cache overhead. Unbuffered I/O (or a raw file copy) is preferred when attempting to copy a large file from one location to another when we do not intend to access the source file after the copy is complete. This will avoid the file system cache overhead and prevent the file system cache from being effectively flushed by the large file data. Many applications accomplish this by calling CreateFile() to create an empty destination file, then using the ReadFile() and WriteFile() functions to transfer the data.
CreateFile() - The CreateFile function creates or opens a file, file stream, directory, physical disk, volume, console buffer, tape drive, communications resource, mailslot, or named pipe. The function returns a handle that can be used to access an object.
ReadFile() - The ReadFile function reads data from a file, and starts at the position that the file pointer indicates. You can use this function for both synchronous and asynchronous operations.
WriteFile() - The WriteFile function writes data to a file at the position specified by the file pointer. This function is designed for both synchronous and asynchronous operation.
For copying files around the network that are very large, my copy utility of choice is ESEUTIL which is one of the database utilities provided with Exchange.
Eseutil is a correct answer, also since Win7 / 2008 R2, you can use the /j switch in Xcopy, which has the same effect.
I understand this question was 11 years ago, nowadays there is robocopy which is kind of replacement for xcopy.
you need to check /J option
/J :: copy using unbuffered I/O (recommended for large files)
Related
Here's what I'm dealing with...
Some process (out of our control) will occasionally drop a zip file into a directory in Azure File Storage. That directory name is InBound. So let's say a file called bigbook.zip is dropped into the InBound folder.
I need to create an Azure Function App that runs every 5 minutes and looks for zip files in the InBound directory. If any exists, then one-by-one, we create a new directory by the same name as the zip file in another directory (called InProcess). So in our example, I would create InProcess/bigbook.
Now inside InProcess/bigbook, I need to unzip bigbook.zip. So by the time the process is done running InProcess/bigbook will contain all the contents of bigbook.zip.
Please note: This function I am creating is a Console App that will run as an Azure Function App. So there will be no file system access (at least, as far as I'm aware, anyway.) There is no option to download the zip file, unzip it, and then move the contents.
I am having a devil of a time figuring out how to do this in memory only. No matter what I try, I keep running into an Out Of Memory exception. For now, I am just doing this on my localhost running in debug in Visual Studio 2017, .NET 4.7. In that setting, I am not able to convert the test zip file, which is 515,069KB.
This was my first attempt:
private async Task<MemoryStream> GetMemoryStreamAsync(CloudFile inBoundfile)
{
MemoryStream memstream = new MemoryStream();
await inBoundfile.DownloadToStreamAsync(memstream).ConfigureAwait(false);
return memstream;
}
And this (with high hopes) was my second attempt, thinking that DownloadRangeToStream would work better than just DownloadToStream.
private MemoryStream GetMemoryStreamByRange(CloudFile inBoundfile)
{
MemoryStream outPutStream = new MemoryStream();
inBoundfile.FetchAttributes();
int bufferLength = 1 * 1024 * 1024;//1 MB chunk
long blobRemainingLength = inBoundfile.Properties.Length;
long offset = 0;
while (blobRemainingLength > 0)
{
long chunkLength = (long)Math.Min(bufferLength, blobRemainingLength);
using (var ms = new MemoryStream())
{
inBoundfile.DownloadRangeToStream(ms, offset, chunkLength);
lock (outPutStream)
{
outPutStream.Position = offset;
var bytes = ms.ToArray();
outPutStream.Write(bytes, 0, bytes.Length);
}
}
offset += chunkLength;
blobRemainingLength -= chunkLength;
}
return outPutStream;
}
But either way, I am running into memory issues. I presume it's because the MemoryStream I am trying to create gets too large?
How else can I tackle this? And again, downloading the zip file is not an option, as the app will ultimately be an Azure Function App. I'm also pretty sure that using a FileStream isn't an option either, as that requires a local file path, which I don't have. (I only have a remote Azure URL)
Could I somehow create a temp file in the same Azure Storage account that the zip file is in, and stream the zip file to that temp file instead of to a memory stream? (Thinking out loud.)
The goal is to get the stream into a ZipArchive using:
ZipArchive archive = new ZipArchive(stream)
And from there I can extract all the contents. But getting to that point w/o memory errors is proving a real bugger.
Any ideas?
Using Azure Storage File Share this is the only way it worked for me without loading the entire ZIP into Memory. I tested with a 3GB ZIP File (with thousands of files or with a big file inside) and Memory/CPU was low and stable. I hope it helps!
var zipFiles = _directory.ListFilesAndDirectories()
.OfType<CloudFile>()
.Where(x => x.Name.ToLower().Contains(".zip"))
.ToList();
foreach (var zipFile in zipFiles)
{
using (var zipArchive = new ZipArchive(zipFile.OpenRead()))
{
foreach (var entry in zipArchive.Entries)
{
if (entry.Length > 0)
{
CloudFile extractedFile = _directory.GetFileReference(entry.Name);
using (var entryStream = entry.Open())
{
byte[] buffer = new byte[16 * 1024];
using (var ms = extractedFile.OpenWrite(entry.Length))
{
int read;
while ((read = entryStream.Read(buffer, 0, buffer.Length)) > 0)
{
ms.Write(buffer, 0, read);
}
}
}
}
}
}
}
I would suggest you use memory snapshots to see why you are running out of memory within Visual Studio. You can use the tutorial in this article to find the culprit. Doing local development with a smaller file may help you continue to work if your machine is simply running out of memory.
When it comes to doing this within Azure, a node in the Consumption plan is limited to 1.5GB of total memory. If you expect to receive files larger than that then you should look at one of the other App Service plans that give you more memory to work with.
It is possible to store files within the function's local directory, so that is an option. You can't guaruntee that you will be using the same local directory between executions, but this should work as long as you are using the file you downloaded within the same execution.
Within a tool copying big files between disks, I replaced the
System.IO.FileInfo.CopyTo method by System.IO.Stream.CopyToAsync.
This allow a faster copy and a better control during the copy, e.g. I can stop the copy.
But this create even more fragmentation of the copied files. It is especially annoying when I copy file of many hundreds megabytes.
How can I avoid disk fragmentation during copy?
With the xcopy command, the /j switch copies files without buffering. And it is recommended for very large file in TechNet
It seems indeed to avoid file fragmentation (while a simple file copy within windows 10 explorer DOES fragment my file!)
A copy without buffering seems to be the opposite way than this async copy. Or it there any way to do async copy without buffering?
Here it my current code for aync copy. I let the default buffersize of 81920 bytes, i.e. 10*1024*size(int64).
I am working with NTFS file systems, thus 4096 bytes clusters.
EDIT: I updated the code with SetLength as suggested, added the FileOptions Async while creating the destinationStream and fix setting the attributes AFTER setting the time (otherwise, exception is thrown for ReadOnly files)
int bufferSize = 81920;
try
{
using (FileStream sourceStream = source.OpenRead())
{
// Remove existing file first
if (File.Exists(destinationFullPath))
File.Delete(destinationFullPath);
using (FileStream destinationStream = File.Create(destinationFullPath, bufferSize, FileOptions.Asynchronous))
{
try
{
destinationStream.SetLength(sourceStream.Length); // avoid file fragmentation!
await sourceStream.CopyToAsync(destinationStream, bufferSize, cancellationToken);
}
catch (OperationCanceledException)
{
operationCanceled = true;
}
} // properly disposed after the catch
}
}
catch (IOException e)
{
actionOnException(e, "error copying " + source.FullName);
}
if (operationCanceled)
{
// Remove the partially written file
if (File.Exists(destinationFullPath))
File.Delete(destinationFullPath);
}
else
{
// Copy meta data (attributes and time) from source once the copy is finished
File.SetCreationTimeUtc(destinationFullPath, source.CreationTimeUtc);
File.SetLastWriteTimeUtc(destinationFullPath, source.LastWriteTimeUtc);
File.SetAttributes(destinationFullPath, source.Attributes); // after set time if ReadOnly!
}
I fear also that the File.SetAttributes and Time at the end on my code could increase file fragmentation.
Is there a proper way to create a 1:1 asynchronous file copy without any file fragmentation, i.e. asking the HDD that the file steam get only contiguous sectors?
Other topics regarding file fragmentation like How can I limit file fragmentation while working with .NET suggests incrementing the file size in larger chunks, but it does not seem to be a direct answer to my question.
but the SetLength method does the job
It does not do the job. It only updates the file size in the directory entry, it does not allocate any clusters. The easiest way to see this for yourself is by doing this on a very large file, say 100 gigabytes. Note how the call completes instantly. Only way it can be instant is when the file system does not also do the job of allocating and writing the clusters. Reading from the file is actually possible, even though the file contains no actual data, the file system simply returns binary zeros.
This will also mislead any utility that reports fragmentation. Since the file has no clusters, there can be no fragmentation. So it only looks like you solved your problem.
The only thing you can do to force the clusters to be allocated is to actually write to the file. It is in fact possible to allocate 100 gigabytes worth of clusters with a single write. You must use Seek() to position to Length-1, then write a single byte with Write(). This will take a while on a very large file, it is in effect no longer async.
The odds that it will reduce fragmentation are not great. You merely reduced the risk somewhat that the writes will be interleaved by writes from other processes. Somewhat, actual writing is done lazily by the file system cache. Core issue is that the volume was fragmented before you began writing, it will never be less fragmented after you're done.
Best thing to do is to just not fret about it. Defragging is automatic on Windows these days, has been since Vista. Maybe you want to play with the scheduling, maybe you want to ask more about it at superuser.com
I think, FileStream.SetLength is what you need.
Considering Hans Passant answer,
in my code above, an alternative to
destinationStream.SetLength(sourceStream.Length);
would be, if I understood it properly:
byte[] writeOneZero = {0};
destinationStream.Seek(sourceStream.Length - 1, SeekOrigin.Begin);
destinationStream.Write(writeOneZero, 0, 1);
destinationStream.Seek(0, SeekOrigin.Begin);
It seems indeed to consolidate the copy.
But a look at the source code of FileStream.SetLengthCore seems it does almost the same, seeking at the end but without writing one byte:
private void SetLengthCore(long value)
{
Contract.Assert(value >= 0, "value >= 0");
long origPos = _pos;
if (_exposedHandle)
VerifyOSHandlePosition();
if (_pos != value)
SeekCore(value, SeekOrigin.Begin);
if (!Win32Native.SetEndOfFile(_handle)) {
int hr = Marshal.GetLastWin32Error();
if (hr==__Error.ERROR_INVALID_PARAMETER)
throw new ArgumentOutOfRangeException("value", Environment.GetResourceString("ArgumentOutOfRange_FileLengthTooBig"));
__Error.WinIOError(hr, String.Empty);
}
// Return file pointer to where it was before setting length
if (origPos != value) {
if (origPos < value)
SeekCore(origPos, SeekOrigin.Begin);
else
SeekCore(0, SeekOrigin.End);
}
}
Anyway, not sure that theses method guarantee no fragmentation, but at least avoid it for most of the cases. Thus the auto defragment tool will finish the job at a low performance expense.
My initial code without this Seek calls created hundred of thousands of fragments for 1 GB file, slowing down my machine when the defragment tool went active.
Using C# (.NET 4.5) I want to copy a set of files to multiple locations (e.g. the contents of a folder to 2 USB drives attached to the computer).
Is there a more efficient way of doing that then just using foreach loops and File.Copy?
Working towards a (possible) solution.
My first thought was some kind of multi-threaded approach. After some reading and research I discovered that just blindly setting up some kind of parallel and/or async process is not a good idea when it comes to IO (as per Why is Parallel.ForEach much faster then AsParallel().ForAll() even though MSDN suggests otherwise?).
The bottleneck is the disk, especially if it's a traditional drive, as it can only read/write synchronously. That got me thinking, what if I read it once then output it in multiple locations? After all, in my USB drive scenario I'm dealing with multiple (output) disks.
I'm having trouble figuring out how to do that though. One I idea I saw (Copy same file from multiple threads to multiple destinations) was to just read all the bytes of each file into memory then loop through the destinations and write out the bytes to each location before moving onto the next file. It seems that's a bad idea if the files might be large. Some of the files I'll be copying will be videos and could be 1 GB (or more). I can't imagine it's a good idea to load a 1 GB file into memory just to copy it to another disk?
So, allowing flexibility for larger files, the closest I've gotten is below (based off How to copy one file to many locations simultaneously). The problem with this code is that I've still not got a single read and multi-write happening. It's currently multi-read and multi-write. Is there a way to further optimise this code? Could I read chunks into memory then write that chunk to each destination before moving onto the next chunk (like the idea above but chunked files instead of whole)?
files.ForEach(fileDetail =>
Parallel.ForEach(fileDetail.DestinationPaths, new ParallelOptions(),
destinationPath =>
{
using (var source = new FileStream(fileDetail.SourcePath, FileMode.Open, FileAccess.Read, FileShare.Read))
using (var destination = new FileStream(destinationPath, FileMode.Create))
{
var buffer = new byte[1024];
int read;
while ((read = source.Read(buffer, 0, buffer.Length)) > 0)
{
destination.Write(buffer, 0, read);
}
}
}));
I thought I'd post my current solution for anyone else who comes across this question.
If anyone discovers a more efficient/quicker way to do this then please let me know!
My code seems to copy files a bit quicker than just running the copy synchronously but it's still not as fast as I'd like (nor as fast as I've seen some other programs do it). I should note that performance may vary depending on .NET version and your system (I'm using Win 10 with .NET 4.5.2 on a 13" MBP with 2.9GHz i5 (5287U - 2 core / 4 thread) + 16GB RAM). I've not even figured out the best combination of method (e.g. FileStream.Write, FileStream.WriteAsync, BinaryWriter.Write) and buffer size yet.
foreach (var fileDetail in files)
{
foreach (var destinationPath in fileDetail.DestinationPaths)
Directory.CreateDirectory(Path.GetDirectoryName(destinationPath));
// Set up progress
FileCopyEntryProgress progress = new FileCopyEntryProgress(fileDetail);
// Set up the source and outputs
using (var source = new FileStream(fileDetail.SourcePath, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize, FileOptions.SequentialScan))
using (var outputs = new CompositeDisposable(fileDetail.DestinationPaths.Select(p => new FileStream(p, FileMode.Create, FileAccess.Write, FileShare.None, bufferSize))))
{
// Set up the copy operation
var buffer = new byte[bufferSize];
int read;
// Read the file
while ((read = source.Read(buffer, 0, buffer.Length)) > 0)
{
// Copy to each drive
await Task.WhenAll(outputs.Select(async destination => await ((FileStream)destination).WriteAsync(buffer, 0, read)));
// Report progress
if (onDriveCopyFile != null)
{
progress.BytesCopied = read;
progress.TotalBytesCopied += read;
onDriveCopyFile.Report(progress);
}
}
}
if (ct.IsCancellationRequested)
break;
}
I'm using CompositeDisposable from Reactive Extensions (https://github.com/Reactive-Extensions/Rx.NET).
IO operations in general should be considered as asynchronous as there is some hardware operations which are run outside your code, so you can try to introduce some async/await constructs for read/write operations, so you can continue the execution during hardware operations.
while ((read = await source.ReadAsync(buffer, 0, buffer.Length)) > 0)
{
await destination.WriteAsync(buffer, 0, read);
}
You also must to mark your lambda delegate as async to make this work:
async destinationPath =>
...
And you should await the resulting tasks all the way. You may find more information here :
Parallel foreach with asynchronous lambda
Nesting await in Parallel.ForEach
In my project user can upload file up to 1GB. I want to copy that uploaded file stream data to second stream.
If I use like this
int i;
while ( ( i = fuVideo.FileContent.ReadByte() ) != -1 )
{
strm.WriteByte((byte)i);
}
then it is taking so much time.
If i try to do this by byte array then I need to add array size in long which is not valid.
If someone has better idea to do this then please let me know.
--
Hi Khepri thanks for your response. I tried Stream.Copy but it is taking so much time to copy one stream object to second.
I tried with 8.02Mb file and it took 3 to 4 minutes.
The code i have added is
Stream fs = fuVideo.FileContent; //fileInf.OpenRead();
Stream strm = ftp.GetRequestStream();
fs.CopyTo(strm);
If i am doing something wrong then please let me know.
Is this .NET 4.0?
If so Stream.CopyTo is probably your best bet.
If not, and to give credit where credit is due, see the answer in this SO thread. If you're not .NET 4.0 make sure to read the comments in that thread as there are some alternative solutions (Async stream reading/writing) that may be worth investigating if performance is at an absolute premium which may be your case.
EDIT:
Based off the update, are you trying to copy the file to another remote destination? (Just guessing based on GetRequestStream() [GetRequestStream()]. The time is going to be the actual transfer of the file content to the destination. So in this case when you do fs.CopyTo(strm) it has to move those bytes from the source stream to the remote server. That's where the time is coming from. You're literally doing a file upload of a huge file. CopyTo will block your processing until it completes.
I'd recommend looking at spinning this kind of processing off to another task or at the least look at the asynchronous option I listed. You can't really avoid this taking a large period of time. You're constrained by file size and available upload bandwidth.
I verified that when working locally CopyTo is sub-second. I tested with a half gig file and a quick Stopwatch class returned a processing time of 800 millisecondss.
If you are not .NET 4.0 use this
static void CopyTo(Stream fromStream, Stream destination, int bufferSize)
{
int num;
byte[] buffer = new byte[bufferSize];
while ((num = fromStream.Read(buffer, 0, buffer.Length)) != 0)
{
destination.Write(buffer, 0, num);
}
}
I am trying to join a number of binary files that were split during download. The requirement stemmed from the project http://asproxy.sourceforge.net/. In this project author allows you to download files by providing a url.
The problem comes through where my server does not have enough memory to keep a file that is larger than 20 meg in memory.So to solve this problem i modified the code to not download files larger than 10 meg's , if the file is larger it would then allow the user to download the first 10 megs. The user must then continue the download and hopefully get the second 10 megs. Now i have got all this working , except when the user needs to join the files they downloaded i end up with corrupt files , as far as i can tell something is either being added or removed via the download.
I am currently join the files together by reading all the files then writing them to one file.This should work since i am reading and writing in bytes. The code i used to join the files is listed here http://www.geekpedia.com/tutorial201_Splitting-and-joining-files-using-C.html
I do not have the exact code with me atm , as soon as i am home i will post the exact code if anyone is willing to help out.
Please let me know if i am missing out anything or if there is a better way to do this , i.e what could i use as an alternative to a memory stream. The source code for the original project which i made changes to can be found here http://asproxy.sourceforge.net/download.html , it should be noted i am using version 5.0. The file i modified is called WebDataCore.cs and i modified line 606 to only too till 10 megs of data had been loaded the continue execution.
Let me know if there is anything i missed.
Thanks
You shouldn't split for memory reasons... the reason to split is usually to avoid having to re-download everything in case of failure. If memory is an issue, you are doing it wrong... you shouldn't be buffering in memory, for example.
The easiest way to download a file is simply:
using(WebClient client = new WebClient()) {
client.DownloadFile(remoteUrl, localPath);
}
Re your split/join code - again, the problem is that you are buffering everything in memory; File.ReadAllBytes is a bad thing unless you know you have small files. What you should have is something like:
byte[] buffer = new byte[8192]; // why not...
int read;
while((read = inStream.Read(buffer, 0, buffer.Length)) > 0)
{
outStream.Write(buffer, 0, read);
}
This uses a moderate buffer to pump data between the two as a stream. A lot more efficient. The loop says:
try to read some data (at most, the buffer-size)
(this will read at least 1 byte, or we have reached the end of the stream)
if we read something, write this many bytes from the buffer to the output
In the end i have found that by using a FTP request i was able to get arround the memory issue and the file is saved correctly.
Thanks for all the help
That example is loading each entire chunk into memory, instead you could do something like this:
int bufSize = 1024 * 32;
byte[] buffer = new byte[bufSize];
using (FileStream outputFile = new FileStream(OutputFileName, FileMode.OpenOrCreate,
FileAccess.Write, FileShare.None, bufSize))
{
foreach (string inputFileName in inputFiles)
{
using (FileStream inputFile = new FileStream(inputFileName, FileMode.Append,
FileAccess.Write, FileShare.None, buffer.Length))
{
int bytesRead = 0;
while ((bytesRead = inputFile.Read(buffer, 0, buffer.Length)) != 0)
{
outputFile.Write(buffer, 0, bytesRead);
}
}