Faster file move method other than File.Move - c#

I have a console application that is going to take about 625 days to complete. Unless there is a way to make it faster.
First off I am working in a directory that has around 4,000,000 files in if not more. I'm working in a database that has a row for each file and then some.
Now working with the SQL is relatively fast, the bottleneck is when I use File.Move() each move takes 18 seconds to complete.
Is there a faster way than File.Move()?
This is the bottleneck:
File.Move(Path.Combine(location, fileName), Path.Combine(rootDir, fileYear, fileMonth, fileName));
All of the other code runs pretty fast. All I need to do is move one file to a new location and then update the database location field.
I can show other code if needed, but really the above is the only current bottleneck.

It turns out switching from File.Move to setting up a FileInfo and using .MoveTo increased the speed significantly.
It will run in about 35 days now as opposed to 625 days.
FileInfo fileinfo = new FileInfo(Path.Combine(location, fileName));
fileinfo.MoveTo(Path.Combine(rootDir, fileYear, fileMonth, fileName));

18 seconds isn't really unusual. NTFS does not perform well when you have a lot of files in a single directory. When you ask for a file, it has to do a linear search of its directory data structure. With 1,000 files, that doesn't take too long. With 10,000 files you notice it. With 4 million files . . . yeah, it takes a while.
You can probably do this even faster if you pre-load all of the directory entries into memory. Then rather than calling the FileInfo constructor for each file, you just look it up in your dictionary.
Something like:
var dirInfo = new DirectoryInfo(path);
// get list of all files
var files = dirInfo.GetFileSystemInfos();
var cache = new Dictionary<string, FileSystemInfo>();
foreach (var f in files)
{
cache.Add(f.FullName, f);
}
Now when you get a name from the database, you can just look it up in the dictionary. That might very well be faster than trying to get it from the disk each time.

You can move files in parallel and also using Directory.EnumerateFiles gives you a lazy loaded list of files (of-course I have not tested it with 4,000,000 files):
var numberOfConcurrentMoves = 2;
var moves = new List<Task>();
var sourceDirectory = "source-directory";
var destinationDirectory = "destination-directory";
foreach (var filePath in Directory.EnumerateFiles(sourceDirectory))
{
var move = new Task(() =>
{
File.Move(filePath, Path.Combine(destinationDirectory, Path.GetFileName(filePath)));
//UPDATE DB
}, TaskCreationOptions.PreferFairness);
move.Start();
moves.Add(move);
if (moves.Count >= numberOfConcurrentMoves)
{
Task.WaitAll(moves.ToArray());
moves.Clear();
}
}
Task.WaitAll(moves.ToArray());

Related

Check inside loop if *txt file has been created

My code is searchcing inside a loop if a *txt file has been created.
If file will not be created after x time then i will throw an exception.
Here is my code:
var AnswerFile = #"C:\myFile.txt";
for (int i = 0; i <= 30; i++)
{
if (File.Exists(AnswerFile))
break;
await Task.Delay(100);
}
if (File.Exists(AnswerFile))
{
}
else
{
}
After the loop i check my file if has been created or not. Loop will expire in 3 seconds, 100ms * 30times.
My code is working, i am just looking for the performance and quality of my code. Is there any better approach than mine? Example should i use FileInfo class instead this?
var fi1 = new FileInfo(AnswerFile);
if(fi1.Exists)
{
}
Or should i use filewatcher Class?
You should perhaps use a FileSystemWatcher for this and decouple the process of creating the file from the process of reacting to its presence. If the file must be generated in a certain time because it has some expiry time then you could make the expiry datetime part of the file name so that if it appears after that time you know it's expired. A note of caution with the FileSystemWatcher - it can sometimes miss something (the fine manual says that events can be missed if large numbers are generated in a short time)
In the past I've used this for watching for files being uploaded via ftp. As soon as the notification of file created appears I put the file into a list and check it periodically to see if it is still growing - you can either look at the filesystem watcher lastwritetime event for this or directly check the size of the file now vs some time ago etc - in either approach it's probably easiest to use a dictionary to track the file and the previous size/most recent lastwritedate event.
After a minute of no growth I consider the file uploaded completely and I process it. It might be wise for you to implement a similar delay if using a file system watcher and the files are arriving by some slow generating method
Why you don't retrieve a list of files name, then search in the list? You can use Directory.GetFiles to get the files list inside a directory then search in this list.
This would be more fixable for you since you will create the list once, and reuse it across the application, instead of calling File.Exists for each file.
Example :
var path = #"C:\folder\"; // set the folder path, which contains all answers files
var ext = "*.txt"; // set the file extension.
// GET filename list (bare name) and make them all lowercase.
var files = Directory.GetFiles(path, ext).Select(x=> x.Substring(path.Length, (x.Length - path.Length) - ext.Length + 1 ).Trim().ToLower()).ToList();
// Search for this filename
var search = "myFile";
// Check
if(files.Contains(search.ToLower()))
{
Console.WriteLine($"File : {search} is already existed.");
}
else
{
Console.WriteLine($"File : {search} is not found.");
}

SharpCompress & LZMA2 7z archive - very slow extraction of specific file. Why? Alternatives?

I have a 7zip archive craeted with LZMA2 compression (compression level: ultra).
The archive contains 1,749 files, which in total originally had a size of 661mb.
The zipped file is 39mb in size.
Now I'm trying to use C# to extract a tiny (~200kb'ish) single file from this archive.
I'm getting the corresponding IArchiveEntry from the IArchive (which works relatively fast),
but then calling IArchiveEntry.WriteToFile(targetPath) takes around 33 seconds! And similarly long if I write to a memory stream instead. (edit: When I'm running this on a 7z LZMA2 archive with compression level = normal, it still takes 9 seconds)
When I'm opening the same archive in the actual 7zip application and extract the same file from there, it takes around 2-3 seconds only.
I suspected it's some sort of multicore (7zip) vs single core (SharpCompress probably?) thing, but I don't notice any CPU usage spike during decompression with 7zip.. maybe its too fast to be noticeable though..
Does anyone know what could be the issue for such slow speeds with SharpCompress? Am I maybe missing some setting or using a wrong factory (ArchiveFactory) ?
If not - is there any C# library out there that might be significantly faster at decompressing this?
For reference, here's a sketch of how I'm using SharpCompress to extract:
private void Extract()
{
using(var archive = GetArchive())
{
var entryPath = /* ... path to entry .. */
var entry = TryGetEntry(archive, entryPath);
entry.WriteToFile(some_target_path);
}
}
private IArchive GetArchive()
{
string path = /* .. path to my .7z file */;
return ArchiveFactory.Open(path);
}
private IArchiveEntry TryGetEntry(IArchive archive, string path)
{
path = path.Replace("\\", "/");
foreach (var entry in archive.Entries)
{
if (!entry.IsDirectory)
{
if (entry.Key == path)
return entry;
}
}
return null;
}
Update: For a temporary solution, I'm now including the 7zr.exe from the 7zip SDK in my application, and run this in a new process to extract a single file, reading the process' output into a binary stream.
This works in around ~3 seconds compared to the ~33seconds with SharpCompress. Works for now, but kind of ugly.. so still curious why SharpCompress seems to be so slow there
This line is the problem
foreach (var entry in archive.Entries)
The problem is described here (ie. If there are 100 files, it decompresses the 1st file 100 times, 2nd file 99 times, and so on)
You need to use reader (forward-only). See the API.
But the sample code there doesn't support 7z.
For 7z you can use archive.ExtractAllEntries(), eg.
var reader = archive.ExtractAllEntries();
while (reader.MoveToNextEntry())
{
if (!reader.Entry.IsDirectory)
reader.WriteEntryToDirectory(extractDir, new ExtractionOptions() { ExtractFullPath = false, Overwrite = true });
}
It will be much faster.
If you need all the files you could also do:
using var reader = archive.ExtractAllEntries();
reader.WriteAllToDirectory(targetPath, new ExtractionOptions() { ExtractFullPath = true, Overwrite = true });

Add Files Into Existing Zip - performance issue

I have a WCF webservice that saves files to a folder(about 200,000 small files).
After that, I need to move them to another server.
The solution I've found was to zip them then move them.
When I adopted this solution, I've made the test with (20,000 files), zipping 20,000 files took only about 2 minutes and moving the zip is really fast.
But in production, zipping 200,000 files takes more than 2 hours.
Here is my code to zip the folder :
using (ZipFile zipFile = new ZipFile())
{
zipFile.UseZip64WhenSaving = Zip64Option.Always;
zipFile.CompressionLevel = CompressionLevel.None;
zipFile.AddDirectory(this.SourceDirectory.FullName, string.Empty);
zipFile.Save(DestinationCurrentFileInfo.FullName);
}
I want to modify the WCF webservice, so that instead of saving to a folder, it saves to the zip.
I use the following code to test:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
The first file to add to the zip takes only 5 ms, but the 10,000 th file to add takes 800 ms.
Is there a way to optimize this ? Or if you have other suggestions ?
EDIT
The example shown above is only for test, in the WCF webservice, i'll have different request sending files that I need to Add to the Zip file.
As WCF is statless, I will have a new instance of my class with each call, so how can I keep the Zip file open to add more files ?
I've looked at your code and immediately spot problems. The problem with a lot of software developers nowadays is that they nowadays don't understand how stuff works, which makes it impossible to reason about it. In this particular case you don't seem to know how ZIP files work; therefore I would suggest you first read up on how they work and attempted to break down what happens under the hood.
Reasoning
Now that we're all on the same page on how they work, let's start the reasoning by breaking down how this works using your source code; we'll continue from there on forward:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
// (1)
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
// (2)
zip.AddFile(additionFile.FullName);
// (3)
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
(1) opens a ZIP file. You're doing this for every file you attempt to add
(2) Adds a single file to the ZIP file
(3) Saves the complete ZIP file
On my computer this takes about an hour.
Now, not all of the file format details are relevant. We're looking for stuff that will get increasingly worse in your program.
Skimming over the file format specification, you'll notice that compression is based on Deflate which doesn't require information on the other files that are compressed. Moving on, we'll notice how the 'file table' is stored in the ZIP file:
You'll notice here that there's a 'central directory' which stores the files in the ZIP file. It's basically stored as a 'list'. So, using this information we can reason on what the trivial way is to update that when implementing steps (1-3) in this order:
Open the zip file, read the central directory
Append data for the (new) compressed file, store the pointer along with the filename in the new central directory.
Re-write the central directory.
Think about it for a moment, for file #1 you need 1 write operation; for file #2, you need to read (1 item), append (in memory) and write (2 items); for file #3, you need to read (2 item), append (in memory) and write (3 items). And so on. This basically means that you're performance will go down the drain if you add more files. You've already observed this, now you know why.
A possible solution
In the previous solution I have added all files at once. That might not work in your use case. Another solution is to implement a merge that basically merges 2 files together every time. This is more convenient if you don't have all files available when you start the compression process.
Basically the algorithm then becomes:
Add a few (say, 16, files). You can toy with this number. Store this in -say- 'file16.zip'.
Add more files. When you hit 16 files, you have to merge the two files of 16 items into a single file of 32 items.
Merge files until you cannot merge anymore. Basically every time you have two files of N items, you create a new file of 2*N items.
Goto (2).
Again, we can reason about it. The first 16 files aren't a problem, we've already established that.
We can also reason what will happen in our program. Because we're merging 2 files into 1 file, we don't have to do as many read and writes. In fact, if you reason about it, you'll see that you have a file of 32 entries in 2 merges, 64 in 4 merges, 128 in 8 merges, 256 in 16 merges... hey, wait we know this sequence, it's 2^N. Again, reasoning about it we'll find that we need approximately 500 merges -- which is much better than the 200.000 operations that we started with.
Hacking in the ZIP file
Yet another solution that might come to mind is to overallocate the central directory, creating slack space for future entries to add. However, this probably requires you to hack into the ZIP code and create your own ZIP file writer. The idea is that you basically overallocate the central directory to a 200K entries before you get started, so that you can simply append in-place.
Again, we can reason about it: adding file now means: adding a file and updating some headers. It won't be as fast as the original solution because you'll need random disk IO, but it'll probably work fast enough.
I haven't worked this out, but it doesn't seem overly complicated to me.
The easiest solution is the most practical
What we haven't discussed so far is the easiest possible solution: one approach that comes to mind is to simply add all files at once, which we can again reason about.
Implementation is quite easy, because now we don't have to do any fancy things; we can simply use the ZIP handler (I use ionic) as-is:
static void Main()
{
try { File.Delete(#"c:\tmp\test.zip"); }
catch { }
var sw = Stopwatch.StartNew();
using (var zip = new ZipFile(#"c:\tmp\test.zip"))
{
zip.UseZip64WhenSaving = Zip64Option.Always;
for (int i = 0; i < 200000; ++i)
{
string filename = "foo" + i.ToString() + ".txt";
byte[] contents = Encoding.UTF8.GetBytes("Hello world!");
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddEntry(filename, contents);
}
zip.Save();
}
Console.WriteLine("Elapsed: {0:0.00}s", sw.Elapsed.TotalSeconds);
Console.ReadLine();
}
Whop; that finishes in 4,5 seconds. Much better.
I can see that you just want to group the 200,000 files into one big single file, without compression (like a tar archive).
Two ideas to explore:
Experiment with other file formats than Zip, as it may not be the fastest. Tar (tape archive) comes to mind (with natural speed advantages due to its simplicity), it even has an append mode which is exactly what you are after to ensure O(1) operations. SharpCompress is a library that will allow you to work with this format (and others).
If you have control over your remote server, you could implement your own file format, the simplest I can think of would be to zip each new file separately (to store the file metadata such as name, date, etc. in the file content itself), and then to append each such zipped file to a single raw bytes file. You would just need to store the byte offsets (separated by columns in another txt file) to allow the remote server to split the huge file into the 200,000 zipped files, and then unzip each of them to get the meta data. I guess this is also roughly what tar does behind the scene :).
Have you tried zipping to a MemoryStream rather than to a file, only flushing to a file when you are done for the day? Of course for back-up purposes your WCF service would have to keep a copy of the received individual files until you are sure they have been "committed" to the remote server.
If you do need compression, 7-Zip (and fiddling with the options) is well worth a try.
You are opening the file repeatedly, why not add loop through and add them all to one zip, then save it?
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories)
.Where(s => s.EndsWith(".aes"))
.Select(f => new FileInfo(f));
using (var zip = ZipFile.Read(nameOfExistingZip))
{
foreach (var additionFile in listAes)
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
}
zip.Save();
}
If the files aren't all available right away, you could at least batch them together. So if you're expecting 200k files, but you only have received 10 so far, don't open the zip, add one, then close it. Wait for a few more to come in and add them in batches.
If you are OK with performance of 100 * 20,000 files, can't you simply partition your large ZIP into a 100 "small" ZIP files? For simplicity, create a new ZIP file every minute and put a time-stamp in the name.
You can zip all the files using .Net TPL (Task Parallel Library) like this:
while(0 != (read = sourceStream.Read(bufferRead, 0, sliceBytes)))
{
tasks[taskCounter] = Task.Factory.StartNew(() =>
CompressStreamP(bufferRead, read, taskCounter, ref listOfMemStream, eventSignal)); // Line 1
eventSignal.WaitOne(-1); // Line 2
taskCounter++; // Line 3
bufferRead = new byte[sliceBytes]; // Line 4
}
Task.WaitAll(tasks); // Line 6
There is a compiled library and source code here:
http://www.codeproject.com/Articles/49264/Parallel-fast-compression-unleashing-the-power-of

Get attributes of all files under a directory while accessing the directory only

I'm trying to write a function in C# that gets a directory path as parameter and returns a dictionary where the keys are the files directly under that directory and the values are their last modification time.
This is easy to do with Directory.GetFiles() and then File.GetLastWriteTime(). However, this means that every file must be accessed, which is too slow for my needs.
Is there a way to do this while accessing just the directory? Does the file system even support this kind of requirement?
Edit, after reading some answers:
Thank you guys, you are all saying pretty much the same - use FileInfo object. Still, it is just as slow to use Directory.GetFiles() (or Directory.EnumerateFiles()) to get those objects, and I suspect that getting them requires access to every file. If the file system keeps last modification time of its files in the files themselves only, there can't be a way to extract that info without file access. Is this the case here? Do GetFiles() and EnumerateFiles() of DirectoryInfo access every file or get their info from the directory entry? I know that if I would have wanted to get just the file names, I could do this with the Directory class without accessing every file. But getting attributes seems trickier...
Edit, following henk's response:
it seems that it really is faster to use FileInfo Object. I created the following test:
static void Main(string[] args)
{
Console.WriteLine(DateTime.Now);
foreach (string file in Directory.GetFiles(#"\\169.254.78.161\dir"))
{
DateTime x = File.GetLastWriteTime(file);
}
Console.WriteLine(DateTime.Now);
DirectoryInfo dirInfo2 = new DirectoryInfo(#"\\169.254.78.161\dir");
var files2 = from f in dirInfo2.EnumerateFiles()
select f;
foreach (FileInfo file in files2)
{
DateTime x = file.LastWriteTime;
}
Console.WriteLine(DateTime.Now);
}
For about 800 files, I usually get something like:
31/08/2011 17:14:48
31/08/2011 17:14:51
31/08/2011 17:14:52
I didn't do any timings but your best bet is:
DirectoryInfo di = new DirectoryInfo(myPath);
FileInfo[] files = di.GetFiles();
I think all the FileInfo attributes are available in the directory file records so this should (could) require the minimum I/O.
The only other thing I can think of is using the FileInfo-Class. As far as I can see this might help you or it might read the file as well (Read Permissions are required)

C# Directory listing massive directory

Here is the scenario:
I have a directory with 2+ million files. The code I have below writes out all the files in about 90 minutes. Does anybody have a way to speed it up or make this code more efficent? I'd also like to only write out the file names in the listing.
string lines = (listBox1.Items.ToString());
string sourcefolder1 = textBox1.Text;
string destinationfolder = (#"C:\anfiles");
using (StreamWriter output = new StreamWriter(destinationfolder + "\\" + "MasterANN.txt"))
{
string[] files = Directory.GetFiles(textBox1.Text, "*.txt");
foreach (string file in files)
{
FileInfo file_info = new FileInfo(file);
output.WriteLine(file_info.Name);
}
}
The slow down is that it writes out 1 line at a time.
It takes about 13-15 minutes to get all the files it needs to write out.
The following 75 minutes is creating the file.
It could help if you don't make a FileInfo instance for every file, use Path.GetFileName instead:
string lines = (listBox1.Items.ToString());
string sourcefolder1 = textBox1.Text;
string destinationfolder = (#"C:\anfiles");
using (StreamWriter output = new StreamWriter(Path.Combine(destinationfolder, "MasterANN.txt"))
{
string[] files = Directory.GetFiles(textBox1.Text, "*.txt");
foreach (string file in files)
{
output.WriteLine(Path.GetFileName(file));
}
}
You're reading 2+ million file descriptors into memory. Depending on how much memory you have you may well be swapping. Try breaking it up into smaller chunks by filtering on the file name.
The first thing I would need to know is, where's the slow down? is it taking 89 minutes for Directory.GetFiles() to execute or is the delay spread out over the calls to FileInfo file_info = new FileInfo(file);?
If the delay is from the latter, you can probably speed things up by getting the file name from the path instead of creating an FileInfo instance to get the filename.
System.IO.Path.GetFileName(file);
From my experience, it's Directory.GetFiles that's slowing you down (aside from console output). To overcome this, P/Invoke into FindFirstFile/FindNextFile to avoid all the memory consumption and generall lagginess.
Using Directory.EnumerateFiles do not need to load all the file names in to memory first. Check this out: C# directory.getfiles memory help
In your case, the code could be:
using (StreamWriter output = new StreamWriter(destinationfolder + "\\" + "MasterANN.txt"))
{
foreach (var file in Directory.EnumerateFiles(sourcefolder, "*.txt"))
{
output.WriteLine(Path.GetFileName(file));
}
}
From this doc, it said that:
The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned; when you use GetFiles, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.
So if you have sufficient memory, Directory.GetFiles is ok. But Directory.EnumerateFiles is much better when a folder contains millions of files.

Categories