C# Directory listing massive directory - c#

Here is the scenario:
I have a directory with 2+ million files. The code I have below writes out all the files in about 90 minutes. Does anybody have a way to speed it up or make this code more efficent? I'd also like to only write out the file names in the listing.
string lines = (listBox1.Items.ToString());
string sourcefolder1 = textBox1.Text;
string destinationfolder = (#"C:\anfiles");
using (StreamWriter output = new StreamWriter(destinationfolder + "\\" + "MasterANN.txt"))
{
string[] files = Directory.GetFiles(textBox1.Text, "*.txt");
foreach (string file in files)
{
FileInfo file_info = new FileInfo(file);
output.WriteLine(file_info.Name);
}
}
The slow down is that it writes out 1 line at a time.
It takes about 13-15 minutes to get all the files it needs to write out.
The following 75 minutes is creating the file.

It could help if you don't make a FileInfo instance for every file, use Path.GetFileName instead:
string lines = (listBox1.Items.ToString());
string sourcefolder1 = textBox1.Text;
string destinationfolder = (#"C:\anfiles");
using (StreamWriter output = new StreamWriter(Path.Combine(destinationfolder, "MasterANN.txt"))
{
string[] files = Directory.GetFiles(textBox1.Text, "*.txt");
foreach (string file in files)
{
output.WriteLine(Path.GetFileName(file));
}
}

You're reading 2+ million file descriptors into memory. Depending on how much memory you have you may well be swapping. Try breaking it up into smaller chunks by filtering on the file name.

The first thing I would need to know is, where's the slow down? is it taking 89 minutes for Directory.GetFiles() to execute or is the delay spread out over the calls to FileInfo file_info = new FileInfo(file);?
If the delay is from the latter, you can probably speed things up by getting the file name from the path instead of creating an FileInfo instance to get the filename.
System.IO.Path.GetFileName(file);

From my experience, it's Directory.GetFiles that's slowing you down (aside from console output). To overcome this, P/Invoke into FindFirstFile/FindNextFile to avoid all the memory consumption and generall lagginess.

Using Directory.EnumerateFiles do not need to load all the file names in to memory first. Check this out: C# directory.getfiles memory help
In your case, the code could be:
using (StreamWriter output = new StreamWriter(destinationfolder + "\\" + "MasterANN.txt"))
{
foreach (var file in Directory.EnumerateFiles(sourcefolder, "*.txt"))
{
output.WriteLine(Path.GetFileName(file));
}
}
From this doc, it said that:
The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned; when you use GetFiles, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.
So if you have sufficient memory, Directory.GetFiles is ok. But Directory.EnumerateFiles is much better when a folder contains millions of files.

Related

Out Of Memory Exception in Foreach

I am trying to create a function that will retrieve all the uploaded files (which are now saved as byte in the database) and download it in a single zip file. I currently have 6000 files to download (and the number could grow).
The functionality is already working (from retrieval to download) if I limit the number of files being downloaded, otherwise, I get an OutOfMemoryException on the ForEach loop.
Here's a pseudo code: (files variable is a list of byte array and file name)
var files = getAllFilesFromDB();
foreach (var file in files)
{
var tempFilePath = Path.Combine(path, filename);
using (FileStream stream = new FileStream(tempfileName, FileMode.Create, FileAccess.ReadWrite))
{
stream.Write(file.byteArray, 0, file.byteArray.Length);
}
}
private readonly IEntityRepository<File> fileRepository;
IEnumerable<FileModel> getAllFilesFromDb()
{
return fileRepository.Select(f => new FileModel(){ fileData = f.byteArray, filename = f.fileName});
}
My question is, is there any other way to do this to avoid getting such errors?
To avoid this problem, you could avoid loading all the contents of all the files in one go. Most likely you will need to split your database call in to two database calls.
Retrieve a list of all the files without their contents but with some identifier - like the PK of the table.
A method which retrieves the contents of an individual file.
Then your (pseudo)code becomes
get list of all files
for each file
get the file contents
write the file to disk
Another possibility is to alter the way your query works currently, so that it uses deferred execution - this means it will not actually load all the files at once, but stream them one at a time from the database - but without seeing more code from your repository implementation, I cannot/ will not guess the right solution for you.

Out of memory exception while updating zip

I am getting OutofMemoryException while trying to add files to a .zip file. I am using 32-bit architecture for building and running the application.
string[] filePaths = Directory.GetFiles(Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData) + "\\capture\\capture");
System.IO.Compression.ZipArchive zip = ZipFile.Open(filePaths1[c], ZipArchiveMode.Update);
foreach (String filePath in filePaths)
{
string nm = Path.GetFileName(filePath);
zip.CreateEntryFromFile(filePath, "capture/" + nm, CompressionLevel.Optimal);
}
zip.Dispose();
zip = null;
I am unable to understand the reason behind it.
The exact reason depends on a variety of factors, but most likely you are simply just adding too much to the archive. Try using the ZipArchiveMode.Create option instead, which writes the archive directly to disk without caching it in memory.
If you are really trying to update an existing archive, you can still use ZipArchiveMode.Create. But it will require opening the existing archive, copying all of its contents to a new archive (using Create), and then adding the new content.
Without a good, minimal, complete code example, it would not be possible to say for sure where the exception is coming from, never mind how to fix it.
EDIT:
Here is what I mean by "…opening the existing archive, copying all of its contents to a new archive (using Create), and then adding the new content":
string[] filePaths = Directory.GetFiles(Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData) + "\\capture\\capture");
using (ZipArchive zipFrom = ZipFile.Open(filePaths1[c], ZipArchiveMode.Read))
using (ZipArchive zipTo = ZipFile.Open(filePaths1[c] + ".tmp", ZipArchiveMode.Create))
{
foreach (ZipArchiveEntry entryFrom in zipFrom.Entries)
{
ZipArchiveEntry entryTo = zipTo.CreateEntry(entryFrom.FullName);
using (Stream streamFrom = entryFrom.Open())
using (Stream streamTo = entryTo.Open())
{
streamFrom.CopyTo(streamTo);
}
}
foreach (String filePath in filePaths)
{
string nm = Path.GetFileName(filePath);
zipTo.CreateEntryFromFile(filePath, "capture/" + nm, CompressionLevel.Optimal);
}
}
File.Delete(filePaths1[c]);
File.Move(filePaths1[c] + ".tmp", filePaths1[c]);
Or something like that. Lacking a good, minimal, complete code example, I just wrote the above in my browser. I didn't try to compile it, never mind test it. And you may want to adjust some specifics (e.g. the handling of the temp file). But hopefully you get the idea.
The reason is simple. OutOfMemoryException means memory is not enough for the execution.
Compression consumes a lot of memory. There is no guarantee that a change of logic can solve the problem. But you can consider different methods to alleviate it.
1.
Since your main program must be 32-bit, you can consider starting another 64-bit process to do the compression (use System.Diagnostics.Process.Start). After the 64-bit process finishes its job and exits, your 32-bit main program can continue. You can simply use a tool already installed on the system, or write a simple program yourself.
2.
Another method is to dispose each time you add an entry.
ZipArchive.Dispose saves the file. After each iteration, memory allocated for the ZipArchive can be freed.
foreach (String filePath in filePaths)
{
System.IO.Compression.ZipArchive zip = ZipFile.Open(filePaths1[c], ZipArchiveMode.Update);
string nm = Path.GetFileName(filePath);
zip.CreateEntryFromFile(filePath, "capture/" + nm, CompressionLevel.Optimal);
zip.Dispose();
}
This approach is not straightforward, and it might not be as effective as the first approach.

Faster file move method other than File.Move

I have a console application that is going to take about 625 days to complete. Unless there is a way to make it faster.
First off I am working in a directory that has around 4,000,000 files in if not more. I'm working in a database that has a row for each file and then some.
Now working with the SQL is relatively fast, the bottleneck is when I use File.Move() each move takes 18 seconds to complete.
Is there a faster way than File.Move()?
This is the bottleneck:
File.Move(Path.Combine(location, fileName), Path.Combine(rootDir, fileYear, fileMonth, fileName));
All of the other code runs pretty fast. All I need to do is move one file to a new location and then update the database location field.
I can show other code if needed, but really the above is the only current bottleneck.
It turns out switching from File.Move to setting up a FileInfo and using .MoveTo increased the speed significantly.
It will run in about 35 days now as opposed to 625 days.
FileInfo fileinfo = new FileInfo(Path.Combine(location, fileName));
fileinfo.MoveTo(Path.Combine(rootDir, fileYear, fileMonth, fileName));
18 seconds isn't really unusual. NTFS does not perform well when you have a lot of files in a single directory. When you ask for a file, it has to do a linear search of its directory data structure. With 1,000 files, that doesn't take too long. With 10,000 files you notice it. With 4 million files . . . yeah, it takes a while.
You can probably do this even faster if you pre-load all of the directory entries into memory. Then rather than calling the FileInfo constructor for each file, you just look it up in your dictionary.
Something like:
var dirInfo = new DirectoryInfo(path);
// get list of all files
var files = dirInfo.GetFileSystemInfos();
var cache = new Dictionary<string, FileSystemInfo>();
foreach (var f in files)
{
cache.Add(f.FullName, f);
}
Now when you get a name from the database, you can just look it up in the dictionary. That might very well be faster than trying to get it from the disk each time.
You can move files in parallel and also using Directory.EnumerateFiles gives you a lazy loaded list of files (of-course I have not tested it with 4,000,000 files):
var numberOfConcurrentMoves = 2;
var moves = new List<Task>();
var sourceDirectory = "source-directory";
var destinationDirectory = "destination-directory";
foreach (var filePath in Directory.EnumerateFiles(sourceDirectory))
{
var move = new Task(() =>
{
File.Move(filePath, Path.Combine(destinationDirectory, Path.GetFileName(filePath)));
//UPDATE DB
}, TaskCreationOptions.PreferFairness);
move.Start();
moves.Add(move);
if (moves.Count >= numberOfConcurrentMoves)
{
Task.WaitAll(moves.ToArray());
moves.Clear();
}
}
Task.WaitAll(moves.ToArray());

Read all text files in a folder with StreamReader

I am trying to read all .txt files in a folder using stream reader. I have this now and it works fine for one file but I need to read all files in the folder. This is what I have so far. Any suggestions would be greatly appreciated.
using (var reader = new StreamReader(File.OpenRead(#"C:\ftp\inbox\test.txt")))
You can use Directory.EnumerateFiles() method instead of.
Returns an enumerable collection of file names that match a search
pattern in a specified path.
var txtFiles = Directory.EnumerateFiles(sourceDirectory, "*.txt");
foreach (string currentFile in txtFiles)
{
...
}
You can call Directory.EnumerateFiles() to find all files in a folder.
You can retrieve the files of a directory:
string[] filePaths = Directory.GetFiles(#"c:\MyDir\");
Therefore you can iterate each file performing whatever you want. Ex: reading all lines.
And also you can use a file mask as a second argument for the GetFiles method.
Edit:
Inside this post you can see the difference between EnumerateFiles and GetFiles.
What is the difference between Directory.EnumerateFiles vs Directory.GetFiles?

Get attributes of all files under a directory while accessing the directory only

I'm trying to write a function in C# that gets a directory path as parameter and returns a dictionary where the keys are the files directly under that directory and the values are their last modification time.
This is easy to do with Directory.GetFiles() and then File.GetLastWriteTime(). However, this means that every file must be accessed, which is too slow for my needs.
Is there a way to do this while accessing just the directory? Does the file system even support this kind of requirement?
Edit, after reading some answers:
Thank you guys, you are all saying pretty much the same - use FileInfo object. Still, it is just as slow to use Directory.GetFiles() (or Directory.EnumerateFiles()) to get those objects, and I suspect that getting them requires access to every file. If the file system keeps last modification time of its files in the files themselves only, there can't be a way to extract that info without file access. Is this the case here? Do GetFiles() and EnumerateFiles() of DirectoryInfo access every file or get their info from the directory entry? I know that if I would have wanted to get just the file names, I could do this with the Directory class without accessing every file. But getting attributes seems trickier...
Edit, following henk's response:
it seems that it really is faster to use FileInfo Object. I created the following test:
static void Main(string[] args)
{
Console.WriteLine(DateTime.Now);
foreach (string file in Directory.GetFiles(#"\\169.254.78.161\dir"))
{
DateTime x = File.GetLastWriteTime(file);
}
Console.WriteLine(DateTime.Now);
DirectoryInfo dirInfo2 = new DirectoryInfo(#"\\169.254.78.161\dir");
var files2 = from f in dirInfo2.EnumerateFiles()
select f;
foreach (FileInfo file in files2)
{
DateTime x = file.LastWriteTime;
}
Console.WriteLine(DateTime.Now);
}
For about 800 files, I usually get something like:
31/08/2011 17:14:48
31/08/2011 17:14:51
31/08/2011 17:14:52
I didn't do any timings but your best bet is:
DirectoryInfo di = new DirectoryInfo(myPath);
FileInfo[] files = di.GetFiles();
I think all the FileInfo attributes are available in the directory file records so this should (could) require the minimum I/O.
The only other thing I can think of is using the FileInfo-Class. As far as I can see this might help you or it might read the file as well (Read Permissions are required)

Categories