I am wondering if I should be using Parallel.ForEach() for my case. For a bit of context: I am developing a small music player using the NAudio library. I want to use Parallel.ForEach() in a factory method to quickly access .mp3 files and create TrackModel objects to represent them (about 400). The code looks like this:
public static List<TrackModel> CreateTracks(string[] files)
{
// Guard Clause
if (files == null || files.Length == 0) throw new ArgumentException();
var output = new List<TrackModel>();
TrackModel track;
Parallel.ForEach(files, file =>
{
using (MusicPlayer musicPlayer = new MusicPlayer(file, 0f))
{
track = new TrackModel()
{
FilePath = file,
Title = File.Create(file).Tag.Title,
Artist = File.Create(file).Tag.FirstPerformer,
TrackLength = musicPlayer.GetLengthInSeconds(),
};
}
lock (output)
{
output.Add(track);
}
});
return output;
}
Note: I use lock to prevent multiple Threads from adding elements to the list at the same time.
My question is the following: Should I be using Parallel.ForEach() in this situation or am I better off writing a normal foreach loop? Is this the right approach to achieve better performance and should I be using multithreading in combination with file access in the first place?
You're better off avoiding both a foreach and Parallel.ForEach. In this case AsParallel() is your friend.
Try this:
public static List<TrackModel> CreateTracks(string[] files)
{
if (files == null || files.Length == 0) throw new ArgumentException();
return
files
.AsParallel()
.AsOrdered()
.WithDegreeOfParallelism(2)
.Select(file =>
{
using (MusicPlayer musicPlayer = new MusicPlayer(file, 0f))
{
return new TrackModel()
{
FilePath = file,
Title = File.Create(file).Tag.Title,
Artist = File.Create(file).Tag.FirstPerformer,
TrackLength = musicPlayer.GetLengthInSeconds(),
};
}
})
.ToList();
}
This handles all the parallel logic and the locking behind the scenes for you.
Combining the suggestion from comments and answers and adapting them to my code I was able to solve my issue with the following code:
public List<TrackModel> CreateTracks(string[] files)
{
var output = files
.AsParallel()
.Select(file =>
{
using MusicPlayer musicPlayer = new MusicPlayer(file, 0f);
using File musicFile = File.Create(file);
return new TrackModel()
{
FilePath = file,
Title = musicFile.Tag.Title,
Artist = musicFile.Tag.FirstPerformer,
Length = musicPlayer.GetLengthInSeconds(),
};
})
.ToList();
return output;
}
Using AsParallel() helped significantly decrease the loading time which is what I was looking for. I will mark Enigmativity's answer as correct because of the clever idea. Initially, it threw a weird AggregateException, but I was able to solve it by saving the output in a variable and then returning it.
Credit to marsze as well, whose suggestion helped me fix a memory leak in the application and shave off 16MB of memory (!).
I have a code for tesseract to run in 1 instance how can i parallelize the code so that it can run in quad core processor or 8 core processor systems.here is my code block.thanks in advance.
using (TesseractEngine engine = new TesseractEngine(#"./tessdata", "tel+tel1", EngineMode.Default))
{
foreach (string ab in files)
{
using (var pages = Pix.LoadFromFile(ab))
{
using (Tesseract.Page page = engine.Process(pages,Tesseract.PageSegMode.SingleBlock))
{
string text = page.GetText();
OCRedText.Append(text);
}
}
}
This has worked for me:
static IEnumerable<string> Ocr(string directory, string sep)
=> Directory.GetFiles(directory, sep)
.AsParallel()
.Select(x =>
{
using var engine = new TesseractEngine(tessdata, "eng", EngineMode.Default);
using var img = Pix.LoadFromFile(x);
using var page = engine.Process(img);
return page.GetText();
}).ToList();
I am no expert on the matter of parallelization, but this function ocr's 8 Tiff's in 12 seconds.
However, it creates an Engine for every Tiff. I have not been able to call engine.Process concurrently.
The most simple way to run this code in parallel is using PLINQ. Calling AsParallel() on enumeration will automatically run query that follows it (.Select(...)) simultaneously on all available CPU cores.
It is crucial to run in parallel only thread-safe code. Assuming TesseractEngine is thread-safe (as you suggest in comment, I didn't verify it myself) as well as Pix.LoadFromFile(), then the only problematic part could be OCRedText.Append(). It is not clear from code, what OCRedText is, so I assume it is StringBuilder or List and therefore it is not thread-safe. So I removed this part from code that will run in parallel and I process it later in single-thread - since method .Append() is likely to run fast, this shouldn't have significant adverse effect on overall performance.
using (TesseractEngine engine = new TesseractEngine(#"./tessdata", "tel+tel1", EngineMode.Default))
{
var texts = files.AsParallel().Select(ab =>
{
using (var pages = Pix.LoadFromFile(ab))
{
using (Tesseract.Page page = engine.Process(pages, eract.PageSegMode.SingleBlock))
{
return page.GetText();
}
}
});
foreach (string text in texts)
{
OCRedText.Append(text);
}
}
I have a method like this :
public ConcurrentBag<FileModel> GetExceptionFiles(List<string> foldersPath, List<string> someList)
{
for (var i = 0; i < foldersPath.Count; i++)
{
var index = i;
new Thread(delegate()
{
foreach (var file in BrowseFiles(foldersPath[index]))
{
if (file.Name.Contains(someList[0]) || file.Name.Contains(someList[1]))
{
using (var fileStream = File.Open(file.Path, FileMode.Open))
using (var bufferedStream = new BufferedStream(fileStream))
using (var streamReader = new StreamReader(bufferedStream))
...
To give you more details:
This methods starts n threads (= foldersPath.Count) and each thread is going to read all the files which contains the strings listed in someList.
Right now my list contains only 2 strings (conditions), this is why im doing :
file.Name.Contains(someList[0]) || file.Name.Contains(someList[1])
What I want to do now is to replace this line with something that check all elements in the list someList
How can I do that?
Edit
Now that I replaced that line by if (someList.Any(item => file.Name.Contains(item)))
The next question is how can I optimize the performance of this code, knowing that each item in foldersPath is a separate hard drive in my network (which is always not more that 5 hard drives).
You could use something like if (someList.Any(item => file.Name.Contains(item)))
This will iterate each item in someList, and check if any of the items are contained in the file name, returning a boolean value to indicate whether any matches were found or not
Fristly.
There is an old saying is computer science, "There are two hard problems in CS, Naming, Cache Invalidation and Off by One Errors."
Don't use for loops, unless you absolutely have to, the tiny perf gain you get isn't worth the debug time (assuming there is any perf gain in this version of .net).
Secondly
new Thread. Don't do that. The creation of a thread is extremely slow and takes up lots of resources, especially for a short lived process like this. Added to the fact, there is overhead in passing data between threads. Use the ThreadPool.QueueUserWorkItem(WaitCallback) instead, if you MUST do short lived threads.
However, as I previously alluded to. Threads are an abstraction for CPU resources. I honestly doubt you are CPU bound. Threading is going to cost you more than you think. Stick to single threads. However you ARE I/O bound, therefore make full usage of asynchronous I/O.
public async Task<IEnumerable<FileModel>> GetExceptionFiles(List<string> foldersPath, List<string> someList)
{
foreach (var folderPath in foldersPath)
foreach (var file in BrowseFiles(folderPath))
{
if (false == someList.Any(x => file.Name.Contains(x, StringComparer.InvariantCultureCaseIgnore)))
continue;
using (var fileStream = await File.OpenTaskAsync(file.Path, FileMode.Open))
using (var bufferedStream = new BufferedStream(fileStream))
using (var streamReader = new StreamReader(bufferedStream))
...
yield return new FileModel();
I'm curious as to how IEnumerable differs from IObservable under the hood. I understand the pull and push patterns respectively but how does C#, in terms of memory etc, notify subscribers (for IObservable) that it should receive the next bit of data in memory to process? How does the observed instance know it's had a change in data to push to the subscribers.
My question comes from a test I was performing reading in lines from a file. The file was about 6Mb in total.
Standard Time Taken: 4.7s, lines: 36587
Rx Time Taken: 0.68s, lines: 36587
How is Rx able to massively improve a normal iteration over each of the lines in the file?
private static void ReadStandardFile()
{
var timer = Stopwatch.StartNew();
var linesProcessed = 0;
foreach (var l in ReadLines(new FileStream(_filePath, FileMode.Open)))
{
var s = l.Split(',');
linesProcessed++;
}
timer.Stop();
_log.DebugFormat("Standard Time Taken: {0}s, lines: {1}",
timer.Elapsed.ToString(), linesProcessed);
}
private static void ReadRxFile()
{
var timer = Stopwatch.StartNew();
var linesProcessed = 0;
var query = ReadLines(new FileStream(_filePath, FileMode.Open)).ToObservable();
using (query.Subscribe((line) =>
{
var s = line.Split(',');
linesProcessed++;
}));
timer.Stop();
_log.DebugFormat("Rx Time Taken: {0}s, lines: {1}",
timer.Elapsed.ToString(), linesProcessed);
}
private static IEnumerable<string> ReadLines(Stream stream)
{
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
yield return reader.ReadLine();
}
}
My hunch is the behavior you're seeing is reflecting the OS caching the file. I would imagine if you reversed the order of the calls you would see a similar difference in speeds, just swapped.
You could improve this benchmark by performing a few warm-up runs or by copying the input file to a temp file using File.Copy prior to testing each one. This way the file would not be "hot" and you would get a fair comparison.
I'd suspect that you're seeing some kind of internal optimization of the CLR. It probably caches the content of the file in memory between the two calls so that ToObservable can pull the content much faster...
Edit: Oh, the good colleague with the crazy nickname eeh ... #sixlettervariables was faster and he's probably right: it's rather the OS who's optimizing than the CLR.
I'm writing a program that needs to search a directory and all its sub directories for files that have a certain extension. This is going to be used both on a local, and a network drive, so performance is a bit of an issue.
Here's the recursive method I'm using now:
private void GetFileList(string fileSearchPattern, string rootFolderPath, List<FileInfo> files)
{
DirectoryInfo di = new DirectoryInfo(rootFolderPath);
FileInfo[] fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
files.AddRange(fiArr);
DirectoryInfo[] diArr = di.GetDirectories();
foreach (DirectoryInfo info in diArr)
{
GetFileList(fileSearchPattern, info.FullName, files);
}
}
I could set the SearchOption to AllDirectories and not use a recursive method, but in the future I'll want to insert some code to notify the user what folder is currently being scanned.
While I'm creating a list of FileInfo objects now all I really care about is the paths to the files. I'll have an existing list of files, which I want to compare to the new list of files to see what files were added or deleted. Is there any faster way to generate this list of file paths? Is there anything that I can do to optimize this file search around querying for the files on a shared network drive?
Update 1
I tried creating a non-recursive method that does the same thing by first finding all the sub directories and then iteratively scanning each directory for files. Here's the method:
public static List<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);
List<DirectoryInfo> dirList = new List<DirectoryInfo>(rootDir.GetDirectories("*", SearchOption.AllDirectories));
dirList.Add(rootDir);
List<FileInfo> fileList = new List<FileInfo>();
foreach (DirectoryInfo dir in dirList)
{
fileList.AddRange(dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly));
}
return fileList;
}
Update 2
Alright so I've run some tests on a local and a remote folder both of which have a lot of files (~1200). Here are the methods I've run the tests on. The results are below.
GetFileListA(): Non-recursive solution in the update above. I think it's equivalent to Jay's solution.
GetFileListB(): Recursive method from the original question
GetFileListC(): Gets all the directories with static Directory.GetDirectories() method. Then gets all the file paths with the static Directory.GetFiles() method. Populates and returns a List
GetFileListD(): Marc Gravell's solution using a queue and returns IEnumberable. I populated a List with the resulting IEnumerable
DirectoryInfo.GetFiles: No additional method created. Instantiated a DirectoryInfo from the root folder path. Called GetFiles using SearchOption.AllDirectories
Directory.GetFiles: No additional method created. Called the static GetFiles method of the Directory using using SearchOption.AllDirectories
Method Local Folder Remote Folder
GetFileListA() 00:00.0781235 05:22.9000502
GetFileListB() 00:00.0624988 03:43.5425829
GetFileListC() 00:00.0624988 05:19.7282361
GetFileListD() 00:00.0468741 03:38.1208120
DirectoryInfo.GetFiles 00:00.0468741 03:45.4644210
Directory.GetFiles 00:00.0312494 03:48.0737459
. . .so looks like Marc's is the fastest.
Try this iterator block version that avoids recursion and the Info objects:
public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
Queue<string> pending = new Queue<string>();
pending.Enqueue(rootFolderPath);
string[] tmp;
while (pending.Count > 0)
{
rootFolderPath = pending.Dequeue();
try
{
tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
}
catch (UnauthorizedAccessException)
{
continue;
}
for (int i = 0; i < tmp.Length; i++)
{
yield return tmp[i];
}
tmp = Directory.GetDirectories(rootFolderPath);
for (int i = 0; i < tmp.Length; i++)
{
pending.Enqueue(tmp[i]);
}
}
}
Note also that 4.0 has inbuilt iterator block versions (EnumerateFiles, EnumerateFileSystemEntries) that may be faster (more direct access to the file system; less arrays)
Cool question.
I played around a little and by leveraging iterator blocks and LINQ I appear to have improved your revised implementation by about 40%
I would be interested to have you test it out using your timing methods and on your network to see what the difference looks like.
Here is the meat of it
private static IEnumerable<FileInfo> GetFileList(string searchPattern, string rootFolderPath)
{
var rootDir = new DirectoryInfo(rootFolderPath);
var dirList = rootDir.GetDirectories("*", SearchOption.AllDirectories);
return from directoriesWithFiles in ReturnFiles(dirList, searchPattern).SelectMany(files => files)
select directoriesWithFiles;
}
private static IEnumerable<FileInfo[]> ReturnFiles(DirectoryInfo[] dirList, string fileSearchPattern)
{
foreach (DirectoryInfo dir in dirList)
{
yield return dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
}
}
The short answer of how to improve the performance of that code is: You cant.
The real performance hit your experiencing is the actual latency of the disk or network, so no matter which way you flip it, you have to check and iterate through each file item and retrieve directory and file listings. (That is of course excluding hardware or driver modifications to reduce or improve disk latency, but a lot of people are already paid a lot of money to solve those problems, so we'll ignore that side of it for now)
Given the original constraints there are several solutions already posted that more or less elegantly wrap the iteration process (However, since I assume that I'm reading from a single hard-drive, parallelism will NOT help to more quickly transverse a directory tree, and may even increase that time since you now have two or more threads fighting for data on different parts of the drive as it attempts to seek back and fourth) reduce the number of objects created, etc. However if we evaluate the how the function will be consumed by the end developer there are some optimizations and generalizations that we can come up with.
First, we can delay the execution of the performance by returning an IEnumerable, yield return accomplishes this by compiling in a state machine enumerator inside of an anonymous class that implements IEnumerable and gets returned when the method executes. Most methods in LINQ are written to delay execution until the iteration is performed, so the code in a select or SelectMany will not be performed until the IEnumerable is iterated through. The end result of delayed execution is only felt if you need to take a subset of the data at a later time, for instance, if you only need the first 10 results, delaying the execution of a query that returns several thousand results won't iterate through the entire 1000 results until you need more than ten.
Now, given that you want to do a subfolder search, I can also infer that it may be useful if you can specify that depth, and if I do this it also generalizes my problem, but also necessitates a recursive solution. Then, later, when someone decides that it now needs to search two directories deep because we increased the number of files and decided to add another layer of categorization you can simply make a slight modification instead of re-writing the function.
In light of all that, here is the solution I came up with that provides a more general solution than some of the others above:
public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, string rootFolderPath)
{
return BetterFileList(fileSearchPattern, new DirectoryInfo(rootFolderPath), 1);
}
public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, DirectoryInfo directory, int depth)
{
return depth == 0
? directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly)
: directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly).Concat(
directory.GetDirectories().SelectMany(x => BetterFileList(fileSearchPattern, x, depth - 1)));
}
On a side note, something else that hasn't been mentioned by anyone so far is file permissions and security. Currently, there's no checking, handling, or permissions requests, and the code will throw file permission exceptions if it encounters a directory it doesn't have access to iterate through.
I recently (2020) discovered this post because of a need to count files and directories across slow connections, and this was the fastest implementation I could come up with. The .NET enumeration methods (GetFiles(), GetDirectories()) perform a lot of under-the-hood work that slows them down tremendously by comparison.
This solution does not return FileInfo objects, but could be modified to do so—or potentially only return the relevant data needed from a custom FileInfo object.
This solution utilizes the Win32 API and .NET's Parallel.ForEach() to leverage the threadpool to maximize performance.
P/Invoke:
/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findfirstfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr FindFirstFile(
string lpFileName,
ref WIN32_FIND_DATA lpFindFileData
);
/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findnextfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindNextFile(
IntPtr hFindFile,
ref WIN32_FIND_DATA lpFindFileData
);
/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findclose
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindClose(
IntPtr hFindFile
);
Method:
public static Tuple<long, long> CountFilesDirectories(
string path,
CancellationToken token
)
{
if (String.IsNullOrWhiteSpace(path))
throw new ArgumentNullException("path", "The provided path is NULL or empty.");
// If the provided path doesn't end in a backslash, append one.
if (path.Last() != '\\')
path += '\\';
IntPtr hFile = IntPtr.Zero;
Win32.Kernel32.WIN32_FIND_DATA fd = new Win32.Kernel32.WIN32_FIND_DATA();
long files = 0;
long dirs = 0;
try
{
hFile = Win32.Kernel32.FindFirstFile(
path + "*", // Discover all files/folders by ending a directory with "*", e.g. "X:\*".
ref fd
);
// If we encounter an error, or there are no files/directories, we return no entries.
if (hFile.ToInt64() == -1)
return Tuple.Create<long, long>(0, 0);
//
// Find (and count) each file/directory, then iterate through each directory in parallel to maximize performance.
//
List<string> directories = new List<string>();
do
{
// If a directory (and not a Reparse Point), and the name is not "." or ".." which exist as concepts in the file system,
// count the directory and add it to a list so we can iterate over it in parallel later on to maximize performance.
if ((fd.dwFileAttributes & FileAttributes.Directory) != 0 &&
(fd.dwFileAttributes & FileAttributes.ReparsePoint) == 0 &&
fd.cFileName != "." && fd.cFileName != "..")
{
directories.Add(System.IO.Path.Combine(path, fd.cFileName));
dirs++;
}
// Otherwise, if this is a file ("archive"), increment the file count.
else if ((fd.dwFileAttributes & FileAttributes.Archive) != 0)
{
files++;
}
}
while (Win32.Kernel32.FindNextFile(hFile, ref fd));
// Iterate over each discovered directory in parallel to maximize file/directory counting performance,
// calling itself recursively to traverse each directory completely.
Parallel.ForEach(
directories,
new ParallelOptions()
{
CancellationToken = token
},
directory =>
{
var count = CountFilesDirectories(
directory,
token
);
lock (directories)
{
files += count.Item1;
dirs += count.Item2;
}
});
}
catch (Exception)
{
// Handle as desired.
}
finally
{
if (hFile.ToInt64() != 0)
Win32.Kernel32.FindClose(hFile);
}
return Tuple.Create<long, long>(files, dirs);
}
On my local system, the performance of GetFiles()/GetDirectories() can be close to this, but across slower connections (VPNs, etc.) I found that this is tremendously faster—45 minutes vs. 90 seconds to access a remote directory of ~40k files, ~40 GB in size.
This can also fairly easily be modified to include other data, like the total file size of all files counted, or rapidly recursing through and deleting empty directories, starting at the furthest branch.
This takes 30 seconds to get 2 million file names that meet the filter. The reason this is so fast is because I am only performing 1 enumeration. Each additional enumeration affects performance. The variable length is open to your interpretation and not necessarily related to the enumeration example.
if (Directory.Exists(path))
{
files = Directory.EnumerateFiles(path, "*.*", SearchOption.AllDirectories)
.Where(s => s.EndsWith(".xml") || s.EndsWith(".csv"))
.Select(s => s.Remove(0, length)).ToList(); // Remove the Dir info.
}
The BCL methods are portable so to speak. If staying 100% managed I believe the best you can do is calling GetDirectories/Folders while checking access rights (or possibly not checking the rights and having another thread ready to go when the first one takes a little too long - a sign that it's about to throw UnauthorizedAccess exception - this might be avoidable with exception filters using VB or as of today unreleased c#).
If you want faster than GetDirectories you have to call win32 (findsomethingEx etc) which provides specific flags that allow ignore possibly unnecessary IO while traversing the MFT structures. Also if the drive is a network share, there can be a great speedup by similar approach but this time avoiding also excessive network roundtrips.
Now if you have admin and use ntfs and are in a real hurry with millions of files to go through, the absolute fastest way to go through them (assuming spinning rust where the disk latency kills) is use of both mft and journaling in combination, essentially replacing the indexing service with one that's targeted for your specific need. If you only need to find filenames and not sizes (or sizes too but then you must cache them and use journal to notice changes), this approach could allow for practically instant search of tens of millions of files and folders if implemented ideally. There may be one or two paywares that have bothered with this. There's samples of both MFT (DiscUtils) and journal reading (google) in C# around. I only have about 5 million files and just using NTFSSearch is good enough for that amount as it takes about 10-20 seconds to search them. With journal reading added it would go down to <3 seconds for that amount.
DirectoryInfo seems to give much more information than you need, try piping a dir command and parsing the info from that.
I needed to get all files from my C: partition so i combined Marc and Jaider answers and got function with no recursion and with parallel programming with result of about 370k files processed in 30 seconds. Maybe this will help someone:
void DirSearch(string path)
{
ConcurrentQueue<string> pendingQueue = new ConcurrentQueue<string>();
pendingQueue.Enqueue(path);
ConcurrentBag<string> filesNames = new ConcurrentBag<string>();
while(pendingQueue.Count > 0)
{
try
{
pendingQueue.TryDequeue(out path);
var files = Directory.GetFiles(path);
Parallel.ForEach(files, x => filesNames.Add(x));
var directories = Directory.GetDirectories(path);
Parallel.ForEach(directories, (x) => pendingQueue.Enqueue(x));
}
catch (Exception)
{
continue;
}
}
}
Consider splitting the updated method into two iterators:
private static IEnumerable<DirectoryInfo> GetDirs(string rootFolderPath)
{
DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);
yield return rootDir;
foreach(DirectoryInfo di in rootDir.GetDirectories("*", SearchOption.AllDirectories));
{
yield return di;
}
yield break;
}
public static IEnumerable<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
var allDirs = GetDirs(rootFolderPath);
foreach(DirectoryInfo di in allDirs())
{
var files = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
foreach(FileInfo fi in files)
{
yield return fi;
}
}
yield break;
}
Also, further to the network-specific scenario, if you were able to install a small service on that server that you could call into from a client machine, you'd get much closer to your "local folder" results, because the search could execute on the server and just return the results to you. This would be your biggest speed boost in the network folder scenario, but may not be available in your situation. I've been using a file synchronization program that includes this option -- once I installed the service on my server the program became WAY faster at identifying the files that were new, deleted, and out-of-sync.
Try Parallel programming:
private string _fileSearchPattern;
private List<string> _files;
private object lockThis = new object();
public List<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
_fileSearchPattern = fileSearchPattern;
AddFileList(rootFolderPath);
return _files;
}
private void AddFileList(string rootFolderPath)
{
var files = Directory.GetFiles(rootFolderPath, _fileSearchPattern);
lock (lockThis)
{
_files.AddRange(files);
}
var directories = Directory.GetDirectories(rootFolderPath);
Parallel.ForEach(directories, AddFileList); // same as Parallel.ForEach(directories, directory => AddFileList(directory));
}
I had the same problem. Here is my attempt which is a lot faster than calling Directory.EnumerateFiles, Directory.EnumerateDirectories or Directory.EnumerateFileSystemEntries recursive:
public static IEnumerable<string> EnumerateDirectoriesRecursive(string directoryPath)
{
return EnumerateFileSystemEntries(directoryPath).Where(e => e.isDirectory).Select(e => e.EntryPath);
}
public static IEnumerable<string> EnumerateFilesRecursive(string directoryPath)
{
return EnumerateFileSystemEntries(directoryPath).Where(e => !e.isDirectory).Select(e => e.EntryPath);
}
public static IEnumerable<(string EntryPath, bool isDirectory)> EnumerateFileSystemEntries(string directoryPath)
{
Stack<string> directoryStack = new Stack<string>(new[] { directoryPath });
while (directoryStack.Any())
{
foreach (string fileSystemEntry in Directory.EnumerateFileSystemEntries(directoryStack.Pop()))
{
bool isDirectory = (File.GetAttributes(fileSystemEntry) & (FileAttributes.Directory | FileAttributes.ReparsePoint)) == FileAttributes.Directory;
yield return (fileSystemEntry, isDirectory);
if (isDirectory)
directoryStack.Push(fileSystemEntry);
}
}
}
You can modify the code to search for specific files or directories easily.
Regards
You can use parallel foreach (.Net 4.0) or you can try Poor Man's Parallel.ForEach Iterator for .Net3.5 . That can speed-up your search.
It is horrible, and the reason file search work is horrible on Windows platforms is because MS made a mistake, that they seem unwilling to put right. You should be able to use
SearchOption.AllDirectories
And we would all get the speed back that we want. But you can not do that, because GetDirectories needs a call back so that you can decide what to do about the directories you do not have access to. MS forgot or did not think to test the class on their own computers.
So, we are all left with the nonsense recursive loops.
Within C#/Managed C++ you have very few oprions, these are also the options that MS take, because their coders haven't worked out how to get around it either.
The main thing is with display items, such as TreeViews and FileViews, only search and show what users can see. There are plaenty of helpers on the controls, including triggers, that tell you when you need to fill in some data.
In trees, starting from collapsed mode, search that one directory as and when the user opens it in the tree, that is much faster than the wait for a whole tree to be filled.
The same in FileViews, I tend towards a 10% rule, how ever many items fit in the display area, have another 10% ready if the user scrolls, it is nicely responsive.
MS do the pre-search and directory watch. A little database of directories, files, this means that you OnOpen your Trees etc have a good fast starting point, it falls down a bit on the refresh.
But mix the two ideas, take your directories and files from the database, but do a refresh search as a tree node is expanded (just that tree node) and as a different directory is selected in the tree.
But the better sollution is to add your file search system as a service. MS already have this, but as far as I know we do not get access to it, I suspect that is because it is immune to 'failed access to directory' errors.
Just as with the MS one, if you have a service running at Admin level, you need to be careful that you are not giving away your security just for the sake of a little extra speed.
I'd be inclined to return an IEnumerable<> in this case -- depending on how you are consuming the results, it could be an improvement, plus you reduce your parameter footprint by 1/3 and avoid passing around that List incessantly.
private IEnumerable<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
DirectoryInfo di = new DirectoryInfo(rootFolderPath);
var fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
foreach (FileInfo fi in fiArr)
{
yield return fi;
}
var diArr = di.GetDirectories();
foreach (DirectoryInfo di in diArr)
{
var nextRound = GetFileList(fileSearchPattern, di.FullnName);
foreach (FileInfo fi in nextRound)
{
yield return fi;
}
}
yield break;
}
Another idea would be to spin off BackgroundWorker objects to troll through directories. You wouldn't want a new thread for every directory, but you might create them on the top level (first pass through GetFileList()), so if you execute on your C:\ drive, with 12 directories, each of those directories will be searched by a different thread, which will then recurse through subdirectories. You'll have one thread going through C:\Windows while another goes through C:\Program Files. There are a lot of variables as to how this is going to affect performance -- you'd have to test it to see.
For file and directory search purpose I would want to offer use multithreading .NET library that possess a wide search opportunities.
All information about library you can find on GitHub: https://github.com/VladPVS/FastSearchLibrary
If you want to download it you can do it here: https://github.com/VladPVS/FastSearchLibrary/releases
Works really fast. Check it yourself!
If you have any questions please ask them.
It is one demonstrative example how you can use it:
class Searcher
{
private static object locker = new object();
private FileSearcher searcher;
List<FileInfo> files;
public Searcher()
{
files = new List<FileInfo>(); // create list that will contain search result
}
public void Startsearch()
{
CancellationTokenSource tokenSource = new CancellationTokenSource();
// create tokenSource to get stop search process possibility
searcher = new FileSearcher(#"C:\", (f) =>
{
return Regex.IsMatch(f.Name, #".*[Dd]ragon.*.jpg$");
}, tokenSource); // give tokenSource in constructor
searcher.FilesFound += (sender, arg) => // subscribe on FilesFound event
{
lock (locker) // using a lock is obligatorily
{
arg.Files.ForEach((f) =>
{
files.Add(f); // add the next part of the received files to the results list
Console.WriteLine($"File location: {f.FullName}, \nCreation.Time: {f.CreationTime}");
});
if (files.Count >= 10) // one can choose any stopping condition
searcher.StopSearch();
}
};
searcher.SearchCompleted += (sender, arg) => // subscribe on SearchCompleted event
{
if (arg.IsCanceled) // check whether StopSearch() called
Console.WriteLine("Search stopped.");
else
Console.WriteLine("Search completed.");
Console.WriteLine($"Quantity of files: {files.Count}"); // show amount of finding files
};
searcher.StartSearchAsync();
// start search process as an asynchronous operation that doesn't block the called thread
}
}
It's another example:
***
List<string> folders = new List<string>
{
#"C:\Users\Public",
#"C:\Windows\System32",
#"D:\Program Files",
#"D:\Program Files (x86)"
}; // list of search directories
List<string> keywords = new List<string> { "word1", "word2", "word3" }; // list of search keywords
FileSearcherMultiple multipleSearcher = new FileSearcherMultiple(folders, (f) =>
{
if (f.CreationTime >= new DateTime(2015, 3, 15) &&
(f.Extension == ".cs" || f.Extension == ".sln"))
foreach (var keyword in keywords)
if (f.Name.Contains(keyword))
return true;
return false;
}, tokenSource, ExecuteHandlers.InCurrentTask, true);
***
In .net core you can do something like this below. It can search for all subdirectories recursively with good performance and ignoring paths without access.
I also tried other methods found in
https://www.codeproject.com/Articles/1383832/System-IO-Directory-Alternative-using-WinAPI
public static IEnumerable<string> ListFiles(string baseDir)
{
EnumerationOptions opt = new EnumerationOptions();
opt.RecurseSubdirectories = true;
opt.ReturnSpecialDirectories = false;
//opt.AttributesToSkip = FileAttributes.Hidden | FileAttributes.System;
opt.AttributesToSkip = 0;
opt.IgnoreInaccessible = true;
var tmp = Directory.EnumerateFileSystemEntries(baseDir, "*", opt);
return tmp;
}