I'm scanning some directory for items. I've just read Multithreaded Directory Looping in C# question but I still want to make it multithreated. Even though everyone says the drive will be the bottleneck I have some points:
The drives may mostly be "single threaded" but how you know what they going to bring up in the future?
How you know the different sub-paths you are scanning are one the same physical drive?
I using an abstraction layer (even two) over the System.IO so that I can later reuse the code in different scenarios.
So, my first idea was to use Task and first dummy implementation was this:
public async Task Scan(bool recursive = false) {
var t = new Task(() => {
foreach (var p in path.scan) Add(p);
if (!recursive) return;
var tks = new Task[subs.Count]; var i = 0;
foreach (var s in subs) tks[i++] = s.Scan(true);
Task.WaitAll(tks);
}); t.Start();
await t;
}
I don't like the idea of creating a Task for each item and generally this doesn't seem ideal, but this was just for a test as Tasks are advertised to automatically manage the threads...
This method works but it's very slow. It takes above 5s to complete, while the single threated version below takes around 0.5s to complete the whole program on the same data set:
public void Scan2(bool recursive = false) {
foreach (var p in path.scan) Add(p);
if (!recursive) return;
foreach (var s in subs) s.Scan2(true);
}
I wander what really goes wrong with fist method. The machine is not on load, CUP usage is insignificant, drive is fine... I've tried profiling it with NProfiler it don't tell me much besides the program sits on Task.WaitAll(tks) all the time.
I also wrote a thread-locked counting mechanism that is invoked during addition of each item. Maybe it's the problem with it?
#region SubCouting
public Dictionary<Type, int> counters = new Dictionary<Type, int>();
private object cLock = new object();
private int _sc = 0;
public int subCount => _sc;
private void inCounter(Type t) {
lock (cLock) {
if (!counters.ContainsKey(t)) counters.Add(t, 1);
counters[t]++;
_sc++;
}
if (parent) parent.inCounter(t);
}
#endregion
But even if threads are waiting here, wouldn't the execution time be similar to single threaded version as opposed to 10x slower?
I'm not sure how to approach this. If I don't want to use tasks, do I need to manage threads manually or is there already some library that would fit nicely for the job?
I think you almost got it. Task.WaitAll(tks) is the problem. You block one thread for this as this is synchronous operation. You get out of threads soon, all threads are just waiting for some tasks that have no threads to run on. You can solve this with async, replace the waiting with await Task.WhenAll(...). It would free the thread when waiting. With some workload the multithreaded version is significantly faster. When just IO bound it is roughly equal.
ConcurrentBag<string> result = new ConcurrentBag<string>();
List<string> result2 = new List<string>();
public async Task Scan(string path)
{
await Task.Run(async () =>
{
var subs = Directory.GetDirectories(path);
await Task.WhenAll(subs.Select(s => Scan(s)));
result.Add(Enumerable.Range(0, 1000000).Sum(i => path[i % path.Length]).ToString());
});
}
public void Scan2(string path)
{
result2.Add(Enumerable.Range(0, 1000000).Sum(i => path[i % path.Length]).ToString());
var subs = Directory.GetDirectories(path);
foreach (var s in subs) Scan2(s);
}
private async void button4_Click(object sender, EventArgs e)
{
string dir = #"d:\tmp";
System.Diagnostics.Stopwatch st = new System.Diagnostics.Stopwatch();
st.Start();
await Scan(dir);
st.Stop();
MessageBox.Show(st.ElapsedMilliseconds.ToString());
st = new System.Diagnostics.Stopwatch();
st.Start();
Scan2(dir);
st.Stop();
MessageBox.Show(st.ElapsedMilliseconds.ToString());
MessageBox.Show(result.OrderBy(x => x).SequenceEqual(result2.OrderBy(x => x)) ? "OK" : "ERROR");
}
Related
I am writing a program which scours the entire filesystem of a computer in order to destroy any files which fall within certain parameters. I want the program to run as fast as possible and utilize as many resources as necessary to achieve this (it's worth noting that the user is expected not to be completing any other work while this process is taking place). To that end, I've written a method which takes a target directory, searches all the files in it, then queues up a new task for each child directory. This is currently done by passing the directories' paths into a queue which the main thread monitors and uses to actually initialize the new tasks, as so:
static class DriveHandler
{
internal static readonly List<string> fixedDrives = GetFixedDrives();
private static readonly ConcurrentQueue<string> _targetPathQueue = new ConcurrentQueue<string>();
private static int _threadCounter = 0;
internal static void WipeDrives()
{
foreach (string driveLetter in fixedDrives)
{
Interlocked.Increment(ref _threadCounter);
Task.Run(() => WalkDrive(driveLetter));
}
while (Volatile.Read(ref _threadCounter) > 0 || !_targetPathQueue.IsEmpty)
{
if (_targetPathQueue.TryDequeue(out string path))
{
Interlocked.Increment(ref _threadCounter);
Task.Run(() => WalkDrive(path));
}
}
}
private static void WalkDrive(string directory)
{
foreach (string file in Directory.GetFiles(directory))
{
//If file meets conditions, delete
}
string[] subDirectories = Directory.GetDirectories(directory);
if (subDirectories.Length != 0)
{
foreach (string subDirectory in subDirectories)
{
_targetPathQueue.Enqueue(subDirectory);
}
}
else { } //do other stuff;
Interlocked.Decrement(ref _threadCounter);
}
}
My question is, is it safe/worth it to just initialize the new tasks from within the already running tasks to avoid wasting processor time monitoring the queue? Something that looks like this:
static class DriveHandler
{
internal static readonly List<string> fixedDrives = GetFixedDrives();
private static int _threadCounter = 0;
internal static void WipeDrives()
{
foreach (string driveLetter in fixedDrives)
{
Interlocked.Increment(ref _threadCounter);
Task.Run(() => WalkDrive(driveLetter));
}
while (Volatile.Read(ref _threadCounter) > 0)
{
Thread.Sleep(5000);
}
}
private static void WalkDrive(string directory)
{
foreach (string file in Directory.GetFiles(directory))
{
//If file meets conditions, delete
}
string[] subDirectories = Directory.GetDirectories(directory);
if (subDirectories.Length != 0)
{
foreach (string subDirectory in subDirectories)
{
Interlocked.Increment(ref _threadCounter);
Task.Run(() => WalkDrive(path));
}
}
else { } //do other stuff;
Interlocked.Decrement(ref _threadCounter);
}
}
I of course need every task to die once it's done, will doing things this way make the old tasks parents to the new ones and keep them alive until all their children have finished?
Many thanks!
First problem:
Task.Run(() => WalkDrive(path));
It's a fire and forget fashion, it's not a good thing to do in this context, why? Because chances are, you have waaay more files and paths on hard disk than a machine have CPU and memory capacity (task consume memory as well not just CPU). Fire and forget, hence the name, you keep spawning up tasks without awaiting them.
My question is, is it safe/worth it to just initialize the new tasks from within the already running tasks to avoid wasting processor time monitoring the queue?
It's valid, nothing can prevent you from doing that, but you are already wasting resources, why to spawn new task each time? You've got already one running, just make it a long running background task and keep it running, just two threads (I assume one is (UI/user facing) thread) and one doing the work.
All these locks and tasks spawning is going to hurt your performance and waste all the resources CPU + memory allocations.
If you want to speed things up by parallel execution You could add the path to the concurrent queue, and have only 10-100 concurrent tasks MAX or whatever, at least you have an upper bound, you control how much the code is doing in parallel.
while conccurent-queue is not empty and no one request to cancel the operation:
Start from base path
Get all sub-paths and enqueue them inside the concurrent-queue
Process files inside that path
Make the current base path as the next item available inside the queue
Start all over again.
You just start max number of concurrent tasks and that's it.
Your main loop/while condition is something like:
private async Task StartAsync(CancellationToken cancellationToken)
{
var tasks = new List<Task>();
for (int i = 0; i < MaxConcurrentTasks; i++)
{
tasks.Add(Task.Run(() => ProcessPath(initialPathHere), cancellationToken));
}
await Task.WhenAll(tasks);
}
And then something along these lines:
private static async Task ProcessPath(string path, CancellationToken cancellationToken)
{
while(concurrentDictionary.Count > 0 && !cancellationToken.IsCancellationRequested)
{
foreach(var subPath in System.IO.Directory.EnumerateDirectories(path))
{
//Enqueue the subPath into the concurrent dictionary
}
//Once finished, process files in the current path
foreach (var file in path)
{
}
path = concurrentDictionary.Dequeue();
}
}
Haven't checked the syntax but that's how a good algorithm would do it in my opinion. Also, please keep in mind that while the task finished its current job, queue might be empty in this line, so modify that code accordingly.
path = concurrentDictionary.Dequeue();
Final notes:
Consider the trades between tasks and Parallel.Invok/execute
Consider using BackgroundServices they are fine-tuned to be long running, depends on your code and requirements
In order to work for performance gain, remember the golden rule, measure early. Start by measuring, have some metrics already at hand so later on if you want to speed things up a bit, you at least know how much you could do right now, so you refactor and measure again and compare and then you will know if you are getting closer or further from your target.
Make sure you do conccurency/paralell processing right, otherwise it's going to go against you not with you.
I'm new to tasks and have a question regarding the usage. Does the Task.Factory fire for all items in the foreach loop or block at the 'await' basically making the program single threaded? If I am thinking about this correctly, the foreach loop starts all the tasks and the .GetAwaiter().GetResult(); is blocking the main thread until the last task is complete.
Also, I'm just wanting some anonymous tasks to load the data. Would this be a correct implementation? I'm not referring to exception handling as this is simply an example.
To provide clarity, I am loading data into a database from an outside API. This one is using the FRED database. (https://fred.stlouisfed.org/), but I have several I will hit to complete the entire transfer (maybe 200k data points). Once they are done I update the tables, refresh market calculations, etc. Some of it is real time and some of it is End-of-day. I would also like to say, I currently have everything working in docker, but have been working to update the code using tasks to improve execution.
class Program
{
private async Task SQLBulkLoader()
{
foreach (var fileListObj in indicators.file_list)
{
await Task.Factory.StartNew( () =>
{
string json = this.GET(//API call);
SeriesObject obj = JsonConvert.DeserializeObject<SeriesObject>(json);
DataTable dataTableConversion = ConvertToDataTable(obj.observations);
dataTableConversion.TableName = fileListObj.series_id;
using (SqlConnection dbConnection = new SqlConnection("SQL Connection"))
{
dbConnection.Open();
using (SqlBulkCopy s = new SqlBulkCopy(dbConnection))
{
s.DestinationTableName = dataTableConversion.TableName;
foreach (var column in dataTableConversion.Columns)
s.ColumnMappings.Add(column.ToString(), column.ToString());
s.WriteToServer(dataTableConversion);
}
Console.WriteLine("File: {0} Complete", fileListObj.series_id);
}
});
}
}
static void Main(string[] args)
{
Program worker = new Program();
worker.SQLBulkLoader().GetAwaiter().GetResult();
}
}
Your awaiting the task returned from Task.Factory.StartNew does make it effectively single threaded. You can see a simple demonstration of this with this short LinqPad example:
for (var i = 0; i < 3; i++)
{
var index = i;
$"{index} inline".Dump();
await Task.Run(() =>
{
Thread.Sleep((3 - index) * 1000);
$"{index} in thread".Dump();
});
}
Here we wait less as we progress through the loop. The output is:
0 inline
0 in thread
1 inline
1 in thread
2 inline
2 in thread
If you remove the await in front of StartNew you'll see it runs in parallel. As others have mentioned, you can certainly use Parallel.ForEach, but for a demonstration of doing it a bit more manually, you can consider a solution like this:
var tasks = new List<Task>();
for (var i = 0; i < 3; i++)
{
var index = i;
$"{index} inline".Dump();
tasks.Add(Task.Factory.StartNew(() =>
{
Thread.Sleep((3 - index) * 1000);
$"{index} in thread".Dump();
}));
}
Task.WaitAll(tasks.ToArray());
Notice now how the result is:
0 inline
1 inline
2 inline
2 in thread
1 in thread
0 in thread
This is a typical problem that C# 8.0 Async Streams are going to solve very soon.
Until C# 8.0 is released, you can use the AsyncEnumarator library:
using System.Collections.Async;
class Program
{
private async Task SQLBulkLoader() {
await indicators.file_list.ParallelForEachAsync(async fileListObj =>
{
...
await s.WriteToServerAsync(dataTableConversion);
...
},
maxDegreeOfParalellism: 3,
cancellationToken: default);
}
static void Main(string[] args)
{
Program worker = new Program();
worker.SQLBulkLoader().GetAwaiter().GetResult();
}
}
I do not recommend using Parallel.ForEach and Task.WhenAll as those functions are not designed for asynchronous streaming.
You'll want to add each task to a collection and then use Task.WhenAll to await all of the tasks in that collection:
private async Task SQLBulkLoader()
{
var tasks = new List<Task>();
foreach (var fileListObj in indicators.file_list)
{
tasks.Add(Task.Factory.StartNew( () => { //Doing Stuff }));
}
await Task.WhenAll(tasks.ToArray());
}
My take on this: most time consuming operations will be getting the data using a GET operation and the actual call to WriteToServer using SqlBulkCopy. If you take a look at that class you will see that there is a native async method WriteToServerAsync method (docs here)
. Always use those before creating Tasks yourself using Task.Run.
The same applies to the http GET call. You can use the native HttpClient.GetAsync (docs here) for that.
Doing that you can rewrite your code to this:
private async Task ProcessFileAsync(string series_id)
{
string json = await GetAsync();
SeriesObject obj = JsonConvert.DeserializeObject<SeriesObject>(json);
DataTable dataTableConversion = ConvertToDataTable(obj.observations);
dataTableConversion.TableName = series_id;
using (SqlConnection dbConnection = new SqlConnection("SQL Connection"))
{
dbConnection.Open();
using (SqlBulkCopy s = new SqlBulkCopy(dbConnection))
{
s.DestinationTableName = dataTableConversion.TableName;
foreach (var column in dataTableConversion.Columns)
s.ColumnMappings.Add(column.ToString(), column.ToString());
await s.WriteToServerAsync(dataTableConversion);
}
Console.WriteLine("File: {0} Complete", series_id);
}
}
private async Task SQLBulkLoaderAsync()
{
var tasks = indicators.file_list.Select(f => ProcessFileAsync(f.series_id));
await Task.WhenAll(tasks);
}
Both operations (http call and sql server call) are I/O calls. Using the native async/await pattern there won't even be a thread created or used, see this question for a more in-depth explanation. That is why for IO bound operations you should never have to use Task.Run (or Task.Factory.StartNew. But do mind that Task.Run is the recommended approach).
Sidenote: if you are using HttpClient in a loop, please read this about how to correctly use it.
If you need to limit the number of parallel actions you could also use TPL Dataflow as it plays very nice with Task based IO bound operations. The SQLBulkLoaderAsyncshould then be modified to (leaving the ProcessFileAsync method from earlier this answer intact):
private async Task SQLBulkLoaderAsync()
{
var ab = new ActionBlock<string>(ProcessFileAsync, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 });
foreach (var file in indicators.file_list)
{
ab.Post(file.series_id);
}
ab.Complete();
await ab.Completion;
}
Use a Parallel.ForEach loop to enable data parallelism over any System.Collections.Generic.IEnumerable<T> source.
// Method signature: Parallel.ForEach(IEnumerable<TSource> source, Action<TSource> body)
Parallel.ForEach(fileList, (currentFile) =>
{
//Doing Stuff
Console.WriteLine("Processing {0} on thread {1}", currentFile, Thread.CurrentThread.ManagedThreadId);
});
Why didnt you try this :), this program will not start parallel tasks (in a foreach), it will be blocking but the logic in task will be done in separate thread from threadpool (only one at the time, but the main thread will be blocked).
The proper approach in your situation is to use Paraller.ForEach
How can I convert this foreach code to Parallel.ForEach?
var finalList = new List<string>();
var list = new List<int> {1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ................. 999999};
var init = 0;
var limitPerThread = 5;
var countDownEvent = new CountdownEvent(list.Count);
for (var i = 0; i < list.Count; i++)
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
new Thread(delegate()
{
Foo(listToFilter);
countDownEvent.Signal();
}).Start();
init += limitPerThread;
}
//wait all to finish
countDownEvent.Wait();
private static void Foo(List<int> listToFilter)
{
var listDone = Boo(listToFilter);
lock (Object)
{
finalList.AddRange(listDone);
}
}
This doesn't:
var taskList = new List<Task>();
for (var i = 0; i < list.Count; i++)
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
var task = Task.Factory.StartNew(() => Foo(listToFilter));
taskList.add(task);
init += limitPerThread;
}
//wait all to finish
Task.WaitAll(taskList.ToArray());
This process must create at least 700 threads in the end. When I run using Thread, it works and creates all of them. But with Task it doesn't.. It seems like its not starting multiples Tasks async.
I really wanna know why.... any ideas?
EDIT
Another version with PLINQ (as suggested).
var taskList = new List<Task>(list.Count);
Parallel.ForEach(taskList, t =>
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
Foo(listToFilter);
init += limitPerThread;
t.Start();
});
Task.WaitAll(taskList.ToArray());
EDIT2:
public static List<Communication> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
try
{
for (var i = 0; i < listIps.Count(); i++)
{
var oPing = new Ping().Send(listIps.ElementAt(i).IpAddress, 10000);
if (oPing != null)
{
if (oPing.Status.Equals(IPStatus.TimedOut) && listIps.Count() > i+1)
continue;
if (oPing.Status.Equals(IPStatus.TimedOut))
{
communication.Result = "NOK";
break;
}
communication.Result = oPing.Status.Equals(IPStatus.Success) ? "OK" : "NOK";
break;
}
if (listIps.Count() > i+1)
continue;
communication.Result = "NOK";
break;
}
}
catch
{
communication.Result = "NOK";
}
finally
{
listResult.Add(communication);
}
}
return listResult;
}
Tasks are NOT multithreading. They can be used for that, but mostly they're actually used for the opposite - multiplexing on a single thread.
To use tasks for multithreading, I suggest using Parallel LINQ. It has many optimizations in it already, such as intelligent partitioning of your lists and only spawning as many threads as there ar CPU cores, etc.
To understand Task and async, think of it this way - a typical workload often includes IO that needs to be waited upon. Maybe you read a file, or query a webservice, or access a database, or whatever. The point is - your thread gets to wait a loooong time (in CPU cycles at least) until you get a response from some faraway destination.
In the Olden Days™ that meant that your thread was getting locked down (suspended) until that response came. If you wanted to do something else in the meantime, you needed to spawn a new thread. That's doable, but not too efficient. Each OS thread carries a significant overhead (memory, kernel resources) with it. And you could end up with several threads actively burning the CPU, which means that the OS needs to switch between them so that each gets a bit of CPU time and these "context switches" are pretty expensive.
async changes that workflow. Now you can have multiple workloads executing on the same thread. While one piece of work is awaiting the result from a faraway source, another can step in and use that thread to do something else useful. When that second workload gets to its own await, the first can awaken and continue.
After all, it doesn't make sense to spawn more threads than there are CPU cores. You're not going to get more work done that way. Just the opposite - more time will be spent on switching the threads and less time will be available for useful work.
That is what the Task/async/await was originally designed for. However Parallel LINQ has also taken advantage of it and reused it for multithreading. In this case you can look at it this way - the other threads is what your main thread is the "faraway destination" that your main thread is waiting on.
Tasks are executed on the Thread Pool. This means that a handful of threads will serve a large number of tasks. You have multi-threading, but not a thread for every task spawned.
You should use tasks. You should aim to use as much threads as your CPU. Generally, the thread pool is doing this for you.
How did you measure up the performance? Do you think that the 700 threads will work faster than 700 tasks executing by 4 threads? No, they would not.
It seems like its not starting multiples Tasks async
How did you came up with this? As other suggested in comments and in other answers, you probably need to remove a thread creation, as after creating 700 threads you'll degrade your system performance, as your threads would fight to each other for the processor time, without any work done faster.
So, you need to add the async/await for your IO operations, into the Foo method, with SendPingAsync version. Also, your method could be simplyfied, as many checks for a listIps.Count() > i + 1 conditions are useless - you do it in the for condition block:
public static async Task<List<Communication>> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
try
{
var ping = new Ping();
communication.Result = "NOK";
for (var i = 0; i < listIps.Count(); i++)
{
var oPing = await ping.SendPingAsync(listIps.ElementAt(i).IpAddress, 10000);
if (oPing != null)
{
if (oPing.Status.Equals(IPStatus.Success)
{
communication.Result = "OK";
break;
}
}
}
}
catch
{
communication.Result = "NOK";
}
finally
{
listResult.Add(communication);
}
}
return listResult;
}
Other problem with your code is that PLINQ version isn't threadsafe:
init += limitPerThread;
This can fail while executing in parallel. You may introduce some helper method, like in this answer:
private async Task<List<PingReply>> PingAsync(List<Communication> theListOfIPs)
{
Ping pingSender = new Ping();
var tasks = theListOfIPs.Select(ip => pingSender.SendPingAsync(ip, 10000));
var results = await Task.WhenAll(tasks);
return results.ToList();
}
And do this kind of check (try/catch logic removed for simplicity):
public static async Task<List<Communication>> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
var check = await PingAsync(listIps);
communication.Result = check.Any(p => p.Status.Equals(IPStatus.Success)) ? "OK" : "NOK";
}
}
And you probably should use Task.Run instead of Task.StartNew for being sure that you aren't blocking the UI thread.
The app parses files within some directory, while new files are being added to the directory. I uses ConcurrentQueue and tried to split work to the number of cores. So if there are files to process - it should process up to 4(cores) files concurrently.
Yet the app runs OOM within seconds, after processing 10-30 files. I see the memory consumption grow to ~1.5GB quickly, than OOM error appears.
I'm to task scheduler, so I'm probably doing something wrong.
File parsing is done by running some .exe on the file, which uses <5mb or ram.
Task scheduler runs every time timer thread elapses. But it runs OOM even before timer elapsed for 2nd time.
private void OnTimedEvent(object source, ElapsedEventArgs e)
{
DirectoryInfo info = new DirectoryInfo(AssemblyDirectory);
FileInfo[] allSrcFiles = info.GetFiles("*.dat").OrderBy(p => p.CreationTime).ToArray();
var validSrcFiles = allSrcFiles.Where(p => (DateTime.Now - p.CreationTime) > TimeSpan.FromSeconds(60));
var newFilesToParse = validSrcFiles.Where(f => !ProcessedFiles.Contains(f.Name));
if (newFilesToParse.Any()) Console.WriteLine("Adding " + newFilesToParse.Count() + " files to the Queue");
foreach (var file in newFilesToParse)
{
FilesToParseQueue.Enqueue(file);
ProcessedFiles.Add(file.Name);
}
if (!busy)
{
if (FilesToParseQueue.Any())
{
busy = true;
Console.WriteLine("");
Console.WriteLine("There are " + FilesToParseQueue.Count + " files in queue. Processing...");
}
var scheduler = new LimitedConcurrencyLevelTaskScheduler(coresCount); //4
TaskFactory factory = new TaskFactory(scheduler);
while (FilesToParseQueue.Any())
{
factory.StartNew(() =>
{
FileInfo file;
if (FilesToParseQueue.TryDequeue(out file))
{
//Dequeue();
ParseFile(file);
}
});
}
if (!FilesToParseQueue.Any())
{
busy = false;
Console.WriteLine("Finished processing Files in the Queue. Waiting for new files...");
}
}
}
Your code keeps on creating new Tasks as long as there are files to process and it does so much faster that the files can be processed. But it has no other limit (like the number of files in the directory), which is why it quickly runs out of memory.
A simple fix would be to move the dequeuing outside the loop:
while (true)
{
FileInfo file;
if (FilesToParseQueue.TryDequeue(out file))
{
factory.StartNew(() => ParseFile(file));
}
else
{
break;
}
}
You would get even better performance if you created just one Task per core and processed the files using a loop inside those Tasks.
This kind of problem (where you queue mutliple units of work, and want them processed in parallel) is a perfect fit for TPL Dataflow:
private async void OnTimedEvent(object source, ElapsedEventArgs e)
{
DirectoryInfo info = new DirectoryInfo(AssemblyDirectory);
FileInfo[] allSrcFiles = info.GetFiles("*.dat").OrderBy(p => p.CreationTime).ToArray();
var validSrcFiles = allSrcFiles.Where(p => (DateTime.Now - p.CreationTime) > TimeSpan.FromSeconds(60));
var newFilesToParse = validSrcFiles.Where(f => !ProcessedFiles.Contains(f.Name));
if (newFilesToParse.Any()) Console.WriteLine("Adding " + newFilesToParse.Count() + " files to the Queue");
var blockOptions = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = coresCount,
};
var block = new ActionBlock<FileInfo>(ParseFile, blockOptions);
var filesToParseCount = 0;
foreach (var file in newFilesToParse)
{
block.Post(file);
ProcessedFiles.Add(file.Name);
++filesToParseCount;
}
Console.WriteLine("There are " + filesToParseCount + " files in queue. Processing...");
block.Complete();
await block.Completion;
Console.WriteLine("Finished processing Files in the Queue. Waiting for new files...");
}
Basic solution
You can actually fix your code by stripping it down to the bare essentials like so:
// This is technically a misnomer. It should be
// called "FileNamesQueuedForProcessing" or similar.
// Non-thread-safe. Assuming timer callback access only.
private readonly HashSet<string> ProcessedFiles = new HashSet<string>();
private readonly LimitedConcurrencyLevelTaskScheduler LimitedConcurrencyScheduler = new LimitedConcurrencyLevelTaskScheduler(Environment.ProcessorCount);
private void OnTimedEvent(object source, ElapsedEventArgs e)
{
DirectoryInfo info = new DirectoryInfo(AssemblyDirectory);
// Slightly rewritten to cut down on allocations.
FileInfo[] newFilesToParse = info
.GetFiles("*.dat")
.Where(f =>
(DateTime.Now - f.CreationTime) > TimeSpan.FromSeconds(60) && // I'd consider removing this filter.
!ProcessedFiles.Contains(f.Name))
.OrderBy(p => p.CreationTime)
.ToArray();
if (newFilesToParse.Length != 0) Console.WriteLine("Adding " + newFilesToParse.Count() + " files to the Queue");
foreach (FileInfo file in newFilesToParse)
{
// Fire and forget.
// You can add the resulting task to a shared thread-safe collection
// if you want to observe completion/exceptions/cancellations.
Task.Factory.StartNew(
() => ParseFile(file)
, CancellationToken.None
, TaskCreationOptions.DenyChildAttach
, LimitedConcurrencyScheduler
);
ProcessedFiles.Add(file.Name);
}
}
Note how I am not doing any kind of load balancing on my own, instead relying on LimitedConcurrencyLevelTaskScheduler to perform as advertised - that is, accept all work items immediately on Task.Factory.StartNew, queue them internally and process them at some point in the future on up to [N = max degree of parallelism] thread pool threads.
P.S. I'm assuming that OnTimedEvent will always fire on the same thread. If not, a small change will be necessary to ensure thread safety:
private void OnTimedEvent(object source, ElapsedEventArgs e)
{
lock (ProcessedFiles)
{
// As above.
}
}
Alternative solution
Now, here's a slightly more novel approach: how about we get rid of the timer and LimitedConcurrencyLevelTaskScheduler and encapsulate all of the processing in a single, modular pipeline? There will be a lot of blocking code (unless you break out TPL Dataflow - but I'll stick with Base Class Library types here), but the messaging between stages is so easy it makes for a really appealing design (in my opinion of course).
private async Task PipelineAsync()
{
const int MAX_FILES_TO_BE_QUEUED = 16;
using (BlockingCollection<FileInfo> queue = new BlockingCollection<FileInfo>(boundedCapacity: MAX_FILES_TO_BE_QUEUED))
{
Task producer = Task.Run(async () =>
{
try
{
while (true)
{
DirectoryInfo info = new DirectoryInfo(AssemblyDirectory);
HashSet<string> namesOfFilesQeueuedForProcessing = new HashSet<string>();
FileInfo[] newFilesToParse = info
.GetFiles("*.dat")
.Where(f =>
(DateTime.Now - f.CreationTime) > TimeSpan.FromSeconds(60) &&
!ProcessedFiles.Contains(f.Name))
.OrderBy(p => p.CreationTime) // Processing order is not guaranteed.
.ToArray();
foreach (FileInfo file in newFilesToParse)
{
// This will block if we reach bounded capacity thereby throttling
// the producer (meaning we'll never overflow the handover collection).
queue.Add(file);
namesOfFilesQeueuedForProcessing.Add(file.Name);
}
await Task.Delay(TimeSpan.FromSeconds(60)).ConfigureAwait(false);
}
}
finally
{
// Exception? Cancellation? We'll let the
// consumer know that it can wind down.
queue.CompleteAdding();
}
});
Task consumer = Task.Run(() =>
{
ParallelOptions options = new ParallelOptions {
MaxDegreeOfParallelism = Environment.ProcessorCount
};
Parallel.ForEach(queue.GetConsumingEnumerable(), options, file => ParseFile(file));
});
await Task.WhenAll(producer, consumer).ConfigureAwait(false);
}
}
This pattern in its general form is described in Stephen Toub's "Patterns of Parallel Programming", page 55. I highly recommend having a look.
The trade-off here is the amount of blocking that you'll be doing due to using BlockingCollection<T> and Parallel.ForEach. The benefits of the pipeline as a concept are numerous though: new stages (Task instances) are easy to add, completion and cancellation easy to wire in, both producer and consumer exceptions are observed, and all the mutable state is delightfully local.
This class is designed to take a list of urls, scan them, then return a list of those which does not work. It uses multiple threads to avoid taking forever on long lists.
My problem is that even if i replace the actual scanning of urls with a test function which returns failure on all urls, the class returns a variable amount of failures.
I'm assuming my problem lies either with ConcurrentStack.TryPop() or .Push(), but I cant for the life of me figure out why. They are supposedly thread safe, and I've tried locking as well, no help there.
Anyone able to explain to me what I am doing wrong? I don't have a lot of experience with multiple threads..
public class UrlValidator
{
private const int MAX_THREADS = 10;
private List<Thread> threads = new List<Thread>();
private ConcurrentStack<string> errors = new ConcurrentStack<string>();
private ConcurrentStack<string> queue = new ConcurrentStack<string>();
public UrlValidator(List<string> urls)
{
queue.PushRange(urls.ToArray<string>());
}
public List<string> Start()
{
threads = new List<Thread>();
while (threads.Count < MAX_THREADS && queue.Count > 0)
{
var t = new Thread(new ThreadStart(UrlWorker));
threads.Add(t);
t.Start();
}
while (queue.Count > 0) Thread.Sleep(1000);
int runningThreads = 0;
while (runningThreads > 0)
{
runningThreads = 0;
foreach (Thread t in threads) if (t.ThreadState == ThreadState.Running) runningThreads++;
Thread.Sleep(100);
}
return errors.ToList<string>();
}
private void UrlWorker()
{
while (queue.Count > 0)
{
try
{
string url = "";
if (!queue.TryPop(out url)) continue;
if (TestFunc(url) != 200) errors.Push(url);
}
catch
{
break;
}
}
}
private int TestFunc(string url)
{
Thread.Sleep(new Random().Next(100));
return -1;
}
}
This is something that the Task Parallel Library and PLINQ (Parallel LINQ) would be really good at. Check out an example of how much easier things will be if you let .NET do its thing:
public IEnumerable<string> ProcessURLs(IEnumerable<string> URLs)
{
return URLs.AsParallel()
.WithDegreeOfParallelism(10)
.Where(url => testURL(url));
}
private bool testURL(string URL)
{
// some logic to determine true/false
return false;
}
Whenever possible, you should let the libraries .NET provides do any thread management needed. The TPL is great for this in general, but since you're simply transforming a single collection of items, PLINQ is well suited for this. You can modify the degree of parallelism (I would recommend setting it less than your maximum number of concurrent TCP connections), and you can add multiple conditions just like LINQ allows. Automatically runs parallel, and makes you do no thread management.
Your problem has nothing to do with ConcurrentStack, but rather with the loop where you are checking for running threads:
int runningThreads = 0;
while (runningThreads > 0)
{
...
}
The condition is immediately false, so you never actually wait for threads. In turn, this means that errors will contain errors from whichever threads have run so far.
However, your code has other issues, but creating threads manually is probably the greatest one. Since you are using .NET 4.0, you should use tasks or PLINQ for asynchronous processing. Using PLINQ, your validation can be implemented as:
public IEnumerable<string> Validate(IEnumerable<string> urls)
{
return urls.AsParallel().Where(url => TestFunc(url) != 200);
}