Task.WhenAll recursion inconsistent results compared to for loop - c#

I have a method where I need recursion to get a hierarchy of files and folders from an API (graph). When I do my recursion inside of a for loop it works as expected and returns a hierarchy with 665 files. This takes about a minute though because it only fetches one folder at a time, whereas doing it with Task.WhenAll only takes 10 seconds.
When using Task.WhenAll I get inconsistent results though, it will return the hierarchy with anywhere from 661 to 665 files depending on the run, with the exact same code. i'm using the variable totalFileCount as an indication of how many files it has found.
Obviously i'm doing something wrong but I can't quite figure out what. Any help is greatly appreciated!
For loop
for (int i = 0; i < folders.Count; i++)
{
var folder = folders[i];
await GetSharePointHierarchy(folder, folderItem, $"{outPath}{folder.Name}\\");
}
Task.WhenAll
var tasks = new List<Task>();
for (int i = 0; i < folders.Count; i++)
{
var folder = folders[i];
var task = GetSharePointHierarchy(folder, folderItem, $"{outPath}{folder.Name}\\");
tasks.Add(task);
}
await Task.WhenAll(tasks);
Full method
public async Task<GraphFolderItem> GetSharePointHierarchy(DriveItem currentDrive, GraphFolderItem parentFolderItem, string outPath = "")
{
IEnumerable<DriveItem> children = await graphHandler.GetFolderChildren(sourceSharepointId, currentDrive.Id);
var folders = new List<DriveItem>();
var files = new List<DriveItem>();
var graphFolderItems = new List<GraphFolderItem>();
foreach (var item in children)
{
if (item.Folder != null)
{
System.IO.Directory.CreateDirectory(outPath + item.Name);
//Console.WriteLine(outPath + item.Name);
folders.Add(item);
}
else
{
totalFileCount++;
files.Add(item);
}
}
var folderItem = new GraphFolderItem
{
SourceFolder = currentDrive,
ItemChildren = files,
FolderChildren = graphFolderItems,
DownloadPath = outPath
};
parentFolderItem.FolderChildren.Add(folderItem);
for (int i = 0; i < folders.Count; i++)
{
var folder = folders[i];
await GetSharePointHierarchy(folder, folderItem, $"{outPath}{folder.Name}\\");
}
return parentFolderItem;
}

It is race condition problem. In parallel execution, you should not use normal datatype or variable. You should always use thread safe concept as per your requirement like thread safe datatypes/collection or lock or Monitor or interlocked.
In this case, interlocked.Increment is good approach like replace the below one where using totalFileCount
Interlocked.Increment(ref totalFileCount);
Please refer the below link for good understanding
Thread Safe concept in details or Thread-safety

It seemed like the problem is when you are using the Task.WhenAll way you are making the code flow run in parallel and in the other way with await each time you run the async function, the code flow is actually not run in parallel
and this is exactly your problem, your source code inside the async function access to shared memory object - totalFileCount
What causing multi threads access to an object at the same time.
For fixing it and still execute the code in parallel, surround the access to totalFileCount instance with the lock statement which limit the number of concurrent executions of a block of code
lock(lockRefObject)
{
totalFileCount++;
}

Related

Running groups of groups of Tasks in a For loop

I have a set of 100 Tasks that need to run, in any order. Putting them all into a Task.WhenAll() tends to overload the back end, which I do not control.
I'd like to run n-number tasks at a time, after each completes, then run the next set. I wrote this code, but the "Console(Running..." is printed to the screen all after the tasks are run making me think all the Tasks are being run.
How can I force the system to really "wait" for each group of Tasks?
//Run some X at a time
int howManytoRunAtATimeSoWeDontOverload = 4;
for(int i = 0; i < tasks.Count; i++)
{
var startIndex = howManytoRunAtATimeSoWeDontOverload * i;
Console.WriteLine($"Running {startIndex} to {startIndex+ howManytoRunAtATimeSoWeDontOverload}");
var toDo = tasks.Skip(startIndex).Take(howManytoRunAtATimeSoWeDontOverload).ToArray();
if (toDo.Length == 0) break;
await Task.WhenAll(toDo);
}
Screen Output:
There are a lot of ways to do this but I would probably use some library or framework that provides a higher level abstraction like TPL Dataflow: https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library (if your using .NET Core there's a newer library).
This makes it a lot easier than building your own buffering mechanisms. Below is a very simple example but you can configure it differently and do a lot more with this library. In the example below I don't batch them but I make sure no more than 10 tasks are processed at the same time.
var buffer = new ActionBlock<Task>(async t =>
{
await t;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 10, MaxDegreeOfParallelism = 1 });
foreach (var t in tasks)
{
await buffer.SendAsync(DummyFunctionAsync(t));
}
buffer.Complete();
await buffer.Completion;

C# Scan tree recursively with multiple threads

I'm scanning some directory for items. I've just read Multithreaded Directory Looping in C# question but I still want to make it multithreated. Even though everyone says the drive will be the bottleneck I have some points:
The drives may mostly be "single threaded" but how you know what they going to bring up in the future?
How you know the different sub-paths you are scanning are one the same physical drive?
I using an abstraction layer (even two) over the System.IO so that I can later reuse the code in different scenarios.
So, my first idea was to use Task and first dummy implementation was this:
public async Task Scan(bool recursive = false) {
var t = new Task(() => {
foreach (var p in path.scan) Add(p);
if (!recursive) return;
var tks = new Task[subs.Count]; var i = 0;
foreach (var s in subs) tks[i++] = s.Scan(true);
Task.WaitAll(tks);
}); t.Start();
await t;
}
I don't like the idea of creating a Task for each item and generally this doesn't seem ideal, but this was just for a test as Tasks are advertised to automatically manage the threads...
This method works but it's very slow. It takes above 5s to complete, while the single threated version below takes around 0.5s to complete the whole program on the same data set:
public void Scan2(bool recursive = false) {
foreach (var p in path.scan) Add(p);
if (!recursive) return;
foreach (var s in subs) s.Scan2(true);
}
I wander what really goes wrong with fist method. The machine is not on load, CUP usage is insignificant, drive is fine... I've tried profiling it with NProfiler it don't tell me much besides the program sits on Task.WaitAll(tks) all the time.
I also wrote a thread-locked counting mechanism that is invoked during addition of each item. Maybe it's the problem with it?
#region SubCouting
public Dictionary<Type, int> counters = new Dictionary<Type, int>();
private object cLock = new object();
private int _sc = 0;
public int subCount => _sc;
private void inCounter(Type t) {
lock (cLock) {
if (!counters.ContainsKey(t)) counters.Add(t, 1);
counters[t]++;
_sc++;
}
if (parent) parent.inCounter(t);
}
#endregion
But even if threads are waiting here, wouldn't the execution time be similar to single threaded version as opposed to 10x slower?
I'm not sure how to approach this. If I don't want to use tasks, do I need to manage threads manually or is there already some library that would fit nicely for the job?
I think you almost got it. Task.WaitAll(tks) is the problem. You block one thread for this as this is synchronous operation. You get out of threads soon, all threads are just waiting for some tasks that have no threads to run on. You can solve this with async, replace the waiting with await Task.WhenAll(...). It would free the thread when waiting. With some workload the multithreaded version is significantly faster. When just IO bound it is roughly equal.
ConcurrentBag<string> result = new ConcurrentBag<string>();
List<string> result2 = new List<string>();
public async Task Scan(string path)
{
await Task.Run(async () =>
{
var subs = Directory.GetDirectories(path);
await Task.WhenAll(subs.Select(s => Scan(s)));
result.Add(Enumerable.Range(0, 1000000).Sum(i => path[i % path.Length]).ToString());
});
}
public void Scan2(string path)
{
result2.Add(Enumerable.Range(0, 1000000).Sum(i => path[i % path.Length]).ToString());
var subs = Directory.GetDirectories(path);
foreach (var s in subs) Scan2(s);
}
private async void button4_Click(object sender, EventArgs e)
{
string dir = #"d:\tmp";
System.Diagnostics.Stopwatch st = new System.Diagnostics.Stopwatch();
st.Start();
await Scan(dir);
st.Stop();
MessageBox.Show(st.ElapsedMilliseconds.ToString());
st = new System.Diagnostics.Stopwatch();
st.Start();
Scan2(dir);
st.Stop();
MessageBox.Show(st.ElapsedMilliseconds.ToString());
MessageBox.Show(result.OrderBy(x => x).SequenceEqual(result2.OrderBy(x => x)) ? "OK" : "ERROR");
}

Why simple multi task doesn't work when multi thread does?

var finalList = new List<string>();
var list = new List<int> {1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ................. 999999};
var init = 0;
var limitPerThread = 5;
var countDownEvent = new CountdownEvent(list.Count);
for (var i = 0; i < list.Count; i++)
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
new Thread(delegate()
{
Foo(listToFilter);
countDownEvent.Signal();
}).Start();
init += limitPerThread;
}
//wait all to finish
countDownEvent.Wait();
private static void Foo(List<int> listToFilter)
{
var listDone = Boo(listToFilter);
lock (Object)
{
finalList.AddRange(listDone);
}
}
This doesn't:
var taskList = new List<Task>();
for (var i = 0; i < list.Count; i++)
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
var task = Task.Factory.StartNew(() => Foo(listToFilter));
taskList.add(task);
init += limitPerThread;
}
//wait all to finish
Task.WaitAll(taskList.ToArray());
This process must create at least 700 threads in the end. When I run using Thread, it works and creates all of them. But with Task it doesn't.. It seems like its not starting multiples Tasks async.
I really wanna know why.... any ideas?
EDIT
Another version with PLINQ (as suggested).
var taskList = new List<Task>(list.Count);
Parallel.ForEach(taskList, t =>
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
Foo(listToFilter);
init += limitPerThread;
t.Start();
});
Task.WaitAll(taskList.ToArray());
EDIT2:
public static List<Communication> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
try
{
for (var i = 0; i < listIps.Count(); i++)
{
var oPing = new Ping().Send(listIps.ElementAt(i).IpAddress, 10000);
if (oPing != null)
{
if (oPing.Status.Equals(IPStatus.TimedOut) && listIps.Count() > i+1)
continue;
if (oPing.Status.Equals(IPStatus.TimedOut))
{
communication.Result = "NOK";
break;
}
communication.Result = oPing.Status.Equals(IPStatus.Success) ? "OK" : "NOK";
break;
}
if (listIps.Count() > i+1)
continue;
communication.Result = "NOK";
break;
}
}
catch
{
communication.Result = "NOK";
}
finally
{
listResult.Add(communication);
}
}
return listResult;
}
Tasks are NOT multithreading. They can be used for that, but mostly they're actually used for the opposite - multiplexing on a single thread.
To use tasks for multithreading, I suggest using Parallel LINQ. It has many optimizations in it already, such as intelligent partitioning of your lists and only spawning as many threads as there ar CPU cores, etc.
To understand Task and async, think of it this way - a typical workload often includes IO that needs to be waited upon. Maybe you read a file, or query a webservice, or access a database, or whatever. The point is - your thread gets to wait a loooong time (in CPU cycles at least) until you get a response from some faraway destination.
In the Olden Days™ that meant that your thread was getting locked down (suspended) until that response came. If you wanted to do something else in the meantime, you needed to spawn a new thread. That's doable, but not too efficient. Each OS thread carries a significant overhead (memory, kernel resources) with it. And you could end up with several threads actively burning the CPU, which means that the OS needs to switch between them so that each gets a bit of CPU time and these "context switches" are pretty expensive.
async changes that workflow. Now you can have multiple workloads executing on the same thread. While one piece of work is awaiting the result from a faraway source, another can step in and use that thread to do something else useful. When that second workload gets to its own await, the first can awaken and continue.
After all, it doesn't make sense to spawn more threads than there are CPU cores. You're not going to get more work done that way. Just the opposite - more time will be spent on switching the threads and less time will be available for useful work.
That is what the Task/async/await was originally designed for. However Parallel LINQ has also taken advantage of it and reused it for multithreading. In this case you can look at it this way - the other threads is what your main thread is the "faraway destination" that your main thread is waiting on.
Tasks are executed on the Thread Pool. This means that a handful of threads will serve a large number of tasks. You have multi-threading, but not a thread for every task spawned.
You should use tasks. You should aim to use as much threads as your CPU. Generally, the thread pool is doing this for you.
How did you measure up the performance? Do you think that the 700 threads will work faster than 700 tasks executing by 4 threads? No, they would not.
It seems like its not starting multiples Tasks async
How did you came up with this? As other suggested in comments and in other answers, you probably need to remove a thread creation, as after creating 700 threads you'll degrade your system performance, as your threads would fight to each other for the processor time, without any work done faster.
So, you need to add the async/await for your IO operations, into the Foo method, with SendPingAsync version. Also, your method could be simplyfied, as many checks for a listIps.Count() > i + 1 conditions are useless - you do it in the for condition block:
public static async Task<List<Communication>> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
try
{
var ping = new Ping();
communication.Result = "NOK";
for (var i = 0; i < listIps.Count(); i++)
{
var oPing = await ping.SendPingAsync(listIps.ElementAt(i).IpAddress, 10000);
if (oPing != null)
{
if (oPing.Status.Equals(IPStatus.Success)
{
communication.Result = "OK";
break;
}
}
}
}
catch
{
communication.Result = "NOK";
}
finally
{
listResult.Add(communication);
}
}
return listResult;
}
Other problem with your code is that PLINQ version isn't threadsafe:
init += limitPerThread;
This can fail while executing in parallel. You may introduce some helper method, like in this answer:
private async Task<List<PingReply>> PingAsync(List<Communication> theListOfIPs)
{
Ping pingSender = new Ping();
var tasks = theListOfIPs.Select(ip => pingSender.SendPingAsync(ip, 10000));
var results = await Task.WhenAll(tasks);
return results.ToList();
}
And do this kind of check (try/catch logic removed for simplicity):
public static async Task<List<Communication>> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
var check = await PingAsync(listIps);
communication.Result = check.Any(p => p.Status.Equals(IPStatus.Success)) ? "OK" : "NOK";
}
}
And you probably should use Task.Run instead of Task.StartNew for being sure that you aren't blocking the UI thread.

Task.WaitAll gets stuck

I have a piece of code that looks like this:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
using (var client = new WebClient())
{
Task<string> getDownloadsTask = client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
taskList[i] = getDownloadsTask;
}
}
Task.WaitAll(taskList.Cast<Task>().ToArray()); //Wait for all results to come back
The code freezes after Task.WaitAll... I have an idea why, it's because client is already disposed at the time of calling, is it possible to delay its disposal until later? Can you recommend another approach?
You need to create and dispose the WebClient within your task. I don't have a way to test this, but see if points you in the right direction:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
taskList[i] = Task.Run(() =>
{
using (var client = new WebClient())
{
return client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
}
});
}
Task.WaitAll(taskList.Cast<Task>().ToArray());
I don't see how that code would ever work, since you dispose the WebClient before the task was run.
You want to do something like this:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
var client = new WebClient();
Task<string> task = client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
task.ContinueWith(x => client.Dispose());
taskList[i] = task;
}
Task.WaitAll(taskList.Cast<Task>().ToArray()); //Wait for all results to come back
i.e. if you dispose the WebClient in the first loop, it's not allocated when you trigger the tasks by using Task.WaitAll. The ContinueWith call will be invoked once the task completes and can therefore be used to dispose each WebClient instance.
However, to get the code to execute concurrent requests to a single host you need to configure the service point. Read this question: Trying to run multiple HTTP requests in parallel, but being limited by Windows (registry)

How to know that your application is not responding?

I have such particular code:
for (int i = 0; i < SingleR_mustBeWorkedUp._number_of_Requestes; i++)
{
Random myRnd = new Random(SingleR_mustBeWorkedUp._num_path);
while (true)
{
int k = myRnd.Next(start, end);
if (CanRequestBePutted(timeLineR, k, SingleR_mustBeWorkedUp._time_service, start + end) == true)
{
SingleR_mustBeWorkedUp.placement[i] = k;
break;
}
}
}
I use an infinite loop here which will end only if CanRequestBePutted returns true. So how to know that the app isn't responding?
There is a solution by controlling time of working each loop, but it doesn't seem to be really good. And I can't forecast that is going to happen in every cases.
Any solutions?
If you're concerned that this operation could potentially take long enough for the application's user to notice, you should be running it in a non-UI thread. Then you can be sure that it will not be making your application unrepsonsive. You should only be running it in the UI thread if you're sure it will always complete very quickly. When in doubt, go to a non-UI thread.
Don't try to figure out dynamically whether the operation will take a long time or not. If it taking a while is a possibility, do the work in another thread.
Why not use a task or threadpool so you're not blocking and put a timer on it?
The task could look something like this:
//put a class level variable
static object _padlock = new object();
var tasks = new List<Task>();
for (int i = 0; i < SingleR_mustBeWorkedUp._number_of_Requestes; i++)
{
var task = new Task(() =>
{
Random myRnd = new Random(SingleR_mustBeWorkedUp._num_path);
while (true)
{
int k = myRnd.Next(start, end);
if (CanRequestBePutted(timeLineR, k, SingleR_mustBeWorkedUp._time_service, start + end) == true)
{
lock(_padlock)
SingleR_mustBeWorkedUp.placement[i] = k;
break;
}
}
});
task.Start();
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
However I would also try to figure out a way to take out your while(true), which is a bit dangerous. Also Task requires .NET 4.0 or above and i'm not sure what framework your targeting.
If you need something older you can use ThreadPool.
Also you might want to put locks around shared resources like SingleR_mustBeWorkedUp.placement or anywhere else might be changing a variable. I put one around SingleR_mustBeWorkedUp.placement as an example.

Categories