How I implement Multithread pool for a repetitive process

How I implement Multithread pool for a repetitive process - c#

I need to do something like this Test() is just calculations, I only want to put all cpu core at work to find the solution faster.
for (int i= 0; i<1000000; i++)
{
result[i] = Test(i);
if (result[i] == 0)
{
break;
}
}
I have work with BackgroundWorker before. And could create an array of N bgWorker and handle a queue by myself but looks too much trouble.
So I found Task.Factory, seem similar to what I want, but still don't know how handle each separate task to wait for the result and stop everything when found the asnwer.
Task<string> task = Task.Factory.StartNew<string>
(() => DownloadString("http://www.google.com/"));
string result = task.Result;
Or maybe is there other solution for my problem.

Parallel.For(0, 1000000, (i, loopState) =>
{
result[i] = Test(i);
if (result[i] == 0)
{
loopState.Stop();
return;
}
});
How to: Write a Simple Parallel.For Loop
How to: Stop or Break from a Parallel.For Loop

You actually don't need to put multiple cores to use since you're doing IO bound work.
You could take advantage of naturally async APIs for HTTP requests, such as those exposed by HttpClient. Then, you could asynchronously check each task as ot completes if it contains the correct result:
public async Task FindResultAsync(string[] urlList)
{
var downloadTasks = urlList.Select(url => DownloadStringAsync(url)).ToList();
while (downloadTasks.Count > 0)
{
await finishedTask = Task.WhenAny(downloadTasks)
if (finishedTask.Result == someValue)
{
return finishedTask.Result;
}
downloadTasks.Remove(finishedTask);
}
}

Related

Storing each async result in its own array element

Let's say I want to download 1000 recipes from a website. The websites accepts at most 10 concurrent connections. Each recipe should be stored in an array, at its corresponding index. (I don't want to send the array to the DownloadRecipe method.)
Technically, I've already solved the problem, but I would like to know if there is an even cleaner way to use async/await or something else to achieve it?
static async Task MainAsync()
{
int recipeCount = 1000;
int connectionCount = 10;
string[] recipes = new string[recipeCount];
Task<string>[] tasks = new Task<string>[connectionCount];
int r = 0;
while (r < recipeCount)
{
for (int t = 0; t < tasks.Length; t++)
{
tasks[t] = Task.Run(async () => recipes[r] = await DownloadRecipe(r));
r++;
}
await Task.WhenAll(tasks);
}
}
static async Task<string> DownloadRecipe(int index)
{
// ... await calls to download recipe
}
Also, this solution it's not optimal, since it doesn't bother starting a new download until all the 10 running downloads are finished. Is there something we can improve there without bloating the code too much? A thread pool limited to 10 threads?

There are many many ways you could do this. One way is to use an ActionBlock which give you access to MaxDegreeOfParallelism fairly easily and will work well with async methods
static async Task MainAsync()
{
var recipeCount = 1000;
var connectionCount = 10;
var recipes = new string[recipeCount];
async Task Action(int i) => recipes[i] = await DownloadRecipe(i);
var processor = new ActionBlock<int>(Action, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = connectionCount,
SingleProducerConstrained = true
});
for (var i = 0; i < recipeCount; i++)
await processor.SendAsync(i);
processor.Complete();
await processor.Completion;
}
static async Task<string> DownloadRecipe(int index)
{
...
}
Another way might be to use a SemaphoreSlim
var slim = new SemaphoreSlim(connectionCount, connectionCount);
var tasks = Enumerable
.Range(0, recipeCount)
.Select(Selector);
async Task<string> Selector(int i)
{
await slim.WaitAsync()
try
{
return await DownloadRecipe(i)
}
finally
{
slim.Release();
}
}
var recipes = await Task.WhenAll(tasks);
Another set of approaches is to use Reactive Extensions (Rx)... Once again there are many ways to do this, this is just an awaitable approach (and likely could be better all things considered)
var results = await Enumerable
.Range(0, recipeCount)
.ToObservable()
.Select(i => Observable.FromAsync(() => DownloadRecipe(i)))
.Merge(connectionCount)
.ToArray()
.ToTask();

Alternative approach to have 10 "pools" which will load data "simultaneously".
You don't need to wrap IO operations with the separate thread. Using separate thread for IO operations is just a waste of resources.
Notice that thread which downloads data will do nothing, but just waiting for a response. This is where async-await approach come very handy - we can send multiple requests without waiting them to complete and without wasting threads.
static async Task MainAsync()
{
var requests = Enumerable.Range(0, 1000).ToArray();
var maxConnections = 10;
var pools = requests
.GroupBy(i => i % maxConnections)
.Select(group => DownloadRecipesFor(group.ToArray()))
.ToArray();
await Task.WhenAll(pools);
var recipes = pools.SelectMany(pool => pool.Result).ToArray();
}
static async Task<IEnumerable<string>> DownLoadRecipesFor(params int[] requests)
{
var recipes = new List<string>();
foreach (var request in requests)
{
var recipe = await DownloadRecipe(request);
recipes.Add(recipe);
}
return recipes;
}
Because inside the pool (DownloadRecipesFor method) we download results one by one - we make sure that we have no more than 10 active requests all the time.
This is little bit more effective than originals, because we don't wait for 10 tasks to complete before starting next "bunch".
This is not ideal, because if last "pool" finishes early then others it aren't able to pickup next request to handle.
Final result will have corresponding indexes, because we will process "pools" and requests inside in same order as we created them.

How should I use Task.Run in my code for proper scalability and performance?

I started to have HUGE doubts regarding my code and I need some advice from more experienced programmers.
In my application on the button click, the application runs a command, that is calling a ScrapJockeys method:
if (UpdateJockeysPl) await ScrapJockeys(JPlFrom, JPlTo + 1, "jockeysPl"); //1 - 1049
ScrapJockeys is triggering a for loop, repeating code block between 20K - 150K times (depends on the case). Inside the loop, I need to call a service method, where the execution of the method takes a lot of time. Also, I wanted to have the ability of cancellation of the loop and everything that is going on inside of the loop/method.
Right now I am with a method with a list of tasks, and inside of the loop is triggered a Task.Run. Inside of each task, I am calling an awaited service method, which reduces execution time of everything to 1/4 comparing to synchronous code. Also, each task has assigned a cancellation token, like in the example GitHub link:
public async Task ScrapJockeys(int startIndex, int stopIndex, string dataType)
{
//init values and controls in here
List<Task> tasks = new List<Task>();
for (int i = startIndex; i < stopIndex; i++)
{
int j = i;
Task task = Task.Run(async () =>
{
LoadedJockey jockey = new LoadedJockey();
CancellationToken.ThrowIfCancellationRequested();
if (dataType == "jockeysPl") jockey = await _scrapServices.ScrapSingleJockeyPlAsync(j);
if (dataType == "jockeysCz") jockey = await _scrapServices.ScrapSingleJockeyCzAsync(j);
//doing some stuff with results in here
}, TokenSource.Token);
tasks.Add(task);
}
try
{
await Task.WhenAll(tasks);
}
catch (OperationCanceledException)
{
//
}
finally
{
await _dataServices.SaveAllJockeysAsync(Jockeys.ToList()); //saves everything to JSON file
//soing some stuff with UI props in here
}
}
So about my question, is there everything fine with my code? According to this article:
Many async newbies start off by trying to treat asynchronous tasks the
same as parallel (TPL) tasks and this is a major misstep.
What should I use then?
And according to this article:
On a busy server, this kind of implementation can kill scalability.
So how am I supposed to do it?
Please be noted, that the service interface method signature is Task<LoadedJockey> ScrapSingleJockeyPlAsync(int index);
And also I am not 100% sure that I am using Task.Run correctly within my service class. The methods inside are wrapping the code inside await Task.Run(() =>, like in the example GitHub link:
public async Task<LoadedJockey> ScrapSingleJockeyPlAsync(int index)
{
LoadedJockey jockey = new LoadedJockey();
await Task.Run(() =>
{
//do some time consuming things
});
return jockey;
}
As far as I understand from the articles, this is a kind of anti-pattern. But I am confused a bit. Based on this SO reply, it should be fine...? If not, how to replace it?

On the UI side, you should be using Task.Run when you have CPU-bound code that is long enough that you need to move it off the UI thread. This is completely different than the server side, where using Task.Run at all is an anti-pattern.
In your case, all your code seems to be I/O-based, so I don't see a need for Task.Run at all.
There is a statement in your question that conflicts with the provided code:
I am calling an awaited service method
public async Task<LoadedJockey> ScrapSingleJockeyPlAsync(int index)
{
await Task.Run(() =>
{
//do some time consuming things
});
}
The lambda passed to Task.Run is not async, so the service method cannot possibly be awaited. And indeed it is not.
A better solution would be to load the HTML asynchronously (e.g., using HttpClient.GetStringAsync), and then call HtmlDocument.LoadHtml, something like this:
public async Task<LoadedJockey> ScrapSingleJockeyPlAsync(int index)
{
LoadedJockey jockey = new LoadedJockey();
...
string link = sb.ToString();
var html = await httpClient.GetStringAsync(link).ConfigureAwait(false);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (jockey.Name == null)
...
return jockey;
}
And also remove the Task.Run from your for loop:
private async Task ScrapJockey(string dataType)
{
LoadedJockey jockey = new LoadedJockey();
CancellationToken.ThrowIfCancellationRequested();
if (dataType == "jockeysPl") jockey = await _scrapServices.ScrapSingleJockeyPlAsync(j).ConfigureAwait(false);
if (dataType == "jockeysCz") jockey = await _scrapServices.ScrapSingleJockeyCzAsync(j).ConfigureAwait(false);
//doing some stuff with results in here
}
public async Task ScrapJockeys(int startIndex, int stopIndex, string dataType)
{
//init values and controls in here
List<Task> tasks = new List<Task>();
for (int i = startIndex; i < stopIndex; i++)
{
tasks.Add(ScrapJockey(dataType));
}
try
{
await Task.WhenAll(tasks);
}
catch (OperationCanceledException)
{
//
}
finally
{
await _dataServices.SaveAllJockeysAsync(Jockeys.ToList()); //saves everything to JSON file
//soing some stuff with UI props in here
}
}

As far as I understand from the articles, this is a kind of anti-pattern.
It is an anti-pattern. But if can't modify the service implementation, you should at least be able to execute the tasks in parallel. Something like this:
public async Task ScrapJockeys(int startIndex, int stopIndex, string dataType)
{
ConcurrentBag<Task> tasks = new ConcurrentBag<Task>();
ParallelOptions parallelLoopOptions = new ParallelOptions() { CancellationToken = CancellationToken };
Parallel.For(startIndex, stopIndex, parallelLoopOptions, i =>
{
int j = i;
switch (dataType)
{
case "jockeysPl":
tasks.Add(_scrapServices.ScrapSingleJockeyPlAsync(j));
break;
case "jockeysCz":
tasks.Add(_scrapServices.ScrapSingleJockeyCzAsync(j));
break;
}
});
try
{
await Task.WhenAll(tasks);
}
catch (OperationCanceledException)
{
//
}
finally
{
await _dataServices.SaveAllJockeysAsync(Jockeys.ToList()); //saves everything to JSON file
//soing some stuff with UI props in here
}
}

Why simple multi task doesn't work when multi thread does?

var finalList = new List<string>();
var list = new List<int> {1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ................. 999999};
var init = 0;
var limitPerThread = 5;
var countDownEvent = new CountdownEvent(list.Count);
for (var i = 0; i < list.Count; i++)
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
new Thread(delegate()
{
Foo(listToFilter);
countDownEvent.Signal();
}).Start();
init += limitPerThread;
}
//wait all to finish
countDownEvent.Wait();
private static void Foo(List<int> listToFilter)
{
var listDone = Boo(listToFilter);
lock (Object)
{
finalList.AddRange(listDone);
}
}
This doesn't:
var taskList = new List<Task>();
for (var i = 0; i < list.Count; i++)
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
var task = Task.Factory.StartNew(() => Foo(listToFilter));
taskList.add(task);
init += limitPerThread;
}
//wait all to finish
Task.WaitAll(taskList.ToArray());
This process must create at least 700 threads in the end. When I run using Thread, it works and creates all of them. But with Task it doesn't.. It seems like its not starting multiples Tasks async.
I really wanna know why.... any ideas?
EDIT
Another version with PLINQ (as suggested).
var taskList = new List<Task>(list.Count);
Parallel.ForEach(taskList, t =>
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
Foo(listToFilter);
init += limitPerThread;
t.Start();
});
Task.WaitAll(taskList.ToArray());
EDIT2:
public static List<Communication> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
try
{
for (var i = 0; i < listIps.Count(); i++)
{
var oPing = new Ping().Send(listIps.ElementAt(i).IpAddress, 10000);
if (oPing != null)
{
if (oPing.Status.Equals(IPStatus.TimedOut) && listIps.Count() > i+1)
continue;
if (oPing.Status.Equals(IPStatus.TimedOut))
{
communication.Result = "NOK";
break;
}
communication.Result = oPing.Status.Equals(IPStatus.Success) ? "OK" : "NOK";
break;
}
if (listIps.Count() > i+1)
continue;
communication.Result = "NOK";
break;
}
}
catch
{
communication.Result = "NOK";
}
finally
{
listResult.Add(communication);
}
}
return listResult;
}

Tasks are NOT multithreading. They can be used for that, but mostly they're actually used for the opposite - multiplexing on a single thread.
To use tasks for multithreading, I suggest using Parallel LINQ. It has many optimizations in it already, such as intelligent partitioning of your lists and only spawning as many threads as there ar CPU cores, etc.
To understand Task and async, think of it this way - a typical workload often includes IO that needs to be waited upon. Maybe you read a file, or query a webservice, or access a database, or whatever. The point is - your thread gets to wait a loooong time (in CPU cycles at least) until you get a response from some faraway destination.
In the Olden Days™ that meant that your thread was getting locked down (suspended) until that response came. If you wanted to do something else in the meantime, you needed to spawn a new thread. That's doable, but not too efficient. Each OS thread carries a significant overhead (memory, kernel resources) with it. And you could end up with several threads actively burning the CPU, which means that the OS needs to switch between them so that each gets a bit of CPU time and these "context switches" are pretty expensive.
async changes that workflow. Now you can have multiple workloads executing on the same thread. While one piece of work is awaiting the result from a faraway source, another can step in and use that thread to do something else useful. When that second workload gets to its own await, the first can awaken and continue.
After all, it doesn't make sense to spawn more threads than there are CPU cores. You're not going to get more work done that way. Just the opposite - more time will be spent on switching the threads and less time will be available for useful work.
That is what the Task/async/await was originally designed for. However Parallel LINQ has also taken advantage of it and reused it for multithreading. In this case you can look at it this way - the other threads is what your main thread is the "faraway destination" that your main thread is waiting on.

Tasks are executed on the Thread Pool. This means that a handful of threads will serve a large number of tasks. You have multi-threading, but not a thread for every task spawned.
You should use tasks. You should aim to use as much threads as your CPU. Generally, the thread pool is doing this for you.

How did you measure up the performance? Do you think that the 700 threads will work faster than 700 tasks executing by 4 threads? No, they would not.
It seems like its not starting multiples Tasks async
How did you came up with this? As other suggested in comments and in other answers, you probably need to remove a thread creation, as after creating 700 threads you'll degrade your system performance, as your threads would fight to each other for the processor time, without any work done faster.
So, you need to add the async/await for your IO operations, into the Foo method, with SendPingAsync version. Also, your method could be simplyfied, as many checks for a listIps.Count() > i + 1 conditions are useless - you do it in the for condition block:
public static async Task<List<Communication>> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
try
{
var ping = new Ping();
communication.Result = "NOK";
for (var i = 0; i < listIps.Count(); i++)
{
var oPing = await ping.SendPingAsync(listIps.ElementAt(i).IpAddress, 10000);
if (oPing != null)
{
if (oPing.Status.Equals(IPStatus.Success)
{
communication.Result = "OK";
break;
}
}
}
}
catch
{
communication.Result = "NOK";
}
finally
{
listResult.Add(communication);
}
}
return listResult;
}
Other problem with your code is that PLINQ version isn't threadsafe:
init += limitPerThread;
This can fail while executing in parallel. You may introduce some helper method, like in this answer:
private async Task<List<PingReply>> PingAsync(List<Communication> theListOfIPs)
{
Ping pingSender = new Ping();
var tasks = theListOfIPs.Select(ip => pingSender.SendPingAsync(ip, 10000));
var results = await Task.WhenAll(tasks);
return results.ToList();
}
And do this kind of check (try/catch logic removed for simplicity):
public static async Task<List<Communication>> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
var check = await PingAsync(listIps);
communication.Result = check.Any(p => p.Status.Equals(IPStatus.Success)) ? "OK" : "NOK";
}
}
And you probably should use Task.Run instead of Task.StartNew for being sure that you aren't blocking the UI thread.

C# Re-use or re-create tasks?

I have a while loop in which I create and start tasks, as follows:
while (!stopped)
{
List<Task> tasks = new List<Task>();
for (int i = 0; i < 10; i++)
tasks.add(Task.Factory.StartNew(() => DoSomething(i)));
Task.WaitAll(tasks.ToArray());
}
Would I get better performance if the tasks are created once before the while loop, and restarted everytime (because the data being passed to the function never changes)?

You can't restart task
http://msdn.microsoft.com/en-us/library/dd270682.aspx
By the way
Task.Factory.StartNew(() => {....})
is fast than
Task task = new Task(() => {...});
task.Start();
because no locking on Start method.
In your case use async io to get performance boost.

There is nothing fundamentally wrong with your code. This is a perfectly acceptable approach. You do not need to worry about the performance or expensive of creating tasks because the TPL was specifically designed to be used exactly like you have done.
However, there is one major problem with your code. You are closing over the loop variable. Remember, closures capture the variable and not the value. The way your code is written the DoSomething method will not be using the value of i that you think it should. Your code needs to be rewritten like this.
while (!stopped)
{
List tasks = new List();
for (int i = 0; i < 10; i++)
{
int capture = i;
tasks.add(Task.Factory.StartNew(() => DoSomething(capture)));
}
Task.WaitAll(tasks.ToArray());
}
As a side note you could use the Parallel.For method as an alternative. It is definitely a lot more compact of a solution if nothing else.
while (!stopped)
{
Parallel.For(0, 10, i => DoSomething(i));
}

creating a .net async wrapper to a sync request

I have the following situation (or a basic misunderstanding with the async await mechanism).
Assume you have a set of 1-20 web request call that takes a long time: findItemsByProduct().
you want to wrap it around in an async request, that would be able to abstract all these calls into one async call, but I can't seem to be able to do it without using more threads.
If I'm doing:
int total = result.paginationOutput.totalPages;
for (int i = 2; i < total + 1; i++)
{
await Task.Factory.StartNew(() =>
{
result = client.findItemsByProduct(i);
});
newList.AddRange(result.searchResult.item);
}
}
return newList;
problem here, that the calls don't run together, rather they are waiting one by one.
I would like all the calls to run together and than harvest the results.
as pseudo code, I would like the code to run like this:
forEach item {
result = item.makeWebRequest();
}
foreach item {
List.addRange(item.harvestResults);
}
I have no idea how to make the code to do that though..

Ideally, you should add a findItemsByProductAsync that returns a Task<Item[]>. That way, you don't have to create unnecessary tasks using StartNew or Task.Run.
Then your code can look like this:
int total = result.paginationOutput.totalPages;
// Start all downloads; each download is represented by a task.
Task<Item[]>[] tasks = Enumerable.Range(2, total - 1)
.Select(i => client.findItemsByProductAsync(i)).ToArray();
// Wait for all downloads to complete.
Item[][] results = await Task.WhenAll(tasks);
// Flatten the results into a single collection.
return results.SelectMany(x => x).ToArray();

Given your requirements which I see as:
Process n number of non-blocking tasks
Process results after all queries have returned
I would use the CountdownEvent for this e.g.
var results = new ConcurrentBag<ItemType>(result.pagination.totalPages);
using (var e = new CountdownEvent(result.pagination.totalPages))
{
for (int i = 2; i <= result.pagination.totalPages+1; i++)
{
Task.Factory.StartNew(() => return client.findItemsByProduct(i))
.ContinueWith(items => {
results.AddRange(items);
e.Signal(); // signal task is done
});
}
// Wait for all requests to complete
e.Wait();
}
// Process results
foreach (var item in results)
{
...
}

This particular problem is solved easily enough without even using await. Simply create each of the tasks, put all of the tasks into a list, and then use WhenAll on that list to get a task that represents the completion of all of those tasks:
public static Task<Item[]> Foo()
{
int total = result.paginationOutput.totalPages;
var tasks = new List<Task<Item>>();
for (int i = 2; i < total + 1; i++)
{
tasks.Add(Task.Factory.StartNew(() => client.findItemsByProduct(i)));
}
return Task.WhenAll(tasks);
}
Also note you have a major problem in how you use result in your code. You're having each of the different tasks all using the same variable, so there are race conditions as to whether or not it works properly. You could end up adding the same call twice and having one skipped entirely. Instead you should have the call to findItemsByProduct be the result of the task, and use that task's Result.

If you want to use async-await properly you have to declare your functions async, and the functions that call you also have to be async. This continues until you have once synchronous function that starts the async process.
Your function would look like this:
by the way you didn't describe what's in the list. I assume they are
object of type T. in that case result.SearchResult.Item returns
IEnumerable
private async Task<List<T>> FindItems(...)
{
int total = result.paginationOutput.totalPages;
var newList = new List<T>();
for (int i = 2; i < total + 1; i++)
{
IEnumerable<T> result = await Task.Factory.StartNew(() =>
{
return client.findItemsByProduct(i);
});
newList.AddRange(result.searchResult.item);
}
return newList;
}
If you do it this way, your function will be asynchronous, but the findItemsByProduct will be executed one after another. If you want to execute them simultaneously you should not await for the result, but start the next task before the previous one is finished. Once all tasks are started wait until all are finished. Like this:
private async Task<List<T>> FindItems(...)
{
int total = result.paginationOutput.totalPages;
var tasks= new List<Task<IEnumerable<T>>>();
// start all tasks. don't wait for the result yet
for (int i = 2; i < total + 1; i++)
{
Task<IEnumerable<T>> task = Task.Factory.StartNew(() =>
{
return client.findItemsByProduct(i);
});
tasks.Add(task);
}
// now that all tasks are started, wait until all are finished
await Task.WhenAll(tasks);
// the result of each task is now in task.Result
// the type of result is IEnumerable<T>
// put all into one big list using some linq:
return tasks.SelectMany ( task => task.Result.SearchResult.Item)
.ToList();
// if you're not familiar to linq yet, use a foreach:
var newList = new List<T>();
foreach (var task in tasks)
{
newList.AddRange(task.Result.searchResult.item);
}
return newList;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.