Storing each async result in its own array element - c#

Let's say I want to download 1000 recipes from a website. The websites accepts at most 10 concurrent connections. Each recipe should be stored in an array, at its corresponding index. (I don't want to send the array to the DownloadRecipe method.)
Technically, I've already solved the problem, but I would like to know if there is an even cleaner way to use async/await or something else to achieve it?
static async Task MainAsync()
{
int recipeCount = 1000;
int connectionCount = 10;
string[] recipes = new string[recipeCount];
Task<string>[] tasks = new Task<string>[connectionCount];
int r = 0;
while (r < recipeCount)
{
for (int t = 0; t < tasks.Length; t++)
{
tasks[t] = Task.Run(async () => recipes[r] = await DownloadRecipe(r));
r++;
}
await Task.WhenAll(tasks);
}
}
static async Task<string> DownloadRecipe(int index)
{
// ... await calls to download recipe
}
Also, this solution it's not optimal, since it doesn't bother starting a new download until all the 10 running downloads are finished. Is there something we can improve there without bloating the code too much? A thread pool limited to 10 threads?

There are many many ways you could do this. One way is to use an ActionBlock which give you access to MaxDegreeOfParallelism fairly easily and will work well with async methods
static async Task MainAsync()
{
var recipeCount = 1000;
var connectionCount = 10;
var recipes = new string[recipeCount];
async Task Action(int i) => recipes[i] = await DownloadRecipe(i);
var processor = new ActionBlock<int>(Action, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = connectionCount,
SingleProducerConstrained = true
});
for (var i = 0; i < recipeCount; i++)
await processor.SendAsync(i);
processor.Complete();
await processor.Completion;
}
static async Task<string> DownloadRecipe(int index)
{
...
}
Another way might be to use a SemaphoreSlim
var slim = new SemaphoreSlim(connectionCount, connectionCount);
var tasks = Enumerable
.Range(0, recipeCount)
.Select(Selector);
async Task<string> Selector(int i)
{
await slim.WaitAsync()
try
{
return await DownloadRecipe(i)
}
finally
{
slim.Release();
}
}
var recipes = await Task.WhenAll(tasks);
Another set of approaches is to use Reactive Extensions (Rx)... Once again there are many ways to do this, this is just an awaitable approach (and likely could be better all things considered)
var results = await Enumerable
.Range(0, recipeCount)
.ToObservable()
.Select(i => Observable.FromAsync(() => DownloadRecipe(i)))
.Merge(connectionCount)
.ToArray()
.ToTask();

Alternative approach to have 10 "pools" which will load data "simultaneously".
You don't need to wrap IO operations with the separate thread. Using separate thread for IO operations is just a waste of resources.
Notice that thread which downloads data will do nothing, but just waiting for a response. This is where async-await approach come very handy - we can send multiple requests without waiting them to complete and without wasting threads.
static async Task MainAsync()
{
var requests = Enumerable.Range(0, 1000).ToArray();
var maxConnections = 10;
var pools = requests
.GroupBy(i => i % maxConnections)
.Select(group => DownloadRecipesFor(group.ToArray()))
.ToArray();
await Task.WhenAll(pools);
var recipes = pools.SelectMany(pool => pool.Result).ToArray();
}
static async Task<IEnumerable<string>> DownLoadRecipesFor(params int[] requests)
{
var recipes = new List<string>();
foreach (var request in requests)
{
var recipe = await DownloadRecipe(request);
recipes.Add(recipe);
}
return recipes;
}
Because inside the pool (DownloadRecipesFor method) we download results one by one - we make sure that we have no more than 10 active requests all the time.
This is little bit more effective than originals, because we don't wait for 10 tasks to complete before starting next "bunch".
This is not ideal, because if last "pool" finishes early then others it aren't able to pickup next request to handle.
Final result will have corresponding indexes, because we will process "pools" and requests inside in same order as we created them.

Related

What is the fastest and most efficient ways to make a large number of C# WebRequests?

I have a list of URLs (thousands), I want to asynchronously get page data from each URL as fast as possible without putting extreme load on the CPU.
I have tried using threading but it still feels quite slow:
public static ConcurrentQueue<string> List = new ConcurrentQueue<string>(); //URL List (assume I added them already)
public static void Threading()
{
for(int i=0;i<100;i++) //100 threads
{
Thread thread = new Thread(new ThreadStart(Task));
thread.Start();
}
}
public static void Task()
{
while(!(List.isEmpty))
{
List.TryDequeue(out string URL);
//GET REQUEST HERE
}
}
Is there any better way to do this? I want to do this asynchronously but I can't figure out how to do it, and I don't want to sacrifice speed or CPU efficiency to do so.
Thanks :)
You should use Microsoft's Reactive Framework (aka Rx) - NuGet System.Reactive and add using System.Reactive.Linq; - then you can do this:
public static IObservable<(string url, string content)> GetAllUrls(List<string> urls) =>
Observable
.Using(
() => new HttpClient(),
hc =>
from url in urls.ToObservable()
from response in Observable.FromAsync(() => hc.GetAsync(url))
from content in Observable.FromAsync(() => response.Content.ReadAsStringAsync())
select (url, content));
That allows you to consume the results in a couple of ways.
You can process them as they get produced:
IDisposable subscription =
GetAllUrls(urlsx).Subscribe(x => Console.WriteLine(x.content));
Or you can get all of them produced and then await the full results:
(string url, string content)[] results = await GetAllUrls(urlsx).ToArray();
You are best off using HttpClient which allows async Task requests.
Just store each task in a list, and await the whole list. To prevent too many requests at once, wait for any single one to complete if there are too many, and remove the completed one from the list.
const int maxDegreeOfParallelism = 100;
static HttpClient _client = new HttpClient();
public static async Task GetAllUrls(List<string> urls)
{
var tasks = new List<Task>(urls.Count);
foreach (var url in urls)
{
if (tasks.Count == maxDegreeOfParallelism) // this prevents too many requests at once
tasks.Remove(await Task.WhenAny(tasks));
tasks.Add(GetUrl(url));
}
await Task.WhenAll(tasks);
}
private static async Task GetUrl(string url)
{
using var response = await _client.GetAsync(url);
// handle response here
var responseStr = await response.Content.ReadAsStringAsync(); // whatever
// do stuff etc
}

Generate List<Task> with using index in for like parameter

I want to generate List<Task> and then call it all parallel. I try to do this, but not understood how pass index in lambda expression. In my realisation always send last index of for. But i want to use all indexes in parallel computing...
List<Task> tasks = new List<Task>();
var valueSize = Values.Count;
for (int i = 1; i <= valueSize; i++)
{
tasks.Add(Task.Factory.StartNew(
() => {
this.Values[i] = this._tcpRepository.GetValueAsync(i).ToString().GetInt32();
}));
}
Task.WaitAll(tasks.ToArray());
Additional information:
public Dictionary<int, int> Values { get; set; }
In TcpRepository
public async Task<string> GetValueAsync(int digit)
{
var netStream = this.TcpClient.GetStream();
if (netStream.CanWrite)
{
Byte[] sendBytes = Encoding.UTF8.GetBytes(digit.ToString());
await netStream.WriteAsync(sendBytes, 0, sendBytes.Length);
}
if (netStream.CanRead)
{
byte[] bytes = new byte[this.Client.ReceiveBufferSize];
await netStream.ReadAsync(bytes, 0, (int)this.Client.ReceiveBufferSize);
return bytes.ToString();
}
return null;
}
GetInt32() it's my custom extension for string public static int GetInt32(this string value). Strings can come with garbage as characters that are not numbers.
There are a few issues here.
To directly answer your question, you need to create a closure over i to prevent it being updated before it is accessed in your async code.
You can do this by swapping your for loop with Enumerable.Range.
Another issue is that you are running GetValueAsync on the ThreadPool but not awaiting it. Subsequently, your Task.WaitAll will be waiting for the outer Tasks only, not the Tasks returned by GetValueAsync.
Here is an example of what you could do:
var tasks = Enumerable.Range(0, valueSize)
.Select(async i =>
{
string val = await _tcpRepository.GetValueAsync(i);
Values[i] = val.GetInt32();
});
Your final issue is the use of Task.WaitAll; this introduces the sync-over-async antipattern. You should allow async to grow through your code base, and instead use Task.WhenAll, which returns a Task that completes when all the provided Tasks are complete:
await Task.WhenAll(tasks);

Task.WhenAny blocking

I'm not sure if I'm misunderstanding the usage of Task.WhenAny but in the following code only "0" gets printed when it should print "1" and "2" and then "mx.Name" every time a task finishes:
public async void PopulateMxRecords(List<string> nsRecords, int threads)
{
ThreadPool.SetMinThreads(threads, threads);
var resolver = new DnsStubResolver();
var tasks = nsRecords.Select(ns => resolver.ResolveAsync<MxRecord>(ns, RecordType.Mx));
Console.WriteLine("0");
var finished = Task.WhenAny(tasks);
Console.WriteLine("1");
while (mxNsRecords.Count < nsRecords.Count)
{
Console.WriteLine("2");
var task = await finished;
var mxRecords = await task;
foreach(var mx in mxRecords)
Console.WriteLine(mx.Name);
}
}
The DnsStubResolver is part of ARSoft.Tools.Net.Dns. The nsRecords list contains up to 2 million strings.
I'm not sure if I'm misunderstanding the usage of Task.WhenAny
You might be. The pattern you seem to be looking for is interleaving. In the following example, notice the important changes that I have made:
use ToList() to materialize the LINQ query results,
move WhenAny() into the loop,
use Remove(task) as each task completes, and
run the while loop as long as tasks.Count() > 0.
Those are the important changes. The other changes are there to make your listing into a runnable demo of interleaving, the full listing of which is here: https://dotnetfiddle.net/nr1gQ7
public static async Task PopulateMxRecords(List<string> nsRecords)
{
var tasks = nsRecords.Select(ns => ResolveAsync(ns)).ToList();
while (tasks.Count() > 0)
{
var task = await Task.WhenAny(tasks);
tasks.Remove(task);
var mxRecords = await task;
Console.WriteLine(mxRecords);
}
}

C# Running many async tasks the same time

I'm kinda new to async tasks.
I've a function that takes student ID and scrapes data from specific university website with the required ID.
private static HttpClient client = new HttpClient();
public static async Task<Student> ParseAsync(string departmentLink, int id, CancellationToken ct)
{
string website = string.Format(departmentLink, id);
try
{
string data;
var stream = await client.GetAsync(website, ct);
using (var reader = new StreamReader(await stream.Content.ReadAsStreamAsync(), Encoding.GetEncoding("windows-1256")))
data = reader.ReadToEnd();
//Parse data here and return Student.
} catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
And it works correctly. Sometimes though I need to run this function for a lot of students so I use the following
for(int i = ids.first; i <= ids.last; i++)
{
tasks[i - ids.first] = ParseStudentData.ParseAsync(entity.Link, i, cts.Token).ContinueWith(t =>
{
Dispatcher.Invoke(() =>
{
listview_students.Items.Add(t.Result);
//Students.Add(t.Result);
//lbl_count.Content = $"{listview_students.Items.Count}/{testerino.Length}";
});
});
}
I'm storing tasks in an array to wait for them later.
This also works finely as long as the students count is between (0, ~600?) it's kinda random.
And then for every other student that still hasn't been parsed throws A task was cancelled.
Keep in mind that, I never use the cancellation token at all.
I need to run this function on so many students it can reach ~9000 async task altogether. So what's happening?
You are basically creating a denial of service attack on the website when you are queuing up 9000 requests in such a short time frame. Not only is this causing you errors, but it could take down the website. It would be best to limit the number of concurrent requests to a more reasonable value (say 30). While there are probably several ways to do this, one that comes to mind is the following:
private async Task Test()
{
var tasks = new List<Task>();
for (int i = ids.first; i <= ids.last; i++)
{
tasks.Add(/* Do stuff */);
await WaitList(tasks, 30);
}
}
private async Task WaitList(IList<Task> tasks, int maxSize)
{
while (tasks.Count > maxSize)
{
var completed = await Task.WhenAny(tasks).ConfigureAwait(false);
tasks.Remove(completed);
}
}
Other approaches might leverage the producer/consumer pattern using .Net classes such as a BlockingCollection
This is what I ended up with based on #erdomke code:
public static async Task ForEachParallel<T>(
this IEnumerable<T> list,
Func<T, Task> action,
int dop)
{
var tasks = new List<Task>(dop);
foreach (var item in list)
{
tasks.Add(action(item));
while (tasks.Count >= dop)
{
var completed = await Task.WhenAny(tasks).ConfigureAwait(false);
tasks.Remove(completed);
}
}
// Wait for all remaining tasks.
await Task.WhenAll(tasks).ConfigureAwait(false);
}
// usage
await Enumerable
.Range(1, 500)
.ForEachParallel(i => ProcessItem(i), Environment.ProcessorCount);

creating a .net async wrapper to a sync request

I have the following situation (or a basic misunderstanding with the async await mechanism).
Assume you have a set of 1-20 web request call that takes a long time: findItemsByProduct().
you want to wrap it around in an async request, that would be able to abstract all these calls into one async call, but I can't seem to be able to do it without using more threads.
If I'm doing:
int total = result.paginationOutput.totalPages;
for (int i = 2; i < total + 1; i++)
{
await Task.Factory.StartNew(() =>
{
result = client.findItemsByProduct(i);
});
newList.AddRange(result.searchResult.item);
}
}
return newList;
problem here, that the calls don't run together, rather they are waiting one by one.
I would like all the calls to run together and than harvest the results.
as pseudo code, I would like the code to run like this:
forEach item {
result = item.makeWebRequest();
}
foreach item {
List.addRange(item.harvestResults);
}
I have no idea how to make the code to do that though..
Ideally, you should add a findItemsByProductAsync that returns a Task<Item[]>. That way, you don't have to create unnecessary tasks using StartNew or Task.Run.
Then your code can look like this:
int total = result.paginationOutput.totalPages;
// Start all downloads; each download is represented by a task.
Task<Item[]>[] tasks = Enumerable.Range(2, total - 1)
.Select(i => client.findItemsByProductAsync(i)).ToArray();
// Wait for all downloads to complete.
Item[][] results = await Task.WhenAll(tasks);
// Flatten the results into a single collection.
return results.SelectMany(x => x).ToArray();
Given your requirements which I see as:
Process n number of non-blocking tasks
Process results after all queries have returned
I would use the CountdownEvent for this e.g.
var results = new ConcurrentBag<ItemType>(result.pagination.totalPages);
using (var e = new CountdownEvent(result.pagination.totalPages))
{
for (int i = 2; i <= result.pagination.totalPages+1; i++)
{
Task.Factory.StartNew(() => return client.findItemsByProduct(i))
.ContinueWith(items => {
results.AddRange(items);
e.Signal(); // signal task is done
});
}
// Wait for all requests to complete
e.Wait();
}
// Process results
foreach (var item in results)
{
...
}
This particular problem is solved easily enough without even using await. Simply create each of the tasks, put all of the tasks into a list, and then use WhenAll on that list to get a task that represents the completion of all of those tasks:
public static Task<Item[]> Foo()
{
int total = result.paginationOutput.totalPages;
var tasks = new List<Task<Item>>();
for (int i = 2; i < total + 1; i++)
{
tasks.Add(Task.Factory.StartNew(() => client.findItemsByProduct(i)));
}
return Task.WhenAll(tasks);
}
Also note you have a major problem in how you use result in your code. You're having each of the different tasks all using the same variable, so there are race conditions as to whether or not it works properly. You could end up adding the same call twice and having one skipped entirely. Instead you should have the call to findItemsByProduct be the result of the task, and use that task's Result.
If you want to use async-await properly you have to declare your functions async, and the functions that call you also have to be async. This continues until you have once synchronous function that starts the async process.
Your function would look like this:
by the way you didn't describe what's in the list. I assume they are
object of type T. in that case result.SearchResult.Item returns
IEnumerable
private async Task<List<T>> FindItems(...)
{
int total = result.paginationOutput.totalPages;
var newList = new List<T>();
for (int i = 2; i < total + 1; i++)
{
IEnumerable<T> result = await Task.Factory.StartNew(() =>
{
return client.findItemsByProduct(i);
});
newList.AddRange(result.searchResult.item);
}
return newList;
}
If you do it this way, your function will be asynchronous, but the findItemsByProduct will be executed one after another. If you want to execute them simultaneously you should not await for the result, but start the next task before the previous one is finished. Once all tasks are started wait until all are finished. Like this:
private async Task<List<T>> FindItems(...)
{
int total = result.paginationOutput.totalPages;
var tasks= new List<Task<IEnumerable<T>>>();
// start all tasks. don't wait for the result yet
for (int i = 2; i < total + 1; i++)
{
Task<IEnumerable<T>> task = Task.Factory.StartNew(() =>
{
return client.findItemsByProduct(i);
});
tasks.Add(task);
}
// now that all tasks are started, wait until all are finished
await Task.WhenAll(tasks);
// the result of each task is now in task.Result
// the type of result is IEnumerable<T>
// put all into one big list using some linq:
return tasks.SelectMany ( task => task.Result.SearchResult.Item)
.ToList();
// if you're not familiar to linq yet, use a foreach:
var newList = new List<T>();
foreach (var task in tasks)
{
newList.AddRange(task.Result.searchResult.item);
}
return newList;
}

Categories