I'm building a solution to find a desired value from an API call inside a for loop.
I basically need to pass to an API the index of a for loop, one of those index will return a desired output, in that moment I want the for loop to break, but I need to efficient this process. I thought making it asynchronous without the await, so that when some API returns the desired output it breaks the loop.
Each API call takes around 10sec, so if I make this async or multithread I would reduce the execution time considerably.
I haven't fount any good orientation of how making async / not await HTTP request.
Any suggestions?
for (int i = 0; i < 60000; i += 256)
{
Console.WriteLine("Incrementing_value: " + i);
string response = await client.GetStringAsync(
"http://localhost:7075/api/Function1?index=" + i.ToString());
Console.WriteLine(response);
if (response != "null")//
{
//found the desired output
break;
}
}
You can run requests in parallel, and cancel them once you have found your desired output:
public class Program {
public static async Task Main() {
var cts = new CancellationTokenSource();
var client = new HttpClient();
var tasks = new List<Task>();
// This is the number of requests that you want to run in parallel.
const int batchSize = 10;
int requestId = 0;
int batchRequestCount = 0;
while (requestId < 60000) {
if (batchRequestCount == batchSize) {
// Batch size reached, wait for current requests to finish.
await Task.WhenAll(tasks);
tasks.Clear();
batchRequestCount = 0;
}
tasks.Add(MakeRequestAsync(client, requestId, cts));
requestId += 256;
batchRequestCount++;
}
if (tasks.Count > 0) {
// Await any remaining tasks
await Task.WhenAll(tasks);
}
}
private static async Task MakeRequestAsync(HttpClient client, int index, CancellationTokenSource cts) {
if (cts.IsCancellationRequested) {
// The desired output was already found, no need for any more requests.
return;
}
string response;
try {
response = await client.GetStringAsync(
"http://localhost:7075/api/Function1?index=" + index.ToString(), cts.Token);
}
catch (TaskCanceledException) {
// Operation was cancelled.
return;
}
if (response != "null") {
// Cancel all current connections
cts.Cancel();
// Do something with the output ...
}
}
}
Note that this solution uses a simple mechanism to limit the amount of concurrent requests, a more advanced solution would make use of semaphores (as mentioned in some of the comments).
There are multiple ways to solve this problem. My personal favorite is to use an ActionBlock<T> from the TPL Dataflow library as a processing engine. This component invokes a provided Action<T> delegate for every data element received, and can also be provided with an asynchronous delegate (Func<T, Task>). It has many useful features, including (among others) configurable degree of parallelism/concurrency, and cancellation via a CancellationToken. Here is an implementation that takes advantage of those features:
async Task<string> GetStuffAsync()
{
var client = new HttpClient();
var cts = new CancellationTokenSource();
string output = null;
// Define the dataflow block
var block = new ActionBlock<string>(async url =>
{
string response = await client.GetStringAsync(url, cts.Token);
Console.WriteLine($"{url} => {response}");
if (response != "null")
{
// Found the desired output
output = response;
cts.Cancel();
}
}, new ExecutionDataflowBlockOptions()
{
CancellationToken = cts.Token,
MaxDegreeOfParallelism = 10 // Configure this to a desirable value
});
// Feed the block with URLs
for (int i = 0; i < 60000; i += 256)
{
block.Post("http://localhost:7075/api/Function1?index=" + i.ToString());
}
block.Complete();
// Wait for the completion of the block
try { await block.Completion; }
catch (OperationCanceledException) { } // Ignore cancellation errors
return output;
}
The TPL Dataflow library is built-in the .NET Core / .NET 5. and it is available as a package for .NET Framework.
The upcoming .NET 6 will feature a new API Parallel.ForEachAsync, that could also be used to solve this problem in a similar fashion.
Related
I wrote a web crawler and I want to know if my approach is correct. The only issue I'm facing is that it stops after some hours of crawling. No exception, it just stops.
1 - the private members and the constructor:
private const int CONCURRENT_CONNECTIONS = 5;
private readonly HttpClient _client;
private readonly string[] _services = new string[2] {
"https://example.com/items?id=ID_HERE",
"https://another_example.com/items?id=ID_HERE"
}
private readonly List<SemaphoreSlim> _semaphores;
public Crawler() {
ServicePointManager.DefaultConnectionLimit = CONCURRENT_CONNECTIONS;
_client = new HttpClient();
_semaphores = new List<SemaphoreSlim>();
foreach (var _ in _services) {
_semaphores.Add(new SemaphoreSlim(CONCURRENT_CONNECTIONS));
}
}
Single HttpClient instance.
The _services is just a string array that contains the URL, they are not the same domain.
I'm using semaphores (one per domain) since I read that it's not a good idea to use the network queue (I don't remember how it calls).
2 - The Run method, which is the one I will call to start crawling.
public async Run(List<int> ids) {
const int BATCH_COUNT = 1000;
var svcIndex = 0;
var tasks = new List<Task<string>>(BATCH_COUNT);
foreach (var itemId in ids) {
tasks.Add(DownloadItem(svcIndex, _services[svcIndex].Replace("ID_HERE", $"{itemId}")));
if (++svcIndex >= _services.Length) {
svcIndex = 0;
}
if (tasks.Count >= BATCH_COUNT) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
if (tasks.Count > 0) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
DownloadItem is an async function that actually makes the GET request, note that I'm not awaiting it here.
If the number of tasks reaches the BATCH_COUNT, I will await all to complete and save the results to file.
3 - The DownloadItem function.
private async Task<string> DownloadItem(int serviceIndex, string link) {
var needReleaseSemaphore = true;
var result = string.Empty;
try {
await _semaphores[serviceIndex].WaitAsync();
var r = await _client.GetStringAsync(link);
_semaphores[serviceIndex].Release();
needReleaseSemaphore = false;
// DUE TO JSON SIZE, I NEED TO REMOVE A VALUE (IT'S USELESS FOR ME)
var obj = JObject.Parse(r);
if (obj.ContainsKey("blah")) {
obj.Remove("blah");
}
result = obj.ToString(Formatting.None);
} catch {
result = string.Empty;
// SINCE I GOT AN EXCEPTION, I WILL 'LOCK' THIS SERVICE FOR 1 MINUTE.
// IF I RELEASED THIS SEMAPHORE, I WILL LOCK IT AGAIN FIRST.
if (!needReleaseSemaphore) {
await _semaphores[serviceIndex].WaitAsync();
needReleaseSemaphore = true;
}
await Task.Delay(60_000);
} finally {
// RELEASE THE SEMAPHORE, IF NEEDED.
if (needReleaseSemaphore) {
_semaphores[serviceIndex].Release();
}
}
return result;
}
4- The function that saves the result.
private async Task SaveDownloadedData(List<string> myData) {
using var fs = new FileStream("./output.dat", FileMode.Append);
foreach (var res in myData) {
var blob = Encoding.UTF8.GetBytes(res);
await fs.WriteAsync(BitConverter.GetBytes((uint)blob.Length));
await fs.WriteAsync(blob);
}
await fs.DisposeAsync();
}
5- Finally, the Main function.
static async Task Main(string[] args) {
var crawler = new Crawler();
var items = LoadItemIds();
await crawler.Run(items);
}
After all this, is my approach correct? I need to make millions of requests, will take some weeks/months to gather all data I need (due to the connection limit).
After 12 - 14 hours, it just stops and I need to manually restart the app (memory usage is ok, my VPS has 1 GB and it never used more than 60%).
I would like to process a list of 50,000 urls through a web service, The provider of this service allows 5 connections per second.
I need to process these urls in parallel with adherence to provider's rules.
This is my current code:
static void Main(string[] args)
{
process_urls().GetAwaiter().GetResult();
}
public static async Task process_urls()
{
// let's say there is a list of 50,000+ URLs
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var url in urls)
{
await throttler.WaitAsync();
allTasks.Add(
Task.Run(async () =>
{
try
{
Console.WriteLine(String.Format("Starting {0}", url));
var client = new HttpClient();
var xml = await client.GetStringAsync(url);
//do some processing on xml output
client.Dispose();
}
finally
{
throttler.Release();
}
}));
}
await Task.WhenAll(allTasks);
}
Instead of var client = new HttpClient(); I will create a new object of the target web service but this is just to make the code generic.
Is this the correct approach to handle and process a huge list of connections? and is there anyway I can limit the number of established connections per second to 5 as the current implementation will not consider any timeframe?
Thanks
Reading values from web service is IO operation which can be done asynchronously without multithreading.
Threads do nothing - only waiting for response in this case. So using parallel is just wasting of resources.
public static async Task process_urls()
{
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var urlGroup in SplitToGroupsOfFive(urls))
{
var tasks = new List<Task>();
foreach(var url in urlGroup)
{
var task = ProcessUrl(url);
tasks.Add(task);
}
// This delay will sure that next 5 urls will be used only after 1 seconds
tasks.Add(Task.Delay(1000));
await Task.WhenAll(tasks.ToArray());
}
}
private async Task ProcessUrl(string url)
{
using (var client = new HttpClient())
{
var xml = await client.GetStringAsync(url);
//do some processing on xml output
}
}
private IEnumerable<IEnumerable<string>> SplitToGroupsOfFive(IEnumerable<string> urls)
{
var const GROUP_SIZE = 5;
var string[] group = null;
var int count = 0;
foreach (var url in urls)
{
if (group == null)
group = new string[GROUP_SIZE];
group[count] = url;
count++;
if (count < GROUP_SIZE)
continue;
yield return group;
group = null;
count = 0;
}
if (group != null && group.Length > 0)
{
yield return group.Take(group.Length);
}
}
Because you mention that "processing" of response is also IO operation, then async/await approach is most efficient, because it using only one thread and process other tasks when previous tasks waiting for response from web service or from file writing IO operations.
I would like to fire several tasks while setting a timeout on them. The idea is to gather the results from the tasks that beat the clock, and cancel (or even just ignore) the other tasks.
I tried using extension methods WithCancellation as explained here, however throwing an exception caused WhenAll to return and supply no results.
Here's what I tried, but I'm opened to other directions as well (note however that I need to use await rather than Task.Run since I need the httpContext in the Tasks):
var cts = new CancellationTokenSource(TimeSpan.FromSeconds(3));
IEnumerable<Task<MyResults>> tasks =
from url in urls
select taskAsync(url).WithCancellation(cts.Token);
Task<MyResults>[] excutedTasks = null;
MyResults[] res = null;
try
{
// Execute the query and start the searches:
excutedTasks = tasks.ToArray();
res = await Task.WhenAll(excutedTasks);
}
catch (Exception exc)
{
if (excutedTasks != null)
{
foreach (Task<MyResults> faulted in excutedTasks.Where(t => t.IsFaulted))
{
// work with faulted and faulted.Exception
}
}
}
// work with res
EDIT:
Following #Servy's answer below, this is the implementation I went with:
var cts = new CancellationTokenSource(TimeSpan.FromSeconds(3));
IEnumerable<Task<MyResults>> tasks =
from url in urls
select taskAsync(url).WithCancellation(cts.Token);
// Execute the query and start the searches:
Task<MyResults>[] excutedTasks = tasks.ToArray();
try
{
await Task.WhenAll(excutedTasks);
}
catch (OperationCanceledException)
{
// Do nothing - we expect this if a timeout has occurred
}
IEnumerable<Task<MyResults>> completedTasks = excutedTasks.Where(t => t.Status == TaskStatus.RanToCompletion);
var results = new List<MyResults>();
completedTasks.ForEach(async t => results.Add(await t));
If any of the tasks fail to complete you are correct that WhenAll doesn't return the results of any that did complete, it just wraps an aggregate exception of all of the failures. Fortunately, you have the original collection of tasks, so you can get the results that completed successfully from there.
var completedTasks = excutedTasks.Where(t => t.Status == TaskStatus.RanToCompletion);
Just use that instead of res.
I tried you code and it worked just fine, except the cancelled tasks are in not in a Faulted state, but rather in the Cancelled. So if you want to process the cancelled tasks use t.IsCanceled instead. The non cancelled tasks ran to completion. Here is the code I used:
public static async Task MainAsync()
{
var urls = new List<string> {"url1", "url2", "url3", "url4", "url5", "url6"};
var cts = new CancellationTokenSource(TimeSpan.FromSeconds(3));
IEnumerable<Task<MyResults>> tasks =
from url in urls
select taskAsync(url).WithCancellation(cts.Token);
Task<MyResults>[] excutedTasks = null;
MyResults[] res = null;
try
{
// Execute the query and start the searches:
excutedTasks = tasks.ToArray();
res = await Task.WhenAll(excutedTasks);
}
catch (Exception exc)
{
if (excutedTasks != null)
{
foreach (Task<MyResults> faulted in excutedTasks.Where(t => t.IsFaulted))
{
// work with faulted and faulted.Exception
}
}
}
}
public static async Task<MyResults> taskAsync(string url)
{
Console.WriteLine("Start " + url);
var random = new Random();
var delay = random.Next(10);
await Task.Delay(TimeSpan.FromSeconds(delay));
Console.WriteLine("End " + url);
return new MyResults();
}
private static void Main(string[] args)
{
MainAsync().Wait();
}
I have written this code. It recursively creates folders in the web system by making REST Calls.
So basically, it creates a folder for the root node, then gets all the child nodes and parallely and recursively calls itself. (for each child)
the only problem with the code is that if a node has too may children OR if the hierarchy is too deep, then I start getting "TaskCancellation" errors.
I have already tried increasing the timeout to 10 minutes.. but that does not solve the problem.
So my question is how can I start say 50 tasks, then wait for something to get freed and proceed only when there is an open slot in 50.
Currently I think my code is going on creating tasks without any limit as is flows through the hierarchy.
public async Task CreateSPFolder(Node node, HttpClient client, string docLib, string currentPath = null)
{
string nodeName = Uri.EscapeDataString(nodeName);
var request = new { __metadata = new { type = "SP.Folder" }, ServerRelativeUrl = nodeName };
string jsonRequest = JsonConvert.SerializeObject(request);
StringContent strContent = new StringContent(jsonRequest);
strContent.Headers.ContentType = MediaTypeHeaderValue.Parse("application/json;odata=verbose");
HttpResponseMessage resp = await client.PostAsync(cmd, strContent);
if (resp.IsSuccessStatusCode)
{
currentPath = (currentPath == null) ? nodeName : currentPath + "/" + nodeName;
}
else
{
string content = await resp.Content.ReadAsStringAsync();
Console.WriteLine(content);
throw new Exception("Failed to create folder " + content);
}
}
List<Task> taskList = new List<Task>();
node.Children.ToList().ForEach(c => taskList.Add(CreateSPFolder(c, client, docLib, currentPath)));
Task.WaitAll(taskList.ToArray());
}
You can use a SemaphoreSlim to control the number of concurrent tasks. You initialize the semaphore to the maximum number of tasks you want to have and then each time you execute a task you acquire the semaphore and then release it when you are finished with the task.
This is a somewhat simplified version of your code that runs forever using random numbers and executes a maximum of 2 tasks at the same time.
class Program
{
private static SemaphoreSlim semaphore = new SemaphoreSlim(2, 2);
public static async Task CreateSPFolder(int folder)
{
try
{
await semaphore.WaitAsync();
Console.WriteLine("Executing " + folder);
Console.WriteLine("WaitAsync - CurrentCount " + semaphore.CurrentCount);
await Task.Delay(2000);
}
finally
{
Console.WriteLine("Finished Executing " + folder);
semaphore.Release();
Console.WriteLine("Release - CurrentCount " + semaphore.CurrentCount);
}
var rand = new Random();
var next = rand.Next(10);
var children = Enumerable.Range(1, next).ToList();
Task.WaitAll(children.Select(CreateSPFolder).ToArray());
}
static void Main(string[] args)
{
CreateSPFolder(1).Wait();
Console.ReadKey();
}
}
First of all, I think your problem is not the amount of tasks but the amount of blocked threads waiting at Task.WaitAll(taskList.ToArray());. It's better to wait asynchronously in such cases (i.e. await Task.WhenAll(taskList);
Secondly, you can use TPL Dataflow's ActionBlock with a MaxDegreeOfParallelism set to 50 and just post to it for every folder to be created. That way you have a flat queue of work to be executed and when it's empty, you're done.
Pseudo code:
var block = new ActionBlock<FolderInfo>(
async folderInfo => {
await CreateFolderAsync(folderInfo);
foreach (var subFolder in GetSubFolders(folderInfo))
{
block.Post(subFolder);
}
},
new DataFlowExecutionOptions {MaxDegreeOfParallelism = 5});
block.Post(rootFolderInfo);
I'm using the following code to post an image to a server.
var image= Image.FromFile(#"C:\Image.jpg");
Task<string> upload = Upload(image);
upload.Wait();
public static async Task<string> Upload(Image image)
{
var uriBuilder = new UriBuilder
{
Host = "somewhere.net",
Path = "/path/",
Port = 443,
Scheme = "https",
Query = "process=false"
};
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Add("locale", "en_US");
client.DefaultRequestHeaders.Add("country", "US");
var content = ConvertToHttpContent(image);
content.Headers.ContentType = MediaTypeHeaderValue.Parse("image/jpeg");
using (var mpcontent = new MultipartFormDataContent("--myFakeDividerText--")
{
{content, "fakeImage", "myFakeImageName.jpg"}
}
)
{
using (
var message = await client.PostAsync(uriBuilder.Uri, mpcontent))
{
var input = await message.Content.ReadAsStringAsync();
return "nothing for now";
}
}
}
}
I'd like to modify this code to run multiple threads. I've used "ThreadPool.QueueUserWorkItem" before and started to modify the code to leverage it.
private void UseThreadPool()
{
int minWorker, minIOC;
ThreadPool.GetMinThreads(out minWorker, out minIOC);
ThreadPool.SetMinThreads(1, minIOC);
int maxWorker, maxIOC;
ThreadPool.GetMaxThreads(out maxWorker, out maxIOC);
ThreadPool.SetMinThreads(4, maxIOC);
var events = new List<ManualResetEvent>();
foreach (var image in ImageCollection)
{
var resetEvent = new ManualResetEvent(false);
ThreadPool.QueueUserWorkItem(
arg =>
{
var img = Image.FromFile(image.getPath());
Task<string> upload = Upload(img);
upload.Wait();
resetEvent.Set();
});
events.Add(resetEvent);
if (events.Count <= 0) continue;
foreach (ManualResetEvent e in events) e.WaitOne();
}
}
The problem is that only one thread executes at a time due to the call to "upload.Wait()". So I'm still executing each thread in sequence. It's not clear to me how I can use PostAsync with a thread-pool.
How can I post images to a server using multiple threads by tweaking the code above? Is HttpClient PostAsync the best way to do this?
I'd like to modify this code to run multiple threads.
Why? The thread pool should only be used for CPU-bound work (and I/O completions, of course).
You can do concurrency just fine with async:
var tasks = ImageCollection.Select(image =>
{
var img = Image.FromFile(image.getPath());
return Upload(img);
});
await Task.WhenAll(tasks);
Note that I removed your Wait. You should avoid using Wait or Result with async tasks; use await instead. Yes, this will cause async to grow through you code, and you should use async "all the way".