I have a list of URLs of pages I want to download concurrently using HttpClient. The list of URLs can be large (100 or more!)
I have currently have this code:
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.facebook.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var client = new HttpClient();
var contents = urls
.ToObservable()
.SelectMany(uri => client.GetStringAsync(new Uri(uri, UriKind.Absolute)));
contents.Subscribe(Console.WriteLine);
The problem: due to the usage of SelectMany, a big bunch of Tasks are created almost at the same time. It seems that if the list of URLs is big enough, a lot Tasks give timeouts (I'm getting "A Task was cancelled" exceptions).
So, I thought there should be a way, maybe using some kind of Scheduler, to limit the number of concurrent Tasks, not allowing more than 5 or 6 at a given time.
This way I could get concurrent downloads without launching too many tasks that may get stall, like they do right now.
How to do that so I don't saturate with lots of timed-out Tasks?
Remember SelectMany() is actually Select().Merge(). While SelectMany does not have a maxConcurrent paramter, Merge() does. So you can use that.
From your example, you can do this:
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.facebook.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var client = new HttpClient();
var contents = urls
.ToObservable()
.Select(uri => Observable.FromAsync(() => client.GetStringAsync(uri)))
.Merge(2); // 2 maximum concurrent requests!
contents.Subscribe(Console.WriteLine);
Here is an example of how you can do it with the DataFlow API:
private static Task DoIt()
{
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.facebook.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var client = new HttpClient();
//Create a block that takes a URL as input
//and produces the download result as output
TransformBlock<string,string> downloadBlock =
new TransformBlock<string, string>(
uri => client.GetStringAsync(new Uri(uri, UriKind.Absolute)),
new ExecutionDataflowBlockOptions
{
//At most 2 download operation execute at the same time
MaxDegreeOfParallelism = 2
});
//Create a block that prints out the result
ActionBlock<string> doneBlock =
new ActionBlock<string>(x => Console.WriteLine(x));
//Link the output of the first block to the input of the second one
downloadBlock.LinkTo(
doneBlock,
new DataflowLinkOptions { PropagateCompletion = true});
//input the urls into the first block
foreach (var url in urls)
{
downloadBlock.Post(url);
}
downloadBlock.Complete(); //Mark completion of input
//Allows consumer to wait for the whole operation to complete
return doneBlock.Completion;
}
static void Main(string[] args)
{
DoIt().Wait();
Console.WriteLine("Done");
Console.ReadLine();
}
Can you see if this helps?
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.google.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var contents =
urls
.ToObservable()
.SelectMany(uri =>
Observable
.Using(
() => new System.Net.Http.HttpClient(),
client =>
client
.GetStringAsync(new Uri(uri, UriKind.Absolute))
.ToObservable()));
Related
I have a list of URLs (thousands), I want to asynchronously get page data from each URL as fast as possible without putting extreme load on the CPU.
I have tried using threading but it still feels quite slow:
public static ConcurrentQueue<string> List = new ConcurrentQueue<string>(); //URL List (assume I added them already)
public static void Threading()
{
for(int i=0;i<100;i++) //100 threads
{
Thread thread = new Thread(new ThreadStart(Task));
thread.Start();
}
}
public static void Task()
{
while(!(List.isEmpty))
{
List.TryDequeue(out string URL);
//GET REQUEST HERE
}
}
Is there any better way to do this? I want to do this asynchronously but I can't figure out how to do it, and I don't want to sacrifice speed or CPU efficiency to do so.
Thanks :)
You should use Microsoft's Reactive Framework (aka Rx) - NuGet System.Reactive and add using System.Reactive.Linq; - then you can do this:
public static IObservable<(string url, string content)> GetAllUrls(List<string> urls) =>
Observable
.Using(
() => new HttpClient(),
hc =>
from url in urls.ToObservable()
from response in Observable.FromAsync(() => hc.GetAsync(url))
from content in Observable.FromAsync(() => response.Content.ReadAsStringAsync())
select (url, content));
That allows you to consume the results in a couple of ways.
You can process them as they get produced:
IDisposable subscription =
GetAllUrls(urlsx).Subscribe(x => Console.WriteLine(x.content));
Or you can get all of them produced and then await the full results:
(string url, string content)[] results = await GetAllUrls(urlsx).ToArray();
You are best off using HttpClient which allows async Task requests.
Just store each task in a list, and await the whole list. To prevent too many requests at once, wait for any single one to complete if there are too many, and remove the completed one from the list.
const int maxDegreeOfParallelism = 100;
static HttpClient _client = new HttpClient();
public static async Task GetAllUrls(List<string> urls)
{
var tasks = new List<Task>(urls.Count);
foreach (var url in urls)
{
if (tasks.Count == maxDegreeOfParallelism) // this prevents too many requests at once
tasks.Remove(await Task.WhenAny(tasks));
tasks.Add(GetUrl(url));
}
await Task.WhenAll(tasks);
}
private static async Task GetUrl(string url)
{
using var response = await _client.GetAsync(url);
// handle response here
var responseStr = await response.Content.ReadAsStringAsync(); // whatever
// do stuff etc
}
In my scenario, I need to process a list of items (pseudo code is blow) , the number of which could be hundreds or thousands. So, what is an efficient way to handle this? Are there some patterns/best practices for this kind of scenario?
Some specific questions are:
I think I should change the sync call on QueryResultAsync to async first, but Micrsoft doc doesn't recommend to use async/await in a tight loop. So, any walkaround?
Should I consider using multiple tasks concurrently running at the same time to reduce latency? e.g., say there are 100 items to process, and I create 10 tasks (one for each item) running at the same time and WaitAll() of them and then there will be 10 rounds to finish the 100 items. Is this better?
Should I consider producer/consumer pattern, where 3 producers for web requests and one consumer to handle the results?
Please let me know if your (scenario) info needed.
public List<string> Process(List<string> items)
{
List<string> resultItems = new List<string>(items.Count);
foreach (string item in items)
{
string result = QueryResultAsync(item).GetAwaiter().GetResult(); // need to send http request for each item with different urls
resultItems.Add(ProcessResult(result);
}
return resultItems;
}
private static string ProcessResult(string item){
// some plain processing logic without I/O
return result; // a string value
}
Since these are IO bound workloads, you could simply use the async and await pattern, and Task.WhenAll and let the task scheduler deal with the details
public async Task<List<string>> Process(List<string> items)
{
var tasks = items.Select(x => QueryResultAsync(x));
var results = await Task.WhenAll(tasks);
return results.Select(x => ProcessResult(x)).ToList();
}
If you are interesting in multiple producers you could use Tpl Dataflow pipeline which can better partition and deal with max concurrent requests, then pipe your results in to the processor.
A nonsensical example
// create some blocks
var queryBlock = new TransformBlock<string, string>(
QueryResultAsync,
new ExecutionDataflowBlockOptions()
{
EnsureOrdered = false,
MaxDegreeOfParallelism = 50 // ??
});
var processBlock = new TransformBlock<string, string>(
ProcessResult,
new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 5, // ??
});
var someOtherAction = new ActionBlock<string>(x => Console.WriteLine(x));
//link them together
queryBlock.LinkTo(processBlock, new DataflowLinkOptions() {PropagateCompletion = true});
processBlock.LinkTo(someOtherAction, new DataflowLinkOptions() {PropagateCompletion = true});
// produce some junk
for (var i = 0; i < 10; i++)
await queryBlock.SendAsync(i.ToString());
// wait for it all to finish
queryBlock.Complete();
await someOtherAction.Completion;
Output
0
8
7
1
2
5
6
3
4
9
There are many ways you can config a pipeline and they have many options, this is just an example
So I had to create dozens of API requests and get json to make it an object and put it in a list.
I also wanted the requests to be parallel because I do not care about the order in which the objects enter the list.
public ConcurrentBag<myModel> GetlistOfDstAsync()
{
var req = new RequestGenerator();
var InitializedObjects = req.GetInitializedObjects();
var myList = new ConcurrentBag<myModel>();
Parallel.ForEach(InitializedObjects, async item =>
{
RestRequest request = new RestRequest("resource",Method.GET);
request.AddQueryParameter("key", item.valueOne);
request.AddQueryParameter("key", item.valueTwo);
var results = await GetAsync<myModel>(request);
myList.Add(results);
});
return myList;
}
What creates a new problem, I do not understand how to put them in the list and it seems I do not use a solution that exists in a form ConcurrentBag
Is my assumption correct and I implement it wrong or should I use another solution?
I also wanted the requests to be parallel
What you actually want is concurrent requests. Parallel does not work as expected with async.
To do asynchronous concurrency, you start each request but do not await the tasks yet. Then, you can await all the tasks together and get the responses using Task.WhenAll:
public async Task<myModel[]> GetlistOfDstAsync()
{
var req = new RequestGenerator();
var InitializedObjects = req.GetInitializedObjects();
var tasks = InitializedObject.Select(async item =>
{
RestRequest request = new RestRequest("resource",Method.GET);
request.AddQueryParameter("key", item.valueOne);
request.AddQueryParameter("key", item.valueTwo);
return await GetAsync<myModel>(request);
}).ToList();
var results = await TaskWhenAll(tasks);
return results;
}
I am describing my problem in a simple example and then describing a more close problem.
Imagine We Have n items [i1,i2,i3,i4,...,in] in the box1 and we have a box2 that can handle m items to do them (m is usually much less than n) . The time required for each item is different. I want to always have doing m job items until all items are proceeded.
A much more close problem is that for example you have a list1 of n strings (URL addresses) of files and we want to have a system to have m files downloading concurrently (for example via httpclient.getAsync() method). Whenever downloading of one of m items finishes, another remaining item from list1 must be substituted as soon as possible and this must be countinued until all of List1 items proceeded.
(number of n and m are specified by users input at runtime)
How this can be done?
Here is a generic method you can use.
when you call this TIn will be string (URL addresses) and the asyncProcessor will be your async method that takes the URL address as input and returns a Task.
The SlimSemaphore used by this method is going to allow only n number of concurrent async I/O requests in real time, as soon as one completes the other request will execute. Something like a sliding window pattern.
public static Task ForEachAsync<TIn>(
IEnumerable<TIn> inputEnumerable,
Func<TIn, Task> asyncProcessor,
int? maxDegreeOfParallelism = null)
{
int maxAsyncThreadCount = maxDegreeOfParallelism ?? DefaultMaxDegreeOfParallelism;
SemaphoreSlim throttler = new SemaphoreSlim(maxAsyncThreadCount, maxAsyncThreadCount);
IEnumerable<Task> tasks = inputEnumerable.Select(async input =>
{
await throttler.WaitAsync().ConfigureAwait(false);
try
{
await asyncProcessor(input).ConfigureAwait(false);
}
finally
{
throttler.Release();
}
});
return Task.WhenAll(tasks);
}
You should look in to TPL Dataflow, add the System.Threading.Tasks.Dataflow NuGet package to your project then what you want is as simple as
private static HttpClient _client = new HttpClient();
public async Task<List<MyClass>> ProcessDownloads(IEnumerable<string> uris,
int concurrentDownloads)
{
var result = new List<MyClass>();
var downloadData = new TransformBlock<string, string>(async uri =>
{
return await _client.GetStringAsync(uri); //GetStringAsync is a thread safe method.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
var processData = new TransformBlock<string, MyClass>(
json => JsonConvert.DeserializeObject<MyClass>(json),
new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded});
var collectData = new ActionBlock<MyClass>(
data => result.Add(data)); //When you don't specifiy options dataflow processes items one at a time.
//Set up the chain of blocks, have it call `.Complete()` on the next block when the current block finishes processing it's last item.
downloadData.LinkTo(processData, new DataflowLinkOptions {PropagateCompletion = true});
processData.LinkTo(collectData, new DataflowLinkOptions {PropagateCompletion = true});
//Load the data in to the first transform block to start off the process.
foreach (var uri in uris)
{
await downloadData.SendAsync(uri).ConfigureAwait(false);
}
downloadData.Complete(); //Signal you are done adding data.
//Wait for the last object to be added to the list.
await collectData.Completion.ConfigureAwait(false);
return result;
}
In the above code only concurrentDownloads number of HttpClients will be active at any given time, unlimited threads will be processing the received strings and turning them in to objects, and a single thread will be taking those objects and adding them to a list.
UPDATE: here is a simplified example that only does what you asked for in the question
private static HttpClient _client = new HttpClient();
public void ProcessDownloads(IEnumerable<string> uris, int concurrentDownloads)
{
var downloadData = new ActionBlock<string>(async uri =>
{
var response = await _client.GetAsync(uri); //GetAsync is a thread safe method.
//do something with response here.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
foreach (var uri in uris)
{
downloadData.Post(uri);
}
downloadData.Complete();
downloadData.Completion.Wait();
}
A simple solution for throttling is a SemaphoreSlim.
EDIT
After a slight alteration the code now creates the tasks when they are needed
var client = new HttpClient();
SemaphoreSlim semaphore = new SemaphoreSlim(m, m); //set the max here
var tasks = new List<Task>();
foreach(var url in urls)
{
// moving the wait here throttles the foreach loop
await semaphore.WaitAsync();
tasks.Add(((Func<Task>)(async () =>
{
//await semaphore.WaitAsync();
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
semaphore.Release();
}))());
}
await Task.WhenAll(tasks);
This is another way to do it
var client = new HttpClient();
var tasks = new HashSet<Task>();
foreach(var url in urls)
{
if(tasks.Count == m)
{
tasks.Remove(await Task.WhenAny(tasks));
}
tasks.Add(((Func<Task>)(async () =>
{
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
}))());
}
await Task.WhenAll(tasks);
Process items in parallel, limiting the number of simultaneous jobs:
string[] strings = GetStrings(); // Items to process.
const int m = 2; // Max simultaneous jobs.
Parallel.ForEach(strings, new ParallelOptions {MaxDegreeOfParallelism = m}, s =>
{
DoWork(s);
});
I have a program where I download files from the Internet and process them. Following is the function that I have written to download the file using threads.
Task<File> re = Task.Factory.StartNew(() => { /*Download the File*/ });
re.ContinueWith((x) => { /*Do another function*/ });
I now want it to use only 10 threads for downloading. I have looked in to ParallelOptions.MaxDegreeOfParallelism property, but I can't understand how to use it when the task returns a result.
One good way to do that is to use the DataFlow API. To use it, you have to install the Microsoft.Tpl.Dataflow Nuget package.
Assuming that you have the following methods for downloading and processing data:
public async Task<DownloadResult> DownloadFile(string url)
{
//Asynchronously download the file and return the result of the download.
//You don't need a thread to download the file if you use asynchronous API.
}
public ProcessingResult ProcessDownloadResult(DownloadResult download_result)
{
//Synchronously process the download result and produce a ProcessingResult.
}
And assuming that you have a list of URLs that you want to download:
List<string> urls = new List<string>();
Then you can do the following with the DataFlow API:
TransformBlock<string,DownloadResult> download_block =
new TransformBlock<string, DownloadResult>(
url => DownloadFile(url),
new ExecutionDataflowBlockOptions
{
//Only 10 asynchronous download operations
//can happen at any point in time.
MaxDegreeOfParallelism = 10
});
TransformBlock<DownloadResult, ProcessingResult> process_block =
new TransformBlock<DownloadResult, ProcessingResult>(
dr => ProcessDownloadResult(dr),
new ExecutionDataflowBlockOptions
{
//We limit the number of CPU intensive operation
//to the number of processors in the system.
MaxDegreeOfParallelism = Environment.ProcessorCount
});
download_block.LinkTo(process_block);
foreach(var url in urls)
{
download_block.Post(url);
}
You can use something like:
Func<File> work = () => {
// Do something
File file = ...
return file
};
var maxNoOfWorkers = 10;
IEnumerable<Task> tasks = Enumerable.Range(0, maxNoOfWorkers)
.Select(s =>
{
var task = Task.Factory.StartNew<File>(work);
return task.ContinueWith(ant => { /* do soemthing else */ });
});
This way TPL decides how many threads to get from the threadpool if however you really want to create a dedicated (non-threadpool) thread you can then do so using:
IEnumerable<Task> tasks = Enumerable.Range(0, maxNoOfWorkers)
.Select(s =>
{
var task = Task.Factory.StartNew<File>(
work,
CancellationToken.None,
TaskCreationOptions.LongRunning,
TaskScheduler.Default);
return task.ContinueWith(ant => { /* do soemthing else */ });
});
Your other options would be to use PLINQ or Paraller.For/ForEach which you can use the MaxDegreeOfParallelism with.
A PLINQ example can be:
Func<File> work = () => {
// Do something
File file = ...
return file
};
var maxNoOfWorkers = 10;
ParallelEnumerable.Range(0, maxNoOfWorkers)
.WithDegreeOfParallelism(maxNoOfWorkers)
.ForAll(x => {
var file = work();
// Do something with file
});
Of course I don't know the context of your example so you may need to adapt it to your requirement.