How to execute this code in parallel? I tried to execute the execution in threads, but the requests are still being executed sequentially. I am new to parallel programming, I will be very happy for your help.
public async Task<IList<AdModel>> LocalBitcoins_buy(int page_number)
{
IList<AdModel> Buy_ads = new List<AdModel>();
string next_page_url;
string url = "https://localbitcoins.net/buy-bitcoins-online/.json?page=" + page_number;
WebRequest request = WebRequest.Create(url);
request.Method = "GET";
using (WebResponse response = await request.GetResponseAsync())
{
using (var reader = new StreamReader(response.GetResponseStream()))
{
JObject json = JObject.Parse(await reader.ReadToEndAsync());
next_page_url = (string) json["pagination"]["next"];
int counter = (int) json["data"]["ad_count"];
for (int ad_list_index = 0; ad_list_index < counter; ad_list_index++)
{
AdModel save = new AdModel();
save.Seller = (string) json["data"]["ad_list"][ad_list_index]["data"]["profile"]["username"];
save.Give = (string) json["data"]["ad_list"][ad_list_index]["data"]["currency"];
save.Get = "BTC";
save.Limits = (string) json["data"]["ad_list"][ad_list_index]["data"]["first_time_limit_btc"];
save.Deals = (string) json["data"]["ad_list"][ad_list_index]["data"]["profile"]["trade_count"];
save.Reviews = (string) json["data"]["ad_list"][ad_list_index]["data"]["profile"]["feedback_score"];
save.PaymentWindow = (string) json["data"]["ad_list"][ad_list_index]["data"]["payment_window_minutes"];
Buy_ads.Add(save);
}
}
}
Console.WriteLine(page_number);
return Buy_ads;
}
I googled and found this links 1, 2. It seems that WebRequest cannot execute requests in parallel. Also I tried to send multiple requests in parallel using WebRequest and for some reasons WebRequest didn't make requests in parallel.
But when I used HttpClient class it did requests in parallel. Try to use HttpClient instead of WebRequest as Microsoft recommends.
So, firstly, you should use HttpClient to make web request.
Then you can use the next approach to download pages in parallel:
public static IList<AdModel> DownloadAllPages()
{
int[] pageNumbers = getPageNumbers();
// Array of tasks that download data from the pages.
Task<IList<AdModel>>[] tasks = new Task<IList<AdModel>>[pageNumbers.Length];
// This loop lauches download tasks in parallel.
for (int i = 0; i < pageNumbers.Length; i++)
{
// Launch download task without waiting for its completion.
tasks[i] = LocalBitcoins_buy(pageNumbers[i]);
}
// Wait for all tasks to complete.
Task.WaitAll(tasks);
// Combine results from all tasks into a common list.
return tasks.SelectMany(t => t.Result).ToList();
}
Of course, you should add error handling into this method.
I suggest you study this post carefully: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/walkthrough-accessing-the-web-by-using-async-and-await
It has a tutorial that is doing exactly what you want to do: Downloading multiple pages simultaneously.
WebRequest does not work asynchronously with getResponse purposefully. It has a second method: getResponseAsync().
To get your toes wet with threading, you can also use the drop-in replacement for the foreach and for loops: https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/how-to-write-a-simple-parallel-foreach-loop
Related
I am building a C# Winforms application and I have many REST calls to process. Each call takes about 10 sec till I receive an answer, so in the end, my application is running quite a while. Mostly spending time waiting for the REST service to answer.
I am not coming forward because no matter what I try (configureAwait, waitAll or whenAll), the application hangs or when I want to access each tasks result, it is going back to the Main methods or hangs. Here is what I currently have:
I am building up a list of tasks to fill my objects :
List<Task> days = new List<Task>();
for (DateTime d = dtStart; d <= dtEnd; d = d.AddDays(1))
{
if (UseProduct)
{
Task _t = AsyncBuildDay(d, Project, Product, fixVersion);
var t = _t as Task<Day>;
days.Add(t);
}
else
{
Task _t = AsyncBuildDay(d, Project, fixVersion);
var t = _t as Task<Day>;
days.Add(t);
}
}
Then I am starting and waiting until every task is finished and the objects are built:
Task.WaitAll(days.ToArray());
When I try this, then the tasks are waiting for activation:
var tks = Task.WhenAll(days.ToArray());
What is running asynchronously inside the tasks (AsyncBuildDay
) is a query to JIRA:
private async Task<string> GetResponse(string url)
{
WebRequest request = WebRequest.Create(url);
request.Method = "GET";
request.Headers["Authorization"] = "Basic " + Convert.ToBase64String(Encoding.Default.GetBytes(JIRAUser + ":" + JIRAPassword));
request.Credentials = new NetworkCredential(JIRAUser, JIRAPassword);
WebResponse response = await request.GetResponseAsync().ConfigureAwait(false);
// Get the stream containing all content returned by the requested server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content fully up to the end.
string json = reader.ReadToEnd();
return json;
}
And now I would like to access all my objects with .Result, but then the whole code freezes again.
foreach (Task<Day> t in days)
{
dc.colDays.Add(t.Result);
}
I don't find a wait to get to my objects and I'm really going nuts with this stuff. Any ideas are much appreciated!
You're overcomplicating this.
Task.WhenAll is the way to go; it returns a new Task that completes when the provided tasks have all completed.
It's also non-blocking.
By awaiting the Task returned by Task.WhenAll, you unwrap it's results into an array:
List<Task<Day>> dayTasks = new();
// ...
Day[] days = await Task.WhenAll(dayTasks);
You can then add this to dc.colDays:
dc.colDays.AddRange(days);
Or if dc.colDays doesnt have an AddRange method:
foreach (var day in days) dc.colDays.Add(day);
It might be better to await any completion, and remove the completed task from the list.
while (days.Count > 0)
{
Task completedTask = await Task.WhenAny(days);
// Do something with result.
days.Remove(completedTask);
}
I need to fetch content from some 3000 urls. I'm using HttpClient, create Task for each url, add tasks to list and then await Task.WhenAll. Something like this
var tasks = new List<Task<string>>();
foreach (var url in urls) {
var task = Task.Run(() => httpClient.GetStringAsync(url));
tasks.Add(task);
}
var t = Task.WhenAll(tasks);
However many tasks end up in Faulted or Canceled states. I thought it might be problem with the concrete urls, but no. I can fetch those url no problem with curl in parallel.
I tried HttpClientHandler, WinHttpHandler with various timeouts etc. Always several hundred urls end with an error.
Then I tried to fetch those urls in batches of 10 and that works. No errors, but very slow. Curl will fetch 3000 urls in parallel very fast.
Then I tried to get httpbin.org 3000 times to verify that the issue is not with my particular urls:
var handler = new HttpClientHandler() { MaxConnectionsPerServer = 5000 };
var httpClient = new HttpClient(handler);
var tasks = new List<Task<HttpResponseMessage>>();
foreach (var _ in Enumerable.Range(1, 3000)) {
var task = Task.Run(() => httpClient.GetAsync("http://httpbin.org"));
tasks.Add(task);
}
var t = Task.WhenAll(tasks);
try { await t.ConfigureAwait(false); } catch { }
int ok = 0, faulted = 0, cancelled = 0;
foreach (var task in tasks) {
switch (task.Status) {
case TaskStatus.RanToCompletion: ok++; break;
case TaskStatus.Faulted: faulted++; break;
case TaskStatus.Canceled: cancelled++; break;
}
}
Console.WriteLine($"RanToCompletion: {ok} Faulted: {faulted} Canceled: {cancelled}");
Again, always several hundred Tasks end in error.
So, what is the issue here? Why I cannot get those urls with async?
I'm using .NET Core and therefore the suggestion to use ServicePointManager (Trying to run multiple HTTP requests in parallel, but being limited by Windows (registry)) is not applicable.
Also, the urls I need to fetch point to different hosts. The code with httpbin is just a test, to show that the problem was not with my urls being invalid.
As Fildor said in the comments, httpClient.GetStringAsync returns Task. So you don't need to wrap it in Task.Run.
I ran this code in the console app. It took 50 seconds to complete. In your comment, you wrote that curl performs 3000 queries in less than a minute - the same thing.
var httpClient = new HttpClient();
var tasks = new List<Task<string>>();
var sw = Stopwatch.StartNew();
for (int i = 0; i < 3000; i++)
{
var task = httpClient.GetStringAsync("http://httpbin.org");
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
sw.Stop();
Console.WriteLine(sw.Elapsed);
Console.WriteLine(tasks.All(t => t.IsCompleted));
Also, all requests were completed successfully.
In your code, you are waiting for tasks started using Task.Run. But you need to wait for the completion of tasks started by calling httpClient.Get...
so I have this code:
This is the main function,a parallel for loop that iterates through all the data that needs to be posted and calls a function
ParallelOptions pOpt = new ParallelOptions();
pOpt.MaxDegreeOfParallelism = 30;
Parallel.For(0, maxsize, pOpt, (index,loopstate) => {
//Calls the function where all the webrequests are made
CallRequests(data1,data2);
if (isAborted)
loopstate.Stop();
});
This function is called inside the parallel loop
public static void CallRequests(string data1, string data2)
{
var cookie = new CookieContainer();
var postData = Parameters[23] + data1 +
Parameters[24] + data2;
HttpWebRequest getRequest = (HttpWebRequest)WebRequest.Create(Parameters[25]);
getRequest.Accept = Parameters[26];
getRequest.KeepAlive = true;
getRequest.Referer = Parameters[27];
getRequest.CookieContainer = cookie;
getRequest.UserAgent = Parameters[28];
getRequest.Method = WebRequestMethods.Http.Post;
getRequest.AllowWriteStreamBuffering = true;
getRequest.ProtocolVersion = HttpVersion.Version10;
getRequest.AllowAutoRedirect = false;
getRequest.ContentType = Parameters[29];
getRequest.ReadWriteTimeout = 5000;
getRequest.Timeout = 5000;
getRequest.Proxy = null;
byte[] byteArray = Encoding.ASCII.GetBytes(postData);
getRequest.ContentLength = byteArray.Length;
Stream newStream = getRequest.GetRequestStream(); //open connection
newStream.Write(byteArray, 0, byteArray.Length); // Send the data.
newStream.Close();
HttpWebResponse getResponse = (HttpWebResponse)getRequest.GetResponse();
if (getResponse.Headers["Location"] == Parameters[30])
{
//These are simple get requests to retrieve the source code using the same format as above.
//I need to preserve the cookie
GetRequets(data1, data2, Parameters[31], Parameters[13], cookie);
GetRequets(data1, data2, Parameters[32], Parameters[15], cookie);
}
}
From what I have seen and been told,I understand that making these requests async is a better idea than using a parallel loop.My method is also heavy on the proccesor.I wonder how can I make these requests async,but also preserve the multithreaded aspect. I also need to keep the cookie,after the post requests finishes.
Converting the CallRequests method to an async is really just a case of switching the sync method calls for async ones with the await keyword and changing the method signature to return Task.
Something like this:
public static async Task CallRequestsAsync(string data1, string data2)
{
var cookie = new CookieContainer();
var postData = Parameters[23] + data1 +
Parameters[24] + data2;
HttpWebRequest getRequest = (HttpWebRequest)WebRequest.Create(Parameters[25]);
getRequest.Accept = Parameters[26];
getRequest.KeepAlive = true;
getRequest.Referer = Parameters[27];
getRequest.CookieContainer = cookie;
getRequest.UserAgent = Parameters[28];
getRequest.Method = WebRequestMethods.Http.Post;
getRequest.AllowWriteStreamBuffering = true;
getRequest.ProtocolVersion = HttpVersion.Version10;
getRequest.AllowAutoRedirect = false;
getRequest.ContentType = Parameters[29];
getRequest.ReadWriteTimeout = 5000;
getRequest.Timeout = 5000;
getRequest.Proxy = null;
byte[] byteArray = Encoding.ASCII.GetBytes(postData);
getRequest.ContentLength = byteArray.Length;
Stream newStream =await getRequest.GetRequestStreamAsync(); //open connection
await newStream.WriteAsync(byteArray, 0, byteArray.Length); // Send the data.
newStream.Close();
HttpWebResponse getResponse = (HttpWebResponse)getRequest.GetResponse();
if (getResponse.Headers["Location"] == Parameters[30])
{
//These are simple get requests to retrieve the source code using the same format as above.
//I need to preserve the cookie
GetRequets(data1, data2, Parameters[31], Parameters[13], cookie);
GetRequets(data1, data2, Parameters[32], Parameters[15], cookie);
}
}
However this, in itself, doesn't really get you anywhere because you still need to await the returned tasks in your main method. A very straightforward (if somewhat blunt) way of doing so would be to simply call Task.WaitAll() (or await Task.WhenAll() if the calling method itself is to become async). Something like this:
var tasks = Enumerable.Range(0, maxsize).Select(index => CallRequestsAsync(data1, data2));
Task.WaitAll(tasks.ToArray());
However, this is really pretty blunt and loses control over how many iterations are running in parallel, etc. I MUCH prefer use of the TPL dataflow library for this sort of thing. This library provides a way of chaining async (or sync for that matter) operations in parallel and passing them from one "processing block" to the next. It has a myriad of options for tweaking degrees of parallelism, buffer sizes, etc.
A detailed expose is beyond the possible scope of this answer so i'd encourage you to read up on it but one possible approach would be to simply push this to an action block - something like this:
var actionBlock = new ActionBlock<int>(async index =>
{
await CallRequestsAsync(data1, data2);
}, new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 30,
BoundedCapacity = 100,
});
for (int i=0; i <= maxsize; i++)
{
actionBlock.Post(i); // or await actionBlock.SendAsync(i) if calling method is also async
}
actionBlock.Complete();
actionBlock.Completion.Wait(); // or await actionBlock.Completion if calling method is also async
Couple of additional points that are outside the scope of my answer that I should mention in passing:
it looks like your CallRequests method is updating some external variable with its results. Where possible it's best to avoid this pattern and have the method return the results for collation later (which the TPL Dataflow library handles through TransformBlock<>). If updating external state is unavoidable then make sure you have thought about the multithreaded implications (deadlocks, race conditions, etc.) which are outside the scope of my answer.
I am assuming there is some useful property of index which has been lost when you created a minimal description for your question? Does it index into a parameter list or something similar? If so, you can always just iterate over these directly and change the ActionBlock<int> to an ActionBlock<{--whatever the type of your parameter is--}>
Make sure you understand the difference between multi-threaded/parallel execution and asynchronous. There are some similarities/overlaps for sure but just making something async doesn't make it multithreaded nor is the converse true.
I have a piece of code that looks like this:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
using (var client = new WebClient())
{
Task<string> getDownloadsTask = client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
taskList[i] = getDownloadsTask;
}
}
Task.WaitAll(taskList.Cast<Task>().ToArray()); //Wait for all results to come back
The code freezes after Task.WaitAll... I have an idea why, it's because client is already disposed at the time of calling, is it possible to delay its disposal until later? Can you recommend another approach?
You need to create and dispose the WebClient within your task. I don't have a way to test this, but see if points you in the right direction:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
taskList[i] = Task.Run(() =>
{
using (var client = new WebClient())
{
return client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
}
});
}
Task.WaitAll(taskList.Cast<Task>().ToArray());
I don't see how that code would ever work, since you dispose the WebClient before the task was run.
You want to do something like this:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
var client = new WebClient();
Task<string> task = client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
task.ContinueWith(x => client.Dispose());
taskList[i] = task;
}
Task.WaitAll(taskList.Cast<Task>().ToArray()); //Wait for all results to come back
i.e. if you dispose the WebClient in the first loop, it's not allocated when you trigger the tasks by using Task.WaitAll. The ContinueWith call will be invoked once the task completes and can therefore be used to dispose each WebClient instance.
However, to get the code to execute concurrent requests to a single host you need to configure the service point. Read this question: Trying to run multiple HTTP requests in parallel, but being limited by Windows (registry)
My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.
for (int i = 1; i<=pages; i++)
{
string page_specific_link = baseurl + "&page=" + i.ToString();
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
catch (Exception)
{
}
}
The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.
I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.
So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.
I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances
// now process urls
foreach (var url in urls_to_download)
{
var worker = ClientQueue.Take();
worker.DownloadStringAsync(url, ...);
}
When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.
In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.
That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.
I should also note that there is a huge difference in resource usage between these two blocks of code:
WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
MyWebClient.DownloadString(url);
}
---------------
foreach (var url in urls_to_download)
{
WebClient MyWebClient = new WebClient();
MyWebClient.DownloadString(url);
}
The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.
Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).
Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.
In addition to #Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();
Parallel.ForEach(pages, x =>
{
using(var client = new WebClient())
{
var pagesource = client.DownloadString(x);
sources.Add(pagesource);
}
});
Yet another approach, that uses async:
static IEnumerable<string> GetSources(List<string> pages)
{
var sources = new BlockingCollection<string>();
var latch = new CountdownEvent(pages.Count);
foreach (var p in pages)
{
using (var wc = new WebClient())
{
wc.DownloadStringCompleted += (x, e) =>
{
sources.Add(e.Result);
latch.Signal();
};
wc.DownloadStringAsync(new Uri(p));
}
}
latch.Wait();
return sources;
}
You should use parallel programming for this purpose.
There are a lot of ways to achieve what u want; the easiest would be something like this:
var pageList = new List<string>();
for (int i = 1; i <= pages; i++)
{
pageList.Add(baseurl + "&page=" + i.ToString());
}
// pageList is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page);
client.Dispose();
lock (sourcelist)
sourcelist.Add(pagesource);
}
catch (Exception) {}
});
I Had a similar Case ,and that's how i solved
using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;
namespace WebClientApp
{
class MainClassApp
{
private static int requests = 0;
private static object requests_lock = new object();
public static void Main() {
List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
foreach(var url in urls) {
ThreadPool.QueueUserWorkItem(GetUrl, url);
}
int cur_req = 0;
while(cur_req<urls.Count) {
lock(requests_lock) {
cur_req = requests;
}
Thread.Sleep(1000);
}
Console.WriteLine("Done");
}
private static void GetUrl(Object the_url) {
string url = (string)the_url;
WebClient client = new WebClient();
Stream data = client.OpenRead (url);
StreamReader reader = new StreamReader(data);
string html = reader.ReadToEnd ();
/// Do something with html
Console.WriteLine(html);
lock(requests_lock) {
//Maybe you could add here the HTML to SourceList
requests++;
}
}
}
You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.
While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.
What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.
First, you would get the urls that you want to download:
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl +
"&page=" + i.ToString(CultureInfo.InvariantCulture)));
Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
// Create the task completion source.
var tcs = new TaskCompletionSource<Tuple<Uri, string>>();
// The web client.
var wc = new WebClient();
// Attach to the DownloadStringCompleted event.
client.DownloadStringCompleted += (s, e) => {
// Dispose of the client when done.
using (wc)
{
// If there is an error, set it.
if (e.Error != null)
{
tcs.SetException(e.Error);
}
// Otherwise, set cancelled if cancelled.
else if (e.Cancelled)
{
tcs.SetCanceled();
}
else
{
// Set the result.
tcs.SetResult(new Tuple<string, string>(url, e.Result));
}
}
};
// Start the process asynchronously, don't burn a thread.
wc.DownloadStringAsync(url);
// Return the task.
return tcs.Task;
});
Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:
// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();
// Wait for all to complete.
Task.WaitAll(materializedTasks);
Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
// pair.Item1 will contain the Uri.
// pair.Item2 will contain the content.
}
Note that the above code has the caveat of not having an error handling.
If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).
I am using an active Threads count and a arbitrary limit:
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}