Parallel request to scrape multiple pages of a website

Parallel request to scrape multiple pages of a website - c#

I want to scrape a website with plenty of pages with interesting data but as the source is very large I want to multithread and limit the overload.
I use a Parallel.ForEach to start each chunk of 10 tasks and I wait in the main for loop until the numbers of active threads started drop below a threshold. For that I use a counter of active threads I increment when starting a new thread with a WebClient and decrement when the DownloadStringCompleted event of the WebClient is triggered.
Originally the questions was how to use DownloadStringTaskAsync instead of DownloadString and wait that each of the threads started in the Parallel.ForEach has completed. This has been solved with a workaround:
a counter (activeThreads) and a Thread.Sleep in the main foor loop.
Is using await DownloadStringTaskAsync instead of DownloadString supposed to improve at all the speed by freeing a thread while waiting for the DownloadString data to arrive ?
And to get back to the original question, is there a way to do this more elegantly using TPL without the workaround of involving a counter ?
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}

If you want an elegant solution you should use Microsoft's Reactive Framework. It's dead simple:
var source = db.ListOfUrls; // Thousands urls
var query =
from uri in source.ToObservable()
from jsonData in Observable.Using(
() => new WebClient(),
wc => Observable.FromAsync(() => wc.DownloadStringTaskAsync(uri)))
select new { uri, json = JsonConvert.DeserializeObject<RootObject>(jsonData) };
IDisposable subscription =
query.Subscribe(x =>
{
/* Do something with x.uri && x.json */
});
That's the entire code. It's nicely multi-threaded and it's kept under control.
Just NuGet "System.Reactive" to get the bits.

Parallel.ForEach
Will create ProcessorCount tasks to execute the function for each item in the source Enumerable. It will take care that there are not to many tasks and will wait for all items and tasks to be executed.
Task.WhenAll
Only awaits the given tasks it does not execute them. Its on your hand to execute them in a proper way and not to many at once.
But there is some fault in your code. The function RecordUri will return a task that has to be awaited otherwise the ForEach will just create more and more as the function will never know when the current task is completed. Also problematic is that you create a task in a task and the first task does nothing else then wait for the first one.
You might also want to take a look at this overload of Parallel.ForEach
https://msdn.microsoft.com/en-us/library/dd782934(v=vs.110).aspx
Edit
Is using await DownloadStringTaskAsync instead of DownloadString supposed to improve at all the speed by freeing a thread while waiting for the DownloadString data to arrive ?
No. As when a task is awaiting a external resource it enters a Suspended state (Windows api that is not using some old/dirty iteration waiting). So there is no much difference.
What differs is the overhead the compiler will generate when compiling your async code. The DownloadStringTaskAsync will create a task that contains the long operation. If you use await it, you will attach yourself to that task (by ContinueWith). So you just create a Task for awaiting another. This is the overhead i was talking about in the upper text.
My approach would be: Use the synchronous method inside your Parallel.ForEach. The Threadding will be done by PLinq and you are free to go on.
Remember "KISS"

Related

Is parallel asynchronous execution where a thread sleeps using multiple threads?

This is the code that I wrote to better understand asynchronous methods. I knew that an asynchronous method is not the same as multithreading, but it does not seem so in this particular scenario:
class Program
{
static void Main(string[] args)
{
Thread.CurrentThread.CurrentCulture = new System.Globalization.CultureInfo("en-US");
//the line above just makes sure that the console output uses . to represent doubles instead of ,
ExecuteAsync();
Console.ReadLine();
}
private static async Task ParallelAsyncMethod() //this is the method where async parallel execution is taking place
{
List<Task<string>> tasks = new List<Task<string>>();
for (int i = 0; i < 5; i++)
{
tasks.Add(Task.Run(() => DownloadWebsite()));
}
var strings = await Task.WhenAll(tasks);
foreach (var str in strings)
{
Console.WriteLine(str);
}
}
private static string DownloadWebsite() //Imitating a website download
{
Thread.Sleep(1500); //making the thread sleep for 1500 miliseconds before returning
return "Download finished";
}
private static async void ExecuteAsync()
{
var watch = Stopwatch.StartNew();
await ParallelAsyncMethod();
watch.Stop();
Console.WriteLine($"It took the machine {watch.ElapsedMilliseconds} milliseconds" +
$" or {Convert.ToDouble(watch.ElapsedMilliseconds) / 1000} seconds to complete this task");
Console.ReadLine();
}
}
//OUTPUT:
/*
Download finished
Download finished
Download finished
Download finished
Download finished
It took the machine 1537 milliseconds or 1.537 seconds to complete this task
*/
As you can see, the DownloadWebsite method waits for 1.5 seconds and then returns "a". The method called ParallelAsyncMethod adds five of these methods into the "tasks" list and then starts the parallel asynchronous execution. As you can see, I also tracked the amount of time that it takes for the ExecuteAsync method to be executed. The result is always somewhere around 1540 milliseconds. Here is my question: if the DownloadWebsite method required a thread to sleep 5 times for 1500 milliseconds, does it mean that the parallel execution of these methods required 5 different threads? If not, then how come it only took the program 1540 milliseconds to be executed and not ~7500 ms?

I knew that an asynchronous method is not the same as multi-threading
That is correct, an asynchronous method releases the current thread whilst I/O occurs, and schedules a continuation after it's completion.
Async and threads are completely unrelated concepts.
but it does not seem so in this particular scenario
That is because you explicitly run DownloadWebsite on the ThreadPool using Task.Run, which imitates asynchronous code by returning a Task after instructing the provided delegate to run.
Because you are not waiting for each Task to complete before starting the next, multiple threads can be used simultaneously.
Currently each thread is being blocked, as you have used Thread.Sleep in the implementation of DownloadWebsite, meaning you are actually running 5 synchronous methods on the ThreadPool.
In production code your DownloadWebsite method should be written asynchronously, maybe using HttpClient.GetAsync:
private static async Task<string> DownloadWebsiteAsync()
{
//...
await httpClinet.GetAsync(//...
//...
}
In that case, GetAsync returns a Task, and releases the current thread whilst waiting for the HTTP response.
You can still run multiple async methods concurrently, but as the thread is released each time, this may well use less than 5 separate threads and may even use a single thread.
Ensure that you dont use Task.Run with an asynchronous method; this simply adds unnecessary overhead:
var tasks = new List<Task<string>>();
for (int i = 0; i < 5; i++)
{
tasks.Add(DownloadWebsiteAsync()); // No need for Task.Run
}
var strings = await Task.WhenAll(tasks);
As an aside, if you want to imitate an async operation, use Task.Delay instead of Thread.Sleep as the former is non-blocking:
private static async Task<string> DownloadWebsite() //Imitating a website download
{
await Task.Delay(1500); // Release the thread for ~1500ms before continuing
return "Download finished";
}

use Task.Run() inside of Select LINQ method

Suppose I have the following code (just for learninig purposes):
static async Task Main(string[] args)
{
var results = new ConcurrentDictionary<string, int>();
var tasks = Enumerable.Range(0, 100).Select(async index =>
{
var res = await DoAsyncJob(index);
results.TryAdd(index.ToString(), res);
});
await Task.WhenAll(tasks);
Console.WriteLine($"Items in dictionary {results.Count}");
}
static async Task<int> DoAsyncJob(int i)
{
// simulate some I/O bound operation
await Task.Delay(100);
return i * 10;
}
I want to know what will be the difference if I make it as follows:
var tasks = Enumerable.Range(0, 100)
.Select(index => Task.Run(async () =>
{
var res = await DoAsyncJob(index);
results.TryAdd(index.ToString(), res);
}));
I get the same results in both cases. But does code executes similarly?

Task.Run is there to execute CPU bound, synchronous, operations in a thread pool thread. As it is the operation you're running is already asynchronous, so using Task.Run means you're scheduling work to run in a thread pool thread, and that work is merely starting an asynchronous operation, which then completes almost immediately, and goes off to do whatever asynchronous work it has to do without blocking that thread pool thread. So by using Task.Run you're waiting to schedule work in the thread pool, but then not actually don't any meaningful work. You're better off just starting the asynchronous operation in the current thread.
The only exception would be if DoAsyncJob were implemented improperly and for some reason wasn't actually asynchronous, contrary to its name and signature, and actually did a lot of synchronous work before returning. But if it is doing that, you should just fix that buggy method, rather than using Task.Run to call it.
On a side note, there's no reason to have a ConcurrentDictionary to collect the results here. Task.WhenAll returns a collection of the results of all of the tasks you've executed. Just use that. Now you don't even need a method to wrap your asynchronous method and process the result in any special way, simplifying the code further:
var tasks = Enumerable.Range(0, 100).Select(DoAsyncJob);
var results = await Task.WhenAll(tasks);
Console.WriteLine($"Items in results {results.Count}");

Yes, both cases execute similarly. Actually, they execute in exactly the same way.

Does async/await inside a loop create a bottleneck?

Lets say i have the following code for example:
private async Task ManageClients()
{
for (int i =0; i < listClients.Count; i++)
{
if (list[i] == 0)
await DoSomethingWithClientAsync();
else
await DoOtherThingAsync();
}
DoOtherWork();
}
My questions are:
1. Will the for() continue and process other clients on the list?, or it
will await untill it finishes one of the tasks.
2. Is even a good practice to use async/await inside a loop?
3. Can it be done in a better way?
I know it was a really simple example, but I'm trying to imagine what would happen if that code was a server with thousands of clients.

In your code example, the code will "block" when the loop reaches await, meaning the other clients will not be processed until the first one is complete. That is because, while the code uses asynchronous calls, it was written using a synchronous logic mindset.
An asynchronous approach should look more like this:
private async Task ManageClients()
{
var tasks = listClients.Select( client => DoSomethingWithClient() );
await Task.WhenAll(tasks);
DoOtherWork();
}
Notice there is only one await, which simultaneously awaits all of the clients, and allows them to complete in any order. This avoids the situation where the loop is blocked waiting for the first client.

If a thread that is executing an async function is calling another async function, this other function is executed as if it was not async until it sees a call to a third async function. This third async function is executed also as if it was not async.
This goes on, until the thread sees an await.
Instead of really doing nothing, the thread goes up the call stack, to see if the caller was not awaiting for the result of the called function. If not, the thread continues the statements in the caller function until it sees an await. The thread goes up the call stack again to see if it can continue there.
This can be seen in the following code:
var taskDoSomething = DoSomethingAsync(...);
// because we are not awaiting, the following is done as soon as DoSomethingAsync has to await:
DoSomethingElse();
// from here we need the result from DoSomethingAsync. await for it:
var someResult = await taskDoSomething;
You can even call several sub-procedures without awaiting:
var taskDoSomething = DoSomethingAsync(...);
var taskDoSomethingElse = DoSomethingElseAsync(...);
// we are here both tasks are awaiting
DoSomethingElse();
Once you need the results of the tasks, if depends what you want to do with them. Can you continue processing if one task is completed but the other is not?
var someResult = await taskDoSomething;
ProcessResult(someResult);
var someOtherResult = await taskDoSomethingelse;
ProcessBothResults(someResult, someOtherResult);
If you need the result of all tasks before you can continue, use Task.WhenAll:
Task[] allTasks = new Task[] {taskDoSomething, taskDoSomethingElse);
await Task.WhenAll(allTasks);
var someResult = taskDoSomething.Result;
var someOtherResult = taskDoSomethingElse.Result;
ProcessBothResults(someResult, someOtherResult);
Back to your question
If you have a sequence of items where you need to start awaitable tasks, it depends on whether the tasks need the result of other tasks or not. In other words can task[2] start if task[1] has not been completed yet? Do Task[1] and Task[2] interfere with each other if they run both at the same time?
If they are independent, then start all Tasks without awaiting. Then use Task.WhenAll to wait until all are finished. The Task scheduler will take care that not to many tasks will be started at the same time. Be aware though, that starting several tasks could lead to deadlocks. Check carefully if you need critical sections
var clientTasks = new List<Task>();
foreach(var client in clients)
{
if (list[i] == 0)
clientTasks.Add(DoSomethingWithClientAsync());
else
clientTasks.Add(DoOtherThingAsync());
}
// if here: all tasks started. If desired you can do other things:
AndNowForSomethingCompletelyDifferent();
// later we need the other tasks to be finished:
var taskWaitAll = Task.WhenAll(clientTasks);
// did you notice we still did not await yet, we are still in business:
MontyPython();
// okay, done with frolicking, we need the results:
await taskWaitAll;
DoOtherWork();
This was the scenario where all Tasks where independent: no task needed the other to be completed before it could start. However if you need Task[2] to be completed before you can start Task[3] you should await:
foreach(var client in clients)
{
if (list[i] == 0)
await DoSomethingWithClientAsync());
else
await DoOtherThingAsync();
}

Advice on processing giant text file and processing URL's

I'm currently trying to loop through a text file that is about 1.5gb's in size and then use the URL's that are grabbed from it to pull down the html from the site.
For speed I'm trying to process all the HTTP request on a new thread but since C# is not my strongest language but a requirement for what I'm doing I'm a bit confused on good thread practice.
This is how I'm processing the list
private static void Main()
{
const Int32 BufferSize = 128;
using (var fileStream = File.OpenRead("dump.txt"))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))
{
String line;
var progress = 0;
while ((line = streamReader.ReadLine()) != null)
{
var stuff = line.Split('|');
getHTML(stuff[3]);
progress += 1;
Console.WriteLine(progress);
}
}
}
And I'm pulling down the HTML as so
private static void getHTML(String url)
{
new Thread(() =>
{
var client = new DecompressGzipResponse();
var html = client.DownloadString(url);
}).Start();
}
Though the speeds are fast doing this initially, after about 20 thousand they slow down and eventually after 32 thousand the application will hang and crash. I was under the impression C# threads terminated when the function completed?
Can anyone give any examples/ suggestions on how to do this better?

One very reliable way to do this is by using the producer-consumer pattern. You create a thread-safe queue of URLs (for example, BlockingCollection<Uri>). Your main thread is the producer, which adds items to the queue. You then have multiple consumer threads, each of which reads Urls from the queue and does the HTTP requests. See BlockingCollection.
Setting it up isn't terribly difficult:
BlockingCollection<Uri> UrlQueue = new BlockingCollection<Uri>();
// Main thread starts the consumer threads
Task t1 = Task.Factory.StartNew(() => ProcessUrls, TaskCreationOptions.LongRunning);
Task t2 = Task.Factory.StartNew(() => ProcessUrls, TaskCreationOptions.LongRunning);
// create more tasks if you think necessary.
// Now read your file
foreach (var line in File.ReadLines(inputFileName))
{
var theUri = ExtractUriFromLine(line);
UrlQueue.Add(theUri);
}
// when done adding lines to the queue, mark the queue as complete
UrlQueue.CompleteAdding();
// now wait for the tasks to complete.
t1.Wait();
t2.Wait();
// You could also use Task.WaitAll if you have an array of tasks
The individual threads process the urls with this method:
void ProcessUrls()
{
foreach (var uri in UrlQueue.GetConsumingEnumerable())
{
// code here to do a web request on that url
}
}
That's a simple and reliable way to do things, but it's not especially quick. You can do much better by using a second queue of WebCient objects that make asynchronous requests For example, say you want to have 15 asynchronous requests. You start the same way with a BlockingCollection, but you only have one persistent consumer thread.
const int MaxRequests = 15;
BlockingCollection<WebClient> Clients = new BlockingCollection<WebClient>();
// start a single consumer thread
var ProcessingThread = Task.Factory.StartNew(() => ProcessUrls, TaskCreationOptions.LongRunning);
// Create the WebClient objects and add them to the queue
for (var i = 0; i < MaxRequests; ++i)
{
var client = new WebClient();
// Add an event handler for the DownloadDataCompleted event
client.DownloadDataCompleted += DownloadDataCompletedHandler;
// And add this client to the queue
Clients.Add(client);
}
// add the code from above that reads the file and populates the queue
Your processing function is somewhat different:
void ProcessUrls()
{
foreach (var uri in UrlQueue.GetConsumingEnumerable())
{
// Wait for an available client
var client = Clients.Take();
// and make an asynchronous request
client.DownloadDataAsync(uri, client);
}
// When the queue is empty, you need to wait for all of the
// clients to complete their requests.
// You know they're all done when you dequeue all of them.
for (int i = 0; i < MaxRequests; ++i)
{
var client = Clients.Take();
client.Dispose();
}
}
Your DownloadDataCompleted event handler does something with the data that was downloaded, and then adds the WebClient instance back to the queue of clients.
void DownloadDataCompleteHandler(Object sender, DownloadDataCompletedEventArgs e)
{
// The data downloaded is in e.Result
// be sure to check the e.Error and e.Cancelled values to determine if an error occurred
// do something with the data
// And then add the client back to the queue
WebClient client = (WebClient)e.UserState;
Clients.Add(client);
}
This should keep you going with 15 concurrent requests, which is about all you can do without getting a bit more complicated. Your system can likely handle many more concurrent requests, but the way that WebClient starts asynchronous requests requires some synchronous work up front, and that overhead makes 15 about the maximum number you can handle.
You might be able to have multiple threads initiating the asynchronous requests. In that case, you could potentially have as many threads as you have processor cores. So on a quad core machine, you could have the main thread and three consumer threads. With three consumer threads this technique could give you 45 concurrent requests. I'm not certain that it scales that well, but it might be worth a try.
There are ways to have hundreds of concurrent requests, but they're quite a bit more complicated to implement.

You need thread management.
My advice is to use Tasks instead of creating your own Threads.
By using the Task Parallel Library, you let the runtime deal with the thread management. By default, it will allocate your tasks on threads from the ThreadPool, and will allow a level of concurrency which is contingent on the number of CPU cores you have. It will also reuse existing Threads when they become available instead of wasting time creating new ones.
If you want to get more advanced, you can create your own task scheduler to manage the scheduling aspect yourself.
See also What is difference between Task and Thread?

Loop through list and create multple threads

I want to loop through a list of URLs and check each URL if the website is down or not using multiple threads.
My approach:
while (_lURLs.Count > 0)
{
while (_iRunningThreads < _iNumThreads)
{
Thread t = new Thread(new ParameterizedThreadStart(CheckWebsite));
string strUrl = GetNextURL();
if (!string.IsNullOrEmpty(strUrl))
{
t.Start(strUrl);
_iRunningThreads++;
}
else
{
break;
}
}
}
private string GetNextURL()
{
lock (_lURLs)
{
if (_lURLs.Count > 0)
{
string strRetVal = _lURLs[0];
_lURLs.RemoveAt(0);
return strRetVal;
}
else
{
return string.Empty;
}
}
}
When a thread is finished the _iRunningThreads property gets decremented.
My problem is: The outer while loop blocks everything "while (_lURLs.Count > 0)".
Adding a Application.DoEvents() in the outer while loop helps but I want to use the code in a c# library where Application.DoEvents() is not available.
Thank you for you help.

Instead of managing the threads yourself, you can use the TPL.
Also, if you're using .Net Framework 4.5 you can even add async/await and the WhenAll method to prevent blocking...
Here is a small example:
private async Task CheckUrl()
{
List<Task> tasks = new List<Task>();
string url = GetNextUrl();
while (!String.IsNullOrEmpty(url))
{
tasks.Add(Task.Run(() => CheckWebSite(url)));
url = GetNextUrl();
}
await Task.WhenAll(tasks);
// All tasks have finished...
}

I think using the .NET ThreadPool would be a good idea in this case, if the tasks take quite a short time to complete.
Check out: http://msdn.microsoft.com/en-us/library/4yd16hza.aspx
This allows you to simplify your code a bit as the ThreadPool automatically manages the count of the worker threads. You just have to call ThreadPool.QueueUserWorkItem for each URL you have and increment a running task counter. Queuing items into the ThreadPool won't block the UI thread.
Have the ThreadPool tasks decrement the counter (as you have now) and when the counter gets to zero (all tasks have been ran) call a callback function so that your main code knows when all the URLs have been processed. You can update the UI or what ever else you want to do from that callback.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parallel request to scrape multiple pages of a website - c#

Related

Is parallel asynchronous execution where a thread sleeps using multiple threads?

use Task.Run() inside of Select LINQ method

Does async/await inside a loop create a bottleneck?

Advice on processing giant text file and processing URL's

Loop through list and create multple threads

Categories

Resources