Query on Queues and Thread Safety - c#

Thread-Safety is not an aspect that I have worried about much as the simple apps and libraries I have written usually only run on the main thread, or do not directly modified properties or fields in any classes that I needed to worry about before.
However, I have started working on a personal project that I am using a WebClient to download data asynchronously from a remote server. There is a Queue<Uri> that contains a pre-built queue of a series of URI's to download data.
So consider the following snippet (this is not my real code, but something I am hoping illustrates my question:
private WebClient webClient = new WebClient();
private Queue<Uri> requestQueue = new Queue<Uri>();
public Boolean DownloadNextASync()
{
if (webClient.IsBusy)
return false;
if (requestQueue.Count == 0)
return false
var uri = requestQueue.Dequeue();
webClient.DownloadDataASync(uri);
return true;
}
If I am understanding correctly, this method is not thread safe (assuming this specific instance of this object is known to multiple threads). My reasoning is webClient could become busy during the time between the IsBusy check and the DownloadDataASync() method call. And also, requestQueue could become empty between the Count check and when the next item is dequeued.
My question is what is the best way to handle this type of situation to make it thread-safe?
This is more of an abstract question as I realize for this specific method that there would have to be an exceptionally inconvenient timing for this to actually cause a problem, and to cover that case I could just wrap the method in an appropriate try-catch since both pieces would throw an exception. But is there another option? Would a lock statement be applicable here?

If you're targeting .Net 4.0, you could use the Task Parallel Library for help:
var queue = new BlockingCollection<Uri>();
var maxClients = 4;
// Optionally provide another producer/consumer collection for the data
// var data = new BlockingCollection<Tuple<Uri,byte[]>>();
// Optionally implement CancellationTokenSource
var clients = from id in Enumerable.Range(0, maxClients)
select Task.Factory.StartNew(
() =>
{
var client = new WebClient();
while (!queue.IsCompleted)
{
Uri uri;
if (queue.TryTake(out uri))
{
byte[] datum = client.DownloadData(uri); // already "async"
// Optionally pass datum along to the other collection
// or work on it here
}
else Thread.SpinWait(100);
}
});
// Add URI's to search
// queue.Add(...);
// Notify our clients that we've added all the URI's
queue.CompleteAdding();
// Wait for all of our clients to finish
clients.WaitAll();
To use this approach for progress indication you can use TaskCompletionSource<TResult> to manage the Event based parallelism:
public static Task<byte[]> DownloadAsync(Uri uri, Action<double> progress)
{
var source = new TaskCompletionSource<byte[]>();
Task.Factory.StartNew(
() =>
{
var client = new WebClient();
client.DownloadProgressChanged
+= (sender, e) => progress(e.ProgressPercentage);
client.DownloadDataCompleted
+= (sender, e) =>
{
if (!e.Cancelled)
{
if (e.Error == null)
{
source.SetResult((byte[])e.Result);
}
else
{
source.SetException(e.Error);
}
}
else
{
source.SetCanceled();
}
};
});
return source.Task;
}
Used like so:
// var urls = new List<Uri>(...);
// var progressBar = new ProgressBar();
Task.Factory.StartNew(
() =>
{
foreach (var uri in urls)
{
var task = DownloadAsync(
uri,
p =>
progressBar.Invoke(
new MethodInvoker(
delegate { progressBar.Value = (int)(100 * p); }))
);
// Will Block!
// data = task.Result;
}
});

I highly recommend reading "Threading In C#" by Joseph Albahari. I have taken a look through it in preparation for my first (mis)adventure into threading and it's pretty comprehensive.
You can read it here: http://www.albahari.com/threading/

Both of the thread-safety concerns you raised are valid. Furthermore, the both WebClient and Queue are documented as not being thread-safe (at the bottom of the MSDN docs). For example, if two threads were dequeuing simultaneously, they might actually cause the queue to become internally inconsistent or could lead to non-sensical return values. For example, if the implementation of Dequeue() was something like:
1. var valueToDequeue = this._internalList[this._startPointer];
2. this._startPointer = (this._startPointer + 1) % this._internalList.Count;
3. return valueToDequeue;
and two threads each executed line 1 before either continued to line 2, then both would return the same value (there are other potential issues here as well). This would not necessarily throw an exception, so you should use a lock statement to guarantee that only one thread can be inside the method at a time:
private readonly object _lock = new object();
...
lock (this._lock) {
// body of method
}
You could also lock on the WebClient or the Queue if you know that no-one else will be synchronizing on them.

Related

non reentrant observable in c#

Given the following method:
If I leave the hack in place, my unit test completes immediately with "observable has no data".
If I take the hack out, there are multiple threads all attempting to login at the same time.
The host service does not allow this.
How do I ensure that only one thread is producing observables at any given point in time.
private static object obj = new object();
private static bool here = true;
public IObservable<Party> LoadAllParties(CancellationToken token)
{
var parties = Observable.Create<Party>(
async (observer, cancel) =>
{
// this is just a hack to test behavior
lock (obj)
{
if (!here)
return;
here = false;
}
// end of hack.
try
{
if (!await this.RequestLogin(observer, cancel))
return;
// request list.
await this._request.GetAsync(this._configuration.Url.RequestList);
if (this.IsCancelled(observer, cancel))
return;
while (!cancel.IsCancellationRequested)
{
var entities = await this._request.GetAsync(this._configuration.Url.ProcessList);
if (this.IsCancelled(observer, cancel))
return;
var tranche = this.ExtractParties(entities);
// break out if it's the last page.
if (!tranche.Any())
break;
Array.ForEach(tranche, observer.OnNext);
await this._request.GetAsync(this._configuration.Url.ProceedList);
if (this.IsCancelled(observer, cancel))
return;
}
observer.OnCompleted();
}
catch (Exception ex)
{
observer.OnError(ex);
}
});
return parties;
}
My Unit Test:
var sut = container.Resolve<SyncDataManager>();
var count = 0;
var token = new CancellationTokenSource();
var observable = sut.LoadAllParties(token.Token);
observable.Subscribe(party => count++);
await observable.ToTask(token.Token);
count.Should().BeGreaterThan(0);
I do think your question is suffering from the XY Problem - the code contains several calls to methods not included which may contain important side effects and I feel that going on the information available won't lead to the best advice.
That said, I suspect you did not intend to subscribe to observable twice - once with the explicit Subscribe call, and once with the ToTask() call. This would certainly explain the concurrent calls, which are occurring in two different subscriptions.
EDIT:
How about asserting on the length instead (tweak the timeout to suit):
var length = await observable.Count().Timeout(TimeSpan.FromSeconds(3));
Better would be to look into Rx-Testing and mock your dependencies. That's a big topic, but this long blog post from the Rx team explains it very well and this answer regarding TPL-Rx interplay may help: Executing TPL code in a reactive pipeline and controlling execution via test scheduler

To call method asynchronously from loop in c#

I have a requirement where I have pull data from Sailthru API.The problem is it will take forever if I make a call synchronously as the response time depends on data.I have very much new to threading and tried out something but it didn't seems to work as expected. Could anyone please guide.
Below is my sample code
public void GetJobId()
{
Hashtable BlastIDs = getBlastIDs();
foreach (DictionaryEntry entry in BlastIDs)
{
Hashtable blastStats = new Hashtable();
blastStats.Add("stat", "blast");
blastStats.Add("blast_id", entry.Value.ToString());
//Function call 1
//Thread newThread = new Thread(() =>
//{
GetBlastDetails(entry.Value.ToString());
//});
//newThread.Start();
}
}
public void GetBlastDetails(string blast_id)
{
Hashtable tbData = new Hashtable();
tbData.Add("job", "blast_query");
tbData.Add("blast_id", blast_id);
response = client.ApiPost("job", tbData);
object data = response.RawResponse;
JObject jtry = new JObject();
jtry = JObject.Parse(response.RawResponse.ToString());
if (jtry.SelectToken("job_id") != null)
{
//Function call 2
Thread newThread = new Thread(() =>
{
GetJobwiseDetail(jtry.SelectToken("job_id").ToString(), client,blast_id);
});
newThread.Start();
}
}
public void GetJobwiseDetail(string job_id, SailthruClient client,string blast_id)
{
Hashtable tbData = new Hashtable();
tbData.Add("job_id", job_id);
SailthruResponse response;
response = client.ApiGet("job", tbData);
JObject jtry = new JObject();
jtry = JObject.Parse(response.RawResponse.ToString());
string status = jtry.SelectToken("status").ToString();
if (status != "completed")
{
//Function call 3
Thread.Sleep(3000);
Thread newThread = new Thread(() =>
{
GetJobwiseDetail(job_id, client,blast_id);
});
newThread.Start();
string str = "test sleeping thread";
}
else {
string export_url = jtry.SelectToken("export_url").ToString();
TraceService(export_url);
SaveCSVDataToDB(export_url,blast_id);
}
}
I want the Function call 1 to start asynchronously(or may be after gap of 3 seconds to avoid load on processor). In Function call 3 I am calling the same function again if the status is not completed with delay of 3 seconds to give time for receiving response.
Also correct me if my question sounds stupid.
You should never use Thread.Sleep like that because among others you don't know if 3000ms will be enough and instead of using Thread class you should use Task class which is much better option because it provides some additional features, thread pool managing etc. I don't have access to IDE but you should try to use Task.Factory.StartNew to invoke your request asynchronously then your function GetJobwiseDetail should return some value that you want to save to database and then use .ContinueWith with delegate that should save your function result into database. If you are able to use .NET 4.5 you should try feature async/await.
Easier solution:
Parallel.ForEach
Get some details on the internet about it, you have to know nothing about threading. It will invoke every loop iteration on another thread.
First of all, avoid at all cost starting new Threads in your code; those threads will suck your memory like a collapsing star because each of those gets ~1MB of memory allocated to it.
Now for the code - depending on the framework you can choose from the following:
ThreadPool.QueueUserWorkItem for older versions of .NET Framework
Parallel.ForEach for .NET Framework 4
async and await for .NET Framework 4.5
TPL Dataflow also for .NET Framework 4.5
The code you show fits quite well with Dataflow and also I wouldn't suggest using async/await here because its usage would transform your example is a sort of fire and forget mechanism which is against the recommendations of using async/await.
To use Dataflow you'll need in broad lines:
one TransformBlock which will take a string as an input and will return the response from the API
one BroadcastBlock that will broadcast the response to
two ActionBlocks; one to store data in the database and the other one to call TraceService
The code should look like this:
var downloader = new TransformBlock<string, SailthruResponse>(jobId =>
{
var data = new HashTable();
data.Add("job_id", jobId);
return client.ApiGet("job", data);
});
var broadcaster = new BroadcastBlock<SailthruResponse>(response => response);
var databaseWriter = new ActionBlock<SailthruResponse>(response =>
{
// save to database...
})
var tracer = new ActionBlock<SailthruResponse>(response =>
{
//TraceService() call
});
var options = new DataflowLinkOptions{ PropagateCompletion = true };
// link blocks
downloader.LinkTo(broadcaster, options);
broadcaster.LinkTo(databaseWriter, options);
broadcaster.LinkTo(tracer, options);
// process values
foreach(var value in getBlastIds())
{
downloader.Post(value);
}
downloader.Complete();
try with async as below.
async Task Task_MethodAsync()
{
// . . .
// The method has no return statement.
}

WebClient.DownloadDataCompleted not firing

I have a very strange problem. My WebClient.DownloadDataCompleted doesn't fire most of the time.
I am using this class:
public class ParallelFilesDownloader
{
public Task DownloadFilesAsync(IEnumerable<Tuple<Uri, Stream>> files, CancellationToken cancellationToken)
{
var localFiles = files.ToArray();
var tcs = new TaskCompletionSource<object>();
var clients = new List<WebClient>();
cancellationToken.Register(
() =>
{
// Break point here
foreach (var wc in clients.Where(x => x != null))
wc.CancelAsync();
});
var syncRoot = new object();
var count = 0;
foreach (var file in localFiles)
{
var client = new WebClient();
client.DownloadDataCompleted += (s, args) =>
{
// Break point here
if (args.Cancelled)
tcs.TrySetCanceled();
else if (args.Error != null)
tcs.TrySetException(args.Error);
else
{
var stream = (Stream)args.UserState;
stream.Write(args.Result, 0, args.Result.Length);
lock (syncRoot)
{
count++;
if (count == localFiles.Length)
tcs.TrySetResult(null);
}
}
};
clients.Add(client);
client.DownloadDataAsync(file.Item1, file.Item2);
}
return tcs.Task;
}
}
And when I am calling DownloadFilesAsync in LINQPad in isolation, DownloadDataCompleted is called after half a second or so, as expected.
However, in my real application, it simply doesn't fire and the code that waits for it to finish simply is stuck. I have two break points as indicated by the comments. None of them is hit.
Ah, but sometimes, it does fire. Same URL, same code, just a new debug session. No pattern at all.
I checked the available threads from the thread pool: workerThreads > 30k, completionPortThreads = 999.
I added a sleep of 10 seconds before the return and checked after the sleep that my web clients haven't been garbage collected and that my event handler is still attached.
Now, I ran out of ideas to trouble shoot this.
What else could cause this strange behaviour?
From the comments:
Somewhere later, there is a Task.WaitAll which waits for this and other tasks. However, (1) I fail to see why this would affect the async download - please elaborate - and (2) the problem doesn't disappear, when I add the sleep and as such, Task.WaitAll will not be called
It seems that you have a deadlock caused by Task.WaitAll. I can explain it throughly here:
When you await an async method that returns a Task or a Task<T>, there is an implicit capture of the SynchronizationContext by the TaskAwaitable being generated by the Task.GetAwaiter method.
Once that sync context is in place and the async method call completes, the TaskAwaitable attempts to marshal the continuation (which is basically the rest of the method calls after the first await keyword) onto the SynchronizationContext (using SynchronizationContext.Post) which was previously captured. If the calling thread is blocked, waiting on that same method to finish, you have a deadlock.
When you call Task.WaitAll, you block until all Tasks are complete, this will make marshaling back to the original context impossible, and basically deadlock.
Instead of using Task.WaitAll, use await Task.WhenAll.
Based on the comments, not an ideal answer but you can temporarily change the sync context before and after the foreach:
var syncContext = SynchronizationContext.Current;
SynchronizationContext.SetSynchronizationContext(null);
foreach (var file in localFiles)
{
...
}
SynchronizationContext.SetSynchronizationContext(syncContext);

Parallel.Invoke - Dynamically creating more 'threads'

I am educating myself on Parallel.Invoke, and parallel processing in general, for use in current project. I need a push in the right direction to understand how you can dynamically\intelligently allocate more parallel 'threads' as required.
As an example. Say you are parsing large log files. This involves reading from file, some sort of parsing of the returned lines and finally writing to a database.
So to me this is a typical problem that can benefit from parallel processing.
As a simple first pass the following code implements this.
Parallel.Invoke(
()=> readFileLinesToBuffer(),
()=> parseFileLinesFromBuffer(),
()=> updateResultsToDatabase()
);
Behind the scenes
readFileLinesToBuffer() reads each line and stores to a buffer.
parseFileLinesFromBuffer comes along and consumes lines from buffer and then let's say it put them on another buffer so that updateResultsToDatabase() can come along and consume this buffer.
So the code shown assumes that each of the three steps uses the same amount of time\resources but lets say the parseFileLinesFromBuffer() is a long running process so instead of running just one of these methods you want to run two in parallel.
How can you have the code intelligently decide to do this based on any bottlenecks it might perceive?
Conceptually I can see how some approach of monitoring the buffer sizes might work, spawning a new 'thread' to consume the buffer at an increased rate for example...but I figure this type of issue has been considered in putting together the TPL library.
Some sample code would be great but I really just need a clue as to what concepts I should investigate next. It looks like maybe the System.Threading.Tasks.TaskScheduler holds the key?
Have you tried the Reactive Extensions?
http://msdn.microsoft.com/en-us/data/gg577609.aspx
The Rx is a new tecnology from Microsoft, the focus as stated in the official site:
The Reactive Extensions (Rx)... ...is a library to compose
asynchronous and event-based programs using observable collections and
LINQ-style query operators.
You can download it as a Nuget package
https://nuget.org/packages/Rx-Main/1.0.11226
Since I am currently learning Rx I wanted to take this example and just write code for it, the code I ended up it is not actually executed in parallel, but it is completely asynchronous, and guarantees the source lines are executed in order.
Perhaps this is not the best implementation, but like I said I am learning Rx, (thread-safe should be a good improvement)
This is a DTO that I am using to return data from the background threads
class MyItem
{
public string Line { get; set; }
public int CurrentThread { get; set; }
}
These are the basic methods doing the real work, I am simulating the time with a simple Thread.Sleep and I am returning the thread used to execute each method Thread.CurrentThread.ManagedThreadId. Note the timer of the ProcessLine it is 4 sec, it's the most time-consuming operation
private IEnumerable<MyItem> ReadLinesFromFile(string fileName)
{
var source = from e in Enumerable.Range(1, 10)
let v = e.ToString()
select v;
foreach (var item in source)
{
Thread.Sleep(1000);
yield return new MyItem { CurrentThread = Thread.CurrentThread.ManagedThreadId, Line = item };
}
}
private MyItem UpdateResultToDatabase(string processedLine)
{
Thread.Sleep(700);
return new MyItem { Line = "s" + processedLine, CurrentThread = Thread.CurrentThread.ManagedThreadId };
}
private MyItem ProcessLine(string line)
{
Thread.Sleep(4000);
return new MyItem { Line = "p" + line, CurrentThread = Thread.CurrentThread.ManagedThreadId };
}
The following method I am using it just to update the UI
private void DisplayResults(MyItem myItem, Color color, string message)
{
this.listView1.Items.Add(
new ListViewItem(
new[]
{
message,
myItem.Line ,
myItem.CurrentThread.ToString(),
Thread.CurrentThread.ManagedThreadId.ToString()
}
)
{
ForeColor = color
}
);
}
And finally this is the method that calls the Rx API
private void PlayWithRx()
{
// we init the observavble with the lines read from the file
var source = this.ReadLinesFromFile("some file").ToObservable(Scheduler.TaskPool);
source.ObserveOn(this).Subscribe(x =>
{
// for each line read, we update the UI
this.DisplayResults(x, Color.Red, "Read");
// for each line read, we subscribe the line to the ProcessLine method
var process = Observable.Start(() => this.ProcessLine(x.Line), Scheduler.TaskPool)
.ObserveOn(this).Subscribe(c =>
{
// for each line processed, we update the UI
this.DisplayResults(c, Color.Blue, "Processed");
// for each line processed we subscribe to the final process the UpdateResultToDatabase method
// finally, we update the UI when the line processed has been saved to the database
var persist = Observable.Start(() => this.UpdateResultToDatabase(c.Line), Scheduler.TaskPool)
.ObserveOn(this).Subscribe(z => this.DisplayResults(z, Color.Black, "Saved"));
});
});
}
This process runs totally in the background, this is the output generated:
in an async/await world, you'd have something like:
public async Task ProcessFileAsync(string filename)
{
var lines = await ReadLinesFromFileAsync(filename);
var parsed = await ParseLinesAsync(lines);
await UpdateDatabaseAsync(parsed);
}
then a caller could just do var tasks = filenames.Select(ProcessFileAsync).ToArray(); and whatever (WaitAll, WhenAll, etc, depending on context)
Use a couple of BlockingCollection. Here is an example
The idea is that you create a producer that puts data into the collection
while (true) {
var data = ReadData();
blockingCollection1.Add(data);
}
Then you create any number of consumers that reads from the collection
while (true) {
var data = blockingCollection1.Take();
var processedData = ProcessData(data);
blockingCollection2.Add(processedData);
}
and so on
You can also let TPL handle the number of consumers by using Parallel.Foreach
Parallel.ForEach(blockingCollection1.GetConsumingPartitioner(),
data => {
var processedData = ProcessData(data);
blockingCollection2.Add(processedData);
});
(note that you need to use GetConsumingPartitioner not GetConsumingEnumerable (see here)

Mass Downloading of Webpages C#

My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.
for (int i = 1; i<=pages; i++)
{
string page_specific_link = baseurl + "&page=" + i.ToString();
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
catch (Exception)
{
}
}
The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.
I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.
So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.
I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances
// now process urls
foreach (var url in urls_to_download)
{
var worker = ClientQueue.Take();
worker.DownloadStringAsync(url, ...);
}
When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.
In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.
That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.
I should also note that there is a huge difference in resource usage between these two blocks of code:
WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
MyWebClient.DownloadString(url);
}
---------------
foreach (var url in urls_to_download)
{
WebClient MyWebClient = new WebClient();
MyWebClient.DownloadString(url);
}
The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.
Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).
Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.
In addition to #Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();
Parallel.ForEach(pages, x =>
{
using(var client = new WebClient())
{
var pagesource = client.DownloadString(x);
sources.Add(pagesource);
}
});
Yet another approach, that uses async:
static IEnumerable<string> GetSources(List<string> pages)
{
var sources = new BlockingCollection<string>();
var latch = new CountdownEvent(pages.Count);
foreach (var p in pages)
{
using (var wc = new WebClient())
{
wc.DownloadStringCompleted += (x, e) =>
{
sources.Add(e.Result);
latch.Signal();
};
wc.DownloadStringAsync(new Uri(p));
}
}
latch.Wait();
return sources;
}
You should use parallel programming for this purpose.
There are a lot of ways to achieve what u want; the easiest would be something like this:
var pageList = new List<string>();
for (int i = 1; i <= pages; i++)
{
pageList.Add(baseurl + "&page=" + i.ToString());
}
// pageList is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page);
client.Dispose();
lock (sourcelist)
sourcelist.Add(pagesource);
}
catch (Exception) {}
});
I Had a similar Case ,and that's how i solved
using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;
namespace WebClientApp
{
class MainClassApp
{
private static int requests = 0;
private static object requests_lock = new object();
public static void Main() {
List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
foreach(var url in urls) {
ThreadPool.QueueUserWorkItem(GetUrl, url);
}
int cur_req = 0;
while(cur_req<urls.Count) {
lock(requests_lock) {
cur_req = requests;
}
Thread.Sleep(1000);
}
Console.WriteLine("Done");
}
private static void GetUrl(Object the_url) {
string url = (string)the_url;
WebClient client = new WebClient();
Stream data = client.OpenRead (url);
StreamReader reader = new StreamReader(data);
string html = reader.ReadToEnd ();
/// Do something with html
Console.WriteLine(html);
lock(requests_lock) {
//Maybe you could add here the HTML to SourceList
requests++;
}
}
}
You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.
While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.
What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.
First, you would get the urls that you want to download:
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl +
"&page=" + i.ToString(CultureInfo.InvariantCulture)));
Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
// Create the task completion source.
var tcs = new TaskCompletionSource<Tuple<Uri, string>>();
// The web client.
var wc = new WebClient();
// Attach to the DownloadStringCompleted event.
client.DownloadStringCompleted += (s, e) => {
// Dispose of the client when done.
using (wc)
{
// If there is an error, set it.
if (e.Error != null)
{
tcs.SetException(e.Error);
}
// Otherwise, set cancelled if cancelled.
else if (e.Cancelled)
{
tcs.SetCanceled();
}
else
{
// Set the result.
tcs.SetResult(new Tuple<string, string>(url, e.Result));
}
}
};
// Start the process asynchronously, don't burn a thread.
wc.DownloadStringAsync(url);
// Return the task.
return tcs.Task;
});
Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:
// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();
// Wait for all to complete.
Task.WaitAll(materializedTasks);
Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
// pair.Item1 will contain the Uri.
// pair.Item2 will contain the content.
}
Note that the above code has the caveat of not having an error handling.
If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).
I am using an active Threads count and a arbitrary limit:
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}

Categories