Getting HTML response fails respectively after first fail - c#

I have a program which gets html code for ~500 webpages every 5 minutes
it runs correctly until first fail(unable to download source in 6 seconds)
after that all threads will fail
and if I restart program, again it runs correctly until ...
where I'm wrong, what I should do to do it better?
this function runs every 5 mins:
foreach (Company company in companies)
{
string link = company.GetLink();
Thread t = new Thread(() => F(company, link));
t.Start();
if (!t.Join(TimeSpan.FromSeconds(6)))
{
Debug.WriteLine( company.Name + " Fails");
t.Abort();
}
}
and this function download html code
private void F(Company company, string link)
{
try
{
string htmlCode = GetInformationFromWeb.GetHtmlRequest(link);
company.HtmlCode = htmlCode;
}
catch (Exception ex)
{
}
}
and this class:
public class GetInformationFromWeb
{
public static string GetHtmlRequest(string url)
{
using (MyWebClient client = new MyWebClient())
{
client.Encoding = Encoding.UTF8;
string htmlCode = client.DownloadString(url);
return htmlCode;
}
}
}
and web client class
public class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}

IF your foreach is looping over 500 companies, and each is creating a new thread, it could be that your internet speed could become a bottleneck and you will receive timeouts over 6 seconds, and fail very often.
I suggest you to try with parallelism. Note MaxDegreeOfParallelism, which sets maximum amount of parallel executions. You can tune this to suit your needs.
Parallel.ForEach(companies, new ParallelOptions { MaxDegreeOfParallelism = 10 }, (company) =>
{
try
{
string htmlCode = GetInformationFromWeb.GetHtmlRequest(company.link);
company.HtmlCode = htmlCode;
}
catch(Exception ex)
{
//ignore or process exception
}
});

I have four basic suggestions:
Use HttpClient instead of obsolete WebClient. HttpClient can deal with asynchronous operations natively and has far more flexibility to take advantage of. You can even read downloaded contents to strings/streams on different thread since you can configure await not to schedule back your operations. Or even program the HttpClientHandler to break after 6 seconds and raise TaskCanceledException if this was exceeded.
Avoid swallowing exceptions (like you do in your F function) as it breaks debugging and obfuscates the real cause of problems. Correctly-written program will never raise an exception during normal operation.
You are using threads in an useless way, in which they are not even overlapping; they are just waiting for each other to start, because you are locking the calling loop after each thread's start. In .NET it would be better to do multitasking using Tasks (for example, by calling them as Task.Run(async delegate() { await yourTask(); }) (or AsyncContext.Run(...) if you need UI access) and it won't block anything.
The whole GetInformationFromWeb class is pointless in the moment - and you are spawning multiple client objects also pointlessly, since one HTTP client object can handle multiple requests (if you'd use HttpClient even without additional bloat - you just instantiate it once as static global variable with all necessary configuration and then call it from any place using as little code as client.GetStringAsync(Uri uri).
OT: Is it some kind of an academic project?

Related

Getting an error “A task was canceled.” when processing a large record set in Parallel.Invoke

I am using Parallel.Invoke to call a large array of Actions on a 4 core machine.
Each action makes a call to an external web api to retrieve a json package of info. That json package is then de-serialized into a series of objects. Each of those objects is then inserted into several tables via EntityFramework 6.
This will process around 2 thousand distinct IDs so I am trying to use the Parallel library to get as fast a through-put as possible.
My main:
private static void Main(string[] args)
{
var apiKey = "myKey";
List<string> caseIDs = new List<string>();
//read list of ids from DB
using (var db = new StagingContext())
{
caseIDs = db.BatchList.Where(b => b.CaseID!=null).Select(a => a.CaseID).Distinct().Take(5000).ToList();
}
List<Action> actions = new List<Action>();
foreach (var id in caseIDs)
{
var UniqueID = Guid.NewGuid();
actions.Add(() => GetRecords(id,"https://myAPIURL/{0}?api={1}&case={2}", apiKey, UniqueID));
}
ParallelOptions op = new ParallelOptions
{
CancellationToken = tok.Token,
MaxDegreeOfParallelism = 10
};
Parallel.Invoke(op, actions.ToArray());
Console.WriteLine("Done");
Console.ReadKey();
}
My action:
private static void GetRecords(string CaseID, string url, string apiKey, Guid UniqueID)
{
using (HttpClient client = new HttpClient())
{
var tmpUrl = string.Format(url, apiKey, CaseID);
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
var result = client.GetAsync(tmpUrl).Result;
var jsonString = result.Content.ReadAsStringAsync();
jsonString.Wait();
var myObjectList = new List<MyObject>();
if (!jsonString.Result.Contains("error"))
{
myObjectList.AddRange(JsonConvert.DeserializeObject<List<MyObject>>(jsonString.Result));
foreach (var item in myObjectList)
{
item.UniqueID = UniqueID;
}
}
//Write this out to DB
using (var db = new StagingContext())
{
var myMappedObjectList = myObjectList.Adapt<List<MyObject>>();
db.CaseAttributeHistories.AddRange(myMappedObjectList);
using (var scope = new TransactionScope(TransactionScopeOption.Required, new TransactionOptions { IsolationLevel = IsolationLevel.ReadUncommitted }))
{
db.SaveChanges();
scope.Complete();
}
}
}
}
When I process a smaller set of data, ~1000 records, it works pretty good. When I process a larger data set , >1400, I often get an
“A task was canceled.”
error.
I am new to the Parallel & multi-threading.
Is this a valid approach?
Is there a good way to track down what is
causing the cancellation?
How would I handle/ignore the error and
continue with the rest of the records?
Is there a better or faster pattern to use in this situation?
First, check for Exceptions. Swallowing a Exception is a deadly sin of exception handling. And unfortunately Multithreading does that fully automatically. Normally you have to write code for that. In mutltithreading you have to write code to avoid it. I would advise those two articles on Exception handling before you try your hand at Multithreading:
http://blogs.msdn.com/b/ericlippert/archive/2008/09/10/vexing-exceptions.aspx
http://www.codeproject.com/Articles/9538/Exception-Handling-Best-Practices-in-NET
Secondly, doing sequential calls to a Web API is generally a bad idea. Please verify that you do not have a way to retrieve the data in bulk, rather then piecemeal. Piecemeal retreival often incurs more overhead then data.
Third, are you even allowed to automate it on that scale? If the APi provider wants no bulk retreival, he might not want automation on that scale. If so he might notice the sudden increase in load and apply some load-throteling later. That could kill your programm.
Fourth, Multithreading a APi call will propably not speed things up. The WEB API and Network will be the bottleneck with a very high propability. Multithreading only helps with CPU bottlenecked operations. With Network, Disk, DB and similar operations, there will be often 0 performance incraese. Or even a performance decrease, as the multiple operations get in each others way.
A bit of Multitasking (even just a single alternate Thread) is mandatory with Network, Disk and similar longrunning opeations. But actuall Multithreading rarely to never helps.
I bet the exception is being thrown from client.GetAsync?
HttpClient will throw TaskCanceledException when the HTTP call times out. (i.e. the web service is not responding)
Annoying, I know.
It's possible that, because you're hitting it so hard, it can't keep up. You can try raising the Timeout property of your HttpClient, but the default is already 100 seconds.
If you want to just ignore those errors, then wrap the client.GetAsync(tmpUrl) in a try/catch block and just return (and maybe log it somewhere).

.NET HttpClient.PostAsync() slow after 3 requests

I am using the .NET 4.5 HttpClient class to make a POST request to a server a number of times. The first 3 calls run quickly, but the fourth time a call to await client.PostAsync(...) is made, it hangs for several seconds before returning the expected response.
using (HttpClient client = new HttpClient())
{
// Prepare query
StringBuilder queryBuilder = new StringBuilder();
queryBuilder.Append("?arg=value");
// Send query
using (var result = await client.PostAsync(BaseUrl + queryBuilder.ToString(),
new StreamContent(streamData)))
{
Stream stream = await result.Content.ReadAsStreamAsync();
return new MyResult(stream);
}
}
The server code is shown below:
HttpListener listener;
void Run()
{
listener.Start();
ThreadPool.QueueUserWorkItem((o) =>
{
while (listener.IsListening)
{
ThreadPool.QueueUserWorkItem((c) =>
{
var context = c as HttpListenerContext;
try
{
// Handle request
}
finally
{
// Always close the stream
context.Response.OutputStream.Close();
}
}, listener.GetContext());
}
});
}
Inserting a debug statement at // Handle request shows that the server code doesn't seem to receive the request as soon as it is sent.
I have already investigated whether it could be a problem with the client not closing the response, meaning that the number of connections the ServicePoint provider allows could be reached. However, I have tried increasing ServicePointManager.MaxServicePoints but this has no effect at all.
I also found this similar question:
.NET HttpClient hangs after several requests (unless Fiddler is active)
I don't believe this is the problem with my code - even changing my code to exactly what is given there didn't fix the problem.
The problem was that there were too many Task instances scheduled to run.
Changing some of the Task.Factory.StartNew calls in my program for tasks which ran for a long time to use the TaskCreationOptions.LongRunning option fixed this. It appears that the task scheduler was waiting for other tasks to finish before it scheduled the request to the server.

Async Loop There is no longer an HttpContext available

I have a requirement, is to process X number of files, usually we can receive around 100 files each day, is a zip file so I have to open it, create a stream then send it to a WebApi service which is a workflow, this workflow calls two more WebApi Steps.
I implemented a console application that loops through the files then calls a wrapper which makes a REST call using HttpWebRequest.GetResponse().
I stressed tested the solution and created 11K files, in a synchronous version it takes to process all the files around 17 minutes, but I would like to create an async version of it and be able to use await HttpWebRequest.GetResponseAsync().
Here is the Async version:
private async Task<KeyValuePair<HttpStatusCode, string>> REST_CallAsync(
string httpMethod,
string url,
string contentType,
object bodyMessage = null,
Dictionary<string, object> headerParameters = null,
object[] queryStringParamaters = null,
string requestData = "")
{
try
{
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create("some url");
req.Method = "POST";
req.ContentType = contentType;
//Adding zip stream to body
var reqBodyBytes = ReadFully((Stream)bodyMessage);
req.ContentLength = reqBodyBytes.Length;
Stream reqStream = req.GetRequestStream();
reqStream.Write(reqBodyBytes, 0, reqBodyBytes.Length);
reqStream.Close();
//Async call
var resp = await req.GetResponseAsync();
var httpResponse = (HttpWebResponse)resp as HttpWebResponse;
var responseData = new StreamReader(resp.GetResponseStream()).ReadToEnd();
return new KeyValuePair<HttpStatusCode,string>(httpResponse.StatusCode, responseData);
}
catch (WebException webEx)
{
//something
}
catch (Exception ex)
{
//something
}
In my console Application I have a loop to open and call the async (CallServiceAsync under the covers calls the method above)
foreach (var zipFile in Directory.EnumerateFiles(directory))
{
using (var zipStream = System.IO.File.OpenRead(zipFile))
{
await _restFulService.CallServiceAsync<WorkflowResponse>(
zipStream,
headerParameters,
null,
true);
}
processId++;
}
}
What end up happening was that only 2K of 11K got processed and didn't throw any exception so I was clueless so I changed the version I am calling the async to:
foreach (var zipFile in Directory.EnumerateFiles(directory))
{
using (var zipStream = System.IO.File.OpenRead(zipFile))
{
tasks.Add(_restFulService.CallServiceAsync<WorkflowResponse>(
zipStream,
headerParameters,
null,
true));
}
}
}
And have another loop to await for the tasks:
foreach (var task in await System.Threading.Tasks.Task.WhenAll(tasks))
{
if (task.Value != null)
{
Console.WriteLine("Ending Process");
}
}
And now I am facing a different error, when I process three files, the third one receives:
The client is disconnected because the underlying request has been completed. There is no longer an HttpContext available.
My question is, what i am doing wrong here? I use SimpleInjector as IoC would it be this the problem?
Also when you do WhenAll is waiting for each thread to run? Is not making it synchronous so it waits for a thread to finish in order to execute the next one? I am new to this async world so any help would be really much appreciated.
Well for those that added -1 to my question and instead of providing some type of solution just suggested something meaningless, here it is the answer and the reason why specifying as much detail as possible is useful.
First problem, since I'm using IIS Express if I'm not running my solution (F5) then the web applications are not available, that happened to me sometimes not always.
The second problem and the one giving me a huge headache is that not all the files got processed, I should've known the reason of this issue before, is the usage of async - await in a console application. I forced my console app to work with async by doing:
static void Main(string[] args)
{
System.Threading.Tasks.Task.Run(() => MainAsync(args)).Wait();
}
static async void MainAsync(string[] args)
{
//rest of code
Then if you note in my foreach I had await keyword and what was happening is that by concept await sends back the control flow to the caller, in this case the OS is the one calling the Console App (that is why doesn't make too much sense to use async - await in a console app, I did it because I mistakenly used await by calling an async method).
So the result was that my process only processed some X number of files, so what I end up doing is the following:
Add a list of tasks, the same way I did above:
tasks.Add(_restFulService.CallServiceAsync<WorkflowResponse>(....
And the way to run the threads is (in my console app):
ExecuteAsync(tasks);
Finally my method:
static void ExecuteAsync(List<System.Threading.Tasks.Task<KeyValuePair<HttpStatusCode, WorkflowResponse>>> tasks)
{
System.Threading.Tasks.Task.WhenAll(tasks).Wait();
}
UPDATE: Based on Scott's feedback, I changed the way I execute my threads.
And now I'm able to process all my files, I tested it and to process 1000 files in my synchronous process took around 160+ seconds to run all the process (I have a workflow of three steps in order to process the file) and when I put my async process in place it took 80+ seconds so almost half of the time. In my production server with IIS I believe the execution time will be less.
Hope this helps to anyone facing this type of issue.

Multi-threaded async web service call in c# .net 3.5

I have 2 ASP.net 3.5 asmx web services, ws2 and ws3. They contain operations op21 and op31 respectively. op21 sleeps for 2 seconds and op31 sleeps for 3 seconds. I want to call both op21 and op31 from op11 in a web service, ws1, asynchronously. Such that when I call op11 from a client synchronously.,the time-taken will be 3 seconds which is the total. I currently get 5 seconds with this code:
WS2SoapClient ws2 = new WS2SoapClient();
WS3SoapClient ws3 = new WS3SoapClient();
//capture time
DateTime now = DateTime.Now;
//make calls
IAsyncResult result1 = ws3.BeginOP31(null,null);
IAsyncResult result2 = ws2.BeginOP21(null,null);
WaitHandle[] handles = { result1.AsyncWaitHandle, result2.AsyncWaitHandle };
WaitHandle.WaitAll(handles);
//calculate time difference
TimeSpan ts = DateTime.Now.Subtract(now);
return "Asynchronous Execution Time (h:m:s:ms): " + String.Format("{0}:{1}:{2}:{3}",
ts.Hours,
ts.Minutes,
ts.Seconds,
ts.Milliseconds);
The expected result is that the total time for both requests should be equal to the time it takes for the slower request to execute.
Note that this works as expected when I debug it with Visual Studio, however when running this on IIS, the time is 5 seconds which seems to show the requests are not processed concurrently.
My question is, is there a specific configuration with IIS and the ASMX web services that might need to be setup properly for this to work as expected?
Original Answer:
I tried this with google.com and bing.com am getting the same thing, linear execution. The problem is that you are starting the BeginOP() calls on the same thread, and the AsyncResult (for whatever reason) is not returned until the call is completed. Kind of useless.
My pre-TPL multi-threading is a bit rusty but I tested the code at the end of this answer and it executes asynchronously: This is a .net 3.5 console app. Note I obviously obstructed some of your code but made the classes look the same.
Update:
I started second-guessing myself because my execution times were so close to each other, it was confusing. So I re-wrote the test a little bit to include both your original code and my suggested code using Thread.Start(). Additionally, I added Thread.Sleep(N) in the WebRequest methods such that it should simulate vastly different execution times for the requests.
The test results do show that the code you posted was sequentially executed as I stated above in my original answer.
Note the total time is much longer in both cases than the actual web request time because of the Thread.Sleep(). I also added the Thread.Sleep() to offset the fact that the first web request to any site takes a long time to spin up (9 seconds), as can be seen above. Either way you slice it, it's clear that the times are sequential in the "old" case and truly "asynchronous" in the new case.
The updated program for testing this out:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading;
namespace MultiThreadedTest
{
class Program
{
static void Main(string[] args)
{
// Test both ways of executing IAsyncResult web calls
ExecuteUsingWaitHandles();
Console.WriteLine();
ExecuteUsingThreadStart();
Console.ReadKey();
}
private static void ExecuteUsingWaitHandles()
{
Console.WriteLine("Starting to execute using wait handles (old way) ");
WS2SoapClient ws2 = new WS2SoapClient();
WS3SoapClient ws3 = new WS3SoapClient();
IAsyncResult result1 = null;
IAsyncResult result2 = null;
// Time the threadas
var stopWatchBoth = System.Diagnostics.Stopwatch.StartNew();
result1 = ws3.BeginOP31();
result2 = ws2.BeginOP21();
WaitHandle[] handles = { result1.AsyncWaitHandle, result2.AsyncWaitHandle };
WaitHandle.WaitAll(handles);
stopWatchBoth.Stop();
// Display execution time of individual calls
Console.WriteLine((result1.AsyncState as StateObject));
Console.WriteLine((result2.AsyncState as StateObject));
// Display time for both calls together
Console.WriteLine("Asynchronous Execution Time for both is {0}", stopWatchBoth.Elapsed.TotalSeconds);
}
private static void ExecuteUsingThreadStart()
{
Console.WriteLine("Starting to execute using thread start (new way) ");
WS2SoapClient ws2 = new WS2SoapClient();
WS3SoapClient ws3 = new WS3SoapClient();
IAsyncResult result1 = null;
IAsyncResult result2 = null;
// Create threads to execute the methods asynchronously
Thread startOp3 = new Thread( () => result1 = ws3.BeginOP31() );
Thread startOp2 = new Thread( () => result2 = ws2.BeginOP21() );
// Time the threadas
var stopWatchBoth = System.Diagnostics.Stopwatch.StartNew();
// Start the threads
startOp2.Start();
startOp3.Start();
// Make this thread wait until both of those threads are complete
startOp2.Join();
startOp3.Join();
stopWatchBoth.Stop();
// Display execution time of individual calls
Console.WriteLine((result1.AsyncState as StateObject));
Console.WriteLine((result2.AsyncState as StateObject));
// Display time for both calls together
Console.WriteLine("Asynchronous Execution Time for both is {0}", stopWatchBoth.Elapsed.TotalSeconds);
}
}
// Class representing your WS2 client
internal class WS2SoapClient : TestWebRequestAsyncBase
{
public WS2SoapClient() : base("http://www.msn.com/") { }
public IAsyncResult BeginOP21()
{
Thread.Sleep(TimeSpan.FromSeconds(10D));
return BeginWebRequest();
}
}
// Class representing your WS3 client
internal class WS3SoapClient : TestWebRequestAsyncBase
{
public WS3SoapClient() : base("http://www.google.com/") { }
public IAsyncResult BeginOP31()
{
// Added sleep here to simulate a much longer request, which should make it obvious if the times are overlapping or sequential
Thread.Sleep(TimeSpan.FromSeconds(20D));
return BeginWebRequest();
}
}
// Base class that makes the web request
internal abstract class TestWebRequestAsyncBase
{
public StateObject AsyncStateObject;
protected string UriToCall;
public TestWebRequestAsyncBase(string uri)
{
AsyncStateObject = new StateObject()
{
UriToCall = uri
};
this.UriToCall = uri;
}
protected IAsyncResult BeginWebRequest()
{
WebRequest request =
WebRequest.Create(this.UriToCall);
AsyncCallback callBack = new AsyncCallback(onCompleted);
AsyncStateObject.WebRequest = request;
AsyncStateObject.Stopwatch = System.Diagnostics.Stopwatch.StartNew();
return request.BeginGetResponse(callBack, AsyncStateObject);
}
void onCompleted(IAsyncResult result)
{
this.AsyncStateObject = (StateObject)result.AsyncState;
this.AsyncStateObject.Stopwatch.Stop();
var webResponse = this.AsyncStateObject.WebRequest.EndGetResponse(result);
Console.WriteLine(webResponse.ContentType, webResponse.ResponseUri);
}
}
// Keep stopwatch on state object for illustration of individual execution time
internal class StateObject
{
public System.Diagnostics.Stopwatch Stopwatch { get; set; }
public WebRequest WebRequest { get; set; }
public string UriToCall;
public override string ToString()
{
return string.Format("Request to {0} executed in {1} seconds", this.UriToCall, Stopwatch.Elapsed.TotalSeconds);
}
}
}
There is some throttling in your system. Probably the service is configured for only one concurrent caller which is a common reason (WCF ConcurrencyMode). There might be HTTP-level connection limits (ServicePointManager.DefaultConnectionLimit) or WCF throttlings on the server.
Use Fiddler to determine if both requests are being sent simultaneously. Use the debugger to break on the server and see if both calls are running simultaneously.

Mass Downloading of Webpages C#

My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.
for (int i = 1; i<=pages; i++)
{
string page_specific_link = baseurl + "&page=" + i.ToString();
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
catch (Exception)
{
}
}
The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.
I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.
So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.
I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances
// now process urls
foreach (var url in urls_to_download)
{
var worker = ClientQueue.Take();
worker.DownloadStringAsync(url, ...);
}
When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.
In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.
That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.
I should also note that there is a huge difference in resource usage between these two blocks of code:
WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
MyWebClient.DownloadString(url);
}
---------------
foreach (var url in urls_to_download)
{
WebClient MyWebClient = new WebClient();
MyWebClient.DownloadString(url);
}
The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.
Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).
Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.
In addition to #Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();
Parallel.ForEach(pages, x =>
{
using(var client = new WebClient())
{
var pagesource = client.DownloadString(x);
sources.Add(pagesource);
}
});
Yet another approach, that uses async:
static IEnumerable<string> GetSources(List<string> pages)
{
var sources = new BlockingCollection<string>();
var latch = new CountdownEvent(pages.Count);
foreach (var p in pages)
{
using (var wc = new WebClient())
{
wc.DownloadStringCompleted += (x, e) =>
{
sources.Add(e.Result);
latch.Signal();
};
wc.DownloadStringAsync(new Uri(p));
}
}
latch.Wait();
return sources;
}
You should use parallel programming for this purpose.
There are a lot of ways to achieve what u want; the easiest would be something like this:
var pageList = new List<string>();
for (int i = 1; i <= pages; i++)
{
pageList.Add(baseurl + "&page=" + i.ToString());
}
// pageList is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page);
client.Dispose();
lock (sourcelist)
sourcelist.Add(pagesource);
}
catch (Exception) {}
});
I Had a similar Case ,and that's how i solved
using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;
namespace WebClientApp
{
class MainClassApp
{
private static int requests = 0;
private static object requests_lock = new object();
public static void Main() {
List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
foreach(var url in urls) {
ThreadPool.QueueUserWorkItem(GetUrl, url);
}
int cur_req = 0;
while(cur_req<urls.Count) {
lock(requests_lock) {
cur_req = requests;
}
Thread.Sleep(1000);
}
Console.WriteLine("Done");
}
private static void GetUrl(Object the_url) {
string url = (string)the_url;
WebClient client = new WebClient();
Stream data = client.OpenRead (url);
StreamReader reader = new StreamReader(data);
string html = reader.ReadToEnd ();
/// Do something with html
Console.WriteLine(html);
lock(requests_lock) {
//Maybe you could add here the HTML to SourceList
requests++;
}
}
}
You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.
While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.
What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.
First, you would get the urls that you want to download:
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl +
"&page=" + i.ToString(CultureInfo.InvariantCulture)));
Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
// Create the task completion source.
var tcs = new TaskCompletionSource<Tuple<Uri, string>>();
// The web client.
var wc = new WebClient();
// Attach to the DownloadStringCompleted event.
client.DownloadStringCompleted += (s, e) => {
// Dispose of the client when done.
using (wc)
{
// If there is an error, set it.
if (e.Error != null)
{
tcs.SetException(e.Error);
}
// Otherwise, set cancelled if cancelled.
else if (e.Cancelled)
{
tcs.SetCanceled();
}
else
{
// Set the result.
tcs.SetResult(new Tuple<string, string>(url, e.Result));
}
}
};
// Start the process asynchronously, don't burn a thread.
wc.DownloadStringAsync(url);
// Return the task.
return tcs.Task;
});
Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:
// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();
// Wait for all to complete.
Task.WaitAll(materializedTasks);
Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
// pair.Item1 will contain the Uri.
// pair.Item2 will contain the content.
}
Note that the above code has the caveat of not having an error handling.
If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).
I am using an active Threads count and a arbitrary limit:
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}

Categories