Mass Downloading of Webpages C#

Mass Downloading of Webpages C# - c#

My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.
for (int i = 1; i<=pages; i++)
{
string page_specific_link = baseurl + "&page=" + i.ToString();
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
catch (Exception)
{
}
}

The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.
I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.
So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.
I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances
// now process urls
foreach (var url in urls_to_download)
{
var worker = ClientQueue.Take();
worker.DownloadStringAsync(url, ...);
}
When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.
In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.
That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.
I should also note that there is a huge difference in resource usage between these two blocks of code:
WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
MyWebClient.DownloadString(url);
}
---------------
foreach (var url in urls_to_download)
{
WebClient MyWebClient = new WebClient();
MyWebClient.DownloadString(url);
}
The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.

Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).
Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.

In addition to #Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();
Parallel.ForEach(pages, x =>
{
using(var client = new WebClient())
{
var pagesource = client.DownloadString(x);
sources.Add(pagesource);
}
});
Yet another approach, that uses async:
static IEnumerable<string> GetSources(List<string> pages)
{
var sources = new BlockingCollection<string>();
var latch = new CountdownEvent(pages.Count);
foreach (var p in pages)
{
using (var wc = new WebClient())
{
wc.DownloadStringCompleted += (x, e) =>
{
sources.Add(e.Result);
latch.Signal();
};
wc.DownloadStringAsync(new Uri(p));
}
}
latch.Wait();
return sources;
}

You should use parallel programming for this purpose.
There are a lot of ways to achieve what u want; the easiest would be something like this:
var pageList = new List<string>();
for (int i = 1; i <= pages; i++)
{
pageList.Add(baseurl + "&page=" + i.ToString());
}
// pageList is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page);
client.Dispose();
lock (sourcelist)
sourcelist.Add(pagesource);
}
catch (Exception) {}
});

I Had a similar Case ,and that's how i solved
using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;
namespace WebClientApp
{
class MainClassApp
{
private static int requests = 0;
private static object requests_lock = new object();
public static void Main() {
List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
foreach(var url in urls) {
ThreadPool.QueueUserWorkItem(GetUrl, url);
}
int cur_req = 0;
while(cur_req<urls.Count) {
lock(requests_lock) {
cur_req = requests;
}
Thread.Sleep(1000);
}
Console.WriteLine("Done");
}
private static void GetUrl(Object the_url) {
string url = (string)the_url;
WebClient client = new WebClient();
Stream data = client.OpenRead (url);
StreamReader reader = new StreamReader(data);
string html = reader.ReadToEnd ();
/// Do something with html
Console.WriteLine(html);
lock(requests_lock) {
//Maybe you could add here the HTML to SourceList
requests++;
}
}
}
You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.

While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.
What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.
First, you would get the urls that you want to download:
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl +
"&page=" + i.ToString(CultureInfo.InvariantCulture)));
Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
// Create the task completion source.
var tcs = new TaskCompletionSource<Tuple<Uri, string>>();
// The web client.
var wc = new WebClient();
// Attach to the DownloadStringCompleted event.
client.DownloadStringCompleted += (s, e) => {
// Dispose of the client when done.
using (wc)
{
// If there is an error, set it.
if (e.Error != null)
{
tcs.SetException(e.Error);
}
// Otherwise, set cancelled if cancelled.
else if (e.Cancelled)
{
tcs.SetCanceled();
}
else
{
// Set the result.
tcs.SetResult(new Tuple<string, string>(url, e.Result));
}
}
};
// Start the process asynchronously, don't burn a thread.
wc.DownloadStringAsync(url);
// Return the task.
return tcs.Task;
});
Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:
// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();
// Wait for all to complete.
Task.WaitAll(materializedTasks);
Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
// pair.Item1 will contain the Uri.
// pair.Item2 will contain the content.
}
Note that the above code has the caveat of not having an error handling.
If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).

I am using an active Threads count and a arbitrary limit:
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}

Related

Images take a very long time to load in C#

The problem I have it is:
I tried to download 1000+ images -> it works, but it takes a very long time to load the image downloaded completely, and the program continues and downloads the next image etc... Until let's admit 100 but the 8th image is still not finished downloading.
So I would like to understand why I encounter such a problem here and / or how to fix this problem.
Hope to see an issue
private string DownloadSourceCode(string url)
{
string sourceCode = "";
try
{
using (WebClient WC = new WebClient())
{
WC.Encoding = Encoding.UTF8;
WC.Headers.Add("Accept", "image / webp, */*");
WC.Headers.Add("Accept-Language", "fr, fr - FR");
WC.Headers.Add("Cache-Control", "max-age=1");
WC.Headers.Add("DNT", "1");
WC.Headers.Add("Origin", url);
WC.Headers.Add("TE", "Trailers");
WC.Headers.Add("user-agent", Fichier.LoadUserAgent());
sourceCode = WC.DownloadString(url);
}
}
catch (WebException e)
{
if (e.Status == WebExceptionStatus.ProtocolError)
{
string status = string.Format("{0}", ((HttpWebResponse)e.Response).StatusCode);
LabelID.TextInvoke(string.Format("{0} {1} {2} ", status,
((HttpWebResponse)e.Response).StatusDescription,
((HttpWebResponse)e.Response).Server));
}
}
catch (NotSupportedException a)
{
MessageBox.Show(a.Message);
}
return sourceCode;
}
private void DownloadImage(string URL, string filePath)
{
try
{
using (WebClient WC = new WebClient())
{
WC.Encoding = Encoding.UTF8;
WC.Headers.Add("Accept", "image / webp, */*");
WC.Headers.Add("Accept-Language", "fr, fr - FR");
WC.Headers.Add("Cache-Control", "max-age=1");
WC.Headers.Add("DNT", "1");
WC.Headers.Add("Origin", "https://myprivatesite.fr//" + STARTNBR.ToString());
WC.Headers.Add("user-agent", Fichier.LoadUserAgent());
WC.DownloadFile(URL, filePath);
NBRIMAGESDWLD++;
}
STARTNBR = CheckBoxBack.Checked ? --STARTNBR : ++STARTNBR;
}
catch (IOException)
{
LabelID.TextInvoke("Accès non autorisé au fichier");
}
catch (WebException e)
{
if (e.Status == WebExceptionStatus.ProtocolError)
{
LabelID.TextInvoke(string.Format("{0} / {1} / {2} ", ((HttpWebResponse)e.Response).StatusCode,
((HttpWebResponse)e.Response).StatusDescription,
((HttpWebResponse)e.Response).Server));
}
}
catch (NotSupportedException a)
{
MessageBox.Show(a.Message);
}
}
private void DownloadImages()
{
const string URL = "https://myprivatesite.fr/";
string imageIDURL = string.Concat(URL, STARTNBR);
string sourceCode = DownloadSourceCode(imageIDURL);
if (sourceCode != string.Empty)
{
string imageNameURL = Fichier.GetURLImage(sourceCode);
if (imageNameURL != string.Empty)
{
string imagePath = PATHIMAGES + STARTNBR + ".png";
LabelID.TextInvoke(STARTNBR.ToString());
LabelImageURL.TextInvoke(imageNameURL + "\r");
DownloadImage(imageNameURL, imagePath);
Extension.SaveOptions(STARTNBR, CheckBoxBack.Checked);
}
}
STARTNBR = CheckBoxBack.Checked ? --STARTNBR : ++STARTNBR;
}
// END FUNCTIONS
private void BoutonStartPause_Click(object sender, EventArgs e)
{
if (Fichier.RGBIMAGES != null)
{
if (boutonStartPause.Text == "Start")
{
boutonStartPause.ForeColor = Color.DarkRed;
boutonStartPause.Text = "Pause";
if (myTimer == null)
myTimer = new System.Threading.Timer(_ => new Task(DownloadImages).Start(), null, 0, Trackbar.Value);
}
else if (boutonStartPause.Text == "Pause")
EndTimer();
Extension.SaveOptions(STARTNBR, CheckBoxBack.Checked);
}
}

So I would like to understand why I encounter such a problem here and / or how to fix this problem.
There are probably two reasons I can think of.
Connection/Port Exhaustion
Thread Pool Exhaustion
Connection/Port Exhaustion
This happens when you're attempting to create too many connections at once, or when the connections you made previously have not yet been released. When you use a WebClient the resources it uses sometimes don't get released immediately. This causes a delay between when that object is disposed and the actual time that the next WebClient attempting to use the same port/connection actually gets access to that port.
An example of something that would most likely cause Connection/Port Exhaustion
int i = 1_000;
while(i --> 0)
{
using var Client = new WebClient();
// do some webclient stuff
}
When you create a lot of web clients, which is sometimes necessary due to the inherent lack of concurrency in WebClient. There's a possibility that by the time the next WebClient is instantiated, the port that the last one was using may not be available yet, causing either a delay(while it waits for the port) or worse the next WebClient opening another port/connection. This can cause a never ending list of connections to open causing things to grind to a halt!
Thread Pool Exhaustion
This is caused by trying to create too many Task or Thread objects at once that block their own execution(via Thread.Sleep or a long running operation).
Normally this isn't an issue since the built in TaskScheduler does a really good job of keeping track of a lot of tasks and makes sure that they all get turns to execute their code.
Where this becomes a problem is the TaskScheduler has no context for which tasks are important, or which tasks are going to need more time than others to complete. So therefor when many tasks are processing long running operations, blocking, or throwing exceptions, the TaskScheduler has to wait for those tasks to finish before it can start new ones. If you are particularly unlucky the TaskScheduler can start a bunch of tasks that are all blocking and no tasks can start, even if all the other tasks waiting are small and would complete instantly.
You should generally use as few tasks as possible to increase reliability and avoid thread pool exhaustion.
What you can do
You have a few options to help improve the reliability and performance of this code.
Consider using HttpClient instead. I understand you may be required to use WebClient so I have provided answers using WebClient exclusively.
Consider Requesting multiple downloads/strings within the same task to avoid Thread Pool Exhaustion
Consider using a WebClient helper class that limits the available webclients that can be active at once, and has the ability to keep webclients open if you're going to be accessing the same website multiple times.
WebClient Helper Class
I created a very simple helper class to get you started. This will allow you to create WebClient requests asynchronously without having to worry about creating too many clients at once. The default limit is the number of Cores in the client's processor(this was chosen arbitrarily).
public class ConcurrentWebClient
{
// limits the number of maximum clients able to be opened at once
public static int MaxConcurrentDownloads => Environment.ProcessorCount;
// holds any clients that should be kept open
private static readonly ConcurrentDictionary<string, WebClient> Clients;
// prevents more than the alloted webclients to be open at once
public static readonly SemaphoreSlim Locker;
// allows cancellation of clients
private static CancellationTokenSource TokenSource = new();
static ConcurrentWebClient()
{
Clients = new ConcurrentDictionary<string, WebClient>();
Locker ??= new SemaphoreSlim(MaxConcurrentDownloads, MaxConcurrentDownloads);
}
// creates new clients, or if a name is provided retrieves it from the dictionary so we don't need to create more than we need
private async Task<WebClient> CreateClient(string Name, bool persistent, CancellationToken token)
{
// try to retrieve it from the dictionary before creating a new one
if (Clients.ContainsKey(Name))
{
return Clients[Name];
}
WebClient newClient = new();
if (persistent)
{
// try to add the client to the dict so we can reference it later
while (Clients.TryAdd(Name, newClient) is false)
{
token.ThrowIfCancellationRequested();
// allow other tasks to do work while we wait to add the new client
await Task.Delay(1, token);
}
}
return newClient;
}
// allows sending basic dynamic requests without having to create webclients outside of this class
public async Task<T> NewRequest<T>(Func<WebClient, T> Expression, int? MaxTimeout = null, string Id = null)
{
// make sure we dont have more than the maximum clients open at one time
// 100s was chosen becuase WebClient has a default timeout of 100s
await Locker.WaitAsync(MaxTimeout ?? 100_000, TokenSource.Token);
bool persistent = true;
if (Id is null)
{
persistent = false;
Id = string.Empty;
}
try
{
WebClient client = await CreateClient(Id, persistent, TokenSource.Token);
// run the expression to get the result
T result = await Task.Run<T>(() => Expression(client), TokenSource.Token);
if (persistent is false)
{
// just in case the user disposes of the client or sets it to ull in the expression we should not assume it's not null at this point
client?.Dispose();
}
return result;
}
finally
{
// make sure even if we encounter an error we still
// release the lock
Locker.Release();
}
}
// allows assigning the headers without having to do it for every webclient manually
public static void AssignDefaultHeaders(WebClient client)
{
client.Encoding = System.Text.Encoding.UTF8;
client.Headers.Add("Accept", "image / webp, */*");
client.Headers.Add("Accept-Language", "fr, fr - FR");
client.Headers.Add("Cache-Control", "max-age=1");
client.Headers.Add("DNT", "1");
// i have no clue what Fichier is so this was not tested
client.Headers.Add("user-agent", Fichier.LoadUserAgent());
}
// cancels a webclient by name, whether its being used or not
public async Task Cancel(string Name)
{
// look to see if we can find the client
if (Clients.ContainsKey(Name))
{
// get a token incase we have to emergency cance
CancellationToken token = TokenSource.Token;
// try to get the client from the dictionary
WebClient foundClient = null;
while (Clients.TryGetValue(Name, out foundClient) is false)
{
token.ThrowIfCancellationRequested();
// allow other tasks to perform work while we wait to get the value from the dictionary
await Task.Delay(1, token);
}
// if we found the client we should cancel and dispose of it so it's resources gets freed
if (foundClient != null)
{
foundClient?.CancelAsync();
foundClient?.Dispose();
}
}
}
// the emergency stop button
public void ForceCancelAll()
{
// this will throw lots of OperationCancelledException, be prepared to catch them, they're fast.
TokenSource?.Cancel();
TokenSource?.Dispose();
TokenSource = new();
foreach (var item in Clients)
{
item.Value?.CancelAsync();
item.Value?.Dispose();
}
Clients.Clear();
}
}
Request Multiple Things at Once
Here all I did was switch to using the helper class, and made it so you can request multiple things using the same connection
public async Task<string[]> DownloadSourceCode(string[] urls)
{
var downloader = new ConcurrentWebClient();
return await downloader.NewRequest<string[]>((WebClient client) =>
{
ConcurrentWebClient.AssignDefaultHeaders(client);
client.Headers.Add("TE", "Trailers");
string[] result = new string[urls.Length];
for (int i = 0; i < urls.Length; i++)
{
string url = urls[i];
client.Headers.Remove("Origin");
client.Headers.Add("Origin", url);
result[i] = client.DownloadString(url);
}
return result;
});
}
private async Task<bool> DownloadImage(string[] URLs, string[] filePaths)
{
var downloader = new ConcurrentWebClient();
bool downloadsSucessful = await downloader.NewRequest<bool>((WebClient client) =>
{
ConcurrentWebClient.AssignDefaultHeaders(client);
int len = Math.Min(URLs.Length, filePaths.Length);
for (int i = 0; i < len; i++)
{
// side-note, this is assuming the websites you're visiting aren't mutating the headers
client.Headers.Remove("Origin");
client.Headers.Add("Origin", "https://myprivatesite.fr//" + STARTNBR.ToString());
client.DownloadFile(URLs[i], filePaths[i]);
NBRIMAGESDWLD++;
STARTNBR = CheckBoxBack.Checked ? --STARTNBR : ++STARTNBR;
}
return true;
});
return downloadsSucessful;
}

Getting HTML response fails respectively after first fail

I have a program which gets html code for ~500 webpages every 5 minutes
it runs correctly until first fail(unable to download source in 6 seconds)
after that all threads will fail
and if I restart program, again it runs correctly until ...
where I'm wrong, what I should do to do it better?
this function runs every 5 mins:
foreach (Company company in companies)
{
string link = company.GetLink();
Thread t = new Thread(() => F(company, link));
t.Start();
if (!t.Join(TimeSpan.FromSeconds(6)))
{
Debug.WriteLine( company.Name + " Fails");
t.Abort();
}
}
and this function download html code
private void F(Company company, string link)
{
try
{
string htmlCode = GetInformationFromWeb.GetHtmlRequest(link);
company.HtmlCode = htmlCode;
}
catch (Exception ex)
{
}
}
and this class:
public class GetInformationFromWeb
{
public static string GetHtmlRequest(string url)
{
using (MyWebClient client = new MyWebClient())
{
client.Encoding = Encoding.UTF8;
string htmlCode = client.DownloadString(url);
return htmlCode;
}
}
}
and web client class
public class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}

IF your foreach is looping over 500 companies, and each is creating a new thread, it could be that your internet speed could become a bottleneck and you will receive timeouts over 6 seconds, and fail very often.
I suggest you to try with parallelism. Note MaxDegreeOfParallelism, which sets maximum amount of parallel executions. You can tune this to suit your needs.
Parallel.ForEach(companies, new ParallelOptions { MaxDegreeOfParallelism = 10 }, (company) =>
{
try
{
string htmlCode = GetInformationFromWeb.GetHtmlRequest(company.link);
company.HtmlCode = htmlCode;
}
catch(Exception ex)
{
//ignore or process exception
}
});

I have four basic suggestions:
Use HttpClient instead of obsolete WebClient. HttpClient can deal with asynchronous operations natively and has far more flexibility to take advantage of. You can even read downloaded contents to strings/streams on different thread since you can configure await not to schedule back your operations. Or even program the HttpClientHandler to break after 6 seconds and raise TaskCanceledException if this was exceeded.
Avoid swallowing exceptions (like you do in your F function) as it breaks debugging and obfuscates the real cause of problems. Correctly-written program will never raise an exception during normal operation.
You are using threads in an useless way, in which they are not even overlapping; they are just waiting for each other to start, because you are locking the calling loop after each thread's start. In .NET it would be better to do multitasking using Tasks (for example, by calling them as Task.Run(async delegate() { await yourTask(); }) (or AsyncContext.Run(...) if you need UI access) and it won't block anything.
The whole GetInformationFromWeb class is pointless in the moment - and you are spawning multiple client objects also pointlessly, since one HTTP client object can handle multiple requests (if you'd use HttpClient even without additional bloat - you just instantiate it once as static global variable with all necessary configuration and then call it from any place using as little code as client.GetStringAsync(Uri uri).
OT: Is it some kind of an academic project?

Where do these 1k threads come from

I am trying to create a application that multi threaded downloads images from a website, as a introduction into threading. (never used threading properly before)
But currently it seems to create 1000+ threads and I am not sure where they are coming from.
I first queue a thread into a thread pool, for starters i only have 1 job in the jobs array
foreach (Job j in Jobs)
{
ThreadPool.QueueUserWorkItem(Download, j);
}
Which starts the void Download(object obj) on a new thread where it loops through a certain amount of pages (images needed / 42 images per page)
for (var i = 0; i < pages; i++)
{
var downloadLink = new System.Uri("http://www." + j.Provider.ToString() + "/index.php?page=post&s=list&tags=" + j.Tags + "&pid=" + i * 42);
using (var wc = new WebClient())
{
try
{
wc.DownloadStringAsync(downloadLink);
wc.DownloadStringCompleted += (sender, e) =>
{
response = e.Result;
ProcessPage(response, false, j);
};
}
catch (System.Exception e)
{
// Unity editor equivalent of console.writeline
Debug.Log(e);
}
}
}
correct me if I am wrong, the next void gets called on the same thread
void ProcessPage(string response, bool secondPass, Job j)
{
var wc = new WebClient();
LinkItem[] linkResponse = LinkFinder.Find(response).ToArray();
foreach (LinkItem i in linkResponse)
{
if (secondPass)
{
if (string.IsNullOrEmpty(i.Href))
continue;
else if (i.Href.Contains("http://loreipsum."))
{
if (DownloadImage(i.Href, ID(i.Href)))
j.Downloaded++;
}
}
else
{
if (i.Href.Contains(";id="))
{
var alterResponse = wc.DownloadString("http://www." + j.Provider.ToString() + "/index.php?page=post&s=view&id=" + ID(i.Href));
ProcessPage(alterResponse, true, j);
}
}
}
}
And finally passes on to the last function and downloads the actual image
bool DownloadImage(string target, int id)
{
var url = new System.Uri(target);
var fi = new System.IO.FileInfo(url.AbsolutePath);
var ext = fi.Extension;
if (!string.IsNullOrEmpty(ext))
{
using (var wc = new WebClient())
{
try
{
wc.DownloadFileAsync(url, id + ext);
return true;
}
catch(System.Exception e)
{
if (DEBUG) Debug.Log(e);
}
}
}
else
{
Debug.Log("Returned Without a extension: " + url + " || " + fi.FullName);
return false;
}
return true;
}
I am not sure how I am starting this many threads, but would love to know.
Edit
The goal of this program is to download the different job in jobs at the same time (max of 5) each downloading a maximum of 42 images at the time.
so a maximum of 210 images can/should be downloaded maximum at all times.

First of all, how did you measure the thread count? Why do you think that you have thousand of them in your application? You are using the ThreadPool, so you don't create them by yourself, and the ThreadPool wouldn't create such great amount of them for it's needs.
Second, you are mixing synchronious and asynchronious operations in your code. As you can't use TPL and async/await, let's go through you code and count the unit-of-works you are creating, so you can minimize them. After you do this, the number of queued items in ThreadPool will decrease and your application will gain performance you need.
You don't set the SetMaxThreads method in your application, so, according the MSDN:
Maximum Number of Thread Pool Threads
The number of operations that can be queued to the thread pool is limited only by available memory;
however, the thread pool limits the number of threads that can be
active in the process simultaneously. By default, the limit is 25
worker threads per CPU and 1,000 I/O completion threads.
So you must set the maximum to the 5.
I can't find a place in your code where you check the 42 images per Job, you are only incrementing the value in ProcessPage method.
Check the ManagedThreadId for the handle of WebClient.DownloadStringCompleted - does it execute in different thread or not.
You are adding the new item in ThreadPool queue, why are you using the asynchronious operation for Downloading? Use a synchronious overload, like this:
ProcessPage(wc.DownloadString(downloadLink), false, j);
This will not create another one item in ThreadPool queue, and you wouldn't have a sinchronisation context switch here.
In ProcessPage your wc variable doesn't being garbage collected, so you aren't freeing all your resourses here. Add using statement here:
void ProcessPage(string response, bool secondPass, Job j)
{
using (var wc = new WebClient())
{
LinkItem[] linkResponse = LinkFinder.Find(response).ToArray();
foreach (LinkItem i in linkResponse)
{
if (secondPass)
{
if (string.IsNullOrEmpty(i.Href))
continue;
else if (i.Href.Contains("http://loreipsum."))
{
if (DownloadImage(i.Href, ID(i.Href)))
j.Downloaded++;
}
}
else
{
if (i.Href.Contains(";id="))
{
var alterResponse = wc.DownloadString("http://www." + j.Provider.ToString() + "/index.php?page=post&s=view&id=" + ID(i.Href));
ProcessPage(alterResponse, true, j);
}
}
}
}
}
In DownloadImage method you also use the asynchronious load. This also adds item in ThreadPoll queue, and I think that you can avoid this, and use synchronious overload too:
wc.DownloadFile(url, id + ext);
return true;
So, in general, avoid the context-switching operations and dispose your resources properly.

Your wc WebClinet will go out of scope and be randomly garbage collected before the async callback. Also on all async calls you have to allow for immediate return and the actual delegated function return. So processPage will have to be in two places. Also the j in the original loop may be going out of scope depending on where Download in the original loop is declared.

Query on Queues and Thread Safety

Thread-Safety is not an aspect that I have worried about much as the simple apps and libraries I have written usually only run on the main thread, or do not directly modified properties or fields in any classes that I needed to worry about before.
However, I have started working on a personal project that I am using a WebClient to download data asynchronously from a remote server. There is a Queue<Uri> that contains a pre-built queue of a series of URI's to download data.
So consider the following snippet (this is not my real code, but something I am hoping illustrates my question:
private WebClient webClient = new WebClient();
private Queue<Uri> requestQueue = new Queue<Uri>();
public Boolean DownloadNextASync()
{
if (webClient.IsBusy)
return false;
if (requestQueue.Count == 0)
return false
var uri = requestQueue.Dequeue();
webClient.DownloadDataASync(uri);
return true;
}
If I am understanding correctly, this method is not thread safe (assuming this specific instance of this object is known to multiple threads). My reasoning is webClient could become busy during the time between the IsBusy check and the DownloadDataASync() method call. And also, requestQueue could become empty between the Count check and when the next item is dequeued.
My question is what is the best way to handle this type of situation to make it thread-safe?
This is more of an abstract question as I realize for this specific method that there would have to be an exceptionally inconvenient timing for this to actually cause a problem, and to cover that case I could just wrap the method in an appropriate try-catch since both pieces would throw an exception. But is there another option? Would a lock statement be applicable here?

If you're targeting .Net 4.0, you could use the Task Parallel Library for help:
var queue = new BlockingCollection<Uri>();
var maxClients = 4;
// Optionally provide another producer/consumer collection for the data
// var data = new BlockingCollection<Tuple<Uri,byte[]>>();
// Optionally implement CancellationTokenSource
var clients = from id in Enumerable.Range(0, maxClients)
select Task.Factory.StartNew(
() =>
{
var client = new WebClient();
while (!queue.IsCompleted)
{
Uri uri;
if (queue.TryTake(out uri))
{
byte[] datum = client.DownloadData(uri); // already "async"
// Optionally pass datum along to the other collection
// or work on it here
}
else Thread.SpinWait(100);
}
});
// Add URI's to search
// queue.Add(...);
// Notify our clients that we've added all the URI's
queue.CompleteAdding();
// Wait for all of our clients to finish
clients.WaitAll();
To use this approach for progress indication you can use TaskCompletionSource<TResult> to manage the Event based parallelism:
public static Task<byte[]> DownloadAsync(Uri uri, Action<double> progress)
{
var source = new TaskCompletionSource<byte[]>();
Task.Factory.StartNew(
() =>
{
var client = new WebClient();
client.DownloadProgressChanged
+= (sender, e) => progress(e.ProgressPercentage);
client.DownloadDataCompleted
+= (sender, e) =>
{
if (!e.Cancelled)
{
if (e.Error == null)
{
source.SetResult((byte[])e.Result);
}
else
{
source.SetException(e.Error);
}
}
else
{
source.SetCanceled();
}
};
});
return source.Task;
}
Used like so:
// var urls = new List<Uri>(...);
// var progressBar = new ProgressBar();
Task.Factory.StartNew(
() =>
{
foreach (var uri in urls)
{
var task = DownloadAsync(
uri,
p =>
progressBar.Invoke(
new MethodInvoker(
delegate { progressBar.Value = (int)(100 * p); }))
);
// Will Block!
// data = task.Result;
}
});

I highly recommend reading "Threading In C#" by Joseph Albahari. I have taken a look through it in preparation for my first (mis)adventure into threading and it's pretty comprehensive.
You can read it here: http://www.albahari.com/threading/

Both of the thread-safety concerns you raised are valid. Furthermore, the both WebClient and Queue are documented as not being thread-safe (at the bottom of the MSDN docs). For example, if two threads were dequeuing simultaneously, they might actually cause the queue to become internally inconsistent or could lead to non-sensical return values. For example, if the implementation of Dequeue() was something like:
1. var valueToDequeue = this._internalList[this._startPointer];
2. this._startPointer = (this._startPointer + 1) % this._internalList.Count;
3. return valueToDequeue;
and two threads each executed line 1 before either continued to line 2, then both would return the same value (there are other potential issues here as well). This would not necessarily throw an exception, so you should use a lock statement to guarantee that only one thread can be inside the method at a time:
private readonly object _lock = new object();
...
lock (this._lock) {
// body of method
}
You could also lock on the WebClient or the Queue if you know that no-one else will be synchronizing on them.

Task Parallel Library WaitAny design

I've just begun to explore the TPL and have a design question.
My Scenario:
I have a list of URLs that each refer to an image. I want each image to be downloaded in parallel. As soon as at least one image is downloaded, I want to execute a method that does something with the downloaded image. That method should NOT be parallelized -- it should be serial.
I think the following will work but I'm not sure if this is the right way to do it. Because I have separate classes for collecting the images and for doing "something" with the collected images, I end up passing around an array of Tasks which seems wrong since it exposes the inner workings of how images are retrieved. But I don't know a way around it. In reality there is more to both of these methods but that's not important for this. Just know that they really shouldn't be lumped into one large method that both retrieves and does something with the image.
//From the Director class
Task<Image>[] downloadTasks = collector.RetrieveImages(listOfURLs);
for (int i = 0; i < listOfURLs.Count; i++)
{
//Wait for any of the remaining downloads to complete
int completedIndex = Task<Image>.WaitAny(downloadTasks);
Image completedImage = downloadTasks[completedIndex].Result;
//Now do something with the image (this "something" must happen serially)
//Uses the "Formatter" class to accomplish this let's say
}
///////////////////////////////////////////////////
//From the Collector class
public Task<Image>[] RetrieveImages(List<string> urls)
{
Task<Image>[] tasks = new Task<Image>[urls.Count];
int index = 0;
foreach (string url in urls)
{
string lambdaVar = url; //Required... Bleh
tasks[index] = Task<Image>.Factory.StartNew(() =>
{
using (WebClient client = new WebClient())
{
//TODO: Replace with live image locations
string fileName = String.Format("{0}.png", i);
client.DownloadFile(lambdaVar, Path.Combine(
Application.StartupPath, fileName));
}
return Image.FromFile(Path.Combine(Application.StartupPath, fileName));
},
TaskCreationOptions.LongRunning | TaskCreationOptions.AttachedToParent);
index++;
}
return tasks;
}

Typically you use WaitAny to wait for one task when you don't care about the results of any of the others. For example if you just cared about the first image that happened to get returned.
How about this instead.
This creates two tasks, one which loads images and adds them to a blocking collection. The second task waits on the collection and processes any images added to the queue. When all the images are loaded the first task closes the queue down so the second task can shut down.
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Net;
using System.Threading.Tasks;
namespace ClassLibrary1
{
public class Class1
{
readonly string _path = Directory.GetCurrentDirectory();
public void Demo()
{
IList<string> listOfUrls = new List<string>();
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/editicon.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/favorite-star-on.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/arrow_dsc_green.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/editicon.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/favorite-star-on.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/arrow_dsc_green.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/editicon.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/favorite-star-on.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/arrow_dsc_green.gif");
BlockingCollection<Image> images = new BlockingCollection<Image>();
Parallel.Invoke(
() => // Task 1: load the images
{
Parallel.For(0, listOfUrls.Count, (i) =>
{
Image img = RetrieveImages(listOfUrls[i], i);
img.Tag = i;
images.Add(img); // Add each image to the queue
});
images.CompleteAdding(); // Done with images.
},
() => // Task 2: Process images serially
{
foreach (var img in images.GetConsumingEnumerable())
{
string newPath = Path.Combine(_path, String.Format("{0}_rot.png", img.Tag));
Console.WriteLine("Rotating image {0}", img.Tag);
img.RotateFlip(RotateFlipType.RotateNoneFlipXY);
img.Save(newPath);
}
});
}
public Image RetrieveImages(string url, int i)
{
using (WebClient client = new WebClient())
{
string fileName = Path.Combine(_path, String.Format("{0}.png", i));
Console.WriteLine("Downloading {0}...", url);
client.DownloadFile(url, Path.Combine(_path, fileName));
Console.WriteLine("Saving {0} as {1}.", url, fileName);
return Image.FromFile(Path.Combine(_path, fileName));
}
}
}
}
WARNING: The code doesn't have any error checking or cancelation. It's late and you need something to do right? :)
This is an example of the pipeline pattern. It assumes that getting an image is pretty slow and that the cost of locking inside the blocking collection isn't going to cause a problem because it happens relatively infrequently compared to the time spent downloading images.
Our book... You can read more about this and other patterns for parallel programming at http://parallelpatterns.codeplex.com/
Chapter 7 covers pipelines and the accompanying examples show pipelines with error handling and cancellation.

TPL already provides the ContinueWith function to execute one task when another finishes. Task chaining is one of the main patterns used in TPL for asynchronous operations.
The following method downloads a set of images and continues by renaming each of the files
static void DownloadInParallel(string[] urls)
{
var tempFolder = Path.GetTempPath();
var downloads = from url in urls
select Task.Factory.StartNew<string>(() =>{
using (var client = new WebClient())
{
var uri = new Uri(url);
string file = Path.Combine(tempFolder,uri.Segments.Last());
client.DownloadFile(uri, file);
return file;
}
},TaskCreationOptions.LongRunning|TaskCreationOptions.AttachedToParent)
.ContinueWith(t=>{
var filePath = t.Result;
File.Move(filePath, filePath + ".test");
},TaskContinuationOptions.ExecuteSynchronously);
var results = downloads.ToArray();
Task.WaitAll(results);
}
You should also check the WebClient Async Tasks from the ParallelExtensionsExtras samples. The DownloadXXXTask extension methods handle both the creation of tasks and the asynchronous downloading of files.
The following method uses the DownloadDataTask extension to get the image's data and rotate it before saving it to disk
static void DownloadInParallel2(string[] urls)
{
var tempFolder = Path.GetTempPath();
var downloads = from url in urls
let uri=new Uri(url)
let filePath=Path.Combine(tempFolder,uri.Segments.Last())
select new WebClient().DownloadDataTask(uri)
.ContinueWith(t=>{
var img = Image.FromStream(new MemoryStream(t.Result));
img.RotateFlip(RotateFlipType.RotateNoneFlipY);
img.Save(filePath);
},TaskContinuationOptions.ExecuteSynchronously);
var results = downloads.ToArray();
Task.WaitAll(results);
}

The best way to do this would probably be by implementing the Observer pattern: have your RetreiveImages function implement IObservable, put your "completed image action" into an IObserver object's OnNext method, and subscribe it to RetreiveImages.
I haven't tried this myself yet (still have to play more with the task library) but I think this is the "right" way to do it.

//download all images
private async void GetAllImages ()
{
var downloadTasks = listOfURLs.Where(url => !string.IsNullOrEmpty(url)).Select(async url =>
{
var ret = await RetrieveImage(url);
return ret;
}).ToArray();
var counts = await Task.WhenAll(downloadTasks);
}
//From the Collector class
public async Task<Image> RetrieveImage(string url)
{
var lambdaVar = url; //Required... Bleh
using (WebClient client = new WebClient())
{
//TODO: Replace with live image locations
var fileName = String.Format("{0}.png", i);
await client.DownloadFile(lambdaVar, Path.Combine(Application.StartupPath, fileName));
}
return Image.FromFile(Path.Combine(Application.StartupPath, fileName));
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Mass Downloading of Webpages C# - c#

Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..). Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.

Related

Images take a very long time to load in C#

Getting HTML response fails respectively after first fail

Where do these 1k threads come from

Query on Queues and Thread Safety

Task Parallel Library WaitAny design

Categories

Resources