Where do these 1k threads come from - c#

I am trying to create a application that multi threaded downloads images from a website, as a introduction into threading. (never used threading properly before)
But currently it seems to create 1000+ threads and I am not sure where they are coming from.
I first queue a thread into a thread pool, for starters i only have 1 job in the jobs array
foreach (Job j in Jobs)
{
ThreadPool.QueueUserWorkItem(Download, j);
}
Which starts the void Download(object obj) on a new thread where it loops through a certain amount of pages (images needed / 42 images per page)
for (var i = 0; i < pages; i++)
{
var downloadLink = new System.Uri("http://www." + j.Provider.ToString() + "/index.php?page=post&s=list&tags=" + j.Tags + "&pid=" + i * 42);
using (var wc = new WebClient())
{
try
{
wc.DownloadStringAsync(downloadLink);
wc.DownloadStringCompleted += (sender, e) =>
{
response = e.Result;
ProcessPage(response, false, j);
};
}
catch (System.Exception e)
{
// Unity editor equivalent of console.writeline
Debug.Log(e);
}
}
}
correct me if I am wrong, the next void gets called on the same thread
void ProcessPage(string response, bool secondPass, Job j)
{
var wc = new WebClient();
LinkItem[] linkResponse = LinkFinder.Find(response).ToArray();
foreach (LinkItem i in linkResponse)
{
if (secondPass)
{
if (string.IsNullOrEmpty(i.Href))
continue;
else if (i.Href.Contains("http://loreipsum."))
{
if (DownloadImage(i.Href, ID(i.Href)))
j.Downloaded++;
}
}
else
{
if (i.Href.Contains(";id="))
{
var alterResponse = wc.DownloadString("http://www." + j.Provider.ToString() + "/index.php?page=post&s=view&id=" + ID(i.Href));
ProcessPage(alterResponse, true, j);
}
}
}
}
And finally passes on to the last function and downloads the actual image
bool DownloadImage(string target, int id)
{
var url = new System.Uri(target);
var fi = new System.IO.FileInfo(url.AbsolutePath);
var ext = fi.Extension;
if (!string.IsNullOrEmpty(ext))
{
using (var wc = new WebClient())
{
try
{
wc.DownloadFileAsync(url, id + ext);
return true;
}
catch(System.Exception e)
{
if (DEBUG) Debug.Log(e);
}
}
}
else
{
Debug.Log("Returned Without a extension: " + url + " || " + fi.FullName);
return false;
}
return true;
}
I am not sure how I am starting this many threads, but would love to know.
Edit
The goal of this program is to download the different job in jobs at the same time (max of 5) each downloading a maximum of 42 images at the time.
so a maximum of 210 images can/should be downloaded maximum at all times.

First of all, how did you measure the thread count? Why do you think that you have thousand of them in your application? You are using the ThreadPool, so you don't create them by yourself, and the ThreadPool wouldn't create such great amount of them for it's needs.
Second, you are mixing synchronious and asynchronious operations in your code. As you can't use TPL and async/await, let's go through you code and count the unit-of-works you are creating, so you can minimize them. After you do this, the number of queued items in ThreadPool will decrease and your application will gain performance you need.
You don't set the SetMaxThreads method in your application, so, according the MSDN:
Maximum Number of Thread Pool Threads
The number of operations that can be queued to the thread pool is limited only by available memory;
however, the thread pool limits the number of threads that can be
active in the process simultaneously. By default, the limit is 25
worker threads per CPU and 1,000 I/O completion threads.
So you must set the maximum to the 5.
I can't find a place in your code where you check the 42 images per Job, you are only incrementing the value in ProcessPage method.
Check the ManagedThreadId for the handle of WebClient.DownloadStringCompleted - does it execute in different thread or not.
You are adding the new item in ThreadPool queue, why are you using the asynchronious operation for Downloading? Use a synchronious overload, like this:
ProcessPage(wc.DownloadString(downloadLink), false, j);
This will not create another one item in ThreadPool queue, and you wouldn't have a sinchronisation context switch here.
In ProcessPage your wc variable doesn't being garbage collected, so you aren't freeing all your resourses here. Add using statement here:
void ProcessPage(string response, bool secondPass, Job j)
{
using (var wc = new WebClient())
{
LinkItem[] linkResponse = LinkFinder.Find(response).ToArray();
foreach (LinkItem i in linkResponse)
{
if (secondPass)
{
if (string.IsNullOrEmpty(i.Href))
continue;
else if (i.Href.Contains("http://loreipsum."))
{
if (DownloadImage(i.Href, ID(i.Href)))
j.Downloaded++;
}
}
else
{
if (i.Href.Contains(";id="))
{
var alterResponse = wc.DownloadString("http://www." + j.Provider.ToString() + "/index.php?page=post&s=view&id=" + ID(i.Href));
ProcessPage(alterResponse, true, j);
}
}
}
}
}
In DownloadImage method you also use the asynchronious load. This also adds item in ThreadPoll queue, and I think that you can avoid this, and use synchronious overload too:
wc.DownloadFile(url, id + ext);
return true;
So, in general, avoid the context-switching operations and dispose your resources properly.

Your wc WebClinet will go out of scope and be randomly garbage collected before the async callback. Also on all async calls you have to allow for immediate return and the actual delegated function return. So processPage will have to be in two places. Also the j in the original loop may be going out of scope depending on where Download in the original loop is declared.

Related

Some tags are missing from my app using ThreadPool

We have a desktop app which principal goal is to catch all the tags read from RFID readers from different brands. We connect the readers and set a ThreadPool for instance of a reader connected. Everything works pretty good when there are 4 or 5 readers connected. The problem starts when there are almost 30 readers connected. A lot of tags are missing. These tags are present in every vehicle. These buses will get into a Land terminal, so we have readers in every zone then we can calculate the bills. In the other hand, 8 or 10 buses will get into the terminal every minute and will go through the majority of readers. My question is, ThreadPool is efficient in this case, or should I use another technique?
Here is a snippet of my code:
Client = new TcpClient(pIPReader, pPuerto);
if (Client.GetStream().CanRead)
{
RX = new StreamReader(Client.GetStream());
ThreadPool.QueueUserWorkItem(SocketNedapupPass, new object[] {
pIPReader,row.Cells[3].Value.ToString().ToUpper(), RX });
}
private void SocketNedapupPass(object obj)
{
object[] array = obj as object[];
bool lecturaNueva = false;
HttpClient client = new HttpClient();
HttpResponseMessage response = new HttpResponseMessage();
StreamReader SR = (StreamReader)array[2];
string pIPReader = array[0].ToString();
if (SR.BaseStream.CanRead)
{
try
{
while(SR.BaseStream.CanRead == true)
{
string RawData = SR.ReadLine();
if ((RawData.Length - 1) > InicioNedap)
{
string checkRawData = RawData.Substring(InicioNedap);
if (checkRawData.Length >= LongitudNedap)
{
RawData = checkRawData.Substring(0, LongitudNedap);
lecturaNueva = validacionLectura(RawData, 1, pIPReader);
}
else
{
oLog.WriteSuceso("La placa " + RawData + " no cumple con las
longitudes especificadas en el archivo de configuración");
}
}
if (lecturaNueva)
{
if (callAccion && spAccion.Length >= 4)
{
int codAntena = 1;
if (listaReaderPrincipal.Any(x =>x.serie_punto_control
== pIPReader && x.es_principal == "2"))
{
//PROCESAR SALIDA PRINCIPAL
ProcesoSalidaPrincipalVehiculo(RawData, pIPReader,
codAntena, SR);//RX);
}
else if (listaReaderPrincipal.Any(x =>
x.serie_punto_control == pIPReader && x.es_principal
== "1"))
{
//PROCESAR INGRESO PRINCIPAL
ProcesoIngresoPrincipalVehiculo(RawData,
pIPReader, codAntena, SR);//RX);
}
else
{
int res = cnAccion.EjecutarSP(spAccion, RawData,
codigoAlterno, pIPReader, codAntena);
}
}
catch (Exception ex)
{
oLog.WriteError(ex);
}
}
}
}
}
catch (Exception ex)
{
SR.Close();
oLog.WriteError(ex);
}
}
}
The ThreadPool is intended for large numbers of short-lived tasks, in order to amortize the cost of Thread creation and destruction. If your program has a need for a specific number of long-running threads, it's better to create these threads explicitly using the Thread constructor, instead of relying on the ThreadPool. The problem with using the ThreadPool for long running tasks is that it might become saturated (it runs out of worker threads), in which case new requests for work are not satisfied immediately but instead they are scheduled for later. Each scheduled work will have to wait until some of the currently running tasks completes. If none of the running tasks completes soon enough, the ThreadPool injects new threads in the pool, at a frequency of around one new thread per second (as of .NET 5). The injection algorithm is an undocumented implementation detail, and it might change in later .NET releases. The only control that you have currently over this algorithm, is the SetMinThreads method. With this method you can configure how many threads will be created instantly on demand, before the ThreadPool switches to the slow, conservative algorithm. You can set this threshold as high as you want at the start of the program, for example:
ThreadPool.SetMinThreads(1000, 1000);
...but in this case the purpose of the ThreadPool will have largely been defeated, and the ThreadPool could be hardly described as a pool any more.

Running multiple background tasks in a WPF applicatiion - UI update stalling

I have tried so many variants of this code. I'm receiving the same issue no matter what. The UI updating starts fine and then stalls until the entire process is complete. Can someone point me in the right direction?
The scenario
In a WPF application we will be calling the same API thousands of times with different parameters passed. We need to collect all the responses and do something.
Sample code
List<Task> tasks = new List<Task>();
for (int i = 1; i <= iterations; i++)
{
Task t = SampleTask(new SampleTaskParameterCollection { TaskId = i, Locker = locker, MinSleep = minSleep, MaxSleep = maxSleep });
tasks.Add(t);
}
Task.WhenAll(tasks);
private void SampleTask(SampleTaskParameterCollection parameters)
{
int sleepTime = rnd.Next(parameters.MinSleep, parameters.MaxSleep);
Thread.Sleep(sleepTime);
Application.Current.Dispatcher.BeginInvoke(new Action(() =>
{
lock (parameters.Locker)
{
ProgressBar1.Value = ProgressBar1.Value + 1;
LogTextbox.Text = LogTextbox.Text + Environment.NewLine + "Task " + parameters.TaskId + " slept for " + sleepTime + "ms and has now completed.";
}
LogTextbox.ScrollToEnd();
if (ProgressBar1.Maximum == ProgressBar1.Value)
{
RunSlowButton.IsEnabled = true;
RunFastButton.IsEnabled = true;
ProgressBar1.Value = 0;
}
}), System.Windows.Threading.DispatcherPriority.Send);
}
The current repo is located on GitHub. Look at the SimpleWindow.
Do not create thousands of tasks - this will cause immense performance problems.
Instead, use something like Parallel.For() to limit the number of tasks that run simultaneously; for example:
Parallel.For(1,
iterations + 1,
(index) =>
{
SampleTask(new SampleTaskParameterCollection { TaskId = index, Locker = locker, MinSleep = minSleep, MaxSleep = maxSleep });
});
Also if the UI updates take longer than the interval between the calls to BeginInvoke() then the invokes will begin to be queued up and things will get nasty.
To solve that, you could use a counter in your SampleTask() to only actually update the UI once every N calls (with a suitable value for N).
However, note that to avoid threading issues you'd have to use Interlocked.Increment() (or some other lock) when incrementing and checking the value of the counter. You'd also have to ensure that you updated the UI one last time when all the work is done.

Why simple multi task doesn't work when multi thread does?

var finalList = new List<string>();
var list = new List<int> {1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ................. 999999};
var init = 0;
var limitPerThread = 5;
var countDownEvent = new CountdownEvent(list.Count);
for (var i = 0; i < list.Count; i++)
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
new Thread(delegate()
{
Foo(listToFilter);
countDownEvent.Signal();
}).Start();
init += limitPerThread;
}
//wait all to finish
countDownEvent.Wait();
private static void Foo(List<int> listToFilter)
{
var listDone = Boo(listToFilter);
lock (Object)
{
finalList.AddRange(listDone);
}
}
This doesn't:
var taskList = new List<Task>();
for (var i = 0; i < list.Count; i++)
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
var task = Task.Factory.StartNew(() => Foo(listToFilter));
taskList.add(task);
init += limitPerThread;
}
//wait all to finish
Task.WaitAll(taskList.ToArray());
This process must create at least 700 threads in the end. When I run using Thread, it works and creates all of them. But with Task it doesn't.. It seems like its not starting multiples Tasks async.
I really wanna know why.... any ideas?
EDIT
Another version with PLINQ (as suggested).
var taskList = new List<Task>(list.Count);
Parallel.ForEach(taskList, t =>
{
var listToFilter = list.Skip(init).Take(limitPerThread).ToList();
Foo(listToFilter);
init += limitPerThread;
t.Start();
});
Task.WaitAll(taskList.ToArray());
EDIT2:
public static List<Communication> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
try
{
for (var i = 0; i < listIps.Count(); i++)
{
var oPing = new Ping().Send(listIps.ElementAt(i).IpAddress, 10000);
if (oPing != null)
{
if (oPing.Status.Equals(IPStatus.TimedOut) && listIps.Count() > i+1)
continue;
if (oPing.Status.Equals(IPStatus.TimedOut))
{
communication.Result = "NOK";
break;
}
communication.Result = oPing.Status.Equals(IPStatus.Success) ? "OK" : "NOK";
break;
}
if (listIps.Count() > i+1)
continue;
communication.Result = "NOK";
break;
}
}
catch
{
communication.Result = "NOK";
}
finally
{
listResult.Add(communication);
}
}
return listResult;
}
Tasks are NOT multithreading. They can be used for that, but mostly they're actually used for the opposite - multiplexing on a single thread.
To use tasks for multithreading, I suggest using Parallel LINQ. It has many optimizations in it already, such as intelligent partitioning of your lists and only spawning as many threads as there ar CPU cores, etc.
To understand Task and async, think of it this way - a typical workload often includes IO that needs to be waited upon. Maybe you read a file, or query a webservice, or access a database, or whatever. The point is - your thread gets to wait a loooong time (in CPU cycles at least) until you get a response from some faraway destination.
In the Olden Days™ that meant that your thread was getting locked down (suspended) until that response came. If you wanted to do something else in the meantime, you needed to spawn a new thread. That's doable, but not too efficient. Each OS thread carries a significant overhead (memory, kernel resources) with it. And you could end up with several threads actively burning the CPU, which means that the OS needs to switch between them so that each gets a bit of CPU time and these "context switches" are pretty expensive.
async changes that workflow. Now you can have multiple workloads executing on the same thread. While one piece of work is awaiting the result from a faraway source, another can step in and use that thread to do something else useful. When that second workload gets to its own await, the first can awaken and continue.
After all, it doesn't make sense to spawn more threads than there are CPU cores. You're not going to get more work done that way. Just the opposite - more time will be spent on switching the threads and less time will be available for useful work.
That is what the Task/async/await was originally designed for. However Parallel LINQ has also taken advantage of it and reused it for multithreading. In this case you can look at it this way - the other threads is what your main thread is the "faraway destination" that your main thread is waiting on.
Tasks are executed on the Thread Pool. This means that a handful of threads will serve a large number of tasks. You have multi-threading, but not a thread for every task spawned.
You should use tasks. You should aim to use as much threads as your CPU. Generally, the thread pool is doing this for you.
How did you measure up the performance? Do you think that the 700 threads will work faster than 700 tasks executing by 4 threads? No, they would not.
It seems like its not starting multiples Tasks async
How did you came up with this? As other suggested in comments and in other answers, you probably need to remove a thread creation, as after creating 700 threads you'll degrade your system performance, as your threads would fight to each other for the processor time, without any work done faster.
So, you need to add the async/await for your IO operations, into the Foo method, with SendPingAsync version. Also, your method could be simplyfied, as many checks for a listIps.Count() > i + 1 conditions are useless - you do it in the for condition block:
public static async Task<List<Communication>> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
try
{
var ping = new Ping();
communication.Result = "NOK";
for (var i = 0; i < listIps.Count(); i++)
{
var oPing = await ping.SendPingAsync(listIps.ElementAt(i).IpAddress, 10000);
if (oPing != null)
{
if (oPing.Status.Equals(IPStatus.Success)
{
communication.Result = "OK";
break;
}
}
}
}
catch
{
communication.Result = "NOK";
}
finally
{
listResult.Add(communication);
}
}
return listResult;
}
Other problem with your code is that PLINQ version isn't threadsafe:
init += limitPerThread;
This can fail while executing in parallel. You may introduce some helper method, like in this answer:
private async Task<List<PingReply>> PingAsync(List<Communication> theListOfIPs)
{
Ping pingSender = new Ping();
var tasks = theListOfIPs.Select(ip => pingSender.SendPingAsync(ip, 10000));
var results = await Task.WhenAll(tasks);
return results.ToList();
}
And do this kind of check (try/catch logic removed for simplicity):
public static async Task<List<Communication>> Foo(List<Dispositive> listToPing)
{
var listResult = new List<Communication>();
foreach (var item in listToPing)
{
var listIps = item.listIps;
var communication = new Communication
{
IdDispositive = item.Id
};
var check = await PingAsync(listIps);
communication.Result = check.Any(p => p.Status.Equals(IPStatus.Success)) ? "OK" : "NOK";
}
}
And you probably should use Task.Run instead of Task.StartNew for being sure that you aren't blocking the UI thread.

Handle multiple threads, one out one in, in a timed loop

I need to process a large number of files overnight, with a defined start and end time to avoid disrupting users. I've been investigating but there are so many ways of handling threading now that I'm not sure which way to go. The files come into an Exchange inbox as attachments.
My current attempt, based on some examples from here and a bit of experimentation, is:
while (DateTime.Now < dtEndTime.Value)
{
var finished = new CountdownEvent(1);
for (int i = 0; i < numThreads; i++)
{
object state = offset;
finished.AddCount();
ThreadPool.QueueUserWorkItem(delegate
{
try
{
StartProcessing(state);
}
finally
{
finished.Signal();
}
});
offset += numberOfFilesPerPoll;
}
finished.Signal();
finished.Wait();
}
It's running in a winforms app at the moment for ease, but the core processing is in a dll so I can spawn the class I need from a windows service, from a console running under a scheduler, however is easiest. I do have a Windows Service set up with a Timer object that kicks off the processing at a time set in the config file.
So my question is - in the above code, I initialise a bunch of threads (currently 10), then wait for them all to process. My ideal would be a static number of threads, where as one finishes I fire off another, and then when I get to the end time I just wait for all threads to complete.
The reason for this is that the files I'm processing are variable sizes - some might take seconds to process and some might take hours, so I don't want the whole application to wait while one thread completes if I can have it ticking along in the background.
(edit)As it stands, each thread instantiates a class and passes it an offset. The class then gets the next x emails from the inbox, starting at the offset (using the Exchange Web Services paging functionality). As each file is processed, it's moved to a separate folder. From some of the replies so far, I'm wondering if actually I should grab the e-mails in the outer loop, and spawn threads as needed.
To cloud the issue, I currently have a backlog of e-mails that I'm trying to process through. Once the backlog has been cleared, it's likely that the nightly run will have a significantly lower load.
On average there are around 1000 files to process each night.
Update
I've rewritten large chunks of my code so that I can use the Parallel.Foreach and I've come up against an issue with thread safety. The calling code now looks like this:
public bool StartProcessing()
{
FindItemsResults<Item> emails = GetEmails();
var source = new CancellationTokenSource(TimeSpan.FromHours(10));
// Process files in parallel, with a maximum thread count.
var opts = new ParallelOptions { MaxDegreeOfParallelism = 8, CancellationToken = source.Token };
try
{
Parallel.ForEach(emails, opts, processAttachment);
}
catch (OperationCanceledException)
{
Console.WriteLine("Loop was cancelled.");
}
catch (Exception err)
{
WriteToLogFile(err.Message + "\r\n");
WriteToLogFile(err.StackTrace + "r\n");
}
return true;
}
So far so good (excuse temporary error handling). I have a new issue now with the fact that the properties of the "Item" object, which is an email, not being threadsafe. So for example when I start processing an e-mail, I move it to a "processing" folder so that another process can't grab it - but it turns out that several of the threads might be trying to process the same e-mail at a time. How do I guarantee that this doesn't happen? I know I need to add a lock, can I add this in the ForEach or should it be in the processAttachments method?
Use the TPL:
Parallel.ForEach( EnumerateFiles(),
new ParallelOptions { MaxDegreeOfParallelism = 10 },
file => ProcessFile( file ) );
Make EnumerateFiles stop enumerating when your end time is reached, trivially like this:
IEnumerable<string> EnumerateFiles()
{
foreach (var file in Directory.EnumerateFiles( "*.txt" ))
if (DateTime.Now < _endTime)
yield return file;
else
yield break;
}
You can use a combination of Parallel.ForEach() along with a cancellation token source which will cancel the operation after a set time:
using System;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Demo
{
static class Program
{
static Random rng = new Random();
static void Main()
{
// Simulate having a list of files.
var fileList = Enumerable.Range(1, 100000).Select(i => i.ToString());
// For demo purposes, cancel after a few seconds.
var source = new CancellationTokenSource(TimeSpan.FromSeconds(10));
// Process files in parallel, with a maximum thread count.
var opts = new ParallelOptions {MaxDegreeOfParallelism = 8, CancellationToken = source .Token};
try
{
Parallel.ForEach(fileList, opts, processFile);
}
catch (OperationCanceledException)
{
Console.WriteLine("Loop was cancelled.");
}
}
static void processFile(string file)
{
Console.WriteLine("Processing file: " + file);
// Simulate taking a varying amount of time per file.
int delay;
lock (rng)
{
delay = rng.Next(200, 2000);
}
Thread.Sleep(delay);
Console.WriteLine("Processed file: " + file);
}
}
}
As an alternative to using a cancellation token, you can write a method that returns IEnumerable<string> which returns the list of filenames, and stop returning them when time is up, for example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Demo
{
static class Program
{
static Random rng = new Random();
static void Main()
{
// Process files in parallel, with a maximum thread count.
var opts = new ParallelOptions {MaxDegreeOfParallelism = 8};
Parallel.ForEach(fileList(), opts, processFile);
}
static IEnumerable<string> fileList()
{
// Simulate having a list of files.
var fileList = Enumerable.Range(1, 100000).Select(x => x.ToString()).ToArray();
// Simulate finishing after a few seconds.
DateTime endTime = DateTime.Now + TimeSpan.FromSeconds(10);
int i = 0;
while (DateTime.Now <= endTime)
yield return fileList[i++];
}
static void processFile(string file)
{
Console.WriteLine("Processing file: " + file);
// Simulate taking a varying amount of time per file.
int delay;
lock (rng)
{
delay = rng.Next(200, 2000);
}
Thread.Sleep(delay);
Console.WriteLine("Processed file: " + file);
}
}
}
Note that you don't need the try/catch with this approach.
You should consider using Microsoft's Reactive Framework. It lets you use LINQ queries to process multithreaded asynchronous processing in a very simple way.
Something like this:
var query =
from file in filesToProcess.ToObservable()
where DateTime.Now < stopTime
from result in Observable.Start(() => StartProcessing(file))
select new { file, result };
var subscription =
query.Subscribe(x =>
{
/* handle result */
});
Truly, that's all the code you need if StartProcessing is already defined.
Just NuGet "Rx-Main".
Oh, and to stop processing at any time just call subscription.Dispose().
This was a truly fascinating task, and it took me a while to get the code to a level that I was happy with it.
I ended up with a combination of the above.
The first thing worth noting is that I added the following lines to my web service call, as the operation timeout I was experiencing, and which I thought was because I'd exceeded some limit set on the endpoint, was actually due to a limit set by microsoft way back in .Net 2.0:
ServicePointManager.DefaultConnectionLimit = int.MaxValue;
ServicePointManager.Expect100Continue = false;
See here for more information:
What to set ServicePointManager.DefaultConnectionLimit to
As soon as I added those lines of code, my processing increased from 10/minute to around 100/minute.
But I still wasn't happy with the looping, and partitioning etc. My service moved onto a physical server to minimise CPU contention, and I wanted to allow the operating system to dictate how fast it ran, rather than my code throttling it.
After some research, this is what I ended up with - arguably not the most elegant code I've written, but it's extremely fast and reliable.
List<XElement> elements = new List<XElement>();
while (XMLDoc.ReadToFollowing("ElementName"))
{
using (XmlReader r = XMLDoc.ReadSubtree())
{
r.Read();
XElement node = XElement.Load(r);
//do some processing of the node here...
elements.Add(node);
}
}
//And now pass the list of elements through PLinQ to the actual web service call, allowing the OS/framework to handle the parallelism
int failCount=0; //the method call below sets this per request; we log and continue
failCount = elements.AsParallel()
.Sum(element => IntegrationClass.DoRequest(element.ToString()));
It ended up fiendishly simple and lightning fast.
I hope this helps someone else trying to do the same thing!

Mass Downloading of Webpages C#

My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.
for (int i = 1; i<=pages; i++)
{
string page_specific_link = baseurl + "&page=" + i.ToString();
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
catch (Exception)
{
}
}
The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.
I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.
So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.
I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances
// now process urls
foreach (var url in urls_to_download)
{
var worker = ClientQueue.Take();
worker.DownloadStringAsync(url, ...);
}
When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.
In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.
That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.
I should also note that there is a huge difference in resource usage between these two blocks of code:
WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
MyWebClient.DownloadString(url);
}
---------------
foreach (var url in urls_to_download)
{
WebClient MyWebClient = new WebClient();
MyWebClient.DownloadString(url);
}
The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.
Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).
Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.
In addition to #Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();
Parallel.ForEach(pages, x =>
{
using(var client = new WebClient())
{
var pagesource = client.DownloadString(x);
sources.Add(pagesource);
}
});
Yet another approach, that uses async:
static IEnumerable<string> GetSources(List<string> pages)
{
var sources = new BlockingCollection<string>();
var latch = new CountdownEvent(pages.Count);
foreach (var p in pages)
{
using (var wc = new WebClient())
{
wc.DownloadStringCompleted += (x, e) =>
{
sources.Add(e.Result);
latch.Signal();
};
wc.DownloadStringAsync(new Uri(p));
}
}
latch.Wait();
return sources;
}
You should use parallel programming for this purpose.
There are a lot of ways to achieve what u want; the easiest would be something like this:
var pageList = new List<string>();
for (int i = 1; i <= pages; i++)
{
pageList.Add(baseurl + "&page=" + i.ToString());
}
// pageList is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page);
client.Dispose();
lock (sourcelist)
sourcelist.Add(pagesource);
}
catch (Exception) {}
});
I Had a similar Case ,and that's how i solved
using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;
namespace WebClientApp
{
class MainClassApp
{
private static int requests = 0;
private static object requests_lock = new object();
public static void Main() {
List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
foreach(var url in urls) {
ThreadPool.QueueUserWorkItem(GetUrl, url);
}
int cur_req = 0;
while(cur_req<urls.Count) {
lock(requests_lock) {
cur_req = requests;
}
Thread.Sleep(1000);
}
Console.WriteLine("Done");
}
private static void GetUrl(Object the_url) {
string url = (string)the_url;
WebClient client = new WebClient();
Stream data = client.OpenRead (url);
StreamReader reader = new StreamReader(data);
string html = reader.ReadToEnd ();
/// Do something with html
Console.WriteLine(html);
lock(requests_lock) {
//Maybe you could add here the HTML to SourceList
requests++;
}
}
}
You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.
While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.
What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.
First, you would get the urls that you want to download:
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl +
"&page=" + i.ToString(CultureInfo.InvariantCulture)));
Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
// Create the task completion source.
var tcs = new TaskCompletionSource<Tuple<Uri, string>>();
// The web client.
var wc = new WebClient();
// Attach to the DownloadStringCompleted event.
client.DownloadStringCompleted += (s, e) => {
// Dispose of the client when done.
using (wc)
{
// If there is an error, set it.
if (e.Error != null)
{
tcs.SetException(e.Error);
}
// Otherwise, set cancelled if cancelled.
else if (e.Cancelled)
{
tcs.SetCanceled();
}
else
{
// Set the result.
tcs.SetResult(new Tuple<string, string>(url, e.Result));
}
}
};
// Start the process asynchronously, don't burn a thread.
wc.DownloadStringAsync(url);
// Return the task.
return tcs.Task;
});
Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:
// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();
// Wait for all to complete.
Task.WaitAll(materializedTasks);
Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
// pair.Item1 will contain the Uri.
// pair.Item2 will contain the content.
}
Note that the above code has the caveat of not having an error handling.
If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).
I am using an active Threads count and a arbitrary limit:
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}

Categories