Most efficient way to download thousands of webpages - c#

I have few thousand of items. For each item I need to download a webpage and process this webpage. Processing itself is not processor-intensive.
Right now, I'm doing it synchronously using webclient class, but it takes too long. I'm sure it can be easily paralelized/asynchronized. But Iam looking for most resource-efficient way to do it. There are possibly some limits for amount of active webrequests, so I dont like idea of creating thousands webclients and starting asynchronous operation on each one. Unless it is not an actual problem.
Is it possible to use Parallel Extensions and Task class in C# 4?
Edit: Thanks for the answers. I was hoping for something using asynchronous operations, because running synchronous operation in paralel will only block those thread.

You want to use a structure called a producer/consumer queue. You queue up all your urls for processing, and assign consumer threads to dequeue each url (with appropriate locking) and then download and process it.
This allows you to control and tune the number of consumers for what works best in your situation. In most cases, you'll find the optimum throughput for network operations is achieved with between 5 and 20 active connections. More and you start worrying about congestion issues on the wire or context switching issues among your threads. Of course, it varies depending on your circumstances: a server with a lot of cores and fat pipe might be able to push this number much higher, but an old P4 on dialup might find it does best with just a couple going at a time. That's why the tuning ability is so important.

Try using Parallel.ForEach([list of items], x => YourDownloadFunction(x))
It will handle concurrency automatically and efficiently, using thread pools and the whole lot.

Use Thread. Parallel.ForEach has limited threads, based on amount of cores/cpus you have. Fetching websites doesn't make a thread completely active throughout its operation. There will be delays between requests (images, static content, etc). So, use threads to maximize the speed. Start with 50 threads then go up from there to see how much your computer can handle.


How to use multithreading or any other .NET technology to scale a program performing network, disk and processor intensive jobs?

The Problem:
Download a batch of PDF files from pickup.fileserver (SFTP or windows share) to local hard drive (Polling is involved here to check if files are available to download)
Process (resize, apply barcodes etc) the PDF files, create some metadata files, update database etc
Upload this batch to dropoff.fileserver (SFTP)
Await response from dropoff.fileserver (Again polling is the only option). Once the batch response is available, download it local HD.
Parse the batch response, update database and finally upload report to pickup.fileserver
Archive all batch files to a SAN location and go back to step 1.
The Current Solution
We are expecting many such batches so we have created a windows service which can keep polling at certain time intervals and perform the steps mentioned above. It takes care of one batch at a time.
The Concern
The current solution works file, however, I'm concerned that it is NOT making best use of available resources, there is certainly a lot of room for improvement. I have very little idea about how I can scale this windows service to be able to process as many batches simultaneously as it can. And then if required, how to involve multiple instances of this windows service hosted on different servers to scale further.
I have read some MSDN articles and some SO answers on similar topics. There are suggestions about using producer-consumer patterns (BlockingCollectiong<T> etc.) Some say that it wouldn't make sense to create multi-threaded app for IO intensive tasks. What we have here is a mixture of disk + network + processor intensive tasks. I need to understand how best to use threading or any other technology to make best use of available resources on one server and go beyond one server (if required) to scale further.
Typical Batch Size
We regularly get batches of 200~ files, 300 MB~ total size. # of batches can grow to about 50 to 100, in next year or two. A couple of times in a year, we get batches of 5k to 10k files.
As you say, what you have is a mixture of tasks, and it's probably going to be hard to implement a single pipeline that optimizes all your resources. I would look at breaking this down into 6 services (one per step) that can then be tuned, multiplied or multi-threaded to provide the throughput you need.
Your sources are probably correct that you're not going to improve performance of your network tasks much by multithreading them. By breaking your application into several services, your resizing and barcoding service can start processing a file as soon as it's done downloading, while the download service moves on to downloading the next file.
The current solution works fine
Then keep it. That's my $0.02. Who cares if it's not terribly efficient? As long as it is efficient enough, then why change it?
That said...
I need to understand how best to use threading or any other technology to make best use of available resources on one server
If you want a new toy, I'd recommend using TPL Dataflow. It is designed specifically for wiring up pipelines that contain a mixture of I/O-bound and CPU-bound steps. Each step can be independently parallelized, and TPL Dataflow blocks understand asynchronous code, so they also work well with I/O.
and go beyond one server (if required) to scale further.
That's a totally different question. You'd need to use reliable queues and break the different steps into different processes, which can then run anywhere. This is a good place to start.
According to this article you may implement background worker jobs (Hangfire preferably) in your application layer and reduce code and deployment management of multiple windows services and achieve the same result possibly.
Also, you won't need to bother about handling multiple windows services.
Additionally it can restore in case of failure at application level or restart events.
There is no magic technology that will solve your problem, you need to analyse each part of it step by step.
You will need to profile the application and determine what areas are slow performing and refactor the code to resolve the problem.
This might mean increasing the demand on one resource to decrease demand on another, for example: You might find that you are doing a database lookup 10 times for each file you process. But caching the data before starting processing files is quicker, but maybe only if you have a batch larger than xx files.
You might find that to increase the processing speed of the whole batch that this is maybe not the optimal method for a single file.
As your program has multiple steps then you can look at each of these in turn, and as a whole.
My guess would be that the ftp download and upload would take the most time. So, you can look at running this in parallel. Whether this means running xx threads at once each processing a file, or having a separate task/thread for each stage in your process you can only determine with testing.
A good design is critical for performance. But there are limits and sometimes it just takes time to do some tasks.
Don’t forget that you must weight this up against the time and effort needed to implement this and the benefit. If the service runs overnight and takes 6 hours to run is it really a benefit if it takes 4 hours, if the people who need to work on the result will not be in the office anyway until much later.
To this kind of problem do you have the any specific file types that you download from the SFTP. I have a similar problem in downloading the large files but it is not a windows service in my case its EXE that runs on the System.timers.
Try to create the threads for each file types which are large in
size eg: PDF's.
You can check for these file types while downloading the SFTP file
path and assign them to a thread process to download.
You also need to upload the files also in vice versa.
--In my case all I was able to do was to tweak the existing one and create a separate thread process for a large file types. that solved my problem as flat files and Large PDF files are downloaded parallel threads.

Ideal number of Tasks

I am currently working on an application that has an "embarrassingly Parallel" scenario. Is there any guideline/algorithm to determine ideal number of tasks to maximize CPU utilization.
If you could maintain a number of threads equal to the number of cores (or double if you have Hyperthreading enabled) the CPU should be utilized in the optimal way.
Also, the related post might be helpful: Optimal number of threads per core.
I think the best approach is to first let the framework deal with that and do something more complicated only when that isn't good enough.
In your case, it would probably mean using Parallel.ForEach() to process some collection, instead of manually using n Tasks.
When you find out that Parallel.ForEach() with default settings doesn't parallelize the work in the way you would want, then you try fiddling with it, by setting MaxDegreeOfParallelism or using a custom partitioner.
And only when that still isn't good enough, then you should consider using Tasks.
This depends on your task. If you only process and don't wait for I/O, you should have as many as you have cores.
Sending queries to many different servers, waiting 20 to 40ms for a response, reading some I/O from some disk drive or tape recorder, and then processing only a single ms, every core can serve 30 threads or more.

Balance between Number of Threads and Web Requests

I have a program that executes multiple threads. Each thread simply executes a HTTPWebRequest and then screen scrapes the page looking for some text. I am a race against other users to find this text. I could execute 1000000 threads, all looking for the same thing.
My thought on that is that would put a lot of work on my processor and would actually cause the requests to execute slower. How can I find a balance between the number of threads to execute and the performance of the web requests. Basically what I want to do is find the optimal number of threads to spawn off so that the amount of data they pull down is greatest.
The application is using .NET4 and written in C#.
You are right to assume that 1000000 threads will put undue pressure on your CPU. The work that your CPU would have to do to manage and switch between that many threads would probably cause your system to be very slow indeed.
Obviously you are not serious about 1000000 threads, but it demonstrates that you cannot simply throw more threads at the problem. You dont really want to write your own load balancer - that will not be easy and will not perform as well as the classes that come with the base class library. Have a look at using ThreadPool threads - the CLR will manage them for you. You can also look at the Parallel Task Library that is new in .NET 4.0 (since you mention that is what you are using).
ALso check out this great article about multi-threading:
C# has a ThreadPool. Submit your web-scraping tasks to the pool. You can tweak the number of threads in the pool to tune your app - you will probably need to increase it well above the default for best performance with such a requirement as yours.
Huge numbers of threads are wasteful, as posted by #M Babcock.
I'm not sure if the number of threads in a C# ThreadPool can be changed at run-time, (I see no reason why not, but M$...). If it is tweakable during the run, tuning will be even easier!
you need to use Parallel.Foreach to manage your threads properly...
You are asking performance question and not providing any estimates on your actual requirements... so let me try doing it for you.
How much data can you pull in - assuming awesome network and regular network card - 100Mb/s at max, probably less than 10Mb/sec. This give about less than 10000 requests per second (assuming ~10K requests/response pairs).
Can one thread handle that much data - searching through 100Mb a second should not be a problem even for single thread. Super easy to prototype/measure.
How many threads I need to read data - likely 1 - starting asynchronous request is fast, reading response OR posting response in a queue for processing is fast for 10000 items a second.
So my estimates - 1 thread for simple code, (1 + one thread per core) if you have more cores and willing to run processing in parallel.

Maintaining a large number of open sockets in a c# application

I'm writing an application in C# where I want to maintain a large number of open socket-connections (tcp), and be able to respond to data that comes in from all of them. Now, the normal thing to do (afaik) would be to call BeginRead on all of them, but as far as I know BeginRead spawns a thread and sits and waits for data in that thread, and I've read that threads don't tend to scale very well when using a lot of them. Though I might be wrong about this, if that's the case please just tell me.
This will probably sound idiotic, but what I want is to be able to maintain as many open socket-connections as possible (as my machine/os allows), while still being able to react to data coming in from any of them as fast as possible, and using as little system-resources as possible doing this.
What I've thought about is to put all of them in some kind of list, and then run a single thread in loop over them checking for new data and acting on that new data (if there is any), though I'd think that a loop like that will end up frying my cpu (cause there is nothing in the loop that blocks). Is there any way (simple or not, I don't mind getting into some more complex algorithms to solve this) that I can achieve this? Any help would be appreciated.
No. The asynchronous methods do not spawn threads, they use IO Completion Ports.
The BeginXXX methods have been deprecated. Use the XxxxAsync methods instead.

C# - Moving files - to queue or multi-thread

I have an app that moves a project and its files from preview to production using a Flex front-end and a .NET web service. Currently, the process takes about 5-10 mins/per project. Aside from latency concerns, it really shouldn't take that long. I'm wondering whether or not this is a good use-case for multi-threading. Also, considering the user may want to push multiple projects or one right after another, is there a way to queue the jobs.
Any suggestions and examples are greatly appreciated.
Something that does heavy disk IO typically isn't a good candidate for multithreading since the disks can really only do one thing at a time. However, if you're pushing to multiple servers or the servers have particularly good disk subsystems some light threading may be beneficial.
As a note - regardless of whether or not you decide to queue the jobs, you will use multi-threading. Queueing is just one way of handling what is ultimately solved using multi-threading.
And yes, I'd recommend you build a queue to push out each project.
You should compare the speed of your code compared to just copying in Windows (i.e., explorer or command line) vs copying with something advanced like TeraCopy. If your code is significantly slower than Window then look at parts in your code to optimize using a profiler. If your code is about as fast as Windows but slower than TeraCopy, then multithreading could help.
Multithreading is not generally helpful when the operation I/O bound, but copying files involves reading from the disk AND writing over the network. This is two I/O operations, so if you separate them onto different threads, it could increase performance. For something like this you need a producer/consumer setup where you have a Circular queue with one thread reading from disk and writing to the queue, and another thread reading from the queue and writing to the network. It'll be important to keep in mind that the two threads will not run at the same speed, so if the queue gets full, wait before writing more data and if it's empty, wait before writing. Also the locking strategy could have a big impact on performance here and could cause the performance to degrade to slower than a single-threaded implementation.
If you're moving things between just two computers, the network is going to be the bottleneck, so you may want to queue these operations.
Likewise, on the same machine, the I/O is going to be the bottleneck, so you'd want to queue there, too.
You should try using the ThreadPool.
ThreadPool.QueueUserWorkItem(MoveProject, project);
Agreed with everyone over the limited performance of running the tasks in parallel.
If you have full control over your deployment environment, you could use Rhino Queues:
This will allow you to produce a queue of jobs asynchronously (say from a WCF service being called from your Silverlight/Flex app) and consume them synchronously.
Alternatively you could use WCF and MSMQ, but the learning curve is greater.
When dealing with multiple files using multiple threads usually IS a good idea in concerns of performance.The main reason is that most disks nowadays support native command queuing.
I wrote an article recently about reading/writing files with multiple files on
Also see related question
Will using multiple threads with a RandomAccessFile help performance?
In particular i made the experience that when dealing with very many files it IS a good idea to use a number of threads. In contrary using many thread in many cases does not slow down applications as much as commonly expected.
Having said that i'd say there is no other way to find out than trying all possible different approaches. It depends on very many conditions: Hardware, OS, Drivers etc.
The very first thing you should do is point any kind of profiling tool towards your software. If you can't do that (like, if you haven't got such a tool), insert logging code.
The very first thing you need to do is figure out what is taking a long time to complete, and then why is it taking a long time to complete. That your "copy" operation as a whole takes a long time to complete isn't good enough, you need to pinpoint the reason for this down to a method or a set of methods.
Until you do that, all the other things you can do to your code will likely be guesswork. My experience has taught me that when it comes to performance, 9 out of 10 reasons for things running slow comes as surprises to the guy(s) that wrote the code.
So measure first, then change.
For instance, you might discover that you're in fact reporting progress of copying the file on a byte-per-byte basis, to a GUI, using a synchronous call to the UI, in which case it wouldn't matter how fast the actual copying can run, you'll still be bound by message handling speed.
But that's just conjecture until you know, so measure first, then change.
