I am writing a heavy web scraper in c#. I want it to be fast and reliable.
Parallel.Foreach and Parallel.For are way too slow for this.
For the input I am using a list of URLs. I want to have up to 300 threads working at the exact same time (my cpu and net connection can handle this). What would be the best way to do this? Would using tasks work better for this?
Sometimes the threads end for no apparent reason and some of the results don't get saved. I want a more reliable way of doing this. Any ideas?
I want to have a more solid queue type of scraping.
What I came up with (not all code but the important parts):
List <string> input = // read text file
int total = words.Length;
int maxThreads = 300;
while (true)
{
if (activeThreads < maxThreads)
{
current++;
Thread thread = new Thread(() => CrawlWebsite(words[current]));
thread.Start();
}
}
public static void CrawlWebsite(string word)
{
activeThreads++;
// scraping part
activeThreads--;
}
Consider using System.Threading.ThreadPool. It could be a little faster for your scenario with many threads, as well as you don't need to manage activeThreads. Instead you can use ThreadPool.SetMaxThreads() and SetMinThreads() and the ThreadPool manages the number of parallel threads for you.
BTW, there is missing synchronization of the shared variables in your example. One of the ways how to synchronize access is using "lock" - see http://msdn.microsoft.com/en-us/library/c5kehkcz.aspx
Also your thread-runned method - the CrawlWebsite() should handle ThreadAbortException - see http://msdn.microsoft.com/en-us/library/system.threading.threadabortexception.aspx.
I was recently working on very similar problem and don´t think that using any high number of threads will make it faster. The slowest think is usually downloading the data. Having huge number of threads does not make it faster, because mostly they are waiting for network connections data transfer etc. So I ended up with having two queues. One is handled by some small number of threads that just send async download requests (10-15 requets at a time). The responses are stored into another queue that goes into another thread pool that takes care of parsing and data processing (Number of threads here depends on your CPU and processing algorithm).
I also save all downloaded data to a database. Anytime I want to implement parsing of some new information from the web I don´t need to redownload the content, but only parse the cached web from DB (This saves a looot of time)
Related
How can I run the above code in the fastest way. What is the best practice?
public ActionResult ExampleAction()
{
// 200K items
var results = dbContext.Results.ToList();
foreach (var result in results)
{
// 10 - 40 items
result.Kazanim = JsonConvert.SerializeObject(
dbContext.SubTables // 2,5M items
.Where(x => x.FooId == result.FooId)
.Select(select => new
{
BarId = select.BarId,
State = select.State,
}).ToList());
dbContext.Entry(result).State = EntityState.Modified;
dbContext.SaveChanges();
}
return Json(true, JsonRequestBehavior.AllowGet);
}
This process takes an average of 500 ms as sync. I have about 2M records. The process is done 200K times.
How should I code asynchronously?
How can I do it faster and easier with an async method.
Here are two suggestions that can improve the performance multiple orders of magnitude:
Do work in batches:
Make the client send a page of data to process; and/or
In the web server code add items to a queue and process them separately.
Use SQL instead of EF:
Write an efficient SQL; and/or
Use the stored proc to do the work inside the db rather than move data between the db and the code.
There's nothing you can do with that code asynchronously for improving its performance. But there's something that can certainly make it faster.
If you call dbContext.SaveChanges() inside the loop, EF will write back the changes to the database for every single entity as a separate transaction.
Move your dbContext.SaveChanges() after the loop. This way EF will write back all your changes at once after in one single transaction.
Always try to have as few calls to .SaveChanges() as possible. One call with 50 changes is much better, faster and more efficient than 50 calls for 1 change each.
and welcome.
There's quite a lot I see incorrect in terms of asynchronicity, but I guess it only matters if there are concurrent users calling your server. This has to do with scalability and the thread pool in charge of spinning up threads to take care of your incoming HTTP requests.
You see, if you occupy a thread pool thread for a long time, that thread will not contribute to dequeueing incoming HTTP requests. This pretty much puts you in a position where you can spin up a maximum of around 2 new thread pool threads per second. If your incoming HTTP request rate is faster than the pool's ability to produce threads, all of your HTTP requests will start seeing increased response times (slowness).
So as a general rule, when doing I/O intensive work, always go async. There are asynchronous versions of most (or all) of the materializing methods like .ToList(): ToListAsync(), CountAsync(), AnyAsync(), etc. There is also a SaveChangesAsync(). First thing I would do is use these under normal circumstances. Yours don't seem to be, so I mentioned this for completeness only.
I think that you must, at the very least, run this heavy process outside the thread pool. Use Task.Factory.StartNew() with the TaskCreationOptions.LongRunning but run synchronous code so you don't fall in the trap of awaiting the returned task in vain.
Now, all that just to have a "proper" skeleton. We haven't really talked about how to make this run faster. Let's do that.
Personally, I think you need some benchmarking between different methods. It looks like you have benchmarked this code. Now listen to #tymtam and see if a stored procedure version runs faster. My hunch, just like #tymtam's, is that it will be definitely faster.
If for whatever reason you insist in running this with C#, I would parallelize the work. The problem with this is Entity Framework. As per usual, my very popular, yet unfriendly ORM, is giving us a big but. EF's DB context works with a single connection and disallows multiple simultaneous queries. So you cannot parallelize this with EF. I would then move to my good, amazing friend, Dapper. Using Dapper, you could divide the workload in threads, and each thread would do an independent DB connection, and through that connection, take care of a portion of the 200K result set you obtain at the beginning.
Thanks for the valuable information you provided.
I decided to use hangfire in line with your suggestions.
I used it with Hangfire Inmemory. I have prepared a function that will throw it into the hangfire queue in the foreach. After getting my relevant values before starting the foreach, I set my function to import parameters that it will calculate and save to the database. I won't prolong it.
A job that took 30 minutes on average fell to 3 minutes with hangfire. Maybe it's still not ideal, but it has worked for me now. Instead of making the user wait, I can show your action as currently in progress. I end the process with a warning that another job has been successfully completed before the end of the last thread.
I haven't used it here for Dapper for now. But I used it on another subject. It really has tremendous performance compared to Entity Framework.
Thanks again.
Hi I am trying to mimic multi threading with Parallel.ForEach loop. Below is my function:
public void PollOnServiceStart()
{
constants = new ConstantsUtil();
constants.InitializeConfiguration();
HashSet<string> newFiles = new HashSet<string>();
//string serviceName = MetadataDbContext.GetServiceName();
var dequeuedItems = MetadataDbContext
.UpdateOdfsServiceEntriesForProcessingOnStart();
var handlers = Producer.GetParserHandlers(dequeuedItems);
while (handlers.Any())
{
Parallel.ForEach(handlers,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
handler =>
{
Logger.Info($"Started processing a file remaining in Parallel ForEach");
handler.Execute();
Logger.Info($"Enqueing one file for next process");
dequeuedItems = MetadataDbContext
.UpdateOdfsServiceEntriesForProcessingOnPollInterval(1);
handlers = Producer.GetParserHandlers(dequeuedItems);
});
int filesRemovedCount = Producer.RemoveTransferredFiles();
Logger.Info($"{filesRemovedCount} files removed from {Constants.OUTPUT_FOLDER}");
}
}
So to explain what's going on. The function UpdateOdfsServiceEntriesForProcessingOnStart() gets 4 file names (4 because of parallel count) and adds them to a thread safe object called ParserHandler. These objects are then put into a list var handlers.
My idea here is to loop through this handler list and call the handler.Execute().
Handler.Execute() copies files from the network location onto a local drive, parses through the file and creates multiple output files, then sends said files to a network location and updates a DB table.
What I expect in this Parallel For Each loop is that after the Handler.Execute() call, UpdateOdfsServiceEntriesForProcessingOnPollInterval(1) function will add a new file name from the db table it reads to the dequeued items container which will then be passed as one item to the recreated handler list. In this way, after one file is done executing, a new file will take its place for each parallel loop.
However what happens is that while I do get a new file added it doesn't get executed by the next available thread. Instead what happens is that the parallel for each has to finish executing the first 4 files and then it will pick up the very next file. Meaning, after the first 4 are ran in parallel, only 1 file is ran at a time thereby nullifying the whole point of the parallel looping. The initial files added before all 4 files finish the Execute() call are never executed.
IE:
(Start1, Start2, Start3, Start4) all at once. What should happen is something like (End2, Start5), and then (End3, Start6). But what is happening is (End 2, End 3, End 1, End 4), Then Start5. End5. Start6, End6.
Why is this happening?
Because we want to deploy multiple of instances of this service app in a machine, it is not beneficial to have a giant list waiting in queue. This is wasteful as the other app instances wont be able to process things.
I am writing what should be a long comment as an answer, although it's an awful answer because it doesn't answer the question.
Be aware that parallelizing filesystem operations is unlikely to make them faster, especially if the storage is a classic hard disk. The head of the disk cannot be in N places at the same moment, and if you tell it to do so will just waste most of its time traveling instead of reading or writing.
The best way to overcome the bottleneck imposed by accessing the filesystem is to make sure that there is work for the disk to do at all moments. Don't stop the disk's work to make a computation or to fetch/save data from/to the database. To make this happen you must have multiple workflows running concurrently. One workflow will do entirely I/O with the disk, another workflow will talk continuously with the database, a third workflow will utilize the CPU by doing the one calculation after the other etc. This approach is called task parallelism (doing heterogeneous work in parallel), as opposed with data parallelism (doing homogeneous work in parallel, the speciality of Parallel.ForEach). It is also called pipelining, because in order to make all workflows run concurrently you must place intermediate buffers between them, so you create a pipeline with the data flowing from buffer to buffer. Another term used for this kind of operations is producer-consumer pattern, which describes a short pipeline consisting by only two building blocks, with the first being the producer and the second the consumer.
The most powerful tool currently available¹ to create pipelines, is the TPL Dataflow library. It offers a variety of "blocks" (pipeline segments) that can be linked with each other, and can cover most scenarios. What you do is that you instantiate the blocks that will compose your pipeline, you configure them, you tell each one what work it should do, you link them together, you feed the first block with the initial raw data that should be processed, and finally await for the Completion of the last block. You can look at an example of using the TPL Dataflow library here.
¹ Available as built-in library in the .NET platform. Powerful third-party tools also exist, like the Akka.NET for example.
I have a Windows Service that has code similar to the following:
List<Buyer>() buyers = GetBuyers();
var results = new List<Result();
Parallel.Foreach(buyers, buyer =>
{
// do some prep work, log some data, etc.
// call out to an external service that can take up to 15 seconds each to return
results.Add(Bid(buyer));
}
// Parallel foreach must have completed by the time this code executes
foreach (var result in results)
{
// do some work
}
This is all fine and good and it works, but I think we're suffering from a scalability issue. We average 20-30 inbound connections per minute and each of those connections fire this code. The "buyers" collection for each of those inbound connections can have from 1-15 buyers in it. Occasionally our inbound connection count sees a spike to 100+ connections per minute and our server grinds to a halt.
CPU usage is only around 50% on each server (two load balanced 8 core servers) but the thread count continues to rise (spiking up to 350 threads on the process) and our response time for each inbound connection goes from 3-4 seconds to 1.5-2 minutes.
I suspect the above code is responsible for our scalability problems. Given this usage scenario (parallelism for I/O operations) on a Windows Service (no UI), is Parallel.ForEach the best approach? I don't have a lot of experience with async programming and am looking forward to using this opportunity to learn more about it, figured I'd start here to get some community advice to supplement what I've been able to find on Google.
Parallel.Foreach has a terrible design flaw. It is prone to consume all available thread-pool resources over time. The number of threads that it will spawn is literally unlimited. You can get up to 2 new ones per second driven by heuristics that nobody understands. The CoreCLR has a hill climbing algorithm built into it that just doesn't work.
call out to an external service
Probably, you should find out what's the right degree of parallelism calling that service. You need to find out by testing different amounts.
Then, you need to restrict Parallel.Foreach to only spawn as many threads as you want at a maximum. You can do that using a fixed concurrency TaskScheduler.
Or, you change this to use async IO and use SemaphoreSlim.WaitAsync. That way no threads are blocked. The pool exhaustion is solved by that and the overloading of the external service as well.
This is my first post here, so apologies if this isn't structured well.
We have been tasked to design a tool that will:
Read a file (of account IDs), CSV format
Download the account data file from the web for each account (by Id) (REST API)
Pass the file to a converter that will produce a report (financial predictions etc) [~20ms]
If the prediction threshold is within limits, run a parser to analyse the data [400ms]
Generate a report for the analysis above [80ms]
Upload all files generated to the web (REST API)
Now all those individual points are relatively easy to do. I'm interested in finding out how best to architect something to handle this and to do it fast & efficiently on our hardware.
We have to process roughly around 2 Million accounts. The square brackets gives an idea of how long each process takes on average. I'd like to use the maximum resources available on the machine - 24 core Xeon processors. It's not a memory intensive process.
Would using TPL and creating each of these as a task be a good idea? Each has to happen sequentially but many can be done at once. Unfortunately the parsers are not multi-threading aware and we don't have the source (it's essentially a black box for us).
My thoughts were something like this - assumes we're using TPL:
Load account data (essentially a CSV import or SQL SELECT)
For each Account (Id):
Download the data file for each account
ContinueWith using the data file, send to the converter
ContinueWith check threshold, send to parser
ContinueWith Generate Report
ContinueWith Upload outputs
Does that sound feasible or am I not understanding it correctly? Would it be better to break down the steps a different way?
I'm a bit unsure on how to handle issues with the parser throwing exceptions (it's very picky) or when we get failures uploading.
All this is going to be in a scheduled job that will run after-hours as a console application.
I would think about using some kind of messagebus. So you can seperate the steps and if one wouldn't work (for example because the REST Service isn't accessible for some time) you can store the message for processing them later on.
Depending on what you use as a messagebus you can introduce threads with it.
In my opinion you could better design workflows, handle exceptional states and so on, if you have a more high level abstraction like a service bus.
Also beaucase the parts could run indepdently they don't block each other.
One easy way could be to use servicestack messaging with Redis ServiceBus.
Some advantages quoted from there:
Message-based design allows for easier parallelization and introspection of computations
DLQ messages can be introspected, fixed and later replayed after server updates and rejoin normal message workflow
I think the easy way to start with multiple thread in your case, will be putting the entire operation for each account id in a thread (or better, in a ThreadPool). In the proposed way below, I think you will not need to control inter-thread operations.
Something like this to put the data on the thread pool queue:
var accountIds = new List<int>();
foreach (var accountId in accountIds)
{
ThreadPool.QueueUserWorkItem(ProcessAccount, accountId);
}
And this is the function you will process each account:
public static void ProcessAccount(object accountId)
{
// Download the data file for this account
// ContinueWith using the data file, send to the converter
// ContinueWith check threshold, send to parser
// ContinueWith Generate Report
// ContinueWith Upload outputs
}
I have a program that needs to get some data from an Atom feed. I tried two approaches and neither of them worked well.
I've used WebClient to synchronously download all the posts I need, but as there are a few thousand and the service is slow it takes many hours.
I've tried (for the first time) async/await, the new HttpClient and Task.WhenAll. Unfortunately that results in thousands of requests hitting the service and bringing it down.
How can I run say 100 requests in parallel?
You could use Parellel with ParallelOptions.MaxDegreeOfParallelism
ParallelOptions.MaxDegreeOfParallelism Property
Or a BlockingCollection with a bounded collection size
BlockingCollection Overview
I would recommend the BlockingCollection
Sounds like you have a solution already in that you can get a lot done at once. I'd suggest just adding another layer on top of that which just loops through all of the posts, but only processes them 100 at a time.
Right now you might have: DownloadAll(List ListofPosts) Inside of DownloadAll you probably have a wait all at the end.
instead:
For loop from 1 to ( ListofPosts Count / 100)
DownloadAll(ListofPosts.Skip(xxx).Take(100));
Obviously not real code there, but then you can do chunks of 100 with little change to your main function.