Parallel.Foreach and Parallel Execution in C# - c#

I am having a task of scanning all the folders name start with "xyz" parallely. I meant if one folder getting scan same time other one should also getting scan. I don't want one by one scanning.
For that I used Parallel Foreach.
Question is?
Is it correct or not? and How to know is it running parallely(to put any message some where)?
Parallel.ForEach(path, currentPath =>
{
var output = programReader.GetData(currentPath, durReader.dirPattern);
foreach (var item in output)
{
foreach (var project in item.Name)
Console.WriteLine(item.serverName + " " + item.serverNumber + " " + fileName);
}
}
EDIT:
Is Parallel.Foreach only works on multicore systems or it could work on single core system also to perform show parallelism

Foirst - if you ask a question, make it a question. !! is a good indication it is not a question.
Second, your approach makes little sense. Parallel will go VERY parallel. Good. Bad: you still have only one disc. Result: tasks waiting. It makes no sense to paralellize IO operations over the degree supported by the hardware.

The Parallel extensions split the load per core - although I'm not sure if they take into account hyperthreaded cores.
But, to add another answer to steer you in the right direction:
Don't try and do this in parallel. The disk can only serve one request at a time, so you're likely to just slow everything down as the disk queue is just likely to get bigger.
The only scenario where you might be able to get increased performance is if the location you're scanning is actually a SAN where the storage is distributed and highly replicated.

You can print the Thread.ManagedThreadId value to see which threads are being used.

Parallel.ForEach or Parallel.Execute uses the cores of the processor. So if your processor is having more thn one core they will be used equally to run this thread. Here you are m

Related

How ASP.NET MVC 5 can do fast processing in C# Async

How can I run the above code in the fastest way. What is the best practice?
public ActionResult ExampleAction()
{
// 200K items
var results = dbContext.Results.ToList();
foreach (var result in results)
{
// 10 - 40 items
result.Kazanim = JsonConvert.SerializeObject(
dbContext.SubTables // 2,5M items
.Where(x => x.FooId == result.FooId)
.Select(select => new
{
BarId = select.BarId,
State = select.State,
}).ToList());
dbContext.Entry(result).State = EntityState.Modified;
dbContext.SaveChanges();
}
return Json(true, JsonRequestBehavior.AllowGet);
}
This process takes an average of 500 ms as sync. I have about 2M records. The process is done 200K times.
How should I code asynchronously?
How can I do it faster and easier with an async method.
Here are two suggestions that can improve the performance multiple orders of magnitude:
Do work in batches:
Make the client send a page of data to process; and/or
In the web server code add items to a queue and process them separately.
Use SQL instead of EF:
Write an efficient SQL; and/or
Use the stored proc to do the work inside the db rather than move data between the db and the code.
There's nothing you can do with that code asynchronously for improving its performance. But there's something that can certainly make it faster.
If you call dbContext.SaveChanges() inside the loop, EF will write back the changes to the database for every single entity as a separate transaction.
Move your dbContext.SaveChanges() after the loop. This way EF will write back all your changes at once after in one single transaction.
Always try to have as few calls to .SaveChanges() as possible. One call with 50 changes is much better, faster and more efficient than 50 calls for 1 change each.
and welcome.
There's quite a lot I see incorrect in terms of asynchronicity, but I guess it only matters if there are concurrent users calling your server. This has to do with scalability and the thread pool in charge of spinning up threads to take care of your incoming HTTP requests.
You see, if you occupy a thread pool thread for a long time, that thread will not contribute to dequeueing incoming HTTP requests. This pretty much puts you in a position where you can spin up a maximum of around 2 new thread pool threads per second. If your incoming HTTP request rate is faster than the pool's ability to produce threads, all of your HTTP requests will start seeing increased response times (slowness).
So as a general rule, when doing I/O intensive work, always go async. There are asynchronous versions of most (or all) of the materializing methods like .ToList(): ToListAsync(), CountAsync(), AnyAsync(), etc. There is also a SaveChangesAsync(). First thing I would do is use these under normal circumstances. Yours don't seem to be, so I mentioned this for completeness only.
I think that you must, at the very least, run this heavy process outside the thread pool. Use Task.Factory.StartNew() with the TaskCreationOptions.LongRunning but run synchronous code so you don't fall in the trap of awaiting the returned task in vain.
Now, all that just to have a "proper" skeleton. We haven't really talked about how to make this run faster. Let's do that.
Personally, I think you need some benchmarking between different methods. It looks like you have benchmarked this code. Now listen to #tymtam and see if a stored procedure version runs faster. My hunch, just like #tymtam's, is that it will be definitely faster.
If for whatever reason you insist in running this with C#, I would parallelize the work. The problem with this is Entity Framework. As per usual, my very popular, yet unfriendly ORM, is giving us a big but. EF's DB context works with a single connection and disallows multiple simultaneous queries. So you cannot parallelize this with EF. I would then move to my good, amazing friend, Dapper. Using Dapper, you could divide the workload in threads, and each thread would do an independent DB connection, and through that connection, take care of a portion of the 200K result set you obtain at the beginning.
Thanks for the valuable information you provided.
I decided to use hangfire in line with your suggestions.
I used it with Hangfire Inmemory. I have prepared a function that will throw it into the hangfire queue in the foreach. After getting my relevant values before starting the foreach, I set my function to import parameters that it will calculate and save to the database. I won't prolong it.
A job that took 30 minutes on average fell to 3 minutes with hangfire. Maybe it's still not ideal, but it has worked for me now. Instead of making the user wait, I can show your action as currently in progress. I end the process with a warning that another job has been successfully completed before the end of the last thread.
I haven't used it here for Dapper for now. But I used it on another subject. It really has tremendous performance compared to Entity Framework.
Thanks again.

Unexpected Parallel.ForEach loop behavior

Hi I am trying to mimic multi threading with Parallel.ForEach loop. Below is my function:
public void PollOnServiceStart()
{
constants = new ConstantsUtil();
constants.InitializeConfiguration();
HashSet<string> newFiles = new HashSet<string>();
//string serviceName = MetadataDbContext.GetServiceName();
var dequeuedItems = MetadataDbContext
.UpdateOdfsServiceEntriesForProcessingOnStart();
var handlers = Producer.GetParserHandlers(dequeuedItems);
while (handlers.Any())
{
Parallel.ForEach(handlers,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
handler =>
{
Logger.Info($"Started processing a file remaining in Parallel ForEach");
handler.Execute();
Logger.Info($"Enqueing one file for next process");
dequeuedItems = MetadataDbContext
.UpdateOdfsServiceEntriesForProcessingOnPollInterval(1);
handlers = Producer.GetParserHandlers(dequeuedItems);
});
int filesRemovedCount = Producer.RemoveTransferredFiles();
Logger.Info($"{filesRemovedCount} files removed from {Constants.OUTPUT_FOLDER}");
}
}
So to explain what's going on. The function UpdateOdfsServiceEntriesForProcessingOnStart() gets 4 file names (4 because of parallel count) and adds them to a thread safe object called ParserHandler. These objects are then put into a list var handlers.
My idea here is to loop through this handler list and call the handler.Execute().
Handler.Execute() copies files from the network location onto a local drive, parses through the file and creates multiple output files, then sends said files to a network location and updates a DB table.
What I expect in this Parallel For Each loop is that after the Handler.Execute() call, UpdateOdfsServiceEntriesForProcessingOnPollInterval(1) function will add a new file name from the db table it reads to the dequeued items container which will then be passed as one item to the recreated handler list. In this way, after one file is done executing, a new file will take its place for each parallel loop.
However what happens is that while I do get a new file added it doesn't get executed by the next available thread. Instead what happens is that the parallel for each has to finish executing the first 4 files and then it will pick up the very next file. Meaning, after the first 4 are ran in parallel, only 1 file is ran at a time thereby nullifying the whole point of the parallel looping. The initial files added before all 4 files finish the Execute() call are never executed.
IE:
(Start1, Start2, Start3, Start4) all at once. What should happen is something like (End2, Start5), and then (End3, Start6). But what is happening is (End 2, End 3, End 1, End 4), Then Start5. End5. Start6, End6.
Why is this happening?
Because we want to deploy multiple of instances of this service app in a machine, it is not beneficial to have a giant list waiting in queue. This is wasteful as the other app instances wont be able to process things.
I am writing what should be a long comment as an answer, although it's an awful answer because it doesn't answer the question.
Be aware that parallelizing filesystem operations is unlikely to make them faster, especially if the storage is a classic hard disk. The head of the disk cannot be in N places at the same moment, and if you tell it to do so will just waste most of its time traveling instead of reading or writing.
The best way to overcome the bottleneck imposed by accessing the filesystem is to make sure that there is work for the disk to do at all moments. Don't stop the disk's work to make a computation or to fetch/save data from/to the database. To make this happen you must have multiple workflows running concurrently. One workflow will do entirely I/O with the disk, another workflow will talk continuously with the database, a third workflow will utilize the CPU by doing the one calculation after the other etc. This approach is called task parallelism (doing heterogeneous work in parallel), as opposed with data parallelism (doing homogeneous work in parallel, the speciality of Parallel.ForEach). It is also called pipelining, because in order to make all workflows run concurrently you must place intermediate buffers between them, so you create a pipeline with the data flowing from buffer to buffer. Another term used for this kind of operations is producer-consumer pattern, which describes a short pipeline consisting by only two building blocks, with the first being the producer and the second the consumer.
The most powerful tool currently available¹ to create pipelines, is the TPL Dataflow library. It offers a variety of "blocks" (pipeline segments) that can be linked with each other, and can cover most scenarios. What you do is that you instantiate the blocks that will compose your pipeline, you configure them, you tell each one what work it should do, you link them together, you feed the first block with the initial raw data that should be processed, and finally await for the Completion of the last block. You can look at an example of using the TPL Dataflow library here.
¹ Available as built-in library in the .NET platform. Powerful third-party tools also exist, like the Akka.NET for example.

Parallel programming for Windows Service

I have a Windows Service that has code similar to the following:
List<Buyer>() buyers = GetBuyers();
var results = new List<Result();
Parallel.Foreach(buyers, buyer =>
{
// do some prep work, log some data, etc.
// call out to an external service that can take up to 15 seconds each to return
results.Add(Bid(buyer));
}
// Parallel foreach must have completed by the time this code executes
foreach (var result in results)
{
// do some work
}
This is all fine and good and it works, but I think we're suffering from a scalability issue. We average 20-30 inbound connections per minute and each of those connections fire this code. The "buyers" collection for each of those inbound connections can have from 1-15 buyers in it. Occasionally our inbound connection count sees a spike to 100+ connections per minute and our server grinds to a halt.
CPU usage is only around 50% on each server (two load balanced 8 core servers) but the thread count continues to rise (spiking up to 350 threads on the process) and our response time for each inbound connection goes from 3-4 seconds to 1.5-2 minutes.
I suspect the above code is responsible for our scalability problems. Given this usage scenario (parallelism for I/O operations) on a Windows Service (no UI), is Parallel.ForEach the best approach? I don't have a lot of experience with async programming and am looking forward to using this opportunity to learn more about it, figured I'd start here to get some community advice to supplement what I've been able to find on Google.
Parallel.Foreach has a terrible design flaw. It is prone to consume all available thread-pool resources over time. The number of threads that it will spawn is literally unlimited. You can get up to 2 new ones per second driven by heuristics that nobody understands. The CoreCLR has a hill climbing algorithm built into it that just doesn't work.
call out to an external service
Probably, you should find out what's the right degree of parallelism calling that service. You need to find out by testing different amounts.
Then, you need to restrict Parallel.Foreach to only spawn as many threads as you want at a maximum. You can do that using a fixed concurrency TaskScheduler.
Or, you change this to use async IO and use SemaphoreSlim.WaitAsync. That way no threads are blocked. The pool exhaustion is solved by that and the overloading of the external service as well.

c# Webcrawler optimization [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have Webcrawles writed in C#, it uses multithreading. For now it can download and parse about 1000 links / min but when I run ex. 3 instances at the same time every one instance can reach 1000 links / min so I have 3000 links/min. One instance uses up to 2% CPU, 100MB RAM, and 1% network speed. Now I wonder it is possible to one instance can reach 3000 links/min or more when I have avaialable resources (cpu,ram,network)?
Structure of my code:
ThreadSafeFileBuffer<string> successWriter = new ThreadSafeFileBuffer<string>("ok.txt");
IEnumerable<string> lines = File.ReadLines("urls.txt");
var options = new ParallelOptions
{
CancellationToken = _cts.Token,
MaxDegreeOfParallelism = 500
};
Parallel.ForEach(lines, options, (line, loopState, idx) =>
{
var crawler = new Crawler(line);
var result = crawler.Go(); //download,parse
successWriter.AddResult(result);
}
I have Windows 7,CPU i7,16GB RAM,SSD disk
The problem with using Parallel.ForEach on a list of URLs is that those lists often contain many URLs from the same site and you end up with multiple concurrent requests to the same site. Some sites frown on that and will block you or insert artificial delays.
1,000 requests per minute works out to 16 or 17 requests per second, which is pretty much the limit of what you can do without resorting to extraordinary measures. A large part of the problem is DNS resolution, which can take a surprisingly long time. In addition, the default .NET ServicePointManager limits you to 2 concurrent requests on any given site. If you want to support more than that, you need to change the ServicePointManager.DefaultConnectionLimit property.
You definitely don't want to add hundreds of threads. I did that once. It's painful. What you need is a handful of threads that can make asynchronous requests very quickly. My testing shows that a single thread can't sustain more than 15 requests per second because HttpRequest.BeginGetResponse does a lot of synchronous work before going asynchronous. As the documentation states:
The BeginGetResponse method requires some synchronous setup tasks to complete (DNS resolution, proxy detection, and TCP socket connection, for example) before this method becomes asynchronous.
You can speed that up somewhat by increasing the size of your DNS client cache and by having a local DNS cache on a separate machine, but there's a limit to what you can achieve there.
I don't know how much crawling you're doing. If you're doing a lot, then you need to implement a politeness policy that takes into account the robots.txt file, limits how often it hits a particular site, limits the types of URLs it downloads (no use downloading an MP3 or .doc file if you can't do anything with it, for example), etc. To prevent your crawler from being blocked, your crawler becomes at the core a politeness policy enforcer that just happens to download web pages.
I started writing about this some time back, but then didn't finish (other projects took precedence). See http://blog.mischel.com/2011/12/13/writing-a-web-crawler-introduction/ for the first post and links to the other posts in the thread. Also see http://blog.mischel.com/2011/12/26/writing-a-web-crawler-queue-management-part-1/. It's something I've been wanting to get back to, but after almost two years I still haven't managed it.
You'll also run into proxy problems, URL filtering problems (here and here), weird redirects, and asynchronous calls that aren't completely asynchronous.
You do not need more threads, as those threads all spend their time waiting. You need an asynchronous program, that doesn't block threads waiting for web replies.
The problem with threads is that they are a rather expensive resource, because of the memory required for their stack, and the work they create for the OS thread scheduler. In your program, this scheduler keeps on switching threads to that they can all take turns waiting. But they're not doing anything usefull.
In a webcrawler your going to be spending most of your time waiting for web requests. So if you have blocking I/O your program is not going to be processing at full speed, also async IO wont help if the program sits idle waiting for a callback. Sounds like you just need to add more threads to your main app and be processing in parallel.
But it's hard to tell, since you've not posted any code.,
Yes, it is. Find out where your bottleneck is and improve on the performance.
Edit:
If you are using Parallel.ForEach, you can try an overload using the ParallelOptions parameter. Setting the MaxDegreeOfParallelism property might help.
Actually the number of links/min is directly proportional with the number of crawler threads running at the same time.
In your first case; you have 3 processes with n threads per. (total 3n threads)
Try to run 3n threads in one process.
Actually this depends on your operating system and CPU, too. Because old versions of windows (like XP) doesn't support parallel multithreading over different cpu cores.
Parallelism with the TPL is bad design for a web crawler. A Parallel.ForEach() loop starts only a bunch of requests (5-50), because it is designed to perform time consuming computing in parallel and not to perform thousands of requests in parallel which almost do nothing. To get the data you want to have, you must be able to preform a really large amount (10000+) of requests in parallel. Async Operations are the key to that.
I have developed the Crawler Engine of the Crawler-Lib Framework. It is a workflow enabled crawler which can easily extended to do any kind of requests or even processing you want to have.
It is designed to give you high throughput out of the box.
Here is the engine: http://www.crawler-lib.net/crawler-lib-engine
Here are some Youtube Videos, showing how the Crawler-Lib engine works: http://www.youtube.com/user/CrawlerLib
I know this project is not open source, but there is a free version.

A few questions about parallelization in c#

I am writing a heavy web scraper in c#. I want it to be fast and reliable.
Parallel.Foreach and Parallel.For are way too slow for this.
For the input I am using a list of URLs. I want to have up to 300 threads working at the exact same time (my cpu and net connection can handle this). What would be the best way to do this? Would using tasks work better for this?
Sometimes the threads end for no apparent reason and some of the results don't get saved. I want a more reliable way of doing this. Any ideas?
I want to have a more solid queue type of scraping.
What I came up with (not all code but the important parts):
List <string> input = // read text file
int total = words.Length;
int maxThreads = 300;
while (true)
{
if (activeThreads < maxThreads)
{
current++;
Thread thread = new Thread(() => CrawlWebsite(words[current]));
thread.Start();
}
}
public static void CrawlWebsite(string word)
{
activeThreads++;
// scraping part
activeThreads--;
}
Consider using System.Threading.ThreadPool. It could be a little faster for your scenario with many threads, as well as you don't need to manage activeThreads. Instead you can use ThreadPool.SetMaxThreads() and SetMinThreads() and the ThreadPool manages the number of parallel threads for you.
BTW, there is missing synchronization of the shared variables in your example. One of the ways how to synchronize access is using "lock" - see http://msdn.microsoft.com/en-us/library/c5kehkcz.aspx
Also your thread-runned method - the CrawlWebsite() should handle ThreadAbortException - see http://msdn.microsoft.com/en-us/library/system.threading.threadabortexception.aspx.
I was recently working on very similar problem and don´t think that using any high number of threads will make it faster. The slowest think is usually downloading the data. Having huge number of threads does not make it faster, because mostly they are waiting for network connections data transfer etc. So I ended up with having two queues. One is handled by some small number of threads that just send async download requests (10-15 requets at a time). The responses are stored into another queue that goes into another thread pool that takes care of parsing and data processing (Number of threads here depends on your CPU and processing algorithm).
I also save all downloaded data to a database. Anytime I want to implement parsing of some new information from the web I don´t need to redownload the content, but only parse the cached web from DB (This saves a looot of time)

Categories