How can I run the above code in the fastest way. What is the best practice?
public ActionResult ExampleAction()
{
// 200K items
var results = dbContext.Results.ToList();
foreach (var result in results)
{
// 10 - 40 items
result.Kazanim = JsonConvert.SerializeObject(
dbContext.SubTables // 2,5M items
.Where(x => x.FooId == result.FooId)
.Select(select => new
{
BarId = select.BarId,
State = select.State,
}).ToList());
dbContext.Entry(result).State = EntityState.Modified;
dbContext.SaveChanges();
}
return Json(true, JsonRequestBehavior.AllowGet);
}
This process takes an average of 500 ms as sync. I have about 2M records. The process is done 200K times.
How should I code asynchronously?
How can I do it faster and easier with an async method.
Here are two suggestions that can improve the performance multiple orders of magnitude:
Do work in batches:
Make the client send a page of data to process; and/or
In the web server code add items to a queue and process them separately.
Use SQL instead of EF:
Write an efficient SQL; and/or
Use the stored proc to do the work inside the db rather than move data between the db and the code.
There's nothing you can do with that code asynchronously for improving its performance. But there's something that can certainly make it faster.
If you call dbContext.SaveChanges() inside the loop, EF will write back the changes to the database for every single entity as a separate transaction.
Move your dbContext.SaveChanges() after the loop. This way EF will write back all your changes at once after in one single transaction.
Always try to have as few calls to .SaveChanges() as possible. One call with 50 changes is much better, faster and more efficient than 50 calls for 1 change each.
and welcome.
There's quite a lot I see incorrect in terms of asynchronicity, but I guess it only matters if there are concurrent users calling your server. This has to do with scalability and the thread pool in charge of spinning up threads to take care of your incoming HTTP requests.
You see, if you occupy a thread pool thread for a long time, that thread will not contribute to dequeueing incoming HTTP requests. This pretty much puts you in a position where you can spin up a maximum of around 2 new thread pool threads per second. If your incoming HTTP request rate is faster than the pool's ability to produce threads, all of your HTTP requests will start seeing increased response times (slowness).
So as a general rule, when doing I/O intensive work, always go async. There are asynchronous versions of most (or all) of the materializing methods like .ToList(): ToListAsync(), CountAsync(), AnyAsync(), etc. There is also a SaveChangesAsync(). First thing I would do is use these under normal circumstances. Yours don't seem to be, so I mentioned this for completeness only.
I think that you must, at the very least, run this heavy process outside the thread pool. Use Task.Factory.StartNew() with the TaskCreationOptions.LongRunning but run synchronous code so you don't fall in the trap of awaiting the returned task in vain.
Now, all that just to have a "proper" skeleton. We haven't really talked about how to make this run faster. Let's do that.
Personally, I think you need some benchmarking between different methods. It looks like you have benchmarked this code. Now listen to #tymtam and see if a stored procedure version runs faster. My hunch, just like #tymtam's, is that it will be definitely faster.
If for whatever reason you insist in running this with C#, I would parallelize the work. The problem with this is Entity Framework. As per usual, my very popular, yet unfriendly ORM, is giving us a big but. EF's DB context works with a single connection and disallows multiple simultaneous queries. So you cannot parallelize this with EF. I would then move to my good, amazing friend, Dapper. Using Dapper, you could divide the workload in threads, and each thread would do an independent DB connection, and through that connection, take care of a portion of the 200K result set you obtain at the beginning.
Thanks for the valuable information you provided.
I decided to use hangfire in line with your suggestions.
I used it with Hangfire Inmemory. I have prepared a function that will throw it into the hangfire queue in the foreach. After getting my relevant values before starting the foreach, I set my function to import parameters that it will calculate and save to the database. I won't prolong it.
A job that took 30 minutes on average fell to 3 minutes with hangfire. Maybe it's still not ideal, but it has worked for me now. Instead of making the user wait, I can show your action as currently in progress. I end the process with a warning that another job has been successfully completed before the end of the last thread.
I haven't used it here for Dapper for now. But I used it on another subject. It really has tremendous performance compared to Entity Framework.
Thanks again.
Related
We have one old ASP.Net asmx webservice in our application which receives bulk requests at sometime. Service is taking less than 5 seconds for a single request. But It is taking more than a minute when it receives 20 or more concurrent requests. Following is the way it is implemented,
1)receives a request with input data from external clients
2)Will get 20 possibilities from database for one request based on input data after validation
3)Then It will iterate all 20 possibilities using foreach and gets solutions either from other external service or data base based on possibility data. Here in old implementation we have used Parallel.Foreach to perform all 20 calls (service calls or DB calls) parallely to improve the performance.
4)After that Service will send back the all 20 solutions to the client.
This old approach is working fine for few (1or 2 ) requests and resonse time of asmx service is very fast(less than 5 seconds) considering external service calls which are taking 2-3 seconds .But This approach is taking more than 60 seconds when the number of concurrent requests are more than 20.Concurrent requests are pushing CPU utilization to 100% and thread pool starvation as per experts analysis and there by causing requests to queue for threads allocation.
So we got a recommendation to replace parallel extensions and complete service with async/await implementation from end to end.I have implemented async/await end to end and also replaced Parallel.foreach with Task.WhenAll in TPL. But response time has increased a lot after this implementation.for a single request 20 secconds and it its taking more than 2 minutes for bulk requests.
I also tried async foreach in place of parallel.foreach as mentioned in below article but still performance is really bad.
https://stackoverflow.com/questions/14673728/run-async-method-8-times-in-parallel/14674239#14674239
As per logs basic issue is with external service calls/DB calls inside foreach in both old parallel or new async/await implementations.But these service responses are very fast for a single request. Async implementation is taking more time in completing service calls than parallel extensions implementation.
I think service should not take more than 20 seconds for bulk request if it is lessa than 5 seconds for single request.
Can anyone please me what should be the way forward here to improve the performance ?
Thanks in advance.
Regards,
Raghu.
Looks like a lot of things happening here at the same time. I believe you have on nderlying issue that causes many side effects.
I will make the assumption that your server is sufficient in terms of CPU and memory to handle the concurrent connections (though the CPU 100% makes me wonder).
It seems to me that your problem, is that the parallel tasks (or threads), compete for the same resources. That would explain why multiple requests take much more time and why the async paradigm takes even more.
Let me explain:
The problem in practice
Parallel implementation: 1 or 2 request need minimum synchronization, so even if they compete for the same resources, it should be fine.
When 20 threads, try to access the same resources, a lot is happening and you come to a situation known as livelock.
When you switch to async, no requests await for a thread (they are waiting on the IO threads), so you make the problem even worse.
(I suspect that the problem is on your database. If your database server is the same machine, it would also explain the utilization).
The solution
Instead of trying to up the parallelism, find the contested resources and identify the problem.
If it's in your database (most probable scenario), then you need to identify the queries causing the trouble and fix them (indexes, statistics, query plans and whatnot). DB profilers showing locks and query execution plans are your friends for this.
If the problem is in your code, try to minimize the race conditions and imporve your algorithms.
To get a hint of where to look for, use the Visual Studio profiling tools: https://learn.microsoft.com/en-us/visualstudio/profiling/profiling-feature-tour?view=vs-2019 or any external .net profiling software.
I have a Windows Service that has code similar to the following:
List<Buyer>() buyers = GetBuyers();
var results = new List<Result();
Parallel.Foreach(buyers, buyer =>
{
// do some prep work, log some data, etc.
// call out to an external service that can take up to 15 seconds each to return
results.Add(Bid(buyer));
}
// Parallel foreach must have completed by the time this code executes
foreach (var result in results)
{
// do some work
}
This is all fine and good and it works, but I think we're suffering from a scalability issue. We average 20-30 inbound connections per minute and each of those connections fire this code. The "buyers" collection for each of those inbound connections can have from 1-15 buyers in it. Occasionally our inbound connection count sees a spike to 100+ connections per minute and our server grinds to a halt.
CPU usage is only around 50% on each server (two load balanced 8 core servers) but the thread count continues to rise (spiking up to 350 threads on the process) and our response time for each inbound connection goes from 3-4 seconds to 1.5-2 minutes.
I suspect the above code is responsible for our scalability problems. Given this usage scenario (parallelism for I/O operations) on a Windows Service (no UI), is Parallel.ForEach the best approach? I don't have a lot of experience with async programming and am looking forward to using this opportunity to learn more about it, figured I'd start here to get some community advice to supplement what I've been able to find on Google.
Parallel.Foreach has a terrible design flaw. It is prone to consume all available thread-pool resources over time. The number of threads that it will spawn is literally unlimited. You can get up to 2 new ones per second driven by heuristics that nobody understands. The CoreCLR has a hill climbing algorithm built into it that just doesn't work.
call out to an external service
Probably, you should find out what's the right degree of parallelism calling that service. You need to find out by testing different amounts.
Then, you need to restrict Parallel.Foreach to only spawn as many threads as you want at a maximum. You can do that using a fixed concurrency TaskScheduler.
Or, you change this to use async IO and use SemaphoreSlim.WaitAsync. That way no threads are blocked. The pool exhaustion is solved by that and the overloading of the external service as well.
I have code that carries out data retrieval - basically executes anything from 3 to 12 SQL (oracle) read statements to retrieve data about an object.
Unfortunantly its running slowly (no SQL statement in particular, its just the fact I have so many of them - and they take around 0.2 seconds per statement, which can mean over 2 secs for the code to complete).
I am looking into ways of improving the performance. One way is to merge some of the tables into a single query (which can reduce the combined results by 0.5 secs). However it doesn't make sense to merge the rest since there will only be data there under certain cicumstances, and trying to determine when there is data there to marshal could get tricky.
I am considering introducing threading into my program, so after the initial query, I would spawn a thread for each of the other queries, so they are executed at the same time. However I have never used threading and am wary of introducing deadlocks or other pit falls.
Currently the other queries marshal the results into different sections of the SAME object. Would this cause any issues (i.e. since we are accessing/updating the same object in different threads though different sections/fields within the object?). Would it be better to return the results and marshal into the object after all the threads have finished?
I know these types of questions are hard to answer since its more general advice, but I would appreciate if anyone thought it was a good idea, or had other suggestions?
If you are doing only reading (select from) - don't worry about deadlocks. Oracle readings are not blockable (mostly). The biggest problem with threading queries to oracle would be how to deal with connections. To create connection, run a query and close connection - is very very very bad. Connections are expensive. They are also limited, so you don't want to create one million connections to execute your logic.
As a result, you would use some sort of connection pool and put your queries in a queue.
Also, I hope you are using bind variables and not string concatenation to pass queries to oracle.
In general, I would collect all the data (better in one query) and only then update the object. You could also consider to brake your object into it sections.
Threading workss perfectly. 2 years ago I did a project that used a multi strage / multi threading approeach to push data into a oracle database (and pull some data out of it for updates).
I basicallly used a staged approach (a request would go through multiple stages, get consumed there and new data be pusehd to the next stage) and every stage used a configurable thread pool, which would take a message, process it and post the new messages.
We used I think at that time close to 200 threads to process about a million SQL statements per minute (hitting an Oracle Exadata that was really getting some work out of that).
So, multithreading "just works" - obviously if you know how to do it and you have to get your architecture and the sql statements nice and non blocking. Databases in general are perfectly calable of handling multiple threads.
Now, for details: THAT DEPENDS.
Example:
Currently the other queries marshal the results into different
sections of the SAME object. Would this cause any issues (i.e. since
we are accessing/updating the same object in different threads though
different sections/fields within the object?)
Absolutely no problem as long as:
You make suer all updates are finished before moving the object to the next phase and
The updates do not overlap or have a cardinality (1 must finish for 2 to have the required data).
These are implementation details and it is really hard to make a generic answer for those (totally impossible). Especially as this is multi threading 101 - and has nothing to do with any database access.
In general - you will also have to tune the number of threads. .NET can not do that itself - as it will see the CPU not busy and spawn up more threads, even if the database server is the bottleneck. This is why we went with multiple stages - so we could tune the number of threads depending what they do (and the last stage used bulk inserting to insert the aggregated data into temporary staging tables with a small number of threads, moving a lot of data in every statement - this will require some tuning possibilities to not totally overload the database side).
In my client-server architecture I have few API functions which usage need to be limited.
Server is written in .net C# and it is running on IIS.
Until now I didn't need to perform any synchronization. Code was written in a way that even if client would send same request multiple times (e.g. create sth request) one call will end with success and all others with error (because of server code + db structure).
What is the best way to perform such limitations? For example I want no more that 1 call of API method: foo() per user per minute.
I thought about some SynchronizationTable which would have just one column unique_text and before computing foo() call I'll write something like foo{userId}{date}{HH:mm} to this table. If call end with success I know that there wasn't foo call from that user in current minute.
I think there is much better way, probably in server code, without using db for that. Of course, there could be thousands of users calling foo.
To clarify what I need: I think it could be some light DictionaryMutex.
For example:
private static DictionaryMutex FooLock = new DictionaryMutex();
FooLock.lock(User.GUID);
try
{
...
}
finally
{
FooLock.unlock(User.GUID);
}
EDIT:
Solution in which one user cannot call foo twice at the same time is also sufficient for me. By "at the same time" I mean that server started to handle second call before returning result for first call.
Note, that keeping this state in memory in an IIS worker process opens the possibility to lose all this data at any instant in time. Worker processes can restart for any number of reasons.
Also, you probably want to have two web servers for high availability. Keeping the state inside of worker processes makes the application no longer clustering-ready. This is often a no-go.
Web apps really should be stateless. Many reasons for that. If you can help it, don't manage your own data structures like suggested in the question and comments.
Depending on how big the call volume is, I'd consider these options:
SQL Server. Your queries are extremely simple and easy to optimize for. Expect 1000s of such queries per seconds per CPU core. This can bear a lot of load. You can use a SQL Express for free.
A specialized store like Redis. Stack Overflow is using Redis as a persistent, clustering-enabled cache. A good idea.
A distributed cache, like Microsoft Velocity. Or others.
This storage problem is rather easy because it fits a key/value store model well. And the data is near worthless so you don't even need to backup.
I think you're overestimating how costly this rate limitation will be. Your web-service is probably doing a lot more costly things than a single UPDATE by primary key to a simple table.
This may not be suitable here, please feel free to move, shout or abuse if so.
We currently have a console application that get started by another and passed in an ID of the 'job', this job will have multiple records that need to be processed. A simple explanation of the flow would be;
Starts 50 threads
Gets records to be processed.
if records > 0 see what threads are not still busy and send it some information.
if records = 0 update something else and exit.
Get more records.
Loop.
Now, I am looking to convert this into a 'polling' service that is continually running and when new records are available, process them. To take what I have and convert this is fairly simple, but the threads stuff is old and probably outdated.
I was looking to refactor most if not all and use Task.Parallel to process the items. However, I am struggling to get a suitable framework for polling and then processing the items and was looking for suggestions on how to achieve this.
Pretty vague I know, but hopefully enough to give some kind of input.
Many thanks
From my experience and this msdn quote:
More efficient and more scalable use of system resources.
Behind the scenes, tasks are queued to the ThreadPool, which has been
enhanced with algorithms (like hill-climbing) that determine and
adjust to the number of threads that maximizes throughput. This makes
tasks relatively lightweight, and you can create many of them to
enable fine-grained parallelism. To complement this, widely-known
work-stealing algorithms are employed to provide load-balancing.
You simply shouldn't care about how many tasks is a good number, or how to create a system where you load balance the threading involved.
Simply use:
Task.Factory.StartNew(() => DoSomeWork());
Every time you want to run something asynchronously, it does all the smart job behind the curtain.
Now since you're likely to create tasks in a loop, please be extra-careful not to introduce a closure bug many people had (including me), which you can look up here.
I have a windows service that runs from 1 to 500 Tasks, and never had trouble.
Hope this helps,
Bab.
If you are polling for new records in a DB table, a better approach would be to install an INSERT-trigger (and possibly also UPDATE- and DELETE-triggers) on this table and to send a message to your service when a new records is inserted.
See Posting Message to MSMQ from SQL Server on MSDN.
The "polling service" sounds like a nice case for an observable collection. There's Rx, a nice way to handle them (http://rxwiki.wikidot.com/101samples), which I think uses the TPL.