I am using EF 6.0 as an interface for my SQL Server database, and have been running into a re-occurring issue that I can't find a solution for other than setting up a auto-restart for the program.
I have a loop that will create a new DbContext and perform operations accordingly. After this loop goes for a few cycles the database operations will slow down to a crawl taking these iterations minutes to process when upon first launch it takes seconds.
I have tried things like disabling automatic change tracking and manual control of the garbage collection.
Here is what I am doing
while(true)
{
var ctx = new DBContext();
// ... Do operations
sleep(10000);
}
I've tried searching for an answer to this problem for awhile but have been unable to find one
Related
How can I run the above code in the fastest way. What is the best practice?
public ActionResult ExampleAction()
{
// 200K items
var results = dbContext.Results.ToList();
foreach (var result in results)
{
// 10 - 40 items
result.Kazanim = JsonConvert.SerializeObject(
dbContext.SubTables // 2,5M items
.Where(x => x.FooId == result.FooId)
.Select(select => new
{
BarId = select.BarId,
State = select.State,
}).ToList());
dbContext.Entry(result).State = EntityState.Modified;
dbContext.SaveChanges();
}
return Json(true, JsonRequestBehavior.AllowGet);
}
This process takes an average of 500 ms as sync. I have about 2M records. The process is done 200K times.
How should I code asynchronously?
How can I do it faster and easier with an async method.
Here are two suggestions that can improve the performance multiple orders of magnitude:
Do work in batches:
Make the client send a page of data to process; and/or
In the web server code add items to a queue and process them separately.
Use SQL instead of EF:
Write an efficient SQL; and/or
Use the stored proc to do the work inside the db rather than move data between the db and the code.
There's nothing you can do with that code asynchronously for improving its performance. But there's something that can certainly make it faster.
If you call dbContext.SaveChanges() inside the loop, EF will write back the changes to the database for every single entity as a separate transaction.
Move your dbContext.SaveChanges() after the loop. This way EF will write back all your changes at once after in one single transaction.
Always try to have as few calls to .SaveChanges() as possible. One call with 50 changes is much better, faster and more efficient than 50 calls for 1 change each.
and welcome.
There's quite a lot I see incorrect in terms of asynchronicity, but I guess it only matters if there are concurrent users calling your server. This has to do with scalability and the thread pool in charge of spinning up threads to take care of your incoming HTTP requests.
You see, if you occupy a thread pool thread for a long time, that thread will not contribute to dequeueing incoming HTTP requests. This pretty much puts you in a position where you can spin up a maximum of around 2 new thread pool threads per second. If your incoming HTTP request rate is faster than the pool's ability to produce threads, all of your HTTP requests will start seeing increased response times (slowness).
So as a general rule, when doing I/O intensive work, always go async. There are asynchronous versions of most (or all) of the materializing methods like .ToList(): ToListAsync(), CountAsync(), AnyAsync(), etc. There is also a SaveChangesAsync(). First thing I would do is use these under normal circumstances. Yours don't seem to be, so I mentioned this for completeness only.
I think that you must, at the very least, run this heavy process outside the thread pool. Use Task.Factory.StartNew() with the TaskCreationOptions.LongRunning but run synchronous code so you don't fall in the trap of awaiting the returned task in vain.
Now, all that just to have a "proper" skeleton. We haven't really talked about how to make this run faster. Let's do that.
Personally, I think you need some benchmarking between different methods. It looks like you have benchmarked this code. Now listen to #tymtam and see if a stored procedure version runs faster. My hunch, just like #tymtam's, is that it will be definitely faster.
If for whatever reason you insist in running this with C#, I would parallelize the work. The problem with this is Entity Framework. As per usual, my very popular, yet unfriendly ORM, is giving us a big but. EF's DB context works with a single connection and disallows multiple simultaneous queries. So you cannot parallelize this with EF. I would then move to my good, amazing friend, Dapper. Using Dapper, you could divide the workload in threads, and each thread would do an independent DB connection, and through that connection, take care of a portion of the 200K result set you obtain at the beginning.
Thanks for the valuable information you provided.
I decided to use hangfire in line with your suggestions.
I used it with Hangfire Inmemory. I have prepared a function that will throw it into the hangfire queue in the foreach. After getting my relevant values before starting the foreach, I set my function to import parameters that it will calculate and save to the database. I won't prolong it.
A job that took 30 minutes on average fell to 3 minutes with hangfire. Maybe it's still not ideal, but it has worked for me now. Instead of making the user wait, I can show your action as currently in progress. I end the process with a warning that another job has been successfully completed before the end of the last thread.
I haven't used it here for Dapper for now. But I used it on another subject. It really has tremendous performance compared to Entity Framework.
Thanks again.
I'm running a windows service using TopShelf (based on console app in C# .NET 4.6.1) and I'm using Automapper 9.0.0. Every 10 seconds I run a task that processes about 1000 rows in a Ms SQL database (using entity framework), It seems like Automapper is taking up a lot of memory, and the memory grows each time the task is run (In task manager I can see the service taking up over 3000 Meg of RAM++).
I am new to Automapper and don't now if there is anything I need to code to release manually the memory. Somewhere I saw a huge amount of handlers and I was wondering if Automapper generates these handlers and how I can clean them up.
I tried putting a GC.Collect() at the end of each task but I don't seem to see a difference
Here is a code extract of my task:
private void _LiveDataTimer_Elapsed(object sender, ElapsedEventArgs e)
{
// setting up Ninject data injection
var kernel = new StandardKernel();
kernel.Load(Assembly.GetExecutingAssembly());
//var stationExtesions = kernel.Get<IStationExtensionRepository>();
//var ops = kernel.Get<IOPRepository>();
//var opExtensions = kernel.Get<IOPExtensionRepository>();
//var periods = kernel.Get<IPeriodRepository>();
//var periodExtensions = kernel.Get<IPeriodExtensionRepository>();
// create the LiveDataTasks object
//var liveDataTasks = new LiveDataTasks(stationExtesions, ops, opExtensions, periods, periodExtensions);
// sync the station live data
//liveDataTasks.SyncLiveStationData();
// force garbage collection to prevent memory leaks
//GC.Collect();
Console.WriteLine("LiveDataTimer: Total available memory before collection: {0:N0}", System.GC.GetTotalMemory(false));
System.GC.Collect();
Console.WriteLine("LiveDataTimer: Total available memory collection: {0:N0}", System.GC.GetTotalMemory(true));
}
MOFICATIONS: I added some console outputs at the end of the code displaying the TotalMemory used. I removed GC.Collect() because it doesn't change anything and commented out most of the code accessing database. Now I realize that kernel.Load(Assembly.GetExecutingAssembly()); already makes memory grow very fast. See the following console capture:
Now if I comment out kernel.Load(Assembly.GetExecutingAssembly()); I get a stable memory situation again. How can I Dispose or unload the Kernel???
Well, first of all you should not be doing Database work in a Service. Moving any big operation on DB data out of the DB will only add having to move the data over the Network twice - once to the client programm, once back to the DB - while also risking Race Conditions and a lot of other issues. My standing advice is: Keep DB work in the DB at all times.
As for the memory Footprint, this migth just be a missreading of the used Memory:
.NET uses the Garbage Collection Memory Management approach. One effect of it is that while the GC for any given Application does his collecting, all other threads have to be paused. As a result the GC is pretty lazy at running. If it only runs once on Application closure - that is the ideal case. So it tries avoding to run before that unessesarily. It will still run as much as it can, before it ever throws a OutOfMemoryException at you. But beyond that, it is perfectly happy to just keep allocating more and more object without cleaning up.
You can test if it is that by calling GC.Collect(). However such a call should generally never be in productive code. A alternate GC strategy (particular the one used for WebServers) might be better.
I finally figured out what was happening: the kernel.Load(...) used to set up NInject data injection was increasing my memory:
var kernel = new StandardKernel();
kernel.Load(Assembly.GetExecutingAssembly());
So I moved this code from the function executed every x seconds to the constructor of the parent class where it is only executed once on initialisation.
This solved the problem.
Thanks guys for your inspiring help and comments!!!!
Firstly, I am not much of an expert in multi-threading and parallel programming.
I am trying to optimize the performance of a legacy application (.Net 4, NHibernate 2.1).
**So far, upgrading NHibernate is not a priority, but is in the pipeline.
Over time, performance has become a nightmare with the growth of data. One item I have seen is a Parallel.ForEach statement that calls a method that fetches and updates a complex entity(with multiple relationships - propeties & collections).
The piece of code has the following form (simplified for clarity):
void SomeMethod(ICollection<TheClass> itemsToProcess)
{
Parallel.ForEach(itemsToProcess, item => ProcessItem(item);
}
TheClass ProcessItem(TheClass i)
{
var temp = NHibernateRepository.SomeFetchMethod(i);
var result = NHibernateRepository.Update(temp);
return result;
}
SQL Server intermittently reports database lock errors with the following error:
Transaction (Process ID 20) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction
I suspect it is due to some race condition happening leading up to a deadlock, even if ISessions are separate.
The ICollection<TheClass> can have up to 1000 items and each with properties and sub-collections that are processed, generating many SELECT and UPDATE statements (confirmed using 'NHibernate Profiler')
Is there a better way to handle this in a parallel way, or shall I refactor the code to a traditional loop?
I do know that I can alternatively implement my code using:
A foreach loop in the same ISession context
With a Stateless Session
With Environment.BatchSize set to a reasonable value
OR
Using SQL BulkCopy
I have also read quite a bit of good info about SQL Server deadlocks and Parallel.ForEach being an easy pitfall:
SQL Transaction was deadlocked
Using SQL Bulk Copy as an alternative
Potential Pitfalls in Data and Task Parallelism
Multi threading C# application with SQL Server database calls
This is a very complicated topic. There's one strategy that is guaranteed to be safe and probably will result in a speedup:
Retry in case of deadlock.
Since a deadlock rolls back the transaction you can safely retry the entire transaction. If the deadlock rate is low the parallelism speedup will be high.
The nice thing about retry is that you can make a simple code change in a central place.
Since it's not apparent from the code posted: Make sure, that threads do not share the session or entities. Neither of them are thread-safe.
I have a windows service that is polling a database. I am using EF6 and linq to do my queries and updates, etc.
The polling needs to be as often as possible, probably every 2 seconds or something in that area.
My gut tells me to have one connection and keep it open while my service is running, however something else tells me to open and close the connection every time. I feel that the latter will slow it down (will this really slow it down this much?).
What are the best practices when it comes to polling a database within a windows service? Should I really be polling my database so often?
I think you should dispose of the context frequently and create a new one every time you poll the database.
The main reason is that unless you disable object tracking (really only suitable for read only operation), the context gets bigger and bigger over time, with each successive polling operation loading more data into the context's cache. As well as the increase in memory this causes, SaveChanges() gets slower as the ObjectContext then looks for changes in the objects which are attached to it.
If the connection is lost for any reason, you'll also have a hard time associating a new connection with the context. Regardless, based on my own experience, it won't slow anything down, it's quick to construct any EF context objects after the first one, because the model is cached on first load.
I wouldn't worry about every polling every 2 seconds. That seems totally reasonable to me.
As an aside, if you're using SQL Server, you can use Sql Dependency to fire an event when data changes, but polling is the most reliable option.
http://msdn.microsoft.com/en-us/library/62xk7953(v=vs.110).aspx
Alternatively, if you're dead set against polling, you could look at using a Message Broker system like RabbitMQ and updating your apps to use it, but be prepared to lose a couple of weeks implementing the infrastructure.
I have various large data modification operations in a project built on c# and Fluent NHibernate.
The DB is sqlite (on disk rather than in memory as I'm interested in performance)
I wanted to check performance of these so I created some tests to feed in large amounts of data and let the processes do their thing. The results from 2 of these processes have got me pretty confused.
The first is a fairly simple case of taking data supplied in an XML file doing some light processing and importing it. The XML contains around 172,000 rows and the process takes a total of around 60 seconds to run with the actual inserts taking around 40 seconds.
In the next process, I do some processing on the same set of data. So I have a DB with approx 172,000 rows in one table. The process then works through this data, doing some heavier processing and generating a whole bunch of DB updates (inserts and updates to the same table).
In total, this results in around 50,000 rows inserted and 80,000 updated.
In this case, the processing takes around 30 seconds, which is fine, but saving the changes to the DB takes over 30 mins! and it crashes before it finishes with an sqlite 'disk or i/o error'
So the question is: why are the inserts/updates in the second process so much slower? They are working on the same table of the same database with the same connection. In both cases, IStatelessSession is used and ado.batch_size is set to 1000.
In both cases, the code looks that does the update like this:
BulkDataInsert((IStatelessSession session) =>
{
foreach (Transaction t in transToInsert) { session.Insert(t); }
foreach (Transaction t in transToUpdate) { session.Update(t); }
});
(although the first process has no 'transToUpdate' line as it's only inserts - Removing the update line and just doing the inserts still takes almost 10 minutes.)
The transTo* variables are List with the objects to be updated/inserted.
BulkDataInsert creates the session and handles the DB transaction.
I didn't understand your second process. However, here are some things to consider:
Are there any clustered or non-clustered indexes on the table?
How many disk drives do you have?
How many threads are writing to the DB in the second test?
It seems that you are experiencing IO bottlenecks that can be resolved by having more disks, more threads, indexes, etc.
So, assuming a lot of things, here is what I "think" is happening:
In the first test your table probably has no indexes, and since you are just inserting data, it is a sequential insert in a single thread which can be pretty fast - especially if you are writing to one disk.
Now, in the second test, you are reading data and then updating data. Your SQL instance has to find the record that it needs to update. If you do not have any indexes this "find" action is basically a table scan, which will happen for each one of those 80,000 row updates. This will make your application really really slow.
The simplest thing you could probably do is add a clustered index on the table for a unique key, and the best option is to use the columns that you are using in the where clause to "update" those rows.
Hope this helps.
DISCLAIMER: I made quite a few assumptions
The problem was due to my test setup.
As is pretty common with nhibernate based projects, I had been using in-memory sqlite databases for unit testing. These work great but one downside is that if you close the session, it destroys the database.
Consequently, my unit of work implementation contains a 'PreserveSession' property to keep the session alive and just create new transactions when needed.
My new performance tests are using on-disk databases but they still use the common code for setting up test databases and so have PreserveSession set to true.
It seems that having several sessions all left open (even though they're not doing anything) starts to cause problems after a while including the performance drop off and the disk IO error.
I re-ran the second test with PreserveSession set to false and immediately I'm down from over 30 minutes to under 2 minutes. Which is more where I'd expect it to be.