I am working on a batch process which dumps ~800,000 records from a slow legacy database (1.4-2ms per record fetch time...it adds up) into MySQL which can perform a little faster. To optimize this, I have been loading all of the MySQL records into memory which puts usage to about 200MB. Then, I start dumping from the legacy database and updating the records.
Originally, when this would complete updating the records I would then call SaveContext which would then make my memory jump from ~500MB-800MB to 1.5GB. Very soon, I would get out of memory exceptions (the virtual machine this is running on has 2GB of RAM) and even if I were to give it more RAM, 1.5-2GB is still a little excessive and that would be just putting a band-aid on the problem. To remedy this, I started calling SaveContext every 10,000 records which helped things along a bit and since I was using delegates to fetch the data from the legacy database and update it in MySQL I didn't receive too horrible a hit in performance since after the 5 second or so wait while it was saving it would then run through the update in memory for the 3000 or so records that had backed up. However, the memory usage still keeps going up.
Here are my potential issues:
The data comes out of the legacy database in any order, so I can't chunk the updates and periodically release the ObjectContext.
If I don't grab all of the data out of MySQL beforehand and instead look it up during the update process by record, it is incredibly slow. I instead grab it all beforehand, cast it to a dictionary indexed by the primary key, and as I update the data I remove the records from the dictionary.
One possible solution I thought of is to somehow free the memory being used by entities that I know I will never touch again since they have already been updated (like clearing the cache, but only for a specific item), but I don't know if that is even possible with Entity Framework.
Does anyone have any thoughts?
You can call the Detach method on the context passing it the object you no longer need:
http://msdn.microsoft.com/en-us/library/system.data.objects.objectcontext.detach%28v=vs.90%29.aspx
I'm wondering if your best bet isn't another tool as previously suggested or just forgoing the use of Entity Framework. If you instead do the code without an ORM, you can:
Tune the SQL statements to improve performance
Easily control, and change, the scope of transactions to get the best performance.
You can batch the updates so that you aren't calling the server to accomplish multiple updates instead of them being performed one at a time.
Related
I'm paging 150k rows in pages of 10,000. I mean I bring 10k rows of the database and iterate over them and then the next 10k and do the same until there aren't more rows. The problem is as I bring more rows I see the memory graph increasing in the performance tab of the task manager and in each iteration to bring the 10k rows the query execution last longer and longer until throws an OutOfMemoryException.
The query is a join of 6 tables. I load the results in a list using EF 4.
Al the end of each iteration I clear the list, set it to null and call GC.Collect() but this doesn't have any effect.
What can I make to free memory of the rows I already checked.
I faced a very similar issue when attempting return large datasets from a database.
By executing the query from a BackgroundWorker it removed the load from the UI thread, thus reducing the complexity of the task and eliminate this issue.
Before using this technique for my query I saw that application go from a few hundred MB's to top out at about 1.2GB before throwing the OutOfMemoryException. After implimenting the BackgroundWorkder the increase would only be about 10/15mb, then reduce back down once it had completed. I would also suggest the query executed slightly faster.
-Yes, I realise that this doesn't sound like it would work, but it actually did and I would recommend it.
Additionally if you were running on a 64bit OS (and targeting 64 bit architecture) you could raise the memory limit from 1.2GB to 4GB. This option can be found in the Project Properties under the Build tab...
Depending on the DB, you may be able to shift some of the work off to it by creating a view and querying that to gather your result set. (- Update)
I'm pretty sure the problem is caused by holding the same context instance to fetch all the pages. In this way, even though you no longer need previously fetched context still keep track of all of them.
You should create new DbContext instance before fetching new batch of records.
You could also try using AsNoTracking() method.
I am reading a text file into a database through EF4. This file has over 600,000 rows in it and therefore speed is important.
If I call SaveChanges after creating each new entity object, then this process takes about 15 mins.
If I call SaveChanges after creating 1024 objects, then it is down to 4 mins.
1024 was an arbitrary number I picked, it has no reference point.
However, I wondered if there WAS an optimum number of objects to load into my Entity Set before calling SaveChanges?
And if so...how do you work it out (other than trial and error) ?
This is actually a really interesting issue, EF becomes much slower as the context gets very large. You can actually combat this and make drastic performance improvements by disabling AutoDetectChanges for the duration of your batch insert. In general however the more items you can include in a transaction in SQL the better.
Take a look at my post on EF performance here http://blog.staticvoid.co.nz/2012/03/entity-framework-comparative.html, and my post on how disabling AutoDetectChanges improves this here http://blog.staticvoid.co.nz/2012/05/entityframework-performance-and.html, these will also give you a good idea of how batch size affects performance.
Profile the application and see what's taking the time on the two you'ce chosen. That should geve you some good numbers to extrapolate from.
Personally I have to query why you are using EF for loading a text file - it's seems like gross overkill for something that should be easy to fire into a DB using BCP, or straight SQLCommands.
We're working on an online system right now, and I'm confused about when to use in-memory search and when to use database search. Can someone please help me figure out the factors to be considered when it comes to searching records?
One factor is that if you need to go through the same results over and over, be sure to cache them in memory. This becomes an issue when you're using linq-to-sql or Entity FrameworkâORMs that support deferred execution.
So if you have an IQueryable<SomeType> that you need to go through multiple times, make sure you materialize it with a ToList() before firing up multiple foreach loops.
It depends on the situation, though I generally prefer in memory search when possible.
However depends on the context, for example if records can get updated between one search and another , and you need the most updated record at the time of the search, obviously you need database search.
If the size of the recordset (data table) that you need to store in memory is huge, maybe is better another search directly on the database.
However keep present that if you can and if performance are important loading data into a datatable and searching, filtering with LINQ for example can increase performance of the search itself.
Another thing to keep in mind is performance of database server and performance of application server : if the database server if fast enough on the search query, maybe you don't need to caching in memory on the application and so you can avoid one step. Keep in mind that caching for in memory search move computational request from database to the application server...
An absolute response is not possible for your question, it is relative on your context ...
It depends on the number of records. If the number of records is small then it's better to keep that in memory, i.e cache the records. Also, if the records get queried frequently then go for the memory option.
But if the record number or record size is too large than it's better to go for the database search option.
Basically it depends on how much memory you have on your server...
I am trying to use sqlite in my application as a sort of cache. I say sort of because items never expire from my cache and I am not storing anything. I simply need to use the cache to store all ids I processed before. I don't want to process anything twice.
I am entering items into the cache at 10,000 messages/sec for a total of 150 million messages. My table is pretty simple. It only has one text column which stores the id's. I was doing this all in memory using a dictionary, however, I am processing millions of messages and, although it is fast that way, I ran out of memory after some time.
I have researched sqlite and performance and I understand that configuration is key, however, I am still getting horrible performance on inserts (I haven't tried selects yet). I am not able to keep up with even 5000 inserts/sec. Maybe this is as good as it gets.
My connection string is as below:
Data Source=filename;Version=3;Count Changes=off;Journal Mode=off;
Pooling=true;Cache Size=10000;Page Size=4096;Synchronous=off
Thanks for any help you can provide!
If you are doing lots of inserts or updates at once, put them in a transaction.
Also, if you are executing essentially the same SQL each time, use a parameterized statement.
Have you looked at the SQLite Optimization FAQ (bit old).
SQLite performance tuning and optimization on embedded systems
If you have many threads writing to the same database, then you're going to run into concurrency problems with that many transactions per second. SQLite always locks the whole database for writes so only one write transaction can be processed at a time.
An alternative is Oracle Berkley DB with SQLite. This latest version of Berkley DB includes a SQLite front end that has a page-level locking mechanism instead of database level. This provides much higher numbers of transactions per second when there is a high concurrency requirement.
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
It includes the same SQLite.NET provider and is supposed to be a drop-in replacement.
Since you're requirements are so specific you may be better off with something more dedicated, like memcached. This will provide a very high throughput caching implementation that will be a lot more memory efficient than a simple hashtable.
Is there a port of memcache to .Net?
I'm an experienced programmer in a legacy (yet object oriented) development tool and making the switch to C#/.Net. I'm writing a small single user app using SQL server CE 3.5. I've read the conceptual DataSet and related doc and my code works.
Now I want to make sure that I'm doing it "right", get some feedback from experienced .Net/SQL Server coders, the kind you don't get from reading the doc.
I've noticed that I have code like this in a few places:
var myTableDataTable = new MyDataSet.MyTableDataTable();
myTableTableAdapter.Fill(MyTableDataTable);
... // other code
In a single user app, would you typically just do this once when the app starts, instantiate a DataTable object for each table and then store a ref to it so you ever just use that single object which is already filled with data? This way you would ever only read the data from the db once instead of potentially multiple times. Or is the overhead of this so small that it just doesn't matter (plus could be counterproductive with large tables)?
For CE, it's probably a non issue. If you were pushing this app to thousands of users and they were all hitting a centralized DB, you might want to spend some time on optimization. In a single-user instance DB like CE, unless you've got data that says you need to optimize, I wouldn't spend any time worrying about it. Premature optimization, etc.
The way to decide varys between 2 main few things
1. Is the data going to be accesses constantly
2. Is there a lot of data
If you are constanty using the data in the tables, then load them on first use.
If you only occasionally use the data, fill the table when you need it and then discard it.
For example, if you have 10 gui screens and only use myTableDataTable on 1 of them, read it in only on that screen.
The choice really doesn't depend on C# itself. It comes down to a balance between:
How often do you use the data in your code?
Does the data ever change (and do you care if it does)?
What's the relative (time) cost of getting the data again, compared to everything else your code does?
How much value do you put on performance, versus developer effort/time (for this particular application)?
As a general rule: for production applications, where the data doesn't change often, I would probably create the DataTable once and then hold onto the reference as you mention. I would also consider putting the data in a typed collection/list/dictionary, instead of the generic DataTable class, if nothing else because it's easier to let the compiler catch my typing mistakes.
For a simple utility you run for yourself that "starts, does its thing and ends", it's probably not worth the effort.
You are asking about Windows CE. In that particular care, I would most likely do the query only once and hold onto the results. Mobile OSs have extra constraints in batteries and space that desktop software doesn't have. Basically, a mobile OS makes bullet #4 much more important.
Everytime you add another retrieval call from SQL, you make calls to external libraries more often, which means you are probably running longer, allocating and releasing more memory more often (which adds fragmentation), and possibly causing the database to be re-read from Flash memory. it's most likely a lot better to hold onto the data once you have it, assuming that you can (see bullet #2).
It's easier to figure out the answer to this question when you think about datasets as being a "session" of data. You fill the datasets; you work with them; and then you put the data back or discard it when you're done. So you need to ask questions like this:
How current does the data need to be? Do you always need to have the very very latest, or will the database not change that frequently?
What are you using the data for? If you're just using it for reports, then you can easily fill a dataset, run your report, then throw the dataset away, and next time just make a new one. That'll give you more current data anyway.
Just how much data are we talking about? You've said you're working with a relatively small dataset, so there's not a major memory impact if you load it all in memory and hold it there forever.
Since you say it's a single-user app without a lot of data, I think you're safe loading everything in at the beginning, using it in your datasets, and then updating on close.
The main thing you need to be concerned with in this scenario is: What if the app exits abnormally, due to a crash, power outage, etc.? Will the user lose all his work? But as it happens, datasets are extremely easy to serialize, so you can fairly easily implement a "save every so often" procedure to serialize the dataset contents to disk so the user won't lose a lot of work.