Why would reusing a DataContext have a negative performance impact? - c#

After a fair amount of research and some errors, I modified my code so that it creates a new DataContext each time the database is queried or data is inserted. And the database is queried frequently - for each of 250k transactions that are processed, the database is queried to obtain a customer id, department id, and category before the transaction is inserted.
So now I'm trying to optimize the code as it was only processing around 15 transactions a second. I removed some extraneous queries and added some indexes and got it up to 30/sec. I then figured that even though everyone says a DataContext is lightweight, it's got to cost something to create a new one 4 times per transaction, so I tried reusing the DataContext. I found, much to my surprise, that reusing the context caused performance to degrade to 10 transactions a second!
Why would this be the case? Is it because the DataContext caches the entities in memory and first searches through its in-memory list before querying the database? So that if, for example, I'm looking for the customer id (primary key) for the customer with name 'MCS' and the customer name column has a clustered index on it so that the database query is fast, the in-memory lookup will be slower?
And is it true that creating/disposing so many db connections could slow things down, or is this just another premature optimization? And if it is true, is there a way to reuse a DataContext but have it perform an actual database query for each linq-to-sql query?

Here's why re-using a DataContext is not a best practice, from the MSDN DataContext documentation:
The DataContext is the source of all
entities mapped over a database
connection. It tracks changes that you
made to all retrieved entities and
maintains an "identity cache" that
guarantees that entities retrieved
more than one time are represented by
using the same object instance.
In general, a DataContext instance is
designed to last for one "unit of
work" however your application defines
that term. A DataContext is
lightweight and is not expensive to
create. A typical LINQ to SQL
application creates DataContext
instances at method scope or as a
member of short-lived classes that
represent a logical set of related
database operations.
If you're re-using a DataContext for a large number of queries, your performance will degrade for a couple of possible reasons:
If DataContext's in-memory identity cache becomes so large that it has to start writing to the pagefile then your performance will be bound to the HD's read-head speed and effectively there won't be a reason to use a cache at all.
The more identity objects there are in memory, the longer each save operation takes.
Essentially what you're doing is violating the UoW principle for the DataContext class.
Opening database connections does have some overhead associated with it, but keeping a connection open for a long period of time (which often also means locking a table) is less preferable than opening and closing them quickly.
Another link which may or may not help you from MSDN:
How to: Reuse a Connection Between an ADO.NET Command and a DataContext (LINQ to SQL)

Even with a clustered index, in-memory lookup will always be faster than a database query--except in edge cases, like a 386 vs. a Cray--even if you factor out network-related delays.
I would guess the degradation has to do with the DataContext's handling of entities that it tracks: reusing a context will continually increase the number of tracked entities, and the call to SaveChanges may end up requiring more time.
Again, that's a guess--but it's where I'd start looking.

Not exactly on point here, but have you considered some sort of application-level cache to look up the customer id, department id, and category? It's not clear from your post how many of these entities exist in your system, or what is involved in querying to obtain them.
However, as an example, if you have one million categories in your system, and you need to look up their Id by category name, keeping an name/Id dictionary in memory for lookup at all times will save you a trip to the database for transaction you process. This could massively improve performance (this assumes a few things, like new caregories aren't being added regularly). As a general rule, round trips to the database are expensive relative to in-memory operations.

You would have to profile everything end-to-end and see where your time is really being spent.
A clustered index is not necessarily the fastest if a row is wide. The fastest would probably be a covering non-clustered index, but that's really beside the point.
I would expect that to get more performance, you're probably going to have to jettison some of the framework, if you aren't really using the capabilities. If you are using the capabilities - well, that's what you are paying for...

Related

What is more advisable for Ecommerce website when it comes to displaying specific product?

To Query everytime on the Database and use 'WHERE' operator?
SELECT * FROM tblProduct WHERE productID = #productID
OR
To Filter the Products List that are put into Cache?
DataTable dtProducts = new DataTable();
dtProducts = HttpContext.Current.Cache["CachedProductList"] as DataTable;
DataView dvProduct = new DataView();
dvProduct = dtProducts.DefaultView;
dvProduct.RowFilter = String.Format("[productID] = {0}", iProductID);
Please share your opinion. Thanks in advance.
Performance is very subjective to your data and how you use it. The method to know what works for sure is to benchmark.
Decide to cache only when your db performance does not meet the performance you require.
When you cache data, you add a lot of overhead in making sure it is up-to-date.
Sql server does not read from disk every time you fire a query, it caches results of frequent queries. Before you decide to cache, know the caching mechanisms used by your database. Using a stored procedure would allow you to cache the query plan too.
Caching data, especially through an in-memory mechanism like HttpContext.Current.Cache is (almost) always going to be faster than going back to the database. Going to the database requires establishing network connections, then the database has to do I/O, etc., whereas using the cache you just use objects in memory. That said, there are a number of things you have to take into account:
The ASP.NET runtime cache is not distributed. If you will be running this code on multiple nodes, you have to decide if you're okay with different nodes potentially having different version of the cached data.
Caches can be told to hold onto data for as long as you want them to, as short as just a few minutes and as long as forever. You have to take into consideration how long the data is going to remain unchanged when deciding how long to cache it. Product data probably doesn't change more often than once a day, so it's a very viable candidate for caching.
Be aware though that the cache time limits you set are not absolutes; objects can be evicted from the cache because of memory limits or when a process/app pool recycles.
As pointed out above, DataTable is not a good object to cache; it's very bulky and expensive to serialize. A list of custom classes is a much better choice from a performance standpoint.
I would say as a general rule of thumb, if you need a set of data more frequently than a few times an hour and it changes less frequently than every few hours, it would be better to pull the list from the database, cache it for a reasonable amount of time, and retrieve it by a filter in code. But that's a general rule; this is the kind of thing that's worth experimenting with in your particular environment.
200,000 objects is a lot of data to put into a cache, but it's also a lot of work for the database if you have to retrieve it frequently. Perhaps there's some subset of it that would be better to cache, and a different, less frequently used subset that could be retrieved every time it's needed. As I said, experiment!
I would prefer the first method. Having 20000 rows in cache does not sound good to me.

Entity framework remove object from context, but not from database

I am working on a batch process which dumps ~800,000 records from a slow legacy database (1.4-2ms per record fetch time...it adds up) into MySQL which can perform a little faster. To optimize this, I have been loading all of the MySQL records into memory which puts usage to about 200MB. Then, I start dumping from the legacy database and updating the records.
Originally, when this would complete updating the records I would then call SaveContext which would then make my memory jump from ~500MB-800MB to 1.5GB. Very soon, I would get out of memory exceptions (the virtual machine this is running on has 2GB of RAM) and even if I were to give it more RAM, 1.5-2GB is still a little excessive and that would be just putting a band-aid on the problem. To remedy this, I started calling SaveContext every 10,000 records which helped things along a bit and since I was using delegates to fetch the data from the legacy database and update it in MySQL I didn't receive too horrible a hit in performance since after the 5 second or so wait while it was saving it would then run through the update in memory for the 3000 or so records that had backed up. However, the memory usage still keeps going up.
Here are my potential issues:
The data comes out of the legacy database in any order, so I can't chunk the updates and periodically release the ObjectContext.
If I don't grab all of the data out of MySQL beforehand and instead look it up during the update process by record, it is incredibly slow. I instead grab it all beforehand, cast it to a dictionary indexed by the primary key, and as I update the data I remove the records from the dictionary.
One possible solution I thought of is to somehow free the memory being used by entities that I know I will never touch again since they have already been updated (like clearing the cache, but only for a specific item), but I don't know if that is even possible with Entity Framework.
Does anyone have any thoughts?
You can call the Detach method on the context passing it the object you no longer need:
http://msdn.microsoft.com/en-us/library/system.data.objects.objectcontext.detach%28v=vs.90%29.aspx
I'm wondering if your best bet isn't another tool as previously suggested or just forgoing the use of Entity Framework. If you instead do the code without an ORM, you can:
Tune the SQL statements to improve performance
Easily control, and change, the scope of transactions to get the best performance.
You can batch the updates so that you aren't calling the server to accomplish multiple updates instead of them being performed one at a time.

C# Data handling design pattern: Objects stored in DB via ORM, work directly with the database?

Consider an application that stores inventory for a store, which consists of hundreds of item types and quantities of each item. I currently have a database mapped out which can handles this, which was natural to design due to a background in procedural programming. However, I am now working in an OOP language (C#) and I am considering storing the inventory and other entities (branch offices, employees, suppliers) as CLR objects, grouped in ObservableCollections, and then persisted to the database via an ORM such as NHibernate.
My concern is that having ObservableCollections with hundreds or thousands of items in memory at all times will be a resource and performance barrier. I am also worried about potential dataloss considering equipment failure or power outage. As the system will be recording financial transactions (sales) the reliability of a database is rather important. Specifically, having all changes / sales in the database at the time of transaction, as opposed to whenever the ORM persists back is important to me.
Should I work directly with the database, or should I work with objects and let the ORM handle the storage?
My concern is that having ObservableCollections with hundreds or thousands of items in memory at
all times will be a resource and performance barrier
Depends how you work with htem.
I have a service keeping about a quarter million items by string key in memory. I do up to around 50.000 updates on them per second, then ttream the udpates out to the database - not as updates but as new data with timestamp (track change over time).
It really depends.
In general this is not an easy question - MOST of the time it is easier to do a SQL query, but an in meemory cache can really BOOST performance. Yes, it uses memory. WHO CARES - worksations can have 64gb memory these days. THe question is whether it makes sense from a performance.
Specifically, having all changes / sales in the database at the time of transaction, as opposed to
whenever the ORM persists back is important to me.
LOGICAL ERROR. An ORM will persist them as part of the ORM transaction. Naturally the cache would not be "one ORM transaction" but independent of transactions updated. If na ORM gets into your way here, it is either the worlds most sucking ORM or your application architecture is broken.
All updates to financial data should happen in a database levele transaction and every ORM I know of supports that.
Should I work directly with the database, or should I work with objects and let the ORM handle the
storage?
Depends on requirements. I love ORM's but hav gone lately to a service oriented architecture where I update in moemry representations thn stream the transaction out to the database. But then I do streaming data inocoming / No decision in logic stuff (data comes from external soruce and HAS to e be processed, no error possible, no loss possible - if soemthign blows on my end, the next start of the app gets the same data again, until I say I processed it).

which is better... in-memory search or database access?

We're working on an online system right now, and I'm confused about when to use in-memory search and when to use database search. Can someone please help me figure out the factors to be considered when it comes to searching records?
One factor is that if you need to go through the same results over and over, be sure to cache them in memory. This becomes an issue when you're using linq-to-sql or Entity Framework—ORMs that support deferred execution.
So if you have an IQueryable<SomeType> that you need to go through multiple times, make sure you materialize it with a ToList() before firing up multiple foreach loops.
It depends on the situation, though I generally prefer in memory search when possible.
However depends on the context, for example if records can get updated between one search and another , and you need the most updated record at the time of the search, obviously you need database search.
If the size of the recordset (data table) that you need to store in memory is huge, maybe is better another search directly on the database.
However keep present that if you can and if performance are important loading data into a datatable and searching, filtering with LINQ for example can increase performance of the search itself.
Another thing to keep in mind is performance of database server and performance of application server : if the database server if fast enough on the search query, maybe you don't need to caching in memory on the application and so you can avoid one step. Keep in mind that caching for in memory search move computational request from database to the application server...
An absolute response is not possible for your question, it is relative on your context ...
It depends on the number of records. If the number of records is small then it's better to keep that in memory, i.e cache the records. Also, if the records get queried frequently then go for the memory option.
But if the record number or record size is too large than it's better to go for the database search option.
Basically it depends on how much memory you have on your server...

Best-practice caching: monolithic vs. fine-grained cache data

In a distributed caching scenario, is it generally advised to use or avoid monolithic objects stored in cache?
I'm working with a service backed by an EAV schema, so we're putting caching in place to minimize the perceived performance deficit imposed by EAV when retrieving all primary records and respective attribute collections from the database. We will prime the cache on service startup.
We don't have particularly frequent calls for all products -- clients call for differentials after they first populate their local cache with the object map. In order to perform that differential, the distributed cache will will need to reflect changes to individual records in the database that are performed on an arbitrary basis, and be processed for changes as differentials are called for by clients.
First thought was to use a List or Dictionary to store the records in the distributed cache -- get the whole collection, manipulate or search it in-memory locally, put the whole collection back into the cache. Later thinking however led to the idea of populating the cache with individual records, each keyed in a way to make them individually retrievable from/updatable to the cache. This led to wondering which method would be more performant when it comes to updating all data.
We're using Windows Server AppFabric, so we have a BulkGet operation available to us. I don't believe there's any notion of a bulk update however.
Is there prevailing thinking as to distributed cache object size? If we had more requests for all items, I would have concerns about network bandwidth, but, for now at least, demand for all items should be fairly minimal.
And yes, we're going to test and profile each method, but I'm wondering if there's anything outside the current scope of thinking to consider here.
So in our scenario, it appears that monolithic cache objects are going to be preferred. With big fat pipes in the datacenter, it takes virtually no perceptible time for ~30 MB of serialized product data to cross the wire. Using a Dictionary<TKey, TValue> we are able to quickly find products in the collection in order to return, or update, the individual item.
With thousands of individual entities, all well under 1 MB, in the cache, bulk operations simply take too long. Too much overhead, latency in the network operations.
Edit: we're now considering maintaining both the entities and the monolithic collection of entities, because with the monolith, it appears that retrieving individual entities becomes a fairly expensive process with a production dataset.

Categories