Best-practice caching: monolithic vs. fine-grained cache data - c#

In a distributed caching scenario, is it generally advised to use or avoid monolithic objects stored in cache?
I'm working with a service backed by an EAV schema, so we're putting caching in place to minimize the perceived performance deficit imposed by EAV when retrieving all primary records and respective attribute collections from the database. We will prime the cache on service startup.
We don't have particularly frequent calls for all products -- clients call for differentials after they first populate their local cache with the object map. In order to perform that differential, the distributed cache will will need to reflect changes to individual records in the database that are performed on an arbitrary basis, and be processed for changes as differentials are called for by clients.
First thought was to use a List or Dictionary to store the records in the distributed cache -- get the whole collection, manipulate or search it in-memory locally, put the whole collection back into the cache. Later thinking however led to the idea of populating the cache with individual records, each keyed in a way to make them individually retrievable from/updatable to the cache. This led to wondering which method would be more performant when it comes to updating all data.
We're using Windows Server AppFabric, so we have a BulkGet operation available to us. I don't believe there's any notion of a bulk update however.
Is there prevailing thinking as to distributed cache object size? If we had more requests for all items, I would have concerns about network bandwidth, but, for now at least, demand for all items should be fairly minimal.
And yes, we're going to test and profile each method, but I'm wondering if there's anything outside the current scope of thinking to consider here.

So in our scenario, it appears that monolithic cache objects are going to be preferred. With big fat pipes in the datacenter, it takes virtually no perceptible time for ~30 MB of serialized product data to cross the wire. Using a Dictionary<TKey, TValue> we are able to quickly find products in the collection in order to return, or update, the individual item.
With thousands of individual entities, all well under 1 MB, in the cache, bulk operations simply take too long. Too much overhead, latency in the network operations.
Edit: we're now considering maintaining both the entities and the monolithic collection of entities, because with the monolith, it appears that retrieving individual entities becomes a fairly expensive process with a production dataset.

Related

A dictionary that can save its elements accessed less often to a disk

In my application I use a dictionary (supporting adding, removing, updating and lookup) where both keys and values are or can be made serializable (values can possibly be quite large object graphs). I came to a point when the dictionary became so large that holding it completely in memory started to occasionally trigger OutOfMemoryException (sometimes in the dictionary methods, and sometimes in other parts of code).
After an attempt to completely replace the dictionary with a database, performance dropped down to an unacceptable level.
Analysis of the dictionary usage patterns showed that usually a smaller part of values are "hot" (are accessed quite often), and the rest (a larger part) are "cold" (accessed rarely or never). It is difficult to say when a new value is added if it will be hot or cold, moreover, some values may migrate back and forth between hot and cold parts over time.
I think that I need an implementation of a dictionary that is able to flush its cold values to a disk on a low memory event, and then reload some of them on demand and keep them in memory until the next low memory event occurs when their hot/cold status will be re-assessed. Ideally, the implementation should neatly adjust the sizes of its hot and cold parts and the flush interval depending on the memory usage profile in the application to maximize overall performance. Because several instances of a dictionary exist in the application (with different key/value types), I think, they might need to coordinate their workflows.
Could you please suggest how to implement such a dictionary?
Compile for 64 bit, deploy on 64 bit, add memory. Keep it in memory.
Before you grown your own you may alternatively look at WeakReference http://msdn.microsoft.com/en-us/library/ms404247.aspx. It would of course require you to rebuild those objects that were reclaimed but one should hope that those which are reclaimed are not used much. It comes with the caveat that its own guidleines state to avoid using weak references as an automatic solution to memory management problems. Instead, develop an effective caching policy for handling your application's objects.
Of course you can ignore that guideline and effectively work your code to account for it.
You can implement the caching policy and upon expiry save to database, on fetch get and cache. Use a sliding expiry of course since you are concerned with keeping those most used.
Do remember however that most used vs heaviest is a trade off. Losing an object 10 times a day that takes 5 minutes to restore would annoy users much more than losing an object 10000 times which tool just 5ms to restore.
And someone above mentioned the web cache. It does automatic memory management with callbacks as noted, depends if you want to lug that one around in your apps.
And...last but not least, look at a distributed cache. With sharding you can split that big dictionary across a few machines.
Just an idea - never did that and never used System.Runtime.Caching:
Implement a wrapper around MemoryCache which will:
Add items with an eviction callback specified. The callback will place evicted items to the database.
Fetch item from database and put back into MemoryCache if the item is absent in MemoryCache during retrieval.
If you expect a lot of request for items missing both in database and memory, you'll probably need to implement either bloom filter or cache keys for present/missing items also.
I have a similar problem in the past.
The concept you are looking for is a read through cache with a LRU (Least Recently Used or Most Recently Used) queue.
Is it there any LRU implementation of IDictionary?
As you add things to your dictionary keep track of which ones where used least recently, remove them from memory and persist those to disk.

What is more advisable for Ecommerce website when it comes to displaying specific product?

To Query everytime on the Database and use 'WHERE' operator?
SELECT * FROM tblProduct WHERE productID = #productID
OR
To Filter the Products List that are put into Cache?
DataTable dtProducts = new DataTable();
dtProducts = HttpContext.Current.Cache["CachedProductList"] as DataTable;
DataView dvProduct = new DataView();
dvProduct = dtProducts.DefaultView;
dvProduct.RowFilter = String.Format("[productID] = {0}", iProductID);
Please share your opinion. Thanks in advance.
Performance is very subjective to your data and how you use it. The method to know what works for sure is to benchmark.
Decide to cache only when your db performance does not meet the performance you require.
When you cache data, you add a lot of overhead in making sure it is up-to-date.
Sql server does not read from disk every time you fire a query, it caches results of frequent queries. Before you decide to cache, know the caching mechanisms used by your database. Using a stored procedure would allow you to cache the query plan too.
Caching data, especially through an in-memory mechanism like HttpContext.Current.Cache is (almost) always going to be faster than going back to the database. Going to the database requires establishing network connections, then the database has to do I/O, etc., whereas using the cache you just use objects in memory. That said, there are a number of things you have to take into account:
The ASP.NET runtime cache is not distributed. If you will be running this code on multiple nodes, you have to decide if you're okay with different nodes potentially having different version of the cached data.
Caches can be told to hold onto data for as long as you want them to, as short as just a few minutes and as long as forever. You have to take into consideration how long the data is going to remain unchanged when deciding how long to cache it. Product data probably doesn't change more often than once a day, so it's a very viable candidate for caching.
Be aware though that the cache time limits you set are not absolutes; objects can be evicted from the cache because of memory limits or when a process/app pool recycles.
As pointed out above, DataTable is not a good object to cache; it's very bulky and expensive to serialize. A list of custom classes is a much better choice from a performance standpoint.
I would say as a general rule of thumb, if you need a set of data more frequently than a few times an hour and it changes less frequently than every few hours, it would be better to pull the list from the database, cache it for a reasonable amount of time, and retrieve it by a filter in code. But that's a general rule; this is the kind of thing that's worth experimenting with in your particular environment.
200,000 objects is a lot of data to put into a cache, but it's also a lot of work for the database if you have to retrieve it frequently. Perhaps there's some subset of it that would be better to cache, and a different, less frequently used subset that could be retrieved every time it's needed. As I said, experiment!
I would prefer the first method. Having 20000 rows in cache does not sound good to me.

C# Data handling design pattern: Objects stored in DB via ORM, work directly with the database?

Consider an application that stores inventory for a store, which consists of hundreds of item types and quantities of each item. I currently have a database mapped out which can handles this, which was natural to design due to a background in procedural programming. However, I am now working in an OOP language (C#) and I am considering storing the inventory and other entities (branch offices, employees, suppliers) as CLR objects, grouped in ObservableCollections, and then persisted to the database via an ORM such as NHibernate.
My concern is that having ObservableCollections with hundreds or thousands of items in memory at all times will be a resource and performance barrier. I am also worried about potential dataloss considering equipment failure or power outage. As the system will be recording financial transactions (sales) the reliability of a database is rather important. Specifically, having all changes / sales in the database at the time of transaction, as opposed to whenever the ORM persists back is important to me.
Should I work directly with the database, or should I work with objects and let the ORM handle the storage?
My concern is that having ObservableCollections with hundreds or thousands of items in memory at
all times will be a resource and performance barrier
Depends how you work with htem.
I have a service keeping about a quarter million items by string key in memory. I do up to around 50.000 updates on them per second, then ttream the udpates out to the database - not as updates but as new data with timestamp (track change over time).
It really depends.
In general this is not an easy question - MOST of the time it is easier to do a SQL query, but an in meemory cache can really BOOST performance. Yes, it uses memory. WHO CARES - worksations can have 64gb memory these days. THe question is whether it makes sense from a performance.
Specifically, having all changes / sales in the database at the time of transaction, as opposed to
whenever the ORM persists back is important to me.
LOGICAL ERROR. An ORM will persist them as part of the ORM transaction. Naturally the cache would not be "one ORM transaction" but independent of transactions updated. If na ORM gets into your way here, it is either the worlds most sucking ORM or your application architecture is broken.
All updates to financial data should happen in a database levele transaction and every ORM I know of supports that.
Should I work directly with the database, or should I work with objects and let the ORM handle the
storage?
Depends on requirements. I love ORM's but hav gone lately to a service oriented architecture where I update in moemry representations thn stream the transaction out to the database. But then I do streaming data inocoming / No decision in logic stuff (data comes from external soruce and HAS to e be processed, no error possible, no loss possible - if soemthign blows on my end, the next start of the app gets the same data again, until I say I processed it).

When is it recommended to use second level cache in NHibernate

I read this question-answer explaining that usage of second level cache on 50,000 rows isn't efficient.
So on which amount of data NHibernate second cache is helpful and when it's not and even ruins performance?
For example: if I have 3,500 Employees (Which I still don't...) will it be a good thing to use the second level cache?
You should mainly use it for 'static' data. Example is a website that does business in selling flight tickets via a shopping site. The shopping bag, orders and orderlines are volatile data. Those are not cached.
But the location data like airports, and the airline data and all the connected names in different languages are 'static'. Those can be cached for long and will than not cause roundtrips to the database every time your app needs those.
So, make a distinguishment between your static and volatile data.
What exactly to cache, what not and how long; Always depends on the usage of your application of course. Use different cache regions with different expire times when needed.
Unfortunately, the answer to that kind of question is not trivial.
Caches will almost always improve your performance when data is read more than it's written, but the only way to see if it helps in your particular case is profiling.
Also, it's never an all-or-nothing proposition. You will likely benefit from caching some entities and some queries. With different lifetimes, usages, etc.

Why would reusing a DataContext have a negative performance impact?

After a fair amount of research and some errors, I modified my code so that it creates a new DataContext each time the database is queried or data is inserted. And the database is queried frequently - for each of 250k transactions that are processed, the database is queried to obtain a customer id, department id, and category before the transaction is inserted.
So now I'm trying to optimize the code as it was only processing around 15 transactions a second. I removed some extraneous queries and added some indexes and got it up to 30/sec. I then figured that even though everyone says a DataContext is lightweight, it's got to cost something to create a new one 4 times per transaction, so I tried reusing the DataContext. I found, much to my surprise, that reusing the context caused performance to degrade to 10 transactions a second!
Why would this be the case? Is it because the DataContext caches the entities in memory and first searches through its in-memory list before querying the database? So that if, for example, I'm looking for the customer id (primary key) for the customer with name 'MCS' and the customer name column has a clustered index on it so that the database query is fast, the in-memory lookup will be slower?
And is it true that creating/disposing so many db connections could slow things down, or is this just another premature optimization? And if it is true, is there a way to reuse a DataContext but have it perform an actual database query for each linq-to-sql query?
Here's why re-using a DataContext is not a best practice, from the MSDN DataContext documentation:
The DataContext is the source of all
entities mapped over a database
connection. It tracks changes that you
made to all retrieved entities and
maintains an "identity cache" that
guarantees that entities retrieved
more than one time are represented by
using the same object instance.
In general, a DataContext instance is
designed to last for one "unit of
work" however your application defines
that term. A DataContext is
lightweight and is not expensive to
create. A typical LINQ to SQL
application creates DataContext
instances at method scope or as a
member of short-lived classes that
represent a logical set of related
database operations.
If you're re-using a DataContext for a large number of queries, your performance will degrade for a couple of possible reasons:
If DataContext's in-memory identity cache becomes so large that it has to start writing to the pagefile then your performance will be bound to the HD's read-head speed and effectively there won't be a reason to use a cache at all.
The more identity objects there are in memory, the longer each save operation takes.
Essentially what you're doing is violating the UoW principle for the DataContext class.
Opening database connections does have some overhead associated with it, but keeping a connection open for a long period of time (which often also means locking a table) is less preferable than opening and closing them quickly.
Another link which may or may not help you from MSDN:
How to: Reuse a Connection Between an ADO.NET Command and a DataContext (LINQ to SQL)
Even with a clustered index, in-memory lookup will always be faster than a database query--except in edge cases, like a 386 vs. a Cray--even if you factor out network-related delays.
I would guess the degradation has to do with the DataContext's handling of entities that it tracks: reusing a context will continually increase the number of tracked entities, and the call to SaveChanges may end up requiring more time.
Again, that's a guess--but it's where I'd start looking.
Not exactly on point here, but have you considered some sort of application-level cache to look up the customer id, department id, and category? It's not clear from your post how many of these entities exist in your system, or what is involved in querying to obtain them.
However, as an example, if you have one million categories in your system, and you need to look up their Id by category name, keeping an name/Id dictionary in memory for lookup at all times will save you a trip to the database for transaction you process. This could massively improve performance (this assumes a few things, like new caregories aren't being added regularly). As a general rule, round trips to the database are expensive relative to in-memory operations.
You would have to profile everything end-to-end and see where your time is really being spent.
A clustered index is not necessarily the fastest if a row is wide. The fastest would probably be a covering non-clustered index, but that's really beside the point.
I would expect that to get more performance, you're probably going to have to jettison some of the framework, if you aren't really using the capabilities. If you are using the capabilities - well, that's what you are paying for...

Categories