This question has been asked before but not under the context of a multithreaded application.
So given a single threaded application, a dictionary will find the value given a key in O(1) time. Given a table (from a database) which has a primary key (cluster index) on column 'X', a search for the associated tuple with a key on column 'X' will be found the tuple in O(log n) time. Add the in-memory benefit of the dictionary, the dictionary wins hands down.
Given a highly parallel (e.g. async socket server) application that relies on a common datastructure (i.e., Dictionary vs Database) to maintain application wide state information (e.g. connected users) where say 50% of the accesses are reads and about 50% of the accesses are updates/deletes/inserts, it does not seem so obvious to me that Dictionary is better. For the following reasons:
1) to make the dictionary thread-safe for concurrent access, a locking mechanism must be used. lock() will lock the entire dictionary and effectively only allowing one thread to access the datastructure at a time. Even using readerWriterLockSlim will lock the entire dictionary when elevated to writeLock.
2) Databases give the benefit of row-level locking when updating/deleting/inserting on a primary key.
3) Dictionaries are in-memory (faster) while database connection use sockets (slower)
So the question is, does the inherit row-level locking features of a relational database out weigh the benefits of in-memory Dictionary accesses?
This is a good question, Paul. Row-level locking feature outweights benefits of in-memory dictionary access. Many additional capabilities of a relational database allows for its widespread use in comparison to an in-memory dictionary object. If you wanted to create an in-memory table for faster data access with MySQL, you could use MEMORY storage engine.
Searching for, say, just date-based information from an indexed date-based column can be rather easy and fast. Adding to that, relational databases - depending on which you use - allows for security by roles, users, business intelligence etc. On a dictionary object you would have build all the features readily available in many popular databases.
A well tuned database can serve thousands of concurrent requests that multi-threaded and disparate applications can benefit from. So before rolling out your own, I'd recommend using a relational database engine. As always, your mileage may vary based on the complexity of the problem you hope to solve.
Related
I need to cache information about user roles in ASP.NET Web API. I have decided to use System.Web.Helpers.WebCache class. Role is plain string, which is about 40 character long. Each user may have between 1-10 roles.
I am thinking of two ways to do this:
Use WebCache.Set(UserID, List<String>). Use user id as key and store List of roles (string) as value. Its easy to retrieve.
Use dictionary, where I will use userId as key and list of roles as value & then cache the dictionary. This way I am caching with only one key. When I retrieve this information, I first retrieve dictionary and then use user id to get the role information.
Questions:
Which approach is better? I like approach one as its easy to use. Does it have any downside?
The way I calculated memory use for keeping these keys into cache is by adding same amount of data (stored 10 roles of type string into) into a notepad and then calculated the size of the notepad (used UTF-8 encoding). The size was about 500 bytes and size of disk was 4 KB . Then if I have 200 users, I will multiply 200 * 500 bytes to calculate the memory usage. Is this right (I am ok if approximately closed) way to calculate?
I prefer the approach of saving individual keys instead of saving the roles of all users as a single cache object.
Following are the reasons:
1) Creation is simple, when user logs in or at an appropriate moment in time, the cache is checked for and 'if empty' created for that user, no need of iterating through the dictionary object (or LINQ) to get to that key item.
2) When user logs off or at an appropriate moment, the cache object is destroyed completely instead of only removing that particular key from cache.
3) Also no need of locking the object when multiple users are trying to access the object at the same time and this scenario will happen. Since object is created per user, there is no risk of locking that object or need to use synchronization or mutex.
Thanks, Praveen
1. Solution one is preferrable. It is straightforward and appears to only offer advantages.
2. Your calculation makes sense for option 1 but not for option 2. A C# dictionary using hashing takes up more memory, for primitive and short data like this, the data taken by hashes may be a significant increase.
The memory storage in individual bytes for this type of application would typically be a secondary concern compared to maintainability and functionality, this is because user roles are often a core functionality with fairly large security concerns and as the project grows it will become very important that the code is maintainable and secure.
Caching should be used exclusively as an optimization and because this is related to small amounts of data for a relatively small user base(~200 people) it would be much better to make your caching of these roles granular and easy to refetch.
According to the official documentation on this library
Microsoft system.web.helpers.webcache
In general, you should never count on an item that you have cached to be in the cache
And because I'll assume that user roles defines some fairly important functionality, it would be better to add queries for these roles to your web API requests instead of storing them locally.
However if you are dead set on using this cache and refetching should it ever disappear then according to your question, option one would be a preferrable choice.
This is because a list takes less memory and in this case appears to be more straight forward and i see no benefits from using a dictionary.
Dictionaries shine when you have large datasets and need speed, but for this scenario where all data is already being stored in memory and the data set is relatively small, a dictionary introduces complexity and higher memory requirements and not much else. Though the memory usage sounds negligible in either scenario on most modern devices and servers.
While a dictionary may sound compelling given your need to lookup roles by users, the WebCache class appears to already offer that funcitonality and thus an additional dictionary loses its appeal
Q1: Without knowing the actual usage of Cache items, it is difficult to draw the conclusion. Nonetheless, I think it all comes down to the design of the life spam for those items. If you want to retire them all in once for certain period and then query a new set of data, storing a ConcurrentDictionary which houses users and roles to WebCache is a easier managing solution to do so.
Otherwise, if you want to retire each entry according to certain event individually, approach one seems a rather straight forward answer. Just be mindful, if you choose approach two, use ConcurrentDictionary instead of Dictionary because the latter is not thread safe.
Q2: WebCache is fundamentally a IEnumerable>, thus it stores the key strings and the memory locations of each value, apart from the meta data of the objects. On the other hand, ConcurrentDictionary/Dictionary stores the hash codes of key strings and the memory locations of each value. While each key's byte[] length is very small, its hashcode could be slightly bigger than the size of the string. Otherwise, the sizes of HashCodes are very predictable and reasonably slim(around 10 bytes in my test). Every time when you add an entry, the size of the whole collection increment about 30 bytes. Of course this figure does not include the actual size of the value as it is irrelevant to the collection.
You can calculate the size of the string by using:
System.Text.Encoding.UTF8.GetByteCount(key);
You might also find it useful to write code to achieve the size of an object:
static long GetSizeOfObject(object obj)
{
using (var stream = new MemoryStream())
{
BinaryFormatter formatter = new BinaryFormatter();
formatter.Serialize(stream, obj);
return stream.Length;
}
}
First of all make sure there will be abstract layer and you can easyly change implementation if future.
I cant see any significant difference between this two approaches both of them use hashtable for search.
But second use search two times I suppose, when it serch dictionary in cache and when it search user in dictionary.
I whold recommend in addition
if users are huge amount , store not roles as strings but roles
Ids. if there are 1000-10000 of no sence to do it
List item
Do not forget
to clear cache record when user roles are updated
You don't need option 2, option 1 should suffice as all you need is key,list<string>.
Few points to consider in general before using caching:-
What is amount of data being cached.
What mode of caching are you using In Memory/Distributed.
How are you going to manage the cache.
If data being cached grows beyond threshold what is the fall over mechanism.
Cache has its pros and cons, In your scenario you have already done the payload analysis so I don't see any issue with option 1.
In a current project of mine I need to manage and store a moderate number (from 10-100 to 5000+) of users (ID, username, and some other data).
This means I have to be able to find users quickly at runtime, and I have to be able to save and restore the database to continue statistics after a restart of the program. I will also need to register every connect/disconnect/login/logout of a user for the statistics. (And some other data as well, but you get the idea).
In the past, I saved settings and other stuff in encoded textfiles, or serialized the needed objects and wrote them down. But these methods require me to rewrite the whole database on each change, and that's increasingly slowing it down (especially with a growing number of users/entries), isn't it?
Now the question is: What is the best way to do this kind of thing in C#?
Unfortunately, I don't have any experience in SQL or other query languages (except for a bit of LINQ), but that's not posing any problem for me, as I have the time and motivation to learn one (or more if required) for this task.
Most effective is highly subjective based on who you ask even if narrowing down this question to specific needs. If you are storing non-relational data Mongo or some other NoSQL type of database such as Raven DB would be effective. If your data has a relational shape then an RDBMS such as MySQL, SQL Server, or Oracle would be effective. Relational databases are ideal if you are going to have heavy reporting requirements as this allows non-developers more ease of access in writing simple SQL queries against it. But also keeping in mind performance with disk cache persistence that databases provide. Commonly accessed data is stored in memory to save the round trips to the disk (with hybrid drives I suppose accessing some files directly accomplishes the same thing however SSD's are still not as fast as RAM access). So you really need to ask yourself some questions to identify the best solution for you; What is the shape of your data (flat, relational, etc), do you have reporting requirements where less technical team members need to be able to query the data repository, and what are your performance metrics?
My scenario is I prefer to stay in a relational database storage system like SQL Server because I would need to work with complex queries.
And then, because some calculations would be better to be done overtime and just store the results into something like Redis or maybe a more traditional NoSQL solution.
That's the point where I thought: and what happened with the second-level cache on NHibernate?.
I did a very small research and I found that there's a Redis second-level cache provider, and now I got "confused".
I mean, if I use NHibernate's second-level cache most of object access should be very fast as it should be no database roundtrip, thus most accessed objects would be retrieved from the in-memory Redis store.
Why I'm considering this instead of just using Redis directly? Because I need actual atomic transactions within my solution's domains.
Ok, the question?
Is relying on NHibernate's second-level cache Redis provider a good idea in order to get the best of relational and schema-less worlds?
What's your advice?
I see two different things as a summary of your view:
Use redis as second level cache above NHB. This make perfect sense as SLC stores separated fields of objects and redis is key/value store. As I remember, SLC contain results of scalar queries or mapped and fetched objects but what's important, the data are taken (cached) from performed queries.
IMHO if you would use redis this way, all cached values must result from NHB queries. This brings you some kind of transaction atomicity, how did you already described, but as far as I know, we found couple bugs when SCL returned stale data or data from uncommitted transactions.
Note that this approach says that someone (NHB) still needs to somehow guarantee business transaction between RDBMS and Redis, which is not simple and buggy.
Also note that SLC itself is not incredibly fast pattern. As SLC contain field of object and not object itself, every hit results into new object creation. So what happen is fetching data from Redis instead of resultset obtained from executed SQL query. So, when you use prepared statements and RDBMS typically makes caching for you, you can find out that this does not bring very large performance improvement for you.
Redis as a separated business store. You managed data completely on you own, you can make their computation within native (C#) code (contrary to SQL query or mapped object). You need to guarantee fresh data and some transaction approach.
What I would choose? Separated redis. Why?
Second level cache along with mapping puts some contract to you as the content results from queries or mapped objects. You can't manage or use Redis on your own. Especially your cached data are coupled/tight to those queries, not to some API (interfaces) and some kind of service (as I would design it)
You can make computation of you data in your own code.
SLC approach seems buggy for me and often it was very hard to find these bugs.
I have a requirement where huge amount of data needs to be cached on the disk.
Whenever there is a change in the database, the data is retreived from the database and cached on the disk. I will be having a background process which keeps checking my cached data with the data base, and updates it as and when required.
I would like to know what would be the best way to organize the cached data on my disk, so that writing and reading from the cache can be faster.
An another thread would be used to fetch some new data from the db and cache it on the disk. I also need to take care of synchronization between the two threads.(one will be updating the existing cache data, and the other will be writing newly fetched data into the cache.)
Please suggest a strategy for organizing the data on the cache and also synchronization between the threads.
SQL Server has something called XML tables. Those tables are based on physical XML files located in the disk. You can map/link XML data in the disk to a table in SQL Server. For users, it is seamless, in other words they see those tables as a regular tables.
Besides technical/philosophical discussion about caching huge data on the disk, this is just an idea...
Do you care about the consistancy of the data? on power failures?
Memory mapped files along with occational flushes porbably get want you want
Do you need to have an indexed access to data?
You probably need to design something B-tree implementation or B+tree implementation. which gives efficient retrival of the indexed data and better block level locking.
http://code.google.com/p/high-concurrency-btree/
As an alternative answer, my own B+Tree implementation will neatly address this as a completely managed code (C#) implementation of an IDictionary<TKey, TValue>. It is a single-file key/value store that is thread-safe and optimized for concurrency. It was built from the ground up expressly for this purpose and for providing write-through caches.
Dicussion - http://csharptest.net/projects/bplustree/
Online Help – http://help.csharptest.net/
Source Code – http://code.google.com/p/csharptest-net/
Downloads – http://code.google.com/p/csharptest-net/downloads
NuGet Package – http://nuget.org/List/Packages/CSharpTest.Net.BPlusTree
We (my team) are about to start development of a mission critical project, one of the sub-systems of this project is a Windows service. This service will be a backbone of the entire system and has to respond as per mission critical standard.
This service will contain many lists to minimize database interaction and to gain performance, I am estimating average size of a list to be 250,00 under normal circumstances.
Is it a good idea to use LINQ to query data from these queues, or should I follow my original plan of creating indexed list?
Indexed List is a custom implementation of Idictionary, which will work like an index and will have many performance oriented features such as MRU, Index rebuilding, etc.
Before rolling your own solution, you might want to look at i4o, an indexed LINQ to Objects provider.
Otherwise, it sounds like applying your own indexing may well be worthwhile - but do performance tests first. "Mission critical" is often about reliability more than performance - if LINQ to Objects performs well enough, why not use it?
Even if you do end up writing your own collections, you should consider making them "LINQ queryable" in some fashion (which will depend on the exact nature of the collections)... it's nice to be able to write LINQ queries, even if it's not against vanilla LINQ to Objects.