How can I use a very large dictionary in C#?

How can I use a very large dictionary in C#? - c#

I want to use a lookup map or dictionary in a C# application, but it is expected to store 1-2 GB of data.
Can someone please tell if I will still be able to use dictionary class, or if I need to use some other class?
EDIT : We have an existing application which uses oracle database to query or lookup object details. It is however too slow, since the same objects are getting repeatedly queried. I was feeling that it might be ideal to use a lookup map for this scenario, to improve the response time. However I am worried if size will make it a problem

Short Answer
Yes. If your machine has enough memory for the structure (and the overhead of the rest of the program and system including operating system).
Long Answer
Are you sure you want to? Without knowing more about your application, it's difficult to know what to suggest.
Where is the data coming from? A file? Files? A database? Services?
Is this a caching mechanism? If so, can you expire items out of the cache once they haven't been accessed for a while? This way, you don't have to hold everything in memory all the time.
As others have suggested, if you're just trying to store lots of data, can you just use a database? That way you don't have to have all of the information in memory at once. With indexing, most databases are excellent at performing fast retrieves. You could combine this approach with a cache.
Is the data that will be in memory read only, or will it have to be persisted back to some storage when something changes?
Scalability - do you expect that the amount of data that will be stored in this dictionary will increase as time goes on? If so, you're going to run into a point where it's very expensive to buy machines that can handle this amount of data. You might want to look a distributed caching system if this is the case (AppFrabric comes to mind) so you can scale out horizontally (more machines) instead of vertically (one really big expensive point of failure).
UPDATE
In light of the poster's edit, it sounds like caching would go a long way here. There are many ways to do this:
Simple dictionary caching - just cache stuff as its requested.
Memcache
Caching Application Block I'm not a huge fan of this implementation, but others have had success.

As long as you're on a 64GB machine, yes you should be able to use that large of a dictionary. However if you have THAT much data, a database may be more appropriate (cassandra is really nothing but a gigantic dictionary, and there's always MySQL).

When you say 1-2GB of data, I assume that you mean the items are complex objects that cumulatively contain 1-2GB.
Unless they're structs (and they shouldn't be), the dictionary doesn't care how big the items are.
As long as you have less than about 224 items (I pulled that number out of a hat), you can store as much as you can fit in memory.
However, as everyone else has suggested, you should probably use a database instead.
You may want to use an in-memory database such as SQL CE.

You can but For a Dictionary as large as that you are better off using a DataBase

Use a database.
Make sure you've a good DB model, put correct indexes, and off you go.

You can use subdictionaries.
Dictionary<KeyA, Dictionary<KeyB ....
Where KeyA is some common part of KeyB.
For example, if you have a String dictionary you can use the First letter as KeyA.

Related

Reduce SQL Server overhead caching query results

I have a software who does a heavy processing based on some files.
I have to query some tables in SQL Server in the process and this is killing the DB and the application performance. (other applications use the same tables).
After optimizing queries and code, getting better results but not enough. After research I reached the solution: Caching some query results. My idea is cache one specific table (identified as the overhead) rows that the file being process need.
I was think in using AppCache Fabric (I'm on MS stack), made some tests it have a large memory usage for small objects ( appcache service have ~350MB of ram usage without objects). But I need to make some queries in these result table (like search for lastname, ssn, birthdate etc.)
My second option is MongoDb as a cache store. I've research about this and most of people I read recommend using memcached or Redis, but I'm using Windows servers and they're not supported officialy.
Using mongo as cache store in this case it is a good approach? Or AppFabric Caching + tag search is better?

It is hard to tell what is better because we don't know enough about your bottlenecks. A lot is depending on quality of the data you're discussing. If the data is very static and is not called constantly but to compile the data set is time-consuming, the good solution might be to use the materialized view. If this data is frequently called than you better caching it on some server (e.g. app fabric).
There are many techniques and possibilities. But you really need to think of the network traffic, demand, size, etc, etc. And it is hard to answer this here without knowing all the details.
Looks like you are on the right way but may be all you need is just a parametrized query. Hard to tell. But I would add Materialized view into the roster that you just posted. May be all you need is to build this view from all the data you need and just access its contents.

My question to you would be that what are your long-term goals or estimates for your application? If this is the highest load you are going to expereince then tuning the DB or using MVL would be an answer. But the long term solution to this is distributed caching and you are already thinking along those lines. Your data requirements is what we'd called "reference data" or "lookup-data" and once you are excuting multiple lookups with limited DB resources there will be performance issue and your DB will become a performance bottleneck.
So the solution, that you are already thinking of, is caching this "reference" data in a cache without the need to go to the database, while, at the same time, keeping cache synchronized with the Database.
Appfabric I wouldn't be too sure about as it will have the same support issues that you mention. What is your budget like? Can you think about spending on a cachisng solution like NCache?

ASP.NET Session - big object vs many small objects

I have a scenario to optimise how my web app is storing data in the session and retrieving it. I should point out that I'm using SQL Server as my session store.
My scenario is I need to store a list of unique IDs mapped to string values in the user's session for later use. The current code I've inherited is using a List<T> with a custom object but I can already see some kind of dictionary is far better for performance.
I've tested two ideas for alternatives:
Storing a Dictionary<int, string> in the session. When I need to get the strings back, I get the dictionary from the session once and can test each ID on the dictionary object.
Since the session is basically like a dictionary itself, store the string directly in the session using a unique session key e.g. Session["MyString_<id>"] = stringValue". Getting the value out of the session would basically be the inverse operation.
My test results show the following based on the operation I need to do and using 100 strings:
Dictionary - 4552 bytes, 0.1071 seconds to do operation
Session Direct - 4441 bytes, 0.0845 seconds to do operation
From these results I see that I save some space in the session (probably because I've not got the overhead of serialising a dictionary object) and it seems to be faster when getting the values back from the session, maybe because strings are faster to deserialise than objects.
So my question is, is it better for performance to store lots of smaller objects in session rather than one big one? Is there some disadvantage for storing lots of smaller objects vs. one bigger object that I haven't seen?

There are penalties for serializing and searching large objects (they take up more space and processor time due to the need to represent a more complex structure).
And why do 2 searches when you can do only one.
Also, all documentation that deal with caching/storing solutions mention that it is much more efficient to serialize a single value from a list based on a computed key, rather than store all the dictionary and retrieve that and search in it.

I think you have almost answered your own question in showing that that yes, there is an overhead with deserializing objects but I think the real reason should be one of manageability and maintainability.
The size of storage difference is going to be minimal when you are talking about 100 objects but as you scale this up to 1000's of objects the differences will increase too, especially if you are using complex custom objects. If you have an application that has many users all using 1000's of sessions then you can imagine how this is just not scalable.
Also, by having many session objects you are undoubtedly going to have to write more code to handle each varying object. This may not be a vast amount more, but certainly more. This would also potentially make it more difficult for a developer picking up your code to understand you reasoning etc and therefore extend your code.
If you can handle the session in a single barebones format like a IEnumerable or IDictionary then this in my opinion is preferable even if there is a slight overhead involved.

C# Design decision for time series access with a database

I'm looking for a "best practise" way to handle incoming time series data.
One data point consists for example of time, height, width etc. for every "tick". Is it a good idea to save n data points in-memory with a collection class and later "flush" the points to a database after reaching the limits of the collection?
Or should the data points be directly written to the database in the first place, so that my object can run queries against it?
I know that this is little information about my requirements, so the question is how fast is the data access to a database compared to a hybrid in-memory and database solution.
Say there are at most 500 data points per second to handle and the data has to be calculated somehow on every point incoming. With a pure database solution, one has to run a store query on every incoming point. I guess this is not effective, but I don't know if such a database is able to "listen" and do this fast.
A nice feature for the database would be to send the points to subcribers. Is this possible with SQL server?
Thanks, Juergen

Putting the "sending to subscribers" requirement aside, don't get into the trap of premature optimization.
I would try the simplest solution first, which is probably just writing the data into the database as it arrives. Then run stress tests. If the performance isn't up to scratch, find the bottlenecks and optimize them out.
Turning to the "sending to subscribers" requirement, this isn't really something which relational database platforms are typically designed for (they are more about storing data and exposing it for on-demand retreival). A pub-sub type requirement is usually best solved using some kind of message bus. Perhaps take a look at something like NServiceBus.

If it is not multi-user then data points in-memory with a collection class is definitive a winner.
If it is multi-user then I would go for some sort of shared in memory data structure on server side
persists it time to time in db.

I would say the bigger question is how you plan on storing this in SQL. I would queue the datapoints in memory for a period of time (1 second?) and then write a single row to the database with a blob field, or nvarchar field containing all the data for that second as this will mean the database will scale further, the row could contain some summary information of what happened in this second which you could use when when performing queries on the data to reduce load when you are doing selects... Of-course this wouldn't be feasable if you want to perform direct queries on this data.
It all depends what you plan to do with the data...

Best way to save/load large amout of data in a .Net application?

What is the best way to save large amount of data for a .Net 4.0 application?
Right now I am using Lists and serializing to a file in "User Data" folder, and its working ok, but I want to know if there is a better/faster way of saving/loading large amount of data.
The data that I am saving contains only lots of words, like documents.
The size of the data is almost 1 mb.

That really depends on the type of your application. I wouldn't use SQL database of any sort for to just load and save operation of data that I do not need to query or transform. The time it will take to map your object graph to a relational model just not worth it.
Also I don't believe it will ever be faster than simple serialization due to the overhead associated with databases (connection management and mapping)
My recent experience was with BinnaryFormatter which had excellent results (files ~ 15mb). Worse come to worse you can always write your own formatter.

Kinda depends on your data and how you have it stored in your app.
But all these NoSQL storage systems are a possibility or just plain binary data into a file.

When you say "large amout [sic] of data", what exactly do you mean by that? A megabyte? a terabyte?
And what exactly is the data?
If it's a set of account records, it might well belong in a database of some sort; if it's a set of images or word processing documents, perhaps not.

If you want fast access, one approach would be to serialize to a hashtable, and cache it. In between reads and writes...
Problem here is ofcourse, versioning, changing of namespaces(then you wont be able to deserialize....easyly), deadlocks, concurrency etc....
Better if you save the file as a XML/JSON, and when you do read it in to memory save it into a hashtable...for fast access...

Datasets and XML in place of proper db: Not a good idea?

In continuation of: Storing DataRelation in xml?
Thanks to everybody for answers to my earlier thread. However, could I ask the reason why everybody is not supporting this XML based approach? What exactly will be the problems? I can apply connstraints to dataset, and I can, I guess, also use transactions.
I am new to this. So if you could point me to some link, where I can find some sort of comparisons, that would be really helpful.
According to FAQ, discussions are not very encouraged, but I guess this is quite specific. I hope, not to be fired for this... :)
Thanks for reading,
Saurabh.

Database management systems are specifically designed to store data and retrieve it quickly, to preserve the integrity of the data and to leverage concurrent access to the data.
XML, on the other hand, was originally designed for documents, separating the content from the presentation. It became a handy way to store simple data because the file structure is so well defined, and then it went out of hand with people trying to store entire databases in an unsuited structure.
XML doesn't guarantee atomicity, concurrency, integrity, fast access or anything like that. Not inherently, anyway. .NET's DataSet libraries do help in that regard, but just because you can serialize DataSet objects to XML doesn't make it a good place to store data for multiple users.
When you're faced with two tools, one which was designed to do exactly what you need to do (in this case a DBMS) and one that was designed to do something else but has been kludged to do what you want, sorta (in this case XML), you should probably go with the first option.

Concurrency will be the main issue, where multiple users want to access the same "database" file. Performance is the other, because the whole file has to be loaded into memory. If the filesize grows it'll get unmanageable. Also, performance on queries is hit, because it won't be as efficient as getting something as tuned and honed as an RDBMS to do it for you.

Though I would need to research to back this up, I suspect that there are some performance implications to not using an actual database.
If there is a chance that another application built on a different platform (not ADO.NET) might need to access your data in the future, having to work with a giant XML file will very likely make life more difficult. A relational DB is the standard approach to this sort of problem.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.