Caching big data

Caching big data - c#

I have an application that monitors various systems in realtime. I got different reports with different fields depending on the monitored application. We are gathering data in 3 minute intervals. And these 3 minute intervals can be 120mb as raw json and 2-3mb as zipped or gzipped json. We are zipping then caching to the disk to avoid database requests by requesting those caches from disk, unzipping them and loading the json data to application. We are holding these caches for like 3 days to 30 days depending on the report type.
For years we have used disk caching. Zipping the 3 minute interval data and then saving it to disk. This led me to use a lot of locks and mutexes.
I know I'm not the only one with this kind of problem. My cache is big. My question is; Is there a better way to save this data and obtain it? Memory caching is not a solution for me because 30 days of data can't be on memory and I am not able to add memory to the server for this application. I need something else. Something better than disk and without the usage of locks.
P.S. : Application is also multi-threaded.

I would consider a NoSQL storage engine. I am thinking at Redis in particular. Redis is a in-memory, fast, key-value store with persistence, which should be a good fit for this kind of scenario. You can then defer most of the lock/consistency hassle to it.
A problem with Redis is if you are really bound to a Windows env. There is an "unofficial" port of redis; the port is done by Microsoft itself.. but I admit that I would not be extremely confident in using it in production.
As for a C# client/library, there is Booksleeve. This site (SO) uses it :) so I bet it is pretty stable!
Of course you will need to tailor Redis to your needs. Redis does offer persistence, and the persistence is configurable (see http://redis.io/topics/persistence). Also, it offers expiration of objects (http://redis.io/commands/expire), very handy for a cache-like mechanism, and the ability to build more complex, atomic commands starting from simpler ones.
I would use Redis to handle the in-memory cache, keeping all the (primary) keys in memory, with data both on disk and in-memory. The in-memory data associated with an volatile key. The primary key points to the in-memory key and to a file name; if the key it points at is invalid, you can re-load data and access it.
This is a complex solution, but it has two advantages:
it should be vary fast
it offloads some of the locks/etc burden to Redis
should be easy to migrate from your solution to this one
Alternatively, Redis also offers a VM solution
http://oldblog.antirez.com/post/redis-virtual-memory-story.html, but I do not know how stable it is, nor have I ever tried it.
Another alternative is to explore other NoSQL solutions; since you mentioned JSON data, I will look at MongoDB.
Finally, a crazy idea... are you on a 64-bit machine?
Have you considered "letting the OS handle it", with a really big page file and page-file-backed memory mapped files (or a standard file too)? Mind you, it might be a very BAD idea...! But it is something that maybe you could try out/research about?

Related

Database recommendations needed -> Columnar, Embedded (if possible)

EDIT: As result of the answers so far I like to add more focus in what I like to zero in on: A database that allows writing in-memory (could be simple C# code) with persistence to storage options in order to access the data from within R. Redis so far looks the most promising. I also consider to actually use something similar to Lockfree++ or ZeroMQ, in order to avoid writing data concurrently to the database, but rather sending all to be persisted data over a message bus/other implementation and to have one "actor" handle all write operations to an in-memory db or other solution. Any more ideas aside Redis (some mentioned SQLite and I will need to still test its performance). Any other suggestions?
I am searching for the ideal database structure/solution that meets most of my below requirements but so far I utterly failed. Can you please help?
My tasks: I run a process in .Net 4.5 (C#) and generate (generally) value types that I want to use for further analysis in other applications and therefore like to either preserve in-memory or persist on disk. More below. The data is generated within different tasks/threads and thus a row based data format does not lend itself well to match this situation (because the data generated in different threads is generated at different times and is thus not aligned). Thus I thought a columnar data structure may be suitable but please correct me if I am wrong.
Example:
Tasks/Thread #1 generates the following data at given time stamps
datetime.ticks / value of output data
1000000001 233.23
1000000002 233.34
1000000006 234.23
...
Taks/Thread #2 generates the following data at given time stamps
datetime.ticks / value of output data
1000000002 33.32
1000000005 34.34
1000000015 54.32
...
I do not need to align the time stamps at the .Net run-time, I am first and foremost after preserving the data and to process the data within R or Python at a later point.
My requirements:
Fast writes, fast writes, fast writes: It can happen that I generate 100,000- 1,000,000 data points per second and need to persist (worst case) or retain in memory the data. Its ok to run the writes on its own thread so this process can lag the data generation process but limitation is 16gb RAM (64bit code), more below.
Preference is for columnar db format as it lends itself well to how I want to query the data later but I am open to any other structure if it makes sense in regards to the examples above (document/key-value also ok if all other requirements are met, especially in terms of write speed).
API that can be referenced from within .Net. Example: HDF5 may be considered capable by some but I find their .Net port horrible.Something that supports .Net a little better would be a plus but if all other requirements are met then I can deal with something similar to the HDF5 .Net port.
Concurrent writes if possible: As described earlier I like to write data concurrently from different tasks/threads.
I am constrained by 16gb memory (run .Net process in 64bit) and thus I probably look for something that is not purely in-memory as I may sometimes generate more data than that. Something in-memory which persists at times or a pure persistence model is probably preferable.
Preference for embedded but if a server in a client/server solution can run as a windows service then no issue.
In terms of data access I have strong preference for a db solution for which interfaces from R and Python already exist because I like to use the Panda library within Python for time series alignments and other analysis and run analyses within R.
If the API/library supports in addition SQL/SQL-like/Linq/ like queries that would be terrific but generally I just need the absolute bare bones such as load columnar data in between start and end date (given the "key"/index is in such format) because I analyze and run queries within R/Python.
If it comes with a management console or data visualizer that would be a plus but not a must.
Should be open source or priced within "reach" (no, KDB does not qualify in that regards ;-)
OK, here is what I have so far, and again its all I got because most db solution simply fail already on the write performance requirement:
Infobright and Db4o. I like what I read so far but I admit I have not checked into any performance stats
Something done myself. I can easily store value types in binary format and index the data by datetime.ticks , I just would need to somehow write scripts to load/deserialize the data in Python/R. But it would be a massive tasks if I wanted to add concurrency, a query engine, and other goodies. Thus I look for something already out there.

I can't comment -- low rep (I'm new here) -- so you get a full answer instead...
First, are you sure you need a database at all? If fast write speed and portability to R is your biggest concern then have you just considered a flat file mechanism? According to your comments you're willing to batch writes out but you need persistence; if those were my requirements I'd write a straight-to-disck buffering system that was lightning fast then build a separate task that periodically took the disk files and moved them into a data store for R, and that's only if R reading the flat files wasn't sufficient in the first place.
If you can do alignment after-the-fact, then you could write the threads to separate files in your main parallel loop, cutting each file off every so often, and leave the alignment and database loading to the subprocess.
So (in crappy pseudo_code), build a thread process that you'd call with backgroundworker or some such and include a threadname string uniquely identifying each worker and thus each filestream (task/thread):
file_name = threadname + '0001.csv' // or something
open(file_name for writing)
while(generating_data) {
generate_data()
while (buffer_not_full and very_busy) {
write_data_to_buffer
generate_data()
}
flush_buffer_to_disk(file_name)
if(file is big enough or enough time has passed or we're not too busy) {
close(file_name)
move(file_name to bob's folder)
increment file_name
open(file_name for writing)
}
)
Efficient and speedy file I/O and buffering is a straightforward and common problem. Nothing is going to be faster than this. Then you can just write another process to do the database loads and not sweat the performance there:
while(file_name in list of files in bob's folder sorted by date for good measure)
{
read bob's file
load bob's file to database
align dates, make pretty
}
And I wouldn't write that part in C#, I'd batch script it and use the database's native loader which is going to be as fast as anything you can build from scratch.
You'll have to make sure the two loops don't interfere much if you're running on the same hardware. That is, run the task threads at a higher priority, or build in some mutex or performance limiters so that the database load doesn't hog resources while the threads are running. I'd definitely segregate the database server and hardware so that file I/O to the flat files isn't compromised.
FIFO queues would work if you're on Unix, but you're not. :-)
Also, hardware is going to have more of a performance impact for you than the database engine, I'd imagine. If you're on a budget I'm guessing you're on COTS hardware, so springing for a solid state drive may up performance fairly cheaply. As I said, separating the DB storage from the flat file storage would help, and the CPU/RAM for R, the Database, and your Threads should all be segregated ideally.
What I'm saying is that choice of DB vendor probably isn't your biggest issue, unless you have a lot of money to spend. You'll be hardware bound most of the time otherwise. Database tuning is an art, and while you can eek out minor performance gains at the top end, having a good database administrator will keep most databases in the same ballpark for performance. I'd look at what R and Python support well and that you're comfortable with. If you think in columnar fashion then look at R and C#'s support for Cassandra (my vote), Hana, Lucid, HBase, Infobright, Vertica and others and pick one based on price and support. For traditional databases on a single commodity machine, I haven't seen anything that MySQL can't handle.

This is not to answer my own question but to keep track of all data bases which I tested so far and why they have not met my requirements (yet): each time I attempted to write 1 million single objects (1 long, 2 floats) to the database. For ooDBs, I stuck the objects into a collection and wrote the collection itself, similar story for key/value such as Redis but also attempted to write simple ints (1mil) to columnar dbs such as InfoBright.
Db4o, awefully slow writes: 1mil objects within a collection took about 45 seconds. I later optimized the collection structure and also wrote each object individually, not much love here.
InfoBright: Same thing, very slow in terms of write speed, which surprised me quite a bit as it organizes data in columnar format but I think the "knowledge tree" only kicks in when querying data rather than when saving flat data structures/tables-like structures.
Redis (through BookSleeve): Great API for .Net: Full Redis functionality (though couple drawbacks to run the server on Windows machines vs. a Linux or Unix box). Performance was very fast...North of 1 million items per second. I serialized all objects using Protocol Buffers (protobuf-net, both written by Marc Gravell), still need to play a lot more with the library but R and Python both have full access to the Redis DB, which is a big plus. Love it so far. The Async framework that Marc wrote around the Redis base functions is awesome, really neat and it works so far. I wanna spend a little more time to experiment with the Redis Lists/Collection types as well, as I so far only serialized to byte arrays.
SqLite: I ran purely in-memory and managed to write 1 million value type elements in around 3 seconds. Not bad for a pure RDBMS, obviously the in-memory option really speeds things up. I only created one connection, one transaction, created one command, one parameter, and simply adjusted the value of the parameter within a loop and ran the ExecuteNonQuery on each iteration. The transaction commit was then run outside the loop.
HDF5: Though there is a .Net port and there also exists a library to somehow work with HDF5 files out of R, I strongly discourage anyone to do so. Its a pure nightmare. The .Net port is very badly written, heck, the whole HDF5 concept is more than questionable. Its a very old and in my opinion outgrown solution to store vectorized/columnar data. This is 2012 not 1995. If one cannot completely delete datasets and vectors out of the file in which they were stored before then I do not call that an annoyance but a major design flaw. The API in general (not just .Net) is very badly designed and written imho, there are tons of class objects that nobody, without having spent hours and hours of studying the file structure, understands how to use. I think that is somewhat evidenced by the very sparse amount of documentation and example code that is out there. Furthermore, the h5r R library is a drama, an absolute nightmare. Its badly written as well (often the file upon writing is not correctly close due to a faulty flush and it corrupts files), the library has issues to even be properly installed on 32 bit OSs...and it goes on and on. I write the most about HDF5 because I spent the most of my time on this piece of .... and ended up with the most frustration. The idea to have a fast columnar file storage system, accessible from R and .Net was enticing but it just does not deliver what it promised in terms of API integration and usability or lack thereof.
Update: I ditched testing velocityDB simply because there does not seem any adapter to access the db from within R available. I currently contemplate writing my own GUI with charting library which would access the generated data either from a written binary file or have it sent over a broker-less message bus (zeroMQ) or sent through LockFree++ to an "actor" (my gui). I could then call R from within C# and have results returned to my GUI. That would possibly allow me the most flexibility and freedom, but would obviously also be the most tedious to code. I am running into more and more limitations during my tests that with each db test I befriend this idea more and more.
RESULT: Thanks for the participation. In the end I awarded the bounty points to Chipmonkey because he suggested partly what I considered important points to the solution to my problem (though I chose my own, different solution in the end).
I ended up with a hybrid between Redis in memory storage and direct calls out of .Net to the R.dll. Redis allows access to its data stored in memory by different processes. This makes it a convenient solution to quickly store the data as key/value in Redis and to then access the same data out of R. Additionally I directly send data and invoke functions in R through its .dll and the excellent R.Net library. Passing a collection of 1 million value types to R takes about 2.3 seconds on my machine which is fast enough given that I get the convenience to just pass in the data, invoke computational functions within R out of the .Net environment and getting the results back sync or async.

Just a note: I once had a similar problem posted by a fellow in a delphi forum. I could help him with a simple ID-key-value database backend I wrote at that time (kind of a NoSQL engine). Basically, it uses a B-Tree to store triplets (32bit ObjectID, 32bit PropertyKey, 64bit Value). I could manage to save about 500k/sec Values in real time (about 5 years ago). Of course, the data was indexed on all three values (ID, property-ID and value). You could optimize this by ignoring the value index.
The source I still have is in Delphi, but I would think about implementing something like that using C#. I cannot tell you whether it will meet your needs for performance, but if all else fails, give it a try. Using a buffered write should also drastically improve performance.

I would go with way combining persistence storage (I personally prefer db4o, but you can use files as well as mentioned above) and storing objects into memory this way:
use BlockingCollection<T> to store objects in memory (I believe you will achieve better performance then 1000000/s to store objects in memory), and than have one or more processing threads which will consume the objects and store them into persistent database
// Producing thread
for (int i=0; i<1000000; i++)
blockingCollection.Add(myObject);
// Consuming threads
while (true)
{
var myObject = blockingCollection.Take();
db4oSession.Store(myObject); // or write it to the files or whathever
}
BlockingCollection pretty much solves Producer-Consumer workflow, and in case you will use multiple instance of them and use AddToAny/TakeFromAny you can reach any kind of multithreaded performance
each consuming thread could have different db4o session (file) to reach desired performance (db4o is singlethreaded).

Since you want to use ZeroMQ why not use memcache over Redis?
ZeroMQ offers no persistence as far as I know. Memcache also offers no persistence and is a bit faster than Redis.
Or perhaps the other way, if you use Redis why not use beanstalk MQ?
If you want to use Redis (for the persistence) you might want to switch from ZeroMQ to beanstalk MQ (also a fast in memory queue, but also has persistence via logging). Beanstalk also has C# libs.

Query from database or from memory? Which is faster?

I am trying to improve the performance of a Windows Service, developed in C# and .NET 2.0, that processes a great amount of files. I want to process more files per second.
In its process, for each file, the service does a database query to retrieve some parameters of the system.
Those parameters change annually, and I am thinking that I would gain some performance, if a loaded those parameters as a singleton and refreshed this singleton periodically. Instead of make a database query for each file being processed, I would get the parameters from memory.
To complete the scenario : I am using Windows Server 2008 R2 64 Bits, SQL Server 2008 is the database, C# and .NET 2.0 as already mentioned.
I am right in my approach? What would you do?
Thanks!

Those parameters change anually
Yes, do cache them in memory. Especially if they are large or complex.
You should take care to invalidate them at the right time once a year, depending how accurate that has to be.
Simply caching them for an hour or even for a few minutes might be a good compromise.

RAM memory data access is definitely faster that any other data access, except than cpu memories like registries and CPU cache
Chaching would be faster even if you change it every minute, so yes, caching that query is very faster

Crossing a network or going to disk is always orders of magnitude slower than in memory access.
Databases can cache data in memory so if you can achieve that and you're not crossing a network, the database might be faster since their data access patterns/indexes etc... may be faster than you're code. But, that's best case - if you need it faster, in memory caches help.
But, be aware that in memory caches can add complexity and bugs. You have to determine the lifetime of the cached data, how to refresh and the more complex it is, the more wierd edge case state bugs you will have. Even though they change annually, you have to handle that cusp.

50GB HttpRuntime.Cache Persistence Possible?

We have an ASP.NET 4.0 application that draws from a database a complex data structure that takes over 12 hours to push into an in memory data structure (that is later stored in HttpRuntime.Cache). The size of the data structure is quickly increasing and we can't continue waiting 12+ hours to get it into memory if the application restarts. This is a major issue if you want to change the web.config or any code in the web application that causes a restart - it means a long wait before the application can be used, and hinders development or updating the deployment.
The data structure MUST be in memory to work at a speed that makes the website usable. In memory databases such as memcache or Redis are slow in comparison to HttpRuntime.Cache, and would not work in our situation (in memory db's have to serialize put/get, plus they can't reference each other, they use keys which are lookups - degrading performance, plus with a large amount of keys the performance goes down quickly). Performance is a must here.
What we would like to do is quickly dump the HttpRuntime.Cache to disk before the application ends (on a restart), and be able to load it back immediately when the application starts again (hopefully within minutes instead of 12+ hours or days).
The in-memory structure is around 50GB.
Is there a solution to this?

In memory databases such as memcache or Redis are slow in comparison to HttpRuntime.Cache
Yes, but they are very fast compared to a 12+ hour spin-up. Personally, I think you're taking the wrong approach here in forcing load of a 50 GB structure. Just a suggestion, but we use HttpRuntime.Cache as part of a multi-tier caching strategy:
local cache is checked etc first
otherwise redis is used as the next tier of cache (which is faster than the underlying data, persistent, and supports a number of app servers) (then local cache is updated)
otherwise, the underlying database is hit (and then both redis and local cache are updated)
The point being, at load we don't require anything in memory - it is filled as it is needed, and from then on it is fast. We also use pub/sub (again courtesy of redis) to ensure cache invalidation is prompt. The net result: it is fast enough when cold, and very fast when warm.
Basically, I would look at anything that avoids needing the 50GB data before you can do anything.
If this data isn't really cache, but is your data, I would look at serialization on a proper object model. I would suggest protobuf-net (I'm biased as the author) as a strong candidate here - very fast and very small output.

Memory Management - How and when to write large objects to disk

I am working on an application which has potential for a large memory load (>5gb) but has a requirement to run on 32bit and .NET 2 based desktops due to the customer deployment environment. My solution so far has been to use an app-wide data store for these large volume objects, when an object is assigned to the store, the store checks for the total memory usage by the app and if it is getting close to the limit it will start serialising some of the older objects in the store to the user's temp folder, retrieving them back into memory as and when they are needed. This is proving to be decidedly unreliable, as if other objects within the app start using memory, the store has no prompt to clear up and make space. I did look at using weak pointers to hold the in-memory data objects, with them being serialised to disk when they were released, however the objects seemed to be getting released almost immediately, especially in debug, causing a massive performance hit as the app was serialising everything.
Are there any useful patterns/paradigms I should be using to handle this? I have googled extensively but as yet haven't found anything useful.

I thought virtual memory was supposed to have you covered in this situation?
Anyways, it seems suspect that you really need all 5gb of data in memory at any given moment - you can't possibly be processing all that data at any given time - at least not on what sounds like a consumer PC. You didn't go into detail about your data, but something to me smells like the object itself is poorly designed in the sense that you need the entire set to be in memory to work with it. Have you thought about trying to fragment out your data into more sensible units - and then do some preemptive loading of the data from disk, just before it needs to be processed? You'd essentially be paying a more constant performance trade-off this way, but you'd reduce your current thrashing issue.

Maybe you go with Managing Memory-Mapped Files and look here. In .NET 2.0 you have to use PInvoke to that functions. Since .NET 4.0 you have efficient built-in functionality with MemoryMappedFile.
Also take a look at:
http://msdn.microsoft.com/en-us/library/dd997372.aspx
You can't store 5GB data in-memory efficiently. You have 2 GB limit per process in 32-bit OS and 4 GB limit per 32-bit process in 64-bit Windows-on-Windows
So you have choice:
Go in Google's Chrome way (and FireFox 4) and maintain potions of data between processes. It may be applicable if your application started under 64-bit OS and you have some reasons to keep your app 32-bit. But this is not so easy way. If you don't have 64-bit OS I wonder where you get >5GB RAM?
If you have 32-bit OS when any solution will be file-based. When you try to keep data in memory (thru I wonder how you address them in memory under 32-bit and 2 GB per process limit) OS just continuously swap portions of data (memory pages) to disk and restores them again and again when you access it. You incur great performance penalty and you already noticed it (I guessed from description of your problem). The main problem OS can't predict when you need one data and when you want another. So it just trying to do best by reading and writing memory pages on/from disk.
So you already use disk storage indirecltly in inefficient way, MMFs just give you same solution in efficient and controlled manner.
You can rearchitecture your application to use MMFs and OS will help you in efficient caching. Do the quick test by yourself MMF maybe good enough for your needs.
Anyway I don't see any other solution to work with dataset greater than available RAM other than file-based. And usually better to have direct control on data manipulation especially when such amount of data came and needs to be processed.

When you have to store huge loads of data and mantain accessibility, sometimes the most useful solution is to use data store and management system like database. Database (MySQL for example) can store a lots of typical data types and of course binary data too. Maybe you can store your object to database (directly or by programming business object model) and get it when you need to. This solution sometimes can solve many problems with data managing (moving, backup, searching, updating...), and storage (data layer) - and it's location independent - mayby this point of view can help you.

Disadvantages of In-Memory Caching of Transformed XML?

I'm writing a web application that constantly retrieves XML "components" from a database and then transforms them into XHTML using XSLT. Some of these transformations happen frequently (e.g. a "sidebar navigation" component goes out for the same XML and performs the same XSL transformation on every page that features that sidebar), so I have started implementing some caching to speed things up.
In my current solution, before each component attempts to perform a transformation, the component checks with a static CacheManager object to see if a cached version of the transformed XML exists. If so, the component outputs this. If not, the component performs the transformation and then stores the transformed XML with the CacheManager object.
The CacheManager object keeps an in-memory store of the cached transformed XML (in a Dictionary, to be exact). In my local, development environment this is working beautifully, but I'm assuming this may not be a very scalable solution.
What are the potential downfalls of storing these data in-memory? Do I need to put a cap on the amount of data I can store in an in-memory data structure like this? Should I be using a different data store for this type of caching altogether?

The obvious disadvantage will be, as you suspect, potentially high memory usage by your cache. You may want to implement a system where rarely-used items "expire" out of the cache when memory pressure goes up. Microsoft's Caching Application Block implements pretty much everything you need, right out of the box.
Another potential (though unlikely) problem you could run into is the cost of digging through the cache to find what you need. At some point it could be faster to just go ahead and generate what you need instead of looking through the cache. We've run into this in at least one precise scenario related to very large caches and very cheap operations. It's unlikely, but it can happen.

You can do the caching but only keep reference values (uri) in the dictionary and store the actual transformed XML on disk. This is probably going to be faster than retrieving values from the database and doing the transformation again, slower than storing this all in memory but gets around the problem of having all of this cached data in memory in the first place. This also could allow the transformed XML documents to survive through a recycle/reset, but you'd have to rebuild your dictionary and you'd also have to think about "expiring" documents that need to be purged from the disk cache.
Just an idea..

You should definitly look for a generic chache implementation. I dont do C#, so I dont know of any solution in your language. In Java, I would have recommended EHCache.
Caching is much harder that it seems at first. That's why it is probably a good idea to rely on work done by others. Some of the problem you will run into at some point are concurrency, cache invalidation, time to live, blocking caches, cache management (statistics, clearing of caches, ...), overflowing to disk, distributed caching, ...
Monitoring a cache is a must. You need to see if the caching strategy you put in place is actually doing anything good (cache hits / cache misses, percentage of cache used, ...) You should probably divide your cache in multiple regions to be able to monitor better the cache usage.
As a side note, as long as you cache your XML after transformation, you should probably store its String representation (and not an object tree). That's one less transformation to do after the cache asyou are probably outputing it as String anyway. And there is a good chance that the String representation will take less space (but as always, measure that, dont take my word for it).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.