I have a requirement where huge amount of data needs to be cached on the disk.
Whenever there is a change in the database, the data is retreived from the database and cached on the disk. I will be having a background process which keeps checking my cached data with the data base, and updates it as and when required.
I would like to know what would be the best way to organize the cached data on my disk, so that writing and reading from the cache can be faster.
An another thread would be used to fetch some new data from the db and cache it on the disk. I also need to take care of synchronization between the two threads.(one will be updating the existing cache data, and the other will be writing newly fetched data into the cache.)
Please suggest a strategy for organizing the data on the cache and also synchronization between the threads.
SQL Server has something called XML tables. Those tables are based on physical XML files located in the disk. You can map/link XML data in the disk to a table in SQL Server. For users, it is seamless, in other words they see those tables as a regular tables.
Besides technical/philosophical discussion about caching huge data on the disk, this is just an idea...
Do you care about the consistancy of the data? on power failures?
Memory mapped files along with occational flushes porbably get want you want
Do you need to have an indexed access to data?
You probably need to design something B-tree implementation or B+tree implementation. which gives efficient retrival of the indexed data and better block level locking.
http://code.google.com/p/high-concurrency-btree/
As an alternative answer, my own B+Tree implementation will neatly address this as a completely managed code (C#) implementation of an IDictionary<TKey, TValue>. It is a single-file key/value store that is thread-safe and optimized for concurrency. It was built from the ground up expressly for this purpose and for providing write-through caches.
Dicussion - http://csharptest.net/projects/bplustree/
Online Help – http://help.csharptest.net/
Source Code – http://code.google.com/p/csharptest-net/
Downloads – http://code.google.com/p/csharptest-net/downloads
NuGet Package – http://nuget.org/List/Packages/CSharpTest.Net.BPlusTree
Related
When working with ASP.NET MVC and SQL Server we are wondering if caching to XML is still something to think about or are their other possibilities for this?
Like for instance we have a table called Customers. If you call this db table everytime you click on Customers or do sorting or filtering in the app why not store this info in a xml file.
Then you work only with the xml file and not the db and you update the xml after adding changes to the customers table.
It is an absolutely brilliant idea.
If:
You only have 1 client
Or you have multiple client but they don't mind seeing old data
You have a database system that doesn't provide caching possibilities
You do not use database access frameworks that can handle caching for you
In short, no, it actually is almost never a good idea.
Databases are made to be used. Most of them can handle a much higher load than programmers think they can, as long as you treat them well. If necessary, a lot of them provide perfectly fine caching possibilities to improve performance if needed.
Any useful type of caching in your application should involving refreshing that cache when anything changes. Implementing that by yourself is usually not a good idea. If you do want a very simple cache of data that was just on the screen before the user clicked away, memory would be the place for it, not a file system. Unless you need centralised session cache, but that goes way beyond "let's write some xml".
Caching to xml file is bad choice. Database system can handle load of 100 users in 5 seconds if you have 50000 records in your table. If you want more speed than this then try using In-memory sql which stores data in RAM for fast access. But for it you need high RAM capacity on server.
I have a variety of rich data structures (primarily trees) that I would like to persist to disk, meaning I not only want to write them to disk but I want a guarantee that the data has been fully written and will survive a power-down.
Others seem to design ways to encode rich data structures in flat database tables as lookup tables from parent to child nodes. This facilitates running SQL queries against the data but I have no need for that: I just want to save and load my trees.
The obvious solution is to store everything as a blob in the data base: a single entry perhaps containing a long string. Is that an abuse of the database or a recommended practice? Another solution might be to use an XML database? Are there any alternatives to databases that I should be considering?
Finally, I'm doing this from F# so a turnkey solution for persisting data from .NET would be ideal...
EDIT: Please note that formatting (e.g. serialization) is irrelevant as I can trivially convert between formats with F#. This is about getting an acknowledgement that a write has been completed all the way down to the non-volatile store (i.e. the disk platter) and no part of the written data is still being held in volatile store (e.g. an RAM cache) so that I can continue safe in that knowledge (e.g. by deleting the old version of the data from disk).
Some of the constructors for .NET's FileStream class take a parameter of type FileOptions. One of the values for FileOptions is WriteThrough, which "Indicates that the system should write through any intermediate cache and go directly to disk."
This should ensure that by the time your write operation (to a new file) returns, the data is committed to disk and you can safely delete the old file.
This can be done via Serialization.
The .NET Framework includes many built-in options for serializing your data to disk, including using binary or XML-based formats. Detailed How-To articles are provided in the MSDN Documentation.
In order to do this, you will require a resource which will allow you to engage in a Transaction (more often than not, you would use a TransactionScope.
Most databases will participate in a Transaction if one is contained. Disk operations can also be managed by a Transaction, but you would have to do some specific work in order to utilize it in .NET.
Also, note that this is only available on Windows Vista and later.
If you go the database route, then you could store the serialized contents of your trees in a blob (or text, depending on the serialization mechanism).
Note, you can also use the FILESTREAM functionality in SQL Server (2008 and up, I believe) to store your files on the filesystem and gain the benefits of transactions in SQL Server.
I haven't used db4o from F# before, but it's all about persisting CLR object graphs to disk in a transactional manner. If it works with records and discriminated unions, it might suit you.
Edit: I just tested db4o 8.0 (.NET 4 version) and it seems to handle both record types and discriminated union hierarchies perfectly well.
Try using XMLSerializer (System.Xml.Serialization).
http://msdn.microsoft.com/en-us/library/system.xml.serialization.xmlserializer.aspx
It can automatically persist complex data structures based on their properties, and you can use attributes to control the output, if you wish:
http://msdn.microsoft.com/en-us/library/83y7df3e.aspx
Slightly OT as the OP didn't want XML, but seeing others mentioned the XML formatter...
If you want textual persistence, the SoapFormatter handles cases
(cycles/object-graphs) that the default XML formatter does not - its XML is not as readable as XMLFormatter's, but it's more readable than binary :)
What is the best way to save large amount of data for a .Net 4.0 application?
Right now I am using Lists and serializing to a file in "User Data" folder, and its working ok, but I want to know if there is a better/faster way of saving/loading large amount of data.
The data that I am saving contains only lots of words, like documents.
The size of the data is almost 1 mb.
That really depends on the type of your application. I wouldn't use SQL database of any sort for to just load and save operation of data that I do not need to query or transform. The time it will take to map your object graph to a relational model just not worth it.
Also I don't believe it will ever be faster than simple serialization due to the overhead associated with databases (connection management and mapping)
My recent experience was with BinnaryFormatter which had excellent results (files ~ 15mb). Worse come to worse you can always write your own formatter.
Kinda depends on your data and how you have it stored in your app.
But all these NoSQL storage systems are a possibility or just plain binary data into a file.
When you say "large amout [sic] of data", what exactly do you mean by that? A megabyte? a terabyte?
And what exactly is the data?
If it's a set of account records, it might well belong in a database of some sort; if it's a set of images or word processing documents, perhaps not.
If you want fast access, one approach would be to serialize to a hashtable, and cache it. In between reads and writes...
Problem here is ofcourse, versioning, changing of namespaces(then you wont be able to deserialize....easyly), deadlocks, concurrency etc....
Better if you save the file as a XML/JSON, and when you do read it in to memory save it into a hashtable...for fast access...
We have an application (rules engine) that has a lot of tables in memory to perform certain business rules. This engine is also used for writing back to the database when needed.
The DB structure is denormalized, and we have 5 transactional tables, that also sometimes need to be queried for reporting.
The issue here is, we want to cache the data inside the app, so it loads on App startup, and then only changes if the DB changed.
Any recommendations?
We are leaning towards creating a DB service, that will handle all Inserts, Updates and Deletes, and queue them to decrease load on the DB server (the transactional tables have loads of indexes also). Also, we are thinking of enabling the DB service to sit on top and serve all reports / other apps that need direct DB access.
The aim here ofcourse is to decrease DB hits for Select queries per request, and prioritize transactions. Also to ensure people accessing apps dont bring the DB server down.
Rules Engine is a C# desktop app, reporting and other apps are web based.
What would be the best way to go about this? I did also think of removing all indexes from my transactional table, and having a trigger insert into a new table which would be a copy, but indexed for report retrieval.
You should perhaps look at distributed caching solutions (from both performance and scalability point of view). In short, I am taking about scalable DB Services backed by distributed cache (so that multiple DB services get served by same cache).
Here's the article that discusses distributed caching including various approaches for database synchronization. And here is the blog post that list few options in .NET for distributed caching.
I've done something similar with an obscenely complex rules engine. Ultimately, I set it up so that the data was serialized centrally (with a process to release new changes, causing a new copy to be serialized and the blob stored somewhere accessible). During load, each app-server would check whether they have the up to date version of the blob, and if not fetch it (and store it locally).
Then all it has to do is deserialize the data into memory. No db hit, except for occasionally grabbing the new blob. It also means the app-server can work while the db server is offline (as long as it has a cached copy of the blob). It also polled periodically for new updates while running, of course - but only to the "is there a new blob" code (it still didn't need to hit the main tables).
You may be interested in this article It uses xml to store a readonly copy of the database (in memory). And XPath to query. Nowadays you'd prefer to query with linq, of course.
Background:
I have one Access database (.mdb) file, with half a dozen tables in it. This file is ~300MB large, so not huge, but big enough that I want to be efficient. In it, there is one major table, a client table. The other tables store data like consultations made, a few extra many-to-one to one fields, that sort of thing.
Task:
I have to write a program to convert this Access database to a set of XML files, one per client. This is a database conversion application.
Options:
(As I see it)
Load the entire Access database into memory in the form of List's of immutable objects, then use Linq to do lookups in these lists for associated data I need.
Benefits:
Easy parallelised. Startup a ThreadPool thread for each client. Because all the objects are immutable, they can be freely shared between the threads, which means all threads have access to all data at all times, and it is all loaded exactly once.
(Possible) Cons:
May use extra memory, loading orphaned items, items that aren't needed anymore, etc.
Use Jet to run queries on the database to extract data as needed.
Benefits:
Potentially lighter weight. Only loads data that is needed, and as it is needed.
(Possible) Cons:
Potentially heavier! May load items more than once and hence use more memory.
Possibly hard to paralellise, unless Jet/OleDb supports concurrent queries (can someone confirm/deny this?)
Some other idea?
What are StackOverflows thoughts on the best way to approach this problem?
Generate XML parts from SQL. Store each fetched record in the file as you fetch it.
Sample:
SELECT '<NODE><Column1>' + Column1 + '</Column1><Column2>' + Column2 + '</Column2></Node>' from MyTable
If your objective is to convert your database to xml files, you can then:
connect to your database through an ADO/OLEDB connection
successively open each of your tables as ADO recordsets
Save each of your recordset as a XML file:
myRecordset.save myXMLFile, adPersistXML
If you are working from the Access file, use the currentProject.accessConnection as your ADO connection
From the sounds of this, it would be a one-time operation. I strongly discourage the actual process of loading the entire setup into memory, that just does not seem like an efficient method of doing this at all.
Also, depending on your needs, you might be able to extract directly from Access -> XML if that is your true end game.
Regardless, with a database that small, doing them one at a time, with a few specifically written queries in my opinion would be easier to manage, faster to write, and less error prone.
I would lean towards jet, since you can be more specific in what data you want to pull.
Also I noticed the large filesize, this is a problem i have recently come across at work. Is this an access 95 or 97 db? If so converting the DB to 2000 or 2003 and then back to 97 will reduce this size, it seems to be a bug in some cases. The DB I was dealing with claimed to be 70meg after i converted it to 2000 and back again it was 8 meg.