I have a variety of rich data structures (primarily trees) that I would like to persist to disk, meaning I not only want to write them to disk but I want a guarantee that the data has been fully written and will survive a power-down.
Others seem to design ways to encode rich data structures in flat database tables as lookup tables from parent to child nodes. This facilitates running SQL queries against the data but I have no need for that: I just want to save and load my trees.
The obvious solution is to store everything as a blob in the data base: a single entry perhaps containing a long string. Is that an abuse of the database or a recommended practice? Another solution might be to use an XML database? Are there any alternatives to databases that I should be considering?
Finally, I'm doing this from F# so a turnkey solution for persisting data from .NET would be ideal...
EDIT: Please note that formatting (e.g. serialization) is irrelevant as I can trivially convert between formats with F#. This is about getting an acknowledgement that a write has been completed all the way down to the non-volatile store (i.e. the disk platter) and no part of the written data is still being held in volatile store (e.g. an RAM cache) so that I can continue safe in that knowledge (e.g. by deleting the old version of the data from disk).
Some of the constructors for .NET's FileStream class take a parameter of type FileOptions. One of the values for FileOptions is WriteThrough, which "Indicates that the system should write through any intermediate cache and go directly to disk."
This should ensure that by the time your write operation (to a new file) returns, the data is committed to disk and you can safely delete the old file.
This can be done via Serialization.
The .NET Framework includes many built-in options for serializing your data to disk, including using binary or XML-based formats. Detailed How-To articles are provided in the MSDN Documentation.
In order to do this, you will require a resource which will allow you to engage in a Transaction (more often than not, you would use a TransactionScope.
Most databases will participate in a Transaction if one is contained. Disk operations can also be managed by a Transaction, but you would have to do some specific work in order to utilize it in .NET.
Also, note that this is only available on Windows Vista and later.
If you go the database route, then you could store the serialized contents of your trees in a blob (or text, depending on the serialization mechanism).
Note, you can also use the FILESTREAM functionality in SQL Server (2008 and up, I believe) to store your files on the filesystem and gain the benefits of transactions in SQL Server.
I haven't used db4o from F# before, but it's all about persisting CLR object graphs to disk in a transactional manner. If it works with records and discriminated unions, it might suit you.
Edit: I just tested db4o 8.0 (.NET 4 version) and it seems to handle both record types and discriminated union hierarchies perfectly well.
Try using XMLSerializer (System.Xml.Serialization).
http://msdn.microsoft.com/en-us/library/system.xml.serialization.xmlserializer.aspx
It can automatically persist complex data structures based on their properties, and you can use attributes to control the output, if you wish:
http://msdn.microsoft.com/en-us/library/83y7df3e.aspx
Slightly OT as the OP didn't want XML, but seeing others mentioned the XML formatter...
If you want textual persistence, the SoapFormatter handles cases
(cycles/object-graphs) that the default XML formatter does not - its XML is not as readable as XMLFormatter's, but it's more readable than binary :)
Related
I want to use a GTFS feed in Google Maps, but I don't know how to. I want to display the buses available from a route. Just so you know, I'm planning on implementing the Google Map I make in a Visual C# application.
This is a very general question, so my answer will necessarily be general as well. If you can provide more detail about what you're trying to accomplish I'll try to offer more specific help.
At a high level, the steps for working with a GTFS feed are:
Parse the data. From the GTFS feed's URL you'll obtain a ZIP file containing a set of CSV files. The format of these files is specified in Google's GTFS reference, and most languages already have a CSV-parsing library available that can be used to read in the data. Additionally, for some languages there are GTFS-parsing libraries available that will return data from these files as objects; it looks like there's one available for C#, gtfsengine, you might want to check out.
Load the data. You'll need to store the data somewhere, at least temporarily, to be able to work with it. This could simply be a data structure in memory (particularly if you've written your own parsing code) but since larger feeds can take some time to read you'll probably want to look at using a relational database or some other kind of storage you can serialize to disk. In the application I'm developing, a separate process parses and loads GTFS data into a relational database in one pass.
Query the data. Obviously how you do this will depend on the method you use for storing the data and the purpose of your application. If you're using a relational database, you will typically have one table per GTFS entity (or CSV file) on which you can construct indices and against which you can execute SQL queries. If you're working with objects in memory, you might construct a hash-table index in memory as well and query that to find the data you need.
I have a requirement where huge amount of data needs to be cached on the disk.
Whenever there is a change in the database, the data is retreived from the database and cached on the disk. I will be having a background process which keeps checking my cached data with the data base, and updates it as and when required.
I would like to know what would be the best way to organize the cached data on my disk, so that writing and reading from the cache can be faster.
An another thread would be used to fetch some new data from the db and cache it on the disk. I also need to take care of synchronization between the two threads.(one will be updating the existing cache data, and the other will be writing newly fetched data into the cache.)
Please suggest a strategy for organizing the data on the cache and also synchronization between the threads.
SQL Server has something called XML tables. Those tables are based on physical XML files located in the disk. You can map/link XML data in the disk to a table in SQL Server. For users, it is seamless, in other words they see those tables as a regular tables.
Besides technical/philosophical discussion about caching huge data on the disk, this is just an idea...
Do you care about the consistancy of the data? on power failures?
Memory mapped files along with occational flushes porbably get want you want
Do you need to have an indexed access to data?
You probably need to design something B-tree implementation or B+tree implementation. which gives efficient retrival of the indexed data and better block level locking.
http://code.google.com/p/high-concurrency-btree/
As an alternative answer, my own B+Tree implementation will neatly address this as a completely managed code (C#) implementation of an IDictionary<TKey, TValue>. It is a single-file key/value store that is thread-safe and optimized for concurrency. It was built from the ground up expressly for this purpose and for providing write-through caches.
Dicussion - http://csharptest.net/projects/bplustree/
Online Help – http://help.csharptest.net/
Source Code – http://code.google.com/p/csharptest-net/
Downloads – http://code.google.com/p/csharptest-net/downloads
NuGet Package – http://nuget.org/List/Packages/CSharpTest.Net.BPlusTree
What is the best way to save large amount of data for a .Net 4.0 application?
Right now I am using Lists and serializing to a file in "User Data" folder, and its working ok, but I want to know if there is a better/faster way of saving/loading large amount of data.
The data that I am saving contains only lots of words, like documents.
The size of the data is almost 1 mb.
That really depends on the type of your application. I wouldn't use SQL database of any sort for to just load and save operation of data that I do not need to query or transform. The time it will take to map your object graph to a relational model just not worth it.
Also I don't believe it will ever be faster than simple serialization due to the overhead associated with databases (connection management and mapping)
My recent experience was with BinnaryFormatter which had excellent results (files ~ 15mb). Worse come to worse you can always write your own formatter.
Kinda depends on your data and how you have it stored in your app.
But all these NoSQL storage systems are a possibility or just plain binary data into a file.
When you say "large amout [sic] of data", what exactly do you mean by that? A megabyte? a terabyte?
And what exactly is the data?
If it's a set of account records, it might well belong in a database of some sort; if it's a set of images or word processing documents, perhaps not.
If you want fast access, one approach would be to serialize to a hashtable, and cache it. In between reads and writes...
Problem here is ofcourse, versioning, changing of namespaces(then you wont be able to deserialize....easyly), deadlocks, concurrency etc....
Better if you save the file as a XML/JSON, and when you do read it in to memory save it into a hashtable...for fast access...
So far all the serialization examples I have found on the web are related to storing arrays or list in a file. With each class of object having to be serialized into their own file such as a ".bin". The root of my problem is that I want to have the information for my product local stored, but I'm so use to working with sql. It's hard for me to visualize how to store information locally. If C# is anything like asp I should be able to connect to an Access database, but that pretty much defeats one of the ideas of serialization which is user non-readability. Is there a serialization method similar to using table and fields or at least allowing you to store all user information in one file?
You could use a ADO.NET DataSet that is serialized and stored locally. It will contain all of the data structures that you're familiar with and allow you to query the data the way you seem to want to and if you serialize it with a Binary Serializer, it will be unreadable to end-users.
Also, you could look at SQLite as an alternative to using DataSets.
SQLite is a software library that
implements a self-contained,
serverless, zero-configuration,
transactional SQL database engine.
SQLite is the most widely deployed SQL
database engine in the world. The
source code for SQLite is in the
public domain.
NHibernate with SQLite is a great combination as well.
Cheers.
Check out NHibernate. That will give you your 'database-like' storage.
If it's human-readability you're after, consider serializing your objects using XML. .Net has decent support for serializing (and deserializing) objects using both XML and binary formats.
The tutorial I used for learning serialization in C# is this CodeProject article.
Update:
I misread one point you made: serialization does not necessarily mean human-readable or not - if you decide to serialize, figure out if you want the data readable or not. Binary serialization is likely to be more compact and less readable.
I am currently writing an IRC client and I've been trying to figure out a good way to store the server settings. Basically a big list of networks and their servers as most IRC clients have.
I had decided on using SQLite but then I wanted to make the list freely available online in XML format (and perhaps definitive), for other IRC apps to use. So now I may just store the settings locally in the same format.
I have very little experience with either ADO.NET or XML so I'm not sure how they would compare in a situation like this.
Is one easier to work with programmatically? Is one faster? Does it matter?
It's a vaguer question than you realize. "Settings" can encompass an awful lot of things.
There's a good .NET infrastructure for handling application settings in configuration files. These, generally, are exposed to your program as properties of a global Settings object; the classes in the System.Configuration namespace take care of reading and persisting them, and there are tools built into Visual Studio to auto-generate the code for dealing with them. One of the data types that this infrastructure supports is StringCollection, so you could use that to store a list of servers.
But for a large list of servers, this wouldn't be my first choice, for a couple of reasons. I'd expect that the elements in your list are actually tuples (e.g. host name, port, description), not simple strings, in which case you'll end up having to format and parse the data to get it into a StringCollection, and that is generally a sign that you should be doing something else. Also, application settings are read-only (under Vista, at least), and while you can give a setting user scope to make it persistable, that leads you down a path that you probably want to understand before committing to.
So, another thing I'd consider: Is your list of servers simply a list, or do you have an internal object model representing it? In the latter case, I might consider using XML serialization to store and retrieve the objects. (The only thing I'd keep in the application configuration file would be the path to the serialized object file.) I'd do this because serializing and deserializing simple objects into XML is really easy; you don't have to be concerned with designing and testing a proper serialization format because the tools do it for you.
The primary reason I look at using a database is if my program performs a bunch of operations whose results need to be atomic and durable, or if for some reason I don't want all of my data in memory at once. If every time X happens, I want a permanent record of it, that's leading me in the direction of using a database. You don't want to use XML serialization for something like that, generally, because you can't realistically serialize just one object if you're saving all of your objects to a single physical file. (Though it's certainly not crazy to simply serialize your whole object model to save one change. In fact, that's exactly what my company's product does, and it points to another circumstance in which I wouldn't use a database: if the data's schema is changing frequently.)
I would personally use XML for settings - .NET is already built to do this and as such has many built-in facilities for storing your settings in XML configuration files.
If you want to use a custom schema (be it XML or DB) for storing settings then I would say that either XML or SQLite will work just as well since you ought to be using a decent API around the data store.
Every tool has its own right
There is plenty of hype arround XML, I know. But you should see, that XML is basically an exchange format -- not a storage format (unless you use a native XML-Database that gives you more options -- but also might add some headaches).
When your configuration is rather small (say less than 10.000 records), you might use XML and be fine. You will load the whole thing into your memory and access the entries there. Done.
But when your configuration is so big, that you dont want to load it completely, than you rethink your decission and stay with SQLite which gives you the option to dynamically load those parts of the configuration you need.
You could also provide a little tool to create a XML file from the DB-content -- creation of XML from a DB is a rather simple task.
Looks like you have two separate applications here: a web server and a desktop client (because that is traditionally where these things run), each with its own storage needs.
On the server side: go with a relational data store, not Xml. Basically at some point you need to keep user data separate from other user data on the server. XML is not a good store for that.
On the client: it doesn't really matter. Xml will probably be easier for you to manipulate. And don't think that because you are using one technology in one setting, you have to use it in the other.