What is best way to store big objects? In my case it's something like tree or linked list.
I tried the following:
1) Relational db
Is not good for tree structures.
2) Document db
I tried RavenDB but it raised System.OutOfMemory exception when i call SaveChanges method
3) .Net Serialization
It's working very slow
4) Protobuf
It cannt to deserialize List<List<>> types and im not sure about linked structures.
So...?
You mention protobuf - I routinely use protobuf-net with objects that are many hundreds of megabytes in size, but: it does need to be suitably written as a DTO, and ideally as a tree (not a bidirectional graph, although that usage is supported in some scenarios).
In the case of a doubly-linked list, that might mean simply: marking the "previous" links as not serialized, then doing a fix-up in an after-deserialize callback, to correctly set the "previous" links. Pretty easy normally.
You are correct in that it doesn't currently support nested lists. This is usually trivial to side-step by using a list of something that has a lists but I'm tempted to make this implicit - i.e. the library should be able to simulate this without you needing to change your model. If you are interested in me doing this, let me know.
If you have a concrete example of a model you'd like to serialize, and want me to offer guidance, let me know - if you can't post it here, then my email is in my profile. Entirely up to you.
Did you tried Json.NET and store the result in a file?
Option [ 2 ] : NOSQL ( Document ) Database
I suggest Cassandra.
From the cassandra wiki,
Cassandra's public API is based on Thrift, which offers no streaming abilities
any value written or fetched has to fit in memory. This is inherent to Thrift's
design and is therefore unlikely to change. So adding large object support to
Cassandra would need a special API that manually split the large objects up
into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265.
As a workaround in the meantime, you can manually split files into chunks of whatever
size you are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
So if your files are < 10MB you should be fine, just make sure to limit the file size, or break large files up into chunks.
CouchDb does a very good job with challenges like that one.
storing a tree in CouchDb
storing a tree in relational databases
Related
A very C# beginner question:
Say I have a big table with data (a detailed "log" table of transactions, for example, held in a CSV file), and I need to perform database-like operations, such as queries (WHERE X > Y...), deletions, insertions and updates.
This data comes from a file which is read, and from that moment this is "my database" to query and make changes to.
Now the file is not too big (hundreds of KB's approximately), and not too complicated, namely I don't need any Inner Joins, or similar relations.
My very naive approach, since I am a beginner in C#, is to retrieve all the data into an object (Some Table / Dataset or similar objects), and use the LINQ functionality for my relatively simple queries. There is no use in any kind of external database on this project of mine.
I would appreciate any recommendations of how is the best way of doing it: what would be the best object to use (DataSet class? are there any other / easier / better performance classes for working with files and hold my data? how should the data be kept in the table - raw strings or object for each type?). Is there a common way it is usually done? Is the idea okay or should I use external classes?
I could not find any good examples that match my situation exactly, probably because I didn't know how to search for it accurately enough, so any suggestions, especially ones accompanied with examples, are welcome!
Thanks in advance!
Please excuse my english, I'm still trying to master it.
I've started to learn MongoDB (coming from a C# background) and I like the idea of what is MongoDB. I have some issues with examples on the internet.
Take the popular blog post / comments example. Post has none or many Comments associated with it. I create Post object, add a few Comment objects to the IList in Post. Thats fine.
Do I add that to just a "Posts" Collection in MonoDB or should I have two collections - one is blog.posts and blog.posts.comments?
I have a fair complicated object model, easiest way to think of it is as a Banking System - ours is mining. I tried to highlight tables with square brackets.
[Users] have one or many [Accounts] that have one or many [Transactions] which has one and only one [Type]. [Transactions] can have one or more [Tag] assigned to the transaction. [Users] create their own [Tags] unique to that user account and we sometimes need to offer reporting by those tags (Eg. for May, tag drilling-expense was $123456.78).
For indexing, I would have thought seperating them would be good but I'm worried it is bad practice this thinking from old RBDMS days.
In a way, its like the blog example. I'm not sure if I should have 1 [Account] Collection and persist all information there, or have an intermediate step that splits it up to seperate collections.
The other related query is, when you persist back and forth, do you usually get back everything associated with that record - even if not required or do you limit?
It depends.
It depends on how many of each of these type of objects you expect to have. Can you fit them all into a single MongoDB document for a given User? Probably not.
It depends on the relationships - is user-Account a one-to-many or a many-to-many relationship? If it's one to many and the number of Accounts is small you might chose to put them in an IList on a User document.
You can still model relationships in MongoDB with separate collections BUT there are no joins in the database so you have to do that in code. Loading a User and then loading their Accounts might be just fine from a performance perspective.
You can index INTO arrays on documents. Don't think of an Index as just being an index on a simple field on a document (like SQL). You can use, say, a Tag collection on a document and index into the tags. (See http://www.mongodb.org/display/DOCS/Indexes#Indexes-Arrays)
When you retrieve or write data you can do a partial read and a partial write of any document. (see http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields)
And, finally, when you can't see how to get what you want using collections and indexes, you might be able to achieve it using map reduce. For example, to find all the tags currently in use sorted by their frequency of use you would map each document emitting the tags used in it, and then you would reduce that set to get the result you want. You might then store the result of that map reduce permanently and only up date it when you need to.
One further concern: You mention calculating totals by tag. If you want accounting-quality transactional consistency, MongoDB might not be the right choice for you. "Eventual-consistency" is the name of the game for NoSQL data stores and they generally aren't a good fit for financial transactions. For example, it doesn't matter if one user sees a blog post with 3 comments while another sees 4 because they hit different replica copies that aren't in sync yet, but for a financial report, that kind of consistency does matter - your report might not add up!
I'm supposed to do the fallowing:
1) read a huge (700MB ~ 10 million elements) XML file;
2) parse it preserving order;
3) create a text(one or more) file with SQL insert statements to bulk load it on the DB;
4) write the relational tuples and write them back in XML.
I'm here to exchange some ideas about the best (== fast fast fast...) way to do this. I will use C# 4.0 and SQL Server 2008.
I believe that XmlTextReader its a good start. But I do not know if it can handle such a huge file. Does it load all file when is instantiated or holds just the actual reading line in memory? I suppose I can do a while(reader.Read()) and that should be fine.
What is the best way to write the text files? As I should preserve the ordering of the XML (adopting some numbering schema) I will have to hold some parts of the tree in memory to do the calculations etc... Should I iterate with stringbuilder?
I will have two scenarios: one where every node (element, attribute or text) will be in the same table (i.e., will be the same object) and another scenario where for each type of node (just this three types, no comments etc..) I will have a table in the DB and a class to represent this entity.
My last specific question is how good is the DataSet ds.WriteXml? Will it handle 10M tuples? Maybe its best to bring chunks from the database and use a XmlWriter... I really dont know.
I'm testing all this stuff... But I decided to post this question to listen you guys, hopping your expertise can help me doing this things more correctly and faster.
Thanks in advance,
Pedro Dusso
I'd use the SQLXML Bulk Load Component for this. You provide a specially annotated XSD schema for your XML with embedded mappings to your relational model. It can then bulk load the XML data blazingly fast.
If your XML has no schema you can create one from visual studio by loading the file and selecting Create Schema from the XML menu. You will need to add the mappings to your relational model yourself however. This blog has some posts on how to do that.
Guess what? You don't have a SQL Server problem. You have an XML problem!
Faced with your situation, I wouldn't hesitate. I'd use Perl and one of its many XML modules to parse the data, create simple tab- or other-delimited files to bulk load, and bcp the resulting files.
Using the server to parse your XML has many disadvantages:
Not fast, more than likely
Positively useless error messages, in my experience
No debugger
Nowhere to turn when one of the above turns out to be true
If you use Perl on the other hand, you have line-by-line processing and debugging, error messages intended to guide a programmer, and many alternatives should your first choice of package turn out not to do the job.
If you do this kind of work often and don't know Perl, learn it. It will repay you many times over.
I am currently writing an IRC client and I've been trying to figure out a good way to store the server settings. Basically a big list of networks and their servers as most IRC clients have.
I had decided on using SQLite but then I wanted to make the list freely available online in XML format (and perhaps definitive), for other IRC apps to use. So now I may just store the settings locally in the same format.
I have very little experience with either ADO.NET or XML so I'm not sure how they would compare in a situation like this.
Is one easier to work with programmatically? Is one faster? Does it matter?
It's a vaguer question than you realize. "Settings" can encompass an awful lot of things.
There's a good .NET infrastructure for handling application settings in configuration files. These, generally, are exposed to your program as properties of a global Settings object; the classes in the System.Configuration namespace take care of reading and persisting them, and there are tools built into Visual Studio to auto-generate the code for dealing with them. One of the data types that this infrastructure supports is StringCollection, so you could use that to store a list of servers.
But for a large list of servers, this wouldn't be my first choice, for a couple of reasons. I'd expect that the elements in your list are actually tuples (e.g. host name, port, description), not simple strings, in which case you'll end up having to format and parse the data to get it into a StringCollection, and that is generally a sign that you should be doing something else. Also, application settings are read-only (under Vista, at least), and while you can give a setting user scope to make it persistable, that leads you down a path that you probably want to understand before committing to.
So, another thing I'd consider: Is your list of servers simply a list, or do you have an internal object model representing it? In the latter case, I might consider using XML serialization to store and retrieve the objects. (The only thing I'd keep in the application configuration file would be the path to the serialized object file.) I'd do this because serializing and deserializing simple objects into XML is really easy; you don't have to be concerned with designing and testing a proper serialization format because the tools do it for you.
The primary reason I look at using a database is if my program performs a bunch of operations whose results need to be atomic and durable, or if for some reason I don't want all of my data in memory at once. If every time X happens, I want a permanent record of it, that's leading me in the direction of using a database. You don't want to use XML serialization for something like that, generally, because you can't realistically serialize just one object if you're saving all of your objects to a single physical file. (Though it's certainly not crazy to simply serialize your whole object model to save one change. In fact, that's exactly what my company's product does, and it points to another circumstance in which I wouldn't use a database: if the data's schema is changing frequently.)
I would personally use XML for settings - .NET is already built to do this and as such has many built-in facilities for storing your settings in XML configuration files.
If you want to use a custom schema (be it XML or DB) for storing settings then I would say that either XML or SQLite will work just as well since you ought to be using a decent API around the data store.
Every tool has its own right
There is plenty of hype arround XML, I know. But you should see, that XML is basically an exchange format -- not a storage format (unless you use a native XML-Database that gives you more options -- but also might add some headaches).
When your configuration is rather small (say less than 10.000 records), you might use XML and be fine. You will load the whole thing into your memory and access the entries there. Done.
But when your configuration is so big, that you dont want to load it completely, than you rethink your decission and stay with SQLite which gives you the option to dynamically load those parts of the configuration you need.
You could also provide a little tool to create a XML file from the DB-content -- creation of XML from a DB is a rather simple task.
Looks like you have two separate applications here: a web server and a desktop client (because that is traditionally where these things run), each with its own storage needs.
On the server side: go with a relational data store, not Xml. Basically at some point you need to keep user data separate from other user data on the server. XML is not a good store for that.
On the client: it doesn't really matter. Xml will probably be easier for you to manipulate. And don't think that because you are using one technology in one setting, you have to use it in the other.
In continuation of: Storing DataRelation in xml?
Thanks to everybody for answers to my earlier thread. However, could I ask the reason why everybody is not supporting this XML based approach? What exactly will be the problems? I can apply connstraints to dataset, and I can, I guess, also use transactions.
I am new to this. So if you could point me to some link, where I can find some sort of comparisons, that would be really helpful.
According to FAQ, discussions are not very encouraged, but I guess this is quite specific. I hope, not to be fired for this... :)
Thanks for reading,
Saurabh.
Database management systems are specifically designed to store data and retrieve it quickly, to preserve the integrity of the data and to leverage concurrent access to the data.
XML, on the other hand, was originally designed for documents, separating the content from the presentation. It became a handy way to store simple data because the file structure is so well defined, and then it went out of hand with people trying to store entire databases in an unsuited structure.
XML doesn't guarantee atomicity, concurrency, integrity, fast access or anything like that. Not inherently, anyway. .NET's DataSet libraries do help in that regard, but just because you can serialize DataSet objects to XML doesn't make it a good place to store data for multiple users.
When you're faced with two tools, one which was designed to do exactly what you need to do (in this case a DBMS) and one that was designed to do something else but has been kludged to do what you want, sorta (in this case XML), you should probably go with the first option.
Concurrency will be the main issue, where multiple users want to access the same "database" file. Performance is the other, because the whole file has to be loaded into memory. If the filesize grows it'll get unmanageable. Also, performance on queries is hit, because it won't be as efficient as getting something as tuned and honed as an RDBMS to do it for you.
Though I would need to research to back this up, I suspect that there are some performance implications to not using an actual database.
If there is a chance that another application built on a different platform (not ADO.NET) might need to access your data in the future, having to work with a giant XML file will very likely make life more difficult. A relational DB is the standard approach to this sort of problem.