A very C# beginner question:
Say I have a big table with data (a detailed "log" table of transactions, for example, held in a CSV file), and I need to perform database-like operations, such as queries (WHERE X > Y...), deletions, insertions and updates.
This data comes from a file which is read, and from that moment this is "my database" to query and make changes to.
Now the file is not too big (hundreds of KB's approximately), and not too complicated, namely I don't need any Inner Joins, or similar relations.
My very naive approach, since I am a beginner in C#, is to retrieve all the data into an object (Some Table / Dataset or similar objects), and use the LINQ functionality for my relatively simple queries. There is no use in any kind of external database on this project of mine.
I would appreciate any recommendations of how is the best way of doing it: what would be the best object to use (DataSet class? are there any other / easier / better performance classes for working with files and hold my data? how should the data be kept in the table - raw strings or object for each type?). Is there a common way it is usually done? Is the idea okay or should I use external classes?
I could not find any good examples that match my situation exactly, probably because I didn't know how to search for it accurately enough, so any suggestions, especially ones accompanied with examples, are welcome!
Thanks in advance!
Related
Sorry if this has been asked elsewhere, but I couldn't find a clear answer anywhere.
I have decided to begin learning to use relational databases a bit more, namely SQL. This is a major beginners question but its probably essential to get started on.
I'm basically a little confused the best practice on how to utilize SQL (or other). At college i have accessed databases (using JSON strings) for things such as mobile apps, but i have never actually designed and built a database myself, as my tutor made the mentioned database for us to access himself.
Lets say I have a C# application that holds genealogy information (i.e. families and their members) and i wanted to store each individual on a database. Would I, simply use the structure I already have but save to fields in a database instead of an xml or text document? Or does it work the other way, i.e. do I create a database with required fields then just retrieve this from the database in a c# application and manipulate the data as I so wish, so the application would be entirely different (so the c# application basically doesn't really hold/store any data and just works on whats fed from the database)?
Whats troubling me is that usually where I would store my c# objects in a dictionary or list for example, would i instead just retrieve straight from the database? or retrieve from the and store the data into a normal structure and work from there (surely this would defeat the point of fast-searching from a database)?
I may be over-thinking it slightly. Hope that makes sense. Thanks in advance
Would I, simply use the structure I already...
or
do I create a database with required fields...
I think that is the crux of your question.
Starting from the database
For me, when building an application that uses a backend database, an Entity-Relationship diagram is pretty crucial. I found quite a nice little tutorial for you here: http://www.sum-it.nl/cursus/dbdesign/english/index.php3 but you can easily find one that suits your learning style. The key point is that you are trying to model the problem domain (the real world out there that needs your application) in a way that your application can somehow capture. Once you have an E-R diagram of related tables, it is easier to figure out the details. Using SQL Management Studio for SQL Server 2008 (Express edition) you can create a few basic tables and build the E-R diagram right there and have it generate relationships for you. You can then, at your leisure, examine the SQL used to achieve that and refine accordingly.
Personally, I always start by examining the problem domain, then I build the E-R diagram, then I build the database. I start building the C# application when I'm reasonably confident the database reflects the problem domain.
Starting from your C# application
However, what really matters is that you model the real world in a meaningful and effective way. In your case you already have a starting point in structures you've created in C# and you can use them to give you a starting point to build the E-R diagram. If you find it easier to get a C# application going and then build a database that reflects it, that should be fine. Perhaps you already have an approach that helps you capture the problem domain effectively. It's an iterative process whatever you do: building the C# code might reveal problems with the underlying database design and vice versa.
Diagramming - E-R or UML?
I'm personally convinced that this whole business is so complicated that you really need some diagrams.
to visualise your database, use an E-R diagram
to visualise your C# application use a UML class diagram
As you head towards a working application, you'll see how these 2 diagrams begin to match or at least reflect eash other pretty closely. In both cases, (entities or classes) understanding the relationship between objects will be really important when you query the database because it is crucial to understand relationships between tables (especially using 1-to-many relationships to resolve a complex many-to-many relationship) and various techniques for joining tables in queries (INNER or OUTER joins etc) No matter how clever your C# application is, you will at some point need to understand at least some of the complexities of the SQL language - and it is easier if you can refer to an E-R diagram.
Where to store?
Whats troubling me is that usually where I would store my c# objects in a dictionary or list for example, would i instead just retrieve straight from the database?
In the database, without a doubt. A C# class called Family would have a property FamilyName, say, with a setter method built in. If you discover a spelling mistake and want to change the name, the setter method would open a connection to the database, run an UPDATE query with the specified family name, (and probably the family id) as a parameter, and update the underlying field accordingly. Retrieving data would involve running a SELECT query etc.
Conclusion
Do some tutorials on how to examine a problem domain, create an entity-relationship diagram and build a set of related tables based on the diagram. I'm convinced that way you'll find it much easier to keep track of the C# classes that you build to communicate with the backend database.
Here's an example of a simple E-R diagram for families and their members:
To begin with you might think members and family could be in one table, but then you discover that creates a lot of duplication so you separate that out into family and member table with a one-to-many relationship, but then you realise that, through marriage for instance, people can belong to more than one family and you need to create a many-to-many relationship. I think the E-R diagram is the best place to work out that kind of complexity.
Not knowing what your structures look like or how your DB will be designed this is hard to answer. But you should be able to use existing data structures, and just pipe the data from the database instead of the XML file.
Look into Linq-to-XML, C# has a strong library to interact with SQL. May be a bit confusing at first, but very powerful once you learn it.
If I am right you are asking also if you should retrieve all the records from the database and store them as objects in a collection or retrieve selected records from the database and use the dataset results without placing them in a purpose defined structure.
I tend to select the records I want from the database and then load the results into my purpose defined classes / structures. This allows you to add your manipulation methods to the class holding a record result etc. without needing to take in dataset results to each method. However you will find yourself doing singular updates all the time when a batch update might be more efficient... if that makes sense.
Take a look at entity frameworks code first. If your data structures are classes in your application there are techniques to use that to create your database schema from that. As far as the data. Store it in your database and populate your lists and dictionaries with it. Or populate list of class genealogy individual with it.
If you want to write your own data classes, there's a free tutorial here written by myself. What I would definitely not to is use the data sources in ASP.NET, as these wizards are the Barty Crouches of the ASP.NET world - they appear good, but turn out to be evil, as inevitably you'll want to be able to tweak them and you won't understand how to do this.
What is best way to store big objects? In my case it's something like tree or linked list.
I tried the following:
1) Relational db
Is not good for tree structures.
2) Document db
I tried RavenDB but it raised System.OutOfMemory exception when i call SaveChanges method
3) .Net Serialization
It's working very slow
4) Protobuf
It cannt to deserialize List<List<>> types and im not sure about linked structures.
So...?
You mention protobuf - I routinely use protobuf-net with objects that are many hundreds of megabytes in size, but: it does need to be suitably written as a DTO, and ideally as a tree (not a bidirectional graph, although that usage is supported in some scenarios).
In the case of a doubly-linked list, that might mean simply: marking the "previous" links as not serialized, then doing a fix-up in an after-deserialize callback, to correctly set the "previous" links. Pretty easy normally.
You are correct in that it doesn't currently support nested lists. This is usually trivial to side-step by using a list of something that has a lists but I'm tempted to make this implicit - i.e. the library should be able to simulate this without you needing to change your model. If you are interested in me doing this, let me know.
If you have a concrete example of a model you'd like to serialize, and want me to offer guidance, let me know - if you can't post it here, then my email is in my profile. Entirely up to you.
Did you tried Json.NET and store the result in a file?
Option [ 2 ] : NOSQL ( Document ) Database
I suggest Cassandra.
From the cassandra wiki,
Cassandra's public API is based on Thrift, which offers no streaming abilities
any value written or fetched has to fit in memory. This is inherent to Thrift's
design and is therefore unlikely to change. So adding large object support to
Cassandra would need a special API that manually split the large objects up
into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265.
As a workaround in the meantime, you can manually split files into chunks of whatever
size you are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
So if your files are < 10MB you should be fine, just make sure to limit the file size, or break large files up into chunks.
CouchDb does a very good job with challenges like that one.
storing a tree in CouchDb
storing a tree in relational databases
I'm supposed to do the fallowing:
1) read a huge (700MB ~ 10 million elements) XML file;
2) parse it preserving order;
3) create a text(one or more) file with SQL insert statements to bulk load it on the DB;
4) write the relational tuples and write them back in XML.
I'm here to exchange some ideas about the best (== fast fast fast...) way to do this. I will use C# 4.0 and SQL Server 2008.
I believe that XmlTextReader its a good start. But I do not know if it can handle such a huge file. Does it load all file when is instantiated or holds just the actual reading line in memory? I suppose I can do a while(reader.Read()) and that should be fine.
What is the best way to write the text files? As I should preserve the ordering of the XML (adopting some numbering schema) I will have to hold some parts of the tree in memory to do the calculations etc... Should I iterate with stringbuilder?
I will have two scenarios: one where every node (element, attribute or text) will be in the same table (i.e., will be the same object) and another scenario where for each type of node (just this three types, no comments etc..) I will have a table in the DB and a class to represent this entity.
My last specific question is how good is the DataSet ds.WriteXml? Will it handle 10M tuples? Maybe its best to bring chunks from the database and use a XmlWriter... I really dont know.
I'm testing all this stuff... But I decided to post this question to listen you guys, hopping your expertise can help me doing this things more correctly and faster.
Thanks in advance,
Pedro Dusso
I'd use the SQLXML Bulk Load Component for this. You provide a specially annotated XSD schema for your XML with embedded mappings to your relational model. It can then bulk load the XML data blazingly fast.
If your XML has no schema you can create one from visual studio by loading the file and selecting Create Schema from the XML menu. You will need to add the mappings to your relational model yourself however. This blog has some posts on how to do that.
Guess what? You don't have a SQL Server problem. You have an XML problem!
Faced with your situation, I wouldn't hesitate. I'd use Perl and one of its many XML modules to parse the data, create simple tab- or other-delimited files to bulk load, and bcp the resulting files.
Using the server to parse your XML has many disadvantages:
Not fast, more than likely
Positively useless error messages, in my experience
No debugger
Nowhere to turn when one of the above turns out to be true
If you use Perl on the other hand, you have line-by-line processing and debugging, error messages intended to guide a programmer, and many alternatives should your first choice of package turn out not to do the job.
If you do this kind of work often and don't know Perl, learn it. It will repay you many times over.
I'm having a bit of a problem deciding how to store some data. To see it from a simple perspective, it will be a simple table of data but there will be many tables. There will be about 7 columns in each table, but again there will be a lot of tables (and they will be created at runtime, whenever the customer wants a clean grid)
The data has to be stored locally in a file (and there will not be multiple instances of the software running).
I'm using C# 4.0 and I have been looking at using XML files(one file per table, or storing multiple tables in a file), sqlite, sql server CE, access etc. I will be happy if someone here has some comments or suggestions on how to do/not to do. Stability and reliability(e.g. no trashed databases because of unstable third party software) is probably my biggest concern.
If you are looking to store the data locally in a file, I would recommend the sqlite option since it seems your data is created in the form of a database table already. Sqlite is already built to handle multiple tables and columns so it means less mental overhead for you, the developer.
http://web.archive.org/web/20100208133236/http://www.mikeduncan.com/sqlite-on-dotnet-in-3-mins/ is a decent tutorial to give a quick overview on how to set it up and get going.
As for what NOT to do: don't try to make your own scheme to save the data to a file, it's a well understood problem that has been solved many times over, why re-invent the wheel?
XML wont be a good choice if you are planning to make several queries, since loading text files may be painful when they grow (talking about files over 1mb). If you plan to mantain the data low, the xml would be good to keep it simple. I still won't use it, but if you have a background, then the benefits will be heavier than the learning curve.
If you have no expertise in any of them, and the data is light my suggestion is SQLite, I beleive is the best lightweight DB for .Net and the prvider is very good. you can find it easily on Google.
I would tell you that Access is not recommendable, but this is a personal oppinion. Many people use it and I think is for some reason. So you should check it out and try it.
Again, my final recommendation is SQLite, unless you know very well another one, in which case you'll have to think how much your data is going to grow. If you plan to have a DB around 100mb, any of them, except xml would do; If you think it'll grow bigger than that, consider SQLite heavily
In continuation of: Storing DataRelation in xml?
Thanks to everybody for answers to my earlier thread. However, could I ask the reason why everybody is not supporting this XML based approach? What exactly will be the problems? I can apply connstraints to dataset, and I can, I guess, also use transactions.
I am new to this. So if you could point me to some link, where I can find some sort of comparisons, that would be really helpful.
According to FAQ, discussions are not very encouraged, but I guess this is quite specific. I hope, not to be fired for this... :)
Thanks for reading,
Saurabh.
Database management systems are specifically designed to store data and retrieve it quickly, to preserve the integrity of the data and to leverage concurrent access to the data.
XML, on the other hand, was originally designed for documents, separating the content from the presentation. It became a handy way to store simple data because the file structure is so well defined, and then it went out of hand with people trying to store entire databases in an unsuited structure.
XML doesn't guarantee atomicity, concurrency, integrity, fast access or anything like that. Not inherently, anyway. .NET's DataSet libraries do help in that regard, but just because you can serialize DataSet objects to XML doesn't make it a good place to store data for multiple users.
When you're faced with two tools, one which was designed to do exactly what you need to do (in this case a DBMS) and one that was designed to do something else but has been kludged to do what you want, sorta (in this case XML), you should probably go with the first option.
Concurrency will be the main issue, where multiple users want to access the same "database" file. Performance is the other, because the whole file has to be loaded into memory. If the filesize grows it'll get unmanageable. Also, performance on queries is hit, because it won't be as efficient as getting something as tuned and honed as an RDBMS to do it for you.
Though I would need to research to back this up, I suspect that there are some performance implications to not using an actual database.
If there is a chance that another application built on a different platform (not ADO.NET) might need to access your data in the future, having to work with a giant XML file will very likely make life more difficult. A relational DB is the standard approach to this sort of problem.