I'm supposed to do the fallowing:
1) read a huge (700MB ~ 10 million elements) XML file;
2) parse it preserving order;
3) create a text(one or more) file with SQL insert statements to bulk load it on the DB;
4) write the relational tuples and write them back in XML.
I'm here to exchange some ideas about the best (== fast fast fast...) way to do this. I will use C# 4.0 and SQL Server 2008.
I believe that XmlTextReader its a good start. But I do not know if it can handle such a huge file. Does it load all file when is instantiated or holds just the actual reading line in memory? I suppose I can do a while(reader.Read()) and that should be fine.
What is the best way to write the text files? As I should preserve the ordering of the XML (adopting some numbering schema) I will have to hold some parts of the tree in memory to do the calculations etc... Should I iterate with stringbuilder?
I will have two scenarios: one where every node (element, attribute or text) will be in the same table (i.e., will be the same object) and another scenario where for each type of node (just this three types, no comments etc..) I will have a table in the DB and a class to represent this entity.
My last specific question is how good is the DataSet ds.WriteXml? Will it handle 10M tuples? Maybe its best to bring chunks from the database and use a XmlWriter... I really dont know.
I'm testing all this stuff... But I decided to post this question to listen you guys, hopping your expertise can help me doing this things more correctly and faster.
Thanks in advance,
Pedro Dusso
I'd use the SQLXML Bulk Load Component for this. You provide a specially annotated XSD schema for your XML with embedded mappings to your relational model. It can then bulk load the XML data blazingly fast.
If your XML has no schema you can create one from visual studio by loading the file and selecting Create Schema from the XML menu. You will need to add the mappings to your relational model yourself however. This blog has some posts on how to do that.
Guess what? You don't have a SQL Server problem. You have an XML problem!
Faced with your situation, I wouldn't hesitate. I'd use Perl and one of its many XML modules to parse the data, create simple tab- or other-delimited files to bulk load, and bcp the resulting files.
Using the server to parse your XML has many disadvantages:
Not fast, more than likely
Positively useless error messages, in my experience
No debugger
Nowhere to turn when one of the above turns out to be true
If you use Perl on the other hand, you have line-by-line processing and debugging, error messages intended to guide a programmer, and many alternatives should your first choice of package turn out not to do the job.
If you do this kind of work often and don't know Perl, learn it. It will repay you many times over.
Related
my concern is that, im trying to make a program that will show up let's say images of cars to create a collection. I'm using windows forms and i have made the grapchical interface i want.
The thing is i have a xml file (like database) with 300 cars(elements) and each car has a some children elements one of them being the "carname.png" so i can use it to find the file image to show also. My concern here is if there is a way to convert this xml file into a binary one that my program will be able to read much faster. This is for efficiency and self-educational-practice purposes as I am young in programming. Thanx in advance for your time and thoughts.
Simply import your XML file into a suitable database and query that instead. If it naturally fits into relational form, use Sqlite as first choice. In the Windows environment as a learning experience you might prefer to use SqlServer but it's overkill.
If the data does not fit naturally into the relational form then use a no-sql database like MongoDB and store your XML records (or JSON or whatever) in it. You get fast access and XML flexibility together.
I use csv files as database in seperate processes. I only store all data or read all data in my datagrid in singular relationship. Every field in every txt file is one and only number starting from zero.
//While reaching countries, i read allcountries.txt,
//while reaching cities, i read allcities.txt
//while reaching places i read allplaces.txt.
but one country has many cities and one city has many places. Yet, i don't use any relationship. I want to use and i know there is some needs for this. How can i reach data for reading and writing by adding all text files one extra column?
And is it possible to reach data without sql queries?
Text files don't have any mechanism for SELECTs or JOINs. You'll be at a pretty steep disadvantage.
However, LINQ gives you the ability to search through object collections in SQL-like ways, and you can certainly create entities that have relationships. The application (whatever you're building) will have to load everything from the text files at application start. Since your code will be doing the searching, it has to have the entities loaded (from text files) and in-memory.
That may or may not be a pleasant or effective experience, but if you're set on replacing SQL with text files, it's not impossible.
CSV files are good for mass input and mass output. They are not good for point manipulations or maintaining relationships. Consider using a database. SQLite might be something useful in your application.
Based on your comments, it would make more sense to use XML instead of CSV. This meets your requirements for being human and machine readable, and XML has nice built in libraries for searching, manipulating, serializing etc.
You can use SQL queries in CSV files: How to use SQL against a CSV file, I have done it for reading but never for writing so I don't know if this will work for you.
What is best way to store big objects? In my case it's something like tree or linked list.
I tried the following:
1) Relational db
Is not good for tree structures.
2) Document db
I tried RavenDB but it raised System.OutOfMemory exception when i call SaveChanges method
3) .Net Serialization
It's working very slow
4) Protobuf
It cannt to deserialize List<List<>> types and im not sure about linked structures.
So...?
You mention protobuf - I routinely use protobuf-net with objects that are many hundreds of megabytes in size, but: it does need to be suitably written as a DTO, and ideally as a tree (not a bidirectional graph, although that usage is supported in some scenarios).
In the case of a doubly-linked list, that might mean simply: marking the "previous" links as not serialized, then doing a fix-up in an after-deserialize callback, to correctly set the "previous" links. Pretty easy normally.
You are correct in that it doesn't currently support nested lists. This is usually trivial to side-step by using a list of something that has a lists but I'm tempted to make this implicit - i.e. the library should be able to simulate this without you needing to change your model. If you are interested in me doing this, let me know.
If you have a concrete example of a model you'd like to serialize, and want me to offer guidance, let me know - if you can't post it here, then my email is in my profile. Entirely up to you.
Did you tried Json.NET and store the result in a file?
Option [ 2 ] : NOSQL ( Document ) Database
I suggest Cassandra.
From the cassandra wiki,
Cassandra's public API is based on Thrift, which offers no streaming abilities
any value written or fetched has to fit in memory. This is inherent to Thrift's
design and is therefore unlikely to change. So adding large object support to
Cassandra would need a special API that manually split the large objects up
into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265.
As a workaround in the meantime, you can manually split files into chunks of whatever
size you are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
So if your files are < 10MB you should be fine, just make sure to limit the file size, or break large files up into chunks.
CouchDb does a very good job with challenges like that one.
storing a tree in CouchDb
storing a tree in relational databases
I'm currently working with loading a lot (thousands of files from 1KB - 6MB) XML files, and loading them into destination databases. Currently, I'm using the SQLXMLBULKLOAD COM object. One of the biggest problems I'm having is that the COM object doesn't always play nice within our transactional environment. There's other problems too, such as performance; the process really begins choking on files approaching ~2MB, taking several minutes, if not longer in some cases (hours), to load into the tables.
So now I'm looking for an alternative, of which there seems to be two flavors:
1) Something like OPENXML, where XML is inserted as XML data into SQL Server
or
2) Solutions that parse the XML in memory, and load as rowsets into the database.
There's drawbacks to either approach, and I know I'm going to have to start doing some benchmarking of prototype solutions before I jump to any conclusions. The OPENXML approach is very attractive IMO, mainly because it promises some really good performance numbers (others claiming going from hours to milliseconds). But it has the drawback of storing data as XML -not ideal in my particular scenario since the destination tables already exist, and many other processes rely on queries and SPROCS out there that query these tables as normal rowset data.
Whatever solution I choose, I must meet the following requirements:
1) Must accept any XML file. Clients (in a business sense) need only provide an XSD, and an appropriate destination database/table(s) for the data.
2) Individual files (never larger than ~6MB) must be processed in under 1 minute (ideally even much quicker than that).
3) Inserted data must be able to accomodate existing queries and SPROCS (i.e, must ultimately end up as normal rowset data)
So my question is, do you have any experience in this situation, and what are your thoughts and insights?
I am not completely opposed to an OPENXML-like solution, just as long as the data can end up as normal rowset data at some point. I am also interested in 3rd party solutions you may have experience with, this is an important part of our process, and we are willing to spend some $ if it provides us the best and most stable solution.
I'm also not opposed to "roll-your-own" suggestions, or things on Codeplex, etc. I came across the LINQ to XSD project, but couldn't find much documentation about what its capabilitities are (just as an exame of things I am interested in)
I would revisit the performance issues you are having with the SQLXMLBULKLOAD COM. I have used this component to load 500MB xml files before. Can you post the code you are using to invoke the component?
I need to analyze tens of thousands of lines of data. The data is imported from a text file. Each line of data has eight variables. Currently, I use a class to define the data structure. As I read through the text file, I store each line object in a generic list, List.
I am wondering if I should switch to using a relational database (SQL) as I will need to analyze the data in each line of text, trying to relate it to definition terms which I also currently store in generic lists (List).
The goal is to translate a large amount of data using definitions. I want the defined data to be filterable, searchable, etc. Using a database makes more sense the more I think about it, but I would like to confirm with more experienced developers before I make the changes, yet again (I was using structs and arraylists at first).
The only drawback I can think of, is that the data does not need to be retained after it has been translated and viewed by the user. There is no need for permanent storage of data, therefore using a database might be a little overkill.
It is not absolutely necessary to go a database. It depends on the actual size of the data and the process you need to do. If you are loading the data into a List with a custom class, why not use Linq to do your querying and filtering? Something like:
var query = from foo in List<Foo>
where foo.Prop = criteriaVar
select foo;
The real question is whether the data is so large that it cannot be loaded up into memory confortably. If that is the case, then yes, a database would be much simpler.
This is not a large amount of data. I don't see any reason to involve a database in your analysis.
There IS a query language built into C# -- LINQ. The original poster currently uses a list of objects, so there is really nothing left to do. It seems to me that a database in this situation would add far more heat than light.
It sounds like what you want is a database. Sqlite supports in-memory databases (use ":memory:" as the filename). I suspect others may have an in-memory mode as well.
I was facing the same problem that you faced now while I was working on my previous company.The thing is I was looking a concrete and good solution for a lot of bar code generated files.The bar code generates a text file with thousands of records with in a single file.Manipulating and presenting the data was so difficult for me at first.Based on the records what I programmed was, I create a class that read the file and loads the data to the data table and able to save it in database. The database what I used was SQL server 2005.Then I able to manage the saved data easily and present it which way I like it.The main point is read the data from the file and save to it to the data base.If you do so you will have a lot of options to manipulate and present as the way you like it.
If you do not mind using access, here is what you can do
Attach a blank Access db as a resource
When needed, write the db out to file.
Run a CREATE TABLE statement that handles the columns of your data
Import the data into the new table
Use sql to run your calculations
OnClose, delete that access db.
You can use a program like Resourcer to load the db into a resx file
ResourceManager res = new ResourceManager( "MyProject.blank_db", this.GetType().Assembly );
byte[] b = (byte[])res.GetObject( "access.blank" );
Then use the following code to pull the resource out of the project. Take the byte array and save it to the temp location with the temp filename
"MyProject.blank_db" is the location and name of the resource file
"access.blank" is the tab given to the resource to save
If the only thing you need to do is search and replace, you may consider using sed and awk and you can do searches using grep. Of course on a Unix platform.
From your description, I think linux command line tools can handle your data very well. Using a database may unnecessarily complicate your work. If you are using windows, these tools are also available by different ways. I would recommend cygwin. The following tools may cover your task: sort, grep, cut, awk, sed, join, paste.
These unix/linux command line tools may look scary to a windows person but there are reasons for people who love them. The following are my reasons for loving them:
They allow your skill to accumulate - your knowledge to a partially tool can be helpful in different future tasks.
They allow your efforts to accumulate - the command line (or scripts) you used to finish the task can be repeated as many times as needed with different data, without human interaction.
They usually outperform the same tool you can write. If you don't believe, try to beat sort with your version for terabyte files.