I have quite a few datatables that are transported between processes by converting them to XML.
My question is, is it quicker to read the XML into a dataset / datatable on the other side and querying that (with LINQ) or should I just leave the XML and query with Linq?
Does the overhead in converting from XML to data table justify any search performance increases a data table may have?
The queries are mainly just finding a primary key.
If the number of queries per table is small, then my guess is that it is faster to query the XML. Reverse that advice if there are numerous queries.
Related
When using SQLite.Net the Linq queries are turned into hopefully performant SQL queries.
Is there however a chance that some sort of query is too complex and leads to a full and slow table scan? Like, let's say I want to get one item from a table with 50k rows - I don't want it to loop all rows and compare the values against my requested value in .Net code; it should generate a proper "where" clause.
I know that one can see the queries generated by SQLite.Net but as they are dynamic there is no guarantee that a stupid combination of clauses might lead to slow performance.
Or am I thinking too far and this cannot happen?
I have a problem which I cannot seem to get around no matter how hard I try.
This company works in market analysis, and have pretty large tables (300K - 1M rows) and MANY columns (think 250-300) which we do some calculations on.
I´ll try to get straight to the problem:
The problem is the filtering of the data. All databases I´ve tried so far are way too slow to select data and return it.
At the moment I am storing the entire table in memory and filtering using dynamic LINQ.
However, while this is quite fast (about 100 ms to filter 250 000 rows) I need better results than this...
Is there any way I can change something in my code (not the data model) which could speed the filtering up?
I have tried using:
DataTable.Select which is slow. Dynamic LINQ which is better, but
still too slow. Normal LINQ (just for testing purposes) which almost
is good enough. Fetching from MySQL and do the processing later on
which is badass slow.
At the beginning of this project we thought that some high-performance database would be able to handle this, but I tried:
H2 (IKVM)
HSQLDB (compiled ODBC-driver)
CubeSQL
MySQL
SQL
SQLite
...
And they are all very slow to interface .NET and get results from.
I have also tried splitting the data into chunks and combining them later in runtime to make the total amount of data which needs filtering smaller.
Is there any way in this universe I can make this faster?
Thanks in advance!
UPDATE
I just want to add that I have not created this database in question.
To add some figures, if I do a simple select of 2 field in the database query window (SQLyog) like this (visit_munic_name is indexed):
SELECT key1, key2 FROM table1 WHERE filter1 = filterValue1
It takes 125 milliseconds on 225639 rows.
Why is it so slow? I have tested 2 different boxes.
Of course they must change someting, obviously?
You do not explain what exactly you want to do, or why filtering a lot of rows is important. Why should it matter how fast you can filter 1M rows to get an aggregate if your database can precalculate that aggregate for you? In any case it seems you are using the wrong tools for the job.
On one hand, 1M rows is a small number of rows for most databases. As long as you have the proper indexes, querying shouldn't be a big problem. I suspect that either you do not have indexes on your query columns or you want to perform ad-hoc queries on non-indexed columns.
Furthermore, it doesn't matter which database you use if your data schema is wrong for the job. Analytical applications typically use star schemas to allow much faster queries for a lot more data than you describe.
All databases used for analysis purposes use special data structures which require that you transform your data to a form they like.
For typical relational databases you have to create star schemas that are combined with cubes to precalculate aggregates.
Column databases store data in a columnar format usually combined with compression to achieve fast analytical queries, but they require that you learn to query them in their own language, which may be very different than the SQL language most people are accustomed to.
On the other hand, the way you query (LINQ or DataTable.Select or whatever) has minimal effect on performance. Picking the proper data structure is much more important.
For instance, using a Dictionary<> is much faster than using any of the techniques you mentioned. A dictionary essentially checks for single values in memory. Executing DataTable.Select without indexes, using LINQ to Datasets or to Objects is essentially the same as scanning all entries of an array or a List<> for a specific value,because that is what all these methods do - scan an entire list sequentially.
The various LINQ providers do not do the job of a database. They do not optimize your queries. They just execute what you tell them to execute. Even doing a binary search on a sorted list is faster than using the generic LINQ providers.
There are various things you can try, depending on what you need to do:
If you are looking for a quick way to slice and dice your data, use an existing product like PowerPivot functionality of Excel 2010. PowerPivot loads and compresses MANY millions of rows in an in-memory columnar format and allows you to query your data just as you would with a Pivot table, and even define joins with other in memory sources.
If you want a more repeatable process you can either create the appropriate star schemas in a relational database or use a columnar database. In either case you will have to write the scripts to load your data in the proper structures.
If you are creating your own application you really need to investigate the various algorithms and structures used by other similar tools either for in memory.
Just a quick question.
Other than the way you manipulate them, are XMLDocuments and DataSets basically the same thing? I'm just wondering for speed issues.
I have come across some code that calls dataSet.getXML() and then traverses the new XMLDocument.
I'm just curious what's the performance difference and which is the best one to use!
Thanks,
Adam
Very different.
A DataSet is a collection of related tabular records (with a strong focus on databases), including change tracking.
An XmlDocument is a tree structure of arbitrary data. You can convert between the two.
For "which is best".... what are you trying to do? Personally I very rarely (if ever) use DataSet / DataTable, but some people like them. I prefer an object (class) representation (perhaps via deserialization), but xml processing is fine in many cases.
It does, however, seem odd to write a DataSet to xml and then query the xml. In that scenario I would just access the original data directly.
No they are not. A DataSet does not store its internal data in XML and than a XMLDocument does not use a table/row structure to store XML Elements. You can convert from one to the other within severe limits but that's it. One of the biggest limitations is that a DataSet requires data to fit in a strict Table/Column format where a XmlDocument can have a wildly different structure from one XmlElement to the next. Moreover, the hierarchical structure of a XmlDocument usually doesn't map well to the tabular structure of a DataSet.
.NET provides XmlDataDocument as a way handle XML data in a tabular way. You have to remember though that XmlDataDocument is an XmlDocument first. The generated DataSet is just an alternative and limited way to look at the underlying XML data.
Depending on the size of your tables linq to xml or xquery might be faster to query your data than looking through the table. Im not positive on this, it is something you are going to have to test against your own data.
I know its a common question asked several times on SO. But help me either way. Actually I have to upload data from my local machine to remote sql database. The remote sql database has a single table where 800,000 records are there. Now from here i have around 1,21311 records locally in my system, from which 75% records already exists on the remote database, but we dont know what records, exactly. We are checking our number using a unique code called DCNNumber. If DCN exists on server then exclude else insert.
So for that what i did is that I collected all DCN from my remote database to a XML using Dataset. The XML alone becomes a 24mb file. From my local text files i am parsing a pulling 1.2 lacs of record to a Generic List. Also the XML DCN are added to a Generic List of String.
Then these two lists are compared using if (! lstODCN.Contains(DCNFromXML)){lstNewDCN.Add(item)};
But this code is taking almost an hour to execute and filter records. So i need some optimal way to filter such a huge figure.
Load all the results into a HashSet<string> - that will be much faster at checking containment.
It's possible that LINQ would also make this simpler, but I'm somewhat confused as to exactly what's going on... I suspect you can just use:
var newDCNs = xmlDCNs.Except(oldDCNs);
In addition to Jon's answer: using an XML dataset to transfer the data from the server is probably a bad idea, because XML is a very verbose format. Using a flat-file format + compression would be much more efficient.
How can I write Join statements on dataset..
I have data in xml format..I can load that data into a dataset..
but how do I fetch data from two datatables using a join query?
Well, it partly depends on how you want to express that join. If you know the query beforehand, I would personally use LINQ to Objects via LINQ to DataSet - that's particularly handy if you're working with strongly typed datasets, but it can work even without that.
The sample code for C# in Depth has some examples in LINQ to DataSet you could have a look at.
Now, if you want to read the query dynamically as well, that makes it a lot harder.
Is this XML actually an XML-serialized dataset? Do you definitely need to get datasets involved at all? If it's just plain XML, have you tried using LINQ to XML with LINQ to Objects? It may be less efficient, but how vital is that for your application? How large is the data likely to be?