Sort-Merge two Datasets: which approach is faster?

Sort-Merge two Datasets: which approach is faster? - c#

Greetings Overflowers,
I have two huge DataSets that I need to merge and sort.
Shall I:
Use DataSet.Merge and then DataSet.Sort?
Insert both DataSets into an in-memory database that is indexed by sort columns then query against it?
Insert both DataSets into an in-memory database that is NOT indexed and do the sorting at query time?
Other option; e.g. multi-threaded merge-sort algorithm that takes both DataSets and outputs a merged and sorted DataSet?
Kind regards

Related

DataTable vs Array performance in c#

I am querying data from Database table say(ABCD) and want to store it in some DataStructure, then read it row by row and perform calculations on the values and update the datastructure with the modified values.
Finally once calculation on all the rows is finished, i would update ABCD table with updated values for each row.
Please suggest the data structure to be used which would give best performance.
I am considering the use of 2D array or DataTable.

An array is much lighter than a datatable but has no data structure operations. However while the data table is a heavy object it is much easier to work with

Array
The fastest data structure to read from is an array.
Searching takes time
Very inefficient to insert or remove something from an array
DataTables
Faster searches (DataTable to LINQ is the fastest than DataTable.Select)
LINQ was still a lot faster than using normal DataSet methods (40-50 times faster!)
Fast insertions, removal, and sorting

How to manage a million records?

I really need an expert's help to answer my query.
Here is the scenario:
Im using an sql select query to retrieve a million records.
I need to perform sorting and grouping on the resultant records which im storing in a datatable( in one execution)
and looping through it for grouping and sorting it.
I know this is so childish and not the right way to process it.
How can i manage the million records effectively and apply the grouping and sorting to it?
Really need help out here. Heard of executing the select query batch wise but how to implement the grouping and sorting while we dont have the entire data in hand?
I cannot go for sql order by and group by directly and that's against my requirement.
Here is what i'm doing right now:
I have the following objects, i.e the column names for grouping and Sorting
List<Group> groupList;
List<Sort> sortList;
DataTable reportData; // Here im having the entire records from db
Im looping through the 'reportData' row by row and matches the current and previous row for the custom grouping and sorting. Would like to know how the same can be done when we are using a batchwise execution or any alternative solution is there?

I need to perform sorting and grouping on the resultant records which
im storing in a datatable( in one execution) and looping through it
for grouping and sorting it.
What for?
Seriously.
Do not pull then try plaing smart with a stupid object model behind (and datasets are not particularly smart, sorry).
Group and sort in your select statement, pull the data lready grouped and joined and be done with it.
A million records was a small amount of data for sql server when the original version was release (4.2 it was, a port of sysase sql server) 17 years of so ago. These days it is something that fits likely into the processor thiird level cache and is nothing a proper sql server even realizes it has just processed.
SQL is particulaly good ad doing projects and ever since they indoruced MARS you can even run multiple queries over one connection, which comes in handy here.
So, go back - throw away the dataset and "I try to program a sort algo" and create proper SQL statements to pull the data as you need it.

Sounds like you should implement Partition Pruning. Partitioning will allow for a separation of content like you are requesting in order to have faster queries.

If I understood correctly, in your case, I would create a temporary database table with the structure I want especially to cover my grouping.
Then I would select the records from main tables and insert them to the temporary one appying all modifications including grouping.
A specific index on how you want them sorted should be also applied.
After that, just select from this table, do what you have to do, and finally if the data are not needed any more, delete the temporary table.
I would choose the above solution because a million of records in memory smells trouble to me...

For example:
1. Lets assume that you would like to group them by their DocumentTypeID
var groupByType = reportData.GroupBy(g=>g.DocumentTypeID);
2. Sorting Alphabetically
var sortAlphabetically = reportData.OrderBy(g=>g.DocumentName);
3. Grouping and Sorting
var groupAndSort = reportData.GroupBy(g=>g.DocumentTypeID)
.OrderBy(g=>g.DocumentName);
4. Sort and Group
var groupAndSort = reportData.OrderBy(g=>g.DocumentName)
.GroupBy(g=>g.DocumentTypeID);
5. Multiple Grouping and sorting
var multipleGroupAndSort = reportData.GroupBy(g=>g.DocumentTypeID)
.GroupBy(g=>g.CreatedOnDate.Month)
.OrderBy(g=>g.DocumentName);
so on and so forth...
But I would still discourage bringing million rows to application. It will cost memory. There are of course ways to manage it through stored procedures etc.

Should I use dataset or datatable?

If I need to fetch one whole column from Table1 in the DB, should I fetch it using datatable or dataset? I can do both ways. I mean ok I should use Datatable. Why is that? What would happen if I use Dataset?
ok that's what I wanted to know. So there's memory issue. Now I am confused. I mean whatever I use be it Datatable or Dataset, both will be fetching only ONE column frommy table in DB. How is Dateset's gonna use more memory then?

Use a DataTable.
A DataSet is an in-memory database while DataTable is an in-memory table.
DataSets are more complicated and heavier-weight; they can contain multiple DataTables and relations between DataTables.

you can better use DataTable(uses less memory).
or you can try with user created value objects or DTO

To answer your edited question, there's more overhead to a dataset. DataTables are better for what you need. If you're doing a lot of data fetching, though, it's easier (and way more maintainable!) to use an ORM of sorts.

Do ADO.Net DataTables have indexes?

I am using VSTS 2008 + C# + .Net 3.5 + SQL Server 2008 + ADO.Net. If I load a table from a database by using a DataTable of ADO.Net, and in the database table, I defined a couple of indexes on the table. My question is, whether on the ADO.Net DataTable, there is related index (the same as the indexes I created on physical database table) to improve certain operation performance on DataTable?
thanks in advance,
George

Actually George's question is not so "bad" as some people insist it is. (I am more and more convinced that there's no such thing as, "a bad question").
I have a rather big table which I load into the memory, in a DataTable object. A lot of processing is done on lines from this table, a lot of times, on various (and different) subsets which I can easily describe as "WHERE ..." of SELECT clauses. Now with this DataTable I can run Select() - a method of DataTable class - but it is quite inefficient.
In the end, I decided to load the DataTable sorted by specific columns and implemented my own
quick search, instead of using the Select() function. It proved to be much faster, but of course it works only on those sorted columns. The trouble would have been avoided, had a DataTable had indexes.

No, but possibly yes.
You can set up your own indices on a DataTable, using a DataView. As you change the table, the DataView will be rebuilt, so the index should always be up to date.
I did some bench tests for my own app. I use a DataTable to approximate a Boost MultiIndexContainer. To create an index on a column call "Author", I initialise the DataTable, and then the DataView...
_dvChangesByAuthor =
new DataView(
_dtChanges,
string.Empty,
"Author ASC",
DataViewRowState.CurrentRows);
To then pull data by Author from the table, you use the view's FindRows function...
dataRowViews = _dvChangesByAuthor.FindRows(author);
List<DataRow> returnRows = new List<DataRow>();
foreach (DataRowView drv in dataRowViews)
{
returnRows.Add(drv.Row);
}
I made a random large DataTable, and ran queries using DataTable.Select(), Linq-To-DataSet (with forced execution by exporting to list) and the above DataView method. The DataView method won easily. Linq took 5000 ticks, Select took over 26000 ticks, DataView took 192 ticks...
LOC=20141121-14:46:32.863,UTC=20141121-14:46:32.863,DELTA=72718,THR=9,DEBUG,LOG=Program,volumeTest() - Running queries for author >TFYN_AUTHOR_047<
LOC=20141121-14:46:32.863,UTC=20141121-14:46:32.863,DELTA=72718,THR=9,DEBUG,LOG=RightsChangeTracker,GetChangesByAuthorUsingLinqToDataset() - Query elapsed time: 2 ms, 4934 ticks; Rows=65
LOC=20141121-14:46:32.879,UTC=20141121-14:46:32.879,DELTA=72733,THR=9,DEBUG,LOG=RightsChangeTracker,GetChangesByAuthorUsingSelect() - Query elapsed time: 11 ms, 26575 ticks; Rows=65
LOC=20141121-14:46:32.879,UTC=20141121-14:46:32.879,DELTA=72733,THR=9,DEBUG,LOG=RightsChangeTracker,GetChangesByAuthorUsingDataview() - Query elapsed time: 0 ms, 192 ticks; Rows=65
So, if you want indices on a DataTable, I would suggest DataView, if you can deal with the fact that the index is re-built when the data changes.

You can create a primary key for the datatable. Filter operations get a big boost if you are searching in the primary key field. Check out this link: here

I had the same problem with many queries from a large datatable that are not according to the primary key.
The solution I found was to create DataView for each index I wanted to use, and then use it's Find and FindRows methods to extract the data.
DataView creates an internal index on the DataTable and behaves virtually as an index for this purpose.
In my case I was able to reduce 10,000 queries from 40 Seconds to ONE!!!

John above is correct. DataTables are disconnected in memory structures. They do not map to the physical implementation of the database.
The indexes on disk are used to speed up lookups because you don't have all the rows. If you have to load every row and scan them it is slow, so an index makes sense. In a DataTable you already have all the rows, so a comparison is fast already.

The correct answer here to the implicit question of creating an index on a DataTable is that you can't do that, but you can create one or more DataViews for the DataTable, which according to the doc will create an index based on the sorting the DataView specifies:
DataView constructs an index. An index contains keys built from one or more columns in the table or view. These keys are stored in a structure that enables the DataView to find the row or rows associated with the key values quickly and efficiently. Operations that use the index, such as filtering and sorting, see signifcant performance increases. The index for a DataView is built both when the DataView is created and when any of the sorting or filtering information is modified. Creating a DataView and then setting the sorting or filtering information later causes the index to be built at least twice: once when the DataView is created, and again when any of the sort or filter properties are modified.
If you need to do a large number of lookups to an in-memory DataTable, it may be the most straightforward and performant to use a DataView with the Find() or FindRows() method to do indexed key lookups. In particular, if you need to do a number of lookups and modifications to the data this would prevent needing to transform your DataTable into another indexed class like a Dictionary and then transforming it back into a DataTable again.

Others have made the point that a DataSet is not intended to serve as a database system--just a representation of data. If you are working under the impression that a DataSet is a database then you are mistaken and might need to reconsider your implementation.
If you need a client-side database, consider using SQL Compact or SQL Lite, both are free redistributable Database systems which can be used without requiring separate installations or services. If you need something more full-featured the SQL Express is the next step up.
To help clarify though, DataSets/Tables are used in .NET development to temporarily hold data as needed. Think of them as the results of a SELECT query against a database; they are roughly similar to CSV files or other forms of tabular data--you can pull data into them from a database, work with the data, and then push the changes back to a database--but they, on their own, are not databases.
If you have a large collection of items which you need to keep in memory for one reason or another then you might consider building a lightweight DTO (data transfer object, Google it, they're very simple) and loading them into a HashTable. HashTables won't give you any form of relational data, but are very efficient at look-ups.

DataTables have a PrimaryKey field that can serve as an index (they are fast already anyway). This field is not copied from the Primary Keys of the database (although that might be nice).

My reading of the docs is that the correct way to achieve this (if needed) is to use AsDataView to produce a DataView (or LinqDataView) that's bound to the underlying table. If your DataTable is invariant then the DataView can be static to avoid redundant re-indexing.
I am currently investigating Linq to DataSet, and this q was helpful to me, so thanks.

DataTables are indexed if you (the coder) specify one or more DataColumns as the Primary Key. Interally ADO.NET uses a Red-Black tree to form this index giving log-time lookups. This Primary Key is not set automatically based on any underlying keying from the data provider.

George,
The answer is no.
Actually, some sort of indexing may be used internally, but only as an implementation detail. For instance, if you create a foreign key constraint, maybe that's assisted by an index. But it doesn't matter to a developer.

Datatable vs Dataset

I currently use a DataTable to get results from a database which I can use in my code.
However, many example on the web show using a DataSet instead and accessing the table(s) through the collections method.
Is there any advantage, performance wise or otherwise, of using DataSets or DataTables as a storage method for SQL results?

It really depends on the sort of data you're bringing back. Since a DataSet is (in effect) just a collection of DataTable objects, you can return multiple distinct sets of data into a single, and therefore more manageable, object.
Performance-wise, you're more likely to get inefficiency from unoptimized queries than from the "wrong" choice of .NET construct. At least, that's been my experience.

One major difference is that DataSets can hold multiple tables and you can define relationships between those tables.
If you are only returning a single result set though I would think a DataTable would be more optimized. I would think there has to be some overhead (granted small) to offer the functionality a DataSet does and keep track of multiple DataTables.

in 1.x there used to be things DataTables couldn't do which DataSets could (don't remember exactly what). All that was changed in 2.x. My guess is that's why a lot of examples still use DataSets. DataTables should be quicker as they are more lightweight. If you're only pulling a single resultset, its your best choice between the two.

One feature of the DataSet is that if you can call multiple select statements in your stored procedures, the DataSet will have one DataTable for each.

There are some optimizations you can use when filling a DataTable, such as calling BeginLoadData(), inserting the data, then calling EndLoadData(). This turns off some internal behavior within the DataTable, such as index maintenance, etc. See this article for further details.

When you are only dealing with a single table anyway, the biggest practical difference I have found is that DataSet has a "HasChanges" method but DataTable does not. Both have a "GetChanges" however, so you can use that and test for null.

A DataTable object represents tabular data as an in-memory, tabular cache of rows, columns, and constraints.
The DataSet consists of a collection of DataTable objects that you can relate to each other with DataRelation objects.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Sort-Merge two Datasets: which approach is faster? - c#

Related

DataTable vs Array performance in c#

How to manage a million records?

Should I use dataset or datatable?

Do ADO.Net DataTables have indexes?

Datatable vs Dataset

Categories

Resources