Suppose we have the following situation:
We have 2-3 tables in database with a huge amount of data (let it be 50-100mln of records) and we want to add 2k of new records. But before adding them we need to check our db on duplicates. So if this 2k contains records which we have in our DB we should ignore them. But to find out whether new record is a duplicate or not we need info from both tables (for example we need to make left join).
The idea of solution is: one task or thread create a suitable data for comparison and pushes data into queue (by batches, not record by record), so our queue(or concurrentQueue) is a global variable. The second thread gets batch from queue and look it through. But there's a problem - memory is growing...
How can I clean memory after I've surfed through the batch?
P.S. If smb has another idea how to optimize this process - please describe it...
This is not the specific answer to the question you are asking, because what you are asking, doesn't really make sense to me.
if you are looking to update specific rows:
INSERT INTO tablename (UniqueKey,columnname1, columnname2, etc...)
VALUES (UniqueKeyValue,value1,value2, etc....)
ON DUPLICATE KEY
UPDATE columnname1=value1, columnname2=value2, etc...
If not, simply ignore/remove the update statement.
This would be darn fast, considering, it would use the unique index of whatever field you want to be unique, and just do an insert or update. No need to validate in a separate table or anything.
I'm creating a HashMap mapping the ID field of a row in a DataTable to the row itself, to improve lookup time for some frequently accessed tables. Now, from time to time, I'm getting the RowNotInTableException:
This row has been removed from a table and does not have any data. BeginEdit() will allow creation of new data in this row.
After looking around the net a bit, it seems that DataRows don't like not being attached to a DataTable. Even though the DataTable stays in memory (not sure if the DataRows keep a reference to it, but I'm definitely still caching it anyway), is it possible I'm breaking something by keeping those rows all isolated in a HashMap? What other reason can there be for this error? This post
RowNotInTableException when accessing second time
discusses a similar problem but there's no solution either.
UPDATE
I'm actually storing DataRowViews if that makes any difference.
The DataRow should be always attached to some DataTable. Even if is removed from DataTable, the row still has reference to the table.
The reason is, the schema of table is placed in DataTable not in DataRow (and the data itself too).
If you want fast lookup without DataTables, use some own structure instead of DataRow.
I am using VSTS 2008 + C# + .Net 3.5 + SQL Server 2008 + ADO.Net. If I load a table from a database by using a DataTable of ADO.Net, and in the database table, I defined a couple of indexes on the table. My question is, whether on the ADO.Net DataTable, there is related index (the same as the indexes I created on physical database table) to improve certain operation performance on DataTable?
thanks in advance,
George
Actually George's question is not so "bad" as some people insist it is. (I am more and more convinced that there's no such thing as, "a bad question").
I have a rather big table which I load into the memory, in a DataTable object. A lot of processing is done on lines from this table, a lot of times, on various (and different) subsets which I can easily describe as "WHERE ..." of SELECT clauses. Now with this DataTable I can run Select() - a method of DataTable class - but it is quite inefficient.
In the end, I decided to load the DataTable sorted by specific columns and implemented my own
quick search, instead of using the Select() function. It proved to be much faster, but of course it works only on those sorted columns. The trouble would have been avoided, had a DataTable had indexes.
No, but possibly yes.
You can set up your own indices on a DataTable, using a DataView. As you change the table, the DataView will be rebuilt, so the index should always be up to date.
I did some bench tests for my own app. I use a DataTable to approximate a Boost MultiIndexContainer. To create an index on a column call "Author", I initialise the DataTable, and then the DataView...
_dvChangesByAuthor =
new DataView(
_dtChanges,
string.Empty,
"Author ASC",
DataViewRowState.CurrentRows);
To then pull data by Author from the table, you use the view's FindRows function...
dataRowViews = _dvChangesByAuthor.FindRows(author);
List<DataRow> returnRows = new List<DataRow>();
foreach (DataRowView drv in dataRowViews)
{
returnRows.Add(drv.Row);
}
I made a random large DataTable, and ran queries using DataTable.Select(), Linq-To-DataSet (with forced execution by exporting to list) and the above DataView method. The DataView method won easily. Linq took 5000 ticks, Select took over 26000 ticks, DataView took 192 ticks...
LOC=20141121-14:46:32.863,UTC=20141121-14:46:32.863,DELTA=72718,THR=9,DEBUG,LOG=Program,volumeTest() - Running queries for author >TFYN_AUTHOR_047<
LOC=20141121-14:46:32.863,UTC=20141121-14:46:32.863,DELTA=72718,THR=9,DEBUG,LOG=RightsChangeTracker,GetChangesByAuthorUsingLinqToDataset() - Query elapsed time: 2 ms, 4934 ticks; Rows=65
LOC=20141121-14:46:32.879,UTC=20141121-14:46:32.879,DELTA=72733,THR=9,DEBUG,LOG=RightsChangeTracker,GetChangesByAuthorUsingSelect() - Query elapsed time: 11 ms, 26575 ticks; Rows=65
LOC=20141121-14:46:32.879,UTC=20141121-14:46:32.879,DELTA=72733,THR=9,DEBUG,LOG=RightsChangeTracker,GetChangesByAuthorUsingDataview() - Query elapsed time: 0 ms, 192 ticks; Rows=65
So, if you want indices on a DataTable, I would suggest DataView, if you can deal with the fact that the index is re-built when the data changes.
You can create a primary key for the datatable. Filter operations get a big boost if you are searching in the primary key field. Check out this link: here
I had the same problem with many queries from a large datatable that are not according to the primary key.
The solution I found was to create DataView for each index I wanted to use, and then use it's Find and FindRows methods to extract the data.
DataView creates an internal index on the DataTable and behaves virtually as an index for this purpose.
In my case I was able to reduce 10,000 queries from 40 Seconds to ONE!!!
John above is correct. DataTables are disconnected in memory structures. They do not map to the physical implementation of the database.
The indexes on disk are used to speed up lookups because you don't have all the rows. If you have to load every row and scan them it is slow, so an index makes sense. In a DataTable you already have all the rows, so a comparison is fast already.
The correct answer here to the implicit question of creating an index on a DataTable is that you can't do that, but you can create one or more DataViews for the DataTable, which according to the doc will create an index based on the sorting the DataView specifies:
DataView constructs an index. An index contains keys built from one or more columns in the table or view. These keys are stored in a structure that enables the DataView to find the row or rows associated with the key values quickly and efficiently. Operations that use the index, such as filtering and sorting, see signifcant performance increases. The index for a DataView is built both when the DataView is created and when any of the sorting or filtering information is modified. Creating a DataView and then setting the sorting or filtering information later causes the index to be built at least twice: once when the DataView is created, and again when any of the sort or filter properties are modified.
If you need to do a large number of lookups to an in-memory DataTable, it may be the most straightforward and performant to use a DataView with the Find() or FindRows() method to do indexed key lookups. In particular, if you need to do a number of lookups and modifications to the data this would prevent needing to transform your DataTable into another indexed class like a Dictionary and then transforming it back into a DataTable again.
Others have made the point that a DataSet is not intended to serve as a database system--just a representation of data. If you are working under the impression that a DataSet is a database then you are mistaken and might need to reconsider your implementation.
If you need a client-side database, consider using SQL Compact or SQL Lite, both are free redistributable Database systems which can be used without requiring separate installations or services. If you need something more full-featured the SQL Express is the next step up.
To help clarify though, DataSets/Tables are used in .NET development to temporarily hold data as needed. Think of them as the results of a SELECT query against a database; they are roughly similar to CSV files or other forms of tabular data--you can pull data into them from a database, work with the data, and then push the changes back to a database--but they, on their own, are not databases.
If you have a large collection of items which you need to keep in memory for one reason or another then you might consider building a lightweight DTO (data transfer object, Google it, they're very simple) and loading them into a HashTable. HashTables won't give you any form of relational data, but are very efficient at look-ups.
DataTables have a PrimaryKey field that can serve as an index (they are fast already anyway). This field is not copied from the Primary Keys of the database (although that might be nice).
My reading of the docs is that the correct way to achieve this (if needed) is to use AsDataView to produce a DataView (or LinqDataView) that's bound to the underlying table. If your DataTable is invariant then the DataView can be static to avoid redundant re-indexing.
I am currently investigating Linq to DataSet, and this q was helpful to me, so thanks.
DataTables are indexed if you (the coder) specify one or more DataColumns as the Primary Key. Interally ADO.NET uses a Red-Black tree to form this index giving log-time lookups. This Primary Key is not set automatically based on any underlying keying from the data provider.
George,
The answer is no.
Actually, some sort of indexing may be used internally, but only as an implementation detail. For instance, if you create a foreign key constraint, maybe that's assisted by an index. But it doesn't matter to a developer.
I want to create a simple class that is similar to a datatable, but without the overhead.
So loading the object with a sqldatareader, and then return this custom datatable-like object that will give me access to the rows and columns like:
myObject[rowID]["columnname"]
How would you go about creating such an object?
I don't want any built in methods/behavior for this object except for accessing the rows and columns of the data.
Update:
I don't want a datable, I want something much leaner (plus I want to learn how to create such an object).
This type of structure can be easily created with a type signature of:
List<Dictionary<string, object>>
This will allow access as you specify and should be pretty easy to populate.
You can always create an object that inherits from List < Dictionary < string, object > > and implements a constructor that takes a SqlDataReader. This constructor should create a enw dictionary for each row, and insert a new entry into the dictionary for each column, using the column name as the key.
I think you're missing something about how .Net works. The extra overhead involved in a DataTable is not significant. Can you point to a specific performance problem in existing code that you believe is caused by a datatable? Perhaps we can help correct that in a more elegant way.
Perhaps the specific thing you're asking about is how to use the convenient ["whatever"] indexing syntax in your own table object.
If so, I suggest you refer to this MSDN page on indexers.
Dictionary<int,object[]> would be better than List<Dictionary<string, object>>. You don't really need a dictionary for each row, since column names are the same for all rows. And if you want to have it lightweight, you should use column indexes instead of names.
So if you have a column "Name" that is a 3rd column, to get its value "Name" from a row ID 10, the code would be:
object val = table[10][2];
Another option is SortedList<int,object[]>... depending on the way you access the data (forward only or random access).
You could also use MultiDictionary<int,object> from PowerCollections.
From the memory usage perspective, I think the best option would be to use a single dimension array with some slack capacity. So after each, say 100, rows, you would create a new array, copy the old contents to it and leave 100 empty rows at the end. But you would have to keep some sort of an index when you delete a row, so that it is marked as deleted without resizing the array.
Isn't this a DataSet/DataTable? Maybe I didn't get the question.
Also, what is the programming language?
I am inserting a column in a DataGridView programmatically (i.e., not bound to any data tables/databases) as follows:
int lastIndex = m_DGV.Columns.Count - 1; // Count = 4 in this case
DataGridViewTextBoxColumn col = (DataGridViewTextBoxColumn)m_DGV.Columns[lastIndex];
m_DGV.Columns.RemoveAt(lastIndex);
m_DGV.Columns.Insert(insertIndex, col); // insertIndex = 2
I have found that my columns are visually out of order sometimes using this method. A workaround is to manually set the DisplayIndex property of the column afterwards. Adding this code "fixes it", but I don't understand why it behaves this way.
Console.Write(m_DGV.Columns[0].DisplayIndex); // Has value of 0
Console.Write(m_DGV.Columns[1].DisplayIndex); // Has value of 1
Console.Write(m_DGV.Columns[2].DisplayIndex); // Has value of 3
Console.Write(m_DGV.Columns[3].DisplayIndex); // Has value of 2
col.DisplayIndex = insertIndex;
Console.Write(m_DGV.Columns[0].DisplayIndex); // Has value of 0
Console.Write(m_DGV.Columns[1].DisplayIndex); // Has value of 1
Console.Write(m_DGV.Columns[2].DisplayIndex); // Has value of 2
Console.Write(m_DGV.Columns[3].DisplayIndex); // Has value of 3
As an aside, my grid can grow its column count dynamically. I wanted to grow it in chunks, so each insert didn't require a column allocation (and associated initialization). Each "new" column would then be added by grabbing an unused column from the end, inserting it into the desired position, and making it visible.
I suspect this is because the order of the columns in the DataGridView do not necessarily dictate the display order, though without explicitly being assigned by default the order of the columns dictate the DisplayIndex property values. That is why there is a DisplayIndex property, so you may add columns to the collection without performing Inserts - you just need to specify the DisplayIndex value and a cascade update occurs for everything with an equal or greater DisplayIndex. It appears from your example the inserted column is also receiving the first skipped DisplayIndex value.
From a question/answer I found:
Changing the DisplayIndex will cause
all the columns between the old
DisplayIndex and the new DisplayIndex
to be shifted.
As with nearly all collections (other than LinkedLists) its always better to add to a collection than insert into a collection. The behavior you are seeing is a reflection of that rule.
I have a couple of ideas.
How about addressing your columns by a unique name, rather than the index in the collection? They might not already have a name, but you could keep track of who's who if you gave them a name that meant something.
You can use the GetFirstColumn, GetNextColumn, GetPreviousColumn, GetLastColumn methods of the DataGridViewColumnCollection class, which work on display order, not the order in the collection. You can also just iterate through the collection using a for loop and m_DGV.Columns[i] until you find the one you want.
Create an inherited DataGridView and DataGridViewColumnCollection. The DataGridView simply is overridden to use your new collection class. Your new DataGridViewColumnCollection will include a method to address the collection by display index, presumably by iterating through the collection until you find the one you want (see #2). Or you can save a dictionary and keep it updated for very large numbers of columns.
I doubt the performance increase of keeping a dictionary, since every time a column moves, you essentially have to rewrite the entire thing. Iterating through is O(n) anyway, and unless you're talking asynchronous operations with hundreds of columns, you're probably okay.
You might be able to override the this[] operator as well, assuming it doesn't screw up the DataGridView.
Idea #1 might be the easiest to implement, but not necessarily the prettiest. Idea #2 works, and you can put it in a function DataGridViewColumn GetColumnByDisplayIndex(int Index). Idea #3 is cute, and certainly the most encapsulated approach, but isn't exactly trivial.
Thanks to cfeduke for excellent advice. I suspected Insert would be slower, but the provided link enlightened me on JUST HOW MUCH slower.
This brings up the question of how to efficiently insert and remove columns dynamically on a DataGridView. It looks like the ideal design would be to add plenty of columns using Add or AddRange, and then never really remove them. You could then simulate removal by setting the Visible property to false. And you could insert a column by grabbing an invisible column, setting its DisplayIndex and making it visible.
However, I suspect there would be landmines to avoid with this approach. Foremost being that you can no longer index your data in a straightforward manner. That is, m_DGV.Columns[i] and m_DGV.Rows[n].Cells[i] will not be mapped properly. I suppose you could create a Map/Dictionary to maintain an external intuitive mapping.
Since my application (as currently designed) requires frequent column insertion and removal it might be worth it. Anyone have any suggestions?