I'm developing an ASP.NET app that analyzes Excel files uploaded by user. The files contain various data about customers (one row = one customer), the key field is CustomerCode. Basically the data comes in form of DataTable object.
At some point I need to get information about the specified customers from SQL and compare it to what user uploaded. I'm doing it the following way:
Make a comma-separated list of customers from CustomerCode column: 'Customer1','Customer2',...'CustomerN'.
Pass this string to SQL query IN (...) clause and execute it.
This was working okay until I ran into The query processor ran out of internal resources and could not produce a query plan exception when trying to pass ~40000 items inside IN (...) clause.
The trivial ways seems to:
Replace IN (...) with = 'SomeCustomerCode' in query template.
Execute this query 40000 times for each CustomerCode.
Do DataTable.Merge 40000 times.
Is there any better way to work this problem around?
Note: I can't do IN (SELECT CustomerCode FROM ... WHERE SomeConditions) because the data comes from Excel files and thus cannot be queried from DB.
"Table valued parameters" would be worth investigating, which let you pass in (usually via a DataTable on the C# side) multiple rows - the downside is that you need to formally declare and name the data shape on the SQL server first.
Alternatively, though: you could use SqlBulkCopy to throw the rows into a staging table, and then just JOIN to that table. If you have parallel callers, you will need some kind of session identifier on the row to distinguish between concurrent uses (and: don't forget to remove your session's data afterwards).
You shouldn't process too many records at once, because of errors as you mentioned, and it is such a big batch that it takes too much time to run and you can't do anything in parallel. You shouldn't process only 1 record at a time either, because then the overhead of the SQL server communication will be too big. Choose something in the middle, process eg. 10000 records at a time. You can even parallelize the processing, you can start running the SQL for the next 10000 in the background while you are processing the previous 10000 batch.
I am working on a C# application, which loads data from a MS SQL 2008 or 2008 R2 database. The table looks something like this:
ID | binary_data | Timestamp
I need to get only the last entry and only the binary data. Entries to this table are added irregular from another program, so I have no way of knowing if there is a new entry.
Which version is better (performance etc.) and why?
//Always a query, which might not be needed
public void ProcessData()
{
byte[] data = "query code get latest binary data from db"
}
vs
//Always a smaller check-query, and sometimes two queries
public void ProcessData()
{
DateTime timestapm = "query code get latest timestamp from db"
if(timestamp > old_timestamp)
data = "query code get latest binary data from db"
}
The binary_data field size will be around 30kB. The function "ProcessData" will be called several times per minutes, but sometimes can be called every 1-2 seconds. This is only a small part of a bigger program with lots of threading/database access, so I want to the "lightest" solution. Thanks.
Luckily, you can have both:
SELECT TOP 1 binary_data
FROM myTable
WHERE Timestamp > #last_timestamp
ORDER BY Timestamp DESC
If there is a no record newer than #last_timestamp, no record will be returned and, thus, no data transmission takes place (= fast). If there are new records, the binary data of the newest is returned immediately (= no need for a second query).
I would suggest you perform tests using both methods as the answer would depend on your usages. Simulate some expected behaviour.
I would say though, that you are probably okay to just do the first query. Do what works. Don't prematurely optimise, if the single query is too slow, try your second two-query approach.
Two-step approach is more efficient from overall workload of system point of view:
Get informed that you need to query new data
Query new data
There are several ways to implement this approach. Here are a pair of them.
Using Query Notifications which is built-in functionality of SQL Server supported in .NET.
Using implied method of getting informed of database table update, e.g. one described in this article at SQL Authority blog
I think that the better path is a storedprocedure that keeps the logic inside the database, Something with an output parameter with the data required and a return value like a TRUE/FALSE to signal the presence of new data
C# with .net 2.0 with a SQL server 2005 DB backend.
I've a bunch of XML files which contain data along the lines of the following, the structure varies a little but is more or less as follows:
<TankAdvisory>
<WarningType name="Tank Overflow">
<ValidIn>All current tanks</ValidIn>
<Warning>Tank is close to capacity</Warning>
<IssueTime Issue-time="2011-02-11T10:00:00" />
<ValidFrom ValidFrom-time="2011-01-11T13:00:00" />
<ValidTo ValidTo-time="2011-01-11T14:00:00" />
</WarningType>
</TankAdvisory>
I have a single DB table that has all the above fields ready to be filled.
When I use the following method of reading the data from the XML file:
DataSet reportData = new DataSet();
reportData.ReadXml("../File.xml");
It successfully populates the Dataset but with multiple tables. So when I come to use SQLBulkCopy I can either save just one table this way:
sbc.WriteToServer(reportData.Tables[0]);
Or if I loop through all the tables in the Dataset adding them it adds a new row in the Database, when in actuality they're all to be stored in the one row.
Then of course there's also the issue of columnmappings, I'm thinking that maybe SQLBulkCopy is the wrong way of doing this.
What I need to do is find a quick way of getting the data from that XML file into the Database under the relevant columns in the DB.
Ok, so the original question is a little old, but i have just came across a way to resolve this issue.
All you need to do is loop through all the DataTables that are in your DataSet and add them to the One DataTable that has all the columns in the Table in your DB like so...
DataTable dataTable = reportData.Tables[0];
//Second DataTable
DataTable dtSecond = reportData.Tables[1];
foreach (DataColumn myCol in dtSecond.Columns)
{
sbc.ColumnMappings.Add(myCol.ColumnName, myCol.ColumnName);
dataTable.Columns.Add(myCol.ColumnName);
dataTable.Rows[0][myCol.ColumnName] = dtSecond.Rows[0][myCol];
}
//Finally Perform the BulkCopy
sbc.WriteToServer(dataTable);
foreach (DataColumn myCol in dtSecond.Columns)
{
dataTable.Columns.Add(myCol.ColumnName);
for (int intRowcnt = 0; intRowcnt <= dtSecond.Rows.Count - 1; intRowcnt++)
{
dataTable.Rows[intRowcnt][myCol.ColumnName] = dtSecond.Rows[intRowcnt][myCol];
}
}
SqlBulkCopy is for many inserts. It's perfect for those cases when you would otherwise generate a lot of INSERT statements and juggle the limit on total number of parameters per batch. The thing about the SqlBulkCopy class though, is that it's a cranky. Unless you fully specify all column mappings for the data set it will throw an exception.
I'm assuming that your data is quite manageable since your reading it into a DataSet. If you where to have even larger data sets you could lift chunks into memory and then flush them to the database piece by piece. But if everything fits in one go, it's as simple as that.
The SqlBulkCopy is the fastest way to put data into the database. Just setup column mappings for all the columns, otherwise it won't work.
Why reinvent the wheel? Use SSIS. Read with an XML Source, transform with one of the many Transformations, then load it with an OLE Db Destination into the SQL Server table. You will never beat SSIS in terms of runtime, speed to deploy the solution, maintenance, error handling etc etc.
I am using VSTS 2008 + C# + .Net 3.5 + SQL Server 2008 + ADO.Net. If I load a table from a database by using a DataTable of ADO.Net, and in the database table, I defined a couple of indexes on the table. My question is, whether on the ADO.Net DataTable, there is related index (the same as the indexes I created on physical database table) to improve certain operation performance on DataTable?
thanks in advance,
George
Actually George's question is not so "bad" as some people insist it is. (I am more and more convinced that there's no such thing as, "a bad question").
I have a rather big table which I load into the memory, in a DataTable object. A lot of processing is done on lines from this table, a lot of times, on various (and different) subsets which I can easily describe as "WHERE ..." of SELECT clauses. Now with this DataTable I can run Select() - a method of DataTable class - but it is quite inefficient.
In the end, I decided to load the DataTable sorted by specific columns and implemented my own
quick search, instead of using the Select() function. It proved to be much faster, but of course it works only on those sorted columns. The trouble would have been avoided, had a DataTable had indexes.
No, but possibly yes.
You can set up your own indices on a DataTable, using a DataView. As you change the table, the DataView will be rebuilt, so the index should always be up to date.
I did some bench tests for my own app. I use a DataTable to approximate a Boost MultiIndexContainer. To create an index on a column call "Author", I initialise the DataTable, and then the DataView...
_dvChangesByAuthor =
new DataView(
_dtChanges,
string.Empty,
"Author ASC",
DataViewRowState.CurrentRows);
To then pull data by Author from the table, you use the view's FindRows function...
dataRowViews = _dvChangesByAuthor.FindRows(author);
List<DataRow> returnRows = new List<DataRow>();
foreach (DataRowView drv in dataRowViews)
{
returnRows.Add(drv.Row);
}
I made a random large DataTable, and ran queries using DataTable.Select(), Linq-To-DataSet (with forced execution by exporting to list) and the above DataView method. The DataView method won easily. Linq took 5000 ticks, Select took over 26000 ticks, DataView took 192 ticks...
LOC=20141121-14:46:32.863,UTC=20141121-14:46:32.863,DELTA=72718,THR=9,DEBUG,LOG=Program,volumeTest() - Running queries for author >TFYN_AUTHOR_047<
LOC=20141121-14:46:32.863,UTC=20141121-14:46:32.863,DELTA=72718,THR=9,DEBUG,LOG=RightsChangeTracker,GetChangesByAuthorUsingLinqToDataset() - Query elapsed time: 2 ms, 4934 ticks; Rows=65
LOC=20141121-14:46:32.879,UTC=20141121-14:46:32.879,DELTA=72733,THR=9,DEBUG,LOG=RightsChangeTracker,GetChangesByAuthorUsingSelect() - Query elapsed time: 11 ms, 26575 ticks; Rows=65
LOC=20141121-14:46:32.879,UTC=20141121-14:46:32.879,DELTA=72733,THR=9,DEBUG,LOG=RightsChangeTracker,GetChangesByAuthorUsingDataview() - Query elapsed time: 0 ms, 192 ticks; Rows=65
So, if you want indices on a DataTable, I would suggest DataView, if you can deal with the fact that the index is re-built when the data changes.
You can create a primary key for the datatable. Filter operations get a big boost if you are searching in the primary key field. Check out this link: here
I had the same problem with many queries from a large datatable that are not according to the primary key.
The solution I found was to create DataView for each index I wanted to use, and then use it's Find and FindRows methods to extract the data.
DataView creates an internal index on the DataTable and behaves virtually as an index for this purpose.
In my case I was able to reduce 10,000 queries from 40 Seconds to ONE!!!
John above is correct. DataTables are disconnected in memory structures. They do not map to the physical implementation of the database.
The indexes on disk are used to speed up lookups because you don't have all the rows. If you have to load every row and scan them it is slow, so an index makes sense. In a DataTable you already have all the rows, so a comparison is fast already.
The correct answer here to the implicit question of creating an index on a DataTable is that you can't do that, but you can create one or more DataViews for the DataTable, which according to the doc will create an index based on the sorting the DataView specifies:
DataView constructs an index. An index contains keys built from one or more columns in the table or view. These keys are stored in a structure that enables the DataView to find the row or rows associated with the key values quickly and efficiently. Operations that use the index, such as filtering and sorting, see signifcant performance increases. The index for a DataView is built both when the DataView is created and when any of the sorting or filtering information is modified. Creating a DataView and then setting the sorting or filtering information later causes the index to be built at least twice: once when the DataView is created, and again when any of the sort or filter properties are modified.
If you need to do a large number of lookups to an in-memory DataTable, it may be the most straightforward and performant to use a DataView with the Find() or FindRows() method to do indexed key lookups. In particular, if you need to do a number of lookups and modifications to the data this would prevent needing to transform your DataTable into another indexed class like a Dictionary and then transforming it back into a DataTable again.
Others have made the point that a DataSet is not intended to serve as a database system--just a representation of data. If you are working under the impression that a DataSet is a database then you are mistaken and might need to reconsider your implementation.
If you need a client-side database, consider using SQL Compact or SQL Lite, both are free redistributable Database systems which can be used without requiring separate installations or services. If you need something more full-featured the SQL Express is the next step up.
To help clarify though, DataSets/Tables are used in .NET development to temporarily hold data as needed. Think of them as the results of a SELECT query against a database; they are roughly similar to CSV files or other forms of tabular data--you can pull data into them from a database, work with the data, and then push the changes back to a database--but they, on their own, are not databases.
If you have a large collection of items which you need to keep in memory for one reason or another then you might consider building a lightweight DTO (data transfer object, Google it, they're very simple) and loading them into a HashTable. HashTables won't give you any form of relational data, but are very efficient at look-ups.
DataTables have a PrimaryKey field that can serve as an index (they are fast already anyway). This field is not copied from the Primary Keys of the database (although that might be nice).
My reading of the docs is that the correct way to achieve this (if needed) is to use AsDataView to produce a DataView (or LinqDataView) that's bound to the underlying table. If your DataTable is invariant then the DataView can be static to avoid redundant re-indexing.
I am currently investigating Linq to DataSet, and this q was helpful to me, so thanks.
DataTables are indexed if you (the coder) specify one or more DataColumns as the Primary Key. Interally ADO.NET uses a Red-Black tree to form this index giving log-time lookups. This Primary Key is not set automatically based on any underlying keying from the data provider.
George,
The answer is no.
Actually, some sort of indexing may be used internally, but only as an implementation detail. For instance, if you create a foreign key constraint, maybe that's assisted by an index. But it doesn't matter to a developer.
I'm currently using SMO and C# to traverse databases to create a tree of settings representing various aspects of the two databases, then comparing these trees to see where and how they are different.
The problem is, for 2 reasonably sized database, it takes almost 10mins to crawl them locally and collect table/column/stored procedure information I wish to compare.
Is there a better interface then SMO to access databases in such a fashion? I would like to not include any additional dependencies, but I'll take that pain for a 50% speed improvement. Below is a sample of how I'm enumerating tables and columns.
Microsoft.SqlServer.Management.Smo.Database db = db_in;
foreach (Table t in db.Tables)
{
if (t.IsSystemObject == false)
{
foreach (Column c in t.Columns)
{
}
}
}
Try to force SMO to read all the required fields at once, instead of querying on access.
See this blog for more information
EDIT: Link is dead but I found the page on archive.org. Here's the relevant code:
Server server = new Server();
// Force IsSystemObject to be returned by default.
server.SetDefaultInitFields(typeof(StoredProcedure), "IsSystemObject");
StoredProcedureCollection storedProcedures = server.Databases["AdventureWorks"].StoredProcedures;
foreach (StoredProcedure sp in storedProcedures) {
if (!sp.IsSystemObject) {
// We only want user stored procedures
}
}
Use the system views in each database and query conventionally.
http://www.microsoft.com/downloads/details.aspx?familyid=2EC9E842-40BE-4321-9B56-92FD3860FB32&displaylang=en
There is little that you can't get via TSQL queries. Getting metadata that way is usually very fast.