I'm developing an ASP.NET app that analyzes Excel files uploaded by user. The files contain various data about customers (one row = one customer), the key field is CustomerCode. Basically the data comes in form of DataTable object.
At some point I need to get information about the specified customers from SQL and compare it to what user uploaded. I'm doing it the following way:
Make a comma-separated list of customers from CustomerCode column: 'Customer1','Customer2',...'CustomerN'.
Pass this string to SQL query IN (...) clause and execute it.
This was working okay until I ran into The query processor ran out of internal resources and could not produce a query plan exception when trying to pass ~40000 items inside IN (...) clause.
The trivial ways seems to:
Replace IN (...) with = 'SomeCustomerCode' in query template.
Execute this query 40000 times for each CustomerCode.
Do DataTable.Merge 40000 times.
Is there any better way to work this problem around?
Note: I can't do IN (SELECT CustomerCode FROM ... WHERE SomeConditions) because the data comes from Excel files and thus cannot be queried from DB.
"Table valued parameters" would be worth investigating, which let you pass in (usually via a DataTable on the C# side) multiple rows - the downside is that you need to formally declare and name the data shape on the SQL server first.
Alternatively, though: you could use SqlBulkCopy to throw the rows into a staging table, and then just JOIN to that table. If you have parallel callers, you will need some kind of session identifier on the row to distinguish between concurrent uses (and: don't forget to remove your session's data afterwards).
You shouldn't process too many records at once, because of errors as you mentioned, and it is such a big batch that it takes too much time to run and you can't do anything in parallel. You shouldn't process only 1 record at a time either, because then the overhead of the SQL server communication will be too big. Choose something in the middle, process eg. 10000 records at a time. You can even parallelize the processing, you can start running the SQL for the next 10000 in the background while you are processing the previous 10000 batch.
Related
This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?
The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.
yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).
We have a 4 column data grid display on a page. I would like to perform a query based on each of the values within each cell of the grid i.e. 4 queries for each row.
This is so I can populate the cell with count of records in db that matches that value.
When each row gets populated by jqgrid, it fires off an ajax call for each cell.
I think it's a very bad idea since I have already discovered the browser limits the number of ajax calls to the same server.
Are there similar limits for ado.net?
I would like to batch these queries together so I do fewer calls to the db, is this what you would do?
How would you approach this?
You could combine your AJAX calls into one, the resulting object contains an array, or multiple properties for each result set, and then you run your SQL in parallel on the server.
Check out this QA for how to use options on how to use TPL and SQL.
Parallel.Foreach SQL querying sometimes results in Connection
me would suggest you to select the associated id data in the first request and populate the values in the UI.
You could either use join or left join in your first query based on your requirement and architecture and fetch the specific column value/ count(id) ( Here you mentioned as the count of records ).
Here is the problem,
I have a SQL DB with your regular customer, product, order schema but HUGE. [Each table has 10s of million rows]. there is also a large table with order_email [apprx 100 million rows]. This table holds all email communication associated with an order. I have implemented a full text search using on top of the order_email which works fine.
Now I want to extend the email search functionality to filter this based on other domain objects. i.e to answer queries like
show customers who sent an email with the phrase 'never gonna give you up'
show orders which has an associated email with the phrase 'more ponies'.
The implementation is to do an intersection/join of the lucene result and a sql result but I can't think of a way to do this without running into issues due to the SIZE of the tables and index involved
My failing approaches
Brute force. Adding most of my DB columns as lucene fields. This is equivalent of denormalizing my entire DB and creating a Lucene Index (size in Terrabytes) with all columns as fields. performance sucks and cost prohibitive.
Getting the Lucene result set, getting the OrderID from it and querying the DB like SELECT * from Order where OrderID IN(ORDERIDs from Lucene). This doesn't work because the email search could yield a million orderIDs, which makes the SQL query to perform poorly if at all.
Doing the join in application code, but iterating over the sql result and the lucene result. This means based on the size of the results, a single query could load 2 multi million row datasets and iterate over them, trashing CPU and memory.
Thoughts on how I can structure this join/intersection of 2 large datasets?
p.s: first one to suggest hadoop is a rotten egg. Wish I could, but we don't have the budget for more hardware.
Like OzrenTkalcecKrznaric said in the comments to the question, paging is your friend. (Remember that the single most power algorithm ever developed is "divide and conquer".)
I am working on a C# application, which loads data from a MS SQL 2008 or 2008 R2 database. The table looks something like this:
ID | binary_data | Timestamp
I need to get only the last entry and only the binary data. Entries to this table are added irregular from another program, so I have no way of knowing if there is a new entry.
Which version is better (performance etc.) and why?
//Always a query, which might not be needed
public void ProcessData()
{
byte[] data = "query code get latest binary data from db"
}
vs
//Always a smaller check-query, and sometimes two queries
public void ProcessData()
{
DateTime timestapm = "query code get latest timestamp from db"
if(timestamp > old_timestamp)
data = "query code get latest binary data from db"
}
The binary_data field size will be around 30kB. The function "ProcessData" will be called several times per minutes, but sometimes can be called every 1-2 seconds. This is only a small part of a bigger program with lots of threading/database access, so I want to the "lightest" solution. Thanks.
Luckily, you can have both:
SELECT TOP 1 binary_data
FROM myTable
WHERE Timestamp > #last_timestamp
ORDER BY Timestamp DESC
If there is a no record newer than #last_timestamp, no record will be returned and, thus, no data transmission takes place (= fast). If there are new records, the binary data of the newest is returned immediately (= no need for a second query).
I would suggest you perform tests using both methods as the answer would depend on your usages. Simulate some expected behaviour.
I would say though, that you are probably okay to just do the first query. Do what works. Don't prematurely optimise, if the single query is too slow, try your second two-query approach.
Two-step approach is more efficient from overall workload of system point of view:
Get informed that you need to query new data
Query new data
There are several ways to implement this approach. Here are a pair of them.
Using Query Notifications which is built-in functionality of SQL Server supported in .NET.
Using implied method of getting informed of database table update, e.g. one described in this article at SQL Authority blog
I think that the better path is a storedprocedure that keeps the logic inside the database, Something with an output parameter with the data required and a return value like a TRUE/FALSE to signal the presence of new data
I have the following code snippet:
var matchingAuthors = from authors in DB.AuthorTable
where m_authors.Keys.Contains(authors.AuthorId)
select authors;
foreach (AuthorTableEntry author in matchingAuthors)
{
....
}
where m_authors is a Dictionary containing the "Author" entries, and DB.AuthorTable is a SQL table. When the size of m_authors goes beyond a certain value (somewhere around the 3000 entries mark), I get an exception:
System.Data.SqlClient.SqlException: The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect.
Too many parameters were provided in this RPC request. The maximum is 2100.
Is there any way I can get around this and work with a larger size dictionary? Alternatively, is there a better way to get all rows in a SQL table where a particular column value for that row matches one of the dictionary entries?
LINQ to SQL uses a parametrized IN statement to perform a local Contains():
...
WHERE AuthorId IN (#p0, #p1, #p2, ...)
...
So the error you're seeing is that SQL ran out of parameters to use for your keys. I can think of two options:
Select the whole table and filter using LINQ to Objects.
Generating an expression tree from your keys: see Option 2 here.
Another option is to consider how you populate m_authors and whether you can include that in the query as a query element itself so it turns into a server-side join/subselect.
Depending on your requirements, you could break apart the work into multiple smaller chunks (first thousand, second thousand, etc.) This runs certain risks if your data is read-write and changes frequently, but it might give you a bit better scalability beyond pulling back thousands of rows in one big gulp. And, if your data can be worked on in part (i.e. without having the entire set in memory), you could send off chunks to be worked on in a separate thread while you are pulling back the next chunk.