Insert and Check for Copies of Data using SQL and C# - c#

So I'm upgrading an old parser right now. It's written in C# and uses SQL to insert records into a database.
Currently it reads and parses a few thousand lines of data from a file, then inserts the new data into a database containing over a million records.
Sometimes it can take over 10 minutes just to add a few thousand lines.
I've come to the conclusion that this bottleneck in performance is due to a SQL command where it uses an IF NOT EXISTS statement to determine whether the row attempting to be inserted already exists, and if it doesn't insert the record.
I believe the problem is that it just takes way too long to call the IF NOT EXISTS on every single row in the new data.
Is there a faster way to determine whether data exists already or not?
I was thinking to insert all of the records first anyways using the SQLBulkCopy Class, then running a stored procedure to remove the duplicates.
Does anyone else have any suggestions or methods to do this as efficiently and quickly as possible? Anything would be appreciated.
EDIT: To clarify, I'd run a stored procedure (on the large table) after copying the new data into the large table
large table = 1,000,000+ rows

1. Create an IDataReader to loop over your source data.
2. Place the values into a strong dataset.
3. Every N number of rows, send the dataset (.GetXml) to a stored procedure. Let's say 1000 for the heck of it.
4. Have the stored procedure shred the xml.
5. Do your INSERT/UPDATE based on this shredded xml.
6. Return from the procedure, keep looping until you're done.
Here is an older example:
http://granadacoder.wordpress.com/2009/01/27/bulk-insert-example-using-an-idatareader-to-strong-dataset-to-sql-server-xml/
The key is that you are doing "bulk" operations.......instead of row by row. And you can pick a sweet spot # (1000 for example) that gives you the best performance.

Related

Strategy to optimize this large SQL insert via C#?

I have about 1.5 million files I need to insert records for in the database.
Each record is inserted with a key that includes the name of the file.
The catch: The files are not uniquely identified currently.
So, what we'd like to do is, for each file:
Insert a record. One of the fields in the record should include an amazon S3 key which should include the ID of the newly inserted record.
Rename the file to include the ID so that it matches the format of the key.
The best thing I can think to do is:
Run an individual insert command that returns the ID of the added row.
Add that back as a property to the individual business object I'm looping through.
Generate an update statement that updates the S3 key to include the ID
Output the file, concatenate the ID into the end the file name.
As I can tell, that looks to be :
1.5 million insert statements
individual SqlCommand executions and read because we need the ID back),
1.5 million times setting a property on an object.
1.5 million update statements generated and executed
Perhaps could make this a one giant concatenated update statement to do them all at once; not sure if that helps
1.5 million file copies.
I can't get around the actual file part, but for the rest, is there a better strategy I'm not seeing?
If you make the client application generate the IDs you can use a straight-forward SqlBulkCopy to insert all rows at once. It will be done in seconds.
If you want to keep the IDENTITY property of the column, you can run a DBCC CHECKIDENT(RESEED) to advance the identity counter by 1.5m to give you a guaranteed gap that you can insert into. If the number of rows is not statically known you can perform the inserting in smaller chunks of maybe 100k until you are done.
You will cut the number of SQL statements in half up by not relying on the database to generate your ID for each row. Do everything locally (including the assignment of an ID) and then do a single batch of inserts at the end, with identity_insert on.
This will cause SQL Server to use your ID's for this batch of records.
If this is still too slow (and 1.5 million inserts might be), the next step would be to output your data to a text file (XML, comma delimited, or whatever) and then do a bulk import operation on the file.
That's as fast as you will be able to make it, I think.

How large of a SQL string (for mass update) can I realistically use in C# .net4?

I'm recieving and parsing a large text file.
In that file I have a numerical ID identifying a row in a table, and another field that I need to update.
ID Current Location
=========================
1 Boston
2 Cambridge
3 Idaho
I was thinking of creating a single SQL command string and firing that off using ADO.Net, but some of these files I'm going to recieve have thousands of lines. Is this doable or is there a limit I'm not seeing?
If you may have thousands of lines, then composing a SQL statement is definitely NOT the way to go. Better code-based alternatives include:
Use SQLBulkCopy to insert the change data to a staging table and then UPDATE your target table using the staging table as the source. It also has excellent batching options (unlike the other choices)
Write a stored procedure to do the Update that accepts an XML parameter that contains the UPDATE data.
Write a stored procedure to do the Update that accepts a table-valued parameter that contains the UPDATE data.
I have not compared them myself but it is my understanding that #3 is generally the fastest (though #1 is plenty fast for almost any need).
Writing one huge INSERT statement well be very slow. You also don't want to parse the whole massive file at once. What you need to do is something along the lines of:
Figure out a good chunk size. Let's call it chunk_size. This will be the number of records you'll read from the file at a time.
Load chunk_size number of records from the file into a DataTable.
Use SQLBulkCopy to insert the DataTable into the DB.
Repeat 2 & 3 until the file is done.
You'll have to experiment to find an optimal size for chunk_size so start small and work your way up.
I'm not sure of an actual limit, if one exists, but why not take "bite sized" chunks of the file that you feel comfortable with and break it into several commands? You can always wrap it in a single transaction if it's important that they all fail or succeed.
Say grab 250 lines at a time, or whatever.

How to efficiently perform SQL Server database updates from my C# application

I am trying to figure out the best way to design my C# application which utilizes data from a SQL Server backend.
My application periodically has to update 55K rows each one at a time from a loop in my application. Before it does an update it needs to check if the record to be updated exists.
If it exists it updates one field.
If not it performs an insert of 4 fields.
The table to be updated has 600K rows.
What is the most efficient way to handle these updates/inserts from my application?
Should I create a data dictionary in c# and load the 600K records and query the dictionary first instead of the database?
Is this a faster approach?
Should I use a stored procedure?
What’s the best way to achieve maximum performance based on this scenario?
You could use SqlBulkCopy to upload to a temp table then have a SQL Server job do the merge.
You should try to avoid "update 55K rows each one at a time from a loop". That will be very slow.
Instead, try to find a way to batch the updates (n at a time). Look into SQL Server table value parameters as a way to send a set of data to a stored procedure.
Here's an article on updating multiple rows with TVPs: http://www.sqlmag.com/article/sql-server-2008/using-table-valued-parameters-to-update-multiple-rows
What if you did something like this, instead?
By some means, get those 55,000 rows of data into the database; if they're not already there. (If you're right now getting those rows from some query, arrange instead for the query-results to be stored in a temporary table on that database. (This might be a proper application for a stored procedure.)
Now, you could express the operations that you need to perform, perhaps, as two separate SQL queries: one to do the updates, and one or more others to do the inserts. The first query might use a clause such as "WHERE FOO IN (SELECT BAR FROM #TEMP_TABLE ...)" to identify the rows to be updated. The others might use "WHERE FOO NOT IN (...)"
This is, to be precise, exactly the sort of thing that I would expect to need to use a stored procedure to do, because, if you think about it, "the SQL server itself" is precisely the right party to be doing the work, because he's the only guy around who already has the data on-hand that you intend to manipulate. He alone doesn't have to "transmit" those 55,000 rows anywhere. Perfect.

C# and SQL Server 2008 - Batch Update

I have a requirement where I need to update thousands of records in live database table, and although there are many columns in this table, I only need to update 2-3 columns.
Further I can't hit database for thousand times just for updating which can be done in a batch update using SQL Server Table Valued Parameter. But again I shouldn't update all thousands records in one go for better error handling, instead want to update records in batches of x*100.
So, below is my approach, please give your valuable inputs for any other alternatives or any change in the proposed process -
1 Fetch required records from database to List<T> MainCollection
2 Save this collection to XML file with each element Status = Pending
3 Take first 'n' elements from XML file with Status = Pending and add them to new List<T> SubsetCollection
4 Loop over List<T> SubsetCollection - make required changes to T
5 Convert List<T> SubsetCollection to DataTable
6 Call Update Stored Procedure and pass above DataTable as TVP
7 Update Status = Processed for XML Elements corresponding to List<T> SubsetCollection
8 If more records with Pending status exists in XML file, go to Step# 3.
Please guide for a better approach or any enhancement in above process.
I would do a database-only approach if possible and if not possible, eliminate the parts that will be the slowest. If you are unable to do all the work in a stored procedure, then retrieve all the records and make changes.
The next step is to write the changes to a staging table with SQL Bulk Copy. This is a fast bulk loaded that will copy thousands of records in seconds. You will store the primary key and the columns to be updated as well as a batch number. The batch number is assigned to each batch of records, therefore allowing another batch to be loaded without conflicting with the first batch.
Use a stored procedure on the server to process the records in batches of 100 or 1000 depending on performance. Pass the batch number to the stored procedure.
We use such a method to load and update millions of records in batches. The best speed is obtained by eliminating the network and allowing the database server to handle the bulk of the work.
I hope this might provide you with an alternate solution to evaluate.
It may not be the best practice but you could embed some logic inside a SQL Server CLR function. This function could be called by a Query,StoProc or a schedule to run at a certain time.
The only issue I can see is getting step 4 to make the required changes on T. Embedding that logic into the database could be detrimental to maintenance, but this is no different to people who embed massive amounts of business logic into StoProcs.
Either way SQL Server CLR functions may be the way to go. You can create them in Visual Studio 2008, 2010 (Check the database new project types).
Tutorial : http://msdn.microsoft.com/en-us/library/w2kae45k(v=vs.80).aspx

Parsing excel sheet in C#, inserting new values into database

I am currently working on a project to parse an excel sheet and insert any values into a database which were not inserted previously. The sheet contains roughly 80 date-value pairs for different names, with an average of about 1500 rows per pair.
Each name has 5 date-value pairs entered manually at the end of the week. Over the weekend, my process will parse the excel file and insert any values that are not currently in the database.
My question is, given the large amount of total data and the small amount added each week, how would you determine easily which values need to be inserted? I have considered adding another table to store the last date inserted for each name and taking any rows after that.
Simplest solution, I would bring it all into a staging table and do the compare in the server. Alternatively, SSIS with an appropriate sort and lookup could determine the differences and insert them.
120000 rows is not significant to compare in the database using SQL, but 120000 individual calls to the database to verify if the row is in the database might take a while on a client-side.
Option 1 would be to create a "lastdate" table that is automatically stamped at the end of your weekend import. Then the next week your program could query the last record in that table, then only read from the excel file after that date. probably your best bet.
Option 2 would be to find a unique field in the data, and row by row check if that key exists in the database. If it doesn't exist, you add it, if it does you won't. This would be my 2nd choice if Option 1 didn't work how you expect it.
It all depends how bullet proof your solution needs to be. If you trust the users that the spreadsheet will not be tweaked in any way that would make it inconsistent, than your solution would be fine.
If you want to be on the safe side (e.g. if some old values could potentially change), you would need to compare the whole thing with the database. To be honest the amount of data you are talking here doesn't seem very big, especially when you process will run on a Weekend. And you can still optimize by writing "batch" type of stored procs for the database.
Thanks for the answers all.
I have decided, rather than creating a new table that stores the last date, I will just select the max date for each name, then insert values after that date into the table.
This assumes that the data prior to the last date remains consistent, which should be fine for this problem.

Categories