Strategy to optimize this large SQL insert via C#? - c#

I have about 1.5 million files I need to insert records for in the database.
Each record is inserted with a key that includes the name of the file.
The catch: The files are not uniquely identified currently.
So, what we'd like to do is, for each file:
Insert a record. One of the fields in the record should include an amazon S3 key which should include the ID of the newly inserted record.
Rename the file to include the ID so that it matches the format of the key.
The best thing I can think to do is:
Run an individual insert command that returns the ID of the added row.
Add that back as a property to the individual business object I'm looping through.
Generate an update statement that updates the S3 key to include the ID
Output the file, concatenate the ID into the end the file name.
As I can tell, that looks to be :
1.5 million insert statements
individual SqlCommand executions and read because we need the ID back),
1.5 million times setting a property on an object.
1.5 million update statements generated and executed
Perhaps could make this a one giant concatenated update statement to do them all at once; not sure if that helps
1.5 million file copies.
I can't get around the actual file part, but for the rest, is there a better strategy I'm not seeing?

If you make the client application generate the IDs you can use a straight-forward SqlBulkCopy to insert all rows at once. It will be done in seconds.
If you want to keep the IDENTITY property of the column, you can run a DBCC CHECKIDENT(RESEED) to advance the identity counter by 1.5m to give you a guaranteed gap that you can insert into. If the number of rows is not statically known you can perform the inserting in smaller chunks of maybe 100k until you are done.

You will cut the number of SQL statements in half up by not relying on the database to generate your ID for each row. Do everything locally (including the assignment of an ID) and then do a single batch of inserts at the end, with identity_insert on.
This will cause SQL Server to use your ID's for this batch of records.
If this is still too slow (and 1.5 million inserts might be), the next step would be to output your data to a text file (XML, comma delimited, or whatever) and then do a bulk import operation on the file.
That's as fast as you will be able to make it, I think.

Related

Insert and Check for Copies of Data using SQL and C#

So I'm upgrading an old parser right now. It's written in C# and uses SQL to insert records into a database.
Currently it reads and parses a few thousand lines of data from a file, then inserts the new data into a database containing over a million records.
Sometimes it can take over 10 minutes just to add a few thousand lines.
I've come to the conclusion that this bottleneck in performance is due to a SQL command where it uses an IF NOT EXISTS statement to determine whether the row attempting to be inserted already exists, and if it doesn't insert the record.
I believe the problem is that it just takes way too long to call the IF NOT EXISTS on every single row in the new data.
Is there a faster way to determine whether data exists already or not?
I was thinking to insert all of the records first anyways using the SQLBulkCopy Class, then running a stored procedure to remove the duplicates.
Does anyone else have any suggestions or methods to do this as efficiently and quickly as possible? Anything would be appreciated.
EDIT: To clarify, I'd run a stored procedure (on the large table) after copying the new data into the large table
large table = 1,000,000+ rows
1. Create an IDataReader to loop over your source data.
2. Place the values into a strong dataset.
3. Every N number of rows, send the dataset (.GetXml) to a stored procedure. Let's say 1000 for the heck of it.
4. Have the stored procedure shred the xml.
5. Do your INSERT/UPDATE based on this shredded xml.
6. Return from the procedure, keep looping until you're done.
Here is an older example:
http://granadacoder.wordpress.com/2009/01/27/bulk-insert-example-using-an-idatareader-to-strong-dataset-to-sql-server-xml/
The key is that you are doing "bulk" operations.......instead of row by row. And you can pick a sweet spot # (1000 for example) that gives you the best performance.

How large of a SQL string (for mass update) can I realistically use in C# .net4?

I'm recieving and parsing a large text file.
In that file I have a numerical ID identifying a row in a table, and another field that I need to update.
ID Current Location
=========================
1 Boston
2 Cambridge
3 Idaho
I was thinking of creating a single SQL command string and firing that off using ADO.Net, but some of these files I'm going to recieve have thousands of lines. Is this doable or is there a limit I'm not seeing?
If you may have thousands of lines, then composing a SQL statement is definitely NOT the way to go. Better code-based alternatives include:
Use SQLBulkCopy to insert the change data to a staging table and then UPDATE your target table using the staging table as the source. It also has excellent batching options (unlike the other choices)
Write a stored procedure to do the Update that accepts an XML parameter that contains the UPDATE data.
Write a stored procedure to do the Update that accepts a table-valued parameter that contains the UPDATE data.
I have not compared them myself but it is my understanding that #3 is generally the fastest (though #1 is plenty fast for almost any need).
Writing one huge INSERT statement well be very slow. You also don't want to parse the whole massive file at once. What you need to do is something along the lines of:
Figure out a good chunk size. Let's call it chunk_size. This will be the number of records you'll read from the file at a time.
Load chunk_size number of records from the file into a DataTable.
Use SQLBulkCopy to insert the DataTable into the DB.
Repeat 2 & 3 until the file is done.
You'll have to experiment to find an optimal size for chunk_size so start small and work your way up.
I'm not sure of an actual limit, if one exists, but why not take "bite sized" chunks of the file that you feel comfortable with and break it into several commands? You can always wrap it in a single transaction if it's important that they all fail or succeed.
Say grab 250 lines at a time, or whatever.

Import CSV into SQL multiple tables

I'm migrating data from one system to another and will be receiving a CSV file with the data to import. The file could contain up to a million records to import. I need to get each line in the file, validate it and put the data into the relevant tables. For example, the CSV would be like:
Mr,Bob,Smith,1 high street,London,ec1,012345789,work(this needs to be looked up in another table to get the ID)
There's a lot more data than this example in the real files.
So, the SQL would be something like this:
Declare #UserID
Insert into User
Values ('Mr', 'Bob', 'Smith', 0123456789)
Set #UserID = ##Identity
Insert into Address
Values ('1 high street', 'London', 'ec1', select ID from AddressType where AddressTypeName = 'work')
I was thinking of iterating over each row and call an SP with the parameters from the file which will contain the SQL above. Would this be the best way of tackling this? It's not time critical as this will just be run once when updating a site.
I'm using C# and SQL Server 2008 R2.
What about you load it into a temporary table (note that this may be logically temporary - not necessarily technically) as staging, then process it from there. This is standard ETL behavior (and a million is tiny for ETL), you first stage the data, then clean it, then put it to the final place.
When performing tasks of this nature, you do not think in terms of rotating through each record individually as that will be a huge performence problem. In this case you bulk insert the records to a staging table or use the wizard to import to a staging table (look out for teh deafult 50 characters espcially in the address field).Then you write set-based code to do any clean up you need (removing bad telephone numbers or zip code or email addresses or states or records missing data in fields that are required in your database or transforing data using lookup tables (suppose you have table with certain required values, those are likely not the same values that you wil find in this file, you need to convert them. We use doctor specialties a lot. So our system might store them as GP but the file might give us a value of General Practioner. You need to look at all teh non-matching values for the field and then determine if you can map them to existing values, if you need to throw the record out or if you need to add more values to your lookup table. Once you have gotten rid of records you don't want and cleaned up those you can in your staging table then you import to the prod tables. Inserts should be written using the SELECT version of INSERT not with the VALUES clause when you are writing more than one or two records.

How to write data in database efficiently using c#?

my windows app is reading text file and inserting it into the database. The problem is text file is extremely big (at least for our low-end machines). It has 100 thousands rows and it takes time to write it into the database.
Can you guys suggest how should i read and write the data efficiently so that it does not hog machine memory?
FYI...
Column delimiter : '|'
Row delimiter : NewLine
It has approximately 10 columns.. (It has an information of clients...like first name, last name, address, phones, emails etc.)
CONSIDER THAT...I AM RESTRICTED FROM USING BULK CMD.
You don't say what kind of database you're using, but if it is SQL Server, then you should look into the BULK INSERT command or the BCP utility.
Given that there is absolutely no chance of getting help from your security folks and using BULK commands, here is the approach I would take:
Make sure you are reading the entire text file first before inserting into the database. Thus reducing the I/O.
Check what indexes you have on the destination table. Can you insert into a temporary table with no indexes or dependencies so that the individual inserts are fast?
Does this data need to be visible immediately after insert? If not then you can have a scheduled job to read from the temp table in step 2 above and insert into the destination table (that has indexes, foreign keys etc.).
Is it possible for you to register your custom assembly into Sql Server? (I'm assuming it's sql server because you've already said you used bulk insert earlier).
Than you can call your assembly to do (mostly) whatever you need, like getting a file from some service (or whatever your option is), parsing and inserting directly into tables.
This is not an option I like, but it could be a saver sometimes.

Faster way to update 250k rows with SQL

I need to update about 250k rows on a table and each field to update will have a different value depending on the row itself (not calculated based on the row id or the key but externally).
I tried with a parametrized query but it turns out to be slow (I still can try with a table-value parameter, SqlDbType.Structured, in SQL Server 2008, but I'd like to have a general way to do it on several databases including MySql, Oracle and Firebird).
Making a huge concat of individual updates is also slow (BUT about 2 times faster than making thousands of individual calls (roundtrips!) using parametrized queries)
What about creating a temp table and running an update joining my table and the tmp one? Will it work faster?
How slow is "slow"?
The main problem with this is that it would create an enormous entry in the database's log file (in case there's a power failure half-way through the update, the database needs to log each action so that it can rollback in the event of failure). This is most likely where the "slowness" is coming from, more than anything else (though obviously with such a large number of rows, there are other ways to make the thing inefficient [e.g. doing one DB roundtrip per update would be unbearably slow], I'm just saying once you eliminate the obvious things, you'll still find it's pretty slow).
There's a few ways you can do it more efficiently. One would be to do the update in chunks, 1,000 rows at a time, say. That way, the database writes lots of small log entries, rather than one really huge one.
Another way would be to turn off - or turn "down" - the database's logging for the duration of the update. In SQL Server, for example, you can set the Recovery Model to "simple" or "bulk update" which would speed it up considerably (with the caveat that you are more at risk if there's a power failure or something during the update).
Edit Just to expand a little more, probably the most efficient way to actually execute the queries in the first place would be to do a BULK INSERT of all the new rows into a temporary table, and then do a single UPDATE of the existing table from that (or to do the UPDATE in chunks of 1,000 as I said above). Most of my answer was addressing the problem once you've implemented it like that: you'll still find it's pretty slow...
call a stored procedure if possible
If the columns updated are part of indexes you could
drop these indexes
do the update
re-create the indexes.
If you need these indexes to retrieve the data, well, it doesn't help.
You should use the SqlBulkCopy with the KeepIdentities flag set.
As part of a SqlTransaction do a query to SELECT all the records that need updating and then DELETE THEM, returning those selected (and now removed) records. Read them into C# in a single batch. Update the records on the C# side in memory, now that you've narrowed the selection and then SqlBulkCopy those updated records back, keys and all. And don't forget to commit the transaction. It's more work, but it's very fast.
Here's what I would do:
Retrieve the entire table, that is, the columns you need in order to calculate/retrieve/find/produce the changes externally
Calculate/produce those changes
Run a bulk insert to a temporary table, uploading the information you need server-side in order to do the changes. This would require the key information + new values for all the rows you intend to change.
Run SQL on the server to copy new values from the temporary table into the production table.
Pros:
Running the final step server-side is faster than running tons and tons of individual SQL, so you're going to lock the table in question for a shorter time
Bulk insert like this is fast
Cons:
Requires extra space in your database for the temporary table
Produces more log data, logging both the bulk insert and the changes to the production table
Here are things that can make your updates slow:
executing updates one by one through parametrized query
solution: do update in one statement
large transaction creates big log entry
see codeka's answer
updating indexes (RDBMS will update index after each row. If you change indexed column, it could be very costly on large table)
if you can, drop indices before update and recreate them after
updating field that has foreign key constraint - for each inserted record RDBMS will go and look for appropriate key
if you can, disable foreign key constraints before update and enable them after update
triggers and row level checks
if you can, disable triggers before update and enable them after

Categories