Updating SQL Server database without duplicates - c#

I wrote this program couple of months back that delimits these large .CSV files and uploads them into a SQL Server database. Since the .CSV file was just basically being appended each time with new data I just had it set up so that each time user uploads data it would delete everything from the table and upload the newly appended data like so:
myConnection.Open();
string sql = #"DELETE FROM TestTable;";
SqlCommand cmd = new SqlCommand(sql, myConnection);
cmd.ExecuteNonQuery();
myConnection.Close();
Now I have to set it up to upload the data without deleting previous entries to the table but the catch is I can't have duplicate data. Luckily the .CSV file comes with unique identifier for each row which I use as primary key but I'm having trouble coming up with algorithm to do this. Is there perhaps something similar to DELETE syntax above that I can use with SQL Server to only update unique data? I'm asking you guys since I'm not the biggest expert when it comes to SQL Server.
I have couple of different classes and background worker so didn't want to past all of the code since its a lot, but if you guys need any specifics let me know.
EDIT
There is a example here: http://msdn.microsoft.com/en-us/library/y06xa2h1(v=vs.80).aspx?cs-save-lang=1&cs-lang=csharp#code-snippet-1
of what I'm pretty sure I need to do, but only thing I'm confused about is what is dataSet1 and where is it coming from? I'm just using connection string to open connection to the SQL Server database and then use SqlDataAdapter to perform functions like inserts and stuff.
If anyone has any clarification on this that would be of great help.
Thanks

A cheapo way to achieve this would be to create a unique index on your unique identifying column in SQL Server, and tell it to just simply ignore any duplicates.
CREATE UNIQUE INDEX UIX_YourIndexNameHere
ON dbo.YourTableNameHere(YourUniqueColumnNameHere)
WITH (IGNORE_DUP_KEY = ON);
This means:
SQL Server will only allow unique values in this column - no duplicates
if any duplicates are being inserted, they will be tossed out without raising any error ("silently ignored", so to speak)
This also means:
possible duplicates will just be ignored - the existing row for that unique ID will remain as is (no updates)
new rows are being inserted
If you need to update pre-existing rows with their ID, I would recommend to do this:
bulk load the .CSV into a temporary staging table
use the MERGE command (hoping that you're using SQL Server 2008 or newer!) to update the real table from the staging table; this allow easily inserting new rows and updating (instead of ignoring) pre-existing rows

Related

SQLBUlkCopy to call StoredProcedure to insert or update data in SQLDb

I am having a list of records which I need to insert or update in a SQL DB based on whether the record is present or not present in the database.
The current flow is I process each record 1 by 1 and then call a Stored Procedure from my C# code which does the task of inserting or updating the database.
The above process is very inefficient, Can i use SQL Bulk Copy to insert these in once into the SQLDb .
Will above increase the performance .
Regards
Ankur
SqlBulkCopy can only insert. If you need to upsert, you might want to SqlBulkCopy into a staging table (a separate table off to one side that isn't part of the main model), and then do the merge in TSQL. You might also want to think about concurrency (how many people can be using the staging table at once, etc).

Editing a large dataset for SQLBulkCopy into a SQL Server database

I have a VERY large (50 million+ records) dataset that I am importing from an old Interbase database into a new SQL Server database.
My current approach is:
acquire csv files from the Interbase database (done, used a program called "FBExport" I found somewhere online)
The schema of the old database doesn't match the new one (not under my control), so now I need to mass edit certain fields in order for them to work in the new database. This is the area I need help with
after editing to the correct schema, I am using SqlBulkCopy to copy the newly edited data set into the SQL Server database.
Part 3 works very quickly, diagnostics shows that importing 10,000 records at once is done almost instantly.
My current (slow) approach to part 2 is I just read the csv file line by line, and lookup the relevant information (ex. the csv file has an ID that is XXX########, whereas the new database has a separate column for each XXX and ########. ex2. the csv file references a model via a string, but the new database references via an ID in the model table) and then insert a new row into my local table, and then SqlBulkCopy after my local table gets large.
My question is: What would be the "best" approach (perfomance wise) for this data-editing step? I figure there is very likely a linq-type approach to this, would that perform better, and how would I go about doing that if it would?
If step #3’s importing is very quick, I would be tempted to create a temporary database whose schema exactly matches the old database and import the records into it. Then I’d look at adding additional columns to the temporary table where you need to split the XXX######## into XXX and ########. You could then use SQL to split the source column into the two separate ones. You could likewise use SQL to do whatever ID based lookups and updates you need to ensure the record relationships continue to be correct.
Once the data has been massaged into a format which is acceptable, you can insert the records into the final tables using IDENTITY_INSERT ON, excluding all legacy columns/information.
In my mind, the primary advantage of doing it within the temporary SQL DB is that at any time you can write queries to ensure that record relationships using the old key(s) are still correctly related to records using the new database’s auto generated keys.
This is of coursed based on me being more comfortable doing data transformations/validation in SQL than in C#.

Import CSV into SQL multiple tables

I'm migrating data from one system to another and will be receiving a CSV file with the data to import. The file could contain up to a million records to import. I need to get each line in the file, validate it and put the data into the relevant tables. For example, the CSV would be like:
Mr,Bob,Smith,1 high street,London,ec1,012345789,work(this needs to be looked up in another table to get the ID)
There's a lot more data than this example in the real files.
So, the SQL would be something like this:
Declare #UserID
Insert into User
Values ('Mr', 'Bob', 'Smith', 0123456789)
Set #UserID = ##Identity
Insert into Address
Values ('1 high street', 'London', 'ec1', select ID from AddressType where AddressTypeName = 'work')
I was thinking of iterating over each row and call an SP with the parameters from the file which will contain the SQL above. Would this be the best way of tackling this? It's not time critical as this will just be run once when updating a site.
I'm using C# and SQL Server 2008 R2.
What about you load it into a temporary table (note that this may be logically temporary - not necessarily technically) as staging, then process it from there. This is standard ETL behavior (and a million is tiny for ETL), you first stage the data, then clean it, then put it to the final place.
When performing tasks of this nature, you do not think in terms of rotating through each record individually as that will be a huge performence problem. In this case you bulk insert the records to a staging table or use the wizard to import to a staging table (look out for teh deafult 50 characters espcially in the address field).Then you write set-based code to do any clean up you need (removing bad telephone numbers or zip code or email addresses or states or records missing data in fields that are required in your database or transforing data using lookup tables (suppose you have table with certain required values, those are likely not the same values that you wil find in this file, you need to convert them. We use doctor specialties a lot. So our system might store them as GP but the file might give us a value of General Practioner. You need to look at all teh non-matching values for the field and then determine if you can map them to existing values, if you need to throw the record out or if you need to add more values to your lookup table. Once you have gotten rid of records you don't want and cleaned up those you can in your staging table then you import to the prod tables. Inserts should be written using the SELECT version of INSERT not with the VALUES clause when you are writing more than one or two records.

How to efficiently perform SQL Server database updates from my C# application

I am trying to figure out the best way to design my C# application which utilizes data from a SQL Server backend.
My application periodically has to update 55K rows each one at a time from a loop in my application. Before it does an update it needs to check if the record to be updated exists.
If it exists it updates one field.
If not it performs an insert of 4 fields.
The table to be updated has 600K rows.
What is the most efficient way to handle these updates/inserts from my application?
Should I create a data dictionary in c# and load the 600K records and query the dictionary first instead of the database?
Is this a faster approach?
Should I use a stored procedure?
What’s the best way to achieve maximum performance based on this scenario?
You could use SqlBulkCopy to upload to a temp table then have a SQL Server job do the merge.
You should try to avoid "update 55K rows each one at a time from a loop". That will be very slow.
Instead, try to find a way to batch the updates (n at a time). Look into SQL Server table value parameters as a way to send a set of data to a stored procedure.
Here's an article on updating multiple rows with TVPs: http://www.sqlmag.com/article/sql-server-2008/using-table-valued-parameters-to-update-multiple-rows
What if you did something like this, instead?
By some means, get those 55,000 rows of data into the database; if they're not already there. (If you're right now getting those rows from some query, arrange instead for the query-results to be stored in a temporary table on that database. (This might be a proper application for a stored procedure.)
Now, you could express the operations that you need to perform, perhaps, as two separate SQL queries: one to do the updates, and one or more others to do the inserts. The first query might use a clause such as "WHERE FOO IN (SELECT BAR FROM #TEMP_TABLE ...)" to identify the rows to be updated. The others might use "WHERE FOO NOT IN (...)"
This is, to be precise, exactly the sort of thing that I would expect to need to use a stored procedure to do, because, if you think about it, "the SQL server itself" is precisely the right party to be doing the work, because he's the only guy around who already has the data on-hand that you intend to manipulate. He alone doesn't have to "transmit" those 55,000 rows anywhere. Perfect.

SQL Server: SqlBulkCopy import causes primary key violations

I am trying to design a window based application in C#.NET. I am reading csv file in data grid view and then insert those data into database. I am using SqlBulkCopy to insert data from csv file into database. My concern is, when I am trying to insert data into database (which already consist data) I am getting error of primary key constraint. I want to know whether it is possible to compare value before inserting into database using SqlBulkCopy. If value exist in database it should do update.
Can anyone provide me logic for this.
Thanks.
If you really, really know that the dupes aren't needed, just set the "Ignore Dupes" option on the index for the PK and you're done.
You need to read the data from the database into a list first.
Then when you load the csv you discard any duplicates.
Also, when you insert you should possibly ignore the primary key and let the sql database generate the key.
There's no way to do any logic when you're firehosing data in via SQLBulkCopy. Even triggers (shudder) are turned off.
What you should do is bulkcopy your data into an empty staging table, then run a stored procedure that merges the data from the staging table into the real, non-empty table.
If you're using SQL 2008, then you can use the MERGE command, otherwise you'll just have to code around it with a separate update and insert.

Categories