I'm currently working with an import file that has 460,000 rows of data within it. Each row consists of a ID and a quantity (eg. "1,120"). This information is read from the file, then should be used to update each individual row within a database (eg. UPDATE item SET quantity = QTY WHERE id = 1).
The problem I'm having, though, is actually being able to actually run the query efficiently. If I run an individual query for each line, it's really not going to work (As I've found out the hard way).
I'm not in any way a SQL user and I'm currently learning, but from what I've seen, the web doesn't seem to have any useful results on this.
I was wondering if anybody had experience with updating such a large dataset, and if so, would they be willing to share the methods that they used to achieve this?
460k rows isn't a lot, so you should be okay there.
I'd recommend importing the entire dataset into a temporary table, or table variable. To get the solution working, start by creating an actual physical table, which you can either DROP or TRUNCATE while you are getting this working.
Create the table, then import all the data into it. Then, do your table update based on a join to this import table.
Discard the import table when appropriate. Once this is all working how you want it to, you can do the entire thing using a stored procedure, and use a temporary table to handle the imported data while you are working with it.
460000 rows is a small dataset. Really small.
Bulk insert into tempoary table, then use an update command to run the update on the original data in one run.
Related
Evening all,
Background: I have many xml data files which I need to import in to an SQL database via my C# WPF based Windows Application. Whilst some data files are a simple and straight forward INSERT, many require validation and verification checks, require GUIDs from existing records (from within the db) before the INSERT takes place in other to maintain certain relationships.
So I've broken the process in to three stages:-
Validate records. Many checks exist such as e.g. 50,000 accounts in xml file must have a matching reference with an associated record already in the database. If there is no corresponding account, abandon the entire import process; Another would be, only the 'Current' record can be updated, so again, if it points to a 'historic' record then it needs to crash and burn.
UPDATE the associated database records e.g Set from 'Current' to 'Historic' as record is to be superseded by what's to come next...
INSERT the records across multiple tables. e.g. 50,000 accounts to be inserted.
So, in summary, I need to validate records before any changes can be made to the database with a few validation checks. At which point I will change various tables for existing records before inserting the 'latest' records.
My first attempt to resolve this situation was to load the xml file in to XmlDocument or XDocument instance, iterate over every account in the xml file and perform an sql command for every account. Verify it exists, verify its a current account, change the record before inserting it. Rinse and repeat for thousands of records - Not ideal to say the least.
So my second attempt is to load the xml file in to a data table. Export the corresponding accounts from the database in to another data table and perform a nested loop validation e.g. does DT1.AccountID exist in DT2.AccountID, move DT2.GUID to DT1.GUID etc etc. I appreciate this could also be a slow process. That said, I do have the luxury then of performing both the UPDATE and INSERT stored procedures with a table value parameter (TVP) and making use of the data table information.
I appreciate many will suggest letting the SQL do all of the work but i'm lacking in that skillset unfortunately (happy to learn if thats the general consensus) but I would much rather do the work in C# code if at all possible.
Any views on this are greatly appreciated. Many of the questions I've found are around bulk INSERT, not so much about validating existing records, following by updating records followed by inserting records. I suppose my question is around the first part, the validation. Does extracting the data from the db in to a data table to work on seem wrong, old fashioned, pointless?
I'm sure i've missed out some vital piece of information so apologies if unclear.
Cheers
I'm migrating data from one system to another and will be receiving a CSV file with the data to import. The file could contain up to a million records to import. I need to get each line in the file, validate it and put the data into the relevant tables. For example, the CSV would be like:
Mr,Bob,Smith,1 high street,London,ec1,012345789,work(this needs to be looked up in another table to get the ID)
There's a lot more data than this example in the real files.
So, the SQL would be something like this:
Declare #UserID
Insert into User
Values ('Mr', 'Bob', 'Smith', 0123456789)
Set #UserID = ##Identity
Insert into Address
Values ('1 high street', 'London', 'ec1', select ID from AddressType where AddressTypeName = 'work')
I was thinking of iterating over each row and call an SP with the parameters from the file which will contain the SQL above. Would this be the best way of tackling this? It's not time critical as this will just be run once when updating a site.
I'm using C# and SQL Server 2008 R2.
What about you load it into a temporary table (note that this may be logically temporary - not necessarily technically) as staging, then process it from there. This is standard ETL behavior (and a million is tiny for ETL), you first stage the data, then clean it, then put it to the final place.
When performing tasks of this nature, you do not think in terms of rotating through each record individually as that will be a huge performence problem. In this case you bulk insert the records to a staging table or use the wizard to import to a staging table (look out for teh deafult 50 characters espcially in the address field).Then you write set-based code to do any clean up you need (removing bad telephone numbers or zip code or email addresses or states or records missing data in fields that are required in your database or transforing data using lookup tables (suppose you have table with certain required values, those are likely not the same values that you wil find in this file, you need to convert them. We use doctor specialties a lot. So our system might store them as GP but the file might give us a value of General Practioner. You need to look at all teh non-matching values for the field and then determine if you can map them to existing values, if you need to throw the record out or if you need to add more values to your lookup table. Once you have gotten rid of records you don't want and cleaned up those you can in your staging table then you import to the prod tables. Inserts should be written using the SELECT version of INSERT not with the VALUES clause when you are writing more than one or two records.
I have a table with about 100 columns and about 10000 rows.
Periodically, I will receive an Excel with similar data and I now need to update the table.
If new rows exist in Excel, I have to add them to the db.
If old rows have been updated, I need to update the rows in the db.
If some rows have been deleted, I need to delete the row from my main table and add to another table.
I have thought about proceeding as follows:
Fetch all rows from db into a DataSet.
Import all rows from Excel into a DataSet.
Compare these 2 DataSets now using joins and perform the required operations.
I have never worked with data of this magnitude and am worried about the performance. Let me know the ideal way to go about realizing this requirement.
Thanks in advance. :)
don't worry about the performance with 10k records, you will not notice it...
maybe a better way to do it is to import the excel file in a temp table and do the processing with a couple simple sql queries... you'll save on dev time and it will potentially perform better...
As my experience says, its so simple if you choose to do the stuff in t-sql as following:
You can use "OPENROWSET", "OPENQUERY", linked servers, DTS and many other thing in SQL Server to import the excel file into a temporary table.
You can write some simple queries to do that. If you are using SQL 2008, "MERGE" has exacly made for your question.
Another thing is that the performance is far different than C#. You can use "TOP" clause to chunk the comparison and do many other things.
Hope it helps.
Cheers
When I load from the database I use one store procedure which loads the DataItem and any Data associated with it. This comes back in one DataSet with two tables, the first table has one row and describes the DataItem and each row in the other table describing the related Data.
This DataSet is then used to populate my objects.
My problem comes when I have to save the objects back to the database. I am currently saving the DataItem and then looping through all of my Data and performing a save on each one. Completely horrible way to go about doing it, I know. It's both slow and it's not transactional.
So what I'd ideally like to do is convert my objects back into my DataSet and then save it all back to the database in one efficient transactional operation. What code do I need on the C# side to make this transactional and to allow me to pass back a DataSet. I presume this will involve using a TableAdapter. But given that I have two tables how will this work? What do I use on the SQL side - Can I use store procedures? (I would like to avoid having SQL in my C# project) Would I need to write something that will handle cycling through a datatable to save each record?
What's the best way to go about doing all this? This will form the lynchpin of a project I'm working on so I want it to be as fast and efficient as it can be!
(.NET 4.0 and SQL 2005)
Did not use TableAdapter in the end as it was more effort than it was worth.
From the comments:
http://msdn.microsoft.com/en-us/library/4esb49b4.aspx
I am trying to figure out the best way to design my C# application which utilizes data from a SQL Server backend.
My application periodically has to update 55K rows each one at a time from a loop in my application. Before it does an update it needs to check if the record to be updated exists.
If it exists it updates one field.
If not it performs an insert of 4 fields.
The table to be updated has 600K rows.
What is the most efficient way to handle these updates/inserts from my application?
Should I create a data dictionary in c# and load the 600K records and query the dictionary first instead of the database?
Is this a faster approach?
Should I use a stored procedure?
What’s the best way to achieve maximum performance based on this scenario?
You could use SqlBulkCopy to upload to a temp table then have a SQL Server job do the merge.
You should try to avoid "update 55K rows each one at a time from a loop". That will be very slow.
Instead, try to find a way to batch the updates (n at a time). Look into SQL Server table value parameters as a way to send a set of data to a stored procedure.
Here's an article on updating multiple rows with TVPs: http://www.sqlmag.com/article/sql-server-2008/using-table-valued-parameters-to-update-multiple-rows
What if you did something like this, instead?
By some means, get those 55,000 rows of data into the database; if they're not already there. (If you're right now getting those rows from some query, arrange instead for the query-results to be stored in a temporary table on that database. (This might be a proper application for a stored procedure.)
Now, you could express the operations that you need to perform, perhaps, as two separate SQL queries: one to do the updates, and one or more others to do the inserts. The first query might use a clause such as "WHERE FOO IN (SELECT BAR FROM #TEMP_TABLE ...)" to identify the rows to be updated. The others might use "WHERE FOO NOT IN (...)"
This is, to be precise, exactly the sort of thing that I would expect to need to use a stored procedure to do, because, if you think about it, "the SQL server itself" is precisely the right party to be doing the work, because he's the only guy around who already has the data on-hand that you intend to manipulate. He alone doesn't have to "transmit" those 55,000 rows anywhere. Perfect.