I'm building an application to import data into a sql server 2008 Express db.
This database is being used by an application that is currently in production.
The data that needs to be imported comes from various sources, mostly excel sheets and xml files.
The database has the following tables:
tools
powertools
strikingtools
owners
Each row, or xml tag in the source files has information about 1 tool:
name, tooltype, weight, wattage, owner, material, etc...
Each of these rows has the name of the tool's owner this name has to be inserted into the owners table but only if the name isn't already in there.
For each of these rows a new row needs to be inserted in the tools table.
The tools table has a field owner_id with a foreign key to the owners table where the primary key of the corresponding row in the owners table needs to be set
Depending on the tooltype a new row must be created in either the powertools table or the strikingtools table. These 2 tables also have a tool_id field with a foreign key to the tools table that must be filled in.
The tools table has a tool_owner_id field with a foreign key to the owners table that must be filled in.
If any of the rows in the importfile fails to import for some reason, the entire import needs to be rolled back
Currently I'm using a dataset to do this but for some large files (over 200.000 tools) this requires quite a lot of memory. Can anybody think of a better aproach for this?
There are two main issues to be solved:
Parsing the a large XML document efficiently.
Adding a large amount of records to the database.
XML Parsing
Although the DataSet approach works, the whole XML document is loaded into memory. To improve the efficiency of working with large XML documents you might want look at the XmlReader class. The API is slightly more difficult to use than what DataSet provides. But you will get the benefit of not loading the whole DOM into memory at once.
Inserting records to the DB
To satisfy your Atomicity requirement you can use a single database transaction but the large number of records you are dealing with for a single transaction is not ideal. You will most likely incur issues like:
Database having to deal with a large number of locks
Database locks that might escalate from row locks to page locks and even table locks.
Concurrent use of the database will be severely affect during the import.
I would recommend the following instead of a single DB transaction:
See if it possible to create smaller transaction batches. Maybe 100 records at a time. Perhaps it is possible to logically load sections of the XML file together, where it would be acceptable load a subset of the data as a unit into the system.
Validate as much of your data upfront. E.g. Check that required fields are filled or that FK's are correct.
Make the upload repeatable. Skip over existing data.
Provide a manual undo strategy. I know this is easier said than done, but might even be required as an additional business rule. For example the upload was successful but someone realises a couple of hours later that the wrong file was uploaded.
It might be useful to upload your data to a initial staging area in your DB to perform validations and to mark which records have been processed.
Use SSIS, and create and ETL package.
Use Transactions for the roll back feature, and stored procedure that handle creating/checking the foreign keys.
Related
Evening all,
Background: I have many xml data files which I need to import in to an SQL database via my C# WPF based Windows Application. Whilst some data files are a simple and straight forward INSERT, many require validation and verification checks, require GUIDs from existing records (from within the db) before the INSERT takes place in other to maintain certain relationships.
So I've broken the process in to three stages:-
Validate records. Many checks exist such as e.g. 50,000 accounts in xml file must have a matching reference with an associated record already in the database. If there is no corresponding account, abandon the entire import process; Another would be, only the 'Current' record can be updated, so again, if it points to a 'historic' record then it needs to crash and burn.
UPDATE the associated database records e.g Set from 'Current' to 'Historic' as record is to be superseded by what's to come next...
INSERT the records across multiple tables. e.g. 50,000 accounts to be inserted.
So, in summary, I need to validate records before any changes can be made to the database with a few validation checks. At which point I will change various tables for existing records before inserting the 'latest' records.
My first attempt to resolve this situation was to load the xml file in to XmlDocument or XDocument instance, iterate over every account in the xml file and perform an sql command for every account. Verify it exists, verify its a current account, change the record before inserting it. Rinse and repeat for thousands of records - Not ideal to say the least.
So my second attempt is to load the xml file in to a data table. Export the corresponding accounts from the database in to another data table and perform a nested loop validation e.g. does DT1.AccountID exist in DT2.AccountID, move DT2.GUID to DT1.GUID etc etc. I appreciate this could also be a slow process. That said, I do have the luxury then of performing both the UPDATE and INSERT stored procedures with a table value parameter (TVP) and making use of the data table information.
I appreciate many will suggest letting the SQL do all of the work but i'm lacking in that skillset unfortunately (happy to learn if thats the general consensus) but I would much rather do the work in C# code if at all possible.
Any views on this are greatly appreciated. Many of the questions I've found are around bulk INSERT, not so much about validating existing records, following by updating records followed by inserting records. I suppose my question is around the first part, the validation. Does extracting the data from the db in to a data table to work on seem wrong, old fashioned, pointless?
I'm sure i've missed out some vital piece of information so apologies if unclear.
Cheers
so we are facing this problems and we are trying to solve them as efficiently as possible:
import data(130+k rows) from excel file
load it in given table on Sql Server Database
Disperse(normalize) the data from that table into 6 others
So far we've managed to cope with the first 2 problems.
We access the data from the file with the help of Excel Data Reader library, bind and bulk insert the loaded data
with EntityFramework.BulkInsert. So far the process is taking less than 2 min, for little over 130k rows, which is acceptable.
The big problem comes when we try to normalize the data in the table loaded on the server.
The Database contains previous information and must be operational during changes => no table locking.
Table has 130 columns and contains information dispersed into 6 tables. The main record keeps foreign keys of the other 5.
Other records might or might not be already present in the database. If they are present missing information is updated.
The tricky part for us is knowing the ID of the sub entities at the time of the insertion of the main entity. Insertion row by row would solve the problem but it would take too much time.
We would like to use some kind of bulk insert, but we're not sure how to keep the connection between the records.
The database design is legacy and reorganising it is not an option. :)
Any ideas for tool, approach or advice on the matter is welcomed :)
Thanks in advance.
I have a VERY large (50 million+ records) dataset that I am importing from an old Interbase database into a new SQL Server database.
My current approach is:
acquire csv files from the Interbase database (done, used a program called "FBExport" I found somewhere online)
The schema of the old database doesn't match the new one (not under my control), so now I need to mass edit certain fields in order for them to work in the new database. This is the area I need help with
after editing to the correct schema, I am using SqlBulkCopy to copy the newly edited data set into the SQL Server database.
Part 3 works very quickly, diagnostics shows that importing 10,000 records at once is done almost instantly.
My current (slow) approach to part 2 is I just read the csv file line by line, and lookup the relevant information (ex. the csv file has an ID that is XXX########, whereas the new database has a separate column for each XXX and ########. ex2. the csv file references a model via a string, but the new database references via an ID in the model table) and then insert a new row into my local table, and then SqlBulkCopy after my local table gets large.
My question is: What would be the "best" approach (perfomance wise) for this data-editing step? I figure there is very likely a linq-type approach to this, would that perform better, and how would I go about doing that if it would?
If step #3’s importing is very quick, I would be tempted to create a temporary database whose schema exactly matches the old database and import the records into it. Then I’d look at adding additional columns to the temporary table where you need to split the XXX######## into XXX and ########. You could then use SQL to split the source column into the two separate ones. You could likewise use SQL to do whatever ID based lookups and updates you need to ensure the record relationships continue to be correct.
Once the data has been massaged into a format which is acceptable, you can insert the records into the final tables using IDENTITY_INSERT ON, excluding all legacy columns/information.
In my mind, the primary advantage of doing it within the temporary SQL DB is that at any time you can write queries to ensure that record relationships using the old key(s) are still correctly related to records using the new database’s auto generated keys.
This is of coursed based on me being more comfortable doing data transformations/validation in SQL than in C#.
What is the standard way of copying data from one oracle database to another.
1) Read data from source table and copy to temp table on destination using configuration( i.e. there are more than 1 table and each table has separate temp table)
2) Right now there is no clob data, but in future clob data might be used.
3) Read everything to memory(if large data read in chunks)
Should not use Oracle links
Should not use files
Code should be only using C# but not any database procedures.
One way that I've used to do this is to use a DataReader on the source database and just perform inserts on the target database (using Bind Parameters for sure).
Note that the DataReader is excellent at not using much memory as it moves through a table (I believe that by default it uses a Fast Forward, Read Only cursor). This means that only a small amount of data is held in memory at a given time.
Here are the things to watch out for:
Relationships
If you're working with data that has relationships, you're going to need to deal with that. There are two ways that I've seen to deal with this:
Temporarily drop the relationships in the target database before doing the copy, then recreate them after.
Copy the data in the correct order for the relationships to work correctly (this is usually pretty difficult / inefficient)
Auto Generated Id Values
These columns are usually handled by disabling the auto increment functionality for the given table and allowing identity insert (I'm using some SQL Server terms, I can't remember how it works on Oracle).
Transactions
If you're moving a lot of data, transactions will be expensive.
Repeatability / Deleting Target Data
Unless you're way more awesome than the rest of us, you'll probably have to run this thing more than once (at least during development). That means you might want a way to delete the target data.
Platform Specific Methods
In SQL Server, there are ways to perform bulk inserts that are blazingly fast (by giving up little things like referential integrity checking). There might be a similar feature within the Oracle toolset.
Table / Column Metadata
I haven't had to do this in Oracle yet, but it looks like you can get metadata on tables and columns using the views mentioned here.
Hi I'm working a project where we need to process several xml files once a day and populate a Database with the information contained in those files.
Each file is roughly 1Mb and contains about 1000 records; we usually need to process between 12 and 25 of these files. I've seen some information regarding bulk inserts using NHibernate but our problem is somehow trickier since the xml files contain new records mixed with updated records.
In the xml there is a flag that tells us is a specific record is a new one or an update to an existing record, but not what information has changed. The xml records do not contain our DB identifier, but we can use an identifier from the xml record to uniquely locate a record in our DB.
Our strategy so far has been to identify if the current record is an insert or an update and based on that we either perform an insert on the DB or we do a search, then we update the information on the object with the information coming from the xml record and finally we do an update on the DB.
The problem with our current approach is that we are having issues with DB locks and our performance degrades really fast. We have thought about some alternatives like having separate tables for the distinct operations or even separate DB’s but doing such a move would mean a big effort so before any decisions I would like to ask for the community opinion on this matter, thanks in advance.
A couple of ideas:
Always try to use IStatelessSession for bulk operations.
If you're still not happy with the performance, just skip NHibernate and use a stored procedure or parameterized query specific to this, or use IQuery.ExecuteUpdate()
If you're using SQL Server, you could convert your xml format to BCPFORMAT xml then run BULK INSERT on it (only for insertions)
If you're having too many DB locks, try grouping the operations (i.e. first find out what needs to be inserted and what updated, then get PKs for the updates, then run BULK INSERT for insertions, then run updates)
If parsing the source files is a performance issue (i.e. it maxes out a CPU core), try doing it in parallel (you could use Parallel Extensions)
This might help: http://ideas-net.blogspot.com/2009/03/nhibernate-update-performance-issue.html