Importing CSV data into application DB maintaining foreign key consistence

Importing CSV data into application DB maintaining foreign key consistence - c#

In my ASP.NET web app I'm trying to implement an import/export procedure to save or insert data in the application DB. My procedure generates some CSV files: one for each table.
Obviously there are relations between some of these tables and when I import CSV in my DB I'd like to maintain association between rows.
Say I have Table1 and Table2 with Table2 that has a foreign key to Table1. So I could have a row in Table1 with ID = 100 and a row in Table2 with Table1_ID = 100.
When I import CSV with Table1 data, new IDs are generated for Table1 rows, how can I maintain consistency of the foreign keys in Table2 when I import the corresponding CSV file?
I'm using Linq-to-SQL to retrieve data from DB... using DataSet and DataTable can help me?
NOTE I'd like to permit cumulative import, so when I import a CSV file there may already be data in the DB. So I cannot use 'Set Identity OFF'.

Add the items of Table1 first, so when you add the items of Table2 there are the corresponding records of Table1 already in the database. For more tables you will have figure out the order. If you are creating a system of arbitrary database schema, you will want to create a table graph (where each node is a table and each arc is a foreign key) in memory [There are no types for that in the base library] and then convert it to a tree such that you get the correct order by traversing the tree (breadth-first).
You can let the database handle the cases where there is a violation of the foreign key, because there is not such field. You will have to decide if you make a transaction of the whole import operation, or per item.
Although analisying the CSVs before hand is possible. To do that, you will want to store the values for the primary key of each table [Use a set for that] (again, iterate over the tables in the correct order), and then when you are reading a table that has a foreign key to a table that you have already read you can check if the key is there, also it will help you yo detect any possible duplicate. [If you have things already in the database to take into account, you would have to query too... although, take care if the database is in an active system where records could be deleted while you are still deciding if you can add the CSVs without problem].
To address that you are generating new IDs when you add...
The simplest solution that I can think of is: don't. In particular if it is an active system, where other requests are being processed, because then there is no way to predict the new IDs before hand. Your best bet would be to add them one by one, in that case, you will have to think your transaction strategy accordningly... it may be the case that you will not be able to roll back.
Although, I think your question is a bit deeper: If the ID of the Table1 did change, then how can I update the corresponding records in the Table2 so they point to the correct record in Table1?
To do that, I want to suggest to do the analysis as I described above, then you will have a group of sets that will works as indexes. This will help you locate the records that you need to update in Table2 for each ID in Table1. [It is also important to keep track if you have already updated a record, and don't do it twice, because it may happen the generated ID match an ID that is yet to be sent to the database].
To roll back, you can also use those sets, as they will end up having the new IDs that identify the records that you will have to pull out of the database if you want to abort the operation.
Edit: those sets (I recommend hashset) are only have the story, because they only have the primary key (for intance: ID in Table1). You will need bags to keep the foreing keys (in this case Table1_ID in Table2).

Related

Insert bulk data into tables that are in a one to many relationship

I have a .NET App connected to a Postgres DB using Npgsql and I am trying to import data into two tables, say Users and Todos. A user has many todos. The User table has an id column that is automatically set by the DB, and the Todos table has a foreign key to the Users table called user_id.
Now, I know how to insert Users, and I know how to insert Todos, but I do not know how to set the user_id for those Todos since the id column from User is only known after the users are inserted into the DB. Any idea?

This depends on how you are importing and which tool you are using. If you are using raw INSERT statements, PostgreSQL has a RETURNING clause which will send you back the ID of the inserted statements (see the docs).
If you are using binary COPY (which is the most efficient way to bulk-import data), there's no such option. This case, one good way is to "allocate" all the ids in one go, by incrementing the sequence backing the ID column, and then sending the IDs when you're importing. This means the database is longer generating those IDs - you're sending them explicitly like any other field.
In practical terms, say you have 100 users (and any number of todos). You can do one call to setval to increment the sequence by 100, and then you can import your users, explicitly setting their IDs to those 100 values. This allows you to also specify the user IDs on the todos. However, if you do this, be mindful of concurrency issues if someone else modifies the sequence at the same time.

BULK INSERT across multiple related tables?

I need to do a BULK INSERT of several hundred-thousand records across 3 tables. A simple breakdown of the tables would be:
TableA
--------
TableAID (PK)
TableBID (FK)
TableCID (FK)
Other Columns
TableB
--------
TableBID (PK)
Other Columns
TableC
--------
TableCID (PK)
Other Columns
The problem with a bulk insert, of course, is that it only works with one table so FK's become a problem.
I've been looking around for ways to work around this, and from what I've gleaned from various sources, using a SEQUENCE column might be the best bet. I just want to make sure I have correctly cobbled together the logic from the various threads and posts I've read on this. Let me know if I have the right idea.
First, would modify the tables to look like this:
TableA
--------
TableAID (PK)
TableBSequence
TableCSequence
Other Columns
TableB
--------
TableBID (PK)
TableBSequence
Other Columns
TableC
--------
TableCID (PK)
TableCSequence
Other Columns
Then, from within the application code, I would make five calls to the database with the following logic:
Request X Sequence numbers from TableC, where X is the known number of records to be inserted into TableC. (1st DB call.)
Request Y Sequence numbers from TableB, where Y is the known number of records to be inserted into TableB (2nd DB call.)
Modify the existing objects for A, B and C (which are models generated to mirror the tables) with the now known Sequence numbers.
Bulk insert to TableA. (3rd DB call)
Bulk insert to TableB. (4th DB call)
Bulk insert to TableC. (5th DB call)
And then, of course, we would always join on the Sequence.
I have three questions:
Do I have the basic logic correct?
In Tables B and C, would I remove the clustered index from the PK and put in on the Sequence instead?
Once the Sequence numbers are requested from Tables B and C, are they then somehow locked between the request and the bulk insert? I just need to make sure that between the request and the insert, some other process doesn't request and use the same numbers.
Thanks!
EDIT:
After typing this up and posting it, I've been reading deeper into the SEQUENCE document. I think I misunderstood it at first. SEQUENCE is not a column type. For the actual column in the table, I would just use an INT (or maybe a BIGINT) depending on the number of records I expect to have). The actual SEQUENCE object is an entirely separate entity whose job is to generate numeric values on request and keep track of which ones have already been generated. So, if I understand correctly, I would generate two SEQUENCE objects, one to be used in conjunction with Table B and one with Table C.
So that answers my third question.

Do I have the basic logic correct?
Yes. The other common approach here is to bulk load your data into a staging table, and do something similar on the server-side.
From the client you can request ranges of sequence values using the sp_sequence_get_range stored procedure.
In Tables B and C, would I remove the clustered index from the PK
No, as you later noted the sequence just supplies the PK values for you.

Sorry, read your question wrong at first. I see now that you are trying to generate your own PK's rather then allow MS SQL to generate them for you. Scratch my above comment.
As David Browne mentioned, you might want to use a staging table to avoid the strain you'll put on your app's heap. Use tempdb and do the modifications directly on the table using a single transaction for each table. Then, copy the staging tables over to their target or use a MERGE if appending. If you are enforcing FK's, you can temporarily remove those constraints if you choose to insert in reverse order (C=>B=>A). You also may want to consider temporarily removing indexes if experiencing performance issues during the insert. Last, consider using SSIS instead of a custom app.

Bulk insert related sets of data with unknown auto-incremented IDs

We are converting database primary keys from GUIDs to auto-incremented INTs. We have data that we parse from text files and put into two C# DataTables Claim and ClaimCharge that we have been using to bulk insert into identically named tables in the database. In the database, ClaimCharge.ClaimID is a foreign key to Claim.ID and several claim charges exist for one claim.
With GUIDs we generated the Claim and ClaimCharge IDs in C#, so bulk inserting was no problem. But with INTs, I don't know what the Claim.ID will be, so I can't assign ClaimCharge.ClaimID. I need some ideas on how this could be accomplished with INTs.
For instance, if the Claim table could be manually locked against inserts, I could:
Bulk insert into alternate tables named ClaimBulkData ClaimChargeBulkData. These tables would still use GUIDs for convenience in keeping the relationship maintained between C# and SQL.
Manually lock the Claim table against inserts (don't know if this is possible) and get the max(ID).
Increment all of the data in ClaimBulkData using MAX(ID).
Associate ClaimChargeBulkData to ClaimBulkData using the newly updated INT
Insert data into real Claim table as a set using IDENTITY_INSERT ON using some kind of exception to the imaginary lock created in step 2.
Release manually created lock against inserts on Claim table (again I don't know if this is possible.
Insert data into real ClaimCharge table.
I want to avoid inserting the data one row at a time in either C# or T-SQL.

Why not just add the new auto-increment column to the master tables -- you will then have both GUID and autoid column so you can fix up the foreign key relationship (one master table at a time)
i.e.,
Assume you have master1 and detail1 and detail1
alter table Master1 add ID int identity(1,1) not null
GO
alter Detail1 add master1ID int null
GO
alter Detail2 add master1ID int null
GO
Then update Detail1 and Detail12 based on joining Master1 on the oldguid key to set the corresponding value of Master1ID for each table
You can then add the foreign keys based on Master1ID to Detail and Detail2
At this point you should have a complete set of data based on both sets of keys, and you can test update views, etc. to make sure they work with the new integer ids
Finally, once all is cool, drop to unneeded GUID foreign key and the Guid columns themselves.
You can always run a database pack once you get everything clean and converted if your intent was to reduce overall disk usage via this restructuring. The point is much of the work is fixups for foreign keys in a process like this.

TSQL Large Insert of Relational Data, W/ Foreign Key Upsert

Relatively simple problem.
Table A has ID int PK, unique Name varchar(500), and cola, colb, etc
Table B has a foreign key to Table A.
So, in the application, we are generating records for both table A and table B into DataTables in memory.
We would be generating thousands of these records on a very large number of "clients".
Eventually we make the call to store these records. However, records from table A may already exist in the database, so we need to get the primary keys for the records that already exist, and insert the missing ones. Then insert all records for table B with the correct foreign key.
Proposed solution:
I was considering sending an xml document to SQL Server to open as a rowset into TableVarA, update TableVarA with the primary keys for the records that already exist, then insert the missing records and output that to TableVarNew, I then select the Name and primary key from TableVarA union all TableVarNew.
Then in code populate the correct FKs into TableB in memory, and insert all of these records using SqlBulkCopy.
Does this sound like a good solution? And if so, what is the best way to populate the FKs in memory for TableB to match the primary key from the returned DataSet.

Sounds like a plan - but I think the handling of Table A can be simpler (a single in-memory table/table variable should be sufficient):
have a TableVarA that contains all rows for Table A
update the ID for all existing rows with their ID (should be doable in a single SQL statement)
insert all non-existing rows (that still have an empty ID) into Table A and make a note of their ID
This could all happen in a single table variable - I don't see why you need to copy stuff around....
Once you've handled your Table A, as you say, update Table B's foreign keys and bulk insert those rows in one go.
What I'm not quite clear on is how Table B references Table A - you just said it had an FK, but you didn't specify what column it was on (assuming on ID). Then how are your rows from Table B referencing Table A for new rows, that aren't inserted yet and thus don't have an ID in Table A yet?

This is more of a comment than a complete answer but I was running out of room so please don't vote it down for not being up to answer criteria.
My concern would be that evaluating a set for missing keys and then inserting in bulk you take a risk that the key got added elsewhere in the mean time. You stated this could be from a large number of clients so it this is going to happen. Yes you could wrap it in a big transaction but big transactions are hogs would lock out other clients.
My thought is to deal with those that have keys in bulk separate assuming there is no risk the PK would be deleted. A TVP is efficient but you need explicit knowledge of which got processed. I think you need to first search on Name to get a list of PK that exists then process that via TVP.
For data integrity process the rest one at a time via a stored procedure that creates the PK as necessary.
Thousands of records is not scary (millions is). Large number of "clients" that is the scary part.

TSQL: UPDATE with INSERT INTO SELECT FROM

so I have an old database that I'm migrating to a new one. The new one has a slightly different but mostly-compatible schema. Additionally, I want to renumber all tables from zero.
Currently I have been using a tool I wrote that manually retrieves the old record, inserts it into the new database, and updates a v2 ID field in the old database to show its corresponding ID location in the new database.
for example, I'm selecting from MV5.Posts and inserting into MV6.Posts. Upon the insert, I retrieve the ID of the new row in MV6.Posts and update it in the old MV5.Posts.MV6ID field.
Is there a way to do this UPDATE via INSERT INTO SELECT FROM so I don't have to process every record manually? I'm using SQL Server 2005, dev edition.

The key with migration is to do several things:
First, do not do anything without a current backup.
Second, if the keys will be changing, you need to store both the old and new in the new structure at least temporarily (Permanently if the key field is exposed to the users because they may be searching by it to get old records).
Next you need to have a thorough understanding of the relationships to child tables. If you change the key field all related tables must change as well. This is where having both old and new key stored comes in handy. If you forget to change any of them, the data will no longer be correct and will be useless. So this is a critical step.
Pick out some test cases of particularly complex data making sure to include one or more test cases for each related table. Store the existing values in work tables.
To start the migration you insert into the new table using a select from the old table. Depending on the amount of records, you may want to loop through batches (not one record at a time) to improve performance. If the new key is an identity, you simply put the value of the old key in its field and let the database create the new keys.
Then do the same with the related tables. Then use the old key value in the table to update the foreign key fields with something like:
Update t2
set fkfield = newkey
from table2 t2
join table1 t1 on t1.oldkey = t2.fkfield
Test your migration by running the test cases and comparing the data with what you stored from before the migration. It is utterly critical to thoroughly test migration data or you can't be sure the data is consistent with the old structure. Migration is a very complex action; it pays to take your time and do it very methodically and thoroughly.

Probably the simplest way would be to add a column on MV6.Posts for oldId, then insert all the records from the old table into the new table. Last, update the old table matching on oldId in the new table with something like:
UPDATE mv5.posts
SET newid = n.id
FROM mv5.posts o, mv6.posts n
WHERE o.id = n.oldid
You could clean up and drop the oldId column afterwards if you wanted to.

The best you can do that I know is with the output clause. Assuming you have SQL 2005 or 2008.
USE AdventureWorks;
GO
DECLARE #MyTableVar table( ScrapReasonID smallint,
Name varchar(50),
ModifiedDate datetime);
INSERT Production.ScrapReason
OUTPUT INSERTED.ScrapReasonID, INSERTED.Name, INSERTED.ModifiedDate
INTO #MyTableVar
VALUES (N'Operator error', GETDATE());
It still would require a second pass to update the original table; however, it might help make your logic simpler. Do you need to update the source table? You could just store the new id's in a third cross reference table.

Heh. I remember doing this in a migration.
Putting the old_id in the new table makes both the update easier -- you can just do an insert into newtable select ... from oldtable, -- and the subsequent "stitching" of records easier. In the "stitch" you'll either update child tables' foreign keys in the insert, by doing a subselect on the new parent (insert into newchild select ... (select id from new_parent where old_id = oldchild.fk) as fk, ... from oldchild) or you'll insert children and do a separate update to fix the foreign keys.
Doing it in one insert is faster; doing it in a separate step meas that your inserts aren't order dependent, and can be re-done if necessary.
After the migration, you can either drop the old_id columns, or, if you have a case where the legacy system exposed the ids and so users used the keys as data, you can keep them to allow use lookup based on the old_id.
Indeed, if you have the foreign keys correctly defined, you can use systables/information-schema to generate your insert statements.

Is there a way to do this UPDATE via INSERT INTO SELECT FROM so I don't have to process every record manually?
Since you wouldn't want to do it manually, but automatically, create a trigger on MV6.Posts so that UPDATE occurs on MV5.Posts automatically when you insert into MV6.Posts.
And your trigger might look something like,
create trigger trg_MV6Posts
on MV6.Posts
after insert
as
begin
set identity_insert MV5.Posts on
update MV5.Posts
set ID = I.ID
from inserted I
set identity_insert MV5.Posts off
end

AFAIK, you cannot update two different tables with a single sql statement
You can however use triggers to achieve what you want to do.

Make a column in MV6.Post.OldMV5Id
make a
insert into MV6.Post
select .. from MV5.Post
then make an update of MV5.Post.MV6ID

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.