TSQL Large Insert of Relational Data, W/ Foreign Key Upsert

TSQL Large Insert of Relational Data, W/ Foreign Key Upsert - c#

Relatively simple problem.
Table A has ID int PK, unique Name varchar(500), and cola, colb, etc
Table B has a foreign key to Table A.
So, in the application, we are generating records for both table A and table B into DataTables in memory.
We would be generating thousands of these records on a very large number of "clients".
Eventually we make the call to store these records. However, records from table A may already exist in the database, so we need to get the primary keys for the records that already exist, and insert the missing ones. Then insert all records for table B with the correct foreign key.
Proposed solution:
I was considering sending an xml document to SQL Server to open as a rowset into TableVarA, update TableVarA with the primary keys for the records that already exist, then insert the missing records and output that to TableVarNew, I then select the Name and primary key from TableVarA union all TableVarNew.
Then in code populate the correct FKs into TableB in memory, and insert all of these records using SqlBulkCopy.
Does this sound like a good solution? And if so, what is the best way to populate the FKs in memory for TableB to match the primary key from the returned DataSet.

Sounds like a plan - but I think the handling of Table A can be simpler (a single in-memory table/table variable should be sufficient):
have a TableVarA that contains all rows for Table A
update the ID for all existing rows with their ID (should be doable in a single SQL statement)
insert all non-existing rows (that still have an empty ID) into Table A and make a note of their ID
This could all happen in a single table variable - I don't see why you need to copy stuff around....
Once you've handled your Table A, as you say, update Table B's foreign keys and bulk insert those rows in one go.
What I'm not quite clear on is how Table B references Table A - you just said it had an FK, but you didn't specify what column it was on (assuming on ID). Then how are your rows from Table B referencing Table A for new rows, that aren't inserted yet and thus don't have an ID in Table A yet?

This is more of a comment than a complete answer but I was running out of room so please don't vote it down for not being up to answer criteria.
My concern would be that evaluating a set for missing keys and then inserting in bulk you take a risk that the key got added elsewhere in the mean time. You stated this could be from a large number of clients so it this is going to happen. Yes you could wrap it in a big transaction but big transactions are hogs would lock out other clients.
My thought is to deal with those that have keys in bulk separate assuming there is no risk the PK would be deleted. A TVP is efficient but you need explicit knowledge of which got processed. I think you need to first search on Name to get a list of PK that exists then process that via TVP.
For data integrity process the rest one at a time via a stored procedure that creates the PK as necessary.
Thousands of records is not scary (millions is). Large number of "clients" that is the scary part.

Related

BULK INSERT across multiple related tables?

I need to do a BULK INSERT of several hundred-thousand records across 3 tables. A simple breakdown of the tables would be:
TableA
--------
TableAID (PK)
TableBID (FK)
TableCID (FK)
Other Columns
TableB
--------
TableBID (PK)
Other Columns
TableC
--------
TableCID (PK)
Other Columns
The problem with a bulk insert, of course, is that it only works with one table so FK's become a problem.
I've been looking around for ways to work around this, and from what I've gleaned from various sources, using a SEQUENCE column might be the best bet. I just want to make sure I have correctly cobbled together the logic from the various threads and posts I've read on this. Let me know if I have the right idea.
First, would modify the tables to look like this:
TableA
--------
TableAID (PK)
TableBSequence
TableCSequence
Other Columns
TableB
--------
TableBID (PK)
TableBSequence
Other Columns
TableC
--------
TableCID (PK)
TableCSequence
Other Columns
Then, from within the application code, I would make five calls to the database with the following logic:
Request X Sequence numbers from TableC, where X is the known number of records to be inserted into TableC. (1st DB call.)
Request Y Sequence numbers from TableB, where Y is the known number of records to be inserted into TableB (2nd DB call.)
Modify the existing objects for A, B and C (which are models generated to mirror the tables) with the now known Sequence numbers.
Bulk insert to TableA. (3rd DB call)
Bulk insert to TableB. (4th DB call)
Bulk insert to TableC. (5th DB call)
And then, of course, we would always join on the Sequence.
I have three questions:
Do I have the basic logic correct?
In Tables B and C, would I remove the clustered index from the PK and put in on the Sequence instead?
Once the Sequence numbers are requested from Tables B and C, are they then somehow locked between the request and the bulk insert? I just need to make sure that between the request and the insert, some other process doesn't request and use the same numbers.
Thanks!
EDIT:
After typing this up and posting it, I've been reading deeper into the SEQUENCE document. I think I misunderstood it at first. SEQUENCE is not a column type. For the actual column in the table, I would just use an INT (or maybe a BIGINT) depending on the number of records I expect to have). The actual SEQUENCE object is an entirely separate entity whose job is to generate numeric values on request and keep track of which ones have already been generated. So, if I understand correctly, I would generate two SEQUENCE objects, one to be used in conjunction with Table B and one with Table C.
So that answers my third question.

Do I have the basic logic correct?
Yes. The other common approach here is to bulk load your data into a staging table, and do something similar on the server-side.
From the client you can request ranges of sequence values using the sp_sequence_get_range stored procedure.
In Tables B and C, would I remove the clustered index from the PK
No, as you later noted the sequence just supplies the PK values for you.

Sorry, read your question wrong at first. I see now that you are trying to generate your own PK's rather then allow MS SQL to generate them for you. Scratch my above comment.
As David Browne mentioned, you might want to use a staging table to avoid the strain you'll put on your app's heap. Use tempdb and do the modifications directly on the table using a single transaction for each table. Then, copy the staging tables over to their target or use a MERGE if appending. If you are enforcing FK's, you can temporarily remove those constraints if you choose to insert in reverse order (C=>B=>A). You also may want to consider temporarily removing indexes if experiencing performance issues during the insert. Last, consider using SSIS instead of a custom app.

How can I update a database table with a .Net DataTable and ignore existing records?

I have a .Net DataTable that contains records, all of which are "added" records. The corresponding table in the database may contain millions of rows. If I attempt to simply call the "Update" method on my SqlDataAdapter, any existing records cause an exception to be raised due to a violation of the primary key constraint. I considered loading all of the physical table's records into a second DataTable instance, merging the two, and then calling the Update method on the second DataTable. This actually works exactly like I want. However, my concern is that if there are 30 billion records in the physical table, loading all of that data into a DataTable in memory could be an issue.
I considered selecting a sub-set of data from the physical table and proceeding as described above, but the construction of the sub-query has proved to be very involved and very tedious. You see, I am not working with a single known table. I am working with a DataSet that contains several hundred DataTables. Each of the DataTables maps to its own physical table. The name and schema of the tables are not known at compile time. This has to all be done at run time.
I have played with the SqlBulkCopy class but have the same issue - duplicate records raise an exception.
I don't want to have to dynamically construct queries for each table at run time. If that is the only way, so be it, but I just can't help but think that there must be a simpler solution using what Ado.Net provides.

you could create your insertcommand like this:
declare #pk int = 1
declare #txt nvarchar(100) = 'nothing'
insert into #temp (id, txt)
select distinct #pk, #txt
where not exists (select id from #temp x where x.id = #pk)
assuming that your table #temp (temporary table used for this example) is created like this (with primary key on id)
create table #temp (id int not null, txt nvarchar(100))

Bulk insert related sets of data with unknown auto-incremented IDs

We are converting database primary keys from GUIDs to auto-incremented INTs. We have data that we parse from text files and put into two C# DataTables Claim and ClaimCharge that we have been using to bulk insert into identically named tables in the database. In the database, ClaimCharge.ClaimID is a foreign key to Claim.ID and several claim charges exist for one claim.
With GUIDs we generated the Claim and ClaimCharge IDs in C#, so bulk inserting was no problem. But with INTs, I don't know what the Claim.ID will be, so I can't assign ClaimCharge.ClaimID. I need some ideas on how this could be accomplished with INTs.
For instance, if the Claim table could be manually locked against inserts, I could:
Bulk insert into alternate tables named ClaimBulkData ClaimChargeBulkData. These tables would still use GUIDs for convenience in keeping the relationship maintained between C# and SQL.
Manually lock the Claim table against inserts (don't know if this is possible) and get the max(ID).
Increment all of the data in ClaimBulkData using MAX(ID).
Associate ClaimChargeBulkData to ClaimBulkData using the newly updated INT
Insert data into real Claim table as a set using IDENTITY_INSERT ON using some kind of exception to the imaginary lock created in step 2.
Release manually created lock against inserts on Claim table (again I don't know if this is possible.
Insert data into real ClaimCharge table.
I want to avoid inserting the data one row at a time in either C# or T-SQL.

Why not just add the new auto-increment column to the master tables -- you will then have both GUID and autoid column so you can fix up the foreign key relationship (one master table at a time)
i.e.,
Assume you have master1 and detail1 and detail1
alter table Master1 add ID int identity(1,1) not null
GO
alter Detail1 add master1ID int null
GO
alter Detail2 add master1ID int null
GO
Then update Detail1 and Detail12 based on joining Master1 on the oldguid key to set the corresponding value of Master1ID for each table
You can then add the foreign keys based on Master1ID to Detail and Detail2
At this point you should have a complete set of data based on both sets of keys, and you can test update views, etc. to make sure they work with the new integer ids
Finally, once all is cool, drop to unneeded GUID foreign key and the Guid columns themselves.
You can always run a database pack once you get everything clean and converted if your intent was to reduce overall disk usage via this restructuring. The point is much of the work is fixups for foreign keys in a process like this.

Importing CSV data into application DB maintaining foreign key consistence

In my ASP.NET web app I'm trying to implement an import/export procedure to save or insert data in the application DB. My procedure generates some CSV files: one for each table.
Obviously there are relations between some of these tables and when I import CSV in my DB I'd like to maintain association between rows.
Say I have Table1 and Table2 with Table2 that has a foreign key to Table1. So I could have a row in Table1 with ID = 100 and a row in Table2 with Table1_ID = 100.
When I import CSV with Table1 data, new IDs are generated for Table1 rows, how can I maintain consistency of the foreign keys in Table2 when I import the corresponding CSV file?
I'm using Linq-to-SQL to retrieve data from DB... using DataSet and DataTable can help me?
NOTE I'd like to permit cumulative import, so when I import a CSV file there may already be data in the DB. So I cannot use 'Set Identity OFF'.

Add the items of Table1 first, so when you add the items of Table2 there are the corresponding records of Table1 already in the database. For more tables you will have figure out the order. If you are creating a system of arbitrary database schema, you will want to create a table graph (where each node is a table and each arc is a foreign key) in memory [There are no types for that in the base library] and then convert it to a tree such that you get the correct order by traversing the tree (breadth-first).
You can let the database handle the cases where there is a violation of the foreign key, because there is not such field. You will have to decide if you make a transaction of the whole import operation, or per item.
Although analisying the CSVs before hand is possible. To do that, you will want to store the values for the primary key of each table [Use a set for that] (again, iterate over the tables in the correct order), and then when you are reading a table that has a foreign key to a table that you have already read you can check if the key is there, also it will help you yo detect any possible duplicate. [If you have things already in the database to take into account, you would have to query too... although, take care if the database is in an active system where records could be deleted while you are still deciding if you can add the CSVs without problem].
To address that you are generating new IDs when you add...
The simplest solution that I can think of is: don't. In particular if it is an active system, where other requests are being processed, because then there is no way to predict the new IDs before hand. Your best bet would be to add them one by one, in that case, you will have to think your transaction strategy accordningly... it may be the case that you will not be able to roll back.
Although, I think your question is a bit deeper: If the ID of the Table1 did change, then how can I update the corresponding records in the Table2 so they point to the correct record in Table1?
To do that, I want to suggest to do the analysis as I described above, then you will have a group of sets that will works as indexes. This will help you locate the records that you need to update in Table2 for each ID in Table1. [It is also important to keep track if you have already updated a record, and don't do it twice, because it may happen the generated ID match an ID that is yet to be sent to the database].
To roll back, you can also use those sets, as they will end up having the new IDs that identify the records that you will have to pull out of the database if you want to abort the operation.
Edit: those sets (I recommend hashset) are only have the story, because they only have the primary key (for intance: ID in Table1). You will need bags to keep the foreing keys (in this case Table1_ID in Table2).

SqlBulkCopy into table with composite primary key

I'm trying to use SqlBulkCopy to insert new rows into my DB table by manually populating a DataTable w/in my application.
This works fine for all tables except the table that has a composite primary key made up of 3 columns. Whenever I try to SqlBulkCopy anything into this table, I get the following error:
Violation of PRIMARY KEY constraint 'PK_MYCOMPOSITEKEY'. Cannot insert duplicate key in object 'dbo.MyTable'.
The statement has been terminated.
Is this even possible?
I have tried setting up my DataTable's primary keys with the following:
dt.PrimaryKey = new[] {dt.Columns["PKcolumn1"], dt.Columns["PKcolumn2"], dt.Columns["PKcolumn3"]};
but again, no luck.

The problem you have is with the data.
In the input file there is either or both of
a row which has the same data in the e pk columns as you already have in the table
or
The file has at least two rows with the same values of the pk columns

Bulk insert to a staging table. Clean up any duplicate records. Then do an insert using straight SQL. When you write the insert code be sure to limit it to records in the staging table that are not in the prod table.

You should verify your bulk data for copies before you hit the DB, the problem could be there as well (not just clashing with an existing constraint, or record in DB). It does work and it is usually correct to report it.
Nonetheless, the entire show of DataSet or even DataReaders is a messy exercise in mappings, bad typeless design, plenty of unnecessary transformations, allocations, object[] based values, and the entire thing becomes order, type and string dependent mess (something only MS could design and keeps designing). Native OLEDB bulk interfaces on the other hand are much cleaner.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.