I am migrating data from old schema to updated schema (PG as new database in place).
I am using C# to automate this process.
Here is a screenshot that shows original data (sample to understand).
Based on updated schema, this source data need to be separated into two tables Vehicles and PartPricing.
The unique combination of Make, Model and Year will be inserted into Vehicles and linked with unique Id.
The Part and PartPrice will then be inserted into PartPricing table and need to be linked with VehicleId. (VehicleId refers to Id of Vehicles table)
Below screen shows the expected output.
The approach i followed is -
Get the unique list of Make, Model and Year and generate bulk insert query and execute.
Fetch all the inserted vehicles and cache into a collection.
Now loop through each line item in the source
Lookup for VehicleId based on Make, Model and Year (from within collection and not from database)
Prepare insert statement for PartPricing
After loop completes, execute the bulk insert query for PartPricing.
Although the Vehicles data is inserted pretty quickly but the preparation of bulk insert for PartPricing is taking considerable amount of time due to lookup.
Is there a better alternative to this problem? please suggest.
Just FYI, when i say bulk insert it follows -
Insert into Vehicles(Make, Model, Year) values
('Honda', 'City', 2010),
('Honda', 'City', 2011),
('Hyundai', 'Accent', 2011),
....
('Toyota', 'Corolla', 2015);
Related
This is for one of my ETL project to sync two database, some table is 4G, so ETL job just load updated data to insert or update, that works fine, but the source table will delete some records, and I want to delete from my table too. What I did is:
List<long> SourceIDList; // load all ID from source table
List<long> MyIDList; // load all ID from my table
var NeedRemoveIDList = MyIDList.Except( SourceIDList );
foreach(var ID in NeedRemoveIDList)
// remove from my table
The code logic work, but load ID from 4G table to List will through "out of memory" exception, is there better way?
Thanks for all the comments, I end up doing this in database, I insert two list into temp table, and use SQL to compare them, take some time to insert data, but since this is ETL job, few extra minutes is OK.
I am facing an issue I hope to get it solved by here. I have 3 different tables in a DataSet and I want to insert it in the database table.
I know I can do this using SqlBulkCopy but there is a catch and that is I want to check if the data already exists in the database then I want it to get updated instead of insert.
And if the data doesn't exist in the database table, I want to insert it then. Any help on this would be appreciated.
I know I can iterate it through each record and then fire a procedure which will check for its existence if it exists den update or else insert. But the data size is huge and iterating through each record would be a time taking process, I don't want to use this approach.
Regards
Disclaimer: I'm the owner of the project Bulk Operations
This project allows to BulkInsert, BulkUpdate, BulkDelete, and BulkMerge (Upsert).
Under the hood, it does almost what #marc_s have suggested (Use SqlBulkCopy into a temporary table and perform a merge statement to insert or update depending on the primary key).
var bulk = new BulkOperation(connection);
bulk.BulkMerge(dt);
I've got a database which amongst others has two tables of data:-
Table 1
ProductID
ProductName
ProductDescription
IsVisible
IsDeleted
Table 2
- ProductPriceID
- ProductID
- LocationID
- Price
Table 2 can hold many prices at different locations for each product in Table 1. I'm reading from a CSV file where the product details are listed in the first columns followed by 15 columns of price values for 15 locations.
I have found that with some nearly 10,000 products being imported each time, that writing this file to the database by first writing the product, and then writing a list of the 15 prices to Table 2, 10000 times over slows the import down HUGELY. It slows it down by up to 2.5x compared to 'attempting' to write in a list of 10000 products first, followed by the some 132,000 product prices. Having 2 writes to the database massively speeds up the whole process, as the lag time is incurred at the database so writing 2 times instead of 20,000 times is much easier.
I've created to lists of the Database types for each object and added the data to each and this is fine. The problem is the ProductID in Table 2. Entity Framework doesn't return this until I call
context.Products.Add(productList);
context.Save();
But at the point this is saved, the list of product prices has already been created but without the relevant ProductID values. When it saves, it crashes because of the foreign key constraint.
Is there anyway with Entity Framework to get the ProductID, that will be assigned to this product without writing each product to the database first? Minimum numbers of database calls is crucial here.
I have the option of re-parsing all the data from the file, but I'm also not keen on this, as its extra processing time. The structure of the file will not be able to be changed.
I weighed up all the options, and it turned out the best way for us to do it, was to write all products to one list, all product prices with the known product code to another list.
We then saved the product list to the database, and then iterated through these products to bring back the ProductID against the Product Code that matched the local list, before saving the product prices list to the database.
2 saves to the database, and one call from the database, and we've cut down a 47 minute data import to 3 minutes.
Thanks for everyones help!
In my ASP.NET web app I'm trying to implement an import/export procedure to save or insert data in the application DB. My procedure generates some CSV files: one for each table.
Obviously there are relations between some of these tables and when I import CSV in my DB I'd like to maintain association between rows.
Say I have Table1 and Table2 with Table2 that has a foreign key to Table1. So I could have a row in Table1 with ID = 100 and a row in Table2 with Table1_ID = 100.
When I import CSV with Table1 data, new IDs are generated for Table1 rows, how can I maintain consistency of the foreign keys in Table2 when I import the corresponding CSV file?
I'm using Linq-to-SQL to retrieve data from DB... using DataSet and DataTable can help me?
NOTE I'd like to permit cumulative import, so when I import a CSV file there may already be data in the DB. So I cannot use 'Set Identity OFF'.
Add the items of Table1 first, so when you add the items of Table2 there are the corresponding records of Table1 already in the database. For more tables you will have figure out the order. If you are creating a system of arbitrary database schema, you will want to create a table graph (where each node is a table and each arc is a foreign key) in memory [There are no types for that in the base library] and then convert it to a tree such that you get the correct order by traversing the tree (breadth-first).
You can let the database handle the cases where there is a violation of the foreign key, because there is not such field. You will have to decide if you make a transaction of the whole import operation, or per item.
Although analisying the CSVs before hand is possible. To do that, you will want to store the values for the primary key of each table [Use a set for that] (again, iterate over the tables in the correct order), and then when you are reading a table that has a foreign key to a table that you have already read you can check if the key is there, also it will help you yo detect any possible duplicate. [If you have things already in the database to take into account, you would have to query too... although, take care if the database is in an active system where records could be deleted while you are still deciding if you can add the CSVs without problem].
To address that you are generating new IDs when you add...
The simplest solution that I can think of is: don't. In particular if it is an active system, where other requests are being processed, because then there is no way to predict the new IDs before hand. Your best bet would be to add them one by one, in that case, you will have to think your transaction strategy accordningly... it may be the case that you will not be able to roll back.
Although, I think your question is a bit deeper: If the ID of the Table1 did change, then how can I update the corresponding records in the Table2 so they point to the correct record in Table1?
To do that, I want to suggest to do the analysis as I described above, then you will have a group of sets that will works as indexes. This will help you locate the records that you need to update in Table2 for each ID in Table1. [It is also important to keep track if you have already updated a record, and don't do it twice, because it may happen the generated ID match an ID that is yet to be sent to the database].
To roll back, you can also use those sets, as they will end up having the new IDs that identify the records that you will have to pull out of the database if you want to abort the operation.
Edit: those sets (I recommend hashset) are only have the story, because they only have the primary key (for intance: ID in Table1). You will need bags to keep the foreing keys (in this case Table1_ID in Table2).
Relatively simple problem.
Table A has ID int PK, unique Name varchar(500), and cola, colb, etc
Table B has a foreign key to Table A.
So, in the application, we are generating records for both table A and table B into DataTables in memory.
We would be generating thousands of these records on a very large number of "clients".
Eventually we make the call to store these records. However, records from table A may already exist in the database, so we need to get the primary keys for the records that already exist, and insert the missing ones. Then insert all records for table B with the correct foreign key.
Proposed solution:
I was considering sending an xml document to SQL Server to open as a rowset into TableVarA, update TableVarA with the primary keys for the records that already exist, then insert the missing records and output that to TableVarNew, I then select the Name and primary key from TableVarA union all TableVarNew.
Then in code populate the correct FKs into TableB in memory, and insert all of these records using SqlBulkCopy.
Does this sound like a good solution? And if so, what is the best way to populate the FKs in memory for TableB to match the primary key from the returned DataSet.
Sounds like a plan - but I think the handling of Table A can be simpler (a single in-memory table/table variable should be sufficient):
have a TableVarA that contains all rows for Table A
update the ID for all existing rows with their ID (should be doable in a single SQL statement)
insert all non-existing rows (that still have an empty ID) into Table A and make a note of their ID
This could all happen in a single table variable - I don't see why you need to copy stuff around....
Once you've handled your Table A, as you say, update Table B's foreign keys and bulk insert those rows in one go.
What I'm not quite clear on is how Table B references Table A - you just said it had an FK, but you didn't specify what column it was on (assuming on ID). Then how are your rows from Table B referencing Table A for new rows, that aren't inserted yet and thus don't have an ID in Table A yet?
This is more of a comment than a complete answer but I was running out of room so please don't vote it down for not being up to answer criteria.
My concern would be that evaluating a set for missing keys and then inserting in bulk you take a risk that the key got added elsewhere in the mean time. You stated this could be from a large number of clients so it this is going to happen. Yes you could wrap it in a big transaction but big transactions are hogs would lock out other clients.
My thought is to deal with those that have keys in bulk separate assuming there is no risk the PK would be deleted. A TVP is efficient but you need explicit knowledge of which got processed. I think you need to first search on Name to get a list of PK that exists then process that via TVP.
For data integrity process the rest one at a time via a stored procedure that creates the PK as necessary.
Thousands of records is not scary (millions is). Large number of "clients" that is the scary part.