Best way to compare two large list, C# - c#

This is for one of my ETL project to sync two database, some table is 4G, so ETL job just load updated data to insert or update, that works fine, but the source table will delete some records, and I want to delete from my table too. What I did is:
List<long> SourceIDList; // load all ID from source table
List<long> MyIDList; // load all ID from my table
var NeedRemoveIDList = MyIDList.Except( SourceIDList );
foreach(var ID in NeedRemoveIDList)
// remove from my table
The code logic work, but load ID from 4G table to List will through "out of memory" exception, is there better way?

Thanks for all the comments, I end up doing this in database, I insert two list into temp table, and use SQL to compare them, take some time to insert data, but since this is ETL job, few extra minutes is OK.

Related

Bulk insert in PGSQL with master-child relation

I am migrating data from old schema to updated schema (PG as new database in place).
I am using C# to automate this process.
Here is a screenshot that shows original data (sample to understand).
Based on updated schema, this source data need to be separated into two tables Vehicles and PartPricing.
The unique combination of Make, Model and Year will be inserted into Vehicles and linked with unique Id.
The Part and PartPrice will then be inserted into PartPricing table and need to be linked with VehicleId. (VehicleId refers to Id of Vehicles table)
Below screen shows the expected output.
The approach i followed is -
Get the unique list of Make, Model and Year and generate bulk insert query and execute.
Fetch all the inserted vehicles and cache into a collection.
Now loop through each line item in the source
Lookup for VehicleId based on Make, Model and Year (from within collection and not from database)
Prepare insert statement for PartPricing
After loop completes, execute the bulk insert query for PartPricing.
Although the Vehicles data is inserted pretty quickly but the preparation of bulk insert for PartPricing is taking considerable amount of time due to lookup.
Is there a better alternative to this problem? please suggest.
Just FYI, when i say bulk insert it follows -
Insert into Vehicles(Make, Model, Year) values
('Honda', 'City', 2010),
('Honda', 'City', 2011),
('Hyundai', 'Accent', 2011),
....
('Toyota', 'Corolla', 2015);

Insert/Update whole DataTable into database table C#

I am facing an issue I hope to get it solved by here. I have 3 different tables in a DataSet and I want to insert it in the database table.
I know I can do this using SqlBulkCopy but there is a catch and that is I want to check if the data already exists in the database then I want it to get updated instead of insert.
And if the data doesn't exist in the database table, I want to insert it then. Any help on this would be appreciated.
I know I can iterate it through each record and then fire a procedure which will check for its existence if it exists den update or else insert. But the data size is huge and iterating through each record would be a time taking process, I don't want to use this approach.
Regards
Disclaimer: I'm the owner of the project Bulk Operations
This project allows to BulkInsert, BulkUpdate, BulkDelete, and BulkMerge (Upsert).
Under the hood, it does almost what #marc_s have suggested (Use SqlBulkCopy into a temporary table and perform a merge statement to insert or update depending on the primary key).
var bulk = new BulkOperation(connection);
bulk.BulkMerge(dt);

Efficient insert statement

I'm looking for an efficient way of inserting records into SQL server for my C#/MVC application. Anyone know what the best method would be?
Normally I've just done a while loop and insert statement within, but then again I've not had quite so many records to deal with. I need to insert around half a million, and at 300 rows a minute with the while loop, I'll be here all day!
What I'm doing is looping through a large holding table, and using it's rows to create records in a different table. I've set up some functions for lookup data which is necessary for the new table, and this is no doubt adding to the drain.
So here is the query I have. Extremely inefficient for large amounts of data!
Declare #HoldingID int
Set #HoldingID = (Select min(HoldingID) From HoldingTable)
While #JourneyHoldingID IS NOT NULL
Begin
Insert Into Journeys (DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
Where HoldingID = #HoldingID
Set #HoldingID = (Select MIN(HoldingID) From Holding Where HoldingID > #HoldingID)
End
I've heard about set-based approaches - is there anything that might work for the above problem?
If you want to insert a lot of data into a MSSQL Server then you should use BULK INSERTs - there is a command line tool called the bcp utility for this, and also a C# wrapper for performing Bulk Copy Operations, but under the covers they are all using BULK INSERT.
Depending on your application you may want to insert your data into a staging table first, and then either MERGE or INSERT INTO SELECT... to transfer those rows from the staging table(s) to the target table(s) - if you have a lot of data then this will take some time, however will be a lot quicker than performing the inserts individually.
If you want to speed this up then are various things that you can do such as changing the recovery model or tweaking / removing triggers and indexes (depending on whether or not this is a live database or not). If its still really slow then you should look into doing this process in batches (e.g. 1000 rows at a time).
This should be exactly what you are doing now.
Insert Into Journeys(DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
ORDER BY HoldingID ASC
you (probably) are able to do it in one statement of the form
INSERT INTO JOURNEYS
SELECT * FROM HOLDING;
Without more information about your schema it is difficult to be absolutely sure.
SQLServer 2008 introduced Table Parameters. These allow you to insert multiple rows in a single trip to the database (send it as a large blob). Without using a temporary table. This article describes how it works (step four in the article)
http://www.altdevblogaday.com/2012/05/16/sql-server-high-performance-inserts/
It differs from bulk inserts in that you do not need special utilities and that all constraints and foreign keys are checked.
I quadrupled my throughput using this and parallelizing the inserts. Now at 15.000 inserts/second in the same table sustained. Regular table with indexes and over a billion rows.

Importing CSV data into application DB maintaining foreign key consistence

In my ASP.NET web app I'm trying to implement an import/export procedure to save or insert data in the application DB. My procedure generates some CSV files: one for each table.
Obviously there are relations between some of these tables and when I import CSV in my DB I'd like to maintain association between rows.
Say I have Table1 and Table2 with Table2 that has a foreign key to Table1. So I could have a row in Table1 with ID = 100 and a row in Table2 with Table1_ID = 100.
When I import CSV with Table1 data, new IDs are generated for Table1 rows, how can I maintain consistency of the foreign keys in Table2 when I import the corresponding CSV file?
I'm using Linq-to-SQL to retrieve data from DB... using DataSet and DataTable can help me?
NOTE I'd like to permit cumulative import, so when I import a CSV file there may already be data in the DB. So I cannot use 'Set Identity OFF'.
Add the items of Table1 first, so when you add the items of Table2 there are the corresponding records of Table1 already in the database. For more tables you will have figure out the order. If you are creating a system of arbitrary database schema, you will want to create a table graph (where each node is a table and each arc is a foreign key) in memory [There are no types for that in the base library] and then convert it to a tree such that you get the correct order by traversing the tree (breadth-first).
You can let the database handle the cases where there is a violation of the foreign key, because there is not such field. You will have to decide if you make a transaction of the whole import operation, or per item.
Although analisying the CSVs before hand is possible. To do that, you will want to store the values for the primary key of each table [Use a set for that] (again, iterate over the tables in the correct order), and then when you are reading a table that has a foreign key to a table that you have already read you can check if the key is there, also it will help you yo detect any possible duplicate. [If you have things already in the database to take into account, you would have to query too... although, take care if the database is in an active system where records could be deleted while you are still deciding if you can add the CSVs without problem].
To address that you are generating new IDs when you add...
The simplest solution that I can think of is: don't. In particular if it is an active system, where other requests are being processed, because then there is no way to predict the new IDs before hand. Your best bet would be to add them one by one, in that case, you will have to think your transaction strategy accordningly... it may be the case that you will not be able to roll back.
Although, I think your question is a bit deeper: If the ID of the Table1 did change, then how can I update the corresponding records in the Table2 so they point to the correct record in Table1?
To do that, I want to suggest to do the analysis as I described above, then you will have a group of sets that will works as indexes. This will help you locate the records that you need to update in Table2 for each ID in Table1. [It is also important to keep track if you have already updated a record, and don't do it twice, because it may happen the generated ID match an ID that is yet to be sent to the database].
To roll back, you can also use those sets, as they will end up having the new IDs that identify the records that you will have to pull out of the database if you want to abort the operation.
Edit: those sets (I recommend hashset) are only have the story, because they only have the primary key (for intance: ID in Table1). You will need bags to keep the foreing keys (in this case Table1_ID in Table2).

DataSet TableAdapter Fill Method

I have a DataSet with two TableAdapters (1 to many relationship) that was created using visual studio 2010's Configuration Wizard.
I make a call to an external source and populate a Dictionary with the results. These results should be all of the entries in the database. To synchronize the DB I don't want to just clear all of the tables and then repopulate them like dropping the tables and creating them with new data in sql.
Is there a clean way possibly using the TableAdapter.Fill() method or do I have to loop through the two tables row by row and decide if it stay or gets deleted and then add the new entries? What is the best approach to make the data that is in the dictionary be the only data in my two tables with the DataSet?
First Question: if it's the same DB why do you have 2 tables with the same information?
To the question at hand: that largley depend on the sizes. If the tables are not big then use a transaction, clear the table (DELETE * FROM TABLE or whatever) and write your data in there again.
If the tables are big on the other hand the question is: can you load all this into your dictionary?
Of course you have to ask yourself what happens to inconsistent data (another user/app changed the data while you had it in your dictionary).
If this takes to long you could remember what you did to the data - that means: flag the changed data and remember the deleted keys and new inserted rows and make your updates based on that.
Both can be achieved by remembering the Filled DataTable and use this as backing field or by implementing your own mechanisms.
In any way I would recommend think on the problem: do you really need the dictionary? Why not make queries against the database to get the data? Or only cache a part of the data for quick access?
PS: the update method on you DataAdapter will do all the work (changing the changed, removing the deleted and inserting the new datarows but it will update the DataTable/Set so this will only work once)
It could be that it is quicker to repopulate the entire table than to itterate through and decide what record go / stay. Could you not do the process of deciding if a records is deleteed via an sql statement ? (Delete from table where active = false) if you want them to stay in the database but not in the dataset (select * from table where active = true)
You could have a date field and select all records that have been added since the date you late 'pooled' the database (select * from table where active = true and date-added > #12:30#)

Categories