SQL comparsion/synchronization speed - c#

I have two datatables let's just call them db1 and db2. db2 contains all the records db1 has but db1 doesn't contain all the records of db2 (they both have the same columns). I have to check the modifications every day in db1 and apply the same for db2.
Currently my tool "exports" both tables into DataTables, performs the conversion and updates/imports the records into db2:
SELECT * FROM db1 -> db1_table
SELECT * FROM db2 -> db2_table
for (int i = 0; i < db1_table.Rows.Count; i++)
{
for (int j = 0; j < db2_table.Rows.Count; j++)
{
//if db1_table.Rows[i] != db2_table.Rows[j] -> UPDATE db2 SET etc.
//if db1_table.Rows[i] doesn't exist in db2 -> INSERT INTO db2 etc.
}
}
This version becomes quite slow after a while. I'm talking about tens of thousands of records.
The other was my initial idea but I found it slow. I pull the whole db1, loop through all of its records and execute an sql query each time:
SELECT * FROM db1 -> db1_table
for (int i = 0; i < db1_table.Rows.Count; i++)
{
//SELECT * FROM db2 WHERE "attributes LIKE db1_table.Rows[i]
//do the comparsion here and execute the UPDATE/INSERT commands if necessary
}
Which is the faster(better) way? Are there any other option I might have?

Side note: you really shouldn't store duplicate data in two tables with the same structure in the first place...
Side note: you should be doing this update in SQL.
To answer your actual question. What you're experiencing is an O(N^2) algorithmic complexity. It can be reduced to around O(N) if you build a hashtable (dictionary) of one of the tables, and you only iterate on the other one. When you look for a match, then you look in the hashtable instead of iteration, that's around O(1) instead of O(N). You just need a good key value that you use for hashing.
Something like this:
var dict = db2_table.Rows.Cast<DataRow>().ToDictionary(row2 => row2["keycolumn"].Value); // this is the hashing, make sure no duplicate keys exist!
foreach (DataRow row1 in db1_table.Rows) {
DataRow row2;
if (dict.TryGetValue(row1["keycolumn"].Value, out row2)) {
// row1 and row2 match by the key column, do something with them
dict.Remove(row2["keycolumn"].Value);
}
// else no match, row1 must be a new row
}
// now dict contains the keys from db2 which have no match in db1, they must have been deleted

There's another option that's O(n) if you have a unique ID that you can order on and compare: Order both tables by the ID and walk them both at once, generating lists of pending changes. After that you can apply the pending changes. The reason for generating lists of changes is so that you can batch commands together at the end of the change detection and benefit from things like bulk inserts, CTEs or temp tables to join on for deletes, and batched command groups for updates -- all of which reduce one of the biggest sources of latency in this kind of operation: DB round trips.
The main loop looks like the following:
// Assuming that IDs are long. Change as required.
long db1_id;
long db2_id;
var idsToAppend = new List<long>();
var idsToUpdate = new List<long>();
var idsToDelete = new List<long>();
int i = 0;
int j = 0;
while (i < db1_table.Rows.Count && j < db2_table.Rows.Count) {
db1_id = db1_table.Rows[i]["ID"];
db2_id = db2_table.Rows[j]["ID"];
if (i == db1_table.Rows.Count && j < db2_table.Rows.Count) {
// There's extra rows in the destination that have been removed from the source
idsToDelete.Add(db1_id);
j++;
} else if (j < db1_table.Rows.Count && j == db2_table.Rows.Count) {
// There's extra rows in the source that need added to the destination
idsToAppend.Add(db1_id);
i++;
} else if (db1_id == db2_id) {
// On the same ID in both datasets
if !(db1_table.Rows[i] == db2_table.Rows[j]) {
// I know == won't work -- only do this if db1 may change and the changes must be propagated to db2
idsToUpdate.Add(db1_id);
}
i++;
j++;
} else if (db1_id > db2_id) {
// row in db1 was removed, remove row in db2
idsToDelete.Add(db1_id);
j++;
} else {
// implicit: db1_id < db2_id
// implicit: row in db1 doesn't exist in db2, needs added
idsToAppend(db1_id);
i++;
}
}
// Walk idsToAppend, idsToUpdate, and idsToDelete applying changes

Related

Taking rows in chunks from datatable and inserting in database

I have around 25k records in datatable. I already have update query written by previous developer which I can't change. What I am trying to do is as follows:
Take 1000 records at a time from datatable, records can vary from 1 to 25k.
Update query which is in string, replace IN('values here') clause of that with these 1000 records and fire query against database.
Now, I know there are effecient ways to do it, like bulk insert by use of array binding , but I can't change present coding pattern due to restrictions.
What I have tried to do:
if (dt.Rows.Count>0)
{
foreach (DataRow dr in dt.Rows)
{
reviewitemsend =reviewitemsend + dr["ItemID"].ToString()+ ',';
//If record count is 1000 , execute against database.
}
}
Now above approach is taking me nowwhere and am like struck. So another better aproach which I am thinking is below :
int TotalRecords = dt.rows.count;
If (TotalRecords <1000 && TotalRecords >0 )
//Update existing query with this records by placing them in IN cluse and execute
else
{
intLoopCounter = TotalRecords/1000; //Manage for extra records, as counter will be whole number, so i will check modulus division also, if that is 0, means no need for extra counter, if that is non zero, intLoopCounter increment by 1
for(int i= 0;i < intLoopCounter; i++)
{
//Take thousand records at a time, unless last counter has less than 1000 records and execute against database
}
}
Also, note update query is below :
string UpdateStatement = #" UPDATE Table
SET column1=<STATUS>,
column2= '<NOTES>',
changed_by = '<CHANGEDBY>',
status= NULL,
WHERE ID IN (<IDS>)";
In above update query, IDS are already replaced with all 25K record ID's, which will be shown to end user like that, internally only I have to execute it as separate chunks, So within IN() cluase I need to insert 1k records at a time
You can split your Datatable using this linq method:
private static List<List<DataRow>> SplitDataTable(DataTable table, int pageSize)
{
return
table.AsEnumerable()
.Select((row, index) => new { Row = row, Index = index, })
.GroupBy(x => x.Index / pageSize)
.Select(x => x.Select(v => v.Row).ToList())
.ToList();
}
Then run the database query on each chunk:
foreach(List<DataRow> chuck in SplitDataTable(dt, 1000))
{
foreach(DataRow row in chuck)
{
// prepare data from row
}
// execute against database
}
Tip: you can modify the split query to prepare your data directly inside of it (by replacing the x.Select(v => v.Row) part, instead of looping twice on that huge DataTable.

What is the best approach to compare two columns with data in a table?

I want to compare data in datatable, columns by columns auto by under 1 rules.
In this picture, I will compare by pair columns Ax_y vs Bx_y.
A0_0 vs B0_0
A1_1 vs B1_1
...........
I tried will code:
foreach(DataRow r in dt.Rows)
{
if (r["A0_0"] == r["B0_0"])
{
// do something
}
}
But this fails, I want to loop all rows and compare. But I have about 50 columns, do this manual is not good idea.
Note: in this a picture I draw sample columns. In real database it will like:
A0_0 B0_0 A0_1 B0_1 A1_0 B1_0 A1_1 B1_1 A2_0 B2_0 A2_1 B2_1
Loop through the columns, In each iteration of the Rows. Something like the following:
foreach (DataRow r in myTableData.Rows)
{
for (int i = 1; i < myTableData.Columns.Count - 1; i+=2)
{
if (r[i] == r[i + 1])
{
// do something;
}
}
}
Here the Inner loop will Iterate through the columns for each row, And It will compare ith column data with i+1th. ie., when i=1 it will compare r["A0_0"] and r["B0_0"]. We skipped 0th column since its for the ID

Performance of setting DataRow values in a large DataTable

I have a large DataTable - around 15000 rows and 100 columns - and I need to set the values for some of the columns in every row.
// Creating the DataTable
DataTable dt = new DataTable();
for (int i = 0; i < COLS_NUM; i++)
{
dt.Columns.Add("COL" + i);
}
for (int i = 0; i < ROWS_NUM; i++)
{
dt.Rows.Add(dt.NewRow());
}
// Setting several values in every row
Stopwatch sw2 = new Stopwatch();
sw2.Start();
foreach (DataRow row in dt.Rows)
{
for (int j = 0; j < 15; j++)
{
row["Col" + j] = 5;
}
}
sw2.Stop();
The measured time above is about 4.5 seconds. Is there any simple way to improve this?
Before you populate the data, call the BeginLoadData() method on the DataTable. When you have finished loading the data, call EndLoadData(). This turns off all notifications, index maintenance, and constraints, which will improve performance.
As an alternative, call BeginEdit() before updating each row, and EndEdit() when the editing for that row is complete.
Here is a link with more information on improving DataSet performance:
http://www.softwire.com/blog/2011/08/04/dataset-performance-in-net-web-applications/
One improvement that I can think of is editing columns by their indices, rather than their names.
foreach (DataRow row in dt.Rows)
{
for (int j = 0; j < 15; j++)
{
row[j] = 5;
}
}
With an empirical test, your method seems to run in ~1500 milliseconds on my computer, and this index based version runs in ~1100 milliseconds.
Also, see Marc's answer in this post:
Set value for all rows in a datatable without for loop
this depends on your business logic which is not clear in your question, however, If you want to set the values for some of the columns in every row, try the following,
Create a separated temp column(s), you might create it in the same
loop when creating the original data table
Fill the new values into this column,
delete the old column and insert the new one in its place instead.
This solution will be logical if you can expect the new values or if you have the same value for all rows (like in your example) or if you have some kind of repeat, in that case adding a new column with loaded will be much more faster than looping all rows.

How to implement SQL Server paging using C# and entity framework?

I have to update every row in a Sql Server table with about 150,000 records using entity framework. To reduce the amount of hits the server takes, I would like to do this in separate batches of 1000 rows. I need entity framework to:
Select the first 1000 rows from the DB.
Update those rows.
Call SaveChanges() method.
Get next 1000 rows.
Repeat.
Whats the best way to achieve this?
I'm using entity framework 4 and SQL Server 2012.
Use LINQ Skip & Take:
return query.Skip(HOW MUCH TO SKIP -AT THE BEGINNING WILL BE ZERO-)
.Take(HOW MUCH TO TAKE -THE NUMBER OF YOUR PAGING SIZE-).ToList();
If you want to do it within a loop you can do something like this:
int pagingIncrement = 1000;
for (int i = 0; i <= 150 000; i=i+pagingIncrement)
{
var query = your actual LINQ query.
var results = query.Skip(i).Take(pagingIncrement);
UpdatePartialResults(results);
}
Note: It is important that while updating those rows you don't update the criteria for the ORDER BY within your actual LINQ query, otherwise you could be end up updating the same results again and again (because of the reordering).
Other idea will be to extend the IEnumerable iterator with some of the previously given ideas such as a Skip(counter).Take(pagingSize and yield result (to be processing kinda asynchronously).
something like this should work:
int skip =0;
int take = 1000;
for (int i = 0; i < 150; i++)
{
var rows = (from x in Context.Table
select x).OrderBy(x => x.id).Skip(skip).Take(take).ToList();
//do some update stuff with rows
skip += 1000;
}

Combine 3 different datatables into 1 and performance with SQL

I was asked to do a report that combines 3 different crystal reports that we use. Already those reports are very slow and heavy and making 1 big one was out of the question. SO I created a little apps in VS 2010.
My main problem is this, I have 3 Datatable (same schema) that were created with the Dataset designer that I need to combine. I created an empty table to store the combined value. The queries are already pretty big so combining them in a SQL query is really out of the question.
Also I do not have write access to the SQL server (2005), because the server is maintained by the company that created our MRP program. Although I could always ask support to add a view to the server.
So my 3 datatable consist of Labor Cost, Material Cost and subcontracting Cost. I need to create a total cost table that adds all of the Cost column of each table by ID. All the table have keys to find and select them.
The problem is that when i fetch all of the current job it is ok (500ms for 400 records), because I have a query that will fetch only the working job. Problem is with Inventory, since I do not know since when those Job were finished I have to fetch the entire database (around 10000 jobs with subqueries that each have up to 100 records) and this for my 3 tables. This takes around 5000 to 8000ms, although it is very fast compared to the crystal report there is one problem.
I need to create a summary table that will combine all these different tables I created, But I also need to do them 2 times, 1 time for each date that is outputted. So my data always changes, because they are based on a Date parameter. Right now it will take around 12-20sec to fetch them all.
I need a way to reduce the load time, here is what I tried.
Tried a for loop to combine the 3 tables
Then tried with the DataReader class to read each line and used the FindByKey methods that the dataset designer created to find the value in the other table, and I have to do this 2 time. (it seems to go a little bit faster than the for loop)
Tried with Linq, don't think it is possible, and will it give more performance?
Tried to do a dynamic query that use "WHERE IN Comma Separated List" (that actually doubled the time of execution, compared to fetching all of the database)
Tried to join my Inventory query to the my Cost queries (that also increased the time it took)
1 - So is there any way to combine my tables more effectively? What is the fastest way to Merge and Sum my records of my 3 tables?
2 - Is there any way to increase performance of my queries without having write access to the server?
Below is some of the code I used for reference :
public static void Fill()
{
DateTime Date = Data.Date;
AllieesDBTableAdapters.CoutMatTableAdapter mat = new AllieesDBTableAdapters.CoutMatTableAdapter();
AllieesDBTableAdapters.CoutLaborTableAdapter lab = new AllieesDBTableAdapters.CoutLaborTableAdapter();
AllieesDBTableAdapters.CoutSTTableAdapter st = new AllieesDBTableAdapters.CoutSTTableAdapter();
Data.allieesDB.CoutTOT.Clear();
//Around 2 sec each Fill
mat.FillUni(Data.allieesDB.CoutMat, Date);
Data.allieesDB.CoutMat.CopyToDataTable(Data.allieesDB.CoutTOT, LoadOption.OverwriteChanges);
lab.FillUni(Data.allieesDB.CoutLabor, Date);
MergeTable(Data.allieesDB.CoutLabor);
st.FillUni(Data.allieesDB.CoutST, Date);
MergeTable(Data.allieesDB.CoutST);
}
Here is the MergeTable Methods (The For loop I tried is in Comment)
private static void MergeTable(DataTable Table)
{
AllieesDB.CoutTOTDataTable dtTOT = Data.allieesDB.CoutTOT;
DataTableReader r = new DataTableReader(Table);
while (r.Read())
{
DataRow drToT = dtTOT.FindByWO(r.GetValue(2).ToString());
if (drToT != null)
{
drToT["Cout"] = (decimal)drToT["Cout"] + (decimal)r.GetValue(3);
} else
{
EA_CoutsDesVentes.AllieesDB.CoutTOTRow row = dtTOT.NewCoutTOTRow();
for (int j = 0; j < r.FieldCount; j++)
{
if (r.GetValue(j) != null)
{
row[j] = r.GetValue(j);
} else
{
row[j] = null;
}
}
dtTOT.AddCoutTOTRow(row);
}
Application.DoEvents();
}
//try
//{
// for (int i = 0; i < Table.Rows.Count; i++)
// {
// DataRow drSource = Table.Rows[i];
// DataRow drToT = dtTOT.FindByWO(drSource["WO"].ToString());
//if (drToT != null)
//{
// drToT["Cout"] = (decimal)drToT["Cout"] + (decimal)drSource["Cout"];
//} else
//{
//
// EA_CoutsDesVentes.AllieesDB.CoutTOTRow row = dtTOT.NewCoutTOTRow();
// for (int j = 0; j < drSource.Table.Columns.Count; j++)
// {
// if (drSource[j] != null)
// {
// row[j] = drSource[j];
// } else
// {
// row[j] = null;
// }
// }
// dtTOT.AddCoutTOTRow(row);
//}
//Application.DoEvents();
// }
//} catch (Exception)
//{
//}
On Sql Server 2005 and up, you can create a materialized view of the aggregate values and dramatically speed up the performance.
look at Improving Performance with SQL Server 2005 Indexed Views

Categories