Combine 3 different datatables into 1 and performance with SQL - c#

I was asked to do a report that combines 3 different crystal reports that we use. Already those reports are very slow and heavy and making 1 big one was out of the question. SO I created a little apps in VS 2010.
My main problem is this, I have 3 Datatable (same schema) that were created with the Dataset designer that I need to combine. I created an empty table to store the combined value. The queries are already pretty big so combining them in a SQL query is really out of the question.
Also I do not have write access to the SQL server (2005), because the server is maintained by the company that created our MRP program. Although I could always ask support to add a view to the server.
So my 3 datatable consist of Labor Cost, Material Cost and subcontracting Cost. I need to create a total cost table that adds all of the Cost column of each table by ID. All the table have keys to find and select them.
The problem is that when i fetch all of the current job it is ok (500ms for 400 records), because I have a query that will fetch only the working job. Problem is with Inventory, since I do not know since when those Job were finished I have to fetch the entire database (around 10000 jobs with subqueries that each have up to 100 records) and this for my 3 tables. This takes around 5000 to 8000ms, although it is very fast compared to the crystal report there is one problem.
I need to create a summary table that will combine all these different tables I created, But I also need to do them 2 times, 1 time for each date that is outputted. So my data always changes, because they are based on a Date parameter. Right now it will take around 12-20sec to fetch them all.
I need a way to reduce the load time, here is what I tried.
Tried a for loop to combine the 3 tables
Then tried with the DataReader class to read each line and used the FindByKey methods that the dataset designer created to find the value in the other table, and I have to do this 2 time. (it seems to go a little bit faster than the for loop)
Tried with Linq, don't think it is possible, and will it give more performance?
Tried to do a dynamic query that use "WHERE IN Comma Separated List" (that actually doubled the time of execution, compared to fetching all of the database)
Tried to join my Inventory query to the my Cost queries (that also increased the time it took)
1 - So is there any way to combine my tables more effectively? What is the fastest way to Merge and Sum my records of my 3 tables?
2 - Is there any way to increase performance of my queries without having write access to the server?
Below is some of the code I used for reference :
public static void Fill()
{
DateTime Date = Data.Date;
AllieesDBTableAdapters.CoutMatTableAdapter mat = new AllieesDBTableAdapters.CoutMatTableAdapter();
AllieesDBTableAdapters.CoutLaborTableAdapter lab = new AllieesDBTableAdapters.CoutLaborTableAdapter();
AllieesDBTableAdapters.CoutSTTableAdapter st = new AllieesDBTableAdapters.CoutSTTableAdapter();
Data.allieesDB.CoutTOT.Clear();
//Around 2 sec each Fill
mat.FillUni(Data.allieesDB.CoutMat, Date);
Data.allieesDB.CoutMat.CopyToDataTable(Data.allieesDB.CoutTOT, LoadOption.OverwriteChanges);
lab.FillUni(Data.allieesDB.CoutLabor, Date);
MergeTable(Data.allieesDB.CoutLabor);
st.FillUni(Data.allieesDB.CoutST, Date);
MergeTable(Data.allieesDB.CoutST);
}
Here is the MergeTable Methods (The For loop I tried is in Comment)
private static void MergeTable(DataTable Table)
{
AllieesDB.CoutTOTDataTable dtTOT = Data.allieesDB.CoutTOT;
DataTableReader r = new DataTableReader(Table);
while (r.Read())
{
DataRow drToT = dtTOT.FindByWO(r.GetValue(2).ToString());
if (drToT != null)
{
drToT["Cout"] = (decimal)drToT["Cout"] + (decimal)r.GetValue(3);
} else
{
EA_CoutsDesVentes.AllieesDB.CoutTOTRow row = dtTOT.NewCoutTOTRow();
for (int j = 0; j < r.FieldCount; j++)
{
if (r.GetValue(j) != null)
{
row[j] = r.GetValue(j);
} else
{
row[j] = null;
}
}
dtTOT.AddCoutTOTRow(row);
}
Application.DoEvents();
}
//try
//{
// for (int i = 0; i < Table.Rows.Count; i++)
// {
// DataRow drSource = Table.Rows[i];
// DataRow drToT = dtTOT.FindByWO(drSource["WO"].ToString());
//if (drToT != null)
//{
// drToT["Cout"] = (decimal)drToT["Cout"] + (decimal)drSource["Cout"];
//} else
//{
//
// EA_CoutsDesVentes.AllieesDB.CoutTOTRow row = dtTOT.NewCoutTOTRow();
// for (int j = 0; j < drSource.Table.Columns.Count; j++)
// {
// if (drSource[j] != null)
// {
// row[j] = drSource[j];
// } else
// {
// row[j] = null;
// }
// }
// dtTOT.AddCoutTOTRow(row);
//}
//Application.DoEvents();
// }
//} catch (Exception)
//{
//}

On Sql Server 2005 and up, you can create a materialized view of the aggregate values and dramatically speed up the performance.
look at Improving Performance with SQL Server 2005 Indexed Views

Related

My DataTables within this method are producing csv files that are double the number of rows they contain, why?

I have the below method, im trying to clean up a text file based on some criteria, with everyones help ive already gotten this far and everything is working and creating and filtering, BUT something i noticed during my last set of tests is that the number of rows expected in the csv files im generating(only for testing purposes) are double the size of the actual datatable, so the first datatable in my method is called "output" which the row count is 2257, but the csv that is created has 4514 records in it and the 2nd csv called outputLoc row count is 1402 but the csv file ends up with 2804 records..
Here is the code currently being executed and working, but generating the above numbers.
try
{
DataTable d = processFileData(concatFile);
// REMOVES ALL RECORDS WITH A CLASS THAT IS NON-LABEL CLASS
var query = from r in d.AsEnumerable()
where !returnClass().Any(r.Field<string>("Column7").Contains)
select r;
DataTable output = query.CopyToDataTable<DataRow>(); // Should have 2257 records
int dtoutputCount = output.Rows.Count;
for (int i = 0; i < dtoutputCount; i++)
{
DataRow rows = output.Rows[i];
output.ImportRow(rows);
}
ToCSV(output, ftype,"filteredclass"); // Only writing to csv for testing and verification of data
//// REMOVES ALL RECORDS THAT HAVE A NON-SELLING OR UNOPENED LOCATION
var queryLoc = from rL in output.AsEnumerable()
where !returnLocations().Any(rL.Field<string>("Column2").Contains)
select rL;
DataTable outputLoc = queryLoc.CopyToDataTable<DataRow>();
int dtoutputLocCount = outputLoc.Rows.Count;
for (int i = 0; i < dtoutputLocCount; i++)
{
DataRow rows = outputLoc.Rows[i];
outputLoc.ImportRow(rows);
}
ToCSV(outputLoc, ftype,"filteredlocation"); // Only writing to csv for testing and verification of data
}
catch (Exception e)
{
Console.WriteLine(e.InnerException);
}
Once everything is working as expect, we would eventually get rid of the ToCSV calls so that we can work with the data in memory and once its all cleaned, then we call that to produce the final file that has been filtered down.
Any help would be greatly appreciated in determining why im getting a file that is exactly twice as big as expected.

IEnumerable Collection - Out of Memory Exception

I have code in my C# console app that is querying a LARGE dataset in SQL, and adding it to an IEnumerable collection that I use to iterate through later in the app. On a SQL table that returns less than 100K rows, it works great, but I have to use this to iterate through 100 Million records, After the SQL query runs, and Dapper tries to fill the collection, I end up with an OUT OF MEMORY exception error. I'm pretty certain it's because it's trying to write 100 Million objects at a time. Is there a way I can batch a collection with no more than say 500K objects, do what I need to do then come back and process another 500K and so on? I essentially need to READ from SQL 500K records, then write those to a file, Read another 500K , write to another file.
public List<AxDlsd> GetDistinctDlsdObjects(AxApp axApp, OperationType operationType)
{
if (operationType == OperationType.Assessment)
{
string query = $"SELECT DISTINCT(clipid) from {axApp.dlname}";
using (var connection = _dbConnectionFactory.GetAxDbConnection())
{
//SqlMapper.Settings.CommandTimeout = 0;
List<AxDlsd> dlsdrecord = new List<AxDlsd>();
return connection.Query<AxDlsd>(query, commandTimeout: 0, buffered: false ).ToList();
}
}
You can do a SELECT COUNT(DISTINCT clipid) from {axApp.dlname} to get the total and then use that to page
int pageSize = 500000;
for(var page = 0; page < (total / pageSize) + 1; page++)
{
string query = $"SELECT DISTINCT(clipid) from {axApp.dlname} ORDER BY clipid OFFSET {page * pageSize} FETCH NEXT {pageSize} ROWS ONLY";
///...
}
This will allow you to go through 500k rows at a time or whatever you page size is. FETCH/OFFSET does require SQL Server 2012. I'm not sure what SQL you are using.

Taking rows in chunks from datatable and inserting in database

I have around 25k records in datatable. I already have update query written by previous developer which I can't change. What I am trying to do is as follows:
Take 1000 records at a time from datatable, records can vary from 1 to 25k.
Update query which is in string, replace IN('values here') clause of that with these 1000 records and fire query against database.
Now, I know there are effecient ways to do it, like bulk insert by use of array binding , but I can't change present coding pattern due to restrictions.
What I have tried to do:
if (dt.Rows.Count>0)
{
foreach (DataRow dr in dt.Rows)
{
reviewitemsend =reviewitemsend + dr["ItemID"].ToString()+ ',';
//If record count is 1000 , execute against database.
}
}
Now above approach is taking me nowwhere and am like struck. So another better aproach which I am thinking is below :
int TotalRecords = dt.rows.count;
If (TotalRecords <1000 && TotalRecords >0 )
//Update existing query with this records by placing them in IN cluse and execute
else
{
intLoopCounter = TotalRecords/1000; //Manage for extra records, as counter will be whole number, so i will check modulus division also, if that is 0, means no need for extra counter, if that is non zero, intLoopCounter increment by 1
for(int i= 0;i < intLoopCounter; i++)
{
//Take thousand records at a time, unless last counter has less than 1000 records and execute against database
}
}
Also, note update query is below :
string UpdateStatement = #" UPDATE Table
SET column1=<STATUS>,
column2= '<NOTES>',
changed_by = '<CHANGEDBY>',
status= NULL,
WHERE ID IN (<IDS>)";
In above update query, IDS are already replaced with all 25K record ID's, which will be shown to end user like that, internally only I have to execute it as separate chunks, So within IN() cluase I need to insert 1k records at a time
You can split your Datatable using this linq method:
private static List<List<DataRow>> SplitDataTable(DataTable table, int pageSize)
{
return
table.AsEnumerable()
.Select((row, index) => new { Row = row, Index = index, })
.GroupBy(x => x.Index / pageSize)
.Select(x => x.Select(v => v.Row).ToList())
.ToList();
}
Then run the database query on each chunk:
foreach(List<DataRow> chuck in SplitDataTable(dt, 1000))
{
foreach(DataRow row in chuck)
{
// prepare data from row
}
// execute against database
}
Tip: you can modify the split query to prepare your data directly inside of it (by replacing the x.Select(v => v.Row) part, instead of looping twice on that huge DataTable.

SQL comparsion/synchronization speed

I have two datatables let's just call them db1 and db2. db2 contains all the records db1 has but db1 doesn't contain all the records of db2 (they both have the same columns). I have to check the modifications every day in db1 and apply the same for db2.
Currently my tool "exports" both tables into DataTables, performs the conversion and updates/imports the records into db2:
SELECT * FROM db1 -> db1_table
SELECT * FROM db2 -> db2_table
for (int i = 0; i < db1_table.Rows.Count; i++)
{
for (int j = 0; j < db2_table.Rows.Count; j++)
{
//if db1_table.Rows[i] != db2_table.Rows[j] -> UPDATE db2 SET etc.
//if db1_table.Rows[i] doesn't exist in db2 -> INSERT INTO db2 etc.
}
}
This version becomes quite slow after a while. I'm talking about tens of thousands of records.
The other was my initial idea but I found it slow. I pull the whole db1, loop through all of its records and execute an sql query each time:
SELECT * FROM db1 -> db1_table
for (int i = 0; i < db1_table.Rows.Count; i++)
{
//SELECT * FROM db2 WHERE "attributes LIKE db1_table.Rows[i]
//do the comparsion here and execute the UPDATE/INSERT commands if necessary
}
Which is the faster(better) way? Are there any other option I might have?
Side note: you really shouldn't store duplicate data in two tables with the same structure in the first place...
Side note: you should be doing this update in SQL.
To answer your actual question. What you're experiencing is an O(N^2) algorithmic complexity. It can be reduced to around O(N) if you build a hashtable (dictionary) of one of the tables, and you only iterate on the other one. When you look for a match, then you look in the hashtable instead of iteration, that's around O(1) instead of O(N). You just need a good key value that you use for hashing.
Something like this:
var dict = db2_table.Rows.Cast<DataRow>().ToDictionary(row2 => row2["keycolumn"].Value); // this is the hashing, make sure no duplicate keys exist!
foreach (DataRow row1 in db1_table.Rows) {
DataRow row2;
if (dict.TryGetValue(row1["keycolumn"].Value, out row2)) {
// row1 and row2 match by the key column, do something with them
dict.Remove(row2["keycolumn"].Value);
}
// else no match, row1 must be a new row
}
// now dict contains the keys from db2 which have no match in db1, they must have been deleted
There's another option that's O(n) if you have a unique ID that you can order on and compare: Order both tables by the ID and walk them both at once, generating lists of pending changes. After that you can apply the pending changes. The reason for generating lists of changes is so that you can batch commands together at the end of the change detection and benefit from things like bulk inserts, CTEs or temp tables to join on for deletes, and batched command groups for updates -- all of which reduce one of the biggest sources of latency in this kind of operation: DB round trips.
The main loop looks like the following:
// Assuming that IDs are long. Change as required.
long db1_id;
long db2_id;
var idsToAppend = new List<long>();
var idsToUpdate = new List<long>();
var idsToDelete = new List<long>();
int i = 0;
int j = 0;
while (i < db1_table.Rows.Count && j < db2_table.Rows.Count) {
db1_id = db1_table.Rows[i]["ID"];
db2_id = db2_table.Rows[j]["ID"];
if (i == db1_table.Rows.Count && j < db2_table.Rows.Count) {
// There's extra rows in the destination that have been removed from the source
idsToDelete.Add(db1_id);
j++;
} else if (j < db1_table.Rows.Count && j == db2_table.Rows.Count) {
// There's extra rows in the source that need added to the destination
idsToAppend.Add(db1_id);
i++;
} else if (db1_id == db2_id) {
// On the same ID in both datasets
if !(db1_table.Rows[i] == db2_table.Rows[j]) {
// I know == won't work -- only do this if db1 may change and the changes must be propagated to db2
idsToUpdate.Add(db1_id);
}
i++;
j++;
} else if (db1_id > db2_id) {
// row in db1 was removed, remove row in db2
idsToDelete.Add(db1_id);
j++;
} else {
// implicit: db1_id < db2_id
// implicit: row in db1 doesn't exist in db2, needs added
idsToAppend(db1_id);
i++;
}
}
// Walk idsToAppend, idsToUpdate, and idsToDelete applying changes

How to implement SQL Server paging using C# and entity framework?

I have to update every row in a Sql Server table with about 150,000 records using entity framework. To reduce the amount of hits the server takes, I would like to do this in separate batches of 1000 rows. I need entity framework to:
Select the first 1000 rows from the DB.
Update those rows.
Call SaveChanges() method.
Get next 1000 rows.
Repeat.
Whats the best way to achieve this?
I'm using entity framework 4 and SQL Server 2012.
Use LINQ Skip & Take:
return query.Skip(HOW MUCH TO SKIP -AT THE BEGINNING WILL BE ZERO-)
.Take(HOW MUCH TO TAKE -THE NUMBER OF YOUR PAGING SIZE-).ToList();
If you want to do it within a loop you can do something like this:
int pagingIncrement = 1000;
for (int i = 0; i <= 150 000; i=i+pagingIncrement)
{
var query = your actual LINQ query.
var results = query.Skip(i).Take(pagingIncrement);
UpdatePartialResults(results);
}
Note: It is important that while updating those rows you don't update the criteria for the ORDER BY within your actual LINQ query, otherwise you could be end up updating the same results again and again (because of the reordering).
Other idea will be to extend the IEnumerable iterator with some of the previously given ideas such as a Skip(counter).Take(pagingSize and yield result (to be processing kinda asynchronously).
something like this should work:
int skip =0;
int take = 1000;
for (int i = 0; i < 150; i++)
{
var rows = (from x in Context.Table
select x).OrderBy(x => x.id).Skip(skip).Take(take).ToList();
//do some update stuff with rows
skip += 1000;
}

Categories