I have around 25k records in datatable. I already have update query written by previous developer which I can't change. What I am trying to do is as follows:
Take 1000 records at a time from datatable, records can vary from 1 to 25k.
Update query which is in string, replace IN('values here') clause of that with these 1000 records and fire query against database.
Now, I know there are effecient ways to do it, like bulk insert by use of array binding , but I can't change present coding pattern due to restrictions.
What I have tried to do:
if (dt.Rows.Count>0)
{
foreach (DataRow dr in dt.Rows)
{
reviewitemsend =reviewitemsend + dr["ItemID"].ToString()+ ',';
//If record count is 1000 , execute against database.
}
}
Now above approach is taking me nowwhere and am like struck. So another better aproach which I am thinking is below :
int TotalRecords = dt.rows.count;
If (TotalRecords <1000 && TotalRecords >0 )
//Update existing query with this records by placing them in IN cluse and execute
else
{
intLoopCounter = TotalRecords/1000; //Manage for extra records, as counter will be whole number, so i will check modulus division also, if that is 0, means no need for extra counter, if that is non zero, intLoopCounter increment by 1
for(int i= 0;i < intLoopCounter; i++)
{
//Take thousand records at a time, unless last counter has less than 1000 records and execute against database
}
}
Also, note update query is below :
string UpdateStatement = #" UPDATE Table
SET column1=<STATUS>,
column2= '<NOTES>',
changed_by = '<CHANGEDBY>',
status= NULL,
WHERE ID IN (<IDS>)";
In above update query, IDS are already replaced with all 25K record ID's, which will be shown to end user like that, internally only I have to execute it as separate chunks, So within IN() cluase I need to insert 1k records at a time
You can split your Datatable using this linq method:
private static List<List<DataRow>> SplitDataTable(DataTable table, int pageSize)
{
return
table.AsEnumerable()
.Select((row, index) => new { Row = row, Index = index, })
.GroupBy(x => x.Index / pageSize)
.Select(x => x.Select(v => v.Row).ToList())
.ToList();
}
Then run the database query on each chunk:
foreach(List<DataRow> chuck in SplitDataTable(dt, 1000))
{
foreach(DataRow row in chuck)
{
// prepare data from row
}
// execute against database
}
Tip: you can modify the split query to prepare your data directly inside of it (by replacing the x.Select(v => v.Row) part, instead of looping twice on that huge DataTable.
Related
I have a main datatable with a variable number of rows (upwards of hundreds at a time) that I need to split into separate datatables for performance reasons. I was able to use the below code to successfully break the original table up into multiple tables in an array, with 20 rows in each.
DataTable[] splittables = tbl.AsEnumerable()
.Select((row, index) => new { row, index })
.GroupBy(x => x.index / 20) // integer division, the fractional part is truncated
.Select(g => g.Select(x => x.row).CopyToDataTable())
.ToArray();
However, being fairly new to C#, I wasn't able to figure out how to use the broken-up tables from that split array. I need to be able to loop through each of the split table chunks and do the same block of work on each subtable, i.e., do work on split table 1 (first 20 rows), then do the same work on split table 2 (next 20 rows), then do the same work on split table 3 (next 20 rows), etc.
Any help would be greatly appreciated.
You now have an array of DataTable so at its simplest you can simply iterate over the array executing some method
foreach(var dataTable in splittables)
{
DoWork(dataTable)
}
private void DoWork(DataTable table)
{
// do your work
}
I have a datatable that I am trying to make an update on.
my datatable is the data source of a data gridview (Forms application)
I want to update all rows that are part of a textbox
the textbox contains a comma separated values such as
A1,A11,B4,B38,C44
I have this code but stuck on how to make it working
DataTable dt = new DataTable();
dt = (DataTable)grd1.DataSource;
DataRow[] dr = dt.Select("'," + TextBox1.Text + ",' LIKE '%,Code,%'");
foreach (DataRow row in dr)
{
row["Price"] = 1000;
}
The problem is in this code
"'," + TextBox1.Text + ",' LIKE '%,Code,%'"
it does not retuen any rows so I think I did not write it the right way.
How to fix my select line?
Note : I added a comma before and after so I do not get "T37" when I am looking for "T3"
Your question wasn't easy to understand for me, but you seem to be saying that you will type a list of values into the textbox and these values are to be looked up in the [Code] column of the datatable. I'm not clear on whether the Code column itself is a single value or a comma separated list of codes, so I'll answer for both. Assuming the Code column is a CSV, and you want that any row where any one of the values in Code is one of these values in the textbox, shall have its price updated to 1000:
So for a textbox of "A1,B1" and a datarows like:
Code Price
A1,C3 200
B4,C7 400
The 200 row shall be updated and the 490 row shall not
I'd use LINQ for this rather than datatable select
var codes = textbox.Split(',');
var rows = dt.AsEnumerable().Where(r => codes.Any(c => (r["Code"] as string).Split(',').Contains(c)));
foreach(var r in rows)
r["Price") = 1000;
If you're doing this a lot I wouldn't have the codes in the row as a CSV string; a row field is allowed to be an array of strings - storing the codes as an array in the row will save having to split them every time you want to query them
If I've got this wrong and the row contains just a single Code value, the logic is the same, it just doesn't need the split(though the code above would work, it's not optimal):
var rows = dt.AsEnumerable().Where(r => codes.Any(c => (r["Code"] as string) == c));
And actually if you're going to be doing this a lot, I would index the datatable:
//if it's a csv in the datatable
var index = dt.AsEnumerable()
.SelectMany(r => r["Code"].ToString().Split(','), (row, code) => new { R=row, C=code})
.ToLookup(o => o.C, o => o.R);
This will give something like a dictionary where a code maps to a list of rows where the code appears. For a row set like
Code Price
A1,C3 200
B4,C3 400
You get a "dictionary" like:
A1: { "A1,C3", 200 }
C3: { "A1,C3", 200 },{ "B4,C3", 400 }
B4: { "B4,C3", 400 }
so you could:
foreach(var c in codesTextbox.Split)
foreach(var row in index["c"])
row["Price"] = 1000;
If the Code column doesn't contain a csv, doing a selectmany should still be fine, but to optimize it:
var index = dt.AsEnumerable().ToLookup(r => (string)r["Code"]);
I have code in my C# console app that is querying a LARGE dataset in SQL, and adding it to an IEnumerable collection that I use to iterate through later in the app. On a SQL table that returns less than 100K rows, it works great, but I have to use this to iterate through 100 Million records, After the SQL query runs, and Dapper tries to fill the collection, I end up with an OUT OF MEMORY exception error. I'm pretty certain it's because it's trying to write 100 Million objects at a time. Is there a way I can batch a collection with no more than say 500K objects, do what I need to do then come back and process another 500K and so on? I essentially need to READ from SQL 500K records, then write those to a file, Read another 500K , write to another file.
public List<AxDlsd> GetDistinctDlsdObjects(AxApp axApp, OperationType operationType)
{
if (operationType == OperationType.Assessment)
{
string query = $"SELECT DISTINCT(clipid) from {axApp.dlname}";
using (var connection = _dbConnectionFactory.GetAxDbConnection())
{
//SqlMapper.Settings.CommandTimeout = 0;
List<AxDlsd> dlsdrecord = new List<AxDlsd>();
return connection.Query<AxDlsd>(query, commandTimeout: 0, buffered: false ).ToList();
}
}
You can do a SELECT COUNT(DISTINCT clipid) from {axApp.dlname} to get the total and then use that to page
int pageSize = 500000;
for(var page = 0; page < (total / pageSize) + 1; page++)
{
string query = $"SELECT DISTINCT(clipid) from {axApp.dlname} ORDER BY clipid OFFSET {page * pageSize} FETCH NEXT {pageSize} ROWS ONLY";
///...
}
This will allow you to go through 500k rows at a time or whatever you page size is. FETCH/OFFSET does require SQL Server 2012. I'm not sure what SQL you are using.
I have two datatables let's just call them db1 and db2. db2 contains all the records db1 has but db1 doesn't contain all the records of db2 (they both have the same columns). I have to check the modifications every day in db1 and apply the same for db2.
Currently my tool "exports" both tables into DataTables, performs the conversion and updates/imports the records into db2:
SELECT * FROM db1 -> db1_table
SELECT * FROM db2 -> db2_table
for (int i = 0; i < db1_table.Rows.Count; i++)
{
for (int j = 0; j < db2_table.Rows.Count; j++)
{
//if db1_table.Rows[i] != db2_table.Rows[j] -> UPDATE db2 SET etc.
//if db1_table.Rows[i] doesn't exist in db2 -> INSERT INTO db2 etc.
}
}
This version becomes quite slow after a while. I'm talking about tens of thousands of records.
The other was my initial idea but I found it slow. I pull the whole db1, loop through all of its records and execute an sql query each time:
SELECT * FROM db1 -> db1_table
for (int i = 0; i < db1_table.Rows.Count; i++)
{
//SELECT * FROM db2 WHERE "attributes LIKE db1_table.Rows[i]
//do the comparsion here and execute the UPDATE/INSERT commands if necessary
}
Which is the faster(better) way? Are there any other option I might have?
Side note: you really shouldn't store duplicate data in two tables with the same structure in the first place...
Side note: you should be doing this update in SQL.
To answer your actual question. What you're experiencing is an O(N^2) algorithmic complexity. It can be reduced to around O(N) if you build a hashtable (dictionary) of one of the tables, and you only iterate on the other one. When you look for a match, then you look in the hashtable instead of iteration, that's around O(1) instead of O(N). You just need a good key value that you use for hashing.
Something like this:
var dict = db2_table.Rows.Cast<DataRow>().ToDictionary(row2 => row2["keycolumn"].Value); // this is the hashing, make sure no duplicate keys exist!
foreach (DataRow row1 in db1_table.Rows) {
DataRow row2;
if (dict.TryGetValue(row1["keycolumn"].Value, out row2)) {
// row1 and row2 match by the key column, do something with them
dict.Remove(row2["keycolumn"].Value);
}
// else no match, row1 must be a new row
}
// now dict contains the keys from db2 which have no match in db1, they must have been deleted
There's another option that's O(n) if you have a unique ID that you can order on and compare: Order both tables by the ID and walk them both at once, generating lists of pending changes. After that you can apply the pending changes. The reason for generating lists of changes is so that you can batch commands together at the end of the change detection and benefit from things like bulk inserts, CTEs or temp tables to join on for deletes, and batched command groups for updates -- all of which reduce one of the biggest sources of latency in this kind of operation: DB round trips.
The main loop looks like the following:
// Assuming that IDs are long. Change as required.
long db1_id;
long db2_id;
var idsToAppend = new List<long>();
var idsToUpdate = new List<long>();
var idsToDelete = new List<long>();
int i = 0;
int j = 0;
while (i < db1_table.Rows.Count && j < db2_table.Rows.Count) {
db1_id = db1_table.Rows[i]["ID"];
db2_id = db2_table.Rows[j]["ID"];
if (i == db1_table.Rows.Count && j < db2_table.Rows.Count) {
// There's extra rows in the destination that have been removed from the source
idsToDelete.Add(db1_id);
j++;
} else if (j < db1_table.Rows.Count && j == db2_table.Rows.Count) {
// There's extra rows in the source that need added to the destination
idsToAppend.Add(db1_id);
i++;
} else if (db1_id == db2_id) {
// On the same ID in both datasets
if !(db1_table.Rows[i] == db2_table.Rows[j]) {
// I know == won't work -- only do this if db1 may change and the changes must be propagated to db2
idsToUpdate.Add(db1_id);
}
i++;
j++;
} else if (db1_id > db2_id) {
// row in db1 was removed, remove row in db2
idsToDelete.Add(db1_id);
j++;
} else {
// implicit: db1_id < db2_id
// implicit: row in db1 doesn't exist in db2, needs added
idsToAppend(db1_id);
i++;
}
}
// Walk idsToAppend, idsToUpdate, and idsToDelete applying changes
I have to update every row in a Sql Server table with about 150,000 records using entity framework. To reduce the amount of hits the server takes, I would like to do this in separate batches of 1000 rows. I need entity framework to:
Select the first 1000 rows from the DB.
Update those rows.
Call SaveChanges() method.
Get next 1000 rows.
Repeat.
Whats the best way to achieve this?
I'm using entity framework 4 and SQL Server 2012.
Use LINQ Skip & Take:
return query.Skip(HOW MUCH TO SKIP -AT THE BEGINNING WILL BE ZERO-)
.Take(HOW MUCH TO TAKE -THE NUMBER OF YOUR PAGING SIZE-).ToList();
If you want to do it within a loop you can do something like this:
int pagingIncrement = 1000;
for (int i = 0; i <= 150 000; i=i+pagingIncrement)
{
var query = your actual LINQ query.
var results = query.Skip(i).Take(pagingIncrement);
UpdatePartialResults(results);
}
Note: It is important that while updating those rows you don't update the criteria for the ORDER BY within your actual LINQ query, otherwise you could be end up updating the same results again and again (because of the reordering).
Other idea will be to extend the IEnumerable iterator with some of the previously given ideas such as a Skip(counter).Take(pagingSize and yield result (to be processing kinda asynchronously).
something like this should work:
int skip =0;
int take = 1000;
for (int i = 0; i < 150; i++)
{
var rows = (from x in Context.Table
select x).OrderBy(x => x.id).Skip(skip).Take(take).ToList();
//do some update stuff with rows
skip += 1000;
}