Performance of setting DataRow values in a large DataTable

Performance of setting DataRow values in a large DataTable - c#

I have a large DataTable - around 15000 rows and 100 columns - and I need to set the values for some of the columns in every row.
// Creating the DataTable
DataTable dt = new DataTable();
for (int i = 0; i < COLS_NUM; i++)
{
dt.Columns.Add("COL" + i);
}
for (int i = 0; i < ROWS_NUM; i++)
{
dt.Rows.Add(dt.NewRow());
}
// Setting several values in every row
Stopwatch sw2 = new Stopwatch();
sw2.Start();
foreach (DataRow row in dt.Rows)
{
for (int j = 0; j < 15; j++)
{
row["Col" + j] = 5;
}
}
sw2.Stop();
The measured time above is about 4.5 seconds. Is there any simple way to improve this?

Before you populate the data, call the BeginLoadData() method on the DataTable. When you have finished loading the data, call EndLoadData(). This turns off all notifications, index maintenance, and constraints, which will improve performance.
As an alternative, call BeginEdit() before updating each row, and EndEdit() when the editing for that row is complete.
Here is a link with more information on improving DataSet performance:
http://www.softwire.com/blog/2011/08/04/dataset-performance-in-net-web-applications/

One improvement that I can think of is editing columns by their indices, rather than their names.
foreach (DataRow row in dt.Rows)
{
for (int j = 0; j < 15; j++)
{
row[j] = 5;
}
}
With an empirical test, your method seems to run in ~1500 milliseconds on my computer, and this index based version runs in ~1100 milliseconds.
Also, see Marc's answer in this post:
Set value for all rows in a datatable without for loop

this depends on your business logic which is not clear in your question, however, If you want to set the values for some of the columns in every row, try the following,
Create a separated temp column(s), you might create it in the same
loop when creating the original data table
Fill the new values into this column,
delete the old column and insert the new one in its place instead.
This solution will be logical if you can expect the new values or if you have the same value for all rows (like in your example) or if you have some kind of repeat, in that case adding a new column with loaded will be much more faster than looping all rows.

Related

Removing certain row from datatable C#

I am trying to remove rows from datatable in which a certain cell is empty. I have tried using for loop but to no avail.
for (int i = dtbl.Rows.Count - 1; i >= 0; i--)
{
DataRow dr = dtbl.Rows[i];
if (dr["id"] == null)
dtbl.Rows.Remove(dr);
}
If the cell of ID column is empty, then that row should be deleted.
Any help is appreciated. Thanks in advance.

Change your test to this one.
for (int i = dtbl.Rows.Count - 1; i >= 0; i--)
{
DataRow dr = dtbl.Rows[i];
if (dr.IsNull("id"))
dtbl.Rows.Remove(dr);
}
See docs: DataRow.IsNull
or you can use a check against the special field DbValue.Null
if (dr["id"] == DbNull.Value)
Another approach is this one
for (int i = dtbl.Rows.Count - 1; i >= 0; i--)
{
DataRow dr = dtbl.Rows[i];
if (dr.IsNull("id"))
dr.Delete();
}
dtbl.AcceptChanges();
this last one just marks the row for deletion, but the row remains in the table until you call AcceptChanges. (So this approach is suitable for a foreach loop)
Calling DataRow.Delete is the preferred way to work if you plan to update the real database table at later time (for example when you want your user delete many rows from a bound grid and then make a single update to the database only if the user clicks on a Save button).

You could use linq to select those not null
dtbl = dtbl.AsEnumerable()
.Where(r => r.Field<string>("id") != null)
.CopyToDataTable();
Might need the type, specify that nullable type to compare row.Field<int?>("id").HasValue

Error when trying to duplicate rows in DataTable in c#

I have an existing datatable called _longDataTable containing data. Now, I want to duplicate each row and in each duplicate of the row, I want to set only the value in the SheetCode column according to a value from a different datatable called values, see code below. For example, the values datatable contains 1, 2 and 3, then I want each row of _longDataTable to be duplicated three times and in each of the duplicated rows, I want the Sheet Code column to have values 1, 2 and 3 respectively. My code now looks like below:
foreach (DataRow sheets in _longDataTable.Rows)
{
for(int k = 0; k < number_of_sheets; k++)
{
var newRowSheets = _longDataTable.NewRow();
newRowSheets.ItemArray = sheets.ItemArray;
newRowSheets["SheetCode"] = values.Rows[k]["Sheet Code"];
//add edited row to long datatable
_longDataTable.Rows.Add(newRowSheets);
}
}
However, I get the following error:
Collection was modified; enumeration operation might not execute.
Does anyone know where this error comes from and how to solve my problem?

you get enumeration error because you are iterating through a collection which is changing in the loop(new rows added to it),
as you said in the comment, you get out of memory exception because you are iterating on the _longDataTable, then you add rows to it, the iteration never reach to end and you will get out of memory exception.
I assume this can help you:
//assume _longDataTable has two columns : column1 and SheetCode
var _longDataTable = new DataTable();
var duplicatedData = new DataTable();
duplicatedData.Columns.Add("Column1");
duplicatedData.Columns.Add("SheetCode");
foreach (DataRow sheets in _longDataTable.Rows)
{
for (int k = 0; k < number_of_sheets; k++)
{
var newRowSheets = duplicatedData.NewRow();
newRowSheets.ItemArray = sheets.ItemArray;
newRowSheets["SheetCode"] = values.Rows[k]["Sheet Code"];
newRowSheets["Column1"] = "anything";
//add edited row to long datatable
duplicatedData.Rows.Add(newRowSheets);
}
}
_longDataTable.Merge(duplicatedData);
do not modify _longDataTable, add rows to the temp table (with the same schema) and after the iteration merge two data tables.

Creating multiple data tables programmatically and structuring classes

I'm working on a project where I will end up opening multiple text files.
Each of these text files will then be split into tables of data.
My initial thoughts were that I wanted to keep as much code away from the UI as possible and that I should create a class DataSetClass with a public method OpenFile.
The DataSet would be created as a global inside this class and then all my other functions for sorting the data or getting information from that particular dataset would be inside that class. However, since making the class, I believe I am unable to actually get at that data now and that instead, I should create a new DataSet and then start a class that takes the dataSet and modifies it by adding the tables. But I am worried that this would be feature envy?
In addition to this, I am struggling with how to add multiple DataTables to a DataSet.
In my code below I add the DataTable to a DataSet after I have put all the relevant data into it. However, I end up just adding the exact same DataTable into the DataSet each time and only ever end up with one table.
for (int data_table = 1; data_table < header_location.Count; data_table++) {
DataTable dt = new DataTable();
SaveHeaderInfo(dt, header_location[data_table], split_data_list);
SaveDataUntilStop(dt, start_location[data_table], split_data_list);
dataSet.Tables.Add(dt);
//dt.Reset();
}
When using the dt.Reset line, it simply wipes all the data even after the table has been added to the dataset.
I presume that this is probably because the .add function actually just inserts a pointer to the table into the dataset. So how would I create an entirely new datatable each time?
Thank you
EDIT: Adding Extra Code:
private void SaveDataUntilStop(DataTable dt, int start, List<string[]> split_data_list)
{
string[] row = new string[400];
for (int i = start + 1; i < split_data_list.Count; i++)
{
if (split_data_list[i][0] != "STOP")
{
for (int j = 0; j < split_data_list[i].Length; j++)
row[j] = split_data_list[i][j];
CheckColumnCountAndAdd(dt, row);
}
else
return;
}
}
private void SaveHeaderInfo(DataTable dt, int location, List<string[]> list)
{
CheckColumnCountAndAdd(dt, list[location + 1]);
CheckColumnCountAndAdd(dt, list[location + 2]);
}
private void CheckColumnCountAndAdd(DataTable dt, string[] str_to_add)
{
while (str_to_add.Length > dt.Columns.Count)
{
AddColumn(dt);
}
dt.Rows.Add(str_to_add);
}

SQL comparsion/synchronization speed

I have two datatables let's just call them db1 and db2. db2 contains all the records db1 has but db1 doesn't contain all the records of db2 (they both have the same columns). I have to check the modifications every day in db1 and apply the same for db2.
Currently my tool "exports" both tables into DataTables, performs the conversion and updates/imports the records into db2:
SELECT * FROM db1 -> db1_table
SELECT * FROM db2 -> db2_table
for (int i = 0; i < db1_table.Rows.Count; i++)
{
for (int j = 0; j < db2_table.Rows.Count; j++)
{
//if db1_table.Rows[i] != db2_table.Rows[j] -> UPDATE db2 SET etc.
//if db1_table.Rows[i] doesn't exist in db2 -> INSERT INTO db2 etc.
}
}
This version becomes quite slow after a while. I'm talking about tens of thousands of records.
The other was my initial idea but I found it slow. I pull the whole db1, loop through all of its records and execute an sql query each time:
SELECT * FROM db1 -> db1_table
for (int i = 0; i < db1_table.Rows.Count; i++)
{
//SELECT * FROM db2 WHERE "attributes LIKE db1_table.Rows[i]
//do the comparsion here and execute the UPDATE/INSERT commands if necessary
}
Which is the faster(better) way? Are there any other option I might have?

Side note: you really shouldn't store duplicate data in two tables with the same structure in the first place...
Side note: you should be doing this update in SQL.
To answer your actual question. What you're experiencing is an O(N^2) algorithmic complexity. It can be reduced to around O(N) if you build a hashtable (dictionary) of one of the tables, and you only iterate on the other one. When you look for a match, then you look in the hashtable instead of iteration, that's around O(1) instead of O(N). You just need a good key value that you use for hashing.
Something like this:
var dict = db2_table.Rows.Cast<DataRow>().ToDictionary(row2 => row2["keycolumn"].Value); // this is the hashing, make sure no duplicate keys exist!
foreach (DataRow row1 in db1_table.Rows) {
DataRow row2;
if (dict.TryGetValue(row1["keycolumn"].Value, out row2)) {
// row1 and row2 match by the key column, do something with them
dict.Remove(row2["keycolumn"].Value);
}
// else no match, row1 must be a new row
}
// now dict contains the keys from db2 which have no match in db1, they must have been deleted

There's another option that's O(n) if you have a unique ID that you can order on and compare: Order both tables by the ID and walk them both at once, generating lists of pending changes. After that you can apply the pending changes. The reason for generating lists of changes is so that you can batch commands together at the end of the change detection and benefit from things like bulk inserts, CTEs or temp tables to join on for deletes, and batched command groups for updates -- all of which reduce one of the biggest sources of latency in this kind of operation: DB round trips.
The main loop looks like the following:
// Assuming that IDs are long. Change as required.
long db1_id;
long db2_id;
var idsToAppend = new List<long>();
var idsToUpdate = new List<long>();
var idsToDelete = new List<long>();
int i = 0;
int j = 0;
while (i < db1_table.Rows.Count && j < db2_table.Rows.Count) {
db1_id = db1_table.Rows[i]["ID"];
db2_id = db2_table.Rows[j]["ID"];
if (i == db1_table.Rows.Count && j < db2_table.Rows.Count) {
// There's extra rows in the destination that have been removed from the source
idsToDelete.Add(db1_id);
j++;
} else if (j < db1_table.Rows.Count && j == db2_table.Rows.Count) {
// There's extra rows in the source that need added to the destination
idsToAppend.Add(db1_id);
i++;
} else if (db1_id == db2_id) {
// On the same ID in both datasets
if !(db1_table.Rows[i] == db2_table.Rows[j]) {
// I know == won't work -- only do this if db1 may change and the changes must be propagated to db2
idsToUpdate.Add(db1_id);
}
i++;
j++;
} else if (db1_id > db2_id) {
// row in db1 was removed, remove row in db2
idsToDelete.Add(db1_id);
j++;
} else {
// implicit: db1_id < db2_id
// implicit: row in db1 doesn't exist in db2, needs added
idsToAppend(db1_id);
i++;
}
}
// Walk idsToAppend, idsToUpdate, and idsToDelete applying changes

Spit DataTable into two by Rows

I have one asp.net datatable and I want to databind into two asp.net datalists, so I though to slice the datatable rows in two datatables both the same size if even .

Use the Take LINQ extension method to specify how many items to use.
And the Skip to jump over if needed.
var half = myList.Take(myList.Count / 2);

If you are slicing by rows, you can simply create a copy of your original data table, find a suitable half way point, and just import the rows into the copy, while deleting them from the original.
Something like the following should work:
DataTable originalTable = new DataTable();
//Load the data into your original table or wherever you get your original table from
DataTable otherTable = originalTable.Copy(); //Copys the table structure only - no data
int rowCount = originalTable.Rows.Count;
int wayPoint = rowCount / 2; //NB integer division rounds down towards 0
for(int i = 0; i <= wayPoint; i++)
{
otherTable.ImportRow(originalTable.Rows[i]); //Imports (copies) the row from the original table to the new one
originalTable.Rows[i].Delete(); //Marks row for deletion
}
originalTable.AcceptChanges(); //Removes the rows we marked for deletion

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Performance of setting DataRow values in a large DataTable - c#

Related

Removing certain row from datatable C#

Error when trying to duplicate rows in DataTable in c#

Creating multiple data tables programmatically and structuring classes

SQL comparsion/synchronization speed

Spit DataTable into two by Rows

Categories

Resources