Optimize looping through large datatable

Optimize looping through large datatable - c#

I'm loping datatable with 100 to 10000 rows, comparing each row to each other through doyble loop.
for (int i = 0; i < DT1.Rows.Count; i++)
{
for (int j = 0; j < DT1.Rows.Count; j++)
{
//some code to compare data
}
}
For 100-200 rows it's done in few minutes, which is OK, but comparing few thousands rows to few thousands, takes hours and isn't finished.
What can I do to speed it up? Best I thought up is to use lists of objects, instead of datatables.
Any other sugestions?
Can thread be used to do this?
Thanks.

I recently came across a similar scenario that I had to work through. Though in my case, I was comparing a pair of excel files. For my trial run, after getting it working, I had 530 rows on one side and 459000 on the other inside nested loop. This is roughly 234 million iterations. My program was able to work through it in roughly 30 seconds. I used a foreach in this scenario:
foreach (DataRow r1 in DT1.Rows) //Loop the First Source data
{
foreach (DataRow r2 in DT2.Rows) //Loop the Second Source data
{
//Comparison code here...
}
}
Edit: In your loop, as a point of reference, you are causing 3 variables to be tracked at each iteration of the loops, first and second are your counters. The third is the major performance hit, DT1.Rows.Count. By using the Direct row count as a part of the loops, it must be re-evaluated at each iteration. This adds unneeded time to the program. If you absolutely require that there be the counters, then assign the Row count out first:
int DT1Count = DT1.Rows.Count;
for (int i = 0; i < DT1Count; i++)
{
for (int j = 0; j < DT1Count; j++)
{
//some code to compare data
}
}
This way, the row count is static and shall remove the extra processing needed to evaluate the row count at each iteration.

Although you can certainly optimize your search by using hash tables, the best optimization is to let the database engine to the search for you. RDBMS engines are optimized for this kind of task - no client-side optimization should be able to beat it. Your biggest disadvantage is having to pull the data from the database into your program. This is very slow. The database engine has all the data right there - this is a huge advantage.
For example, if you are looking for rows representing users with identical first and last name, a simple query with a self-join will get you results in seconds, not minutes, because the data never leaves the engine.
select u1.userId, u2.userId
from User u1
join User u2 on u1.FirstName=u2.FirstName and u1.LastName=u2.LastName
Assuming that FirstName and LastName columns are indexed, this query will find you duplicates very quickly.

If the results are sorted in some sort of order you can put the results into an array and loop through using a Binary Search

for (int i = 0; i < DT1.Rows.Count; i++)
{
for (int j = i+1; j < DT1.Rows.Count; j++) //<-- starts from next row
{
//some code to compare data
}
}

you could also count on .NET internals to do better job than manual looping using:
DataTable.Select(filterExpression, sortExpression)

The biggest optimization to be made here is the following:
Currently, you are comparing each value twice. For example, in the first iteration of the loop, you are comparing the first row with itself, because both loops start at index 0.
The simplest fix to this would be to change the inner loop to this:
for (int j = i + 1; j < DT1.Rows.Count; j++)
This will dramatically reduce the number of comparisons. Your algorithm currently needs n^2 comparisons. The proposed fix reduces this number to less than the half. With the fix you only need (n^2 - n) / 2 comparisons.

Related

Loop stops all code after one iteration

I have a loop which in theory should loop 40000 times but exits and doesn't continue with code after the loop, just after one iteration. I figured that I wasnt being a silly willy about the for-loops since it didn't continue after the loops at all, so that might be something with restrictions for Lists? Or mayby something about the VS-debugger that isn't working preperly? (probably not tho...)
Edit: Thanks for pointing out that the last layer was pointless. I edited the code, but the problem persists.
Edit2: To clarify, the code does not result in an exception, or breaks. It just stops without any notifications, and shows the form(since I do a windows forms application). Just... it just don't want to continue and skips the rest of the code.
for (int i = 0; i < hiddenLayerDepth - 1; i++)
{
Connectors.Add(new List<List<List<List<Connector>>>>());
for (int j = 0; j < playfieldSize; j++)
{
Connectors[i].Add(new List<List<List<Connector>>>());
for (int k = 0; k < playfieldSize; k++)
{
Connectors[i][j].Add(new List<List<Connector>>());
for (int l = 0; l < playfieldSize; l++)
{
Connectors[i][j][k][l].Add(new Connector());
}
}
}
}
hiddenLayerDepth is set to 5 when entering the loop, and playfieldSize is set to 10. It enters the innermost loop and executes the code inside, then it just stops without increasing m.

Missing
Connectors[i][j][k].Add(new List<List<Connector>>());
If you know the sizes you should just create and array up front

Well, I tried to add a 'Connector' where there were no list. The List that contained the lists that would countain the Connectors was not added.

Extreme performance difference when using DataTable.Add

Take a look at the program below. It's pretty self-explanatory, but I'll explain anyway :)
I have two methods, one fast and one slow. These methods do the exact same thing: they create a table with 50,000 rows and 1000 columns. I write to a variable number of columns in the table. In the code below I've picked 10 (NUM_COLS_TO_WRITE_TO).
In other words, only 10 columns out of the 1000 will actually contain data. OK. The only difference between the two methods is that the fast populates the columns and then calls DataTable.AddRow, whereas the slow one does it after. That's it.
The performance difference however is shocking (to me anyway). The fast version is almost completely unaffected by changing the number of columns we write to, whereas the slow one goes up linearly. For example, when the number of columns I write to is 20, the fast version takes 2.8 seconds, but the slow version takes over a minute.
What in the world could possibly be going on here?
I thought that maybe adding dt.BeginLoadData would make a difference, and it did to some extent, it brought the time down from 61 seconds to ~50 seconds, but that's still a huge difference.
Of course, the obvious answer is, "Well, don't do it that way." OK. Sure. But what in world is causing this? Is this expected behavior? I sure didn't expect it. :)
public class Program
{
private const int NUM_ROWS = 50000;
private const int NUM_COLS_TO_WRITE_TO = 10;
private const int NUM_COLS_TO_CREATE = 1000;
private static void AddRowFast() {
DataTable dt = new DataTable();
//add a table with 1000 columns
for (int i = 0; i < NUM_COLS_TO_CREATE; i++) {
dt.Columns.Add("x" + i, typeof(string));
}
for (int i = 0; i < NUM_ROWS; i++) {
var theRow = dt.NewRow();
for (int j = 0; j < NUM_COLS_TO_WRITE_TO; j++) {
theRow[j] = "whatever";
}
//add the row *after* populating it
dt.Rows.Add(theRow);
}
}
private static void AddRowSlow() {
DataTable dt = new DataTable();
//add a table with 1000 columns
for (int i = 0; i < NUM_COLS_TO_CREATE; i++) {
dt.Columns.Add("x" + i, typeof(string));
}
for (int i = 0; i < NUM_ROWS; i++) {
var theRow = dt.NewRow();
//add the row *before* populating it
dt.Rows.Add(theRow);
for (int j=0; j< NUM_COLS_TO_WRITE_TO; j++){
theRow[j] = "whatever";
}
}
}
static void Main(string[] args)
{
var sw = Stopwatch.StartNew();
AddRowFast();
sw.Stop();
Console.WriteLine(sw.Elapsed.TotalMilliseconds);
sw.Restart();
AddRowSlow();
sw.Stop();
Console.WriteLine(sw.Elapsed.TotalMilliseconds);
//When NUM_COLS is 5
//FAST: 2754.6782
//SLOW: 15794.1378
//When NUM_COLS is 10
//FAST: 2777.431 ms
//SLOW 32004.7203 ms
//When NUM_COLS is 20
//FAST: 2831.1733 ms
//SLOW: 61246.2243 ms
}
}
Update
Calling theRow.BeginEdit and theRow.EndEdit in the slow version makes the slow version more or less constant (~4 seconds on my machine). If I actually had some constraints on the table, I guess this might make sense to me.

When attached to table, much more work is being done to record and track the state on every change.
For example, if you do this,
theRow.BeginEdit();
for (int j = 0; j < NUM_COLS_TO_WRITE_TO; j++)
{
theRow[j] = "whatever";
}
theRow.CancelEdit();
Then in BeginEdit() , internally it's taking a copy of the contents of the row, so that at any point, you can rollback - and the end result of the above is an empty row again without whatever. This is still possible, even when in BeginLoadData mode. Following the path ofBeginEdit if attached to a DataTable, eventually you get into DataTable.NewRecord() which shows that it is just copying each value for every column to store the original state incase a cancel is needed - not much magic here. On the other hand, if not attached to a datatable, not much is happening in BeginEdit at all and it is exited quickly.
EndEdit() is similarly pretty heavy (when attached), as here all the constraints etc are checked (max length, does column allow nulls etc). Also it fires a bunch of events, explictly frees storage used incase the edit was cancelled, and makes available for recall with DataTable.GetChanges(), which is still possible in BeginLoadData. Infact looking at source all BeginLoadData seems to do is turn off constraint checking and indexing.
So that describes what BeginEdit and EditEdit does, and they are completely different when attached or not attached in terms of what is stored. Now consider that a single theRow[j] = "whatever" you can see on the indexer setter for DataRow it calls BeginEditInternal and then EditEdit on every single call (if it is not already in an edit because you explicitly called BeginEdit earlier). So that means it's copying and storing every single value for each column in the row, every time you do this call. So you're doing it 10 times, that means with your 1,000 column DataTable, over 50,000 rows, it means it is allocating 500,000,000 objects. On top of this there is all the other versioning, checks and events being fired after every single change, and so, overall, it's much slower when the row is attached to a DataTable than when not.

Which part of my Parallel.For should be fixed and is unsafe?

I have a nested For loops as follow:
// This loop cannot be parallel because results of the next
// two loops will be used for next t
for (int t= 0; t< 6000000; t++)
// Calculations in the following two loops includes 75% of all time spend for.
for (int i= 0; i< 1000; i++)
{
for (int j= 0; j< 1000; j++)
{
if (Vxcal){V.x= ............. // some calculations }
if (Vycal){V.y= ............. // some calculations }
// Vbar is a two dimensional array
Vbar = V;
}
}
I changed the above code to :
// This loop cannot be parallel because results of the next
// two loops will be used for next t
for (int t= 0; t< 6000000; t++)
// Calculations in the following two loops includes 75% of all time spend for.
Parallel.for (0, 1000, i=>
{
Parallel.for (0, 1000, j=>
{
if (Vxcal){V.x= ............. // some calculations }
if (Vycal){V.y= ............. // some calculations }
// Vbar is a two dimensional array
Vbar = V;
}
}
When I run code results are not correct and takes hours instead of 10 mins. My question is:
Are these kind of For loops suitable for Parallel?
These loops just have some mathematical calculations.
How can I make this parallel For loops safe?
I found a keyword "Lock" which can help me for have a safe loop but which part of this loop is unsafe?

I see you modified your question to put in some other numbers. So your inner loop is now executed 6000000*1000*1000, or 6,000,000,000,000 times. At 4 billion calculations per second (which your computer can't do), that's going to take 1,500 seconds, or 25 minutes. If you get perfect parallelism with Parallel.For, you can cut that down to 8.25 minutes on a 4-core machine. If your calculations are long and complicated, there's no surprise that it takes hours to complete.
Your code is slow because it's doing a lot of work. You need a better algorithm!
Original answer
Consider your nested loops:
for (int t= 0; t< 5000000000; t++)
// Calculations in the following two loops includes 75% of all time spend for.
for (int i= 0; i< 5000000000; i++)
{
for (int j= 0; j< 5000000000; j++)
The inner loop (your calculation) is being executed 5,000,000,000*5,000,000,000*5,000,000,000 times. 125,000,000,000,000,000,000,000,000,000 is a huge number. Even if your computer could do that loop 4 billion times per second (it can't--not even close), it would take 31,250,000,000,000,000,000 seconds, or about 990 billion years to complete. Using multiple threads on a four-core machine could cut that down to only 250 billion years.
I don't know what you're trying to do, but you'll need a much better algorithm, a computer that's about a 500 billion times faster than the one you have, or several hundred billion processors if you want it to finish in your lifetime.

Bidirectional Bubble sort c#

I have a homework assignment of coding a bidirectional bubble sort. Can someone please see if my logic is correct with respect to it. I Don't want code as I want to figure it out myself. I just want a logic check of how i understand it.
As i understand the Bidirectional Bubble sort you implement 2 for loops one starting at position 1 in the list and performing a normal bubble sort. As the first for loop reaches the end a second one is implemented working in reverse. I just don't completely understand what the terminating conditions for each loop would be.
Would the for loop conditions be something as follows?
loop 1 - for(i = 0; i < Count -i; i++)
loop 2 - for(j = Count - i; j > i; j--)
in each loop the swap conditions would be specified.
Thanks

The "classic" bubble sort goes through the entire array on each iteration, so the loops should be
for(i = 0; i < Count - 1; i++)
and
for(j = Count - 1; j > 0; j--)
Both loops skip one index: the first loop skips the last index, while the second loop skips the initial one. This is so that your code could safely compare data[i] to data[i+1], and data[j] to data[j-1].
EDIT The "optimized" bubble sort skips the initial k elements on k-th iteration. Since your bubble sort is bidirectional, you will be able to skip the initial k and the tail k elements, like this:
int k = 0;
do { // The outer loop
...
for(int i = k; i < Count - k - 1; i++)
...
for(int j = Count - k - 1; j > k ; j--)
...
k++;
} while (<there were swaps>);

bidirectional bubble sort works like this:
instead of passing the list from bottom to top every time (bubble sort) you start one time at the bottom and every second time from the top of the list.
the wikipedia article does a way better job at explaining it:
http://en.wikipedia.org/wiki/Cocktail_sort
- rich

How to Loop the List if Records using LINQ

I have a List of items 500
here i need to take first 100 and then i need insert into Database .as so on...
but here when i insert first 100 into DB i don't need that records to be inserted into DB
By using LINQ

You can use Skip and Take to do this.
var stuffToInsert = myList.Skip(100).Take(100);
Skip will move forward X objects, Take will enumerate upto (less if there is not enough data) Y objects. You can put Skip(0) for the first tranche of objects (because you are not needing to skip anything yet)

int i = myList.Count / 100;
int batchSize = 100;
for(int j = 0; j < i; j++)
{
InsertIntoDataBase(myList.Skip(j * batchSize).Take(batchSize));
}
Where InsertIntoDataBase() is some function you can implement to do the insert.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Optimize looping through large datatable - c#

If the results are sorted in some sort of order you can put the results into an array and loop through using a Binary Search

for (int i = 0; i < DT1.Rows.Count; i++) { for (int j = i+1; j < DT1.Rows.Count; j++) //<-- starts from next row { //some code to compare data } }

you could also count on .NET internals to do better job than manual looping using: DataTable.Select(filterExpression, sortExpression)

Related

Loop stops all code after one iteration

Extreme performance difference when using DataTable.Add

Which part of my Parallel.For should be fixed and is unsafe?

Bidirectional Bubble sort c#

How to Loop the List if Records using LINQ

Categories

Resources