Extreme performance difference when using DataTable.Add

Extreme performance difference when using DataTable.Add - c#

Take a look at the program below. It's pretty self-explanatory, but I'll explain anyway :)
I have two methods, one fast and one slow. These methods do the exact same thing: they create a table with 50,000 rows and 1000 columns. I write to a variable number of columns in the table. In the code below I've picked 10 (NUM_COLS_TO_WRITE_TO).
In other words, only 10 columns out of the 1000 will actually contain data. OK. The only difference between the two methods is that the fast populates the columns and then calls DataTable.AddRow, whereas the slow one does it after. That's it.
The performance difference however is shocking (to me anyway). The fast version is almost completely unaffected by changing the number of columns we write to, whereas the slow one goes up linearly. For example, when the number of columns I write to is 20, the fast version takes 2.8 seconds, but the slow version takes over a minute.
What in the world could possibly be going on here?
I thought that maybe adding dt.BeginLoadData would make a difference, and it did to some extent, it brought the time down from 61 seconds to ~50 seconds, but that's still a huge difference.
Of course, the obvious answer is, "Well, don't do it that way." OK. Sure. But what in world is causing this? Is this expected behavior? I sure didn't expect it. :)
public class Program
{
private const int NUM_ROWS = 50000;
private const int NUM_COLS_TO_WRITE_TO = 10;
private const int NUM_COLS_TO_CREATE = 1000;
private static void AddRowFast() {
DataTable dt = new DataTable();
//add a table with 1000 columns
for (int i = 0; i < NUM_COLS_TO_CREATE; i++) {
dt.Columns.Add("x" + i, typeof(string));
}
for (int i = 0; i < NUM_ROWS; i++) {
var theRow = dt.NewRow();
for (int j = 0; j < NUM_COLS_TO_WRITE_TO; j++) {
theRow[j] = "whatever";
}
//add the row *after* populating it
dt.Rows.Add(theRow);
}
}
private static void AddRowSlow() {
DataTable dt = new DataTable();
//add a table with 1000 columns
for (int i = 0; i < NUM_COLS_TO_CREATE; i++) {
dt.Columns.Add("x" + i, typeof(string));
}
for (int i = 0; i < NUM_ROWS; i++) {
var theRow = dt.NewRow();
//add the row *before* populating it
dt.Rows.Add(theRow);
for (int j=0; j< NUM_COLS_TO_WRITE_TO; j++){
theRow[j] = "whatever";
}
}
}
static void Main(string[] args)
{
var sw = Stopwatch.StartNew();
AddRowFast();
sw.Stop();
Console.WriteLine(sw.Elapsed.TotalMilliseconds);
sw.Restart();
AddRowSlow();
sw.Stop();
Console.WriteLine(sw.Elapsed.TotalMilliseconds);
//When NUM_COLS is 5
//FAST: 2754.6782
//SLOW: 15794.1378
//When NUM_COLS is 10
//FAST: 2777.431 ms
//SLOW 32004.7203 ms
//When NUM_COLS is 20
//FAST: 2831.1733 ms
//SLOW: 61246.2243 ms
}
}
Update
Calling theRow.BeginEdit and theRow.EndEdit in the slow version makes the slow version more or less constant (~4 seconds on my machine). If I actually had some constraints on the table, I guess this might make sense to me.

When attached to table, much more work is being done to record and track the state on every change.
For example, if you do this,
theRow.BeginEdit();
for (int j = 0; j < NUM_COLS_TO_WRITE_TO; j++)
{
theRow[j] = "whatever";
}
theRow.CancelEdit();
Then in BeginEdit() , internally it's taking a copy of the contents of the row, so that at any point, you can rollback - and the end result of the above is an empty row again without whatever. This is still possible, even when in BeginLoadData mode. Following the path ofBeginEdit if attached to a DataTable, eventually you get into DataTable.NewRecord() which shows that it is just copying each value for every column to store the original state incase a cancel is needed - not much magic here. On the other hand, if not attached to a datatable, not much is happening in BeginEdit at all and it is exited quickly.
EndEdit() is similarly pretty heavy (when attached), as here all the constraints etc are checked (max length, does column allow nulls etc). Also it fires a bunch of events, explictly frees storage used incase the edit was cancelled, and makes available for recall with DataTable.GetChanges(), which is still possible in BeginLoadData. Infact looking at source all BeginLoadData seems to do is turn off constraint checking and indexing.
So that describes what BeginEdit and EditEdit does, and they are completely different when attached or not attached in terms of what is stored. Now consider that a single theRow[j] = "whatever" you can see on the indexer setter for DataRow it calls BeginEditInternal and then EditEdit on every single call (if it is not already in an edit because you explicitly called BeginEdit earlier). So that means it's copying and storing every single value for each column in the row, every time you do this call. So you're doing it 10 times, that means with your 1,000 column DataTable, over 50,000 rows, it means it is allocating 500,000,000 objects. On top of this there is all the other versioning, checks and events being fired after every single change, and so, overall, it's much slower when the row is attached to a DataTable than when not.

Related

What is the cost of using ArrayList in C# to store large object that are passed around to various methods and cast prior to use

I've been asked to work in an enterprise application that is using ArrayList everywhere (a Find in Files search came back with 1500+ instances of the keyword ArrayList). Very large objects are being passed around in very long ArrayLists to various methods, in many cases these ArrayLists are even 3 dimensional. I already know first hand how terrible it is to work inside this code, however, I am not too sure about the performance impact this is having on the application. For example, when the objects are retrieved from the list they have to be explicitly cast from object, versus if a Generic List had been used no explicit casting is done. To test the performance I wrote a very simple test case shown below:
AVeryLargeObject largeObj = new AVeryLargeObject();
int iterations = int.Parse(textBox1.Text);
timers.SetStart("ArrayListAdd"); // Using Win32PerformanceTimer
ArrayList oldList = new ArrayList(iterations);
for (int i = 0; i < iterations; i++)
{
oldList.Add(largeObj);
}
timers.SetEnd("ArrayListAdd");
timers.SetStart("ArrayListPull");
for (int i = 0; i < iterations; i++)
{
AVeryLargeObject obj = (AVeryLargeObject)oldList[i];
}
timers.SetEnd("ArrayListPull");
timers.SetStart("GenericListAdd");
List<AVeryLargeObject> properList = new List<AVeryLargeObject>(iterations);
for (int i = 0; i < iterations; i++)
{
properList.Add(largeObj);
}
timers.SetEnd("GenericListAdd");
timers.SetStart("GenericListPull");
for (int i = 0; i < iterations; i++)
{
AVeryLargeObject obj = properList[i];
}
timers.SetEnd("GenericListPull");
While I definitely see slower performance when pulling objects and casting from ArrayList I had to run upwards of a million iterations to see a millisecond difference. Does this mean the performance between these 2 isn't really that different? It seems the ArrayList pull may be as much as twice as slow but this code runs so fast that I wonder if refactoring this application to use Generic List would even yield a performance boost? Would this performance be affected by the size of the objects being stored and whether the object's properties are populated? What about memory usage?

How to improve performance of ConcurrentDictionary.Count in C#

Recently, I needed to choose between using SortedDictionary and SortedList, and settled on SortedList.
However, now I discovered that my C# program is slowing to a crawl when performing SortedList.Count, which I check using a function/method called thousands of times.
Usually my program would call the function 10,000 times within 35 ms, but while using SortedList.Count, it slowed to 300-400 ms, basically 10x slower.
I also tried SortedList.Keys.Count, but this reduced my performance another 10 times, to over 3000 ms.
I have only ~5000 keys/objects in SortedList<DateTime, object_name>.
I can easily and instantly retrieve data from my sorted list by SortedList[date] (in 35 ms), so I haven't found any problem with the list structure or objects its holding.
Is this performance normal?
What else can I use to obtain the number of records in the list, or just to check that the list is populated?
(besides adding a separate tracking flag, which I may do for now)
CORRECTION:
Sorry, I'm actually using:
ConcurrentDictionary<string, SortedList<DateTime, string>> dict_list = new ConcurrentDictionary<string, SortedList<DateTime, string>>();
And I had various counts in various places, sometimes checking items in the list and other times in ConcurrentDicitonary. So the issue applies to ConcurrentDicitonary and I wrote quick test code to confirm this, which takes 350 ms, without using concurrency.
Here is the test with ConcurrentDicitonary, showing 350 ms:
public static void CountTest()
{
//Create test ConcurrentDictionary
ConcurrentDictionary<int, string> test_dict = new ConcurrentDictionary<int, string>();
for (int i = 0; i < 50000; i++)
{
test_dict[i] = "ABC";
}
//Access .Count property 10,000 times
int tick_count = Environment.TickCount;
for (int i = 1; i <= 10000; i++)
{
int dict_count = test_dict.Count;
}
Console.WriteLine(string.Format("Time: {0} ms", Environment.TickCount - tick_count));
Console.ReadKey();
}

this article recommends calling this instead:
dictionary.Skip(0).Count()
The count could be invalid as soon as the call from the method returns. If you want to write the count to a log for tracing purposes, for example, you can use alternative methods, such as the lock-free enumerator

Well, ConcurrentDictionary<TKey, TValue> must work properly with many threads at once, so it needs some synchronization overhead.
Source code for Count property: https://referencesource.microsoft.com/#mscorlib/system/Collections/Concurrent/ConcurrentDictionary.cs,40c23c8c36011417
public int Count
{
[SuppressMessage("Microsoft.Concurrency", "CA8001", Justification = "ConcurrencyCop just doesn't know about these locks")]
get
{
int count = 0;
int acquiredLocks = 0;
try
{
// Acquire all locks
AcquireAllLocks(ref acquiredLocks);
// Compute the count, we allow overflow
for (int i = 0; i < m_tables.m_countPerLock.Length; i++)
{
count += m_tables.m_countPerLock[i];
}
}
finally
{
// Release locks that have been acquired earlier
ReleaseLocks(0, acquiredLocks);
}
return count;
}
}
Looks like you need to refactor your existing code. Since you didn't provide any code, we can't tell you what you could optimize.

For performance sensitive code I would not recommend to use ConcurrentDictionary.Count property as it has a locking implementation. You could use Interlocked.Increment instead to do the count yourself.

DataTable do not get updates?

I have a DataTable with one column filled with list of words from a text file, I create a method to read a string if the string is founded the row must be deleted, but the problem is that the DataTable don't get the updates.
foreach(string line in file)
{
tagst.Rows.Add(line)
}
string s;
for (int k = 0; k < tagst.Rows.Count; k++)
{
s = tagst.Rows[k]["Tags"].ToString();
if(s.Equals("Jad"))
{
tagst.Rows[k].Delete();
}
}

after your loop, call tagst.AcceptChanges();
per the documentation:
When AcceptChanges is called, any DataRow object still in edit mode successfully ends its edits. The DataRowState also changes: all Added and Modified rows become Unchanged, and Deleted rows are removed.
As #LarsTech stated, you'll want to rework your loop like:
for (int i = tagst.Rows.Count - 1; i >= 0; i--)
{
// ....
}

One issue with your coding is that when you removing rows - you need to start from the last one backward for very simple reason. Once you remove row 3,10,20 out of 100 how many rows will remain? Why do you need to start from bottom? Because tagst will be automatically re-indexed and reordered as soon as delete completed.
But your tagst.Rows.Count is already set and not getting refreshed ever!
Basically once your counter(K) hit number where rows already been deleted you will see error at best, at worse your app will crush if you do not have error handling routines set. Since you did not post actual code for how you create your tgst I will show how it can be done in array. Declaration of variable ommited...
Try this:
for (int k=tagstArray.Count; k>0; k--)
{
s = tagstArray[k].ToString();
if(s.Contains("Jad"))
{
tagstArray[k].Remove(k);
}
}

Optimize looping through large datatable

I'm loping datatable with 100 to 10000 rows, comparing each row to each other through doyble loop.
for (int i = 0; i < DT1.Rows.Count; i++)
{
for (int j = 0; j < DT1.Rows.Count; j++)
{
//some code to compare data
}
}
For 100-200 rows it's done in few minutes, which is OK, but comparing few thousands rows to few thousands, takes hours and isn't finished.
What can I do to speed it up? Best I thought up is to use lists of objects, instead of datatables.
Any other sugestions?
Can thread be used to do this?
Thanks.

I recently came across a similar scenario that I had to work through. Though in my case, I was comparing a pair of excel files. For my trial run, after getting it working, I had 530 rows on one side and 459000 on the other inside nested loop. This is roughly 234 million iterations. My program was able to work through it in roughly 30 seconds. I used a foreach in this scenario:
foreach (DataRow r1 in DT1.Rows) //Loop the First Source data
{
foreach (DataRow r2 in DT2.Rows) //Loop the Second Source data
{
//Comparison code here...
}
}
Edit: In your loop, as a point of reference, you are causing 3 variables to be tracked at each iteration of the loops, first and second are your counters. The third is the major performance hit, DT1.Rows.Count. By using the Direct row count as a part of the loops, it must be re-evaluated at each iteration. This adds unneeded time to the program. If you absolutely require that there be the counters, then assign the Row count out first:
int DT1Count = DT1.Rows.Count;
for (int i = 0; i < DT1Count; i++)
{
for (int j = 0; j < DT1Count; j++)
{
//some code to compare data
}
}
This way, the row count is static and shall remove the extra processing needed to evaluate the row count at each iteration.

Although you can certainly optimize your search by using hash tables, the best optimization is to let the database engine to the search for you. RDBMS engines are optimized for this kind of task - no client-side optimization should be able to beat it. Your biggest disadvantage is having to pull the data from the database into your program. This is very slow. The database engine has all the data right there - this is a huge advantage.
For example, if you are looking for rows representing users with identical first and last name, a simple query with a self-join will get you results in seconds, not minutes, because the data never leaves the engine.
select u1.userId, u2.userId
from User u1
join User u2 on u1.FirstName=u2.FirstName and u1.LastName=u2.LastName
Assuming that FirstName and LastName columns are indexed, this query will find you duplicates very quickly.

If the results are sorted in some sort of order you can put the results into an array and loop through using a Binary Search

for (int i = 0; i < DT1.Rows.Count; i++)
{
for (int j = i+1; j < DT1.Rows.Count; j++) //<-- starts from next row
{
//some code to compare data
}
}

you could also count on .NET internals to do better job than manual looping using:
DataTable.Select(filterExpression, sortExpression)

The biggest optimization to be made here is the following:
Currently, you are comparing each value twice. For example, in the first iteration of the loop, you are comparing the first row with itself, because both loops start at index 0.
The simplest fix to this would be to change the inner loop to this:
for (int j = i + 1; j < DT1.Rows.Count; j++)
This will dramatically reduce the number of comparisons. Your algorithm currently needs n^2 comparisons. The proposed fix reduces this number to less than the half. With the fix you only need (n^2 - n) / 2 comparisons.

How to increment a number when something happens in a For loop

I'm using the following code to count the number of rows in an html table, store that number to use in a for loop which extracts text out of a <td> in each row.
Table myTable = browser.Div(Find.ById("resultSpan")).Table(Find.First());
int numRows = myTable.TableRows.Count;
List<string> myList = new List<string>();
for (int i = 1; i < numRows; i++)
{
myList.Add(myTable.TableRows[i].TableCells[1].Text);
}
I placed a label control on my form and I basically want it to increment in real time so I can see how fast the program is parsing the data.
On average I'm dealing with ~2000 rows, so it takes a long time, and I want to be able to see the status. The above code is using WatiN, but I'm sure this is a C# question.
Edit This was easier than I thought-
for (int i = 1; i < numRows; i++)
{
myList.Add(myTable.TableRows[i].TableCells[1].Text);
label1.Text = i.ToString();
}

You might want to use BackgroundWorker here so you are not blocking your UI and so you can provide status update through the BackgroundWorker

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extreme performance difference when using DataTable.Add - c#

Related

What is the cost of using ArrayList in C# to store large object that are passed around to various methods and cast prior to use

How to improve performance of ConcurrentDictionary.Count in C#

DataTable do not get updates?

Optimize looping through large datatable

How to increment a number when something happens in a For loop

Categories

Resources