How to efficiently compare two data tables in C#

How to efficiently compare two data tables in C# - c#

I am working on a method that takes in two datatables and a list of primary key column names and gives back the matches. I do not have any other info about the tables.
I have searched the site for a solution to this problem and have found some answers, but none have given me a fast enough solution.
Based on results from stackoverflow I now have this:
var matches =
(from rowA in tableA.AsEnumerable()
from rowB in tableB.AsEnumerable()
where primaryKeyColumnNames.All(column => rowA[column].ToString() == rowB[column].ToString())
select new { rowA, rowB });
The problem is this is REALLY slow. It takes 4 minutes for two tables of 8000 rows each. Before I came to stackoverflow I was actually iterating through the columns and rows it took 2 minutes. (so this is actually slower than what I had) 2-4 minutes doesn't seem so bad until I hit the table with 350,000 rows. It takes days. I need to find a better solution.
Can anyone think of a way for this be faster?
Edit: Per a suggestion from tinstaafl this is now my code.
var matches = tableA.Rows.Cast<DataRow>().Select(rowA => new
{
rowA,
rowB = tableB.Rows.Find(rowA.ItemArray.Where((x, y) =>
primaryKeyColumnNames.Contains(tableA.Columns[y].ColumnName,
StringComparer.InvariantCultureIgnoreCase)).ToArray())
})
.Where(x => x.rowB != null);

Using the PrimaryKey property of the DataTable, which will accept an array of columns, should help. Perhaps something like this:
tableA.PrimaryKey = primaryKeyColumnNames.Select(x => tableA.Columns[x]).ToArray();
tableB.PrimaryKey = primaryKeyColumnNames.Select(x => tableB.Columns[x]).ToArray();
var matches = (from System.Data.DataRow RowA in tableA.Rows
where tableB.Rows.Contains(RowA.ItemArray.Where((x,y) => primaryKeyColumnNames.Contains(tableA.Columns[y].ColumnName)).ToArray())
select RowA).ToList();
In a test with 2 tables with 9900 rows and returning 9800 as common, this took about 1/3 of a second.

Related

Using LINQ to aggregate and group a list of data into a new list but Sum is 0

I have a list of data retrieved from SQL and stored in a class. I want to now aggregate the data using LINQ in C# rather than querying the database again on a different dataset.
Example data I have is above.
Date, Period, Price, Vol and I am trying to create a histogram using this data. I tried to use Linq code below but seem to be getting a 0 sum.
Period needs to be a where clause based on a variable
Volume needs to be aggregated for the price ranges
Price needs to be a bucket and grouped on this column
I dont want a range. Just a number for each bucket.
Example output I want is (not real data just as example):
Bucket SumVol
18000 50
18100 30
18200 20
Attempted the following LINQ query but my SUM seems to be be empty. I still need to add my where clause in, but for some reason the data is not aggregating.
var ranges = new[] { 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, 20000 };
var priceGroups = eod.GroupBy(x => ranges.FirstOrDefault(r => r > x.price))
.Select(g => new { Price = g.Key, Sum = g.Sum(s => s.vol)})
.ToList();
var grouped = ranges.Select(r => new
{
Price = r,
Sum = priceGroups.Where(g => g.Price > r || g.Price == 0).Sum(g => g.Sum)
});

First things first... There seems to be nothing wrong with your priceGroups list. I've run that on my end and, as far as I can understand your purpose, it seems to be grabbing the expected values from your dataset.
var ranges = new[] { 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, 20000 };
var priceGroups = eod.GroupBy(x => ranges.FirstOrDefault(r => r > x.price))
.Select(g => new { Price = g.Key, Sum = g.Sum(s => s.vol) })
.ToList();
Now, I assume your intent with the grouped list was to obtain yet another anonymous type list, much like you did with your priceGroups list, which is also an anonymous type list... List<'a> in C#.
var grouped = ranges.Select(r => new
{
Price = r,
Sum = priceGroups.Where(g => g.Price > r || g.Price == 0).Sum(g => g.Sum)
});
For starters, your are missing the ToList() method call at the end of it. However, that's not the main issue here, as you could still work with an IEnumerable<'a> just as well for most purposes.
As I see it, the core problem is at your anonymous property Sum attribution. Why are your filtering for g.Price > r || g.Price == 0?
There is no element with Price equal to zero on your priceGroups list. Those are a subset of ranges, and there is no zero there. Then you are comparing every value in ranges against that subset in priceGroups, and consolidating the Sums of every element in priceGroups that have Price higher than the range being evaluated. In other words, the property Sum in your grouped list is a sum of sums.
Keep in mind that priceGroups is already an aggregated list. It seems to me you are trying to aggregate it again when you call the Sum() method after a Where() clause like you are doing. That doesn't make much sense.
What you want (I believe) for the Sum property in the grouped list is for it to be the same as the Sum property in the priceGroups list, if the range being evaluated matches the Price being evaluated. Furthermore, where there is no matches, you want your grouped list Sum to be zero, as that means the range being evaluated was not in the original dataset. You can achieve that with the following instead:
Sum = priceGroups.FirstOrDefault(g => g.Price == r)?.Sum ?? 0
You said your Sum was "empty" in your post, but that's not the behavior I saw on my end. Try the above and, if still not behaving as you would expect, share a small dataset for which you know the expected output with me and I can try to help you further.

Use LINQ instead to query the DB is great, mainly because you are saving process avoiding a new call to your DB. And in case you don't have a high update BD (that change the data very quickly) you can use the retrived data to calculate all using LINQ

FirstOrDefault() adding days of time to iteration

Got a kind of edge case issue here. I've been tasked with pulling all data from one database to another, where the destination database has a different schema.
I've chosen to write a WinForms utility to do the data mapping and transfer with Entity Framework/ADO.NET when necessary.
This has worked great so far, except for this one specific table that has 2.5 million records. The transfer is about 10 minutes total when I disregard all foreign keys, however when I start mapping foreign keys with FirstOrDefault() calls against in memory lists of data that have been already moved to the destination database, quite literally 4 days are added to the amount of time that it takes.
I'm going to need to run this tool a lot over the coming days so this isn't really acceptable for me.
Here's my current approach (Not my first approach, this is the result of much trial and error for efficiencies sake):
private OldModelContext _oldModelContext { get; } //instantiated in controller
using (var newModelContext = new NewModelContext())
{
//Takes no time at all to load these into memory, collections are small, 3 - 20 records each
var alreadyMigratedTable1 = newModelContext.alreadyMigratedTable1.ToList();
var alreadyMigratedTable2 = newModelContext.alreadyMigratedTable2.ToList();
var alreadyMigratedTable3 = newModelContext.alreadyMigratedTable3.ToList();
var alreadyMigratedTable4 = newModelContext.alreadyMigratedTable4.ToList();
var alreadyMigratedTable5 = newModelContext.alreadyMigratedTable5.ToList();
var oldDatasetInMemory = _oldModelContext.MasterData.AsNoTracking().ToList();//2.5 Million records, takes about 6 minutes
var table = new DataTable("MasterData");
table.Columns.Add("Column1");
table.Columns.Add("Column2");
table.Columns.Add("Column3");
table.Columns.Add("ForeignKeyColumn1");
table.Columns.Add("ForeignKeyColumn2");
table.Columns.Add("ForeignKeyColumn3");
table.Columns.Add("ForeignKeyColumn4");
table.Columns.Add("ForeignKeyColumn5");
foreach(var masterData in oldDatasetInMemory){
DataRow row = table.NewRow();
//With just these properties mapped, this takes about 2 minutes for all 2.5 Million
row["Column1"] = masterData.Property1;
row["Column2"] = masterData.Property2;
row["Column3"] = masterData.Property3;
//With this mapping, we add about 4 days to the overall process.
row["ForeignKeyColumn1"] = alreadyMigratedTable1.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn2"] = alreadyMigratedTable2.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn3"] = alreadyMigratedTable3.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn4"] = alreadyMigratedTable4.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn5"] = alreadyMigratedTable5.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
table.Rows.Add(row);
}
//Save table with SQLBulkCopy is very fast, takes about a minute and a half.
}
}
Note: uniquePropertyOn(New/Old)Dataset is most often a unique description string shared among the datasets, can't match Ids as they won't be the same across databases.
I have tried:
Instead of using a foreach, cast with a linq select statement, not much improvement was had.
Use .Where(predicate).FirstOrDefault(), didn't see any considerable improvement
Running FirstOrDefault() against iqueryable instead of lists of migrated data, didn't see any improvement.
Mapping to a List instead of a datatable, but that makes no difference in the mapping speed, and also makes bulk saves slower.
I've been messing around with the idea of turning the foreach into a parallel foreach loop and locking the calls to the datatable, but I keep running into
Entity Framework connection closed issues
when querying the in memory lists while using the parallel foreach.... not really sure what that's about but initially the speed results were promising.
I'd be happy to post that code/errors if anyone thinks it's the right road to go down, but i'm not sure anymore..

The first thing I'd try is a dictionary, and pre-fetching the columns:
var fk1 = oldDatasetInMemory.Columns["ForeignKeyColumn1"];
// ...
var alreadyMigratedTable1 = newModelContext.alreadyMigratedTable1.ToDictionary(
x => x.uniquePropertyOnNewDataset);
// ...
if (alreadyMigratedTable1.TryGetValue(masterData.uniquePropertyOnOldDataset, out var val))
row[fk1] = val;
However, in reality: I'd also try to avoid the entire DataTable piece unless it is really, really necessary.

If there is really no other way to migrate this data than to load everything into memory, you can make it more efficient by avoiding this nested loop and by linking the lists via Join.
Read: Why is LINQ JOIN so much faster than linking with WHERE?
var newData =
from master in oldDatasetInMemory
join t1 in alreadyMigratedTable1
on master.uniquePropertyOnOldDataset equals t1.uniquePropertyOnNewDataset into t1Group
from join1 in t1Group.Take(1).DefaultIfEmpty()
join t2 in alreadyMigratedTable2
on master.uniquePropertyOnOldDataset equals t2.uniquePropertyOnNewDataset into t2Group
from join2 in t2Group.Take(1).DefaultIfEmpty()
join t3 in alreadyMigratedTable3
on master.uniquePropertyOnOldDataset equals t3.uniquePropertyOnNewDataset into t3Group
from join3 in t1Group.Take(1).DefaultIfEmpty()
join t4 in alreadyMigratedTable4
on master.uniquePropertyOnOldDataset equals t4.uniquePropertyOnNewDataset into t4Group
from join4 in t1Group.Take(1).DefaultIfEmpty()
join t5 in alreadyMigratedTable5
on master.uniquePropertyOnOldDataset equals t5.uniquePropertyOnNewDataset into t5Group
from join5 in t1Group.Take(1).DefaultIfEmpty()
select new { master, join1, join2, join3, join4, join5};
foreach (var x in newData)
{
DataRow row = table.Rows.Add();
row["Column1"] = x.master.Property1;
row["Column2"] = x.master.Property2;
row["Column3"] = x.master.Property3;
row["ForeignKeyColumn1"] = x.join1;
row["ForeignKeyColumn2"] = x.join2;
row["ForeignKeyColumn3"] = x.join3;
row["ForeignKeyColumn4"] = x.join4;
row["ForeignKeyColumn5"] = x.join5;
}
This is a LINQ Left-Outer-Join which takes only one row from the right side.

linq statement very slow

I have a linq to sql data context set up. I try to expect an amount of logs identified by a mappingID, so im using that to write a webclient, that shows the status of these downloads. Right now i have a situation where a linq statement is taking forever, although the amount of rows seems relatively low.
The statement taking forever is:
var dlCatToUnion = (from cl in currentLogs
where ndcat.All(x=>x!=cl.CategoryCountryCategoryTypeMapping.CategoryID)
group cl by cl.CategoryCountryCategoryTypeMapping.Category.CategoryID into t1
select new CategoryStruct
{
CategoryName = t1.Max(x => x.CategoryCountryCategoryTypeMapping.Category.Name),
Status = t1.Any(x=>x.Response!=(int)ErrorCodes.staticCodes.success)
? (int)ErrorCodes.staticCodes.genericFailure : (int)ErrorCodes.staticCodes.success,
AverageResponseTime = 0,
categoryId = t1.Key
}
);
Specifically if you look at the second line where it says where ndcat.All(x=>x!=cl.CategoryCountryCategoryTypeMapping.CategoryID) if i take this part out, its instant.
To take a look at what that line is doing:
var ndcat = (from ndid in notDownloadedIds
where ndid.Category.StorefrontID==StorefrontID
group ndid by ndid.CategoryID into finalCat
select finalCat.Key);
And then notDownloadedIds
notDownloadedIds = cDataContext.CategoryCountryCategoryTypeMappings.Where(mapping =>!
currentLogs.Select(dll => dll.CategoryCountryCategoryTypeMappingID).Any(id => id == mapping.CategoryCountryCategoryTypeMappingID));
To give some estimates of row counts, currentLogs is around 25k rows, CategoryCountryCategoryTYpeMappingID is about 53k rows. ndcat ends up being 47 rows(also it enumerates just about instantly).
Also to note ive changed the suspect line to a ! ...Any(...) statement and its just as slow.
Is there anywhere where im being ineffecient?

Have you tried changing:
where ndcat.All(x => x != cl.CategoryCountryCategoryTypeMapping.CategoryID)
to:
where !ndcat.Any(x => x == cl.CategoryCountryCategoryTypeMapping.CategoryID)
??

I ended up changing a bit, but long story short, i did the ndcat check after grouping instead of before, which made quite a bit faster.

Converting SQL to LINQ needing to group records created per second

I am looking for a way in C# LINQ using lambda format to group records per second. in my search i have yet to find a good way to do this.
the SQL query is as follows.
select count(cct_id) as 'cnt'
,Year(cct_date_created)
,Month(cct_date_created)
,datepart(dd,cct_date_created)
,datepart(hh,cct_date_created)
,datepart(mi,cct_date_created)
,datepart(ss,cct_date_created)
from ams_transactions with (nolock)
where cct_date_created between dateadd(dd,-1,getdate()) and getdate()
group by
Year(cct_date_created)
,Month(cct_date_created)
,datepart(dd,cct_date_created)
,datepart(hh,cct_date_created)
,datepart(mi,cct_date_created)
,datepart(ss,cct_date_created)
now the closest i was able to come was the following but it is not giving me the right results.
var groupedResult = MyTable.Where(t => t.cct_date_created > start
&& t.t.cct_date_created < end)
.GroupBy(t => new { t.cct_date_created.Month,
t.cct_date_created.Day,
t.cct_date_created.Hour,
t.cct_date_created.Minute,
t.cct_date_created.Second })
.Select(group => new {
TPS = group.Key.Second
});
this appears to be grouping by seconds but not considering it as per individual minute in the date range and instead that second of every minute in the date range. To get Transactions per second i need it to consider each minute of the month, hour, day, minute separately.
The goal will be to pull a Max and Average then from this grouped list. Any help would be greatly appreciated :)

Currently you're selecting the second, rather than the count - why? (You're also using an anonymous type for no obvious reason - whenever you have a single property, consider just selecting that property instead of wrapping it in an anonymous type.)
So change your Select to:
.Select(group => new { Key = group.Key,
Transactions = group.Count() });
Or to have all of the key properties separately:
.Select(group => new { group.Month,
group.Day,
group.Hour,
group.Minute,
group.Second,
Transactions = group.Count() });
(As an aside, do you definitely not need the year part? It's in your SQL...)

C# LINQ Ignoring empty values in datatable

I have a datatable that I have grouped as follows:
var result = from data in view.AsEnumerable()
group data by new {Group = data.Field<string>("group_no")}
into grp
select new
{
Group = grp.Key.Group,
PRAS = grp.Average(c => Convert.ToDouble(c.Field<string>("pAKT Total")))
};
Now, the average function is also counting the empty cells in it's calculation. For example, there are 10 cells with only 5 populated with values. I want the average to be the sum of the 5 values divided by 5.
How can I ensure that it does what I want?
Thanks.

Maybe something like this:
PRAS = grp.Select(row => row.Field<string>("pAKT Total"))
.Where(s => !String.IsNullOrEmpty(s))
.Select(Convert.ToDouble)
.Average()

To my knowledge, that's not possible with the Average method.
You can however achieve the result you want to, with the following substitute:
PRAS = grp.Sum(c => Convert.ToDouble(c.Field<string>("pAKT Total"))) / grp.Count(c => !c.IsDBNull)
This only makes sense, when you want to select the "empty" rows in the group, but just don't want to include them in your average. If you don't need the "empty" rows at all, don't select them in the first place, i.e. add a where clause that excludes them.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to efficiently compare two data tables in C# - c#

Related

Using LINQ to aggregate and group a list of data into a new list but Sum is 0

FirstOrDefault() adding days of time to iteration

linq statement very slow

Converting SQL to LINQ needing to group records created per second

C# LINQ Ignoring empty values in datatable

Categories

Resources