Difference between datatable.Rows.Cast<DataRow> and datatable.AsEnumerable() in Linq C# - c#

I am working on same datatable related operation on data, what would be the most efficient way to use linq on datatable-
var list = dataSet.Tables[0]
.AsEnumerable()
.Where(p => p.Field<String>("EmployeeName") == "Jams");
OR
var listobj = (EnumerableRowCollection<DataRow>) dataSet.Tables[0].Rows
.Cast<DataRow>()
.Where(dr => dr["EmployeeName"].ToString() == "Jams");

.AsEnumerable() internally uses .Rows.Cast<DataRow>(), at least in the reference implementation. It does a few other bits as well but nothing that would appreciably affect performance.

.AsEnumerable() and .Field do a lot of extra work that is not needed in most cases.
Also, field lookup by index is faster than lookup by name:
int columnIndex = dataTable.Columns["EmployeeName"].Ordinal;
var list = dataTable.Rows.Cast<DataRow>().Where(dr => "Jams".Equals(dr[columnIndex]));
For multiple names, the lookup is faster if the results are cached in a Dictionary or Lookup:
int colIndex = dataTable.Columns["EmployeeName"].Ordinal;
var lookup = dataTable.Rows.Cast<DataRow>().ToLookup(dr => dr[colIndex]?.ToString());
// .. and later when the result is needed:
var list = lookup["Jams"];

Define "efficient".
From performance standpoint, I doubt that there are any significant differences between these two options: the overall run-time will be dominated by the time required to do network I/O, not the time required to do casting.
From pure code style point of view, second one looks too unelegant to me. If you can get away with all-LINQ solution, go with it as it's generally (IMO, at least) more readable by virtue of being declarative.

Interestingly enough, AsEnumerable() returns EnumerableRowCollection<DataRow>
which if you look into the code for this, you will see the following:
this._enumerableRows = Enumerable.Cast<TRow>((IEnumerable) table.Rows);
So I would say that they are basically equivalent!

Related

Is it a good practice to use query to table after select from different table?

I've been wondering if it is a good practice from performance point of view to use following syntax when making call to the table using LINQ. Following is just an example, but I hope you get the idea:
Context.Pets.Where(p => p.Name == petname)
.Select(d => new {
SomeProperty = p.Age,
SomeOtherProperty = p.Color,
VeryDifferentProperty = Context.FavoriteFood.Where(f => f.FavFood == p.FavFood).FirstOrDefault().Nutrition.Protein});
Here I'm talking specifically about VeryDifferentProperty. Is it OK to make this kind of call?
Depending on the size of FavoriteFood and Pets list's size, you might benefit from converting the FavoriteFood list to a dictionary (with FavFood as key) to reduce overall execution time.
Currently, processing Pets is an O(n) operation and calculating value of VeryDifferentProperty makes it O(n^2)
Depending on number of items in second list, it might be worthwhile to take a one-time hit of converting list to dictionary and then lookups become O(1). There should be no further optimization needed beyond that.

Orderby C# string record

I have the following orderby for a record read from db and then building a string.
The following code works fine but I know this can be improved any suggestion is highly appreciated.
result.Sites.ForEach(x =>
{
result.SiteDetails +=
string.Concat(ICMSRepository.Instance.GetSiteInformationById(x.SiteInformationId).SiteCode,
",");
});
//Sort(Orderby) sites by string value NOT by numerical order
result.SiteDetails = result.SiteDetails.Trim(',');
List<string> siteCodes = result.SiteDetails.Split(',').ToList();
var siteCodesOrder = siteCodes.OrderBy(x => x).ToArray();
string siteCodesSorted = string.Join(", ", siteCodesOrder);
result.SiteDetails = siteCodesSorted;
That's a little convoluted, yeah.
All we need to do is select out the SiteCode as string, sort with OrderBy, then join the results. Since String::Join has a variant that works with IEnumerable<string> we don't need to convert to array in the middle.
What we end up with is a single statement for assigning to your SiteDetails member:
result.SiteDetails = string.Join(", ",
result.Sites
.Select(x => $"{ICMSRepository.Instance.GetSiteInformationById(x.SiteInformationId).SiteCode}")
.OrderBy(x => x)
);
(Or you could use .ToString() instead of $"{...}")
This is the general process for most transforms in LINQ. Figure out what your inputs are, what you need to do with them, and how the outputs should look.
If you're using LINQ it's uncommon that you will have to build and manipulate intermediary lists unless you're doing something quite complex. For simple tasks like sorting a sequence of values there is almost never a reason to put them into transitional collections, since the framework handles all of that for you.
And the best part is it enumerates the collection one time to get the full set of data. No more loops to pull the data out, then process, then rebuild.
One thing that will improve performance is to get rid of the .ToList() and the .ToString. Neither is necessary and just take up extra processing time and memory.
Go with Corey's answer, which this is a variant of, but I thought I'd offer a slightly clearer way to express the query:
result.SiteDetails =
String.Join(", ",
from x in result.Sites
let sc = ICMSRepository.Instance.GetSiteInformationById(x.SiteInformationId).SiteCode
orderby sc
select sc);

Linq performance when diffing two lists using inner Contains

EDIT 01: I seem to have found a solution (click for the answer) that works for me. Going from and hour to merely seconds by pre-computing and then applying the .Except() extension method; but leaving this open if anyone else encounters this problem or if anyone else finds a better solution.
ORIGINAL QUESTION
I have the following set of queries, for differend kind of objects I'm staging from a source system so I can keep it in sync and make a delta stamp myself, as the sourcesystem doesn't provide it, nor can we build or touch it.
I get all data in memory an then for example perform this query, where I look for objects that don't exist any longer in the source system, but are present in the staging database - and thus have to be marked "deleted". The bottleneck is the first part of the LINQ query - on the .Contains(), how can I improve it's performance - mayve with .Except(), with a custom comparer?
Or should I best put them in a hashing list and them perform the compare?
The problem is though I have to have the staged objects afterwards to do some property transforms on them, this seemed the simplest solution, but unfortunately it's very slow on 20k objects
stagedSystemObjects.Where(stagedSystemObject =>
!sourceSystemObjects.Select(sourceSystemObject => sourceSystemObject.Code)
.Contains(stagedSystemObject.Code)
)
.Select(x =>
{
x.ActiveStatus = ActiveStatuses.Disabled;
x.ChangeReason = ChangeReasons.Edited;
return x;
})
.ToList();
Based on Yves Schelpe's answer. I made a little tweaks to make it faster.
The basic idea is to cancel the first two ToList and use PLINQ. See if this help
var stagedSystemCodes = stagedSystemObjects.Select(x => x.Code);
var sourceSystemCodes = sourceSystemObjects.Select(x => x.Code);
var codesThatNoLongerExistInSourceSystem = stagedSystemCodes.Except(sourceSystemCodes).ToArray();
var y = stagedSystemObjects.AsParallel()
.Where(stagedSystemObject =>
codesThatNoLongerExistInSourceSystem.Contains(stagedSystemObject.Code))
.Select(x =>
{
x.ActiveStatus = ActiveStatuses.Disabled;
x.ChangeReason = ChangeReasons.Edited;
return x;
}).ToArray();
Note that PLINQ may only work well for computational limited task with multi-core CPU. It could make things worse in other scenarios.
I have found a solution for this problem - which brought it down to mere seconds in stead of an hour for 200k objects.
It's done by pre-computing and then applying the .Except() extension method
So no longer "chaining" linq queries, or doing .Contains inside a method... but make it "simpler" by first projecting both to a list of strings, so that inner calculation doesn't have to happen over and over again in the original question's example code.
Here is my solution, that for now is satisfactory. However I'm leaving this open if anyone comes up with a refined/better solution!
var stagedSystemCodes = stagedSystemObjects.Select(x => x.Code).ToList();
var sourceSystemCodes = sourceSystemObjects.Select(x => x.Code).ToList();
var codesThatNoLongerExistInSourceSystem = stagedSystemCodes.Except(sourceSystemCodes).ToList();
return stagedSystemObjects
.Where(stagedSystemObject =>
codesThatNoLongerExistInSourceSystem.Contains(stagedSystemObject.Code))
.Select(x =>
{
x.ActiveStatus = ActiveStatuses.Disabled;
x.ChangeReason = ChangeReasons.Edited;
return x;
})
.ToList();

Speed improvement in LINQ Where(Array.Contains)

I initially had a method that contained a LINQ query returning int[], which then got used later in a fashion similar to:
int[] result = something.Where(s => previousarray.Contains(s.field));
This turned out to be horribly slow, until the first array was retrieved as the native IQueryable<int>. It now runs very quickly, but I'm wondering how I'd deal with the situation if I was provided an int[] from elsewhere which then had to be used as above.
Is there a way to speed up the query in such cases? Converting to a List doesn't seem to help.
In LINQ-SQL, a Contains will be converted to a SELECT ... WHERE field IN(...) and should be relatively fast. In LINQ-Objects however, it will call ICollection<T>.Contains if the source is an ICollection<T>.
When a LINQ-SQL result is treated as an IEnumerable instead of an IQueryable, you lose the linq provider - i.e., any further operations will be done in memory and not in the database.
As for why its much slower in memory:
Array.Contains() is an O(n) operation so
something.Where(s => previousarray.Contains(s.field));
is O(p * s) where p is the size of previousarray and s is the size of something.
HashSet<T>.Contains() on the other hand is an O(1) operation. If you first create a hashset, you will see a big improvement on the .Contains operation as it will be O(s) instead of O(p * s).
Example:
var previousSet = new HashSet<int>(previousarray);
var result = something.Where(s => previousSet.Contains(s.field));
Where on Lists/Arrays/IEnumarables etc is O[N] operation. It is O[~1] on HashSet. So you should try to use it.

Does the order of LINQ functions matter?

Basically, as the question states... does the order of LINQ functions matter in terms of performance? Obviously the results would have to be identical still...
Example:
myCollection.OrderBy(item => item.CreatedDate).Where(item => item.Code > 3);
myCollection.Where(item => item.Code > 3).OrderBy(item => item.CreatedDate);
Both return me the same results, but are in a different LINQ order. I realize that reordering some items will result in different results, and I'm not concerned about those. What my main concern is in knowing if, in getting the same results, ordering can impact performance. And, not just on the 2 LINQ calls I made (OrderBy, Where), but on any LINQ calls.
It will depend on the LINQ provider in use. For LINQ to Objects, that could certainly make a huge difference. Assume we've actually got:
var query = myCollection.OrderBy(item => item.CreatedDate)
.Where(item => item.Code > 3);
var result = query.Last();
That requires the whole collection to be sorted and then filtered. If we had a million items, only one of which had a code greater than 3, we'd be wasting a lot of time ordering results which would be thrown away.
Compare that with the reversed operation, filtering first:
var query = myCollection.Where(item => item.Code > 3)
.OrderBy(item => item.CreatedDate);
var result = query.Last();
This time we're only ordering the filtered results, which in the sample case of "just a single item matching the filter" will be a lot more efficient - both in time and space.
It also could make a difference in whether the query executes correctly or not. Consider:
var query = myCollection.Where(item => item.Code != 0)
.OrderBy(item => 10 / item.Code);
var result = query.Last();
That's fine - we know we'll never be dividing by 0. But if we perform the ordering before the filtering, the query will throw an exception.
Yes.
But exactly what that performance difference is depends on how the underlying expression tree is evaluated by the LINQ provider.
For instance, your query may well execute faster the second time (with the WHERE clause first) for LINQ-to-XML, but faster the first time for LINQ-to-SQL.
To find out precisely what the performance difference is, you'll most likely want to profile your application. As ever with such things, though, premature optimisation is not usually worth the effort -- you may well find issues other than LINQ performance are more important.
In your particular example it can make a difference to the performance.
First query: Your OrderBy call needs to iterate through the entire source sequence, including those items where Code is 3 or less. The Where clause then also needs to iterate the entire ordered sequence.
Second query: The Where call limits the sequence to only those items where Code is greater than 3. The OrderBy call then only needs to traverse the reduced sequence returned by the Where call.
In Linq-To-Objects:
Sorting is rather slow and uses O(n) memory. Where on the other hand is relatively fast and uses constant memory. So doing Where first will be faster, and for large collections significantly faster.
The reduced memory pressure can be significant too, since allocations on the large object heap(together with their collection) are relatively expensive in my experience.
Obviously the results would have to be identical still...
Note that this is not actually true - in particular, the following two lines will give different results (for most providers/datasets):
myCollection.OrderBy(o => o).Distinct();
myCollection.Distinct().OrderBy(o => o);
It's worth noting that you should be careful when considering how to optimize a LINQ query. For example, if you use the declarative version of LINQ to do the following:
public class Record
{
public string Name { get; set; }
public double Score1 { get; set; }
public double Score2 { get; set; }
}
var query = from record in Records
order by ((record.Score1 + record.Score2) / 2) descending
select new
{
Name = record.Name,
Average = ((record.Score1 + record.Score2) / 2)
};
If, for whatever reason, you decided to "optimize" the query by storing the average into a variable first, you wouldn't get the desired results:
// The following two queries actually takes up more space and are slower
var query = from record in Records
let average = ((record.Score1 + record.Score2) / 2)
order by average descending
select new
{
Name = record.Name,
Average = average
};
var query = from record in Records
let average = ((record.Score1 + record.Score2) / 2)
select new
{
Name = record.Name,
Average = average
}
order by average descending;
I know not many people use declarative LINQ for objects, but it is some good food for thought.
It depends on the relevancy. Suppose if you have very few items with Code=3, then the next order will work on small set of collection to get the order by date.
Whereas if you have many items with the same CreatedDate, then the next order will work on larger set of collection to get the order by date.
So, in both case there will be a difference in performance

Categories