Basically, as the question states... does the order of LINQ functions matter in terms of performance? Obviously the results would have to be identical still...
Example:
myCollection.OrderBy(item => item.CreatedDate).Where(item => item.Code > 3);
myCollection.Where(item => item.Code > 3).OrderBy(item => item.CreatedDate);
Both return me the same results, but are in a different LINQ order. I realize that reordering some items will result in different results, and I'm not concerned about those. What my main concern is in knowing if, in getting the same results, ordering can impact performance. And, not just on the 2 LINQ calls I made (OrderBy, Where), but on any LINQ calls.
It will depend on the LINQ provider in use. For LINQ to Objects, that could certainly make a huge difference. Assume we've actually got:
var query = myCollection.OrderBy(item => item.CreatedDate)
.Where(item => item.Code > 3);
var result = query.Last();
That requires the whole collection to be sorted and then filtered. If we had a million items, only one of which had a code greater than 3, we'd be wasting a lot of time ordering results which would be thrown away.
Compare that with the reversed operation, filtering first:
var query = myCollection.Where(item => item.Code > 3)
.OrderBy(item => item.CreatedDate);
var result = query.Last();
This time we're only ordering the filtered results, which in the sample case of "just a single item matching the filter" will be a lot more efficient - both in time and space.
It also could make a difference in whether the query executes correctly or not. Consider:
var query = myCollection.Where(item => item.Code != 0)
.OrderBy(item => 10 / item.Code);
var result = query.Last();
That's fine - we know we'll never be dividing by 0. But if we perform the ordering before the filtering, the query will throw an exception.
Yes.
But exactly what that performance difference is depends on how the underlying expression tree is evaluated by the LINQ provider.
For instance, your query may well execute faster the second time (with the WHERE clause first) for LINQ-to-XML, but faster the first time for LINQ-to-SQL.
To find out precisely what the performance difference is, you'll most likely want to profile your application. As ever with such things, though, premature optimisation is not usually worth the effort -- you may well find issues other than LINQ performance are more important.
In your particular example it can make a difference to the performance.
First query: Your OrderBy call needs to iterate through the entire source sequence, including those items where Code is 3 or less. The Where clause then also needs to iterate the entire ordered sequence.
Second query: The Where call limits the sequence to only those items where Code is greater than 3. The OrderBy call then only needs to traverse the reduced sequence returned by the Where call.
In Linq-To-Objects:
Sorting is rather slow and uses O(n) memory. Where on the other hand is relatively fast and uses constant memory. So doing Where first will be faster, and for large collections significantly faster.
The reduced memory pressure can be significant too, since allocations on the large object heap(together with their collection) are relatively expensive in my experience.
Obviously the results would have to be identical still...
Note that this is not actually true - in particular, the following two lines will give different results (for most providers/datasets):
myCollection.OrderBy(o => o).Distinct();
myCollection.Distinct().OrderBy(o => o);
It's worth noting that you should be careful when considering how to optimize a LINQ query. For example, if you use the declarative version of LINQ to do the following:
public class Record
{
public string Name { get; set; }
public double Score1 { get; set; }
public double Score2 { get; set; }
}
var query = from record in Records
order by ((record.Score1 + record.Score2) / 2) descending
select new
{
Name = record.Name,
Average = ((record.Score1 + record.Score2) / 2)
};
If, for whatever reason, you decided to "optimize" the query by storing the average into a variable first, you wouldn't get the desired results:
// The following two queries actually takes up more space and are slower
var query = from record in Records
let average = ((record.Score1 + record.Score2) / 2)
order by average descending
select new
{
Name = record.Name,
Average = average
};
var query = from record in Records
let average = ((record.Score1 + record.Score2) / 2)
select new
{
Name = record.Name,
Average = average
}
order by average descending;
I know not many people use declarative LINQ for objects, but it is some good food for thought.
It depends on the relevancy. Suppose if you have very few items with Code=3, then the next order will work on small set of collection to get the order by date.
Whereas if you have many items with the same CreatedDate, then the next order will work on larger set of collection to get the order by date.
So, in both case there will be a difference in performance
Related
I've been wondering if it is a good practice from performance point of view to use following syntax when making call to the table using LINQ. Following is just an example, but I hope you get the idea:
Context.Pets.Where(p => p.Name == petname)
.Select(d => new {
SomeProperty = p.Age,
SomeOtherProperty = p.Color,
VeryDifferentProperty = Context.FavoriteFood.Where(f => f.FavFood == p.FavFood).FirstOrDefault().Nutrition.Protein});
Here I'm talking specifically about VeryDifferentProperty. Is it OK to make this kind of call?
Depending on the size of FavoriteFood and Pets list's size, you might benefit from converting the FavoriteFood list to a dictionary (with FavFood as key) to reduce overall execution time.
Currently, processing Pets is an O(n) operation and calculating value of VeryDifferentProperty makes it O(n^2)
Depending on number of items in second list, it might be worthwhile to take a one-time hit of converting list to dictionary and then lookups become O(1). There should be no further optimization needed beyond that.
I am trying to essentially see if entities exist in a local context and sort them accordingly. This function seems to be faster than others we have tried runs in about 50 seconds for 1000 items but I am wondering if there is something I can do to improve the efficiency. I believe the find here is slowing it down significantly as a simple foreach iteration over 1000 takes milliseconds and benchmarking shows bottle necking there. Any ideas would be helpful. Thank you.
Sample code:
foreach(var entity in entities) {
var localItem = db.Set<T>().Find(Key);
if(localItem != null)
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
If this is a database (which from the comments I've gathered that it is...)
You would be better off doing fewer queries.
list1.AddRange(db.Set<T>().Where(x => x.Key == Key));
list2.AddRange(db.Set<T>().Where(x => x.Key != Key));
This would be 2 queries instead of 1000+.
Also be aware of the fact that by adding each one to a List<T>, you're keeping 2 large arrays. So if 1000+ turns into 10000000, you're going to have interesting memory issues.
See this post on my blog for more information: http://www.artisansoftware.blogspot.com/2014/01/synopsis-creating-large-collection-by.html
If I understand correctly the database seems to be the bottleneck? If you want to (effectivly) select data from a database relation, whose attribute x should match a ==-criteria, you should consider creating a secondary access path for that attribute (an index structure). Depending on your database system and the distribution in your table this might be a hash index (especially good for checks on ==) or a B+-tree (allrounder) or whatever your system offers you.
However this only works if...
you not only get the full data set once and have to live with that in your application.
adding (another) index to the relation is not out of question (or e.g. its not worth to have it for a single need).
adding an index wouldn't be effective - e.g if the attribute you are querying on has very few unique values.
I found your answers very helpful but here is ultimately how I fold the problem. It seemed .Find was the bottleneck.
var tableDictionary = db.Set<T>().ToDictionary(x => x.KeyValue, x => x);
foreach(var entity in entities) {
if (tableDictionary.ContainsKey(entity.yKeyValue))
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
This ran in with 900+ rows in about a 10th of a second which for our purposes was efficient enough.
Rather than querying the DB for each item, you can just do one query, get all of the data (since you want all of the data from the DB eventually) and you can then group it in memory, which can be done (in this case) about as efficiently as in the database. By creating a lookup of whether or not the key is equal, we can easily get the two groups:
var lookup = db.Set<T>().ToLookup(item => item.Key == Key);
var list1 = lookup[true].ToList();
var list2 = lookup[false].ToList();
(You can use AddRange instead if the lists have previous values that should also be in them.)
I initially had a method that contained a LINQ query returning int[], which then got used later in a fashion similar to:
int[] result = something.Where(s => previousarray.Contains(s.field));
This turned out to be horribly slow, until the first array was retrieved as the native IQueryable<int>. It now runs very quickly, but I'm wondering how I'd deal with the situation if I was provided an int[] from elsewhere which then had to be used as above.
Is there a way to speed up the query in such cases? Converting to a List doesn't seem to help.
In LINQ-SQL, a Contains will be converted to a SELECT ... WHERE field IN(...) and should be relatively fast. In LINQ-Objects however, it will call ICollection<T>.Contains if the source is an ICollection<T>.
When a LINQ-SQL result is treated as an IEnumerable instead of an IQueryable, you lose the linq provider - i.e., any further operations will be done in memory and not in the database.
As for why its much slower in memory:
Array.Contains() is an O(n) operation so
something.Where(s => previousarray.Contains(s.field));
is O(p * s) where p is the size of previousarray and s is the size of something.
HashSet<T>.Contains() on the other hand is an O(1) operation. If you first create a hashset, you will see a big improvement on the .Contains operation as it will be O(s) instead of O(p * s).
Example:
var previousSet = new HashSet<int>(previousarray);
var result = something.Where(s => previousSet.Contains(s.field));
Where on Lists/Arrays/IEnumarables etc is O[N] operation. It is O[~1] on HashSet. So you should try to use it.
I am trying to create a method using LINQ that would take X ammount of products fron the DB, so I am using the .TAKE method for that.
The thing is, in situations I need to take all the products, so is there a wildcard I can give to .TAKE or some other method that would bring me all the products in the DB?
Also, what happens if I do a .TAKE (50) and there are only 10 products in the DB?
My code looks something like :
var ratingsToPick = context.RatingAndProducts
.ToList()
.OrderByDescending(c => c.WeightedRating)
.Take(pAmmount);
You could separate it to a separate call based on your flag:
IEnumerable<RatingAndProducts> ratingsToPick = context.RatingAndProducts
.OrderByDescending(c => c.WeightedRating);
if (!takeAll)
ratingsToPick = ratingsToPick.Take(pAmmount);
var results = ratingsToPick.ToList();
If you don't include the Take, then it will simply take everything.
Note that you may need to type your original query as IEnumerable<MyType> as OrderByDescending returns an IOrderedEnumerable and won't be reassignable from the Take call. (or you can simply work around this as appropriate based on your actual code)
Also, as #Rene147 pointed out, you should move your ToList to the end otherwise it will retrieve all items from the database every time and the OrderByDescending and Take are then actually operating on a List<> of objects in memory not performing it as a database query which I assume is unintended.
Regarding your second question if you perform a Take(50) but only 10 entries are available. That might depend on your database provider, but in my experience, they tend to be smart enough to not throw exceptions and will simply give you whatever number of items are available. (I would suggest you perform a quick test to make sure for your specific case)
Your current solution always takes all products from database. Because you are calling ToList(). After loading all products from database you are taking first N in memory. In order to conditionally load first N products, you need to build query
int? countToTake = 50;
var ratingsToPick = context.RatingAndProducts
.OrderByDescending(c => c.WeightedRating);
// conditionally take only first results
if (countToTake.HasValue)
ratingsToPick = ratingsToPick.Take(countToTake.Value);
var result = ratingsToPick.ToList(); // execute query
I have a little program that needs to do some calculation on a data range. The range maybe contain about half a millon of records. I just looked to my db and saw that a group by was executed.
I thought that the result was executed on the first line, and later I just worked with data in RAM. But now I think that the query builder combine the expression.
var Test = db.Test.Where(x => x > Date.Now.AddDays(-7));
var Test2 = (from p in Test
group p by p.CustomerId into g
select new { UniqueCount = g.Count() } );
In my real world app I got more subqueries that is based on the range selected by the first query. I think I just added a big overhead to let the DB make different selects.
Now I bascilly just call .ToList() after the first expression.
So my question is am I right about that the query builder combine different IQueryable when it builds the expression tree?
Yes, you are correct. LINQ expressions are lazily evaluated at the moment you evaluate them (via .ToList(), for example). At that point in time, Entity Framework will look at the total query and build an SQL statement to represent it.
In this particular case, it's probably wiser to not evaluate the first query, because the SQL database is optimized for performing set-based operations like grouping and counting. Rather than forcing the database to send all the Test objects across the wire, deserializing the results into in-memory objects, and then performing the grouping and counting locally, you will likely see better performance by having the SQL database just return the resulting Counts.