OrderBy().Last() or OrderByDescending().First() performance - c#

I know that this probably is micro-optimization, but still I wonder if there is any difference in using
var lastObject = myList.OrderBy(item => item.Created).Last();
or
var lastObject = myList.OrderByDescending(item => item.Created).First();
I am looking for answers for Linq to objects and Linq to Entities.

Assuming that both ways of sorting take equal time (and that's a big 'if'), then the first method would have the extra cost of doing a .Last(), potentially requiring a full enumeration.
And that argument probably holds even stronger for an SQL oriented LINQ.

(my answer is about Linq to Objects, not Linq to Entities)
I don't think there's a big difference between the two instructions, this is clearly a case of micro-optimization. In both cases, the collection needs to be sorted, which usually means a complexity of O(n log n). But you can easily get the same result with a complexity of O(n), by enumerating the collection and keeping track of the min or max value. Jon Skeet provides an implementation in his MoreLinq project, in the form of a MaxBy extension method:
var lastObject = myList.MaxBy(item => item.Created);

I'm sorry this doesn't directly answer your question, but...
Why not do a better optimization and use Jon Skeet's implementations of MaxBy or MinBy?
That will be O(n) as opposed to O(n log n) in both of the alternatives you presented.

In both cases it depends somewhat on your underlying collections. If you have knowledge up front about how the collections look before the order and select you could choose one over the other. For example, if you know the list is usually in an ascending (or mostly ascending) sorted order you could prefer the first choice. Or if you know you have indexes on the SQL tables that are sorted ascending. Although the SQL optimizer can probably deal with that anyway.
In a general case they are equivalent statements. You were right when you said it's micro-optimization.

Assuming OrderBy and OrderByDescending averages the same performance, taking the first element would permorm better than last when the number of elements is large.

just my two cents: since OrderBy or OrderByDescending have to iterate over all the objects anyway, there should be no difference. however, if it were me i would probably just loop through all the items in a foreach with a compare to hold the highest comparing item, which would be an O(n) search instead of whatever order of magnitude the sorting is.

Related

Better way to search a List for a string

I've implimented the following code to search a list of objects for a particular value:
List<customer> matchingContacts = cAllServer
.Where(o => o.customerNum.Contains(searchTerm) ||
o.personInv.lastname.Contains(searchTerm) ||
o.personDel.lastname.Contains(searchTerm))
.ToList();
is there a quicker or cleaner way to impliment this search?
Since you will have to iterate through all of the list items, it will have O(n) complexity. Performance also depends whether you are operating on IQueryable collection (with or without lazy loading) or it is a serialized IEnumerable collection. I'd advise to check first the properties that are most likely to have the value you are searching for, because you are using "or" operator, so that you can speed up your "Contains" operation. You iterate more quickly if you know that it is a match for a particular entity in 10ms rather than in 25ms. There is also a popular argument what is faster? Contains or IndexOf? Well IndexOf should be a little bit faster, but I doubt you'll notice it, unless you operate on lists with millions of elements. Is String.Contains() faster than String.IndexOf()?
I think that this is fine as it is, but in other hand I'd think twice about the need to convert to list, you already are receving an IEnumerable type that'll let you iterate through stuff. Unless you need to go back and forth the list and searching by index there's no need for you to convert it to List.
This is an small optimization though.
The one thing I'd suggest is to create a new "searchtext" column, prepopulated with (o.customerNum + "|" + o.personInv.lastname + "|" + o.personDel.lastname).ToUpper().
List<customer> matchingContacts = cAllServer
.Where(o => o.searchtext.Contains(searchTerm))
.ToList();
This performs one search instead of three (but on a longer string), and it you .ToUpper() searchTerm, you can perform a case-sensivitive search which migh tbe trivial faster.
On the whole, I wouldn't expect this to be significantly faster.

C# Using .Min() on a Dictionary, min value, but return a key?

I'm not sure if this is even possible, as far as I can tell, it isn't, but I have a dictionary and I want to find the Min() of values, but return either the KVP or Key. Currently I am using orderby, but I feel like that is going to be super inefficient once this grows large enough.
You can use MoreLinq and its MinBy method:
var pair = dictionary.MinBy(x => x.Value);
var key = pair.Key;
(And yes, this is O(N) rather than O(N log N) - more efficient than ordering. It won't need as much memory, either.)
I don't think there are any shortcuts and I would think that iterating every item looking for the min would be faster than sorting it first. You could easily compare speeds to find out for sure.
-EDIT: After the comment-
If you use MoreLinq.MinBy, your code will be more concise, but you won't avoid iterating every element.
https://github.com/morelinq/MoreLINQ/blob/master/MoreLinq/MinBy.cs
So, to answer your question more explicitly, you are best to iterate every element looking for the lowest value, and when you have finished iterating all elements, return the key.
I offer this answer in case you don't want to bring in MoreLinq for any reason.

How does LINQ's OrderBy jive with MoveNext?

This thread says that LINQ's OrderBy uses Quicksort. I'm struggling how that makes sense given that OrderBy returns an IEnumerable.
Let's take the following piece of code for example.
int[] arr = new int[] { 1, -1, 0, 60, -1032, 9, 1 };
var ordered = arr.OrderBy(i => i);
foreach(int i in ordered)
Console.WriteLine(i);
The loop is the equivalent of
var mover = ordered.GetEnumerator();
while(mover.MoveNext())
Console.WriteLine(mover.Current);
The MoveNext() returns the next smallest element. The way that LINQ works, unless you "cash out" of the query by use ToList() or similar, there are not supposed to be any intermediate lists created, so each time you call MoveNext() the IEnumerator finds the next smallest element. That doesn't make sense because during the execution of Quicksort there is no concept of a current smallest and next smallest element.
Where is the flaw in my thinking here?
the way that LINQ works, unless you "cash out" of the query by use ToList() or similar, there are not supposed to be any intermediate lists created
This statement is false. The flaw in your thinking is that you believe a false statement.
The LINQ to Objects implementation is smart about deferring work when possible at a reasonable cost. As you correctly note, it is not possible in the case of sorting. OrderBy produces as its result an object which, when MoveNext is called, enumerates the entire source sequence, generates the sorted list in memory and then enumerates the sorted list.
Similarly, joining and grouping also must enumerate the whole sequence before the first element is enumerated. (Logically, a join is just a cross product with a filter, and the work could be spread out over each MoveNext() but that would be inefficient; for practicality, a lookup table is built. It is educational to work out the asymptotic space vs time tradeoff; give it a shot.)
The source code is available; I encourage you to read it if you have questions about the implementation. Or check out Jon's "edulinq" series.
There's a great answer already, but to add a few things:
Enumerating the results of OrderBy() obviously can't yield an element until it has processed all elements because not until it has seen the last input element can it know that that last element seen isn't the first it must yield. It also must work on sources that can't be repeated or which will give different results each time. As such even if some sort of zeal meant the developers wanted to find the nth element anew each cycle, buffering is a logical requirement.
The quicksort is lazy in two regards though. One is that rather than sort the elements to return based on the keys from the delegate passed to the method, it sorts a mapping:
Buffer all the elements.
Get the keys. Note that this means the delegate is run only once per element. Among other things it means that non-pure keyselectors won't cause problems.
Get a map of numbers from 0 to n.
Sort the map.
Enumerate through the map, yielding the associated element each time.
So there is a sort of laziness in the final sorting of elements. This is significant in cases where moving elements is expensive (large value types).
There is of course also laziness in that none of the above is done until after the first attempt to enumerate, so until you call MoveNext() the first time, it won't have happened.
In .NET Core there is further laziness building on that, depending on what you then do with the results of OrderBy. Since OrderBy contains information about how to sort rather than the sorted buffer, the class returned by OrderBy can do something else with that information other than quicksorting:
The most obvious is ThenBy which all implementations do. When you call ThenBy or ThenByDescending you get a new similar class with different information about how to sort, and the sort the OrderBy result could have done probably never will.
First() and Last() don't need to sort at all. Logically source.OrderBy(del).First() is a variant of source.Min() where del contains the information to determine what defines "less than" for that Min(). Therefore if you call First() on the results of an OrderBy() that's exactly what is done. The laziness of OrderBy allows it to do this instead of quicksort. (Which means O(n) time complexity and O(1) space complexity instead of O(n log n) and O(n) respectively).
Skip() and Take() define a subsequence of a sequence which with OrderBy must conceptually happen after that sort. But since they are lazy too what can be returned is an object that knows; how to sort, how many to skip, how many to take. As such partial quicksort can be used so that the source need only be partially sorted: If a partition is outside of the range that will be returned then there's no point sorting it.
ElementAt() places more of a burden than First() or Last() but again doesn't require a full quicksort. Quickselect can be used to find just one result; if you're looking for the 3rd element and you've partitioned a set of 200 elements around the 90th element then you only need to look further in the first partition and can ignore the second partition from now on. Best-case and average-case time complexity is O(n).
The above can be combined, so e.g. .Skip(10).First() is equivalent to ElementAt(10) and can be treated as such.
All of these exceptions to getting the entire buffer and sorting it all have one thing in common: They were all implemented after identifying a way in which the correct result can be returned after making the computer do less work*. That new [] {1, 2, 3, 4}.Where(i => i % 2 == 0) will yield the 2 before it has seen the 4 (or even the 3 it won't yield comes from the same general principle. It just comes at it more easily (though there are still specialised variants of Where() results behind the scenes to provide other optimisations).
But note that Enumerable.Range(1, 10000).Where(i => i >= 10000) scans through 9999 elements to yield that first. Really it's not all that different to OrderBy's buffering; they're both bringing you the next result as quickly as they can†, and what differs is just what that means.
*And also identifying that the effort to detect and make use of the features of a particular case are worth it. E.g. many aggregate calls like Sum() can be optimised of the results of OrderBy by skipping the ordering completely. But this can generally be realised by the caller and they can just leave out the OrderBy so while adding that would make most calls to Sum() slightly slower to make that case much faster the case that benefits shouldn't really be happening anyway.
†Well, pretty much as quickly. It would be possible to get the first results back more quickly than OrderBy does—when you've got the left most part of a sequence sorted start giving out results—but that comes at a cost that would affect the later results so the trade-off isn't necessarily that doing that would be better.

What .NET collection provides the fastest search

I have 60k items that need to be checked against a 20k lookup list. Is there a collection object (like List, HashTable) that provides an exceptionly fast Contains() method? Or will I have to write my own? In otherwords, is the default Contains() method just scan each item or does it use a better search algorithm.
foreach (Record item in LargeCollection)
{
if (LookupCollection.Contains(item.Key))
{
// Do something
}
}
Note. The lookup list is already sorted.
In the most general case, consider System.Collections.Generic.HashSet as your default "Contains" workhorse data structure, because it takes constant time to evaluate Contains.
The actual answer to "What is the fastest searchable collection" depends on your specific data size, ordered-ness, cost-of-hashing, and search frequency.
If you don't need ordering, try HashSet<Record> (new to .Net 3.5)
If you do, use a List<Record> and call BinarySearch.
Have you considered List.BinarySearch(item)?
You said that your large collection is already sorted so this seems like the perfect opportunity? A hash would definitely be the fastest, but this brings about its own problems and requires a lot more overhead for storage.
You should read this blog that speed tested several different types of collections and methods for each using both single and multi-threaded techniques.
According to the results, a BinarySearch on a List and SortedList were the top performers constantly running neck-in-neck when looking up something as a "value".
When using a collection that allows for "keys", the Dictionary, ConcurrentDictionary, Hashset, and HashTables performed the best overall.
I've put a test together:
First - 3 chars with all of the possible combinations of A-Z0-9
Fill each of the collections mentioned here with those strings
Finally - search and time each collection for a random string (same string for each collection).
This test simulates a lookup when there is guaranteed to be a result.
Then I changed the initial collection from all possible combinations to only 10,000 random 3 character combinations, this should induce a 1 in 4.6 hit rate of a random 3 char lookup, thus this is a test where there isn't guaranteed to be a result, and ran the test again:
IMHO HashTable, although fastest, isn't always the most convenient; working with objects. But a HashSet is so close behind it's probably the one to recommend.
Just for fun (you know FUN) I ran with 1.68M rows (4 characters):
Keep both lists x and y in sorted order.
If x = y, do your action, if x < y, advance x, if y < x, advance y until either list is empty.
The run time of this intersection is proportional to min (size (x), size (y))
Don't run a .Contains () loop, this is proportional to x * y which is much worse.
If it's possible to sort your items then there is a much faster way to do this then doing key lookups into a hashtable or b-tree. Though if you're items aren't sortable you can't really put them into a b-tree anyway.
Anyway, if sortable sort both lists then it's just a matter of walking the lookup list in order.
Walk lookup list
While items in check list <= lookup list item
if check list item = lookup list item do something
Move to next lookup list item
If you're using .Net 3.5, you can make cleaner code using:
foreach (Record item in LookupCollection.Intersect(LargeCollection))
{
//dostuff
}
I don't have .Net 3.5 here and so this is untested. It relies on an extension method. Not that LookupCollection.Intersect(LargeCollection) is probably not the same as LargeCollection.Intersect(LookupCollection) ... the latter is probably much slower.
This assumes LookupCollection is a HashSet
If you aren't worried about squeaking every single last bit of performance the suggestion to use a HashSet or binary search is solid. Your datasets just aren't large enough that this is going to be a problem 99% of the time.
But if this just one of thousands of times you are going to do this and performance is critical (and proven to be unacceptable using HashSet/binary search), you could certainly write your own algorithm that walked the sorted lists doing comparisons as you went. Each list would be walked at most once and in the pathological cases wouldn't be bad (once you went this route you'd probably find that the comparison, assuming it's a string or other non-integral value, would be the real expense and that optimizing that would be the next step).

What are the benefits of List<T>.Find over alternatives?

Recently I used a predicate to describe search logic and passed it to the Find method of a few Lists.
foreach (IHiscoreBarItemView item in _view.HiscoreItems)
{
Predicate<Hiscore> matchOfHiscoreName =
(h) => h.Information.Name.Equals(item.HiscoreName);
var current = player.Hiscores.Find(matchOfHiscoreName);
item.GetLogicEngine().ForceSetHiscoreValue(current as Skill);
var goal = player.Goals.Find(matchOfHiscoreName);
item.GetLogicEngine().ForceSetGoalHiscoreValue(goal as Skill);
}
Are there any benefits, apart from 'less code', from using the aforementioned approach over an alternative.
I am particularly interested in performance.
Thanks
Benefit of Find over using LINQ: It's available in .NET 2.0
Benefit of LINQ over Find: consistency with other sequences; query expression syntax etc
Benefit of Find over BinarySearch: The list doesn't have to be sorted, and you only need equality comparison
Benefit of BinarySearch over Find: BinarySearch is O(log n); Find is O(n)
Benefit of Find over a foreach loop: compactness and not repeating yourself
Benefit of a foreach loop over Find: Any other custom processing you want to perform
Of these, only Find vs BinarySearch has any real performance difference. Of course, if you could change from a List<T> to a Dictionary<TKey,TValue> then finding elements would become amortized O(1) ...

Categories