This thread says that LINQ's OrderBy uses Quicksort. I'm struggling how that makes sense given that OrderBy returns an IEnumerable.
Let's take the following piece of code for example.
int[] arr = new int[] { 1, -1, 0, 60, -1032, 9, 1 };
var ordered = arr.OrderBy(i => i);
foreach(int i in ordered)
Console.WriteLine(i);
The loop is the equivalent of
var mover = ordered.GetEnumerator();
while(mover.MoveNext())
Console.WriteLine(mover.Current);
The MoveNext() returns the next smallest element. The way that LINQ works, unless you "cash out" of the query by use ToList() or similar, there are not supposed to be any intermediate lists created, so each time you call MoveNext() the IEnumerator finds the next smallest element. That doesn't make sense because during the execution of Quicksort there is no concept of a current smallest and next smallest element.
Where is the flaw in my thinking here?
the way that LINQ works, unless you "cash out" of the query by use ToList() or similar, there are not supposed to be any intermediate lists created
This statement is false. The flaw in your thinking is that you believe a false statement.
The LINQ to Objects implementation is smart about deferring work when possible at a reasonable cost. As you correctly note, it is not possible in the case of sorting. OrderBy produces as its result an object which, when MoveNext is called, enumerates the entire source sequence, generates the sorted list in memory and then enumerates the sorted list.
Similarly, joining and grouping also must enumerate the whole sequence before the first element is enumerated. (Logically, a join is just a cross product with a filter, and the work could be spread out over each MoveNext() but that would be inefficient; for practicality, a lookup table is built. It is educational to work out the asymptotic space vs time tradeoff; give it a shot.)
The source code is available; I encourage you to read it if you have questions about the implementation. Or check out Jon's "edulinq" series.
There's a great answer already, but to add a few things:
Enumerating the results of OrderBy() obviously can't yield an element until it has processed all elements because not until it has seen the last input element can it know that that last element seen isn't the first it must yield. It also must work on sources that can't be repeated or which will give different results each time. As such even if some sort of zeal meant the developers wanted to find the nth element anew each cycle, buffering is a logical requirement.
The quicksort is lazy in two regards though. One is that rather than sort the elements to return based on the keys from the delegate passed to the method, it sorts a mapping:
Buffer all the elements.
Get the keys. Note that this means the delegate is run only once per element. Among other things it means that non-pure keyselectors won't cause problems.
Get a map of numbers from 0 to n.
Sort the map.
Enumerate through the map, yielding the associated element each time.
So there is a sort of laziness in the final sorting of elements. This is significant in cases where moving elements is expensive (large value types).
There is of course also laziness in that none of the above is done until after the first attempt to enumerate, so until you call MoveNext() the first time, it won't have happened.
In .NET Core there is further laziness building on that, depending on what you then do with the results of OrderBy. Since OrderBy contains information about how to sort rather than the sorted buffer, the class returned by OrderBy can do something else with that information other than quicksorting:
The most obvious is ThenBy which all implementations do. When you call ThenBy or ThenByDescending you get a new similar class with different information about how to sort, and the sort the OrderBy result could have done probably never will.
First() and Last() don't need to sort at all. Logically source.OrderBy(del).First() is a variant of source.Min() where del contains the information to determine what defines "less than" for that Min(). Therefore if you call First() on the results of an OrderBy() that's exactly what is done. The laziness of OrderBy allows it to do this instead of quicksort. (Which means O(n) time complexity and O(1) space complexity instead of O(n log n) and O(n) respectively).
Skip() and Take() define a subsequence of a sequence which with OrderBy must conceptually happen after that sort. But since they are lazy too what can be returned is an object that knows; how to sort, how many to skip, how many to take. As such partial quicksort can be used so that the source need only be partially sorted: If a partition is outside of the range that will be returned then there's no point sorting it.
ElementAt() places more of a burden than First() or Last() but again doesn't require a full quicksort. Quickselect can be used to find just one result; if you're looking for the 3rd element and you've partitioned a set of 200 elements around the 90th element then you only need to look further in the first partition and can ignore the second partition from now on. Best-case and average-case time complexity is O(n).
The above can be combined, so e.g. .Skip(10).First() is equivalent to ElementAt(10) and can be treated as such.
All of these exceptions to getting the entire buffer and sorting it all have one thing in common: They were all implemented after identifying a way in which the correct result can be returned after making the computer do less work*. That new [] {1, 2, 3, 4}.Where(i => i % 2 == 0) will yield the 2 before it has seen the 4 (or even the 3 it won't yield comes from the same general principle. It just comes at it more easily (though there are still specialised variants of Where() results behind the scenes to provide other optimisations).
But note that Enumerable.Range(1, 10000).Where(i => i >= 10000) scans through 9999 elements to yield that first. Really it's not all that different to OrderBy's buffering; they're both bringing you the next result as quickly as they can†, and what differs is just what that means.
*And also identifying that the effort to detect and make use of the features of a particular case are worth it. E.g. many aggregate calls like Sum() can be optimised of the results of OrderBy by skipping the ordering completely. But this can generally be realised by the caller and they can just leave out the OrderBy so while adding that would make most calls to Sum() slightly slower to make that case much faster the case that benefits shouldn't really be happening anyway.
†Well, pretty much as quickly. It would be possible to get the first results back more quickly than OrderBy does—when you've got the left most part of a sequence sorted start giving out results—but that comes at a cost that would affect the later results so the trade-off isn't necessarily that doing that would be better.
Related
I am looking at this code
var numbers = Enumerable.Range(0, 20);
var parallelResult = numbers.AsParallel().AsOrdered()
.Where(i => i % 2 == 0).AsSequential();
foreach (int i in parallelResult.Take(5))
Console.WriteLine(i);
The AsSequential() is supposed to make the resulting array sorted. Actually it is sorted after its execution, but if I remove the call to AsSequential(), it is still sorted (since AsOrdered()) is called.
What is the difference between the two?
AsSequential is just meant to stop any further parallel execution - hence the name. I'm not sure where you got the idea that it's "supposed to make the resulting array sorted". The documentation is pretty clear:
Converts a ParallelQuery into an IEnumerable to force sequential evaluation of the query.
As you say, AsOrdered ensures ordering (for that particular sequence).
I know that this was asked over a year old but here are my two cents.
In the example exposed, i think it uses AsSequential so that the next query operator (in this case the Take operator) it is execute sequentially.
However the Take operator prevent a query from being parallelized, unless the source elements are in their original indexing position, so that is why even when you remove the AsSequential operator, the result is still sorted.
I have a generic collection with 5000+ items in it. All items are unique so I used SingleOrDefault to pull up an item from collection. Today I used Red Gate ANTS profiler to look into the code and found out my SingleOrDefault call has 18 millions hit for 5000 iterations with (~3.5 sec) whereas when I change it to FirstOrDefault it has 9 millions hit with (~1.5 sec).
I used SingleOrDefault because I know that all items in collection are unique.
Edit : Question will be why is FirstOrDefault faster than SingleOrDefault even though this is the exact scenario where we supposed to use SingleOrDefault.
SingleOrDefault() raises an exception if there is more than one. In order to determine that, it must verify there are no more than one.
On the other hand, FirstOrDefault() can stop looking once it finds one. Therefore, I would expect it to be considerably faster in many cases.
SingleOrDefault(predicate) makes sure there is at most one item matching the given predicate, so even if it finds a matching item near the beginning of your collection, it still has to continue to the end of the IEnumerable.
FirstOrDefault(predicate) stops as soon as it finds a matching item in the collection. If your "first matches" are uniformly distributed throughout your IEnumerable, then you will, on average, have to go through half of the IEnumerable.
For a sequence of N items, SingleOrDefault will run your predicate N times, and FirstOrDefault will run your predicate (on average) N/2 times. This explains why you see SingleOrDefault has twice as many "hits" as FirstOrDefault.
If you know you'll only ever have a single matching item because the source of your collection is controlled by you and your system, then you're probably better off using FirstOrDefault. If your collection is coming from a user for example, then it could make sense to use SingleOrDefault as a check on the user's input.
I doubt very seriously that the choice between SingleOrDefault or FirstOrDefault will be your bottleneck. I think profiling tools will hopefully highlight much larger fish to fry. Your own metrics reveal that this amounts to an almost indiscernable unit of time for any given iteration.
But I recommend using the one that matches your expectation. Namely, is having more than one that matches a predicate an error? If it is, use the method that enforces that expectation. SingleOrDefault. (Similarly, if having none is also an error, simply use Single.) If it is not an error for more than one, feel free to use the First variants, instead.
Now it should become obvious why one could be marginally faster than the other, as other answers discuss. One is enforcing a constraint, which of course is accomplished by executing logic. The other isn't enforcing that particular constraint and is thus not delayed by it.
FirstOrDefault will return on the first hit. SinglerOrDefault will not return on the first hit but will also look at all other elements to check if its unique. So FirstOrDefault will be faster in most cases. Idf you don't need the uniqueness check take FirstOrDefault.
I've run tests using LinqPad which indicate that queries using Single, and SingleOrDefault are faster than queries using First or FirstOrDefault. These tests were on rather simple queries of large datasets (no joins involved). I did not expect this to be the result, in fact I was trying to prove to another developer that we should be using First and FirstOrDefault, but my foundation for my argument died when the proof indicated Single was actually faster. There may be cases where First is faster, but don't assume it is the blanket case.
I've been making a lot of use of LINQ queries in the application I'm currently writing, and one of the situations that I keep running into is having to convert the LINQ query results into lists for further processing (I have my reasons for wanting lists).
I'd like to have a better understanding of what happens in this list conversion in case there are inefficiencies since I've used it repeatedly now. So, given I execute a line line like this:
var matches = (from x in list1 join y in list2 on x equals y select x).ToList();
Questions:
Is there any overhead here aside from the creation of a new list and its population with references to the elements in the Enumerable returned from the query?
Would you consider this inefficient?
Is there a way to get the LINQ query to directly generate a list to avoid the need for a conversion in this circumstance?
Well, it creates a copy of the data. That could be inefficient - but it depends on what's going on. If you need a List<T> at the end, List<T> is usually going to be close to as efficient as you'll get. The one exception to that is if you're going to just do a conversion and the source is already a list - then using ConvertAll will be more efficient, as it can create the backing array of the right size to start with.
If you only need to stream the data - e.g. you're just going to do a foreach on it, and taking actions which don't affect the original data sources - then calling ToList is definitely a potential source of inefficiency. It will force the whole of list1 to be evaluated - and if that's a lazily-evaluated sequence (e.g. "the first 1,000,000 values from a random number generator") then that's not good. Note that as you're doing a join, list2 will be evaluated anyway as soon as you try to pull the first value from the sequence (whether that's in order to populate a list or not).
You might want to read my Edulinq post on ToList to see what's going on - at least in one possible implementation - in the background.
There is no any other overhed except those ones already mantioned by you.
I would say yes, but it depends on concrete application scenario. By the way, in general it's better to avoid additional calls. (I think this is obvious).
I'm afraid not. The LINQ query return a sequence of data, that could be an infinit sequence potentially. Converting to List<T> you make it finit, with also a possibility of index access, which is not possible to have in sequence or stream.
Suggession: avoid situation where you need the List<T>. If, by the way, you need it, push inside as less data as you you need in the current moment.
Hope this helps.
In addition to what has been said, if the initial two lists that you're joining were already quite large, creating a third (creating an "intersection" of the two) could cause out of memory errors. If you just iterate the result of the LINQ statement, you'll reduce the memory usage dramatically.
Most of the overhead happens before the list creation like the connection to db, getting the data to
an adapter, for the var type the .NET need to decide it's data type/structure...
The efficiency is very relative term. For a programmer who doesn't strong in SQL is efficient,
faster developing (relatively to old ADO) the overheads detailed in 1.
On the other hand LINQ can call procedures from the db itself, which already faster.
I suggest you to to the next test:
Run your program on maximal amount of data and measure the time.
Use some db procedure to export the data to file (like XML, CSV,....) and try to build your list
from that file and measure the time.
Then you can see if the difference is significant.
But the second ways is less efficient for the programmer, but can reduce the run time.
Enumerable.ToList(source) is essentially just a call to new List(source).
This constructor will test whether source is an ICollection<T>, and if it is allocate an array of the appropriate size. In other cases, i.e. most cases where the source is a LINQ query, it will allocate an array with the default initial capacity (four items) and grow it by doubling the capacity as needed. Each time the capacity doubles, a new array is allocated and the old one is copied over into the new one.
This may introduce some overhead in cases where your list wil have a lot of items (we're probably talking thousands at least). The overhead can be significant as soon as the list grows over 85 KB, as it is then allocated on the Large Object Heap, which is not compacted and may suffer from memory fragmentation. Note that I'm refering to the array in the list. If T is a reference type, that array contains only references, not the actual objects. Those objects then don't count for the 85 KB limitation.
You could remove some of this overhead if you can accurately estimate the size of your sequence (where it is better to overestimate a little bit than it is to underestimate a little bit). For example, if you are only running a .Select() operator on something that implements ICollection<T>, you know the size of the output list.
In such cases, this extension method would reduce this overhead:
public static List<T> ToList<T>(this IEnumerable<T> source, int initialCapacity)
{
// parameter validation ommited for brevity
var result = new List<T>(initialCapacity);
foreach (T item in source)
{
result.Add(item);
}
return result;
}
In some cases, the list you create is just going to replace a list that was already there, e.g. from a previous run. In those cases, you can avoid quite a few memory allocations if you reuse the old list. That would only work if you don't have concurrent access to that old list though, and I wouldn't do it if new lists will typically be significantly smaller than old lists. If that's the case, you can use this extension method:
public static void CopyToList<T>(this IEnumerable<T> source, List<T> destination)
{
// parameter validation ommited for brevity
destination.Clear();
foreach (T item in source)
{
destination.Add(item);
}
}
This being said, would I consider .ToList() being inefficient? No, if you have the memory, and you're going to use the list repeatedly, either for random indexing into it a lot, or iterating over it multiple times.
Now back to your specific example:
var matches = (from x in list1 join y in list2 on x equals y select x).ToList();
It may be more efficient to do this in some other way, for example:
var matches = list1.Intersect(list2).ToList();
which would yield the same results if list1 and list2 don't contain duplicates, and is very efficient if list2 is small.
The only way to really know though, as usual, is to measure using typical workloads.
I know that this probably is micro-optimization, but still I wonder if there is any difference in using
var lastObject = myList.OrderBy(item => item.Created).Last();
or
var lastObject = myList.OrderByDescending(item => item.Created).First();
I am looking for answers for Linq to objects and Linq to Entities.
Assuming that both ways of sorting take equal time (and that's a big 'if'), then the first method would have the extra cost of doing a .Last(), potentially requiring a full enumeration.
And that argument probably holds even stronger for an SQL oriented LINQ.
(my answer is about Linq to Objects, not Linq to Entities)
I don't think there's a big difference between the two instructions, this is clearly a case of micro-optimization. In both cases, the collection needs to be sorted, which usually means a complexity of O(n log n). But you can easily get the same result with a complexity of O(n), by enumerating the collection and keeping track of the min or max value. Jon Skeet provides an implementation in his MoreLinq project, in the form of a MaxBy extension method:
var lastObject = myList.MaxBy(item => item.Created);
I'm sorry this doesn't directly answer your question, but...
Why not do a better optimization and use Jon Skeet's implementations of MaxBy or MinBy?
That will be O(n) as opposed to O(n log n) in both of the alternatives you presented.
In both cases it depends somewhat on your underlying collections. If you have knowledge up front about how the collections look before the order and select you could choose one over the other. For example, if you know the list is usually in an ascending (or mostly ascending) sorted order you could prefer the first choice. Or if you know you have indexes on the SQL tables that are sorted ascending. Although the SQL optimizer can probably deal with that anyway.
In a general case they are equivalent statements. You were right when you said it's micro-optimization.
Assuming OrderBy and OrderByDescending averages the same performance, taking the first element would permorm better than last when the number of elements is large.
just my two cents: since OrderBy or OrderByDescending have to iterate over all the objects anyway, there should be no difference. however, if it were me i would probably just loop through all the items in a foreach with a compare to hold the highest comparing item, which would be an O(n) search instead of whatever order of magnitude the sorting is.
I have 60k items that need to be checked against a 20k lookup list. Is there a collection object (like List, HashTable) that provides an exceptionly fast Contains() method? Or will I have to write my own? In otherwords, is the default Contains() method just scan each item or does it use a better search algorithm.
foreach (Record item in LargeCollection)
{
if (LookupCollection.Contains(item.Key))
{
// Do something
}
}
Note. The lookup list is already sorted.
In the most general case, consider System.Collections.Generic.HashSet as your default "Contains" workhorse data structure, because it takes constant time to evaluate Contains.
The actual answer to "What is the fastest searchable collection" depends on your specific data size, ordered-ness, cost-of-hashing, and search frequency.
If you don't need ordering, try HashSet<Record> (new to .Net 3.5)
If you do, use a List<Record> and call BinarySearch.
Have you considered List.BinarySearch(item)?
You said that your large collection is already sorted so this seems like the perfect opportunity? A hash would definitely be the fastest, but this brings about its own problems and requires a lot more overhead for storage.
You should read this blog that speed tested several different types of collections and methods for each using both single and multi-threaded techniques.
According to the results, a BinarySearch on a List and SortedList were the top performers constantly running neck-in-neck when looking up something as a "value".
When using a collection that allows for "keys", the Dictionary, ConcurrentDictionary, Hashset, and HashTables performed the best overall.
I've put a test together:
First - 3 chars with all of the possible combinations of A-Z0-9
Fill each of the collections mentioned here with those strings
Finally - search and time each collection for a random string (same string for each collection).
This test simulates a lookup when there is guaranteed to be a result.
Then I changed the initial collection from all possible combinations to only 10,000 random 3 character combinations, this should induce a 1 in 4.6 hit rate of a random 3 char lookup, thus this is a test where there isn't guaranteed to be a result, and ran the test again:
IMHO HashTable, although fastest, isn't always the most convenient; working with objects. But a HashSet is so close behind it's probably the one to recommend.
Just for fun (you know FUN) I ran with 1.68M rows (4 characters):
Keep both lists x and y in sorted order.
If x = y, do your action, if x < y, advance x, if y < x, advance y until either list is empty.
The run time of this intersection is proportional to min (size (x), size (y))
Don't run a .Contains () loop, this is proportional to x * y which is much worse.
If it's possible to sort your items then there is a much faster way to do this then doing key lookups into a hashtable or b-tree. Though if you're items aren't sortable you can't really put them into a b-tree anyway.
Anyway, if sortable sort both lists then it's just a matter of walking the lookup list in order.
Walk lookup list
While items in check list <= lookup list item
if check list item = lookup list item do something
Move to next lookup list item
If you're using .Net 3.5, you can make cleaner code using:
foreach (Record item in LookupCollection.Intersect(LargeCollection))
{
//dostuff
}
I don't have .Net 3.5 here and so this is untested. It relies on an extension method. Not that LookupCollection.Intersect(LargeCollection) is probably not the same as LargeCollection.Intersect(LookupCollection) ... the latter is probably much slower.
This assumes LookupCollection is a HashSet
If you aren't worried about squeaking every single last bit of performance the suggestion to use a HashSet or binary search is solid. Your datasets just aren't large enough that this is going to be a problem 99% of the time.
But if this just one of thousands of times you are going to do this and performance is critical (and proven to be unacceptable using HashSet/binary search), you could certainly write your own algorithm that walked the sorted lists doing comparisons as you went. Each list would be walked at most once and in the pathological cases wouldn't be bad (once you went this route you'd probably find that the comparison, assuming it's a string or other non-integral value, would be the real expense and that optimizing that would be the next step).