I have a generic collection with 5000+ items in it. All items are unique so I used SingleOrDefault to pull up an item from collection. Today I used Red Gate ANTS profiler to look into the code and found out my SingleOrDefault call has 18 millions hit for 5000 iterations with (~3.5 sec) whereas when I change it to FirstOrDefault it has 9 millions hit with (~1.5 sec).
I used SingleOrDefault because I know that all items in collection are unique.
Edit : Question will be why is FirstOrDefault faster than SingleOrDefault even though this is the exact scenario where we supposed to use SingleOrDefault.
SingleOrDefault() raises an exception if there is more than one. In order to determine that, it must verify there are no more than one.
On the other hand, FirstOrDefault() can stop looking once it finds one. Therefore, I would expect it to be considerably faster in many cases.
SingleOrDefault(predicate) makes sure there is at most one item matching the given predicate, so even if it finds a matching item near the beginning of your collection, it still has to continue to the end of the IEnumerable.
FirstOrDefault(predicate) stops as soon as it finds a matching item in the collection. If your "first matches" are uniformly distributed throughout your IEnumerable, then you will, on average, have to go through half of the IEnumerable.
For a sequence of N items, SingleOrDefault will run your predicate N times, and FirstOrDefault will run your predicate (on average) N/2 times. This explains why you see SingleOrDefault has twice as many "hits" as FirstOrDefault.
If you know you'll only ever have a single matching item because the source of your collection is controlled by you and your system, then you're probably better off using FirstOrDefault. If your collection is coming from a user for example, then it could make sense to use SingleOrDefault as a check on the user's input.
I doubt very seriously that the choice between SingleOrDefault or FirstOrDefault will be your bottleneck. I think profiling tools will hopefully highlight much larger fish to fry. Your own metrics reveal that this amounts to an almost indiscernable unit of time for any given iteration.
But I recommend using the one that matches your expectation. Namely, is having more than one that matches a predicate an error? If it is, use the method that enforces that expectation. SingleOrDefault. (Similarly, if having none is also an error, simply use Single.) If it is not an error for more than one, feel free to use the First variants, instead.
Now it should become obvious why one could be marginally faster than the other, as other answers discuss. One is enforcing a constraint, which of course is accomplished by executing logic. The other isn't enforcing that particular constraint and is thus not delayed by it.
FirstOrDefault will return on the first hit. SinglerOrDefault will not return on the first hit but will also look at all other elements to check if its unique. So FirstOrDefault will be faster in most cases. Idf you don't need the uniqueness check take FirstOrDefault.
I've run tests using LinqPad which indicate that queries using Single, and SingleOrDefault are faster than queries using First or FirstOrDefault. These tests were on rather simple queries of large datasets (no joins involved). I did not expect this to be the result, in fact I was trying to prove to another developer that we should be using First and FirstOrDefault, but my foundation for my argument died when the proof indicated Single was actually faster. There may be cases where First is faster, but don't assume it is the blanket case.
Related
This thread says that LINQ's OrderBy uses Quicksort. I'm struggling how that makes sense given that OrderBy returns an IEnumerable.
Let's take the following piece of code for example.
int[] arr = new int[] { 1, -1, 0, 60, -1032, 9, 1 };
var ordered = arr.OrderBy(i => i);
foreach(int i in ordered)
Console.WriteLine(i);
The loop is the equivalent of
var mover = ordered.GetEnumerator();
while(mover.MoveNext())
Console.WriteLine(mover.Current);
The MoveNext() returns the next smallest element. The way that LINQ works, unless you "cash out" of the query by use ToList() or similar, there are not supposed to be any intermediate lists created, so each time you call MoveNext() the IEnumerator finds the next smallest element. That doesn't make sense because during the execution of Quicksort there is no concept of a current smallest and next smallest element.
Where is the flaw in my thinking here?
the way that LINQ works, unless you "cash out" of the query by use ToList() or similar, there are not supposed to be any intermediate lists created
This statement is false. The flaw in your thinking is that you believe a false statement.
The LINQ to Objects implementation is smart about deferring work when possible at a reasonable cost. As you correctly note, it is not possible in the case of sorting. OrderBy produces as its result an object which, when MoveNext is called, enumerates the entire source sequence, generates the sorted list in memory and then enumerates the sorted list.
Similarly, joining and grouping also must enumerate the whole sequence before the first element is enumerated. (Logically, a join is just a cross product with a filter, and the work could be spread out over each MoveNext() but that would be inefficient; for practicality, a lookup table is built. It is educational to work out the asymptotic space vs time tradeoff; give it a shot.)
The source code is available; I encourage you to read it if you have questions about the implementation. Or check out Jon's "edulinq" series.
There's a great answer already, but to add a few things:
Enumerating the results of OrderBy() obviously can't yield an element until it has processed all elements because not until it has seen the last input element can it know that that last element seen isn't the first it must yield. It also must work on sources that can't be repeated or which will give different results each time. As such even if some sort of zeal meant the developers wanted to find the nth element anew each cycle, buffering is a logical requirement.
The quicksort is lazy in two regards though. One is that rather than sort the elements to return based on the keys from the delegate passed to the method, it sorts a mapping:
Buffer all the elements.
Get the keys. Note that this means the delegate is run only once per element. Among other things it means that non-pure keyselectors won't cause problems.
Get a map of numbers from 0 to n.
Sort the map.
Enumerate through the map, yielding the associated element each time.
So there is a sort of laziness in the final sorting of elements. This is significant in cases where moving elements is expensive (large value types).
There is of course also laziness in that none of the above is done until after the first attempt to enumerate, so until you call MoveNext() the first time, it won't have happened.
In .NET Core there is further laziness building on that, depending on what you then do with the results of OrderBy. Since OrderBy contains information about how to sort rather than the sorted buffer, the class returned by OrderBy can do something else with that information other than quicksorting:
The most obvious is ThenBy which all implementations do. When you call ThenBy or ThenByDescending you get a new similar class with different information about how to sort, and the sort the OrderBy result could have done probably never will.
First() and Last() don't need to sort at all. Logically source.OrderBy(del).First() is a variant of source.Min() where del contains the information to determine what defines "less than" for that Min(). Therefore if you call First() on the results of an OrderBy() that's exactly what is done. The laziness of OrderBy allows it to do this instead of quicksort. (Which means O(n) time complexity and O(1) space complexity instead of O(n log n) and O(n) respectively).
Skip() and Take() define a subsequence of a sequence which with OrderBy must conceptually happen after that sort. But since they are lazy too what can be returned is an object that knows; how to sort, how many to skip, how many to take. As such partial quicksort can be used so that the source need only be partially sorted: If a partition is outside of the range that will be returned then there's no point sorting it.
ElementAt() places more of a burden than First() or Last() but again doesn't require a full quicksort. Quickselect can be used to find just one result; if you're looking for the 3rd element and you've partitioned a set of 200 elements around the 90th element then you only need to look further in the first partition and can ignore the second partition from now on. Best-case and average-case time complexity is O(n).
The above can be combined, so e.g. .Skip(10).First() is equivalent to ElementAt(10) and can be treated as such.
All of these exceptions to getting the entire buffer and sorting it all have one thing in common: They were all implemented after identifying a way in which the correct result can be returned after making the computer do less work*. That new [] {1, 2, 3, 4}.Where(i => i % 2 == 0) will yield the 2 before it has seen the 4 (or even the 3 it won't yield comes from the same general principle. It just comes at it more easily (though there are still specialised variants of Where() results behind the scenes to provide other optimisations).
But note that Enumerable.Range(1, 10000).Where(i => i >= 10000) scans through 9999 elements to yield that first. Really it's not all that different to OrderBy's buffering; they're both bringing you the next result as quickly as they can†, and what differs is just what that means.
*And also identifying that the effort to detect and make use of the features of a particular case are worth it. E.g. many aggregate calls like Sum() can be optimised of the results of OrderBy by skipping the ordering completely. But this can generally be realised by the caller and they can just leave out the OrderBy so while adding that would make most calls to Sum() slightly slower to make that case much faster the case that benefits shouldn't really be happening anyway.
†Well, pretty much as quickly. It would be possible to get the first results back more quickly than OrderBy does—when you've got the left most part of a sequence sorted start giving out results—but that comes at a cost that would affect the later results so the trade-off isn't necessarily that doing that would be better.
What is the difference between these two Linq queries:
var result = ResultLists().Where( c=> c.code == "abc").FirstOrDefault();
// vs.
var result = ResultLists().FirstOrDefault( c => c.code == "abc");
Are the semantics exactly the same?
Iff sematically equal, does the predicate form of FirstOrDefault offer any theoretical or practical performance benefit over Where() plus plain FirstOrDefault()?
Either is fine.
They both run lazily - if the source list has a million items, but the tenth item matches then both will only iterate 10 items from the source.
Performance should be almost identical and any difference would be totally insignificant.
The second one. All other things being equal, the iterator in the second case can stop as soon as it finds a match, where the first one must find all that match, and then pick the first of those.
Nice discussion, all the above answers are correct.
I didn't run any performance test, whereas on the bases of my experience FirstOrDefault() sometimes faster and optimize as compare to Where().FirstOrDefault().
I recently fixed the memory overflow/performance issue ("neural-network algorithm") and fix was changing Where(x->...).FirstOrDefault() to simply FirstOrDefault(x->..).
I was ignoring the editor's recommendation to change Where(x->...).FirstOrDefault() to simply FirstOrDefault(x->..).
So I believe the correct answer to the above question is
The second option is the best approach in all cases
Where is actually a deferred execution - it means, the evaluation of an expression is delayed until its realized value is actually required. It greatly improves performance by avoiding unnecessary execution.
Where looks kind of like this, and returns a new IEnumerable
foreach (var item in enumerable)
{
if (condition)
{
yield return item;
}
}
FirstOrDefault() returns <T> and not throw any exception or return null when there is no result
This is an almost academic question but I'm curious as to its answer.
Suppose you have a loop that performs a routine replace on every row in a dataset. Let's say there's 10,000 such rows.
Is it more efficient to have something like this:
Row = Row.Replace('X', 'Y');
Or to check whether the row even contains the character that is to be replaced in the first place, like this:
if (Row.Contains('X')) Row = Row.Replace('X', 'Y');
Is there any difference in terms of efficiency? I realize that that the difference might be very minor bit I'm interested in knowing if one way is better than the other regardless of how much better it may be. Also, would your answer be different if the probability of finding the character that's to be replaced was 10% from it it being 90%?
For your check, Row.Contains('X'), is an O(n) function, which means that it iterates over the entire string one character at a time to see if that character exists.
Row.Replace('X', 'Y') works exactly the same way, it checks every single character one character at a time.
So, if you have that check in place, you iterate over the string potentially twice. If you just replace, you iterate over the string once.
You need to measure first on a realistic dataset, then decide which is higher performance. If your typical dataset doesn't often have anything, then having the Contains() call may be faster (because although Replace also iterates through all chars in the string, there will be an extra string object created and garbage collected due to the immutability of strings), but if "X" is often present, the check becomes a waste and actually slows things down.
Also, this typically isn't the first place to look for and worry about performance problems. Things like chatty interfaces, network I/O, web services, databases, file I/O and GUI updates are going to hurt you orders of magnitude more than stuff like this.
If you were going to do stuff like this, and if Row came back from a database (as it's name suggests) then getting the database to do the query might be another approach to save performance. E.g.
select MyTextColumn from MyTable where MyTextColumn like '%X%'
Then perform the replacement on all the results, because you know you only returned results where the replacement was needed.
This does introduce other concerns though - for example, in SQL Server, if the above example included an index on MyTextColumn, SQL Server won't be able to use that index because the like argument starts with a wildcard (it's not considered to be "sargable").
In summary, write for correctness, readability and maintenance first, then measure performance and make targeted improvements where they are found to be required.
The first option is faster. In order to check if a substring is present it first has to find it. As there won't be any caching mechanism why not replace it directly? Otherwise you'd be searching twice. If 'X' is present many times you would be basically doubling the effort.
Don't forget that strings in C# are IMMUTABLE. That means they cannot change.
For it to replace anything it has to create a new string in memory, and copy the data across, then garbage collect the old string later on.
Using Contains() first, will prevent needless creation, copying, and garbage collection of string data, and therefore perform faster.
When an IEnumerable needs both to be sorted and for elements to be removed, are there advantages/drawback of performing the stages in a particular order? My performance tests appear to indicate that it's irrelevant.
A simplified (and somewhat contrived) example of what I mean is shown below:
public IEnumerable<DataItem> GetDataItems(int maximum, IComparer<DataItem> sortOrder)
{
IEnumerable<DataItem> result = this.GetDataItems();
result.Sort(sortOrder);
result.RemoveAll(item => !item.Display);
result = result.Take(maximum);
return result;
}
If your tests indicate it's irrelevant, than why worry about it? Don't optimize before you need to, only when it becomes a problem. If you find a problem with performance, and have used a profiler, and have found that that method is the hotspot, then you can worry more about it.
On second thought, have you considered using LINQ? Those calls could be replaced with a call to Where and OrderBy, both of which are deferred, and then calling Take, like you have in your example. The LINQ libraries should find the best way of doing this for you, and if your data size expands to the point where it takes a noticeable amount of time to process, you can use PLINQ with a simple call to AsParallel.
You might as well RemoveAll before sorting so that you'll have fewer elements to sort.
I think that Sort() method would usually have complexity of O(n*log(n)), and RemoveAll() just O(n), so in general it is probably better to remove items first.
You'd want something like this:
public IEnumerable<DataItem> GetDataItems(int maximum, IComparer<DataItem> sortOrder)
{
IEnumerable<DataItem> result = this.GetDataItems();
return result
.Where(item => item.Display)
.OrderBy(sortOrder)
.Take(maximum);
}
There are two answers that are correct, but won't teach you anything:
It doesn't matter.
You should probably do RemoveAll first.
The first is correct because you said your performance tests showed it's irrelevant. The second is correct because it will have an effect on larger datasets.
There's a third answer that also isn't very useful: Sometimes it's faster to do removals afterwards.
Again, it doesn't actually tell you anything, but "sometimes" always means there is more to learn.
There's also only so much value in saying "profile first". What if profiling shows that 90% of the time is spent doing x.Foo(), which it does in a loop? Is the problem with Foo(), with the loop or with both? Obviously if we can make both more efficient we should, but how do we reason about that without knowledge outside of what a profiler tells us?
When something happens over multiple items (which is true of both RemoveAll and Sort) there are five things (I'm sure there are more I'm not thinking of now) that will affect the performance impact:
The per-set constant costs (both time and memory). How much it costs to do things like calling the function that we pass a collection to, etc. These are almost always negligible, but there could be some nasty high cost hidden there (often because of a mistake).
The per-item constant costs (both time and memory). How much it costs to do something that we do on some or all of the items. Because this happens multiple times, there can be an appreciable win in improving them.
The number of items. As a rule the more items, the more the performance impact. There are exceptions (next item), but unless those exceptions apply (and we need to consider the next item to know when this is the case), then this will be important.
The complexity of the operation. Again, this is a matter of both time-complexity and memory-complexity, but here the chances that we might choose to improve one at the cost of another. I'll talk about this more below.
The number of simultaneous operations. This can be a big difference between "works on my machine" and "works on the live system". If a super time-efficient approach uses .5GB of memory is tested on a machine with 2GB of memory available, it'll work wonderfully, but when you move it to a machine with 8GB of memory available and have multiple concurrent users, it'll hit a bottleneck at 16 simultaneous operations, and suddenly what was beating other approaches in your performance measurements becomes the application's hotspot.
To talk about complexity a bit more. The time complexity is a measure of how the time taken to do something relates the number of items it is done with, while memory complexity is a measure of how the memory used relates to that same number of items. Obtaining an item from a dictionary is O(1) or constant because it takes the same amount of time however large the dictionary is (not strictly true, strictly it "approaches" O(1), but it's close enough for most thinking). Finding something in an already sorted list can be O(log2 n) or logarithmic. Filtering through a list will be linear or O(n). Sorting something using a quicksort (which is what Sort uses) tends to be linearithmic or O(n log2 n) but in its worse case - against a list already sorted - will be quadratic O(n2).
Considering these, with a set of 8 items, an O(1) operation will take 1k seconds to do something, where k is a constant amount of time, O(log2 n) means 3k seconds, O(n) means 8k, O(n log2 n) means 24k and O(n2) means 64k. These are the most commonly found though there are plenty of others like O(nm) which is affected by two different sizes, or O(n!) which would be 40320k.
Obviously, we want as low a complexity as possible, though since k will be different in each case, sometimes the best solution for a small set has a high complexity (but low k constant) though a lower-complexity case will beat it with larger input.
So. Let's go back to the cases you are considering, viz filtering followed by sorting vs. sorting followed by filtering.
Per-set constants. Since we are moving two operations around but still doing both, this will be the same either way.
Per-item constants. Again, we're still doing the same things per item in either case, so no effect.
Number of items. Filtering reduces the number of items. Therefore the sooner we filter items, the more efficient the rest of the operation. Therefore doing RemoveAll first wins in this regard.
Complexity of the operation. It's either a O(n) followed by a average-case-O(log2 n)-worse-case-O(n2), or it's an average-case-O(log2 n)-worse-case-O(n2) followed by an O(n). Same either way.
Number of simultaneous cases. Total memory pressure will be relieved the sooner we remove some items, (slight win for RemoveAll first).
So, we've got two reasons to consider RemoveAll first as likely to be more efficient and none to consider it likely to be less efficient.
We would not assume that we were 100% guaranteed to be correct here. For a start we could simply have made a mistake in our reasoning. For another, there could be other factors we've dismissed as irrelevant that were actually pertinent. It is still true that we should profile before optimising, but reasoning about the sort of things I've mentioned above will both make us more likely to write performant code in the first place (not the same as optimising; but a matter of picking between options when readability, clarity and correctness is equal either way) and makes it easier to find likely ways to improve those things that profiling has found to be troublesome.
For a slightly different but relevant case, consider if the criteria sorted on matched those removed on. E.g. if we were to sort by date and remove all items after a given date.
In this case, if the list deallocates on all removals, it'll still be O(n), but with a much smaller constant. Alternatively, if it just moved the "last-item" pointer*, it becomes O(1). Finding the pointer is O(log2 n), so here there's both reasons to consider that filtering first will be faster (the reasons given above) and that sorting first will be faster (that removal can be made a much faster operation than it was before). With this sort of case it becomes only possible to tell by extending our profiling. It is also true that the performance will be affected by the type of data sent, so we need to profile with realistic data, rather than artificial test data, and we may even find that what was the more performant choice becomes the less performant choice months later when the dataset it is used on changes. Here the ability to reason becomes even more important, because we should note the possibility that changes in real-world use may make this change in this regard, and know that it is something we need to keep an eye on throughout the project's life.
(*Note, List<T> does not just move a last-item pointer for a RemoveRange that covers the last item, but another collection could.)
It would probably be better to the RemoveAll first, although it would only make much of a difference if your sorting comparison was intensive to calculate.
In an application I will have between about 3000 and 30000 strings.
After creation (read from files unordered) there will not be many strings that will be added often (but there WILL be sometimes!). Deletion of strings will also not happen often.
Comparing a string with the ones stored will occur frequently.
What kind of structure can I use best, a hashtable, a tree (Red-Black, Splay,....) or just on ordered list (maybe a StringArray?) ?
(Additional remark : a link to a good C# implementation would be appreciated as well)
It sounds like you simply need a hashtable. The HashSet<T> would thus seem to be the ideal choice. (You don't seem to require keys, but Dictionary<T> would be the right option if you did, of course.)
Here's a summary of the time complexities of the different operations on a HashSet<T> of size n. They're partially based off the fact that the type uses an array as the backing data structure.
Insertion: Typically O(1), but potentially O(n) if the array needs to be resized.
Deletion: O(1)
Exists (Contains): O(1) (given ideal hashtable buckets)
Someone correct me if any of these are wrong please. They are just my best guesses from what I know of the implementation/hashtables in general.
HashSet is very good for fast insertion and search speeds. Add, Remove and Contains are O(1).
Edit- Add assumes the array does not need to be resized. If that's the case as Noldorin has stated it is O(n).
I used HashSet on a recent VB 6 (I didn't write it) to .NET 3.5 upgrade project where I was iterating round a collection that had child items and each child item could appear in more than one parent item. The application processed a list of items I wanted to send to an API that charges a lot of money per call.
I basically used the HashSet to keep track items I'd already sent to prevent us incurring an unnecessary charge. As the process was invoked several times (it is basically a batch job with multiple commands), I serialized the HashSet between invocations. This worked very well- I had a requirement to reuse as much as the existing code as possible as this had been thoroughly tested. The HashSet certainly performed very fast.
If you're looking for real-time performance or optimal memory efficiency I'd recommend a radix tree or explicit suffix or prefix tree. Otherwise I'd probably use a hash.
Trees have the advantage of having fixed bounds on worst case lookup, insertion and deletion times (based on the length of the pattern you're looking up). Hash based solutions have the advantage of being a whole lot easier to code (you get these out of the box in C#), cheaper to construct initially and if properly configured have similar average-case performance. However, they do tend to use more memory and have non-deterministic time lookups, insertions (and depending on the implementation possibly deletions).
The answers recommending HashSet<T> are spot on if your comparisons are just "is this string present in the set or not". You could even use different IEqualityComparer<string> implementations (probably choosing from the ones in StringComparer) for case-sensitivity etc.
Is this the only type of comparison you need, or do you need things like "where would this string appear in the set if it were actually an ordered list?" If you need that sort of check, then you'll probably want to do a binary search. (List<T> provides a BinarySearch method; I don't know why SortedList and SortedDictionary don't, as both would be able to search pretty easily. Admittedly a SortedDictionary search wouldn't be quite the same as a normal binary search, but it would still usually have similar characteristics I believe.)
As I say, if you only want "in the set or not" checking, the HashSet<T> is your friend. I just thought I'd bring up the rest in case :)
If you need to know "where would this string appear in the set if it were actually an ordered list" (as in Jon Skeet's answer), you could consider a trie. This solution can only be used for certain types of "string-like" data, and if the "alphabet" is large compared to the number of strings it can quickly lose its advantages. Cache locality could also be a problem.
This could be over-engineered for a set of only N = 30,000 things that is largely precomputed, however. You might even do better just allocating an array of k * N Optional and filling it by skipping k spaces between each actual thing (thus reducing the probability that your rare insertions will require reallocation, still leaving you with a variant of binary search, and keeping your items in sorted order. If you need precise "where would this string appear in the set", though, this wouldn't work because you would need O(n) time to examine each space before the item checking if it was blank or O(n) time on insert to update a "how many items are really before me" counter in each slot. It could provide you with very fast imprecise indexes, though, and those indexes would be stable between insertions/deletions.