I am new in LINQ and would like to write some extension methods. Before doing so I wanted to test if I will do it correctly. I just wanted to compare the performance of my CustomSelect extension method with built-in Select extension method.
static void Main(string[] args)
{
List<int> list = new List<int>();
for (int i = 0; i < 10000000; i++)
list.Add(i);
DateTime now1 = DateTime.Now;
List<int> process1 = list.Select(i => i).ToList();
Console.WriteLine(DateTime.Now - now1);
DateTime now2 = DateTime.Now;
List<int> process2 = list.CustomSelect(i => i).ToList();
Console.WriteLine(DateTime.Now - now2);
}
public static IEnumerable<TResult> CustomSelect<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector)
{
foreach (TSource item in source)
{
yield return selector(item);
}
}
Timespan for built-in method: 0.18 sec
Timespan for custom method: 0.35 sec
Changing the order of processes yields the same result.
If I collect the elements in a list and return instead of yield return, then the timespan is being nearly same with built-in one. But as far as I know we should yield return whereever possible.
So what can be the code for built-in method? What should be my approach?
Thanks in advance
The key difference I can see is that the inbuilt method checks for List<T> and special-cases it, exploiting the custom List<T>.Enumerator implementation, rather than IEnumerable<T> / IEnumerator<T>. You can do that special-case yourself:
public static IEnumerable<TResult> CustomSelect<TSource, TResult>(
this IEnumerable<TSource> source, Func<TSource, TResult> selector)
{
if (source is List<TSource>)
return CustomSelectList((List<TSource>)source, selector);
return CustomSelectDefault(source, selector);
}
private static IEnumerable<TResult> CustomSelectList<TSource, TResult>(
List<TSource> source, Func<TSource, TResult> selector)
{
foreach (TSource item in source)
{
yield return selector(item);
}
}
private static IEnumerable<TResult> CustomSelectDefault<TSource, TResult>(
IEnumerable<TSource> source, Func<TSource, TResult> selector)
{
foreach (TSource item in source)
{
yield return selector(item);
}
}
You could take that a stage further by hand-rolling the entire iterator (which is what WhereSelectListIterator<TSource, TResult> does), but the above is probably close enough.
The inbuilt implementation also special-cases arrays, and handles various forms of composed queries.
There's a lot of things wrong with your performance test, which makes it inconclusive - you should look into best practices for benchmarking code in .NET. Use Stopwatch instead of DateTime.Now, use many repetitions of the same thing on at once instead of one shot at each, make sure you're not getting hindered by the GC (.ToList() is going to screw your measurements quite a bit).
yield return should not be used because it's faster, the idea is that it's easy to write, and it's lazy. If I did Take(10) on the yield return variant, I'd only get 10 elements. The return variant, on the other hand, will produce the whole list, return it, and then reduce it to 10 elements.
In effect, you're taking pretty much the simplest case where there's very little reason to use Select at all (apart from clarity). Enumerables are made to handle far more crazier stuff, and using the LINQ methods, do it in easy to understand and concise matter, exposing an interface familiar to functional programmers. That often means that you could get more performance by rewriting the whole thing in a less general way - the point is that you should really only do that if necessary - if this is not a performance bottleneck of your application (and it rarely will be), the cleaner, easier to extend code is a better option.
Related
I'm doing a C# exercise to create an operation that takes a collection, performs a function on each object in the collection, and returns a collection of modified objects.
My code is currently as follows:
public static IEnumerable<U> Accumulate<T, U>(this IEnumerable<T> collection, Func<T, U> func)
{
IEnumerable<U> output = Enumerable.Empty<U>();
foreach (T item in collection)
{
output.Append(func(item));
}
return output;
}
This is only returning an empty collection, and I have no idea why.
I have tried creating a copy of the item in the foreach after seeing this approach in another thread, like so:
foreach (T item in collection)
{
U copy = func(item);
output.Append(copy);
}
but that didn't solve anything.
I did some research but couldn't really find any examples doing exactly what I'm trying to do here. I read some things about closure, but couldn't really understand it, as I'm new to C#.
To answer your actual question: The reason it isn't working is because
output.Append(func(item));
doesn't change output - instead, it returns a new sequence which is func(item) appended to output. Thus when you eventually return output you are just returning the original, empty sequence.
You could make yours work by this simple change:
output = output.Append(func(item));
However, this is not an efficient approach - you're much better off using yield, by modifying your method as follows:
public static IEnumerable<U> Accumulate<T, U>(this IEnumerable<T> collection, Func<T, U> func)
{
foreach (T item in collection)
{
yield return func(item);
}
}
Although note that that is more simply expressed as:
public static IEnumerable<U> Accumulate<T, U>(this IEnumerable<T> collection, Func<T, U> func)
{
return collection.Select(item => func(item));
}
But it is useful to know about how to do this with yield so that you can write solutions to more complex Linq-like problems.
Usually, when I want to achieve this kind of behaviour, I make use of C# Iterators.
They are so usefull when you want to process an iteration on some kind of data and, at each iteration, return a value that is appended to your resulting collection.
Take a look at the docs: MS Docs
To be more specific: will the Linq extension method Any(IEnumerable collection, Func predicate) stop checking all the remaining elements of the collections once the predicate has yielded true for an item?
Because I don't want to spend to much time on figuring out if I need to do the really expensive parts at all:
if(lotsOfItems.Any(x => x.ID == target.ID))
//do expensive calculation here
So if Any is always checking all the items in the source this might end up being a waste of time instead of just going with:
var candidate = lotsOfItems.FirstOrDefault(x => x.ID == target.ID)
if(candicate != null)
//do expensive calculation here
because I'm pretty sure that FirstOrDefault does return once it got a result and only keeps going through the whole Enumerable if it does not find a suitable entry in the collection.
Does anyonehave information about the internal workings of Any, or could anyone suggest a solution for this kind of decision?
Also, a colleague suggested something along the lines of:
if(!lotsOfItems.All(x => x.ID != target.ID))
since this is supposed to stop once the conditions returns false for the first time but I'm not sure on that, so if anyone could shed some light on this as well it would be appreciated.
As we see from the source code, Yes:
internal static bool Any<T>(this IEnumerable<T> source, Func<T, bool> predicate) {
foreach (T element in source) {
if (predicate(element)) {
return true; // Attention to this line
}
}
return false;
}
Any() is the most efficient way to determine whether any element of a sequence satisfies a condition with LINQ.
also:a colleague suggested something along the lines of
if(!lotsOfItems.All(x => x.ID != target.ID)) since this is supposed to
stop once the conditions returns false for the first time but i'm not
sure on that, so if anyone could shed some light on this as well it
would be appreciated :>]
All() determines whether all elements of a sequence satisfy a condition. So, the enumeration of source is stopped as soon as the result can be determined.
Additional note:
The above is true if you are using Linq to objects. If you are using Linq to Database, then it will create a query and will execute it against database.
You could test it yourself: https://ideone.com/nIDKxr
public static IEnumerable<int> Tester()
{
yield return 1;
yield return 2;
throw new Exception();
}
static void Main(string[] args)
{
Console.WriteLine(Tester().Any(x => x == 1));
Console.WriteLine(Tester().Any(x => x == 2));
try
{
Console.WriteLine(Tester().Any(x => x == 3));
}
catch
{
Console.WriteLine("Error here");
}
}
Yes, it does :-)
also:a colleague suggested something along the lines of
if(!lotsOfItems.All(x => x.ID != target.ID))
since this is supposed to stop once the conditions returns false for the first time but i'm not sure on that, so if anyone could shed some light on this as well it would be appreciated :>]
Using the same reasoning, All() could continue even if one of the element returns false :-) No, even All() is programmed correctly :-)
It does whatever is the quickest way of doing what it has to do.
When used on an IEnumerable this will be along the lines of:
foreach(var item in source)
if(predicate(item))
return true;
return false;
Or for the variant that doesn't take a predicate:
using(var en = source.GetEnumerator())
return en.MoveNext();
When run against at database it will be something like
SELECT EXISTS(SELECT null FROM [some table] WHERE [some where clause])
And so on. How that was executed would depend in turn on what indices were available for fulfilling the WHERE clause, so it could be a quick index lookup, a full table scan aborting on first match found, or an index lookup followed by a partial table scan aborting on first match found, depending on that.
Yet other Linq providers would have yet other implementations, but generally the people responsible will be trying to be at least reasonably efficient.
In all, you can depend upon it being at least slightly more efficient than calling FirstOrDefault, as FirstOrDefault uses similar approaches but does have to return a full object (perhaps constructing it). Likewise !All(inversePredicate) tends to be pretty much on a par with Any(predicate) as per this answer.
Single is an exception to this
Update: The following from this point on no longer applies to .NET Core, which has changed the implementation of Single.
It's important to note that in the case of linq-to objects, the overloads of Single and SingleOrDefault that take a predicate do not stop on identified failure. While the obvious approach to Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate) would be something like:
public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
/* do null checks */
using(var en = source.GetEnumerator())
while(en.MoveNext())
{
var val = en.Current;
if(predicate(val))
{
while(en.MoveNext())
if(predicate(en.Current))
throw new InvalidOperationException("too many matching items");
return val;
}
}
throw new InvalidOperationException("no matching items");
}
The actual implementation is something like:
public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
/* do null checks */
var result = default(TSource);
long tally = 0;
for(var item in source)
if(predicate(item))
{
result = item;
checked{++tally;}
}
switch(tally)
{
case 0:
throw new InvalidOperationException("no matching items");
case 1:
return result;
default:
throw new InvalidOperationException("too many matching items");
}
}
Now, while successful Single will have to scan everything, this can mean that an unsucessful Single is much, much slower than it needs to (and can even potentially throw an undocumented error) and if the reason for the unexpected duplicate is a bug which is duplicating items into the sequence - and hence making it far larger than it should be, then the Single that should have helped you find that problem is now dragging away through this.
SingleOrDefault has the same issue.
This only applies to linq-to-objects, but it remains safer to do .Where(predicate).Single() rather than Single(predicate).
Any stops at the first match. All stops at the first non-match.
I don't know whether the documentation guarantees that but this behavior is now effectively fixed for all time due to compatibility reasons. It also makes sense.
Yes it stops when the predicate is satisfied once. Here is code via RedGate Reflector:
[__DynamicallyInvokable]
public static bool Any<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
if (predicate == null)
{
throw Error.ArgumentNull("predicate");
}
foreach (TSource local in source)
{
if (predicate(local))
{
return true;
}
}
return false;
}
Say I have ISet<T> _set = new HashSet<T>();
Now if I do: _set.Cast<TInterface>().Contains(obj, comparer); (where T implements TInterface), do I loose the O(1) benefit of the HashSet<T>?
In other words - does .Cast<T>()ing changes the underlying type (HashSet<T> in this case) to something else, or the underlying type preserved?
Logically, a HashSet<T> uses an internal hash-table based on the hashing logic of the comparer that it was created with, so of course it's not possible to do an element-containment test on it with a different comparer and expect O(1) performance.
That said, let's look at things in a bit more detail for your specific scenario:
The Cast<T> method looks like this (from reference-source):
public static IEnumerable<TResult> Cast<TResult>(this IEnumerable source) {
IEnumerable<TResult> typedSource = source as IEnumerable<TResult>;
if (typedSource != null) return typedSource;
if (source == null) throw Error.ArgumentNull("source");
return CastIterator<TResult>(source);
}
As you can see, if the source implements IEnumerable<TResult> it just returns the source directly. Since IEnumerable<> is a covariant interface, this test will pass for your use case (assuming the concrete type implements the interface type) and the hash-set will be returned directly - a good thing as there's still hope of its internal hash-table being used.
However, the overload of Contains you are using looks like this:
public static bool Contains<TSource>(this IEnumerable<TSource> source, TSource value, IEqualityComparer<TSource> comparer)
{
if (comparer == null) comparer = EqualityComparer<TSource>.Default;
if (source == null) throw Error.ArgumentNull("source");
foreach (TSource element in source)
if (comparer.Equals(element, value)) return true;
return false;
}
As you can see, it always loops through the collection to linear-search, which is O(n).
So the entire operation is going to be O(n) regardless.
_set.Cast<TInterface>() will return an IEnumerable<TInterface> so _set.Cast<TInterface>().Contains(obj, comparer); doesn't invokes HashSet.Contains, rather it invokes Enumerable.Contains extension method.
So obviously you don't get O(1) operation anymore.
If you need O(1) you again need to create a HashSet out of it.
var newSet = new HashSet(_set.Cast<TInterface>(),comparer);
newSet.Contains();
The Cast method returns an IEnumerable so the Contains method will operate on the IEnumerable rather than the HashSet. So I think you'd loose the benefit of HashSet. Why don't you do the cast in the compared instead?
I wanted to write an extension method (for using it in a fluent syntax) so that If a sequence is :
List< int> lst = new List< int>(){1,2,3 };
I want to repeat it 3 times (for example). so the output would be 123123123
I wrote this :
public static IEnumerable<TSource> MyRepeat<TSource>(this IEnumerable<TSource> source,int n)
{
return Enumerable.Repeat(source,n).SelectMany(f=>f);
}
And now I can do this :
lst.MyRepeat(3)
output :
Question :
Shouldn't I use Yield in the extension method ? I tried yield return but it's not working here. Why is that and should I use it.
edit
After Ant's answer I changed it to :
public static IEnumerable<TSource> MyRepeat<TSource>(this IEnumerable<TSource> source,int n)
{
var k=Enumerable.Repeat(source,n).SelectMany(f=>f);
foreach (var element in k)
{
yield return element;
}
}
But is there any difference ?
This is because the following already returns an IEnumerable:
Enumerable.Repeat(source,n).SelectMany(f=>f);
When you use the yield keyword, you specify that a given iteration over the method will return what follows. So you are essentially saying "each iteration will yield an IEnumerable<TSource>," when actually, each iteration over a method returning an IEnumerable<TSource>should yield a TSource.
Hence, your error - when you iterate over MyRepeat, you are expected to return a TSource but because you are trying to yield an IEnumerable, you are actually trying to return an IEnumerable from every iteration instead of returning a single element.
Your edit should work but is a little pointless - if you simply return the IEnumerable directly it won't be enumerated until you iterate over it (or call ToList or something). In your very first example, SelectMany (or one of its nested methods) will already be using yield, meaning the yield is already there, it's just implicit in your method.
Ant P's answer is of course correct.
You would use yield if you were building the enumerable that is returned yourself, rather than relying on SelectMany. eg:
public static IEnumerable<T> Repeat<T>(this IEnumberable<T> items, int repeat)
{
for (int i = 0; i < repeat; ++i)
foreach(T item in items)
yield return item;
}
The thing you yield is an element of the sequence. The code is instructions for producing the sequence of yielded elements.
So I recently found myself writing a loop similar to this one:
var headers = new Dictionary<string, string>();
...
foreach (var header in headers)
{
if (String.IsNullOrEmpty(header.Value)) continue;
...
}
Which works fine, it iterates through the dictionary once and does all I need it to do. However, my IDE is suggesting this as a more readable / optimized alternative, but I disagree:
var headers = new Dictionary<string, string>();
...
foreach (var header in headers.Where(header => !String.IsNullOrEmpty(header.Value)))
{
...
}
But wont that iterate through the dictionary twice? Once to evaluate the .Where(...) and then once for the for-each loop?
If not, and the second code example only iterates the dictionary once, please explain why and how.
The code with continue is about twice as fast.
I ran the following code in LINQPad, and the results consistently say that the clause with continue is twice as fast.
void Main()
{
var headers = Enumerable.Range(1,1000).ToDictionary(i => "K"+i,i=> i % 2 == 0 ? null : "V"+i);
var stopwatch = new Stopwatch();
var sb = new StringBuilder();
stopwatch.Start();
foreach (var header in headers.Where(header => !String.IsNullOrEmpty(header.Value)))
sb.Append(header);
stopwatch.Stop();
Console.WriteLine("Using LINQ : " + stopwatch.ElapsedTicks);
sb.Clear();
stopwatch.Reset();
stopwatch.Start();
foreach (var header in headers)
{
if (String.IsNullOrEmpty(header.Value)) continue;
sb.Append(header);
}
stopwatch.Stop();
Console.WriteLine("Using continue : " + stopwatch.ElapsedTicks);
}
Here are some of the results I got
Using LINQ : 1077
Using continue : 348
Using LINQ : 939
Using continue : 459
Using LINQ : 768
Using continue : 382
Using LINQ : 1256
Using continue : 457
Using LINQ : 875
Using continue : 318
In general LINQ is always going to be slower when working with an already evaluated IEnumerable<T>, than the foreach counterpart. The reason is that LINQ-to-Objects is just a high-level wrapper of these lower level language features. The benefit to using LINQ here is not performance, but the provision of a consistent interface. LINQ absolutely does provide performance benefits, but they come into play when you are working with resources that are not already in active memory (and allow you to leverage the ability to optimize the code that is actually executed). When the alternative code is the most optimal alternative, then LINQ just has to go through a redundant process to call the same code you would have written anyway. To illustrate this, I'm going to paste the code below that is actually called when you use LINQ's Where operator on a loaded enumerable:
public static IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
if (predicate == null)
{
throw Error.ArgumentNull("predicate");
}
if (source is Iterator<TSource>)
{
return ((Iterator<TSource>) source).Where(predicate);
}
if (source is TSource[])
{
return new WhereArrayIterator<TSource>((TSource[]) source, predicate);
}
if (source is List<TSource>)
{
return new WhereListIterator<TSource>((List<TSource>) source, predicate);
}
return new WhereEnumerableIterator<TSource>(source, predicate);
}
And here is the WhereSelectEnumerableIterator<TSource,TResult> class. The predicate field is the delegate that you pass into the Where() method. You will see where it actually gets executed in the MoveNext method (as well as all the redundant null checks). You will also see that the enumerable is only looped through once. Stacking where clauses will result in the creation of multiple iterator classes (wrapping their predecessors), but will not result in multiple enumeration actions (due to deferred execution). Keep in mind that when you write a Lambda like this, you are also actually creating a new Delegate instance (also affecting your performance in a minor way).
private class WhereSelectEnumerableIterator<TSource, TResult> : Enumerable.Iterator<TResult>
{
private IEnumerator<TSource> enumerator;
private Func<TSource, bool> predicate;
private Func<TSource, TResult> selector;
private IEnumerable<TSource> source;
public WhereSelectEnumerableIterator(IEnumerable<TSource> source, Func<TSource, bool> predicate, Func<TSource, TResult> selector)
{
this.source = source;
this.predicate = predicate;
this.selector = selector;
}
public override Enumerable.Iterator<TResult> Clone()
{
return new Enumerable.WhereSelectEnumerableIterator<TSource, TResult>(this.source, this.predicate, this.selector);
}
public override void Dispose()
{
if (this.enumerator != null)
{
this.enumerator.Dispose();
}
this.enumerator = null;
base.Dispose();
}
public override bool MoveNext()
{
switch (base.state)
{
case 1:
this.enumerator = this.source.GetEnumerator();
base.state = 2;
break;
case 2:
break;
default:
goto Label_007C;
}
while (this.enumerator.MoveNext())
{
TSource current = this.enumerator.Current;
if ((this.predicate == null) || this.predicate(current))
{
base.current = this.selector(current);
return true;
}
}
this.Dispose();
Label_007C:
return false;
}
public override IEnumerable<TResult2> Select<TResult2>(Func<TResult, TResult2> selector)
{
return new Enumerable.WhereSelectEnumerableIterator<TSource, TResult2>(this.source, this.predicate, Enumerable.CombineSelectors<TSource, TResult, TResult2>(this.selector, selector));
}
public override IEnumerable<TResult> Where(Func<TResult, bool> predicate)
{
return (IEnumerable<TResult>) new Enumerable.WhereEnumerableIterator<TResult>(this, predicate);
}
}
I personally think the performance difference is completely justifiable, because LINQ code is much easier to maintain and reuse. I also do things to offset the performance issues (like declaring all my anonymous lambda delegates and expressions as static readonly fields in a common class). But in reference to your actual question, your continue clause is definitely faster than the LINQ alternative.
No it won't iterate through it twice. the .Where does not actually evaluate by itself. The foreach actually pulls out each element from the where that satisfies the clause.
Similarly a headers.Select(x) doesn't actually process anything until you put a .ToList() or something behind it that forces it to evaluate.
EDIT:
To explain it a bit more, as Marcus pointed out, the .Where returns an iterator so each element is iterated over and the expression is processed once, if it matches then it goes into the body of the loop.
i think the second example will only iterates the dict once.
because what the header.Where(...) returns is exactly a "iterator", rather than a temporary value, every time the loop iterates, it will use the filter which is defined in Where(...), which make the one-time-iteration work.
However, i am not a sophisticated C# coder, i am not sure how C# will deal with such situation, but i think things should be the same.