Linq to Objects: TakeWhileOrFirst - c#

What would be the most readable way to apply the following to a sequence using linq:
TakeWhile elements are valid but always at least the first element
EDIT: I have updated the title, to be more precise. I'm sorry for any confusion, the answers below have definitely taught me something!
The expected behavior is this: Take while element are valid. If the result is an empty sequence, take the first element anyway.

I think this makes the intention quite clear:
things.TakeWhile(x => x.Whatever).DefaultIfEmpty(things.First());
My earlier, more verbose solution:
var query = things.TakeWhile(x => x.Whatever);
if (!query.Any()) { query = things.Take(1); }

The following works*, and seems pretty well readable to me:
seq.Take(1).Concat(seq.TakeWhile(condition).Skip(1));
There may be a better way, not sure.
*with thanks to #Jeff M for the correction

It would be most efficiently implemented manually as far as I can tell to ensure that it isn't enumerated over more than necessary.
public static IEnumerable<TSource> TakeWhileOrFirst<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
using (var enumerator = source.GetEnumerator())
{
if (!enumerator.MoveNext())
yield break;
TSource current = enumerator.Current;
yield return current;
if (predicate(current))
{
while (enumerator.MoveNext() && predicate(current = enumerator.Current))
yield return current;
}
}
}
And for the sake of completion, an overload that includes the index:
public static IEnumerable<TSource> TakeWhileOrFirst<TSource>(this IEnumerable<TSource> source, Func<TSource, int, bool> predicate)
{
using (var enumerator = source.GetEnumerator())
{
if (!enumerator.MoveNext())
yield break;
TSource current = enumerator.Current;
int index = 0;
yield return current;
if (predicate(current, index++))
{
while (enumerator.MoveNext() && predicate(current = enumerator.Current, index++))
yield return current;
}
}
}

DISCLAIMER:
This is a variation of Jeff Ms nice answer and as such is only meant to show the code using do-while instead. It's only provided as an extension to Jeffs answer.
public static IEnumerable<TSource> TakeWhileOrFirst<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
using (var enumerator = source.GetEnumerator())
{
if (!enumerator.MoveNext())
yield break;
var current = enumerator.Current;
do{
yield return current
} while (predicate(current) &&
enumerator.MoveNext() &&
predicate(current = enumerator.Current) );
}
}
of course it's a matter of style I personally like to have as low a nesting level of my conditional logic as possible but the double use of predicate might be hard to grasp and can be a slight performance hog (depending on optimization and branch prediction)

Related

Why is DefaultIfEmpty implemented this way?

Chasing the implementation of System.Linq.Enumerable.DefaultIfEmpty took me to this method. It looks alright except for the following quaint details:
// System.Linq.Enumerable
[IteratorStateMachine(typeof(Enumerable.<DefaultIfEmptyIterator>d__90<>))]
private static IEnumerable<TSource> DefaultIfEmptyIterator<TSource>(IEnumerable<TSource> source, TSource defaultValue)
{
using (IEnumerator<TSource> enumerator = source.GetEnumerator())
{
if (enumerator.MoveNext())
{
do
{
yield return enumerator.Current;
}
while (enumerator.MoveNext());
}
else
{
yield return defaultValue;
}
}
IEnumerator<TSource> enumerator = null;
yield break;
yield break;
}
1) Why does the code have to iterate over the whole sequence once it has been established that the sequence is not empty?
2) Why the yield break two times at the end?
3) Why explicitly set the enumerator to null at the end when there is no other reference to it?
I would have left it at this:
// System.Linq.Enumerable
[IteratorStateMachine(typeof(Enumerable.<DefaultIfEmptyIterator>d__90<>))]
private static IEnumerable<TSource> DefaultIfEmptyIterator<TSource>(IEnumerable<TSource> source, TSource defaultValue)
{
using (IEnumerator<TSource> enumerator = source.GetEnumerator())
{
if (enumerator.MoveNext())
{
do
{
yield return enumerator.Current;
}
// while (enumerator.MoveNext());
}
else
{
yield return defaultValue;
}
}
// IEnumerator<TSource> enumerator = null;
yield break;
// yield break;
}
DefaultIfEmpty needs to act as the following:
If the source enumerable has no entries, it needs to act as an enumerable with a single value; the default value.
If the source enumerable is not empty, it needs to act as the source enumerable. Therefore, it needs to yield all values.
Because when you start enumerating and this code is used as another level of enumeration you have to enumerate the whole thing.
If you just yield return the first one and stop there the code using this enumerator will think there is only one value. So you have to enumerate everything there is and yield return it forward.
You could of course do return enumerator and that would work, but not after the MoveNext() has been called since that would cause the first value to be skipped. If there was another way to check if values exist then this would be the way to do it.
Why does the code have to iterate over the whole sequence once it has been established that the sequence is not empty?
As you can read in MSDN about DefaultIfEmtpy return value:
An IEnumerable<T> object that contains the default value for the TSource type if source is empty; otherwise, source.
So, if the enumerable is empty the result is a enumerable containing the default value, but if the enumerable isn't empty the same enumerable is returned (not only the first element).
It may seem that this method is about checking only whether an enumerable contains elements or not, but it is not the case.
Why the yield break two times at the end?
No ideas :)

Does Any() stop on success?

To be more specific: will the Linq extension method Any(IEnumerable collection, Func predicate) stop checking all the remaining elements of the collections once the predicate has yielded true for an item?
Because I don't want to spend to much time on figuring out if I need to do the really expensive parts at all:
if(lotsOfItems.Any(x => x.ID == target.ID))
//do expensive calculation here
So if Any is always checking all the items in the source this might end up being a waste of time instead of just going with:
var candidate = lotsOfItems.FirstOrDefault(x => x.ID == target.ID)
if(candicate != null)
//do expensive calculation here
because I'm pretty sure that FirstOrDefault does return once it got a result and only keeps going through the whole Enumerable if it does not find a suitable entry in the collection.
Does anyonehave information about the internal workings of Any, or could anyone suggest a solution for this kind of decision?
Also, a colleague suggested something along the lines of:
if(!lotsOfItems.All(x => x.ID != target.ID))
since this is supposed to stop once the conditions returns false for the first time but I'm not sure on that, so if anyone could shed some light on this as well it would be appreciated.
As we see from the source code, Yes:
internal static bool Any<T>(this IEnumerable<T> source, Func<T, bool> predicate) {
foreach (T element in source) {
if (predicate(element)) {
return true; // Attention to this line
}
}
return false;
}
Any() is the most efficient way to determine whether any element of a sequence satisfies a condition with LINQ.
also:a colleague suggested something along the lines of
if(!lotsOfItems.All(x => x.ID != target.ID)) since this is supposed to
stop once the conditions returns false for the first time but i'm not
sure on that, so if anyone could shed some light on this as well it
would be appreciated :>]
All() determines whether all elements of a sequence satisfy a condition. So, the enumeration of source is stopped as soon as the result can be determined.
Additional note:
The above is true if you are using Linq to objects. If you are using Linq to Database, then it will create a query and will execute it against database.
You could test it yourself: https://ideone.com/nIDKxr
public static IEnumerable<int> Tester()
{
yield return 1;
yield return 2;
throw new Exception();
}
static void Main(string[] args)
{
Console.WriteLine(Tester().Any(x => x == 1));
Console.WriteLine(Tester().Any(x => x == 2));
try
{
Console.WriteLine(Tester().Any(x => x == 3));
}
catch
{
Console.WriteLine("Error here");
}
}
Yes, it does :-)
also:a colleague suggested something along the lines of
if(!lotsOfItems.All(x => x.ID != target.ID))
since this is supposed to stop once the conditions returns false for the first time but i'm not sure on that, so if anyone could shed some light on this as well it would be appreciated :>]
Using the same reasoning, All() could continue even if one of the element returns false :-) No, even All() is programmed correctly :-)
It does whatever is the quickest way of doing what it has to do.
When used on an IEnumerable this will be along the lines of:
foreach(var item in source)
if(predicate(item))
return true;
return false;
Or for the variant that doesn't take a predicate:
using(var en = source.GetEnumerator())
return en.MoveNext();
When run against at database it will be something like
SELECT EXISTS(SELECT null FROM [some table] WHERE [some where clause])
And so on. How that was executed would depend in turn on what indices were available for fulfilling the WHERE clause, so it could be a quick index lookup, a full table scan aborting on first match found, or an index lookup followed by a partial table scan aborting on first match found, depending on that.
Yet other Linq providers would have yet other implementations, but generally the people responsible will be trying to be at least reasonably efficient.
In all, you can depend upon it being at least slightly more efficient than calling FirstOrDefault, as FirstOrDefault uses similar approaches but does have to return a full object (perhaps constructing it). Likewise !All(inversePredicate) tends to be pretty much on a par with Any(predicate) as per this answer.
Single is an exception to this
Update: The following from this point on no longer applies to .NET Core, which has changed the implementation of Single.
It's important to note that in the case of linq-to objects, the overloads of Single and SingleOrDefault that take a predicate do not stop on identified failure. While the obvious approach to Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate) would be something like:
public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
/* do null checks */
using(var en = source.GetEnumerator())
while(en.MoveNext())
{
var val = en.Current;
if(predicate(val))
{
while(en.MoveNext())
if(predicate(en.Current))
throw new InvalidOperationException("too many matching items");
return val;
}
}
throw new InvalidOperationException("no matching items");
}
The actual implementation is something like:
public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
/* do null checks */
var result = default(TSource);
long tally = 0;
for(var item in source)
if(predicate(item))
{
result = item;
checked{++tally;}
}
switch(tally)
{
case 0:
throw new InvalidOperationException("no matching items");
case 1:
return result;
default:
throw new InvalidOperationException("too many matching items");
}
}
Now, while successful Single will have to scan everything, this can mean that an unsucessful Single is much, much slower than it needs to (and can even potentially throw an undocumented error) and if the reason for the unexpected duplicate is a bug which is duplicating items into the sequence - and hence making it far larger than it should be, then the Single that should have helped you find that problem is now dragging away through this.
SingleOrDefault has the same issue.
This only applies to linq-to-objects, but it remains safer to do .Where(predicate).Single() rather than Single(predicate).
Any stops at the first match. All stops at the first non-match.
I don't know whether the documentation guarantees that but this behavior is now effectively fixed for all time due to compatibility reasons. It also makes sense.
Yes it stops when the predicate is satisfied once. Here is code via RedGate Reflector:
[__DynamicallyInvokable]
public static bool Any<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
if (predicate == null)
{
throw Error.ArgumentNull("predicate");
}
foreach (TSource local in source)
{
if (predicate(local))
{
return true;
}
}
return false;
}

What can be the code for someIEnumerable.Select extension method?

I am new in LINQ and would like to write some extension methods. Before doing so I wanted to test if I will do it correctly. I just wanted to compare the performance of my CustomSelect extension method with built-in Select extension method.
static void Main(string[] args)
{
List<int> list = new List<int>();
for (int i = 0; i < 10000000; i++)
list.Add(i);
DateTime now1 = DateTime.Now;
List<int> process1 = list.Select(i => i).ToList();
Console.WriteLine(DateTime.Now - now1);
DateTime now2 = DateTime.Now;
List<int> process2 = list.CustomSelect(i => i).ToList();
Console.WriteLine(DateTime.Now - now2);
}
public static IEnumerable<TResult> CustomSelect<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector)
{
foreach (TSource item in source)
{
yield return selector(item);
}
}
Timespan for built-in method: 0.18 sec
Timespan for custom method: 0.35 sec
Changing the order of processes yields the same result.
If I collect the elements in a list and return instead of yield return, then the timespan is being nearly same with built-in one. But as far as I know we should yield return whereever possible.
So what can be the code for built-in method? What should be my approach?
Thanks in advance
The key difference I can see is that the inbuilt method checks for List<T> and special-cases it, exploiting the custom List<T>.Enumerator implementation, rather than IEnumerable<T> / IEnumerator<T>. You can do that special-case yourself:
public static IEnumerable<TResult> CustomSelect<TSource, TResult>(
this IEnumerable<TSource> source, Func<TSource, TResult> selector)
{
if (source is List<TSource>)
return CustomSelectList((List<TSource>)source, selector);
return CustomSelectDefault(source, selector);
}
private static IEnumerable<TResult> CustomSelectList<TSource, TResult>(
List<TSource> source, Func<TSource, TResult> selector)
{
foreach (TSource item in source)
{
yield return selector(item);
}
}
private static IEnumerable<TResult> CustomSelectDefault<TSource, TResult>(
IEnumerable<TSource> source, Func<TSource, TResult> selector)
{
foreach (TSource item in source)
{
yield return selector(item);
}
}
You could take that a stage further by hand-rolling the entire iterator (which is what WhereSelectListIterator<TSource, TResult> does), but the above is probably close enough.
The inbuilt implementation also special-cases arrays, and handles various forms of composed queries.
There's a lot of things wrong with your performance test, which makes it inconclusive - you should look into best practices for benchmarking code in .NET. Use Stopwatch instead of DateTime.Now, use many repetitions of the same thing on at once instead of one shot at each, make sure you're not getting hindered by the GC (.ToList() is going to screw your measurements quite a bit).
yield return should not be used because it's faster, the idea is that it's easy to write, and it's lazy. If I did Take(10) on the yield return variant, I'd only get 10 elements. The return variant, on the other hand, will produce the whole list, return it, and then reduce it to 10 elements.
In effect, you're taking pretty much the simplest case where there's very little reason to use Select at all (apart from clarity). Enumerables are made to handle far more crazier stuff, and using the LINQ methods, do it in easy to understand and concise matter, exposing an interface familiar to functional programmers. That often means that you could get more performance by rewriting the whole thing in a less general way - the point is that you should really only do that if necessary - if this is not a performance bottleneck of your application (and it rarely will be), the cleaner, easier to extend code is a better option.

Should I use Yield when writing my own extension?

I wanted to write an extension method (for using it in a fluent syntax) so that If a sequence is :
List< int> lst = new List< int>(){1,2,3 };
I want to repeat it 3 times (for example). so the output would be 123123123
I wrote this :
public static IEnumerable<TSource> MyRepeat<TSource>(this IEnumerable<TSource> source,int n)
{
return Enumerable.Repeat(source,n).SelectMany(f=>f);
}
And now I can do this :
lst.MyRepeat(3)
output :
Question :
Shouldn't I use Yield in the extension method ? I tried yield return but it's not working here. Why is that and should I use it.
edit
After Ant's answer I changed it to :
public static IEnumerable<TSource> MyRepeat<TSource>(this IEnumerable<TSource> source,int n)
{
var k=Enumerable.Repeat(source,n).SelectMany(f=>f);
foreach (var element in k)
{
yield return element;
}
}
But is there any difference ?
This is because the following already returns an IEnumerable:
Enumerable.Repeat(source,n).SelectMany(f=>f);
When you use the yield keyword, you specify that a given iteration over the method will return what follows. So you are essentially saying "each iteration will yield an IEnumerable<TSource>," when actually, each iteration over a method returning an IEnumerable<TSource>should yield a TSource.
Hence, your error - when you iterate over MyRepeat, you are expected to return a TSource but because you are trying to yield an IEnumerable, you are actually trying to return an IEnumerable from every iteration instead of returning a single element.
Your edit should work but is a little pointless - if you simply return the IEnumerable directly it won't be enumerated until you iterate over it (or call ToList or something). In your very first example, SelectMany (or one of its nested methods) will already be using yield, meaning the yield is already there, it's just implicit in your method.
Ant P's answer is of course correct.
You would use yield if you were building the enumerable that is returned yourself, rather than relying on SelectMany. eg:
public static IEnumerable<T> Repeat<T>(this IEnumberable<T> items, int repeat)
{
for (int i = 0; i < repeat; ++i)
foreach(T item in items)
yield return item;
}
The thing you yield is an element of the sequence. The code is instructions for producing the sequence of yielded elements.

Linq + foreach loop optimization

So I recently found myself writing a loop similar to this one:
var headers = new Dictionary<string, string>();
...
foreach (var header in headers)
{
if (String.IsNullOrEmpty(header.Value)) continue;
...
}
Which works fine, it iterates through the dictionary once and does all I need it to do. However, my IDE is suggesting this as a more readable / optimized alternative, but I disagree:
var headers = new Dictionary<string, string>();
...
foreach (var header in headers.Where(header => !String.IsNullOrEmpty(header.Value)))
{
...
}
But wont that iterate through the dictionary twice? Once to evaluate the .Where(...) and then once for the for-each loop?
If not, and the second code example only iterates the dictionary once, please explain why and how.
The code with continue is about twice as fast.
I ran the following code in LINQPad, and the results consistently say that the clause with continue is twice as fast.
void Main()
{
var headers = Enumerable.Range(1,1000).ToDictionary(i => "K"+i,i=> i % 2 == 0 ? null : "V"+i);
var stopwatch = new Stopwatch();
var sb = new StringBuilder();
stopwatch.Start();
foreach (var header in headers.Where(header => !String.IsNullOrEmpty(header.Value)))
sb.Append(header);
stopwatch.Stop();
Console.WriteLine("Using LINQ : " + stopwatch.ElapsedTicks);
sb.Clear();
stopwatch.Reset();
stopwatch.Start();
foreach (var header in headers)
{
if (String.IsNullOrEmpty(header.Value)) continue;
sb.Append(header);
}
stopwatch.Stop();
Console.WriteLine("Using continue : " + stopwatch.ElapsedTicks);
}
Here are some of the results I got
Using LINQ : 1077
Using continue : 348
Using LINQ : 939
Using continue : 459
Using LINQ : 768
Using continue : 382
Using LINQ : 1256
Using continue : 457
Using LINQ : 875
Using continue : 318
In general LINQ is always going to be slower when working with an already evaluated IEnumerable<T>, than the foreach counterpart. The reason is that LINQ-to-Objects is just a high-level wrapper of these lower level language features. The benefit to using LINQ here is not performance, but the provision of a consistent interface. LINQ absolutely does provide performance benefits, but they come into play when you are working with resources that are not already in active memory (and allow you to leverage the ability to optimize the code that is actually executed). When the alternative code is the most optimal alternative, then LINQ just has to go through a redundant process to call the same code you would have written anyway. To illustrate this, I'm going to paste the code below that is actually called when you use LINQ's Where operator on a loaded enumerable:
public static IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
if (predicate == null)
{
throw Error.ArgumentNull("predicate");
}
if (source is Iterator<TSource>)
{
return ((Iterator<TSource>) source).Where(predicate);
}
if (source is TSource[])
{
return new WhereArrayIterator<TSource>((TSource[]) source, predicate);
}
if (source is List<TSource>)
{
return new WhereListIterator<TSource>((List<TSource>) source, predicate);
}
return new WhereEnumerableIterator<TSource>(source, predicate);
}
And here is the WhereSelectEnumerableIterator<TSource,TResult> class. The predicate field is the delegate that you pass into the Where() method. You will see where it actually gets executed in the MoveNext method (as well as all the redundant null checks). You will also see that the enumerable is only looped through once. Stacking where clauses will result in the creation of multiple iterator classes (wrapping their predecessors), but will not result in multiple enumeration actions (due to deferred execution). Keep in mind that when you write a Lambda like this, you are also actually creating a new Delegate instance (also affecting your performance in a minor way).
private class WhereSelectEnumerableIterator<TSource, TResult> : Enumerable.Iterator<TResult>
{
private IEnumerator<TSource> enumerator;
private Func<TSource, bool> predicate;
private Func<TSource, TResult> selector;
private IEnumerable<TSource> source;
public WhereSelectEnumerableIterator(IEnumerable<TSource> source, Func<TSource, bool> predicate, Func<TSource, TResult> selector)
{
this.source = source;
this.predicate = predicate;
this.selector = selector;
}
public override Enumerable.Iterator<TResult> Clone()
{
return new Enumerable.WhereSelectEnumerableIterator<TSource, TResult>(this.source, this.predicate, this.selector);
}
public override void Dispose()
{
if (this.enumerator != null)
{
this.enumerator.Dispose();
}
this.enumerator = null;
base.Dispose();
}
public override bool MoveNext()
{
switch (base.state)
{
case 1:
this.enumerator = this.source.GetEnumerator();
base.state = 2;
break;
case 2:
break;
default:
goto Label_007C;
}
while (this.enumerator.MoveNext())
{
TSource current = this.enumerator.Current;
if ((this.predicate == null) || this.predicate(current))
{
base.current = this.selector(current);
return true;
}
}
this.Dispose();
Label_007C:
return false;
}
public override IEnumerable<TResult2> Select<TResult2>(Func<TResult, TResult2> selector)
{
return new Enumerable.WhereSelectEnumerableIterator<TSource, TResult2>(this.source, this.predicate, Enumerable.CombineSelectors<TSource, TResult, TResult2>(this.selector, selector));
}
public override IEnumerable<TResult> Where(Func<TResult, bool> predicate)
{
return (IEnumerable<TResult>) new Enumerable.WhereEnumerableIterator<TResult>(this, predicate);
}
}
I personally think the performance difference is completely justifiable, because LINQ code is much easier to maintain and reuse. I also do things to offset the performance issues (like declaring all my anonymous lambda delegates and expressions as static readonly fields in a common class). But in reference to your actual question, your continue clause is definitely faster than the LINQ alternative.
No it won't iterate through it twice. the .Where does not actually evaluate by itself. The foreach actually pulls out each element from the where that satisfies the clause.
Similarly a headers.Select(x) doesn't actually process anything until you put a .ToList() or something behind it that forces it to evaluate.
EDIT:
To explain it a bit more, as Marcus pointed out, the .Where returns an iterator so each element is iterated over and the expression is processed once, if it matches then it goes into the body of the loop.
i think the second example will only iterates the dict once.
because what the header.Where(...) returns is exactly a "iterator", rather than a temporary value, every time the loop iterates, it will use the filter which is defined in Where(...), which make the one-time-iteration work.
However, i am not a sophisticated C# coder, i am not sure how C# will deal with such situation, but i think things should be the same.

Categories