C# SkipWhile leaks memory if predicate is false - c#

// C:\logs\AzureSDK.log is ~2.5GB file
IEnumerable<string> lines = File.ReadLines(#"C:\logs\AzureSDK.log").SkipWhile(line => false);
Console.WriteLine(string.Join("\n", lines));
return;
This clearly does not return an iterator and allocates memory internally until I get OOM. Returning true in SkipWhile predicate does not lead to this and completes as expected (couple MB memory usage during the execution)
As per docs, method signature and common sense, SkipWhile must return an iterator and not load all the data into memory.
Machine info
Microsoft Windows [Version 10.0.14393]
Target 4.5.2, AnyCPU, Release
VS 2015 Update 3
NET 4.6.01586
Thoughts? I must be doing something stupid but unsure what
UPD: well the stupid thing was the string.Join I forgot about, which is appending to a single StringBuilder loading all the lines into memory.
I also checked SkipWhile sources and it's obviously perfectly fine:
public static IEnumerable<TSource> SkipWhile<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate) {
if (source == null) throw Error.ArgumentNull("source");
if (predicate == null) throw Error.ArgumentNull("predicate");
return SkipWhileIterator<TSource>(source, predicate);
}
static IEnumerable<TSource> SkipWhileIterator<TSource>(IEnumerable<TSource> source, Func<TSource, bool> predicate) {
bool yielding = false;
foreach (TSource element in source) {
if (!yielding && !predicate(element)) yielding = true;
if (yielding) yield return element;
}
}

SkipWhile does return an enumerator. But then you use string.Join to concatenate everything, and therefore end up loading the whole file into memory.
If you change your code to process each line independently, you'll see that you use much less memory:
foreach (var line in File.ReadLines(#"C:\logs\AzureSDK.log").SkipWhile(_ => false))
{
Console.WriteLine(line);
}

Your error is not on the SkipWhile, when you pass true in it is causing it to skip every line - returning no results for your join.
string.Join is causing the out of memory exception because it's trying to allocate a string that is 2.5gb in length.

Related

Why use wrappers around the actual iterator functions in LINQ extension methods?

When looking at Microsoft's implementation of various C# LINQ methods, I noticed that the public extension methods are mere wrappers that return the actual implementation in the form of a separate iterator function.
For example (from System.Linq.Enumerable.cs):
public static IEnumerable<TSource> Concat<TSource>(this IEnumerable<TSource> first, IEnumerable<TSource> second) {
if (first == null) throw Error.ArgumentNull("first");
if (second == null) throw Error.ArgumentNull("second");
return ConcatIterator<TSource>(first, second);
}
static IEnumerable<TSource> ConcatIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second) {
foreach (TSource element in first) yield return element;
foreach (TSource element in second) yield return element;
}
What is the reason for wrapping the iterator like that instead of combining them into one and return the iterator directly?
Like this:
public static IEnumerable<TSource> Concat<TSource>(this IEnumerable<TSource> first, IEnumerable<TSource> second) {
if (first == null) throw Error.ArgumentNull("first");
if (second == null) throw Error.ArgumentNull("second");
foreach (TSource element in first) yield return element;
foreach (TSource element in second) yield return element;
}
Wrappers are used to check method arguments immediately (i.e. when you call LINQ extension method). Otherwise arguments will not be checked until you start to consume iterator (i.e. use query in foreach loop or call some extension method which executes query - ToList, Count etc). This approach is used for all extension methods with deferred type of execution.
If you will use approach without wrapper then:
int[] first = { 1, 2, 3 };
int[] second = null;
var all = first.Concat(second); // note that query is not executed yet
// some other code
...
var name = Console.ReadLine();
Console.WriteLine($"Hello, {name}, we have {all.Count()} items!"); // boom! exception here
With argument-checking wrapper method you will get exception at first.Concat(second) line.

Is this a bug in resharper?

I have me some Resharper squiggles here.
and they tell me that I have a possible multiple enumeration of IEnumerable going on. However you can see that this is not true. final is explicitly declared as a list ( List<Point2D> ) and pointTangents is declared previously as List<PointVector2D>
Any idea on why Resharper might be telling me this?
Edit Experiments To See If I can replicate with simpler code
As you can see below there are no squiggles and no warnings even though Bar is declared to take IEnumerable as arg.
Looks a lot like RSRP-429474 False-positive warning for possible multiple enumeration :
I have this code:
List<string> duplicateLabelsList = allResourcesLookup.SelectMany(x => x).Select(x => x.LoaderOptions.Label).Duplicates<string, string>().ToList(); ;
if (duplicateLabelsList.Any())
throw new DuplicateResourceLoaderLabelsException(duplicateLabelsList);
For both usages of duplicateLabelsList, I'm being warned about
possible multiple enumeration, despite the fact I've called ToList and
therefore there should be no multiple enumeration.
which (currently) has a Fix Version of 9.2, which (currently) isn't yet released.
The extension method public static TSource Last<TSource>(this IEnumerable<TSource> source); is defined for the type IEnumberable<TSource>.
If one looks at the implentation of Last<TSource>:
public static TSource Last<TSource>(this IEnumerable<TSource> source)
{
if (source == null) throw Error.ArgumentNull("source");
IList<TSource> list = source as IList<TSource>;
if (list != null)
{
int count = list.Count;
if (count > 0) return list[count - 1];
}
else
{
using (IEnumerator<TSource> e = source.GetEnumerator())
{
if (e.MoveNext())
{
TSource result;
do
{
result = e.Current;
} while (e.MoveNext());
return result;
}
}
}
throw Error.NoElements();
}
It is clear that if source implements IList then source is not enumerated and therefore your assumption that this is a "bug" in Resharper is correct.
I'd consider it more like a false positive probably due to the fact that Resharper has no general way to know that Last()'s implementation avoids unnecessary enumerations. It is probably deciding to flag the potential multiple enumeration based on the fact that Last<TSource> is defined for typed IEnumerable<T> objects.

Does Any() stop on success?

To be more specific: will the Linq extension method Any(IEnumerable collection, Func predicate) stop checking all the remaining elements of the collections once the predicate has yielded true for an item?
Because I don't want to spend to much time on figuring out if I need to do the really expensive parts at all:
if(lotsOfItems.Any(x => x.ID == target.ID))
//do expensive calculation here
So if Any is always checking all the items in the source this might end up being a waste of time instead of just going with:
var candidate = lotsOfItems.FirstOrDefault(x => x.ID == target.ID)
if(candicate != null)
//do expensive calculation here
because I'm pretty sure that FirstOrDefault does return once it got a result and only keeps going through the whole Enumerable if it does not find a suitable entry in the collection.
Does anyonehave information about the internal workings of Any, or could anyone suggest a solution for this kind of decision?
Also, a colleague suggested something along the lines of:
if(!lotsOfItems.All(x => x.ID != target.ID))
since this is supposed to stop once the conditions returns false for the first time but I'm not sure on that, so if anyone could shed some light on this as well it would be appreciated.
As we see from the source code, Yes:
internal static bool Any<T>(this IEnumerable<T> source, Func<T, bool> predicate) {
foreach (T element in source) {
if (predicate(element)) {
return true; // Attention to this line
}
}
return false;
}
Any() is the most efficient way to determine whether any element of a sequence satisfies a condition with LINQ.
also:a colleague suggested something along the lines of
if(!lotsOfItems.All(x => x.ID != target.ID)) since this is supposed to
stop once the conditions returns false for the first time but i'm not
sure on that, so if anyone could shed some light on this as well it
would be appreciated :>]
All() determines whether all elements of a sequence satisfy a condition. So, the enumeration of source is stopped as soon as the result can be determined.
Additional note:
The above is true if you are using Linq to objects. If you are using Linq to Database, then it will create a query and will execute it against database.
You could test it yourself: https://ideone.com/nIDKxr
public static IEnumerable<int> Tester()
{
yield return 1;
yield return 2;
throw new Exception();
}
static void Main(string[] args)
{
Console.WriteLine(Tester().Any(x => x == 1));
Console.WriteLine(Tester().Any(x => x == 2));
try
{
Console.WriteLine(Tester().Any(x => x == 3));
}
catch
{
Console.WriteLine("Error here");
}
}
Yes, it does :-)
also:a colleague suggested something along the lines of
if(!lotsOfItems.All(x => x.ID != target.ID))
since this is supposed to stop once the conditions returns false for the first time but i'm not sure on that, so if anyone could shed some light on this as well it would be appreciated :>]
Using the same reasoning, All() could continue even if one of the element returns false :-) No, even All() is programmed correctly :-)
It does whatever is the quickest way of doing what it has to do.
When used on an IEnumerable this will be along the lines of:
foreach(var item in source)
if(predicate(item))
return true;
return false;
Or for the variant that doesn't take a predicate:
using(var en = source.GetEnumerator())
return en.MoveNext();
When run against at database it will be something like
SELECT EXISTS(SELECT null FROM [some table] WHERE [some where clause])
And so on. How that was executed would depend in turn on what indices were available for fulfilling the WHERE clause, so it could be a quick index lookup, a full table scan aborting on first match found, or an index lookup followed by a partial table scan aborting on first match found, depending on that.
Yet other Linq providers would have yet other implementations, but generally the people responsible will be trying to be at least reasonably efficient.
In all, you can depend upon it being at least slightly more efficient than calling FirstOrDefault, as FirstOrDefault uses similar approaches but does have to return a full object (perhaps constructing it). Likewise !All(inversePredicate) tends to be pretty much on a par with Any(predicate) as per this answer.
Single is an exception to this
Update: The following from this point on no longer applies to .NET Core, which has changed the implementation of Single.
It's important to note that in the case of linq-to objects, the overloads of Single and SingleOrDefault that take a predicate do not stop on identified failure. While the obvious approach to Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate) would be something like:
public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
/* do null checks */
using(var en = source.GetEnumerator())
while(en.MoveNext())
{
var val = en.Current;
if(predicate(val))
{
while(en.MoveNext())
if(predicate(en.Current))
throw new InvalidOperationException("too many matching items");
return val;
}
}
throw new InvalidOperationException("no matching items");
}
The actual implementation is something like:
public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
/* do null checks */
var result = default(TSource);
long tally = 0;
for(var item in source)
if(predicate(item))
{
result = item;
checked{++tally;}
}
switch(tally)
{
case 0:
throw new InvalidOperationException("no matching items");
case 1:
return result;
default:
throw new InvalidOperationException("too many matching items");
}
}
Now, while successful Single will have to scan everything, this can mean that an unsucessful Single is much, much slower than it needs to (and can even potentially throw an undocumented error) and if the reason for the unexpected duplicate is a bug which is duplicating items into the sequence - and hence making it far larger than it should be, then the Single that should have helped you find that problem is now dragging away through this.
SingleOrDefault has the same issue.
This only applies to linq-to-objects, but it remains safer to do .Where(predicate).Single() rather than Single(predicate).
Any stops at the first match. All stops at the first non-match.
I don't know whether the documentation guarantees that but this behavior is now effectively fixed for all time due to compatibility reasons. It also makes sense.
Yes it stops when the predicate is satisfied once. Here is code via RedGate Reflector:
[__DynamicallyInvokable]
public static bool Any<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
if (predicate == null)
{
throw Error.ArgumentNull("predicate");
}
foreach (TSource local in source)
{
if (predicate(local))
{
return true;
}
}
return false;
}

Linq + foreach loop optimization

So I recently found myself writing a loop similar to this one:
var headers = new Dictionary<string, string>();
...
foreach (var header in headers)
{
if (String.IsNullOrEmpty(header.Value)) continue;
...
}
Which works fine, it iterates through the dictionary once and does all I need it to do. However, my IDE is suggesting this as a more readable / optimized alternative, but I disagree:
var headers = new Dictionary<string, string>();
...
foreach (var header in headers.Where(header => !String.IsNullOrEmpty(header.Value)))
{
...
}
But wont that iterate through the dictionary twice? Once to evaluate the .Where(...) and then once for the for-each loop?
If not, and the second code example only iterates the dictionary once, please explain why and how.
The code with continue is about twice as fast.
I ran the following code in LINQPad, and the results consistently say that the clause with continue is twice as fast.
void Main()
{
var headers = Enumerable.Range(1,1000).ToDictionary(i => "K"+i,i=> i % 2 == 0 ? null : "V"+i);
var stopwatch = new Stopwatch();
var sb = new StringBuilder();
stopwatch.Start();
foreach (var header in headers.Where(header => !String.IsNullOrEmpty(header.Value)))
sb.Append(header);
stopwatch.Stop();
Console.WriteLine("Using LINQ : " + stopwatch.ElapsedTicks);
sb.Clear();
stopwatch.Reset();
stopwatch.Start();
foreach (var header in headers)
{
if (String.IsNullOrEmpty(header.Value)) continue;
sb.Append(header);
}
stopwatch.Stop();
Console.WriteLine("Using continue : " + stopwatch.ElapsedTicks);
}
Here are some of the results I got
Using LINQ : 1077
Using continue : 348
Using LINQ : 939
Using continue : 459
Using LINQ : 768
Using continue : 382
Using LINQ : 1256
Using continue : 457
Using LINQ : 875
Using continue : 318
In general LINQ is always going to be slower when working with an already evaluated IEnumerable<T>, than the foreach counterpart. The reason is that LINQ-to-Objects is just a high-level wrapper of these lower level language features. The benefit to using LINQ here is not performance, but the provision of a consistent interface. LINQ absolutely does provide performance benefits, but they come into play when you are working with resources that are not already in active memory (and allow you to leverage the ability to optimize the code that is actually executed). When the alternative code is the most optimal alternative, then LINQ just has to go through a redundant process to call the same code you would have written anyway. To illustrate this, I'm going to paste the code below that is actually called when you use LINQ's Where operator on a loaded enumerable:
public static IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
if (predicate == null)
{
throw Error.ArgumentNull("predicate");
}
if (source is Iterator<TSource>)
{
return ((Iterator<TSource>) source).Where(predicate);
}
if (source is TSource[])
{
return new WhereArrayIterator<TSource>((TSource[]) source, predicate);
}
if (source is List<TSource>)
{
return new WhereListIterator<TSource>((List<TSource>) source, predicate);
}
return new WhereEnumerableIterator<TSource>(source, predicate);
}
And here is the WhereSelectEnumerableIterator<TSource,TResult> class. The predicate field is the delegate that you pass into the Where() method. You will see where it actually gets executed in the MoveNext method (as well as all the redundant null checks). You will also see that the enumerable is only looped through once. Stacking where clauses will result in the creation of multiple iterator classes (wrapping their predecessors), but will not result in multiple enumeration actions (due to deferred execution). Keep in mind that when you write a Lambda like this, you are also actually creating a new Delegate instance (also affecting your performance in a minor way).
private class WhereSelectEnumerableIterator<TSource, TResult> : Enumerable.Iterator<TResult>
{
private IEnumerator<TSource> enumerator;
private Func<TSource, bool> predicate;
private Func<TSource, TResult> selector;
private IEnumerable<TSource> source;
public WhereSelectEnumerableIterator(IEnumerable<TSource> source, Func<TSource, bool> predicate, Func<TSource, TResult> selector)
{
this.source = source;
this.predicate = predicate;
this.selector = selector;
}
public override Enumerable.Iterator<TResult> Clone()
{
return new Enumerable.WhereSelectEnumerableIterator<TSource, TResult>(this.source, this.predicate, this.selector);
}
public override void Dispose()
{
if (this.enumerator != null)
{
this.enumerator.Dispose();
}
this.enumerator = null;
base.Dispose();
}
public override bool MoveNext()
{
switch (base.state)
{
case 1:
this.enumerator = this.source.GetEnumerator();
base.state = 2;
break;
case 2:
break;
default:
goto Label_007C;
}
while (this.enumerator.MoveNext())
{
TSource current = this.enumerator.Current;
if ((this.predicate == null) || this.predicate(current))
{
base.current = this.selector(current);
return true;
}
}
this.Dispose();
Label_007C:
return false;
}
public override IEnumerable<TResult2> Select<TResult2>(Func<TResult, TResult2> selector)
{
return new Enumerable.WhereSelectEnumerableIterator<TSource, TResult2>(this.source, this.predicate, Enumerable.CombineSelectors<TSource, TResult, TResult2>(this.selector, selector));
}
public override IEnumerable<TResult> Where(Func<TResult, bool> predicate)
{
return (IEnumerable<TResult>) new Enumerable.WhereEnumerableIterator<TResult>(this, predicate);
}
}
I personally think the performance difference is completely justifiable, because LINQ code is much easier to maintain and reuse. I also do things to offset the performance issues (like declaring all my anonymous lambda delegates and expressions as static readonly fields in a common class). But in reference to your actual question, your continue clause is definitely faster than the LINQ alternative.
No it won't iterate through it twice. the .Where does not actually evaluate by itself. The foreach actually pulls out each element from the where that satisfies the clause.
Similarly a headers.Select(x) doesn't actually process anything until you put a .ToList() or something behind it that forces it to evaluate.
EDIT:
To explain it a bit more, as Marcus pointed out, the .Where returns an iterator so each element is iterated over and the expression is processed once, if it matches then it goes into the body of the loop.
i think the second example will only iterates the dict once.
because what the header.Where(...) returns is exactly a "iterator", rather than a temporary value, every time the loop iterates, it will use the filter which is defined in Where(...), which make the one-time-iteration work.
However, i am not a sophisticated C# coder, i am not sure how C# will deal with such situation, but i think things should be the same.

C# Difference between First() and Find()

So I know that Find() is only a List<T> method, whereas First() is an extension for any IEnumerable<T>. I also know that First() will return the first element if no parameter is passed, whereas Find() will throw an exception. Lastly, I know that First() will throw an exception if the element is not found, whereas Find() will return the type's default value.
I hope that clears up confusion about what I'm actually asking. This is a computer science question and deals with these methods at the computational level. I've come to understand that IEnumerable<T> extensions do not always operate as one would expect under the hood. So here's the Q, and I mean from a "close to the metal" standpoint: What is the difference between Find() and First()?
Here's some code to provide basic assumptions to operate under for this question.
var l = new List<int> { 1, 2, 3, 4, 5 };
var x = l.First(i => i == 3);
var y = l.Find(i => i == 3);
Is there any actual computational difference between how First() and Find() discover their values in the code above?
Note: Let us ignore things like AsParallel() and AsQueryable() for now.
Here's the code for List<T>.Find (from Reflector):
public T Find(Predicate<T> match)
{
if (match == null)
{
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.match);
}
for (int i = 0; i < this._size; i++)
{
if (match(this._items[i]))
{
return this._items[i];
}
}
return default(T);
}
And here's Enumerable.First:
public static TSource First<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
if (predicate == null)
{
throw Error.ArgumentNull("predicate");
}
foreach (TSource local in source)
{
if (predicate(local))
{
return local;
}
}
throw Error.NoMatch();
}
So both methods work roughly the same way: they iterate all items until they find one that matches the predicate. The only noticeable difference is that Find uses a for loop because it already knows the number of elements, and First uses a foreach loop because it doesn't know it.
First will throw an exception when it finds nothing, FirstOrDefault however does exactly the same as Find (apart from how it iterates through the elements).
BTW Find is rather equal to FirstOrDefault() than to First(). Because if predicate of First() is not satisfied with any list elements you will get an exception.
Here what returns a dotpeek, another great free reflector replacement with some of ReSharper features
Here for Enumerable.First(...) and Enumerable.FirstOrDefault(...) extension methods:
public static TSource FirstOrDefault<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate) {
if (source == null) throw Error.ArgumentNull("source");
if (predicate == null) throw Error.ArgumentNull("predicate");
foreach (TSource element in source) {
if (predicate(element)) return element;
}
return default(TSource);
}
public static TSource First<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate) {
if (source == null) throw Error.ArgumentNull("source");
if (predicate == null) throw Error.ArgumentNull("predicate");
foreach (TSource element in source) {
if (predicate(element)) return element;
}
throw Error.NoMatch();
}
and here is for List<>.Find:
/// <summary>
/// Searches for an element that matches the conditions defined by the specified predicate, and returns the first occurrence within the entire <see cref="T:System.Collections.Generic.List`1"/>.
/// </summary>
///
/// <returns>
/// The first element that matches the conditions defined by the specified predicate, if found; otherwise, the default value for type <paramref name="T"/>.
/// </returns>
/// <param name="match">The <see cref="T:System.Predicate`1"/> delegate that defines the conditions of the element to search for.</param><exception cref="T:System.ArgumentNullException"><paramref name="match"/> is null.</exception>
[__DynamicallyInvokable]
public T Find(Predicate<T> match)
{
if (match == null)
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.match);
for (int index = 0; index < this._size; ++index)
{
if (match(this._items[index]))
return this._items[index];
}
return default (T);
}
1- Find() returns Null if the entity is not in the context but First() will throw an exception
2- Find() returns entities that have been added to the context but have not yet been saved to the database
Since List<> is not indexed in any way, it has to go through all values to find a specific value. Therefore it doesn't make much of a difference compared to traversing the list via an enumerable (apart from the creation of a enumerable helper object instance).
That said, keep in mind that the Find function was created way earlier than the First extension method (Framework V2.0 vs. V3.5), and I doubt that they would have implemented Find if the List<> class had been implemented at the same time as the extension methods.

Categories