Filtering IEnumerable Pattern - c#

Consider the following simple code pattern:
foreach(Item item in itemList)
{
if(item.Foo)
{
DoStuff(item);
}
}
If I want to parallelize it using Parallel Extensions(PE) I might simply replace the for loop construct as follows:
Parallel.ForEach(itemList, delegate(Item item)
{
if(item.Foo)
{
DoStuff(item);
}
});
However, PE will perform unnecessary work assigning work to threads for those items where Foo turned out to be false. Thus I was thinking an intermediate wrapper/filtering IEnumerable might be a reasonable approach here. Do you agree? If so what is the simplest way of achieving this? (BTW I'm currently using C#2, so I'd be grateful for at least one example that doesn't use lambda expressions etc.)

I'm not sure how the partitioning in PE for .NET 2 works, so it's difficult to say there. If each element is being pushed into a separate work item (which would be a fairly poor partitioning strategy), then filtering in advance would make quite a bit of sense.
If, however, item.Foo happened to be at all expensive (I wouldn't expect this, given that it's a property, but it's always possible), allowing it to be parallelized could be advantageous.
In addition, in .NET 4, the partitioning strategy used by the TPL will handle this fairly well. It was specifically designed to handle situations with varying levels of work. It does partitioning in "chunks", so one item does not get sent to one thread, but rather a thread gets assigned a set of items, which it processes in bulk. Depending on the frequency of item.Foo being false, paralellizing (using TPL) would quite possibly be faster than filtering in advance.

That all factors down to this single line:
Parallel.ForEach(itemList.Where(i => i.Foo), DoStuff);
But reading a comment to another post I now see you're in .Net 2.0 yet, so some of this may be a bit tricky to sneak past the compiler.
For .Net 2.0, I think you can do it like this (I'm a little unclear that passing the method names as delegates will still just work, but I think it will):
public IEnumerable<T> Where(IEnumerable<T> source, Predicate<T> predicate)
{
foreach(T item in source)
if (predicate(item))
yield return item;
}
public bool HasFoo(Item item) { return item.Foo; }
Parallel.ForEach(Where(itemList, HasFoo), DoStuff);

If I was to implement this, I would simply filter the list, before calling the foreach.
var actionableResults = from x in ItemList WHERE x.Foo select x;
This will filter the list to get the items that can be acted upon.
NOTE: this might be a pre-mature optimization, and could not make a major difference in your performance.

Related

Using LINQ Where result in foreach: hidden if statement, double foreach?

foreach (Person criminal in people.Where(person => person.isCriminal)
{
// do something
}
I have this piece of code and want to know how does it actually work. Is it equivalent to an if statement nested inside the foreach iteration or does it first loop through the list of people and repeats the loop with selected values? I care to know more about this from the perspective of efficiency.
foreach (Person criminal in people)
{
if (criminal.isCriminal)
{
// do something
}
}
Where uses deferred execution.
This means that the filtering does not occur immediately when you call Where. Instead, each time you call GetEnumerator().MoveNext() on the return value of Where, it checks if the next element in the sequence satisfies the condition. If it does not, it skips over this element and checks the next one. When there is an element that satisfies the condition, it stops advancing and you can get the value using Current.
Basically, it is like having an if statement inside a foreach loop.
To understand what happens, you must know how IEnumerables<T> work (because LINQ to Objects always work on IEnumerables<T>. IEnumerables<T> return an IEnumerator<T> which implements an iterator. This iterator is lazy, i.e. it always only yields one element of the sequence at once. There is no looping done in advance, unless you have an OrderBy or another command which requires it.
So if you have ...
foreach (string name in source.Where(x => x.IsChecked).Select(x => x.Name)) {
Console.WriteLine(name);
}
... this will happen: The foreach-statement requires the first item which is requested from the Select, which in turn requires one item from Where, which in turn retrieves one item from the source. The first name is printed to the console.
Then the foreach-statement requires the second item which is requested from the Select, which in turn requires one item from Where, which in turn retrieves one item from the source. The second name is printed to the console.
and so on.
This means that both of your code snipptes are logically equivalent.
It depends on what people is.
If people is an IEnumerable object (like a collection, or the result of a method using yield) then the two pieces of code in your question are indeed equivalent.
A naïve Where could be implemented as:
public static IEnumerable<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
// Error handling left out for simplicity.
foreach (TSource item in source)
{
if (predicate(item))
{
yield return item;
}
}
}
The actual code in Enumerable is a bit different to make sure that errors from passing a null source or predicate happen immediately rather than on the deferred execution, and to optimise for a few cases (e.g. source.Where(x => x.IsCriminal).Where(x => x.IsOnParole) is turned into the equivalent of source.Where(x => x.IsCriminal && x.IsOnParole) so that there's one fewer step in the chains of iterations), but that's the basic principle.
If however people is an IQueryable then things are different, and depend on the details of the query provider in question.
The simplest possibility is that the query provider can't do anything special with the Where and so it ends up just doing pretty much the above, because that will still work.
But often the query provider can do something else. Let's say people is a DbSet<Person> in Entity Framework assocated with a table in a database called people. If you do:
foreach(var person in people)
{
DoSomething(person);
}
Then Entity Framework will run SQL similar to:
SELECT *
FROM people
And then create a Person object for each row returned. We could do the same filtering in about to implement Where but we can also do better.
If you do:
foreach (Person criminal in people.Where(person => person.isCriminal)
{
DoSomething(person);
}
Then Entity Framework will run SQL similar to:
SELECT *
FROM people
WHERE isCriminal = 1
This means that the logic of deciding which elements to return is done in the database before it comes back to .NET. It allows for indices to be used in computing the WHERE which can be much more efficient, but even in the worse case of there being no useful indices and the database having to do a full scan it will still mean that those records we don't care about are never reported back from the database and there is no object created for them just to be thrown away again, so the difference in performance can be immense.
I care to know more about this from the perspective of efficiency
You are hopefully satisfied that there's no double pass as you suggested might happen, and happy to learn that it's even more efficient than the foreach … if you suggested when possible.
A bare foreach and if will still beat .Where() against an IEnumerable (but not against a database source) as there are a few overheads to Where that foreach and if don't have, but it's to a degree that is only worth caring about in very hot paths. Generally Where can be used with reasonable confidence in its efficiency.

Difference between using where/lambda and for each

I'm new to C# and came across this function preformed on a dictionary.
_objDictionary.Keys.Where(a => (a is fooObject)).ToList().ForEach(a => ((fooObject)a).LaunchMissles());
My understanding is that this essentially puts every key that is a fooObject into a list, then performs the LaunchMissles function of each. How is that different than using a for each loop like this?
foreach(var entry in _objDictionary.Keys)
{
if (entry is fooObject)
{
entry.LaunchMissles();
}
}
EDIT: The resounding opinion appears to be that there is no functional difference.
This is good example of abusing LINQ - statement did not become more readable or better in any other way, but some people just like to put LINQ everywhere. Though in this case you might take the best from both worlds by doing:
foreach(var entry in _objDictionary.Keys.OfType<FooObject>())
{
entry.LaunchMissles();
}
Note that in your foreach example you are missing a cast to FooObject to invoke LaunchMissles.
In general, Linq is no Voodomagic and does the same stuff under the hood that you would need to write if you werent using it. Linq just makes things easier to write but it wont beat regular code performance wise (if it really is equivalent)
In your case, your "oldschool" approach is perfectly fine and in my opinion the favorable
foreach(var entry in _objDictionary.Keys)
{
fooObject foo = entry as fooObject;
if (foo != null)
{
foo .LaunchMissles();
}
}
Regarding the Linq-Approach:
Materializing the Sequence to a List just to call a method on it, that does the same as the code above, is just wasting ressources and making it less readable.
In your example it doesnt make a diffrence but if the source wasnt a Collection (like Dictionary.Keys is) but an IEnumerable that really works the lazy way, then there can be a huge impact.
Lazy evalutation is designed to yield items when needed, calling ToList inbetween would first gather all items before actually executing the ForEach.
While the plain foreach-approach would get one item, then process it, then get the next and so on.
If you really want to use a "Linq-Foreach" than dont use the List-Implementation but roll your own extensionmethod (like mentioned in the comments below your quesiton)
public static class EnumerableExtensionMethods
{
public static void ForEach<T>(this IEnumerable<T> sequence, Action<T> action)
{
foreach(T item in sequence)
action(item);
}
}
Then still rolling with a regular foreach should be prefered, unless you put the foreach-body into a different method
sequence.ForEach(_methodThatDoesThejob);
That is the only "for me acceptable" way of using this.

Does foreach loop work more slowly when used with a not stored list or array?

I am wondered at if foreach loop works slowly if an unstored list or array is used as an in array or List.
I mean like that:
foreach (int number in list.OrderBy(x => x.Value)
{
// DoSomething();
}
Does the loop in this code calculates the sorting every iteration or not?
The loop using stored value:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>;
foreach (int number in list)
{
// DoSomething();
}
And if it does, which code shows the better performance, storing the value or not?
This is often counter-intuitive, but generally speaking, the option that is best for performance is to wait as long as possible to materialize results into a concrete structure like a list or array. Please keep in mind that this is a generalization, and so there are plenty of cases where it doesn't hold. Nevertheless, the first instinct is better when you avoid creating the list for as long as possible.
To demonstrate with your sample, we have these two options:
var list = tours.OrderBy(x => x.Value).ToList();
foreach (int number in list)
{
// DoSomething();
}
vs this option:
foreach (int number in list.OrderBy(x => x.Value))
{
// DoSomething();
}
To understand what is going on here, you need to look at the .OrderBy() extension method. Reading the linked documentation, you'll see it returns a IOrderedEnumerable<TSource> object. With an IOrderedEnumerable, all of the sorting needed for the foreach loop is already finished when you first start iterating over the object (and that, I believe, is the crux of your question: No, it does not re-sort on each iteration). Also note that both samples use the same OrderBy() call. Therefore, both samples have the same problem to solve for ordering the results, and they accomplish it the same way, meaning they take exactly the same amount of time to reach that point in the code.
The difference in the code samples, then, is entirely in using the foreach loop directly vs first calling .ToList(), because in both cases we start from an IOrderedEnumerable. Let's look closely at those differences.
When you call .ToList(), what do you think happens? This method is not magic. There is still code here which must execute in order to produce the list. This code still effectively uses it's own foreach loop that you can't see. Additionally, where once you only needed to worry about enough RAM to handle one object at a time, you are now forcing your program to allocate a new block of RAM large enough to hold references for the entire collection. Moving beyond references, you may also potentially need to create new memory allocations for the full objects, if you were reading a from a stream or database reader before that really only needed one object in RAM at a time. This is an especially big deal on systems where memory is the primary constraint, which is often the case with web servers, where you may be serving and maintaining session RAM for many many sessions, but each session only occasionally uses any CPU time to request a new page.
Now I am making one assumption here, that you are working with something that is not already a list. What I mean by this, is the previous paragraphs talked about needing to convert an IOrderedEnumerable into a List, but not about converting a List into some form of IEnumerable. I need to admit that there is some small overhead in creating and operating the state machine that .Net uses to implement those objects. However, I think this is a good assumption. It turns out to be true far more often than we realize. Even in the samples for this question, we're paying this cost regardless, by the simple virtual of calling the OrderBy() function.
In summary, there can be some additional overhead in using a raw IEnumerable vs converting to a List, but there probably isn't. Additionally, you are almost certainly saving yourself some RAM by avoiding the conversions to List whenever possible... potentially a lot of RAM.
Yes and no.
Yes the foreach statement will seem to work slower.
No your program has the same total amount of work to do so you will not be able to measure a difference from the outside.
What you need to focus on is not using a lazy operation (in this case OrderBy) multiple times without a .ToList or ToArray. In this case you are only using it once(foreach) but it is an easy thing to miss.
Edit: Just to be clear. The as statement in the question will not work as intended but my answer assumes no .ToList() after OrderBy .
This line won't run:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>; // Returns null.
Instead, you want to store the results this way:
List<Tour> list = tours.OrderBy(x => x.Value).ToList();
And yes, the second option (storing the results) will enumerate much faster as it will skip the sorting operation.

Could locking an enumerable potentially cause multiple enumeration?

I think ReSharper is lying to me.
I have this extension method that (hopefully) returns an xor of two enumerations:
public static IEnumerable<T> Xor<T>(this IEnumerable<T> first, IEnumerable<T> second)
{
lock (first)
{
lock (second)
{
var firstAsList = first.ToList();
var secondAsList = second.ToList();
return firstAsList.Except(secondAsList).Union(secondAsList.Except(firstAsList));
}
}
}
ReSharper thinks I'm performing a multiple enumeration of an IEnumerable, as you can see, on both the arguments. If I remove the locks, then it's satisfied that I'm not.
Is ReSharper right or wrong? I believe it's wrong.
edit: I do realize that I'm enumerating the lists multiple times, but ReSharper is saying I'm enumerating over the original arguments multiple times, which I don't think is true. I'm enumerating both arguments once into a list so I may then perform the actual set manipulation, but as I see it, I'm not actually iterating over the arguments passed multiple times.
For example, if the passed arguments are actually query results, my belief is this method won't cause a storm of queries to be executed by the set manipulation. I do understand what ReSharper means by warning of multiple enumeration: if the enumerables passed are heavy to generate, then if they're enumerated multiple times, then performing multiple enumerations on them will be much slower.
Also, removing the locks definitely makes ReSharper happier:
You are indeed enumerating both of the lists multiple times. You are not enumerating the enumerables passed as parameters multiple times. Both lists are enumerated once for each call to Except. The call to Union is not enumerating either sequence an additional time, but rather is enumerating the results of the two calls to Except.
Of course, iterating a List multiple times in a context like this isn't really a problem; there aren't negative consequences to iterating an unchanging list multiple times.
The lock statements have nothing whatsoever to do with enumeration of the sequences. Locking on an IEnumerable does not iterate it. Of course, locking on two objects like this, specifically two objects that are not limited in scope to this section of code, is very dangerous. It's quite possible to end up deadlocking the program with locks used in this manor if code elsewhere (such as another invocation of this method) ends up taking locks on the same objects in the opposite order).
This is a bit of a funny one.
First things first: as you've correctly identified, R# is raising this inspection not against the multiple usages of the Lists - there is of course nothing to worry about in multiply enumerating a List - but against (what R' sees as multiple usages of) the IEnumerable arguments. I'm presuming you already know why this would be potentially bad, so I'll skip that.
Now to the question of whether R# is right to complain here. To quote the C# spec,
A lock statement of the form
lock (x) ...
where x is an expression of a reference-type, is precisely
equivalent to
System.Threading.Monitor.Enter(x);
try {
...
}
finally {
System.Threading.Monitor.Exit(x);
}
except that x is only evaluated once.
(I've put in this emphasis because I like this wording; it avoids debates (that I'm definitely not qualified to enter) about whether this is "syntatic sugar" or not.)
Taking a minimal example which produces this R# inspection:
private static void Method(IEnumerable<int> enumerable)
{
lock (enumerable)
{
var list = enumerable.ToList();
}
}
and replacing it by what I think is the precisely equivalent version as mandated by the spec:
private static void Method(IEnumerable<int> enumerable)
{
var x = enumerable;
System.Threading.Monitor.Enter(x);
try
{
var list = enumerable.ToList();
}
finally
{
System.Threading.Monitor.Exit(x);
}
}
also produces the inspection.
The question then is: is R# right to produce this inspection? And this is where I think we get into a grey area. When I pass the following enumerable to either of these methods:
static IEnumerable<int> MyEnumerable()
{
Console.WriteLine("Enumerable enumerated");
yield return 1;
yield return 2;
}
it is not multiply enumerated, which would suggest that R# is wrong to warn here; however, I can't actually find anything in documentation that guarantees this behaviour of either lock or Monitor.Enter. So for me it's not quite as clear-cut as this R# bug I reported, where use of GetType flagged this inspection; but nonetheless I'd guess you're safe.
If you raise this on the R# bug tracker, you can get JetBrains' finest looking at a) whether this behaviour is indeed guaranteed. and b) whether R# can be adjusted to either not warn, or provide a justification for warning.
That said, of course, using locking here probably isn't actually achieving what you want to achieve, as stated in other answers and comments...

Enumerator problem, Any way to avoid two loops?

I have a third party api, which has a class that returns an enumerator for different items in the class.
I need to remove an item in that enumerator, so I cannot use "for each". Only option I can think of is to get the count by iterating over the enum and then run a normal for loop to remove the items.
Anyone know of a way to avoid the two loops?
Thanks
[update] sorry for the confusion but Andrey below in comments is right.
Here is some pseudo code out of my head that won't work and for which I am looking a solution which won't involve two loops but I guess it's not possible:
for each (myProperty in MyProperty)
{
if (checking some criteria here)
MyProperty.Remove(myProperty)
}
MyProperty is the third party class that implements the enumerator and the remove method.
Common pattern is to do something like this:
List<Item> forDeletion = new List<Item>();
foreach (Item i in somelist)
if (condition for deletion) forDeletion.Add(i);
foreach (Item i in forDeletion)
somelist.Remove(i); //or how do you delete items
Loop through it once and create a second array which contains the items which should not be deleted.
If you know it's a collection, you can go with reverted for:
for (int i = items.Count - 1; i >= 0; i--)
{
items.RemoveAt(i);
}
Otherwise, you'll have to do two loops.
You can create something like this:
public IEnumerable<item> GetMyList()
{
foreach (var x in thirdParty )
{
if (x == ignore)
continue;
yield return x;
}
}
I need to remove an item in that enumerator
As long as this is a single item that's not a problem. The rule is that you cannot continue to iterate after modifying the collection. Thus:
foreach (var item in collection) {
if (item.Equals(toRemove) {
collection.Remove(toRemove);
break; // <== stop iterating!!
}
}
It is not possible to remove an item from an Enumerator. What you can do is to copy or filter(or both) the content of the whole enumeration sequence.
You can achieve this by using linq and do smth like this:
YourEnumerationReturningFunction().Where(item => yourRemovalCriteria);
Can you elaborate on the API and the API calls you are using?
If you receive an IEnumerator<T> or IEnumerable<T> you cannot remove any item from the sequence behind the enumerator because there is no method to do so. And you should of course not rely on down casting an received object because the implementation may change. (Actually a well designed API should not expose mutable objects holding internal state at all.)
If you receive IList<T> or something similar you can just use a normal for loop from back to front and remove the items as needed because there is no iterator which state could be corrupted. (Here the rule about exposing mutable state should apply again - modifying the returned collection should not change any state.)
IEnumerator.Count() will decide at run-time what it needs to do - enumerate to count or reflect to see it's a collection and call .Count that way.
I like SJoerd's suggestion but I worry about how many items we may be talking about.
Why not something like ..
// you don't want 2 and 3
IEnumerable<int> fromAPI = Enumerable.Range(0, 10);
IEnumerable<int> result = fromAPI.Except(new[] { 2, 3 });
A clean, readable way to do this is as follows (I'm guessing at the third-party container's API here since you haven't specified it.)
foreach(var delItem in ThirdPartyContainer.Items
.Where(item=>ShouldIDeleteThis(item))
//or: .Where(ShouldIDeleteThis)
.ToArray()) {
ThirdPartyContainer.Remove(delItem);
}
The call to .ToArray() ensures that all items to be deleted have been greedily cached before the foreach iteration begins.
Behind the scenes this involves an array and an extra iteration over that, but that's generally very cheap, and the advantage of this method over the other answers to this question is that it works on plain enumerables and does not involve tricky mutable state issues that are hard to read and easy to get wrong.
By contrast, iterating in reverse, while not rocket science, is much more prone to off-by-one errors and harder to read; and it also relies on internals of the collection such as not changing order in between deletions (e.g. better not be a binary heap, say). Manually adding items that should be deleted to a temporary list is just unnecessary code - that's what .ToArray() will do just fine :-).
an enumerator always has a private field pointing to the real collection.
you can get it via reflection.modify it.
have fun.

Categories