FindAll vs Where extension-method

FindAll vs Where extension-method - c#

I just want know if a "FindAll" will be faster than a "Where" extentionMethod and why?
Example :
myList.FindAll(item=> item.category == 5);
or
myList.Where(item=> item.category == 5);
Which is better ?

Well, FindAll copies the matching elements to a new list, whereas Where just returns a lazily evaluated sequence - no copying is required.
I'd therefore expect Where to be slightly faster than FindAll even when the resulting sequence is fully evaluated - and of course the lazy evaluation strategy of Where means that if you only look at (say) the first match, it won't need to check the remainder of the list. (As Matthew points out, there's work in maintaining the state machine for Where. However, this will only have a fixed memory cost - whereas constructing a new list may require multiple array allocations etc.)
Basically, FindAll(predicate) is closer to Where(predicate).ToList() than to just Where(predicate).
Just to react a bit more to Matthew's answer, I don't think he's tested it quite thoroughly enough. His predicate happens to pick half the items. Here's a short but complete program which tests the same list but with three different predicates - one picks no items, one picks all the items, and one picks half of them. In each case I run the test fifty times to get longer timing.
I'm using Count() to make sure that the Where result is fully evaluated. The results show that collecting around half the results, the two are neck and neck. Collecting no results, FindAll wins. Collecting all the results, Where wins. I find this intriguing: all of the solutions become slower as more and more matches are found: FindAll has more copying to do, and Where has to return the matched values instead of just looping within the MoveNext() implementation. However, FindAll gets slower faster than Where does, so loses its early lead. Very interesting.
Results:
FindAll: All: 11994
Where: All: 8176
FindAll: Half: 6887
Where: Half: 6844
FindAll: None: 3253
Where: None: 4891
(Compiled with /o+ /debug- and run from the command line, .NET 3.5.)
Code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
class Test
{
static List<int> ints = Enumerable.Range(0, 10000000).ToList();
static void Main(string[] args)
{
Benchmark("All", i => i >= 0); // Match all
Benchmark("Half", i => i % 2 == 0); // Match half
Benchmark("None", i => i < 0); // Match none
}
static void Benchmark(string name, Predicate<int> predicate)
{
// We could just use new Func<int, bool>(predicate) but that
// would create one delegate wrapping another.
Func<int, bool> func = (Func<int, bool>)
Delegate.CreateDelegate(typeof(Func<int, bool>), predicate.Target,
predicate.Method);
Benchmark("FindAll: " + name, () => ints.FindAll(predicate));
Benchmark("Where: " + name, () => ints.Where(func).Count());
}
static void Benchmark(string name, Action action)
{
GC.Collect();
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 50; i++)
{
action();
}
sw.Stop();
Console.WriteLine("{0}: {1}", name, sw.ElapsedMilliseconds);
}
}

How about we test instead of guess? Shame to see the wrong answer get out.
var ints = Enumerable.Range(0, 10000000).ToList();
var sw1 = Stopwatch.StartNew();
var findall = ints.FindAll(i => i % 2 == 0);
sw1.Stop();
var sw2 = Stopwatch.StartNew();
var where = ints.Where(i => i % 2 == 0).ToList();
sw2.Stop();
Console.WriteLine("sw1: {0}", sw1.ElapsedTicks);
Console.WriteLine("sw2: {0}", sw2.ElapsedTicks);
/*
Debug
sw1: 1149856
sw2: 1652284
Release
sw1: 532194
sw2: 1016524
*/
Edit:
Even if I turn the above code from
var findall = ints.FindAll(i => i % 2 == 0);
...
var where = ints.Where(i => i % 2 == 0).ToList();
... to ...
var findall = ints.FindAll(i => i % 2 == 0).Count;
...
var where = ints.Where(i => i % 2 == 0).Count();
I get these results
/*
Debug
sw1: 1250409
sw2: 1267016
Release
sw1: 539536
sw2: 600361
*/
Edit 2.0...
If you want a list of the subset of the current list the fastest method if the FindAll(). The reason for this is simple. The FindAll instance method uses the indexer on the current List instead of the enumerator state machine. The Where() extension method is an external call to a different class that uses the enumerator. If you step from each node in the list to the next node you will have to call the MoveNext() method under the covers. As you can see from the above examples it is even faster to use the index entries to create a new list (that is pointing to the original items, so memory bloat will be minimal) to even just get a count of the filtered items.
Now if you are going to early abort from the Enumerator the Where() method could be faster. Of course if you move the early abort logic to the predicate of the FindAll() method you will again be using the indexer instead of the enumerator.
Now there are other reasons to use the Where() statement (such as the other linq methods, foreach blocks and many more) but the question was is the FindAll() faster than Where(). And unless you don't execute the Where() the answer seems to be yes. (When comparing apples to apples)
I am not say don't use LINQ or the .Where() method. They make for code that is much simpler to read. The question was about performance and not about how easy you can read and understand the code. By fast the fastest way to do this work would be to use a for block stepping each index and doing any logic as you want (even early exits). The reason LINQ is so great is becasue of the complex expression trees and transformation you can get with them. But using the iterator from the .Where() method has to go though tons of code to find it's way to a in memory statemachine that is just getting the next index out of the List. It should also be noted that this .FindAll() method is only useful on objects that implmented it (such as Array and List.)
Yet more...
for (int x = 0; x < 20; x++)
{
var ints = Enumerable.Range(0, 10000000).ToList();
var sw1 = Stopwatch.StartNew();
var findall = ints.FindAll(i => i % 2 == 0).Count;
sw1.Stop();
var sw2 = Stopwatch.StartNew();
var where = ints.AsEnumerable().Where(i => i % 2 == 0).Count();
sw2.Stop();
var sw4 = Stopwatch.StartNew();
var cntForeach = 0;
foreach (var item in ints)
if (item % 2 == 0)
cntForeach++;
sw4.Stop();
Console.WriteLine("sw1: {0}", sw1.ElapsedTicks);
Console.WriteLine("sw2: {0}", sw2.ElapsedTicks);
Console.WriteLine("sw4: {0}", sw4.ElapsedTicks);
}
/* averaged results
sw1 575446.8
sw2 605954.05
sw3 394506.4
/*

Well, at least you can try to measure it.
The static Where method is implemented using an iterator bloc (yield keyword), which basically means that the execution will be deferred. If you only compare the calls to theses two methods, the first one will be slower, since it immediately implies that the whole collection will be iterated.
But if you include the complete iteration of the results you get, things can be a bit different. I'm pretty sure the yield solution is slower, due to the generated state machine mechanism it implies. (see #Matthew anwser)

I can give some clue, but not sure which one faster.
FindAll() is executed right away.
Where() is defferred executed.

The advantage of where is the deferred execution. See the difference if you'd have the following functionality
BigSequence.FindAll( x => DoIt(x) ).First();
BigSequence.Where( x => DoIt(x) ).First();
FindAll has covered the complete sequene, while Where in most sequences will stop enumerating as soon as one element is found.
The same effects will be one using Any(), Take(), Skip(), etc. I'm not sure, but I guess you'll have huge advantages in all functions that have deferred execution

Related

Why does List.Sum() perform poorly as compared to a foreach?

Question: Why does the execution time of Sum() take much longer than a foreach() in the following scenario?
public void TestMethod4()
{
List<int> numbers = new List<int>();
for (int i = 0; i < 1000000000; i++)
{
numbers.Add(i);
}
Stopwatch sw = Stopwatch.StartNew();
long totalCount = numbers.Sum(num => true ? 1 : 0); // simulating a dummy true condition
sw.Stop();
Console.WriteLine("Time taken Sum() : {0}ms", sw.Elapsed.TotalMilliseconds);
sw = Stopwatch.StartNew();
totalCount = 0;
foreach (var num in numbers)
{
totalCount += true ? 1 : 0; // simulating a dummy true condition
}
sw.Stop();
Console.WriteLine("Time taken foreach() : {0}ms", sw.Elapsed.TotalMilliseconds);
}
Sample run1
Time taken Sum() : 21443.8093ms
Time taken foreach() : 4251.9795ms

TL;DR: The difference in times is caused by the CLR applying two separate optimizations in the second case, but not the first case:
Linq's Sum operates on IEnumerable<T>, not List<T>.
The CLR/JIT does have a special-case optimization for foreach with List<T>, but not if a List<T> is passed as IEnumerable<T>.
Which means it's using IEnumerator<T> and incurring the cost of all of virtual-calls associated with that.
Whereas using List<T> directly uses static calls (instance method calls are still "static" calls, provided they're not virtual).
Linq's Sum accepts a delegate Func<T,Int64>.
Functions passed as a delegate Func<T,Int64> are not inlined, even with MethodImplOptions.AggressiveInline.
The cost of a delegate invocation are slightly more expensive than virtual calls.
I've reimplemented your SUM program using a variety of different approaches which you access here: https://gist.github.com/Jehoel/1a4fcd2e70374d3694c3a105061a6d1c
My benchmark results (Release build, x64, .NET Core 5, i7-7700HQ):
Approach
Time (ms)
Test_Sum_Delegate
118ms
Test_MySum_DirectFunc_IEnum
112ms
Test_MySum_IndirectFunc_IEnum
114ms
Test_MySum_DirectCall_IEnum
89ms
Test_MySum_DirectFunc_List
58ms
Test_MySum_IndirectFunc_List
58ms
Test_MySum_DirectCall_List
37ms
Test_Sum_DelegateLambda
109ms
Test_For_Inline
4ms
Test_For_Delegate
3ms
Test_ForUnrolled_Inline
4ms
Test_ForUnrolled_Delegate
4ms
Test_ForEach_Inline
38ms
Test_ForEach_Delegate
37ms
We can isolate the different behaviour by changing one thing at a time (e.g. foreach vs for, IEnumerable<T> vs List<T>, Func<T> vs direct function calls).
The System.Linq.Enumerable.Sum approach (Test_Sum_Delegate) is identical to Test_MySum_IndirectFunc_IEnum (disregard the 4ms difference between them). Both of these approaches iterate over the List<T> using IEnumerable<T>.
Changing the method to pass the List<T> as a List<T> instead of IEnumerable<T> (in Test_MySum_IndirectFunc_List) eliminates the virtual-calls from foreach using IEnumerator<T> which causes a reduction from ~114ms to 58ms, a 50% reduction in time already.
Then changing the Func<Int64,Int64> (a delegate) call to the GetValue func to a "static" call to GetValue (as in Test_MySum_DirectCall_List) brings the time down to 37ms - which is the same as Test_ForEach_Delegate. This approach is the same as your hand-written foreach loop.
The only way to get faster performance is with a for loop without any virtual calls. (In debug builds the Unrolled loop is even faster than the normal for loop, but in Release builds there's no observed difference).

StackExchange.Redis Scan x amount of keys

I have a redis db that has thousands of keys and I'm currently running the following line to get all the keys:
string[] keysArr = keys.Select(key => (string)key).ToArray();
But because I have a lot of keys this takes a long time. I want to limit the number of keys being read. So I'm trying to run an execute command where I get 100 keys at a time:
var keys = Redis.Connection.GetDatabase(dbNum).Execute("scan", 0, "count", 100);
This command successfully runs the command, however unable to access the the value as it is private, and unable to cast it even though RedisResult classs provides a explicit cast to it:
public static explicit operator string[] (RedisResult result);
Any ideas to get x amount of keys at a time from redis?
Thanks

SE.Redis has a .Keys() method on IServer API which fully encapsulates the semantics of SCAN. If possible, just use this method, and consume the data 100 at a time. It is usually pretty easy to write a batching function, i.e.
ExecuteInBatches(server.Keys(), 100, batch => DoSomething(batch));
with:
public void ExecuteInBatches<T>(IEnumerable<T> source, int batchSize,
Action<List<T>> action)
{
List<T> batch = new List<T>();
foreach(var item in source) {
batch.Add(item);
if(batch.Count == batchSize) {
action(batch);
batch = new List<T>(); // in case the callback stores it
}
}
if (batch.Count != 0) {
action(batch); // any leftovers
}
}
The enumerator will worry about advancing the cursor.
You can use Execute, but: that is a lot of work! Also, SCAN makes no gaurantees about how many will be returned per page; it can be zero - it can be 3 times what you asked for. It is ... guidance only.
Incidentally, the reason that the cast fails is because SCAN doesn't return a string[] - it returns an array of two items, the first of which is the "next" cursor, the second is the keys. So maybe:
var arr = (RedisResult[])server.Execute("scan", 0);
var nextCursor = (int)arr[0];
var keys = (RedisKey[])arr[1];
But all this is doing is re-implementing IServer.Keys, the hard way (and significantly less efficiently - ServerResult is not the ideal way to store data, it is simply necessary in the case of Execute and ScriptEvaluate).

I would use the .Take() method, outlined by Microsoft here.
Returns a specified number of contiguous elements from the start of a
sequence.
It would look something like this:
//limit to 100
var keysArr = keys.Select(key => (string)key).Take(100).ToArray();

Linq FirstOrDefault evaluates predicate each iteration?

If I had a statement such as:
var item = Core.Collections.Items.FirstOrDefault(itm => itm.UserID == bytereader.readInt());
Does this code read an integer from my stream each iteration, or does it read the integer once, store it, then use its value throughout the lookup?

Consider this code:
static void Main(string[] args)
{
new[] { 1, 2, 3, 4 }.FirstOrDefault(j => j == Get());
Console.ReadLine();
}
static int i = 5;
static int Get()
{
Console.WriteLine("GET:" + i);
return i--;
}
It shows, that it will call the method the number of times it needs to meet the first element matching the condition. The output will be:
GET:5
GET:4
GET:3

I don't know without checking but would expect it to read it each time.
But this is very easily remedied with the following version of your code.
byte val = bytereader.readInt();
var item = Core.Collections.Items.FirstOrDefault(itm => itm.UserID == val);
Myself, I would automatically take this approach anyway just to remove any doubt. Might be a good habit to form as there is no reason to read it for each item.

It's actually quite obvious that the call is performed for each item - FirstOrDefault() takes an delegate as argument. This fact is a bit obscured by using a lambda method but in the end the method only sees a delegate that it can call for each item to check the predicate. In order to evaluate the right hand side only once some magic mechanism would have to understand and rewrite the method and (sometimes sadly) there is no real magic inside compilers and runtimes.

Parallel For Loop

I am trying to utilize the parallel for loop in .NET Framework 4.0. However I noticed that, I am missing some elements in the result set.
I have snippet of code as below. lhs.ListData is a list of nullable double and rhs.ListData is a list of nullable double.
int recordCount = lhs.ListData.Count > rhs.ListData.Count ? rhs.ListData.Count : lhs.ListData.Count;
List<double?> listResult = new List<double?>(recordCount);
var rangePartitioner = Partitioner.Create(0, recordCount);
Parallel.ForEach(rangePartitioner, range =>
{
for (int index = range.Item1; index < range.Item2; index++)
{
double? result = lhs.ListData[index] * rhs.ListData[index];
listResult.Add(result);
}
});
lhs.ListData has the length of 7964 and rhs.ListData has the length of 7962. When I perform the "*" operation, listResult has only 7867 as output. There are null elements in the both input list.
I am not sure what is happening during the execution. Is there any reason why I am seeing less elements in the result set? Please advice...

The correct way to do this is to use LINQ's IEnumerable.AsParallel() extention. It does all of the partitioning for you, and everything in PLINQ is inherently thread-safe. There is another LINQ extension called Zip that zips together two collections into one, based on a function that you give it. However, this isn't exactly what you need as it only goes to the length of the shorter of the two lists, not the longer. It would probably be easies to do this, but first expand the shorter of the two lists to the length of the longer one by padding it with null at the end of the list.
IEnumerable<double?> lhs, rhs; // Assume these are filled with your numbers.
double?[] result = System.Linq.Enumerable.Zip(lhs, rhs, (a, b) => a * b).AsParallel().ToArray();
Here's the MSDN page on Zip:
http://msdn.microsoft.com/en-us/library/dd267698%28VS.100%29.aspx

That's probably because the operations on a List<T> (e.g. Add) are not thread safe - your results may vary. As a workaround you could use a lock, but that would very much reduce performance.
It looks like you just want each item in the result list to be the product of the items at the corresponding index in the two input lists, how about this instead using PLINQ:
var listResult = lhs.AsParallel()
.Zip(rhs.AsParallel(), (a,b) => a*b)
.ToList();
Not sure why you chose parallelism here, I would benchmark if this is even necessary - is this truly the bottleneck in your application?

You are using List<double?> to store results but Add method is not thread safe.
You can use explicit index to store the result (instead of calling Add):
listResult[index] = result;

Is the Linq Count() faster or slower than List.Count or Array.Length?

Is the LINQ Count() method any faster or slower than List<>.Count or Array.Length?

In general slower. LINQ's Count in general is an O(N) operation while List.Count and Array.Length are both guaranteed to be O(1).
However it some cases LINQ will special case the IEnumerable<T> parameter by casting to certain interface types such as IList<T> or ICollection<T>. It will then use that Count method to do an actual Count() operation. So it will go back down to O(1). But you still pay the minor overhead of the cast and interface call.

The Enumerable.Count() method checks for ICollection<T>, using .Count - so in the case of arrays and lists, it is not much more inefficient (just an extra level of indirection).

Marc has the right answer but the devil is in the detail.
On my machine:
For arrays .Length is about 100 times faster than .Count()
For Lists .Count is about 10 times faster than .Count() - Note: I would expect similar performance from all Collections that implement IList<T>
Arrays start off slower since .Length involves only a single operation, .Count on arrays involves a layer of indirection. So .Count on arrays starts off 10x slower (on my machine), which could be one of those reasons the interface is implemented explicitly. Imagine if you had an object with two public properties, .Count and .Length. Both do the exact same thing but .Count is 10X slower.
Of course non of this really makes much of a difference since you would have to be counting your arrays and lists millions of times a second to feel a performance hit.
Code:
static void TimeAction(string description, int times, Action func) {
var watch = new Stopwatch();
watch.Start();
for (int i = 0; i < times; i++) {
func();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.ElapsedMilliseconds);
}
static void Main(string[] args) {
var array = Enumerable.Range(0, 10000000).ToArray();
var list = Enumerable.Range(0, 10000000).ToArray().ToList();
// jit
TimeAction("Ignore and jit", 1 ,() =>
{
var junk = array.Length;
var junk2 = list.Count;
array.Count();
list.Count();
});
TimeAction("Array Length", 1000000, () => {
var tmp1 = array.Length;
});
TimeAction("Array Count()", 1000000, () =>
{
var tmp2 = array.Count();
});
TimeAction("Array Length through cast", 1000000, () =>
{
var tmp3 = (array as ICollection<int>).Count;
});
TimeAction("List Count", 1000000, () =>
{
var tmp1 = list.Count;
});
TimeAction("List Count()", 1000000, () =>
{
var tmp2 = list.Count();
});
Console.ReadKey();
}
Results:
Array Length Time Elapsed 3 ms
Array Count() Time Elapsed 264 ms
Array Length through cast Time Elapsed 16 ms
List Count Time Elapsed 3 ms
List Count() Time Elapsed 18 ms

I believe that if you call Linq.Count() on either an ICollection or IList (like an ArrayList or List) then it will just return the Count property's value. So the performance will be about the same on plain collections.

I would say it depends on the List. If it is an IQueryable that is a table in a db somewhere then Count() will be much faster because it doesn't have to load all of the objects. But if the list is in-memory i would guess that the Count property would be faster if not about the same.

Some additional info - LINQ Count - the difference between using it and not can be huge - and this doesn't have to be over 'large' collections either. I have a collection formed from linq to objects with about 6500 items (big.. but not huge by any means) . Count() in my case takes several seconds. Converting to a list (or array, whatver) the count is then virtually immediate. Having this count in an inner loop means the impact could be huge. Count enumerates through everything. An array and a list are both 'self aware' of their lengths and do not need to enumerate them. Any debug statements (log4net for ex) that reference this count() will also then slow everything down considerably more. Do yourself a favor and if you need to reference this often save the count size and only call it once on a LINQ collection unless you convert it to a list and then can reference away without a performance hit.
Here is a quick test of what I was talking about above. Note every time we call Count() our collection size changes.. hence evaluation happens, which is more than an expected 'count' operation. Just something to be aware of : )
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace LinqTest
{
class TestClass
{
public TestClass()
{
CreateDate = DateTime.Now;
}
public DateTime CreateDate;
}
class Program
{
static void Main(string[] args)
{
//Populate the test class
List list = new List(1000);
for (int i=0; i<1000; i++)
{
System.Threading.Thread.Sleep(20);
list.Add(new TestClass());
if(i%100==0)
{
Console.WriteLine(i.ToString() + " items added");
}
}
//now query for items
var newList = list.Where(o=> o.CreateDate.AddSeconds(5)> DateTime.Now);
while (newList.Count() > 0)
{
//Note - are actual count keeps decreasing.. showing our 'execute' is running every time we call count.
Console.WriteLine(newList.Count());
System.Threading.Thread.Sleep(500);
}
}
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

FindAll vs Where extension-method - c#

I just want know if a "FindAll" will be faster than a "Where" extentionMethod and why? Example : myList.FindAll(item=> item.category == 5); or myList.Where(item=> item.category == 5); Which is better ?

I can give some clue, but not sure which one faster. FindAll() is executed right away. Where() is defferred executed.

Related

Why does List.Sum() perform poorly as compared to a foreach?

StackExchange.Redis Scan x amount of keys

Linq FirstOrDefault evaluates predicate each iteration?

Parallel For Loop

Is the Linq Count() faster or slower than List.Count or Array.Length?

Categories

Resources