I am trying to utilize the parallel for loop in .NET Framework 4.0. However I noticed that, I am missing some elements in the result set.
I have snippet of code as below. lhs.ListData is a list of nullable double and rhs.ListData is a list of nullable double.
int recordCount = lhs.ListData.Count > rhs.ListData.Count ? rhs.ListData.Count : lhs.ListData.Count;
List<double?> listResult = new List<double?>(recordCount);
var rangePartitioner = Partitioner.Create(0, recordCount);
Parallel.ForEach(rangePartitioner, range =>
{
for (int index = range.Item1; index < range.Item2; index++)
{
double? result = lhs.ListData[index] * rhs.ListData[index];
listResult.Add(result);
}
});
lhs.ListData has the length of 7964 and rhs.ListData has the length of 7962. When I perform the "*" operation, listResult has only 7867 as output. There are null elements in the both input list.
I am not sure what is happening during the execution. Is there any reason why I am seeing less elements in the result set? Please advice...
The correct way to do this is to use LINQ's IEnumerable.AsParallel() extention. It does all of the partitioning for you, and everything in PLINQ is inherently thread-safe. There is another LINQ extension called Zip that zips together two collections into one, based on a function that you give it. However, this isn't exactly what you need as it only goes to the length of the shorter of the two lists, not the longer. It would probably be easies to do this, but first expand the shorter of the two lists to the length of the longer one by padding it with null at the end of the list.
IEnumerable<double?> lhs, rhs; // Assume these are filled with your numbers.
double?[] result = System.Linq.Enumerable.Zip(lhs, rhs, (a, b) => a * b).AsParallel().ToArray();
Here's the MSDN page on Zip:
http://msdn.microsoft.com/en-us/library/dd267698%28VS.100%29.aspx
That's probably because the operations on a List<T> (e.g. Add) are not thread safe - your results may vary. As a workaround you could use a lock, but that would very much reduce performance.
It looks like you just want each item in the result list to be the product of the items at the corresponding index in the two input lists, how about this instead using PLINQ:
var listResult = lhs.AsParallel()
.Zip(rhs.AsParallel(), (a,b) => a*b)
.ToList();
Not sure why you chose parallelism here, I would benchmark if this is even necessary - is this truly the bottleneck in your application?
You are using List<double?> to store results but Add method is not thread safe.
You can use explicit index to store the result (instead of calling Add):
listResult[index] = result;
Related
I'm trying to figure out the best way to perform a computation fast and wanted to find out what sort of approach people would usually take in a situation like this.
I have a List of objects which have properties that I want to compute the mean and standard deviation of. I thought using this Math.NET library would probably be easier/optimised for performance.
Unfortunately, the input arguments for these functions are arrays. Is my only solution to write my own function to compute means and STDs? Could I write some sort of extension method for lists that uses lambda functions like here? Or am I better off writing functions that return arrays of my object properties and use these with Math.NET.
Presumably the answer depends on some things like the size of the list? Let's say for argument's sake that the list has 50 elements. My concern is purely performance.
ArrayStatistics indeed expects arrays as it is optimized for this special case (that's why it is called ArrayStatistics). Similarly, StreamingStatistics is optimized for IEnumerable sequence streaming without keeping data in memory. The general class that works with all kind of input is the Statistics class.
Have you verified that simply using LINQ and StreamingStatistics is not fast enough in your use case? Computing these statistics for a list of merely 50 entries is barely measurable at all, unless say you do that a million times in a loop.
Example with Math.NET Numerics v3.0.0-alpha7, using Tuples in a list to emulate your custom types:
using MathNet.Numerics.Statistics;
var data = new List<Tuple<string, double>>
{
Tuple.Create("A", 1.0),
Tuple.Create("B", 2.0),
Tuple.Create("C", 1.5)
};
// using the normal extension methods within `Statistics`
var stdDev1 = data.Select(x => x.Item2).StandardDeviation();
var mean1 = data.Select(x => x.Item2).Mean();
// single pass variant (unfortunately there's no single pass MeanStdDev yet):
var meanVar2 = data.Select(x => x.Item2).MeanVariance();
var mean2 = meanVar2.Item1;
var stdDev2 = Math.Sqrt(meanVar2.Item2);
// directly using the `StreamingStatistics` class:
StreamingStatistics.MeanVariance(data.Select(x => x.Item2));
The eaisiest solution you can use is to put Linq so that transform List to array
List<SomeClass> list = ...
GetMeanAndStdError(list.ToArray()); // <- Not that good performance
However, if perforamance is your concern, you'd rather compute Mean and Variance explicitly (write your own function):
List<SomeClass> list = ...
Double sumX = 0.0;
Double sumXX = 0.0;
foreach (var item in list) {
Double x = item.SomeProperty;
sumX += x;
sumXX += x * x;
}
Double mean = sumX / list.Count;
Double variance = (sumXX / list.Count - mean);
I'm working on a project, and I find myself repeatedly looping over a jagged array using nested for loops. I'm wondering if there might be a neater way of doing it using foreach?
Here's what I mean:
for (int ii = 0; ii < xDimension; ii++)
{
for (int jj = 0; jj < yDimension; jj++)
{
OutputArray[ii][jj] = someFunction(InputArray[ii][jj]);
}
}
Note that I'm using Jagged arrays even though my data is of fixed size because jagged arrays are faster than multidimensional arrays. Unfortunately speed is an issue with this project so unfortunately performance will outweigh my own OCD coding desires.
Is there a way to do this with foreach that avoids the nested for loops but puts the output data in the correct place in the OutputArray? Wwould there be any benefit/loss from doing so (if it is possible) other than having slightly neater code?
If you really want to create a jagged array as the result, you could use Array.ConvertAll twice:
var result = Array.ConvertAll(input,
array => Array.ConvertAll(array, SomeFunction));
This is slightly more efficient than using Select/ToArray from LINQ, as it knows it's converting an array to another array, so can create the target arrays immediately. Using Select followed by ToArray requires the results to be built up gradually, as if you were putting them into a List<T>, copying them when the buffer is exhausted - and then "right-sizing" the array at the end.
On the other hand, using LINQ (as per Daniel's answer) would probably be more idiomatic these days... and the performance difference will usually be insignificant. I thought I'd give this as another option :)
(Note that this creates new arrays, ignoring the existing OutputArray... I'm assuming you can get rid of the creation of the existing OutputArray, although that may not be the case...)
The following code is "neater":
OutputArray = InputArray.Select(x => x.Select(y => someFunction(y)).ToArray())
.ToArray();
But I would just go with the loops, because this LINQ version has a significant disadvantage: It creates new arrays instead of using the existing ones in OutputArray. This argument is moot if you create OutputArray right before the loop you showed us.
Furthermore, it is quite a lot harder to read.
You could get data out of the jagged array more easily using a foreach, but as you wouldn't have variables indicating the indexes you wouldn't be able to set a value of the array very effectively, like you do in your example.
If you just wanted to read the values though, you can do this:
foreach(int n in OutputArray.SelectMany(array=>array))
{
Console.WriteLine(n);
}
The SelectMany is needed to flatten the sequence, so that instead of being a sequence of sequences it is just a single sequence.
You can even get the indices out with LINQ:
foreach (var t in InputArray.SelectMany(
(inner, ii) => inner.Select((val, jj) => new { val, ii, jj })))
{
OutputArray[t.ii][t.jj] = someFunction(t.val);
}
but, to be honest, the twin-for-loop construct is a lot more maintainable.
Another alternative, just for fun:
foreach (var item in InputArray.SelectMany(x => x).Select((value, index) => new {value, index}))
{
var x = item.index / yDimension;
var y = item.index % yDimension;
OutputArray[x][y] = someFunction(item.value);
}
I'd just stick with the nested loops, though.
I have a List<byte[]> and I like to deserialize each byte[] into Foo. The List is ordered and I like to write a parallel loop in which the resulting List<Foo> contains all Foo in the same order than the original byte[]. The list is significantly large to make parallel operation worthwhile. Is there a built-in way to accomplish this?
If not, any ideas how to achieve a speedup over running this all synchronously?
Thanks
From the info you've given, I understand you want to have an output array of Foo with size equal to the input array of bytes? Is this correct?
If so, yes the operation is simple. Don't bother with locking or synchronized constructs, these will erode all the speed up that parallelization gives you.
Instead, if you obey this simple rule any algorithm can be parallelized without locking or synchronization:
For each input element X[i] processed, you may read from any input element X[j], but only write to output element Y[i]
Look up Scatter/Gather, this type of operation is called a gather as only one output element is written to.
If you can use the above principle then you want to create your output array Foo[] up front, and use Parallel.For not ForEach on the input array.
E.g.
List<byte[]> inputArray = new List<byte[]>();
int[] outputArray = new int[inputArray.Count];
var waitHandle = new ManualResetEvent(false);
int counter = 0;
Parallel.For(0, inputArray.Count, index =>
{
// Pass index to for loop, do long running operation
// on input items
// writing to only a single output item
outputArray[index] = DoOperation(inputArray[index]);
if(Interlocked.Increment(ref counter) == inputArray.Count -1)
{
waitHandle.Set();
}
});
waitHandler.WaitOne();
// Optional conversion back to list if you wanted this
var outputList = outputArray.ToList();
You can use a threadsafe dictionary with an index int key to store the reult from foo
so at the end you will have all the data orderer in the dictionary
I'm trying to find a solution to this problem:
Given a IEnumerable< IEnumerable< int>> I need a method/algorithm that returns the input, but in case of several IEnmerable< int> with the same elements only one per coincidence/group is returned.
ex.
IEnumerable<IEnumerable<int>> seqs = new[]
{
new[]{2,3,4}, // #0
new[]{1,2,4}, // #1 - equals #3
new[]{3,1,4}, // #2
new[]{4,1,2} // #3 - equals #1
};
"foreach seq in seqs" .. yields {#0,#1,#2} or {#0,#2,#3}
Sould I go with ..
.. some clever IEqualityComparer
.. some clever LINQ combination I havent figured out - groupby, sequenceequal ..?
.. some seq->HashSet stuff
.. what not. Anything will help
I'll be able to solve it by good'n'old programming but inspiration is always appreciated.
Here's a slightly simpler version of digEmAll's answer:
var result = seqs.Select(x => new HashSet<int>(x))
.Distinct(HashSet<int>.CreateSetComparer());
Given that you want to treat the elements as sets, you should have them that way to start with, IMO.
Of course this won't help if you want to maintain order within the sequences that are returned, you just don't mind which of the equal sets is returned... the above code will return an IEnumerable<HashSet<int>> which will no longer have any ordering within each sequence. (The order in which the sets are returned isn't guaranteed either, although it would be odd for them not to be return in first-seen-first-returned basis.)
It feels unlikely that this wouldn't be enough, but if you could give more details of what you really need to achieve, that would make it easier to help.
As noted in comments, this will also assume that there are no duplicates within each original source array... or at least, that they're irrelevant, so you're happy to treat { 1 } and { 1, 1, 1, 1 } as equal.
Use the correct collection type for the job. What you really want is ISet<IEnumerable<int>> with an equality comparer that will ignore the ordering of the IEnumerables.
EDITED:
You can get what you want by building your own IEqualityComparer<IEnumerable<int>> e.g.:
public class MyEqualityComparer : IEqualityComparer<IEnumerable<int>>
{
public bool Equals(IEnumerable<int> x, IEnumerable<int> y)
{
return x.OrderBy(el1 => el1).SequenceEqual(y.OrderBy(el2 => el2));
}
public int GetHashCode(IEnumerable<int> elements)
{
int hash = 0;
foreach (var el in elements)
{
hash = hash ^ el.GetHashCode();
}
return hash;
}
}
Usage:
var values = seqs.Distinct(new MyEqualityComparer()).ToList();
N.B.
this solution is slightly different from the one given by Jon Skeet.
His answer considers sublists as sets, so basically two lists like [1,2] and [1,1,1,2,2] are equal.
This solution don't, i.e. :
[1,2,1,1] is equal to [2,1,1,1] but not to [2,2,1,1], hence basically the two lists have to contain the same elements and in the same number of occurrences.
I just want know if a "FindAll" will be faster than a "Where" extentionMethod and why?
Example :
myList.FindAll(item=> item.category == 5);
or
myList.Where(item=> item.category == 5);
Which is better ?
Well, FindAll copies the matching elements to a new list, whereas Where just returns a lazily evaluated sequence - no copying is required.
I'd therefore expect Where to be slightly faster than FindAll even when the resulting sequence is fully evaluated - and of course the lazy evaluation strategy of Where means that if you only look at (say) the first match, it won't need to check the remainder of the list. (As Matthew points out, there's work in maintaining the state machine for Where. However, this will only have a fixed memory cost - whereas constructing a new list may require multiple array allocations etc.)
Basically, FindAll(predicate) is closer to Where(predicate).ToList() than to just Where(predicate).
Just to react a bit more to Matthew's answer, I don't think he's tested it quite thoroughly enough. His predicate happens to pick half the items. Here's a short but complete program which tests the same list but with three different predicates - one picks no items, one picks all the items, and one picks half of them. In each case I run the test fifty times to get longer timing.
I'm using Count() to make sure that the Where result is fully evaluated. The results show that collecting around half the results, the two are neck and neck. Collecting no results, FindAll wins. Collecting all the results, Where wins. I find this intriguing: all of the solutions become slower as more and more matches are found: FindAll has more copying to do, and Where has to return the matched values instead of just looping within the MoveNext() implementation. However, FindAll gets slower faster than Where does, so loses its early lead. Very interesting.
Results:
FindAll: All: 11994
Where: All: 8176
FindAll: Half: 6887
Where: Half: 6844
FindAll: None: 3253
Where: None: 4891
(Compiled with /o+ /debug- and run from the command line, .NET 3.5.)
Code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
class Test
{
static List<int> ints = Enumerable.Range(0, 10000000).ToList();
static void Main(string[] args)
{
Benchmark("All", i => i >= 0); // Match all
Benchmark("Half", i => i % 2 == 0); // Match half
Benchmark("None", i => i < 0); // Match none
}
static void Benchmark(string name, Predicate<int> predicate)
{
// We could just use new Func<int, bool>(predicate) but that
// would create one delegate wrapping another.
Func<int, bool> func = (Func<int, bool>)
Delegate.CreateDelegate(typeof(Func<int, bool>), predicate.Target,
predicate.Method);
Benchmark("FindAll: " + name, () => ints.FindAll(predicate));
Benchmark("Where: " + name, () => ints.Where(func).Count());
}
static void Benchmark(string name, Action action)
{
GC.Collect();
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 50; i++)
{
action();
}
sw.Stop();
Console.WriteLine("{0}: {1}", name, sw.ElapsedMilliseconds);
}
}
How about we test instead of guess? Shame to see the wrong answer get out.
var ints = Enumerable.Range(0, 10000000).ToList();
var sw1 = Stopwatch.StartNew();
var findall = ints.FindAll(i => i % 2 == 0);
sw1.Stop();
var sw2 = Stopwatch.StartNew();
var where = ints.Where(i => i % 2 == 0).ToList();
sw2.Stop();
Console.WriteLine("sw1: {0}", sw1.ElapsedTicks);
Console.WriteLine("sw2: {0}", sw2.ElapsedTicks);
/*
Debug
sw1: 1149856
sw2: 1652284
Release
sw1: 532194
sw2: 1016524
*/
Edit:
Even if I turn the above code from
var findall = ints.FindAll(i => i % 2 == 0);
...
var where = ints.Where(i => i % 2 == 0).ToList();
... to ...
var findall = ints.FindAll(i => i % 2 == 0).Count;
...
var where = ints.Where(i => i % 2 == 0).Count();
I get these results
/*
Debug
sw1: 1250409
sw2: 1267016
Release
sw1: 539536
sw2: 600361
*/
Edit 2.0...
If you want a list of the subset of the current list the fastest method if the FindAll(). The reason for this is simple. The FindAll instance method uses the indexer on the current List instead of the enumerator state machine. The Where() extension method is an external call to a different class that uses the enumerator. If you step from each node in the list to the next node you will have to call the MoveNext() method under the covers. As you can see from the above examples it is even faster to use the index entries to create a new list (that is pointing to the original items, so memory bloat will be minimal) to even just get a count of the filtered items.
Now if you are going to early abort from the Enumerator the Where() method could be faster. Of course if you move the early abort logic to the predicate of the FindAll() method you will again be using the indexer instead of the enumerator.
Now there are other reasons to use the Where() statement (such as the other linq methods, foreach blocks and many more) but the question was is the FindAll() faster than Where(). And unless you don't execute the Where() the answer seems to be yes. (When comparing apples to apples)
I am not say don't use LINQ or the .Where() method. They make for code that is much simpler to read. The question was about performance and not about how easy you can read and understand the code. By fast the fastest way to do this work would be to use a for block stepping each index and doing any logic as you want (even early exits). The reason LINQ is so great is becasue of the complex expression trees and transformation you can get with them. But using the iterator from the .Where() method has to go though tons of code to find it's way to a in memory statemachine that is just getting the next index out of the List. It should also be noted that this .FindAll() method is only useful on objects that implmented it (such as Array and List.)
Yet more...
for (int x = 0; x < 20; x++)
{
var ints = Enumerable.Range(0, 10000000).ToList();
var sw1 = Stopwatch.StartNew();
var findall = ints.FindAll(i => i % 2 == 0).Count;
sw1.Stop();
var sw2 = Stopwatch.StartNew();
var where = ints.AsEnumerable().Where(i => i % 2 == 0).Count();
sw2.Stop();
var sw4 = Stopwatch.StartNew();
var cntForeach = 0;
foreach (var item in ints)
if (item % 2 == 0)
cntForeach++;
sw4.Stop();
Console.WriteLine("sw1: {0}", sw1.ElapsedTicks);
Console.WriteLine("sw2: {0}", sw2.ElapsedTicks);
Console.WriteLine("sw4: {0}", sw4.ElapsedTicks);
}
/* averaged results
sw1 575446.8
sw2 605954.05
sw3 394506.4
/*
Well, at least you can try to measure it.
The static Where method is implemented using an iterator bloc (yield keyword), which basically means that the execution will be deferred. If you only compare the calls to theses two methods, the first one will be slower, since it immediately implies that the whole collection will be iterated.
But if you include the complete iteration of the results you get, things can be a bit different. I'm pretty sure the yield solution is slower, due to the generated state machine mechanism it implies. (see #Matthew anwser)
I can give some clue, but not sure which one faster.
FindAll() is executed right away.
Where() is defferred executed.
The advantage of where is the deferred execution. See the difference if you'd have the following functionality
BigSequence.FindAll( x => DoIt(x) ).First();
BigSequence.Where( x => DoIt(x) ).First();
FindAll has covered the complete sequene, while Where in most sequences will stop enumerating as soon as one element is found.
The same effects will be one using Any(), Take(), Skip(), etc. I'm not sure, but I guess you'll have huge advantages in all functions that have deferred execution