Select equal "chunks" from IEnumerable Range for Parallel For loop - c#

This is a two-part question.
I have programmatically determined a range of double values:
public static void Main(string[] args)
{
var startRate = 0.0725;
var rateStep = 0.001;
var maxRate = 0.2;
var stepsFromStartToMax = (int)Math.Ceiling((maxRate-startRate)/rateStep);
var allRateSteps = Enumerable.Range(0, stepsFromStartToMax)
.Select(i => startRate + (maxRate - startRate) * ((double)i / (stepsFromStartToMax - 1)))
.ToArray();
foreach (var i in allRateSteps)
{
Console.WriteLine(i); // this prints the correct values
}
}
I would like to divide this list of numbers up into chunks based on the processor count, which I can get from Environment.ProcessorCount (usually 8.) Ideally, I would end up with something like a List of Tuples, where each Tuple contains the start and end values for each chunk:
[(0.725, 0.813), (0.815, 0.955), ...]
1) How do you select out the inner ranges in less code, without having to know how many tuples I will need? I've come up with a long way to do this with loops, but I'm hoping LINQ can help here:
var counter = 0;
var listOne = new List<double>();
//...
var listEight = new List<double>();
foreach (var i in allRateSteps)
{
counter++;
if (counter < allRateSteps.Length/8)
{
listOne.Add(i);
}
//...
else if (counter < allRateSteps.Length/1)
{
listEight.Add(i);
}
}
// Now that I have lists, I can get their First() and Last() to create tuples
var tupleList = new List<Tuple<double, double>>{
new Tuple<double, double>(listOne.First(), listOne.Last()),
//...
new Tuple<double, double>(listEight.First(), listEight.Last())
};
Once I have this new list of range Tuples, I want to use each of these as a basis for a parallel loop which writes to a ConcurrentDictionary during certain conditions. I'm not sure how to get this code into my loop...
I've got this piece of code working on multiple threads, but 2) how do I evenly distribute the work across all processors based on the ranges I've defined in tupleList:
var maxRateObj = new ConcurrentDictionary<string, double>();
var startTime = DateTime.Now;
Parallel.For(0,
stepsFromStartToMax,
new ParallelOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
},
x =>
{
var i = (x * rateStep) + startRate;
Console.WriteLine("{0} : {1} : {2} ",
i,
DateTime.Now - startTime,
Thread.CurrentThread.ManagedThreadId);
if (!maxRateObj.Any())
{
maxRateObj["highestRateSoFar"] = i;
}
else {
if (i > maxRateObj["highestRateSoFar"])
{
maxRateObj["highestRateSoFar"] = i;
}
}
});
This prints out, e.g.:
...
0.1295 : 00:00:00.4846470 : 5
0.0825 : 00:00:00.4846720 : 8
0.1645 : 00:00:00.4844220 : 6
0.0835 : 00:00:00.4847510 : 8
...
Thread1 needs to handle the ranges in the first tuple, thread2 handles the ranged defined in the second tuple, etc... where i is defined by the range in the loop. Again, the number of range tuples will depend on the number of processors. Thanks.

I would like to divide this list of numbers up into chunks based on the processor count
There are many possible implementations for a LINQ Batch method.
How do you select out the inner ranges in less code, without having to know how many tuples I will need?
Here's one way to handle that:
var batchRanges = from batch in allRateSteps.Batch(anyNumberGoesHere)
let first = batch.First()
let last = batch.Last()
select Tuple.Create(first, last);
(0.0725, 0.0795275590551181)
(0.0805314960629921, 0.0875590551181102)
(0.0885629921259842, 0.0955905511811024)
...
how do I evenly distribute the work across all processors based on the ranges I've defined in tupleList
This part of your example doesn't reference tupleList so it's hard to see the desired behavior.
Thread1 needs to handle the ranges in the first tuple, thread2 handles the ranged defined in the second tuple, etc...
Unless you have some hard requirement that certain threads process certain batches, I would strongly suggest generating your work as a single "stream" and using a higher-level abstraction for parallelism e.g. PLINQ.
If you just want to do work in batches, you can still do that but not care about which thread(s) the work is being done on:
static void Work(IEnumerable<int> ints) {
var sum = ints.Sum();
Thread.Sleep(sum);
Console.WriteLine(ints.Sum());
}
public static void Main (string[] args) {
var inputs = from i in Enumerable.Range(0, 100)
select i + i;
var batches = inputs.Batch(8);
var tasks = from batch in batches
select Task.Run(() => Work(batch));
Task.WaitAll(tasks.ToArray());
}
The default TaskScheduler is coordinating the work for you behind the scenes, and it'll likely outperform hand-rolling your own threading scheme.
Also consider something like this:
static int Work(IEnumerable<int> ints) {
Console.WriteLine("Work on thread " + Thread.CurrentThread.ManagedThreadId);
var sum = ints.Sum();
Thread.Sleep(sum);
return sum;
}
public static void Main (string[] args) {
var inputs = from i in Enumerable.Range(0, 100)
select i + i;
var batches = inputs.Batch(8);
var tasks = from batch in batches
select Work(batch);
foreach (var task in tasks.AsParallel()) {
Console.WriteLine(task);
}
}
/*
Work on thread 6
Work on thread 4
56
Work on thread 4
184
Work on thread 4
Work on thread 4
312
440
...
*/

Related

How can I determine whether a parallel foreach loop is going to have better performance than a foreach loop?

I just did a simple test in .NET Fiddle of sorting 100 random integer arrays of length 1000 and seeing whether doing so with a Paralell.ForEach loop is faster than a plain old foreach loop.
Here is my code (I put this together fast, so please ignore the repetition and overall bad look of the code)
using System;
using System.Net;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;
using System.Linq;
public class Program
{
public static int[] RandomArray(int minval, int maxval, int arrsize)
{
Random randNum = new Random();
int[] rand = Enumerable
.Repeat(0, arrsize)
.Select(i => randNum.Next(minval, maxval))
.ToArray();
return rand;
}
public static void SortOneThousandArraysSync()
{
var arrs = new List<int[]>(100);
for(int i = 0; i < 100; ++i)
arrs.Add(RandomArray(Int32.MinValue,Int32.MaxValue,1000));
Parallel.ForEach(arrs, (arr) =>
{
Array.Sort(arr);
});
}
public static void SortOneThousandArraysAsync()
{
var arrs = new List<int[]>(100);
for(int i = 0; i < 100; ++i)
arrs.Add(RandomArray(Int32.MinValue,Int32.MaxValue,1000));
foreach(var arr in arrs)
{
Array.Sort(arr);
};
}
public static void Main()
{
var start = DateTime.Now;
SortOneThousandArraysSync();
var end = DateTime.Now;
Console.WriteLine("t1 = " + (end - start).ToString());
start = DateTime.Now;
SortOneThousandArraysAsync();
end = DateTime.Now;
Console.WriteLine("t2 = " + (end - start).ToString());
}
}
and here are the results after hitting Run twice:
t1 = 00:00:00.0156244
t2 = 00:00:00.0156243
...
t1 = 00:00:00.0467854
t2 = 00:00:00.0156246
...
So, sometimes it's faster and sometimes it's about the same.
Possible explanations:
The random arrays were "more unsorted" for the sync one versus the async one in the 2nd test I ran
It has something to do with the processes running on .NET Fiddle. In the first case the parallel one basically ran like a non-parallel operation because there weren't any threads for my fiddle to take over. (Or something like that)
Thoughts?
You should only use Parallel.ForEach() if the code within the loop takes a significant amount of time to execute. In this case, it takes more time to create multiple threads, sort the array, and then combine the result onto one thread than it is to simply sort it on a single thread. For example, the Parallel.ForEach() in the following code snippet takes less time to execute than the normal ForEach loop:
public static void Main(string[] args)
{
var numbers = Enumerable.Range(1, 10000);
Parallel.ForEach(numbers, n => Factorial(n));
foreach (var number in numbers)
{
Factorial(number);
}
}
private static int Factorial(int number)
{
if (number == 1 || number == 0)
return 1;
return number * Factorial(number - 1);
}
However, if I change var numbers = Enumerable.Range(1, 10000); to var numbers = Enumerable.Range(1, 1000);, the ForEach loop is faster than Parallel.ForEach().
When working with small tasks (which don't take a significant amount of time to execute) have a look at Partitioner class; in your case:
public static void SortOneThousandArraysAsyncWithPart() {
var arrs = new List<int[]>(100);
for (int i = 0; i < 100; ++i)
arrs.Add(RandomArray(Int32.MinValue, Int32.MaxValue, 1000));
// Let's spread the tasks between threads manually with a help of Partitioner.
// We don't want task stealing and other optimizations: just split the
// list between 8 (on my workstation) threads and run them
Parallel.ForEach(Partitioner.Create(0, 100), part => {
for (int i = part.Item1; i < part.Item2; ++i)
Array.Sort(arrs[i]);
});
}
I get the following results (i7 3.2GHz 4 cores HT, .Net 4.6 IA-64) - averaged by 100 runs:
0.0081 Async (foreach)
0.0119 Parallel.ForEach
0.0084 Parallel.ForEach + Partitioner
as you can see, foreach is still on the top, but Parallel.ForEach + Partitioner is very close to the winner
Checking performance of algorithms is a tricky business, and performance at small scale can easily be affected by a variety of factors external to your code. Please see my answer to an almost-duplicate question here for an in-depth explanation, plus some links to benchmarking templates that you can adapt to better measure your algorithm's performance.

Parallel.ForEach slower than normal foreach

I'm playing around with the Parallel.ForEach in a C# console application, but can't seem to get it right. I'm creating an array with random numbers and i have a sequential foreach and a Parallel.ForEach that finds the largest value in the array. With approximately the same code in c++ i started to see a tradeoff to using several threads at 3M values in the array. But the Parallel.ForEach is twice as slow even at 100M values. What am i doing wrong?
class Program
{
static void Main(string[] args)
{
dostuff();
}
static void dostuff() {
Console.WriteLine("How large do you want the array to be?");
int size = int.Parse(Console.ReadLine());
int[] arr = new int[size];
Random rand = new Random();
for (int i = 0; i < size; i++)
{
arr[i] = rand.Next(0, int.MaxValue);
}
var watchSeq = System.Diagnostics.Stopwatch.StartNew();
var largestSeq = FindLargestSequentially(arr);
watchSeq.Stop();
var elapsedSeq = watchSeq.ElapsedMilliseconds;
Console.WriteLine("Finished sequential in: " + elapsedSeq + "ms. Largest = " + largestSeq);
var watchPar = System.Diagnostics.Stopwatch.StartNew();
var largestPar = FindLargestParallel(arr);
watchPar.Stop();
var elapsedPar = watchPar.ElapsedMilliseconds;
Console.WriteLine("Finished parallel in: " + elapsedPar + "ms Largest = " + largestPar);
dostuff();
}
static int FindLargestSequentially(int[] arr) {
int largest = arr[0];
foreach (int i in arr) {
if (largest < i) {
largest = i;
}
}
return largest;
}
static int FindLargestParallel(int[] arr) {
int largest = arr[0];
Parallel.ForEach<int, int>(arr, () => 0, (i, loop, subtotal) =>
{
if (i > subtotal)
subtotal = i;
return subtotal;
},
(finalResult) => {
Console.WriteLine("Thread finished with result: " + finalResult);
if (largest < finalResult) largest = finalResult;
}
);
return largest;
}
}
It's performance ramifications of having a very small delegate body.
We can achieve better performance using the partitioning. In this case the body delegate performs work with a high data volume.
static int FindLargestParallelRange(int[] arr)
{
object locker = new object();
int largest = arr[0];
Parallel.ForEach(Partitioner.Create(0, arr.Length), () => arr[0], (range, loop, subtotal) =>
{
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > subtotal)
subtotal = arr[i];
return subtotal;
},
(finalResult) =>
{
lock (locker)
if (largest < finalResult)
largest = finalResult;
});
return largest;
}
Pay attention to synchronize the localFinally delegate. Also note the need for proper initialization of the localInit: () => arr[0] instead of () => 0.
Partitioning with PLINQ:
static int FindLargestPlinqRange(int[] arr)
{
return Partitioner.Create(0, arr.Length)
.AsParallel()
.Select(range =>
{
int largest = arr[0];
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > largest)
largest = arr[i];
return largest;
})
.Max();
}
I highly recommend free book Patterns of Parallel Programming by Stephen Toub.
As the other answerers have mentioned, the action you're trying to perform against each item here is so insignificant that there are a variety of other factors which end up carrying more weight than the actual work you're doing. These may include:
JIT optimizations
CPU branch prediction
I/O (outputting thread results while the timer is running)
the cost of invoking delegates
the cost of task management
the system incorrectly guessing what thread strategy will be optimal
memory/cpu caching
memory pressure
environment (debugging)
etc.
Running each approach a single time is not an adequate way to test, because it enables a number of the above factors to weigh more heavily on one iteration than on another. You should start with a more robust benchmarking strategy.
Furthermore, your implementation is actually dangerously incorrect. The documentation specifically says:
The localFinally delegate is invoked once per task to perform a final action on each task’s local state. This delegate might be invoked concurrently on multiple tasks; therefore, you must synchronize access to any shared variables.
You have not synchronized your final delegate, so your function is prone to race conditions that would make it produce incorrect results.
As in most cases, the best approach to this one is to take advantage of work done by people smarter than we are. In my testing, the following approach appears to be the fastest overall:
return arr.AsParallel().Max();
The Parallel Foreach loop should be running slower because the algorithm used is not parallel and a lot more work is being done to run this algorithm.
In the single thread, to find the max value, we can take the first number as our max value and compare it to every other number in the array. If one of the numbers larger than our first number, we swap and continue. This way we access each number in the array once, for a total of N comparisons.
In the Parallel loop above, the algorithm creates overhead because each operation is wrapped inside a function call with a return value. So in addition to doing the comparisons, it is running overhead of adding and removing these calls onto the call stack. In addition, since each call is dependent on the value of the function call before, it needs to run in sequence.
In the Parallel For Loop below, the array is divided into an explicit number of threads determined by the variable threadNumber. This limits the overhead of function calls to a low number.
Note, for low values, the parallel loops performs slower. However, for 100M, there is a decrease in time elapsed.
static int FindLargestParallel(int[] arr)
{
var answers = new ConcurrentBag<int>();
int threadNumber = 4;
int partitionSize = arr.Length/threadNumber;
Parallel.For(0, /* starting number */
threadNumber+1, /* Adding 1 to threadNumber in case array.Length not evenly divisible by threadNumber */
i =>
{
if (i*partitionSize < arr.Length) /* check in case # in array is divisible by # threads */
{
var max = arr[i*partitionSize];
for (var x = i*partitionSize;
x < (i + 1)*partitionSize && x < arr.Length;
++x)
{
if (arr[x] > max)
max = arr[x];
}
answers.Add(max);
}
});
/* note the shortcut in finding max in the bag */
return answers.Max(i=>i);
}
Some thoughts here: In the parallel case, there is thread management logic involved that determines how many threads it wants to use. This thread management logic presumably possibly runs on your main thread. Every time a thread returns with the new maximum value, the management logic kicks in and determines the next work item (the next number to process in your array). I'm pretty sure that this requires some kind of locking. In any case, determining the next item may even cost more than performing the comparison operation itself.
That sounds like a magnitude more work (overhead) to me than a single thread that processes one number after the other. In the single-threaded case there are a number of optimization at play: No boundary checks, CPU can load data into the first level cache within the CPU, etc. Not sure, which of these optimizations apply for the parallel case.
Keep in mind that on a typical desktop machine there are only 2 to 4 physical CPU cores available so you will never have more than that actually doing work. So if the parallel processing overhead is more than 2-4 times of a single-threaded operation, the parallel version will inevitably be slower, which you are observing.
Have you attempted to run this on a 32 core machine? ;-)
A better solution would be determine non-overlapping ranges (start + stop index) covering the entire array and let each parallel task process one range. This way, each parallel task can internally do a tight single-threaded loop and only return once the entire range has been processed. You could probably even determine a near optimal number of ranges based on the number of logical cores of the machine. I haven't tried this but I'm pretty sure you will see an improvement over the single-threaded case.
Try splitting the set into batches and running the batches in parallel, where the number of batches corresponds to your number of CPU cores.
I ran some equations 1K, 10K and 1M times using the following methods:
A "for" loop.
A "Parallel.For" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.For" across 4 batches.
A "Parallel.ForEach" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.ForEach" across 4 batches.
Results: (Measured in seconds)
Conclusion:
Processing batches in parallel using the "Parallel.ForEach" has the best outcome in cases above 10K records. I believe the batching helps because it utilizes all CPU cores (4 in this example), but also minimizes the amount of threading overhead associated with parallelization.
Here is my code:
public void ParallelSpeedTest()
{
var rnd = new Random(56);
int range = 1000000;
int numberOfCores = 4;
int batchSize = range / numberOfCores;
int[] rangeIndexes = Enumerable.Range(0, range).ToArray();
double[] inputs = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] weights = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] outputs = new double[rangeIndexes.Length];
/// Series "for"...
var startTimeSeries = DateTime.Now;
for (var i = 0; i < range; i++)
{
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
var durationSeries = DateTime.Now - startTimeSeries;
/// "Parallel.For"...
var startTimeParallel = DateTime.Now;
Parallel.For(0, range, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
});
var durationParallelFor = DateTime.Now - startTimeParallel;
/// "Parallel.For" in Batches...
var startTimeParallel2 = DateTime.Now;
Parallel.For(0, numberOfCores, (c) => {
var endValue = (c == numberOfCores - 1) ? range : (c + 1) * batchSize;
var startValue = c * batchSize;
for (var i = startValue; i < endValue; i++)
{
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
});
var durationParallelForBatches = DateTime.Now - startTimeParallel2;
/// "Parallel.ForEach"...
var startTimeParallelForEach = DateTime.Now;
Parallel.ForEach(rangeIndexes, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
});
var durationParallelForEach = DateTime.Now - startTimeParallelForEach;
/// Parallel.ForEach in Batches...
List<Tuple<int,int>> ranges = new List<Tuple<int, int>>();
for (var i = 0; i < numberOfCores; i++)
{
int start = i * batchSize;
int end = (i == numberOfCores - 1) ? range : (i + 1) * batchSize;
ranges.Add(new Tuple<int,int>(start, end));
}
var startTimeParallelBatches = DateTime.Now;
Parallel.ForEach(ranges, (range) => {
for(var i = range.Item1; i < range.Item1; i++) {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
});
var durationParallelForEachBatches = DateTime.Now - startTimeParallelBatches;
Debug.Print($"=================================================================");
Debug.Print($"Given: Set-size: {range}, number-of-batches: {numberOfCores}, batch-size: {batchSize}");
Debug.Print($".................................................................");
Debug.Print($"Series For: {durationSeries}");
Debug.Print($"Parallel For: {durationParallelFor}");
Debug.Print($"Parallel For Batches: {durationParallelForBatches}");
Debug.Print($"Parallel ForEach: {durationParallelForEach}");
Debug.Print($"Parallel ForEach Batches: {durationParallelForEachBatches}");
Debug.Print($"");
}

Why is Array.Sort working faster after using Array.ForEach?

I was spending my free time comparing built-in sorting algorithms in various libraries and languages and whe I hit C# and .NET I stumbled across a quite interesting and yet unknown to me "quirk".
Here's the fist program I ran:
class Program
{
static void Main(string[] args)
{
var a = new int[1000000];
var r = new Random();
var t = DateTime.Now;
for (int i = 0; i < 1000000; i++)
{
a[i] = r.Next();
}
Console.WriteLine(DateTime.Now - t);
t = DateTime.Now;
Array.Sort(a);
Console.WriteLine(DateTime.Now - t);
Console.ReadKey();
}
}
and I got an average result of 11 ms for filling the array and 77 ms for sorting.
Then I tried this code:
class Program
{
static void Main(string[] args)
{
var a = new int[1000000];
var r = new Random();
var t = DateTime.Now;
Array.ForEach(a, x => x = r.Next());
Console.WriteLine(DateTime.Now - t);
t = DateTime.Now;
Array.Sort(a);
Console.WriteLine(DateTime.Now - t);
Console.ReadKey();
}
}
and to my surprise the average times were 14 ms and 36 ms.
How can this be explained?
In the 2nd example, you are not assigning to the array items at all:
Array.ForEach(a, x => x = r.Next());
You're assigning to the lambda parameter x. Then, you're sorting an array consisting of zeros. This is probably faster because no data movement needs to happen. No writes to the array.
Apart from that, your benchmark methodology is questionable. Make the benchmark actually valid by: Using Stopwatch, increasing the running time by 10x, using Release mode without debugger, repeating the experiment and verifying the times are stable.
Because in the second case you do not really initialize an array and it remains all zeroes. In other words it is already sorted.
ForEach does not change entries

LINQ query and sub-query enumeration count in C#?

suppose I have this query :
int[] Numbers= new int[5]{5,2,3,4,5};
var query = from a in Numbers
where a== Numbers.Max (n => n) //notice MAX he should also get his value somehow
select a;
foreach (var element in query)
Console.WriteLine (element);
How many times does Numbers is enumerated when running the foreach ?
how can I test it ( I mean , writing a code which tells me the number of iterations)
It will be iterated 6 times. Once for the Where and once per element for the Max.
The code to demonstrate this:
private static int count = 0;
public static IEnumerable<int> Regurgitate(IEnumerable<int> source)
{
count++;
Console.WriteLine("Iterated sequence {0} times", count);
foreach (int i in source)
yield return i;
}
int[] Numbers = new int[5] { 5, 2, 3, 4, 5 };
IEnumerable<int> sequence = Regurgitate(Numbers);
var query = from a in sequence
where a == sequence.Max(n => n)
select a;
It will print "Iterated sequence 6 times".
We could make a more general purpose wrapper that is more flexible, if you're planning to use this to experiment with other cases:
public class EnumerableWrapper<T> : IEnumerable<T>
{
private IEnumerable<T> source;
public EnumerableWrapper(IEnumerable<T> source)
{
this.source = source;
}
public int IterationsStarted { get; private set; }
public int NumMoveNexts { get; private set; }
public int IterationsFinished { get; private set; }
public IEnumerator<T> GetEnumerator()
{
IterationsStarted++;
foreach (T item in source)
{
NumMoveNexts++;
yield return item;
}
IterationsFinished++;
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
public override string ToString()
{
return string.Format(
#"Iterations Started: {0}
Iterations Finished: {1}
Number of move next calls: {2}"
, IterationsStarted, IterationsFinished, NumMoveNexts);
}
}
This has several advantages over the other function:
It records both the number of iterations started, the number of iterations that were completed, and the total number of times all of the sequences were incremented.
You can create different instances to wrap different underlying sequences, thus allowing you to inspect multiple sequences per program, instead of just one when using a static variable.
Here is how you can estimate a quick count of the number of times the collection is enumerated: wrap your collection in a CountedEnum<T>, and increment counter on each yield return, like this --
static int counter = 0;
public static IEnumerable<T> CountedEnum<T>(IEnumerable<T> ee) {
foreach (var e in ee) {
counter++;
yield return e;
}
}
Then change your array declaration to this,
var Numbers= CountedEnum(new int[5]{5,2,3,4,5});
run your query, and print the counter. For your query, the code prints 30 (link to ideone), meaning that your collection of five items has been enumerated six times.
Here is how you can check the count
void Main()
{
var Numbers= new int[5]{5,2,3,4,5}.Select(n=>
{
Console.Write(n);
return n;
});
var query = from a in Numbers
where a== Numbers.Max (n => n)
select a;
foreach (var element in query)
{
var v = element;
}
}
Here is output
5 5 2 3 4 5 2 5 2 3 4 5 3 5 2 3 4 5 4 5 2 3 4 5 5 5 2 3 4 5
The number of iteration has to be equal to query.Count().
So to the count of the elements in the result of the first query.
If you're asking about something else, please clarify.
EDIT
After clarification:
if you're searching for total count of the iteration in the code provided, there will be 7 iterations (for this concrete case).
var query = from a in Numbers
where a== Numbers.Max (n => n) //5 iterations to find MAX among 5 elements
select a;
and
foreach (var element in query)
Console.WriteLine (element); //2 iterations over resulting collection(in this question)
How many times does Numbers is enumerated when running the foreach
Loosely speaking, your code is morally equivalent to:
foreach(int a in Numbers)
{
// 1. I've gotten rid of the unnecessary identity lambda.
// 2. Note that Max works by enumerating the entire source.
var max = Numbers.Max();
if(a == max)
Console.WriteLine(a);
}
So we enumerate the following times:
One enumeration of the sequence for the outer loop (1).
One enumeration of the sequence for each of its members (Count).
So in total, we enumerate Count + 1 times.
You could bring this down to 2 by hoisting the Max query outside the loop by introducing a local.
how can I test it ( I mean , writing a code which tells me the number
of iterations)
This wouldn't be easy with a raw array. But you could write your own enumerable implementation (that perhaps wrapped an array) and add some instrumentation to the GetEnumerator method. Or if you want to go deeper, go the whole hog and write a custom enumerator with instrumentation on MoveNext and Current as well.
Count via public property also yields 6.
private static int ncount = 0;
private int[] numbers= new int[5]{5,2,3,4,5};
public int[] Numbers
{
get
{
ncount++;
Debug.WriteLine("Numbers Get " + ncount.ToString());
return numbers;
}
}
This brings the count down to 2.
Makes sense but I would not have thought of it.
int nmax = Numbers.Max(n => n);
var query = from a in Numbers
where a == nmax //notice MAX he should also get his value somehow
//where a == Numbers.Max(n => n) //notice MAX he should also get his value somehow
select a;
It will be iterated 6 times. Once for the Where and once per element for the Max.
Define and initialize a count variable outside the foreach loop and increment the count variable as count++ inside the loop to get the number of times of enumeration.

Faster way to do a List<T>.Contains()

I am trying to do what I think is a "de-intersect" (I'm not sure what the proper name is, but that's what Tim Sweeney of EpicGames called it in the old UnrealEd)
// foo and bar have some identical elements (given a case-insensitive match)
List‹string› foo = GetFoo();
List‹string› bar = GetBar();
// remove non matches
foo = foo.Where(x => bar.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();
bar = bar.Where(x => foo.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();
Then later on, I do another thing where I subtract the result from the original, to see which elements I removed. That's super-fast using .Except(), so no troubles there.
There must be a faster way to do this, because this one is pretty bad-performing with ~30,000 elements (of string) in either List. Preferably, a method to do this step and the one later on in one fell swoop would be nice. I tried using .Exists() instead of .Contains(), but it's slightly slower. I feel a bit thick, but I think it should be possible with some combination of .Except() and .Intersect() and/or .Union().
This operation can be called a symmetric difference.
You need a different data structure, like a hash table. Add the intersection of both sets to it, then difference the intersection from each set.
UPDATE:
I got a bit of time to try this in code. I used HashSet<T> with a set of 50,000 strings, from 2 to 10 characters long with the following results:
Original: 79499 ms
Hashset: 33 ms
BTW, there is a method on HashSet called SymmetricExceptWith which I thought would do the work for me, but it actually adds the different elements from both sets to the set the method is called on. Maybe this is what you want, rather than leaving the initial two sets unmodified, and the code would be more elegant.
Here is the code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
class Program
{
static void Main(string[] args)
{
// foo and bar have some identical elements (given a case-insensitive match)
var foo = getRandomStrings();
var bar = getRandomStrings();
var timer = new Stopwatch();
timer.Start();
// remove non matches
var f = foo.Where(x => !bar.Contains(x)).ToList();
var b = bar.Where(x => !foo.Contains(x)).ToList();
timer.Stop();
Debug.WriteLine(String.Format("Original: {0} ms", timer.ElapsedMilliseconds));
timer.Reset();
timer.Start();
var intersect = new HashSet<String>(foo);
intersect.IntersectWith(bar);
var fSet = new HashSet<String>(foo);
var bSet = new HashSet<String>(bar);
fSet.ExceptWith(intersect);
bSet.ExceptWith(intersect);
timer.Stop();
var fCheck = new HashSet<String>(f);
var bCheck = new HashSet<String>(b);
Debug.WriteLine(String.Format("Hashset: {0} ms", timer.ElapsedMilliseconds));
Console.WriteLine("Sets equal? {0} {1}", fSet.SetEquals(fCheck), bSet.SetEquals(bCheck)); //bSet.SetEquals(set));
Console.ReadKey();
}
static Random _rnd = new Random();
private const int Count = 50000;
private static List<string> getRandomStrings()
{
var strings = new List<String>(Count);
var chars = new Char[10];
for (var i = 0; i < Count; i++)
{
var len = _rnd.Next(2, 10);
for (var j = 0; j < len; j++)
{
var c = (Char)_rnd.Next('a', 'z');
chars[j] = c;
}
strings.Add(new String(chars, 0, len));
}
return strings;
}
}
With intersect it would be done like this:
var matches = ((from f in foo
select f)
.Intersect(
from b in bar
select b, StringComparer.InvariantCultureIgnoreCase))
If the elements are unique within each list you should consider using an HashSet
The HashSet(T) class provides high
performance set operations. A set is a
collection that contains no duplicate
elements, and whose elements are in no
particular order.
With sorted list, you can use binary search.
Contains on a list is an O(N) operation. If you had a different data structure, such as a sorted list or a Dictionary, you would dramatically reduce your time. Accessing a key in a sorted list is usually O(log N) time, and in a hash is usually O(1) time.

Categories