I was trying to develop a method pipeline using asynchronous method invocation. The logic for the pipeline is as follows
There are n data in a collection that have to be fed into m number of methods in a pipeline
Enumerate a collection of T
Feed the first element to the first method
Get the output, feed it to the second method asynchronously
At the same time, feed the second element of the collection to the first method
After completion of the first method, fed the result to the second method (if the second method is still running, put the result into its queue and start executing the third element at first method)
When second method finishes executing take the first element from the queue and execute and so on (every method should run asynchronously, no one should wait for the next to finish)
At the mth method, after executing the data, store the result to a list
After completing nth element at the mth method return the list of the results (n number of results) to the very first level.
I came up with a code as follows, but it did not work as intended, the result never gets returned and moreover it is not executing in the order as it should be.
static class Program
{
static void Main(string[] args)
{
var list = new List<int> { 1, 2, 3, 4 };
var result = list.ForEachPipeline(Add, Square, Add, Square);
foreach (var element in result)
{
Console.WriteLine(element);
Console.WriteLine("---------------------");
}
Console.ReadLine();
}
private static int Add(int j)
{
return j + 1;
}
private static int Square(int j)
{
return j * j;
}
internal static void AddNotify<T>(this List<T> list, T item)
{
Console.WriteLine("Adding {0} to the list", item);
list.Add(item);
}
}
internal class Function<T>
{
private readonly Func<T, T> _func;
private readonly List<T> _result = new List<T>();
private readonly Queue<T> DataQueue = new Queue<T>();
private bool _isBusy;
static readonly object Sync = new object();
readonly ManualResetEvent _waitHandle = new ManualResetEvent(false);
internal Function(Func<T, T> func)
{
_func = func;
}
internal Function<T> Next { get; set; }
internal Function<T> Start { get; set; }
internal int Count;
internal IEnumerable<T> Execute(IEnumerable<T> source)
{
var isSingle = true;
foreach (var element in source) {
var result = _func(element);
if (Next != null)
{
Next.ExecuteAsync(result, _waitHandle);
isSingle = false;
}
else
_result.AddNotify(result);
}
if (!isSingle)
_waitHandle.WaitOne();
return _result;
}
internal void ExecuteAsync(T element, ManualResetEvent resetEvent)
{
lock(Sync)
{
if(_isBusy)
{
DataQueue.Enqueue(element);
return;
}
_isBusy = true;
_func.BeginInvoke(element, CallBack, resetEvent);
}
}
internal void CallBack(IAsyncResult result)
{
bool set = false;
var worker = (Func<T, T>) ((AsyncResult) result).AsyncDelegate;
var resultElement = worker.EndInvoke(result);
var resetEvent = result.AsyncState as ManualResetEvent;
lock(Sync)
{
_isBusy = false;
if(Next != null)
Next.ExecuteAsync(resultElement, resetEvent);
else
Start._result.AddNotify(resultElement);
if(DataQueue.Count > 1)
{
var element = DataQueue.Dequeue();
ExecuteAsync(element, resetEvent);
}
if(Start._result.Count == Count)
set = true;
}
if(set)
resetEvent.Set();
}
}
public static class Pipe
{
public static IEnumerable<T> ForEachPipeline<T>(this IEnumerable<T> source, params Func<T, T>[] pipes)
{
Function<T> start = null, previous = null;
foreach (var function in pipes.Select(pipe => new Function<T>(pipe){ Count = source.Count()}))
{
if (start == null)
{
start = previous = function;
start.Start = function;
continue;
}
function.Start = start;
previous.Next = function;
previous = function;
}
return start != null ? start.Execute(source) : null;
}
}
Can you guys please help me to make this thing work? If this design is not good for an actual method pipeline, please feel free to suggest a different one.
Edit: I have to stick to .Net 3.5 strictly.
I didn't immediately find the problem in your code, but you might be overcomplicating things a bit. This might be a simpler way to do what you want.
public static class Pipe
{
public static IEnumerable<T> Execute<T>(
this IEnumerable<T> input, params Func<T, T>[] functions)
{
// each worker will put its result in this array
var results = new T[input.Count()];
// launch workers and return a WaitHandle for each one
var waitHandles = input.Select(
(element, index) =>
{
var waitHandle = new ManualResetEvent(false);
ThreadPool.QueueUserWorkItem(
delegate
{
T result = element;
foreach (var function in functions)
{
result = function(result);
}
results[index] = result;
waitHandle.Set();
});
return waitHandle;
});
// wait for each worker to finish
foreach (var waitHandle in waitHandles)
{
waitHandle.WaitOne();
}
return results;
}
}
This does not create a lock for each stage of the pipeline as in your own attempt. I've omitted that because it did not seem useful. However, you could easily add it by wrapping the functions like this:
var wrappedFunctions = functions.Select(x => AddStageLock(x));
where AddStageLock is this:
private static Func<T,T> AddStageLock<T>(Func<T,T> function)
{
object stageLock = new object();
Func<T, T> wrappedFunction =
x =>
{
lock (stageLock)
{
return function(x);
}
};
return wrappedFunction;
}
edit: The Execute implementation will probably be slower than single threaded execution, unless the work to be done for each individual element dwarfs the overhead of creating a wait handle and scheduling a task on the thread pool, To really benefit from multi-threading you need to limit the overhead; PLINQ in .NET 4 does this by partitioning the data.
Any particular reason for taking pipe-line approach? IMO, launching a separate thread for each input with all functions chained one after another would be simpler to write and faster to execute. For example,
function T ExecPipe<T>(IEnumerable<Func<T, T>> pipe, T input)
{
T value = input;
foreach(var f in pipe)
{
value = f(value);
}
return value;
}
var pipe = new List<Func<int, int>>() { Add, Square, Add, Square };
var list = new List<int> { 1, 2, 3, 4 };
foreach(var value in list)
{
ThreadPool.QueueUserWorkItem(o => ExecPipe(pipe, (int)o), value);
}
Now, coming to your code, I believe for accurate pipeline implementation with M stage, you must have exactly M threads as each stage can execute in parallel - now, some threads may be idle because i/p has not reached them. I am not certain if your code is launching any threads and what will be the count of thread at particular time.
Why dont you break off a thread for each iteration and aggregate your results in a locking resource. You only need to do. Could use PLinq for this.
I think you might be mistaking methods for resources. You only need lock a method if it is dealing with a critical block with a shared resource in it. By picking a resource off and breaking into a new thread from there, you eliminate the need to manage your second method.
I.E.: Method X Calls Method1 then Passes value into Method2
Foreach item in arr
Async(MethodX(item));
Related
I have a method that will spawn lots of CPU-bound workers with Task.Run(). Each worker may in turn spawn more workers, but I'm guaranteed that eventually, all workers will stop executing. My first thought was writing my method like this:
public Result OrchestrateWorkers(WorkItem[] workitems)
{
this.countdown = new CountdownEvent(0);
this.results = new ConcurrentQueue<WorkerResult>();
foreach (var workItem in workitems)
{
SpawnWorker(workItem);
}
this.countdown.Wait(); // until all spawned workers have completed.
return ComputeTotalResult(this.results);
}
The public SpawnWorker method is used to start a worker, and to keep track of when they complete by enqueueing the worker's result and decrementing the countdown.
public void SpawnWorker(WorkItem workItem)
{
this.countdown.AddCount();
Task.Run(() => {
// Worker is passed an instance of this class
// so it can call SpawnWorker if it needs to.
var worker = new Worker(workItem, this);
var result = worker.DoWork();
this.results.Enqueue(result);
countdown.Signal();
});
}
Each worker can call SpawnWorker as much as they like, but they're guaranteed to terminate at some point.
In this design, the thread that calls OrchestrateWorkers will block untill all the workers have completed. My thinking is that it's a shame that there's a blocked thread; it would be nice if it could be doing work as well.
Would it be better to rearchitect the solution to something like this?
public Task<Result> OrchestrateWorkersAsync(WorkItem[] workitems)
{
if (this.tcs is not null) throw InvalidOperationException("Already running!");
this.tcs = new TaskCompletionSource<Result>();
this.countdown = 0; // just a normal integer.
this.results = new ConcurrentQueue<WorkerResult>();
foreach (var workItem in workitems)
{
SpawnWorker(workItem);
}
return tcs.Task;
}
public void SpawnWorker(WorkItem workItem)
{
Interlocked.Increment(ref this.countdown);
Task.Run(() => {
var worker = new Worker(workItem, this);
var result = worker.DoWork();
this.results.Enqueue(result);
if (Interlocked.Decrement(ref countdown) == 0)
{
this.tcs.SetResult(this.ComputeTotalResult(this.results));
}
});
}
EDIT: I've added a more full-fleshed sample below. It should be compileable and runnable. I'm seeing a ~10% performance improvement on my 8-core system, but I want to make sure this is the "canonical" way to orchestrate a swarm of spawning tasks.
using System.Collections.Concurrent;
using System.Diagnostics;
using System;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;
using System.Linq;
public class Program
{
const int ITERATIONS = 2500000;
const int WORKERS = 200;
public static async Task Main()
{
var o = new Orchestrator<int, int>();
var oo = new OrchestratorAsync<int, int>();
var array = Enumerable.Range(0, WORKERS);
var result = Time(() => o.OrchestrateWorkers(array, DoWork));
Console.Error.WriteLine("Sync spawned {0} workers", result.Count());
var resultAsync = await TimeAsync(() => oo.OrchestrateWorkersAsync(array, DoWorkAsync));
Console.Error.WriteLine("Async spawned {0} workers", resultAsync.Count());
}
static async Task<T> TimeAsync<T>(Func<Task<T>> work)
{
var sw = new Stopwatch();
sw.Start();
var result = await work();
sw.Stop();
Console.WriteLine("Total async time: {0}", sw.ElapsedMilliseconds);
return result;
}
static T Time<T>(Func<T> work)
{
var sw = new Stopwatch();
sw.Start();
var result = work();
sw.Stop();
Console.WriteLine("Total time: {0}", sw.ElapsedMilliseconds);
return result;
}
static int DoWork(int x, Orchestrator<int, int> arg2)
{
var rnd = new Random();
int n = 0;
for (int i = 0; i < ITERATIONS; ++i)
{
n += rnd.Next();
}
if (x >= 0)
{
arg2.SpawnWorker(-1, DoWork);
arg2.SpawnWorker(-1, DoWork);
}
return n;
}
static int DoWorkAsync(int x, OrchestratorAsync<int, int> arg2)
{
var rnd = new Random();
int n = 0;
for (int i = 0; i < ITERATIONS; ++i)
{
n += rnd.Next();
}
if (x >= 0)
{
arg2.SpawnWorker(-1, DoWorkAsync);
arg2.SpawnWorker(-1, DoWorkAsync);
}
return n;
}
public class Orchestrator<TWorkItem, TResult>
{
private ConcurrentQueue<TResult> results;
private CountdownEvent countdownEvent;
public Orchestrator()
{
this.results = new();
this.countdownEvent = new(1);
}
public IEnumerable<TResult> OrchestrateWorkers(
IEnumerable<TWorkItem> workItems,
Func<TWorkItem, Orchestrator<TWorkItem, TResult>, TResult> worker)
{
foreach (var workItem in workItems)
{
SpawnWorker(workItem, worker);
}
countdownEvent.Signal();
countdownEvent.Wait();
return results;
}
public void SpawnWorker(
TWorkItem workItem,
Func<TWorkItem, Orchestrator<TWorkItem, TResult>, TResult> worker)
{
this.countdownEvent.AddCount(1);
Task.Run(() =>
{
var result = worker(workItem, this);
this.results.Enqueue(result);
countdownEvent.Signal();
});
}
}
public class OrchestratorAsync<TWorkItem, TResult>
{
private ConcurrentQueue<TResult> results;
private volatile int countdown;
private TaskCompletionSource<IEnumerable<TResult>> tcs;
public OrchestratorAsync()
{
this.results = new();
this.countdown = 0;
this.tcs = new TaskCompletionSource<IEnumerable<TResult>>();
}
public Task<IEnumerable<TResult>> OrchestrateWorkersAsync(
IEnumerable<TWorkItem> workItems,
Func<TWorkItem, OrchestratorAsync<TWorkItem, TResult>, TResult> worker)
{
this.countdown = 0; // just a normal integer.
foreach (var workItem in workItems)
{
SpawnWorker(workItem, worker);
}
return tcs.Task;
}
public void SpawnWorker(TWorkItem workItem,
Func<TWorkItem, OrchestratorAsync<TWorkItem, TResult>, TResult> worker)
{
Interlocked.Increment(ref this.countdown);
Task.Run(() =>
{
var result = worker(workItem, this);
this.results.Enqueue(result);
if (Interlocked.Decrement(ref countdown) == 0)
{
this.tcs.SetResult(this.results);
}
});
}
}
}
There's one big problem with the code as-written: the tasks fired off by Task.Run are discarded. This means there's no way to detect if anything goes wrong (i.e., an exception). It also means that there's not an easy way to aggregate results during execution, which is a common requirement; this lack of natural result handling is making the code collect results "out of band" in a separate collection.
These are the flags that this code is asking for adjustment to its structure. This is actual parallel code (i.e., not asynchronous), so parallel patterns are appropriate. You don't know how many tasks you need initially, so basic Data/Task Parallelism (such as a Parallel or PLINQ approach) won't suffice. At this point, you're needing Dynamic Task Parallelism, which is the most complex kind of parallelism. The TPL does support it, but your code just has to use the lower-level APIs to get it done.
Since you have dynamically-added work and since your structure is generally tree-shaped (each work can add other work), you can introduce an artificial root and then use child tasks. This will give you two - and possibly three - benefits:
All exceptions are no longer ignored. Child task exceptions are propagated up to their parents, all the way to the root.
You know when all the tasks are complete. Since parent tasks only complete when all their children complete, there's no need for a countdown event or any other orchestrating synchronization primitive; your code just has to wait on the root task, and all the work is done when that task completes.
If it is possible/desirable to reduce results as you go (a common requirement), then the child tasks can return the results and you will end up with the already-reduced results as the result of your root task.
Example code (ignoring (3) since it's not clear whether results can be reduced):
public class OrchestratorParentChild<TWorkItem, TResult>
{
private readonly ConcurrentQueue<TResult> results = new();
public IEnumerable<TResult> OrchestrateWorkers(
IEnumerable<TWorkItem> workItems,
Func<TWorkItem, OrchestratorParentChild<TWorkItem, TResult>, TResult> worker)
{
var rootTask = Task.Factory.StartNew(
() =>
{
foreach (var workItem in workItems)
SpawnWorker(workItem, worker);
},
default,
TaskCreationOptions.None,
TaskScheduler.Default);
rootTask.Wait();
return results;
}
public void SpawnWorker(
TWorkItem workItem,
Func<TWorkItem, OrchestratorParentChild<TWorkItem, TResult>, TResult> worker)
{
_ = Task.Factory.StartNew(
() => results.Enqueue(worker(workItem, this)),
default,
TaskCreationOptions.AttachedToParent,
TaskScheduler.Default);
}
}
Note that an "orchestrator" isn't normally used. Code using the Dynamic Task Parallelism pattern usually just calls StartNew directly instead of calling some orchestrator "spawn work" method.
In case you're wondering how this may look with results, here's one possibility:
public class OrchestratorParentChild<TWorkItem, TResult>
{
public TResult OrchestrateWorkers(
IEnumerable<TWorkItem> workItems,
Func<TWorkItem, OrchestratorParentChild<TWorkItem, TResult>, Func<IEnumerable<TResult>, TResult>, TResult> worker,
Func<IEnumerable<TResult>, TResult> resultReducer)
{
var rootTask = Task.Factory.StartNew(
() =>
{
var childTasks = workItems.Select(x => SpawnWorker(x, worker, resultReducer)).ToArray();
Task.WaitAll(childTasks);
return resultReducer(childTasks.Select(x => x.Result));
},
default,
TaskCreationOptions.None,
TaskScheduler.Default);
return rootTask.Result;
}
public Task<TResult> SpawnWorker(
TWorkItem workItem,
Func<TWorkItem, OrchestratorParentChild<TWorkItem, TResult>, Func<IEnumerable<TResult>, TResult>, TResult> worker,
Func<IEnumerable<TResult>, TResult> resultReducer)
{
return Task.Factory.StartNew(
() => worker(workItem, this, resultReducer),
default,
TaskCreationOptions.AttachedToParent,
TaskScheduler.Default);
}
}
As a final note, I rarely plug my book on this site, but you may find it helpful. Also a copy of "Parallel Programming with Microsoft® .NET: Design Patterns for Decomposition and Coordination on Multicore Architectures" if you can find it; it's a bit out of date in some places but still good overall if you want to do TPL programming.
I am new in programming world. i am doing my graduation and also learning dotnet.
I want to iterate my list in parallel foreach but i want to use partition there. I have lack of knowledge so my code is not compiling.
Actually this way i did it first which is working.
Parallel.ForEach(MyBroker, broker =>,,
{
mybrow = new WeightageRowNumber();
mybrow.RowNumber = Interlocked.Increment(ref rowNumber);
lock (_lock)
{
Mylist.Add(mybrow);
}
});
now i want to use partition so i change my code this way but now my code not compiling. here is code
Parallel.ForEach(MyBroker, broker,
(j, loop, subtotal) =>
{
mybrow = new WeightageRowNumber();
mybrow.RowNumber = Interlocked.Increment(ref rowNumber);
lock (_lock)
{
Mylist.Add(mybrow);
}
return brokerRowWeightageRowNumber.RowNumber;
},
(finalResult) =>
var rownum= Interlocked.Increment(ref finalResult);
console.writeline(rownum);
);
please see my second set of code and show me how to restructure to use partition for parallel foreach to iterate my list.
please guide me. thanks
The Parallel.ForEach method has 20 overloads - perhaps try a different overload?
Without your dependencies included I can't give a 1-to-1 example on your implementation but here is an in-depth example (reformatted from here) that you can copy into your IDE and set debug breakpoints (if that's useful). Unfortunately building an instantiable overload of OrderablePartitioner appears non-trivial so sorry for all the boilerplate code:
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using System.Threading;
using System.Collections.Concurrent;
using System.Collections;
using System.Linq;
// Simple partitioner that will extract one (index,item) pair at a time,
// in a thread-safe fashion, from the underlying collection.
class SingleElementOrderablePartitioner<T> : OrderablePartitioner<T>
{
// The collection being wrapped by this Partitioner
IEnumerable<T> m_referenceEnumerable;
// Class used to wrap m_index for the purpose of sharing access to it
// between an InternalEnumerable and multiple InternalEnumerators
private class Shared<U>
{
internal U Value;
public Shared(U item)
{
Value = item;
}
}
// Internal class that serves as a shared enumerable for the
// underlying collection.
private class InternalEnumerable : IEnumerable<KeyValuePair<long, T>>, IDisposable
{
IEnumerator<T> m_reader;
bool m_disposed = false;
Shared<long> m_index = null;
// These two are used to implement Dispose() when static partitioning is being performed
int m_activeEnumerators;
bool m_downcountEnumerators;
// "downcountEnumerators" will be true for static partitioning, false for
// dynamic partitioning.
public InternalEnumerable(IEnumerator<T> reader, bool downcountEnumerators)
{
m_reader = reader;
m_index = new Shared<long>(0);
m_activeEnumerators = 0;
m_downcountEnumerators = downcountEnumerators;
}
public IEnumerator<KeyValuePair<long, T>> GetEnumerator()
{
if (m_disposed)
throw new ObjectDisposedException("InternalEnumerable: Can't call GetEnumerator() after disposing");
// For static partitioning, keep track of the number of active enumerators.
if (m_downcountEnumerators) Interlocked.Increment(ref m_activeEnumerators);
return new InternalEnumerator(m_reader, this, m_index);
}
IEnumerator<KeyValuePair<long, T>> IEnumerable<KeyValuePair<long, T>>.GetEnumerator()
{
return this.GetEnumerator();
}
public void Dispose()
{
if (!m_disposed)
{
// Only dispose the source enumerator if you are doing dynamic partitioning
if (!m_downcountEnumerators)
{
m_reader.Dispose();
}
m_disposed = true;
}
}
// Called from Dispose() method of spawned InternalEnumerator. During
// static partitioning, the source enumerator will be automatically
// disposed once all requested InternalEnumerators have been disposed.
public void DisposeEnumerator()
{
if (m_downcountEnumerators)
{
if (Interlocked.Decrement(ref m_activeEnumerators) == 0)
{
m_reader.Dispose();
}
}
}
IEnumerator IEnumerable.GetEnumerator()
{
throw new NotImplementedException();
}
}
// Internal class that serves as a shared enumerator for
// the underlying collection.
private class InternalEnumerator : IEnumerator<KeyValuePair<long, T>>
{
KeyValuePair<long, T> m_current;
IEnumerator<T> m_source;
InternalEnumerable m_controllingEnumerable;
Shared<long> m_index = null;
bool m_disposed = false;
public InternalEnumerator(IEnumerator<T> source, InternalEnumerable controllingEnumerable, Shared<long> index)
{
m_source = source;
m_current = default(KeyValuePair<long, T>);
m_controllingEnumerable = controllingEnumerable;
m_index = index;
}
object IEnumerator.Current
{
get { return m_current; }
}
KeyValuePair<long, T> IEnumerator<KeyValuePair<long, T>>.Current
{
get { return m_current; }
}
void IEnumerator.Reset()
{
throw new NotSupportedException("Reset() not supported");
}
// This method is the crux of this class. Under lock, it calls
// MoveNext() on the underlying enumerator, grabs Current and index,
// and increments the index.
bool IEnumerator.MoveNext()
{
bool rval = false;
lock (m_source)
{
rval = m_source.MoveNext();
if (rval)
{
m_current = new KeyValuePair<long, T>(m_index.Value, m_source.Current);
m_index.Value = m_index.Value + 1;
}
else m_current = default(KeyValuePair<long, T>);
}
return rval;
}
void IDisposable.Dispose()
{
if (!m_disposed)
{
// Delegate to parent enumerable's DisposeEnumerator() method
m_controllingEnumerable.DisposeEnumerator();
m_disposed = true;
}
}
}
// Constructor just grabs the collection to wrap
public SingleElementOrderablePartitioner(IEnumerable<T> enumerable)
: base(true, true, true)
{
// Verify that the source IEnumerable is not null
if (enumerable == null)
throw new ArgumentNullException("enumerable");
m_referenceEnumerable = enumerable;
}
// Produces a list of "numPartitions" IEnumerators that can each be
// used to traverse the underlying collection in a thread-safe manner.
// This will return a static number of enumerators, as opposed to
// GetOrderableDynamicPartitions(), the result of which can be used to produce
// any number of enumerators.
public override IList<IEnumerator<KeyValuePair<long, T>>> GetOrderablePartitions(int numPartitions)
{
if (numPartitions < 1)
throw new ArgumentOutOfRangeException("NumPartitions");
List<IEnumerator<KeyValuePair<long, T>>> list = new List<IEnumerator<KeyValuePair<long, T>>>(numPartitions);
// Since we are doing static partitioning, create an InternalEnumerable with reference
// counting of spawned InternalEnumerators turned on. Once all of the spawned enumerators
// are disposed, dynamicPartitions will be disposed.
var dynamicPartitions = new InternalEnumerable(m_referenceEnumerable.GetEnumerator(), true);
for (int i = 0; i < numPartitions; i++)
list.Add(dynamicPartitions.GetEnumerator());
return list;
}
// Returns an instance of our internal Enumerable class. GetEnumerator()
// can then be called on that (multiple times) to produce shared enumerators.
public override IEnumerable<KeyValuePair<long, T>> GetOrderableDynamicPartitions()
{
// Since we are doing dynamic partitioning, create an InternalEnumerable with reference
// counting of spawned InternalEnumerators turned off. This returned InternalEnumerable
// will need to be explicitly disposed.
return new InternalEnumerable(m_referenceEnumerable.GetEnumerator(), false);
}
// Must be set to true if GetDynamicPartitions() is supported.
public override bool SupportsDynamicPartitions
{
get { return true; }
}
}
Here are examples of how to structure Parallel.ForEach using the above OrderablePartitioner. See how you can refactor-out your finally-block entirely out of the ForEach impl?
public class Program
{
static void Main(string[] args)
{
//
// First a fairly simple visual test
//
var someCollection = new string[] { "four", "score", "and", "twenty", "years", "ago" };
var someOrderablePartitioner = new SingleElementOrderablePartitioner<string>(someCollection);
Parallel.ForEach(someOrderablePartitioner, (item, state, index) =>
{
Console.WriteLine("ForEach: item = {0}, index = {1}, thread id = {2}", item, index, Thread.CurrentThread.ManagedThreadId);
});
//
// Now a more rigorous test of dynamic partitioning (used by Parallel.ForEach)
//
List<int> src = Enumerable.Range(0, 100000).ToList();
SingleElementOrderablePartitioner<int> myOP = new SingleElementOrderablePartitioner<int>(src);
int counter = 0;
bool mismatch = false;
Parallel.ForEach(myOP, (item, state, index) =>
{
if (item != index) mismatch = true;
Interlocked.Increment(ref counter);
});
if (mismatch) Console.WriteLine("OrderablePartitioner Test: index mismatch detected");
Console.WriteLine("OrderablePartitioner test: counter = {0}, should be 100000", counter);
}
}
Also this link might be useful ("Write a simple parallel.ForEach Loop")
Consider the following possible interface for an immutable generic enumerator:
interface IImmutableEnumerator<T>
{
(bool Succesful, IImmutableEnumerator<T> NewEnumerator) MoveNext();
T Current { get; }
}
How would you implement this in a reasonably performant way in c#? I'm a little out of ideas, because the IEnumerator infrastructure in .NET is inherently mutable and I can't see a way around it.
A naive implementation would be to simply create a new enumerator on every MoveNext() handing down a new inner mutable enumerator with current.Skip(1).GetEnumerator() but that is horribly inefficient.
I'm implementing a parser that needs to be able to look ahead; using an immutable enumerator would make things cleaner and easier to follow so I'm curious if there is an easy way to do this that I might be missing.
The input is an IEnumerable<T> and I can't change that. I can always materialize the enumerable with ToList() of course (with an IList in hand, looking ahead is trivial), but the data can be pretty large and I'd like to avoid it, if possible.
This is it:
public class ImmutableEnumerator<T> : IImmutableEnumerator<T>, IDisposable
{
public static (bool Succesful, IImmutableEnumerator<T> NewEnumerator) Create(IEnumerable<T> source)
{
var enumerator = source.GetEnumerator();
var successful = enumerator.MoveNext();
return (successful, new ImmutableEnumerator<T>(successful, enumerator));
}
private IEnumerator<T> _enumerator;
private (bool Succesful, IImmutableEnumerator<T> NewEnumerator) _runOnce = (false, null);
private ImmutableEnumerator(bool successful, IEnumerator<T> enumerator)
{
_enumerator = enumerator;
this.Current = successful ? _enumerator.Current : default(T);
if (!successful)
{
_enumerator.Dispose();
}
}
public (bool Succesful, IImmutableEnumerator<T> NewEnumerator) MoveNext()
{
if (_runOnce.NewEnumerator == null)
{
var successful = _enumerator.MoveNext();
_runOnce = (successful, new ImmutableEnumerator<T>(successful, _enumerator));
}
return _runOnce;
}
public T Current { get; private set; }
public void Dispose()
{
_enumerator.Dispose();
}
}
My test code succeeds nicely:
var xs = new[] { 1, 2, 3 };
var ie = ImmutableEnumerator<int>.Create(xs);
if (ie.Succesful)
{
Console.WriteLine(ie.NewEnumerator.Current);
var ie1 = ie.NewEnumerator.MoveNext();
if (ie1.Succesful)
{
Console.WriteLine(ie1.NewEnumerator.Current);
var ie2 = ie1.NewEnumerator.MoveNext();
if (ie2.Succesful)
{
Console.WriteLine(ie2.NewEnumerator.Current);
var ie3 = ie2.NewEnumerator.MoveNext();
if (ie3.Succesful)
{
Console.WriteLine(ie3.NewEnumerator.Current);
var ie4 = ie3.NewEnumerator.MoveNext();
}
}
}
}
This outputs:
1
2
3
It's immutable and it's efficient.
Here's a version using Lazy<(bool, IImmutableEnumerator<T>)> as per a request in the comments:
public class ImmutableEnumerator<T> : IImmutableEnumerator<T>, IDisposable
{
public static (bool Succesful, IImmutableEnumerator<T> NewEnumerator) Create(IEnumerable<T> source)
{
var enumerator = source.GetEnumerator();
var successful = enumerator.MoveNext();
return (successful, new ImmutableEnumerator<T>(successful, enumerator));
}
private IEnumerator<T> _enumerator;
private Lazy<(bool, IImmutableEnumerator<T>)> _runOnce;
private ImmutableEnumerator(bool successful, IEnumerator<T> enumerator)
{
_enumerator = enumerator;
this.Current = successful ? _enumerator.Current : default(T);
if (!successful)
{
_enumerator.Dispose();
}
_runOnce = new Lazy<(bool, IImmutableEnumerator<T>)>(() =>
{
var s = _enumerator.MoveNext();
return (s, new ImmutableEnumerator<T>(s, _enumerator));
});
}
public (bool Succesful, IImmutableEnumerator<T> NewEnumerator) MoveNext()
{
return _runOnce.Value;
}
public T Current { get; private set; }
public void Dispose()
{
_enumerator.Dispose();
}
}
You can achieve pseudo immutability suitable in this particular scenario by utilising a singly linked list. It allows for infinite look-ahead (limited only by your heap size) without the ability to look at previously processed nodes (unless you happen to store a reference to a previously processed node - which you shouldn't).
This solution addresses the requirements as stated (except for not conforming to your exact interface, with all of its functionality nevertheless intact).
The usage of such a linked list might look like this:
IEnumerable<int> numbersFromZeroToNine = Enumerable.Range(0, 10);
using (IEnumerator<int> enumerator = numbersFromZeroToNine.GetEnumerator())
{
var node = LazySinglyLinkedListNode<int>.CreateListHead(enumerator);
while (node != null)
{
Console.WriteLine($"Current value: {node.Value}.");
if (node.Next != null)
{
// Single-element look-ahead. Technically you could do node.Next.Next...Next.
// You can also nest another while loop here, and look ahead as much as needed.
Console.WriteLine($"Next value: {node.Next.Value}.");
}
else
{
Console.WriteLine("End of collection reached. There is no next value.");
}
node = node.Next;
// At this point the object which used to be referenced by the "node" local
// becomes eligible for collection, preventing unbounded memory growth.
}
}
Output:
Current value: 0.
Next value: 1.
Current value: 1.
Next value: 2.
Current value: 2.
Next value: 3.
Current value: 3.
Next value: 4.
Current value: 4.
Next value: 5.
Current value: 5.
Next value: 6.
Current value: 6.
Next value: 7.
Current value: 7.
Next value: 8.
Current value: 8.
Next value: 9.
Current value: 9.
End of collection reached. There is no next value.
The implementation is as follows:
sealed class LazySinglyLinkedListNode<T>
{
public static LazySinglyLinkedListNode<T> CreateListHead(IEnumerator<T> enumerator)
{
return enumerator.MoveNext() ? new LazySinglyLinkedListNode<T>(enumerator) : null;
}
public T Value { get; }
private IEnumerator<T> Enumerator;
private LazySinglyLinkedListNode<T> _next;
public LazySinglyLinkedListNode<T> Next
{
get
{
if (_next == null && Enumerator != null)
{
if (Enumerator.MoveNext())
{
_next = new LazySinglyLinkedListNode<T>(Enumerator);
}
else
{
Enumerator = null; // We've reached the end.
}
}
return _next;
}
}
private LazySinglyLinkedListNode(IEnumerator<T> enumerator)
{
Value = enumerator.Current;
Enumerator = enumerator;
}
}
An important thing to note here is that the source collection is only enumerated once, lazily, with MoveNext being called at most once per each node's lifetime regardless of how many times you access Next.
Using a doubly-linked list would allow look-behind, but would cause infinite memory growth and require periodic pruning, which is not trivial. Singly-linked list avoids this issue as long as you are not storing node references outside of your main loop. In the example above you could replace numbersFromZeroToNine with an IEnumerable<int> generator which infinitely yields integers, and the loop will run forever without running out of memory.
I ran into a weird issue and I'm wondering what I should do about it.
I have this class that return a IEnumerable<MyClass> and it is a deferred execution. Right now, there are two possible consumers. One of them sorts the result.
See the following example :
public class SomeClass
{
public IEnumerable<MyClass> GetMyStuff(Param givenParam)
{
double culmulativeSum = 0;
return myStuff.Where(...)
.OrderBy(...)
.TakeWhile( o =>
{
bool returnValue = culmulativeSum < givenParam.Maximum;
culmulativeSum += o.SomeNumericValue;
return returnValue;
};
}
}
Consumers call the deferred execution only once, but if they were to call it more than that, the result would be wrong as the culmulativeSum wouldn't be reset. I've found the issue by inadvertence with unit testing.
The easiest way for me to fix the issue would be to just add .ToArray() and get rid of the deferred execution at the cost of a little bit of overhead.
I could also add unit test in consumers class to ensure they call it only once, but that wouldn't prevent any new consumer coded in the future from this potential issue.
Another thing that came to my mind was to make subsequent execution throw.
Something like
return myStuff.Where(...)
.OrderBy(...)
.TakeWhile(...)
.ThrowIfExecutedMoreThan(1);
Obviously this doesn't exist.
Would it be a good idea to implement such thing and how would you do it?
Otherwise, if there is a big pink elephant that I don't see, pointing it out will be appreciated. (I feel there is one because this question is about a very basic scenario :| )
EDIT :
Here is a bad consumer usage example :
public class ConsumerClass
{
public void WhatEverMethod()
{
SomeClass some = new SomeClass();
var stuffs = some.GetMyStuff(param);
var nb = stuffs.Count(); //first deferred execution
var firstOne = stuff.First(); //second deferred execution with the culmulativeSum not reset
}
}
You can solve the incorrect result issue by simply turning your method into iterator:
double culmulativeSum = 0;
var query = myStuff.Where(...)
.OrderBy(...)
.TakeWhile(...);
foreach (var item in query) yield return item;
It can be encapsulated in a simple extension method:
public static class Iterators
{
public static IEnumerable<T> Lazy<T>(Func<IEnumerable<T>> source)
{
foreach (var item in source())
yield return item;
}
}
Then all you need to do in such scenarios is to surround the original method body with Iterators.Lazy call, e.g.:
return Iterators.Lazy(() =>
{
double culmulativeSum = 0;
return myStuff.Where(...)
.OrderBy(...)
.TakeWhile(...);
});
You can use the following class:
public class JustOnceOrElseEnumerable<T> : IEnumerable<T>
{
private readonly IEnumerable<T> decorated;
public JustOnceOrElseEnumerable(IEnumerable<T> decorated)
{
this.decorated = decorated;
}
private bool CalledAlready;
public IEnumerator<T> GetEnumerator()
{
if (CalledAlready)
throw new Exception("Enumerated already");
CalledAlready = true;
return decorated.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator()
{
if (CalledAlready)
throw new Exception("Enumerated already");
CalledAlready = true;
return decorated.GetEnumerator();
}
}
to decorate an enumerable so that it can only be enumerated once. After that it would throw an exception.
You can use this class like this:
return new JustOnceOrElseEnumerable(
myStuff.Where(...)
...
);
Please note that I do not recommend this approach because it violates the contract of the IEnumerable interface and thus the Liskov Substitution Principle. It is legal for consumers of this contract to assume that they can enumerate the enumerable as many times as they like.
Instead, you can use a cached enumerable that caches the result of enumeration. This ensures that the enumerable is only enumerated once and that all subsequent enumeration attempts would read from the cache. See this answer here for more information.
Ivan's answer is very fitting for the underlying issue in OP's example - but for the general case, I have approached this in the past using an extension method similar to the one below. This ensures that the Enumerable has a single evaluation but is also deferred:
public static IMemoizedEnumerable<T> Memoize<T>(this IEnumerable<T> source)
{
return new MemoizedEnumerable<T>(source);
}
private class MemoizedEnumerable<T> : IMemoizedEnumerable<T>, IDisposable
{
private readonly IEnumerator<T> _sourceEnumerator;
private readonly List<T> _cache = new List<T>();
public MemoizedEnumerable(IEnumerable<T> source)
{
_sourceEnumerator = source.GetEnumerator();
}
public IEnumerator<T> GetEnumerator()
{
return IsMaterialized ? _cache.GetEnumerator() : Enumerate();
}
private IEnumerator<T> Enumerate()
{
foreach (var value in _cache)
{
yield return value;
}
while (_sourceEnumerator.MoveNext())
{
_cache.Add(_sourceEnumerator.Current);
yield return _sourceEnumerator.Current;
}
_sourceEnumerator.Dispose();
IsMaterialized = true;
}
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
public List<T> Materialize()
{
if (IsMaterialized)
return _cache;
while (_sourceEnumerator.MoveNext())
{
_cache.Add(_sourceEnumerator.Current);
}
_sourceEnumerator.Dispose();
IsMaterialized = true;
return _cache;
}
public bool IsMaterialized { get; private set; }
void IDisposable.Dispose()
{
if(!IsMaterialized)
_sourceEnumerator.Dispose();
}
}
public interface IMemoizedEnumerable<T> : IEnumerable<T>
{
List<T> Materialize();
bool IsMaterialized { get; }
}
Example Usage:
void Consumer()
{
//var results = GetValuesComplex();
//var results = GetValuesComplex().ToList();
var results = GetValuesComplex().Memoize();
if(results.Any(i => i == 3))
{
Console.WriteLine("\nFirst Iteration");
//return; //Potential for early exit.
}
var last = results.Last(); // Causes multiple enumeration in naive case.
Console.WriteLine("\nSecond Iteration");
}
IEnumerable<int> GetValuesComplex()
{
for (int i = 0; i < 5; i++)
{
//... complex operations ...
Console.Write(i + ", ");
yield return i;
}
}
Naive: ✔ Deferred, ✘ Single enumeration.
ToList: ✘ Deferred, ✔ Single enumeration.
Memoize: ✔ Deferred, ✔ Single enumeration.
.
Edited to use the proper terminology and flesh out the implementation.
Provided items is the result of a LINQ expression:
var items = from item in ItemsSource.RetrieveItems()
where ...
Suppose generation of each item takes some non-negligeble time.
Two modes of operation are possible:
Using foreach would allow to start working with items in the beginning of the collection much sooner than whose in the end become available. However if we wanted to later process the same collection again, we'll have to copy save it:
var storedItems = new List<Item>();
foreach(var item in items)
{
Process(item);
storedItems.Add(item);
}
// Later
foreach(var item in storedItems)
{
ProcessMore(item);
}
Because if we'd just made foreach(... in items) then ItemsSource.RetrieveItems() would get called again.
We could use .ToList() right upfront, but that would force us wait for the last item to be retrieved before we could start processing the first one.
Question: Is there an IEnumerable implementation that would iterate first time like regular LINQ query result, but would materialize in process so that second foreach would iterate over stored values?
A fun challenge so I have to provide my own solution. So fun in fact that my solution now is in version 3. Version 2 was a simplification I made based on feedback from Servy. I then realized that my solution had huge drawback. If the first enumeration of the cached enumerable didn't complete no caching would be done. Many LINQ extensions like First and Take will only enumerate enough of the enumerable to get the job done and I had to update to version 3 to make this work with caching.
The question is about subsequent enumerations of the enumerable which does not involve concurrent access. Nevertheless I have decided to make my solution thread safe. It adds some complexity and a bit of overhead but should allow the solution to be used in all scenarios.
public static class EnumerableExtensions {
public static IEnumerable<T> Cached<T>(this IEnumerable<T> source) {
if (source == null)
throw new ArgumentNullException("source");
return new CachedEnumerable<T>(source);
}
}
class CachedEnumerable<T> : IEnumerable<T> {
readonly Object gate = new Object();
readonly IEnumerable<T> source;
readonly List<T> cache = new List<T>();
IEnumerator<T> enumerator;
bool isCacheComplete;
public CachedEnumerable(IEnumerable<T> source) {
this.source = source;
}
public IEnumerator<T> GetEnumerator() {
lock (this.gate) {
if (this.isCacheComplete)
return this.cache.GetEnumerator();
if (this.enumerator == null)
this.enumerator = source.GetEnumerator();
}
return GetCacheBuildingEnumerator();
}
public IEnumerator<T> GetCacheBuildingEnumerator() {
var index = 0;
T item;
while (TryGetItem(index, out item)) {
yield return item;
index += 1;
}
}
bool TryGetItem(Int32 index, out T item) {
lock (this.gate) {
if (!IsItemInCache(index)) {
// The iteration may have completed while waiting for the lock.
if (this.isCacheComplete) {
item = default(T);
return false;
}
if (!this.enumerator.MoveNext()) {
item = default(T);
this.isCacheComplete = true;
this.enumerator.Dispose();
return false;
}
this.cache.Add(this.enumerator.Current);
}
item = this.cache[index];
return true;
}
}
bool IsItemInCache(Int32 index) {
return index < this.cache.Count;
}
IEnumerator IEnumerable.GetEnumerator() {
return GetEnumerator();
}
}
The extension is used like this (sequence is an IEnumerable<T>):
var cachedSequence = sequence.Cached();
// Pulling 2 items from the sequence.
foreach (var item in cachedSequence.Take(2))
// ...
// Pulling 2 items from the cache and the rest from the source.
foreach (var item in cachedSequence)
// ...
// Pulling all items from the cache.
foreach (var item in cachedSequence)
// ...
There is slight leak if only part of the enumerable is enumerated (e.g. cachedSequence.Take(2).ToList(). The enumerator that is used by ToList will be disposed but the underlying source enumerator is not disposed. This is because the first 2 items are cached and the source enumerator is kept alive should requests for subsequent items be made. In that case the source enumerator is only cleaned up when eligigble for garbage Collection (which will be the same time as the possibly large cache).
Take a look at the Reactive Extentsions library - there is a MemoizeAll() extension which will cache the items in your IEnumerable once they're accessed, and store them for future accesses.
See this blog post by Bart De Smet for a good read on MemoizeAll and other Rx methods.
Edit: This is actually found in the separate Interactive Extensions package now - available from NuGet or Microsoft Download.
public static IEnumerable<T> SingleEnumeration<T>(this IEnumerable<T> source)
{
return new SingleEnumerator<T>(source);
}
private class SingleEnumerator<T> : IEnumerable<T>
{
private CacheEntry<T> cacheEntry;
public SingleEnumerator(IEnumerable<T> sequence)
{
cacheEntry = new CacheEntry<T>(sequence.GetEnumerator());
}
public IEnumerator<T> GetEnumerator()
{
if (cacheEntry.FullyPopulated)
{
return cacheEntry.CachedValues.GetEnumerator();
}
else
{
return iterateSequence<T>(cacheEntry).GetEnumerator();
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return this.GetEnumerator();
}
}
private static IEnumerable<T> iterateSequence<T>(CacheEntry<T> entry)
{
using (var iterator = entry.CachedValues.GetEnumerator())
{
int i = 0;
while (entry.ensureItemAt(i) && iterator.MoveNext())
{
yield return iterator.Current;
i++;
}
}
}
private class CacheEntry<T>
{
public bool FullyPopulated { get; private set; }
public ConcurrentQueue<T> CachedValues { get; private set; }
private static object key = new object();
private IEnumerator<T> sequence;
public CacheEntry(IEnumerator<T> sequence)
{
this.sequence = sequence;
CachedValues = new ConcurrentQueue<T>();
}
/// <summary>
/// Ensure that the cache has an item a the provided index. If not, take an item from the
/// input sequence and move to the cache.
///
/// The method is thread safe.
/// </summary>
/// <returns>True if the cache already had enough items or
/// an item was moved to the cache,
/// false if there were no more items in the sequence.</returns>
public bool ensureItemAt(int index)
{
//if the cache already has the items we don't need to lock to know we
//can get it
if (index < CachedValues.Count)
return true;
//if we're done there's no race conditions hwere either
if (FullyPopulated)
return false;
lock (key)
{
//re-check the early-exit conditions in case they changed while we were
//waiting on the lock.
//we already have the cached item
if (index < CachedValues.Count)
return true;
//we don't have the cached item and there are no uncached items
if (FullyPopulated)
return false;
//we actually need to get the next item from the sequence.
if (sequence.MoveNext())
{
CachedValues.Enqueue(sequence.Current);
return true;
}
else
{
FullyPopulated = true;
return false;
}
}
}
}
So this has been edited (substantially) to support multithreaded access. Several threads can ask for items, and on an item by item basis, they will be cached. It doesn't need to wait for the entire sequence to be iterated for it to return cached values. Below is a sample program that demonstrates this:
private static IEnumerable<int> interestingIntGenertionMethod(int maxValue)
{
for (int i = 0; i < maxValue; i++)
{
Thread.Sleep(1000);
Console.WriteLine("actually generating value: {0}", i);
yield return i;
}
}
public static void Main(string[] args)
{
IEnumerable<int> sequence = interestingIntGenertionMethod(10)
.SingleEnumeration();
int numThreads = 3;
for (int i = 0; i < numThreads; i++)
{
int taskID = i;
Task.Factory.StartNew(() =>
{
foreach (int value in sequence)
{
Console.WriteLine("Task: {0} Value:{1}",
taskID, value);
}
});
}
Console.WriteLine("Press any key to exit...");
Console.ReadKey(true);
}
You really need to see it run to understand the power here. As soon as a single thread forces the next actual values to be generated all of the remaining threads can immediately print that generated value, but they will all be waiting if there are no uncached values for that thread to print. (Obviously thread/threadpool scheduling may result in one task taking longer to print it's value than needed.)
There have already been posted thread-safe implementations of the Cached/SingleEnumeration operator by Martin Liversage and Servy respectively, and the thread-safe Memoise operator from the System.Interactive package is also available. In case thread-safety is not a requirement, and paying the cost of thread-synchronization is undesirable, there are answers offering unsynchronized ToCachedEnumerable implementations in this question. All these implementations have in common that they are based on custom types. My challenge was to write a similar not-synchronized operator in a single self-contained extension method (no strings attached). Here is my implementation:
public static IEnumerable<T> MemoiseNotSynchronized<T>(this IEnumerable<T> source)
{
// Argument validation omitted
IEnumerator<T> enumerator = null;
List<T> buffer = null;
return Implementation();
IEnumerable<T> Implementation()
{
if (buffer != null && enumerator == null)
{
// The source has been fully enumerated
foreach (var item in buffer) yield return item;
yield break;
}
enumerator ??= source.GetEnumerator();
buffer ??= new();
for (int i = 0; ; i = checked(i + 1))
{
if (i < buffer.Count)
{
yield return buffer[i];
}
else if (enumerator.MoveNext())
{
Debug.Assert(buffer.Count == i);
var current = enumerator.Current;
buffer.Add(current);
yield return current;
}
else
{
enumerator.Dispose(); enumerator = null;
yield break;
}
}
}
}
Usage example:
IEnumerable<Point> points = GetPointsFromDB().MemoiseNotSynchronized();
// Enumerate the 'points' any number of times, on a single thread.
// The data will be fetched from the DB only once.
// The connection with the DB will open when the 'points' is enumerated
// for the first time, partially or fully.
// The connection will stay open until the 'points' is enumerated fully
// for the first time.
Testing the MemoiseNotSynchronized operator on Fiddle.