List<T>.RemoveAll as parallel - c#

I would like to known an alternative to do a toProcess.RemoveAll, but in parallel. Today my code like my exemplo is working well, but in sequencial, and I'd like to be in paralle.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ParallelTest
{
using System.Threading;
using System.Threading.Tasks;
class Program
{
static void Main(string[] args)
{
List<VerifySomethingFromInternet> foo = new List<VerifySomethingFromInternet>();
foo.Add(new VerifySomethingFromInternet(#"id1", true));
foo.Add(new VerifySomethingFromInternet(#"id2", false));
foo.Add(new VerifySomethingFromInternet(#"id3", true));
foo.Add(new VerifySomethingFromInternet(#"id4", false));
foo.Add(new VerifySomethingFromInternet(#"id5", true));
foo.Add(new VerifySomethingFromInternet(#"id6", false));
DoSomethingFromIntert bar = new DoSomethingFromIntert();
bar.DoesWork(foo);
Console.ReadLine();
}
}
public class DoSomethingFromIntert
{
bool RemoveIFTrueFromInternet(VerifySomethingFromInternet vsfi)
{
Console.WriteLine(String.Format("Identification : {0} - Thread : {1}", vsfi.Identification, Thread.CurrentThread.ManagedThreadId));
// Do some blocking work at internet
return vsfi.IsRemovable;
}
public void DoesWork(List<VerifySomethingFromInternet> toProcess)
{
Console.WriteLine(String.Format("total : {0}", toProcess.Count));
//Remove all true return
toProcess.RemoveAll(f => this.RemoveIFTrueFromInternet(f));
Console.WriteLine(String.Format("total : {0}", toProcess.Count));
}
}
public class VerifySomethingFromInternet
{
public VerifySomethingFromInternet(string id, bool remove)
{
this.Identification = id;
this.IsRemovable = remove;
}
public string Identification { get; set; }
public bool IsRemovable { get; set; }
}
}

var newList = toProcess.AsParallel ()
.Where (f => !this.RemoveIFTrueFromInternet(f))
.ToList ();
toProcess = newList;
Probably this answers your question, but I'm not sure that it's really faster. Try and measure.
Note that this may change the order of the elements in the list. If you care about order, add AsOrdered after AsParallel. (Thanks to weston for the [implicit] hint).

List<T> isn't thread safe so there is no way to do this in parallel with this type of list.
You can use thread safe ConcurrentBag instead, but that one doesn't have a RemoveAll method, obviously.
You can also convert the list to an array, edit that one, and pass it to list again.

I tried to restructure your code a bit
I used BlockingCollection to implement a producer consumer scenario
this is not removing in parallel but it may solve your problem by processing them in parallel, give it a try you may love it
class Program
{
static void Main(string[] args)
{
DoSomethingFromIntert bar = new DoSomethingFromIntert();
bar.Verify(#"id1", true);
bar.Verify(#"id2", false);
bar.Verify(#"id3", true);
bar.Verify(#"id4", false);
bar.Verify(#"id5", true);
bar.Verify(#"id6", false);
bar.Complete();
Console.ReadLine();
}
}
public class DoSomethingFromIntert
{
BlockingCollection<VerifySomethingFromInternet> toProcess = new BlockingCollection<VerifySomethingFromInternet>();
ConcurrentBag<VerifySomethingFromInternet> workinglist = new ConcurrentBag<VerifySomethingFromInternet>();
public DoSomethingFromIntert()
{
//init four consumers you may choose as many as you want
ThreadPool.QueueUserWorkItem(DoesWork);
ThreadPool.QueueUserWorkItem(DoesWork);
ThreadPool.QueueUserWorkItem(DoesWork);
ThreadPool.QueueUserWorkItem(DoesWork);
}
public void Verify(string param, bool flag)
{
//add to the processing list
toProcess.TryAdd(new VerifySomethingFromInternet(param, flag));
}
public void Complete()
{
//mark producer as complete and let the threads exit when finished verifying
toProcess.CompleteAdding();
}
bool RemoveIFTrueFromInternet(VerifySomethingFromInternet vsfi)
{
Console.WriteLine(String.Format("Identification : {0} - Thread : {1}", vsfi.Identification, Thread.CurrentThread.ManagedThreadId));
// Do some blocking work at internet
return vsfi.IsRemovable;
}
private void DoesWork(object state)
{
Console.WriteLine(String.Format("total : {0}", toProcess.Count));
foreach (var item in toProcess.GetConsumingEnumerable())
{
//do work
if (!RemoveIFTrueFromInternet(item))
{
//add to list if working
workinglist.TryAdd(item);
}
//no need to remove as it is removed from the list automatically
}
//this line will only reach after toProcess.CompleteAdding() and when items are consumed(verified)
Console.WriteLine(String.Format("total : {0}", toProcess.Count));
}
}
in short it will start verifying the items as soon as you add them and will keep the successful items in a separate list
Edit
as the foreach loop for GetConsumingEnumerable() does not end by default it keep waiting for the next element forever until CompleteAdding() is called. so I added Complete() method in the wrapper class to finish the verification loop once we have pushed all the elements.
the idea is to keep adding the verification elements to the class and let the consumer loop verify each of them in parallel and once you are done will all of the elements call Complete() to know the consumers that there are no more elements to be added so they can terminate the foreach loop once the list is empty.
in your code the removal of the element is not the actual issue of performance but the synchronous loop of the verification process if the hot spot. removing from list a just a cost of few ms however the expensive part of the code is the blocking work at internet so if we can make it parallel we are able to cut some of the precious time.
be careful with the number of consumers threads you initialize, however I used thread pool but still may affect performance if excessively used. so decide a number based on the machine capability eg. number or cores / processors
more about BlockingCollection

Related

Take all items from ConcurrentBag using a swap

I'm trying to take all items in one fell swoop from a ConcurrentBag. Since there's nothing like TryEmpty on the collection, I've resorted to using Interlocked.Exchange in the same fashion as described here: How to remove all Items from ConcurrentBag?
My code looks like this:
private ConcurrentBag<Foo> _allFoos; //Initialized in constructor.
public bool LotsOfThreadsAccessingThisMethod(Foo toInsert)
{
this._allFoos.Add(toInsert);
return true;
}
public void SingleThreadProcessingLoopAsALongRunningTask(object state)
{
var token = (CancellationToken) state;
var workingSet = new List<Foo>();
while (!token.IsCancellationRequested)
{
if (!workingSet.Any())
{
workingSet = Interlocked.Exchange(ref this._allFoos, new ConcurrentBag<Foo>).ToList();
}
var processingCount = (int)Math.Min(workingSet.Count, TRANSACTION_LIMIT);
if (processingCount > 0)
{
using (var ctx = new MyEntityFrameworkContext())
{
ctx.BulkInsert(workingSet.Take(processingCount));
}
workingSet.RemoveRange(0, processingCount);
}
}
}
The problem is that this sometimes misses items that are added to the list. I've written a test application that feeds data to my ConcurrentBag.Add method and verified that it is sending all of the data. When I set a breakpoint on the Add call and check the count of the ConcurrentBag after, it's zero. The item just isn't being added.
I'm fairly positive that it's because the Interlocked.Exchange call doesn't use the internal locking mechanism of the ConcurrentBag so it's losing data somewhere in the swap, but I have no knowledge of what's actually happening.
How can I just grab all the items out of the ConcurrentBag at one time without resorting to my own locking mechanism? And why does Add ignore the item?
I think taking all the items from the ConcurentBag is not needed. You can achieve exactly the same behavior you are trying to implement simply by changing the processing logic as follows (no need for own synchronization or interlocked swaps):
public void SingleThreadProcessingLoopAsALongRunningTask(object state)
{
var token = (CancellationToken)state;
var buffer = new List<Foo>(TRANSACTION_LIMIT);
while (!token.IsCancellationRequested)
{
Foo item;
if (!this._allFoos.TryTake(out item))
{
if (buffer.Count == 0) continue;
}
else
{
buffer.Add(item);
if (buffer.Count < TRANSACTION_LIMIT) continue;
}
using (var ctx = new MyEntityFrameworkContext())
{
ctx.BulkInsert(buffer);
}
buffer.Clear();
}
}

Parallel.For out-of-sync output

I'm kind of new to the parallel programing classes in C# 4.0; Was trying a simple for loop, where I should normally get longs from 0 to 99 printed in sequential fashion with the usual for loop, but with the Parallel.For, I'm getting incosistent outputs in random jumbled-up orders.
Code :
using System.Threading.Tasks;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Parallel.For(0, 100, i =>
{
//object sync = new object();
//lock (sync)
{
Console.WriteLine("Writing" + i);
}
});
Console.Read();
}
}
}
One output on console :
Writing0
Writing1
Writing2
Writing3
Writing4
Writing5
Writing6
Writing7
Writing8
Writing9
Writing10
Writing11
Writing12
Writing13
Writing14
Writing15
Writing16
Writing17
Writing18
Writing19
Writing20
Writing21
Writing22
Writing23
Writing24
Writing25
Writing26
Writing27
Writing28
Writing29
Writing30
Writing31
Writing32
Writing33
Writing34
Writing35
Writing36
Writing37
Writing38
Writing39
Writing40
Writing41
Writing42
Writing43
Writing44
Writing45
Writing46
Writing47
Writing48
Writing49
Writing50
Writing66
Writing67
Writing68
Writing70
Writing71
Writing72
Writing73
Writing74
Writing75
Writing76
Writing77
Writing78
Writing69
Writing82
Writing83
Writing84
Writing85
Writing86
Writing87
Writing88
Writing89
Writing90
Writing51
Writing52
Writing53
Writing54
Writing55
Writing91
Writing92
Writing93
Writing94
Writing95
Writing56
Writing57
Writing79
Writing80
Writing81
Writing58
Writing59
Writing96
Writing97
Writing98
Writing99
Writing60
Writing61
Writing62
Writing63
Writing64
Writing65
Thanks in advance, for whatever help you guys get me.
That is parallel computing. The tasks are queued up and each available processor gets one. If it is done, the next queued task will be issued to it. There is no guarantuee about the order in which the tasks will be delivered to the processing units as well as no guarantuee which one will be finished next, therefore parallizable code is not the same as sequential code + the keyword parallel. The algorithms need to be developed to be able to run in parallel. In your simple example all numbers from 1 to 100 are written, but the parallel loop does not write them in the natural order you'd expect.
This is to be expected, the way you are doing it there the work is partitioned using the current thread as well as a number of others pulled from the thread pool.
If you wanted to do the same thing on a different thread but make the writting synchronous then you could try :
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var t = Task.Factory.StartNew(() => {
for(var i = 0;i<100;i++)
{
//object sync = new object();
//lock (sync)
{
Console.WriteLine("Writing" + i);
}
}
});
Console.Read();
}
}
}

Design pattern for dynamic C# object

I have a queue that processes objects in a while loop. They are added asynchronously somewhere.. like this:
myqueue.pushback(String value);
And they are processed like this:
while(true)
{
String path = queue.pop();
if(process(path))
{
Console.WriteLine("Good!");
}
else
{
queue.pushback(path);
}
}
Now, the thing is that I'd like to modify this to support a TTL-like (time to live) flag, so the file path would be added o more than n times.
How could I do this, while keeping the bool process(String path) function signature? I don't want to modify that.
I thought about holding a map, or a list that counts how many times the process function returned false for a path and drop the path from the list at the n-th return of false. I wonder how can this be done more dynamically, and preferably I'd like the TTL to automatically decrement itself at each new addition to the process. I hope I am not talking trash.
Maybe using something like this
class JobData
{
public string path;
public short ttl;
public static implicit operator String(JobData jobData) {jobData.ttl--; return jobData.path;}
}
I like the idea of a JobData class, but there's already an answer demonstrating that, and the fact that you're working with file paths give you another possible advantage. Certain characters are not valid in file paths, and so you could choose one to use as a delimiter. The advantage here is that the queue type remains a string, and so you would not have to modify any of your existing asynchronous code. You can see a list of reserved path characters here:
http://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
For our purposes, I'll use the percent (%) character. Then you can modify your code as follows, and nothing else needs to change:
const int startingTTL = 100;
const string delimiter = "%";
while(true)
{
String[] path = queue.pop().Split(delimiter.ToCharArray());
int ttl = path.Length > 1?--int.Parse(path[1]):startingTTL;
if(process(path[0]))
{
Console.WriteLine("Good!");
}
else if (ttl > 0)
{
queue.pushback(string.Format("{0}{1}{2}", path[0], delimiter,ttl));
}
else
{
Console.WriteLine("TTL expired for path: {0}" path[0]);
}
}
Again, from a pure architecture standpoint, a class with two properties is a better design... but from a practical standpoint, YAGNI: this option means you can avoid going back and changing other asynchronous code that pushes into the queue. That code still only needs to know about the strings, and will work with this unmodified.
One more thing. I want to point out that this is a fairly tight loop, prone to running away with a cpu core. Additionally, if this is the .Net queue type and your tight loop gets ahead of your asynchronous produces to empty the queue, you'll throw an exception, which would break out of the while(true) block. You can solve both issues with code like this:
while(true)
{
try
{
String[] path = queue.pop().Split(delimiter.ToCharArray());
int ttl = path.Length > 1?--int.Parse(path[1]):startingTTL;
if(process(path[0]))
{
Console.WriteLine("Good!");
}
else if (ttl > 0)
{
queue.pushback(string.Format("{0}{1}{2}", path[0], delimiter,ttl));
}
else
{
Console.WriteLine("TTL expired for path: {0}" path[0]);
}
}
catch(InvalidOperationException ex)
{
//Queue.Dequeue throws InvalidOperation if the queue is empty... sleep for a bit before trying again
Thread.Sleep(100);
}
}
If the constraint is that bool process(String path) cannot be touched/changed then put the functionality into myqueue. You can keep its public signatures of void pushback(string path) and string pop(), but internally you can track your TTL. You can either wrap the string paths in a JobData-like class that gets added to the internal queue, or you can have a secondary Dictionary keyed by path. Perhaps even something as simple as saving the last poped path and if the subsequent push is the same path you can assume it was a rejected/failed item. Also, in your pop method you can even discard a path that has been rejected too many time and internally fetch the next path so the calling code is blissfully unaware of the issue.
You could abstract/encapsulate the functionality of the "job manager". Hide the queue and implementation from the caller so you can do whatever you want without the callers caring. Something like this:
public static class JobManager
{
private static Queue<JobData> _queue;
static JobManager() { Task.Factory.StartNew(() => { StartProcessing(); }); }
public static void AddJob(string value)
{
//TODO: validate
_queue.Enqueue(new JobData(value));
}
private static StartProcessing()
{
while (true)
{
if (_queue.Count > 0)
{
JobData data = _queue.Dequeue();
if (!process(data.Path))
{
data.TTL--;
if (data.TTL > 0)
_queue.Enqueue(data);
}
}
else
{
Thread.Sleep(1000);
}
}
}
private class JobData
{
public string Path { get; set; }
public short TTL { get; set; }
public JobData(string value)
{
this.Path = value;
this.TTL = DEFAULT_TTL;
}
}
}
Then your processing loop can handle the TTL value.
Edit - Added a simple processing loop. This code isn't thread safe, but should hopefully give you an idea.

Disruptor example with 1 publisher and 4 parallel consumers

In this example https://stackoverflow.com/a/9980346/93647 and here Why is my disruptor example so slow? (at the end of the question) there is 1 publisher which publish items and 1 consumer.
But in my case consumer work is much more complicated and takes some time. So I want 4 consumers that process data in parallel.
So for example if producer produce numbers: 1,2,3,4,5,6,7,8,9,10,11..
I want consumer1 to catch 1,5,9,... consumer2 to catch 2,6,10,... consumer3 to catch 3,7,11,... consumer4 to catch 4,8,12... (well not exactly these numbers, the idea is that data should be processed in parallel, i don't care which certain number is processed on which consumer)
And remember this need to be done parallel because in real application consumer work is pretty expensive. I expect consumers to be executed in different threads to use power of multicore systems.
Of course I can just create 4 ringbuffers and attach 1 consumer to 1 ring-buffer. This way I can use original example. But I feel it wouldn't be correct. Likely it would be correct to create 1 publisher (1 ringbuffer) and 4 consumers - as this is what i need.
Adding link to a very simular question in google groups: https://groups.google.com/forum/#!msg/lmax-disruptor/-CLapWuwWLU/GHEP4UkxrAEJ
So we have two options:
one ring many consumers (each consumer will "wake-up" on every addition, all consumer should have the same WaitStrategy)
many "one ring - one consumer" (each consumer will wake-up only on data that it should process. each consumer can have own WaitStrategy).
EDIT: I forgot to mention the code is partially taken from the FAQ. I have no idea if this approach is better or worse than Frank's suggestion.
The project is severely under documented, that's a shame as it looks nice.
Anyway try the following snip (based on your first link) - tested on mono and seems to be OK:
using System;
using System.Threading.Tasks;
using Disruptor;
using Disruptor.Dsl;
namespace DisruptorTest
{
public sealed class ValueEntry
{
public long Value { get; set; }
}
public class MyHandler : IEventHandler<ValueEntry>
{
private static int _consumers = 0;
private readonly int _ordinal;
public MyHandler()
{
this._ordinal = _consumers++;
}
public void OnNext(ValueEntry data, long sequence, bool endOfBatch)
{
if ((sequence % _consumers) == _ordinal)
Console.WriteLine("Event handled: Value = {0}, event {1} processed by {2}", data.Value, sequence, _ordinal);
else
Console.WriteLine("Event {0} rejected by {1}", sequence, _ordinal);
}
}
class Program
{
private static readonly Random _random = new Random();
private const int SIZE = 16; // Must be multiple of 2
private const int WORKERS = 4;
static void Main()
{
var disruptor = new Disruptor.Dsl.Disruptor<ValueEntry>(() => new ValueEntry(), SIZE, TaskScheduler.Default);
for (int i=0; i < WORKERS; i++)
disruptor.HandleEventsWith(new MyHandler());
var ringBuffer = disruptor.Start();
while (true)
{
long sequenceNo = ringBuffer.Next();
ringBuffer[sequenceNo].Value = _random.Next();;
ringBuffer.Publish(sequenceNo);
Console.WriteLine("Published entry {0}, value {1}", sequenceNo, ringBuffer[sequenceNo].Value);
Console.ReadKey();
}
}
}
}
From the specs of the ring-buffer you will see that every consumer will try to process your ValueEvent. in your case you don't need that.
I solved it like this:
Add a field processed to your ValueEvent and when a consumer takes the event he test on that field, if it is already processed he moves on to the next field.
Not the most pretty way, but it's how the buffer works.

How to unit test Thread Safe Generic List in C# using NUnit?

I asked a question about building custom Thread Safe Generic List now I am trying to unit test it and I absolutely have no idea how to do that. Since the lock happens inside the ThreadSafeList class I am not sure how to make the list to lock for a period of time while I am try to mimic the multiple add call. Thanks.
Can_add_one_item_at_a_time
[Test]
public void Can_add_one_item_at_a_time() //this test won't pass
{
//I am not sure how to do this test...
var list = new ThreadSafeList<string>();
//some how need to call lock and sleep inside list instance
//say somehow list locks for 1 sec
var ta = new Thread(x => list.Add("a"));
ta.Start(); //does it need to aboard say before 1 sec if locked
var tb = new Thread(x => list.Add("b"));
tb.Start(); //does it need to aboard say before 1 sec if locked
//it involves using GetSnapshot()
//which is bad idea for unit testing I think
var snapshot = list.GetSnapshot();
Assert.IsFalse(snapshot.Contains("a"), "Should not contain a.");
Assert.IsFalse(snapshot.Contains("b"), "Should not contain b.");
}
Snapshot_should_be_point_of_time_only
[Test]
public void Snapshot_should_be_point_of_time_only()
{
var list = new ThreadSafeList<string>();
var ta = new Thread(x => list.Add("a"));
ta.Start();
ta.Join();
var snapshot = list.GetSnapshot();
var tb = new Thread(x => list.Add("b"));
tb.Start();
var tc = new Thread(x => list.Add("c"));
tc.Start();
tb.Join();
tc.Join();
Assert.IsTrue(snapshot.Count == 1, "Snapshot should only contain 1 item.");
Assert.IsFalse(snapshot.Contains("b"), "Should not contain a.");
Assert.IsFalse(snapshot.Contains("c"), "Should not contain b.");
}
Instance method
public ThreadSafeList<T> Instance<T>()
{
return new ThreadSafeList<T>();
}
Let's look at your first test, Can_add_one_item_at_a_time.
First of all, your exit conditions don't make sense. Both items should be added, just one at a time. So of course your test will fail.
You also don't need to make a snapshot; remember, this is a test, nothing else is going to be touching the list while your test is running.
Last but not least, you need to make sure that you aren't trying to evaluate your exit conditions until all of the threads have actually finished. Simplest way is to use a counter and a wait event. Here's an example:
[Test]
public void Can_add_from_multiple_threads()
{
const int MaxWorkers = 10;
var list = new ThreadSafeList<int>(MaxWorkers);
int remainingWorkers = MaxWorkers;
var workCompletedEvent = new ManualResetEvent(false);
for (int i = 0; i < MaxWorkers; i++)
{
int workerNum = i; // Make a copy of local variable for next thread
ThreadPool.QueueUserWorkItem(s =>
{
list.Add(workerNum);
if (Interlocked.Decrement(ref remainingWorkers) == 0)
workCompletedEvent.Set();
});
}
workCompletedEvent.WaitOne();
workCompletedEvent.Close();
for (int i = 0; i < MaxWorkers; i++)
{
Assert.IsTrue(list.Contains(i), "Element was not added");
}
Assert.AreEqual(MaxWorkers, list.Count,
"List count does not match worker count.");
}
Now this does carry the possibility that the Add happens so quickly that no two threads will ever attempt to do it at the same time. No Refunds No Returns partially explained how to insert a conditional delay. I would actually define a special testing flag, instead of DEBUG. In your build configuration, add a flag called TEST, then add this to your ThreadSafeList class:
public class ThreadSafeList<T>
{
// snip fields
public void Add(T item)
{
lock (sync)
{
TestUtil.WaitStandardThreadDelay();
innerList.Add(item);
}
}
// snip other methods/properties
}
static class TestUtil
{
[Conditional("TEST")]
public static void WaitStandardThreadDelay()
{
Thread.Sleep(1000);
}
}
This will cause the Add method to wait 1 second before actually adding the item as long as the build configuration defines the TEST flag. The entire test should take at least 10 seconds; if it finishes any faster than that, something's wrong.
With that in mind, I'll leave the second test up to you. It's similar.
You will need to insert some TESTONLY code that adds a delay in your lock. You can create a function like this:
[Conditional("DEBUG")]
void SleepForABit(int delay) { thread.current.sleep(delay); }
and then call it in your class. The Conditional attribute ensure it is only called in DEBUG builds and you can leave it in your compiled code.
Write something which consistently delays 100Ms or so and something that never waits and let'em slug it out.
You might want to take a look at Chess. It's a program specifically designed to find race conditions in multi-threaded code.

Categories