Limiting number of API requests started per second using Parallel.ForEach - c#

I am working on improving some of my code to increase efficiency. In the original code I was limiting the number of threads allowed to be 5, and if I had already 5 active threads I would wait until one finished before starting another one. Now I want to modify this code to allow any number of threads, but I want to be able to make sure that only 5 threads get started every second. For example:
Second 0 - 5 new threads
Second 1 - 5 new threads
Second 2 - 5 new threads ...
Original Code (cleanseDictionary contains usually thousands of items):
ConcurrentDictionary<long, APIResponse> cleanseDictionary = new ConcurrentDictionary<long, APIResponse>();
ConcurrentBag<int> itemsinsec = new ConcurrentBag<int>();
ConcurrentDictionary<long, string> resourceDictionary = new ConcurrentDictionary<long, string>();
DateTime start = DateTime.Now;
Parallel.ForEach(resourceDictionary, new ParallelOptions { MaxDegreeOfParallelism = 5 }, row =>
{
lock (itemsinsec)
{
ThrottleAPIRequests(itemsinsec, start);
itemsinsec.Add(1);
}
cleanseDictionary.TryAdd(row.Key, _helper.MakeAPIRequest(string.Format("/endpoint?{0}", row.Value)));
});
private static void ThrottleAPIRequests(ConcurrentBag<int> itemsinsec, DateTime start)
{
if ((start - DateTime.Now).Milliseconds < 10001 && itemsinsec.Count > 4)
{
System.Threading.Thread.Sleep(1000 - (start - DateTime.Now).Milliseconds);
start = DateTime.Now;
itemsinsec = new ConcurrentBag<int>();
}
}
My first thought was increase the MaxDegreeofParallelism to something much higher and then have a helper method that will limit only 5 threads in a second, but I am not sure if that is the best way to do it and if it is, I would probably need a lock around that step?
Thanks in advance!
EDIT
I am actually looking for a way to throttle the API Requests rather than the actual threads. I was thinking they were one in the same.
Edit 2: My requirements are to send over 5 API requests every second

"Parallel.ForEach" from the MS website
may run in parallel
If you want any degree of fine control over how the threads are managed, this is not the way.
How about creating your own helper class where you can queue jobs with a group id, allows you to wait for all jobs of group id X to complete, and it spawns extra threads as and when required?

For me the best solution is:
using System;
using System.Collections.Concurrent;
using System.Threading.Tasks;
namespace SomeNamespace
{
public class RequestLimiter : IRequestLimiter
{
private readonly ConcurrentQueue<DateTime> _requestTimes;
private readonly TimeSpan _timeSpan;
private readonly object _locker = new object();
public RequestLimiter()
{
_timeSpan = TimeSpan.FromSeconds(1);
_requestTimes = new ConcurrentQueue<DateTime>();
}
public TResult Run<TResult>(int requestsOnSecond, Func<TResult> function)
{
WaitUntilRequestCanBeMade(requestsOnSecond).Wait();
return function();
}
private Task WaitUntilRequestCanBeMade(int requestsOnSecond)
{
return Task.Factory.StartNew(() =>
{
while (!TryEnqueueRequest(requestsOnSecond).Result) ;
});
}
private Task SynchronizeQueue()
{
return Task.Factory.StartNew(() =>
{
_requestTimes.TryPeek(out var first);
while (_requestTimes.Count > 0 && (first.Add(_timeSpan) < DateTime.UtcNow))
_requestTimes.TryDequeue(out _);
});
}
private Task<bool> TryEnqueueRequest(int requestsOnSecond)
{
lock (_locker)
{
SynchronizeQueue().Wait();
if (_requestTimes.Count < requestsOnSecond)
{
_requestTimes.Enqueue(DateTime.UtcNow);
return Task.FromResult(true);
}
return Task.FromResult(false);
}
}
}
}

I want to be able to send over 5 API request every second
That's really easy:
while (true) {
await Task.Delay(TimeSpan.FromSeconds(1));
await Task.WhenAll(Enumerable.Range(0, 5).Select(_ => RunRequestAsync()));
}
Maybe not the best approach since there will be a burst of requests. This is not continuous.
Also, there is timing skew. One iteration takes more than 1 second. This can be solved with a few lines of time logic.

Related

how can i implement a concurrent execution queue class?

I need to have a class that will execute actions in thread pool, but these actions should be queued. For example:
method 1
method 2
method 3
When someone called method 1 from his thread he can also call method 2 or method 3 and they all 3 methods can be performed concurrently, but when another call came from user for method 1 or 2 or 3, this time the thread pool should block these call, until the old ones execution is completed.
Something like the below picture:
Should I use channels?
To should i use channels?, the answer is yes, but there are other features available too.
Dataflow
.NET already offers this feature through the TPL Dataflow classes. You can use an ActionBlock class to pass messages (ie data) to a worker method that executes i the background with guaranteed order and a configurable degree of parallelism. Channels are a new feature which does essentially the same job.
What you describe is actually the simplest way of using an ActionBlock - just post data messages to it and have it process them one by one :
void Method1(MyDataObject1 data){...}
var block=new ActionBlock<MyDataObject1>(Method1);
//Start sending data to the block
for(var msg in someListOfItems)
{
block.PostAsync(msg);
}
By default, an ActionBlock has an infinite input queue. It will use only one task to process messages asynchronously, in the order they are posted.
When you're done with it, you can tell it to Complete() and await asynchronously for all remaining items to finish processing :
block.Complete();
await block.Completion;
To handle different methods, you can simply use multiple blocks, eg :
var block1=new ActionBlock<MyDataObject1>(Method1);
var block2=new ActionBlock<MyDataObject1>(Method2);
Channels
Channels are a lower-level feature than blocks. This means you have to write more code but you get far better control on how the "processing blocks" work. In fact, you can probably rewrite the TPL Dataflow library using channels.
You could create a processing block similar to an ActionBlock with the following (a bit naive) method:
ChannelWriter<TIn> Work(Action<TIn> action)
{
var channel=Channel.CreateUnbounded<TIn>();
var workerTask=Task.Run(async ()=>{
await foreach(var msg in channel.Reader.ReadAllAsync())
{
action(msg);
}
})
var writer=channel.Writer;
return writer;
}
This method creates a channel and runs a task in the background to read data asynchronously and process them. I'm cheating "a bit" here by using await foreach and ChannelReader.ReadAllAsync() which are available in C#8 and .NET Core 3.0.
This method can be used like a block :
ChannelWriter<DataObject1> writer1 = Work(Method1);
foreach(var msg in someListOfItems)
{
writer1.WriteAsync(msg);
}
writer1.Complete();
There's a lot more to Channels though. SignalR for example uses them to allow streaming of notifications to the clients.
Here is my suggestion. For each synchronous method, an asynchronous method should be added. For example the method FireTheGun is synchronous:
private static void FireTheGun(int bulletsCount)
{
var ratata = Enumerable.Repeat("Ta", bulletsCount).Prepend("Ra");
Console.WriteLine(String.Join("-", ratata));
}
The asynchronous counterpart FireTheGunAsync is very simple, because the complexity of queuing the synchronous action is delegated to a helper method QueueAsync.
public static Task FireTheGunAsync(int bulletsCount)
{
return QueueAsync(FireTheGun, bulletsCount);
}
Here is the implementation of QueueAsync. Each action has its dedicated SemaphoreSlim, to prevent multiple concurrent executions:
private static ConcurrentDictionary<MethodInfo, SemaphoreSlim> semaphores =
new ConcurrentDictionary<MethodInfo, SemaphoreSlim>();
public static Task QueueAsync<T1>(Action<T1> action, T1 param1)
{
return Task.Run(async () =>
{
var semaphore = semaphores
.GetOrAdd(action.Method, key => new SemaphoreSlim(1));
await semaphore.WaitAsync();
try
{
action(param1);
}
finally
{
semaphore.Release();
}
});
}
Usage example:
FireTheGunAsync(5);
FireTheGunAsync(8);
Output:
Ra-Ta-Ta-Ta-Ta-Ta
Ra-Ta-Ta-Ta-Ta-Ta-Ta-Ta-Ta
Implementing versions of QueueAsync with different number of parameters should be trivial.
Update: My previous implementation of QueueAsync has the probably undesirable behavior that executes the actions in random order. This happens because the second task may be the first one to acquire the semaphore. Below is an implementation that guaranties the corrent order of execution. The performance could be bad in case of high contention, because each task is entering a loop until it takes the semaphore in the right order.
private class QueueInfo
{
public SemaphoreSlim Semaphore = new SemaphoreSlim(1);
public int TicketToRide = 0;
public int Current = 0;
}
private static ConcurrentDictionary<MethodInfo, QueueInfo> queues =
new ConcurrentDictionary<MethodInfo, QueueInfo>();
public static Task QueueAsync<T1>(Action<T1> action, T1 param1)
{
var queue = queues.GetOrAdd(action.Method, key => new QueueInfo());
var ticket = Interlocked.Increment(ref queue.TicketToRide);
return Task.Run(async () =>
{
while (true) // Loop until our ticket becomes current
{
await queue.Semaphore.WaitAsync();
try
{
if (Interlocked.CompareExchange(ref queue.Current,
ticket, ticket - 1) == ticket - 1)
{
action(param1);
break;
}
}
finally
{
queue.Semaphore.Release();
}
}
});
}
What about this solution?
public class ConcurrentQueue
{
private Dictionary<byte, PoolFiber> Actionsfiber;
public ConcurrentQueue()
{
Actionsfiber = new Dictionary<byte, PoolFiber>()
{
{ 1, new PoolFiber() },
{ 2, new PoolFiber() },
{ 3, new PoolFiber() },
};
foreach (var fiber in Actionsfiber.Values)
{
fiber.Start();
}
}
public void ExecuteAction(Action Action , byte Code)
{
if (Actionsfiber.ContainsKey(Code))
Actionsfiber[Code].Enqueue(() => { Action.Invoke(); });
else
Console.WriteLine($"invalid byte code");
}
}
public static void SomeAction1()
{
Console.WriteLine($"{DateTime.Now} Action 1 is working");
for (long i = 0; i < 2400000000; i++)
{
}
Console.WriteLine($"{DateTime.Now} Action 1 stopped");
}
public static void SomeAction2()
{
Console.WriteLine($"{DateTime.Now} Action 2 is working");
for (long i = 0; i < 5000000000; i++)
{
}
Console.WriteLine($"{DateTime.Now} Action 2 stopped");
}
public static void SomeAction3()
{
Console.WriteLine($"{DateTime.Now} Action 3 is working");
for (long i = 0; i < 5000000000; i++)
{
}
Console.WriteLine($"{DateTime.Now} Action 3 stopped");
}
public static void Main(string[] args)
{
ConcurrentQueue concurrentQueue = new ConcurrentQueue();
concurrentQueue.ExecuteAction(SomeAction1, 1);
concurrentQueue.ExecuteAction(SomeAction2, 2);
concurrentQueue.ExecuteAction(SomeAction3, 3);
concurrentQueue.ExecuteAction(SomeAction1, 1);
concurrentQueue.ExecuteAction(SomeAction2, 2);
concurrentQueue.ExecuteAction(SomeAction3, 3);
Console.WriteLine($"press any key to exit the program");
Console.ReadKey();
}
the output :
8/5/2019 7:56:57 AM Action 1 is working
8/5/2019 7:56:57 AM Action 3 is working
8/5/2019 7:56:57 AM Action 2 is working
8/5/2019 7:57:08 AM Action 1 stopped
8/5/2019 7:57:08 AM Action 1 is working
8/5/2019 7:57:15 AM Action 2 stopped
8/5/2019 7:57:15 AM Action 2 is working
8/5/2019 7:57:16 AM Action 3 stopped
8/5/2019 7:57:16 AM Action 3 is working
8/5/2019 7:57:18 AM Action 1 stopped
8/5/2019 7:57:33 AM Action 2 stopped
8/5/2019 7:57:33 AM Action 3 stopped
the poolFiber is a class in the ExitGames.Concurrency.Fibers namespace.
more info :
How To Avoid Race Conditions And Other Multithreading Issues?

Unit testing of thread synchronization (lock)

I'm working on multithread app and there is part of code which sholud be run only by one thread same time. Nothing complicated. I use lock to synchronize it. It's working in life system but I would like to write unit test to check if only one thread is in the critical section. I wrote one, it was working but it stops :)
I can't figure out how to write a test it in proper way. I use NSubstitute to create mocks.
Class to test:
public interface IMultiThreadClass
{
void Go();
}
public class Lock02 : IMultiThreadClass
{
private readonly IProcessor _processor;
private readonly string _threadName;
private static readonly object Locker = new Object();
public Lock02(IProcessor processor, string threadName)
{
_processor = processor;
_threadName = threadName;
}
public void Go()
{
//critical section
lock (Locker)
{
_processor.Process(_threadName);
}
}
}
Test:
[TestMethod()]
public void Run_Test()
{
//Only one thread should run Processor.Process, but we allow max 2 threads to catch locking erorrs
SemaphoreSlim semaphore = new SemaphoreSlim(1, 2);
//Semaphore to synchronize asserts
SemaphoreSlim synchroSemaphore = new SemaphoreSlim(0, 1);
IProcessor procesor = Substitute.For<IProcessor>();
procesor.When(x => x.Process(Arg.Any<string>())).Do(y =>
{
//increment counter to check if method was called
Interlocked.Increment(ref _counter);
//release synchro semaphore
synchroSemaphore.Release();
//stop thread and wait for release
semaphore.Wait();
});
Lock02 locker1 = new Lock02(procesor, "1");
Lock02 locker2 = new Lock02(procesor, "2");
Lock02 locker3 = new Lock02(procesor, "3");
Task.Run(() => locker1.Go());
Task.Run(() => locker2.Go());
Task.Run(() => locker3.Go());
//ASSERT
//Thread.Sleep(1000);
synchroSemaphore.Wait();
Assert.AreEqual(1, _counter);
semaphore.Release(1);
synchroSemaphore.Wait();
Assert.AreEqual(2, _counter);
semaphore.Release(1);
synchroSemaphore.Wait();
Assert.AreEqual(3, _counter);
semaphore.Release(1);
}
A possible (simple but not bulletproof) way is to spawn some threads/tasks in the unit test, each fetching and temporarily storing an int variable (possibly static), waiting for a bit (delay), incrementing the value and writing it back to the variable. Without thread synchronization (lock), many if not all threads will grab the same number and it will not be equal (as it should) to the number of threads/tasks.
This is not bulletproof since there is still a race condition making it not reproducible (the smelly code is the 50 ms delay), although it seems (to me) very unlikely for all treads to wait for each other in the perfect way and produce the right result.
I consider this being a smelly workaround, but it is simple and works.
[TestMethod]
public async Task APossibleTest()
{
int importantNumber = 0;
var proc = Substitute.For<IProcessor>();
proc.WhenForAnyArgs(processor => processor.Process(Arg.Any<string>()))
.Do(callInfo =>
{
int cached = importantNumber;
// Wait for other threads to fetch the number too (if they were not synchronized).
Thread.Sleep(TimeSpan.FromMilliseconds(50));
// This kind of incrementation will check the thread synchronization.
// Using a thread-safe Interlocked or other here does not make sense.
importantNumber = cached + 1;
});
var locker = new Locker(proc, "da horror");
// Create 10 tasks all attempting to increment the important number.
Task[] tasks =
Enumerable
.Range(0, 10)
// You could create multiple lockers here (with their own processors).
.Select(i => Task.Run(() => locker.Go()))
.ToArray();
await Task.WhenAll(tasks);
Assert.AreEqual(10, importantNumber, "Exactly 10 increments were expected since we have 10 tasks.");
}

Odd behavior with yield and Parallel.ForEach

At work one of our processes uses a SQL database table as a queue. I've been designing a queue reader to check the table for queued work, update the row status when work starts, and delete the row when the work is finished. I'm using Parallel.Foreach to give each process its own thread and setting MaxDegreeOfParallelism to 4.
When the queue reader starts up it checks for any unfinished work and loads the work into an list, then it does a Concat with that list and a method that returns an IEnumerable which runs in an infinite loop checking for new work to do. The idea is that the unfinished work should be processed first and then the new work can be worked as threads are available. However what I'm seeing is that FetchQueuedWork will change dozens of rows in the queue table to 'Processing' immediately but only work on a few items at a time.
What I expected to happen was that FetchQueuedWork would only get new work and update the table when a slot opened up in the Parallel.Foreach. What's really odd to me is that it behaves exactly as I would expect when I run the code in my local developer environment, but in production I get the above problem.
I'm using .Net 4. Here is the code:
public void Go()
{
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork.Concat(FetchQueuedWork());
Parallel.ForEach(work, new ParallelOptions { MaxDegreeOfParallelism = 4 }, DoWork);
}
private IEnumerable<WorkData> FetchQueuedWork()
{
while (true)
{
var workUnit = WorkData.GetQueuedWorkAndSetStatusToProcessing();
yield return workUnit;
}
}
private void DoWork(WorkData workUnit)
{
if (!workUnit.Loaded)
{
System.Threading.Thread.Sleep(5000);
return;
}
Work();
}
I suspect that the default (Release mode?) behaviour is to buffer the input. You might need to create your own partitioner and pass it the NoBuffering option:
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork.Concat(FetchQueuedWork());
var options = new ParallelOptions { MaxDegreeOfParallelism = 4 };
var partitioner = Partitioner.Create(work, EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(partioner, options, DoWork);
Blorgbeard's solution is correct when it comes to .NET 4.5 - hands down.
If you are constrained to .NET 4, you have a few options:
Replace your Parallel.ForEach with work.AsParallel().WithDegreeOfParallelism(4).ForAll(DoWork). PLINQ is more conservative when it comes to buffering items, so this should do the trick.
Write your own enumerable partitioner (good luck).
Create a grotty semaphore-based hack such as this:
(Side-effecting Select used for the sake of brevity)
public void Go()
{
const int MAX_DEGREE_PARALLELISM = 4;
using (var semaphore = new SemaphoreSlim(MAX_DEGREE_PARALLELISM, MAX_DEGREE_PARALLELISM))
{
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork
.Concat(FetchQueuedWork())
.Select(w =>
{
// Side-effect: bad practice, but easier
// than writing your own IEnumerable.
semaphore.Wait();
return w;
});
// You still need to specify MaxDegreeOfParallelism
// here so as not to saturate your thread pool when
// Parallel.ForEach's load balancer kicks in.
Parallel.ForEach(work, new ParallelOptions { MaxDegreeOfParallelism = MAX_DEGREE_PARALLELISM }, workUnit =>
{
try
{
this.DoWork(workUnit);
}
finally
{
semaphore.Release();
}
});
}
}

Correct use of Parallel for loop in C#?

In this example, is this the correct use of the Parallel.For loop if I want to limit the number of threads that can perform the function DoWork to ten at a time? Will other threads be blocked until one of the ten threads becomes available? If not, what is a better multi-threaded solution that would still let me execute that function 6000+ times?
class Program
{
static void Main(string[] args)
{
ThreadExample ex = new ThreadExample();
}
}
public class ThreadExample
{
int limit = 6411;
public ThreadExample()
{
Console.WriteLine("Starting threads...");
int temp = 0;
Parallel.For(temp, limit, new ParallelOptions { MaxDegreeOfParallelism = 10 }, i =>
{
DoWork(temp);
temp++;
});
}
public void DoWork(int info)
{
//Thread.Sleep(50); //doing some work here.
int num = info * 5;
Console.WriteLine("Thread: {0} Result: {1}", info.ToString(), num.ToString());
}
}
You need to use the i passed to the lambda function as index. Parallel.For relieves you from the hassle of working with the loop counter, but you need to use it!
Parallel.For(0, limit, new ParallelOptions { MaxDegreeOfParallelism = 10 }, i =>
{
DoWork(i);
});
As for your other questions:
Yes, this will correctly limit the amount of threads working simultaneously.
There are no threads being blocked. The iterations are queued and as soon as a thread becomes available, it takes the next iteration (in a synchronized manner) from the queue to process.

Thread-safe buffer of data to make batch inserts of controlled size

I have a simulation that generates data which must be saved to database.
ParallelLoopResult res = Parallel.For(0, 1000000, options, (r, state) =>
{
ComplexDataSet cds = GenerateData(r);
SaveDataToDatabase(cds);
});
The simulation generates a whole lot of data, so it wouldn't be practical to first generate it and then save it to database (up to 1 GB of data) and it also wouldn't make sense to save it to database one by one (too small transanctions to be practical). I want to insert them to database as a batch insert of controlled size (say 100 with one commit).
However, I think my knowledge of parallel computing is less that theoretical. I came up with this (which as you can see is very flawed):
DataBuffer buffer = new DataBuffer(...);
ParallelLoopResult res = Parallel.For(0, 10000000, options, (r, state) =>
{
ComplexDataSet cds = GenerateData(r);
buffer.SaveDataToBuffer(cds, i == r - 1);
});
public class DataBuffer
{
int count = 0;
int limit = 100
object _locker = new object();
ConcurrentQueue<ConcurrentBag<ComplexDataSet>> ComplexDataBagQueue{ get; set; }
public void SaveDataToBuffer(ComplexDataSet data, bool isfinalcycle)
{
lock (_locker)
{
if(count >= limit)
{
ConcurrentBag<ComplexDataSet> dequeueRef;
if(ComplexDataBagQueue.TryDequeue(out dequeueRef))
{
Commit(dequeueRef);
}
_lastItemRef = new ConcurrentBag<ComplexDataSet>{data};
ComplexDataSetsQueue.Enqueue(_lastItemRef);
count = 1;
}
else
{
// First time
if(_lastItemRef == null)
{
_lastItemRef = new ConcurrentBag<ComplexDataSet>{data};
ComplexDataSetsQueue.Enqueue(_lastItemRef);
count = 1;
}
// If buffer isn't full
else
{
_lastItemRef.Add(data);
count++;
}
}
if(isfinalcycle)
{
// Commit everything that hasn't been committed yet
ConcurrentBag<ComplexDataSet> dequeueRef;
while (ComplexDataSetsQueue.TryDequeue(out dequeueRef))
{
Commit(dequeueRef);
}
}
}
}
public void Commit(ConcurrentBag<ComplexDataSet> data)
{
// Commit data to database..should this be somehow in another thread or something ?
}
}
As you can see, I'm using queue to create a buffer and then manually decide when to commit. However I have a strong feeling that this isn't very performing solution to my problem. First, I'm unsure whether I'm doing locking right. Second, I'm not sure even if this is fully thread-safe (or at all).
Can you please take a look for a moment and comment what should I do differently ? Or if there is a complitely better way of doing this (using somekind of Producer-Consumer technique or something) ?
Thanks and best wishes,
D.
There is no need to use locks or expensive concurrency-safe data structures. The data is all independent, so introducing locking and sharing will only hurt performance and scalability.
Parallel.For has an overload that lets you specify per-thread data. In this you can store a private queue and private database connection.
Also: Parallel.For internally partitions your range into smaller chunks. It's perfectly efficient to pass it a huge range, so nothing to change there.
Parallel.For(0, 10000000, () => new ThreadState(),
(i, loopstate, threadstate) =>
{
ComplexDataSet data = GenerateData(i);
threadstate.Add(data);
return threadstate;
}, threadstate => threadstate.Dispose());
sealed class ThreadState : IDisposable
{
readonly IDisposable db;
readonly Queue<ComplexDataSet> queue = new Queue<ComplexDataSet>();
public ThreadState()
{
// initialize db with a private MongoDb connection.
}
public void Add(ComplexDataSet cds)
{
queue.Enqueue(cds);
if(queue.Count == 100)
{
Commit();
}
}
void Commit()
{
db.Write(queue);
queue.Clear();
}
public void Dispose()
{
try
{
if(queue.Count > 0)
{
Commit();
}
}
finally
{
db.Dispose();
}
}
}
Now, MongoDb currently doesn't support truly concurrent inserts -- it holds some expensive locks in the server, so parallel commits won't gain you much (if any) speed. They want to fix this in the future, so you might get a free speed-up one day.
If you need to limit the number of database connections held, a producer/consumer setup is a good alternative. You can use a BlockingCollection queue to do this efficiently without using any locks:
// Specify a maximum of 1000 items in the collection so that we don't
// run out of memory if we get data faster than we can commit it.
// Add() will wait if it is full.
BlockingCollection<ComplexDataSet> commits =
new BlockingCollection<ComplexDataSet>(1000);
Task consumer = Task.Factory.StartNew(() =>
{
// This is the consumer. It processes the
// "commits" queue until it signals completion.
while(!commits.IsCompleted)
{
ComplexDataSet cds;
// Timeout of -1 will wait for an item or IsCompleted == true.
if(commits.TryTake(out cds, -1))
{
// Got at least one item, write it.
db.Write(cds);
// Continue dequeuing until the queue is empty, where it will
// timeout instantly and return false, or until we've dequeued
// 100 items.
for(int i = 1; i < 100 && commits.TryTake(out cds, 0); ++i)
{
db.Write(cds);
}
// Now that we're waiting for more items or have dequeued 100
// of them, commit. More can be continue to be added to the
// queue by other threads while this commit is processing.
db.Commit();
}
}
}, TaskCreationOptions.LongRunning);
try
{
// This is the producer.
Parallel.For(0, 1000000, i =>
{
ComplexDataSet data = GenerateData(i);
commits.Add(data);
});
}
finally // put in a finally to ensure the task closes down.
{
commits.CompleteAdding(); // set commits.IsFinished = true.
consumer.Wait(); // wait for task to finish committing all the items.
}
In your example you have 10 000 000 packages of work. Each of this needs to be distributed to a thread.
Assuming you don't have a really large number of cpu cores this is not optimal. You also have to synchronize your threads on every buffer.SaveDataToBuffer call (by using locks). Additionally you should be aware that the variable r isn't necessarly increased by one in a chronology view (example: Thread1 executes r with 1,2,3 and Thread2 with 4,5,6. Chronological this would lead to the following sequence of r passed to SaveDataToBuffer 1,4,2,5,3,6 (approximately)).
I would make the packages of work larger and then commit each package at once. This has also the benefit that you don't have to lock/synchronize all to often.
Here's an example:
int total = 10000000;
int step = 1000;
Parallel.For(0, total / step, (r, state) =>
{
int start = r * start;
int end = start + step;
ComplexDataSet[] result = new ComplexDataSet[step];
for (int i = start; i < end; i++)
{
result[i - start] = GenerateData(i);
}
Commit(result);
});
In this example the whole work is split into 10 000 packages (which are executed in parallel) and every package generates 1000 data items and commits them to the database.
With this solution the Commit method might be a bottleneck, if not wisely designed. Best would be to make it thread safe without using any locks. This can be accomplished, if you don't use common objects between threads which need synchronization.
E.g. for a sql server backend that would mean creating an own sql connection in the context of every Commit() call:
private void Commit(ComplexDataSet[] data)
{
using (var connection = new SqlConnection("connection string..."))
{
connection.Open();
// insert your data here...
}
}
Instead of increasing complexity of software, rather consider simplification. You can refactor the code into three parts:
Workers that enqueue
This is concurrent GenerateData in Parallel.For that does some heavy computation and produce ComplexDataSet.
Actual queue
A concurrent queue that stores the results from [1] - so many ComplexDataSet. Here I assumed that one instance of ComplexDataSet is actually not really resource consuming and fairly light. As long as the queue is concurrent it will support parallel "inserts" and "deletes".
Workers that dequeue
Code that takes one instance of the ComplexDataSet from processing queue [2] and puts it into the concurrent bag (or other storage). Once the bag has N number of items you block, stop dequeueing, flush the content of the bag into the database and clear it. Finally, you unblock and resume dequeueing.
Here is some metacode (it still compiles, but needs improvements)
[1]
// [1] - Class is responsible for generating complex data sets and
// adding them to processing queue
class EnqueueWorker
{
//generate data and add to queue
internal void ParrallelEnqueue(ConcurrentQueue<ComplexDataSet> resultQueue)
{
Parallel.For(1, 10000, (i) =>
{
ComplexDataSet cds = GenerateData(i);
resultQueue.Enqueue(cds);
});
}
//generate data
ComplexDataSet GenerateData(int i)
{
return new ComplexDataSet();
}
}
[3]
//[3] This guy takes sets from the processing queue and flush results when
// N items have been generated
class DequeueWorker
{
//buffer that holds processed dequeued data
private static ConcurrentBag<ComplexDataSet> buffer;
//lock to flush the data to the db once in a while
private static object syncRoot = new object();
//take item from processing queue and add it to internal buffer storage
//once buffer is full - flush it to the database
internal void ParrallelDequeue(ConcurrentQueue<ComplexDataSet> resultQueue)
{
buffer = new ConcurrentBag<ComplexDataSet>();
int N = 100;
Parallel.For(1, 10000, (i) =>
{
//try dequeue
ComplexDataSet cds = null;
var spinWait = new SpinWait();
while (cds == null)
{
resultQueue.TryDequeue(out cds);
spinWait.SpinOnce();
}
//add to buffer
buffer.Add(cds);
//flush to database if needed
if (buffer.Count == N)
{
lock (syncRoot)
{
IEnumerable<ComplexDataSet> data = buffer.ToArray();
// flush data to database
buffer = new ConcurrentBag<ComplexDataSet>();
}
}
});
}
}
[2] and usage
class ComplexDataSet { }
class Program
{
//processing queueu - [2]
private static ConcurrentQueue<ComplexDataSet> processingQueue;
static void Main(string[] args)
{
// create new processing queue - single instance for whole app
processingQueue = new ConcurrentQueue<ComplexDataSet>();
//enqueue worker
Task enqueueTask = Task.Factory.StartNew(() =>
{
EnqueueWorker enqueueWorker = new EnqueueWorker();
enqueueWorker.ParrallelEnqueue(processingQueue);
});
//dequeue worker
Task dequeueTask = Task.Factory.StartNew(() =>
{
DequeueWorker dequeueWorker = new DequeueWorker();
dequeueWorker.ParrallelDequeue(processingQueue);
});
}
}

Categories