How to consume a BlockingCollection<T> in batches

How to consume a BlockingCollection<T> in batches - c#

I've come up with some code to consume all wating items from a queue. Rather than processing the items 1 by 1, it makes sense to process all waiting items as a set.
I've declared my queue like this.
private BlockingCollection<Item> items =
new BlockingCollection<Item>(new ConcurrentQueue<Item>);
Then, on a consumer thread, I plan to read the items in batches like this,
Item nextItem;
while (this.items.TryTake(out nextItem, -1))
{
var workToDo = new List<Item>();
workToDo.Add(nextItem);
while(this.items.TryTake(out nextItem))
{
workToDo.Add(nextItem);
}
// process workToDo, then go back to the queue.
}
This approach lacks the utility of GetConsumingEnumerable and I can't help wondering if I've missed a better way, or if my approach is flawed.
Is there a better way to consume a BlockingCollection<T> in batches?

A solution is to use the BufferBlock<T> from
System.Threading.Tasks.Dataflow (which is included in .net core 3+). It does not use GetConsumingEnumerable(), but it still does allow you the same utility, mainly:
allows parallel processing w/ multiple (symmetrical and/or asymmetrical) consumers and producers
thread safe (allowing for the above) - no race conditions to worry about
can be cancelled by a cancellation token and/or collection completion
consumers block until data is available, avoiding wasting CPU cycles on polling
There is also a BatchBlock<T>, but that limits you to fixed sized batches.
var buffer = new BufferBlock<Item>();
while (await buffer.OutputAvailableAsync())
{
if (buffer.TryReceiveAll(out var items))
//process items
}
Here is a working example, which demos the following:
multiple symmetrical consumers which process variable length batches in parallel
multiple symmetrical producers (not truly operating in parallel in this example)
ability to complete the collection when the producers are done
to keep the example short, I did not demonstrate the use of a CancellationToken
ability to wait until the producers and/or consumers are done
ability to call from an area that doesn't allow async, such as a constructor
the Thread.Sleep() calls are not required, but help simulate some processing time that would occur in more taxing scenarios
both the Task.WaitAll() and the Thread.Sleep() can optionally be converted to their async equivalents
no need to use any external libraries
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
static class Program
{
static void Main()
{
var buffer = new BufferBlock<string>();
// Kick off consumer task(s)
List<Task> consumers = new List<Task>();
for (int i = 0; i < 3; i++)
{
consumers.Add(Task.Factory.StartNew(async () =>
{
// need to copy this due to lambda variable capture
var num = i;
while (await buffer.OutputAvailableAsync())
{
if (buffer.TryReceiveAll(out var items))
Console.WriteLine($"Consumer {num}: " +
items.Aggregate((a, b) => a + ", " + b));
// real life processing would take some time
await Task.Delay(500);
}
Console.WriteLine($"Consumer {num} complete");
}));
// give consumer tasks time to activate for a better demo
Thread.Sleep(100);
}
// Kick off producer task(s)
List<Task> producers = new List<Task>();
for (int i = 0; i < 3; i++)
{
producers.Add(Task.Factory.StartNew(() =>
{
for (int j = 0 + (1000 * i); j < 500 + (1000 * i); j++)
buffer.Post(j.ToString());
}));
// space out the producers for a better demo
Thread.Sleep(10);
}
// may also use the async equivalent
Task.WaitAll(producers.ToArray());
Console.WriteLine("Finished waiting on producers");
// demo being able to complete the collection
buffer.Complete();
// may also use the async equivalent
Task.WaitAll(consumers.ToArray());
Console.WriteLine("Finished waiting on consumers");
Console.ReadLine();
}
}
Here is a mondernised and simplified version of the code.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
class Program
{
private static async Task Main()
{
var buffer = new BufferBlock<string>();
// Kick off consumer task(s)
var consumers = new List<Task>();
for (var i = 0; i < 3; i++)
{
var id = i;
consumers.Add(Task.Run(() => StartConsumer(id, buffer)));
// give consumer tasks time to activate for a better demo
await Task.Delay(100);
}
// Kick off producer task(s)
var producers = new List<Task>();
for (var i = 0; i < 3; i++)
{
var pid = i;
producers.Add(Task.Run(() => StartProducer(pid, buffer)));
// space out the producers for a better demo
await Task.Delay(10);
}
// may also use the async equivalent
await Task.WhenAll(producers);
Console.WriteLine("Finished waiting on producers");
// demo being able to complete the collection
buffer.Complete();
// may also use the async equivalent
await Task.WhenAll(consumers);
Console.WriteLine("Finished waiting on consumers");
Console.ReadLine();
}
private static async Task StartConsumer(
int id,
IReceivableSourceBlock<string> buffer)
{
while (await buffer.OutputAvailableAsync())
{
if (buffer.TryReceiveAll(out var items))
{
Console.WriteLine($"Consumer {id}: " +
items.Aggregate((a, b) => a + ", " + b));
}
// real life processing would take some time
await Task.Delay(500);
}
Console.WriteLine($"Consumer {id} complete");
}
private static Task StartProducer(int pid, ITargetBlock<string> buffer)
{
for (var j = 0 + (1000 * pid); j < 500 + (1000 * pid); j++)
{
buffer.Post(j.ToString());
}
return Task.CompletedTask;
}
}

While not as good as ConcurrentQueue<T> in some ways, my own LLQueue<T> allows for a batched dequeue with a AtomicDequeueAll method where all items currently on the queue are taken from it in a single (atomic and thread-safe) operation, and are then in a non-threadsafe collection for consumption by a single thread. This method was designed precisely for the scenario where you want to batch the read operations.
This isn't blocking though, though it could be used to create a blocking collection easily enough:
public BlockingBatchedQueue<T>
{
private readonly AutoResetEvent _are = new AutoResetEvent(false);
private readonly LLQueue<T> _store;
public void Add(T item)
{
_store.Enqueue(item);
_are.Set();
}
public IEnumerable<T> Take()
{
_are.WaitOne();
return _store.AtomicDequeueAll();
}
public bool TryTake(out IEnumerable<T> items, int millisecTimeout)
{
if(_are.WaitOne(millisecTimeout))
{
items = _store.AtomicDequeueAll();
return true;
}
items = null;
return false;
}
}
That's a starting point that doesn't do the following:
Deal with a pending waiting reader upon disposal.
Worry about a potential race with multiple readers both being triggered by a write happening while one was reading (it just considers the occasional empty result enumerable to be okay).
Place any upper-bound on writing.
All of which could be added too, but I wanted to keep to the minimum of some practical use, that hopefully isn't buggy within the defined limitations above.

No, there is no better way. Your approach is basically correct.
You could wrap the "consume-in-batches" functionality in an extension method, for ease of use. The implementation below uses the same List<T> as a buffer during the whole enumeration, with the intention to prevent the allocation of a new buffer on each iteration. It also includes a maxSize parameter, that allows to limit the size of the emitted batches:
/// <summary>
/// Consumes the items in the collection in batches. Each batch contains all
/// the items that are immediately available, up to a specified maximum number.
/// </summary>
public static IEnumerable<T[]> GetConsumingEnumerableBatch<T>(
this BlockingCollection<T> source, int maxSize,
CancellationToken cancellationToken = default)
{
ArgumentNullException.ThrowIfNull(source);
if (maxSize < 1) throw new ArgumentOutOfRangeException(nameof(maxSize));
if (source.IsCompleted) yield break;
var buffer = new List<T>();
while (source.TryTake(out var item, Timeout.Infinite, cancellationToken))
{
Debug.Assert(buffer.Count == 0);
buffer.Add(item);
while (buffer.Count < maxSize && source.TryTake(out item))
buffer.Add(item);
T[] batch = buffer.ToArray();
int batchSize = batch.Length;
buffer.Clear();
yield return batch;
if (batchSize < buffer.Capacity >> 2)
buffer.Capacity = buffer.Capacity >> 1; // Shrink oversized buffer
}
}
Usage example:
foreach (Item[] batch in this.items.GetConsumingEnumerableBatch(Int32.MaxValue))
{
// Process the batch
}
The buffer is shrank in half, every time an emitted batch is smaller than a quarter of the buffer's capacity. This will keep the buffer in control, in case it has become oversized at some point during the enumeration.
The intention of the if (source.IsCompleted) yield break line is to replicate the behavior of the built-in GetConsumingEnumerable method, when it is supplied with an already canceled token, and the collection is empty and completed.
In case of cancellation, no buffered messages are in danger of being lost. The cancellationToken is checked only when the buffer is empty.
A simpler implementation without memory management features, can be found in the first revision of this answer.

Related

ConcurrentBag Skiping some items C#

I am using concurrentbag for scraping URLs , Right now its working fine for 500 / 100 urls but when I am trying to scrape 8000 urls . All URLs not processing and some items pending in inputQueue.
But I am using while (!inputQueue.IsEmpty) . So, it should run loop till any items exists into inputqueue.
I want only run 100 threads max. So, I first creating 100 threads and calling "Run()" method and inside that method I am running a loop to take items till items exits in inputqueue and add into output queue after scraping urls.
public ConcurrentBag<Data> inputQueue = new ConcurrentBag<Data>();
public ConcurrentBag<Data> outPutQueue = new ConcurrentBag<Data>();
public List<Data> Scrapes(List<Data> scrapeRequests)
{
ServicePointManager.ServerCertificateValidationCallback += (sender, cert, chain, sslPolicyErrors) => true;
string proxy_session_id = new Random().Next().ToString();
numberOfRequestSent = 0;
watch.Start();
foreach (var sRequest in scrapeRequests)
{
inputQueue.Add(sRequest);
}
//inputQueue.CompleteAdding();
var taskList = new List<Task>();
for (var i = 0; i < n_parallel_exit_nodes; i++) //create 100 threads only
{
taskList.Add(Task.Factory.StartNew(async () =>
{
await Run();
}, TaskCreationOptions.RunContinuationsAsynchronously));
}
Task.WaitAll(taskList.ToArray()); //Waiting
//print result
Console.WriteLine("Number Of URLs Found - {0}", scrapeRequests.Count);
Console.WriteLine("Number Of Request Sent - {0}", numberOfRequestSent);
Console.WriteLine("Input Queue - {0}", inputQueue.Count);
Console.WriteLine("OutPut Queue - {0}", outPutQueue.ToList().Count);
Console.WriteLine("Success - {0}", outPutQueue.ToList().Where(x=>x.IsProxySuccess==true).Count().ToString());
Console.WriteLine("Failed - {0}", outPutQueue.ToList().Where(x => x.IsProxySuccess == false).Count().ToString());
Console.WriteLine("Process Time In - {0}", watch.Elapsed);
return outPutQueue.ToList();
}
async Task<string> Run()
{
while (!inputQueue.IsEmpty)
{
var client = new Client(super_proxy_ip, "US");
if (!client.have_good_super_proxy())
client.switch_session_id();
if (client.n_req_for_exit_node == switch_ip_every_n_req)
client.switch_session_id();
var scrapeRequest = new ProductResearch_ProData();
inputQueue.TryTake(out scrapeRequest);
try
{
numberOfRequestSent++;
// Console.WriteLine("Sending request for - {0}", scrapeRequest.URL);
scrapeRequest.HTML = client.DownloadString((string)scrapeRequest.URL);
//Console.WriteLine("Response done for - {0}", scrapeRequest.URL);
scrapeRequest.IsProxySuccess = true;
outPutQueue.Add(scrapeRequest); //add object to output queue
//lumanti code
client.handle_response();
}
catch (WebException e)
{
Console.WriteLine("Failed");
scrapeRequest.IsProxySuccess = false;
Console.WriteLine(e.Message);
outPutQueue.Add(scrapeRequest); //add object to output queue
//lumanti code
client.handle_response(e);
}
client.clean_connection_pool();
client.Dispose();
}
return await Task.Run(() => "Done");
}

There are multiple problems here, but none of them seems to be the cause for the inputQueue.Count having a none-zero value at the end. In any case I would like to point at the problems I can see.
var taskList = new List<Task>();
for (var i = 0; i < n_parallel_exit_nodes; i++) // create 100 threads only
{
taskList.Add(Task.Factory.StartNew(async () =>
{
await Run();
}, TaskCreationOptions.RunContinuationsAsynchronously));
}
The method Task.Factory.StartNew doesn't understand async delegates, so when it is called with an async lambda as argument it returns a nested task. In this case it returns a Task<Task<string>>. You store this nested task in List<Task> collection, which is possible because the type Task<TResult> inherits from the type Task, but doing so you lose the ability to await for the completion (and get the result) of the inner task. You only hold a reference to the outer task. Miraculously this is not a problem in this case (it usually is) since the outer task does all the work, and the inner task does essentially nothing (other than using a thread-pool thread to return a "Done" string that is not really needed anywhere).
You also don't attach any continuations to the outer tasks, so the flag TaskCreationOptions.RunContinuationsAsynchronously seems redundant.
// create 100 threads only
You don't create 100 threads, you create 100 tasks. These tasks are scheduled in the ThreadPool, which will be immediately starved because the tasks are long-running, and will start injecting one new thread every 500 msec until all scheduled tasks have been assigned to a thread.
var scrapeRequest = new ProductResearch_ProData();
inputQueue.TryTake(out scrapeRequest);
Here you instantiate an object of type ProductResearch_ProData that is immediately discarded and becomes eligible for garbage collection in the very next line. The TryTake method will either return an object removed from the bag, or null if the bag is empty. You ignore the return value of the TryTake method, which is entirely possible to be false because meanwhile the bag may have been emptied by another worker, and then proceed with a scrapeRequest that has possibly a null value, resulting in that case to a NullReferenceException.
Worth noting that you extract an object of type ProductResearch_ProData from a ConcurrentBag<Data>, so either the class Data inherits from
the base class ProductResearch_ProData, or there is a transcription error in the code.

How to spawn a number of threads async to call a method multiple times in C#?

I need to call a worker method multiple times to load data into a database. I want to do some parallelism with this and be able to specify the number of threads to use. I thought of using the mod operator to split the workload, but getting stuck on how to implement with async await.
So the async method must create n number of threads and then call the worker method so there are n streams of work happening in parallel. The worker method is synchronous.
I had a go at it, but quite sure how to implement what I want. Is there a pattern for this?
Some code I was playing around with:
using System;
using System.Threading;
using System.Threading.Tasks;
namespace TestingAsync
{
class Program
{
static void Main(string[] args)
{
int threads = 3;
int numItems = 10;
Task task = ThreadIt(threads, numItems);
}
static async Task ThreadIt(int threads, int numItems)
{
Console.WriteLine($"Item limit: {numItems}, threads: {threads}");
for (int i = 0; i < numItems; i++)
{
Console.Write($"Adding item: {i} mod 1: {i % threads}. ");
int task = await DoSomeWork(i%threads, 500);
}
}
static async Task<int> DoSomeWork(int Item, int time)
{
Console.Write($" Working.on item {Item}..");
Thread.Sleep(time );
Console.WriteLine($"done.");
return Item;
}
}
}
EDIT:
I'm going to rephrase because maybe I wasn't clear in my requirements.
What I want is to create n number of threads. There will be x number of items to process and I want them to be queued up using mod (or something else) and then processed in order in parallel across the n threads. When one item has finished, I want the next item to be processed immediately and not wait for all three threads to finish. Some items will take longer to process than others, maybe even up to 10 times longer, so other threads should not be waiting for one of the threads to complete.
For example if we have 3 threads and 9 items, this would happen:
thread1: items 0,3,6
thread2: items 1,4,7
thread3: items 2,5,8
each thread processes it's workload in order and does not wait in between each item.

You can try creating a List<Task<T>> and start them and then await it with WhenAll if you want all tasks to be completed or WhenAny if any of them completes:
static async Task ThreadIt(int threads, int numItems)
{
List<Task<int>> tasks = new List<Task<int>>();
Console.WriteLine($"Item limit: {numItems}, threads: {threads}");
for (int i = 0; i < numItems; i++)
{
Console.Write($"Adding item: {i} mod 1: {i % threads}. ");
tasks.Add(DoSomeWork(i%threads, 500));
}
var result = await Task.WhenAll(tasks);
}
and when using Task, async and await we should be using Task.Delay instead of Thread.Sleep:
static async Task<int> DoSomeWork(int Item, int time)
{
Console.Write($" Working.on item {Item}..");
await Task.Delay(time); // note this
Console.WriteLine($"done.");
return Item;
}
EDIT:
You can create a ConcurrentQueue and then dequeue each time when 3 Tasks complete and generate next 3 like:
static async Task ThreadIt(int threads, int numItems)
{
ConcurrentQueue<int> queue = new ConcurrentQueue<int>();
Enumerable.Range(0, 10).ForEach(x => queue.Enqueue(x));
List<Task<int>> tasks = new List<Task<int>>();
Console.WriteLine($"Item limit: {numItems}, threads: {threads}");
while (!queue.IsEmpty)
{
for (int i = 0; i < threads; i++)
{
if(queue.TryDequeue(out int val))
{
Console.Write($"Adding item: {val} mod 1: {val % threads}. ");
tasks.Add(DoSomeWork(val%threads, 500));
}
}
var result = await Task.WhenAll(tasks);
}
}

I need to call a worker method multiple times to load data into a database. I want to do some parallelism with this and be able to specify the number of threads to use... The worker method is synchronous... Is there a pattern for this?
Yes, the Task Parallel Library.
Given:
static int DoSomeWork(int Item, int time)
{
Console.Write($" Working.on item {Item}..");
Thread.Sleep(time);
Console.WriteLine($"done.");
return Item;
}
You can parallelize it as such:
static List<int> ThreadIt(int threads, int numItems)
{
Console.WriteLine($"Item limit: {numItems}, threads: {threads}");
var items = Enumerable.Range(0, numItems);
return items.AsParallel().WithDegreeOfParallelism(threads)
.Select(i => DoSomeWork(i, 500))
.ToList();
}

creating a .net async wrapper to a sync request

I have the following situation (or a basic misunderstanding with the async await mechanism).
Assume you have a set of 1-20 web request call that takes a long time: findItemsByProduct().
you want to wrap it around in an async request, that would be able to abstract all these calls into one async call, but I can't seem to be able to do it without using more threads.
If I'm doing:
int total = result.paginationOutput.totalPages;
for (int i = 2; i < total + 1; i++)
{
await Task.Factory.StartNew(() =>
{
result = client.findItemsByProduct(i);
});
newList.AddRange(result.searchResult.item);
}
}
return newList;
problem here, that the calls don't run together, rather they are waiting one by one.
I would like all the calls to run together and than harvest the results.
as pseudo code, I would like the code to run like this:
forEach item {
result = item.makeWebRequest();
}
foreach item {
List.addRange(item.harvestResults);
}
I have no idea how to make the code to do that though..

Ideally, you should add a findItemsByProductAsync that returns a Task<Item[]>. That way, you don't have to create unnecessary tasks using StartNew or Task.Run.
Then your code can look like this:
int total = result.paginationOutput.totalPages;
// Start all downloads; each download is represented by a task.
Task<Item[]>[] tasks = Enumerable.Range(2, total - 1)
.Select(i => client.findItemsByProductAsync(i)).ToArray();
// Wait for all downloads to complete.
Item[][] results = await Task.WhenAll(tasks);
// Flatten the results into a single collection.
return results.SelectMany(x => x).ToArray();

Given your requirements which I see as:
Process n number of non-blocking tasks
Process results after all queries have returned
I would use the CountdownEvent for this e.g.
var results = new ConcurrentBag<ItemType>(result.pagination.totalPages);
using (var e = new CountdownEvent(result.pagination.totalPages))
{
for (int i = 2; i <= result.pagination.totalPages+1; i++)
{
Task.Factory.StartNew(() => return client.findItemsByProduct(i))
.ContinueWith(items => {
results.AddRange(items);
e.Signal(); // signal task is done
});
}
// Wait for all requests to complete
e.Wait();
}
// Process results
foreach (var item in results)
{
...
}

This particular problem is solved easily enough without even using await. Simply create each of the tasks, put all of the tasks into a list, and then use WhenAll on that list to get a task that represents the completion of all of those tasks:
public static Task<Item[]> Foo()
{
int total = result.paginationOutput.totalPages;
var tasks = new List<Task<Item>>();
for (int i = 2; i < total + 1; i++)
{
tasks.Add(Task.Factory.StartNew(() => client.findItemsByProduct(i)));
}
return Task.WhenAll(tasks);
}
Also note you have a major problem in how you use result in your code. You're having each of the different tasks all using the same variable, so there are race conditions as to whether or not it works properly. You could end up adding the same call twice and having one skipped entirely. Instead you should have the call to findItemsByProduct be the result of the task, and use that task's Result.

If you want to use async-await properly you have to declare your functions async, and the functions that call you also have to be async. This continues until you have once synchronous function that starts the async process.
Your function would look like this:
by the way you didn't describe what's in the list. I assume they are
object of type T. in that case result.SearchResult.Item returns
IEnumerable
private async Task<List<T>> FindItems(...)
{
int total = result.paginationOutput.totalPages;
var newList = new List<T>();
for (int i = 2; i < total + 1; i++)
{
IEnumerable<T> result = await Task.Factory.StartNew(() =>
{
return client.findItemsByProduct(i);
});
newList.AddRange(result.searchResult.item);
}
return newList;
}
If you do it this way, your function will be asynchronous, but the findItemsByProduct will be executed one after another. If you want to execute them simultaneously you should not await for the result, but start the next task before the previous one is finished. Once all tasks are started wait until all are finished. Like this:
private async Task<List<T>> FindItems(...)
{
int total = result.paginationOutput.totalPages;
var tasks= new List<Task<IEnumerable<T>>>();
// start all tasks. don't wait for the result yet
for (int i = 2; i < total + 1; i++)
{
Task<IEnumerable<T>> task = Task.Factory.StartNew(() =>
{
return client.findItemsByProduct(i);
});
tasks.Add(task);
}
// now that all tasks are started, wait until all are finished
await Task.WhenAll(tasks);
// the result of each task is now in task.Result
// the type of result is IEnumerable<T>
// put all into one big list using some linq:
return tasks.SelectMany ( task => task.Result.SearchResult.Item)
.ToList();
// if you're not familiar to linq yet, use a foreach:
var newList = new List<T>();
foreach (var task in tasks)
{
newList.AddRange(task.Result.searchResult.item);
}
return newList;
}

C# RX (System.Reactive) - Async - Publish an IEnumerable<DataRow> to multiple observing data handers

I'm new to RX.
I'd like to traverse an IEnumerable and publish to multi DataHandlers that process the data in their respective threads.
Below is my sample program. The publish works and a new thread is created, but the 3 RowHandlers are all running in 1 thread. I need 3 threads. What is the best way to implement this?
class Program
{
public class MyDataGenerator
{
public IEnumerable<int> myData()
{
//Heavy lifting....Don't want to process more than once.
yield return 1;
yield return 2;
yield return 3;
yield return 4;
yield return 5;
yield return 6;
}
}
static void Main(string[] args)
{
MyDataGenerator h = new MyDataGenerator();
Console.WriteLine("Thread id " + Thread.CurrentThread.ManagedThreadId.ToString());
//
var shared = h.myData().ToObservable().Publish();
///////////////////////////////
// Row Handling Requirements
//
// 1. Single Scan of IEnumerable.
// 2. Row handlers process data in their own threads.
// 3. OK if scanning thread blocks while data is processed
//
//Create the RowHandlers
MyRowHandler rn1 = new MyRowHandler();
rn1.ido = shared.Subscribe(i => rn1.processID(i));
MyRowHandler rn2 = new MyRowHandler();
rn2.ido = shared.Subscribe(i => rn2.processID(i));
MyRowHandler rn3 = new MyRowHandler();
rn3.ido = shared.Subscribe(i => rn3.processID(i));
//
shared.Connect();
}
public class MyRowHandler
{
public IDisposable ido = null;
public void processID(int i)
{
var o = Observable.Start(() =>
{
Console.WriteLine(String.Format("Start Thread ID {0} Int{1}", Thread.CurrentThread.ManagedThreadId, i));
Thread.Sleep(30);
Console.WriteLine("Done Thread ID"+Thread.CurrentThread.ManagedThreadId.ToString());
}
);
o.First();
}
}
}
Discovery :
The coding speed & code quality gains one receives from Rx come at the expense of performance. Task/Delegates are without a doubt multiples faster. That means that the most important thing one needs to learn about Rx is when to use Rx. Below is a draft summary guideline. For large volumes I can see use for Rx in chuncking, combining, and other many stream-many handler models; however, basic Async should not use rx.
I'd post an image with a matrix guideline, but the site won't let me post images

If I understand your sequencing requirements correctly and you want three parallel running scans, you can just observe on the TaskPool and subscribe from there;
...
//Create the RowHandlers
MyRowHandler rn1 = new MyRowHandler();
rn1.ido = shared.ObserveOn(Scheduler.TaskPool).Subscribe(i => rn1.processID(i));
...
Note that since you're then running asynchronously and your main thread doesn't wait for the scans to get done, your program will terminate right away unless you for example put a Console.ReadKey() at the end of the program.
EDIT: Regarding running the same thread "all the way", you're scheduling a bit strangely for that. If you drop the observable in the rowhandler, you can use Scheduler.NewThread and get good results;
...
var rowHandler1 = new MyRowHandler();
rowHandler1.ido = shared.ObserveOn(Scheduler.NewThread).Subscribe(rowHandler1.ProcessID);
...
public void ProcessID(int i)
{
Console.WriteLine(String.Format("Start Thread ID {0} Int{1}", Thread.CurrentThread.ManagedThreadId, i));
Thread.Sleep(30);
Console.WriteLine("Done Thread ID" + Thread.CurrentThread.ManagedThreadId.ToString(CultureInfo.InvariantCulture));
}
That will give each subscription its own thread, and stay with it.

Thread-safe buffer of data to make batch inserts of controlled size

I have a simulation that generates data which must be saved to database.
ParallelLoopResult res = Parallel.For(0, 1000000, options, (r, state) =>
{
ComplexDataSet cds = GenerateData(r);
SaveDataToDatabase(cds);
});
The simulation generates a whole lot of data, so it wouldn't be practical to first generate it and then save it to database (up to 1 GB of data) and it also wouldn't make sense to save it to database one by one (too small transanctions to be practical). I want to insert them to database as a batch insert of controlled size (say 100 with one commit).
However, I think my knowledge of parallel computing is less that theoretical. I came up with this (which as you can see is very flawed):
DataBuffer buffer = new DataBuffer(...);
ParallelLoopResult res = Parallel.For(0, 10000000, options, (r, state) =>
{
ComplexDataSet cds = GenerateData(r);
buffer.SaveDataToBuffer(cds, i == r - 1);
});
public class DataBuffer
{
int count = 0;
int limit = 100
object _locker = new object();
ConcurrentQueue<ConcurrentBag<ComplexDataSet>> ComplexDataBagQueue{ get; set; }
public void SaveDataToBuffer(ComplexDataSet data, bool isfinalcycle)
{
lock (_locker)
{
if(count >= limit)
{
ConcurrentBag<ComplexDataSet> dequeueRef;
if(ComplexDataBagQueue.TryDequeue(out dequeueRef))
{
Commit(dequeueRef);
}
_lastItemRef = new ConcurrentBag<ComplexDataSet>{data};
ComplexDataSetsQueue.Enqueue(_lastItemRef);
count = 1;
}
else
{
// First time
if(_lastItemRef == null)
{
_lastItemRef = new ConcurrentBag<ComplexDataSet>{data};
ComplexDataSetsQueue.Enqueue(_lastItemRef);
count = 1;
}
// If buffer isn't full
else
{
_lastItemRef.Add(data);
count++;
}
}
if(isfinalcycle)
{
// Commit everything that hasn't been committed yet
ConcurrentBag<ComplexDataSet> dequeueRef;
while (ComplexDataSetsQueue.TryDequeue(out dequeueRef))
{
Commit(dequeueRef);
}
}
}
}
public void Commit(ConcurrentBag<ComplexDataSet> data)
{
// Commit data to database..should this be somehow in another thread or something ?
}
}
As you can see, I'm using queue to create a buffer and then manually decide when to commit. However I have a strong feeling that this isn't very performing solution to my problem. First, I'm unsure whether I'm doing locking right. Second, I'm not sure even if this is fully thread-safe (or at all).
Can you please take a look for a moment and comment what should I do differently ? Or if there is a complitely better way of doing this (using somekind of Producer-Consumer technique or something) ?
Thanks and best wishes,
D.

There is no need to use locks or expensive concurrency-safe data structures. The data is all independent, so introducing locking and sharing will only hurt performance and scalability.
Parallel.For has an overload that lets you specify per-thread data. In this you can store a private queue and private database connection.
Also: Parallel.For internally partitions your range into smaller chunks. It's perfectly efficient to pass it a huge range, so nothing to change there.
Parallel.For(0, 10000000, () => new ThreadState(),
(i, loopstate, threadstate) =>
{
ComplexDataSet data = GenerateData(i);
threadstate.Add(data);
return threadstate;
}, threadstate => threadstate.Dispose());
sealed class ThreadState : IDisposable
{
readonly IDisposable db;
readonly Queue<ComplexDataSet> queue = new Queue<ComplexDataSet>();
public ThreadState()
{
// initialize db with a private MongoDb connection.
}
public void Add(ComplexDataSet cds)
{
queue.Enqueue(cds);
if(queue.Count == 100)
{
Commit();
}
}
void Commit()
{
db.Write(queue);
queue.Clear();
}
public void Dispose()
{
try
{
if(queue.Count > 0)
{
Commit();
}
}
finally
{
db.Dispose();
}
}
}
Now, MongoDb currently doesn't support truly concurrent inserts -- it holds some expensive locks in the server, so parallel commits won't gain you much (if any) speed. They want to fix this in the future, so you might get a free speed-up one day.
If you need to limit the number of database connections held, a producer/consumer setup is a good alternative. You can use a BlockingCollection queue to do this efficiently without using any locks:
// Specify a maximum of 1000 items in the collection so that we don't
// run out of memory if we get data faster than we can commit it.
// Add() will wait if it is full.
BlockingCollection<ComplexDataSet> commits =
new BlockingCollection<ComplexDataSet>(1000);
Task consumer = Task.Factory.StartNew(() =>
{
// This is the consumer. It processes the
// "commits" queue until it signals completion.
while(!commits.IsCompleted)
{
ComplexDataSet cds;
// Timeout of -1 will wait for an item or IsCompleted == true.
if(commits.TryTake(out cds, -1))
{
// Got at least one item, write it.
db.Write(cds);
// Continue dequeuing until the queue is empty, where it will
// timeout instantly and return false, or until we've dequeued
// 100 items.
for(int i = 1; i < 100 && commits.TryTake(out cds, 0); ++i)
{
db.Write(cds);
}
// Now that we're waiting for more items or have dequeued 100
// of them, commit. More can be continue to be added to the
// queue by other threads while this commit is processing.
db.Commit();
}
}
}, TaskCreationOptions.LongRunning);
try
{
// This is the producer.
Parallel.For(0, 1000000, i =>
{
ComplexDataSet data = GenerateData(i);
commits.Add(data);
});
}
finally // put in a finally to ensure the task closes down.
{
commits.CompleteAdding(); // set commits.IsFinished = true.
consumer.Wait(); // wait for task to finish committing all the items.
}

In your example you have 10 000 000 packages of work. Each of this needs to be distributed to a thread.
Assuming you don't have a really large number of cpu cores this is not optimal. You also have to synchronize your threads on every buffer.SaveDataToBuffer call (by using locks). Additionally you should be aware that the variable r isn't necessarly increased by one in a chronology view (example: Thread1 executes r with 1,2,3 and Thread2 with 4,5,6. Chronological this would lead to the following sequence of r passed to SaveDataToBuffer 1,4,2,5,3,6 (approximately)).
I would make the packages of work larger and then commit each package at once. This has also the benefit that you don't have to lock/synchronize all to often.
Here's an example:
int total = 10000000;
int step = 1000;
Parallel.For(0, total / step, (r, state) =>
{
int start = r * start;
int end = start + step;
ComplexDataSet[] result = new ComplexDataSet[step];
for (int i = start; i < end; i++)
{
result[i - start] = GenerateData(i);
}
Commit(result);
});
In this example the whole work is split into 10 000 packages (which are executed in parallel) and every package generates 1000 data items and commits them to the database.
With this solution the Commit method might be a bottleneck, if not wisely designed. Best would be to make it thread safe without using any locks. This can be accomplished, if you don't use common objects between threads which need synchronization.
E.g. for a sql server backend that would mean creating an own sql connection in the context of every Commit() call:
private void Commit(ComplexDataSet[] data)
{
using (var connection = new SqlConnection("connection string..."))
{
connection.Open();
// insert your data here...
}
}

Instead of increasing complexity of software, rather consider simplification. You can refactor the code into three parts:
Workers that enqueue
This is concurrent GenerateData in Parallel.For that does some heavy computation and produce ComplexDataSet.
Actual queue
A concurrent queue that stores the results from [1] - so many ComplexDataSet. Here I assumed that one instance of ComplexDataSet is actually not really resource consuming and fairly light. As long as the queue is concurrent it will support parallel "inserts" and "deletes".
Workers that dequeue
Code that takes one instance of the ComplexDataSet from processing queue [2] and puts it into the concurrent bag (or other storage). Once the bag has N number of items you block, stop dequeueing, flush the content of the bag into the database and clear it. Finally, you unblock and resume dequeueing.
Here is some metacode (it still compiles, but needs improvements)
[1]
// [1] - Class is responsible for generating complex data sets and
// adding them to processing queue
class EnqueueWorker
{
//generate data and add to queue
internal void ParrallelEnqueue(ConcurrentQueue<ComplexDataSet> resultQueue)
{
Parallel.For(1, 10000, (i) =>
{
ComplexDataSet cds = GenerateData(i);
resultQueue.Enqueue(cds);
});
}
//generate data
ComplexDataSet GenerateData(int i)
{
return new ComplexDataSet();
}
}
[3]
//[3] This guy takes sets from the processing queue and flush results when
// N items have been generated
class DequeueWorker
{
//buffer that holds processed dequeued data
private static ConcurrentBag<ComplexDataSet> buffer;
//lock to flush the data to the db once in a while
private static object syncRoot = new object();
//take item from processing queue and add it to internal buffer storage
//once buffer is full - flush it to the database
internal void ParrallelDequeue(ConcurrentQueue<ComplexDataSet> resultQueue)
{
buffer = new ConcurrentBag<ComplexDataSet>();
int N = 100;
Parallel.For(1, 10000, (i) =>
{
//try dequeue
ComplexDataSet cds = null;
var spinWait = new SpinWait();
while (cds == null)
{
resultQueue.TryDequeue(out cds);
spinWait.SpinOnce();
}
//add to buffer
buffer.Add(cds);
//flush to database if needed
if (buffer.Count == N)
{
lock (syncRoot)
{
IEnumerable<ComplexDataSet> data = buffer.ToArray();
// flush data to database
buffer = new ConcurrentBag<ComplexDataSet>();
}
}
});
}
}
[2] and usage
class ComplexDataSet { }
class Program
{
//processing queueu - [2]
private static ConcurrentQueue<ComplexDataSet> processingQueue;
static void Main(string[] args)
{
// create new processing queue - single instance for whole app
processingQueue = new ConcurrentQueue<ComplexDataSet>();
//enqueue worker
Task enqueueTask = Task.Factory.StartNew(() =>
{
EnqueueWorker enqueueWorker = new EnqueueWorker();
enqueueWorker.ParrallelEnqueue(processingQueue);
});
//dequeue worker
Task dequeueTask = Task.Factory.StartNew(() =>
{
DequeueWorker dequeueWorker = new DequeueWorker();
dequeueWorker.ParrallelDequeue(processingQueue);
});
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to consume a BlockingCollection<T> in batches - c#

Related

ConcurrentBag Skiping some items C#

How to spawn a number of threads async to call a method multiple times in C#?

creating a .net async wrapper to a sync request

C# RX (System.Reactive) - Async - Publish an IEnumerable<DataRow> to multiple observing data handers

Thread-safe buffer of data to make batch inserts of controlled size

Categories

Resources