Efficient approach to multithreaded set difference

Efficient approach to multithreaded set difference - c#

I have a finite set of consumer threads each consuming a job. Once they process the job, they have a list of subjobs that were listed in the consumed job. I need to add the subjobs from that list that I don't already have in the database. There are 3 million in the database, so getting the list of which ones aren't already in the database is slow. I don't mind each thread blocking on that call, but since I have a race condition (see code) I have to lock them all on that slow call, so they can only call that section one at a time and my program crawls. What can I do to fix this so the threads don't slow down for that call? I tried a queue, but since the threads are pushing out lists of jobs faster than the computer can determine which ones should be added to the database, I end up with a queue that keeps growing and never empties.
My code:
IEnumerable<string> getUniqueJobNames(IEnumerable<job> subJobs, int setID)
{
return subJobs.Select(el => el.name)
.Except(db.jobs.Where(el => el.set_ID==setID).Select(el => el.name));
}
//...consumer thread i
lock(lockObj)
{
var uniqueJobNames = getUniqueJobNames(consumedJob.subJobs, consumerSetID);
//if there was a context switch here to some thread i+1
// and that thread found uniqueJobs that also were found in thread i
// then there will be multiple copies of the same job added in the database.
// So I put this section in a lock to prevent that.
saveJobsToDatabase(uniqueJobName, consumerSetID);
}
//continue consumer thread i...

Rather than going back to the database to check for uniqueness of job names you could the relevant info into a lookup data structure into memory, which allows you to check the existence much faster:
Dictionary<int, HashSet<string>> jobLookup = db.jobs.GroupBy(i => i.set_ID)
.ToDictionary(i => i.Key, i => new HashSet<string>(i.Select(i => i.Name)));
This you only do once. Afterwards every time you need to check for uniqueness you use the lookup:
IEnumerable<string> getUniqueJobNames(IEnumerable<job> subJobs, int setID)
{
var existingJobs = jobLookup.ContainsKey(setID) ? jobLookup[setID] : new HashSet<string>();
return subJobs.Select(el => el.Name)
.Except(existingJobs);
}
If you need to enter a new sub job also add it to the lookup:
lock(lockObj)
{
var uniqueJobNames = getUniqueJobNames(consumedJob.subJobs, consumerSetID);
//if there was a context switch here to some thread i+1
// and that thread found uniqueJobs that also were found in thread i
// then there will be multiple copies of the same job added in the database.
// So I put this section in a lock to prevent that.
saveJobsToDatabase(uniqueJobName, consumerSetID);
if(!jobLookup.ContainsKey(newconsumerSetID))
{
jobLookup.Add(newconsumerSetID, new HashSet<string>(uniqueJobNames));
}
else
{
jobLookup[newconsumerSetID] = new HashSet<string>(jobLookup[newconsumerSetID].Concat(uniqueJobNames)));
}
}

Related

Start threads at the order that they were started, only when previous thread was finished

Sorry for the confusing title, but that's basically what i need, i could do something with global variables but that would only be viable for 2 threads that are requested one after the other.
here is a pseudo code that might explain it better.
/*Async function that gets requests from a server*/
if ()//recieved request from server
{
new Thread(() =>
{
//do stuff
//in the meantime a new thread has been requested from server
//and another one 10 seconds later.. etc.
//wait for this current thread to finish
//fire up the first thread that was requested while this ongoing thread
//after the second thread is finished fire up the third thread that was requested 10 seconds after this thread
//etc...
}).Start();
}
I don't know when each thread will be requested, as it is based on the server sending info to client, so i cant do Task.ContiuneWith as it's dynamic.
So Michael suggested me to look into Queues, and i came up with it
static Queue<Action> myQ = new Queue<Action>();
static void Main(string[] args)
{
new Thread(() =>
{
while (1 == 1)
{
if (myQ.FirstOrDefault() == null)
break;
myQ.FirstOrDefault().Invoke();
}
}).Start();
myQ.Enqueue(() =>
{
TestQ("First");
});
myQ.Enqueue(() =>
{
TestQ("Second");
});
Console.ReadLine();
}
private static void TestQ(string s)
{
Console.WriteLine(s);
Thread.Sleep(5000);
myQ.Dequeue();
}
I commented the code, i basically need to check if the act is first in queue or not.
EDIT: So i re-made it, and now it works, surely there is a better way to do this ? because i cant afford to use an infinite while loop.

You will have to use a global container for the threads. Maybe check Queues.
This class implements a queue as a circular array. Objects stored in a
Queue are inserted at one end and removed from the other.
Queues and stacks are useful when you need temporary storage for
information; that is, when you might want to discard an element after
retrieving its value. Use Queue if you need to access the information
in the same order that it is stored in the collection. Use Stack if
you need to access the information in reverse order. Use
ConcurrentQueue(Of T) or ConcurrentStack(Of T) if you need to access
the collection from multiple threads concurrently.
Three main operations can be performed on a Queue and its elements:
Enqueue adds an element to the end of the Queue.
Dequeue removes the oldest element from the start of the Queue.
Peek returns the oldest element that is at the start of the Queue but does not remove it from the Queue.
EDIT (From what you added)
Here is how I would change your example code to implement the infinite loop and keep it under your control.
static Queue<Action> myQ = new Queue<Action>();
static void Main(string[] args)
{
myQ.Enqueue(() =>
{
TestQ("First");
});
myQ.Enqueue(() =>
{
TestQ("Second");
});
Thread thread = new Thread(() =>
{
while(true) {
Thread.Sleep(5000)
if (myQ.Count > 0) {
myQ.Dequeue().Invoke()
}
}
}).Start();
// Do other stuff, eventually calling "thread.Stop()" the stop the infinite loop.
Console.ReadLine();
}
private static void TestQ(string s)
{
Console.WriteLine(s);
}

You could put the requests that you receive into a queue if there is a thread currently running. Then, to find out when threads return, they could fire an event. When this event fires, if there is something in the queue, start a new thread to process this new request.
The only thing with this is you have to be careful about race conditions, since you are communicating essentially between multiple threads.

How to wait for a function execution for specific duration

My C# application stops responding for a long time, as I break the Debug it stops on a function.
foreach (var item in list)
{
xmldiff.Compare(item, secondary, output);
...
}
I guess the running time of this function is long or it hangs. Anyway, I want to wait for a certain time (e.g. 5 seconds) for the execution of this function, and if it exceeds this time, I skip it and go to the next item in the loop. How can I do it? I found some similar question but they are mostly for processes or asynchronous methods.

You can do it the brutal way: spin up a thread to do the work, join it with timeout, then abort it, if the join didn't work.
Example:
var worker = new Thread( () => { xmlDiff.Compare(item, secondary, output); } );
worker.Start();
if (!worker.Join( TimeSpan.FromSeconds( 1 ) ))
worker.Abort();
But be warned - aborting threads is not considered nice and can make your app unstable. If at all possible try to modify Compare to accept a CancellationToken to cancel the comparison.

I would avoid directly using threads and use Microsoft's Reactive Extensions (NuGet "Rx-Main") to abstract away the management of the threads.
I don't know the exact signature of xmldiff.Compare(item, secondary, output) but if I assume it produces an integer then I could do this with Rx:
var query =
from item in list.ToObservable()
from result in
Observable
.Start(() => xmldiff.Compare(item, secondary, output))
.Timeout(TimeSpan.FromSeconds(5.0), Observable.Return(-1))
select new { item, result };
var subscription =
query
.Subscribe(x =>
{
/* do something with `x.item` and/or `x.result` */
});
This automatically iterates through each item and starts a background computation of xmldiff.Compare, but only allows each computation to take as much as 5.0 seconds before returning a default value of -1.
The subscription variable is an IDisposable, so if you want to abort the entire query before it completes just call .Dispose().

I skip it and go to the next item in the loop
By "skip it", do you mean "leave it there" or "cancel it"? The two scenarios are quite different. But for both two I suggest you use Task.
//generate 10 example tasks
var tasks = Enumerable
.Range(0, 10)
.Select(n => new Task(() => DoSomething(n)))
.ToList();
var maxExecutionTime = TimeSpan.FromSeconds(5);
foreach (var task in tasks)
{
if (task.Wait(maxExecutionTime))
{
//the task is finished in time
}
else
{
// the task is over time
// just leave it there
// the loop continues
// if you want to cancel it, see
// http://stackoverflow.com/questions/4783865/how-do-i-abort-cancel-tpl-tasks
}
}
One thing to improve is "do you really need to run your tasks one by one?" If they are independent you can run them in parallel.

Anyway to Parallel Yield c#

I have multiple enumerators that enumerate over flat files. I originally had each enumerator in a Parallel Invoke and each Action was adding to a BlockingCollection<Entity> and that collections was returning a ConsumingEnumerable();
public interface IFlatFileQuery
{
IEnumerable<Entity> Run();
}
public class FlatFile1 : IFlatFileQuery
{
public IEnumerable<Entity> Run()
{
// loop over a flat file and yield each result
yield return Entity;
}
}
public class Main
{
public IEnumerable<Entity> DoLongTask(ICollection<IFlatFileQuery> _flatFileQueries)
{
// do some other stuff that needs to be returned first:
yield return Entity;
// then enumerate and return the flat file data
foreach (var entity in GetData(_flatFileQueries))
{
yield return entity;
}
}
private IEnumerable<Entity> GetData(_flatFileQueries)
{
var buffer = new BlockingCollection<Entity>(100);
var actions = _flatFileQueries.Select(fundFileQuery => (Action)(() =>
{
foreach (var entity in fundFileQuery.Run())
{
buffer.TryAdd(entity, Timeout.Infinite);
}
})).ToArray();
Task.Factory.StartNew(() =>
{
Parallel.Invoke(actions);
buffer.CompleteAdding();
});
return buffer.GetConsumingEnumerable();
}
}
However after a bit of testing it turns out that the code change below is about 20-25% faster.
private IEnumerable<Entity> GetData(_flatFileQueries)
{
return _flatFileQueries.AsParallel().SelectMany(ffq => ffq.Run());
}
The trouble with the code change is that it waits till all flat file queries are enumerated before it returns the whole lot that can then be enumerated and yielded.
Would it be possible to yield in the above bit of code somehow to make it even faster?
I should add that at most the combined results of all the flat file queries might only be 1000 or so Entities.
Edit:
Changing it to the below doesn't make a difference to the run time. (R# even suggests to go back to the way it was)
private IEnumerable<Entity> GetData(_flatFileQueries)
{
foreach (var entity in _flatFileQueries.AsParallel().SelectMany(ffq => ffq.Run()))
{
yield return entity;
}
}

The trouble with the code change is that it waits till all flat file queries are enumerated before it returns the whole lot that can then be enumerated and yielded.
Let's prove that it's false by a simple example. First, let's create a TestQuery class that will yield a single entity after a given time. Second, let's execute several test queries in parallel and measure how long it took to yield their result.
public class TestQuery : IFlatFileQuery {
private readonly int _sleepTime;
public IEnumerable<Entity> Run() {
Thread.Sleep(_sleepTime);
return new[] { new Entity() };
}
public TestQuery(int sleepTime) {
_sleepTime = sleepTime;
}
}
internal static class Program {
private static void Main() {
Stopwatch stopwatch = Stopwatch.StartNew();
var queries = new IFlatFileQuery[] {
new TestQuery(2000),
new TestQuery(3000),
new TestQuery(1000)
};
foreach (var entity in queries.AsParallel().SelectMany(ffq => ffq.Run()))
Console.WriteLine("Yielded after {0:N0} seconds", stopwatch.Elapsed.TotalSeconds);
Console.ReadKey();
}
}
This code prints:
Yielded after 1 seconds
Yielded after 2 seconds
Yielded after 3 seconds
You can see with this output that AsParallel() will yield each result as soon as its available, so everything works fine. Note that you might get different timings depending on the degree of parallelism (such as "2s, 5s, 6s" with a degree of parallelism of 1, effectively making the whole operation not parallel at all). This output comes from an 4-cores machine.
Your long processing will probably scale with the number of cores, if there is no common bottleneck between the threads (such as a shared locked resource). You might want to profile your algorithm to see if there are slow parts that can be improved using tools such as dotTrace.

I don't think there is a red flag in your code anywhere. There are no outrageous inefficiencies. I think it comes down to multiple smaller differences.
PLINQ is very good at processing streams of data. Internally, it works more efficiently than adding items to a synchronized list one-by-one. I suspect that your calls to TryAdd are a bottleneck because each call requires at least two Interlocked operations internally. Those can put enormous load on the inter-processor memory bus because all threads will compete for the same cache line.
PLINQ is cheaper because internally, it does some buffering. I'm sure it doesn't output items one-by-one. Probably it batches them and amortizes sycnhronization cost that way over multiple items.
A second issue would be the bounded capacity of the BlockingCollection. 100 is not a lot. This might lead to a lot of waiting. Waiting is costly because it requires a call to the kernel and a context switch.

I make this alternative that works good for me in any scenario:
This works for me:
In a Task in a Parallel.Foreach Enqueue in a ConcurrentQueue the item
transformed to be processed.
The task has a continue that marks a
flag with that task ends.
In the same thread of execution with tasks
ends a while dequeue and yields
Fast and excellent results for me:
Task.Factory.StartNew (() =>
{
Parallel.ForEach<string> (TextHelper.ReadLines(FileName), ProcessHelper.DefaultParallelOptions,
(string currentLine) =>
{
// Read line, validate and enqeue to an instance of FileLineData (custom class)
});
}).
ContinueWith
(
ic => isCompleted = true
);
while (!isCompleted || qlines.Count > 0)
{
if (qlines.TryDequeue (out returnLine))
{
yield return returnLine;
}
}

By default the ParallelQuery class, when is working on IEnumerable<T> sources, employs a partitioning strategy known as "chunk partitioning". With this strategy each worker thread grabs a progressively larger number of items each time. This means that it has an input buffer. Then the results are accumulated into an output buffer, having a size chosen by the system, before they are available to the consumer of the query. You can disable both buffers by using the configuration options EnumerablePartitionerOptions.NoBuffering and ParallelMergeOptions.NotBuffered.
private IEnumerable<Entity> GetData(ICollection<IFlatFileQuery> flatFileQueries)
{
return Partitioner
.Create(flatFileQueries, EnumerablePartitionerOptions.NoBuffering)
.AsParallel()
.AsOrdered()
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
.SelectMany(ffq => ffq.Run());
}
This way each worker thread will grab only one item at a time, and will propagate the result as soon as it is computed.
NoBuffering: Create a partitioner that takes items from the source enumerable one at a time and does not use intermediate storage that can be accessed more efficiently by multiple threads. This option provides support for low latency (items will be processed as soon as they are available from the source) and provides partial support for dependencies between items (a thread cannot deadlock waiting for an item that the thread itself is responsible for processing).
NotBuffered: Use a merge without output buffers. As soon as result elements have been computed, make that element available to the consumer of the query.

Most Efficient way to create list of all existing items in ConcurrentQueue at certain timestamp

I have to maintain information logs , these logs can be written from many threads concurrently, but when I need them I am using only one thread to dequeue it that takes break of around 5 seconds between dequeueing the collection.
Following is the code I've written to Dequeue it.
if (timeNotReached)
{
InformationLogQueue.Enqueue(informationLog);
}
else
{
int currentLogCount = InformationLogQueue.Count;
var informationLogs = new List<InformationLog>();
for (int i = 0; i < currentLogCount; i++)
{
InformationLog informationLog1;
InformationLogQueue.TryDequeue(out informationLog1);
informationLogs.Add(informationLog1);
}
WriteToDatabase(informationLogs);
}
After dequeueing I am passing it to LINQ's insert method that requires List of InformationLog to insert to database.
Is this the correct way or is there any other efficient way to do this?

You could use the ConcurrentQueue<T> directly in a Linq statement via an extension method like this:
static IEnumerable<T> DequeueExisting<T>(this ConcurrentQueue<T> queue)
{
T item;
while (queue.TryDequeue(out item))
yield return item;
}
This would save you from having to continuously allocate new List<T> and ConcurrentQueue<T> objects.

You should probably be using the ConcurrentQueue<T> via a BlockingCollection<T> as described here.
Somthing like this,
private BlockingCollection<InformationLog> informationLogs =
new BlockingCollection<InformationLog>(new ConcurrentQueue<InformationLog>);
Then on your consumer thread you can do
foreach(var log in this.informationLogs.GetConsumingEnumerable())
{
// process consumer logs 1 by 1.
}
Okay, here is an answer for cosuming mutiple items. On the consuming thread do this,
InformationLog nextLog;
while (this.informationLogs.TryTake(out nextLog, -1))
{
var workToDo = new List<informationLog>();
workToDo.Add(nextLog);
while(this.informationLogs.TryTake(out nextLog))
{
workToDo.Add(nextLog);
}
// process workToDo, then go back to the queue.
}
The first while loop takes items from the queue with an infinite wait time, I'm assuming that once adding is complete on the queue, i.e CompleteAdding is called, this call will return false, without a delay, once the queue is empty.
The inner while loop takes items with a 50 millisecond timeout, this could be adjusted for you needs. Once the queue is empty it will return false, then the batch of work can be processed.

C# Quartz Race Condition

I am automating some tasks on my website, but I'm currently stuck.
public void Execute(JobExecutionContext context)
{
var linqFindAccount = from Account in MainAccounts
where Account.Done == false
select Account;
foreach (var acc in linqFindAccount)
{
acc.Done = true;
// stuff
}
}
The issue is that when I start multiple threads the first threads get assigned to the same first account because they set the Done value to true at the same time. How am I supposed to avoid this?
EDIT:
private object locker = new object();
public void Execute(JobExecutionContext context)
{
lock (locker)
{
var linqFindAccount = from Account in MainAccounts
where Account.Done == false
select Account;
foreach (var acc in linqFindAccount)
{
Console.WriteLine(context.JobDetail.Name + " assigned to " + acc.Mail);
acc.Done = true;
// stuff
}
}
}
Instance [ 2 ] assigned to firstmail#hotmail.com
Instance [ 1 ] assigned to firstmail#hotmail.com
First two threads got assigned to the first account, even though the list contains 30 accounts.
Thanks.

Use
private static readonly object locker = new object();
instead of
private object locker = new object();

Your problem is that the deferred execution happens when you start the foreach loop. So the result is cached and not reevaluated every loop. So every thread will work with it's own list of the items. So when an Account is set to done, the other list still remain with the object in it.
A Queue is more suitable in this case. Just put the items in a shared Queue and let the loops take items of the queue and let them finish when the Queue is empty.

Few problems with your code:
1) Assuming you use stateless Quartz jobs, your lock does not do any good. Quartz creates new job instance every time it fires a trigger. That is why you see the same account processed twice. It would only work if you use stateful job (IStatefulJob). Or make lock static, but read on.
2) Even if 1) is fixed, it would defeat the purpose of having multiple threads because they will all wait for each other on the same lock. You might as well have one thread doing this.
I don't know enough about requirements especially what's going on in // stuff. It maybe that you don't need this code to run on multiple threads and sequential execution will do just fine. I assume this is not the case and you want to run it multiple threads. The easiest way is to have only one Quartz job. In this job, load Accounts in chunks, say 100 jobs in every chunk. This will give you 5 chunks if you have 500 accounts. Offload every chunk processing to thread pool. It will take care of using optimal number of threads. This would be a poor man's Producer Consumer Queue.
public void Execute(JobExecutionContext context) {
var linqFindAccount = from Account in MainAccounts
where Account.Done == false
select Account;
IList<IList<Account>> chunks = linqFindAccount.SplitIntoChunks(/* TODO */);
foreach (IList<Account> chunk in chunks) {
ThreadPool.QueueUserWorkItem(DoStuff, chunk);
}
}
private static void DoStuff(Object parameter) {
IList<Account> chunk = (IList<Account>) parameter;
foreach (Account account in chunk) {
// stuff
}
}
As usual, with multiple threads you have to be very careful with accessing mutable shared state. You will have to make sure that everything you do in 'DoStuff' method will not cause undesired side effects. You may find this and this useful.

foreach (var acc in linqFindAccount)
{
string mailComponent = acc.Mail;
Console.WriteLine(context.JobDetail.Name + " assigned to " + mailComponent);
acc.Done = true;
// stuff
}
Try above.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Efficient approach to multithreaded set difference - c#

Related

Start threads at the order that they were started, only when previous thread was finished

How to wait for a function execution for specific duration

Anyway to Parallel Yield c#

Most Efficient way to create list of all existing items in ConcurrentQueue at certain timestamp

C# Quartz Race Condition

Categories

Resources