My situation is this:
Multiple threads must write concurrently to the same collection (add and addrange). Order of items is not an issue.
When all threads have completed (join) and im back on my main thread, then I need to read all the collected data fast in a foreach style, where no actual locking is needed since all threads are done.
In the "old days" I would probably use a readerwriter lock for this on a List, but with the new concurrent collections I wonder if not there is a better alternative. I just can't figure out which as most concurrent collections seem to assume that the reader is also on a concurrent thread.
I don't believe you want to use any of the collections in System.Collections.Concurrent. These generally have extra overhead to allow for concurrent reading.
Unless you have a lot of contention, you are probably better taking a lock on a simple List<T> and adding to it. You will have a small amount of overhead as the List resizes, but it will be fairly infrequent.
However, what I would probably do in this case is simply adding to a List<T> per thread rather than a shared one, and either merge them at the end of processing, or simply iterate over all elements in each of the collections.
You could possibly use a ConcurrentBag and then call .ToArray() or GetEnumerator() on it when ready to read (bypassing a per-read penalty), but you will may find that the speed of insertions is a bit slower than your manual write lock on a simple List. It really depends on amount of contention. The ConcurrentBag is pretty good about partitioning but as you noted, is geared to concurrent reads and writes.
As always, benchmark your particular situation! Multithreading performance is highly dependent on many things in actual usage, and things like type of data, number of insertions, and such will change the results dramatically - a handful of reality is worth a gallon of theory.
Order of items is not an issue. When all threads have completed (join) and im back on my main thread, then I need to read all the collected data
You have not stated a requirement for a thread-safe collection at all. There's no point in sharing a single collection since you never read at the same time you write. Nor does it matter that all writing happens to the same collection since order doesn't matter. Nor should it matter since order would be random anyway.
So just give each thread its own collection to fill, no locking required. And iterate them one by one afterwards, no locking required.
Try the System.Collections.Concurrent.ConcurrentBag.
From the collection's description:
Represents a thread-safe, unordered collection of objects.
I believe this meets your criteria of handling multiple threads and order of items not being important, and later when you are back in the main thread, you can quickly foreach iterate over the collection and act on each item.
Related
I am working on a multi-thread application, where I load data from external feeds and store them in internal collections.
These collections are updated once per X minutes, by loading all data from the external feeds again.
There is no other adding/removing from these collection, just reading.
Normally I would use locking during the updating, same as everywhere I am accessing the collections.
Question:
Do the concurrent collections make my life easier in this case?
Basically I see two approaches
Load the data from external feed and then remove the items which are not present anymore, add the missing, and update the changed - I guess this is a good solution with help of concurrent collection (no locking required, right?), but it require too much code from my side.
Simply override the old collection object with a new one (e.g. _data = new ConcurentBag(newData). Here I am quite sure that using the concurrent collections have no advantage at all, am I right? Locking mechanism is required.
Is there out of the box solution I can use, using the concurrent collections? I would not like to reinvent the wheel again.
Yes, for concurrent collections the locking mechanism is stored inside the collections, so if you new up a collection in place of the old one, that just defeats the purpose. They are mostly used in producer-consumer situations, usually in combination with a BlockingCollection<T>. If your producer does more than just add data, it makes things a bit more complicated.
The benefit to not using concurrent collections is that your locking mechanism no longer depends on the collection - you can have a separate synchronization object that you lock on, and inside the critical section you're free to assign another instance like you wanted.
To answer your question - I don't know of any out-of-the-box mechanism to do what you want, but I wouldn't call using a simple lock statement "reinventing the wheel". That's a bit like saying that using for loops is reinventing the wheel. Just have a separate synchronization object alongside your non-concurrent collection.
I have a persistent B+tree, multiple threads are reading different chunks of the tree and performing some operations on read data. Interesting part: each thread produces a set of results, and as end user I want to see all the results in one place. What I do: one ConcurentDictionary and all threads are writing to it.
Everything works smooth this way. But the application is time critical, one extra second means a total dissatisfaction. ConcurentDictionary because of the thread-safety overhead is intrinsically slow compared to Dictionary.
I can use Dictionary, then each thread will write results to distinct dictionaries. But then I'll have the problem of merging different dictionaries.
.
My Questions:
Are concurrent collections a good decision for my scenario ?
If Not(1), then how would I merge optimally different dictionaries. Given that, (a) copying items one-by-one and (b) LINQ are known solutions and are not as optimal as expected :)
If Not(2) ;-) What would you suggest instead ?
.
A quick info:
#Thread = processorCount. The application can run on a standard laptop (i.e., 4 threads) or high-end server (i.e., <32 threads)
Item Count. The tree usually holds more than 1.0E+12 items.
From your timings it seems that the locking/building of the result dictionary is taking 3700ms per thread with the actual processing logic taking just 300ms.
I suggest that as an experiment you let each thread create its own local dictionary of results. Then you can see how much time is spent building the dictionary compared to how much is the effect of locking across threads.
If building the local dictionary adds more than 300ms then it will not be possible to meet your time limit. Because without any locking or any attempt to merge the results it has already taken too long.
Update
It seems that you can either pay the merge price as you go along, with the locking causing the threads to sit idle for a significant percentage of time, or pay the price in a post-processing merge. But the core problem is that the locking means you are not fully utilising the available CPU.
The only real solution to getting maximum performance from your cores is it use a non-blocking dictionary implementation that is also thread safe. I could not find a .NET implementation but did find a research paper detailing an algorithm that would indicate it is possible.
Implementing such an algorithm correctly is not trivial but would be fun!
Scalable and Lock-Free Concurrent Dictionaries
Had you considered async persistence?
Is it allowed in your scenario?
You can bypass to a queue in a separated thread pool (creating a thread pool would avoid the overhead of creating a (sub)thread for each request), and there you can handle the merging logic without affecting response time.
Is it necessary to lock LINQ statements as follows? If omitting the lock, any exceptions will be countered when multiple threads execute it concurrently?
lock (syncKey)
{
return (from keyValue in dictionary
where keyValue.Key > versionNumber
select keyValue.Value).ToList();
}
PS: Writer threads do exist to mutate the dictionary.
Most types are thread-safe to read, but not thread-safe during mutation.
If none of the threads is changing the dictionary, then you don't need to do anything - just read away.
If, however, one of the threads is changing it then you have problems and need to synchronize. The simplest approach is a lock, however this prevents concurrent readers even when there is no writer. If there is a good chance you will have more readers that writers, consider using a ReaderWriterLockSlim to synchronize - this will allow any number of readers (with no writer), or: one writer.
In 4.0 you might also consider a ConcurrentDictionary<,>
So long as the query has no side-effects (such as any of the expressions calling code that make changes) there there is no need to lock a LINQ statement.
Basically, if you don't modify the data (and nothing else is modifying the data you are using) then you don't need locks.
If you are using .NET 4.0 and there is a ConcurrentDictionary that is thread safe. Here is an example of using a concurrent dictionary (admittedly not in a LINQ statement)
UPDATE
If you are modifying data then you need to use locks. If two or more threads attempt to access a locked section of code there will be a small performance loss as one or more of the threads waits for the lock to be released. NOTE: If you over-lock then you may end up with worse performance that you would if you had just built the code using a sequential algorithm from the start.
If you are only ever reading data then you don't need locks as there is no mutable shared state to protect.
If you do not use locks then you may end up with intermittent bugs where the data is not quite right or exceptions are thrown when collisions occur between readers and writers. In my experience, most of the time you may never get an exception, you just get corrupt data (except you don't necessarily know it is corrupt). Here is another example showing how data can be corrupted if you don't use locks or redesign your algorithm to cope.
You often get the best out of a system if you consider the constraints of developing in a parallel system from the outset. Sometimes you can re-write your code so it uses no shared data. Sometime you can split the data up into chunks and have each thread/task work on its own chunk then have some process at the end stitch it all back together again.
If your dictionary is static and a method where you run the query is not (or another concurrent access scenarios), and dictionary can be modified from another thread, then yes, lock is required otherwise - is not.
Yes, you need to lock your shared resources when using LINQ in multi-threaded scenarios (EDIT: of course, if your source collection is being modified as Marc said, if you are only reading it, you don't need to worry about it). If you are using .Net 4 or the parallel extensions for 3.5 you could look at replacing your Dictionary with a ConcurrentDictionary (or use some other custom implementation anyway).
I have found possible slowdown in my app so I would have two questions:
What is the real difference between simple locking on object and reader/writer locks?
E.g. I have a collection of clients, that change quickly. For iterations should I use readerlock or the simple lock is enough?
In order to decrease load, I have left iteration (only reading) of one collection without any locks. This collection changes often and quickly, but items are added and removed with writerlocks. Is it safe (I dont mind occassionally skipped item, this method runs in loop and its not critical) to left this reading unsecured by lock? I just dont want to have random exceptions.
No, your current scenario is not safe.
In particular, if a collection changes while you're iterating over it, you'll get an InvalidOperationException in the iterating thread. You should obtain a reader lock for the whole duration of your iterator:
Obtain reader lock
Iterate over collection
Release reader lock
Note this is not the same as obtaining a reader lock for each step of the iteration - that won't help.
As for the difference between reader/writer locks and "normal" locks - the idea of a reader/writer lock is that multiple threads can read at the same time, but only one thread can write (and only when no-one is reading). In some cases this can improve performance - but it increases the complexity of the solution too (in terms of getting it right). I'd also advise you to use ReaderWriterLockSlim from .NET 3.5 if you possibly can - it's much more efficient than the original ReaderWriterLock, and there are some inherent problems with ReaderWriterLock IIRC.
Personally I normally use simple locks until I've proved that lock contention is a performance bottleneck. Have you profiled your application yet to find out where the bottleneck is?
Ok first about the reading iteration without locks thing. It's not safe, and you shouldn't do it. Just to illustrate the point in the most simple way - you're iterating through a collection but you never know how many items are in that collection and have no way to find out. Where do you stop? Checking the count every iteration doesn't help because it can change after you check it but before you get the element.
ReaderWriterLock is designed for a situation where you allow multiple threads have concurrent read access, but force synchronous write. From the sounds of your application you don't have multiple concurrent readers, and writes are just as common as reads, so the ReaderWriterLock provides no benefit. You'd be better served by classic locking in this case.
In general whatever tiny performance benefits you squeeze out of not locking access to shared objects with multithreading are dramatically offset by random weirdness and unexplainable behavior. Lock everything that is shared, test the application, and then when everything works you can run a profiler on it, check just how much time the app is waiting on locks and then implement some dangerous trickery if needed. But chances are the impact is going to be small.
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified” - Donald Knuth
From the MSDN documentation:
"Synchronized supports multiple writing threads, provided that no threads are reading the Hashtable. The synchronized wrapper does not provide thread-safe access in the case of one or more readers and one or more writers."
Source:
http://msdn.microsoft.com/en-us/library/system.collections.hashtable.synchronized.aspx
It sounds like I still have to use locks anyways, so my question is why would we use Hashtable.Synchronized at all?
For the same reason there are different levels of DB transaction. You may care that writes are guaranteed, but not mind reading stale/possibly bad data.
EDIT I note that their specific example is an Enumerator. They can't handle this case in their wrapper, because if you break from the enumeration early, the wrapper class would have no way to know that it can release its lock.
Think instead of the case of a counter. Multiple threads can increase a value in the table, and you want to display the value of the count. It doesn't matter if you display 1,200,453 and the count is actually 1,200,454 - you just need it close. However, you don't want the data to be corrupt. This is a case where thread-safety is important for writes, but not reads.
For the case where you can guarantee that no reader will access the data structure when writing to it (or when you don't care reading wrong data). For example, where the structure is not continually being modified, but a one time calculation that you'll later have to access, although huge enough to warrant many threads writing to it.
you would need it when you are for-eaching over a hashtable on one thread (reads) and there exists other threads that may add/remove items to/from it (writes) ...