I have a persistent B+tree, multiple threads are reading different chunks of the tree and performing some operations on read data. Interesting part: each thread produces a set of results, and as end user I want to see all the results in one place. What I do: one ConcurentDictionary and all threads are writing to it.
Everything works smooth this way. But the application is time critical, one extra second means a total dissatisfaction. ConcurentDictionary because of the thread-safety overhead is intrinsically slow compared to Dictionary.
I can use Dictionary, then each thread will write results to distinct dictionaries. But then I'll have the problem of merging different dictionaries.
.
My Questions:
Are concurrent collections a good decision for my scenario ?
If Not(1), then how would I merge optimally different dictionaries. Given that, (a) copying items one-by-one and (b) LINQ are known solutions and are not as optimal as expected :)
If Not(2) ;-) What would you suggest instead ?
.
A quick info:
#Thread = processorCount. The application can run on a standard laptop (i.e., 4 threads) or high-end server (i.e., <32 threads)
Item Count. The tree usually holds more than 1.0E+12 items.
From your timings it seems that the locking/building of the result dictionary is taking 3700ms per thread with the actual processing logic taking just 300ms.
I suggest that as an experiment you let each thread create its own local dictionary of results. Then you can see how much time is spent building the dictionary compared to how much is the effect of locking across threads.
If building the local dictionary adds more than 300ms then it will not be possible to meet your time limit. Because without any locking or any attempt to merge the results it has already taken too long.
Update
It seems that you can either pay the merge price as you go along, with the locking causing the threads to sit idle for a significant percentage of time, or pay the price in a post-processing merge. But the core problem is that the locking means you are not fully utilising the available CPU.
The only real solution to getting maximum performance from your cores is it use a non-blocking dictionary implementation that is also thread safe. I could not find a .NET implementation but did find a research paper detailing an algorithm that would indicate it is possible.
Implementing such an algorithm correctly is not trivial but would be fun!
Scalable and Lock-Free Concurrent Dictionaries
Had you considered async persistence?
Is it allowed in your scenario?
You can bypass to a queue in a separated thread pool (creating a thread pool would avoid the overhead of creating a (sub)thread for each request), and there you can handle the merging logic without affecting response time.
Related
My situation is this:
Multiple threads must write concurrently to the same collection (add and addrange). Order of items is not an issue.
When all threads have completed (join) and im back on my main thread, then I need to read all the collected data fast in a foreach style, where no actual locking is needed since all threads are done.
In the "old days" I would probably use a readerwriter lock for this on a List, but with the new concurrent collections I wonder if not there is a better alternative. I just can't figure out which as most concurrent collections seem to assume that the reader is also on a concurrent thread.
I don't believe you want to use any of the collections in System.Collections.Concurrent. These generally have extra overhead to allow for concurrent reading.
Unless you have a lot of contention, you are probably better taking a lock on a simple List<T> and adding to it. You will have a small amount of overhead as the List resizes, but it will be fairly infrequent.
However, what I would probably do in this case is simply adding to a List<T> per thread rather than a shared one, and either merge them at the end of processing, or simply iterate over all elements in each of the collections.
You could possibly use a ConcurrentBag and then call .ToArray() or GetEnumerator() on it when ready to read (bypassing a per-read penalty), but you will may find that the speed of insertions is a bit slower than your manual write lock on a simple List. It really depends on amount of contention. The ConcurrentBag is pretty good about partitioning but as you noted, is geared to concurrent reads and writes.
As always, benchmark your particular situation! Multithreading performance is highly dependent on many things in actual usage, and things like type of data, number of insertions, and such will change the results dramatically - a handful of reality is worth a gallon of theory.
Order of items is not an issue. When all threads have completed (join) and im back on my main thread, then I need to read all the collected data
You have not stated a requirement for a thread-safe collection at all. There's no point in sharing a single collection since you never read at the same time you write. Nor does it matter that all writing happens to the same collection since order doesn't matter. Nor should it matter since order would be random anyway.
So just give each thread its own collection to fill, no locking required. And iterate them one by one afterwards, no locking required.
Try the System.Collections.Concurrent.ConcurrentBag.
From the collection's description:
Represents a thread-safe, unordered collection of objects.
I believe this meets your criteria of handling multiple threads and order of items not being important, and later when you are back in the main thread, you can quickly foreach iterate over the collection and act on each item.
I have a program that executes multiple threads. Each thread simply executes a HTTPWebRequest and then screen scrapes the page looking for some text. I am a race against other users to find this text. I could execute 1000000 threads, all looking for the same thing.
My thought on that is that would put a lot of work on my processor and would actually cause the requests to execute slower. How can I find a balance between the number of threads to execute and the performance of the web requests. Basically what I want to do is find the optimal number of threads to spawn off so that the amount of data they pull down is greatest.
The application is using .NET4 and written in C#.
You are right to assume that 1000000 threads will put undue pressure on your CPU. The work that your CPU would have to do to manage and switch between that many threads would probably cause your system to be very slow indeed.
Obviously you are not serious about 1000000 threads, but it demonstrates that you cannot simply throw more threads at the problem. You dont really want to write your own load balancer - that will not be easy and will not perform as well as the classes that come with the base class library. Have a look at using ThreadPool threads - the CLR will manage them for you. You can also look at the Parallel Task Library that is new in .NET 4.0 (since you mention that is what you are using).
ALso check out this great article about multi-threading:
http://www.albahari.com/threading/
C# has a ThreadPool. Submit your web-scraping tasks to the pool. You can tweak the number of threads in the pool to tune your app - you will probably need to increase it well above the default for best performance with such a requirement as yours.
Huge numbers of threads are wasteful, as posted by #M Babcock.
I'm not sure if the number of threads in a C# ThreadPool can be changed at run-time, (I see no reason why not, but M$...). If it is tweakable during the run, tuning will be even easier!
you need to use Parallel.Foreach to manage your threads properly...
You are asking performance question and not providing any estimates on your actual requirements... so let me try doing it for you.
How much data can you pull in - assuming awesome network and regular network card - 100Mb/s at max, probably less than 10Mb/sec. This give about less than 10000 requests per second (assuming ~10K requests/response pairs).
Can one thread handle that much data - searching through 100Mb a second should not be a problem even for single thread. Super easy to prototype/measure.
How many threads I need to read data - likely 1 - starting asynchronous request is fast, reading response OR posting response in a queue for processing is fast for 10000 items a second.
So my estimates - 1 thread for simple code, (1 + one thread per core) if you have more cores and willing to run processing in parallel.
I'm working on my 10th grade science fair project right now and I've kind of hit a wall. My project is testing the effect of parallelism on the efficiency of brute forcing md5 password hashes. I'll be calculating the # of password combinations/second it tests to see how efficient it is, using 1, 4,16,32,64,128,512,and 1024 threads. I'm not sure if I'll do dictionary brute force or pure brute force. I figure that dictionary would be easier to parallelize; just split the list up into equal parts for each thread. I haven't written much code yet; I'm just trying to plan it out before I start coding.
My questions are:
Is calculating the password combinations tested/second the best way to determine the performance based on # of threads?
Dictionary or pure brute force? If pure brute force, how would you split up the task into a variable number of threads?
Any other suggestions?
I'm not trying to dampen your enthusiasm, but this is already quite a well understood problem. I'll try to explain what to expect below. But maybe it would be better to do your project in another area. How's about "Maximising MD5 hashing throughput" then you wouldn't be restricted to just looking at threading.
I think that when you write up your project, you'll need to offer some kind of analysis as to when parallel processing is appropriate and when it isn't.
Each time that your CPU changes to another thread, it has to persist the current thread context and load the new thread context. This overhead does not occur in a single-threaded process (except for managed services like garbage collection). So all else equal, adding threads won't improve performance because it must do the original workload plus all of the context switching.
But if you have multiple CPUs (cores) at your disposal, creating one thread per CPU will mean that you can parallelize your calculations without incurring context switching costs. If you have more threads than CPUs then context switching will become an issue.
There are 2 classes of computation: IO-bound and compute-bound. An IO-bound computation can spend large amounts of CPU cycles waiting for a response from some hardware like a network card or a hard disk. Because of this overhead, you can increase the number of threads to the point where the CPU is maxed out again, and this can cancel out the cost of context switching. However there is a limit to the number of threads, beyond which context switching will take up more time than the threads spend blocking for IO.
Compute-bound computations simply require CPU time for number crunching. This is the kind of computation used by a password cracker. Compute-bound operations do not get blocked, so adding more threads than CPUs will slow down your overall throughput.
The C# ThreadPool already takes care of all of this for you - you just add tasks, and it queues them until a Thread is available. New Threads are only created when a thread is blocked. That way, context switches are minimised.
I have a quad-core machine - breaking the problem into 4 threads, each executing on its own core, will be more or less as fast as my machine can brute force passwords.
To seriously parallelize this problem, you're going to need a lot of CPUs. I've read about using the GPU of a graphics card to attack this problem.
There's an analysis of attack vectors that I wrote up here if it's any use to you. Rainbow tables and the processor/memory trade offs would be another interesting area to do a project in.
To answer your question:
1) There is nothing like the best way to test thread performance. Different problems scale differently with threads, depending on how independent each operation in the target problem is. So you can try the dictionary thing. But, when you analyse the results, the results that you get might not be applicable on all problems. One very popular example however, is that people try a shared counter, where the counter is increased by a fixed number of times by each thread.
2) Brute force will cover a large number of cases. In fact, by brute force, there can be an infinite number of possibilities. So, you might have to limit your password by some constraints like the maximum length of the password and so on. One way to distribute brute force is to assign each thread a different starting character for the password. The thread then tests all possible passwords for that starting character. Once the thread finishes its work, it gets another starting character till you use all possible starting symbols.
3) One suggestion that I would like to give you is to test on a little smaller number of threads. You are going upto 1024 threads. That is not a good idead. The number of cores on a machine is generally 4 to 10. So, try not to exceed the number of threads by a huge number than the number of cores. Because, a processor cannot run multiple threads at the same time. Its one thread per processor at any given time. Instead, try to measure performace for different schemes for assigning the problem to different threads.
Let me know if this helps!
One solution that will work for both a dictionary and a brute-force of all possible passwords is to use a approach based around dividing the job up into work units. Have a shared object responsible for dividing the problem space up into units of work - ideally, something like 100ms to 5 seconds worth of work each - and give a reference to this object to each thread you start. Each thread then operates in a loop like this:
for work_block in work_block_generator.get():
for item in work_block:
# Do work
The advantage of this over just parcelling up the whole workspace into one chunk per thread up-front is that if one thread works faster than others, it won't run out of work and just sit idle - it'll pick up more chunks.
Ideally your work item generator would have an interface that, when called, returns an iterator, which itself returns individual passwords to test. The dictionary-based one, then, selects a range from the dictionary, while the brute force one selects a prefix to test for each batch. You'll need to use synchronization primitives to stop races between different threads trying to grab work units, of course.
In both the dictionary and brute force methods, the problem is Embarrassingly Parallel.
To divide the problem for brute force with n threads, just say, the first two (or three) letters (the "prefix") into n pieces. Then, each thread has a set of assigned prefixes, like "aa - fz" where it is responsible only for testing everything that follows its prefixes.
Dictionary is usually statistically slightly better in practice for cracking more passwords, but brute force, since it covers everything, cannot miss a password within the target length.
I have found possible slowdown in my app so I would have two questions:
What is the real difference between simple locking on object and reader/writer locks?
E.g. I have a collection of clients, that change quickly. For iterations should I use readerlock or the simple lock is enough?
In order to decrease load, I have left iteration (only reading) of one collection without any locks. This collection changes often and quickly, but items are added and removed with writerlocks. Is it safe (I dont mind occassionally skipped item, this method runs in loop and its not critical) to left this reading unsecured by lock? I just dont want to have random exceptions.
No, your current scenario is not safe.
In particular, if a collection changes while you're iterating over it, you'll get an InvalidOperationException in the iterating thread. You should obtain a reader lock for the whole duration of your iterator:
Obtain reader lock
Iterate over collection
Release reader lock
Note this is not the same as obtaining a reader lock for each step of the iteration - that won't help.
As for the difference between reader/writer locks and "normal" locks - the idea of a reader/writer lock is that multiple threads can read at the same time, but only one thread can write (and only when no-one is reading). In some cases this can improve performance - but it increases the complexity of the solution too (in terms of getting it right). I'd also advise you to use ReaderWriterLockSlim from .NET 3.5 if you possibly can - it's much more efficient than the original ReaderWriterLock, and there are some inherent problems with ReaderWriterLock IIRC.
Personally I normally use simple locks until I've proved that lock contention is a performance bottleneck. Have you profiled your application yet to find out where the bottleneck is?
Ok first about the reading iteration without locks thing. It's not safe, and you shouldn't do it. Just to illustrate the point in the most simple way - you're iterating through a collection but you never know how many items are in that collection and have no way to find out. Where do you stop? Checking the count every iteration doesn't help because it can change after you check it but before you get the element.
ReaderWriterLock is designed for a situation where you allow multiple threads have concurrent read access, but force synchronous write. From the sounds of your application you don't have multiple concurrent readers, and writes are just as common as reads, so the ReaderWriterLock provides no benefit. You'd be better served by classic locking in this case.
In general whatever tiny performance benefits you squeeze out of not locking access to shared objects with multithreading are dramatically offset by random weirdness and unexplainable behavior. Lock everything that is shared, test the application, and then when everything works you can run a profiler on it, check just how much time the app is waiting on locks and then implement some dangerous trickery if needed. But chances are the impact is going to be small.
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified” - Donald Knuth
I have a quad core machine and would like to write some code to parse a text file that takes advantage of all four cores. The text file basically contains one record per line.
Multithreading isn't my forte so I'm wondering if anyone could give me some patterns that I might be able to use to parse the file in an optimal manner.
My first thoughts are to read all the lines into some sort of queue and then spin up threads to pull the lines off the queue and process them, but that means the queue would have to exist in memory and these are fairly large files so I'm not so keen on that idea.
My next thoughts are to have some sort of controller that will read in a line and assign it a thread to parse, but I'm not sure if the controller will end up being a bottleneck if the threads are processing the lines faster than it can read and assign them.
I know there's probably another simpler solution than both of these but at the moment I'm just not seeing it.
I'd go with your original idea. If you are concerned that the queue might get too large implement a buffer-zone for it (i.e. If is gets above 100 lines the stop reading the file and if it gets below 20 then start reading again. You'd need to do some testing to find the optimal barriers). Make it so that any of the threads can potentially be the "reader thread" as it has to lock the queue to pull an item out anyway it can also check to see if the "low buffer region" has been hit and start reading again. While it's doing this the other threads can read out the rest of the queue.
Or if you prefer, have one reader thread assign the lines to three other processor threads (via their own queues) and implement a work-stealing strategy. I've never done this so I don't know how hard it is.
Mark's answer is the simpler, more elegant solution. Why build a complex program with inter-thread communication if it's not necessary? Spawn 4 threads. Each thread calculates size-of-file/4 to determine it's start point (and stop point). Each thread can then work entirely independently.
The only reason to add a special thread to handle reading is if you expect some lines to take a very long time to process and you expect that these lines are clustered in a single part of the file. Adding inter-thread communication when you don't need it is a very bad idea. You greatly increase the chance of introducing an unexpected bottleneck and/or synchronization bugs.
This will eliminate bottlenecks of having a single thread do the reading:
open file
for each thread n=0,1,2,3:
seek to file offset 1/n*filesize
scan to next complete line
process all lines in your part of the file
My experience is with Java, not C#, so apologies if these solutions don't apply.
The immediate solution I can think up off the top of my head would be to have an executor that runs 3 threads (using Executors.newFixedThreadPool, say). For each line/record read from the input file, fire off a job at the executor (using ExecutorService.submit). The executor will queue requests for you, and allocate between the 3 threads.
Probably better solutions exist, but hopefully that will do the job. :-)
ETA: Sounds a lot like Wolfbyte's second solution. :-)
ETA2: System.Threading.ThreadPool sounds like a very similar idea in .NET. I've never used it, but it may be worth your while!
Since the bottleneck will generally be in the processing and not the reading when dealing with files I'd go with the producer-consumer pattern. To avoid locking I'd look at lock free lists. Since you are using C# you can take a look at Julian Bucknall's Lock-Free List code.
#lomaxx
#Derek & Mark: I wish there was a way to accept 2 answers. I'm going to have to end up going with Wolfbyte's solution because if I split the file into n sections there is the potential for a thread to come across a batch of "slow" transactions, however if I was processing a file where each process was guaranteed to require an equal amount of processing then I really like your solution of just splitting the file into chunks and assigning each chunk to a thread and being done with it.
No worries. If clustered "slow" transactions is a issue, then the queuing solution is the way to go. Depending on how fast or slow the average transaction is, you might also want to look at assigning multiple lines at a time to each worker. This will cut down on synchronization overhead. Likewise, you might need to optimize your buffer size. Of course, both of these are optimizations that you should probably only do after profiling. (No point in worrying about synchronization if it's not a bottleneck.)
If the text that you are parsing is made up of repeated strings and tokens, break the file into chunks and for each chunk you could have one thread pre-parse it into tokens consisting of keywords, "punctuation", ID strings, and values. String compares and lookups can be quite expensive and passing this off to several worker threads can speed up the purely logical / semantic part of the code if it doesn't have to do the string lookups and comparisons.
The pre-parsed data chunks (where you have already done all the string comparisons and "tokenized" it) can then be passed to the part of the code that would actually look at the semantics and ordering of the tokenized data.
Also, you mention you are concerned with the size of your file occupying a large amount of memory. There are a couple things you could do to cut back on your memory budget.
Split the file into chunks and parse it. Read in only as many chunks as you are working on at a time plus a few for "read ahead" so you do not stall on disk when you finish processing a chunk before you go to the next chunk.
Alternatively, large files can be memory mapped and "demand" loaded. If you have more threads working on processing the file than CPUs (usually threads = 1.5-2X CPU's is a good number for demand paging apps), the threads that are stalling on IO for the memory mapped file will halt automatically from the OS until their memory is ready and the other threads will continue to process.