Highest Performance for Cross AppDomain Signaling

Highest Performance for Cross AppDomain Signaling - c#

My performance sensitive application uses MemoryMappedFiles for pushing bulk data between many AppDomains. I need the fastest mechanism to signal a receiving AD that there is new data to be read.
The design looks like this:
AD 1: Writer to MMF, when data is written it should notify the reader ADs
AD 2,3,N..: Reader of MMF
The readers do not need to know how much data is written because each message written will start with a non zero int and it will read until zero, don't worry about partially written messages.
(I think) Traditionally, within a single AD, Monitor.Wait/Pulse could be used for this, I do not think it works across AppDomains.
A MarshalByRefObject remoting method or event can also be used but I would like something faster. (I benchmark 1,000,000 MarshalByRefObject calls/sec on my machine, not bad but I want more)
A named EventWaitHandle is about twice as fast from initial measurements.
Is there anything faster?
Note: The receiving ADs do not need to get every signal as long as the last signal is not dropped.

A thread context switch costs between 2000 and 10,000 machine cycles on Windows. If you want more than a million per second then you are going to have to solve the Great Silicon Speed Bottleneck. You are already on the very low end of the overhead.
Focus on switching less often and collecting more data in one whack. Nothing needs to switch at a microsecond.

The named EventWaitHandle is the way to go for a one way signal (For lowest latency). From my measurements 2x faster than a cross-appdomain method call. The method call performance is very impressive in the latest versions of the CLR to date (4) and should make the most sense for the large majority of cases since it's possible to pass some information int he method call (in my case, how much data to read)
If it's OK to continuously burn a thread on the receiving end, and performance is that critical, a tight loop may be faster.
I hope Microsoft continues to improve the cross appdomain functionality as it can really help with application reliability and plugin-ins.

Related

Inefficient Parallel.For?

I'm using a parallel for loop in my code to run a long running process on a large number of entities (12,000).
The process parses a string, goes through a number of input files (I've read that given the number of IO based things the benefits of threading could be questionable, but it seems to have sped things up elsewhere) and outputs a matched result.
Initially, the process goes quite quickly - however it ends up slowing to a crawl. It's possible that it's just hit a number of particularly tricky input data, but this seems unlikely looking closer at things.
Within the loop, I added some debug code that prints "Started Processing: " and "Finished Processing: " when it begins/ends an iteration and then wrote a program that pairs a start and a finish, initially in order to find which ID was causing a crash.
However, looking at the number of unmatched ID's, it looks like the program is processing in excess of 400 different entities at once. This seems like, with the large number of IO, it could be the source of the issue.
So my question(s) is(are) this(these):
Am I interpreting the unmatched ID's properly, or is there some clever stuff going behind the scenes I'm missing, or even something obvious?
If you'd agree what I've spotted is correct, how can I limit the number it spins off and does at once?
I realise this is perhaps a somewhat unorthodox question and may be tricky to answer given there is no code, but any help is appreciated and if there's any more info you'd like, let me know in the comments.

Without seeing some code, I can guess at the answers to your questions:
Unmatched IDs indicate to me that the thread that is processing that data is being de-prioritized. This could be due to IO or the thread pool trying to optimize, however it seems like if you are strongly IO bound then that is most likely your issue.
I would take a look at Parallel.For, specifically using ParallelOptions.MaxDegreesOfParallelism to limit the maximum number of tasks to a reasonable number. I would suggest trial and error to determine the optimum number of degrees, starting around the number of processor cores you have.
Good luck!

Let me start by confirming that is indeed a very bad idea to read 2 files at the same time from a hard drive (at least until the majority of HDs out there are SSDs), let alone whichever number your whole thing is using.
The use of parallelism serves to optimize processing using an actually paralellizable resource, which is the CPU power. If you paralellized process reads from a hard drive then you're losing most of the benefit.
And even then, even the CPU power is not prone to infinite paralellization. A normal desktop CPU has the capacity to run up to 10 threads at the same time (depends of the model obviously, but that's the order of magnitude).
So two things
first, I am going to make the assumption that your entities use all your files, but your files are not too big to be loaded into memory. If it's the case, you should read your files into objects (i.e. into memory), then paralellize the processing of your entities using those objects. If not, you're basically relying on your hard drive's cache to not reread your files every time you need them, and your hard drive's cache is far smaller than your memory (1000-fold).
second, you shouldn't be running Parallel.For on 12.000 items. Parallel.For will actually (try to) create 12.000 threads, and that is actually worse than 10 threads, because of the big overhead that paralellizing will create, and the fact your CPU will not benefit from it at all since it cannot run more than 10 threads at a time.
You should probably use a more efficient method, which is the IEnumerable<T>.AsParallel() extension (comes with .net 4.0). This one will, at runtime, determine what is the optimal thread number to run, then divide your enumerable into as many batches. Basically, it does the job for you - but it creates a big overhead too, so it's only useful if the processing of one element is actually costly for the CPU.
From my experience, using anything parallel should always be evaluated against not using it in real-life, i.e. by actually profiling your application. Don't assume it's going to work better.

Parallel code bad scalability

Recently I've been analyzing how my parallel computations actually speed up on 16-core processor. And the general formula that I concluded - the more threads you have the less speed per core you get - is embarassing me. Here are the diagrams of my cpu load and processing speed:
So, you can see that processor load increases, but speed increases much slower. I want to know why such an effect takes place and how to get the reason of unscalable behaviour.
I've made sure to use Server GC mode.
I've made sure that I'm parallelizing appropriate code as soon as code does nothing more than
Loads data from RAM (server has 96 GB of RAM, swap file shouldn't be hit)
Performs not complex calculations
Stores data in RAM
I've profiled my application carefully and found no bottlenecks - looks like each operation becomes slower as thread number grows.
I'm stuck, what's wrong with my scenario?
I use .Net 4 Task Parallel Library.

You will always get this kind of curve, it's called Amdahl's law.
The question is how soon it will level off.
You say you checked your code for bottlenecks, let's assume that's correct. Then there is still the memory bandwidth and other hardware factors.

The key to a linear scalability - in the context of where going from one to two cores doubles the throughput - is to use shared resources as little as possible. This means:
don't use hyperthreading (because the two threads share the same core resource)
tie every thread to a specific core (otherwise the OS will juggle the
threads between cores)
don't use more threads than there are cores (the OS will swap in and
out)
stay inside the core's own caches - nowadays the L1 & L2 caches
don't venture into the L3 cache or RAM unless it is absolutely
necessary
minimize/economize on critical section/synchronization usage
If you've come this far you've probably profiled and hand-tuned your code too.
Thread pools are a compromise and not suited for uncompromising, high-performance applications. Total thread control is.
Don't worry about the OS scheduler. If your application is CPU-bound with long computations that mostly does local L1 & L2 memory accesses it's a better performance bet to tie each thread to its own core. Sure the OS will come in but compared to the work being performed by your threads the OS work is negligible.
Also I should say that my threading experience is mostly from Windows NT-engine machines.
_______EDIT_______
Not all memory accesses have to do with data reads and writes (see comment above). An often overlooked memory access is that of fetching code to be executed. So my statement about staying inside the core's own caches implies making sure that ALL necessary data AND code reside in these caches. Remember also that even quite simple OO code may generate hidden calls to library routines. In this respect (the code generation department), OO and interpreted code is a lot less WYSIWYG than perhaps C (generally WYSIWYG) or, of course, assembly (totally WYSIWYG).

A general decrease in return with more threads could indicate some kind of bottle neck.
Are there ANY shared resources, like a collection or queue or something or are you using some external functions that might be dependent on some limited resource?
The sharp break at 8 threads is interesting and in my comment I asked if the CPU is a true 16 core or an 8 core with hyper threading, where each core appears as 2 cores to the OS.
If it is hyper threading, you either have so much work that the hyper threading cannot double the performance of the core, or the memory pipe to the core cannot handle twice the data through put.
Are the work performed by the threads even or are some threads doing more than others, that could also indicate resource starvation.
Since your added that threads query for data very often, that indicates a very large risk of waiting.
Is there any way to let the threads get more data each time? Like reading 10 items instead of one?

If you are doing memory intensive stuff, you could be hitting cache capacity.
You could maybe test this with mock algorithm which just processes same small bit if data over and over so it all should fit in cache.
If it indeed is cache, possible solutions could be making the threads work on same data somehow (like different parts of small data window), or just tweaking the algorithm to be more local (like in sorting, merge sort is generally slower than quick sort, but it is more cache friendly which still makes it better in some cases).

Are your threads reading and writing to items close together in memory? Then you're probably running into false sharing. If thread 1 works with data[1] and thread2 works with data[2], then even though in an ideal world we know that two consecutive reads of data[2] by thread2 will always produce the same result, in the actual world, if thread1 updates data[1] sometime between those two reads, then the CPU will mark the cache as dirty and update it. http://msdn.microsoft.com/en-us/magazine/cc872851.aspx. To solve it, make sure the data each thread is working with is adequately far away in memory from the data the other threads are working with.
That could give you a performance boost, but likely won't get you to 16x—there are lots of things going on under the hood and you'll just have to knock them out one-by-one. And really it's not that your algorithm is running at 30% speed when multithreaded; it's more that your single-threaded algorithm is running at 300% speed, enabled by all sorts of CPU and caching awesomeness that running multithreaded has a harder time taking advantage of. So there's nothing to be "embarrassed" about. But with some diligence, you can perhaps get the multithreaded version working at nearly 300% speed as well.
Also, if you're counting hyperthreaded cores as real cores, well, they're not. They only allow threads to swap really fast when one is blocked. But they'll never let you run at double speed unless your threads are getting blocked half the time anyway, in which case that already means you have opportunity for speedup.

Balance between Number of Threads and Web Requests

I have a program that executes multiple threads. Each thread simply executes a HTTPWebRequest and then screen scrapes the page looking for some text. I am a race against other users to find this text. I could execute 1000000 threads, all looking for the same thing.
My thought on that is that would put a lot of work on my processor and would actually cause the requests to execute slower. How can I find a balance between the number of threads to execute and the performance of the web requests. Basically what I want to do is find the optimal number of threads to spawn off so that the amount of data they pull down is greatest.
The application is using .NET4 and written in C#.

You are right to assume that 1000000 threads will put undue pressure on your CPU. The work that your CPU would have to do to manage and switch between that many threads would probably cause your system to be very slow indeed.
Obviously you are not serious about 1000000 threads, but it demonstrates that you cannot simply throw more threads at the problem. You dont really want to write your own load balancer - that will not be easy and will not perform as well as the classes that come with the base class library. Have a look at using ThreadPool threads - the CLR will manage them for you. You can also look at the Parallel Task Library that is new in .NET 4.0 (since you mention that is what you are using).
ALso check out this great article about multi-threading:
http://www.albahari.com/threading/

C# has a ThreadPool. Submit your web-scraping tasks to the pool. You can tweak the number of threads in the pool to tune your app - you will probably need to increase it well above the default for best performance with such a requirement as yours.
Huge numbers of threads are wasteful, as posted by #M Babcock.
I'm not sure if the number of threads in a C# ThreadPool can be changed at run-time, (I see no reason why not, but M$...). If it is tweakable during the run, tuning will be even easier!

you need to use Parallel.Foreach to manage your threads properly...

You are asking performance question and not providing any estimates on your actual requirements... so let me try doing it for you.
How much data can you pull in - assuming awesome network and regular network card - 100Mb/s at max, probably less than 10Mb/sec. This give about less than 10000 requests per second (assuming ~10K requests/response pairs).
Can one thread handle that much data - searching through 100Mb a second should not be a problem even for single thread. Super easy to prototype/measure.
How many threads I need to read data - likely 1 - starting asynchronous request is fast, reading response OR posting response in a queue for processing is fast for 10000 items a second.
So my estimates - 1 thread for simple code, (1 + one thread per core) if you have more cores and willing to run processing in parallel.

using C# for real time applications

Can C# be used for developing a real-time application that involves taking input from web cam continuously and processing the input?

You cannot use any main stream garbage collected language for “hard real-time systems”, as the garbage collect will sometimes stop the system responding in a defined time. Avoiding allocating object can help, however you need a way to prove you are not creating any garbage and that the garbage collector will not kick in.
However most “real time” systems don’t in fact need to always respond within a hard time limit, so it all comes down do what you mean by “real time”.
Even when parts of the system needs to be “hard real time” often other large parts of the system like the UI don’t.
(I think your app needs to be fast rather than “real time”, if 1 frame is lost every 100 years how many people will get killed?)

I've used C# to create multiple realtime, high speed, machine vision applications that run 24/7 and have moving machinery dependent on the application. If something goes wrong in the software, something immediately and visibly goes wrong in the real world.
I've found that C#/.Net provide pretty good functionality for doing so. As others have said, definitely stay on top of garbage collection. Break up to processing into several logical steps, and have separate threads working each. I've found the Producer Consumer programming model to work well for this, perhaps ConcurrentQueue for starters.
You could start with something like:
Thread 1 captures the camera image, converts it to some format, and puts it into an ImageQueue
Thread 2 consumes from the ImageQueue, processing the image and comes up with a data object that is put onto a ProcessedQueue
Thread 3 consumes from the ProcessedQueue and does something interesting with the results.
If Thread 2 takes too long, Threads 1 and 3 are still chugging along. If you have a multicore processor you'll be throwing more hardware at the math. You could also use several threads in place of any thread that I wrote above, although you'd have to take care of ordering the results manually.
Edit
After reading other peoples answers, you could probably argue my definition of "realtime". In my case, the computer produces targets that it sends to motion controllers which do the actual realtime motion. The motion controllers provide their own safety layers for things like timing, max/min ranges, smooth accel/decelerations and safety sensors. These controllers read sensors across an entire factory with a cycle time of less than 1ms.

Absolutely. The key will be to avoid garbage collection and memory management as much as possible. Try to avoid new-ing objects as much as possible, using buffers or object pools when you can.

Of course, someone has even developed a library to do that: AForge.NET
As with any real-time application and not just C#, you'll have to manage the buffers well as #David suggested.
Not only that, there're also the XNA Framework (for things like 3D games) and you can program DirectX using C# as well which are very real-time.
And did you know that, if you want, you can do pointer manipulations in C# too?

It depends on how 'real-time' it needs to be; ie, what your timing constraints are, and how quickly you need to 'do something'.
If you can handle 'doing something' maybe every 300ms or so in .NET, say on a timer event, I've found Windows to work okay. Note that this is something I found true on multiple systems of different ages and different speeds. As always, YMMV.
But that number is awfully long for a lot of applications. Maybe not for yours.
Do some research, make sure your app responds quickly enough for your application.

C# - Moving files - to queue or multi-thread

I have an app that moves a project and its files from preview to production using a Flex front-end and a .NET web service. Currently, the process takes about 5-10 mins/per project. Aside from latency concerns, it really shouldn't take that long. I'm wondering whether or not this is a good use-case for multi-threading. Also, considering the user may want to push multiple projects or one right after another, is there a way to queue the jobs.
Any suggestions and examples are greatly appreciated.
Thanks!

Something that does heavy disk IO typically isn't a good candidate for multithreading since the disks can really only do one thing at a time. However, if you're pushing to multiple servers or the servers have particularly good disk subsystems some light threading may be beneficial.

As a note - regardless of whether or not you decide to queue the jobs, you will use multi-threading. Queueing is just one way of handling what is ultimately solved using multi-threading.
And yes, I'd recommend you build a queue to push out each project.

You should compare the speed of your code compared to just copying in Windows (i.e., explorer or command line) vs copying with something advanced like TeraCopy. If your code is significantly slower than Window then look at parts in your code to optimize using a profiler. If your code is about as fast as Windows but slower than TeraCopy, then multithreading could help.
Multithreading is not generally helpful when the operation I/O bound, but copying files involves reading from the disk AND writing over the network. This is two I/O operations, so if you separate them onto different threads, it could increase performance. For something like this you need a producer/consumer setup where you have a Circular queue with one thread reading from disk and writing to the queue, and another thread reading from the queue and writing to the network. It'll be important to keep in mind that the two threads will not run at the same speed, so if the queue gets full, wait before writing more data and if it's empty, wait before writing. Also the locking strategy could have a big impact on performance here and could cause the performance to degrade to slower than a single-threaded implementation.

If you're moving things between just two computers, the network is going to be the bottleneck, so you may want to queue these operations.
Likewise, on the same machine, the I/O is going to be the bottleneck, so you'd want to queue there, too.

You should try using the ThreadPool.
ThreadPool.QueueUserWorkItem(MoveProject, project);

Agreed with everyone over the limited performance of running the tasks in parallel.
If you have full control over your deployment environment, you could use Rhino Queues:
http://ayende.com/Blog/archive/2008/08/01/Rhino-Queues.aspx
This will allow you to produce a queue of jobs asynchronously (say from a WCF service being called from your Silverlight/Flex app) and consume them synchronously.
Alternatively you could use WCF and MSMQ, but the learning curve is greater.

When dealing with multiple files using multiple threads usually IS a good idea in concerns of performance.The main reason is that most disks nowadays support native command queuing.
I wrote an article recently about reading/writing files with multiple files on ddj.com.
See http://www.ddj.com/go-parallel/article/showArticle.jhtml?articleID=220300055.
Also see related question
Will using multiple threads with a RandomAccessFile help performance?
In particular i made the experience that when dealing with very many files it IS a good idea to use a number of threads. In contrary using many thread in many cases does not slow down applications as much as commonly expected.
Having said that i'd say there is no other way to find out than trying all possible different approaches. It depends on very many conditions: Hardware, OS, Drivers etc.

The very first thing you should do is point any kind of profiling tool towards your software. If you can't do that (like, if you haven't got such a tool), insert logging code.
The very first thing you need to do is figure out what is taking a long time to complete, and then why is it taking a long time to complete. That your "copy" operation as a whole takes a long time to complete isn't good enough, you need to pinpoint the reason for this down to a method or a set of methods.
Until you do that, all the other things you can do to your code will likely be guesswork. My experience has taught me that when it comes to performance, 9 out of 10 reasons for things running slow comes as surprises to the guy(s) that wrote the code.
So measure first, then change.
For instance, you might discover that you're in fact reporting progress of copying the file on a byte-per-byte basis, to a GUI, using a synchronous call to the UI, in which case it wouldn't matter how fast the actual copying can run, you'll still be bound by message handling speed.
But that's just conjecture until you know, so measure first, then change.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.