ZeroMQ subscriber fails to initialize using 1000+ publishers

ZeroMQ subscriber fails to initialize using 1000+ publishers - c#

I am trying to evaluate ZeroMQ for a larger monitoring and data gathering system. On a smaller scale everything works nice but stepping up the load and scale a bit seems tricky.
Right now I am using a C# wrapper (clrzmq, 3.0.0-rc1) to create both a publisher and a subscriber application. I am binding the Publisher socket (1 socket, 1 context) to 1000 endpoints (localhost + a range of ports) and let the Subscriber applications socket (again 1 socket, 1 context) bind to the publisher endpoints.
This sometimes works, and sometimes not (I guess it relates to the max number of sockets handled by the process somehow). It seems to depend on in which order I start the applications but I cannot tell for sure. The only thing I see is nasty SEHExceptions, containing no details at all. If I create simple console applications I sometimes see low level C++ Asserts like:
Assertion failed: fds.size () <= FD_SETSIZE (......\src\select.cpp:70)
Assertion failed: Permission denied (......\src\signaler.cpp:281)
Assertion failed: Connection reset by peer (......\src\signaler.cpp:124)
Not very helpful to me. In the C# wrapper, the Context creation fails. It does not even get a chance to begin connecting to or even creating sockets. I would expect low level ZeroMQ errors to be handled by throwing exceptions, maybe I just have not understood how to deal with errors yet.
The questions I have right now is:
How do I create a (somewhat) realistic test setup to simulate 1000 separate publishers on a single machine (in real world 1 publisher = 1 machine) and a couple of Subscribers on Another machine, all using C#. Is that even possible?
More importantly, how do I trap ZeroMQ errors in C# code to be able to understand what goes wrong?
Since ZeroMQ seems pretty stable and mature I have a hard time believing 1000 publishers should be a problem to handle. However, I need better error support than currently available (unless I completely missed something here) in order to use ZeroMQ over C#.
Update:
After diggin into the source, I end up with a zmq_assert(...) leading to RaiseException (0x40000015, EXCEPTION_NONCONTINUABLE, 1, extra_info);. This will abruptly terminate the application after dumping the original assert statement to the console. This seems a bit harsh, but may well be the best option given that it is really unrecoverable. However, a somewhat better error message would not hurt. Not everyone knows what fds.size () <= FD_SETSIZE means. The comment in the source gives some clues, would be nice to have that comment in the error message. Anyway, given that my application is not a console app, this just leaves me with an unhandled SEHException, which does not seem to contain even the assert statement or line/file info. I wonder how many other bugs I will create that will result in other similar cryptical errors.

After looking into this a bit more, it seems the default number of sockets are set to 1024. The C# wrapper has a property on the Context object that should be able to change this setting but it is not working, at least not as expected. Also, the native zmqlib does not have this setting on the context object.
Running a setup like in the description does not seem possible, at least not using the clrzmq C# ZeroMQ wrapper. I solved it by running 500 publishers on a separate machine and another 500 plus 1000 subscribers on another machine. This worked nice without any errors.
The other topic is also a bit disappointing. When the maximum number of sockets are reached, ZeroMQ simply throws an uncatchable exception causing the application to crash abruptly. This is a fail fast approach, avoiding any further data/state corruption but unfortunatly also leaves very few clues to what happend that caused the application to die. Judging from other posts, it seems very hard to gather data for post-mortem when this happens. Catching the exception in the C# code seems impossible or very hard, and hooking into the stdout to capture the printed assert also seems very hard to achieve (if we are not running from a command prompt, in which case the assert message is printed just before the application dies).
All-in-all, this makes low-level trouble shooting and post-mortem analysis in a non-console C# setting very hard when ZeroMQ terminates via the zmq_assert(...) call. Hopefully this was an extreme case. Not all failure modes seems to cause termination in this abrupt way.

The default FD_SETSIZE is 1024 (defined in the MSVC libzmq project), so you will hit this about half-way through your test case. The other asserts tumble on from that.
Increase this in your libzmq project, to 4K or 8K, and things should work better.
As for the assert() call, it's too brutal on Windows, for sure. On Linux this gives a decent stack dump and enough information to trace the problem. Feel free to improve the assert macro so that it does something smarter, e.g. launch the debugger. In any case if you hit an assert you can't reasonably continue.
Asserting when the FD set is full, well, that could be handled better. If you know anything about C/C++, feel free to take a look at the code. We do depend on peoples' patches.
Also, if you feel 1024 is too small, feel free to raise this in the project and send us the patch.

A quick and dirty look into this problem suggest that you're creating too many socket connections for your computer. Check out this link on the max number of sockets from MSDN. The error's you are getting look suspiciously relevant enough for this to be a possible source of your error.
To be honest, having 1000 separate publishers seems like you are tackling the problem a little incorrectly for using zmq. Why not have 1 publisher and use 'namespaces' and have the subscribers SUBSCRIBE to what it needs to split out what messages subscribers get.

Related

if i don't use NetworkComms.Shutdown will i break something?

Hey i recently created a text message application in c# that sends messages back in forth in a console. I used NetworkCommsDotNet & NetworkCommsDotNet.Connections.
When i was researching about it i found a command NetworkComms.Shutdown() http://www.networkcomms.net/api/html/M_NetworkCommsDotNet_NetworkComms_Shutdown.htm
I'm also new to programming so i really didn't completely understand what they where saying and was still left wondering if I don't use this in my program, will it break something or mess up my router in any way?
ps - the program works and i had success with testing it between two computers on my home network.

I haven't used this, nor even know what it is, however i am good at reading documentation and believe what they tell me (for the most part)
Shutdown all connections, threads and execute OnCommsShutdown event.
Any packet handlers are left unchanged. If any network activity has
taken place this should be called on application close.
The reason why its telling you this, is that is most likely using unmanaged resources, and most likely wants to gracefully shut them down or clean them up. Since there is no open source for this project, we can only listen to what its telling you

SCPI 34970A Connection Exception

I have tried writing this question on Keysights community page, but apparently they have a lack of community. I am hoping there will be individuals here that can help me, however.
Anyways, I have written a program that talks to an Agilent 34970A roughly every minute (variable depending on input values, and also random depending on input values) and scans through almost 20 channels. I then parse all the data and make sure everything looks right. I have run into several issues, and I have worked through most of them, however, I am now stuck with one.
SOMETIMES, I get an exception when I first try to connect to the instrument. The short version of the exception is:
Keysight.CommandExpert.InstrumentAbstraction.CommunicationTimeoutException was unhandled
HResult=-2146233088
Message=Timed out while trying to query instrument errors
Source=Keysight.CommandExpert.Scpi
Timeout=10000
I have written my code in C#, and like I said it happen right when I try to connect to the instrument. This line of code is:
Ag3497x v34970A = new Ag3497x("ASRL1::INSTR");
How do I handle this exception? Originally I was using IVI-COM drivers, but found them to be cumbersome and unreliable. I read on here somewhere one person who preferred the SCPI commands, so I switched to these. However, when I was using the IVI drivers, there was a command to close the connection. I couldn't find on with the SCPI commands. Could this be part of the issue?
The full code I use to talk with and read data from the 34970A is as follows:
string readings = null;
Ag3497x v34970A = new Ag3497x("ASRL1::INSTR");
v34970A.SCPI.RST.Command();
v34970A.SCPI.DISPlay.TEXT.Command("CYCLE TESTER");
v34970A.Transport.DefaultTimeout.Set(-1);
v34970A.SCPI.CONFigure.VOLTage.AC.Command(100, "MAX", "#101:116");
v34970A.SCPI.SENSe.VOLTage.AC.BANDwidth.Command(1000, "#101:117");
v34970A.SCPI.CONFigure.VOLTage.AC.Command(10, "MAX", "#117");
v34970A.SCPI.SENSe.VOLTage.AC.BANDwidth.Command(1000, "#117");
v34970A.SCPI.CONFigure.VOLTage.AC.Command(300, "MAX", "#119");
v34970A.SCPI.SENSe.VOLTage.AC.BANDwidth.Command(60, "#119");
v34970A.SCPI.CONFigure.TEMPerature.Command("TCouple", "K", 1, "MAX", "#118");
v34970A.SCPI.ROUTe.SCAN.Command("#101:119");
v34970A.SCPI.TRIGger.COUNt.Command(1);
v34970A.SCPI.READ.QueryAllData(null, out readings);
I run this code up to 10,000 times as required by the test the program is written for. I am connective via SERIAL as you can see in the code.
Anyways, any help would be much appreciated. As am sure you could understand, when a program fails at cycle 8000 out of 10000, it can be very frustrated, especially when 10000 cycles can take more than 10 days to complete. Really, maybe I'll get lucky and someone will know how I can prevent this, but I'm truly looking for a way to handle this exception when it does happen, assuming it can't be prevented. So really, this long post is asking how to handle an exception. I have searched, but this answer has alluded me. If you would like to know any other detail, please let me know, and I will provide it promptly.
Thanks again,
Josh

Few suggestions:
Why do you have to call the Ag3497x constructor on each iteration? construct the device only once and reuse it.
What are the baud rate? cable length? consider shorter cables and smaller baud rate.
Consider sniffing out what exactly happening in the port while this exception is happening using some serial port sniffer like this one.
If everything else is fine and it seems like some issue within the Visa driver, if you don't have some response from Keysights, consider using a different visa driver or implement your own using SerialPort.

COM Add-in: Resolve the error DisconnectedContext in WinWord.exe

I built an add-on to Microsoft Word. When the user clicks a button, it runs a number of processes that export a list of Microsoft Word documents to Filtered HTML. This works fine.
Where the code falls down is in processing large amounts of files. After the file conversions are done and I call the next function, the app crashes and I get this information from Visual Studio:
Managed Debugging Assistant 'DisconnectedContext' has detected a problem in 'C:\Program Files\Microsoft Office\root\Office16\WINWORD.EXE'.
Additional information: Transition into COM context 0x56255b88 for
this RuntimeCallableWrapper failed with the following error: System
call failed. (Exception from HRESULT: 0x80010100
(RPC_E_SYS_CALL_FAILED)). This is typically because the COM context
0x56255b88 where this RuntimeCallableWrapper was created has been
disconnected or it is busy doing something else. Releasing the
interfaces from the current COM context (COM context 0x56255cb0). This
may cause corruption or data loss. To avoid this problem, please
ensure that all COM contexts/apartments/threads stay alive and are
available for context transition, until the application is completely
done with the RuntimeCallableWrappers that represents COM components
that live inside them.
After some testing, I realized that if I simply remove all the code after the file conversions, there are no problems. To resolve this, I place the remainder of my code in yet another button.
The problem is I don't want to give the user two buttons. After reading various other threads, it sounds like my code has a memory or threading issue. The answers I am reading do not help me truly understand what to do next.
I feel like this is what I want to do:
1- Run conversion.
2- Close thread/cleanup memory issue from conversion.
3- Continue running code.
Unfortunately, I really don't know how to do #2 or if it is even possible. Your help is very much appreciated.

or it is busy doing something else
The managed debugging assistant diagnostic you got is pretty gobbledygooky but that's the part of the message that accurately describes the real problem. You have a firehose problem, the 3rd most common issue associated with threading. The mishap is hard to diagnose because this goes wrong inside the Word plumbing and not your code.
Trying not to commit the same gobbledygook sin myself, what goes wrong is that the interop calls you make into the Office program are queued, waiting for their turn to get executed. The underlying "system call" that the error code hints at is PostMessage(). Wherever there is a queue, there is a risk that the queue gets too large. Happens when the producer (your program) is adding items too the queue far faster than the consumer (the Office program) removes them. The firehose problem. Unless the producer slows down, the queue will grow without bounds and something is going to fail if it is allowed to grow endlessly, at a minimum the process runs out of memory.
It is not allowed to get close to that problem. The underlying queue that PostMessage() uses is protected by the OS. Windows fails the call when the queue already contains 10,000 messages. That's a fatal error that RPC does not know how to recover from, or rather should not try to recover from. Something is amiss and it isn't pretty. It returns an error code to your program to tell you about it. That's RPC_E_SYS_CALL_FAILED. Nothing much better happens in your program, the CLR doesn't know how to recover from it either, nor does your code. So the show is over, the interop call you made got lost and was not executed by Word.
Finding a completely reliable workaround for this awkward problem is not that straight-forward. Beware that this can happen on any interop call, so catching the exception and trying again is pretty drastically unpractical. But do keep in mind that the Q+D fix is very simple. The plain problem is that your program is running too fast, slowing it down with a Thread.Sleep() or Task.Delay() call is quite crude but will always fix the issue. Well, assuming you delay enough.
I think, but don't know for a fact because nobody ever posts repro code, that this issue is also associated with using a console mode app or a worker thread in your program. If it is a console mode app then try applying the [STAThread] attribute to your Main() method. If it is a worker thread then call Thread.SetApartmentState() before starting the thread, but beware it is very important to also create the Application interface on that worker thread. Not otherwise a workaround for an add-in.
If neither of those workarounds is effective or too unpractical then consider that you can automagically slow your program down, and ensure the queue is emptied, by occasionally reading something back from the Office program. Something silly, any property getter call will do. Necessarily you can't get the property value until the Office program catches up. That can still fail, there is also a 60 second time-out on the interop call. But that's something you can fix, you can call CoRegisterMessageFilter() in your program to install a callback that runs when the timeout trips. Very gobbledygooky as well, but the cut-and-paste code is readily available.

StackOverflowException in .NET >= 4.0 - give other threads chance to gracefully exit

Is there a way how to at least postpone termination of managed app (by few dozens of milliseconds) and set some shared flag to give other threads chance to gracefully terminate (the SO thread itself wouldn't obviously execute anything further)? I'm contemplating to use JIT debugger or CLR hosting for this - I'm curios if anybody tried this before.
Why would I want to do something so wrong?:
Without too much detail - imagine this analogy - you are in a casino betting on a roulette and suddenly find out that the roulette is unreliable fake. So you want to immediately leave the casino, BUT likely want to collect your bets from the table first.
Unfortunately I cannot leverage separate process for this as there are very tight performance requirements.
Tried and didn't work:
.NET behavior for StackOverflowException (and contradicting info on MSDN) has been discussed several times on SO - to quickly sum up:
HandleProcessCorruptedStateExceptionsAttribute (e.g. on appdomain unhandled exception handler) doesn't work
ExecuteCodeWithGuaranteedCleanup doesn't work
legacyUnhandledExceptionPolicy doesn't work
There may be few other attempts how to handle StackOverflowExceptions - but it seems to be apparent that CLR terminates the whole process as is mentioned in this great answer by Hans Passant.
Considering to try:
JIT debugger - leave the thread with exception frozen, set some
shared flag (likely in pinned location) and thaw other threads for a
short time.
CLR hosting and setting unhandled exception policy
Do you have any other idea? Or any experience (successful/unsuccessful) with those two ways?

The word "fake" isn't quite the correct one for your casino analogy. There was a magnitude 9 earth quake and the casino building along with the roulette table, the remaining chips and the player disappeared in a giant cloud of smoke and dust.
The only shot you have at running code after an SOE is to stay far away from that casino, it has to run in another process. A "guard" process that starts your misbehaving program, it can use the Process.ExitCode to detect the crash. It will be -1073741571 (0xc00000fd). The process state is gone, you'll have to use one of the .NET out-of-process interop methods (like WCF, named pipes, sockets, memory-mapped file) to make the guard process aware of things that need to be done to clean up. This needs to be transactional, you cannot reason about the exact point in time that the crash occurred since it might have died while updating the guard.
Do beware that this is rarely worth the effort. Because an SOE is pretty indistinguishable from an everyday process abort. Like getting killed by Task Manager. Or the machine losing power. Or being subjected to the effects of an earth quake :)

A StackOverflowException is an immediate and critical exception from which the runtime cannot recover - that's why you can't catch it, or recover from it, or anything else. In order to run another method (whether that's a cleanup method or anything else), you have to be able to create a stack frame for that method, and the stack is already full (that's what a StackOverflowException means!). You can't run another method because running a method is what causes the exception in the first place!
Fortunately, though, this kind of exception is always caused by program structure. You should be able to diagnose and fix the error in your code: when you get the exception, you will see in your call stack that there's a loop of one or more methods recursing indefinitely. You need to identify what the faulty logic is and fix it, and that'll be a lot easier than trying to fix the unfixable exception.

What is Environment.FailFast?

What is Environment.FailFast?
How is it useful?

It is used to kill an application. It's a static method that will instantly kill an application without being caught by any exception blocks.
Environment.FastFail(String) can
actually be a great debugging tool.
For example, say you have an
application that is just downright
giving you some weird output. You have
no idea why. You know it's wrong, but
there are just no exceptions bubbling
to the surface to help you out. Well,
if you have access to Visual Studio
2005's Debug->Exceptions... menu item,
you can actually tell Visual Studio to
allow you to see those first chance
exceptions. If you don't have that,
however you can put
Environment.FastFail(String) in an
exception, and use deductive reasoning
and process of elimination to find out
where your problem in.
Reference

It also creates a dump and event viewer entry, which might be useful.

It's a way to immediately exit your application without throwing an exception.
Documentation is here.
Might be useful in some security or data-critical contexts.

Failfast can be used in situations where you might be endangering the user's data. Say in a database engine, when you detect a corruption of your internal data structures, the only sane course of action is to halt the process as quickly as possible, to avoid writing garbage to the database and risk corrupting it and lose the user's data. This is one possible scenario where failfast is useful.
Another use is to catch programmer errors. Say you are writing a library and some function accepts a pointer that cannot be null in any circumstance, that is, if it's null, you are clearly in presence of a programmer error. You can return an error like E_POINTER or throw some InvalidArgument exception and hope someone notices, but you'll get their attention better by failing fast :-)
Note that I'm not restricting the example to pointers, you can generalize to any parameter or condition that should never happen. Failing fast ultimately results in better quality apps, as many bugs no longer go unnoticed.
Finally, failing fast helps with capturing the state of the process as faithfully as possible (as a memory dump gets created), in particular when failing fast immediately upon detecting an unrecoverable error or a really unexpected condition.
If the process was allowed to continue, say the 'finally' clauses would run or the stack would be unwound, and things would get destroyed or disposed-of, before a memory dump is taken, then the state of the process might be altered in such as way that makes it much more difficult to diagnose the root cause of the problem.

It kills the application and even skips try/finally blocks.

From .NET Framework Design Guidelines on Exception Throwing:
✓ CONSIDER terminating the process by calling System.Environment.FailFast (.NET Framework 2.0 feature) instead of throwing an exception if your code encounters a situation where it is unsafe for further execution.

Joe Duffy discusses failing fast and the discipline to make it useful, here.
http://joeduffyblog.com/2014/10/13/if-youre-going-to-fail-do-it-fast/
Essentially, he's saying that for programming bugs - i.e. unexpected errors that are the fault of the programmer and not the programme user or other inputs or situations that can be reasonable expected to be bad - then deciding to always fail fast for unexpected errors has been seen to improve code quality.
I think since its an optional team decision and discipline, use of this API in C# is rare since in reality we're all mostly writing LoB apps for 12 people in HR or an online shop at best.
So for us, we'd maybe use this when we want deny the consumer of our API the opportunity of making any further moves.

An unhandled exception that is thrown (or rethrown) within a Task won't take effect until the Task is garbage-collected, at some perhaps-random time later.
This method lets you crash the process now -- see this answer.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.