What should the client do while the TIBCO EMS server attempts failover? - c#

The TIBCO EMS user's guide (pg 292) says:
The backup server will work indefinitely to either A) become the
primary server or B) reconnect to the primary server. It also says
clients may receive fail-over notification when the switch is successful (see also TIBCO EMS .NET reference pg 220).
I have some questions spinning off of these facts...
What kind of errors occur on the client side while the servers are attempting fail-over/reconnect?
What is the appropriate response from the client?
Get new Connection objects from the ConnectionFactory until one works?
Wait for fail-over notification? (are current Connection instances fixed at this time? or do I need to get a new instance?)
I hope the scenario is clear, any related information or advice would be appreciated too.

I can at least answer #1 above.
If you have enabled Tibems.SetExceptionOnFTSwitch(true); and have set up an exception handler to capture the messages the server sends to the client, you will see the following:
For single-server, non-fault tolerant connection failures:
"Connection has been terminated".
For fault-tolerant connection failures:
"Connection has performed fault-tolerant switch to "
If you attempt to publish while the connection is down, a TIBCO.EMS.IllegalStateException is thrown with the "Producer is closed" message.
for #2 above, I think the answer is to allow the EMS library to handle as much as possible. Once we got the EMS reconnect functionality to work, it gracefully tried to reconnect until the server became available again and once it reconnected, it was like there was never a problem. The only gotcha is probably if you try to publish a message before the ems connection is back. This is where the exception handler comes in, Once notified that you are in failover mode, you can adjust exception handling on the publisher side to suppress the error until the connection is back. The thing I don't know is how do you tell when you've exhausted all reconnect attempts.
Anyway, Seems like our two worlds are closely related when it comes to EMS - hope our findings (based on your comments on my questions) help you.

We use TEMS (Tibco EMS - a Tibco Product for WCF) So it becomes a custom binding. We tried to break it by doing things like bounce the server to force switch overs and it works really well. make sure you are using version 1.2 not 1.1 because you cannot do anything other then client acknowledgement.

Related

WMQ: Distributing MQ readers over several machines

I am using WMQ to access an IBM WebSphere MQ on a mainframe - using c#.
We are considering spreading out our service on several machines, and we then need to make sure that two services on two different machines cannot read/get the same MQ message at the same time.
My code for getting messages is this:
var connectionProperties = new Hashtable();
const string transport = MQC.TRANSPORT_MQSERIES_CLIENT;
connectionProperties.Add(MQC.TRANSPORT_PROPERTY, transport);
connectionProperties.Add(MQC.HOST_NAME_PROPERTY, mqServerIP);
connectionProperties.Add(MQC.PORT_PROPERTY, mqServerPort);
connectionProperties.Add(MQC.CHANNEL_PROPERTY, mqChannelName);
_mqManager = new MQQueueManager(mqManagerName, connectionProperties);
var queue = _mqManager.AccessQueue(_queueName, MQC.MQOO_INPUT_SHARED + MQC.MQOO_FAIL_IF_QUIESCING);
var queueMessage = new MQMessage {Format = MQC.MQFMT_STRING};
var queueGetMessageOptions = new MQGetMessageOptions {Options = MQC.MQGMO_WAIT, WaitInterval = 2000};
queue.Get(queueMessage, queueGetMessageOptions);
queue.Close();
_mqManager.Commit();
return queueMessage.ReadString(queueMessage.MessageLength);
Is WebSphere MQ transactional by default, or is there something I need to change in my configuration to enable this?
Or - do I need to ask our mainframe guys to do some of their magic?
Thx
Unless you actively BROWSE the message (ie read it but leave it there with no locks), only one getter will ever be able to 'get' the message. Even without transactionality, MQ will still only deliver the message once... but once delivered its gone
MQ is not transactional 'by default' - you need to get with GMO_SYNCPOINT (MQ transactions) and commit at the connection (MQQueueManager level) if you want transactionality (or integrate with .net transactions is another option)
If you use syncpoint then one getter will get the message, the other will ignore it, but if you subsequently have an issue and rollback, then it is made available to any getter (as you would want). It is this scenario where you might see a message twice, but thats because you aborted the transaction and hence asked for it to be put back to how it was before the get.
I wish I'd found this sooner because the accepted answer is incomplete. MQ provides once and only once delivery of messages as described in the other answer and IBM's documentation. If you have many clients listening on the same queue, MQ will deliver only one copy of the message. This is uncontested.
That said, MQ, or any other async messaging for that matter, must deal with session handling and ambiguous outcomes. The affect of these factors is such that any async messaging application should be designed to gracefully handle dupe messages.
Consider an application putting a message onto a queue. If the PUT call receives a 2009 Connection Broken response, it is unclear whether the connection failed before or after the channel agent received and acted on the API call. The application, having no way to tell the difference, must put the message again to assure it is received. Doing the PUT under syncpoint can result in a 2009 on the COMMIT (or equivalent return code in messaging transports other than MQ) and the app doesn't know if the COMMIT was successful or if the PUT will eventually be rolled back. To be safe it must PUT the message again.
Now consider the partner application receiving the messages. A GET issued outside of syncpoint that reaches the channel agent will permanently remove the message from the queue, even if the channel agent cannot then deliver it. So use of transacted sessions ensures that the message is not lost. But suppose that the message has been received and processed and the COMMIT returns a 2009 Connection Broken. The app has no way to know whether the message was removed during the COMMIT or will be rolled back and delivered again. At the very least the app can avoid losing messages by using transacted sessions to retrieve them, but can not guarantee to never receive a dupe.
This is of course endemic to all async messaging, not just MQ, which is why the JMS specification directly address it. The situation is addressed in all versions but in the JMS 1.1 spec look in section 4.4.13 Duplicate Production of Messages which states:
If a failure occurs between the time a client commits its work on a
Session and the commit method returns, the client cannot determine if
the transaction was committed or rolled back. The same ambiguity
exists when a failure occurs between the non-transactional send of a
PERSISTENT message and the return from the sending method.
It is up to a JMS application to deal with this ambiguity. In some
cases, this may cause a client to produce functionally duplicate
messages.
A message that is redelivered due to session recovery is not
considered a duplicate message.
If it is critical that the application receive one and only one copy of the message, use 2-Phase transactions. The transaction manager and XA protocol will provide very strong (but still not absolute) assurance that only one copy of the message will be processed by the application.
The behavior of the messaging transport in delivering one and only one copy of a given message is a measure of the reliability of the transport. By contrast, the behavior of an application which relies on receipt of one and only one copy of the message is a measure of the reliability of the application.
Any duplicate messages received from an IBM MQ transport are almost certainly going to be due to the application's failure to use XA to account for the ambiguous outcomes inherent in async messaging and not a defect in MQ. Please keep this in mind when the Production version of the application chokes on its first duplicate message.
On a related note, if Disaster Recovery is involved, the app must also gracefully recover from lost messages, or else find a way to violate the laws of relativity.

TcpClient.BeginRead/TcpClient.EndRead doesn't throw exception when internet disconnected

I'm using TcpListener to accept & read from TcpClient.
The problem is that when reading from a TcpClient, TcpClient.BeginRead / TcpClient.EndRead doesn't throw exception when the internet is disconnected. It throws exception only if client's process is ended or connection is closed by server or client.
The system generally has no chance to know that connection is broken. The only reliable way to know this is to attempt to send something. When you do this, the packet is sent, then lost or bounced and your system knows that connection is no longer available, and reports the problem back to you by error code or exception (depending on environment). Reading is usually not enough cause reading only checks the state of input buffer, and doesn't send the packet to the remote side.
As far as I know, low level sockets doesn't notify you in such cases. You should provide your own time out implementation or ping the server periodically.
If you want to know about when the network status changes you can subscribe to the System.Net.NetworkInformation.NetworkChange.NetworkAvailabilityChanged event. This is not specific to the internet, just the local network.
EDIT
Sorry, I misunderstood. The concept of "connected" really doesn't exist the more you think about it. This post does a great job of going into more details about that. There is a Connected property on the TcpClient but MSDN says (emphasis mine):
Because the Connected property only
reflects the state of the connection
as of the most recent operation, you
should attempt to send or receive a
message to determine the current
state. After the message send fails,
this property no longer returns true.
Note that this behavior is by design.
You cannot reliably test the state of
the connection because, in the time
between the test and a send/receive,
the connection could have been lost.
Your code should assume the socket is
connected, and gracefully handle
failed transmissions.
Basically the only way to check for a client connection it to try to send data. If it goes through, you're connected. If it fails, you're not.
I don't think you'd want BeginRead and EndRead throwing exceptions as these should be use in multi threaded scenarios.
You probably need to implement some other mechanism to respond to the dropping of a connection.

Proper implementation of C# TCP reconnecting client

I have to write a TCP Client that will have ability to reconnect to server. The server can be unavailable due to poor network connection quality or some maintenance issues. I'm searching for quality solutions in this area.
My current solutions is following:
keep connection state in ConnectionState enum {Offline, Online, Connecting}
create client with TcpClient class.
create two timers called ConnectionCheckTimer, and ReconnectTimer
connect to server
start reader thread and connection check timer
reading is performed with tcpClient.GetStream() and then reading from this stream
when Exception is caught in readerLoop client state is changed to offline and ReconnectTimer is launched
ConnectionCheckTimer periodically checks lastMessageTimestamp and compares it with current time if the interval is greater then maxValue it launches ReconnectTimer
Currently i'm not satisfied with this solution because it still generates exceptions for instance ObjectDisposedException on TcpClient.NetworkStream. I'm looking for some clean and reusable Tcp reconnecting client implementation that is able to cope with all sockets problems that can occur during connecting, disconnecting, reading data.
If you have connection issues, you will always have exceptions. I think you have a sound outline, you just need to handle the exceptions. You could start with your own Socket class implemenation and write the TCPIP Server. Starter code is at MS:
http://msdn.microsoft.com/en-us/library/fx6588te(VS.71).aspx
The C# code is half way down the VB page.
The class you should use is "SocketAsyncEventArgs".
I've used it in this project:
http://ts3querylib.codeplex.com/
Have a look at the AsyncTcpDispatcher class.

TcpChannel registration problem

I've read here: Error 10048 when trying to open TcpChannel
I am having what I thought to be a similar problem - apparently not. I took the advice of the first respondant to reset winsock (how does the winsock get corrupted, anyhow?) Anyway, here is my channel registration:
channel = new TcpChannel(channelPort);
ChannelServices.RegisterChannel(channel, false);
and the client call:
// Create a channel for communicating w/ the remote object
// Notice no port is specified on the client
TcpChannel channel = new TcpChannel();
ChannelServices.RegisterChannel(channel, false);
// Create an instance of the remote object
CommonDataObject obj = Activator.GetObject( typeof(CommonDataObject) ,
"tcp://localhost:49500/CommonDataObject") as CommonDataObject;
This seems all too straightforward to be such a hassle to use. But, the problem seems to be with the server's ChannelServices.RegisterChannel(...). Now, the reason I included the client portion is because the client instances, checks for the server object. If it can't find it, then it 'nudges' the server to instance itself. What I was wondering is if checking for the object's available first (a la: Activator.GetObject(...) ) would cause the ChannelServices to 'think' this tcp channel is already registered? It sounds dumb, but that is my only possible explanation. I have turned off the firewall, anti-fungal app, and rebooted. Still receive this
The channel 'tcp' is already
registered.
I looked at my stack trace and did notice:
at System.Runtime.Remoting.Channels.ChannelServices.RegisterChannelInternal(IChannel chnl, Boolean ensureSecurity)
at System.Runtime.Remoting.Channels.ChannelServices.RegisterChannel(IChannel chnl, Boolean ensureSecurity)
I wondered if the RegisterChannelInternal(...) might be what is causing the 'already registerd' issue. So, other than that, I am at a loss...
It's possible that the call I'm making to check for that Channel is causing it. If that is the consensus, then my question changes to: How can I poll for the Channel?
UPDATE:
After removing the initial check for the server from the client and 'assuming' that the server needs to be instanced, I did discover that the client checking is causing the problem. I've managed to get the server going, and the client did get a 'transparent proxy' object. But the question still remains: "How can I poll to discover if the server is instanced?"
The answer is evidently, yes...when the client is registering the channel, it keeps the server from registering another Tcp channel. I have removed the client instancing of a Tcp channel and the registration.
Since I haven't gotten an answer on pinging, I'm going through with a try/catch block on the obj = Activator.GetObject(...). If obj is returned null, then I 'nudge' the server, it fires up...and then the client connects with the CommonDataObject (derived from MarshalByRefObject).
So, in a sense, that is the polling technique I'm using. I'd like something more elegant - that is, an implementation that didn't work by causing a failure. To me, that's more of a hack work-around than a solution.
I found the answer here. Thanks to Abhijeet for the inadvertent solution!!! Btw...don't forget to declare:
using System.Linq;

How do you deal with transport-level errors in SqlConnection?

Every now and then in a high volume .NET application, you might see this exception when you try to execute a query:
System.Data.SqlClient.SqlException: A transport-level error has
occurred when sending the request to the server.
According to my research, this is something that "just happens" and not much can be done to prevent it. It does not happen as a result of a bad query, and generally cannot be duplicated. It just crops up maybe once every few days in a busy OLTP system when the TCP connection to the database goes bad for some reason.
I am forced to detect this error by parsing the exception message, and then retrying the entire operation from scratch, to include using a new connection. None of that is pretty.
Anybody have any alternate solutions?
I posted an answer on another question on another topic that might have some use here. That answer involved SMB connections, not SQL. However it was identical in that it involved a low-level transport error.
What we found was that in a heavy load situation, it was fairly easy for the remote server to time out connections at the TCP layer simply because the server was busy. Part of the reason was the defaults for how many times TCP will retransmit data on Windows weren't appropriate for our situation.
Take a look at the registry settings for tuning TCP/IP on Windows. In particular you want to look at TcpMaxDataRetransmissions and maybe TcpMaxConnectRetransmissions. These default to 5 and 2 respectively, try upping them a little bit on the client system and duplicate the load situation.
Don't go crazy! TCP doubles the timeout with each successive retransmission, so the timeout behavior for bad connections can go exponential on you if you increase these too much. As I recall upping TcpMaxDataRetransmissions to 6 or 7 solved our problem in the vast majority of cases.
This blog post by Michael Aspengren explains the error message "A transport-level error has occurred when sending the request to the server."
To answer your original question:
A more elegant way to detect this particular error, without parsing the error message, is to inspect the Number property of the SqlException.
(This actually returns the error number from the first SqlError in the Errors collection, but in your case the transport error should be the only one in the collection.)
I had the same problem albeit it was with service requests to a SQL DB.
This is what I had in my service error log:
System.Data.SqlClient.SqlException: A transport-level error has occurred when sending the request to the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
I have a C# test suite that tests a service. The service and DB were both on external servers so I thought that might be the issue. So I deployed the service and DB locally to no avail. The issue continued. The test suite isn't even a hard pressing performance test at all, so I had no idea what was happening. The same test was failing each time, but when I disabled that test, another one would fail continuously.
I tried other methods suggested on the Internet that didn't work either:
Increase the registry values of TcpMaxDataRetransmissions and TcpMaxConnectRetransmissions.
Disable the "Shared Memory" option within SQL Server Configuration Manager under "Client Protocols" and sort TCP/IP to 1st in the list.
This might occur when you are testing scalability with a large number of client connection attempts. To resolve this issue, use the regedit.exe utility to add a new DWORD value named SynAttackProtect to the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ with value data of 00000000.
My last resort was to use the old age saying "Try and try again". So I have nested try-catch statements to ensure that if the TCP/IP connection is lost in the lower communications protocol that it does't just give up there but tries again. This is now working for me, however it's not a very elegant solution.
use Enterprise Services with transactional components
I have seen this happen in my own environment a number of times. The client application in this case is installed on many machines. Some of those machines happen to be laptops people were leaving the application open disconnecting it and then plugging it back in and attempting to use it. This will then cause the error you have mentioned.
My first point would be to look at the network and ensure that servers aren't on DHCP and renewing IP Addresses causing this error. If that isn't the case then you have to start trawlling through your event logs looking for other network related.
Unfortunately it is as stated above a network error. The main thing you can do is just monitor the connections using a tool like netmon and work back from there.
Good Luck.
You should also check hardware connectivity to the database.
Perhaps this thread will be helpful:
http://channel9.msdn.com/forums/TechOff/234271-Conenction-forcibly-closed-SQL-2005/
I'm using reliability layer around my DB commands (abstracted away in the repository interfaece). Basically that's just code that intercepts any expected exception (DbException and also InvalidOperationException, that happens to get thrown on connectivity issues), logs it, captures statistics and retries everything again.
With that reliability layer present, the service has been able to survive stress-testing gracefully (constant dead-locks, network failures etc). Production is far less hostile than that.
PS: There is more on that here (along with a simple way to define reliability with the interception DSL)
I had the same problem. I asked my network geek friends, and all said what people have replied here: Its the connection between the computer and the database server. In my case it was my Internet Service Provider, or there router that was the problem. After a Router update, the problem went away. But do you have any other drop-outs of internet connection from you're computer or server? I had...
I experienced the transport error this morning in SSMS while connected to SQL 2008 R2 Express.
I was trying to import a CSV with \r\n. I coded my row terminator for 0x0d0x0a. When I changed it to 0x0a, the error stopped. I can change it back and forth and watch it happen/not happen.
BULK INSERT #t1 FROM 'C:\123\Import123.csv' WITH
( FIRSTROW = 1, FIELDTERMINATOR = ',', ROWTERMINATOR = '0x0d0x0a' )
I suspect I am not writing my row terminator correctly because SQL parses one character at a time right while I'm trying to pass two characters.
Anyhow, this error is 4 years old now, but it may provide a bit of information for the next user.
I just wanted to post a fix here that worked for our company on new software we've installed. We were getting the following error since day 1 on the client log file: Server was unable to process request. ---> A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.) ---> The semaphore timeout period has expired.
What completely fixed the problem was to set up a link aggregate (LAG) on our switch. Our Dell FX1 server has redundant fiber lines coming out of the back of it. We did not realize that the switch they're plugged into needed to have a LAG configured on those two ports. See details here: https://docs.meraki.com/display/MS/Switch+Ports#SwitchPorts-LinkAggregation

Categories