MSDN states that Socket.Shutdown can throw a SocketException. I've had this happen to me in production recently after introducing a load balancer between my clients and my server. But I cannot reproduce it in testing without a load balancer. Can you?
Some background - I have a server application written in C# that uses TCP sockets to communicate with clients. The application protocol is very simple for the server: accept connection, read request, send response, wait for client shutdown (read expecting 0 bytes), shutdown.
This code has been in production without issue for many years. However after introducing a load balancer in front of multiple server machines one of the server processes crashed due to an unhandled SocketException that was raised when the server called Socket.Shutdown. The particular client had timed out whilst waiting for the server to respond and attempted to close the connection early. The exception message on the server was "An existing connection was forcibly closed by the remote host." It is not unusual for the client to do this, but obviously prior to the load balancer the server was raising this error at a different point in the code. Still it's clearly a server bug and the fix is obvious - handle the exception.
However using a test client application (also written in C#), I cannot find a sequence of operations that will cause the server to raise an exception during Socket.Shutdown. It appears that the load balancer did something unusual to the TCP packets, but still, I dislike using that as excuse for failing to reproduce the issue.
I can run both server and client code in debug and I have WireShark watching the packets.
On the client side, after the connection is established, the operations are:
Socket.Send() // single call
Socket.Receive() // this one times out in our scenario
Socket.XXX() // various choices as described below
On the server side, after the connection is established, the operations are:
1) Socket.Receive() //multiple calls until complete message is received
2) // Processing...
3) Socket.Write() //single call
4) Socket.Receive() // single call expecting 0 bytes
5) Socket.Shutdown()
Presume each call is wrapped with try..catch(SocketException)
A) If I pause the server during step 2, wait for the client to time out, and initiate a client shutdown using Socket.Shutdown(SocketShutDown.Send) a FIN packet is sent to the server. When the server resumes processing, all the calls will succeed (3 thru 5) because that's a perfectly acceptable TCP flow.
B) If I pause the server during step 2, wait for the client to time out, and initiate a client shutdown using Socket.Shutdown(SocketShutDown.Both) or Socket.Close() again a FIN packet is sent to the server. When the server resumes processing step 3 succeeds, but it causes the client to send a RST packet in response as it is not accepting more data. If this RST arrives before step 4 then Socket.Receive throws and step 5 succeeds. If it arrives after step 4, then Socket.Receive succeeds (returns 0 bytes), and yet step 5 succeeds.
C) If the client has "Dont Linger" set (Linger enabled with 0 timeout), and I pause the server during processing, wait for the client to time out, and initiate a client shutdown using Socket.Shutdown(SocketShutDown.Both) or Socket.Close() a "RST" packet is immediately sent to the server. When the server resumes processing steps 3 and 4 will fail but still step 5 succeeds.
I think what puzzles me most is that Socket.Shutdown appears to ignore my test client RST packets and yet evidently my load balancer was able to send a RST packet that was not ignored. What am I missing? What else can I try?
Related
I'm developing a C# application, working with TCP sockets.
While debugging, I arrive in this piece of source code:
if ((_socket!= null) && (_socket.Connected))
{
Debug.WriteLine($"..."); <= my breakpoint is here.
return true;
}
In my watch-window, the value of _socket.RemoteEndPoint is:
_socket.RemoteEndPoint {10.1.0.160:50001} System.Net.EndPoint {...}
Still, in commandline, when I run netstat -aon | findstr /I "10.1.0.160", I just see this:
TCP 10.1.13.200:62720 10.1.0.160:3389 ESTABLISHED 78792
TCP 10.1.13.200:63264 10.1.0.160:445 ESTABLISHED 4
=> the remote endpoint "10.1.0.160:50001" is not visible in netstat result.
As netstat seems not reliable for testing TCP sockets, what tool can I use instead?
(For your information: even after having run further, there still is no "netstat" entry.)
Documentation of Socket.Connected, says:
The value of the Connected property reflects the state of the
connection as of the most recent operation. If you need to determine
the current state of the connection, make a nonblocking, zero-byte
Send call. If the call returns successfully or throws a WAEWOULDBLOCK
error code (10035), then the socket is still connected; otherwise, the
socket is no longer connected.
So if it returns true - socket has been "connected" some time in the past, but not necessary is still alive right now.
That's because it's not possible to detect if your TCP connection is still alive with certainly without contacting the other side in one way or another. The TCP connection is kind of "virtual", two sides just exchange packets but there is no hard link between them. When one side decides to finish communication - it sends a packet and waits for response from the other side. If all goes well two sides will both agree that connection is closed.
However, if side A does NOT send this close packet, for example because it crashed, or internet died and so on - the other side B has no way to figure out that connection is no longer active UNTIL it tries to send some data to A. Then this send will fail and now B knows connection is dead.
So if you really need to know if other side is still alive - then you have to send some data there. You can use keepalive which is available on TCP sockets (which basically does the same - sends some data from time to time). Or if you always write on this connection first (say other side is server and you do requests to it from time to time, but do not expect any data between those requests) - then just don't check if the other side is alive - you will know that when you will attempt to write next time.
I have a Dotnet application (running on a Windows PC) which communicates with a Linux box via OPC UA. The use case here is to make ~40 read requests to the server in serial. Once these 40 read calls are complete, the next cycle of 40 read calls begins. Each read call returns a response from the server carrying a payload of ~16KB which is fragmented and delivered to the client. For most requests, the server finishes delivering the complete response within 5ms. However for some requests it takes ~300 ms to complete.
In scenarios where this delay exists, I can see the following pattern of re-transmissions.
[71612] A new Read request is sent to the server.
[71613-71630] The response is delivered to the client.
[71631] A new Read request is sent to the server.
[71632] A TCP Spurious Retransmission occurs from the server for packet [71844] with Seq No. 61624844
[71633] Client sends a DUP ACK for the packet.
[71634] Client does a TCP Retransmission for the read request in [71846] after 288ms
This delay adds up and causes some 5-6 seconds of delay for a complete cycle of 40 requests to complete. I want to figure out what is causing these retransmissions (hence delays) and what can possibly be done to-
Reduce the frequency of retransmissions.
Reduce the 300ms delay from the client side to quickly retransmit the obstructed read request.
I have tried disabling the Nagle algorithm on the server to possibly improve performance but it did not have any effect. Also, when reducing the response size by half (8KB), the retransmissions are rare and hence the delay is minute as well. But reducing the response is not a valid solution in our use case.
The connection to the Linux box is through a switch, however while directly connecting to it point-point, there is marginal reduction in the delay.
I can share relevant code but I think this issue is likely with the TCP stack (or at least, some configuration that should be enabled?) hence it would make little difference.
I'd like to wait for a slow response from a client with TcpClient but get a timeout after about 20s no matter how I configure it. This is my attempt:
using (var client = new TcpClient { ReceiveTimeout = 9999999, SendTimeout = 9999999 })
{
await client.ConnectAsync(ip, port);
using (var stream = client.GetStream())
{
// Some quick read/writes happen here via the stream with stream.Write() and stream.Read(), successfully.
// Now the remote host is calculating something long and will reply if finished. This throws the below exception however instead of waiting for >20s.
var bytesRead = await stream.ReadAsync(new byte[8], 0, 8);
}
}
The exception is an IOException:
Unable to read data from the transport connection: A connection
attempt failed because the connected party did not properly respond
after a period of time, or established connection failed because
connected host has failed to respond.
...which contains a SocketException inside:
A connection attempt failed because the connected party did not
properly respond after a period of time, or established connection
failed because connected host has failed to respond
SocketErrorCode is TimedOut.
The 20s seems to be an OS default on Windows but isn't it possible to override it from managed code by interacting with TcpClient? Or how can I wait for the response otherwise?
I've also tried the old-style BeginRead-EndRead way and the same happens on EndRead. The problem is also not caused by Windows Firewall or Defender.
I'd like to wait for a slow response from a client
It's important to note that it's the connection that is failing. The connection timeout is only for establishing a connection, which should always be very fast. In fact, the OS will accept connections on behalf of an application, so you're literally just talking about a packet round-trip. 21 seconds should be plenty.
Once the connection is established, then you can just remove the ReceiveTimeout/SendTimeout and use asynchronous reads to wait forever.
It turns out that the remote host wasn't responding in a timely manner, hence the problem. Let me elaborate, and though this will be a solution very specific to my case maybe it will be useful for others too.
The real issue wasn't a timeout per se, as the exception indicated, but rather what exceptions thrown on subsequent Read() calls have shown: "An existing connection was forcibly closed by the remote host"
The remote host wasn't purposely closing the connection. Rather what happened is that when it was slow to respond it was actually so busy that it wasn't processing any TCP traffic either. While the local host wasn't explicitly sending anything while waiting for a response this still was an issue: the local host tried to send ACKs for previous transmissions of the remote host. Since these couldn't be delivered the local host determined that the remote host "forcibly closed" the connection.
I got the clue from looking at the traffic with Wireshark (always good to try to look at what's beneath the surface instead of guessing around): it was apparent that while the remote host was busy it showed complete radio silence. At the same time Wireshark showed retransmission attempts carried out by the local host, indicating that this is behind the issue.
Thus the solution couldn't be implemented on the local host either, the behavior of the remote host needed to be changed.
I'm writing a service that needs to maintain a long running SSL connection to a remote server. I need this server to be self-healing, that is if it's disconnected for any reason then the next time it's written to it will reconnect. I've tried this:
bool isConnected = client.Connected && client.Client.Poll(0, SelectMode.SelectWrite) && stream.CanWrite;
if (!isConnected )
{
this.connected = false;
GetConnection();
}
stream.Write(bytes, 0, bytes.Length);
stream.Flush();
But I find it doesn't act as I would expect it. If I simulate a network outage by disabling my wifi, I'm still able to write to the stream with stream.Write() for approximately 20 seconds. Then next time I try to write to it, none of client.Connected, client.Client.Poll(), or stream.CanWrite() return false, but when I go to write to the stream I get a socket exception. Finally, if I try to recreate the connection, I get this exception: An existing connection was forcibly closed by the remote host.
I would appreciate any help create a long running SslStream that can withstand network failure. Thanks!
From a 10.000 feet point of view:
The reason you can still write to the stream after shutting down your wifi is because there is a network buffer that is holding the data for transmission, stream.Write/stream.Flush success means the network interface (TCP/IP stack) has accepted the data and has been buffered for transmission, not that the data has reach its target.
It takes time to the TCP/IP Stack to notice a full media disconnection, (connection lost/reset) because even if there is no physical link TCP/IP will see this as a temporary issue in the network and will keep retrying for a while (the network could be dropping packets at some point and the stack will keep retrying)
If you think about this in the reverse way, you won't like all your programs to fail if there is a network hiccup (this happen too often on internet), so TCP/IP takes its time to notify to the app layer that the connection has become invalid (after retry several times and wait a reasonable amount of time)
You can always reconnect to the server when the SslStream fails and continue sending data, although you will find is not as easy as this because there are several scenarios where you send and data is not received by server and others where server receive the data and you do not receive any ACK from server at all... So depending on your needs, self-healing alone could be not enough.
Self-Healing is simple to implement, data consistency and reliability is harder and usually requires the server to be ready to support some kind of reliable messaging mechanism to ensure all data has been sent and received.
The underlying protocol for SSL is TCP. TCP will usually only send data if the application wants it to deliver data, or if it needs to reply to data received from the other side by sending an ACK. This means, that a broken connection like a lost link will not be noticed until you are trying to send any data. But you will not notice immediatly, because:
A write to the socket will only deliver the data to the OS kernel and return success if this delivery was successful.
The kernel will then try to deliver the data to the peer and will wait for the ACK from the client.
If it does not get any ACK it will retry again to deliver the data and only after some unsuccessful retries the kernel will declare the connection broken.
Only after the connection is marked broken by the kernel the next write or read will return the error from kernel to user space, like with returning EPIPE when doing a write.
This means, if you want to know up-front if the connection is still alive you have to make sure that you get a regular data exchange on the connection. At the TCP level you might set TCP_KEEPALIVE, but this might use an interval of some hours between exchanges packets. At the SSL layer you might try to use the infamous heartbeat extension, but most peers will not understand it. The last choice is to implement some kind of heartbeat in your own application.
As for the self healing: When reconnecting you get a new TCP connection and you also need to do a full SSL handshake, because the last SSL connection was not cleanly closed and thus cannot be resumed. The server has no idea that this new connection is just a continuation of the old one so you have to implement some kind of meta-connection spanning multiple TCP connections inside your application layer on both client and server. Inside this meta-connection you need to have your own data tracking to detect, which data are really accepted from the peer and which were only send but never explicitly accepted because the connection broke. Sound like a kind of TCP on top of TCP.
Error:
Unable to read data from the transport connection: A blocking operation was interrupted by a call to WSACancelBlockingCall
Situation
There is a TCP Server
My web application connects to this TCP Server
Using the below code:
TcpClientInfo = new TcpClient();
_result = TcpClientInfo.BeginConnect(<serverAddress>,<portNumber>, null, null);
bool success = _result.AsyncWaitHandle.WaitOne(20000, true);
if (!success)
{
TcpClientInfo.Close();
throw new Exception("Connection Timeout: Failed to establish connection.");
}
NetworkStreamInfo = TcpClientInfo.GetStream();
NetworkStreamInfo.ReadTimeout = 20000;
2 Users use the same application from two different location to access information from this server at the SAME TIME
Server takes around 2sec to reply
Both Connect
But One of the user gets above error
"Unable to read data from the transport connection: A blocking operation was interrupted by a call to WSACancelBlockingCall"
when trying to read data from stream
How can I resolve this issue?
Use a better way of connecting to the server
Can't because it's a server issue
if a server issue, how should the server handle request to avoid this problem
This looks Windows-specific to me, which isn't my strong point, but...
You don't show us the server code, only the client code. I can only assume, then, that your server code accepts a socket connection, does its magic, sends something back, and closes the client connection. If this is your case, then that's the problem.
The accept() call is a blocking one that waits for the next client connection attempt and binds to it. There may be a queue of connection attempts created and administered by the OS, but it can still only accept one connection at a time.
If you want to be able to handle multiple simultaneous requests, you have to change your server to call accept(), and when a new connection comes in, launch a worker thread/process to handle the request and go back to the top of the loop where the accept() is. So the main loop hands off the actual work to another thread/process so it can get back to the business of waiting for the next connection attempt.
Real server applications are more complex than this. They launch a bunch of "worker bee" threads/processes in a pool and reuse them for future requests. Web servers do this, for instance.
If my assumptions about your server code are wrong, please enlighten us as to what it looks like.
Just a thought.
If your server takes 2seconds to response, shouldn't the Timeout values be 2000, instead of 20000 (which is 20 seconds)? First argument for AsyncWaitHandle.WaitOne() is in milliseconds.
If you are waiting 20 seconds, may be your server is disconnecting you for being idle?