Unable to resolve DNS (sometimes?)

Unable to resolve DNS (sometimes?) - c#

Given an application that in parallel requests 100 urls at a time for 10000 urls, I'll receive the following error for 50-5000 of them:
The remote name cannot be resolved 'www.url.com'
I understand that the error means the DNS Server was unable to resolve the url. However, for each run, the number of urls that cannot be resolved changes (ranging from 50 to 5000).
Am I making too many requests too fast? And can I even do that? - Running the same test on a much more powerful server, shows that only 10 urls could not be resolved - which sounds much more realistic.
The code that does the parallel requesting:
var semp = new SemaphoreSlim(100);
var uris = File.ReadAllLines(#"C:\urls.txt").Select(x => new Uri(x));
foreach(var uri in uris)
{
Task.Run(async () =>
{
await semp.WaitAsync();
var result = await Web.TryGetPage(uri); // Using HttpWebRequest
semp.Release();
});
}

I'll bet that you didn't know that the DNS lookup of HttpWebRequest (which is the cornerstone of all .net http apis) happens synchronously, even when making async requests (annoying, right?). This means that firing off many requests at once causes severe ThreadPool strain and large amount of latency. This can lead to unexpected timeouts. If you really want to step things up, don't use the .net dns implementation. You can use a third party library to resolve hosts and create your webrequest with an ip instead of a hostname, then manually set the host header before firing off the request. You can achieve much higher throughput this way.

It does sound like you're swamping your local DNS server (in the jargon, your local recursive DNS resolver).
When your program issues a DNS resolution request, it sends a port 53 datagram to the local resolver. That resolver responds either by replying from its cache or recursively resending the request to some other resolver that's been identified as possibly having the record you're looking for.
So, your multithreaded program is causing a lot of datagrams to fly around. Internet Protocol hosts and routers handle congestion and overload by dropping datagram packets. It's like handling a traffic jam on a bridge by bulldozing cars off the bridge. In an overload situation, some packets just disappear.
So, it's up to endpoint software using datagram protocols to try again if their packets get lost. That's the purpose of TCP, and that's how it can provide the illusion of an error-free stream of data even though it can only communicate with datagrams.
So, your program will need to try again when you get resolution failure on some of your DNS requests. You're a datagram endpoint so you own the responsibility of retry. I suspect the .net library is give you back failure when some of your requests time out because your datagrams got dropped.
Now, here's the important thing. It is also the responsibility of a datagram endpoint program, like yours, to implement congestion control. TCP does this automatically using its sliding window system, with an algorithm called slow-start / exponential backoff. If TCP didn't do this all internet routers would be congested all the time. This algorithm was dreamed up by Van Jacobson, and you should go read about it.
In the meantime you should implement a simple form of it in your bulk DNS lookup program. Here's how you might do that.
Start with a batch size of, say, 5 lookups.
Every time you get the whole batch back successfully, increase your batch size by one for your next batch. This is slow-start. As long as you're not getting congestion, you increase the network load.
Every time you get a failure to resolve a name, reduce the size of the next batch by half. So, for example, if your batch size was 30 and you got a failure, your next batch size will be 15. This is exponential backoff. You respond to congestion by dramatically reducing the load you're putting on the network.
Implement a maximum batch size of something like 100 just to avoid being too much of a pig and looking like a crude denial-of-service attack to the DNS system.
I had a similar project a while ago and this strategy worked well for me.

Related

How to reduce delays caused by a Server TCP Spurious retransmission and subsequent Client TCP retransmission?

I have a Dotnet application (running on a Windows PC) which communicates with a Linux box via OPC UA. The use case here is to make ~40 read requests to the server in serial. Once these 40 read calls are complete, the next cycle of 40 read calls begins. Each read call returns a response from the server carrying a payload of ~16KB which is fragmented and delivered to the client. For most requests, the server finishes delivering the complete response within 5ms. However for some requests it takes ~300 ms to complete.
In scenarios where this delay exists, I can see the following pattern of re-transmissions.
[71612] A new Read request is sent to the server.
[71613-71630] The response is delivered to the client.
[71631] A new Read request is sent to the server.
[71632] A TCP Spurious Retransmission occurs from the server for packet [71844] with Seq No. 61624844
[71633] Client sends a DUP ACK for the packet.
[71634] Client does a TCP Retransmission for the read request in [71846] after 288ms
This delay adds up and causes some 5-6 seconds of delay for a complete cycle of 40 requests to complete. I want to figure out what is causing these retransmissions (hence delays) and what can possibly be done to-
Reduce the frequency of retransmissions.
Reduce the 300ms delay from the client side to quickly retransmit the obstructed read request.
I have tried disabling the Nagle algorithm on the server to possibly improve performance but it did not have any effect. Also, when reducing the response size by half (8KB), the retransmissions are rare and hence the delay is minute as well. But reducing the response is not a valid solution in our use case.
The connection to the Linux box is through a switch, however while directly connecting to it point-point, there is marginal reduction in the delay.
I can share relevant code but I think this issue is likely with the TCP stack (or at least, some configuration that should be enabled?) hence it would make little difference.

ServicePoint Configuration - Application starving http connections?

I have two C# asp.net applications running on IIS:
The main application creates up to 80 threads where each of them will
establish an http connection to a certrain endpoint (all the same endpoint (LAN)) at a frequency of roughly 3 seconds.
That endpoint is beeing hosted on localhost (e.g localhost:4510).
This endpoint is the second application which represents the "driver" that will ultimately establish a connection to a device within LAN.
So it's totally possible to have 80 threads trying to make a request to driver/device at the same time.
Over time the app seems to have issues with anything involving httpclients. RavenDB, Elasticsearch and also the 80 threads.
I read a few things about ServicePointManager class; especially DefaultConnectionLimit and
MaxServicePoints and how the influence http througput.
I only have basic understanding of the underlying mechanism so I'd like to ask if I should focus on a specific subject or what I would want to check to may improve on http throughput.
Update:
With current configuration CPU load is low and memory consumption also.
Following code shows how the 80 httpclients which connect to the driver on localhost:4510:
var driverBaseAddressSp= ServicePointManager.FindServicePoint(driverBaseAddress); Debug.WriteLine(driverBaseAddressSp.ConnectionLimit);
Debug.WriteLine(driverBaseAddressSp.MaxIdleTime);
var connectionUriSp = ServicePointManager.FindServicePoint(connectionUri);
Debug.WriteLine(connectionUriSp.ConnectionLimit);
Debug.WriteLine(connectionUriSp.MaxIdleTime);
return new HttpClient { BaseAddress = driverBaseAddress };
ConnectionLimit shows Int.Max when debugging but
I cannot find any configuration in the solution?

Detect and handle unresponsive RabbitMQ in .NET application

We recently had an outage where one of our APIs became unresponsive due to our rabbit cluster being given artificially high load. We where running out of threads in mono (.NET) and requests to the API failed. Although this is unlikely to happen again we would like to put some protection in against this. Ideally we would have calls to bus.Publish() timeout after a set amount of time but we can't workout how.
We then came across the blocked connections notification feature of RabbitMQ and thought this might help. However we can't figure out how to get at the connection object that is in the IServiceBus. So far we have tried
_serviceBus = serviceBus;
var connection =
((MassTransit.Transports.RabbitMq.RabbitMqEndpointAddress) _serviceBus.Endpoint.Address)
.ConnectionFactory.CreateConnection();
connection.ConnectionBlocked += Connection_ConnectionBlocked;
connection.ConnectionUnblocked += Connection_ConnectionUnblocked;
But when we do this we get a BrokerUnreachableException which I don't understand.
My questions are, is this the right approach to detect timeouts and fail (we have a backup mechanism to collect the data in the message and repost later) and if this is correct, how do we make it work?

I think you can manage this by combining System.Timer or Observable.Timer to schedule checks, and the check, which use request-response. Consumer for the request should be inside the same process. You can specify a cancellation token with reasonable timeout for the Request call and it you get a timeout - your messaging infrastructure is down or too busy, or your endpoint is too busy.

.NET WebSockets forcibly closed despite keep-alive and activity on the connection

We have written a simple WebSocket client using System.Net.WebSockets. The KeepAliveInterval on the ClientWebSocket is set to 30 seconds.
The connection is opened successfully and traffic flows as expected in both directions, or if the connection is idle, the client sends Pong requests every 30 seconds to the server (visible in Wireshark).
But after 100 seconds the connection is abruptly terminated due to the TCP socket being closed at the client end (watching in Wireshark we see the client send a FIN). The server responds with a 1001 Going Away before closing the socket.
After a lot of digging we have tracked down the cause and found a rather heavy-handed workaround. Despite a lot of Google and Stack Overflow searching we have only seen a couple of other examples of people posting about the problem and nobody with an answer, so I'm posting this to save others the pain and in the hope that someone may be able to suggest a better workaround.
The source of the 100 second timeout is that the WebSocket uses a System.Net.ServicePoint, which has a MaxIdleTime property to allow idle sockets to be closed. On opening the WebSocket if there is an existing ServicePoint for the Uri it will use that, with whatever the MaxIdleTime property was set to on creation. If not, a new ServicePoint instance will be created, with MaxIdleTime set from the current value of the System.Net.ServicePointManager MaxServicePointIdleTime property (which defaults to 100,000 milliseconds).
The issue is that neither WebSocket traffic nor WebSocket keep-alives (Ping/Pong) appear to register as traffic as far as the ServicePoint idle timer is concerned. So exactly 100 seconds after opening the WebSocket it just gets torn down, despite traffic or keep-alives.
Our hunch is that this may be because the WebSocket starts life as an HTTP request which is then upgraded to a websocket. It appears that the idle timer is only looking for HTTP traffic. If that is indeed what is happening that seems like a major bug in the System.Net.WebSockets implementation.
The workaround we are using is to set the MaxIdleTime on the ServicePoint to int.MaxValue. This allows the WebSocket to stay open indefinitely. But the downside is that this value applies to any other connections for that ServicePoint. In our context (which is a Load test using Visual Studio Web and Load testing) we have other (HTTP) connections open for the same ServicePoint, and in fact there is already an active ServicePoint instance by the time that we open our WebSocket. This means that after we update the MaxIdleTime, all HTTP connections for the Load test will have no idle timeout. This doesn't feel quite comfortable, although in practice the web server should be closing idle connections anyway.
We also briefly explore whether we could create a new ServicePoint instance reserved just for our WebSocket connection, but couldn't see a clean way of doing that.
One other little twist which made this harder to track down is that although the System.Net.ServicePointManager MaxServicePointIdleTime property defaults to 100 seconds, Visual Studio is overriding this value and setting it to 120 seconds - which made it harder to search for.

I ran into this issue this week. Your workaround got me pointed in the right direction, but I believe I've narrowed down the root cause.
If a "Content-Length: 0" header is included in the "101 Switching Protocols" response from a WebSocket server, WebSocketClient gets confused and schedules the connection for cleanup in 100 seconds.
Here's the offending code from the .Net Reference Source:
//if the returned contentlength is zero, preemptively invoke calldone on the stream.
//this will wake up any pending reads.
if (m_ContentLength == 0 && m_ConnectStream is ConnectStream) {
((ConnectStream)m_ConnectStream).CallDone();
}
According to RFC 7230 Section 3.3.2, Content-Length is prohibited in 1xx (Informational) messages, but I've found it mistakenly included in some server implementations.
For additional details, including some sample code for diagnosing ServicePoint issues, see this thread: https://github.com/ably/ably-dotnet/issues/107

I set the KeepAliveInterval for the socket to 0 like this:
theSocket.Options.KeepAliveInterval = TimeSpan.Zero;
That eliminated the problem of the websocket shutting down when the timeout was reached. But then again, it also probably turns off the send of ping messages altogether.

I studied this issue these days, compared capture packages in Wireshark(webclient-client of python and WebSocketClient of .Net), and found what happened. In WebSocketClient, "Options.KeepAliveInterval" only send one packet to the server when no message received from server in these period. But some server only judge if there is active message from client. So we have to manually send arbitrary packets (not necessarily ping packets,and WebSocketMessageType has no ping type) to the server at regular intervals,even if the server side continuously sends packets. That's the solution.

How long is a WCF connection held open?

I'm running a small WCF client application that connects to an IIS server every few minutes to download data. There are about 500 of these clients for 2 or 3 servers, and my basic code is something like this:
Client connection = null;
try
{
connection = new Client();
List<TPointer> objects = connection.GetList();
// Some work on List<T>
foreach (TPointer pointer in objects)
{
T data = GetDataFromStream(pointer, connection);
// Some additional processing on T
}
connection.SendMoreData();
// More work
}
catch (...)
{
// Exception handling for various exceptions
}
finally
{
// Handle Close() or Abort()
if (connection != null)
connection.Close();
}
When I simulate running all the clients at once for large amounts of TPointers, I start encountering the following error:
System.TimeoutException: The request channel timed out while waiting for a reply after 00:01:00.
That seems like one of those errors that can occur for any number of reasons. For all I know the server could just be swamped, or I could be requesting too large/too many objects and it's taking too long to download (a whole minute though?). Increasing the timeout is an option, but I'd like to understand the actual problem instead of fixing the symptom.
Given I have no control over the server, how can I streamline my client?
I'm not actually sure what the "request channel" mentioned in the timeout refers to. Does the timeout start ticking from when I create new Client() until I call Client.Close()? Or does each specific request I'm sending to the server (e.g. GetList or GetData) get another minute? Is it worth my while to close Client() in between each call to the server? (I'm hoping not... that would be ugly)
Would it be helpful to chunk up the amount of data I'm receiving? The GetList() call can be quite large (running into the thousands). I could try obtaining a few objects at a time and jobbing off the post-processing for later...
Edit:
Since a few people mentioned streaming:
The Client binding uses TransferMode.StreamedResponse.
GetDataFromStream() uses a Stream derived from TPointer, and SendMoreData()'s payload size is more or less negligible.
Only GetList() actually returns a non-stream object, but I'm unclear as to whether or not that affects the method of transfer.

Or does each specific request I'm sending to the server (e.g. GetList or GetData) get another minute?
The timeout property applies to each and every operation that you're doing. It's reset. If your timeout is one minute, then it starts the moment you invoke that method.
What I'd do is implement a retry policy and use an async version of the client's method and use a CancellationToken or call Abort() on your client when it's taking too long. Alternatively, you can increment or set your timeouts on the InnerChannel on the operation timout.
client.InnerChannel.OperationTimeout = TimeSpan.FromMinutes(10);
You can use that during your operation and in your retry policy you can abort entirely and reset your timeout after your retries have failed or succeeded.
Alternatively, you can try to stream your results and see if you can operate individually on them, but I don't know if keeping that connection open will trip the timeout. You'll have to hold off on operating on your collection until you have everything.
Also, set TransferMode = TransferMode.StreamedResponse in your binding.

I believe the timeout you are hitting is time to first response. In your scenario here first response is the whole response since you are returning the list, more data more time. You might want to consider streaming the data instead of returning a full list.

I suggest to modify both your web.config file (wcf side) and also app.config (client side), adding binding section like this (i.e. timeout of 25 minutes in stead of 1 minute which is default value):
<bindings>
<wsHttpBinding>
<binding name="WSHttpBinding_IYourService"
openTimeout="00:25:00"
closeTimeout="00:25:00"
sendTimeout="00:25:00"
receiveTimeout="00:25:00">
</binding>
</wsHttpBinding>
</bindings>

Given I have no control over the server, how can I streamline my client?
Basically you can not do this when you only have control over the client. It seems like the operations return no Stream (unless the pointers are types which derive from Stream).
If you want to know more about how to generally achieve streaming just read up on this MSDN article.
Everything you can do on the client is scratching on the surface of the problem. Like #The Anathema proposed in his answer you can create a retry logic and/or set the timeout to a higher value. But to eradicate the root of the problem you'd need to investigate the source of the service itself so that it can handle a higher amount of requests. Or have instances of the service running on multiple servers with a load balancer in front.

I ended up going with a combination of the answers here, so I'll just post an answer. I chunked GetList() to a certain size to avoid keeping the connection open so long (it also had a positive effect on the code in general, since I was keeping less in memory temporarily.) I already have a retry policy in place, but will also plan on messing with the timeout, as The Anathema and a couple others suggested.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.