ServicePoint Configuration - Application starving http connections? - c#

I have two C# asp.net applications running on IIS:
The main application creates up to 80 threads where each of them will
establish an http connection to a certrain endpoint (all the same endpoint (LAN)) at a frequency of roughly 3 seconds.
That endpoint is beeing hosted on localhost (e.g localhost:4510).
This endpoint is the second application which represents the "driver" that will ultimately establish a connection to a device within LAN.
So it's totally possible to have 80 threads trying to make a request to driver/device at the same time.
Over time the app seems to have issues with anything involving httpclients. RavenDB, Elasticsearch and also the 80 threads.
I read a few things about ServicePointManager class; especially DefaultConnectionLimit and
MaxServicePoints and how the influence http througput.
I only have basic understanding of the underlying mechanism so I'd like to ask if I should focus on a specific subject or what I would want to check to may improve on http throughput.
Update:
With current configuration CPU load is low and memory consumption also.
Following code shows how the 80 httpclients which connect to the driver on localhost:4510:
var driverBaseAddressSp= ServicePointManager.FindServicePoint(driverBaseAddress); Debug.WriteLine(driverBaseAddressSp.ConnectionLimit);
Debug.WriteLine(driverBaseAddressSp.MaxIdleTime);
var connectionUriSp = ServicePointManager.FindServicePoint(connectionUri);
Debug.WriteLine(connectionUriSp.ConnectionLimit);
Debug.WriteLine(connectionUriSp.MaxIdleTime);
return new HttpClient { BaseAddress = driverBaseAddress };
ConnectionLimit shows Int.Max when debugging but
I cannot find any configuration in the solution?

Related

C# HttpClient not using all established connections to a host with parallel requests

I'm using .NET Framework 4.8 for a console application that manages an ETL process and a .NET standard 2.0 library for the HTTP requests that use the HttpClient. This application is expected to handle millions of records and is long running.
These requests are made in parallel with a maximum concurrency limit of 20.
At application launch, I increase the number of connections .NET can make to a single host in a connection pool via the ServicePointManager. This is a frequent cause for connection pool starvation, as in .NET Framework it defaults to 2.
public static async Task Main(string[] args)
{
ServicePointManager.DefaultConnectionLimit = 50;
...
}
This is greater than the number of max concurrent requests allowed to ensure my requests are not being queued up on connections that are already being used.
I then loop through the records and post them to a TPL Dataflow block with the concurrency limit of 20.
This block makes the request on my API client which uses a singleton HttpClient for all requests..
public class MyApiClient
{
private static HttpClient _httpClient { get; set; }
public MyApiClient()
{
httpClient = new HttpClient();
}
public async Task<ReturnedObject> UploadNewDocumentAsync(DocParams docParams, DocData docData)
{
MultipartFormDataContent content = ConstructMultipartFormDataContent(docParams);
HttpContent httpContent = docData.ConvertToHttpContent();
content.Add(httpContent);
using (HttpResponseMessage response = await _httpClient.PostAsync("document/upload", content))
{
return await HttpResponseReader.ReadResponse<ReturnedObject>(response).ConfigureAwait(false);
}
}
}
I'm using sysinternals TCPView.exe to view all connections made between the local host and remote host, as well as what the status of those connections are and whether data is actively being sent or not.
When starting the application I see new connections established for each request made, until around 50 connections are established. I also see activity for around 20 at any given time. This meets expectations.
After around 24 hours of activity, TCPView shows only 5 connections concurrently sending and receiving data. All 50 connections still exist and are all in the Established state, but the majority sit idle. I don't have a way of logging when connections stop being actively used. So I don't know whether all of a sudden it drops from using 20 connections to only 5, or whether it gradually decreases.
My log file records elapsed time for every request made and I also see a degradation in performance at this time. Requests take longer and longer to complete, and I see an increase in TaskCanceled exceptions as the HttpClient reaches its timeout value of 100 seconds.
I've also confirmed with the 3rd party API vendor that they are not receiving a large number of incoming requests timing out.
This suggests to me that the application is still trying to make 20 requests at a time, but they are being queued up on a smaller number of TCP connections. All the symptoms point to classic connection pool starvation.
While the application is running, I can output some information from the ServicePointManager to the console.
ServicePoint sp = ServicePointManager.FindServicePoint(myApiService.BasePath, WebRequest.DefaultWebProxy);
Console.WriteLine(
$"CurrentConnections: {sp.CurrentConnections} " +
$"ConnectionLimit: {sp.ConnectionLimit}");
Console output:
CurrentConnections: 50 ConnectionLimit: 50
This validates what I see in TCPView.exe, which is that all 50 connections are still allowed and are established.
The symptoms show connection pool starvation, but TCPView.exe and ServicePointManager show there are plenty of available established connections. .NET is just not using them all. This behaviour only shows up after several hours of runtime. The issue is repeatable. If I close and relaunch the application, it begins by rapidly opening all 50 TCP connections, and I see data being transferred on up to 20 at a time. When I check 24 hours later, the symptoms of connection pool starvation have shown up again.
What could cause the behavior, and is there anything further I could do to validate my assumptions?

Getting an error “Response code: Non HTTP response code: org.apache.http.conn.HttpHostConnectException” in jmeter

I am running a Load Test on my .Net web application using Jmeter.
Application Process: Launch - Login - Start Test - Answer Q&A - Home Page - Logout
Till 500 users or sometimes 750, the test is running successfully. But when I increase the load I get an Error:
Non HTTP response code: org.apache.http.conn.HttpHostConnectException/Non HTTP response message:
Connect to www.demoname.com:80 [www.demoname.com\/11.111.111.111] failed: Connection timed out: connect
'11.111.111.111' - is my Server IP address
I have increased the jmeter.batch file Heap memory to HEAP=-Xms1g -Xmx4g -XX:MaxMetaspaceSize=256m
Apache Jmeter version - 5.1.1r1855137
Java Version - 1.8.0_221
Server Configuration: Standard D4 v2 (8 vcpus, 28 GiB memory)
How can I get rid of this error?
HttpHostConnectException is basically an instance of a ConnectException which:
Signals that an error occurred while attempting to connect a socket to a remote address and port. Typically, the connection was refused remotely (e.g., no process is listening on the remote address/port).
So most probably the error is on your server side because JMeter attempts to establish the connection and fails to do this within the bounds of the defined timeout
If you're absolutely sure that your application works normally you can increase connect timeout in HTTP Request Defaults
However a better idea would be getting to the bottom of the error and fixing it on server side, the most possible reasons are in:
your application is overloaded, i.e. lacks essential resources like CPU, RAM, Network, etc. Make sure to set up monitoring of these metrics using i.e. Azure Monitor or JMeter PerfMon Plugin
your application infrastructure is not properly configured, make sure to double check if your backend is properly tuned for high loads including IIS and MSSQL
it might be the case your application code is problematic, i.e. it cannot handle more than X users due to poorly implemented algorithms. Consider using a profiler tool like NProfiler or dotTrace to detect the most expensive functions, largest objects, etc.

How to fully terminate HttpWebRequest

Even though i am properly terminating everything when i check existing HTTP connections i see they are not terminated
For example when i open 200 concurrent connections by starting different tasks
I see
158 Established HTTP connections
927 TimeWait
95 SynSent
24 LastAck
6 CloseWait
34 FinWait
The worse part is, the number of TimeWait keep increasing each minute
So how can i prevent such issue to happen?
After a while the windows become unable to make any new requests
This problem occurs when i use webproxies : Too many proxy connection kills window's resolving hosts ability
Here when i use 200 connections with different proxies
Connections in TimeWait state can generate a performance problem.
First, take a look at TCP State diagram,
https://en.wikipedia.org/wiki/File:Tcp_state_diagram_fixed_new.svg
This is a state of a TCP connection after a machine’s TCP has sent the ACK segment in response to a FIN segment received from its peer (details in RFC 793 defining TCP back in 1981 http://www.ietf.org/rfc/rfc793.txt). During this state the socket resources, including the TCB (TCP Control Block) and the port of course, are not released to the OS. After a timeout expires, socket resources are released to the OS. The original reason is to deal with the Two Generals problem that can happen between peers in an unreliable medium. The connection will be in TimeWait until a configurable timeout which has a default value that is dependent on the operating system.
These links can help you to set the TcpTimedWaitDelay parameter in Windows:
https://technet.microsoft.com/en-us/library/cc938217.aspx
http://msdn.microsoft.com/en-us/library/ee377084%28v=bts.10%29.aspx
It says the default value is 240 seconds but I'm my tests I experienced lower times (between 60 and 120).
Anyway, today networks are more reliable and web services requiring high performance and throughput should reduce this value. I would suggest set it just to 5 seconds. If you want to be more conservative, set it to 30 seconds.
Other parameter that could be useful for you is the max number of ephemeral ports Windows allows a client to open. Windows Server by default limits the maximum number of ephemeral TCP ports. In some Windows, this value could be 5000. You can change this behavior by setting the value MaxUserPort in the registry.

.NET WebSockets forcibly closed despite keep-alive and activity on the connection

We have written a simple WebSocket client using System.Net.WebSockets. The KeepAliveInterval on the ClientWebSocket is set to 30 seconds.
The connection is opened successfully and traffic flows as expected in both directions, or if the connection is idle, the client sends Pong requests every 30 seconds to the server (visible in Wireshark).
But after 100 seconds the connection is abruptly terminated due to the TCP socket being closed at the client end (watching in Wireshark we see the client send a FIN). The server responds with a 1001 Going Away before closing the socket.
After a lot of digging we have tracked down the cause and found a rather heavy-handed workaround. Despite a lot of Google and Stack Overflow searching we have only seen a couple of other examples of people posting about the problem and nobody with an answer, so I'm posting this to save others the pain and in the hope that someone may be able to suggest a better workaround.
The source of the 100 second timeout is that the WebSocket uses a System.Net.ServicePoint, which has a MaxIdleTime property to allow idle sockets to be closed. On opening the WebSocket if there is an existing ServicePoint for the Uri it will use that, with whatever the MaxIdleTime property was set to on creation. If not, a new ServicePoint instance will be created, with MaxIdleTime set from the current value of the System.Net.ServicePointManager MaxServicePointIdleTime property (which defaults to 100,000 milliseconds).
The issue is that neither WebSocket traffic nor WebSocket keep-alives (Ping/Pong) appear to register as traffic as far as the ServicePoint idle timer is concerned. So exactly 100 seconds after opening the WebSocket it just gets torn down, despite traffic or keep-alives.
Our hunch is that this may be because the WebSocket starts life as an HTTP request which is then upgraded to a websocket. It appears that the idle timer is only looking for HTTP traffic. If that is indeed what is happening that seems like a major bug in the System.Net.WebSockets implementation.
The workaround we are using is to set the MaxIdleTime on the ServicePoint to int.MaxValue. This allows the WebSocket to stay open indefinitely. But the downside is that this value applies to any other connections for that ServicePoint. In our context (which is a Load test using Visual Studio Web and Load testing) we have other (HTTP) connections open for the same ServicePoint, and in fact there is already an active ServicePoint instance by the time that we open our WebSocket. This means that after we update the MaxIdleTime, all HTTP connections for the Load test will have no idle timeout. This doesn't feel quite comfortable, although in practice the web server should be closing idle connections anyway.
We also briefly explore whether we could create a new ServicePoint instance reserved just for our WebSocket connection, but couldn't see a clean way of doing that.
One other little twist which made this harder to track down is that although the System.Net.ServicePointManager MaxServicePointIdleTime property defaults to 100 seconds, Visual Studio is overriding this value and setting it to 120 seconds - which made it harder to search for.
I ran into this issue this week. Your workaround got me pointed in the right direction, but I believe I've narrowed down the root cause.
If a "Content-Length: 0" header is included in the "101 Switching Protocols" response from a WebSocket server, WebSocketClient gets confused and schedules the connection for cleanup in 100 seconds.
Here's the offending code from the .Net Reference Source:
//if the returned contentlength is zero, preemptively invoke calldone on the stream.
//this will wake up any pending reads.
if (m_ContentLength == 0 && m_ConnectStream is ConnectStream) {
((ConnectStream)m_ConnectStream).CallDone();
}
According to RFC 7230 Section 3.3.2, Content-Length is prohibited in 1xx (Informational) messages, but I've found it mistakenly included in some server implementations.
For additional details, including some sample code for diagnosing ServicePoint issues, see this thread: https://github.com/ably/ably-dotnet/issues/107
I set the KeepAliveInterval for the socket to 0 like this:
theSocket.Options.KeepAliveInterval = TimeSpan.Zero;
That eliminated the problem of the websocket shutting down when the timeout was reached. But then again, it also probably turns off the send of ping messages altogether.
I studied this issue these days, compared capture packages in Wireshark(webclient-client of python and WebSocketClient of .Net), and found what happened. In WebSocketClient, "Options.KeepAliveInterval" only send one packet to the server when no message received from server in these period. But some server only judge if there is active message from client. So we have to manually send arbitrary packets (not necessarily ping packets,and WebSocketMessageType has no ping type) to the server at regular intervals,even if the server side continuously sends packets. That's the solution.

Unable to resolve DNS (sometimes?)

Given an application that in parallel requests 100 urls at a time for 10000 urls, I'll receive the following error for 50-5000 of them:
The remote name cannot be resolved 'www.url.com'
I understand that the error means the DNS Server was unable to resolve the url. However, for each run, the number of urls that cannot be resolved changes (ranging from 50 to 5000).
Am I making too many requests too fast? And can I even do that? - Running the same test on a much more powerful server, shows that only 10 urls could not be resolved - which sounds much more realistic.
The code that does the parallel requesting:
var semp = new SemaphoreSlim(100);
var uris = File.ReadAllLines(#"C:\urls.txt").Select(x => new Uri(x));
foreach(var uri in uris)
{
Task.Run(async () =>
{
await semp.WaitAsync();
var result = await Web.TryGetPage(uri); // Using HttpWebRequest
semp.Release();
});
}
I'll bet that you didn't know that the DNS lookup of HttpWebRequest (which is the cornerstone of all .net http apis) happens synchronously, even when making async requests (annoying, right?). This means that firing off many requests at once causes severe ThreadPool strain and large amount of latency. This can lead to unexpected timeouts. If you really want to step things up, don't use the .net dns implementation. You can use a third party library to resolve hosts and create your webrequest with an ip instead of a hostname, then manually set the host header before firing off the request. You can achieve much higher throughput this way.
It does sound like you're swamping your local DNS server (in the jargon, your local recursive DNS resolver).
When your program issues a DNS resolution request, it sends a port 53 datagram to the local resolver. That resolver responds either by replying from its cache or recursively resending the request to some other resolver that's been identified as possibly having the record you're looking for.
So, your multithreaded program is causing a lot of datagrams to fly around. Internet Protocol hosts and routers handle congestion and overload by dropping datagram packets. It's like handling a traffic jam on a bridge by bulldozing cars off the bridge. In an overload situation, some packets just disappear.
So, it's up to endpoint software using datagram protocols to try again if their packets get lost. That's the purpose of TCP, and that's how it can provide the illusion of an error-free stream of data even though it can only communicate with datagrams.
So, your program will need to try again when you get resolution failure on some of your DNS requests. You're a datagram endpoint so you own the responsibility of retry. I suspect the .net library is give you back failure when some of your requests time out because your datagrams got dropped.
Now, here's the important thing. It is also the responsibility of a datagram endpoint program, like yours, to implement congestion control. TCP does this automatically using its sliding window system, with an algorithm called slow-start / exponential backoff. If TCP didn't do this all internet routers would be congested all the time. This algorithm was dreamed up by Van Jacobson, and you should go read about it.
In the meantime you should implement a simple form of it in your bulk DNS lookup program. Here's how you might do that.
Start with a batch size of, say, 5 lookups.
Every time you get the whole batch back successfully, increase your batch size by one for your next batch. This is slow-start. As long as you're not getting congestion, you increase the network load.
Every time you get a failure to resolve a name, reduce the size of the next batch by half. So, for example, if your batch size was 30 and you got a failure, your next batch size will be 15. This is exponential backoff. You respond to congestion by dramatically reducing the load you're putting on the network.
Implement a maximum batch size of something like 100 just to avoid being too much of a pig and looking like a crude denial-of-service attack to the DNS system.
I had a similar project a while ago and this strategy worked well for me.

Categories