Windows kernel queuing outbound network connections

Windows kernel queuing outbound network connections - c#

We have an application (a meta-search engine) that must make 50 - 250 outbound HTTP connections in response to a user action, frequently.
The way we do this is by creating a bunch of HttpWebRequests and running them asynchronously using Action.BeginInvoke. This obviously uses the ThreadPool to launch the web requests, which run synchronously on their own thread. Note that it is currently this way as this was originally a .NET 2.0 app and there was no TPL to speak of.
Using ETW (our event sources combined with the .NET framework and kernal ones) and NetMon is that while the thread pool can start 200 threads running our code in about 300ms (so, no threadpool exhaustion issues here), it takes up a variable amount of time, sometimes up to 10 - 15 seconds for the Windows kernel to make all the TCP connections that have been queued up.
This is very obvious in NetMon - you see around 60 - 100 TCP connections open (SYN) immediately (the number varies, but it's never more then around 120), then the rest trickle in over a period of time. It's as if the connections are being queued somewhere, but I don't know where and I don't know how to tune this to we can perform more concurrent outgoing connections. Perfmon Outbound Connection Queue stays at 0 but in the Connections Established counter you can see an initial spike of connections then a gradual increase as the rest filter through.
It does appear that latency to the endpoints to which we are connecting play a part, as running the code close to the endpoints that it connects to doesn't show the problem as significantly.
I've taken comprehensive ETW traces but there is no decent documentation on many of the Microsoft providers, which would be a help I'm sure.
Any advice to work around this or advice on tuning windows for a large amount of outgoing connections would be great. The platform is Win7 (dev) and Win2k8R2 (prod).

It looks like slow DNS queries are the culprit here. Looking at the ETW provider "Microsoft-Windows-Networking-Correlation", I can trace the network call from inception to connection and note that many connections are taking > 1 second at the DNS resolver (Microsoft-Windows-RPC).
It appears our local DNS server is slow/can't handle the load we are throwing at it and isn't caching aggressively. Production wasn't showing as severe symptoms as the prod DNS servers do everything right.

Related

High-Availability TCP server application

In my project I have a cloud hosted virtual machine running a C# application which needs to:
accept TCP connection from several external clients (approximately 500)
receive data asynchronously from the connected clients (not high frequency, approximately 1 message per minute)
do some processing on received data
forward received data to other actors
reply back to connected clients and possibly do some asynchronous sending (based on internal time-checks)
The design seems to me quite straightforward. I provide a listener which accepts incoming TCP connection, when a new connection is establhised a new thread is spawned; that thread runs in loop (performing activities points from 2 to 5) and check for associated socket aliveness (if socket is dead, the thread exits the loop and would eventually terminate; later a new connection will be attempted from the external client the socket belonged to).
So now the issue is that for limited amount of external clients (I would say 200/300) everything runs smoothly, but as that number grows (or when the clients send data with higher frequency) the communication gets very slow and obstructed.
I was thinking about some better design, for example:
using Tasks instead of Threads
using ThreadPool
replace 1Thread1Socket with something like 1Thread10Socket
or even some scaling strategies:
open two different TCP listeners (different port) within the same application (reconfiguring clients so that half of them target each listener)
provide two identical application with two different TCP listeners (different port) on the same virtual machine
set up two different virtual machines with the same application running on each of them (reconfiguring clients so that half of them target each virtual machine address)
Finally the questions: is the current design poor or naive? do you see any major criticality in the way I handle the communication? do you have any more robust and efficient option (among those mentioned above, or any additional one)?
Thanks

The number of listeners is unlikely to be a limiting factor. Here at Stack Overflow we handle ~60k sockets per instance, and the only reason we need multiple listeners is so we can split the traffic over multiple ports to avoid ephemeral port exhaustion at the load balancer. Likewise, I should note that those 60k per-instance socket servers run at basically zero CPU, so: it is premature to think about multiple exes, VMs, etc. That is not the problem. The problem is the code, and distributing a poor socket infrastructure over multiple processes just hides the problem.
Writing high performance socket servers is hard, but the good news is: you can avoid most of this. Kestrel (the ASP.NET Core http server) can act as a perfectly good TCP server, dealing with most of the horrible bits of async, sockets, buffer management, etc for you, so all you have to worry about is the actual data processing. The "pipelines" API even deals with back-buffers for you, so you don't need to worry about over-read.
An extensive walkthrough of this is in my 3-and-a-bit part blog series starting here - it is simply way too much information to try and post here. But it links through to a demo server - a dummy redis server hosted via Kestrel. It can also be hosted without Kestrel, using Pipelines.Sockets.Unofficial, but... frankly I'd use Kestrel. The server shown there is broadly similar (in terms of broad initialization - not the actual things it does) to our 60k-per-instance web-socket tier.

What could be rate limiting CPU cycles on my C# WCF Service?

Something very strange started happening on our production servers a day or two ago regarding a WCF Service we run there: it seems that something started rate limiting the process in question's CPU cycles to the amount of CPU cycles that would be available on one core, even though the load is spread across all cores (the process is not burning one core to 100% usage)
The Service is mostly just a CRUD (create, read, update, delete) service, with the exception of a few long running (can take up to 20 minutes) service calls that exist there. These long running service calls kicks of a simple Thread and returns void so not to make the Client application wait, or hold up the WCF connection:
// WCF Service Side
[OperationBehavior]
public void StartLongRunningProcess()
{
Thread workerThread = new Thread(DoWork);
workerThread.Start();
}
private void DoWork()
{
// Call SQL Stored proc
// Write the 100k+ records to new excel spreadsheet
// return (which kills off this thread)
}
Before the above call is kicked off, the service seems to respond as it should, Fetching data to display on the front-end quickly.
When you kick off the long running process, and the CPU usage goes to 100 / CPUCores, the front-end response gets slower and slower, and eventually wont accept any more WCF connections after a few minutes.
What I think is happening, is the long running process is using all the CPU cycles the OS is allowing, because something is rate limiting it, and WCF can't get a chance to accept the incoming connection, never mind execute the request.
At some point I started wondering if the Cluster our virtual servers run on is somehow doing this, but then we managed to reproduce this on our development machines with the client communicating to the service using the loopback address, so the hardware firewalls are not interfering with the network traffic either.
While testing this inside of VisualStudio, i managed to start 4 of these long running processes and with the debugger confirmed that all 4 are executing simultaneously, in different threads (by checking Thread.CurrentThread.ManagedThreadId), but still only using 100 / CPUCores worth of CPU cycles in total.
On the production server, it doesn't go over 25% CPU usage (4 cores), when we doubled the CPU cores to 8, it doesn't go over 12.5% CPU usage.
Our development machines have 8 cores, and also wont go over 12.5% CPU usage.
Other things worth mentioning about the service
Its a Windows Service
Its running inside of a TopShelf host
The problem didn't start after a deployment (of our service anyway)
Production server is running Windows Server 2008 R2 Datacenter
Dev Machines are running Windows 7 Enterprise
Things that we have checked, double checked, and tried:
Changing the process' priority up to High from Normal
Checked that the processor affinity for the process is not limiting to a specific core
The [ServiceBehavior] Attribute is set to ConcurrencyMode = ConcurrencyMode.Multiple
Incoming WCF Service calls are executing on different threads
Remove TopShelf from the equation hosting the WCF service in just a console application
Set the WCF Service throttling values: <serviceThrottling maxConcurrentCalls="1000" maxConcurrentInstances="1000" maxConcurrentSessions="1000" />
Any ideas on what could be causing this?

There must be a shared resource that only allows a single thread to access it at a time. This would effectively only allow one thread at a time to run, and create exactly the situation you have.
Processor affinity masks are the only way to limit a process to a single CPU, and if you did this you would see one CPU pinned and all the others idle (which is not your situation).
We use a tool called LeanSentry that is very good at identifying these kinds of problems. It will attach itself to IIS as a debugger and capture stack dumps of all executing processes, then tell you if most of your threads are blocked in the same spot. There is a free trial that would be long enough for you to figure this out.

The CPU usage looks like a lock on a table in the SQL Database to me. I would use the SQL management studio to analyze the statements see if it can confirm that.
Also you indicated that you call a stored procedure might want to have it look at that as well.
This all just looks like a database issue to me

NSeviceBus - Performance degradation when endpoints are disconnected

I'm looking into using NServiceBus for fault tolerant communications between a central server and many remote located PCs.
I'm running the GateWay sample in the newest (3.2) release, and all works well - with a trial commercial license the performance seems great, sending to 3 remote PCs. But - to test the fault tolerance, if I disconnect one of the PCs, I notice that although the other sites continue to receive messages from the server, the performance suffers greatly - the site can be waiting up to 30 seconds to recieve a message - presumably because the server is busy dealing with retries for the site that is disconnected.
Is there a simple configuration-type answer to this?

I think the problem here was that I hadn't specified the number of worker threads on the server. I changed this to 5 worker threads, and the issue has now gone away.

500 Socket clients trying to push data to my socket server. Need help in handling

I am required to create a high performance application where I will be getting 500 socket messages from my socket clients simultaneously. Based on my logs i could see that my dual core system is processing 80 messages at a time.
I am using Async sockets (.BeginRecieve) and i have set NoDelay to true
From the logs from my clients and my server i could see that the message i wrote from my client is read by my server after 3-4 sec.
My service time of my application should be lot lesser.

First, you should post your current code so any potential bugs can be identified.
Second, if you're on .NET 3.5, you might want to look at the SocketAsyncEventArgs enhancements.

Start looking at your resource usages:
CPU usage - both on the overall system, as well as your specific process.
Memory usage - same as above.
Networking statistics.
Once you identify where the bottleneck is, both the community and yourself will have an easier time looking at what to focus on next in improving the performance.
A review of your code may also be necessary - but this may be more appropriate for https://codereview.stackexchange.com/.

When you do a socket.listen, what is your backlog set to? I can't speak to .net 4.0, but with 2.0 I have seen a problem where once your backlog is filled up (too many connection attempts too fast) then some of the sockets will get a TCP accept and then a TCP Reset. The Client then may or may not attempt to reconnect later again. This causes a connection bottleneck rather than a data throughput or a processing bottleneck.

Is it possible create a scalable WCF service with thousands of long-running TCP connections?

I'm attempting to create a WCF service where several thousand (~10,000) clients can connect via a duplex NetTcpBinding for extended periods of time (weeks, maybe months).
After a bit of reading, it looks like it's better to host in IIS than a custom application or Windows service.
Is using WCF for such a service acceptable, or even possible? If so, where can I expect to run into throttling or performance issues, such as increasing the WCF ListenBacklog & MaxConcurrentConnections?
Thanks!

Why do you need to maintain opened connection for weeks / months? That will introduce a lot of complexity, timeouts handling, error handling, recreating connection, etc. I even doubt that this will work.
Net.tcp connections use transport session which leads to PerSession instancing of WCF service - the single service instance servers all requests and lives for the whole duration of the session (weeks or months in your case) = instance and whole its content is still in the memory. Any interruption or unhandled exception will break the channel and close the session = all session's local data are lost and client must crate new proxy to start a new session again. Also any timeout (default is 20 minutes of inactivity) will close the session. For the last - depending of business logic complexity you can find that if even few hundreds clients needs processing in the same time single server is not able to serve all of them and some clients will timeout (again breaks the session). Allowing load balancing with net.tcp demands load balancing algorithm with sticky sessions (session affinity) and whole architecture becomes even more complicated and fragile. Scalability in net.tcp means that service can be deployed on multiple servers but the whole client session must be handled by single server (if server dies all sessions served by the server die as well).
Hosting in IIS/WAS/AppFabric has several advantages where two of them is health monitoring and process recycling. Health monitoring continuously check that worker process is still alive and can process request - if it doesn't it silently starts new worker process and routes new incoming requests to that process. Process recycling regularly recycles (default setting is after 29 hours) application domain which makes process healthy and reducing memory leaks. The side effect is that both recreating process or application domain will kill all sessions. Once you self host the service you lose all of them so you have to deal with health of your service yourselves.
Edit:
IMHO health status information doesn't have to be send over TCP. That is information that doesn't require all the fancy stuff. If you lose some information it will not affect anything = you can use UDP for health status transfers.
When using TCP you don't need to maintain proxy / session opened just to keep opened the connection. TCP connection is not closed immediately when you close the proxy. It remains opened in a pool for short duration of time and if any other proxy needs connection to the same server it is reused (the default idle timeout in pool should be 2 minutes) - I discussed Net.Tcp transport in WCF in another answer.
I'm not a fan of callbacks - this whole concept in WCF is overused and abused. Keeping 10.000 TCP connection opened for months just in case to be able to send sometimes data back to few PCs sounds ridiculous. If you need to communicate with PC expose the service on the PC and call it when you need to send some commands. Just add functionality which will call the server when the PC starts and when the PC is about to shut down + add transfering monitoring informations.
Anyway 10.000 PCs sending information every minute - this can cause that you will receive 10.000 requests in the same time - it can have the same effect as Denial of service attack. Depending on the processing time your server(s) may not be able to process them and many requests will timeout. You can also think about some message queuing or publish-subscribe protocols. Messages will be passed to a queue or topic and server(s) will process them continuously.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.