Is there a valid reason to not use TcpListener for implementing a high performance/high throughput TCP server instead of SocketAsyncEventArgs?
I've already implemented this high performance/high throughput TCP server using SocketAsyncEventArgs went through all sort of headaches to handling those pinned buffers using a big pre-allocated byte array and pools of SocketAsyncEventArgs for accepting and receiving, putting together using some low level stuff and shiny smart code with some TPL Data Flow and some Rx and it works perfectly; almost text book in this endeavor - actually I've learnt more than 80% of these stuff from other-one's code.
However there are some problems and concerns:
Complexity: I can not delegate any sort of modifications to this server to another
member of the team. That bounds me to this sort of tasks and I can
not pay enough attention to other parts of other projects.
Memory Usage (pinned byte arrays): Using SocketAsyncEventArgs the pools are needed to
be pre-allocated. So for handling 100000 concurrent connections
(worse condition, even on different ports) a big pile of RAM is uselessly hovers there;
pre-allocated (even if these conditions are met just at some times,
server should be able to handle 1 or 2 such peaks everyday).
TcpListener actually works good: I actually had put TcpListener into test (with some tricks like
using AcceptTcpClient on a dedicated thread, and not the async
version and then sending the accepted connections to a
ConcurrentQueue and not creating Tasks in-place and the like)
and with latest version of .NET, it worked very well, almost as good
as SocketAsyncEventArgs, no data-loss and a low memory foot-print
which helps with not wasting too much RAM on server and no pre-allocation is needed.
So why I do not see TcpListener being used anywhere and everybody (including myself) is using SocketAsyncEventArgs? Am I missing something?
I see no evidence that this question is about TcpListener at all. It seems you are only concerned with the code that deals with a connection that already has been accepted. Such a connection is independent of the listener.
SocketAsyncEventArgs is a CPU-load optimization. I'm convinced you can achieve a higher rate of operations per second with it. How significant is the difference to normal APM/TAP async IO? Certainly less than an order of magnitude. Probably between 1.2x and 3x. Last time I benchmarked loopback TCP transaction rate I found that the kernel took about half of the CPU usage. That means your app can get at most 2x faster by being infinitely optimized.
Remember that SocketAsyncEventArgs was added to the BCL in the year 2000 or so when CPUs were far less capable.
Use SocketAsyncEventArgs only when you have evidence that you need it. It causes you to be far less productive. More potential for bugs.
Here's the template that your socket processing loop should look like:
while (ConnectionEstablished()) {
var someData = await ReadFromSocketAsync(socket);
await ProcessDataAsync(someData);
}
Very simple code. No callbacks thanks to await.
In case you are concerned about managed heap fragmentation: Allocate a new byte[1024 * 1024] on startup. When you want to read from a socket read a single byte into some free portion of this buffer. When that single-byte read completes you ask how many bytes are actually there (Socket.Available) and synchronously pull the rest. That way you only pin a single rather small buffer and still can use async IO to wait for data to arrive.
This technique does not require polling. Since Socket.Available can only increase without reading from the socket we do not risk to perform a read that is too small accidentally.
Alternatively, you can combat managed heap fragmentation by allocating few very big buffers and handing out chunks.
Or, if you don't find this to be a problem you don't need to do anything.
Related
I am writing a high-throughput server serving thousands of connections. Suppose I have 400 bytes to send via a socket. Suppose I do it in two ways:
Call the Socket.Send() 40 times, each time sending 10 bytes.
Call the Socket.Send() once, sending 400 bytes.
Do these two ways make much difference in terms of speed, CPU load, etc?
If Socket.NoDelay is left at false, then it will very rarely make any difference - most of the time, you're just going to buffering locally - albeit with a bit more P/Invoke overhead than is absolutely necessary (due to lots of calls through the socket layer). Note that Socket.NoDelay should usually be set to true in anything where you care.
If Socket.NoDelay is true, then if everything is working maximally, then you might introduce additional packet fragmentation by using 40 sends of 10 bytes, which would be avoided when using one send of 400 bytes. However, in many cases, the various abstractions and layers in the OS/hardware stacks means that a lot of the 10 byte chunks will probably end up sharing packets. That's still a lot more packets than 1, in the optimal case, though.
Note also that this is always a trade-off: packet fragmentation will decrease overall throughput, but sending the first bytes sooner could reduce the perceived latency, if the other 390 bytes are going to take a measurable (but presumably small) amount of time to construct.
In most cases: this is unlikely to be a bottleneck. If you can avoid packet fragmentation without causing latency, that may be desirable. If it was me, I'd probably be more concerned with efficient buffer management to maximise scalability while avoiding pauses due to GC; tools like the new "pipelines" IO API can really help with that, and Kestrel can be used to host a TCP server based on "pipelines" in a lot less code than you would be using if you wrote your own socket listener - and it then deals with all the buffer management for you.
I've never really worked with COM sockets before, and now have some code that is listening to a rather steady stream of data (500Kb/s - 2000Kb/s).
I've experimented with different sizes, but am not really sure what I'm conceptually doing.
byte[] m_SocketBuffer = new byte[4048];
//vs
byte[] m_SocketBuffer = new byte[8096];
The socket I'm using is System.Net.Sockets.Socket, and this is my constructor:
new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)
My questions are:
Is there a general trade-off for large vs. small buffers?
How does one go about sizing the buffer? What should you use as a gauge?
I'm retrieving the data like this:
string socketData = Encoding.ASCII.GetString(m_SocketBuffer, 0, iReceivedBytes)
while (sData.Length > 0)
{ //do stuff }
Does my reading event happen when the buffer is full? Like whenever the socket buffer hits the threshold, that's when I can read from it?
Short version:
Optimal buffer size is dependent on a lot of factors, including the underlying network transport and how your own code handles the I/O.
10's of K is probably about right for a high-volume server moving a lot of data. But you can use much smaller if you know that remote endpoints won't ever be sending you a lot of data all at once.
Buffers are pinned for the duration of the I/O operation, which can cause or exacerbate heap fragmentation. For a really high-volume server, it might make sense to allocate very large buffers (larger than 85,000 bytes) so that the buffer is allocated from the large-object heap (which either has no fragmentation issues, or is in a perpetual state of fragmentation, depending on how you look at it :) ), and then use just a portion of each large buffer for any given I/O operation.
Re: your specific questions:
Is there a general trade-off for large vs. small buffers?
Probably the most obvious is the usual: a buffer larger than you will ever actually need is just wasting space.
Making buffers too small forces more I/O operations, possibly forcing more thread context switches (depending on how you are doing I/O), and for sure increasing the number of program statements that have to be executed.
There are other trade-offs of course, but to go into each and every one of them would be far too broad a discussion for this forum.
How does one go about sizing the buffer? What should you use as a gauge?
I'd start with a size that seems "reasonable", and then experiment from there. Adjust the buffer size in various load testing scenarios, increasing and decreasing, to see what if any effect there is on performance.
Does my reading event happen when the buffer is full? Like whenever the socket buffer hits the threshold, that's when I can read from it?
When you read from the socket, the network layer will put as much data into your buffer as it can. If there is more data available than will fit, it fills the buffer. If there is less data available than will fit, the operation will complete without filling the buffer (but always when at least one byte has been placed into the buffer…the only time a read operation completes with zero-length is when the connection is being shut down)
I am currently working on a .NET c# socket server which should be able to scale upto 100K concurrent connections. I am using the socketasynceventargs class and the pattern mentioned here . correct me if I am wrong but I understand that maintaining 100K concurrent connections is different from 100K client hitting the socket server at the exact same time. my question is how many connections can I make simultaneously? is this dependent on the socket backlog variable? if so what is the max backlog value i can set?
Thanks in advance
I am currently working on a .NET c# socket server which should be able to scale upto 100K concurrent connections.
Last time I tested this on Win7 this was an easy goal to reach. The number of connections seems to be limited by memory usage.
I am using the socketasynceventargs class and the pattern mentioned here.
This pattern is used to have a very high frequency of calls. It is not useful to maintain a high number of connections because it uses more memory than a simple BeginRead call that is outstanding. Always ask why and don't just copy sample code. Most sample code about sockets is horribly wrong, even on MSDN.
Have a single BeginRead call outstanding per socket. Until a read call is completed the memory buffer given to it is pinned. This causes GC problems. Either use one big preallocated buffer (64MB or so) or read only one byte at first. Only when that one byte read completes you read the rest with a bigger buffer.
correct me if I am wrong but I understand that maintaining 100K concurrent connections is different from 100K client hitting the socket server at the exact same time.
Not sure I understand. 100k clients coming in in the same millisecond would be hard to handle while maintaining 100k connections that have been established over the course of seconds is much easier.
my question is how many connections can I make simultaneously?
Test that and expect to find a high number. Watch your RAM usage.
is this dependent on the socket backlog variable?
That is for outstanding connections that have not been handed to the application. It is mostly meaningless in practice because an app should have a fast accept loop immediately accepting anything.
if so what is the max backlog value i can set?
Set the default.
In some asynchronous tcp server code I have, occasionally an error occurs that causes the process to consume the entire system's memory. In looking at the logs, event viewer and some MS docs the problem happens if "the calling application makes Asynchronous IO calls to the same client multiple times then you might see a heap fragmentation and private byte increase if the remote client stops its end of I/O" which results in spikes in memory usage and pinning of System.Threading.OverlappedData struct and byte arrays.
The KB article's proposed solution is to "set an upper bound on the amount of buffers outstanding (either send or receive) with their asynchronous IO."
How does one do this? Is this referring to the byte[] that are sent into BeginRead? So is the solution simply wrapping access byte[]'s with a semaphore?
EDIT: Semaphore controlled access to byte buffers or just having static sized pool of byte buffers are two common solutions. A concern I have that still remains is that when this async client problem occurs (maybe it's some weird network event actually) having semaphores or byte buffer pools will prevent me from running out of memory, but it does not solve the problem. My pool of buffers will likely get gobbled up by the problem client(s), in effect locking correct function legitimate clients out.
EDIT 2: Came across this great answer. Basically it shows how to manually unpin objects. And while asynchronous TCP code leaves pinning up to behind the scenes runtime rules, it might be possible to override that by explicitly pinning each buffer before use, then unpinning at the end of the block or in a finally. I'm trying to figure that out now...
One way of addressing the problem is by pre-allocating buffers and other data structures used in async communications. If you preallocate on startup, there will be no fragmentation, since the memory will naturally reside in the same area of the heap.
I recommend using ReceiveAsync/SendAsync APIs added to .Net 3.5 SP1, which allows you to cache or pre-allocate both the SocketAsyncEventArgs structure and the memory buffer stored in SocketAsyncEventArgs.Buffer property, unlike the older BeginXXX/EndXXX APIs which only allow caching or pre-allocating the memory buffer.
Using the old API also incurred significant CPU costs, because the API internally created Windows Overlapped I/O structures again and again. In the new API this takes place within SocketAsyncEventArgs, and so by pooling these objects, the CPU cost is paid only once.
Regarding your update on pinning: pinning is there for a reason, namely to prevent GC from moving the buffer during defragmentation. By unpinning manually, you may cause memory corruption.
We have a hardware system with some FPGA's and an FTDI USB controller. The hardware streams data over USB bulk transfer to the PC at around 5MB/s and the software is tasked with staying in sync, checking the CRC and writing the data to file.
The FTDI chip has a 'busy' pin which goes high while its waiting for the PC to do its business. There is a limited amount of buffering in the FTDI and elsewhere on the hardware.
The busy line is going high for longer than the hardware can buffer (50-100ms) so we are losing data. To save us from having to re-design the hardware I have been asked to 'fix' this issue!
I think my code is quick enough as we've had it running up to 15MB/s, so that leaves an IO bottleneck somewhere. Are we just expecting too much from the PC/OS?
Here is my data entry point. Occasionally we get a dropped bit or byte. If the checksum doesn't compute, I shift through until it does. byte[] data is nearly always 4k.
void ftdi_OnData(byte[] data)
{
List<byte> buffer = new List<byte>(data.Length);
int index = 0;
while ((index + rawFile.Header.PacketLength + 1) < data.Length)
{
if (CheckSum.CRC16(data, index, rawFile.Header.PacketLength + 2)) // <- packet length + 2 for 16bit checksum
{
buffer.AddRange(data.SubArray<byte>(index, rawFile.Header.PacketLength));
index += rawFile.Header.PacketLength + 2; // <- skip the two checksums, we dont want to save them...
}
else
{
index++; // shift through
}
}
rawFile.AddData(buffer.ToArray(), 0, buffer.Count);
}
Tip: do not write to a file.... queue.
Modern computers have multiple processors. If you want certain things as fast as possible, use multiple processors.
Have on thread deal with the USB data, check checksums etc. It queues (ONLY) the results to a thread safe queue.
Another thread reads data from the queue and writes it to a file, possibly buffered.
Finished ;)
100ms is a lot of time for decent operations. I have successfully managed around 250.000 IO data packets per second (financial data) using C# without a sweat.
Basically, make sure your IO threads do ONLY that and use your internal memory as buffer. Especially dealing with hardware on one end the thread doing that should ONLY do that, POSSIBLY if needed running in high priority.
To get good read throughput on Windows on USB, you generally need to have multiple asynchronous reads (or very large reads, which is often less convenient) queued onto the USB device stack. I'm not quite sure what the FTDI drivers / libraries do internally in this regard.
Traditionally I have written mechanisms with an array of OVERLAPPED strutures and an array of buffers, and kept shovelling them into ReadFile as soon as they're free. I was doing 40+MB/s reads on USB2 like this about 5-6 years ago, so modern PCs should certainly be able to cope.
It's very important that you (or your drivers/libraries) don't get into a "start a read, finish a read, deal with the data, start another read" cycle, because you'll find that the bus is idle for vast swathes of time. A USB analyser would show you if this was happening.
I agree with the others that you should get off the thread that the read is happening as soon as possible - don't block the FTDI event handler for any longer than at takes to put the buffer into another queue.
I'd preallocate a circular queue of buffers, pick the next free one and throw the received data into it, then complete the event handling as quickly as possible.
All that checksumming and concatenation with its attendant memory allocation, garbage collection, etc, can be done the other side of potentially 100s of MB of buffer time/space on the PC. At the moment you may well be effectively asking your FPGA/hardware buffer to accommodate the time taken for you to do all sorts of ponderous PC stuff which can be done much later.
I'm optimistic though - if you can really buffer 100ms of data on the hardware, you should be able to get this working reliably. I wish I could persuade all my clients to allow so much...
So what does your receiving code look like? Do you have a thread running at high priority responsible solely for capturing the data and passing it in memory to another thread in a non-blocking fashion? Do you run the process itself at an elevated priority?
Have you designed the rest of your code to avoid the more expensive 2nd gen garbage collections? How large are you buffers, are they on the large object heap? Do you reuse them efficiently?