socket buffer size: pros and cons of bigger vs smaller - c#

I've never really worked with COM sockets before, and now have some code that is listening to a rather steady stream of data (500Kb/s - 2000Kb/s).
I've experimented with different sizes, but am not really sure what I'm conceptually doing.
byte[] m_SocketBuffer = new byte[4048];
//vs
byte[] m_SocketBuffer = new byte[8096];
The socket I'm using is System.Net.Sockets.Socket, and this is my constructor:
new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)
My questions are:
Is there a general trade-off for large vs. small buffers?
How does one go about sizing the buffer? What should you use as a gauge?
I'm retrieving the data like this:
string socketData = Encoding.ASCII.GetString(m_SocketBuffer, 0, iReceivedBytes)
while (sData.Length > 0)
{ //do stuff }
Does my reading event happen when the buffer is full? Like whenever the socket buffer hits the threshold, that's when I can read from it?

Short version:
Optimal buffer size is dependent on a lot of factors, including the underlying network transport and how your own code handles the I/O.
10's of K is probably about right for a high-volume server moving a lot of data. But you can use much smaller if you know that remote endpoints won't ever be sending you a lot of data all at once.
Buffers are pinned for the duration of the I/O operation, which can cause or exacerbate heap fragmentation. For a really high-volume server, it might make sense to allocate very large buffers (larger than 85,000 bytes) so that the buffer is allocated from the large-object heap (which either has no fragmentation issues, or is in a perpetual state of fragmentation, depending on how you look at it :) ), and then use just a portion of each large buffer for any given I/O operation.
Re: your specific questions:
Is there a general trade-off for large vs. small buffers?
Probably the most obvious is the usual: a buffer larger than you will ever actually need is just wasting space.
Making buffers too small forces more I/O operations, possibly forcing more thread context switches (depending on how you are doing I/O), and for sure increasing the number of program statements that have to be executed.
There are other trade-offs of course, but to go into each and every one of them would be far too broad a discussion for this forum.
How does one go about sizing the buffer? What should you use as a gauge?
I'd start with a size that seems "reasonable", and then experiment from there. Adjust the buffer size in various load testing scenarios, increasing and decreasing, to see what if any effect there is on performance.
Does my reading event happen when the buffer is full? Like whenever the socket buffer hits the threshold, that's when I can read from it?
When you read from the socket, the network layer will put as much data into your buffer as it can. If there is more data available than will fit, it fills the buffer. If there is less data available than will fit, the operation will complete without filling the buffer (but always when at least one byte has been placed into the buffer…the only time a read operation completes with zero-length is when the connection is being shut down)

Related

C# Socket: is multiple sending less efficient than a single send?

I am writing a high-throughput server serving thousands of connections. Suppose I have 400 bytes to send via a socket. Suppose I do it in two ways:
Call the Socket.Send() 40 times, each time sending 10 bytes.
Call the Socket.Send() once, sending 400 bytes.
Do these two ways make much difference in terms of speed, CPU load, etc?
If Socket.NoDelay is left at false, then it will very rarely make any difference - most of the time, you're just going to buffering locally - albeit with a bit more P/Invoke overhead than is absolutely necessary (due to lots of calls through the socket layer). Note that Socket.NoDelay should usually be set to true in anything where you care.
If Socket.NoDelay is true, then if everything is working maximally, then you might introduce additional packet fragmentation by using 40 sends of 10 bytes, which would be avoided when using one send of 400 bytes. However, in many cases, the various abstractions and layers in the OS/hardware stacks means that a lot of the 10 byte chunks will probably end up sharing packets. That's still a lot more packets than 1, in the optimal case, though.
Note also that this is always a trade-off: packet fragmentation will decrease overall throughput, but sending the first bytes sooner could reduce the perceived latency, if the other 390 bytes are going to take a measurable (but presumably small) amount of time to construct.
In most cases: this is unlikely to be a bottleneck. If you can avoid packet fragmentation without causing latency, that may be desirable. If it was me, I'd probably be more concerned with efficient buffer management to maximise scalability while avoiding pauses due to GC; tools like the new "pipelines" IO API can really help with that, and Kestrel can be used to host a TCP server based on "pipelines" in a lot less code than you would be using if you wrote your own socket listener - and it then deals with all the buffer management for you.

SocketAsyncEventArgs vs TcpListener/TcpClient [duplicate]

Is there a valid reason to not use TcpListener for implementing a high performance/high throughput TCP server instead of SocketAsyncEventArgs?
I've already implemented this high performance/high throughput TCP server using SocketAsyncEventArgs went through all sort of headaches to handling those pinned buffers using a big pre-allocated byte array and pools of SocketAsyncEventArgs for accepting and receiving, putting together using some low level stuff and shiny smart code with some TPL Data Flow and some Rx and it works perfectly; almost text book in this endeavor - actually I've learnt more than 80% of these stuff from other-one's code.
However there are some problems and concerns:
Complexity: I can not delegate any sort of modifications to this server to another
member of the team. That bounds me to this sort of tasks and I can
not pay enough attention to other parts of other projects.
Memory Usage (pinned byte arrays): Using SocketAsyncEventArgs the pools are needed to
be pre-allocated. So for handling 100000 concurrent connections
(worse condition, even on different ports) a big pile of RAM is uselessly hovers there;
pre-allocated (even if these conditions are met just at some times,
server should be able to handle 1 or 2 such peaks everyday).
TcpListener actually works good: I actually had put TcpListener into test (with some tricks like
using AcceptTcpClient on a dedicated thread, and not the async
version and then sending the accepted connections to a
ConcurrentQueue and not creating Tasks in-place and the like)
and with latest version of .NET, it worked very well, almost as good
as SocketAsyncEventArgs, no data-loss and a low memory foot-print
which helps with not wasting too much RAM on server and no pre-allocation is needed.
So why I do not see TcpListener being used anywhere and everybody (including myself) is using SocketAsyncEventArgs? Am I missing something?
I see no evidence that this question is about TcpListener at all. It seems you are only concerned with the code that deals with a connection that already has been accepted. Such a connection is independent of the listener.
SocketAsyncEventArgs is a CPU-load optimization. I'm convinced you can achieve a higher rate of operations per second with it. How significant is the difference to normal APM/TAP async IO? Certainly less than an order of magnitude. Probably between 1.2x and 3x. Last time I benchmarked loopback TCP transaction rate I found that the kernel took about half of the CPU usage. That means your app can get at most 2x faster by being infinitely optimized.
Remember that SocketAsyncEventArgs was added to the BCL in the year 2000 or so when CPUs were far less capable.
Use SocketAsyncEventArgs only when you have evidence that you need it. It causes you to be far less productive. More potential for bugs.
Here's the template that your socket processing loop should look like:
while (ConnectionEstablished()) {
var someData = await ReadFromSocketAsync(socket);
await ProcessDataAsync(someData);
}
Very simple code. No callbacks thanks to await.
In case you are concerned about managed heap fragmentation: Allocate a new byte[1024 * 1024] on startup. When you want to read from a socket read a single byte into some free portion of this buffer. When that single-byte read completes you ask how many bytes are actually there (Socket.Available) and synchronously pull the rest. That way you only pin a single rather small buffer and still can use async IO to wait for data to arrive.
This technique does not require polling. Since Socket.Available can only increase without reading from the socket we do not risk to perform a read that is too small accidentally.
Alternatively, you can combat managed heap fragmentation by allocating few very big buffers and handing out chunks.
Or, if you don't find this to be a problem you don't need to do anything.

Avoid memory fragmentation in high performance server sockets

I've been looking into using the new async socket functionality in .net 3.5, and I think I understand the concepts of using a large preallocated buffer pool in order to prevent memory fragmentation, and performance decrease from GC processing the buffers.
While thinking about all of this however, I got concerned about the way I actually save the data from the buffers until the entire message is received. Currently, on each receive, I append the received data onto a pending message object (a stringbuilder in some implementations, and another byte array in others).
Are these byte arrays safe since they never get marshalled to unmanaged code? Or is there a common approach used to retain data across multiple receives in order to build the complete message for processing?

C# get max value for Socket.ReceiveBufferSize and Socket.SendBufferSize on a system

Our high throughput application (~1gbps) benefits greatly from a large ReceiveBufferSize and SendBufferSize.
I noticed on my machine I can have a 100 MB buffer size with no problems but on some client and test machines the max value is a little over 10 MB and seems to be variable.
Are there any methods to query the system what the max tx/rx buffer size can be.
Actually for high performance networking the SO_RCVBUF and SO_SNDBUF options should be set to 0 to avoid buffer copies, as per KB181611:
If you use the SO_RCVBUF and SO_SNDBUF
option to set zero TCP stack receive
and send buffer, you basically
instruct the TCP stack to directly
perform I/O using the buffer provided
in your I/O call. Therefore, in
addition to the nonblocking advantage
of the overlapped socket I/O, the
other advantage is better performance
because you save a buffer copy between
the TCP stack buffer and the user
buffer for each I/O call. But you have
to make sure you don't access the user
buffer once it's submitted for
overlapped operation and before the
overlapped operation completes.
The max values you can set these options (which are the real setting behind the managed Socket.ReceiveBufferSize) are 'implementation dependent'. Other TCP parameters are documented at TCP/IP Registry Settings.
Those two properties internally play with the socket options (via SetSocketOption, eventually to the native setsockopt). If memory serves these are going to depend on the non-paged pool memory available (which changes machine to machine) and potentially which network driver is on each machine.
Regardless, you actually aren't guaranteed that the buffer size you requested is used, you'll have to retrieve the current buffer size after the fact to make sure it was used. Moreover, on Windows 7 and 2008 machines it is my understanding that your buffers may be dynamically increased/decreased.
In short, you likely can only test increasing buffer sizes and take the maximum that does not cause an error. There are too many variables at play which could determine the maximum size.

Fixing gaps in streamed USB data

We have a hardware system with some FPGA's and an FTDI USB controller. The hardware streams data over USB bulk transfer to the PC at around 5MB/s and the software is tasked with staying in sync, checking the CRC and writing the data to file.
The FTDI chip has a 'busy' pin which goes high while its waiting for the PC to do its business. There is a limited amount of buffering in the FTDI and elsewhere on the hardware.
The busy line is going high for longer than the hardware can buffer (50-100ms) so we are losing data. To save us from having to re-design the hardware I have been asked to 'fix' this issue!
I think my code is quick enough as we've had it running up to 15MB/s, so that leaves an IO bottleneck somewhere. Are we just expecting too much from the PC/OS?
Here is my data entry point. Occasionally we get a dropped bit or byte. If the checksum doesn't compute, I shift through until it does. byte[] data is nearly always 4k.
void ftdi_OnData(byte[] data)
{
List<byte> buffer = new List<byte>(data.Length);
int index = 0;
while ((index + rawFile.Header.PacketLength + 1) < data.Length)
{
if (CheckSum.CRC16(data, index, rawFile.Header.PacketLength + 2)) // <- packet length + 2 for 16bit checksum
{
buffer.AddRange(data.SubArray<byte>(index, rawFile.Header.PacketLength));
index += rawFile.Header.PacketLength + 2; // <- skip the two checksums, we dont want to save them...
}
else
{
index++; // shift through
}
}
rawFile.AddData(buffer.ToArray(), 0, buffer.Count);
}
Tip: do not write to a file.... queue.
Modern computers have multiple processors. If you want certain things as fast as possible, use multiple processors.
Have on thread deal with the USB data, check checksums etc. It queues (ONLY) the results to a thread safe queue.
Another thread reads data from the queue and writes it to a file, possibly buffered.
Finished ;)
100ms is a lot of time for decent operations. I have successfully managed around 250.000 IO data packets per second (financial data) using C# without a sweat.
Basically, make sure your IO threads do ONLY that and use your internal memory as buffer. Especially dealing with hardware on one end the thread doing that should ONLY do that, POSSIBLY if needed running in high priority.
To get good read throughput on Windows on USB, you generally need to have multiple asynchronous reads (or very large reads, which is often less convenient) queued onto the USB device stack. I'm not quite sure what the FTDI drivers / libraries do internally in this regard.
Traditionally I have written mechanisms with an array of OVERLAPPED strutures and an array of buffers, and kept shovelling them into ReadFile as soon as they're free. I was doing 40+MB/s reads on USB2 like this about 5-6 years ago, so modern PCs should certainly be able to cope.
It's very important that you (or your drivers/libraries) don't get into a "start a read, finish a read, deal with the data, start another read" cycle, because you'll find that the bus is idle for vast swathes of time. A USB analyser would show you if this was happening.
I agree with the others that you should get off the thread that the read is happening as soon as possible - don't block the FTDI event handler for any longer than at takes to put the buffer into another queue.
I'd preallocate a circular queue of buffers, pick the next free one and throw the received data into it, then complete the event handling as quickly as possible.
All that checksumming and concatenation with its attendant memory allocation, garbage collection, etc, can be done the other side of potentially 100s of MB of buffer time/space on the PC. At the moment you may well be effectively asking your FPGA/hardware buffer to accommodate the time taken for you to do all sorts of ponderous PC stuff which can be done much later.
I'm optimistic though - if you can really buffer 100ms of data on the hardware, you should be able to get this working reliably. I wish I could persuade all my clients to allow so much...
So what does your receiving code look like? Do you have a thread running at high priority responsible solely for capturing the data and passing it in memory to another thread in a non-blocking fashion? Do you run the process itself at an elevated priority?
Have you designed the rest of your code to avoid the more expensive 2nd gen garbage collections? How large are you buffers, are they on the large object heap? Do you reuse them efficiently?

Categories