Avoid memory fragmentation in high performance server sockets

Avoid memory fragmentation in high performance server sockets - c#

I've been looking into using the new async socket functionality in .net 3.5, and I think I understand the concepts of using a large preallocated buffer pool in order to prevent memory fragmentation, and performance decrease from GC processing the buffers.
While thinking about all of this however, I got concerned about the way I actually save the data from the buffers until the entire message is received. Currently, on each receive, I append the received data onto a pending message object (a stringbuilder in some implementations, and another byte array in others).
Are these byte arrays safe since they never get marshalled to unmanaged code? Or is there a common approach used to retain data across multiple receives in order to build the complete message for processing?

Related

socket buffer size: pros and cons of bigger vs smaller

I've never really worked with COM sockets before, and now have some code that is listening to a rather steady stream of data (500Kb/s - 2000Kb/s).
I've experimented with different sizes, but am not really sure what I'm conceptually doing.
byte[] m_SocketBuffer = new byte[4048];
//vs
byte[] m_SocketBuffer = new byte[8096];
The socket I'm using is System.Net.Sockets.Socket, and this is my constructor:
new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)
My questions are:
Is there a general trade-off for large vs. small buffers?
How does one go about sizing the buffer? What should you use as a gauge?
I'm retrieving the data like this:
string socketData = Encoding.ASCII.GetString(m_SocketBuffer, 0, iReceivedBytes)
while (sData.Length > 0)
{ //do stuff }
Does my reading event happen when the buffer is full? Like whenever the socket buffer hits the threshold, that's when I can read from it?

Short version:
Optimal buffer size is dependent on a lot of factors, including the underlying network transport and how your own code handles the I/O.
10's of K is probably about right for a high-volume server moving a lot of data. But you can use much smaller if you know that remote endpoints won't ever be sending you a lot of data all at once.
Buffers are pinned for the duration of the I/O operation, which can cause or exacerbate heap fragmentation. For a really high-volume server, it might make sense to allocate very large buffers (larger than 85,000 bytes) so that the buffer is allocated from the large-object heap (which either has no fragmentation issues, or is in a perpetual state of fragmentation, depending on how you look at it :) ), and then use just a portion of each large buffer for any given I/O operation.
Re: your specific questions:
Is there a general trade-off for large vs. small buffers?
Probably the most obvious is the usual: a buffer larger than you will ever actually need is just wasting space.
Making buffers too small forces more I/O operations, possibly forcing more thread context switches (depending on how you are doing I/O), and for sure increasing the number of program statements that have to be executed.
There are other trade-offs of course, but to go into each and every one of them would be far too broad a discussion for this forum.
How does one go about sizing the buffer? What should you use as a gauge?
I'd start with a size that seems "reasonable", and then experiment from there. Adjust the buffer size in various load testing scenarios, increasing and decreasing, to see what if any effect there is on performance.
Does my reading event happen when the buffer is full? Like whenever the socket buffer hits the threshold, that's when I can read from it?
When you read from the socket, the network layer will put as much data into your buffer as it can. If there is more data available than will fit, it fills the buffer. If there is less data available than will fit, the operation will complete without filling the buffer (but always when at least one byte has been placed into the buffer…the only time a read operation completes with zero-length is when the connection is being shut down)

SocketAsyncEventArgs vs TcpListener/TcpClient [duplicate]

Is there a valid reason to not use TcpListener for implementing a high performance/high throughput TCP server instead of SocketAsyncEventArgs?
I've already implemented this high performance/high throughput TCP server using SocketAsyncEventArgs went through all sort of headaches to handling those pinned buffers using a big pre-allocated byte array and pools of SocketAsyncEventArgs for accepting and receiving, putting together using some low level stuff and shiny smart code with some TPL Data Flow and some Rx and it works perfectly; almost text book in this endeavor - actually I've learnt more than 80% of these stuff from other-one's code.
However there are some problems and concerns:
Complexity: I can not delegate any sort of modifications to this server to another
member of the team. That bounds me to this sort of tasks and I can
not pay enough attention to other parts of other projects.
Memory Usage (pinned byte arrays): Using SocketAsyncEventArgs the pools are needed to
be pre-allocated. So for handling 100000 concurrent connections
(worse condition, even on different ports) a big pile of RAM is uselessly hovers there;
pre-allocated (even if these conditions are met just at some times,
server should be able to handle 1 or 2 such peaks everyday).
TcpListener actually works good: I actually had put TcpListener into test (with some tricks like
using AcceptTcpClient on a dedicated thread, and not the async
version and then sending the accepted connections to a
ConcurrentQueue and not creating Tasks in-place and the like)
and with latest version of .NET, it worked very well, almost as good
as SocketAsyncEventArgs, no data-loss and a low memory foot-print
which helps with not wasting too much RAM on server and no pre-allocation is needed.
So why I do not see TcpListener being used anywhere and everybody (including myself) is using SocketAsyncEventArgs? Am I missing something?

I see no evidence that this question is about TcpListener at all. It seems you are only concerned with the code that deals with a connection that already has been accepted. Such a connection is independent of the listener.
SocketAsyncEventArgs is a CPU-load optimization. I'm convinced you can achieve a higher rate of operations per second with it. How significant is the difference to normal APM/TAP async IO? Certainly less than an order of magnitude. Probably between 1.2x and 3x. Last time I benchmarked loopback TCP transaction rate I found that the kernel took about half of the CPU usage. That means your app can get at most 2x faster by being infinitely optimized.
Remember that SocketAsyncEventArgs was added to the BCL in the year 2000 or so when CPUs were far less capable.
Use SocketAsyncEventArgs only when you have evidence that you need it. It causes you to be far less productive. More potential for bugs.
Here's the template that your socket processing loop should look like:
while (ConnectionEstablished()) {
var someData = await ReadFromSocketAsync(socket);
await ProcessDataAsync(someData);
}
Very simple code. No callbacks thanks to await.
In case you are concerned about managed heap fragmentation: Allocate a new byte[1024 * 1024] on startup. When you want to read from a socket read a single byte into some free portion of this buffer. When that single-byte read completes you ask how many bytes are actually there (Socket.Available) and synchronously pull the rest. That way you only pin a single rather small buffer and still can use async IO to wait for data to arrive.
This technique does not require polling. Since Socket.Available can only increase without reading from the socket we do not risk to perform a read that is too small accidentally.
Alternatively, you can combat managed heap fragmentation by allocating few very big buffers and handing out chunks.
Or, if you don't find this to be a problem you don't need to do anything.

Memory-efficient IList<T> implementation

I need a collection type for received bytes in my socket application (which deals with ~5k of concurrent connections).
I tried using a List<byte> but since it has one internal array and I receive lots of data, it can cause OutOfMemoryExceptions.
So I need a collection that,
Keeps the data in smaller blocks; like an Unrolled Linked List.
Provides fast lookup (Preferably an IList<T>) because I look for a delimiter that marks the end of the message after each receive operation.
What I use right now is Stream. I supply a MemoryStream for the operations that don't involve too much data and supply a FileStream of a temporary file for the operations that involve serious amounts of data.
MemoryStream is no different than a List<T>, though and I prefer not to use files as buffers.
So...
What collection or approach do you recommend?

It appears that you are using inappropriate architecture for a network application. You should buffer only those data which is required. Here you are using a list to buffer the data until the required amount of data is received.
I would recommend that you should check for delimiter on each receipt of data in the data itself and if it is there, you should push in only the data till you encounter the delimiter. Once the data is ready, you should fetch it out from list and use it and dispose off the list. Adding up everything to the list is not a good approach and will surely consume a lot of memory.
Ideally, you should have a protocol which always inform you before you actually receive the data about the length of data you are going to receive. This way, you can be sure that required data has been received and you should not rely on the delimiter.

A possible quick and dirty solution:
At the start of the program, allocate a buffer large enough for the largest amount of data you will receive. Use a separate 'count' field to keep track of how much data is currently in use.
(I don't really like this solution; I'd use files or find some way of working with the data in blocks, but it might work for you).

How to stop asynchronous tcp .NET code from using up an entire system's resources

In some asynchronous tcp server code I have, occasionally an error occurs that causes the process to consume the entire system's memory. In looking at the logs, event viewer and some MS docs the problem happens if "the calling application makes Asynchronous IO calls to the same client multiple times then you might see a heap fragmentation and private byte increase if the remote client stops its end of I/O" which results in spikes in memory usage and pinning of System.Threading.OverlappedData struct and byte arrays.
The KB article's proposed solution is to "set an upper bound on the amount of buffers outstanding (either send or receive) with their asynchronous IO."
How does one do this? Is this referring to the byte[] that are sent into BeginRead? So is the solution simply wrapping access byte[]'s with a semaphore?
EDIT: Semaphore controlled access to byte buffers or just having static sized pool of byte buffers are two common solutions. A concern I have that still remains is that when this async client problem occurs (maybe it's some weird network event actually) having semaphores or byte buffer pools will prevent me from running out of memory, but it does not solve the problem. My pool of buffers will likely get gobbled up by the problem client(s), in effect locking correct function legitimate clients out.
EDIT 2: Came across this great answer. Basically it shows how to manually unpin objects. And while asynchronous TCP code leaves pinning up to behind the scenes runtime rules, it might be possible to override that by explicitly pinning each buffer before use, then unpinning at the end of the block or in a finally. I'm trying to figure that out now...

One way of addressing the problem is by pre-allocating buffers and other data structures used in async communications. If you preallocate on startup, there will be no fragmentation, since the memory will naturally reside in the same area of the heap.
I recommend using ReceiveAsync/SendAsync APIs added to .Net 3.5 SP1, which allows you to cache or pre-allocate both the SocketAsyncEventArgs structure and the memory buffer stored in SocketAsyncEventArgs.Buffer property, unlike the older BeginXXX/EndXXX APIs which only allow caching or pre-allocating the memory buffer.
Using the old API also incurred significant CPU costs, because the API internally created Windows Overlapped I/O structures again and again. In the new API this takes place within SocketAsyncEventArgs, and so by pooling these objects, the CPU cost is paid only once.
Regarding your update on pinning: pinning is there for a reason, namely to prevent GC from moving the buffer during defragmentation. By unpinning manually, you may cause memory corruption.

Fixing gaps in streamed USB data

We have a hardware system with some FPGA's and an FTDI USB controller. The hardware streams data over USB bulk transfer to the PC at around 5MB/s and the software is tasked with staying in sync, checking the CRC and writing the data to file.
The FTDI chip has a 'busy' pin which goes high while its waiting for the PC to do its business. There is a limited amount of buffering in the FTDI and elsewhere on the hardware.
The busy line is going high for longer than the hardware can buffer (50-100ms) so we are losing data. To save us from having to re-design the hardware I have been asked to 'fix' this issue!
I think my code is quick enough as we've had it running up to 15MB/s, so that leaves an IO bottleneck somewhere. Are we just expecting too much from the PC/OS?
Here is my data entry point. Occasionally we get a dropped bit or byte. If the checksum doesn't compute, I shift through until it does. byte[] data is nearly always 4k.
void ftdi_OnData(byte[] data)
{
List<byte> buffer = new List<byte>(data.Length);
int index = 0;
while ((index + rawFile.Header.PacketLength + 1) < data.Length)
{
if (CheckSum.CRC16(data, index, rawFile.Header.PacketLength + 2)) // <- packet length + 2 for 16bit checksum
{
buffer.AddRange(data.SubArray<byte>(index, rawFile.Header.PacketLength));
index += rawFile.Header.PacketLength + 2; // <- skip the two checksums, we dont want to save them...
}
else
{
index++; // shift through
}
}
rawFile.AddData(buffer.ToArray(), 0, buffer.Count);
}

Tip: do not write to a file.... queue.
Modern computers have multiple processors. If you want certain things as fast as possible, use multiple processors.
Have on thread deal with the USB data, check checksums etc. It queues (ONLY) the results to a thread safe queue.
Another thread reads data from the queue and writes it to a file, possibly buffered.
Finished ;)
100ms is a lot of time for decent operations. I have successfully managed around 250.000 IO data packets per second (financial data) using C# without a sweat.
Basically, make sure your IO threads do ONLY that and use your internal memory as buffer. Especially dealing with hardware on one end the thread doing that should ONLY do that, POSSIBLY if needed running in high priority.

To get good read throughput on Windows on USB, you generally need to have multiple asynchronous reads (or very large reads, which is often less convenient) queued onto the USB device stack. I'm not quite sure what the FTDI drivers / libraries do internally in this regard.
Traditionally I have written mechanisms with an array of OVERLAPPED strutures and an array of buffers, and kept shovelling them into ReadFile as soon as they're free. I was doing 40+MB/s reads on USB2 like this about 5-6 years ago, so modern PCs should certainly be able to cope.
It's very important that you (or your drivers/libraries) don't get into a "start a read, finish a read, deal with the data, start another read" cycle, because you'll find that the bus is idle for vast swathes of time. A USB analyser would show you if this was happening.
I agree with the others that you should get off the thread that the read is happening as soon as possible - don't block the FTDI event handler for any longer than at takes to put the buffer into another queue.
I'd preallocate a circular queue of buffers, pick the next free one and throw the received data into it, then complete the event handling as quickly as possible.
All that checksumming and concatenation with its attendant memory allocation, garbage collection, etc, can be done the other side of potentially 100s of MB of buffer time/space on the PC. At the moment you may well be effectively asking your FPGA/hardware buffer to accommodate the time taken for you to do all sorts of ponderous PC stuff which can be done much later.
I'm optimistic though - if you can really buffer 100ms of data on the hardware, you should be able to get this working reliably. I wish I could persuade all my clients to allow so much...

So what does your receiving code look like? Do you have a thread running at high priority responsible solely for capturing the data and passing it in memory to another thread in a non-blocking fashion? Do you run the process itself at an elevated priority?
Have you designed the rest of your code to avoid the more expensive 2nd gen garbage collections? How large are you buffers, are they on the large object heap? Do you reuse them efficiently?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.