Fixing gaps in streamed USB data - c#

We have a hardware system with some FPGA's and an FTDI USB controller. The hardware streams data over USB bulk transfer to the PC at around 5MB/s and the software is tasked with staying in sync, checking the CRC and writing the data to file.
The FTDI chip has a 'busy' pin which goes high while its waiting for the PC to do its business. There is a limited amount of buffering in the FTDI and elsewhere on the hardware.
The busy line is going high for longer than the hardware can buffer (50-100ms) so we are losing data. To save us from having to re-design the hardware I have been asked to 'fix' this issue!
I think my code is quick enough as we've had it running up to 15MB/s, so that leaves an IO bottleneck somewhere. Are we just expecting too much from the PC/OS?
Here is my data entry point. Occasionally we get a dropped bit or byte. If the checksum doesn't compute, I shift through until it does. byte[] data is nearly always 4k.
void ftdi_OnData(byte[] data)
{
List<byte> buffer = new List<byte>(data.Length);
int index = 0;
while ((index + rawFile.Header.PacketLength + 1) < data.Length)
{
if (CheckSum.CRC16(data, index, rawFile.Header.PacketLength + 2)) // <- packet length + 2 for 16bit checksum
{
buffer.AddRange(data.SubArray<byte>(index, rawFile.Header.PacketLength));
index += rawFile.Header.PacketLength + 2; // <- skip the two checksums, we dont want to save them...
}
else
{
index++; // shift through
}
}
rawFile.AddData(buffer.ToArray(), 0, buffer.Count);
}

Tip: do not write to a file.... queue.
Modern computers have multiple processors. If you want certain things as fast as possible, use multiple processors.
Have on thread deal with the USB data, check checksums etc. It queues (ONLY) the results to a thread safe queue.
Another thread reads data from the queue and writes it to a file, possibly buffered.
Finished ;)
100ms is a lot of time for decent operations. I have successfully managed around 250.000 IO data packets per second (financial data) using C# without a sweat.
Basically, make sure your IO threads do ONLY that and use your internal memory as buffer. Especially dealing with hardware on one end the thread doing that should ONLY do that, POSSIBLY if needed running in high priority.

To get good read throughput on Windows on USB, you generally need to have multiple asynchronous reads (or very large reads, which is often less convenient) queued onto the USB device stack. I'm not quite sure what the FTDI drivers / libraries do internally in this regard.
Traditionally I have written mechanisms with an array of OVERLAPPED strutures and an array of buffers, and kept shovelling them into ReadFile as soon as they're free. I was doing 40+MB/s reads on USB2 like this about 5-6 years ago, so modern PCs should certainly be able to cope.
It's very important that you (or your drivers/libraries) don't get into a "start a read, finish a read, deal with the data, start another read" cycle, because you'll find that the bus is idle for vast swathes of time. A USB analyser would show you if this was happening.
I agree with the others that you should get off the thread that the read is happening as soon as possible - don't block the FTDI event handler for any longer than at takes to put the buffer into another queue.
I'd preallocate a circular queue of buffers, pick the next free one and throw the received data into it, then complete the event handling as quickly as possible.
All that checksumming and concatenation with its attendant memory allocation, garbage collection, etc, can be done the other side of potentially 100s of MB of buffer time/space on the PC. At the moment you may well be effectively asking your FPGA/hardware buffer to accommodate the time taken for you to do all sorts of ponderous PC stuff which can be done much later.
I'm optimistic though - if you can really buffer 100ms of data on the hardware, you should be able to get this working reliably. I wish I could persuade all my clients to allow so much...

So what does your receiving code look like? Do you have a thread running at high priority responsible solely for capturing the data and passing it in memory to another thread in a non-blocking fashion? Do you run the process itself at an elevated priority?
Have you designed the rest of your code to avoid the more expensive 2nd gen garbage collections? How large are you buffers, are they on the large object heap? Do you reuse them efficiently?

Related

TCP Socket ReceiveAsync sometimes takes too long

I have to work with a device, which uses TCP connection to control it. It sends 1 byte of data every 30 milliseconds and I must react to it as soon as possible. Usually, everything works fine, but sometimes Socket.ReceiveAsync() function stuck for time up to 400-800 milliseconds and then returns some number of received bytes.
I use code like this:
_socket.ReceiveTimeout = 5;
var sw = Stopwatch.StartNew();
var len = await _socket.ReceiveAsync(new ArraySegment<byte>(Buffer, Offset, Count),
SocketFlags.None)
.ConfigureAwait(false);
_logger.Info($"Reading took {sw.ElapsedMilliseconds}ms"); // usually 0-6ms, but sometimes up to 800ms
I recorded this process with Wireshark there I could see that all the data was received in time, with about 30ms interval.
I also noticed that the probability of happening of this delay is higher when you do something on your computer. Like opening the start menu or Explorer. I think, that switching to another process or garbage collection should be much faster.
It sends 1 byte of data every 30 milliseconds and I must react to it as soon as possible.
This is extremely difficult to do on Windows, which is not a realtime operating system. Really, all you can do is your best, understanding that an unexpected antivirus scan may throw it off.
I was once sent to a customer site to investigate TCP/IP communications timeouts that only happened at night. Plane flight, hotel stay, the whole shebang. Turns out the night shift was playing DOOM. True story.
So, you can't really guarantee this kind of app will always work; you just have to do your best.
First, I recommend keeping a continuous read going. Always Be Reading. Literally, as soon as a read completes, shove that byte into a producer/consumer queue and start reading again ASAP. Don't bother with a timeout; just keep reading.
The second thing you can do is reduce overhead. In this case, you're calling a Socket API that returns Task<T>; it would be better to call one that returns ValueTask<T>:
var len = await _socket.ReceiveAsync(
new ArraySegment<byte>(Buffer, Offset, Count).AsMemory(), SocketFlags.None)
.ConfigureAwait(false);
See this blog post for more about how ValueTask<T> reduces memory usage, particularly with sockets.
Alternatively, you can make your read loop run synchronously on a separate thread, as mentioned in the comments. One nice aspect to the separate thread is that you can boost its priority - even into the "realtime" range. But There Be Dragons - you really do not want to do that unless you literally have no choice. If there's any bug on that thread, you can deadlock the entire OS.
Once you make your read loop as tight as possible (never waiting for any kind of processing), and reduce your memory allocations (preventing unnecessary GCs), that's about all you can do. Someday, somebody's gonna fire up DOOM, and your app will just have to try its best.

socket buffer size: pros and cons of bigger vs smaller

I've never really worked with COM sockets before, and now have some code that is listening to a rather steady stream of data (500Kb/s - 2000Kb/s).
I've experimented with different sizes, but am not really sure what I'm conceptually doing.
byte[] m_SocketBuffer = new byte[4048];
//vs
byte[] m_SocketBuffer = new byte[8096];
The socket I'm using is System.Net.Sockets.Socket, and this is my constructor:
new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)
My questions are:
Is there a general trade-off for large vs. small buffers?
How does one go about sizing the buffer? What should you use as a gauge?
I'm retrieving the data like this:
string socketData = Encoding.ASCII.GetString(m_SocketBuffer, 0, iReceivedBytes)
while (sData.Length > 0)
{ //do stuff }
Does my reading event happen when the buffer is full? Like whenever the socket buffer hits the threshold, that's when I can read from it?
Short version:
Optimal buffer size is dependent on a lot of factors, including the underlying network transport and how your own code handles the I/O.
10's of K is probably about right for a high-volume server moving a lot of data. But you can use much smaller if you know that remote endpoints won't ever be sending you a lot of data all at once.
Buffers are pinned for the duration of the I/O operation, which can cause or exacerbate heap fragmentation. For a really high-volume server, it might make sense to allocate very large buffers (larger than 85,000 bytes) so that the buffer is allocated from the large-object heap (which either has no fragmentation issues, or is in a perpetual state of fragmentation, depending on how you look at it :) ), and then use just a portion of each large buffer for any given I/O operation.
Re: your specific questions:
Is there a general trade-off for large vs. small buffers?
Probably the most obvious is the usual: a buffer larger than you will ever actually need is just wasting space.
Making buffers too small forces more I/O operations, possibly forcing more thread context switches (depending on how you are doing I/O), and for sure increasing the number of program statements that have to be executed.
There are other trade-offs of course, but to go into each and every one of them would be far too broad a discussion for this forum.
How does one go about sizing the buffer? What should you use as a gauge?
I'd start with a size that seems "reasonable", and then experiment from there. Adjust the buffer size in various load testing scenarios, increasing and decreasing, to see what if any effect there is on performance.
Does my reading event happen when the buffer is full? Like whenever the socket buffer hits the threshold, that's when I can read from it?
When you read from the socket, the network layer will put as much data into your buffer as it can. If there is more data available than will fit, it fills the buffer. If there is less data available than will fit, the operation will complete without filling the buffer (but always when at least one byte has been placed into the buffer…the only time a read operation completes with zero-length is when the connection is being shut down)

SocketAsyncEventArgs vs TcpListener/TcpClient [duplicate]

Is there a valid reason to not use TcpListener for implementing a high performance/high throughput TCP server instead of SocketAsyncEventArgs?
I've already implemented this high performance/high throughput TCP server using SocketAsyncEventArgs went through all sort of headaches to handling those pinned buffers using a big pre-allocated byte array and pools of SocketAsyncEventArgs for accepting and receiving, putting together using some low level stuff and shiny smart code with some TPL Data Flow and some Rx and it works perfectly; almost text book in this endeavor - actually I've learnt more than 80% of these stuff from other-one's code.
However there are some problems and concerns:
Complexity: I can not delegate any sort of modifications to this server to another
member of the team. That bounds me to this sort of tasks and I can
not pay enough attention to other parts of other projects.
Memory Usage (pinned byte arrays): Using SocketAsyncEventArgs the pools are needed to
be pre-allocated. So for handling 100000 concurrent connections
(worse condition, even on different ports) a big pile of RAM is uselessly hovers there;
pre-allocated (even if these conditions are met just at some times,
server should be able to handle 1 or 2 such peaks everyday).
TcpListener actually works good: I actually had put TcpListener into test (with some tricks like
using AcceptTcpClient on a dedicated thread, and not the async
version and then sending the accepted connections to a
ConcurrentQueue and not creating Tasks in-place and the like)
and with latest version of .NET, it worked very well, almost as good
as SocketAsyncEventArgs, no data-loss and a low memory foot-print
which helps with not wasting too much RAM on server and no pre-allocation is needed.
So why I do not see TcpListener being used anywhere and everybody (including myself) is using SocketAsyncEventArgs? Am I missing something?
I see no evidence that this question is about TcpListener at all. It seems you are only concerned with the code that deals with a connection that already has been accepted. Such a connection is independent of the listener.
SocketAsyncEventArgs is a CPU-load optimization. I'm convinced you can achieve a higher rate of operations per second with it. How significant is the difference to normal APM/TAP async IO? Certainly less than an order of magnitude. Probably between 1.2x and 3x. Last time I benchmarked loopback TCP transaction rate I found that the kernel took about half of the CPU usage. That means your app can get at most 2x faster by being infinitely optimized.
Remember that SocketAsyncEventArgs was added to the BCL in the year 2000 or so when CPUs were far less capable.
Use SocketAsyncEventArgs only when you have evidence that you need it. It causes you to be far less productive. More potential for bugs.
Here's the template that your socket processing loop should look like:
while (ConnectionEstablished()) {
var someData = await ReadFromSocketAsync(socket);
await ProcessDataAsync(someData);
}
Very simple code. No callbacks thanks to await.
In case you are concerned about managed heap fragmentation: Allocate a new byte[1024 * 1024] on startup. When you want to read from a socket read a single byte into some free portion of this buffer. When that single-byte read completes you ask how many bytes are actually there (Socket.Available) and synchronously pull the rest. That way you only pin a single rather small buffer and still can use async IO to wait for data to arrive.
This technique does not require polling. Since Socket.Available can only increase without reading from the socket we do not risk to perform a read that is too small accidentally.
Alternatively, you can combat managed heap fragmentation by allocating few very big buffers and handing out chunks.
Or, if you don't find this to be a problem you don't need to do anything.

What is the most efficient and reliable way to operate a critical streaming buffer in C#?

I am developing an application in c# which handles a large stream of incoming and outgoing data from a queue like buffer. The buffer needs to be some sort of file in the disk. Data will be written to the buffer very often (i'm talking like once every 10 ms!). Each data written will make 1 record/line in the buffer.
The same program will also read the buffer (line by line) and process the buffered data. After one line/record has been processed, the buffered data must immediately delete the line/record from the buffer file to prevent it from reprocessing the buffered data in the event of a system reboot. This read and delete will also be done at a very fast rate (like every 10ms or so).
So it's as system which writes, reads, and purges what has been read. It gets even harder as this buffer may grow up to 5GB (GIGABYTE) in size if the program decides not to process the buffered data.
**So my question is: What kind of method should I use to handle this buffering mechanism? I had a look at using SQlite and simple text file but they may be ineffecient handling large sizes or maybe not so good handling concurrent inserts, read and delete.
Anyway, really appreciate any advice. Thank you in advance for any answers given!**
You sound like you're describing Message Queues
There's MSMQ, ZeroMQ, RabbitMQ, and a couple others.
Here's a bunch of links on MSMQ:
http://www.techrepublic.com/article/use-microsoft-message-queuing-in-c-for-inter-process-communication/6170794
http://support.microsoft.com/KB/815811
http://www.csharphelp.com/2007/06/msmq-your-reliable-asynchronous-message-processing/
http://msdn.microsoft.com/en-us/library/ms973816.aspx
http://www.c-sharpcorner.com/UploadFile/rajkpt/101262007012217AM/1.aspx
Here's ZeroMQ (or 0MQ)
And here's RabbitMQ

Highest Performance for Cross AppDomain Signaling

My performance sensitive application uses MemoryMappedFiles for pushing bulk data between many AppDomains. I need the fastest mechanism to signal a receiving AD that there is new data to be read.
The design looks like this:
AD 1: Writer to MMF, when data is written it should notify the reader ADs
AD 2,3,N..: Reader of MMF
The readers do not need to know how much data is written because each message written will start with a non zero int and it will read until zero, don't worry about partially written messages.
(I think) Traditionally, within a single AD, Monitor.Wait/Pulse could be used for this, I do not think it works across AppDomains.
A MarshalByRefObject remoting method or event can also be used but I would like something faster. (I benchmark 1,000,000 MarshalByRefObject calls/sec on my machine, not bad but I want more)
A named EventWaitHandle is about twice as fast from initial measurements.
Is there anything faster?
Note: The receiving ADs do not need to get every signal as long as the last signal is not dropped.
A thread context switch costs between 2000 and 10,000 machine cycles on Windows. If you want more than a million per second then you are going to have to solve the Great Silicon Speed Bottleneck. You are already on the very low end of the overhead.
Focus on switching less often and collecting more data in one whack. Nothing needs to switch at a microsecond.
The named EventWaitHandle is the way to go for a one way signal (For lowest latency). From my measurements 2x faster than a cross-appdomain method call. The method call performance is very impressive in the latest versions of the CLR to date (4) and should make the most sense for the large majority of cases since it's possible to pass some information int he method call (in my case, how much data to read)
If it's OK to continuously burn a thread on the receiving end, and performance is that critical, a tight loop may be faster.
I hope Microsoft continues to improve the cross appdomain functionality as it can really help with application reliability and plugin-ins.

Categories