I have to work with a device, which uses TCP connection to control it. It sends 1 byte of data every 30 milliseconds and I must react to it as soon as possible. Usually, everything works fine, but sometimes Socket.ReceiveAsync() function stuck for time up to 400-800 milliseconds and then returns some number of received bytes.
I use code like this:
_socket.ReceiveTimeout = 5;
var sw = Stopwatch.StartNew();
var len = await _socket.ReceiveAsync(new ArraySegment<byte>(Buffer, Offset, Count),
SocketFlags.None)
.ConfigureAwait(false);
_logger.Info($"Reading took {sw.ElapsedMilliseconds}ms"); // usually 0-6ms, but sometimes up to 800ms
I recorded this process with Wireshark there I could see that all the data was received in time, with about 30ms interval.
I also noticed that the probability of happening of this delay is higher when you do something on your computer. Like opening the start menu or Explorer. I think, that switching to another process or garbage collection should be much faster.
It sends 1 byte of data every 30 milliseconds and I must react to it as soon as possible.
This is extremely difficult to do on Windows, which is not a realtime operating system. Really, all you can do is your best, understanding that an unexpected antivirus scan may throw it off.
I was once sent to a customer site to investigate TCP/IP communications timeouts that only happened at night. Plane flight, hotel stay, the whole shebang. Turns out the night shift was playing DOOM. True story.
So, you can't really guarantee this kind of app will always work; you just have to do your best.
First, I recommend keeping a continuous read going. Always Be Reading. Literally, as soon as a read completes, shove that byte into a producer/consumer queue and start reading again ASAP. Don't bother with a timeout; just keep reading.
The second thing you can do is reduce overhead. In this case, you're calling a Socket API that returns Task<T>; it would be better to call one that returns ValueTask<T>:
var len = await _socket.ReceiveAsync(
new ArraySegment<byte>(Buffer, Offset, Count).AsMemory(), SocketFlags.None)
.ConfigureAwait(false);
See this blog post for more about how ValueTask<T> reduces memory usage, particularly with sockets.
Alternatively, you can make your read loop run synchronously on a separate thread, as mentioned in the comments. One nice aspect to the separate thread is that you can boost its priority - even into the "realtime" range. But There Be Dragons - you really do not want to do that unless you literally have no choice. If there's any bug on that thread, you can deadlock the entire OS.
Once you make your read loop as tight as possible (never waiting for any kind of processing), and reduce your memory allocations (preventing unnecessary GCs), that's about all you can do. Someday, somebody's gonna fire up DOOM, and your app will just have to try its best.
Related
OK, that title was perhaps vague, but allow me to explain.
I'm dealing with a large list, of hundreds of messages to be sent to a CAN bus as byte arrays. Each of these messages has an Interval property detailing how often the message must be sent, in milliseconds. But I'll get back to that.
So I have a thread. The thread loops through this giant list of messages until stopped, with the body roughly like this:
Stopwatch timer = new Stopwatch();
sw.Start();
while(!ShouldStop)
{
foreach(Message msg in list)
{
if(msg.IsReadyToSend(timer)) msg.Send();
}
}
This works great, with phenomenal accuracy in honoring the Message objects' Interval. However, it hogs an entire CPU. The problem is that, because of the massive number of messages and the nature of the CAN bus, there is generally less than half a millisecond before the thread has to send another message. There would never be a case the thread would be able to sleep for, say, more than 15 milliseconds.
What I'm trying to figure out is if there is a way to do this that allows for the thread to block or yield momentarily, allowing the processor to sleep and save some cycles. Would I get any kind of accuracy at all if I try splitting the work into a thread per message? Is there any other way of doing this that I'm not seeing?
EDIT: It may be worth mentioning that the Message's Interval property is not absolute. As long as the thread continues to spew messages, the receiver should be happy, but if the thread regularly sleeps for, say, 25 ms because of higher priority threads stealing its time-slice, it could raise red flags for the receiver.
Based on the updated requirement there is very good chance that default setup with Sleep(0) could be enough - messages may be sent in small bursts, but it sounds like is ok. Using multimedia timer may make burst less noticeable. Building more tolerance to receiver of the messages may be better approach (if possible).
If you need hard milliseconds accuracy with good guarantees - C# on Windows is not the best choice - separate hardware (even Adruino) may be needed, or at least lower level code that C#.
Windows is not RT OS, so you can't really get sub-millisecond accuracy.
Busy loop (possibly on high-pri thread) as you have is common approach if you need sub-millisecond accuracy.
You can try using Multimedia timers (sample - Multimedia timer interrupts in C# (first two interrupts are bad)), as well to change default time slice to 1ms (see Why are .NET timers limited to 15 ms resolution? for sample/explanation).
In any case you should be aware that your code can loose its time-slice if there are other higher priority threads to be scheduled and all your efforts would be lost.
Note: you obviously should consider if more sensible data structure is more suitable (i.e. heap or priority queue may work better to find next item).
As you have discovered, the most accurate way to "wait" on a CPU is to poll the RTC. However that is computationally intensive. If you are needing to get to the clock accuracy in timing, there is no other way.
However, in your original post, you said that the timing was in the order of 15ms.
On my 3.3GHz Quad Core i5 at home, 15ms x 3.3GHz = 50 Million Clock cycles (or 200 million if you count all the cores).
That is an eternity.
Loose sleep timing is most likely more than accurate enough for your purposes.
To be frank if you needed Hard RT, C# on the .net VM running on the .net GC on the Windows Kernel is the wrong choice.
I'm developing a C#, UWP 10 solution that communicates with a network device using a fast, continual read/write loop. The StreamSocket offered by the API seemed to work great, until I realized that there was a memory leak: there is an accumulation of Task<uint32> in the heap, in the order of hundreds per minute.
Whether I use a plain old while (true) loop inside an async Task, or using a self-posting ActionBlock<T> with TPL Dataflow (as per this answer), the result is the same.
I'm able to isolate the problem further if I eliminate reading from the socket and focus on writing:
Whether I use the DataWriter.StoreAsync approach or the more direct StreamSocket.OutputStream.WriteAsync(IBuffer buffer), the problem remains. Furthermore, adding the .AsTask() to these makes no difference.
Even when the garbage collector runs, these Task<uint32>'s are never removed from the heap. All of these tasks are complete (RanToCompletion), have no errors or any other property value that would indicate a "not quite ready to be reclaimed".
There seems to be a hint to my problem on this page (a byte array going from the managed to unmanaged world prevents release of memory), but the prescribed solution seems pretty stark: that the only way around this is to write all communications logic in C++/CX. I hope this is not true; surely other C# developers have successfully realized continual high-speed network communictions without memory leaks. And surely Microsoft wouldn't release an API that only works without memory leaks in C++/CX
EDIT
As requested, some sample code. My own code has too many layers, but a much simpler example can be observed with this Microsoft sample. I made a simple modification to send 1000 times in a loop to highlight the problem. This is the relevant code:
public sealed partial class Scenario3 : Page
{
// some code omitted
private async void SendHello_Click(object sender, RoutedEventArgs e)
{
// some code omitted
StreamSocket socket = //get global object; socket is already connected
DataWriter writer = new DataWriter(socket.OutputStream);
for (int i = 0; i < 1000; i++)
{
string stringToSend = "Hello";
writer.WriteUInt32(writer.MeasureString(stringToSend));
writer.WriteString(stringToSend);
await writer.StoreAsync();
}
}
}
Upon starting up the app and connecting the socket, there is only instance of Task<UInt32> on the heap. After clicking the "SendHello" button, there are 86 instances. After pressing it a 2nd time: 129 instances.
Edit #2
After running my app (with tight loop send/receive) for 3 hours, I can see that there definitely is a problem: 0.5 million Task instances, which never get GC'd, and the app's process memory rose from an initial 46 MB to 105 MB. Obviously this app can't run indefinitly.
However... this only applies to running in debug mode. If I compile my app in Release mode, deploy it and run it, there are no memory issues. I can leave it running all night and it is clear that memory is being managed properly.
Case closed.
there are 86 instances. After pressing it a 2nd time: 129 instances.
That's entirely normal. And a strong hint that the real problem here is that you don't know how to interpret the memory profiler report properly.
Task sounds like a very expensive object, it has a lot of bang for the buck and a thread is involved, the most expensive operating system object you could ever create. But it is not, a Task object is actually a puny object. It only takes 44 bytes in 32-bit mode, 80 bytes in 64-bit mode. The truly expensive resource is not owned by Task, the threadpool manager takes care of it.
That means you can create a lot of Task objects before you put enough pressure on the GC heap to trigger a collection. About 47 thousand of them to fill the gen #0 segment in 32-bit mode. Many more on a server, hundreds of thousands, its segments are much bigger.
In your code snippet, Task objects are the only objects you actually create. Your for(;;) loop does therefore not nearly loop often enough to ever see the number of Task objects decreasing or limiting.
So it is the usual story, accusations of the .NET Framework having leaks, especially on these kind of essential object types that are used heavily in server-style apps that run for months, are forever highly exaggerated. Double-guessing the garbage collector is always tricky, you typically only gain confidence by in fact having your app running for months and never failing on OOM.
I would create and close the DataWriter within the for .
Is there a valid reason to not use TcpListener for implementing a high performance/high throughput TCP server instead of SocketAsyncEventArgs?
I've already implemented this high performance/high throughput TCP server using SocketAsyncEventArgs went through all sort of headaches to handling those pinned buffers using a big pre-allocated byte array and pools of SocketAsyncEventArgs for accepting and receiving, putting together using some low level stuff and shiny smart code with some TPL Data Flow and some Rx and it works perfectly; almost text book in this endeavor - actually I've learnt more than 80% of these stuff from other-one's code.
However there are some problems and concerns:
Complexity: I can not delegate any sort of modifications to this server to another
member of the team. That bounds me to this sort of tasks and I can
not pay enough attention to other parts of other projects.
Memory Usage (pinned byte arrays): Using SocketAsyncEventArgs the pools are needed to
be pre-allocated. So for handling 100000 concurrent connections
(worse condition, even on different ports) a big pile of RAM is uselessly hovers there;
pre-allocated (even if these conditions are met just at some times,
server should be able to handle 1 or 2 such peaks everyday).
TcpListener actually works good: I actually had put TcpListener into test (with some tricks like
using AcceptTcpClient on a dedicated thread, and not the async
version and then sending the accepted connections to a
ConcurrentQueue and not creating Tasks in-place and the like)
and with latest version of .NET, it worked very well, almost as good
as SocketAsyncEventArgs, no data-loss and a low memory foot-print
which helps with not wasting too much RAM on server and no pre-allocation is needed.
So why I do not see TcpListener being used anywhere and everybody (including myself) is using SocketAsyncEventArgs? Am I missing something?
I see no evidence that this question is about TcpListener at all. It seems you are only concerned with the code that deals with a connection that already has been accepted. Such a connection is independent of the listener.
SocketAsyncEventArgs is a CPU-load optimization. I'm convinced you can achieve a higher rate of operations per second with it. How significant is the difference to normal APM/TAP async IO? Certainly less than an order of magnitude. Probably between 1.2x and 3x. Last time I benchmarked loopback TCP transaction rate I found that the kernel took about half of the CPU usage. That means your app can get at most 2x faster by being infinitely optimized.
Remember that SocketAsyncEventArgs was added to the BCL in the year 2000 or so when CPUs were far less capable.
Use SocketAsyncEventArgs only when you have evidence that you need it. It causes you to be far less productive. More potential for bugs.
Here's the template that your socket processing loop should look like:
while (ConnectionEstablished()) {
var someData = await ReadFromSocketAsync(socket);
await ProcessDataAsync(someData);
}
Very simple code. No callbacks thanks to await.
In case you are concerned about managed heap fragmentation: Allocate a new byte[1024 * 1024] on startup. When you want to read from a socket read a single byte into some free portion of this buffer. When that single-byte read completes you ask how many bytes are actually there (Socket.Available) and synchronously pull the rest. That way you only pin a single rather small buffer and still can use async IO to wait for data to arrive.
This technique does not require polling. Since Socket.Available can only increase without reading from the socket we do not risk to perform a read that is too small accidentally.
Alternatively, you can combat managed heap fragmentation by allocating few very big buffers and handing out chunks.
Or, if you don't find this to be a problem you don't need to do anything.
The documentation for the Serial Port states that:
The DataReceived event is not guaranteed to be raised for every byte
received. Use the BytesToRead property to determine how much data is
left to be read in the buffer.
The protocol we are trying to implement, separates the messages by idle periods. Since we have to rely on the time of each character received, this .NET's restriction seems to be a problem.
Does anyone know how does the .NET's SerialPort decides whether to raise an event or not. Is it to avoid the event spamming in high baud rates, so it buffers them?
Is there any guarantee that at least one event will be raised in ever XY milliseconds? What is that minimal period, if any?
How to approach this problem?
EDIT: A little more research shows that it can be done by setting the timeouts. Stupid me!
This is not a good plan for a protocol, unless the periods between messages are at least a couple of seconds. Windows does not provide any kind of service guarantee for code that runs in user mode, it is not a real-time operating system. Your code will fail when the machine gets heavily loaded and your code gets pre-empted by other threads that run with a higher priority. Like kernel threads. Delays of hundreds of milliseconds are common, several seconds is certainly quite possible, especially when your code got paged-out and the paging file got fragmented. Very hard to troubleshoot, it repeats horribly poorly.
The alternative is simple, just use a frame around the message so you can reliably detect the start and the end of a message. Two bytes will do, STX and ETX are popular choices. Add a length byte if the end-of-message byte can also appear in the data.
We have a hardware system with some FPGA's and an FTDI USB controller. The hardware streams data over USB bulk transfer to the PC at around 5MB/s and the software is tasked with staying in sync, checking the CRC and writing the data to file.
The FTDI chip has a 'busy' pin which goes high while its waiting for the PC to do its business. There is a limited amount of buffering in the FTDI and elsewhere on the hardware.
The busy line is going high for longer than the hardware can buffer (50-100ms) so we are losing data. To save us from having to re-design the hardware I have been asked to 'fix' this issue!
I think my code is quick enough as we've had it running up to 15MB/s, so that leaves an IO bottleneck somewhere. Are we just expecting too much from the PC/OS?
Here is my data entry point. Occasionally we get a dropped bit or byte. If the checksum doesn't compute, I shift through until it does. byte[] data is nearly always 4k.
void ftdi_OnData(byte[] data)
{
List<byte> buffer = new List<byte>(data.Length);
int index = 0;
while ((index + rawFile.Header.PacketLength + 1) < data.Length)
{
if (CheckSum.CRC16(data, index, rawFile.Header.PacketLength + 2)) // <- packet length + 2 for 16bit checksum
{
buffer.AddRange(data.SubArray<byte>(index, rawFile.Header.PacketLength));
index += rawFile.Header.PacketLength + 2; // <- skip the two checksums, we dont want to save them...
}
else
{
index++; // shift through
}
}
rawFile.AddData(buffer.ToArray(), 0, buffer.Count);
}
Tip: do not write to a file.... queue.
Modern computers have multiple processors. If you want certain things as fast as possible, use multiple processors.
Have on thread deal with the USB data, check checksums etc. It queues (ONLY) the results to a thread safe queue.
Another thread reads data from the queue and writes it to a file, possibly buffered.
Finished ;)
100ms is a lot of time for decent operations. I have successfully managed around 250.000 IO data packets per second (financial data) using C# without a sweat.
Basically, make sure your IO threads do ONLY that and use your internal memory as buffer. Especially dealing with hardware on one end the thread doing that should ONLY do that, POSSIBLY if needed running in high priority.
To get good read throughput on Windows on USB, you generally need to have multiple asynchronous reads (or very large reads, which is often less convenient) queued onto the USB device stack. I'm not quite sure what the FTDI drivers / libraries do internally in this regard.
Traditionally I have written mechanisms with an array of OVERLAPPED strutures and an array of buffers, and kept shovelling them into ReadFile as soon as they're free. I was doing 40+MB/s reads on USB2 like this about 5-6 years ago, so modern PCs should certainly be able to cope.
It's very important that you (or your drivers/libraries) don't get into a "start a read, finish a read, deal with the data, start another read" cycle, because you'll find that the bus is idle for vast swathes of time. A USB analyser would show you if this was happening.
I agree with the others that you should get off the thread that the read is happening as soon as possible - don't block the FTDI event handler for any longer than at takes to put the buffer into another queue.
I'd preallocate a circular queue of buffers, pick the next free one and throw the received data into it, then complete the event handling as quickly as possible.
All that checksumming and concatenation with its attendant memory allocation, garbage collection, etc, can be done the other side of potentially 100s of MB of buffer time/space on the PC. At the moment you may well be effectively asking your FPGA/hardware buffer to accommodate the time taken for you to do all sorts of ponderous PC stuff which can be done much later.
I'm optimistic though - if you can really buffer 100ms of data on the hardware, you should be able to get this working reliably. I wish I could persuade all my clients to allow so much...
So what does your receiving code look like? Do you have a thread running at high priority responsible solely for capturing the data and passing it in memory to another thread in a non-blocking fashion? Do you run the process itself at an elevated priority?
Have you designed the rest of your code to avoid the more expensive 2nd gen garbage collections? How large are you buffers, are they on the large object heap? Do you reuse them efficiently?