When writing functions that operate on a potentially infinite "stream" of data, i.e. bytes, chars, whatever, what are some design considerations in deciding to use Strings/Arrays vs. Streams for the input/output?
Is there a huge performance impact of always writing functions to use streams, and then making overload methods that use stream wrappers (i.e. StringReader/Writer) to return "simple data" like an array or a string that doesnt require disposing and other considerations?
I would think functions operating on arrays are far more convenient because you can "return" a resulting array and you don't usually have to worry about disposing. I assume stream operators are nice because they can operate on an infinite source of data, probably memory-efficient as well.
If you are working with binary data of unknown size always use streams. Reading an entire file into a byte array for example is usually bad idea if it can be avoided. Most functions in .Net that work with binary data such as encryption and compression are built to use streams as input/output.
If you are writing a function to process a stream of data, then why not pass it as an IEnumerable<T>. You can then return a stream as an IEnumerable<T> in a generator function. In other words using return yield to return each result one a ta time.
You can end up with asymptotic improvements in performance in some cases because the evaluation is done as needed.
Related
Okay so I'm trying to make a multiplayer game using a self built netcode in unity3d using c#,
The thing is since I'm using raw tcp I need to convert everything to a byte[] but I got sick of using Array.Copy. Since I'm reserving a few bytes of every message sent over a network as sort of a message identifier that I can use to interpret the data I receive.
So my question is, for the purpose of making this code more friendly to myself, is it a terrible idea to use a list of bytes instead of a byte array and once I've prepared the message to be sent I can just call .ToArray on that list?
Would this be terrible for performance?
As a general rule when dealing with sockets: you should usually be working with oversized arrays (ArrayPool<byte>.Shared can be useful here), and then use the Send overloads that accepts either the byte[], int, int (offset+count), or ArraySegment<byte> - so you aren't constantly re-copying things, and can re-use buffers. However, frankly: you may also way to look into "pipelines"; this is the new IO API that Kestrel uses that deals with all the buffer management for you, so you don't have to. In the core framework, pipelines is currently mostly server-focused, but this is being improved hopefully in .NET 5 as part of "Bedrock" - however, client-side pipelines wrappers are available in Pipelines.Sockets.Unofficial (on NuGet).
To be explicit: no, constantly calling ToArray() on a List<byte> is not a good way of making buffers more convenient; that will probably work, but could contribute to GC stalls, plus it is unnecessary overhead in your socket code.
I'm porting over some C++ and Java code to C# and see all the data IO as Streams, when all the streams are less than 1K in size.
Given that the buffer size of the stream equals the whole stream in almost all circumstances, is there any reason I shouldn't simply use a Byte[]?
The 1K data units are arriving from a stream source (network or disk) however, once read into memory, the stream access is a little random. I think direct byte[x] access might be more efficient (or logical).
So my question, is it generally acceptable from a security and architecture perspective to use Byte[] array directly instead of wrapping it in a stream? Assume that no further "stream" access is needed for other operations (e.g. an encoded media stream).
Since you stated that the usage is "a little random," I think a byte array makes the most sense; these are inherently good for lookup at a given position, while a stream would require you to do a linear read and reset the position. I'm not sure what security concerns you might have, but if you're passing the array to any unmanaged resources you might want to consider pinning it in memory.
(This post is regarding High Frequency type programming)
I recently saw on a forum (I think they were discussing Java) that if you have to parse a lot of string data its better to use a byte array than a string with a split(). The exact post was:
One performance trick to working with any language, C++, Java, C# is
to avoid object creation. It's not the cost of allocation or GC, its
the cost to access large memory arrays that dont fit in the CPU cache.
Modern CPU's are much faster than their memory. They stall for many,
many cycles for each cache miss. Most of the CPU transister budget is
allocated to reduce this with large caches and lots of ticks.
GPU's solve the problem differently by having lots of threads ready to
execute to hide memory access latency and have little or no cache and
spend the transistors on more cores.
So, for example, rather than using String's and split to parse a
message, use byte arrays that can be updated in place. You really want
to avoid random memory access over large data structures, at least in
the inner loops.
Is he just saying "dont use strings because they're an object and creating objects is costly" ? Or is he saying something else?
Does using a byte array ensure the data remains in the cache for as long as possible?
When you use a string is it too large to be held in the CPU cache?
Generally, is using the primitive data types the best methods for writing faster code?
He's saying that if you break a chunk text up into separate string objects, those string objects have worse locality than the large array of text. Each string, and the array of characters it contains, is going to be somewhere else in memory; they can be spread all over the place. It is likely that the memory cache will have to thrash in and out to access the various strings as you process the data. In contrast, the one large array has the best possible locality, as all the data is on one area of memory, and cache-thrashing will be kept to a minimum.
There are limits to this, of course: if the text is very, very large, and you only need to parse out part of it, then those few small strings might fit better in the cache than the large chunk of text.
There are lots of other reasons to use byte[] or char* instead of Strings for HFT. Strings consists of 16-bit char in Java and are immutable. byte[] or ByteBuffer are easily recycled, have good cache locatity, can be off the heap (direct) saving a copy, avoiding character encoders. This all assumes you are using ASCII data.
char* or ByteBuffers can also be mapped to network adapters to save another copy. (With some fiddling for ByteBuffers)
In HFT you are rarely dealing with large amounts of data at once. Ideally you want to be processing data as soon as it comes down the Socket. i.e. one packet at a time. (about 1.5 KB)
I'm not familiar with the internals of DeflateStream, but I need to store files in a Vendor's DB system that uses DeflateStream on binary attachments. The first thing I noticed was that all of my files were 10-50% BIGGER after compression, but I attribute that to a less sophisticated compression algo on top of files that are already highly compressed (in this case they were all PDFs). My question however relates to the fact that when I just wrote the original file into the BLOB the Vendor's application had no problem opening it (it opened the attachments I compressed with deflate as well). Is there a header on the compressed data that tells DeflateStream that the data's not compressed and basically pass it on as-is? This is the specification; can anyone familiar with it point where this is defined - or am I off base and the vendor is doing some magic behind the scenes?
no, there is no such magic in DeflateStream.
The built-in deflateStream exhibits a compression anomaly in which previously-compressed data actually increases in size. This has been reported to Microsoft previously, but they declined to fix the problem. it has to do with a naive implementation in DeflateStream of the DEFLATE protocol.
Ways that I know of to avoid the problem:
use an alternative deflateStream that does not exhibit this problem. See DotNetZip for one example.
It includes a DeflateStream that just works.
use the broken DeflateStream, compress the stream, compare sizes, and if the "compressed" stream is larger, then fallback to using the "uncompressed" stream.
If you choose the former case, you still have the condition where you are compressing stuff that has already been compressed. In other words, unnecessary double-compression. so you may want to look into avoiding that, regardless what you choose.
Stream compression is different from file compression. When compressing a file, it's generally possible to make multiple passes over the entire file and determine which compression scheme to use before having to commit to one. When compressing a stream, it's often necessary to start outputting data before the compression routine has processed enough data to know what compression method is going to be optimal.
This effect can be somewhat mitigated by dividing data into blocks, deciding for each block how to represent the data, and including a header at the start of each block identifying how it is stored. Unfortunately, the extra block headers will add to the size of the resulting stream. Further, many compression schemes improve in effectiveness as they process a stream; it may well be that every 1k block in a file would expand if "compressed" individually, even if compressing the whole file would result in a considerable space savings (since the compresser could e.g. build up a dictionary of common byte sequences). It would be possible to design a compress/uncompress pair so that a block of data which would expand would be written out verbatim by the compresser (with a header byte indicating that's what it was), and have the uncompresser process that block the same way the compresser could have done, so as to add to the dictionary the same byte sequences that would have been added had the block been stored in "compressed" form. Such an approach would probably be a good one, though it would add considerably to the complexity of the uncompresser.
I suspect the biggest problem for DeflateStream, though, is that there may not be any way to improve the worst-case "compression" performance without producing compressed data that is incompatible with the existing "uncompress" code. Suppose one has a string of bytes Q, and one needs a sequence of bytes which, when fed to the "uncompress" code that shipped with .net 2.0, will yield that same sequence. It may well be that for some possible values of Q, there are no such input sequences which aren't a lot bigger than Q. If that's the case, there's no way Microsoft could "fix" the problem without a time machine.
It all depends on how the DEFLATE stream was created.
DEFLATE supports a "non-compressed block" (BTYPE=00) and all data in this block, should it be utilized, is stored verbatim with no compression -- just the block header, length, and raw data. However, a stream can be a valid DEFLATE stream and contain zero (or not enough) "non-compressed" blocks even if this resulted in a sub-par compression ratio.
The overall compression ratio will depend upon the data, compressor algorithm/implementation, and amount of effort it puts into performing the compression.
Happy coding.
I have an out of memory exception using C# when reading in a massive file
I need to change the code but for the time being can I increase the heap size (like I would in Java) as a shaort term fix?
.Net does that automatically.
Looks like you have reached the limit of the memory one .Net process can use for its objects (on 32 bit machine this is 2 standard or 3GB by using the /3GB boot switch. Credits to Leppie & Eric Lippert for the info).
Rethink your algorithm, or perhaps a change to a 64 bit machine might help.
No, this is not possible. This problem might occur because you're running on a 32-bit OS and memory is too fragmented. Try not to load the whole file into memory (for instance, by processing line by line) or, when you really need to load it completely, by loading it in multiple, smaller parts.
No you can't see my answer here: Is there any way to pre-allocate the heap in the .NET runtime, like -Xmx/-Xms in Java?
For reading large files it is usually preferable to stream them from disk, reading them in chunks and dealing with them a piece at a time instead of loading the whole thing up front.
As others have already pointed out, this is not possible. The .NET runtime handles heap allocations on behalf of the application.
In my experience .NET applications commonly suffer from OOM when there should be plenty of memory available (or at least, so it appears). The reason for this is usually the use of huge collections such as arrays, List (which uses an array to store its data) or similar.
The problem is these types will sometimes create peaks in memory use. If these peak requests cannot be honored an OOM exception is throw. E.g. when List needs to increase its capacity it does so by allocating a new array of double the current size and then it copies all the references/values from one array to the other. Similarly operations such as ToArray makes a new copy of the array. I've also seen similar problems on big LINQ operations.
Each array is stored as contiguous memory, so to avoid OOM the runtime must be able to obtain one big chunk of memory. As the address space of the process may be fragmented due to both DLL loading and general use for the heap, this is not always possible in which case an OOM exception is thrown.
What sort of file are you dealing with ?
You might be better off using a StreamReader and yield returning the ReadLine result, if it's textual.
Sure, you'll be keeping a file-pointer around, but the worst case scenario is massively reduced.
There are similar methods for Binary files, if you're uploading a file to SQL for example, you can read a byte[] and use the Sql Pointer mechanics to write the buffer to the end of a blob.