Why no byte strings in .net / c#?

Why no byte strings in .net / c#? - c#

Is there a good reason that .NET provides string functions (like search, substring extraction, splitting, etc) only for UTF-16 and not for byte arrays? I see many cases when it would be easier and much more efficient to work with 8-bit chars instead of 16-bit.
Let's take MIME (.EML) format for example. It's basically 8-bit text file. You cannot read it properly using ANY encoding (because encoding info is contained within the file, moreover, different parts can have different encodings).
So you basically better read a MIME file as bytes, determine it's structure (ideally, using 8bit-string parsing tools), and after finding encodings for all encoding-dependent data blocks apply encoding.GetString(data) to get normal UTF-16 representation of them.
Another thing is with base64 data blocks (base64 is just an example, there are also UUE and others). Currently .NET expects you to have a base64 16-bit string but it's not effective to read data of double size and do all conversions from bytes to string just to decode this data. When dealing with megabytes of data, it becomes important.
Missing byte string manipulation functions leads to the need to write them manually but the implementation is obviously less efficient than native code implementation of string functions.
I don't say it needs to be called 8-bit chars, let's keep it bytes. Just have a set of native methods which reflect most string manipulation routines, but with byte arrays. Is this needed only by me or am I missing something important about common .NET architecture?

Let's take MIME (.EML) format for example. It's basically 8-bit text file. You cannot read it properly using ANY encoding. (because encoding info is contained within the file, moreover, different parts can have different encodings).
So, you're talking about a case where general-purpose byte-string methods aren't very useful, and you'd need to specialise.
And then for other cases, you'd need to specialise again.
And again.
I actually think byte-string methods would be more useful than your example suggests, but it remains that a lot of cases for them have specialised needs that differ from other uses in incompatible ways.
Which suggests it may not be well-suited for the base library. It's not like you can't make your own that do fit those specialised needs.

Code to deal with mixed-encoding string manipulation is unnecessarily hard and much harder to explain/get right. The way you suggest to handle mixed encoding every "string" would need to keep encoding information in it and framework would have to provide implementations of all possible combinations of encodings.
Standard solution for such problem is to provide well defined way convert all types to/from single "canonical" representation and perform most operations on that canonical type. You see that more easily in image/video processing where random incoming formats converted into one format tool knows about, processed and converted back to original/any other format.
.Net strings are almost there with "canonical" way to represent Unicode string. There are still many ways to represent same string from user point of view that is actually composed from different char elements. Even regular string comparison is huge problem (as frequently in addition to encoding there are locale differences).
Notes
there are already plenty of API dealing with byte arrays to compare/slice - both in Array/List classes and as LINQ helpers. The only real missing part is regex-like matches.
even dealing with single type of encoding for strings (UTF-16 in .Net, UTF-8 in many other systems) is hard enough - even getting "sting length" is a problem (do you need to count surrogate pairs only or include all combining characters, or just .Length is enough).
it is good idea to try to write code yourself to see where complexity come from and whether particular framework decision makes sense. Try to implement 10-15 common string functions to support several encodings - i.e. (UTF8, UTF16, and one of 8-bit encoding).

Related

Is there a preferred manner for sending data over a web socket connection?

Is there a 'correct' or preferred manner for sending data over a web socket connection?
In my case, I am sending the information from a C# application to a python (tornado) web server, and I am simply sending a string consisting of several elements separated by commas. In python, I use rudimentary techniques to split the string and then structure the elements into an object.
e.g:
'foo,0,bar,1'
becomes:
object = {
'foo': 0,
'bar': 1
}
In the other direction, I am sending the information as a JSON string which I then deserialise using Json.NET
I imagine there is no strictly right or wrong way of doing this, but are there significant advantages and disadvantages that I should be thinking of? And, somewhat related, is there a consensus for using string vs. binary formats?

Writing a custom encoding (eg, as "k,v,..") is different than 'using binary'.
It is still text, just a rigid under-defined one-off hand-rolled format that must be manually replicated. (What happens if a key or value contains a comma? What happens if the data needs to contain nested objects? How can null be interpreted differently than '' or 'null'?)
While JSON is definitely the most ubiquitous format for WebSockets one shouldn't (for interchange purposes) write JSON by hand - one uses an existing serialization library on both ends. (There are many reasons why JSON is ubiquitous which are covered in other answers - this doesn't mean it is always the 'best' format, however.)
To this end a binary serializer can also be used (BSON being a trivial example as it is effectively JSON-like in structure and operation). Just replace JSON.parse with FORMATX.parse as appropriate.
The only requirements are then:
There is a suitable serializer/deserializer for the all the clients and servers. JSON works well here because it is so popular and there is no shortage of implementations.
There are various binary serialization libraries with both Python and C# libraries, but it will require finding a 'happy intersection'.
The serialization format can represent the data. JSON usually works sufficiently and it has a very nice 1-1 correspondence with basic object graphs and simple values. It is also inherently schema-less.
Some formats are better are certain tasks and have different characteristics, features, or tool-chains. However most concepts (and arguably most DTOs) can be mapped onto JSON easily which makes it a good 'default' choice.
The other differences between different kinds of binary and text serializations is most mostly dressing - but if you'd like to start talking about schema vs. schema-less, extensibility, external tooling, metadata, non-compressed encoded sizes (or size after transport compression), compliance with a specific existing protocol, etc..
.. but the point to take away is don't create a 'new' one-off format. Unless of course, you just like making wheels or there is a very specific use-case to fit.

First advice would be to use the same format for both ways, not plain text in one direction and JSON in the other.
I personally think {'foo':0,'bar':1} is better than foo,0,bar,1 because everybody understands JSON but for your custom format they might not without some explanations. The idea is you are inventing a data interchange format when JSON is already one and #jfriend00 is right, pretty much every language now understands JSON, Python included.
Regarding text vs binary, there isn't any consensus. As # user2864740 mentions in the comments to my answer as long as the two sides understand each other, it doesn't really matter. This only becomes relevant if one of the sides has a preference for a format (consider for example opening the connection from the browser, using JavaScript - for that people might prefer JSON instead of binary).
My advice is to go with something simple as JSON and design your app so that you can change the wire format by swapping in another implementation without affecting the logic of your application.

Fast and memory efficient ASCII string class for .NET

This might have been asked before, but I can't find any such posts. Is there a class to work with ASCII Strings? The benefits are numerous:
Comparison should be faster since its just byte-for-byte (instead of UTF-8 with variable encoding)
Memory efficient, should use about half the memory in large strings
Faster versions of ToUpper()/ToLower() which use a Look-Up-Table that is language invariant
Jon Skeet wrote a basic AsciiString implementation and proved #2, but I'm wondering if anyone took this further and completed such a class. I'm sure there would be uses, although no one would typically take such a route since all the existing String functions would have to be re-implemented by hand. And conversions between String <> AsciiString would be scattered everywhere complicating an otherwise simple program.
Is there such a class? Where?

I thought I would post the outcome of my efforts to implement a system as described with as much string support and compatibility as I could. It's possibly not perfect but it should give you a decent base to improve on if needed.
The ASCIIChar struct and ASCIIString string implicitly convert to their native counterparts for ease of use.
The OP's suggestion for replacements of ToUpper/Lower etc have been implemented in a much quicker way than a lookup list and all the operations are as quick and memory friendly as I could make them.
Sorry couldn't post source, it was too long. See links below.
ASCIIChar - Replaces char, stores the value in a byte instead of int and provides support methods and compatibility for the string class. Implements virtual all methods and properties available for char.
ASCIIChars - Provides static properties for each of the valid ASCII characters for ease of use.
ASCIIString - Replaces string, stores characters in a byte array and implements virtually all methods and properties available for string.

Dotnet has no ASCII string support directly. Strings are UTF16 because Windows API works with ASCII (onr char - one byte) or UTF16 only. Utf8 will be the best solution (java uses it), but .NET does not support it because Windows doesn't.
Windows API can convert between charsets, but windows api only works with 1 byte chars or 2 byte chars, so if you use UTF8 strings in .NET you must convert them everytime which has impact in performace. Dotnet can use UTF8 and other encondings via BinaryWriter/BinaryReader or a simple StreamWriter/StreamReader.

Protobuf-net IsPacked=true for user defined structures

Is it currently possible to use IsPacked=true for user defined structures? If not, then is it planned in the future?
I'm getting the following exception when I tried to apply that attribute to a field of the type ColorBGRA8[]: System.InvalidOperationException : Only simple data-types can use packed encoding
My scenario is as follows: I'm writing a game and have tons of blitable structures for various things such as colors, vectors, matrices, vertices, constant buffers. Their memory layout needs to be precisely defined at compile time in order to match for example the constant buffer layout from a shader (where fields generally? need to be aligned on a 16 byte boundary).
I don't mean to waste anyone's time, but I couldn't find any recent information about this particular question.
Edit after it has been answered
I am currently testing a solution which uses protobuf-net for almost everything but large arrays of user defined, but blitable structures. All my fields of arrays of custom structures have been replaced by arrays of bytes, which can be packed. After protobuf-net is finished deserializing the data, I then use memcpy via p/invoke in order to be able to work with an array of custom structures again.
The following numbers are from a test which serializes one instance containing one field of either the byte[] or ColorBGRA8[]. The raw test data is ~38MiB of data, e.g. 1000000 entries in the color array. Serialization was one in memory using MemoryStream.
Writing
Platform.Copy + Protobuf: 51ms, Size: 38,15 MiB
Protobuf: 2093ms, Size: 109,45 MiB
Reading
Platform.Copy + Protobuf: 43ms
Protobuf: 2307ms
The test shows that for huge arrays of more or less random data, a noticeable memory overhead can occur. This wouldn't have been such a big deal, if not for the (de)serialization times. I understand protobuf-net might not be designed for my extreme case, let alone optimized for it, but it is something I am not willing to accept.
I think I will stick with this hybrid approach, as protobuf-net works extremely well for everything else.

Simply "does not apply". To quote from the encoding specification:
Only repeated fields of primitive numeric types (types which use the varint, 32-bit, or 64-bit wire types) can be declared "packed".
This doesn't work with custom structures or classes. The two approaches that apply here are strings (length-prefixed) and groups (start/end tokens). The latter is often cheaper to encode, but Google prefer the former.
Protobuf is not designed to arbitrarily match some other byte layout. It is its own encoding format and is only designed to process / output protobuf data. It would be like saying "I'm writing XML, but I want it to look like {non-xml} instead".

How to reduce memory footprint on .NET string intensive applications?

I have an application that have ~1,000,000 strings in memory for performance reasons. My application consumes ~200 MB RAM.
I want to reduce the amount of memory consumed by the strings.
I know .NET represents strings in UTF-16 encoding (2 byte per char). Most strings in my application contain pure english chars, so storing them in UTF-8 encoding will be 2 times more efficient than UTF-16.
Is there a way to store a string in memory in UTF-8 encoding while allowing standard string functions? (My needs including mostly IndexOf with StringComparison.OrdinalIgnoreCase).

Unfortunately, you can't change .Net internal representation of string. My guess is that the CLR is optimized for multibyte strings.
What you are dealing with is the famous paradigm of the Space-time tradeoff, which states that in order to gain memory you'll have to use more processor, or you can save processor by using some memory.
That said, take a look at some considerations here. If I were you, once established that the memory gain will be enough for you, do try to write your own "string" class, which uses ASCII encoding. This will probably suffice.
UPDATE:
More on the money, you should check this post, "Of memory and strings", by StackOverflow legend Jon Skeet which deals with the problem you are facing. Sorry I didn't mentioned it right away, it took me some time to find the exact post from Jon.

Is there a way to store a string in memory in UTF-8 encoding while allowing standard string > functions? (My needs including mostly IndexOf with StringComparison.OrdinalIgnoreCase).
You could store as a byte array, and provide your own IndexOf implementation (since converting back to string for IndexOf would likely be a huge performance hit). Use the System.Text.Encoding functions for that (best bet would be to do a build step to convert to byte, and then read the byte arrays from disk - only converting back to string for display, if needed).
You could store them in a C/C++ library, letting you use single byte strings. You probably wouldn't want to marshal them back, but you could possibly just marshal results (I assume there's some sort of searching going on here) without too much of a perf hit. C++/CLI may make this easier (by being able to write the searching code in C++/CLI, but the string "database" in C++).
Or, you could revisit your initial performance issues that needs all of the strings in memory. An embedded database, indexing, etc. may both speed things up and reduce memory usage - and be more maintainable.

What if you store it as a bytearray? Just restore to string when you need to do some operations on it. I'd make a class for setting & getting the strings which internally stores it off as bytearrays.
to bytearray:
string s = "whatever";
byte[] b = System.Text.Encoding.UTF8.GetBytes(s);
to string:
string s = System.Text.Encoding.UTF8.GetString(b);

try using an in-memory-DB for as "storage" and SQL to interact with the data... For example SQLite can be deployed as part of your application (consists just of 1-2 DLLs which can be placed in the same folder as your application)...

What if you create your own UTF-8 string class (UTF8String?) and supply an implicit cast to String? You'll be sacrificing some speed for the sake of memory, but that might be what you're looking for.

Is serialization a must in order to transfer data across the wire?

Below is something I read and was wondering if the statement is true.
Serialization is the process of
converting a data structure or object
into a sequence of bits so that it can
be stored in a file or memory buffer,
or transmitted across a network
connection link to be "resurrected"
later in the same or another computer
environment.[1] When the resulting
series of bits is reread according to
the serialization format, it can be
used to create a semantically
identical clone of the original
object. For many complex objects, such
as those that make extensive use of
references, this process is not
straightforward.

Serialization is just a fancy way of describing what you do when you want a certain data structure, class, etc to be transmitted.
For example, say I have a structure:
struct Color
{
int R, G, B;
};
When you transmit this over a network you don't say send Color. You create a line of bits and send it. I could create an unsigned char* and concatenate R, G, and B and then send these. I just did serialization

Serialization of some kind is required, but this can take many forms. It can be something like dotNET serialization, that is handled by the language, or it can be a custom built format. Maybe a series of bytes where each byte represents some "magic value" that only you and your application understand.
For example, in dotNET I can can create a class with a single string property, mark it as serializable and the dotNET framework takes care of most everything else.
I can also build my own custom format where the first 4 bytes represent the length of the data being sent and all subsequent bytes are characters in a string. But then of course you need to worry about byte ordering, unicode vs ansi encoding, etc etc.
Typically it is easier to make use of whatever framework your language/OS/dev framework uses, but it is not required.

Yes, serialization is the only way to transmit data over the wire. Consider what the purpose of serialization is. You define the way that the class is stored. In memory tho, you have no way to know exactly where each portion of the class is. Especially if you have, for instance, a list, if it's been allocated early but then reallocated, it's likely to be fragmented all over the place, so it's not one contiguous block of memory. How do you send that fragmented class over the line?
For that matter, if you send a List<ComplexType> over the wire, how does it know where each ComplexType begins and ends.

The real problem here is not getting over the wire, the problem is ending up with the same semantic object on the other side of the wire. For properly transporting data between dissimilar systems -- whether via TCP/IP, floppy, or punch card -- the data must be encoded (serialized) into a platform independent representation.
Because of alignment and type-size issues, if you attempted to do a straight binary transfer of your object it would cause Undefined Behavior (to borrow the definition from the C/C++ standards).
For example the size and alignment of the long datatype can differ between architectures, platforms, languages, and even different builds of the same compiler.

Is serialization a must in order to transfer data across the wire?
Literally no.
It is conceivable that you can move data from one address space to another without serializing it. For example, a hypothetical system using distributed virtual memory could move data / objects from one machine to another by sending pages ... without any specific serialization step.
And within a machine, the objects could be transferred by switch pages from one virtual address space to another.
But in practice, the answer is yes. I'm not aware of any mainstream technology that works that way.

For anything more complex than a primitive or a homogeneous run of primitives, yes.

Binary serialization is not the only option. You can also serialize an object as an XML file, for example. Or as a JSON.

I think you're asking the wrong question. Serialization is a concept in computer programming and there are certain requirements which must be satisfied for something to be considered a serialization mechanism.
Any means of preparing data such that it can be transmitted or stored in such a way that another program (including but not limited to another instance of the same program on another system or at another time) can read the data and re-instantiate whatever objects the data represents.
Note I slipped the term "objects" in there. If I write a program that stores a bunch of text in a file; and I later use some other program, or some instance of that first program to read that data ... I haven't really used a "serialization" mechanism. If I write it in such a way that the text is also stored with some state about how it was being manipulated ... that might entail serialization.
The term is used mostly to convey the concept that active combinations of behavior and state are being rendered into a form which can be read by another program/instance and instantiated. Most serialization mechanism are bound to a particular programming language, or virtual machine system (in the sense of a Java VM, a C# VM etc; not in the sense of "VMware" virtual machines). JSON (and YAML) are a notable exception to this. They represents data for which there are reasonably close object classes with reasonably similar semantics such that they can be instantiated in multiple different programming languages in a meaningful way.
It's not that all data transmission or storage entails "serialization" ... is that certain ways of storing and transmitting data can be used for serialization. At very list it must be possible to disambiguated among the types of data that the programming language supports. If it reads: 1 is has to know whether that's text or an integer or a real (equivalent to 1.0) or a bit.

Strictly speaking it isn't the only option; you could put an argument that "remoting" meets the meaning inthe text; here a fake object is created at the receiver that contains no state. All calls (methods, properties etc) are intercepted and only the call and result are transferred. This avoids the need to transfer the object itself, but can get very expensive if overly "chatty" usage is involved (I.e. Lots of calls)as each has the latency of the speed of light (which adds up).
However, "remoting" is now rather out of fashion. Most often, yes: the object will need to be serialised and deserialized in some way (there are lots of options here). The paragraph is then pretty-much correct.

Having a messages as objects and serializing into bytes is a better way of understanding and managing what is transmitted over wire. In the old days protocols and data was much simpler, often, programmers just put bytes into output stream. Common understanding was shared by having well-known and simple specifications.

I would say serialization is needed to store the objects in file for persistence, but dynamically allocated pointers in objects need to be build again when we de-serialize, But the serialization for transfer depends on the physical protocol and the mechanism used, for example if i use UART to transfer data then its serialized bit by bit but if i use parallel port then 8 bits together gets transferred , which is not serialized

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.