C# What is the difference between Text.Encoder and Text.Encoding - c#

I am currently using Unicode in bytes and using Encoding class to get bytes and get strings.
However, I saw there is an encoder class and it seems like doing the same thing as the encoding class. Does anyone know what is the difference between them and when to use either of them.
Here are the Microsoft documentation page:
Encoder: https://msdn.microsoft.com/en-us/library/system.text.encoder(v=vs.110).aspx
Encoding: https://msdn.microsoft.com/en-us/library/system.text.encoding(v=vs.110).aspx

There is definitely a difference. An Encoding is an algorithm for transforming a sequence of characters into bytes and vice versa. An Encoder is a stateful object that transforms sequences of characters into bytes. To get an Encoder object you usually call GetEncoder on an Encoding object. Why is it necessary to have a stateful tranformation? Imagine you are trying to efficiently encode long sequences of characters. You want to avoid creating a lot of arrays or one huge array. So you break the characters down into say reusable 1K character buffers. However this might make some illegal characters sequences, for example a utf-16 surrogate pair broken across to separate calls to GetBytes. The Encoder object knows how to handle this and saves the necessary state across successive calls to GetBytes. Thus you use an Encoder for transforming one block of text that is self-contained. I believe you can reuse an Encoder instance more transforms of multiple sections of text as long as you have called GetBytes with flush equal to true on the last array of characters. If you just want to easily encode short strings, use the Encoding.GetBytes methods. For the decoding operations there is a similar Decoder class that holds the decoding state.

Related

Advantage in using SerialPort.ReadByte over ReadChar?

Of all the example codes I have read online regarding SerialPorts all uses ReadByte then convert to Character instead of using ReadChar in the first place.
Is there a advantage in doing this?
The SerialPort.Encoding property is often misunderstood. The default is ASCIIEncoding, it will produce ? for byte values 0x80..0xFF. So they don't like getting these question marks. If you see such code then converting the byte to char directly then they are getting it really wrong, Unicode has lots of unprintable codepoints in that byte range and the odds that the device actually meant to send these characters are zero. A string tends to be regarded as easier to handle than a byte[], it is.
When you use ReadChar it is based on the encoding you are using, like #Preston Guillot said. According to the docu of ReadChar:
This method reads one complete character based on the encoding.
Use caution when using ReadByte and ReadChar together. Switching
between reading bytes and reading characters can cause extra data to
be read and/or other unintended behavior. If it is necessary to switch
between reading text and reading binary data from the stream, select a
protocol that carefully defines the boundary between text and binary
data, such as manually reading bytes and decoding the data.

how to write with a single byte character encoding?

I have a webservice that returns the config file to a low level hardware device.
The manufacturer of this device tells me he only supports single byte charactersets for this config file.
On this wiki page I found out that the following should be single byte character sets:
ISO 8859
ISO/IEC 646 (I could not find this one here)
various Microsoft/IBM code pages
But when I call Encoding.GetMaxByteCount(1) on these character sets it always returns 2.
I also tried various other encodings (for instance IBM437), but GetMaxByteCount also returns 2 for other character sets.
The method Endoding.IsSingleByte seems unreliable according to this
You should be careful in what your application does with the value for
IsSingleByte. An assumption of how an Encoding will proceed may still
be wrong. For example, Windows-1252 has a value of true for
Encoding.IsSingleByte, but Encoding.GetMaxByteCount(1) returns 2. This
is because the method considers potential leftover surrogates from a
previous decoder operation.
Also the method Encoding.GetMaxByteCount has some of the same issues according to this
Note that GetMaxByteCount considers potential leftover surrogates from
a previous decoder operation. Because of the decoder, passing a value
of 1 to the method retrieves 2 for a single-byte encoding, such as
ASCII. Your application should use the IsSingleByte property if this
information is necessary.
Because of this I am not sure anymore on what to use.
Further reading.
Basically, GetMaxByteCount considers an edge-case that you will probably never need in regular code, specifically what it says about the decoder and surrogates. The point here is that some code-points are encoded as surrogate pairs, which in unfortunate cases can mean that it straddles two calls to GetBytes() / GetChars (on the encoder/decoder). As a consequence, the implementation may theoretically have a single byte/character still buffered and waiting to be processed, therefore GetMaxByteCount needs to warn about this.
However! All of this only makes sense if you are using the encoder/decoder directly. If you are using operations on the Encoding, such as Encoding.GetBytes, then all of this is abstracted away from you and you will never need to know. In which case, just use IsSingleByte and you'll be fine.
Maybe you should use the example from Encoding.Convert Method page on MSDN
The Encoding.Convert method should provide an ASCII encoded string. Hopefully single byte..

Is there a .NET StringBuilder that uses UTF-8 directly?

I have a performance sensitive scenario where I would like to write UTF-8 to a byte array.
A quick glimpse on the .NET StringBuilder class has me believe that it only builds UTF-16 natively. Encoding.UTF8.GetBytes(str) means extra allocations and extra clock cycles that I am not willing to spend.
Is there a native UTF-8 writer?
The MemoryStream is like a StringBuilder for bytes; you can use it to create a sequence of bytes efficiently by repeatedly appending sequences of bytes to it. It doesn't have methods to append strings of characters though. To avoid converting each string to a byte array first, you can wrap the stream in a StreamWriter which takes care of the conversion.

C#: String -> MD5 -> Hex

in languages like PHP or Python there are convenient functions to turn an input string into an output string that is the HEXed representation of it.
I find it a very common and useful task (password storing and checking, checksum of file content..), but in .NET, as far as I know, you can only work on byte streams.
A function to do the work is easy to put on (eg http://blog.stevex.net/index.php/c-code-snippet-creating-an-md5-hash-string/), but I'd like to know if I'm missing something, using the wrong pattern or there is simply no such thing in .NET.
Thanks
The method you linked to seems right, a slightly different method is showed on the MSDN C# FAQ
A comment suggests you can use:
System.Web.Security.FormsAuthentication.HashPasswordForStoringInConfigFile(string, "MD5");
Yes you can only work with bytes (as far as I know). But you can turn those bytes easily into their hex representation by looping through them and doing something like:
myByte.ToString("x2");
And you can get the bytes that make up the string using:
System.Text.Encoding.UTF8.GetBytes(myString);
So it could be done in a couple lines.
One problem is with the very concept of "the HEXed representation of [a string]".
A string is a sequence of characters. How those characters are represented as individual bits depends on the encoding. The "native" encoding to .NET is UTF-16, but usually a more compact representation is achieved (while preserving the ability to encode any string) using UTF-8.
You can use Encoding.GetBytes to get the encoded version of a string once you've chosen an appropriate encoding - but the fact that there is that choice to make is the reason that there aren't many APIs which go straight from string to base64/hex or which perform encryption/hashing directly on strings. Any such APIs which do exist will almost certainly be doing the "encode to a byte array, perform appropriate binary operation, decode opaque binary data to hex/base64".
(That makes me wonder whether it wouldn't be worth writing a utility class which could take an encoding, a Func<byte[], byte[]> and an output format such as hex/base64 - that could represent an arbitrary binary operation applied to a string.)

Create SecureString from unmanaged unicode string

I am wanting to try to tie the CryptUnprotectData windows API function and the .net SecureString together the best way possible. CryptUnprotectData returns a DATA_BLOB structure consisting of an array of bytes and a byte length. In my program this will be a Unicode UTF-16 string. SecureString has a constructor which takes a char* and length params, so I would like to be able to do something like:
SecureString ss = SecureString((char*)textBlob.pbData, textBlob.cbData / 2);
This works, except UTF-16 is variable length, so I don't really know what to use as the length argument. The above example assumes 2 byte characters (BMP), but for other planes it could be up to 4 bytes. I need to know the number of UTF-16 characters in the byte array. What is the best way to do this without copying the values around in memory (thereby compromising security). I plan on zeroing out and freeing the byte array as quickly as possible.
Most of the Windows API deals with UTF-16 code points as far as I'm aware - in other words, you treat surrogate pairs as two code points instead of a single character. Given that the constructor for SecureString is dealing with a pointer to .NET System.Char values (which are UTF-16) I think the code snippet you've got is fine - the number of elements in pbData is half its size in bytes.
For instance if pbData contained (just) a surrogate pair, cbData would be 4 and you'd still want to pass in 2 as the second argument - because that's the number of System.Char values you're constructing the SecureString from. The fact that it's one non-BMP unicode character is irrelevant to the number of UTF-16 System.Char values it's represented in.
(And yes, the support for non-BMP data is a bit of a mess, and I suspect very few people get it right everywhere. I'm sure I don't. Fortunately in many places you don't need to worry...)

Categories