I have a performance sensitive scenario where I would like to write UTF-8 to a byte array.
A quick glimpse on the .NET StringBuilder class has me believe that it only builds UTF-16 natively. Encoding.UTF8.GetBytes(str) means extra allocations and extra clock cycles that I am not willing to spend.
Is there a native UTF-8 writer?
The MemoryStream is like a StringBuilder for bytes; you can use it to create a sequence of bytes efficiently by repeatedly appending sequences of bytes to it. It doesn't have methods to append strings of characters though. To avoid converting each string to a byte array first, you can wrap the stream in a StreamWriter which takes care of the conversion.
Related
From the MSDN pages for StreamWriter and BinaryWriter you can clearly see the differences:
StreamWriter:
Implements a TextWriter for writing
characters to a stream in a particular
encoding.
And:
BinaryWriter:
Writes primitive types in binary to a
stream and supports writing strings in
a specific encoding.
I know that the most important feature of BinaryWriter is that it can wirte primitive types in binary, which can be more compact and efficient (consider writing the integer 23861398 - the binary writer would require 4 bytes, but the stream writer would require 8, 16, or even 32 depending on the encoding).
But when it comes to write strings, can we say StreamWriter and BinaryWriter are interchangeable?
From the BinaryWriter documentation:
Writes a length-prefixed string to this stream in the current encoding of the BinaryWriter, and advances the current position of the stream in accordance with the encoding used and the specific characters being written to the stream.
and:
Length-prefixed means that this method first writes the length of the string, in bytes, when encoded with the BinaryWriter instance's current encoding to the stream. This value is written as an unsigned integer. This method then writes that many bytes to the stream.
For example, the string "A" has a length of 1, but when encoded with UTF-16; the length is 2 bytes, so the value written in the prefix is 2, and 3 bytes are written to the stream, including the prefix.
The StreamWriter class does not write any string-lengths to the output. It just writes the text itself
Both do respect the encoding you have specified when you create the object. The actual text being written uses that encoding.
So depending on what you mean by "interchangeable", they either are, or they are not. I would say they are not, but if all you are looking at is the sequence of bytes that represents the text written itself, one might consider them to be.
To address your comment:
I'm new to BinaryWriter, so why it need to write length to the stream first? why not just write "real data" directly?
Unlike StreamWriter, which only writes one kinds of data, BinaryWriter can write all sorts of data, include raw bytes and various primitive types, to a single Stream. When writing each of these, it needs a way of indicating where that particular section of data ends. That's so that BinaryWriter's reading counterpart, BinaryReader, has a way of knowing where each section of data ends.
For some things, it's simple, because the data item itself is fixed size. A System.Int32 is 4 bytes, a System.Int64 is 8, and so on. They're fixed-length, so when reading you just read the expected number of bytes.
But string objects are variable length. There are two common approaches to handling this: prefixing the string with a count, or terminating the string with a null character ('\0'). In-memory .NET strings are counted, not null-terminated, and so string objects can actually contain a null character, and so null-terminating the string doesn't work for storing a string in a file.
So the prefix with the count of bytes is used instead.
No, they're absolutely not interchangeable. BinaryWriter is an opinionated basic IO tool for writing to binary outputs; StreamWriter is for writing text outputs. Any times that the output looks similar is purely incidental - they should absolutely not be used interchageably.
For strings, they will look similar, but that's simply because both are ultimately just using a text-encoding. The boilerplate around them is very different though (with the BinaryWriter using length-prefix, etc).
If you're looking to use one as though it were the other: you're probably doing something very wrong. If you clarify what you're trying to do, we can probably offer guidance.
I am currently using Unicode in bytes and using Encoding class to get bytes and get strings.
However, I saw there is an encoder class and it seems like doing the same thing as the encoding class. Does anyone know what is the difference between them and when to use either of them.
Here are the Microsoft documentation page:
Encoder: https://msdn.microsoft.com/en-us/library/system.text.encoder(v=vs.110).aspx
Encoding: https://msdn.microsoft.com/en-us/library/system.text.encoding(v=vs.110).aspx
There is definitely a difference. An Encoding is an algorithm for transforming a sequence of characters into bytes and vice versa. An Encoder is a stateful object that transforms sequences of characters into bytes. To get an Encoder object you usually call GetEncoder on an Encoding object. Why is it necessary to have a stateful tranformation? Imagine you are trying to efficiently encode long sequences of characters. You want to avoid creating a lot of arrays or one huge array. So you break the characters down into say reusable 1K character buffers. However this might make some illegal characters sequences, for example a utf-16 surrogate pair broken across to separate calls to GetBytes. The Encoder object knows how to handle this and saves the necessary state across successive calls to GetBytes. Thus you use an Encoder for transforming one block of text that is self-contained. I believe you can reuse an Encoder instance more transforms of multiple sections of text as long as you have called GetBytes with flush equal to true on the last array of characters. If you just want to easily encode short strings, use the Encoding.GetBytes methods. For the decoding operations there is a similar Decoder class that holds the decoding state.
Of all the example codes I have read online regarding SerialPorts all uses ReadByte then convert to Character instead of using ReadChar in the first place.
Is there a advantage in doing this?
The SerialPort.Encoding property is often misunderstood. The default is ASCIIEncoding, it will produce ? for byte values 0x80..0xFF. So they don't like getting these question marks. If you see such code then converting the byte to char directly then they are getting it really wrong, Unicode has lots of unprintable codepoints in that byte range and the odds that the device actually meant to send these characters are zero. A string tends to be regarded as easier to handle than a byte[], it is.
When you use ReadChar it is based on the encoding you are using, like #Preston Guillot said. According to the docu of ReadChar:
This method reads one complete character based on the encoding.
Use caution when using ReadByte and ReadChar together. Switching
between reading bytes and reading characters can cause extra data to
be read and/or other unintended behavior. If it is necessary to switch
between reading text and reading binary data from the stream, select a
protocol that carefully defines the boundary between text and binary
data, such as manually reading bytes and decoding the data.
DateTime todayDateTime = DateTime.Now;
StringBuilder todayDateTimeSB = new StringBuilder("0");
todayDateTimeSB.Append(todayDateTime.ToString("MMddyyyy"));
long todayDateTimeLongValue = Convert.ToInt64(todayDateTimeSB.ToString());
// convert to byte array packed decimal
byte[] packedDecValue = ToComp3UsingStrings(todayDateTimeLongValue);
// append each byte to the string builder
foreach (byte b in packedDecValue)
{
sb.Append(b); // bytes 56-60
}
sb.Append(' ', 37);
The above code takes the current date time, formats it into a long value and passes that to a method which converts it to a packed decimal format. I know that the above works since when I step though the code the byte array has the correct Hex values for all of the bytes that I am expecting.
However the above is the code I am having issues with, specifically I have researched and found that the string builder .Append(byte) actually does a ToString() for that byte. Which is altering the value of the byte when it adds it to the string. The question is how do I tell the StringBuilder to take the 'byte' as is and store it in memory without formatting/altering the value. I know that there is also a .AppendFormat() which has several overloads which use the IFormatProvider to give lots and lots of options on how to format things but I don't see any way to tell it to NOT format/change/alter the value of the data.
You can cast the byte to a char:
sb.Append((char)b);
You can also use an ASCIIEncoding to convert all the bytes at once:
string s = Encoding.ASCII.GetString(packedDecValue);
sb.Append(s);
As noted, in a Unicode world, bytes (octets) are not characters. The CLR works with Unicode characters internally and internally represents them in the UTF-16 encoding. A StringBuilder builds a UTF-16 encoded Unicode string.
Once you have that UTF-16 string, however, you can re-encode it, using, say UTF-8 or the ASCIIEncoding. However, in both of those, code points 0x0080 and higher will not be left as-is.
UTF-8 uses 2 octets for code points 0x0080–0x07FF; 3 octets for code points 0x0800–0xFFFF and so on. http://en.wikipedia.org/wiki/UTF-8#Description
The ASCII encoding is worse: per the documentation, code points outside 0x0000–0x007F are simply chucked:
If you use the default encoder returned by the Encoding.ASCII property or the
ASCIIEncoding constructor, characters outside that range are replaced with a
question mark (?) before the encoding operation is performed.
If you need to send a stream of octets unscathed, you are better off using a System.IO.MemoryStream wrapped in a StreamReader and StreamWriter.
You can then access the MemoryStream's backing store via its GetBuffer() method or its ToArray() method. GetBuffer() gives you a reference to the actual backing store. However it likely contains alloated, but unused, bytes — you need to check the stream's Length and Capacity. ToArray() allocates a new array and copies the actual stream content into it, so the array reference you recieve is the correct length.
I am working on C#, trying below code
byte[] buffer = new byte[str.Length];
buffer = Encoding.UTF8.GetBytes(str);
In str I've got lengthy data but I've got problem in getting complete encoded bytes.
Please tell me what's going wrong and how can I overcome this problem?
Why are you creating a new byte array and then ignoring it? The value of buffer before the call to GetBytes is being replaced with a reference to a new byte array returned by GetBytes.
However, you shouldn't expect the UTF-8 encoded version of a string to be the same length in bytes as the original string's length in characters, unless it's all ASCII. Any character over U+007F takes up at least 2 bytes.
What's the bigger picture here? What are you trying to achieve, and why does the length of the byte array matter to you?
The proper use is:
byte[] buffer = Encoding.UTF8.GetBytes(str);
In general, you should not make any assumptions about length/size/count when working with encoding, bytes and chars/strings. Let the Encoding objects do their work and then query the resulting objects for that info.
Having said that, I don't believe there is an inherent length restriction for the encoding classes. I have several production apps doing the same work in the opposite direction (bytes encoded to chars) which are processing byte arrays in the 10s of megabytes.