1:1 decoding of UTF-8 octets for visualization

1:1 decoding of UTF-8 octets for visualization - c#

I'm making a tool (C#, WPF) for viewing binary data which may contain embedded text. It's traditional for such data viewers to use two vertical columns, one displaying the hexadecimal value of each byte and the other displaying the ASCII character corresponding to each byte, if printable.
I've been thinking it would be nice to support display of embedded text using non-ASCII encodings as well, in particular UTF-8 and UTF-16. The issue is that UTF code points don't map 1:1 with octets. I would like to keep the output grid-aligned according to its location in the data, so I need every octet to map to something to appear in the corresponding cell in the grid. What I'm thinking is that the end octet of each code point will map to the resulting Unicode character, and lead bytes map to placeholders that vary with sequence length (perhaps circled forms and use color to distinguish them from the actual encoded characters), and continuation and invalid bytes similarly to placeholders.
struct UtfOctetVisualization
{
enum Classification
{
Ascii,
NonAscii,
LeadByteOf2,
LeadByteOf3,
LeadByteOf4,
Continuation,
Error
}
Classification OctetClass;
int CodePoint; // valid only when OctetClass == Ascii or NonAscii
}
The Encoding.UTF8.GetString() method doesn't provide any information about the location each resulting character came from.
I could use Encoding.UTF8.GetDecoder() and call Convert passing a single byte at a time so that the completed output parameter gives a classification for each octet.
But in both methods, in order to have handling of invalid characters, I would need to implement a DecoderFallback class? This looks complicated.
Is there a simple way to get this information using the APIs provided with .NET (in System.Text or otherwise)? Using System.Text.Decoder, what would the fallback look like that fills in an output array shared with the decoder?
Or is it more feasible to write a custom UTF-8 recognizer (finite state machine)?

How about decoding one character at a time so that you can capture the number of bytes each character occupies. Something like this:
string data = "hello????";
byte[] buffer = new byte[Encoding.UTF8.GetByteCount(data)];
int bufferIndex = 0;
for(int i = 0; i < data.Length; i++)
{
int bytes = Encoding.UTF8.GetBytes(data, i, 1, buffer, bufferIndex);
Console.WriteLine("Character: {0}, Position: {1}, Bytes: {2}", data[i], i, bytes);
bufferIndex += bytes;
}
Fiddle: https://dotnetfiddle.net/poohHM
Those ???" in the string are supposed to be multi-byte characters, but SO dosent let me paste them in. See the Fiddle.
I dont this this is going to workout the way you want when you mix binary stuff with characters as #Jon has pointed out. I mean you'll see something, but it may not be what you expect, because the encoder wont be able to distinguish what bytes are supposed to be characters.

Related

C# BinaryWriter Write Method String Size

When writing a string to a binary file using C#, the length (in bytes) is automatically prepended to the output. According to the MSDN documentation this is an unsigned integer, but is also a single byte. The example they give is that a single UTF-8 character would be three written bytes: 1 size byte and 2 bytes for the character. This is fine for strings up to length 255, and matches with the behaviour I've observed.
However, if your string is longer than 255 bytes, the size of the unsigned integer grows as necessary. As a simple example, consider 1024 characters as:
string header = "ABCDEFGHIJKLMNOP";
for (int ii = 0; ii < 63; ii++)
{
header += "ABCDEFGHIJKLMNOP";
}
fileObject.Write(header);
results in 2-bytes prepending the string. Creating a 2^17 length string results in a somewhat maddening 3-byte array.
The question, therefore, is how to know how many bytes to read to get the size of what follows when reading? I wouldn't necessarily know a priori the header size. Ultimately, can I force the Write(string) method to always use a consistent size (say 2 bytes)?
A possible workaround is to write my own write(string) method, but I would like to avoid that for obvious reasons (similar questions here and here accept this as an answer). Another more palatable workaround is to have the reader look for a specific character that starts the ASCII string information (maybe an unprintable character?), but that is not infallible. A final workaround (that I can think of) would be to force the string to be within the range of sizes for a particular number of size bytes; again, that is non ideal.
While forcing the size of the byte array to be consistent is the easiest, I have control over the reader so any clever reader solutions are also welcome.

BinaryWriter and BinaryReader aren't the only way of writing binary data; simply: they provide a convention that is shared between that specific reader and writer. No, you can't tell them to use another convention - unless of course you subclass both of them and override the ReadString and Write(string) methods completely.
If you want to use a different convention, then simply: don't use BinaryReader and BinaryWriter. It is pretty easy to talk to a Stream directly using any text Encoding you want to get hold of the bytes and the byte count. Then you can use whatever convention you want. If you only ever need to write strings up to 65k then sure: use fixed 2 bytes (unsigned short). You'll also need to decide which byte comes first, of course (the "endianness").
As for the size of the prefix: it is essentially using:
int byteCount = this._encoding.GetByteCount(value);
this.Write7BitEncodedInt(byteCount);
with:
protected void Write7BitEncodedInt(int value)
{
uint num = (uint) value;
while (num >= 0x80)
{
this.Write((byte) (num | 0x80));
num = num >> 7;
}
this.Write((byte) num);
}
This type of encoding of lengths is pretty common - it is the same idea as the "varint" that "protobuf" uses, for example (base-128, least significant group first, retaining bit order in 7-bit groups, 8th bit as continuation)

If you want to write the length yourself:
using (var bw = new BinaryWriter(fs))
{
bw.Write(length); // Use a byte, a short...
bw.Write(Encoding.Unicode.GetBytes("Your string"));
}

Encryption/Decryption invalid parameters using DES algorithm [duplicate]

How do I convert a string to a byte[] in .NET (C#) without manually specifying a specific encoding?
I'm going to encrypt the string. I can encrypt it without converting, but I'd still like to know why encoding comes to play here.
Also, why should encoding even be taken into consideration? Can't I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?

Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!
Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in".
(And, of course, to be able to re-construct the string from the bytes.)
For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.
Just do this instead:
static byte[] GetBytes(string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
// Do NOT use on arbitrary bytes; only use on GetBytes's output on the SAME system
static string GetString(byte[] bytes)
{
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
return new string(chars);
}
As long as your program (or other programs) don't try to interpret the bytes somehow, which you obviously didn't mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.
Additional benefit to this approach: It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!
It will be encoded and decoded just the same, because you are just looking at the bytes.
If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters.

It depends on the encoding of your string (ASCII, UTF-8, ...).
For example:
byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString);
byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString);
A small sample why encoding matters:
string pi = "\u03a0";
byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi);
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi);
Console.WriteLine (ascii.Length); //Will print 1
Console.WriteLine (utf8.Length); //Will print 2
Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //Will print '?'
ASCII simply isn't equipped to deal with special characters.
Internally, the .NET framework uses UTF-16 to represent strings, so if you simply want to get the exact bytes that .NET uses, use System.Text.Encoding.Unicode.GetBytes (...).
See Character Encoding in the .NET Framework (MSDN) for more information.

The accepted answer is very, very complicated. Use the included .NET classes for this:
const string data = "A string with international characters: Norwegian: ÆØÅæøå, Chinese: 喂 谢谢";
var bytes = System.Text.Encoding.UTF8.GetBytes(data);
var decoded = System.Text.Encoding.UTF8.GetString(bytes);
Don't reinvent the wheel if you don't have to...

BinaryFormatter bf = new BinaryFormatter();
byte[] bytes;
MemoryStream ms = new MemoryStream();
string orig = "喂 Hello 谢谢 Thank You";
bf.Serialize(ms, orig);
ms.Seek(0, 0);
bytes = ms.ToArray();
MessageBox.Show("Original bytes Length: " + bytes.Length.ToString());
MessageBox.Show("Original string Length: " + orig.Length.ToString());
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt
BinaryFormatter bfx = new BinaryFormatter();
MemoryStream msx = new MemoryStream();
msx.Write(bytes, 0, bytes.Length);
msx.Seek(0, 0);
string sx = (string)bfx.Deserialize(msx);
MessageBox.Show("Still intact :" + sx);
MessageBox.Show("Deserialize string Length(still intact): "
+ sx.Length.ToString());
BinaryFormatter bfy = new BinaryFormatter();
MemoryStream msy = new MemoryStream();
bfy.Serialize(msy, sx);
msy.Seek(0, 0);
byte[] bytesy = msy.ToArray();
MessageBox.Show("Deserialize bytes Length(still intact): "
+ bytesy.Length.ToString());

This is a popular question. It is important to understand what the question author is asking, and that it is different from what is likely the most common need. To discourage misuse of the code where it is not needed, I've answered the latter first.
Common Need
Every string has a character set and encoding. When you convert a System.String object to an array of System.Byte you still have a character set and encoding. For most usages, you'd know which character set and encoding you need and .NET makes it simple to "copy with conversion." Just choose the appropriate Encoding class.
// using System.Text;
Encoding.UTF8.GetBytes(".NET String to byte array")
The conversion may need to handle cases where the target character set or encoding doesn't support a character that's in the source. You have some choices: exception, substitution, or skipping. The default policy is to substitute a '?'.
// using System.Text;
var text = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes("You win €100"));
// -> "You win ?100"
Clearly, conversions are not necessarily lossless!
Note: For System.String the source character set is Unicode.
The only confusing thing is that .NET uses the name of a character set for the name of one particular encoding of that character set. Encoding.Unicode should be called Encoding.UTF16.
That's it for most usages. If that's what you need, stop reading here. See the fun Joel Spolsky article if you don't understand what encoding is.
Specific Need
Now, the question author asks is, "Every string is stored as an array of bytes, right? Why can't I simply have those bytes?"
He doesn't want any conversion.
From the C# spec:
Character and string processing in C# uses Unicode encoding. The char
type represents a UTF-16 code unit, and the string type represents a
sequence of UTF-16 code units.
So, we know that if we ask for the null conversion (i.e., from UTF-16 to UTF-16), we'll get the desired result:
Encoding.Unicode.GetBytes(".NET String to byte array")
But to avoid the mention of encodings, we must do it another way. If an intermediate data type is acceptable, there is a conceptual shortcut for this:
".NET String to byte array".ToCharArray()
That doesn't get us the desired datatype but Mehrdad's answer shows how to convert this Char array to a Byte array using BlockCopy. However, this copies the string twice! And, it too explicitly uses encoding-specific code: the datatype System.Char.
The only way to get to the actual bytes the String is stored in is to use a pointer. The fixed statement allows taking the address of values. From the C# spec:
[For] an expression of type string, ... the initializer computes the
address of the first character in the string.
To do so, the compiler writes code skipping over the other parts of the string object with RuntimeHelpers.OffsetToStringData. So, to get the raw bytes, just create a pointer to the string and copy the number of bytes needed.
// using System.Runtime.InteropServices
unsafe byte[] GetRawBytes(String s)
{
if (s == null) return null;
var codeunitCount = s.Length;
/* We know that String is a sequence of UTF-16 code units
and such code units are 2 bytes */
var byteCount = codeunitCount * 2;
var bytes = new byte[byteCount];
fixed(void* pRaw = s)
{
Marshal.Copy((IntPtr)pRaw, bytes, 0, byteCount);
}
return bytes;
}
As #CodesInChaos pointed out, the result depends on the endianness of the machine. But the question author is not concerned with that.

You need to take the encoding into account, because 1 character could be represented by 1 or more bytes (up to about 6), and different encodings will treat these bytes differently.
Joel has a posting on this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The first part of your question (how to get the bytes) was already answered by others: look in the System.Text.Encoding namespace.
I will address your follow-up question: why do you need to pick an encoding? Why can't you get that from the string class itself?
The answer is in two parts.
First of all, the bytes used internally by the string class don't matter, and whenever you assume they do you're likely introducing a bug.
If your program is entirely within the .Net world then you don't need to worry about getting byte arrays for strings at all, even if you're sending data across a network. Instead, use .Net Serialization to worry about transmitting the data. You don't worry about the actual bytes anymore: the Serialization formatter does it for you.
On the other hand, what if you are sending these bytes somewhere that you can't guarantee will pull in data from a .Net serialized stream? In this case, you definitely do need to worry about encoding, because obviously, this external system cares. So again, the internal bytes used by the string don't matter: you need to pick an encoding so you can be explicit about this encoding on the receiving end, even if it's the same encoding used internally by .Net.
I understand that in this case, you might prefer to use the actual bytes stored by the string variable in memory where possible, with the idea that it might save some work creating your byte stream. However, I put it to you it's just not important compared to making sure that your output is understood at the other end, and to guarantee that you must be explicit with your encoding. Additionally, if you really want to match your internal bytes, you can already just choose the Unicode encoding, and get those performance savings.
This brings me to the second part... picking the Unicode encoding is telling .Net to use the underlying bytes. You do need to pick this encoding because when some new-fangled Unicode-Plus comes out the .Net runtime needs to be free to use this newer, better encoding model without breaking your program. But, for the moment (and foreseeable future), just choosing the Unicode encoding gives you what you want.
It's also important to understand your string has to be rewritten to wire, and that involves at least some translation of the bit-pattern even when you use a matching encoding. The computer needs to account for things like Big vs Little Endian, network byte order, packetization, session information, etc.

Just to demonstrate that Mehrdrad's sound answer works, his approach can even persist the unpaired surrogate characters(of which many had leveled against my answer, but of which everyone are equally guilty of, e.g. System.Text.Encoding.UTF8.GetBytes, System.Text.Encoding.Unicode.GetBytes; those encoding methods can't persist the high surrogate characters d800 for example, and those just merely replace high surrogate characters with value fffd ) :
using System;
class Program
{
static void Main(string[] args)
{
string t = "爱虫";
string s = "Test\ud800Test";
byte[] dumpToBytes = GetBytes(s);
string getItBack = GetString(dumpToBytes);
foreach (char item in getItBack)
{
Console.WriteLine("{0} {1}", item, ((ushort)item).ToString("x"));
}
}
static byte[] GetBytes(string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
static string GetString(byte[] bytes)
{
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
return new string(chars);
}
}
Output:
T 54
e 65
s 73
t 74
? d800
T 54
e 65
s 73
t 74
Try that with System.Text.Encoding.UTF8.GetBytes or System.Text.Encoding.Unicode.GetBytes, they will merely replace high surrogate characters with value fffd
Every time there's a movement in this question, I'm still thinking of a serializer(be it from Microsoft or from 3rd party component) that can persist strings even it contains unpaired surrogate characters; I google this every now and then: serialization unpaired surrogate character .NET. This doesn't make me lose any sleep, but it's kind of annoying when every now and then there's somebody commenting on my answer that it's flawed, yet their answers are equally flawed when it comes to unpaired surrogate characters.
Darn, Microsoft should have just used System.Buffer.BlockCopy in its BinaryFormatter ツ
谢谢！

Try this, a lot less code:
System.Text.Encoding.UTF8.GetBytes("TEST String");

Well, I've read all answers and they were about using encoding or one about serialization that drops unpaired surrogates.
It's bad when the string, for example, comes from SQL Server where it was built from a byte array storing, for example, a password hash. If we drop anything from it, it'll store an invalid hash, and if we want to store it in XML, we want to leave it intact (because the XML writer drops an exception on any unpaired surrogate it finds).
So I use Base64 encoding of byte arrays in such cases, but hey, on the Internet there is only one solution to this in C#, and it has bug in it and is only one way, so I've fixed the bug and written back procedure. Here you are, future googlers:
public static byte[] StringToBytes(string str)
{
byte[] data = new byte[str.Length * 2];
for (int i = 0; i < str.Length; ++i)
{
char ch = str[i];
data[i * 2] = (byte)(ch & 0xFF);
data[i * 2 + 1] = (byte)((ch & 0xFF00) >> 8);
}
return data;
}
public static string StringFromBytes(byte[] arr)
{
char[] ch = new char[arr.Length / 2];
for (int i = 0; i < ch.Length; ++i)
{
ch[i] = (char)((int)arr[i * 2] + (((int)arr[i * 2 + 1]) << 8));
}
return new String(ch);
}

Also please explain why encoding should be taken into consideration.
Can't I simply get what bytes the string has been stored in?
Why this dependency on encoding?!!!
Because there is no such thing as "the bytes of the string".
A string (or more generically, a text) is composed of characters: letters, digits, and other symbols. That's all. Computers, however, do not know anything about characters; they can only handle bytes. Therefore, if you want to store or transmit text by using a computer, you need to transform the characters to bytes. How do you do that? Here's where encodings come to the scene.
An encoding is nothing but a convention to translate logical characters to physical bytes. The simplest and best known encoding is ASCII, and it is all you need if you write in English. For other languages you will need more complete encodings, being any of the Unicode flavours the safest choice nowadays.
So, in short, trying to "get the bytes of a string without using encodings" is as impossible as "writing a text without using any language".
By the way, I strongly recommend you (and anyone, for that matter) to read this small piece of wisdom: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

C# to convert a string to a byte array:
public static byte[] StrToByteArray(string str)
{
System.Text.UTF8Encoding encoding=new System.Text.UTF8Encoding();
return encoding.GetBytes(str);
}

byte[] strToByteArray(string str)
{
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
return enc.GetBytes(str);
}

With the advent of Span<T> released with C# 7.2, the canonical technique to capture the underlying memory representation of a string into a managed byte array is:
byte[] bytes = "rubbish_\u9999_string".AsSpan().AsBytes().ToArray();
Converting it back should be a non-starter because that means you are in fact interpreting the data somehow, but for the sake of completeness:
string s;
unsafe
{
fixed (char* f = &bytes.AsSpan().NonPortableCast<byte, char>().DangerousGetPinnableReference())
{
s = new string(f);
}
}
The names NonPortableCast and DangerousGetPinnableReference should further the argument that you probably shouldn't be doing this.
Note that working with Span<T> requires installing the System.Memory NuGet package.
Regardless, the actual original question and follow-up comments imply that the underlying memory is not being "interpreted" (which I assume means is not modified or read beyond the need to write it as-is), indicating that some implementation of the Stream class should be used instead of reasoning about the data as strings at all.

You can use the following code for conversion between string and byte array.
string s = "Hello World";
// String to Byte[]
byte[] byte1 = System.Text.Encoding.Default.GetBytes(s);
// OR
byte[] byte2 = System.Text.ASCIIEncoding.Default.GetBytes(s);
// Byte[] to string
string str = System.Text.Encoding.UTF8.GetString(byte1);

I'm not sure, but I think the string stores its info as an array of Chars, which is inefficient with bytes. Specifically, the definition of a Char is "Represents a Unicode character".
take this example sample:
String str = "asdf éß";
String str2 = "asdf gh";
EncodingInfo[] info = Encoding.GetEncodings();
foreach (EncodingInfo enc in info)
{
System.Console.WriteLine(enc.Name + " - "
+ enc.GetEncoding().GetByteCount(str)
+ enc.GetEncoding().GetByteCount(str2));
}
Take note that the Unicode answer is 14 bytes in both instances, whereas the UTF-8 answer is only 9 bytes for the first, and only 7 for the second.
So if you just want the bytes used by the string, simply use Encoding.Unicode, but it will be inefficient with storage space.

The key issue is that a glyph in a string takes 32 bits (16 bits for a character code) but a byte only has 8 bits to spare. A one-to-one mapping doesn't exist unless you restrict yourself to strings that only contain ASCII characters. System.Text.Encoding has lots of ways to map a string to byte[], you need to pick one that avoids loss of information and that is easy to use by your client when she needs to map the byte[] back to a string.
Utf8 is a popular encoding, it is compact and not lossy.

Use:
string text = "string";
byte[] array = System.Text.Encoding.UTF8.GetBytes(text);
The result is:
[0] = 115
[1] = 116
[2] = 114
[3] = 105
[4] = 110
[5] = 103

Fastest way
public static byte[] GetBytes(string text)
{
return System.Text.ASCIIEncoding.UTF8.GetBytes(text);
}
EDIT
as Makotosan commented this is now the best way:
Encoding.UTF8.GetBytes(text)

The closest approach to the OP's question is Tom Blodget's, which actually goes into the object and extracts the bytes. I say closest because it depends on implementation of the String Object.
"Can't I simply get what bytes the string has been stored in?"
Sure, but that's where the fundamental error in the question arises. The String is an object which could have an interesting data structure. We already know it does, because it allows unpaired surrogates to be stored. It might store the length. It might keep a pointer to each of the 'paired' surrogates allowing quick counting. Etc. All of these extra bytes are not part of the character data.
What you want is each character's bytes in an array. And that is where 'encoding' comes in. By default you will get UTF-16LE. If you don't care about the bytes themselves except for the round trip then you can choose any encoding including the 'default', and convert it back later (assuming the same parameters such as what the default encoding was, code points, bug fixes, things allowed such as unpaired surrogates, etc.
But why leave the 'encoding' up to magic? Why not specify the encoding so that you know what bytes you are gonna get?
"Why is there a dependency on character encodings?"
Encoding (in this context) simply means the bytes that represent your string. Not the bytes of the string object. You wanted the bytes the string has been stored in -- this is where the question was asked naively. You wanted the bytes of string in a contiguous array that represent the string, and not all of the other binary data that a string object may contain.
Which means how a string is stored is irrelevant. You want a string "Encoded" into bytes in a byte array.
I like Tom Bloget's answer because he took you towards the 'bytes of the string object' direction. It's implementation dependent though, and because he's peeking at internals it might be difficult to reconstitute a copy of the string.
Mehrdad's response is wrong because it is misleading at the conceptual level. You still have a list of bytes, encoded. His particular solution allows for unpaired surrogates to be preserved -- this is implementation dependent. His particular solution would not produce the string's bytes accurately if GetBytes returned the string in UTF-8 by default.
I've changed my mind about this (Mehrdad's solution) -- this isn't getting the bytes of the string; rather it is getting the bytes of the character array that was created from the string. Regardless of encoding, the char datatype in c# is a fixed size. This allows a consistent length byte array to be produced, and it allows the character array to be reproduced based on the size of the byte array. So if the encoding were UTF-8, but each char was 6 bytes to accommodate the largest utf8 value, it would still work. So indeed -- encoding of the character does not matter.
But a conversion was used -- each character was placed into a fixed size box (c#'s character type). However what that representation is does not matter, which is technically the answer to the OP. So -- if you are going to convert anyway... Why not 'encode'?

How do I convert a string to a byte[] in .NET (C#) without manually specifying a specific encoding?
A string in .NET represents text as a sequence of UTF-16 code units, so the bytes are encoded in memory in UTF-16 already.
Mehrdad's Answer
You can use Mehrdad's answer, but it does actually use an encoding because chars are UTF-16. It calls ToCharArray which looking at the source creates a char[] and copies the memory to it directly. Then it copies the data to a byte array that is also allocated. So under the hood it is copying the underlying bytes twice and allocating a char array that is not used after the call.
Tom Blodget's Answer
Tom Blodget's answer is 20-30% faster than Mehrdad since it skips the intermediate step of allocating a char array and copying the bytes to it, but it requires you compile with the /unsafe option. If you absolutely do not want to use encoding, I think this is the way to go. If you put your encryption login inside the fixed block, you don't even need to allocate a separate byte array and copy the bytes to it.
Also, why should encoding be taken into consideration? Can't I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?
Because that is the proper way to do it. string is an abstraction.
Using an encoding could give you trouble if you have 'strings' with invalid characters, but that shouldn't happen. If you are getting data into your string with invalid characters you are doing it wrong. You should probably be using a byte array or a Base64 encoding to start with.
If you use System.Text.Encoding.Unicode, your code will be more resilient. You don't have to worry about the endianness of the system your code will be running on. You don't need to worry if the next version of the CLR will use a different internal character encoding.
I think the question isn't why you want to worry about the encoding, but why you want to ignore it and use something else. Encoding is meant to represent the abstraction of a string in a sequence of bytes. System.Text.Encoding.Unicode will give you a little endian byte order encoding and will perform the same on every system, now and in the future.

You can use following code to convert a string to a byte array in .NET
string s_unicode = "abcéabc";
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);

If you really want a copy of the underlying bytes of a string, you can use a function like the one that follows. However, you shouldn't please read on to find out why.
[DllImport(
"msvcrt.dll",
EntryPoint = "memcpy",
CallingConvention = CallingConvention.Cdecl,
SetLastError = false)]
private static extern unsafe void* UnsafeMemoryCopy(
void* destination,
void* source,
uint count);
public static byte[] GetUnderlyingBytes(string source)
{
var length = source.Length * sizeof(char);
var result = new byte[length];
unsafe
{
fixed (char* firstSourceChar = source)
fixed (byte* firstDestination = result)
{
var firstSource = (byte*)firstSourceChar;
UnsafeMemoryCopy(
firstDestination,
firstSource,
(uint)length);
}
}
return result;
}
This function will get you a copy of the bytes underlying your string, pretty quickly. You'll get those bytes in whatever way they are encoding on your system. This encoding is almost certainly UTF-16LE but that is an implementation detail you shouldn't have to care about.
It would be safer, simpler and more reliable to just call,
System.Text.Encoding.Unicode.GetBytes()
In all likelihood this will give the same result, is easier to type, and the bytes will round-trip, as well as a byte representation in Unicode can, with a call to
System.Text.Encoding.Unicode.GetString()

Here is my unsafe implementation of String to Byte[] conversion:
public static unsafe Byte[] GetBytes(String s)
{
Int32 length = s.Length * sizeof(Char);
Byte[] bytes = new Byte[length];
fixed (Char* pInput = s)
fixed (Byte* pBytes = bytes)
{
Byte* source = (Byte*)pInput;
Byte* destination = pBytes;
if (length >= 16)
{
do
{
*((Int64*)destination) = *((Int64*)source);
*((Int64*)(destination + 8)) = *((Int64*)(source + 8));
source += 16;
destination += 16;
}
while ((length -= 16) >= 16);
}
if (length > 0)
{
if ((length & 8) != 0)
{
*((Int64*)destination) = *((Int64*)source);
source += 8;
destination += 8;
}
if ((length & 4) != 0)
{
*((Int32*)destination) = *((Int32*)source);
source += 4;
destination += 4;
}
if ((length & 2) != 0)
{
*((Int16*)destination) = *((Int16*)source);
source += 2;
destination += 2;
}
if ((length & 1) != 0)
{
++source;
++destination;
destination[0] = source[0];
}
}
}
return bytes;
}
It's way faster than the accepted anwser's one, even if not as elegant as it is.
Here are my Stopwatch benchmarks over 10000000 iterations:
[Second String: Length 20]
Buffer.BlockCopy: 746ms
Unsafe: 557ms
[Second String: Length 50]
Buffer.BlockCopy: 861ms
Unsafe: 753ms
[Third String: Length 100]
Buffer.BlockCopy: 1250ms
Unsafe: 1063ms
In order to use it, you have to tick "Allow Unsafe Code" in your project build properties.
As per .NET Framework 3.5, this method can also be used as String extension:
public static unsafe class StringExtensions
{
public static Byte[] ToByteArray(this String s)
{
// Method Code
}
}

Upon being asked what you intend to do with the bytes, you responded:
I'm going to encrypt it. I can encrypt it without converting but I'd still like to know why encoding comes to play here. Just give me the bytes is what I say.
Regardless of whether you intend to send this encrypted data over the network, load it back into memory later, or stream it to another process, you are clearly intending to decrypt it at some point. In that case, the answer is that you're defining a communication protocol. A communication protocol should not be defined in terms of implementation details of your programming language and its associated runtime. There are several reasons for this:
You may need to communicate with a process implemented in a different language or runtime. (This might include a server running on another machine or sending the string to a JavaScript browser client, for example.)
The program may be re-implemented in a different language or runtime in the future.
The .NET implementation might change the internal representation of strings. You may think this sounds farfetched, but this actually happened in Java 9 to reduce memory usage. There's no reason .NET couldn't follow suit. Skeet suggests that UTF-16 probably isn't optimal today give the rise of the emoji and other blocks of Unicode needing more than 2 bytes to represent as well, increasing the likelihood that the internal representation could change in the future.
For communicating (either with a completely disparate process or with the same program in the future), you need to define your protocol strictly to minimize the difficulty of working with it or accidentally creating bugs. Depending on .NET's internal representation is not a strict, clear, or even guaranteed to be consistent definition. A standard encoding is a strict definition that will not fail you in the future.
In other words, you can't satisfy your requirement for consistency without specifying an encoding.
You may certainly choose to use UTF-16 directly if you find that your process performs significantly better since .NET uses it internally or for any other reason, but you need to choose that encoding explicitly and perform those conversions explicitly in your code rather than depending on .NET's internal implementation.
So choose an encoding and use it:
using System.Text;
// ...
Encoding.Unicode.GetBytes("abc"); # UTF-16 little endian
Encoding.UTF8.GetBytes("abc")
As you can see, it's also actually less code to just use the built in encoding objects than to implement your own reader/writer methods.

The string can be converted to byte array in few different ways, due to the following fact: .NET supports Unicode, and Unicode standardizes several difference encodings called UTFs. They have different lengths of byte representation but are equivalent in that sense that when a string is encoded, it can be coded back to the string, but if the string is encoded with one UTF and decoded in the assumption of different UTF if can be screwed up.
Also, .NET supports non-Unicode encodings, but they are not valid in general case (will be valid only if a limited sub-set of Unicode code point is used in an actual string, such as ASCII). Internally, .NET supports UTF-16, but for stream representation, UTF-8 is usually used. It is also a standard-de-facto for Internet.
Not surprisingly, serialization of string into an array of byte and deserialization is supported by the class System.Text.Encoding, which is an abstract class; its derived classes support concrete encodings: ASCIIEncoding and four UTFs (System.Text.UnicodeEncoding supports UTF-16)
Ref this link.
For serialization to an array of bytes using System.Text.Encoding.GetBytes. For the inverse operation use System.Text.Encoding.GetChars. This function returns an array of characters, so to get a string, use a string constructor System.String(char[]).
Ref this page.
Example:
string myString = //... some string
System.Text.Encoding encoding = System.Text.Encoding.UTF8; //or some other, but prefer some UTF is Unicode is used
byte[] bytes = encoding.GetBytes(myString);
//next lines are written in response to a follow-up questions:
myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);
myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);
//how many times shall I repeat it to show there is a round-trip? :-)

It depends on what you want the bytes FOR
This is because, as Tyler so aptly said, "Strings aren't pure data. They also have information." In this case, the information is an encoding that was assumed when the string was created.
Assuming that you have binary data (rather than text) stored in a string
This is based off of OP's comment on his own question, and is the correct question if I understand OP's hints at the use-case.
Storing binary data in strings is probably the wrong approach because of the assumed encoding mentioned above! Whatever program or library stored that binary data in a string (instead of a byte[] array which would have been more appropriate) has already lost the battle before it has begun. If they are sending the bytes to you in a REST request/response or anything that must transmit strings, Base64 would be the right approach.
If you have a text string with an unknown encoding
Everybody else answered this incorrect question incorrectly.
If the string looks good as-is, just pick an encoding (preferably one starting with UTF), use the corresponding System.Text.Encoding.???.GetBytes() function, and tell whoever you give the bytes to which encoding you picked.

If you are using .NET Core or System.Memory for .NET Framework, there is a very efficient marshaling mechanism available via Span<T> and Memory<T> that can effectively reinterpret string memory as a span of bytes. Once you have a span of bytes you are free to marshal back to another type, or copy the span to an array for serialization.
To summarize what others have said:
Storing a representation of this kind of serialization is sensitive to system endianness, compiler optimizations, and changes to the internal representation of strings in the executing .NET Runtime.
Avoid long-term storage
Avoid deserializing or interpreting the string in other environments
This includes other machines, processor architectures, .NET runtimes, containers, etc.
This includes comparisons, formatting, encryption, string manipulation, localization, character transforms, etc.
Avoid making assumptions about the character encoding
The default encoding tends to be UTF-16LE in practice, but the compiler / runtime can choose any internal representation
Implementation
public static class MarshalExtensions
{
public static ReadOnlySpan<byte> AsBytes(this string value) => MemoryMarshal.AsBytes(value.AsSpan());
public static string AsString(this ReadOnlySpan<byte> value) => new string(MemoryMarshal.Cast<byte, char>(value));
}
Example
static void Main(string[] args)
{
string str1 = "你好，世界";
ReadOnlySpan<byte> span = str1.AsBytes();
string str2 = span.AsString();
byte[] bytes = span.ToArray();
Debug.Assert(bytes.Length > 0);
Debug.Assert(str1 == str2);
}
Furthur Insight
In C++ this is roughly equivalent to reinterpret_cast, and C this is roughly equivalent to a cast to the system's word type (char).
In recent versions of the .NET Core Runtime (CoreCLR), operations on spans effectively invoke compiler intrinsics and various optimizations that can sometimes eliminate bounds checking, leading to exceptional performance while preserving memory safety, assuming that your memory was allocated by the CLR and the spans are not derived from pointers from an unmanaged memory allocator.
Caveats
This uses a mechanism supported by the CLR that returns ReadOnlySpan<char> from a string; Additionally, this span does not necessarily encompass the complete internal string layout. ReadOnlySpan<T> implies that you must create a copy if you need to perform mutation, as strings are immutable.

Computers only understand raw binary data, raw bits.
One bit is a Binary Digit : 0 or 1.
An 8-bits number is a byte. One byte is a number between 0 and 255.
ASCII is a table that converts numbers to characters.
Numbers between 0 and 31 are controls: tab, new line, and others.
Numbers between 32 and 126 are printable characters :
letter a, number 1, % sign, underscore _
So with ASCII, there are 33 control characters and 95 printable characters.
ASCII is the most commonly used character encoding today.
The first entries of the Unicode table are ASCII and match the ASCII character set.
ASCII is a 7-bit character set. Numbers between 0 and 127.
With 8 bits we can go up to 255.
The most common alternative to ASCII is EBCDIC which is not compatible with ASCII and still exists today on IBM computers and databases.
1 byte, so 8 bits number is the most common unit in computer science nowadays. 1 byte is a number between 0 and 255.
ASCII defines a meaning for each number between 0 and 127.
The character associated with numbers between 128 and 255 depends on the character encoding being used. Two widely used character encodings used nowadays are windows1252 and UTF-8.
In windows1252 the number corresponding to the € sign is 128.
1 byte : [A0].
In the Unicode Database, the € sign is number 8364.
Now I give you the number 8364. Tow bytes : [20,AC].
In UTF-8 the Euro sign is the number 14844588. Three bytes : [E282AC].
Now I give you some raw data. Let's say 20AC.
Is it two windows1252 characters: £ or one single Unicode € sign?
I give you some more raw data. E282AC.
Well, 82 is an unassigned character in windows1252 so it is probably not windows1252.
It could be macRoman "‚Ç¨" or OEM 437 "ßéó" or the UTF-8 "€" sign.
It is possible to guess the encoding of a stream of raw bytes based on the characteristics of the character encodings and on statistics but there is no reliable way to do that.
Numbers between 128 and 255 on their own are invalid in UTF-8.
The é is common in some languages (french) so if you see many bytes with the value E9 surrounded by letters it is probably a windows1252-encoding string, the E9 byte representing the é character.
When you have a stream of raw bytes that represents a string, it is far better to know the matching encoding rather than trying to guess.
Below is a screenshot of one raw byte in various encodings that were once widely used.

Two ways:
public static byte[] StrToByteArray(this string s)
{
List<byte> value = new List<byte>();
foreach (char c in s.ToCharArray())
value.Add(c.ToByte());
return value.ToArray();
}
And,
public static byte[] StrToByteArray(this string s)
{
s = s.Replace(" ", string.Empty);
byte[] buffer = new byte[s.Length / 2];
for (int i = 0; i < s.Length; i += 2)
buffer[i / 2] = (byte)Convert.ToByte(s.Substring(i, 2), 16);
return buffer;
}
I tend to use the bottom one more often than the top, haven't benchmarked them for speed.

How to reverse a Reed - Solomon algorithm? [duplicate]

I want to transmit binary data over a noisy channel.
I read that a good ECC algorithm to detect errors is Reed-Solomon.
The problem is i don't understand the input for this algorithm.
here is my naive failed attempt with zxing.net:
int[] toEncode = { 123,232,432};
var gf = GenericGF.AZTEC_DATA_12;
ReedSolomonEncoder rse = new ReedSolomonEncoder(gf);
rse.encode(toEncode, 2);
ReedSolomonDecoder rsd = new ReedSolomonDecoder(gf);
rse.encode(toEncode, 2);
please explain to me the input for the encoder and decoder.

Is this the implementation you are using here: ReedSolomonEncoder.cs?
If so, to encode N integers with M data correction integers, you need to pass an array of length N+M. Your data should be in the first N indices and the codes look to be added at the end in the final M entries.
Also, note the following restriction in the encoder:
Update: a more recent version is here: http://zxingnet.codeplex.com/. Its most recent version of ReedSolomonEncoder.cs does not have this restriction.
This class implements Reed-Solomon encoding schemes used in processing QR codes. A very brief description of Reed Solomon encoding is here: Reed-Solomon Codes.
An encoding choice of "QR_CODE_FIELD_256" (which is probably a reasonable choice for you) means that error correction codes are being generated on byte-sized chunks ("symbols") of your message, which means your maximum message length (data to encode plus error correction codes) is 255 bytes long. If you are sending more data you will need to break it into chunks.
Update 2: Using QR_CODE_FIELD_256, your integers need to be between 0 and 255 as well, so to encode a general byte stream, you need to put each byte into a separate integer in the integer array, pass the int array (plus space for error correction codes) through the encoder, then reconvert to a (larger) byte array. And the reverse for decoding.

Byte array replace byte with byte sequence efficiency: iterate and copy versus SelectMany

I'm dealing with a byte array that comprises a text message, but some of the characters in the message are control characters (i.e. less than 0x20) and I want to replace them with sequences of characters that are human readable when decoded into ASCII (for instance 0x0F would display [TAB] instead of actually being a tab character). So as I see it, I have three options:
Decode the whole thing into an ASCII string, then use String.Replace() to swap out what I want. The problem with this is that the characters seem to just be decoded as the unprintable box character or question marks, thus losing their actual byte values.
Iterate through the byte array looking for any of my control characters and performing an array insert operation (make new larger array, copy existing pieces in, write new pieces).
Use Array.ToList<byte>() to convert the byte array to a List, then use IEnumerable.SelectMany() to transform the control characters into sequences of readable characters which SelectMany will then flatten back out for me.
So the question is, which is the best option in terms of efficiency? I don't really have a good feel for the performance implications of the IEnumerable lambda operations. I believe option 1 is out as functionally unworkable, but I could be wrong.

Try
// your byte array for the message
byte[] TheMessage = ...;
// a string representation of your message (the character 0x01... 0x32 are NOT altered)
string MessageString = Encoding.ASCII.GetString(TheMessage);
// replace whatever you want...
MessageString = MessageString.Replace (" ", "x").Replace ( "\n", " " )...
// the replaced message back as byte array
byte[] TheReplacedMessage= Encoding.ASCII.GetBytes(MessageString.ToCharArray());
EDIT:
Sample for replacing an 8 Bit byte value
MessageString = MessageString.Replace ( Encoding.ASCII.GetString (new byte[] {0xF7}), " " )...
Regarding the performance
I am not 100% sure whether it is the fastest approach... we just tried several approaches though our requirement was to replace "byte array of 1-n bytes" whithin the original byte-array... this came out the fastet+cleanest for our use case (1 MB - 1 GB files).

How do I get a consistent byte representation of strings in C# without manually specifying an encoding?

How do I convert a string to a byte[] in .NET (C#) without manually specifying a specific encoding?
I'm going to encrypt the string. I can encrypt it without converting, but I'd still like to know why encoding comes to play here.
Also, why should encoding even be taken into consideration? Can't I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?

Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!
Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in".
(And, of course, to be able to re-construct the string from the bytes.)
For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.
Just do this instead:
static byte[] GetBytes(string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
// Do NOT use on arbitrary bytes; only use on GetBytes's output on the SAME system
static string GetString(byte[] bytes)
{
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
return new string(chars);
}
As long as your program (or other programs) don't try to interpret the bytes somehow, which you obviously didn't mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.
Additional benefit to this approach: It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!
It will be encoded and decoded just the same, because you are just looking at the bytes.
If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters.

It depends on the encoding of your string (ASCII, UTF-8, ...).
For example:
byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString);
byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString);
A small sample why encoding matters:
string pi = "\u03a0";
byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi);
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi);
Console.WriteLine (ascii.Length); //Will print 1
Console.WriteLine (utf8.Length); //Will print 2
Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //Will print '?'
ASCII simply isn't equipped to deal with special characters.
Internally, the .NET framework uses UTF-16 to represent strings, so if you simply want to get the exact bytes that .NET uses, use System.Text.Encoding.Unicode.GetBytes (...).
See Character Encoding in the .NET Framework (MSDN) for more information.

The accepted answer is very, very complicated. Use the included .NET classes for this:
const string data = "A string with international characters: Norwegian: ÆØÅæøå, Chinese: 喂 谢谢";
var bytes = System.Text.Encoding.UTF8.GetBytes(data);
var decoded = System.Text.Encoding.UTF8.GetString(bytes);
Don't reinvent the wheel if you don't have to...

BinaryFormatter bf = new BinaryFormatter();
byte[] bytes;
MemoryStream ms = new MemoryStream();
string orig = "喂 Hello 谢谢 Thank You";
bf.Serialize(ms, orig);
ms.Seek(0, 0);
bytes = ms.ToArray();
MessageBox.Show("Original bytes Length: " + bytes.Length.ToString());
MessageBox.Show("Original string Length: " + orig.Length.ToString());
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt
BinaryFormatter bfx = new BinaryFormatter();
MemoryStream msx = new MemoryStream();
msx.Write(bytes, 0, bytes.Length);
msx.Seek(0, 0);
string sx = (string)bfx.Deserialize(msx);
MessageBox.Show("Still intact :" + sx);
MessageBox.Show("Deserialize string Length(still intact): "
+ sx.Length.ToString());
BinaryFormatter bfy = new BinaryFormatter();
MemoryStream msy = new MemoryStream();
bfy.Serialize(msy, sx);
msy.Seek(0, 0);
byte[] bytesy = msy.ToArray();
MessageBox.Show("Deserialize bytes Length(still intact): "
+ bytesy.Length.ToString());

This is a popular question. It is important to understand what the question author is asking, and that it is different from what is likely the most common need. To discourage misuse of the code where it is not needed, I've answered the latter first.
Common Need
Every string has a character set and encoding. When you convert a System.String object to an array of System.Byte you still have a character set and encoding. For most usages, you'd know which character set and encoding you need and .NET makes it simple to "copy with conversion." Just choose the appropriate Encoding class.
// using System.Text;
Encoding.UTF8.GetBytes(".NET String to byte array")
The conversion may need to handle cases where the target character set or encoding doesn't support a character that's in the source. You have some choices: exception, substitution, or skipping. The default policy is to substitute a '?'.
// using System.Text;
var text = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes("You win €100"));
// -> "You win ?100"
Clearly, conversions are not necessarily lossless!
Note: For System.String the source character set is Unicode.
The only confusing thing is that .NET uses the name of a character set for the name of one particular encoding of that character set. Encoding.Unicode should be called Encoding.UTF16.
That's it for most usages. If that's what you need, stop reading here. See the fun Joel Spolsky article if you don't understand what encoding is.
Specific Need
Now, the question author asks is, "Every string is stored as an array of bytes, right? Why can't I simply have those bytes?"
He doesn't want any conversion.
From the C# spec:
Character and string processing in C# uses Unicode encoding. The char
type represents a UTF-16 code unit, and the string type represents a
sequence of UTF-16 code units.
So, we know that if we ask for the null conversion (i.e., from UTF-16 to UTF-16), we'll get the desired result:
Encoding.Unicode.GetBytes(".NET String to byte array")
But to avoid the mention of encodings, we must do it another way. If an intermediate data type is acceptable, there is a conceptual shortcut for this:
".NET String to byte array".ToCharArray()
That doesn't get us the desired datatype but Mehrdad's answer shows how to convert this Char array to a Byte array using BlockCopy. However, this copies the string twice! And, it too explicitly uses encoding-specific code: the datatype System.Char.
The only way to get to the actual bytes the String is stored in is to use a pointer. The fixed statement allows taking the address of values. From the C# spec:
[For] an expression of type string, ... the initializer computes the
address of the first character in the string.
To do so, the compiler writes code skipping over the other parts of the string object with RuntimeHelpers.OffsetToStringData. So, to get the raw bytes, just create a pointer to the string and copy the number of bytes needed.
// using System.Runtime.InteropServices
unsafe byte[] GetRawBytes(String s)
{
if (s == null) return null;
var codeunitCount = s.Length;
/* We know that String is a sequence of UTF-16 code units
and such code units are 2 bytes */
var byteCount = codeunitCount * 2;
var bytes = new byte[byteCount];
fixed(void* pRaw = s)
{
Marshal.Copy((IntPtr)pRaw, bytes, 0, byteCount);
}
return bytes;
}
As #CodesInChaos pointed out, the result depends on the endianness of the machine. But the question author is not concerned with that.

You need to take the encoding into account, because 1 character could be represented by 1 or more bytes (up to about 6), and different encodings will treat these bytes differently.
Joel has a posting on this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The first part of your question (how to get the bytes) was already answered by others: look in the System.Text.Encoding namespace.
I will address your follow-up question: why do you need to pick an encoding? Why can't you get that from the string class itself?
The answer is in two parts.
First of all, the bytes used internally by the string class don't matter, and whenever you assume they do you're likely introducing a bug.
If your program is entirely within the .Net world then you don't need to worry about getting byte arrays for strings at all, even if you're sending data across a network. Instead, use .Net Serialization to worry about transmitting the data. You don't worry about the actual bytes anymore: the Serialization formatter does it for you.
On the other hand, what if you are sending these bytes somewhere that you can't guarantee will pull in data from a .Net serialized stream? In this case, you definitely do need to worry about encoding, because obviously, this external system cares. So again, the internal bytes used by the string don't matter: you need to pick an encoding so you can be explicit about this encoding on the receiving end, even if it's the same encoding used internally by .Net.
I understand that in this case, you might prefer to use the actual bytes stored by the string variable in memory where possible, with the idea that it might save some work creating your byte stream. However, I put it to you it's just not important compared to making sure that your output is understood at the other end, and to guarantee that you must be explicit with your encoding. Additionally, if you really want to match your internal bytes, you can already just choose the Unicode encoding, and get those performance savings.
This brings me to the second part... picking the Unicode encoding is telling .Net to use the underlying bytes. You do need to pick this encoding because when some new-fangled Unicode-Plus comes out the .Net runtime needs to be free to use this newer, better encoding model without breaking your program. But, for the moment (and foreseeable future), just choosing the Unicode encoding gives you what you want.
It's also important to understand your string has to be rewritten to wire, and that involves at least some translation of the bit-pattern even when you use a matching encoding. The computer needs to account for things like Big vs Little Endian, network byte order, packetization, session information, etc.

Just to demonstrate that Mehrdrad's sound answer works, his approach can even persist the unpaired surrogate characters(of which many had leveled against my answer, but of which everyone are equally guilty of, e.g. System.Text.Encoding.UTF8.GetBytes, System.Text.Encoding.Unicode.GetBytes; those encoding methods can't persist the high surrogate characters d800 for example, and those just merely replace high surrogate characters with value fffd ) :
using System;
class Program
{
static void Main(string[] args)
{
string t = "爱虫";
string s = "Test\ud800Test";
byte[] dumpToBytes = GetBytes(s);
string getItBack = GetString(dumpToBytes);
foreach (char item in getItBack)
{
Console.WriteLine("{0} {1}", item, ((ushort)item).ToString("x"));
}
}
static byte[] GetBytes(string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
static string GetString(byte[] bytes)
{
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
return new string(chars);
}
}
Output:
T 54
e 65
s 73
t 74
? d800
T 54
e 65
s 73
t 74
Try that with System.Text.Encoding.UTF8.GetBytes or System.Text.Encoding.Unicode.GetBytes, they will merely replace high surrogate characters with value fffd
Every time there's a movement in this question, I'm still thinking of a serializer(be it from Microsoft or from 3rd party component) that can persist strings even it contains unpaired surrogate characters; I google this every now and then: serialization unpaired surrogate character .NET. This doesn't make me lose any sleep, but it's kind of annoying when every now and then there's somebody commenting on my answer that it's flawed, yet their answers are equally flawed when it comes to unpaired surrogate characters.
Darn, Microsoft should have just used System.Buffer.BlockCopy in its BinaryFormatter ツ
谢谢！

Try this, a lot less code:
System.Text.Encoding.UTF8.GetBytes("TEST String");

Well, I've read all answers and they were about using encoding or one about serialization that drops unpaired surrogates.
It's bad when the string, for example, comes from SQL Server where it was built from a byte array storing, for example, a password hash. If we drop anything from it, it'll store an invalid hash, and if we want to store it in XML, we want to leave it intact (because the XML writer drops an exception on any unpaired surrogate it finds).
So I use Base64 encoding of byte arrays in such cases, but hey, on the Internet there is only one solution to this in C#, and it has bug in it and is only one way, so I've fixed the bug and written back procedure. Here you are, future googlers:
public static byte[] StringToBytes(string str)
{
byte[] data = new byte[str.Length * 2];
for (int i = 0; i < str.Length; ++i)
{
char ch = str[i];
data[i * 2] = (byte)(ch & 0xFF);
data[i * 2 + 1] = (byte)((ch & 0xFF00) >> 8);
}
return data;
}
public static string StringFromBytes(byte[] arr)
{
char[] ch = new char[arr.Length / 2];
for (int i = 0; i < ch.Length; ++i)
{
ch[i] = (char)((int)arr[i * 2] + (((int)arr[i * 2 + 1]) << 8));
}
return new String(ch);
}

Also please explain why encoding should be taken into consideration.
Can't I simply get what bytes the string has been stored in?
Why this dependency on encoding?!!!
Because there is no such thing as "the bytes of the string".
A string (or more generically, a text) is composed of characters: letters, digits, and other symbols. That's all. Computers, however, do not know anything about characters; they can only handle bytes. Therefore, if you want to store or transmit text by using a computer, you need to transform the characters to bytes. How do you do that? Here's where encodings come to the scene.
An encoding is nothing but a convention to translate logical characters to physical bytes. The simplest and best known encoding is ASCII, and it is all you need if you write in English. For other languages you will need more complete encodings, being any of the Unicode flavours the safest choice nowadays.
So, in short, trying to "get the bytes of a string without using encodings" is as impossible as "writing a text without using any language".
By the way, I strongly recommend you (and anyone, for that matter) to read this small piece of wisdom: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

C# to convert a string to a byte array:
public static byte[] StrToByteArray(string str)
{
System.Text.UTF8Encoding encoding=new System.Text.UTF8Encoding();
return encoding.GetBytes(str);
}

byte[] strToByteArray(string str)
{
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
return enc.GetBytes(str);
}

With the advent of Span<T> released with C# 7.2, the canonical technique to capture the underlying memory representation of a string into a managed byte array is:
byte[] bytes = "rubbish_\u9999_string".AsSpan().AsBytes().ToArray();
Converting it back should be a non-starter because that means you are in fact interpreting the data somehow, but for the sake of completeness:
string s;
unsafe
{
fixed (char* f = &bytes.AsSpan().NonPortableCast<byte, char>().DangerousGetPinnableReference())
{
s = new string(f);
}
}
The names NonPortableCast and DangerousGetPinnableReference should further the argument that you probably shouldn't be doing this.
Note that working with Span<T> requires installing the System.Memory NuGet package.
Regardless, the actual original question and follow-up comments imply that the underlying memory is not being "interpreted" (which I assume means is not modified or read beyond the need to write it as-is), indicating that some implementation of the Stream class should be used instead of reasoning about the data as strings at all.

You can use the following code for conversion between string and byte array.
string s = "Hello World";
// String to Byte[]
byte[] byte1 = System.Text.Encoding.Default.GetBytes(s);
// OR
byte[] byte2 = System.Text.ASCIIEncoding.Default.GetBytes(s);
// Byte[] to string
string str = System.Text.Encoding.UTF8.GetString(byte1);

I'm not sure, but I think the string stores its info as an array of Chars, which is inefficient with bytes. Specifically, the definition of a Char is "Represents a Unicode character".
take this example sample:
String str = "asdf éß";
String str2 = "asdf gh";
EncodingInfo[] info = Encoding.GetEncodings();
foreach (EncodingInfo enc in info)
{
System.Console.WriteLine(enc.Name + " - "
+ enc.GetEncoding().GetByteCount(str)
+ enc.GetEncoding().GetByteCount(str2));
}
Take note that the Unicode answer is 14 bytes in both instances, whereas the UTF-8 answer is only 9 bytes for the first, and only 7 for the second.
So if you just want the bytes used by the string, simply use Encoding.Unicode, but it will be inefficient with storage space.

The key issue is that a glyph in a string takes 32 bits (16 bits for a character code) but a byte only has 8 bits to spare. A one-to-one mapping doesn't exist unless you restrict yourself to strings that only contain ASCII characters. System.Text.Encoding has lots of ways to map a string to byte[], you need to pick one that avoids loss of information and that is easy to use by your client when she needs to map the byte[] back to a string.
Utf8 is a popular encoding, it is compact and not lossy.

Use:
string text = "string";
byte[] array = System.Text.Encoding.UTF8.GetBytes(text);
The result is:
[0] = 115
[1] = 116
[2] = 114
[3] = 105
[4] = 110
[5] = 103

Fastest way
public static byte[] GetBytes(string text)
{
return System.Text.ASCIIEncoding.UTF8.GetBytes(text);
}
EDIT
as Makotosan commented this is now the best way:
Encoding.UTF8.GetBytes(text)

You can use following code to convert a string to a byte array in .NET
string s_unicode = "abcéabc";
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);

If you really want a copy of the underlying bytes of a string, you can use a function like the one that follows. However, you shouldn't please read on to find out why.
[DllImport(
"msvcrt.dll",
EntryPoint = "memcpy",
CallingConvention = CallingConvention.Cdecl,
SetLastError = false)]
private static extern unsafe void* UnsafeMemoryCopy(
void* destination,
void* source,
uint count);
public static byte[] GetUnderlyingBytes(string source)
{
var length = source.Length * sizeof(char);
var result = new byte[length];
unsafe
{
fixed (char* firstSourceChar = source)
fixed (byte* firstDestination = result)
{
var firstSource = (byte*)firstSourceChar;
UnsafeMemoryCopy(
firstDestination,
firstSource,
(uint)length);
}
}
return result;
}
This function will get you a copy of the bytes underlying your string, pretty quickly. You'll get those bytes in whatever way they are encoding on your system. This encoding is almost certainly UTF-16LE but that is an implementation detail you shouldn't have to care about.
It would be safer, simpler and more reliable to just call,
System.Text.Encoding.Unicode.GetBytes()
In all likelihood this will give the same result, is easier to type, and the bytes will round-trip, as well as a byte representation in Unicode can, with a call to
System.Text.Encoding.Unicode.GetString()

Here is my unsafe implementation of String to Byte[] conversion:
public static unsafe Byte[] GetBytes(String s)
{
Int32 length = s.Length * sizeof(Char);
Byte[] bytes = new Byte[length];
fixed (Char* pInput = s)
fixed (Byte* pBytes = bytes)
{
Byte* source = (Byte*)pInput;
Byte* destination = pBytes;
if (length >= 16)
{
do
{
*((Int64*)destination) = *((Int64*)source);
*((Int64*)(destination + 8)) = *((Int64*)(source + 8));
source += 16;
destination += 16;
}
while ((length -= 16) >= 16);
}
if (length > 0)
{
if ((length & 8) != 0)
{
*((Int64*)destination) = *((Int64*)source);
source += 8;
destination += 8;
}
if ((length & 4) != 0)
{
*((Int32*)destination) = *((Int32*)source);
source += 4;
destination += 4;
}
if ((length & 2) != 0)
{
*((Int16*)destination) = *((Int16*)source);
source += 2;
destination += 2;
}
if ((length & 1) != 0)
{
++source;
++destination;
destination[0] = source[0];
}
}
}
return bytes;
}
It's way faster than the accepted anwser's one, even if not as elegant as it is.
Here are my Stopwatch benchmarks over 10000000 iterations:
[Second String: Length 20]
Buffer.BlockCopy: 746ms
Unsafe: 557ms
[Second String: Length 50]
Buffer.BlockCopy: 861ms
Unsafe: 753ms
[Third String: Length 100]
Buffer.BlockCopy: 1250ms
Unsafe: 1063ms
In order to use it, you have to tick "Allow Unsafe Code" in your project build properties.
As per .NET Framework 3.5, this method can also be used as String extension:
public static unsafe class StringExtensions
{
public static Byte[] ToByteArray(this String s)
{
// Method Code
}
}

The string can be converted to byte array in few different ways, due to the following fact: .NET supports Unicode, and Unicode standardizes several difference encodings called UTFs. They have different lengths of byte representation but are equivalent in that sense that when a string is encoded, it can be coded back to the string, but if the string is encoded with one UTF and decoded in the assumption of different UTF if can be screwed up.
Also, .NET supports non-Unicode encodings, but they are not valid in general case (will be valid only if a limited sub-set of Unicode code point is used in an actual string, such as ASCII). Internally, .NET supports UTF-16, but for stream representation, UTF-8 is usually used. It is also a standard-de-facto for Internet.
Not surprisingly, serialization of string into an array of byte and deserialization is supported by the class System.Text.Encoding, which is an abstract class; its derived classes support concrete encodings: ASCIIEncoding and four UTFs (System.Text.UnicodeEncoding supports UTF-16)
Ref this link.
For serialization to an array of bytes using System.Text.Encoding.GetBytes. For the inverse operation use System.Text.Encoding.GetChars. This function returns an array of characters, so to get a string, use a string constructor System.String(char[]).
Ref this page.
Example:
string myString = //... some string
System.Text.Encoding encoding = System.Text.Encoding.UTF8; //or some other, but prefer some UTF is Unicode is used
byte[] bytes = encoding.GetBytes(myString);
//next lines are written in response to a follow-up questions:
myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);
myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);
//how many times shall I repeat it to show there is a round-trip? :-)

It depends on what you want the bytes FOR
This is because, as Tyler so aptly said, "Strings aren't pure data. They also have information." In this case, the information is an encoding that was assumed when the string was created.
Assuming that you have binary data (rather than text) stored in a string
This is based off of OP's comment on his own question, and is the correct question if I understand OP's hints at the use-case.
Storing binary data in strings is probably the wrong approach because of the assumed encoding mentioned above! Whatever program or library stored that binary data in a string (instead of a byte[] array which would have been more appropriate) has already lost the battle before it has begun. If they are sending the bytes to you in a REST request/response or anything that must transmit strings, Base64 would be the right approach.
If you have a text string with an unknown encoding
Everybody else answered this incorrect question incorrectly.
If the string looks good as-is, just pick an encoding (preferably one starting with UTF), use the corresponding System.Text.Encoding.???.GetBytes() function, and tell whoever you give the bytes to which encoding you picked.

If you are using .NET Core or System.Memory for .NET Framework, there is a very efficient marshaling mechanism available via Span<T> and Memory<T> that can effectively reinterpret string memory as a span of bytes. Once you have a span of bytes you are free to marshal back to another type, or copy the span to an array for serialization.
To summarize what others have said:
Storing a representation of this kind of serialization is sensitive to system endianness, compiler optimizations, and changes to the internal representation of strings in the executing .NET Runtime.
Avoid long-term storage
Avoid deserializing or interpreting the string in other environments
This includes other machines, processor architectures, .NET runtimes, containers, etc.
This includes comparisons, formatting, encryption, string manipulation, localization, character transforms, etc.
Avoid making assumptions about the character encoding
The default encoding tends to be UTF-16LE in practice, but the compiler / runtime can choose any internal representation
Implementation
public static class MarshalExtensions
{
public static ReadOnlySpan<byte> AsBytes(this string value) => MemoryMarshal.AsBytes(value.AsSpan());
public static string AsString(this ReadOnlySpan<byte> value) => new string(MemoryMarshal.Cast<byte, char>(value));
}
Example
static void Main(string[] args)
{
string str1 = "你好，世界";
ReadOnlySpan<byte> span = str1.AsBytes();
string str2 = span.AsString();
byte[] bytes = span.ToArray();
Debug.Assert(bytes.Length > 0);
Debug.Assert(str1 == str2);
}
Furthur Insight
In C++ this is roughly equivalent to reinterpret_cast, and C this is roughly equivalent to a cast to the system's word type (char).
In recent versions of the .NET Core Runtime (CoreCLR), operations on spans effectively invoke compiler intrinsics and various optimizations that can sometimes eliminate bounds checking, leading to exceptional performance while preserving memory safety, assuming that your memory was allocated by the CLR and the spans are not derived from pointers from an unmanaged memory allocator.
Caveats
This uses a mechanism supported by the CLR that returns ReadOnlySpan<char> from a string; Additionally, this span does not necessarily encompass the complete internal string layout. ReadOnlySpan<T> implies that you must create a copy if you need to perform mutation, as strings are immutable.

Two ways:
public static byte[] StrToByteArray(this string s)
{
List<byte> value = new List<byte>();
foreach (char c in s.ToCharArray())
value.Add(c.ToByte());
return value.ToArray();
}
And,
public static byte[] StrToByteArray(this string s)
{
s = s.Replace(" ", string.Empty);
byte[] buffer = new byte[s.Length / 2];
for (int i = 0; i < s.Length; i += 2)
buffer[i / 2] = (byte)Convert.ToByte(s.Substring(i, 2), 16);
return buffer;
}
I tend to use the bottom one more often than the top, haven't benchmarked them for speed.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.