Array and String Encoding

Array and String Encoding - c#

When I do
string s = Encoding.Unicode.GetString(a);
byte[] aa = Encoding.Unicode.GetBytes(s);
I have different arrays (a != aa) . Why ?
But when I do this? It's all right
string s = Encoding.Default.GetString(a);
byte[] aa = Encoding.Default.GetBytes(s);

That is because you are using encoding backwards. Encoding is used to encode a string to bytes, then back to a string again.
In an encoding every character has a corresponding set of bytes, but not every set of bytes has to have a corresponding character. That's why you can't take any arbitrary bytes and decode into a string.
Using the encoding Default it works to misuse it that way, because it only uses a single byte for each character, and it happens to have a character for every byte code. It still doesn't make sense to use it that way, though.

To add to Guffa's answer, here is a detailed example of how your code fails for certain byte sequences, such as 0, 216:
// Let's start with some character from the ancient Aegean numbers:
// The code point of Aegean One is U+10107. Code points > U+FFFF need two
// code units with two bytes each if you encode them in UTF-16 (Encoding.Unicode)
string aegeanOne = char.ConvertFromUtf32(0x10107);
byte[] aegeanOneBytes = Encoding.Unicode.GetBytes(aegeanOne);
// Length == 4 (2 bytes each for high and low surrogate)
// == 0, 216, 7, 221
// Let's just take the first two bytes.
// This creates a malformed byte sequence,
// because the corresponding low surrogate is missing.
byte[] a = new byte[2];
a[0] = aegeanOneBytes[0]; // == 0
a[1] = aegeanOneBytes[1]; // == 216
string s = Encoding.Unicode.GetString(a);
// == replacement character � (U+FFFD),
// because the bytes could not be decoded properly (missing low surrogate)
byte[] aa = Encoding.Unicode.GetBytes(s);
// == 253, 255 == 0xFFFD != 0, 216
string s2 = Encoding.Default.GetString(a);
// == "\0Ø" (NUL + LATIN CAPITAL LETTER O WITH STROKE)
// Results may differ, depending on the default encoding of the operating system
byte[] aa2 = Encoding.Default.GetBytes(s2);
// == 0, 216

It means your byte[] a has a byte order which does not conform to Unicode rules.

Related

Get a guid to encode using big-endian formatting C#

I have a unusual situation where by I have an existing MySQL database that uses binary(16) primary keys, these are the basis for UUIDs that are used in an existing api.
My problem is that I now want to add a replacement api written with dotnet core, and I'm running into a problem with encoding that has been explained here
Specifically, the Guid struct in dotnet uses a mixed-endian format that produces a different string to the existing api. This isn't acceptable for obvious reasons.
So my question is this: is there an elegant way to force the Guid struct to encode entirely with the big-endian format?
If there isn't I can just write a terrible hack, but I thought I'd check with the collective intelligence of the SO community first!

Nope; as far as I'm aware there's no inbuilt way to get this. And yes, Guid has what I can only call "crazy-endian" implementation currently. You'd need to get the Guid-ordered bits (either via unsafe or Guid.ToByteArray) and then order them manually, figuring out which chunks to reverse - it isn't a simple Array.Reverse(). So: very manual, I'm afraid. I suggest using a guid like
00010203-0405-0607-0809-0a0b0c0d0e0f
to debug it; this gives you (as I suspect you are aware):
03-02-01-00-05-04-07-06-08-09-0A-0B-0C-0D-0E-0F
so:
reverse 4
reverse 2
reverse 2
straight 8

As of 2021 there still isn't a built-in way to convert a System.Guid to a MySQL compatible big endian string in C#.
Here's the extension we came up with when we encountered this exact C# mixed-endian Guid problem at work:
public static string ToStringBigEndian(this Guid guid)
{
// allocate enough bytes to store Guid ASCII string
Span<byte> result = stackalloc byte[36];
// set all bytes to 0xFF (to be able to distinguish them from real data)
result.Fill(0xFF);
// get bytes from guid
Span<byte> buffer = stackalloc byte[16];
_ = guid.TryWriteBytes(buffer);
int skip = 0;
// iterate over guid bytes
for (int i = 0; i < buffer.Length; i++)
{
// indices 4, 6, 8 and 10 will contain a '-' delimiter character in the Guid string.
// --> leave space for those delimiters
if (i is 4 or 6 or 8 or 10)
{
skip++;
}
// stretch high and low bytes of every single byte into two bytes (skipping '-' delimiter characters)
result[(2 * i) + skip] = (byte)(buffer[i] >> 0x4);
result[(2 * i) + 1 + skip] = (byte)(buffer[i] & 0x0Fu);
}
// iterate over precomputed byte array.
// values 0x0 to 0xF are final hex values, but must be mapped to ASCII characters.
// value 0xFF is to be mapped to '-' delimiter character.
for (int i = 0; i < result.Length; i++)
{
// map bytes to ASCII values (a-f will be lowercase)
ref byte b = ref result[i];
b = b switch
{
0xFF => 0x2D, // Map 0xFF to '-' character
< 0xA => (byte)(b + 0x30u), // Map 0x0 - 0x9 to '0' - '9'
_ => (byte)(b + 0x57u) // Map 0xA - 0xF to 'a' - 'f'
};
}
// get string from ASCII encoded guid byte array
return Encoding.ASCII.GetString(result);
}
it's a bit lengthy but apart from the big endian string it returns it does no heap allocations so it's guaranteed to be fast :)

Conversion from Base 64 error

I'm trying to convert from a Base64 string. First I tried this:
string a = "BTQmJiI6JzFkZ2ZhY";
byte[] b = Convert.FromBase64String(a);
string c = System.Text.Encoding.ASCII.GetString(b);
Then got the exception - System.FormatException was caught Message=Invalid length for a Base-64 char array.
So after googling,I tried this:
string a1 = "BTQmJiI6JzFkZ2ZhY";
int mod4 = a1.Length % 4;
if (mod4 > 0)
{
a1 += new string('=', 4 - mod4);
}
byte[] b1 = Convert.FromBase64String(a1);
string c1 = System.Text.Encoding.ASCII.GetString(b1);
Here I got the exception - System.FormatException was caught Message=Invalid character in a Base-64 string.
Is there any invalid character in "BTQmJiI6JzFkZ2ZhY"? Or is it the length issue?
EDIT: I first decrypt the input string using the below code:
string sourstr, deststr,strchar;
int strlen;
decimal ascvalue, ConvValue;
deststr = "";
sourstr = "InputString";
strlen = sourstr.Length;
for (int intI = 0; intI <= strlen - 1; intI++)
{
strchar = sourstr.Substring(intI, 1);
ascvalue = (decimal)strchar[0];
ConvValue = (decimal)((int)ascvalue ^ 85);
if ((char)ConvValue.ToString().Length == 0)
{
deststr = deststr + strchar;
}
else
{
deststr = deststr + (char)ConvValue;
}
}
This output deststr is passed to below code
Convert.ToBase64String(System.Text.Encoding.ASCII.GetBytes(deststr));
This is where I got "BTQmJiI6JzFkZ2ZhY"

You cannot get such base64 string by encoding whole number of bytes. While encoding, every 3 bytes are represented as 4 characters, because 3 bytes is 24 bits, and each base64 character is 6 bits (2^6=64), so 4 of them is also 24 bits. If number of bytes to encode is not divisable by 3 - you have some bytes left. You can have 2 or 1 bytes left.
If you have 2 bytes left - that's 16 bits and you need at least 3 characters to encode that (2 characters is just 12 bits - not enough). So in case you have 2 bytes left - you encode them with 3 characters and apply "=" padding.
If you have 1 byte left - that's 8 bits. You need at least 2 characters for that. You encode to 2 characters and apply "==" padding.
Note that there is no way to encode something to just one character (and for that reason - there is no "===" padding).
Your string can be divided in 4 character blocks: "BTQm", "JiI6", "JzFk", "Z2Zh", "Y". 4 first blocks each represent 3 bytes, but what "Y" represents? Who knows. You can say that it represents 1 byte in range 0-63, but from above you can see that's not how it works, so to interpret it like that you have to do it yourself.
From above you can see that you cannot get base64 string with length 17 (without padding). You can get 16, 18, 19, 20, but never 17

Are you sure you took all chars from base64 output?
Appending "==" at the end of the string will make your first approach work without any problems. Although there is strange character at the beginning of the output. So the next question is: Are you sure it is "ASCI" Encoding?

How to decode an utf8 encoded string split in two buffers right in between a 4 byte long char?

A character in UTF8 encoding has up to 4 bytes. Now imagine I read from a stream into one buffer and then into the another. Unfortunately it just happens to be that at the end of the first buffer 2 chars of the 4 byte UTF8 encoded char are left and at the beginning of the the second buffer the rest 2 bytes.
Is there a way to partially decode that string (while leaving the 2 rest byte) without copying those two buffers into one big
string str = "Hello\u263AWorld";
Console.WriteLine(str);
Console.WriteLine("Length of 'HelloWorld': " + Encoding.UTF8.GetBytes("HelloWorld").Length);
var bytes = Encoding.UTF8.GetBytes(str);
Console.WriteLine("Length of 'Hello\u263AWorld': " + bytes.Length);
Console.WriteLine(Encoding.UTF8.GetString(bytes, 0, 6));
Console.WriteLine(Encoding.UTF8.GetString(bytes, 7, bytes.Length - 7));
This returns:
Hello☺World
Length of 'HelloWorld': 10
Length of 'Hello☺World': 13
Hello�
�World
The smiley face is 3 bytes long.
Is there a class that deals with split decoding of strings?
I would like to get first "Hello" and then "☺World" reusing the reminder of the not encoded byte array. Without copying both arrays into one big array. I really just want to use the reminder of the first buffer and somehow make the magic happen.

You should use a Decoder, which is able to maintain state between calls to GetChars - it remembers the bytes it hasn't decoded yet.
using System;
using System.Text;
class Test
{
static void Main()
{
string str = "Hello\u263AWorld";
var bytes = Encoding.UTF8.GetBytes(str);
var decoder = Encoding.UTF8.GetDecoder();
// Long enough for the whole string
char[] buffer = new char[100];
// Convert the first "packet"
var length1 = decoder.GetChars(bytes, 0, 6, buffer, 0);
// Convert the second "packet", writing into the buffer
// from where we left off
// Note: 6 not 7, because otherwise we're skipping a byte...
var length2 = decoder.GetChars(bytes, 6, bytes.Length - 6,
buffer, length1);
var reconstituted = new string(buffer, 0, length1 + length2);
Console.WriteLine(str == reconstituted); // true
}
}

Add PPOOE layer tp packet - convert length into byte

I have application that play Pcap files and i try to add function that wrap my packet with PPPOE layer.
so almost all done except large packets that i didn't understand yet how to set the new langth after add PPPOE layer.
For example this packet:
As you can see this packet length is 972 bytes (03 cc), and all i want is to convert it to decimal, after see this packet byte[] in my code i can see that this value converted into 3 and 204 in my packet byte[], so my question is how this calculation works ?

Those two bytes represents a short (System.Int16) in bigendian notation (most significant byte first).
You can follow two approaches to get the decimal value of those two bytes. One is with the BitConverter class, the other is by doing the calculation your self.
BitConverter
// the bytes
var bytes = new byte[] {3, 204};
// are the bytes little endian?
var littleEndian = false; // no
// What architecure is the BitConverter running on?
if (BitConverter.IsLittleEndian != littleEndian)
{
// reverse the bytes if endianess mismatch
bytes = bytes.Reverse().ToArray();
}
// convert
var value = BitConverter.ToInt16( bytes , 0);
value.Dump(); // or Console.WriteLine(value); --> 972
Calculate your self
base 256 of two bytes:
// the bytes
var bytes2 = new byte[] {3, 204};
// [0] * 256 + [1]
var value2 = bytes2[0] * 256 + bytes2[1]; // 3 * 256 + 204
value2.Dump(); // 972

Truncating a byte array vs. substringing the Encoded string coming out of SHA-256

I am not familiar with Hashing algorithms and the risks associated when using them and therefore have a question on the answer below that I received on a previous question . . .
Based on the comment that the hash value must, when encoded to ASCII, fit within 16 ASCI characters, the solution is first, to choose some cryptographic hash function (the SHA-2 family includes SHA-256, SHA-384, and SHA-512)
then, to truncate the output of the chosen hash function to 96 bits (12 bytes) - that is, keep the first 12 bytes of the hash function output and discard the remaining bytes
then, to base-64-encode the truncated output to 16 ASCII characters (128 bits)
yielding effectively a 96-bit-strong cryptographic hash.
If I substring the base-64-encoded string to 16 characters is that fundamentally different then keeping the first 12 bytes of the hash function and then base-64-encoding them? If so, could someone please explain (provide example code) for truncating the byte array?
I tested the substring of the full hash value against 36,000+ distinct values and had no collisions. The code below is my current implementation.
Thanks for any help (and clarity) you can provide.
public static byte[] CreateSha256Hash(string data)
{
byte[] dataToHash = (new UnicodeEncoding()).GetBytes(data);
SHA256 shaM = new SHA256Managed();
byte[] hashedData = shaM.ComputeHash(dataToHash);
return hashedData;
}
public override void InputBuffer_ProcessInputRow(InputBufferBuffer Row)
{
byte[] hashedData = CreateSha256Hash(Row.HashString);
string s = Convert.ToBase64String(hashedData, Base64FormattingOptions.None);
Row.HashValue = s.Substring(0, 16);
}
[Original post]
(http://stackoverflow.com/questions/4340471/is-there-a-hash-algorithm-that-produces-a-hash-size-of-64-bits-in-c)

No, there is no difference. However, it's easier to just get the base64 string of the first 12 bytes of the array, instead of truncating the array:
public override void InputBuffer_ProcessInputRow(InputBufferBuffer Row) {
byte[] hashedData = CreateSha256Hash(Row.HashString);
Row.HashValue = Convert.ToBase64String(hashedData, 0, 12);
}
The base 64 encoding simply puts 6 bits in each character, so 3 bytes (24 bits) goes into 4 characters. As long as you are splitting the data at an even 3 byte boundary, it's the same as splitting the string at the even 4 character boundary.
If you try to split the data between these boundaries, the base64 string will be padded with filler data up to the next boundary, so the result would not be the same.

Truncating is as easy as adding Take(12) here:
Change
byte[] hashedData = CreateSha256Hash(Row.HashString);
To:
byte[] hashedData = CreateSha256Hash(Row.HashString).Take(12).ToArray();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Array and String Encoding - c#

When I do string s = Encoding.Unicode.GetString(a); byte[] aa = Encoding.Unicode.GetBytes(s); I have different arrays (a != aa) . Why ? But when I do this? It's all right string s = Encoding.Default.GetString(a); byte[] aa = Encoding.Default.GetBytes(s);

It means your byte[] a has a byte order which does not conform to Unicode rules.

Related

Get a guid to encode using big-endian formatting C#

Conversion from Base 64 error

How to decode an utf8 encoded string split in two buffers right in between a 4 byte long char?

Add PPOOE layer tp packet - convert length into byte

Truncating a byte array vs. substringing the Encoded string coming out of SHA-256

Categories

Resources