UTF8 Encoding not adding byte order mark - c#

We know that the constructor of the class UTF8Encoding can receive an optional parameter: a bool specifying if the encoder should provide a byte order mark (BOM) or not.
However, when encoding the same text using both approaches, the output is the same:
string text = "Hello, world!";
byte[] withBom= new UTF8Encoding(true).GetBytes(text);
byte[] withoutBom = new UTF8Encoding(false).GetBytes(text);
Both withBom and withoutBom have the same content, one doesn't even have one byte more than the other one.
Why does this happen? Why is the byte order mark not being added to withBom?

BOM parameter in the constructor does no affect the result of GetBytes, it affects the result of GetPreamble. Users are expected to append it manually.
byte[] bom = new UTF8Encoding(true).GetPreamble(); // 3 bytes
byte[] noBom = new UTF8Encoding(false).GetPreamble(); // 0 bytes

The BOM is returned via the UTF8Encoding.GetPreamble method:
UTF8Encoding enc = new UTF8Encoding(true);
byte[] withBom = enc.GetPreamble().Concat(enc.GetBytes(text)).ToArray();

Related

String value system.byte[] while coverting Base64 value to string in c#

I have two LDIF files from where I am reading values and using it for comparsion using c#
One of the attribute: value in LDIF is a base64 value, need to convert it in UTF-8 format
displayName:: Rmlyc3ROYW1lTGFzdE5hbWU=
So I thought of using string -> byte[], but I am not able to use the above displayName value as string
byte[] newbytes = Convert.FromBase64String(displayname);
string displaynamereadable = Encoding.UTF8.GetString(newbytes);
In my C# code, I am doing this to retrieve the values from the ldif file
for(Entry entry ldif.ReadEntry() ) //reads value from ldif for particular user's
{
foreach(Attr attr in entry) //here attr gives attributes of a particular user
{
if(attr.Name.Equals("displayName"))
{
string attVal = attr.Value[0].ToString(); //here the value of String attVal is system.Byte[], so not able to use it in the next line
byte[] newbytes = Convert.FromBase64String(attVal); //here it throws an error mentioned below
string displaynamereadable = Encoding.UTF8.GetString(attVal);
}
}
}
Error:
The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.
I am trying to user attVal as String so that I can get the encoded UTf-8 value but its throwing an error.
I tried to use BinaryFormatter and MemoryStream as well, it worked but it inserted so many new chars with the original value.
Snapshot of BinaryFormatter:
object obj = attr.Value[0];
byte[] bytes = null;
BinaryFormatter bf = new BinaryFormatter();
using (MemoryStream ms = new MemoryStream())
{
bf.Serialize(ms, obj);
bytes = (ms.ToArray());
}
string d = Encoding.UTF8.GetString(bytes);
So the result after encoding should be: "FirstNameLastName"
But it gives "\u0002 \u004 \u004 ...................FirstNameLastName\v"
Thanks,
Base64 was designed to send binary data through transfer channels that only support plain text, and as a result, Base64 is always ASCII text. So if attr.Value[0] is a byte array, just interpret those bytes as string using ASCII encoding:
String attVal = Encoding.ASCII.GetString(attr.Value[0] as Byte[]);
Byte[] newbytes = Convert.FromBase64String(attVal);
String displaynamereadable = Encoding.UTF8.GetString(newbytes);
Also note, your code above was feeding attVal into that final line rather than newbytes.

How to convert unicode to utf-8 encoding in c#

I want to convert unicode string to UTF8 string. I want to use this UTF8 string in SMS API to send unicode SMS.
I want conversion like this tool
https://cafewebmaster.com/online_tools/utf8_encode
eg. I have unicode string "हैलो फ़्रेंड्स" and it should be converted into "हà¥à¤²à¥ à¥à¥à¤°à¥à¤à¤¡à¥à¤¸"
I have tried this but not getting expected output
private string UnicodeToUTF8(string strFrom)
{
byte[] bytes = Encoding.Default.GetBytes(strFrom);
return Encoding.UTF8.GetString(bytes);
}
and calling function like this
string myUTF8String = UnicodeToUTF8("हैलो फ़्रेंड्स");
I don't think this is possible to answer concretely without knowing more about the SMS API you want to use. The string type in C# is UTF-16. If you want a different encoding, it's given to you as a byte[] (because a string is UTF-16, always).
You could 'cast' that into a string by doing something like this:
static string UnicodeToUTF8(string from) {
var bytes = Encoding.UTF8.GetBytes(from);
return new string(bytes.Select(b => (char)b).ToArray());
}
As far as I can tell this yields the same output as the website you linked. However, without knowing what API you're handing this string off to, I can't guarantee that this will ultimately work.
The point of string is that we don't need to worry about its underlying encoding, but this casting operation is kind of a giant hack and makes no guarantees that string represents a well-formed string anymore.
If something expects a UTF-8 encoding, it should accept a byte[], not a string.
Try this:
string output = "hello world";
byte[] bytes1 = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(output));
byte[] bytes2 = Encoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(output));
var output1 = Encoding.UTF8.GetString(bytes1);
var output2 = Encoding.Unicode.GetString(bytes2);
You will see that bytes1 is 11 bytes (1 byte per char UTF-8) and bytes2 is 22 bytes (2 bytes per char for unicode)

UtF-8 gives extra string in German character

I have file name testtäöüßÄÖÜ . I want to convert in UTF-8 using c#.
string test ="testtäöüß";
var bytes = new List<byte>(test.Length);
foreach (var c in test)
bytes.Add((byte)c);
var retValue = Encoding.UTF8.GetString(bytes.ToArray());
after running this code my output is : 'testt mit Umlaute äöü?x. where mit Umlaute is extra
text.
Can anybody help me ?
Thanks in advance.
You can't do that. You can't cast an UTF-8 character to byte. UTF-8 for anything other than ASCII requires at least two bytes, byte can can't store this
Instead of creating a list, use
byte[] bytes = System.Text.Encoding.UTF8.GetBytes (test);
I think, Tseng means the following
Taken from: http://www.chilkatsoft.com/p/p_320.asp
System.Text.Encoding utf_8 = System.Text.Encoding.UTF8;
// This is our Unicode string:
string s_unicode = "abcéabc";
// Convert a string to utf-8 bytes.
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);
// Convert utf-8 bytes to a string.
string s_unicode2 = System.Text.Encoding.UTF8.GetString(utf8Bytes);
MessageBox.Show(s_unicode2);

C# Encoding.Convert Vs C++ MultiByteToWideChar

I have a C++ code snippet that uses MultiByteToWideChar to convert UTF-8 string to UTF-16
For C++, if input is "Hôtel", the output is "Hôtel" which is correct
For C#, if input is "Hôtel", the output is "Hôtel" which is not correct.
The C# code to convert from UTF8 to UTF16 looks like
Encoding.Unicode.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.Unicode,
Encoding.UTF8.GetBytes(utf8)));
In C++ the conversion code looks like
MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
0, // default flags
utf8.data(), // source UTF-8 string
utf8.length(), // length (in chars) of source UTF-8 string
&utf16[0], // destination buffer
utf16.length() // size of destination buffer, in wchar_t's
)
I want to have the same results in C# that I am getting in C++. Is there anything wrong with the C# code ?
It appears you want to treat string characters as Windows-1252 (Often mislabeled as ANSI) code points, and have those code points decoded as UTF-8 bytes, where Windows-1252 code point == UTF-8 byte value.
The reason the accepted answer doesn't work is that it treats the string characters as unicode code points, rather than
Windows-1252. It can get away with most characters because Windows-1252 maps them exactly the same as unicode, but input with characters
like –, €, ™, ‘, ’, ”, • etc.. will fail because Windows-1252 maps those differently than unicode in this sense.
So what you want is simply this:
public static string doWeirdMapping(string arg)
{
Encoding w1252 = Encoding.GetEncoding(1252);
return Encoding.UTF8.GetString(w1252.GetBytes(arg));
}
Then:
Console.WriteLine(doWeirdMapping("Hôtel")); //prints Hôtel
Console.WriteLine(doWeirdMapping("HVOLSVÖLLUR")); //prints HVOLSVÖLLUR
Maybe this one:
private static string Utf8ToUnicode(string input)
{
return Encoding.UTF8.GetString(input.Select(item => (byte)item).ToArray());
}
Try This
string str = "abc!";
Encoding unicode = Encoding.Unicode;
Encoding utf8 = Encoding.UTF8;
byte[] unicodeBytes = unicode.GetBytes(str);
byte[] utf8Bytes = Encoding.Convert( unicode,
utf8,
unicodeBytes );
Console.WriteLine( "UTF Bytes:" );
StringBuilder sb = new StringBuilder();
foreach( byte b in utf8Bytes ) {
sb.Append( b ).Append(" : ");
}
Console.WriteLine( sb.ToString() );
This Link would be helpful for you to understand about encodings and their conversions
Use System.Text.Encoding.UTF8.GetString().
Pass in your UTF-8 encoded text, as a byte array. The function returns a standard .net string which is encoded in UTF-16.
Sample function will be as below:
private string ReadData(Stream binary_file) {
System.Text.Encoding encoding = System.Text.Encoding.UTF8;
// Read string from binary file with UTF8 encoding
byte[] buffer = new byte[30];
binary_file.Read(buffer, 0, 30);
return encoding.GetString(buffer);
}

How do i compare a byte[] to string?

I want to compare the first few bytes in byte[] with a string. How can i do this?
You must know the encoding of the byte array to properly compare them.
For example, if you know your byte array is made of UTF-8 bytes, then you can create a string from the byte array:
System.Text.UTF8Encoding enc = new System.Text.UTF8Encoding();
string s = enc.GetString(originalBytes);
Now you can compare string s to your other string.
Conversely, if you want to compare just the first few bytes, you can convert the string into a UTF8 byte array:
System.Text.UTF8Encoding enc = new System.Text.UTF8Encoding();
byte[] b = enc.GetBytes(originalString);
Now you can compare byte array b to your other byte array.
There are several other encoding objects for ASCII, Unicode, etc. See the MSDN page here.
use
byte [] fromString = Encoding.Default.GetBytes("helloworld");

Categories