UtF-8 gives extra string in German character - c#

I have file name testtäöüßÄÖÜ . I want to convert in UTF-8 using c#.
string test ="testtäöüß";
var bytes = new List<byte>(test.Length);
foreach (var c in test)
bytes.Add((byte)c);
var retValue = Encoding.UTF8.GetString(bytes.ToArray());
after running this code my output is : 'testt mit Umlaute äöü?x. where mit Umlaute is extra
text.
Can anybody help me ?
Thanks in advance.

You can't do that. You can't cast an UTF-8 character to byte. UTF-8 for anything other than ASCII requires at least two bytes, byte can can't store this
Instead of creating a list, use
byte[] bytes = System.Text.Encoding.UTF8.GetBytes (test);

I think, Tseng means the following
Taken from: http://www.chilkatsoft.com/p/p_320.asp
System.Text.Encoding utf_8 = System.Text.Encoding.UTF8;
// This is our Unicode string:
string s_unicode = "abcéabc";
// Convert a string to utf-8 bytes.
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);
// Convert utf-8 bytes to a string.
string s_unicode2 = System.Text.Encoding.UTF8.GetString(utf8Bytes);
MessageBox.Show(s_unicode2);

Related

How to convert unicode to utf-8 encoding in c#

I want to convert unicode string to UTF8 string. I want to use this UTF8 string in SMS API to send unicode SMS.
I want conversion like this tool
https://cafewebmaster.com/online_tools/utf8_encode
eg. I have unicode string "हैलो फ़्रेंड्स" and it should be converted into "हà¥à¤²à¥ à¥à¥à¤°à¥à¤à¤¡à¥à¤¸"
I have tried this but not getting expected output
private string UnicodeToUTF8(string strFrom)
{
byte[] bytes = Encoding.Default.GetBytes(strFrom);
return Encoding.UTF8.GetString(bytes);
}
and calling function like this
string myUTF8String = UnicodeToUTF8("हैलो फ़्रेंड्स");
I don't think this is possible to answer concretely without knowing more about the SMS API you want to use. The string type in C# is UTF-16. If you want a different encoding, it's given to you as a byte[] (because a string is UTF-16, always).
You could 'cast' that into a string by doing something like this:
static string UnicodeToUTF8(string from) {
var bytes = Encoding.UTF8.GetBytes(from);
return new string(bytes.Select(b => (char)b).ToArray());
}
As far as I can tell this yields the same output as the website you linked. However, without knowing what API you're handing this string off to, I can't guarantee that this will ultimately work.
The point of string is that we don't need to worry about its underlying encoding, but this casting operation is kind of a giant hack and makes no guarantees that string represents a well-formed string anymore.
If something expects a UTF-8 encoding, it should accept a byte[], not a string.
Try this:
string output = "hello world";
byte[] bytes1 = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(output));
byte[] bytes2 = Encoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(output));
var output1 = Encoding.UTF8.GetString(bytes1);
var output2 = Encoding.Unicode.GetString(bytes2);
You will see that bytes1 is 11 bytes (1 byte per char UTF-8) and bytes2 is 22 bytes (2 bytes per char for unicode)

Convert String to ByteArray in C#

I want to convert a String to ByteArray in C# for Decrypt some data.
When I get de String from the ByteArray created, it shows question marks (?).
Example code:
byte[] strTemp = Encoding.ASCII.GetBytes(strData);
MessageBox.Show(strData);
MessageBox.Show(Encoding.ASCII.GetString(strTemp));
The string is "Ê<,,l"x¡" (With the double quotation mark) and the result to convert again to string is: ???l?x?
I hope this helps you:
To Get byte array from a string
private byte[] StringToByteArray(string str)
{
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
return enc.GetBytes(str);
}
To get a string back from byte array:
private string ByteArrayToString(byte[] arr)
{
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
return enc.GetString(arr);
}
For specific of that input, this BigEndianUnicode encoding seems works fine
byte[] strTemp = Encoding.BigEndianUnicode.GetBytes(strData);
MessageBox.Show(strData);
MessageBox.Show(Encoding.BigEndianUnicode.GetString(strTemp));
`
You are getting a byte array for the ASCII representation of your string, but your string is Unicode.
C# uses Unicode to encode strings, Unicode being able to represent far more symbols as ASCII.
In your example, every symbol which has no ASCII representation is replaced by '?', this is why only 'l' and 'x' appear in the output.
The proper way to do it is to use a Unicode encoding instead:
byte[] strTemp = Encoding.UTF8.GetBytes(strData);
MessageBox.Show(strData);
MessageBox.Show(Encoding.UTF8.GetString(strTemp));
Basically, any Unicode encoding can be used: UTF8, UTF32, Unicode, BigEndianUnicode (https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings).

Decoding an UTF-8 string to Windows-1256

I used this code to encode a UTF-8 string to Windows-1256 string:
string q = textBox1.Text;
UTF7Encoding utf = new UTF7Encoding();
byte[] winByte = Encoding.GetEncoding(1256).GetBytes(q);
string result = utf.GetString(winByte);
This code is working but I can't decode the result or encoded to original string!
How I can decode an encoded string (result variable) to same before converted (q variable)?
You are converting the strings incorrectly.
Have a look at the commented code below. The comments explain what is wrong, and how to do it correctly, but basically what is happening is:
Firstly you use Encoding.GetEncoding(1256).GetBytes(q) to convert a string (which is UTF16) to an ANSI codepage 1256 string.
Then you use a UTF7 encoding to convert it back. But that's wrong because you need to use an ANSI codepage 1256 encoding to convert it back:
string q = "ABئبئ"; // UTF16.
UTF7Encoding utf = new UTF7Encoding(); // Used to convert UTF16 to/from UTF7
// Convert UTF16 to ANSI codepage 1256. winByte[] will be ANSI codepage 1256.
byte[] winByte = Encoding.GetEncoding(1256).GetBytes(q);
// Convert UTF7 to UTF16.
// But this is WRONG because winByte is ANSI codepage 1256, NOT UTF7!
string result = utf.GetString(winByte);
Debug.Assert(result != q); // So result doesn't equal q
// The CORRECT way to convert the ANSI string back:
// Convert ANSI codepage 1256 string to UTF16
result = Encoding.GetEncoding(1256).GetString(winByte);
Debug.Assert(result == q); // Now result DOES equal q

C# Encoding.Convert Vs C++ MultiByteToWideChar

I have a C++ code snippet that uses MultiByteToWideChar to convert UTF-8 string to UTF-16
For C++, if input is "Hôtel", the output is "Hôtel" which is correct
For C#, if input is "Hôtel", the output is "Hôtel" which is not correct.
The C# code to convert from UTF8 to UTF16 looks like
Encoding.Unicode.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.Unicode,
Encoding.UTF8.GetBytes(utf8)));
In C++ the conversion code looks like
MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
0, // default flags
utf8.data(), // source UTF-8 string
utf8.length(), // length (in chars) of source UTF-8 string
&utf16[0], // destination buffer
utf16.length() // size of destination buffer, in wchar_t's
)
I want to have the same results in C# that I am getting in C++. Is there anything wrong with the C# code ?
It appears you want to treat string characters as Windows-1252 (Often mislabeled as ANSI) code points, and have those code points decoded as UTF-8 bytes, where Windows-1252 code point == UTF-8 byte value.
The reason the accepted answer doesn't work is that it treats the string characters as unicode code points, rather than
Windows-1252. It can get away with most characters because Windows-1252 maps them exactly the same as unicode, but input with characters
like –, €, ™, ‘, ’, ”, • etc.. will fail because Windows-1252 maps those differently than unicode in this sense.
So what you want is simply this:
public static string doWeirdMapping(string arg)
{
Encoding w1252 = Encoding.GetEncoding(1252);
return Encoding.UTF8.GetString(w1252.GetBytes(arg));
}
Then:
Console.WriteLine(doWeirdMapping("Hôtel")); //prints Hôtel
Console.WriteLine(doWeirdMapping("HVOLSVÖLLUR")); //prints HVOLSVÖLLUR
Maybe this one:
private static string Utf8ToUnicode(string input)
{
return Encoding.UTF8.GetString(input.Select(item => (byte)item).ToArray());
}
Try This
string str = "abc!";
Encoding unicode = Encoding.Unicode;
Encoding utf8 = Encoding.UTF8;
byte[] unicodeBytes = unicode.GetBytes(str);
byte[] utf8Bytes = Encoding.Convert( unicode,
utf8,
unicodeBytes );
Console.WriteLine( "UTF Bytes:" );
StringBuilder sb = new StringBuilder();
foreach( byte b in utf8Bytes ) {
sb.Append( b ).Append(" : ");
}
Console.WriteLine( sb.ToString() );
This Link would be helpful for you to understand about encodings and their conversions
Use System.Text.Encoding.UTF8.GetString().
Pass in your UTF-8 encoded text, as a byte array. The function returns a standard .net string which is encoded in UTF-16.
Sample function will be as below:
private string ReadData(Stream binary_file) {
System.Text.Encoding encoding = System.Text.Encoding.UTF8;
// Read string from binary file with UTF8 encoding
byte[] buffer = new byte[30];
binary_file.Read(buffer, 0, 30);
return encoding.GetString(buffer);
}

C# Encoding.Converting Latin to Hebrew

I'm trying to fetch and parse an online excel document which is written in hebrew but unfortunately in a non-hebrew encoding.
As an example I'm trying to convert the following string: "âìéåï_1", which serves as the 1st sheet name to hebrew using C# code, but I'm unable to do so.
I know the above is convertible, since when I open it up in NotePad++ and select Encoding/Character Sets/Hebrew/Windows 1255, I can see: "גליון_1" which is the correct hebrew representation of the above string.
I'm using the below code
string str = "âìéåï_1";
Encoding windows = Encoding.GetEncoding("Windows-1255");
Encoding ascii = Encoding.GetEncoding("Windows-1252");
byte[] asciiBytes = ascii.GetBytes(str);
byte[] windowsBytes = Encoding.Convert(ascii, windows, asciiBytes);
char[] windowsChars = new char[windows.GetCharCount(windowsBytes, 0, windowsBytes.Length)];
windows.GetChars(windowsBytes, 0, windowsBytes.Length, windowsChars, 0);
string windowsString = new string(windowsChars);
I assumed that the encoding of the origin string is Windows-1252 since when I paste it in NotePad++ and change the encoding to Windows-1252 the string remains the same...
I'm probably doing something wrong here, anyone know how to convert the above correctly?
Thanks,
Mikey
const string Str = "âìéåï_1";
Encoding latinEncoding = Encoding.GetEncoding("Windows-1252");
Encoding hebrewEncoding = Encoding.GetEncoding("Windows-1255");
byte[] latinBytes = latinEncoding.GetBytes(Str);
string hebrewString = hebrewEncoding.GetString(latinBytes);
hebrewString:
גליון_1
In your supplied example "Window-1252" is not actualy ASCII, it is extended ASCII, and for some reason Encoding.Convert with these two encodings cannot convert extended range ASCII, so all +127 characters are converted as 63 (i.e. ?). When "converting" from one extended ASCII character byte[] to another, I would expect the bytes to be the same, it is only when you convert them to a .Net unicode string I would expect them to be different. Not sure why Convert is converting +127 chars to '?'.

Categories