How to convert unicode to utf-8 encoding in c#

How to convert unicode to utf-8 encoding in c# - c#

I want to convert unicode string to UTF8 string. I want to use this UTF8 string in SMS API to send unicode SMS.
I want conversion like this tool
https://cafewebmaster.com/online_tools/utf8_encode
eg. I have unicode string "हैलो फ़्रेंड्स" and it should be converted into "à¤¹à¥à¤²à¥ à¥à¥à¤°à¥à¤à¤¡à¥à¤¸"
I have tried this but not getting expected output
private string UnicodeToUTF8(string strFrom)
{
byte[] bytes = Encoding.Default.GetBytes(strFrom);
return Encoding.UTF8.GetString(bytes);
}
and calling function like this
string myUTF8String = UnicodeToUTF8("हैलो फ़्रेंड्स");

I don't think this is possible to answer concretely without knowing more about the SMS API you want to use. The string type in C# is UTF-16. If you want a different encoding, it's given to you as a byte[] (because a string is UTF-16, always).
You could 'cast' that into a string by doing something like this:
static string UnicodeToUTF8(string from) {
var bytes = Encoding.UTF8.GetBytes(from);
return new string(bytes.Select(b => (char)b).ToArray());
}
As far as I can tell this yields the same output as the website you linked. However, without knowing what API you're handing this string off to, I can't guarantee that this will ultimately work.
The point of string is that we don't need to worry about its underlying encoding, but this casting operation is kind of a giant hack and makes no guarantees that string represents a well-formed string anymore.
If something expects a UTF-8 encoding, it should accept a byte[], not a string.

Try this:
string output = "hello world";
byte[] bytes1 = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(output));
byte[] bytes2 = Encoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(output));
var output1 = Encoding.UTF8.GetString(bytes1);
var output2 = Encoding.Unicode.GetString(bytes2);
You will see that bytes1 is 11 bytes (1 byte per char UTF-8) and bytes2 is 22 bytes (2 bytes per char for unicode)

Related

Base 64 encoding and Decoding

I want to decode special characters in Base 64 and I want to include space into it. Please tell me how to handle space in base 64 encoding and decoding.
<add key="SpecialCharacter" value="w6J8YSzDoXxhLMOlfGEsw6R8YSzDo3xhLMOmfGFlLMSNfGMsw6l8ZSzDqHxlLMOqfGUsw6t8ZSzDu3x1LMO6fHUsw7x8dSzDuHxvLMOzfG8sw7R8byzDtnxvLMOyfG8sxaF8byxgfCwnfCzFgnxsLMW8fHo="/>

Take a look into these functions:
public static string ToBase64(this string value)
{
byte[] bytes = Encoding.Default.GetBytes(value);
return Convert.ToBase64String(bytes);
}
public static string FromBase64(this string value)
{
byte[] bytes = Convert.FromBase64String(value);
return Encoding.Default.GetString(bytes);
}
The first one converts a string to a base 64 string.
e.g.: string base64 = "Hello World!".ToBase64()
The second one returns it to a 'normal' string:
string original = base64.FromBase64()
The core functionality is in Convert.FromBase64String(string value). It returns a byte array, which has to converted to a string with Encoding. When you know the used encoding, you shuld use it and not the default (UTF-16, i think).
I tested some encodings. In you case, the string is encoded in UTF8 and results in:
â|a,á|a,å|a,ä|a,ã|a,æ|ae,č|c,é|e,è|e,ê|e,ë|e,û|u,ú|u,ü|u,ø|o,ó|o,ô|o,ö|o,ò|o,š|o,`|,'|,ł|l,ż|z
In addition to the comment to #WDS here again: Spaces should get converted to base64, too. No need for spacial handing them.

"Specified value has invalid Control characters" when converting SHA512 output to string

I am attempting to create an Hash for an API.
my input is something like this:
FBN|Web|3QTC0001|RS1|260214133217|000000131127897656
And my expected output is like :
17361DU87HT56F0O9967E34FDFFDFG7UO334665324308667FDGJKD66F9888766DFKKJJR466634HH6566734JHJH34766734NMBBN463499876554234343432456
I tried the bellow but I keep getting
"Specified value has invalid Control characters. Parameter name: value"
I am actually doing this in a REST service.
public static string GetHash(string text)
{
string hash = "";
SHA512 alg = SHA512.Create();
byte[] result = alg.ComputeHash(Encoding.UTF8.GetBytes(text));
hash = Encoding.UTF8.GetString(result);
return hash;
}
What am I missing?

The problem is Encoding.UTF8.GetString(result) as the data in result is invalid UTF-8 (it's just binary goo!) so trying to convert it to text is invalid - in general, and specifically for this input - which results in the Exception being thrown.
Instead, convert the byte[] to the hex representation of said byte sequence; don't treat it as UTF-8 encoded text.
See the questions How do you convert Byte Array to Hexadecimal String, and vice versa? and How can I convert a hex string to a byte array?, which discuss several different methods of achieving this task.

In order to make this work you need to convert the individual byte elements into a hex representation
var builder = new StringBuilder();
foreach(var b in result) {
builder.AppendFormat("{0:X2}", b);
}
return builder.ToString();

You might want to consider using Base64 encoding (AKA UUEncode):
public static string GetHash(string text)
{
SHA512 alg = SHA512.Create();
byte[] result = alg.ComputeHash(Encoding.UTF8.GetBytes(text));
return Convert.ToBase64String(result);
}
For your example string, the result is
OJgzW5JdC1IMdVfC0dH98J8tIIlbUgkNtZLmOZsjg9H0wRmwd02tT0Bh/uTOw/Zs+sgaImQD3hh0MlzVbqWXZg==
It has an advantage of being more compact than encoding each byte into two characters: three bytes takes four characters with Base64 encoding or six characters the other way.

UtF-8 gives extra string in German character

I have file name testtäöüßÄÖÜ . I want to convert in UTF-8 using c#.
string test ="testtäöüß";
var bytes = new List<byte>(test.Length);
foreach (var c in test)
bytes.Add((byte)c);
var retValue = Encoding.UTF8.GetString(bytes.ToArray());
after running this code my output is : 'testt mit Umlaute äöü?x. where mit Umlaute is extra
text.
Can anybody help me ?
Thanks in advance.

You can't do that. You can't cast an UTF-8 character to byte. UTF-8 for anything other than ASCII requires at least two bytes, byte can can't store this
Instead of creating a list, use
byte[] bytes = System.Text.Encoding.UTF8.GetBytes (test);

I think, Tseng means the following
Taken from: http://www.chilkatsoft.com/p/p_320.asp
System.Text.Encoding utf_8 = System.Text.Encoding.UTF8;
// This is our Unicode string:
string s_unicode = "abcéabc";
// Convert a string to utf-8 bytes.
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(s_unicode);
// Convert utf-8 bytes to a string.
string s_unicode2 = System.Text.Encoding.UTF8.GetString(utf8Bytes);
MessageBox.Show(s_unicode2);

Encoding not converting

An ASP.NET page (ashx) receives a GET request with a UTF8 string. It reads a SqlServer database with Windows-1255 data.
I can't seem to get them to work together. I've used information gathered on SO (mainly Convert a string's character encoding from windows-1252 to utf-8) as well as msdn on the subject.
When I run anything through the functions below - it always ends up the same as it started - not converted at all.
Is something done wrong?
EDIT
What I'm specifically trying to do (getData returns a Dictionary<int, string>):
getData().Where(a => a.Value.Contains(context.Request.QueryString["q"]))
Result is empty, unless I send a "neutral" character such as "'" or ",".
CODE
string windows1255FromUTF8(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(p);
byte[] winBytes = Encoding.Convert(utf8, win, utfBytes);
return win.GetString(winBytes);
}
string UTF8FromWindows1255(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] winBytes = win.GetBytes(p);
byte[] utfBytes = Encoding.Convert(win, utf8, winBytes);
return utf8.GetString(utfBytes);
}

There is nothing wrong with the functions, they are simply useless.
What the functions do is to encode the strings into bytes, convert the data from one encoding to another, then decode the bytes back to a string. Unless the string contains a character that is not possible to encode using the windows-1255 encoding, the returned value should be identical to the input.
Strings in .NET doesn't have an encoding. If you get a string from a source where the text was encoded using for example UTF-8, once it's decoded into a string it doesn't have that encoding any more. You don't have to do anyting to a string to use it when the destination has a specific encoding, whatever library you are using that takes the string will take care of the encoding.

For some reason this worked:
byte[] fromBytes = (fromEncoding.UTF8).GetBytes(myString);
string finalString = (Encoding.GetEncoding(1255)).GetString(fromBytes);
Switching encoding without the conversion...

C# Encoding.Convert Vs C++ MultiByteToWideChar

I have a C++ code snippet that uses MultiByteToWideChar to convert UTF-8 string to UTF-16
For C++, if input is "HÃ´tel", the output is "Hôtel" which is correct
For C#, if input is "HÃ´tel", the output is "HÃ´tel" which is not correct.
The C# code to convert from UTF8 to UTF16 looks like
Encoding.Unicode.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.Unicode,
Encoding.UTF8.GetBytes(utf8)));
In C++ the conversion code looks like
MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
0, // default flags
utf8.data(), // source UTF-8 string
utf8.length(), // length (in chars) of source UTF-8 string
&utf16[0], // destination buffer
utf16.length() // size of destination buffer, in wchar_t's
)
I want to have the same results in C# that I am getting in C++. Is there anything wrong with the C# code ?

It appears you want to treat string characters as Windows-1252 (Often mislabeled as ANSI) code points, and have those code points decoded as UTF-8 bytes, where Windows-1252 code point == UTF-8 byte value.
The reason the accepted answer doesn't work is that it treats the string characters as unicode code points, rather than
Windows-1252. It can get away with most characters because Windows-1252 maps them exactly the same as unicode, but input with characters
like –, €, ™, ‘, ’, ”, • etc.. will fail because Windows-1252 maps those differently than unicode in this sense.
So what you want is simply this:
public static string doWeirdMapping(string arg)
{
Encoding w1252 = Encoding.GetEncoding(1252);
return Encoding.UTF8.GetString(w1252.GetBytes(arg));
}
Then:
Console.WriteLine(doWeirdMapping("HÃ´tel")); //prints Hôtel
Console.WriteLine(doWeirdMapping("HVOLSVÃ–LLUR")); //prints HVOLSVÖLLUR

Maybe this one:
private static string Utf8ToUnicode(string input)
{
return Encoding.UTF8.GetString(input.Select(item => (byte)item).ToArray());
}

Try This
string str = "abc!";
Encoding unicode = Encoding.Unicode;
Encoding utf8 = Encoding.UTF8;
byte[] unicodeBytes = unicode.GetBytes(str);
byte[] utf8Bytes = Encoding.Convert( unicode,
utf8,
unicodeBytes );
Console.WriteLine( "UTF Bytes:" );
StringBuilder sb = new StringBuilder();
foreach( byte b in utf8Bytes ) {
sb.Append( b ).Append(" : ");
}
Console.WriteLine( sb.ToString() );
This Link would be helpful for you to understand about encodings and their conversions

Use System.Text.Encoding.UTF8.GetString().
Pass in your UTF-8 encoded text, as a byte array. The function returns a standard .net string which is encoded in UTF-16.
Sample function will be as below:
private string ReadData(Stream binary_file) {
System.Text.Encoding encoding = System.Text.Encoding.UTF8;
// Read string from binary file with UTF8 encoding
byte[] buffer = new byte[30];
binary_file.Read(buffer, 0, 30);
return encoding.GetString(buffer);
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to convert unicode to utf-8 encoding in c# - c#

Related

Base 64 encoding and Decoding

"Specified value has invalid Control characters" when converting SHA512 output to string

UtF-8 gives extra string in German character

Encoding not converting

C# Encoding.Convert Vs C++ MultiByteToWideChar

Categories

Resources