UTF-8 Encoding and Decoding in c#

UTF-8 Encoding and Decoding in c# - c#

I Searched for " How to Encode the data in utf-8 format". Regarding this I got the best result is following:
UTF8Encoding utf8 = new UTF8Encoding();
String unicodeString = "ABCD";
// Encode the string.
Byte[] encodedBytes = utf8.GetBytes(unicodeString);
// Decode bytes back to string.
String decodedString = utf8.GetString(encodedBytes);
But the Problem is when I see the encoded data I found that is not more than ASCII code.
can any one help me to improve my knowledge.
For example as I passed "ABCD " it gets converted into 65,66,67,68.... I think this is not utf-8

UTF-8 is backwards compatible with ASCII of course. You should test with some characters that are not included in ASCII.
If you program in C# the strings are already encoded in UTF-16. You will not see anything Special there. If you want to see something you should try to compare the LENGTH of the Byte[] when you encode the string into different Encodings.

Check out the Wikipedia article on UTF8: Wikipedia.
From there:
Backward compatibility: One-byte codes are used only for the ASCII
values 0 through 127. In this case the UTF-8 code has the same value
as the ASCII code. The high-order bit of these codes is always 0. This
means that UTF-8 can be used for parsers expecting 8-bit extended
ASCII even if they are not designed for UTF-8.
The point here is that for anything that would be ASCII 0-127 in UTF8 it's the same. You need to try more extended characters (an example in the article is the Euro symbol) to see how it's different. Or try an ASCII value greater than 127 and you'll see it different.

Related

c# UTF8 GetString from bytes array not equal to php chr function

I'm trying to make one decoder. Basic system .Net 4.7 I'm trying to migrate this system into php, but I'm having trouble converting bytes. As far as I understand the default string UTF-16le on C#, I understood the ord and chr functions as UCS-2 on the PHP side. I want to do below and I do not get the same result there are codes. What can I do to fix this, thanks in advance
XOR Encoded Text Bytes = [101,107,217,78,40,68,234,218,162,67,139,81,44,166,24,148];
on C#
string result = System.Text.Encoding.UTF8.GetString(destinationArray);
On PHP
for($i=0;$i<sizeof($encoded);$i++){
echo "\t".$encoded[$i]." => ".chr($encoded[$i])."\n";
$tmpStr .= chr($encoded[$i]);
}
C# Result size=26:
ek�N(D�ڢC�Q,��
PHP Result size=16:
ek�N(D�ڢC�Q,��
the strings looks the same, but byte translation is quite different.
C# Result to Bytes array:
byte[] utf8 = System.Text.Encoding.Unicode.GetBytes(result);
Console.WriteLine(string.Join("-", utf8));
response =
101-0-107-0-253-255-78-0-40-0-68-0-253-255-162-6-67-0-253-255-81-0-44-0-253-255-24-0-253-255
PHP Result to Bytes Array:
echo implode("-",unpack("C*", $tmpStr));
response = 101-107-217-78-40-68-234-218-162-67-139-81-44-166-24-148
if php response convert to UTF-16le, results again different
echo implode("-",unpack("C*", mb_convert_encoding($tmpStr,'UTF-16le')));
response =
101-0-107-0-63-0-78-0-40-0-68-0-63-0-162-6-67-0-63-0-81-0-44-0-63-0-24-0-63-0

You are mixing quite different things here.
First, in the C# code, you are not using the same encoding when converting from bytes to a string and then from a string back to bytes: Encoding.UTF8 in the first case and Encoding.Unicode (which is .NET name for UTF-16) in the latter... Things cannot go well if you do this. And by the way, I'm not sure that PHP's UCS2 is equivalent to UTF-16:
UTF-8 encodes characters on 1, 2, 3 or 4 bytes depending on the character
UTF-16 encodes characters on 2 or 4 bytes depending on the character
UCS-2 always encodes characters on 2 bytes, and hence cannot encode more than 65536 characters...
Then what you pass to the 'bytes to string' conversions is not necessarily valid! Because you've XORed the input data (I assume it to be some secret string), the resulting bytes may or may not be a valid sequence in some encodings. For example:
It is not valid in ASCII because you have (in your example) bytes > 127
It is not valid in UTF-8 because 217 followed by 78 is recognized neither as a 1-, 2-, 3-, or 4-byte character by UTF-8; hence, the � you see before the N.
It seems to be invalid UTF-16 as well, but roundtripping works (I could get back the original array using .NET's Unicode.GetString, then Unicode.GetBytes. If I remove your last byte - and end up with an odd number of bytes - then UTF-16 roundtripping does not work any more...
Although I did not test it, it should also be invalid UCS-2 because UCS-2 'looks like' UTF-16 for 2-byte characters.
Roundtripping works with ANSI encodings sucha as windows-1252 because these encodings accept any byte. However, I would discourage using such trick because you have to be sure the same code page is used on both sides of the encoding/decoding process.
Therefore, I think, in your case, the best way to store your XORed bytes into a string would be to convert the array to base64. In C# you can do it this way:
// The code below gives you ZWt1TihEInY+QydRLEIYMA==
var converted = Convert.ToBase64String(array);
// And this one gives you back the initial array
var bytes = Convert.FromBase64String(converted);
Quick googling will tell you to use base64_encode and base64_decode in PHP.
Bottom note: if you want to really understand what's going on with al this encodings stuff, here is the must-read blog post on the subject: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Handle Non-UTF-8 Characters in Byte Array

I have an array of bytes which contains some characters that are not UTF-8. These characters cannot be deserialized using UTF-8 encoding. So, my question is, how can I handle these characters and make the string readable in whatever language it is.
For example, if I have an array:
byte[] b = myArrayWithNonUTF8Characters;
And I try to deserialize the array with:
DataContractJsonSerializer jsonSerializer = new DataContractJsonSerializer(typeof(MyObject));
MyObject objResponse = (MyObject)jsonSerializer.ReadObject(new MemoryStream(b));
Then I get an error that the array contains invalid UTF8 bytes.
Any way to make this work?
PS: Please, do not give me this answer: string s = System.Text.Encoding.UTF8.GetString(b, 0, b.Length); It will only return symbols replacing the non-UTF-8 characters.

The beauty of UTF is that it encodes characters in most languages; so you can have Greek and Japanese in the same character stream.
Without UTF, your entire stream (or in your case an array) must be in a single language defined by a Code Page. Each character is represented by an ASCII byte but the actual character is determined by the Code Page (see http://en.wikipedia.org/wiki/Code_page for more details).
For example if your text was written in Greek you might use Code Page 111:
System.Text.Encoding.GetEncoding(111)
In short, you need to know what language the ASCII text was written in.

Convert ISCII characters to its UTF-8 encoding?

I want to convert the ascii encoded text input by my users into UTF-8 encoding, so that I can display it using any unicode font types. For example, I want to display english alphabet 'l' in ASCII as 'ക' in Unicode. I think I would require a mapping system too, so that I can Map l to 'ക'. Please help me to solve this issue.

Your text is in ISCII (Indian Script Code for Information Interchange). You need to convert ISCII with the proper code page to unicode. The following methods should do the job. Convert will convert a given text from one encoding to another. GetEncoding will provide you with the Encoding objects to be used by the Convert method.
Example code can be found here: http://www.dotnetframework.org/default.aspx/Net/Net/3#5#50727#3053/DEVDIV/depot/DevDiv/releases/whidbey/netfxsp/ndp/clr/src/BCL/System/Text/ISCIIEncoding#cs/1/ISCIIEncoding#cs
Code page identifiers can be found here:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
public static byte[] Convert(System.Text.Encoding srcEncoding, System.Text.Encoding dstEncoding, byte[] bytes)
Member of System.Text.Encoding
Summary:
Converts an entire byte array from one encoding to another.
Parameters:
srcEncoding: The encoding format of bytes.
dstEncoding: The target encoding format.
bytes:
Returns:
An array of type System.Byte containing the results of converting bytes from srcEncoding to dstEncoding.
and this
public static System.Text.Encoding GetEncoding(int codepage)
Member of System.Text.Encoding
Summary:
Returns the encoding associated with the specified code page identifier.
Parameters:
codepage: The code page identifier of the preferred encoding. -or- 0, to use the default encoding.
Returns:
The System.Text.Encoding associated with the specified code page.
As per Wikipedia Article, the code page for Malayalam is 57009

Encoding.UTF8.GetString(Encoding.ASCII.GetBytes(input))

Your question makes no sense. Changing the encoding from ASCII to UTF-8 does not magically turn an l into a ക, it only changes the byte representation of the l (actually, since ASCII is a subset of UTF-8, it does not even do that here. It does nothing.)
What you probably want is some kind of transliteration between the Latin and Malayalam alphabet, but that is something completely different.

ASCII Extended in C# string

How do I make a string in C# to accept non printable ASCII extended characters like • , cause when I try to put • in a string it just give a blank space or null.

Extended ASCII is just ASCII with the 8 high bits set to different values.
The problem lies in the fact that no commission has ratified a standard for extended ASCII. There are a lot of variants out there and there's no way to tell what you are using.
Now C# uses UTF-16 encoding which will be different from whichever extended ASCII you are using.
You will have to find the matching Unicode character and display it as follows
string a ="\u2649" ; //where 2649 is a the Unicode number
Console.write(a) ;
Alternatively you could find out which encoding your files use and use it like so
eg. encoding Windows-1252:
Encoding encoding = Encoding.GetEncoding(1252);
and for UTF-16
Encoding enc = new UnicodeEncoding(false, true, true);
and convert it using
Encoding.Convert (Encoding, Encoding, Byte[], Int32, Int32)
Details are here

Try this..
Convert those charcaters as string as folows.
string equivalentLetter = Encoding.Default.GetString(new byte[] { (byte)letter });
Now, the equivalent letter contains the correct string.
I tried this for EURO symbol, it worked.

.NET strings are UTF-16 encoded, not extended-ascii (whatever that is). By simply adding a number to a character will give you another defined character within the UTF-16 plain set. If you want to see the underlying character as it would be in your extended ASCII encoding you need to convert the newly calculated letter from whatever encoding you are talking about to UTF-16. See: http://msdn.microsoft.com/en-us/library/66sschk1.aspx

Ruby equivalent to .NET's Encoding.ASCII.GetString(byte[])

Does Ruby have an equivalent to .NET's Encoding.ASCII.GetString(byte[])?
Encoding.ASCII.GetString(bytes[]) takes an array of bytes and returns a string after decoding the bytes using the ASCII encoding.

Assuming your data is in an array like so (each element is a byte, and further, from the description you posted, no larger than 127 in value, that is, a 7-bit ASCII character):
array =[104, 101, 108, 108, 111]
string = array.pack("c*")
After this, string will contain "hello", which is what I believe you're requesting.
The pack method "Packs the contents of arr into a binary sequence according to the directives in the given template string".
"c*" asks the method to interpret each element of the array as a "char". Use "C*" if you want to interpret them as unsigned chars.
http://ruby-doc.org/core/classes/Array.html#M002222
The example given in the documentation page uses the function to convert a string with Unicode characters. In Ruby I believe this is best done using Iconv:
require "iconv"
require "pp"
#Ruby representation of unicode characters is different
unicodeString = "This unicode string contains two characters " +
"with codes outside the ASCII code range, " +
"Pi (\342\x03\xa0) and Sigma (\342\x03\xa3).";
#printing original string
puts unicodeString
i = Iconv.new("ASCII//IGNORE","UTF-8")
#Printing converted string, unicode characters stripped
puts i.iconv(unicodeString)
bytes = i.iconv(unicodeString).unpack("c*")
#printing array of bytes of converted string
pp bytes
Read up on Ruby's Iconv here.
You might also want to check this question.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

UTF-8 Encoding and Decoding in c# - c#

Related

c# UTF8 GetString from bytes array not equal to php chr function

Handle Non-UTF-8 Characters in Byte Array

Convert ISCII characters to its UTF-8 encoding?

ASCII Extended in C# string

Ruby equivalent to .NET's Encoding.ASCII.GetString(byte[])

Categories

Resources