Convert the strings to UTF-16 Encoding C# - c#

I have few strings which are 1252 ENCODED ,UTF-8 and UTF-16 encoded. Ultimately I have to convert all the strings to UTF-16 encoding for comparison,how do I do this?
I came across if we know source encoding we can convert to destination encoding,but I need to convert strings(which may be encoded in any format) to UTF-16(default)
var url=#"file:///C:/Users/Œser/file.html";
Uri parsedurl;
var pass=Uri.TryCreate(url.Trim(),UriKind.Absolute,out parsedurl);
At this point parsedurl.AbsoluteUri prints file:///C:/Users/ %C5%92ser/file.html which is expected
Then I load the html file in IE WebBrowserControl
I intercept navigate
strURL = URL.ToString();
Now strURL prints file:///C:/Users/%8Cser/file.html

.NET string values are always UTF-16 (at least until Utf8String, which is looking like .NET 7 or .NET 8 now). So presumably you have some bytes or streams that are encoded in various encodings, that you want to covert to UTF-16 string instances.
The key here is Encoding; for example:
var enc = Encoding.GetEncoding(1252);
var enc = Encoding.UTF8
var enc = Encoding.BigEndianUnicode; (UTF-16, big-endian)
var enc = Encoding.Unicode; (UTF-16, little-endian)
You can use this encoding manually (GetString(...), GetEncoder(...) etc) - or you can pass it to a TextReader such as StreamReader as an optional constructor argument.
Note that 1252 may not be available in .NET Core / .NET 5 (only .NET Framework), as it depends on the OS encoding directory. You may have to settle for "Western European (ISO)" (iso-8859-1, code-page 28591 i.e. Encoding.GetEncoding(28591)).
From https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html:
ISO-8859-1 (also called Latin-1) is identical to Windows-1252 (also called CP1252) except for the code points 128-159 (0x80-0x9F). ISO-8859-1 assigns several control codes in this range. Windows-1252 has several characters, punctuation, arithmetic and business symbols assigned to these code points.
Similarly, Encoding can be used to write to any chosen encoding, if you want to get bytes again - presumably using either of the UTF-16 variants.

Related

Why does File.ReadAllText() also recognize UTF-16 encodings?

I read a file using
File.ReadAllText(..., Encoding.ASCII);
According the documentation [MSDN] (emphasis mine),
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
However, in my case the ASCII file incorrectly started with 0xFE 0xFF and it detected UTF-16 (probably big endian, but I did not check).
According to File [referencesource] it uses a StreamReader:
private static String InternalReadAllText(String path, Encoding encoding, bool checkHost)
{
...
using (StreamReader sr = new StreamReader(path, encoding, true, StreamReader.DefaultBufferSize, checkHost))
return sr.ReadToEnd();
}
and that StreamReader overload with 5 parameter [MSDN] is documented to support UTF-16 as well
It automatically recognizes UTF-8, little-endian Unicode, big-endian Unicode, little-endian UTF-32, and big-endian UTF-32 text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used.
(emphasis mine)
Since File.ReadAlltext() is supposed to and documented to detect Unicode BOMs, it's probably a good idea that it detects UTF-16 as well. However, the documentation is wrong and should be updated. I filed issue #7515.

UTF-8 Encoding and Decoding in c#

I Searched for " How to Encode the data in utf-8 format". Regarding this I got the best result is following:
UTF8Encoding utf8 = new UTF8Encoding();
String unicodeString = "ABCD";
// Encode the string.
Byte[] encodedBytes = utf8.GetBytes(unicodeString);
// Decode bytes back to string.
String decodedString = utf8.GetString(encodedBytes);
But the Problem is when I see the encoded data I found that is not more than ASCII code.
can any one help me to improve my knowledge.
For example as I passed "ABCD " it gets converted into 65,66,67,68.... I think this is not utf-8
UTF-8 is backwards compatible with ASCII of course. You should test with some characters that are not included in ASCII.
If you program in C# the strings are already encoded in UTF-16. You will not see anything Special there. If you want to see something you should try to compare the LENGTH of the Byte[] when you encode the string into different Encodings.
Check out the Wikipedia article on UTF8: Wikipedia.
From there:
Backward compatibility: One-byte codes are used only for the ASCII
values 0 through 127. In this case the UTF-8 code has the same value
as the ASCII code. The high-order bit of these codes is always 0. This
means that UTF-8 can be used for parsers expecting 8-bit extended
ASCII even if they are not designed for UTF-8.
The point here is that for anything that would be ASCII 0-127 in UTF8 it's the same. You need to try more extended characters (an example in the article is the Euro symbol) to see how it's different. Or try an ASCII value greater than 127 and you'll see it different.

Strange behaviour when writing a string to a UTF-8 file (no BOM) - ANSI file being returned

I'm attempting to write out C# string data to a UTF-8 file without a byte order mark (BOM), but am getting an ANSI file created.
using (StreamWriter objStreamWriter = new StreamWriter(SomePath, false, new UTF8Encoding(false)))
{
objStreamWriter.Write("Hello world - Encoding no BOM but actually returns ANSI");
objStreamWriter.Close();
}
According to the documentation for the UTF8Encoding class constructor, setting the encoderShouldEmitUTF8Identifier parameter to false should inhibit the Byte Order Mark.
I'm using .NET Framework 4.5 on my British (en-gb) computer. Below is screenshot of the ScreenWriter object showing UTF8Encoding in place.
So why am I getting an ANSI file (as checked with Notepad++) back from this operation?
Your example string that you're writing to the file consists only of characters in the ASCII range. The ASCII range is shared by ASCII, UTF-8 and most (all?) ANSI code pages. So, given that there is no BOM, Notepad++ has no indication if UTF-8 or ANSI is meant, and apparently defaults to ANSI.
If there is no BOM and no unicode characters, how do you expect Notepad++ to recognise it as UTF-8? UTF-8, ANSI and ASCII are all identical for the characters you are emitting?
(Even if you include some unicode characters Notepad++ may struggle to guess the correct encoding.)
In "Hello world - Encoding no BOM but actually returns ANSI", no character is encoded differently in UTF8 and ANSI. Because of BOM absence, Notepad++ shows that the file is encoded in ANSI because there is no 'special character'. Try adding a "é, à, ê" character in your file and Notepad++ will show it as being encoded in UTF8 without BOM.

ASCII Extended in C# string

How do I make a string in C# to accept non printable ASCII extended characters like • , cause when I try to put • in a string it just give a blank space or null.
Extended ASCII is just ASCII with the 8 high bits set to different values.
The problem lies in the fact that no commission has ratified a standard for extended ASCII. There are a lot of variants out there and there's no way to tell what you are using.
Now C# uses UTF-16 encoding which will be different from whichever extended ASCII you are using.
You will have to find the matching Unicode character and display it as follows
string a ="\u2649" ; //where 2649 is a the Unicode number
Console.write(a) ;
Alternatively you could find out which encoding your files use and use it like so
eg. encoding Windows-1252:
Encoding encoding = Encoding.GetEncoding(1252);
and for UTF-16
Encoding enc = new UnicodeEncoding(false, true, true);
and convert it using
Encoding.Convert (Encoding, Encoding, Byte[], Int32, Int32)
Details are here
Try this..
Convert those charcaters as string as folows.
string equivalentLetter = Encoding.Default.GetString(new byte[] { (byte)letter });
Now, the equivalent letter contains the correct string.
I tried this for EURO symbol, it worked.
.NET strings are UTF-16 encoded, not extended-ascii (whatever that is). By simply adding a number to a character will give you another defined character within the UTF-16 plain set. If you want to see the underlying character as it would be in your extended ASCII encoding you need to convert the newly calculated letter from whatever encoding you are talking about to UTF-16. See: http://msdn.microsoft.com/en-us/library/66sschk1.aspx

How to convert a string from iso 8859-1 to utf-8? C# Windows phone 7 -

my question is very simple but at the moment i don't know how to do this. I have a string in ISO-8859-1 format and i need to convert this string to UTF-8. I need to do it in c# on windows phone 7 sdk. How can i do it? Thanks
The MSDN page for the Encoding class lists the recognized encodings.
28591 iso-8859-1 Western European (ISO)
For your question the correct choice is iso-8859-1 which you can pass to Encoding.GetEncoding.
var inputEncoding = Encoding.GetEncoding("iso-8859-1");
var text = inputEncoding.GetString(input);
var output = Encoding.Utf8.GetBytes(text);
Two clarifications on the previous answers:
There is no Encoding.GetText method (unless it was introduced specifically for the WP7 framework). The method should presumably be Encoding.GetString.
The Encoding.GetString method takes a byte[] parameter, not a string. All strings in .NET are internally represented as UTF-16; there is no way of having a “string in ISO-8859-1 format”. Thus, you must be careful how you read your source (file, network), rather than how you process your string.
For example, to read from a text file encoded in ISO-8859-1, you could use:
string text = File.ReadAllText(path, Encoding.GetEncoding("iso-8859-1"));
To save to a text file encoded in UTF-8, you could use:
File.WriteAllText(path, text, Encoding.UTF8);
Reply to comment:
Yes. You can use Encoding.GetString to decode your byte array (assuming it contains character values for text under a particular encoding) into a string, and Encoding.GetBytes to convert your string back into a byte array (possibly of a different encoding), as demonstrated in the other answers.
The concept of “encoding” relates to how byte sequences (be they a byte[] array in memory or the content of a file on disk) are to be interpreted. The string class is oblivious to the encoding that the text was read from, or should be saved to.
You can use Convert which works pretty well, especially when you have byte array:
var latinString = "Řr"; // år
Encoding latinEncoding = Encoding.GetEncoding("iso-8859-1");
Encoding utf8Encoding = Encoding.UTF8;
byte[] latinBytes = latinEncoding.GetBytes(latinString);
byte[] utf8Bytes = Encoding.Convert(latinEncoding, utf8Encoding, latinBytes);
var utf8String = Encoding.UTF8.GetString(utf8Bytes);

Categories