Why does File.ReadAllText() also recognize UTF-16 encodings? - c#

I read a file using
File.ReadAllText(..., Encoding.ASCII);
According the documentation [MSDN] (emphasis mine),
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
However, in my case the ASCII file incorrectly started with 0xFE 0xFF and it detected UTF-16 (probably big endian, but I did not check).

According to File [referencesource] it uses a StreamReader:
private static String InternalReadAllText(String path, Encoding encoding, bool checkHost)
{
...
using (StreamReader sr = new StreamReader(path, encoding, true, StreamReader.DefaultBufferSize, checkHost))
return sr.ReadToEnd();
}
and that StreamReader overload with 5 parameter [MSDN] is documented to support UTF-16 as well
It automatically recognizes UTF-8, little-endian Unicode, big-endian Unicode, little-endian UTF-32, and big-endian UTF-32 text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used.
(emphasis mine)
Since File.ReadAlltext() is supposed to and documented to detect Unicode BOMs, it's probably a good idea that it detects UTF-16 as well. However, the documentation is wrong and should be updated. I filed issue #7515.

Related

Convert the strings to UTF-16 Encoding C#

I have few strings which are 1252 ENCODED ,UTF-8 and UTF-16 encoded. Ultimately I have to convert all the strings to UTF-16 encoding for comparison,how do I do this?
I came across if we know source encoding we can convert to destination encoding,but I need to convert strings(which may be encoded in any format) to UTF-16(default)
var url=#"file:///C:/Users/Œser/file.html";
Uri parsedurl;
var pass=Uri.TryCreate(url.Trim(),UriKind.Absolute,out parsedurl);
At this point parsedurl.AbsoluteUri prints file:///C:/Users/ %C5%92ser/file.html which is expected
Then I load the html file in IE WebBrowserControl
I intercept navigate
strURL = URL.ToString();
Now strURL prints file:///C:/Users/%8Cser/file.html
.NET string values are always UTF-16 (at least until Utf8String, which is looking like .NET 7 or .NET 8 now). So presumably you have some bytes or streams that are encoded in various encodings, that you want to covert to UTF-16 string instances.
The key here is Encoding; for example:
var enc = Encoding.GetEncoding(1252);
var enc = Encoding.UTF8
var enc = Encoding.BigEndianUnicode; (UTF-16, big-endian)
var enc = Encoding.Unicode; (UTF-16, little-endian)
You can use this encoding manually (GetString(...), GetEncoder(...) etc) - or you can pass it to a TextReader such as StreamReader as an optional constructor argument.
Note that 1252 may not be available in .NET Core / .NET 5 (only .NET Framework), as it depends on the OS encoding directory. You may have to settle for "Western European (ISO)" (iso-8859-1, code-page 28591 i.e. Encoding.GetEncoding(28591)).
From https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html:
ISO-8859-1 (also called Latin-1) is identical to Windows-1252 (also called CP1252) except for the code points 128-159 (0x80-0x9F). ISO-8859-1 assigns several control codes in this range. Windows-1252 has several characters, punctuation, arithmetic and business symbols assigned to these code points.
Similarly, Encoding can be used to write to any chosen encoding, if you want to get bytes again - presumably using either of the UTF-16 variants.

Why Does Byte 150 show up as a dash in Notepad but Not when I read it programatically?

I've got a file that looks OK in Notepad (and Notepad++) but when I try to read it with a C# program, the dash shows up as a replacement character (�) instead. After some trial and error, I can reproduce the error as follows:
File.WriteAllBytes("C:\\Misc\\CharTest\\wtf.txt", new byte[] { 150 });
var readFile = File.ReadAllText("C:\\Misc\\CharTest\\wtf.txt");
Console.WriteLine(readFile);
Now, if you go and look in the wtf.txt file using Notepad, you'll see a dash... but I don't get it. I know that's not a "real" Unicode value so that's probably the root of the issue, but I don't get why it looks fine in Notepad and not when I read in the file. And how do I get the file to read it as a dash?
As an aside, a VB6 program I'm trying to rewrite in C# also reads it as a dash.
The File.ReadAllText(string) overload defaults to UTF8 encoding, in which a standalone byte with value 150 is invalid.
Specify the actual encoding of the file, for example:
var encoding = Encoding.GetEncoding(1252);
string content = File.ReadAllText(fileName, encoding);
I used the Windows-1252 encoding, which has a dash at codepoint 150.
Edit: Notepad displays the file correctly because for non-Unicode files the Windows-1252 codepage is the default for western regional settings. So likely you can use also Encoding.Default to get the correct result but keep in mind that Encoding.Default can return different code pages with different regional settings.
You are writing bytes in a textfile. And the you are reading those bytes and interpret them as chars.
Now, when you write bytes, you don't care about encoding, while you have to, in order to read those very same bytes as char.
Notepad++ seems to interpret the byte as Unicode char and therefore prints the _.
Now File.ReadAllText reads the bytes in the specified encoding, which you did not specify and there will be set to one of these and seems to be UTF-8, where 150 is not a valid entry.

Handle — while import csv file using C#

I had written code (in C#) about to import csv file using filehelper.
I am facing one issue that if file contain any &mdash (—) than it would replace by ? character (not exact ? instead some special character as per shown in below image)
How can i handle this by code?
Thanks.
How your stream reader object is created ? Have you provided any specific encoding to it ? I think you should try if not yet, as default encoding can not be detected while there is no BOM defined.
From MSDN
The character encoding is set by the encoding parameter, and the
buffer size is set to 1024 bytes. The StreamReader object attempts to
detect the encoding by looking at the first three bytes of the stream.
It automatically recognizes UTF-8, little-endian Unicode, and
big-endian Unicode text if the file starts with the appropriate byte
order marks. Otherwise, the user-provided encoding is used. See the
Encoding.GetPreamble method for more information.

Strange behaviour when writing a string to a UTF-8 file (no BOM) - ANSI file being returned

I'm attempting to write out C# string data to a UTF-8 file without a byte order mark (BOM), but am getting an ANSI file created.
using (StreamWriter objStreamWriter = new StreamWriter(SomePath, false, new UTF8Encoding(false)))
{
objStreamWriter.Write("Hello world - Encoding no BOM but actually returns ANSI");
objStreamWriter.Close();
}
According to the documentation for the UTF8Encoding class constructor, setting the encoderShouldEmitUTF8Identifier parameter to false should inhibit the Byte Order Mark.
I'm using .NET Framework 4.5 on my British (en-gb) computer. Below is screenshot of the ScreenWriter object showing UTF8Encoding in place.
So why am I getting an ANSI file (as checked with Notepad++) back from this operation?
Your example string that you're writing to the file consists only of characters in the ASCII range. The ASCII range is shared by ASCII, UTF-8 and most (all?) ANSI code pages. So, given that there is no BOM, Notepad++ has no indication if UTF-8 or ANSI is meant, and apparently defaults to ANSI.
If there is no BOM and no unicode characters, how do you expect Notepad++ to recognise it as UTF-8? UTF-8, ANSI and ASCII are all identical for the characters you are emitting?
(Even if you include some unicode characters Notepad++ may struggle to guess the correct encoding.)
In "Hello world - Encoding no BOM but actually returns ANSI", no character is encoded differently in UTF8 and ANSI. Because of BOM absence, Notepad++ shows that the file is encoded in ANSI because there is no 'special character'. Try adding a "é, à, ê" character in your file and Notepad++ will show it as being encoded in UTF8 without BOM.

StreamReader weird error with �

StreamReader reads '–' (alt+ 0150) as � even if I have UTF-8 encoding and I have detectEncodingFromByteOrderMarks (BOM) set to true. Can any one guide me on this ?
That byte code won't appear in utf-8 encoded text. It is '\u2013', 0xe2 + 0x80 + 0x93 when encoded in utf-8. If you get this character when you type Alt+0150 on the numeric keypad then your default system code page is probably 1252. Simply pass Encoding.Default to the StreamReader constructor.
You need to know the encoding that was used to encode the text. There's no way around that. Try different encodings until you get the desired results.
From MSDN:
The detectEncodingFromByteOrderMarks parameter detects the encoding by
looking at the first three bytes of the stream. It automatically
recognizes UTF-8, little-endian Unicode, and big-endian Unicode text
if the file starts with the appropriate byte order marks. Otherwise,
the user-provided encoding is used. See the Encoding.GetPreamble
method for more information.
Which means that using that BOM is just an extra thing that may or may not work or can be easily overriden
As the other users wrote, the probable reason of this issue is an ANSI encoding of the file you are trying to read. I've recreated the issue you've described when I saved the file in ANSI encoding.
Try to use this code:
var stream = new StreamReader(fileName, Encoding.Default);
The Encoding.Default parameter is important in here. This code should read the character you've mentioned correctly.

Categories