StreamReader reads '–' (alt+ 0150) as � even if I have UTF-8 encoding and I have detectEncodingFromByteOrderMarks (BOM) set to true. Can any one guide me on this ?
That byte code won't appear in utf-8 encoded text. It is '\u2013', 0xe2 + 0x80 + 0x93 when encoded in utf-8. If you get this character when you type Alt+0150 on the numeric keypad then your default system code page is probably 1252. Simply pass Encoding.Default to the StreamReader constructor.
You need to know the encoding that was used to encode the text. There's no way around that. Try different encodings until you get the desired results.
From MSDN:
The detectEncodingFromByteOrderMarks parameter detects the encoding by
looking at the first three bytes of the stream. It automatically
recognizes UTF-8, little-endian Unicode, and big-endian Unicode text
if the file starts with the appropriate byte order marks. Otherwise,
the user-provided encoding is used. See the Encoding.GetPreamble
method for more information.
Which means that using that BOM is just an extra thing that may or may not work or can be easily overriden
As the other users wrote, the probable reason of this issue is an ANSI encoding of the file you are trying to read. I've recreated the issue you've described when I saved the file in ANSI encoding.
Try to use this code:
var stream = new StreamReader(fileName, Encoding.Default);
The Encoding.Default parameter is important in here. This code should read the character you've mentioned correctly.
Related
I've got a file that looks OK in Notepad (and Notepad++) but when I try to read it with a C# program, the dash shows up as a replacement character (�) instead. After some trial and error, I can reproduce the error as follows:
File.WriteAllBytes("C:\\Misc\\CharTest\\wtf.txt", new byte[] { 150 });
var readFile = File.ReadAllText("C:\\Misc\\CharTest\\wtf.txt");
Console.WriteLine(readFile);
Now, if you go and look in the wtf.txt file using Notepad, you'll see a dash... but I don't get it. I know that's not a "real" Unicode value so that's probably the root of the issue, but I don't get why it looks fine in Notepad and not when I read in the file. And how do I get the file to read it as a dash?
As an aside, a VB6 program I'm trying to rewrite in C# also reads it as a dash.
The File.ReadAllText(string) overload defaults to UTF8 encoding, in which a standalone byte with value 150 is invalid.
Specify the actual encoding of the file, for example:
var encoding = Encoding.GetEncoding(1252);
string content = File.ReadAllText(fileName, encoding);
I used the Windows-1252 encoding, which has a dash at codepoint 150.
Edit: Notepad displays the file correctly because for non-Unicode files the Windows-1252 codepage is the default for western regional settings. So likely you can use also Encoding.Default to get the correct result but keep in mind that Encoding.Default can return different code pages with different regional settings.
You are writing bytes in a textfile. And the you are reading those bytes and interpret them as chars.
Now, when you write bytes, you don't care about encoding, while you have to, in order to read those very same bytes as char.
Notepad++ seems to interpret the byte as Unicode char and therefore prints the _.
Now File.ReadAllText reads the bytes in the specified encoding, which you did not specify and there will be set to one of these and seems to be UTF-8, where 150 is not a valid entry.
I had written code (in C#) about to import csv file using filehelper.
I am facing one issue that if file contain any &mdash (—) than it would replace by ? character (not exact ? instead some special character as per shown in below image)
How can i handle this by code?
Thanks.
How your stream reader object is created ? Have you provided any specific encoding to it ? I think you should try if not yet, as default encoding can not be detected while there is no BOM defined.
From MSDN
The character encoding is set by the encoding parameter, and the
buffer size is set to 1024 bytes. The StreamReader object attempts to
detect the encoding by looking at the first three bytes of the stream.
It automatically recognizes UTF-8, little-endian Unicode, and
big-endian Unicode text if the file starts with the appropriate byte
order marks. Otherwise, the user-provided encoding is used. See the
Encoding.GetPreamble method for more information.
I'm calling File.ReadAllText() in a program designed to format some files that I have.
Some of these files contain the ® (174) symbol. However, when the text is being read, the returned string contains � (65533) symbols where the ® (174) should be.
What would cause this and how can I fix it?
Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.
Code sample:
string readText = File.ReadAllText(path, Encoding.Default); // <-- change the encoding to whatever the encoding really is
If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown
This is likely due to a mismatch in the Encoding. Use the ReadAllText overload which allows you to specify the proper Encoding to use when reading the file.
The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.
You need to specify the encoding when you call File.ReadAllText, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)
The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.
For example:
Encoding isoLatin1 = Encoding.GetEncoding(28591);
string text = File.ReadAllText(path, isoLatin1);
It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.
The character you are reading is the Replacement character
used to replace an incoming character whose value is unknown or unrepresentable in Unicode
compare the use of U+001A as a control character to indicate the substitute function
http://www.fileformat.info/info/unicode/char/fffd/index.htm
You are getting this because the actual encoding of the file does not match the encoding your program expects.
By default ReadAllText expects UTF-8. It is encountering a byte sequence that does not represent a valid UTF-8 character, so replacing it with the Replacement character.
I am converting HTML to docx using http://www.codeproject.com/Articles/91894/HTML-as-a-Source-for-a-DOCX-File.
Most of the characters are read properly but some special characters such as •,“ ” are being displayed as •. What should I be doing to correct this?
The HTML that I was passing to HTMLtoDocx was also not reading special characters properly. Instead it was displaying as '?'. After changing the encoding to Encoding.Default it's returning the correct characters.
In HTMLtoDOCX there are two places that I can set encoding(lines below). In both the places I Tried changing the encoding format from Encoding.UTF8 to Encoding. But it isn't helping.
StreamWriter streamStartPart = new StreamWriter(docpartDocumentXML.GetStream(FileMode.Create, FileAccess.Write), Encoding.Default);
byte[] Origem = Encoding.Default.GetBytes(html);
• indicates a UTF-8 sequences incorrectly interpreted as ANSI (=Encoding.Default).
You should check whether the HTML file is read with the correct encoding.
While the encoding info is available in the HTTP Header or in HTML META tags, this encoding may not be correct if the HTML is read from a file.
Since .Net treats string characters as 2-byte Unicode values, making sure the correct encoding is apply to read and write byte streams is the first step to fix your problem.
I'm attempting to write out C# string data to a UTF-8 file without a byte order mark (BOM), but am getting an ANSI file created.
using (StreamWriter objStreamWriter = new StreamWriter(SomePath, false, new UTF8Encoding(false)))
{
objStreamWriter.Write("Hello world - Encoding no BOM but actually returns ANSI");
objStreamWriter.Close();
}
According to the documentation for the UTF8Encoding class constructor, setting the encoderShouldEmitUTF8Identifier parameter to false should inhibit the Byte Order Mark.
I'm using .NET Framework 4.5 on my British (en-gb) computer. Below is screenshot of the ScreenWriter object showing UTF8Encoding in place.
So why am I getting an ANSI file (as checked with Notepad++) back from this operation?
Your example string that you're writing to the file consists only of characters in the ASCII range. The ASCII range is shared by ASCII, UTF-8 and most (all?) ANSI code pages. So, given that there is no BOM, Notepad++ has no indication if UTF-8 or ANSI is meant, and apparently defaults to ANSI.
If there is no BOM and no unicode characters, how do you expect Notepad++ to recognise it as UTF-8? UTF-8, ANSI and ASCII are all identical for the characters you are emitting?
(Even if you include some unicode characters Notepad++ may struggle to guess the correct encoding.)
In "Hello world - Encoding no BOM but actually returns ANSI", no character is encoded differently in UTF8 and ANSI. Because of BOM absence, Notepad++ shows that the file is encoded in ANSI because there is no 'special character'. Try adding a "é, à, ê" character in your file and Notepad++ will show it as being encoded in UTF8 without BOM.