Invalid characters in File.ReadAllText

Invalid characters in File.ReadAllText - c#

I'm calling File.ReadAllText() in a program designed to format some files that I have.
Some of these files contain the ® (174) symbol. However, when the text is being read, the returned string contains � (65533) symbols where the ® (174) should be.
What would cause this and how can I fix it?

Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.
Code sample:
string readText = File.ReadAllText(path, Encoding.Default); // <-- change the encoding to whatever the encoding really is
If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown

This is likely due to a mismatch in the Encoding. Use the ReadAllText overload which allows you to specify the proper Encoding to use when reading the file.
The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.

You need to specify the encoding when you call File.ReadAllText, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)
The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.
For example:
Encoding isoLatin1 = Encoding.GetEncoding(28591);
string text = File.ReadAllText(path, isoLatin1);
It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.

The character you are reading is the Replacement character
used to replace an incoming character whose value is unknown or unrepresentable in Unicode
compare the use of U+001A as a control character to indicate the substitute function
http://www.fileformat.info/info/unicode/char/fffd/index.htm
You are getting this because the actual encoding of the file does not match the encoding your program expects.
By default ReadAllText expects UTF-8. It is encountering a byte sequence that does not represent a valid UTF-8 character, so replacing it with the Replacement character.

Related

C# metro No mapping for the Unicode character exists in the target multi-byte code page

Line:
IList<string> text = await FileIO.ReadLinesAsync(file);
causes exception No mapping for the Unicode character exists in the target multi-byte code page
When I remove chars like ąśźćóż from my file it runs ok, but the problem is that I can't guarantee that those chars won't happen in future.
I tried changing the encoding in advanced save options but it is already
Unicode (UTF-8 with signature) - Codepage 65001
I have a hard time trying to figure this one out.

Make FileIO.ReadLinesAsync use a matching encoding. I don't know what you custom class does but according to the error message it does not use any Unicode encoding.

I think those characters ąśźćóż are UTF-16 encoded.So, it's better to use UTF-16. Use the overload ReadLinesAsync(IStorageFile, UnicodeEncoding) and set UnicodeEncdoing parameter to UnicodeEncoding.Utf16BE
From MSDN :
This method uses the character encoding of the specified file. If you
want to specify different encoding, call ReadLinesAsync(IStorageFile,
UnicodeEncoding) instead.

Special characters for Docx with ooxml

I am converting HTML to docx using http://www.codeproject.com/Articles/91894/HTML-as-a-Source-for-a-DOCX-File.
Most of the characters are read properly but some special characters such as •,“ ” are being displayed as â€¢. What should I be doing to correct this?
The HTML that I was passing to HTMLtoDocx was also not reading special characters properly. Instead it was displaying as '?'. After changing the encoding to Encoding.Default it's returning the correct characters.
In HTMLtoDOCX there are two places that I can set encoding(lines below). In both the places I Tried changing the encoding format from Encoding.UTF8 to Encoding. But it isn't helping.
StreamWriter streamStartPart = new StreamWriter(docpartDocumentXML.GetStream(FileMode.Create, FileAccess.Write), Encoding.Default);
byte[] Origem = Encoding.Default.GetBytes(html);

â€¢ indicates a UTF-8 sequences incorrectly interpreted as ANSI (=Encoding.Default).
You should check whether the HTML file is read with the correct encoding.
While the encoding info is available in the HTTP Header or in HTML META tags, this encoding may not be correct if the HTML is read from a file.
Since .Net treats string characters as 2-byte Unicode values, making sure the correct encoding is apply to read and write byte streams is the first step to fix your problem.

StreamReader weird error with �

StreamReader reads '–' (alt+ 0150) as � even if I have UTF-8 encoding and I have detectEncodingFromByteOrderMarks (BOM) set to true. Can any one guide me on this ?

That byte code won't appear in utf-8 encoded text. It is '\u2013', 0xe2 + 0x80 + 0x93 when encoded in utf-8. If you get this character when you type Alt+0150 on the numeric keypad then your default system code page is probably 1252. Simply pass Encoding.Default to the StreamReader constructor.

You need to know the encoding that was used to encode the text. There's no way around that. Try different encodings until you get the desired results.
From MSDN:
The detectEncodingFromByteOrderMarks parameter detects the encoding by
looking at the first three bytes of the stream. It automatically
recognizes UTF-8, little-endian Unicode, and big-endian Unicode text
if the file starts with the appropriate byte order marks. Otherwise,
the user-provided encoding is used. See the Encoding.GetPreamble
method for more information.
Which means that using that BOM is just an extra thing that may or may not work or can be easily overriden

As the other users wrote, the probable reason of this issue is an ANSI encoding of the file you are trying to read. I've recreated the issue you've described when I saved the file in ANSI encoding.
Try to use this code:
var stream = new StreamReader(fileName, Encoding.Default);
The Encoding.Default parameter is important in here. This code should read the character you've mentioned correctly.

Default C# String encoding

I am having some issues with the default string encoding in C#. I need to read strings from certain files/packets. However, these strings include characters from the 128-256 range (extended ascii), and all of these characters show up as question marks , instead of the proper character. For example, when reading a string ,it could come up as "S?meStr?n?" if the string contained the extended ascii characters.
Now, is there any way to change the default encoding for my application? I know in java you could define the default character set from command line.

There's no one single "extended ASCII" encoding. There are lots of different 8-bit encodings which are compatible with ASCII for the bottom 128 values.
You need to find out what encoding your files actually use, and specific that when reading the data with StreamReader (or whatever else you're using). For example, you may want encoding Windows-1252:
Encoding encoding = Encoding.GetEncoding(1252);
.NET strings are always sequences of UTF-16 code points. You can't change that, and you shouldn't try. (That's true in Java as well, and you really shouldn't use the platform default encoding when calling getBytes() etc unless that's what you really, really mean.)

An Encoding can be specified in at least one overload of functions for reading text - for example, ReadAllText(string, Encoding).
So if you no a file's encoded using Windows-1252, then you can specify it like so:
string contents = File.ReadAllText(someFilePath, Encoding.GetEncoding(1252));
Of course, doing this requires knowing ahead of time which code page is being used.

.NET: How do I tell if an encoding supports all the chars in my string?

I've got lots of text that I need to output, which includes all sorts of characters from many languages. Sometimes I need to output the text in character encodings other than Unicode (eg, Shift-JIS, or ISO-8859-2), in order to match the page it's going to.
If the text has characters that the encoding can't handle (eg, Japanese characters in ISO-8859-2 encoded output) I end up with odd characters in the output. I can escape them, but I'd rather do that only if it's really necessary.
So, my question is this: Is there a way I can tell ahead of time if an encoding can handle all the characters in my string?
EDIT:
I think the EncoderFallback is probably the right answer to the question I asked. Unfortunately it doesn't seem to work in my particular situation. My thought was to convert the characters to their HTML entity equivalents (eg, モ instead of モ). However, the encoder only converts the first such character it finds, and if I set the Response.ContentEncoding it never calls my EncoderFallback at all.

You can write your own EncoderFallback class assign that to the encoder before encoding.
Using this approach you need do nothing in advanced (which likely would be simply processing the output string looking for problems).
Instead your Fallback class need only handle replacements where the encoding does not have a value for a character.

Try to encode the string with an Encoding whose EncoderFallback is set to EncoderExceptionFallback. eg.:
Encoding e= Encoding.GetEncoding(932, new EncoderExceptionFallback(), new DecoderExceptionFallback());
Then catch EncoderFallbackException when you GetBytes().

I think the methods already should work. (The EncoderFallback solution seems quite nice.) Here's an alternative however, in case you prefer it.
Create an encoder for the encoding you want to test by calling encoding.GetEncoder().
You can then call the Convert method of the Encoder object, passing in your text, and looking at the value of the completed out parameter to determine whether it succeeded or not.
If speed is an issue, you may want to benchmark the various methods, but I suspect they would all have quite similar performance profiles.

Convert it to the target encoding, convert it back and compare it with the original?
Try Encoding.GetBytes() and Encoding.GetStrings() to convert hence and forth.
As an optimization you could search all used unicode characters from your original string and just use that to try out the encoding.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Invalid characters in File.ReadAllText - c#

I'm calling File.ReadAllText() in a program designed to format some files that I have. Some of these files contain the ® (174) symbol. However, when the text is being read, the returned string contains � (65533) symbols where the ® (174) should be. What would cause this and how can I fix it?

This is likely due to a mismatch in the Encoding. Use the ReadAllText overload which allows you to specify the proper Encoding to use when reading the file. The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.

Related

C# metro No mapping for the Unicode character exists in the target multi-byte code page

Special characters for Docx with ooxml

StreamReader weird error with �

Default C# String encoding

.NET: How do I tell if an encoding supports all the chars in my string?

Categories

Resources