When I call :
oStreamReader = new StreamReader(_sFileName, System.Text.Encoding.Default);
I don't get characters with accent (by the way I expect french characters with accent).
When I display the System.Text.Encoding.Default, I get :
{System.Text.SBCSCodePageEncoding}
[System.Text.SBCSCodePageEncoding]: {System.Text.SBCSCodePageEncoding}
BodyName: "iso-8859-1"
CodePage: 1252
DecoderFallback: {System.Text.InternalDecoderBestFitFallback}
EncoderFallback: {System.Text.InternalEncoderBestFitFallback}
EncodingName: "Europe de l'Ouest (Windows)"
HeaderName: "Windows-1252"
IsBrowserDisplay: true
IsBrowserSave: true
IsMailNewsDisplay: true
IsMailNewsSave: true
IsReadOnly: true
IsSingleByte: true
WebName: "Windows-1252"
WindowsCodePage: 1252
Does it not expect to be UTF-8 ?
Where can I set System.Text.Encoding.Default ?
Is it bound with Windows settings ?
Thanks a lot in advance.
Eric.
Does it not expect to be UTF-8 ?
On .NET Framework, it's your configured Windows code page. On .NET Core, it is UTF-8.
From the docs:
In .NET Framework on the Windows desktop, the Default property always gets the system's active code page and creates a Encoding object that corresponds to it. The active code page may be an ANSI code page, which includes the ASCII character set along with additional characters that vary by code page. Because all Default encodings based on ANSI code pages lose data, consider using the Encoding.UTF8 encoding instead. UTF-8 is often identical in the U+00 to U+7F range, but can encode characters outside the ASCII range without loss
Where can I set System.Text.Encoding.Default ?
It's your configured Windows code page.
Is it bound with Windows settings ?
Yep
oStreamReader = new StreamReader(_sFileName, System.Text.Encoding.Default);
The easiest thing is to just do:
oStreamReader = new StreamReader(_sFileName);
StreamReader will try to detect the encoding used from the byte order marks, but will fall back to UTF-8 if that fails, so just let it do that.
There should be almost no need to ever type Encoding.Default in your code: it's a badly-named property which should be ignored.
Different computers can use different encodings as the default, and the default encoding can change on a single computer. Refer to https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.default?view=net-5.0 for details. This page clearly explains how System.Text.Encoding should be used.
I would not suggest changing default encoding because the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. Rather, use the encoding required by your specific code.
oStreamReader = new StreamReader(_sFileName, System.Text.Encoding.UTF8);
Related
Line:
IList<string> text = await FileIO.ReadLinesAsync(file);
causes exception No mapping for the Unicode character exists in the target multi-byte code page
When I remove chars like ąśźćóż from my file it runs ok, but the problem is that I can't guarantee that those chars won't happen in future.
I tried changing the encoding in advanced save options but it is already
Unicode (UTF-8 with signature) - Codepage 65001
I have a hard time trying to figure this one out.
Make FileIO.ReadLinesAsync use a matching encoding. I don't know what you custom class does but according to the error message it does not use any Unicode encoding.
I think those characters ąśźćóż are UTF-16 encoded.So, it's better to use UTF-16. Use the overload ReadLinesAsync(IStorageFile, UnicodeEncoding) and set UnicodeEncdoing parameter to UnicodeEncoding.Utf16BE
From MSDN :
This method uses the character encoding of the specified file. If you
want to specify different encoding, call ReadLinesAsync(IStorageFile,
UnicodeEncoding) instead.
StreamReader reads '–' (alt+ 0150) as � even if I have UTF-8 encoding and I have detectEncodingFromByteOrderMarks (BOM) set to true. Can any one guide me on this ?
That byte code won't appear in utf-8 encoded text. It is '\u2013', 0xe2 + 0x80 + 0x93 when encoded in utf-8. If you get this character when you type Alt+0150 on the numeric keypad then your default system code page is probably 1252. Simply pass Encoding.Default to the StreamReader constructor.
You need to know the encoding that was used to encode the text. There's no way around that. Try different encodings until you get the desired results.
From MSDN:
The detectEncodingFromByteOrderMarks parameter detects the encoding by
looking at the first three bytes of the stream. It automatically
recognizes UTF-8, little-endian Unicode, and big-endian Unicode text
if the file starts with the appropriate byte order marks. Otherwise,
the user-provided encoding is used. See the Encoding.GetPreamble
method for more information.
Which means that using that BOM is just an extra thing that may or may not work or can be easily overriden
As the other users wrote, the probable reason of this issue is an ANSI encoding of the file you are trying to read. I've recreated the issue you've described when I saved the file in ANSI encoding.
Try to use this code:
var stream = new StreamReader(fileName, Encoding.Default);
The Encoding.Default parameter is important in here. This code should read the character you've mentioned correctly.
I am having some issues with the default string encoding in C#. I need to read strings from certain files/packets. However, these strings include characters from the 128-256 range (extended ascii), and all of these characters show up as question marks , instead of the proper character. For example, when reading a string ,it could come up as "S?meStr?n?" if the string contained the extended ascii characters.
Now, is there any way to change the default encoding for my application? I know in java you could define the default character set from command line.
There's no one single "extended ASCII" encoding. There are lots of different 8-bit encodings which are compatible with ASCII for the bottom 128 values.
You need to find out what encoding your files actually use, and specific that when reading the data with StreamReader (or whatever else you're using). For example, you may want encoding Windows-1252:
Encoding encoding = Encoding.GetEncoding(1252);
.NET strings are always sequences of UTF-16 code points. You can't change that, and you shouldn't try. (That's true in Java as well, and you really shouldn't use the platform default encoding when calling getBytes() etc unless that's what you really, really mean.)
An Encoding can be specified in at least one overload of functions for reading text - for example, ReadAllText(string, Encoding).
So if you no a file's encoded using Windows-1252, then you can specify it like so:
string contents = File.ReadAllText(someFilePath, Encoding.GetEncoding(1252));
Of course, doing this requires knowing ahead of time which code page is being used.
When I try to get some text from file and display it in textbox it is okay until I want to write Czech characters (e. g. 蚞). They show up like: Moj� nejv�t�� z�libou je �e�en� koresponden�n�ch semin���
Should I set encoding to loaded text just before I assign it to textBox1.text or it is possible to change encoding of textBox1.Text itself?
I use following code:
textBox1.Text = File.ReadAllText(file);
Try to force the encoding (the machine default should be OK, if you don't know the correct one):
textBox1.Text = File.ReadAllText(file,Encoding.Default);
Anyway, being Czech I guess your current default encoding is "Western European (Windows)" (you can get it also doing Encoding.GetEncoding(1252))
That is also the one on my pc (I have an Italian version of Win7).
From MSDN for ReadAllText()
This method attempts to automatically
detect the encoding of a file based on
the presence of byte order marks.
Encoding formats UTF-8 and UTF-32
(both big-endian and little-endian)
can be detected.
Use the ReadAllText(String, Encoding)
method overload when reading files
that might contain imported text,
because unrecognized characters may
not be read correctly.
Try using the other overload to explicitly specify the Encoding since automatic detection is not working in your case, something like
textBox1.Text = File.ReadAllText(file, Encoding.UTF8);
Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?
For example, executing a command like this:
string s = (string) Clipboard.GetData(DataFormats.Html)
Results in stuff like:
<FONT size=-2>Â Â <A href="/advanced_search?hl=en">Advanced
Search</A><BR>Â Â Preferences<BR>Â Â <A
href="/language_tools?hl=en">Language
Tools</A></FONT>
Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.
It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?
In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).
It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the  character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.
For my other project I made a function that fix data with corrupted encoding.
In this case simple conversion should be sufficient:
byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);
My original function is a little bit more complex and contains tests to ensure that data are not corrupted...
public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
if (string.IsNullOrEmpty(text))
return false;
byte[] data = encoding.GetBytes(text);
// there should not be any character outside source encoding
string newStr = encoding.GetString(data);
if (!string.Equals(text, newStr)) // if there is any character "outside"
return false; // leave, the input is in a different encoding
if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
return false; // if not, can not convert to UTF-8
text = Encoding.UTF8.GetString(data);
return true;
}
I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...
EDIT: (July 20, 2017)
It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0)
(Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)
You have to interpret the data as UTF-8. See MS Office hyperlinks change code page?.
DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.
You get allot of wrong encodings, leading funny/bad characters such as
'Å','‹','Å’','Ž','Å¡','Å“','ž','Ÿ','Â','¡','¢','£','¤','Â¥','¦','§','¨','©'
Full explanation here
Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters
Soln: Create a translation dictionary and search and replace.
I don't know what your original source document is, but be aware that Word and Outlook provide several versions of the clipboard in different encodings. One is usually Windows-1252 and another is UTF-8. Possibly you're grabbing the UTF-8 encoded version by default, when you're expecting the Windows-1252 (Latin-1 + Smart Quotes)? Non-ASCII characters would show up as multiple odd Latin-1 accented characters. Most "Smart Quotes" are not in the Latin-1 set and are often three bytes in UTF-8.
Can you specify which encoding you want the clipboard contents in?
Try this:
System.Windows.Forms.Clipboard.GetText(System.Windows.Forms.TextDataFormat.Html);