string encoding in C# - strange characters

string encoding in C# - strange characters - c#

I have a file that i need to import.
The problem is that I have problems with a lot of characters in that file.
For example these names are wrong:
BjÃ¶rn (in file) - Should be Björn
Ã…ke (in file) - Should be Åke
Unfortunately I can't recreate the file with the correct encoding.
Also there are a lot of characters that are wrong (these was just examples). I can't do a search and replace on all (if there isn't a dictionary with all conversions).
Can I decode the strings in some way?
thanks Patrik
Edit:
Just some more info that I should added before (I blame my tiredness).
The file is an .xlsx file.

I debugged this with Notepad++. I copied the correct strings into Notepad++. I used Encoding | Convert to UTF-8. Then I selected Encoding | Encode as ANSI. This has the effect of interpreting the UTF-8 bytes as if they were ANSI. And when I did this I end up with the same erroneous values as you. So clearly when you read the file you are interpreting is as ANSI rather than UTF-8.
The solution then is that your file has been encoded as UTF-8. Make sure that the file is interpreted as UTF-8 when you read it. I can't tell you exactly how to do that since you didn't show how you were reading the file in the first place.
It's possible that your file does not contain a byte-order-mark (BOM). If so then specify the encoding when you read the file by passing Encoding.UTF8.

I've just tried your first example, and it definitely looks like that's UTF-8.
It's unclear what you're using to look at the file in the first place, but if you load it with a text editor which understands UTF-8 and tell it that it's a UTF-8 file, it should be fine.
When you load it with .NET, you should just be able to use File.OpenText, File.ReadAllText etc - most IO dealing with encodings in .NET defaults to UTF-8 anyway.

Related

How to convert ANSI text to Unicode correctly without installing language pack? [duplicate]

I've got a FileUpload control in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.
But, some of this output is becoming mangled, specifically the symbol '£' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.
The question is,
How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and
How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol £ etc?
I've looked online but cannot find a satisfactory answer.

If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:
StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);
The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.
You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.
The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.
I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.
In practice, I've found the following to work for most of what I do:
StreamReader reader = new StreamReader("filename", Encoding.Default, true);
Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.

Can Encoding.Default recognize utf8 characters? Should I really not use it?

Well, when using IO.File.ReadAllText(path) or ReadAllText(path, System.Text.Encoding.UTF8) to read a text file which is saved in ANSI encoding, non-latin characters aren't displayed correctly.
So, I decided to use Encoding.Default. It worked just fine, but I see recommendations against using it everywhere (like here and here) because it "will only guarantee that all UTF-7 character sets will be read correctly". Also Microsoft
says:
Gets an encoding for the operating system's current ANSI code page.
However, it seems to me that it can recognize a file with any encoding. I tested that on a file that contains Chinese, Japanese, and Arabic characters -the file is saved in utf8 encoding-, and I was able to display the file correctly.
Code used:
Dim loadedText As String = IO.File.ReadAllText(path, System.Text.Encoding.Default)
MessageBox.Show(loadedText, "utf8")
Output:
So my question in points:
Is there something I'm missing here?
Why is it not recommended to use Encoding.Default when reading a file?
I know that a file with ANSI encoding would be displayed incorrectly if the default system encoding/system locale is changed, which is something I don't care about in my current case. But..
Is there even another way to prevent this from happening?
Side note: Please don't mind me using the c# tag. Although my code is in VB, any answer with C# code is welcomed.

File.ReadAllText actually tries to auto-detect the encoding. If the encoding cannot be determined from a BOM, then the encoding argument is used to decode the file.
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
If you used Encoding.UTF8 to write the file, then it would include a BOM. Your Encoding.Default is likely being ignored.

Using Encoding.Default is not recommended because it is operating system's ANSI code page, which is limited to given code page's character set. In other words, text file created in Notepad (ANSI encoding) in Czech Windows will be displayed incorrectly in English Windows. For this reason, everything should be saved and opened in UTF-8 encoding.
Saved in ANSI and opened in Unicode may not work
Saved in Unicode and opened in ANSI will not work
Saved in ANSI and opened in another ANSI may not work

What all types of file can be read with File class c#?

I have tried reading Text File and XML File with File Class, it works fine.I was wondering if we can read excel or word or other types.
var str = File.ReadAllLines("Test.xlsx");
While debugging ,str shows special characters.
Hope I had made question clear.Kindly Advise
Down votes are welcomed,if accompanied by proper comment to improve :).
Thanks in advance.

XML and Text Files are plain files, where text on screen appear like they are in file. That's why File.ReadAllLines work.
With Excel, it is different. It has encoded logic in file, which when read by a special programs (read MSExcel) decodes it and displays it correctly on screen.
Think of it as a encoded or obfuscated file read by programs specially defined to decrypt them.
To read Excel file in DotNet, you can use them to be transferred into DataSet/DataTable like this Read Excel File in C# (Example)

With File.ReadAllLines you can read text files (and XML is -as we know- as well a text file).
Of course then function reads other kind of data files as well - but you will not get meaningful results. The binary data is interpreted as characters. This will not work for Office files.

The MSDN documentation for File.ReadAllLines() states that:
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
Therefore you can read text files with one of the UTF encodings it supports. To read files that use other encodings (e.g. Windows ANSI, non-Latin text) you should use the overload that takes an Encoding parameter.

ANSI vs SHIFT JIS vs UTF-8 in c#

I have been trying to figure the difference for quite sometime now. The issue is with a file that is in ANSI encoding has japanese characters like: ‚È‚‚Æ‚à1‚Â‚ÌINCREMENTs‚ª•K—v‚Å‚·. It equivalent in shift-jis is 少なくとも1つのINCREMENT行が必要です. which is expected to be in japanese.
I need to display these characters after reading from file(in ANSI) on a webpage. There are some other files in UTF-8 displaying characters right not seeing this. I am finding it difficult to figure out whats the difference and how do I change encoding to do right things here..
I use c# for reading this file and displaying it, I also need to write the string back into file if its modified on web. Any encoding and decoding schemas here?

As far as code pages are concerned, "ANSI" (and Encoding.Default in .NET) basically just means "the non-Unicode codepage used by this system" - exactly what codepage that is, depends on how the system is configured, but on a Western European system, it's likely to be Windows-1252.
For the system where that text comes from, then "ANSI" would appear to mean Shift-JIS - so unless your system has the same code page, you'll need to tell your code to read the text as Shift-JIS.
Assuming you're reading the file with a StreamReader, there are various constructors that take an Encoding, so just grab a Shift-JIS encoding with Encoding.GetEncoding("shift_jis") or Encoding.GetEncoding(932) and use it to construct your StreamReader.

How does Encoding.Default work in .NET?

I'm reading a file using:
var source = File.ReadAllText(path);
and the character © wasn't being loaded correctly.
Then, I changed it to:
var source = File.ReadAllText(path, Encoding.UTF8);
and nothing.
I decided to try using
var source = File.ReadAllText(path, Encoding.Default);
and it worked perfectly.
Then I debugged it and tried to find which Encoding did the trick, and I found that it was UTF-7.
What I want to know is:
Is it recommended to use Encoding.Default, and can it guarantee all the characters of the file will be read without problems?

Encoding.Default will only guarantee that all UTF-7 character sets will be read correctly (google for the whole set). On the other hand, if you try to read a file not encoded with UTF-8 in the UTF-8 mode, you'll get corrupted characters like you did.
For instance if the file is encoded UTF-16 and if you read it in UTF-16 mode, you'll be fine even if the file does not contain a single UTF-16 specific character. It all boils down to the file's encoding.
You'll need to do the save - reopen stuff with the same encoding to be safe from corruptions. Otherwise, try to use UTF-7 as much as you can since it is the most compact yet 'email safe' encoding possible, which is why it is default in most .NET framework setups.

It is not recommended to use Encoding.Default.
Quote from MSDN:
Different computers can use different
encodings as the default, and the
default encoding can even change on a
single computer. Therefore, data
streamed from one computer to another
or even retrieved at different times
on the same computer might be
translated incorrectly. In addition,
the encoding returned by the Default
property uses best-fit fallback to map
unsupported characters to characters
supported by the code page. For these
two reasons, using the default
encoding is generally not recommended.
To ensure that encoded bytes are
decoded properly, your application
should use a Unicode encoding, such as
UTF8Encoding or UnicodeEncoding, with
a preamble. Another option is to use a
higher-level protocol to ensure that
the same format is used for encoding
and decoding.

It sounds like you are interested in auto-detecting the encoding of a file, in some sort of situation where you are not in control of the encoding used to save it. There are several questions on StackOverflow addressing this; some cursory browsing points to Determine a string's encoding in C# as a pretty good one. My favorite answer is the one pointing to a C# port of Mozilla's universal charset detector.

I think the ur file is in utf-7 encoding.nothing more.
visit this page Your Answer

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

string encoding in C# - strange characters - c#

Related

How to convert ANSI text to Unicode correctly without installing language pack? [duplicate]

Can Encoding.Default recognize utf8 characters? Should I really not use it?

What all types of file can be read with File class c#?

ANSI vs SHIFT JIS vs UTF-8 in c#

How does Encoding.Default work in .NET?

Categories

Resources