Some questions about encoding

Some questions about encoding - c#

I have some questions about encoding. I have some files in various types and encoding. Only text files (.txt, .csv, .xml) can have byte order mark or am I wrong? In first method I want to prepare file : change encoding, remove pramble or not , at secund I want to only use :
File.WriteAllBytes("testFile.csv", fileBytes);
I don't want to send encoding to second method, so should be bom at every file? or It will be saved with unicod encoding.
To convert encoding I want to use Encoding.Convert method but after converting there is no bom at file, so Is it better to use streamReader and StreamWriter with source and dest encodings?

The XML format marks its encoding to its root node but it's just a notation. Real encoding can be different.
In my opinion, use same encoding to all of output and input files. Unicode is good choice as you mentioned.
If you cannot handle the input files encoding because they wrote by other person, try UTF-8 for them. Almost text editors are using UTF-8 as default encoding. File.WriteAllBytes() as well.

Related

How do I extract UTF-8 strings out of a JSON file using LitJSON, as JsonData does not seem to convert?

I've tried many methods to extract some strings out of a JSON file using LitJson in Unity.
I've encoding converts all over, tried getting byte arrays and sending them around and nothing seems to work.
I went to the very start of where I create the JsonData object and tried to run the following test:
public JsonData CreateJSONDataObject()
{
Debug.Assert(pathName != null, "No JSON Data path name set. Please set before commencing read.");
string jsonString = File.ReadAllText(Application.dataPath + pathName, System.Text.Encoding.UTF8);
JsonData jsonDataObject = JsonMapper.ToObject(jsonString);
Debug.Log("Test compatibility: ë | " + jsonDataObject["Roots"][2]["name"]);
return jsonDataObject;
}
I made sure my jsonString is using UTF-8, however the output shows this:
Test compatibility: ë | W�den
I've tried many other methods, but as this is making sure to encode right when creating a JsonData object I can't think of what I am doing wrong as I just don't know enough about JSON.
Thank you in advance.

This type of problem occurs when a text file is written with one encoding and read using a different one. I was able to reproduce your problem with the following program, which removes the JSON serialization from the equation entirely:
string file = #"c:\temp\test.txt";
string text = "Wöden";
File.WriteAllText(file, text, Encoding.Default));
string text2 = File.ReadAllText(file, Encoding.UTF8);
Debug.WriteLine(text2);
Since you are reading with UTF-8 and it is not working, the real question is, what encoding was used to write the file originally? You should be using the same encoding to read it back. I suspect that the file was originally created using either Windows-1252 or iso-8859-1 instead of UTF-8. Try using one of those when you read the file, e.g.:
string jsonString = File.ReadAllText(Application.dataPath + pathName,
Encoding.GetEncoding("Windows-1252"));
You said in the comments that your JSON file was not created programmatically, but was "written by hand", meaning you used Notepad or some other text editor to make the file. If that is so, then that explains how you got into this situation. When you save the file, you should have the option to choose an encoding. For Notepad at least, the default encoding is "ANSI", which most likely maps to Windows-1252 (Western European), but depends on your locale. If you are in the Baltic region, for example, it would be Windows-1257 (Baltic). In any case, "ANSI" is not UTF-8. If you want to save the file in UTF-8 encoding, you have to specifically choose that option. Whatever option you use to save the file, that is the encoding you need to use to read it the next time, whether it is with a text editor or with code. Using the wrong encoding to read the file is what causes the corruption.
To change the encoding of a file, you first have to read it in using the same encoding that it was saved in originally, and then you can write it back out using a different encoding. You can do that with your text editor, simply by re-saving the file with a different encoding, or you can do that programmatically:
string text = File.ReadAllText(file, originalEncoding);
File.WriteAllText(file, text, newEncoding);
The key is knowing which encoding was used originally, and therein lies the rub. For legacy encodings (such as Windows-12xx) there is no way to tell because there is no marker in the file which identifies it. Unicode encodings (e.g. UTF-8, UTF-16), on the other hand, do write out a marker at the beginning of the file, called a BOM, or byte-order mark, which can be detected programmatically. That, coupled with the fact that Unicode encodings can represent all characters, is why they are much preferred over legacy encodings.
For more information, I highly recommend reading What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Handle — while import csv file using C#

I had written code (in C#) about to import csv file using filehelper.
I am facing one issue that if file contain any &mdash (—) than it would replace by ? character (not exact ? instead some special character as per shown in below image)
How can i handle this by code?
Thanks.

How your stream reader object is created ? Have you provided any specific encoding to it ? I think you should try if not yet, as default encoding can not be detected while there is no BOM defined.
From MSDN
The character encoding is set by the encoding parameter, and the
buffer size is set to 1024 bytes. The StreamReader object attempts to
detect the encoding by looking at the first three bytes of the stream.
It automatically recognizes UTF-8, little-endian Unicode, and
big-endian Unicode text if the file starts with the appropriate byte
order marks. Otherwise, the user-provided encoding is used. See the
Encoding.GetPreamble method for more information.

Invalid characters in File.ReadAllText

I'm calling File.ReadAllText() in a program designed to format some files that I have.
Some of these files contain the ® (174) symbol. However, when the text is being read, the returned string contains � (65533) symbols where the ® (174) should be.
What would cause this and how can I fix it?

Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.
Code sample:
string readText = File.ReadAllText(path, Encoding.Default); // <-- change the encoding to whatever the encoding really is
If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown

This is likely due to a mismatch in the Encoding. Use the ReadAllText overload which allows you to specify the proper Encoding to use when reading the file.
The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.

You need to specify the encoding when you call File.ReadAllText, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)
The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.
For example:
Encoding isoLatin1 = Encoding.GetEncoding(28591);
string text = File.ReadAllText(path, isoLatin1);
It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.

The character you are reading is the Replacement character
used to replace an incoming character whose value is unknown or unrepresentable in Unicode
compare the use of U+001A as a control character to indicate the substitute function
http://www.fileformat.info/info/unicode/char/fffd/index.htm
You are getting this because the actual encoding of the file does not match the encoding your program expects.
By default ReadAllText expects UTF-8. It is encountering a byte sequence that does not represent a valid UTF-8 character, so replacing it with the Replacement character.

Special characters for Docx with ooxml

I am converting HTML to docx using http://www.codeproject.com/Articles/91894/HTML-as-a-Source-for-a-DOCX-File.
Most of the characters are read properly but some special characters such as •,“ ” are being displayed as â€¢. What should I be doing to correct this?
The HTML that I was passing to HTMLtoDocx was also not reading special characters properly. Instead it was displaying as '?'. After changing the encoding to Encoding.Default it's returning the correct characters.
In HTMLtoDOCX there are two places that I can set encoding(lines below). In both the places I Tried changing the encoding format from Encoding.UTF8 to Encoding. But it isn't helping.
StreamWriter streamStartPart = new StreamWriter(docpartDocumentXML.GetStream(FileMode.Create, FileAccess.Write), Encoding.Default);
byte[] Origem = Encoding.Default.GetBytes(html);

â€¢ indicates a UTF-8 sequences incorrectly interpreted as ANSI (=Encoding.Default).
You should check whether the HTML file is read with the correct encoding.
While the encoding info is available in the HTTP Header or in HTML META tags, this encoding may not be correct if the HTML is read from a file.
Since .Net treats string characters as 2-byte Unicode values, making sure the correct encoding is apply to read and write byte streams is the first step to fix your problem.

StreamReader weird error with �

StreamReader reads '–' (alt+ 0150) as � even if I have UTF-8 encoding and I have detectEncodingFromByteOrderMarks (BOM) set to true. Can any one guide me on this ?

That byte code won't appear in utf-8 encoded text. It is '\u2013', 0xe2 + 0x80 + 0x93 when encoded in utf-8. If you get this character when you type Alt+0150 on the numeric keypad then your default system code page is probably 1252. Simply pass Encoding.Default to the StreamReader constructor.

You need to know the encoding that was used to encode the text. There's no way around that. Try different encodings until you get the desired results.
From MSDN:
The detectEncodingFromByteOrderMarks parameter detects the encoding by
looking at the first three bytes of the stream. It automatically
recognizes UTF-8, little-endian Unicode, and big-endian Unicode text
if the file starts with the appropriate byte order marks. Otherwise,
the user-provided encoding is used. See the Encoding.GetPreamble
method for more information.
Which means that using that BOM is just an extra thing that may or may not work or can be easily overriden

As the other users wrote, the probable reason of this issue is an ANSI encoding of the file you are trying to read. I've recreated the issue you've described when I saved the file in ANSI encoding.
Try to use this code:
var stream = new StreamReader(fileName, Encoding.Default);
The Encoding.Default parameter is important in here. This code should read the character you've mentioned correctly.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.