Special characters for Docx with ooxml - c#

I am converting HTML to docx using http://www.codeproject.com/Articles/91894/HTML-as-a-Source-for-a-DOCX-File.
Most of the characters are read properly but some special characters such as •,“ ” are being displayed as •. What should I be doing to correct this?
The HTML that I was passing to HTMLtoDocx was also not reading special characters properly. Instead it was displaying as '?'. After changing the encoding to Encoding.Default it's returning the correct characters.
In HTMLtoDOCX there are two places that I can set encoding(lines below). In both the places I Tried changing the encoding format from Encoding.UTF8 to Encoding. But it isn't helping.
StreamWriter streamStartPart = new StreamWriter(docpartDocumentXML.GetStream(FileMode.Create, FileAccess.Write), Encoding.Default);
byte[] Origem = Encoding.Default.GetBytes(html);

• indicates a UTF-8 sequences incorrectly interpreted as ANSI (=Encoding.Default).
You should check whether the HTML file is read with the correct encoding.
While the encoding info is available in the HTTP Header or in HTML META tags, this encoding may not be correct if the HTML is read from a file.
Since .Net treats string characters as 2-byte Unicode values, making sure the correct encoding is apply to read and write byte streams is the first step to fix your problem.

Related

Some questions about encoding

I have some questions about encoding. I have some files in various types and encoding. Only text files (.txt, .csv, .xml) can have byte order mark or am I wrong? In first method I want to prepare file : change encoding, remove pramble or not , at secund I want to only use :
File.WriteAllBytes("testFile.csv", fileBytes);
I don't want to send encoding to second method, so should be bom at every file? or It will be saved with unicod encoding.
To convert encoding I want to use Encoding.Convert method but after converting there is no bom at file, so Is it better to use streamReader and StreamWriter with source and dest encodings?
The XML format marks its encoding to its root node but it's just a notation. Real encoding can be different.
In my opinion, use same encoding to all of output and input files. Unicode is good choice as you mentioned.
If you cannot handle the input files encoding because they wrote by other person, try UTF-8 for them. Almost text editors are using UTF-8 as default encoding. File.WriteAllBytes() as well.

Invalid characters in File.ReadAllText

I'm calling File.ReadAllText() in a program designed to format some files that I have.
Some of these files contain the ® (174) symbol. However, when the text is being read, the returned string contains � (65533) symbols where the ® (174) should be.
What would cause this and how can I fix it?
Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.
Code sample:
string readText = File.ReadAllText(path, Encoding.Default); // <-- change the encoding to whatever the encoding really is
If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown
This is likely due to a mismatch in the Encoding. Use the ReadAllText overload which allows you to specify the proper Encoding to use when reading the file.
The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.
You need to specify the encoding when you call File.ReadAllText, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)
The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.
For example:
Encoding isoLatin1 = Encoding.GetEncoding(28591);
string text = File.ReadAllText(path, isoLatin1);
It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.
The character you are reading is the Replacement character
used to replace an incoming character whose value is unknown or unrepresentable in Unicode
compare the use of U+001A as a control character to indicate the substitute function
http://www.fileformat.info/info/unicode/char/fffd/index.htm
You are getting this because the actual encoding of the file does not match the encoding your program expects.
By default ReadAllText expects UTF-8. It is encountering a byte sequence that does not represent a valid UTF-8 character, so replacing it with the Replacement character.

Strange behaviour when writing a string to a UTF-8 file (no BOM) - ANSI file being returned

I'm attempting to write out C# string data to a UTF-8 file without a byte order mark (BOM), but am getting an ANSI file created.
using (StreamWriter objStreamWriter = new StreamWriter(SomePath, false, new UTF8Encoding(false)))
{
objStreamWriter.Write("Hello world - Encoding no BOM but actually returns ANSI");
objStreamWriter.Close();
}
According to the documentation for the UTF8Encoding class constructor, setting the encoderShouldEmitUTF8Identifier parameter to false should inhibit the Byte Order Mark.
I'm using .NET Framework 4.5 on my British (en-gb) computer. Below is screenshot of the ScreenWriter object showing UTF8Encoding in place.
So why am I getting an ANSI file (as checked with Notepad++) back from this operation?
Your example string that you're writing to the file consists only of characters in the ASCII range. The ASCII range is shared by ASCII, UTF-8 and most (all?) ANSI code pages. So, given that there is no BOM, Notepad++ has no indication if UTF-8 or ANSI is meant, and apparently defaults to ANSI.
If there is no BOM and no unicode characters, how do you expect Notepad++ to recognise it as UTF-8? UTF-8, ANSI and ASCII are all identical for the characters you are emitting?
(Even if you include some unicode characters Notepad++ may struggle to guess the correct encoding.)
In "Hello world - Encoding no BOM but actually returns ANSI", no character is encoded differently in UTF8 and ANSI. Because of BOM absence, Notepad++ shows that the file is encoded in ANSI because there is no 'special character'. Try adding a "é, à, ê" character in your file and Notepad++ will show it as being encoded in UTF8 without BOM.

StreamReader weird error with �

StreamReader reads '–' (alt+ 0150) as � even if I have UTF-8 encoding and I have detectEncodingFromByteOrderMarks (BOM) set to true. Can any one guide me on this ?
That byte code won't appear in utf-8 encoded text. It is '\u2013', 0xe2 + 0x80 + 0x93 when encoded in utf-8. If you get this character when you type Alt+0150 on the numeric keypad then your default system code page is probably 1252. Simply pass Encoding.Default to the StreamReader constructor.
You need to know the encoding that was used to encode the text. There's no way around that. Try different encodings until you get the desired results.
From MSDN:
The detectEncodingFromByteOrderMarks parameter detects the encoding by
looking at the first three bytes of the stream. It automatically
recognizes UTF-8, little-endian Unicode, and big-endian Unicode text
if the file starts with the appropriate byte order marks. Otherwise,
the user-provided encoding is used. See the Encoding.GetPreamble
method for more information.
Which means that using that BOM is just an extra thing that may or may not work or can be easily overriden
As the other users wrote, the probable reason of this issue is an ANSI encoding of the file you are trying to read. I've recreated the issue you've described when I saved the file in ANSI encoding.
Try to use this code:
var stream = new StreamReader(fileName, Encoding.Default);
The Encoding.Default parameter is important in here. This code should read the character you've mentioned correctly.

How to get correctly-encoded HTML from the clipboard?

Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?
For example, executing a command like this:
string s = (string) Clipboard.GetData(DataFormats.Html)
Results in stuff like:
<FONT size=-2>  <A href="/advanced_search?hl=en">Advanced
Search</A><BR>  Preferences<BR>  <A
href="/language_tools?hl=en">Language
Tools</A></FONT>
Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.
It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?
In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).
It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the  character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.
For my other project I made a function that fix data with corrupted encoding.
In this case simple conversion should be sufficient:
byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);
My original function is a little bit more complex and contains tests to ensure that data are not corrupted...
public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
if (string.IsNullOrEmpty(text))
return false;
byte[] data = encoding.GetBytes(text);
// there should not be any character outside source encoding
string newStr = encoding.GetString(data);
if (!string.Equals(text, newStr)) // if there is any character "outside"
return false; // leave, the input is in a different encoding
if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
return false; // if not, can not convert to UTF-8
text = Encoding.UTF8.GetString(data);
return true;
}
I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...
EDIT: (July 20, 2017)
It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0)
(Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)
You have to interpret the data as UTF-8. See MS Office hyperlinks change code page?.
DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.
You get allot of wrong encodings, leading funny/bad characters such as
'Å','‹','Å’','Ž','Å¡','Å“','ž','Ÿ','Â','¡','¢','£','¤','Â¥','¦','§','¨','©'
Full explanation here
Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters
Soln: Create a translation dictionary and search and replace.
I don't know what your original source document is, but be aware that Word and Outlook provide several versions of the clipboard in different encodings. One is usually Windows-1252 and another is UTF-8. Possibly you're grabbing the UTF-8 encoded version by default, when you're expecting the Windows-1252 (Latin-1 + Smart Quotes)? Non-ASCII characters would show up as multiple odd Latin-1 accented characters. Most "Smart Quotes" are not in the Latin-1 set and are often three bytes in UTF-8.
Can you specify which encoding you want the clipboard contents in?
Try this:
System.Windows.Forms.Clipboard.GetText(System.Windows.Forms.TextDataFormat.Html);

Categories