How to get correctly-encoded HTML from the clipboard? - c#

Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?
For example, executing a command like this:
string s = (string) Clipboard.GetData(DataFormats.Html)
Results in stuff like:
<FONT size=-2>  <A href="/advanced_search?hl=en">Advanced
Search</A><BR>  Preferences<BR>  <A
href="/language_tools?hl=en">Language
Tools</A></FONT>
Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.
It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?

In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).
It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the  character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.
For my other project I made a function that fix data with corrupted encoding.
In this case simple conversion should be sufficient:
byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);
My original function is a little bit more complex and contains tests to ensure that data are not corrupted...
public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
if (string.IsNullOrEmpty(text))
return false;
byte[] data = encoding.GetBytes(text);
// there should not be any character outside source encoding
string newStr = encoding.GetString(data);
if (!string.Equals(text, newStr)) // if there is any character "outside"
return false; // leave, the input is in a different encoding
if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
return false; // if not, can not convert to UTF-8
text = Encoding.UTF8.GetString(data);
return true;
}
I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...
EDIT: (July 20, 2017)
It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0)
(Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)

You have to interpret the data as UTF-8. See MS Office hyperlinks change code page?.

DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.
You get allot of wrong encodings, leading funny/bad characters such as
'Å','‹','Å’','Ž','Å¡','Å“','ž','Ÿ','Â','¡','¢','£','¤','Â¥','¦','§','¨','©'
Full explanation here
Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters
Soln: Create a translation dictionary and search and replace.

I don't know what your original source document is, but be aware that Word and Outlook provide several versions of the clipboard in different encodings. One is usually Windows-1252 and another is UTF-8. Possibly you're grabbing the UTF-8 encoded version by default, when you're expecting the Windows-1252 (Latin-1 + Smart Quotes)? Non-ASCII characters would show up as multiple odd Latin-1 accented characters. Most "Smart Quotes" are not in the Latin-1 set and are often three bytes in UTF-8.
Can you specify which encoding you want the clipboard contents in?

Try this:
System.Windows.Forms.Clipboard.GetText(System.Windows.Forms.TextDataFormat.Html);

Related

Why Does Byte 150 show up as a dash in Notepad but Not when I read it programatically?

I've got a file that looks OK in Notepad (and Notepad++) but when I try to read it with a C# program, the dash shows up as a replacement character (�) instead. After some trial and error, I can reproduce the error as follows:
File.WriteAllBytes("C:\\Misc\\CharTest\\wtf.txt", new byte[] { 150 });
var readFile = File.ReadAllText("C:\\Misc\\CharTest\\wtf.txt");
Console.WriteLine(readFile);
Now, if you go and look in the wtf.txt file using Notepad, you'll see a dash... but I don't get it. I know that's not a "real" Unicode value so that's probably the root of the issue, but I don't get why it looks fine in Notepad and not when I read in the file. And how do I get the file to read it as a dash?
As an aside, a VB6 program I'm trying to rewrite in C# also reads it as a dash.
The File.ReadAllText(string) overload defaults to UTF8 encoding, in which a standalone byte with value 150 is invalid.
Specify the actual encoding of the file, for example:
var encoding = Encoding.GetEncoding(1252);
string content = File.ReadAllText(fileName, encoding);
I used the Windows-1252 encoding, which has a dash at codepoint 150.
Edit: Notepad displays the file correctly because for non-Unicode files the Windows-1252 codepage is the default for western regional settings. So likely you can use also Encoding.Default to get the correct result but keep in mind that Encoding.Default can return different code pages with different regional settings.
You are writing bytes in a textfile. And the you are reading those bytes and interpret them as chars.
Now, when you write bytes, you don't care about encoding, while you have to, in order to read those very same bytes as char.
Notepad++ seems to interpret the byte as Unicode char and therefore prints the _.
Now File.ReadAllText reads the bytes in the specified encoding, which you did not specify and there will be set to one of these and seems to be UTF-8, where 150 is not a valid entry.

C# metro No mapping for the Unicode character exists in the target multi-byte code page

Line:
IList<string> text = await FileIO.ReadLinesAsync(file);
causes exception No mapping for the Unicode character exists in the target multi-byte code page
When I remove chars like ąśźćóż from my file it runs ok, but the problem is that I can't guarantee that those chars won't happen in future.
I tried changing the encoding in advanced save options but it is already
Unicode (UTF-8 with signature) - Codepage 65001
I have a hard time trying to figure this one out.
Make FileIO.ReadLinesAsync use a matching encoding. I don't know what you custom class does but according to the error message it does not use any Unicode encoding.
I think those characters ąśźćóż are UTF-16 encoded.So, it's better to use UTF-16. Use the overload ReadLinesAsync(IStorageFile, UnicodeEncoding) and set UnicodeEncdoing parameter to UnicodeEncoding.Utf16BE
From MSDN :
This method uses the character encoding of the specified file. If you
want to specify different encoding, call ReadLinesAsync(IStorageFile,
UnicodeEncoding) instead.

How to decode a utf string in c#

I have been trying to decode the following string:
Crédit
in c# using the following code:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
string msg = iso.GetString(utf8.GetBytes(#"Crédit"));
which is yielding:
Crédit
I looked online http://jeppesn.dk/utf-8.html and this is in correct utf 8 and should yield:
Crédit
Can someone please point out where i am going wrong?
Thanks
It should be the other way around, and Windows-1252, not ISO-8859-1. Depending on context, people usually mean Windows-1252 when they say Latin-1 or ISO-8859-1, but actually using ISO-8859-1 will fail when there are characters like € because it was a mislabeling in the first place. Even browsers use Windows-1252 when ISO-8859-1 is specified as encoding.
Encoding w1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
string msg = utf8.GetString(w1252.GetBytes(#"Crédit"));
You're trying to do something that doesn't make sense, basically. You should almost never1 be interpreting the output of one encoding as the input to another encoding. It's like saying, "Suppose I save this image as a gif... then load that file using a jpeg loader... what does it look like?"
I suspect that if you use:
// Just an example: don't actually do this.
string msg = utf8.GetString(iso.GetBytes(#"Crédit"));
... it will do what you want, but you shouldn't be doing this at all.
Now, what is your real input (in what form) and what are you trying to achieve?
1 If you're doing so, it's usually because someone else has already done the wrong thing, or there's a configuration problem somewhere. If you find yourself doing this, you should think very carefully about whether you should really be doing it, or whether you're just working around a different problem which should be tackled differently.

How do I read and write smart quotes (and other silly characters) in C#

I'm writing a program that reads all the text in a file into a string, loops over that string looking at the characters, and then appends the characters back to another string using a Stringbuilder. The issue I'm having is when it's written back out, the special characters such as “ and ” , come out looking like � characters instead. I don't need to do a conversion, I just want it written back out the way I read it in:
StringBuilder sb = new StringBuilder();
string text = File.ReadAllText(filePath);
for (int i = 0; i < text.Length; ++i) {
if (text[i] != '{') { // looking for opening curly brace
sb.Append(text[i]);
continue;
}
// Do stuff
}
File.WriteAllText(destinationFile, sb.ToString());
I tried using different Encodings (UTF-8, UTF-16, ASCII), but then it just came out even worse; I started getting question mark symbols and Chinese characters (yes, a bit of a shotgun approach, but I was just experimenting).
I did read this article: http://www.joelonsoftware.com/articles/Unicode.html
...but it didn't really explain why I was seeing what I saw, unless in C#, the reader starts cutting off bits when it hits weird characters like that. Thanks in advance for any help!
TL;DR that is definitely not UTF-8 and you are not even using UTF-8 to read the resulting file. Read as Windows1252, write as Windows1252 (If you are going to use the same viewing method to view the resulting file)
Well let's first just say that there is no way a file made by a regular user will be in UTF-8. Not all programs in windows even support it (excel, notepad..), let alone have it as default encoding (even most developer tools don't default to utf-8, which drives me insane). Since a lot of developers don't understand that such a thing as encoding even exists, then what chances do regular users have of saving their files in an utf-8 hostile environment?
This is where your problems first start. According to documentation, the overload you are using File.ReadAllText(filePath); can only detect UTF-8 or UTF-32.
Indeed, simply reading a file encoded normally in Windows-1252 that contains "a”a" results in a string "a�a", where � is the unicode replacement character (Read the wikipedia section, it describes exactly the situation you are in!) used to replace invalid bytes. When the replacement character is again encoded as UTF-8, and interpreted as Windows-1252, you will see � because the bytes for � in UTF-8 are 0xEF, 0xBF, 0xBD which are the bytes for � in Windows-1252.
So read it as Windows-1252 and you're half-way there:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
String result = File.ReadAllText(#"C:\myfile.txt", windows1252);
Console.WriteLine(result); //Correctly prints "a”a" now
Because you saw �, the tool you are viewing the newly made file with is also using Windows-1252. So if the goal is to have the file show correct characters in that tool, you must encode the output as Windows-1252:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
File.WriteAllText(#"C:\myFile", sb.toString(), windows1252);
Chances are the text will be UTF8.
File.ReadAllText(filePath, Encoding.UTF8)
coupled with
File.WriteAllText(destinationFile, sb.ToString(), Encoding.UTF8)
should cover off dealing with the Unicode characters. If you do one or the other you're going to get garbage output, both or nothing.

Default C# String encoding

I am having some issues with the default string encoding in C#. I need to read strings from certain files/packets. However, these strings include characters from the 128-256 range (extended ascii), and all of these characters show up as question marks , instead of the proper character. For example, when reading a string ,it could come up as "S?meStr?n?" if the string contained the extended ascii characters.
Now, is there any way to change the default encoding for my application? I know in java you could define the default character set from command line.
There's no one single "extended ASCII" encoding. There are lots of different 8-bit encodings which are compatible with ASCII for the bottom 128 values.
You need to find out what encoding your files actually use, and specific that when reading the data with StreamReader (or whatever else you're using). For example, you may want encoding Windows-1252:
Encoding encoding = Encoding.GetEncoding(1252);
.NET strings are always sequences of UTF-16 code points. You can't change that, and you shouldn't try. (That's true in Java as well, and you really shouldn't use the platform default encoding when calling getBytes() etc unless that's what you really, really mean.)
An Encoding can be specified in at least one overload of functions for reading text - for example, ReadAllText(string, Encoding).
So if you no a file's encoded using Windows-1252, then you can specify it like so:
string contents = File.ReadAllText(someFilePath, Encoding.GetEncoding(1252));
Of course, doing this requires knowing ahead of time which code page is being used.

Categories