Re-Encoding of wrong filenames - c#

So I read the Spolsky Article twice, this question too and tried a lot. Now I'm here.
I created a tarball of a directory structure on a Linux Machine with locale ISO-8859-1 and untarred it on Windows with 7zip. As a result, the filenames are scrambled up when I view them in Windows Explorer (and in my C# program, too): Where I expect to see a German umlaut ü it's a ³ - No wonder, because the filenames are written to the tar file using the ISO-8859-1 codepage and Windows obviously does not know about this.
I want to fix this by renaming the files to their correct names. So I think I have to tell the program "read the filename, think of it as ISO-8859-1 and return every character as UTF-16 character."
My code to find the correct filename:
void Main()
{
string[] files = Directory.GetFiles(#"C:\test", #"*", SearchOption.AllDirectories);
var e1 = Encoding.GetEncoding("ISO-8859-1");
var e2 = Encoding.GetEncoding("UTF-16");
foreach (var f in files)
{
Console.WriteLine($"Source: {f}");
var source = e1.GetBytes(f);
var dest = Encoding.Convert(e1, e2, source);
Console.WriteLine($"Result: {e2.GetString(dest)}");
}
}
Result - nothing happend:
Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrl³.odt
expected Result:
Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrlü.odt
When I exchange e1 and e2 I get weird results. My brain hurts. What am I not getting?
Edit: I know that the mistake has been made earlier, but now I have wrong filenames on the Windows machine that I need to correct. However, it might not be solvable via the Encoding-Class. I found this blog post and the author states
It turns out, this isn't a problem with the encoding at all, but the same character address meaning different things to different character sets.
In conclusion, he wrote a method to replace the characters between 130 and 173 with specific, different characters. This does not look straightforward to me, but is it possible that this is the only way? Can anyone comment on this, please?

After some more reading I got the solution myself. This excellent article helped. The point is: Once a wrong encoding was used, you can only guess (or have to know) what went wrong exactly. If you know, you can revert the whole thing in code.
void Main()
{
// We get the source string e.g. reading files from a directory. We see a "³" when
// we expect a German umlaut "ü". The reason can be a poorly configured smb share
// on a Linux server or other problems.
string source = "M³nch";
// We are in a .NET program, so the source string (here in the
// program) is Unicode in UTF-16 encoding. I.e., the codepoints
// M, ³, n, c and h are encoded in UTF-16.
byte[] bytesFromSource = Encoding.Unicode.GetBytes(source); //
// The source encoding is UTF-16, hence we get two bytes per character.
// We accidently worked with the OEM850 Codepage, we now have look up the bytes of
// the codepoints on the OEM850 codepage: We convert our bytesFromSource to the wrong Codepage
byte[] bytesInWrongCodepage = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(850), bytesFromSource);
// Here's the trick: Although converting to OEM850, we now assume that the bytes are Codepage ISO-8859-1.
// We convert the bytes from ISO-8859-1 to Unicode.
byte[] bytesFromCorrectCodepage = Encoding.Convert(Encoding.GetEncoding("ISO-8859-1"), Encoding.Unicode, bytesInWrongCodepage);
// And finally we get the right character.
string result = Encoding.Unicode.GetString(bytesFromCorrectCodepage);
Console.WriteLine(result); // Münch
}
CAVEAT: Do not run this method over its results. This is likely to produce non-printable characters or other mayhem.

Related

German character ß encoding in Livelink using C#

I have folder name that contains German special character such äÄéöÖüß.The following screenshot display contents of LiveLink server.
I want to extract folder from Livelink server using C#.
valueis obtained from LLserver.
var bytes = new List<byte>(value.Length);
foreach (var c in value)
{
bytes.Add((byte)c);
}
var result = Encoding.UTF8.GetString(bytes.ToArray());
Finally, the result is äÄéöÖü�x .where ß is seen as box character '�x'. All other characters present in folder name are decoded successfully/properly except the ß character.
I am just wondering why the same code works for all other German special characters but not for ß.
Could anybody help to fix this problem in C#?
Thanks in advance.
Go to admin panel of server Livelink/livelink.exe?func=admin.sysvars
and set Character Set: UTF-8
and code section change as follow
byte[] bytes = Encoding.Default.GetBytes(value);
var retValue = Encoding.UTF8.GetString(bytes);
It works fine.
You guessed your encoding to be UTF8 and it obviously is not. You will need to find out what encoding the byte stream really represents and use that instead. We cannot help you with that, you will have to ask the sender of said bytes.

How to decode a utf string in c#

I have been trying to decode the following string:
Crédit
in c# using the following code:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
string msg = iso.GetString(utf8.GetBytes(#"Crédit"));
which is yielding:
Crédit
I looked online http://jeppesn.dk/utf-8.html and this is in correct utf 8 and should yield:
Crédit
Can someone please point out where i am going wrong?
Thanks
It should be the other way around, and Windows-1252, not ISO-8859-1. Depending on context, people usually mean Windows-1252 when they say Latin-1 or ISO-8859-1, but actually using ISO-8859-1 will fail when there are characters like € because it was a mislabeling in the first place. Even browsers use Windows-1252 when ISO-8859-1 is specified as encoding.
Encoding w1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
string msg = utf8.GetString(w1252.GetBytes(#"Crédit"));
You're trying to do something that doesn't make sense, basically. You should almost never1 be interpreting the output of one encoding as the input to another encoding. It's like saying, "Suppose I save this image as a gif... then load that file using a jpeg loader... what does it look like?"
I suspect that if you use:
// Just an example: don't actually do this.
string msg = utf8.GetString(iso.GetBytes(#"Crédit"));
... it will do what you want, but you shouldn't be doing this at all.
Now, what is your real input (in what form) and what are you trying to achieve?
1 If you're doing so, it's usually because someone else has already done the wrong thing, or there's a configuration problem somewhere. If you find yourself doing this, you should think very carefully about whether you should really be doing it, or whether you're just working around a different problem which should be tackled differently.

How do I read and write smart quotes (and other silly characters) in C#

I'm writing a program that reads all the text in a file into a string, loops over that string looking at the characters, and then appends the characters back to another string using a Stringbuilder. The issue I'm having is when it's written back out, the special characters such as “ and ” , come out looking like � characters instead. I don't need to do a conversion, I just want it written back out the way I read it in:
StringBuilder sb = new StringBuilder();
string text = File.ReadAllText(filePath);
for (int i = 0; i < text.Length; ++i) {
if (text[i] != '{') { // looking for opening curly brace
sb.Append(text[i]);
continue;
}
// Do stuff
}
File.WriteAllText(destinationFile, sb.ToString());
I tried using different Encodings (UTF-8, UTF-16, ASCII), but then it just came out even worse; I started getting question mark symbols and Chinese characters (yes, a bit of a shotgun approach, but I was just experimenting).
I did read this article: http://www.joelonsoftware.com/articles/Unicode.html
...but it didn't really explain why I was seeing what I saw, unless in C#, the reader starts cutting off bits when it hits weird characters like that. Thanks in advance for any help!
TL;DR that is definitely not UTF-8 and you are not even using UTF-8 to read the resulting file. Read as Windows1252, write as Windows1252 (If you are going to use the same viewing method to view the resulting file)
Well let's first just say that there is no way a file made by a regular user will be in UTF-8. Not all programs in windows even support it (excel, notepad..), let alone have it as default encoding (even most developer tools don't default to utf-8, which drives me insane). Since a lot of developers don't understand that such a thing as encoding even exists, then what chances do regular users have of saving their files in an utf-8 hostile environment?
This is where your problems first start. According to documentation, the overload you are using File.ReadAllText(filePath); can only detect UTF-8 or UTF-32.
Indeed, simply reading a file encoded normally in Windows-1252 that contains "a”a" results in a string "a�a", where � is the unicode replacement character (Read the wikipedia section, it describes exactly the situation you are in!) used to replace invalid bytes. When the replacement character is again encoded as UTF-8, and interpreted as Windows-1252, you will see � because the bytes for � in UTF-8 are 0xEF, 0xBF, 0xBD which are the bytes for � in Windows-1252.
So read it as Windows-1252 and you're half-way there:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
String result = File.ReadAllText(#"C:\myfile.txt", windows1252);
Console.WriteLine(result); //Correctly prints "a”a" now
Because you saw �, the tool you are viewing the newly made file with is also using Windows-1252. So if the goal is to have the file show correct characters in that tool, you must encode the output as Windows-1252:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
File.WriteAllText(#"C:\myFile", sb.toString(), windows1252);
Chances are the text will be UTF8.
File.ReadAllText(filePath, Encoding.UTF8)
coupled with
File.WriteAllText(destinationFile, sb.ToString(), Encoding.UTF8)
should cover off dealing with the Unicode characters. If you do one or the other you're going to get garbage output, both or nothing.

Convert UCS-2 characters to UTF-8 Using C#

I'm pulling some internationalized text from a MS SQL Server 2005 database. As per the defaults for that DB, the characters are stored as UCS-2. However, I need to output the data in UTF-8 format, as I'm sending it out over the web. Currently, I have the following code to convert:
SqlString dbString = resultReader.GetSqlString(0);
byte[] dbBytes = dbString.GetUnicodeBytes();
byte[] utf8Bytes = System.Text.Encoding.Convert(System.Text.Encoding.Unicode,
System.Text.Encoding.UTF8, dbBytes);
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
string outputString = encoder.GetString(utf8Bytes);
However, when I examine the output in the browser, it appears to be garbage, no matter what I set the encoding to.
What am I missing?
EDIT:
In response to the answers below, the reason I thought I had to perform a conversion is because I can output literal multibyte strings just fine. For example:
OutputControl.Text = "カルフォルニア工科大学とチューリッヒ工科大学は共同で、太陽光を保管可能な燃料に直接変えることのできる装置の開発に成功したとのこと";
works. Here, OutputControl is an ASP.Net Literal. However,
OutputControl.Text = outputString; //Output from above snippet
results in mangled output as described above. My hypothesis was that the database's output was somehow getting mangled by ASP.Net. If that's not the case, then what are some other possibilities?
EDIT 2:
Okay, I'm stupid. It turns out that there's nothing wrong with the database at all. When I tried inserting my own literal double byte characters (材料,原料;木料), I could read and output them just fine even without any conversion process at all. It seems to me that whatever is inserting the data into the DB is mangling the characters somehow, so I'm going to look at that. With my verified, "clean" data, the following code works:
OutputControl.Text = dbString.ToString();
as the responses below indicate it should.
Your code does essentially the same as:
SqlString dbString = resultReader.GetSqlString(0);
string outputString = dbString.ToString();
string itself is a UNICODE string (specifically, UTF-16, which is 'almost' the same as UCS-2, except for codepoints not fitting into the lowest 16 bits). In other words, the conversions you are performing are redundant.
Your web app most likely mangles the encoding somewhere else as well, or sets a wrong encoding for the HTML output. However, that can't be diagnosed from the information you provided so far.
String in .net is 'encoding agnostic'.
You can convert bytes to string using a particular encoding to tell .net how to interprets your bytes.
You can convert string to bytes using a particular encoding to tell .net how you want your bytes served.
But trying to convert a string to another string using encodings makes no sens at all.

How to get correctly-encoded HTML from the clipboard?

Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?
For example, executing a command like this:
string s = (string) Clipboard.GetData(DataFormats.Html)
Results in stuff like:
<FONT size=-2>  <A href="/advanced_search?hl=en">Advanced
Search</A><BR>  Preferences<BR>  <A
href="/language_tools?hl=en">Language
Tools</A></FONT>
Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.
It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?
In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).
It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the  character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.
For my other project I made a function that fix data with corrupted encoding.
In this case simple conversion should be sufficient:
byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);
My original function is a little bit more complex and contains tests to ensure that data are not corrupted...
public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
if (string.IsNullOrEmpty(text))
return false;
byte[] data = encoding.GetBytes(text);
// there should not be any character outside source encoding
string newStr = encoding.GetString(data);
if (!string.Equals(text, newStr)) // if there is any character "outside"
return false; // leave, the input is in a different encoding
if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
return false; // if not, can not convert to UTF-8
text = Encoding.UTF8.GetString(data);
return true;
}
I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...
EDIT: (July 20, 2017)
It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0)
(Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)
You have to interpret the data as UTF-8. See MS Office hyperlinks change code page?.
DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.
You get allot of wrong encodings, leading funny/bad characters such as
'Å','‹','Å’','Ž','Å¡','Å“','ž','Ÿ','Â','¡','¢','£','¤','Â¥','¦','§','¨','©'
Full explanation here
Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters
Soln: Create a translation dictionary and search and replace.
I don't know what your original source document is, but be aware that Word and Outlook provide several versions of the clipboard in different encodings. One is usually Windows-1252 and another is UTF-8. Possibly you're grabbing the UTF-8 encoded version by default, when you're expecting the Windows-1252 (Latin-1 + Smart Quotes)? Non-ASCII characters would show up as multiple odd Latin-1 accented characters. Most "Smart Quotes" are not in the Latin-1 set and are often three bytes in UTF-8.
Can you specify which encoding you want the clipboard contents in?
Try this:
System.Windows.Forms.Clipboard.GetText(System.Windows.Forms.TextDataFormat.Html);

Categories