I have folder name that contains German special character such äÄéöÖüß.The following screenshot display contents of LiveLink server.
I want to extract folder from Livelink server using C#.
valueis obtained from LLserver.
var bytes = new List<byte>(value.Length);
foreach (var c in value)
{
bytes.Add((byte)c);
}
var result = Encoding.UTF8.GetString(bytes.ToArray());
Finally, the result is äÄéöÖü�x .where ß is seen as box character '�x'. All other characters present in folder name are decoded successfully/properly except the ß character.
I am just wondering why the same code works for all other German special characters but not for ß.
Could anybody help to fix this problem in C#?
Thanks in advance.
Go to admin panel of server Livelink/livelink.exe?func=admin.sysvars
and set Character Set: UTF-8
and code section change as follow
byte[] bytes = Encoding.Default.GetBytes(value);
var retValue = Encoding.UTF8.GetString(bytes);
It works fine.
You guessed your encoding to be UTF8 and it obviously is not. You will need to find out what encoding the byte stream really represents and use that instead. We cannot help you with that, you will have to ask the sender of said bytes.
Related
So I read the Spolsky Article twice, this question too and tried a lot. Now I'm here.
I created a tarball of a directory structure on a Linux Machine with locale ISO-8859-1 and untarred it on Windows with 7zip. As a result, the filenames are scrambled up when I view them in Windows Explorer (and in my C# program, too): Where I expect to see a German umlaut ü it's a ³ - No wonder, because the filenames are written to the tar file using the ISO-8859-1 codepage and Windows obviously does not know about this.
I want to fix this by renaming the files to their correct names. So I think I have to tell the program "read the filename, think of it as ISO-8859-1 and return every character as UTF-16 character."
My code to find the correct filename:
void Main()
{
string[] files = Directory.GetFiles(#"C:\test", #"*", SearchOption.AllDirectories);
var e1 = Encoding.GetEncoding("ISO-8859-1");
var e2 = Encoding.GetEncoding("UTF-16");
foreach (var f in files)
{
Console.WriteLine($"Source: {f}");
var source = e1.GetBytes(f);
var dest = Encoding.Convert(e1, e2, source);
Console.WriteLine($"Result: {e2.GetString(dest)}");
}
}
Result - nothing happend:
Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrl³.odt
expected Result:
Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrlü.odt
When I exchange e1 and e2 I get weird results. My brain hurts. What am I not getting?
Edit: I know that the mistake has been made earlier, but now I have wrong filenames on the Windows machine that I need to correct. However, it might not be solvable via the Encoding-Class. I found this blog post and the author states
It turns out, this isn't a problem with the encoding at all, but the same character address meaning different things to different character sets.
In conclusion, he wrote a method to replace the characters between 130 and 173 with specific, different characters. This does not look straightforward to me, but is it possible that this is the only way? Can anyone comment on this, please?
After some more reading I got the solution myself. This excellent article helped. The point is: Once a wrong encoding was used, you can only guess (or have to know) what went wrong exactly. If you know, you can revert the whole thing in code.
void Main()
{
// We get the source string e.g. reading files from a directory. We see a "³" when
// we expect a German umlaut "ü". The reason can be a poorly configured smb share
// on a Linux server or other problems.
string source = "M³nch";
// We are in a .NET program, so the source string (here in the
// program) is Unicode in UTF-16 encoding. I.e., the codepoints
// M, ³, n, c and h are encoded in UTF-16.
byte[] bytesFromSource = Encoding.Unicode.GetBytes(source); //
// The source encoding is UTF-16, hence we get two bytes per character.
// We accidently worked with the OEM850 Codepage, we now have look up the bytes of
// the codepoints on the OEM850 codepage: We convert our bytesFromSource to the wrong Codepage
byte[] bytesInWrongCodepage = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(850), bytesFromSource);
// Here's the trick: Although converting to OEM850, we now assume that the bytes are Codepage ISO-8859-1.
// We convert the bytes from ISO-8859-1 to Unicode.
byte[] bytesFromCorrectCodepage = Encoding.Convert(Encoding.GetEncoding("ISO-8859-1"), Encoding.Unicode, bytesInWrongCodepage);
// And finally we get the right character.
string result = Encoding.Unicode.GetString(bytesFromCorrectCodepage);
Console.WriteLine(result); // Münch
}
CAVEAT: Do not run this method over its results. This is likely to produce non-printable characters or other mayhem.
I searched and find some partial answers that work in some instances but nothing that works in all. The scenarion is I get sent via network an XML file. The file has the encoding named e.g. encoding = "Windows-932" or encoding = "Windows-1254" and so on. I need to parse the file and get certain info and the convert that info to Unicode chars and send it on to another machine that can only read Unicode.
So is the encoding is
1253 it is Cyrillic so Char E1 = ASCII225 = Unicode 0431.
1254 it is Turkish so Char E1 = ASCII225 = Unicode 00E1.
1251 it is Greek so Char E1 = ASCII225 = Unicode 03B1.
So far I thought I could have a loookup table that looked at the encoding and then I just add the Unicode page in front of the E1 BUT that will not work as in Unicode they do not have the same page position as you see above.
To further complicate things I can also get encoding such as Japanese (shift-JIS) which is codepage 932. Now this does not get all the japanese from the same page and almost every character on the ASCII pages comes from a different Unicode page.
So the question is how in C# do I convert the XML data to Unicode and get it correct everytime? Any ideas?
Encoding.GetEncoding("windows-1253").GetString(new byte[] {0xE1}) // -> "\u03B1" α
Encoding.GetEncoding("windows-1254").GetString(new byte[] {0xE1}) // -> "\u00E1" á
Encoding.GetEncoding("windows-1251").GetString(new byte[] {0xE1}) // -> "\u0431" б
But for an XML file you should be using an existing XML parser (eg XmlReader or XDocument.Load) which will deal with encodings for you.
The goal is simple. Grab some French text containing special characters from a .txt file and stick it into the variable "content". All is working well except that all instances of the character "à" are being interpreted as "À". Did I pick the wrong encoding (UTF7) or what?
Any help would be much appreciated.
// Using ecoding to ensure special characters work
Encoding enc = Encoding.UTF7;
// Make sure we get all the information about special chars
byte[] fileBytes = System.IO.File.ReadAllBytes(path);
// Convert the new byte[] into a char[] and then into a string.
char[] fileChars = new char[enc.GetCharCount(fileBytes, 0, fileBytes.Length)];
enc.GetChars(fileBytes, 0, fileBytes.Length, fileChars, 0);
string fileString = new string(fileChars);
// Insert the resulting encoded string "fileString" into "content"
content = fileString;
Your code is correct besides the wrong encoding. Find the correct one and plug it in. Nobody uses UTF7 so this is probably not it.
Maybe it is a non-Unicode one. Try Encoding.Default. That one empirically often helps in Germany.
Also, just use File.ReadAllText. It does everything you are doing.
I need your help concerning something which disturbs me when working with UNICODE encoding in .NET Framework ...
I have to interface with some customer data systems with are non-UNICODE applications, and those customers have worldwide companies (Chinese, Korean, Russian, ...). So they have to provide me an ASCII 8 bits file, wich will be encoded with their Windows code page.
So, if a Greek customer sends me a text file containing 'Σ' (sigma letter '\u03A3') in a product name, I will get an equivalent letter corresponding to the 211 ANSI code point, represented in my own code page. My computer is a French Windows, which means the code page is Windows-1252, so I will have in place 'Ó' in this text file... Ok.
I know this customer is a Greek one, so I can read his file by forcing the windows-1253 code page in my import parameters.
/// <summary>
/// Convert a string ASCII value using code page encoding to Unicode encoding
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
public static string ToUnicode(string value, int codePage)
{
Encoding windows = Encoding.Default;
Encoding unicode = Encoding.Unicode;
Encoding sp = Encoding.GetEncoding(codePage);
if (sp != null && !String.IsNullOrEmpty(value))
{
// First get bytes in windows encoding
byte[] wbytes = windows.GetBytes(value);
// Check if CodePage to use is different from current Windows one
if (windows.CodePage != sp.CodePage)
{
// Convert to Unicode using SP code page
byte[] ubytes = Encoding.Convert(sp, unicode, wbytes);
return unicode.GetString(ubytes);
}
else
{
// Directly convert to Unicode using windows code page
byte[] ubytes = Encoding.Convert(windows, unicode, wbytes);
return unicode.GetString(ubytes);
}
}
else
{
return value;
}
}
Well in the end I got 'Σ' in my application and I am able to save this into my SQL Server database. Now my application has to perform some complex computations, and then I have to give back this file to the customer with an automatic export...
So my problem is that I have to perform a UNICODE => ANSI conversion?! But this is not as simple as I thought at the beginning...
I don't want to save the code page used during import, so my first idea was to convert UNICODE to windows-1252, and then automatically send the file to the customers. They will read the exported text file with their own code page so this idea was interesting for me.
But the problem is that the conversion in this way has a strange behaviour... Here are two different examples:
1st example (я)
char ya = '\u042F';
string strYa = Char.ConvertFromUtf32(ya);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1251 = System.Text.Encoding.GetEncoding(1251);
string strYa1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strYa)));
string strYa1251 = ansi1251.GetString(System.Text.Encoding.Convert(unicode, ansi1251, unicode.GetBytes(strYa)));
So strYa1252 contains '?', whereas strYa1251 contains valid char 'я'. So it seems it is impossible te convert to ANSI if valid code page is not indicated to Convert() function ... So nothing in Unicode Encoding class helps user to get equivalences between ANSI and UNICODE code points ? :\
2nd example (Σ)
char sigma = '\u3A3';
string strSigma = Char.ConvertFromUtf32(sigma);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1253 = System.Text.Encoding.GetEncoding(1253);
string strSigma1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strSigma)));
string strSigma1253 = ansi1253.GetString(System.Text.Encoding.Convert(unicode, ansi1253, unicode.GetBytes(strSigma)));
At this time, I have the correct 'Σ' in the strSigma1253 string, but I also have 'S' for strSigma1252. As indicated at the beginning, I should have 'Ó' if ANSI code has been found, or '?' if the character has not been found, but not 'S'. Why?
Yes of course, a linguist could say that 'S' is equivalent to the greek Sigma character because they sound the same in both alphabets, but they don't have the same ANSI code!
So how can the Convert() function in the .NET framework manage this kind of equivalence?
And does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?
I should have ...'?' if the character has not been found, but not 'S'. Why?
This is known as 'best-fit' encoding, and in most cases it's a bad thing. When Windows can't encode a character to the target code page (because Σ does not exist in code page 1252), it makes best efforts to map the character to something a bit like it. This can mean losing the diacritical marks (ë→e), or mapping to a cognate (Σ→S), a character that's related (≤→=), a character that's unrelated but looks a bit similar (∞→8), or whatever other madcap replacement seemed like a good idea at the time but turns out to be culturally or mathematically offensive in practice.
You can see the tables for cp1252, including that Sigma mapping, here.
Apart from being a silent mangling of dubious usefulness, it also has some quite bad security implications. You should be able to stop it happening by setting EncoderFallback to ReplacementFallback or ExceptionFallback.
does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?
You'll have to keep a table of encodings for each customer. Read their input files using that encoding to decode; write their output files using the same encoding.
(For sanity, set new customers to UTF-8 and document that this is the preferred encoding.)
I'm trying to read data from the registry # ""SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\RecentDocs\"
The return value I get is System.byte[], when I convert it to a string like suggested here.
It works (I think). But I only get 1 letter returned and not the whole string.
Perhaps I'm doing something wrong? I'm fairly certain there can't be only one letter in there..
I've tried Encoding.ASCII.GetString(bytes); and Encoding.UTF8.GetString(bytes); and Encoding.Default.GetString(bytes); but it all returns only 1 character/letter.
I've checkout this link as well. But thats for C++ and I'm using C# and don't see that Method that they suggested (RegGetValueA)
Here is my code:
RegistryKey pRegKey = Registry.CurrentUser;
pRegKey = pRegKey.OpenSubKey("SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Explorer\\RecentDocs\\");
Object val = pRegKey..GetValue("0");
byte[] bytes = (byte[])pRegKey.GetValue ("0");
string str = Encoding.ASCII.GetString(bytes);
System.Windows.MessageBox.Show("The value is: " + str);
Thanks in advance for any help :)
The string is encoded using UTF-16, so you should use Encoding.Unicode.
But it doesn't seem it's just UTF-16 encoded strings, there's some more data. For me, (when decoded as UTF-16), it displays as
Stažené soubory□Š6□□□□□Stažené soubory.lnk□T□□뻯□□□□*□□□□□□□□□□□□Stažené soubory.lnk□6□
Stažené soubory means Downloads in Czech, which is the language of my Windows. And the U+25A1 squares in the above text are actually zero chars.
Are you sure that the encoding is ASCII ?
I would suspect some UTF like Encoding.UTF8 or Encoding.Unicode - try that...