Read mixed encoding string - c#

I read some string with (windows-1256) encoding but the numbers in that string encoded using (UTF-8) and as a result all text except numbers (encoded with utf-8) read but numbers displays as (?) which is acceptable. but i want to know how can i read complete text without problem, how can i know when to switch between encodings to read correct text.
NOTE: Browsers displays these kind of text correctly so they know when they should switch
Any solution or code ?

The lower half of the windows-1256 code page is the same as ASCII. Digits in UTF-8 are also the same as ASCII - if you read the string with windows-1256 encoding, it should work just fine.

Related

Why Does Byte 150 show up as a dash in Notepad but Not when I read it programatically?

I've got a file that looks OK in Notepad (and Notepad++) but when I try to read it with a C# program, the dash shows up as a replacement character (�) instead. After some trial and error, I can reproduce the error as follows:
File.WriteAllBytes("C:\\Misc\\CharTest\\wtf.txt", new byte[] { 150 });
var readFile = File.ReadAllText("C:\\Misc\\CharTest\\wtf.txt");
Console.WriteLine(readFile);
Now, if you go and look in the wtf.txt file using Notepad, you'll see a dash... but I don't get it. I know that's not a "real" Unicode value so that's probably the root of the issue, but I don't get why it looks fine in Notepad and not when I read in the file. And how do I get the file to read it as a dash?
As an aside, a VB6 program I'm trying to rewrite in C# also reads it as a dash.
The File.ReadAllText(string) overload defaults to UTF8 encoding, in which a standalone byte with value 150 is invalid.
Specify the actual encoding of the file, for example:
var encoding = Encoding.GetEncoding(1252);
string content = File.ReadAllText(fileName, encoding);
I used the Windows-1252 encoding, which has a dash at codepoint 150.
Edit: Notepad displays the file correctly because for non-Unicode files the Windows-1252 codepage is the default for western regional settings. So likely you can use also Encoding.Default to get the correct result but keep in mind that Encoding.Default can return different code pages with different regional settings.
You are writing bytes in a textfile. And the you are reading those bytes and interpret them as chars.
Now, when you write bytes, you don't care about encoding, while you have to, in order to read those very same bytes as char.
Notepad++ seems to interpret the byte as Unicode char and therefore prints the _.
Now File.ReadAllText reads the bytes in the specified encoding, which you did not specify and there will be set to one of these and seems to be UTF-8, where 150 is not a valid entry.

Decode UTF-8 bytes as Latin-1 characters

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Farsi (Persian-Arabic):
مدل-رنگ-موی-جدید-5-436x500
whereas it should look like this:
مدل-رنگ-موی-جدید-5-436x500
This link convert this correctly:
http://www.ltg.ed.ac.uk/~richard/utf-8.html
How I can do it in c#?
It is very hard to tell exactly what is going on from the description of your question. We would all be much better off if you provided us with an example of what is happening using a single character instead of a whole string, and if you chose an example character which does not belong to some exotic character set, for example the bullet character (u2022) or something like that.
Anyhow, what is probably happening is this:
The letter "ر" is represented in UTF-8 as a byte sequence of D8 B1, but what you see is "ر", and that's because in UTF-16 Ø is u00D8 and ± is u00B1. So, the incoming text was originally in UTF-8, but in the process of importing it to a dotNet Unicode String in your application it was incorrectly interpreted as being in some 8-bit character set such as ANSI or Latin-1. That's why you now have a Unicode String which appears to contain garbage.
However, the process of converting 8-bit characters to Unicode is for the most part not destructive, so all of the information is still there, that's why the UTF-8 tool that you linked to can still kind of make sense out of it.
What you need to do is convert the string back to an array of ANSI (or Latin-1, whatever) bytes, and then re-construct the string the right way, which is a conversion of UTF-8 to Unicode.
I cannot easily reproduce your situation, so here are some things to try:
byte[] bytes = System.Text.Encoding.Ansi.GetBytes( garbledUnicodeString );
followed by
string properUnicodeString = System.Text.Encoding.UTF8.GetString( bytes );

How do I use C#'s IndexOf when strange characters are in the string

Below is what the text looks like when viewed in NotePad++.
I need to get the IndexOf for that peice of the string. for use the the below code. And I can't figure out how to use the odd characters in my code.
int start = text.IndexOf("AppxxxxxxDB INFO");
Where the "xxxxx"'s represent the strange characters.
All these characters have corresponding ASCII codes, you can insert them in a string by escaping it.
For instance:
"App\x0000\x0001\x0000\x0003\x0000\x0000\x0000DB INFO"
or shorter:
"App\x00\x01\x00\x03\x00\x00\x00"+"DB INFO"
\xXXXX means you specify one character with XXXX the hexadecimal number corresponding to the character.
Notepad++ simply wants to make it a bit more convenient by rendering these characters by printing the abbreviation in a "bubble". But that's just rendering.
The origin of these characters is printer (and other media) directives. For instance you needed to instruct a printer to move to the next line, stop the printing job, nowadays they are still used. Some terminals use them to communicate color changes, etc. The most well known is \n or \x000A which means you start a new line. For text they are thus characters that specify how to handle text. A bit equivalent to modern html, etc. (although it's only a limited equivalence). \n is thus only a new line because there is a consensus about that. If one defines his/her own encoding, he can invent a new system.
Echoing #JonSkeet's warning, when you read a file into a string, the file's bytes are decoded according to a character set encoding. The decoder has to do something with bytes values or sequences that are invalid per the encoding rules. Typical decoders substitute a replacement character and attempt to go on.
I call that data corruption. In most cases, I'd rather have the decoder throw an exception.
You can use a standard decoder, customize one or create a new one with the Encoding class to get the behavior you want. Or, you can preserve the original bytes by reading the file as bytes instead of as text.
If you insist on reading the file as text, I suggest using the 437 encoding because it has 256 characters, one for every byte value, no restrictions on byte sequences and each 437 character is also in Unicode. The bytes that represent text will possibly decode the same characters that you want to search for as strings, but you have to check, comparing 437 and Unicode in this table.
Really, you should have and follow the specification for the file type you are reading. After all, there is no text but encoded text, and you have to know which encoding it is.

Accented characters are not showing properly after copying from a text box

I am using below code to copy text from some control.Please note text could be in Spanish or English.Later i am showing it up inside a rich text box.
Clipboard.Clear();
MyDocBodyControl.Range.Copy();
html = Convert.ToString(Clipboard.GetData(DataFormats.Html));
But when i am displaying them in rich text box,the accented characters are not showing properly.If i am using any other formats like Text,then i am getting proper accented characters.But i have to use HTML formats because i have some styles to be added with the copied text.
Any way to show the accented characters properly with HTML data format ?
Set a correct encoding? UTF-8/Unicode/... ?
Also have a look on these topics: How to convert a Unicode character to its ASCII equivalent
DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.
You get allot of wrong encodings, which leading to funny/bad characters such as
'Å','‹','Å’','Ž','Å¡','Å“','ž','Ÿ','Â','¡','¢','£','¤','Â¥','¦','§','¨','©'
For example '€' is wrongly encoded as '€' in Windows-1252.
Full explanation here at this dedicated website
Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters
But by using the conversions tables you will not loose any UTF-8 characters. You can get the original pristine UTF-8 characters from DataFormats.Html. (Note: Ppm solutions defaults to ASCII on a fail and you loose encoding information!)
Also, Chrome adds Apple-converted-* characters that appear as for example 'Â ' from a clip, but claim to be removed.
Soln: Create a translation dictionary and search and replace.

How to get correctly-encoded HTML from the clipboard?

Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?
For example, executing a command like this:
string s = (string) Clipboard.GetData(DataFormats.Html)
Results in stuff like:
<FONT size=-2>  <A href="/advanced_search?hl=en">Advanced
Search</A><BR>  Preferences<BR>  <A
href="/language_tools?hl=en">Language
Tools</A></FONT>
Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.
It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?
In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).
It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the  character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.
For my other project I made a function that fix data with corrupted encoding.
In this case simple conversion should be sufficient:
byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);
My original function is a little bit more complex and contains tests to ensure that data are not corrupted...
public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
if (string.IsNullOrEmpty(text))
return false;
byte[] data = encoding.GetBytes(text);
// there should not be any character outside source encoding
string newStr = encoding.GetString(data);
if (!string.Equals(text, newStr)) // if there is any character "outside"
return false; // leave, the input is in a different encoding
if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
return false; // if not, can not convert to UTF-8
text = Encoding.UTF8.GetString(data);
return true;
}
I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...
EDIT: (July 20, 2017)
It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0)
(Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)
You have to interpret the data as UTF-8. See MS Office hyperlinks change code page?.
DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.
You get allot of wrong encodings, leading funny/bad characters such as
'Å','‹','Å’','Ž','Å¡','Å“','ž','Ÿ','Â','¡','¢','£','¤','Â¥','¦','§','¨','©'
Full explanation here
Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters
Soln: Create a translation dictionary and search and replace.
I don't know what your original source document is, but be aware that Word and Outlook provide several versions of the clipboard in different encodings. One is usually Windows-1252 and another is UTF-8. Possibly you're grabbing the UTF-8 encoded version by default, when you're expecting the Windows-1252 (Latin-1 + Smart Quotes)? Non-ASCII characters would show up as multiple odd Latin-1 accented characters. Most "Smart Quotes" are not in the Latin-1 set and are often three bytes in UTF-8.
Can you specify which encoding you want the clipboard contents in?
Try this:
System.Windows.Forms.Clipboard.GetText(System.Windows.Forms.TextDataFormat.Html);

Categories