Convert Hebrew chars to English - c#

To prevent marking my question as "duplicate" here is the "answer" to the similar question , which unfortunately doesn't work.So I have the situation where I type Hebrew text into TextBox and I need to convert each typed letter into standard char codes like those in ASCII table (decimal) for English language.Because converting Hebrew chars directly returns entirely different code( those seem to be Unicode ) I need to convert Hebrew input into English. I tried different Encoder types for Hebrew input conversion :Unicode , UTF8 , UTF16 , "Windows-1255" .I am always getting "?" .
So for example ,possible solution from the question mentioned above is this:
public static string convertUTF8ASCII(string initialString)
{
byte[] unicodeBytes = Encoding.Unicode.GetBytes(initialString);
byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
return Encoding.GetEncoding("Windows-1255").GetString(asciiBytes);
}
And it doesn't work.
I have also tried something like this :
public static int GetASCIICodeFromUnicode(string letter){
Encoding ascii = Encoding.GetEncoding("Windows-1252");
Encoding unicode = Encoding.GetEncoding("Windows-1255");
byte[] unicodeBytes = unicode.GetBytes(letter);
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
return (int)Convert.ToChar(asciiString);
}
Doesn't work either.

By the time you get strings appearing in .NET, it is too late to find out what the keyboard sent out.
In order to find out what keyboard position was clicked you will need to have a mapping of the Hebrew to English positions - a Dictionary<char,char> is a good candidate data structure for this.

Related

How to convert unicode to utf-8 encoding in c#

I want to convert unicode string to UTF8 string. I want to use this UTF8 string in SMS API to send unicode SMS.
I want conversion like this tool
https://cafewebmaster.com/online_tools/utf8_encode
eg. I have unicode string "हैलो फ़्रेंड्स" and it should be converted into "हà¥à¤²à¥ à¥à¥à¤°à¥à¤à¤¡à¥à¤¸"
I have tried this but not getting expected output
private string UnicodeToUTF8(string strFrom)
{
byte[] bytes = Encoding.Default.GetBytes(strFrom);
return Encoding.UTF8.GetString(bytes);
}
and calling function like this
string myUTF8String = UnicodeToUTF8("हैलो फ़्रेंड्स");
I don't think this is possible to answer concretely without knowing more about the SMS API you want to use. The string type in C# is UTF-16. If you want a different encoding, it's given to you as a byte[] (because a string is UTF-16, always).
You could 'cast' that into a string by doing something like this:
static string UnicodeToUTF8(string from) {
var bytes = Encoding.UTF8.GetBytes(from);
return new string(bytes.Select(b => (char)b).ToArray());
}
As far as I can tell this yields the same output as the website you linked. However, without knowing what API you're handing this string off to, I can't guarantee that this will ultimately work.
The point of string is that we don't need to worry about its underlying encoding, but this casting operation is kind of a giant hack and makes no guarantees that string represents a well-formed string anymore.
If something expects a UTF-8 encoding, it should accept a byte[], not a string.
Try this:
string output = "hello world";
byte[] bytes1 = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(output));
byte[] bytes2 = Encoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(output));
var output1 = Encoding.UTF8.GetString(bytes1);
var output2 = Encoding.Unicode.GetString(bytes2);
You will see that bytes1 is 11 bytes (1 byte per char UTF-8) and bytes2 is 22 bytes (2 bytes per char for unicode)

Converting unknown characters to Greek characters

I have a file which contains the following characters:
ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ
I am trying to convert that to Greek words and the result should be:
ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ
The file that the above value is stored in in Unicode format.
I am applying all possible encodings but no luck in the conversion.
private void Convert()
{
string textFilePhysicalPath = (#"C:\Users\Nec\Desktop\a.txt");
string contents = File.ReadAllText(textFilePhysicalPath);
List<string> sLines = new List<string>();
// For every encoding, get the property values.
EncodingInfo ei;
foreach (var ei in Encoding.GetEncodings())
{
Encoding e = ei.GetEncoding();
Encoding iso = Encoding.GetEncoding(ei.Name);
Encoding utfx = Encoding.Unicode;
byte[] utfBytes = utfx.GetBytes(contents);
byte[] isoBytes = Encoding.Convert(utfx, iso, utfBytes);
string msg = iso.GetString(isoBytes);
string xx = (ei.Name + " " + msg);
sLines.Add(xx);
}
using (StreamWriter file = new StreamWriter(#"C:\Users\Nec\Desktop\result.txt"))
{
foreach (var line in sLines)
file.WriteLine(line);
}
}
A website that converts it correctly is http://www.online-decoder.com/el
but even when I use the ISO-8859-1 to ISO-8859-7 it still doesn't work in .NET.
This code converts the string from the C# which is UTF-16 to an 8-bit representation using the common ISO-8859-1 codepage. Then it converts it back to UTF-16 using the greek codepage windows-1253. The result is ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ as you want.
string errorneousString = "ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ";
byte[] asIso88591Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(errorneousString);
string asGreekString = Encoding.GetEncoding("windows-1253").GetString(asIso88591Bytes);
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(asGreekString);
Edit: Since your file is encoded in an 8-bit format, you need to specify the codepage when reading it. Use this:
string fileContents = File.ReadAllText("189.dat", Encoding.GetEncoding("windows-1253"));
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(fileContents);
That reads the content as
'CS','C.S.F. EXAMINATION','ΕΞΕΤΑΣΗ Ε.Ν.Υ.' 'EH','Hb
ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ' 'EP','PROTEIN
ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΠΡΩΤΕΙΝΩΝ' 'FB','HAEMATOLOGY -
FBC','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΑΙΜΑΤΟΣ - FBC' 'FR','FREE TEXT', 'GT','GLUCOSE
TOLERANCE TEST','ΔΟΚΙΜΑΣΙΑ ΑΝΟΧΗΣ ΓΛΥΚΟΖΗΣ'
'MI','MICROBIOLOGY','ΜΙΚΡΟΒΙΟΛΟΓΙΑ' 'NO','NORMAL FORM','ΚΑΝΟΝΙΚΟ
ΔΕΛΤΙΟ' 'RE','RENAL CALCULUS','ΧΗΜΙΚΗ ΑΝΑΛΥΣΗ ΟΥΡΟΛΙΘΟΥ' 'SE','SEMEN
ANALYSIS','ΣΠΕΡΜΟΔΙΑΓΡΑΜΜΑ' 'SP','SPECIAL PATHOLOGY','SPECIAL
PATHOLOGY' 'ST','STOOL EXAMINATION
','ΕΞΕΤΑΣΗ ΚΟΠΡΑΝΩΝ' 'SW','SEMEN WASH','SEMEN WASH'
'TH','THROMBOPHILIA PANEL','THROMBOPHILIA PANEL' 'UR','URINE
ANALYSIS','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΟΥΡΩΝ' 'WA','WATER CULTURE REPORT','ΑΝΑΛΥΣΗ
ΝΕΡΟΥ' 'WI','WIDAL ','ΑΝΟΣΟΒΙΟΛΟΓΙΑ'
This is an ASCII file stored using the Greek (1253) codepage which was read using a different codepage.
File.ReadAllText tries to detect whether the file is UTF16 or UTF8 by checking the BOM bytes and falls back to UTF8 by default. UTF8 is essentially the 7-bit ANSI codepage for single-byte text, which means that trying to read a nonUnicode, nonANSI file like this will result in garbled text.
To load a file using a specific encoding/codepage, just pass the encoding as the Encoding parametter, eg :
var enc = Encoding.GetEncoding(1253);
var text=File.ReadAllText(#"189.dat",enc);
Strings in .NET are Unicode, specifically UTF16. This means that text doesn't need any conversions. Its contents will be :
'CS','C.S.F. EXAMINATION','ΕΞΕΤΑΣΗ Ε.Ν.Υ.'
'EH','Hb ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ'
'EP','PROTEIN ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΠΡΩΤΕΙΝΩΝ'
'FB','HAEMATOLOGY - FBC','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΑΙΜΑΤΟΣ - FBC'
'FR','FREE TEXT',
'GT','GLUCOSE TOLERANCE TEST','ΔΟΚΙΜΑΣΙΑ ΑΝΟΧΗΣ ΓΛΥΚΟΖΗΣ'
'MI','MICROBIOLOGY','ΜΙΚΡΟΒΙΟΛΟΓΙΑ'
'NO','NORMAL FORM','ΚΑΝΟΝΙΚΟ ΔΕΛΤΙΟ'
'RE','RENAL CALCULUS','ΧΗΜΙΚΗ ΑΝΑΛΥΣΗ ΟΥΡΟΛΙΘΟΥ'
'SE','SEMEN ANALYSIS','ΣΠΕΡΜΟΔΙΑΓΡΑΜΜΑ'
'SP','SPECIAL PATHOLOGY','SPECIAL PATHOLOGY'
'ST','STOOL EXAMINATION ','ΕΞΕΤΑΣΗ ΚΟΠΡΑΝΩΝ'
'SW','SEMEN WASH','SEMEN WASH'
'TH','THROMBOPHILIA PANEL','THROMBOPHILIA PANEL'
'UR','URINE ANALYSIS','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΟΥΡΩΝ'
'WA','WATER CULTURE REPORT','ΑΝΑΛΥΣΗ ΝΕΡΟΥ'
'WI','WIDAL ','ΑΝΟΣΟΒΙΟΛΟΓΙΑ'
UTF16 uses two bytes for every character. If a UTF16 file was opened in a hex browser, every other character would be a NUL (0x00). It's not UTF8 either - outside the 7-bit ANSI range each character uses two or more bytes that always have the high bit set. Instead of one garbled character there would be two at least.
File and stream methods that could be affected by encoding or culture in .NET always have an overload that accepts an Encoding or CultureInfo parameter.
Console
Writing the output to the Console may display in garbled text. The text isn't really converted, just displayed the wrong way.
While the console can display Unicode text it assumes that the system's codepage is used by default. In the past it couldn't even support UTF8 as a codepage - there was no such option in the settings. After all, the label for the system locale settings is Language used for non-Unicode programs.
The latest Windows 10 Insider releases offer UTF8 as the system codepage as a beta option.
To ensure Unicode text appears properly in the console one would have to set its encoding to UTF8, eg :
var text=File.ReadAllText(#"189.dat",enc);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine(text);
I don't know what codepage this is, but it seems to be simply offset by some values. You can convert the source string to the target string by adding 11 to the first byte and 16 to the second byte:
var input = Encoding.Default.GetBytes("ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ");
for (var i = 0; i < input.Length; i++)
{
if (input[i] == 32) continue;
input[i++] += 11;
input[i] += 16;
}
var output = Encoding.UTF8.GetString(input);
Result: ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ
Not sure if this is a solution, but it may give you a hint.
Just it c#
HtmlWeb web = new HtmlWeb();
web.OverrideEncoding = Encoding.GetEncoding(65001);

Encoding not converting

An ASP.NET page (ashx) receives a GET request with a UTF8 string. It reads a SqlServer database with Windows-1255 data.
I can't seem to get them to work together. I've used information gathered on SO (mainly Convert a string's character encoding from windows-1252 to utf-8) as well as msdn on the subject.
When I run anything through the functions below - it always ends up the same as it started - not converted at all.
Is something done wrong?
EDIT
What I'm specifically trying to do (getData returns a Dictionary<int, string>):
getData().Where(a => a.Value.Contains(context.Request.QueryString["q"]))
Result is empty, unless I send a "neutral" character such as "'" or ",".
CODE
string windows1255FromUTF8(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(p);
byte[] winBytes = Encoding.Convert(utf8, win, utfBytes);
return win.GetString(winBytes);
}
string UTF8FromWindows1255(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] winBytes = win.GetBytes(p);
byte[] utfBytes = Encoding.Convert(win, utf8, winBytes);
return utf8.GetString(utfBytes);
}
There is nothing wrong with the functions, they are simply useless.
What the functions do is to encode the strings into bytes, convert the data from one encoding to another, then decode the bytes back to a string. Unless the string contains a character that is not possible to encode using the windows-1255 encoding, the returned value should be identical to the input.
Strings in .NET doesn't have an encoding. If you get a string from a source where the text was encoded using for example UTF-8, once it's decoded into a string it doesn't have that encoding any more. You don't have to do anyting to a string to use it when the destination has a specific encoding, whatever library you are using that takes the string will take care of the encoding.
For some reason this worked:
byte[] fromBytes = (fromEncoding.UTF8).GetBytes(myString);
string finalString = (Encoding.GetEncoding(1255)).GetString(fromBytes);
Switching encoding without the conversion...

C# Encoding.Convert Vs C++ MultiByteToWideChar

I have a C++ code snippet that uses MultiByteToWideChar to convert UTF-8 string to UTF-16
For C++, if input is "Hôtel", the output is "Hôtel" which is correct
For C#, if input is "Hôtel", the output is "Hôtel" which is not correct.
The C# code to convert from UTF8 to UTF16 looks like
Encoding.Unicode.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.Unicode,
Encoding.UTF8.GetBytes(utf8)));
In C++ the conversion code looks like
MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
0, // default flags
utf8.data(), // source UTF-8 string
utf8.length(), // length (in chars) of source UTF-8 string
&utf16[0], // destination buffer
utf16.length() // size of destination buffer, in wchar_t's
)
I want to have the same results in C# that I am getting in C++. Is there anything wrong with the C# code ?
It appears you want to treat string characters as Windows-1252 (Often mislabeled as ANSI) code points, and have those code points decoded as UTF-8 bytes, where Windows-1252 code point == UTF-8 byte value.
The reason the accepted answer doesn't work is that it treats the string characters as unicode code points, rather than
Windows-1252. It can get away with most characters because Windows-1252 maps them exactly the same as unicode, but input with characters
like –, €, ™, ‘, ’, ”, • etc.. will fail because Windows-1252 maps those differently than unicode in this sense.
So what you want is simply this:
public static string doWeirdMapping(string arg)
{
Encoding w1252 = Encoding.GetEncoding(1252);
return Encoding.UTF8.GetString(w1252.GetBytes(arg));
}
Then:
Console.WriteLine(doWeirdMapping("Hôtel")); //prints Hôtel
Console.WriteLine(doWeirdMapping("HVOLSVÖLLUR")); //prints HVOLSVÖLLUR
Maybe this one:
private static string Utf8ToUnicode(string input)
{
return Encoding.UTF8.GetString(input.Select(item => (byte)item).ToArray());
}
Try This
string str = "abc!";
Encoding unicode = Encoding.Unicode;
Encoding utf8 = Encoding.UTF8;
byte[] unicodeBytes = unicode.GetBytes(str);
byte[] utf8Bytes = Encoding.Convert( unicode,
utf8,
unicodeBytes );
Console.WriteLine( "UTF Bytes:" );
StringBuilder sb = new StringBuilder();
foreach( byte b in utf8Bytes ) {
sb.Append( b ).Append(" : ");
}
Console.WriteLine( sb.ToString() );
This Link would be helpful for you to understand about encodings and their conversions
Use System.Text.Encoding.UTF8.GetString().
Pass in your UTF-8 encoded text, as a byte array. The function returns a standard .net string which is encoded in UTF-16.
Sample function will be as below:
private string ReadData(Stream binary_file) {
System.Text.Encoding encoding = System.Text.Encoding.UTF8;
// Read string from binary file with UTF8 encoding
byte[] buffer = new byte[30];
binary_file.Read(buffer, 0, 30);
return encoding.GetString(buffer);
}

C# Encoding.Converting Latin to Hebrew

I'm trying to fetch and parse an online excel document which is written in hebrew but unfortunately in a non-hebrew encoding.
As an example I'm trying to convert the following string: "âìéåï_1", which serves as the 1st sheet name to hebrew using C# code, but I'm unable to do so.
I know the above is convertible, since when I open it up in NotePad++ and select Encoding/Character Sets/Hebrew/Windows 1255, I can see: "גליון_1" which is the correct hebrew representation of the above string.
I'm using the below code
string str = "âìéåï_1";
Encoding windows = Encoding.GetEncoding("Windows-1255");
Encoding ascii = Encoding.GetEncoding("Windows-1252");
byte[] asciiBytes = ascii.GetBytes(str);
byte[] windowsBytes = Encoding.Convert(ascii, windows, asciiBytes);
char[] windowsChars = new char[windows.GetCharCount(windowsBytes, 0, windowsBytes.Length)];
windows.GetChars(windowsBytes, 0, windowsBytes.Length, windowsChars, 0);
string windowsString = new string(windowsChars);
I assumed that the encoding of the origin string is Windows-1252 since when I paste it in NotePad++ and change the encoding to Windows-1252 the string remains the same...
I'm probably doing something wrong here, anyone know how to convert the above correctly?
Thanks,
Mikey
const string Str = "âìéåï_1";
Encoding latinEncoding = Encoding.GetEncoding("Windows-1252");
Encoding hebrewEncoding = Encoding.GetEncoding("Windows-1255");
byte[] latinBytes = latinEncoding.GetBytes(Str);
string hebrewString = hebrewEncoding.GetString(latinBytes);
hebrewString:
גליון_1
In your supplied example "Window-1252" is not actualy ASCII, it is extended ASCII, and for some reason Encoding.Convert with these two encodings cannot convert extended range ASCII, so all +127 characters are converted as 63 (i.e. ?). When "converting" from one extended ASCII character byte[] to another, I would expect the bytes to be the same, it is only when you convert them to a .Net unicode string I would expect them to be different. Not sure why Convert is converting +127 chars to '?'.

Categories