How to convert hebrew (unicode) to Ascii in c#?

How to convert hebrew (unicode) to Ascii in c#? - c#

I have to create some sort of text file in which there are numbers and Hebrew letters decoded to ASCII.
This is file creation method which triggers on ButtonClick
protected void ToFile(object sender, EventArgs e)
{
filename = Transactions.generateDateYMDHMS();
string path = string.Format("{0}{1}.001", Server.MapPath("~/transactions/"), filename);
StreamWriter sw = new StreamWriter(path, false, Encoding.ASCII);
sw.WriteLine("hello");
sw.WriteLine(Transactions.convertUTF8ASCII("שלום"));
sw.WriteLine("bye");
sw.Close();
}
as you can see, i use Transactions.convertUTF8ASCII() static method to convert from probably Unicode string from .NET to ASCII representation of it. I use it on term Hebrew 'shalom' and get back '????' instead of result i need.
Here is the method.
public static string convertUTF8ASCII(string initialString)
{
byte[] unicodeBytes = Encoding.Unicode.GetBytes(initialString);
byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
return Encoding.ASCII.GetString(asciiBytes);
}
Instead of having initial word decoded to ASCII i get '????' in the file i create even if i run debbuger i get same result.
What i'm doing wrong ?

You can't simply translate arbitrary unicode characters to ASCII. The best it can do is discard the unsupportable characters, hence ????. Obviously the basic 7-bit characters will work, but not much else. I'm curious as to what the expected result is?
If you need this for transfer (rather than representation) you might consider base-64 encoding of the underlying UTF8 bytes.

Do you perhaps mean ANSI, not ASCII?
ASCII doesn't define any Hebrew characters. There are however some ANSI code pages which do such as "windows-1255"
In which case, you may want to consider looking at:
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx
In short, where you have:
Encoding.ASCII
You would replace it with:
Encoding.GetEncoding(1255)

Are you perhaps asking about transliteration (as in "Romanization") instead of encoding conversion, if you really are talking about ASCII?

I just faced the same issue when original xml file was in ASCII Encoding.
As Userx suggested
Encoding.GetEncoding(1255)
XDocument.Parse(System.IO.File.ReadAllText(xmlPath, Encoding.GetEncoding(1255)));
So now my XDocument file can read hebrew even if the xml file was saved as ASCII

Related

Handle Non-UTF-8 Characters in Byte Array

I have an array of bytes which contains some characters that are not UTF-8. These characters cannot be deserialized using UTF-8 encoding. So, my question is, how can I handle these characters and make the string readable in whatever language it is.
For example, if I have an array:
byte[] b = myArrayWithNonUTF8Characters;
And I try to deserialize the array with:
DataContractJsonSerializer jsonSerializer = new DataContractJsonSerializer(typeof(MyObject));
MyObject objResponse = (MyObject)jsonSerializer.ReadObject(new MemoryStream(b));
Then I get an error that the array contains invalid UTF8 bytes.
Any way to make this work?
PS: Please, do not give me this answer: string s = System.Text.Encoding.UTF8.GetString(b, 0, b.Length); It will only return symbols replacing the non-UTF-8 characters.

The beauty of UTF is that it encodes characters in most languages; so you can have Greek and Japanese in the same character stream.
Without UTF, your entire stream (or in your case an array) must be in a single language defined by a Code Page. Each character is represented by an ASCII byte but the actual character is determined by the Code Page (see http://en.wikipedia.org/wiki/Code_page for more details).
For example if your text was written in Greek you might use Code Page 111:
System.Text.Encoding.GetEncoding(111)
In short, you need to know what language the ASCII text was written in.

What's going wrong with C#'s string formatter?

I'm getting the following behavior from C#s string encoder:
[Test Case Screenshot][1]
poundFromBytes should be "£", but instead it's "?".
It's as if it's trying to encode the byte array using ASCII instead of UTF-8.
Is this a bug in Windows 7 / C#'s string encoder, or am I missing something?
My real issue here is that I get the same problem when I use File.ReadAllText on an ANSI text file, and I get a related issue in a third party library.
EDIT
I found my problem, I was running under the assumption that UTF-8 was backwards compatible with ANSI, but it's actually only backwards compatible with ASCII. Cheers anyway, at least I'll know to make sure I have no immaterial problems with my test case next time.

The single-byte representation of the pound sign is not valid UTF-8.
Use Encoding.GetBytes instead:
byte[] poundBytes = Encoding.GetEncoding("UTF-8").GetBytes(sPound)

The correct block of code should read something like:
var testChar = '£';
var bytes = Encoding.UTF32.GetBytes(new []{testChar});
string testConvert = Encoding.UTF32.GetString(bytes, 0, bytes.Length);
As others have said, you need to use a UTF encoder to get the bytes for a character. Incidentally characters are UTF-16 format by default (see: http://msdn.microsoft.com/en-us/library/x9h8tsay.aspx)

If you want to use an Encoding's GetString() method, you should probably also use it's corresponding GetBytes() method:
static void Main(string[] args)
{
char cPound = '£';
byte bPound = (byte)cPound; //not really valid
string sPound = "" + cPound;
byte[] poundBytes = Encoding.UTF8.GetBytes(sPound);
string poundFromBytes = Encoding.UTF8.GetString(pountBytes);
Console.WriteLine(poundFromBytes);
Console.ReadKey(True);
}

Check out the documents here. As mentioned in the comments you can't just cast your char to a byte. I'll edit with a more succinct answer but I want to avoid just copy/pasting what msdn has. http://msdn.microsoft.com/en-us/library/ds4kkd55(v=vs.110).aspx
char[] pound = new char[] { '£' };
byte[] poundAsBytes = Encoding.UTF8.GetBytes(pound);
Also, why is everyone using this GetEncoding with a hard coded argument rather than accessing UTF8 directly?

.txt to String with Encoding

The goal is simple. Grab some French text containing special characters from a .txt file and stick it into the variable "content". All is working well except that all instances of the character "à" are being interpreted as "À". Did I pick the wrong encoding (UTF7) or what?
Any help would be much appreciated.
// Using ecoding to ensure special characters work
Encoding enc = Encoding.UTF7;
// Make sure we get all the information about special chars
byte[] fileBytes = System.IO.File.ReadAllBytes(path);
// Convert the new byte[] into a char[] and then into a string.
char[] fileChars = new char[enc.GetCharCount(fileBytes, 0, fileBytes.Length)];
enc.GetChars(fileBytes, 0, fileBytes.Length, fileChars, 0);
string fileString = new string(fileChars);
// Insert the resulting encoded string "fileString" into "content"
content = fileString;

Your code is correct besides the wrong encoding. Find the correct one and plug it in. Nobody uses UTF7 so this is probably not it.
Maybe it is a non-Unicode one. Try Encoding.Default. That one empirically often helps in Germany.
Also, just use File.ReadAllText. It does everything you are doing.

How do I read and write smart quotes (and other silly characters) in C#

I'm writing a program that reads all the text in a file into a string, loops over that string looking at the characters, and then appends the characters back to another string using a Stringbuilder. The issue I'm having is when it's written back out, the special characters such as “ and ” , come out looking like ï¿½ characters instead. I don't need to do a conversion, I just want it written back out the way I read it in:
StringBuilder sb = new StringBuilder();
string text = File.ReadAllText(filePath);
for (int i = 0; i < text.Length; ++i) {
if (text[i] != '{') { // looking for opening curly brace
sb.Append(text[i]);
continue;
}
// Do stuff
}
File.WriteAllText(destinationFile, sb.ToString());
I tried using different Encodings (UTF-8, UTF-16, ASCII), but then it just came out even worse; I started getting question mark symbols and Chinese characters (yes, a bit of a shotgun approach, but I was just experimenting).
I did read this article: http://www.joelonsoftware.com/articles/Unicode.html
...but it didn't really explain why I was seeing what I saw, unless in C#, the reader starts cutting off bits when it hits weird characters like that. Thanks in advance for any help!

TL;DR that is definitely not UTF-8 and you are not even using UTF-8 to read the resulting file. Read as Windows1252, write as Windows1252 (If you are going to use the same viewing method to view the resulting file)
Well let's first just say that there is no way a file made by a regular user will be in UTF-8. Not all programs in windows even support it (excel, notepad..), let alone have it as default encoding (even most developer tools don't default to utf-8, which drives me insane). Since a lot of developers don't understand that such a thing as encoding even exists, then what chances do regular users have of saving their files in an utf-8 hostile environment?
This is where your problems first start. According to documentation, the overload you are using File.ReadAllText(filePath); can only detect UTF-8 or UTF-32.
Indeed, simply reading a file encoded normally in Windows-1252 that contains "a”a" results in a string "a�a", where � is the unicode replacement character (Read the wikipedia section, it describes exactly the situation you are in!) used to replace invalid bytes. When the replacement character is again encoded as UTF-8, and interpreted as Windows-1252, you will see ï¿½ because the bytes for � in UTF-8 are 0xEF, 0xBF, 0xBD which are the bytes for ï¿½ in Windows-1252.
So read it as Windows-1252 and you're half-way there:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
String result = File.ReadAllText(#"C:\myfile.txt", windows1252);
Console.WriteLine(result); //Correctly prints "a”a" now
Because you saw ï¿½, the tool you are viewing the newly made file with is also using Windows-1252. So if the goal is to have the file show correct characters in that tool, you must encode the output as Windows-1252:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
File.WriteAllText(#"C:\myFile", sb.toString(), windows1252);

Chances are the text will be UTF8.
File.ReadAllText(filePath, Encoding.UTF8)
coupled with
File.WriteAllText(destinationFile, sb.ToString(), Encoding.UTF8)
should cover off dealing with the Unicode characters. If you do one or the other you're going to get garbage output, both or nothing.

How to convert a string from iso 8859-1 to utf-8? C# Windows phone 7 -

my question is very simple but at the moment i don't know how to do this. I have a string in ISO-8859-1 format and i need to convert this string to UTF-8. I need to do it in c# on windows phone 7 sdk. How can i do it? Thanks

The MSDN page for the Encoding class lists the recognized encodings.
28591 iso-8859-1 Western European (ISO)
For your question the correct choice is iso-8859-1 which you can pass to Encoding.GetEncoding.
var inputEncoding = Encoding.GetEncoding("iso-8859-1");
var text = inputEncoding.GetString(input);
var output = Encoding.Utf8.GetBytes(text);

Two clarifications on the previous answers:
There is no Encoding.GetText method (unless it was introduced specifically for the WP7 framework). The method should presumably be Encoding.GetString.
The Encoding.GetString method takes a byte[] parameter, not a string. All strings in .NET are internally represented as UTF-16; there is no way of having a “string in ISO-8859-1 format”. Thus, you must be careful how you read your source (file, network), rather than how you process your string.
For example, to read from a text file encoded in ISO-8859-1, you could use:
string text = File.ReadAllText(path, Encoding.GetEncoding("iso-8859-1"));
To save to a text file encoded in UTF-8, you could use:
File.WriteAllText(path, text, Encoding.UTF8);
Reply to comment:
Yes. You can use Encoding.GetString to decode your byte array (assuming it contains character values for text under a particular encoding) into a string, and Encoding.GetBytes to convert your string back into a byte array (possibly of a different encoding), as demonstrated in the other answers.
The concept of “encoding” relates to how byte sequences (be they a byte[] array in memory or the content of a file on disk) are to be interpreted. The string class is oblivious to the encoding that the text was read from, or should be saved to.

You can use Convert which works pretty well, especially when you have byte array:
var latinString = "Řr"; // år
Encoding latinEncoding = Encoding.GetEncoding("iso-8859-1");
Encoding utf8Encoding = Encoding.UTF8;
byte[] latinBytes = latinEncoding.GetBytes(latinString);
byte[] utf8Bytes = Encoding.Convert(latinEncoding, utf8Encoding, latinBytes);
var utf8String = Encoding.UTF8.GetString(utf8Bytes);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to convert hebrew (unicode) to Ascii in c#? - c#

Are you perhaps asking about transliteration (as in "Romanization") instead of encoding conversion, if you really are talking about ASCII?

I just faced the same issue when original xml file was in ASCII Encoding. As Userx suggested Encoding.GetEncoding(1255) XDocument.Parse(System.IO.File.ReadAllText(xmlPath, Encoding.GetEncoding(1255))); So now my XDocument file can read hebrew even if the xml file was saved as ASCII

Related

Handle Non-UTF-8 Characters in Byte Array

What's going wrong with C#'s string formatter?

.txt to String with Encoding

How do I read and write smart quotes (and other silly characters) in C#

How to convert a string from iso 8859-1 to utf-8? C# Windows phone 7 -

Categories

Resources