Reading file containing Arabic language - c#

I have a file that contains both Arabic and English word/letters/numbers. I'm trying to print the file using the code from Here. When I open the file in notepad, I see all the funny and unprintable chars. When I save the same file as Unicode in Save as... file menu (notepad) and choosing Unicode, the file is displayed properly (I see Arabic letters, etc.).
When I open the same file in notepad++ the only option that displays the file correctly is
Menu->Encoding->Character set->Arabic
With C#, I'm trying to read the file line by line and print it using
ev.Graphics.DrawString(line, printFont, Brushes.Red, leftMargin, yPos, _sf);
where line is the line from the file. When the file is saved in right encode, everything prints out fine. But when we have encoding issues, we get bunch of diamonds, question marks, etc.
Here are a few ways (from various sources) that I tried opening the file with right encoding (please let me know if one of them should work and I'll try again):
Attempt 1
var arabic = Encoding.GetEncoding(1252);
var bytes = arabic.GetBytes(line);
line = arabic.GetString(bytes);`
Attempt 2
streamToPrint = new StreamReader(this.filepath,System.Text.Encoding.UTF8,true);
Attempt 3
byte[] utf8Bytes = Encoding.UTF8.GetBytes(line);
line = Encoding.Unicode.GetString(utf8Bytes);`
None of them work. Can someone kindly show me what changes I have to make to Here code so that it will read the file and print it?

var arabic = Encoding.GetEncoding(1252);
That's not it, 1252 is the Windows codepage for Western Europe and the Americas. Your next guess is 1256, the default Windows codepage for Arabic. Your next guess should be the legacy MS-Dos code pages, 864 and 720.
This kind of misery ought to inspire you to contact the company or programmer that created the file. It is high time they update. Best argument you can give them is that you are available now, probably won't be whenever they need to update.

You need to look at the BOM (Byte Order Mark, U+FEFF), which should be the first Unicode character in the file. If it's not found, It's either plain ASCI, UTF-8 without a byte order mark or something odd.
Read the first several octets of the file. The BOM is encoding differently for different encodings:
hex FE BB BF indicates UTF-8. HOWEVER, for UTF-8, the BOM is optional, it being meaningless, what with UTF-8 being an 8-bit encoding and all. If it's not found, it's no guarantee that the file is UTF-8, though. It could be plain ASCII or encoding with some other non-Unicode DBCS scheme.
hex FE FF indicates UTF-16, big-endian (network byte order).
hex FF FE indicates UTF-16, little-endian.
hex 00 00 FE FF indicates UTF-32, big-endian (network byte order).
hex FF FE 00 00 indicates UTF-32, little endian.
etc. See http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding for more.
You might notice that this isn't fool-proof. A little-endian, UTF-16 encoding file would be hard to differentiate from a little-endian, UTF-32 encoded file...if it's first non-BOM Unicode character was an ascii NUL (U+0000).

Related

Why Does Byte 150 show up as a dash in Notepad but Not when I read it programatically?

I've got a file that looks OK in Notepad (and Notepad++) but when I try to read it with a C# program, the dash shows up as a replacement character (�) instead. After some trial and error, I can reproduce the error as follows:
File.WriteAllBytes("C:\\Misc\\CharTest\\wtf.txt", new byte[] { 150 });
var readFile = File.ReadAllText("C:\\Misc\\CharTest\\wtf.txt");
Console.WriteLine(readFile);
Now, if you go and look in the wtf.txt file using Notepad, you'll see a dash... but I don't get it. I know that's not a "real" Unicode value so that's probably the root of the issue, but I don't get why it looks fine in Notepad and not when I read in the file. And how do I get the file to read it as a dash?
As an aside, a VB6 program I'm trying to rewrite in C# also reads it as a dash.
The File.ReadAllText(string) overload defaults to UTF8 encoding, in which a standalone byte with value 150 is invalid.
Specify the actual encoding of the file, for example:
var encoding = Encoding.GetEncoding(1252);
string content = File.ReadAllText(fileName, encoding);
I used the Windows-1252 encoding, which has a dash at codepoint 150.
Edit: Notepad displays the file correctly because for non-Unicode files the Windows-1252 codepage is the default for western regional settings. So likely you can use also Encoding.Default to get the correct result but keep in mind that Encoding.Default can return different code pages with different regional settings.
You are writing bytes in a textfile. And the you are reading those bytes and interpret them as chars.
Now, when you write bytes, you don't care about encoding, while you have to, in order to read those very same bytes as char.
Notepad++ seems to interpret the byte as Unicode char and therefore prints the _.
Now File.ReadAllText reads the bytes in the specified encoding, which you did not specify and there will be set to one of these and seems to be UTF-8, where 150 is not a valid entry.

String lengths differ in Python3 from file and through copy-and-paste

I have a string like this from Wikipedia (https://en.wikipedia.org/wiki/Tyre,_Lebanon)
Tyre (Arabic: صور‎‎, Ṣūr; Phoenician: 𐤑𐤅𐤓, Ṣur; Hebrew: צוֹר‎, Tsor; Tiberian Hebrew צֹר‎, Ṣōr; Akkadian: 𒀫𒊒, Ṣurru; Greek: Τύρος, Týros; Turkish: Sur; Latin: Tyrus, Armenian Տիր [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.
When this sentence is loaded from a file, its length is 262. When it is copied and pasted from Browser, it is 267.
My question is that I have an existing data pipeline in C# that recognizes the length as 266 (the copy-and-paste length above but default read-from-file in C#), but Python3 reads the C# text output file and considers it as length of 262. The issue is that the character indexing (e.g. s[10:20]) through these two encoding systems will be different and make the end-to-end algorithm fails at this type of cases.
It appears the underlying encoding is different, though they have the same appearance to human readers (only the different parts shown):
Phoenician: \xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur;
Phoenician: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2ur;
And
Akkadian: \xf0\x92\x80\xab\xf0\x92\x8a\x92, \xe1\xb9\xa2urru;
Akkadian: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2urru;
Is there a way for Python to read the file using the later encoding of length 266? And how to detect/determine the proper encoding system from the utf-8 bytes above?
The full utf-8 encoding for each case is shown below for further investigation
From file
b'Tyre (Arabic: \xd8\xb5\xd9\x88\xd8\xb1\xe2\x80\x8e\xe2\x80\x8e, \xe1\xb9\xa2\xc5\xabr; Phoenician: \xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur; Hebrew: \xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e, Tsor; Tiberian Hebrew \xd7\xa6\xd6\xb9\xd7\xa8\xe2\x80\x8e, \xe1\xb9\xa2\xc5\x8dr; Akkadian: \xf0\x92\x80\xab\xf0\x92\x8a\x92, \xe1\xb9\xa2urru; Greek: \xce\xa4\xcf\x8d\xcf\x81\xce\xbf\xcf\x82, T\xc3\xbdros; Turkish: Sur; Latin: Tyrus, Armenian \xd5\x8f\xd5\xab\xd6\x80 [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.'
From copy and paste
b'Tyre (Arabic: \xd8\xb5\xd9\x88\xd8\xb1\xe2\x80\x8e\xe2\x80\x8e, \xe1\xb9\xa2\xc5\xabr; Phoenician: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2ur; Hebrew: \xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e, Tsor; Tiberian Hebrew \xd7\xa6\xd6\xb9\xd7\xa8\xe2\x80\x8e, \xe1\xb9\xa2\xc5\x8dr; Akkadian: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2urru; Greek: \xce\xa4\xcf\x8d\xcf\x81\xce\xbf\xcf\x82, T\xc3\xbdros; Turkish: Sur; Latin: Tyrus, Armenian \xd5\x8f\xd5\xab\xd6\x80 [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.'
You probably don't have Phoenician fonts installed in your system, so the web browser (as #lenz mentioned in the comment) displays characters 𐤓 instead. Python loads your string properly.
There are 5 problematic characters in the text: 3 Phoenician and 2 Akkadian:
The first character of the problematic part with Phoenician symbols is 'Phoenician Letter Sade' (https://unicode-table.com/en/10911/) -- it spans 4 bytes in UTF-8: F0 90 A4 91
It is followed with 'Phoenician Letter Wau' (https://unicode-table.com/en/10905/) -- again 4 bytes: F0 90 A4 85
The third letter if 'Phoenician Letter Rosh' (https://unicode-table.com/en/10913/) -- is uses 4 bytes as well: F0 90 A4 93
(I omit the Akkadian ones.)
Each of those letters is replaced in your encodings by \xef\xbf\xbd\xef\xbf\xbd that correspond to ��.
Each problematic letter somehow gets replaced by two � signs, so the total length of the string increases by 5, from 262 to 267 characters.
It turns out I found a different viewpoint to answer this question. C# does report longer length of a string, but it does not mean it is incorrect, just the underlying encoding system is different and has its limitation.
http://illegalargumentexception.blogspot.com/2010/04/i18n-comparing-character-encoding-in-c.html
Python C# - Unicode character is not the same on Python and C#
When reading a file and decoding to Unicode, C# and Java store Unicode strings internally encoded as UTF-16. Code points outside the basic multilingual plane (BMP, U+0000 to U+FFFF) use surrogates (two words) to represent a Unicode code point. The fact that you can see a Unicode code point as two words is a leaky abstraction.
Python 3.3+ hides this abstraction. It internally uses 1-, 2- or 4-byte encodings as needed to represent a Unicode string, but presents only the Unicode code points to the user.
This explains that why the lengths reported by C# can be longer than Python.
How to make them congruent? hmmm... probably not directly but through a substring search as a post-processing...

What encoding be used to create MS-DOS txt file using C#(UTF8Encoding vs Encoding)

I am trying to create a flat file for a legacy system and they mandates that the data to be presented in TextEncoding of MS DOS .txt file (Text Document - MS-DOS Format CP_OEM). I am a bit confused between files generated by using UTF8Encoding class in C# (.net4.0 framework) and I think it produce a file in default txt file (Encoding: CP_ACP).
I think Encoding names CP_ACP , Winodows and ANSI refers to same thing and Windows default is ANSI and it will omit any unicode character information.
If I use UTF8Encoding class in C# library to create a text file(as below), is it going to be in the MS DOS txt file format?
byte[] title = new UTF8Encoding(true).GetBytes("New Text File");
As per the answer supplied it is evident that UTF8 is NOT equivalent to MSDOS txt format and should use Encoding.GetEncoding(850) method to get the encoding library.
I read the following posts to check on my information but nothing conclusive yet.
https://blogs.msdn.microsoft.com/oldnewthing/20120220-00?p=8273
https://blog.mh-nexus.de/2015/01/character-encoding-confusion
https://blogs.msdn.microsoft.com/oldnewthing/20090115-00?p=19483
Finally the conclusion is to go with Encoding.GetEncoding(850) when creating a byte array to be converted back to the actual file(note: i am using byte array as i can leverage existing middle wares).
You can use the File.ReadXY(String, Encoding) and File.WriteXY(String, String[], Encoding) methods, where XY is either AllLines, Lines or AllText working with string[], IEnumerable<string> and string respectively.
MS-DOS uses different code pages. Probably the code page 850 "Western European / Latin-1" or code page 437 "OEM-US / OEM / PC-8 / DOS Latin US" (as #HansPassant suggests) will be okay. If you are not sure, which code page you need, create example files containing letters like ä, ö, ü, é, è, ê, ç, à or greek letters with the legacy system and see whether they work. If you don't use such letters or other special characters, then the code page is not very critical.
File.WriteAllText(path, "Hello World", Encoding.GetEncoding(850));
The character codes from 0 to 127 (7-bit) are the same for all MS-DOS code pages, for ANSI and UTF-8. UTF files are sometimes introduced with a BOM (byte order mark).
MS-DOS knows only 8-bit characters. The codes 128 to 255 differ for the different national code pages.
See: File Class, Encoding Class and Wikipedia: Code Page.

c# Converting codepage input to display as Unicode

I searched and find some partial answers that work in some instances but nothing that works in all. The scenarion is I get sent via network an XML file. The file has the encoding named e.g. encoding = "Windows-932" or encoding = "Windows-1254" and so on. I need to parse the file and get certain info and the convert that info to Unicode chars and send it on to another machine that can only read Unicode.
So is the encoding is
1253 it is Cyrillic so Char E1 = ASCII225 = Unicode 0431.
1254 it is Turkish so Char E1 = ASCII225 = Unicode 00E1.
1251 it is Greek so Char E1 = ASCII225 = Unicode 03B1.
So far I thought I could have a loookup table that looked at the encoding and then I just add the Unicode page in front of the E1 BUT that will not work as in Unicode they do not have the same page position as you see above.
To further complicate things I can also get encoding such as Japanese (shift-JIS) which is codepage 932. Now this does not get all the japanese from the same page and almost every character on the ASCII pages comes from a different Unicode page.
So the question is how in C# do I convert the XML data to Unicode and get it correct everytime? Any ideas?
Encoding.GetEncoding("windows-1253").GetString(new byte[] {0xE1}) // -> "\u03B1" α
Encoding.GetEncoding("windows-1254").GetString(new byte[] {0xE1}) // -> "\u00E1" á
Encoding.GetEncoding("windows-1251").GetString(new byte[] {0xE1}) // -> "\u0431" б
But for an XML file you should be using an existing XML parser (eg XmlReader or XDocument.Load) which will deal with encodings for you.

Problems reading text file - non-breaking space character

I am trying to read some text file with the following line:
"WE BUY : 10 000.00 USD"
First I have opened this file in binary editor and the 13th character (or 12th by 0-based C# indexing) (thousand separator) is "160 code in decimal" or "A0 code in hex" in Windows-1251 encoding.
However after I read this line into string using File.ReadAllLines
in the debugger I can see the the character now has 65533 code.
"lines[9][12] 65533 '�' char"
The default encoding Encoding.Default for my PC is "Windows-1251".
How come?
UPDATE
Tryed open the file with UTF-8 encoding, still the same result.
UPDATE 2
The problem is that file encoding is 8bit but the debugger shows for 8bit character 'A0' 16 bit value of 65533.
The one argument File.ReadAllLines will assume the input is UTF-8, whatever the system default encoding.
For anything else you need to specify the encoding:
var lines = File.ReadAllLines(filename, Encoding.GetEncoding(name));
You can get name from your Encoding.Default.WebName ("Windows-1252" is what I get here, but check locally).

Categories