Python C# - Unicode character is not the same on Python and C# - c#

I encountered with a problem while working on text files. I found that the character Unicode representation on Python and C# is different.
While opening the file with Python 3.5.2 on specific index the unicode character is:
with open('file.txt', 'r', encoding = 'utf-8') as f:
text = f.read()
text[189]
// Output: u"\U0001F464"
While opening the file with C# on the same index this char is represented by two characters:
string text = File.ReadAllText("file.txt", Encoding.UTF8);
Console.WriteLine(((int)text[189]).ToString("X4"));
// Output: "D83D"
string text = File.ReadAllText("file.txt", Encoding.UTF8);
Console.WriteLine(((int)text[190]).ToString("X4"));
// Output: "DC64"
So on python this char is on index 189 and on c# its on 189 and 190.
Reference to this charecter on fileformat website:
http://www.fileformat.info/info/unicode/char/1F464/index.htm
As you can see there, the representation of this charecter has a different length. On C#/C/C++/Java "\uD83D\uDC64" and on python u"\U0001F464".
The part of the text that is problematic:
👤 Sign in
Is there a way to use the same unicode representation in Python 3.5 and C#?
Edit:
Download of the original file in which this error happend:
https://ufile.io/pr5v6

You can't fix it. It is inherent in the Unicode implementation of the languages.
When reading a file and decoding to Unicode, C# and Java store Unicode strings internally encoded as UTF-16. Code points outside the basic multilingual plane (BMP, U+0000 to U+FFFF) use surrogates (two words) to represent a Unicode code point. The fact that you can see a Unicode code point as two words is a leaky abstraction.
Python 3.3+ hides this abstraction. It internally uses 1-, 2- or 4-byte encodings as needed to represent a Unicode string, but presents only the Unicode code points to the user.
Python 2 (same leaky abstraction as C# and Java):
Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:53:40) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> len(u'\U0001F464')
2
>>> u'\U0001F464'[0]
u'\ud83d'
>>> u'\U0001F464'[1]
u'\udc64'
Python 3.3+:
Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> len(u'\U0001F464')
1
>>> u'\U0001F464'[0]
'👤'
Internally, Python 3 uses UTF-32 to store a Unicode string containing a non-BMP code point and would use four bytes to store U+1F464.

Related

c# UTF8 GetString from bytes array not equal to php chr function

I'm trying to make one decoder. Basic system .Net 4.7 I'm trying to migrate this system into php, but I'm having trouble converting bytes. As far as I understand the default string UTF-16le on C#, I understood the ord and chr functions as UCS-2 on the PHP side. I want to do below and I do not get the same result there are codes. What can I do to fix this, thanks in advance
XOR Encoded Text Bytes = [101,107,217,78,40,68,234,218,162,67,139,81,44,166,24,148];
on C#
string result = System.Text.Encoding.UTF8.GetString(destinationArray);
On PHP
for($i=0;$i<sizeof($encoded);$i++){
echo "\t".$encoded[$i]." => ".chr($encoded[$i])."\n";
$tmpStr .= chr($encoded[$i]);
}
C# Result size=26:
ek�N(D�ڢC�Q,��
PHP Result size=16:
ek�N(D�ڢC�Q,��
the strings looks the same, but byte translation is quite different.
C# Result to Bytes array:
byte[] utf8 = System.Text.Encoding.Unicode.GetBytes(result);
Console.WriteLine(string.Join("-", utf8));
response =
101-0-107-0-253-255-78-0-40-0-68-0-253-255-162-6-67-0-253-255-81-0-44-0-253-255-24-0-253-255
PHP Result to Bytes Array:
echo implode("-",unpack("C*", $tmpStr));
response = 101-107-217-78-40-68-234-218-162-67-139-81-44-166-24-148
if php response convert to UTF-16le, results again different
echo implode("-",unpack("C*", mb_convert_encoding($tmpStr,'UTF-16le')));
response =
101-0-107-0-63-0-78-0-40-0-68-0-63-0-162-6-67-0-63-0-81-0-44-0-63-0-24-0-63-0
You are mixing quite different things here.
First, in the C# code, you are not using the same encoding when converting from bytes to a string and then from a string back to bytes: Encoding.UTF8 in the first case and Encoding.Unicode (which is .NET name for UTF-16) in the latter... Things cannot go well if you do this. And by the way, I'm not sure that PHP's UCS2 is equivalent to UTF-16:
UTF-8 encodes characters on 1, 2, 3 or 4 bytes depending on the character
UTF-16 encodes characters on 2 or 4 bytes depending on the character
UCS-2 always encodes characters on 2 bytes, and hence cannot encode more than 65536 characters...
Then what you pass to the 'bytes to string' conversions is not necessarily valid! Because you've XORed the input data (I assume it to be some secret string), the resulting bytes may or may not be a valid sequence in some encodings. For example:
It is not valid in ASCII because you have (in your example) bytes > 127
It is not valid in UTF-8 because 217 followed by 78 is recognized neither as a 1-, 2-, 3-, or 4-byte character by UTF-8; hence, the � you see before the N.
It seems to be invalid UTF-16 as well, but roundtripping works (I could get back the original array using .NET's Unicode.GetString, then Unicode.GetBytes. If I remove your last byte - and end up with an odd number of bytes - then UTF-16 roundtripping does not work any more...
Although I did not test it, it should also be invalid UCS-2 because UCS-2 'looks like' UTF-16 for 2-byte characters.
Roundtripping works with ANSI encodings sucha as windows-1252 because these encodings accept any byte. However, I would discourage using such trick because you have to be sure the same code page is used on both sides of the encoding/decoding process.
Therefore, I think, in your case, the best way to store your XORed bytes into a string would be to convert the array to base64. In C# you can do it this way:
// The code below gives you ZWt1TihEInY+QydRLEIYMA==
var converted = Convert.ToBase64String(array);
// And this one gives you back the initial array
var bytes = Convert.FromBase64String(converted);
Quick googling will tell you to use base64_encode and base64_decode in PHP.
Bottom note: if you want to really understand what's going on with al this encodings stuff, here is the must-read blog post on the subject: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

String lengths differ in Python3 from file and through copy-and-paste

I have a string like this from Wikipedia (https://en.wikipedia.org/wiki/Tyre,_Lebanon)
Tyre (Arabic: صور‎‎, Ṣūr; Phoenician: 𐤑𐤅𐤓, Ṣur; Hebrew: צוֹר‎, Tsor; Tiberian Hebrew צֹר‎, Ṣōr; Akkadian: 𒀫𒊒, Ṣurru; Greek: Τύρος, Týros; Turkish: Sur; Latin: Tyrus, Armenian Տիր [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.
When this sentence is loaded from a file, its length is 262. When it is copied and pasted from Browser, it is 267.
My question is that I have an existing data pipeline in C# that recognizes the length as 266 (the copy-and-paste length above but default read-from-file in C#), but Python3 reads the C# text output file and considers it as length of 262. The issue is that the character indexing (e.g. s[10:20]) through these two encoding systems will be different and make the end-to-end algorithm fails at this type of cases.
It appears the underlying encoding is different, though they have the same appearance to human readers (only the different parts shown):
Phoenician: \xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur;
Phoenician: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2ur;
And
Akkadian: \xf0\x92\x80\xab\xf0\x92\x8a\x92, \xe1\xb9\xa2urru;
Akkadian: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2urru;
Is there a way for Python to read the file using the later encoding of length 266? And how to detect/determine the proper encoding system from the utf-8 bytes above?
The full utf-8 encoding for each case is shown below for further investigation
From file
b'Tyre (Arabic: \xd8\xb5\xd9\x88\xd8\xb1\xe2\x80\x8e\xe2\x80\x8e, \xe1\xb9\xa2\xc5\xabr; Phoenician: \xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur; Hebrew: \xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e, Tsor; Tiberian Hebrew \xd7\xa6\xd6\xb9\xd7\xa8\xe2\x80\x8e, \xe1\xb9\xa2\xc5\x8dr; Akkadian: \xf0\x92\x80\xab\xf0\x92\x8a\x92, \xe1\xb9\xa2urru; Greek: \xce\xa4\xcf\x8d\xcf\x81\xce\xbf\xcf\x82, T\xc3\xbdros; Turkish: Sur; Latin: Tyrus, Armenian \xd5\x8f\xd5\xab\xd6\x80 [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.'
From copy and paste
b'Tyre (Arabic: \xd8\xb5\xd9\x88\xd8\xb1\xe2\x80\x8e\xe2\x80\x8e, \xe1\xb9\xa2\xc5\xabr; Phoenician: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2ur; Hebrew: \xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e, Tsor; Tiberian Hebrew \xd7\xa6\xd6\xb9\xd7\xa8\xe2\x80\x8e, \xe1\xb9\xa2\xc5\x8dr; Akkadian: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2urru; Greek: \xce\xa4\xcf\x8d\xcf\x81\xce\xbf\xcf\x82, T\xc3\xbdros; Turkish: Sur; Latin: Tyrus, Armenian \xd5\x8f\xd5\xab\xd6\x80 [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.'
You probably don't have Phoenician fonts installed in your system, so the web browser (as #lenz mentioned in the comment) displays characters 𐤓 instead. Python loads your string properly.
There are 5 problematic characters in the text: 3 Phoenician and 2 Akkadian:
The first character of the problematic part with Phoenician symbols is 'Phoenician Letter Sade' (https://unicode-table.com/en/10911/) -- it spans 4 bytes in UTF-8: F0 90 A4 91
It is followed with 'Phoenician Letter Wau' (https://unicode-table.com/en/10905/) -- again 4 bytes: F0 90 A4 85
The third letter if 'Phoenician Letter Rosh' (https://unicode-table.com/en/10913/) -- is uses 4 bytes as well: F0 90 A4 93
(I omit the Akkadian ones.)
Each of those letters is replaced in your encodings by \xef\xbf\xbd\xef\xbf\xbd that correspond to ��.
Each problematic letter somehow gets replaced by two � signs, so the total length of the string increases by 5, from 262 to 267 characters.
It turns out I found a different viewpoint to answer this question. C# does report longer length of a string, but it does not mean it is incorrect, just the underlying encoding system is different and has its limitation.
http://illegalargumentexception.blogspot.com/2010/04/i18n-comparing-character-encoding-in-c.html
Python C# - Unicode character is not the same on Python and C#
When reading a file and decoding to Unicode, C# and Java store Unicode strings internally encoded as UTF-16. Code points outside the basic multilingual plane (BMP, U+0000 to U+FFFF) use surrogates (two words) to represent a Unicode code point. The fact that you can see a Unicode code point as two words is a leaky abstraction.
Python 3.3+ hides this abstraction. It internally uses 1-, 2- or 4-byte encodings as needed to represent a Unicode string, but presents only the Unicode code points to the user.
This explains that why the lengths reported by C# can be longer than Python.
How to make them congruent? hmmm... probably not directly but through a substring search as a post-processing...

What encoding be used to create MS-DOS txt file using C#(UTF8Encoding vs Encoding)

I am trying to create a flat file for a legacy system and they mandates that the data to be presented in TextEncoding of MS DOS .txt file (Text Document - MS-DOS Format CP_OEM). I am a bit confused between files generated by using UTF8Encoding class in C# (.net4.0 framework) and I think it produce a file in default txt file (Encoding: CP_ACP).
I think Encoding names CP_ACP , Winodows and ANSI refers to same thing and Windows default is ANSI and it will omit any unicode character information.
If I use UTF8Encoding class in C# library to create a text file(as below), is it going to be in the MS DOS txt file format?
byte[] title = new UTF8Encoding(true).GetBytes("New Text File");
As per the answer supplied it is evident that UTF8 is NOT equivalent to MSDOS txt format and should use Encoding.GetEncoding(850) method to get the encoding library.
I read the following posts to check on my information but nothing conclusive yet.
https://blogs.msdn.microsoft.com/oldnewthing/20120220-00?p=8273
https://blog.mh-nexus.de/2015/01/character-encoding-confusion
https://blogs.msdn.microsoft.com/oldnewthing/20090115-00?p=19483
Finally the conclusion is to go with Encoding.GetEncoding(850) when creating a byte array to be converted back to the actual file(note: i am using byte array as i can leverage existing middle wares).
You can use the File.ReadXY(String, Encoding) and File.WriteXY(String, String[], Encoding) methods, where XY is either AllLines, Lines or AllText working with string[], IEnumerable<string> and string respectively.
MS-DOS uses different code pages. Probably the code page 850 "Western European / Latin-1" or code page 437 "OEM-US / OEM / PC-8 / DOS Latin US" (as #HansPassant suggests) will be okay. If you are not sure, which code page you need, create example files containing letters like ä, ö, ü, é, è, ê, ç, à or greek letters with the legacy system and see whether they work. If you don't use such letters or other special characters, then the code page is not very critical.
File.WriteAllText(path, "Hello World", Encoding.GetEncoding(850));
The character codes from 0 to 127 (7-bit) are the same for all MS-DOS code pages, for ANSI and UTF-8. UTF files are sometimes introduced with a BOM (byte order mark).
MS-DOS knows only 8-bit characters. The codes 128 to 255 differ for the different national code pages.
See: File Class, Encoding Class and Wikipedia: Code Page.

What happens to a null byte when converting bytes to ISO 8859-1 encoding?

I'm not entirely sure if the question even makes sense. I'm converting a byte array taken from an ID3 tag and converting it to a string. Most text frames in an ID3 tag use ISO 8859-1 encoding but it depends on the frame. In any case, if you look up what 0x00 is in the ISO 8859-1 codes it is invalid.
To further complicate, either due programmer error or just poor formatting, some of the strings end in 0x00 and some do not.
When converting a series of bytes into a string using ISO 8859-1 encoding do you have manually check the end of the string to see if it is a null? Or will the encoding object through whatever method it uses to convert in the first place deal with the null properly? Furthermore, is there some sort of function that could normalize or "fix" the null terminated string?
When you try to display these strings they do not display properly.
I am using C# for this particular project.
Some extra info here about ID3 Tags: ID3 Specs
Or am I completely misunderstanding the whole thing? Is a null terminator simply a way a particular language handles strings and it has nothing to do with encoding?
Edit: I used System.Text.Encoding.GetEncoding("iso-8859-1") followed by a GetString call
If you use Encoding.GetEncoding(28591), it just converts a byte 0 to the Unicode U+0000. Encodings generally assume that they have to convert all the bytes - they don't look for terminators.
This treatment of 0 as Unicode 0 is inline with the Wikipedia description:
In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the unassigned code values thus provides for 256 characters via every possible 8-bit value.
The C0 and C1 control characters page includes:
0: Originally used to allow gaps to be left on paper tape for edits. Later used for padding after a code that might take a terminal some time to process (e.g. a carriage return or line feed on a printing terminal). Now often used as a string terminator, especially in the C programming language.
Sample code:
using System;
using System.Text;
class Program
{
static void Main(string[] args)
{
byte[] data = { 0, 0 };
Encoding latin1 = Encoding.GetEncoding(28591);
string text = latin1.GetString(data);
Console.WriteLine(text.Length); // 2
Console.WriteLine((int) text[0]); // 0
Console.WriteLine((int) text[1]); // 0
}
}
Happily, ASCII, ISO-8859-1 and Unicode all agree on codepoints in the range 0..127. Thus your character '\0' will be encoded identically in ASCII, ISO-8859-1 and UTF-8.
If your program assigns special semantics to the zero byte, you have to take care of that appropriately.

"Unable to translate Unicode character" error when saving to txt file

Additional information: Unable to
translate Unicode character \uDFFF at
index 195 to specified code page.
I made an algorithm, who's result are binary values (different lengths). I transformed it into uint, and then into chars and saved into stringbuilder, as you can see below:
uint n = Convert.ToUInt16(tmp_chars, 2);
_koded_text.Append(Convert.ToChar(n));
My problem is, that when i try to save those values into .txt i get the previously mentioned error.
StreamWriter file = new StreamWriter(filename);
file.WriteLine(_koded_text);
file.Close();
What i am saving is this: "忿췾᷿]볯褟ﶞ痢ﳻ��伞ﳴ㿯ﹽ翼蛿㐻ﰻ筹��﷿₩マ랿鳿⏟麞펿"... which are some weird signs.
What i need is to convert those binary values into some kind of string of chars and save it to txt. I saw somewhere that converting to UTF8 should help, but i don't know how to. Would changing files encoding help too?
You cannot transform binary data to a string directly. The Unicode characters in a string are encoded using utf16 in .NET. That encoding uses two bytes per character, providing 65536 distinct values. Unicode however has over one million codepoints. To make that work, the Unicode codepoints above \uffff (above the BMP, Basic Multilingual Plane) are encoded with a surrogate pair. The first one has a value between 0xd800 and 0xdbff, the second between 0xdc00 and 0xdfff. That provides 2 ^ (10 + 10) = 1 million additional codes.
You can perhaps see where this leads, in your case the code detects a high surrogate value (0xdfff) that isn't paired with a low surrogate. That's illegal. Lots more possible mishaps, several codepoints are unassigned, several are diacritics that get mangled when the string is normalized.
You just can't make this work. Base64 encoding is the standard way to carry binary data across a text stream. It uses 6 bits per character, 3 bytes require 4 characters. The character set is ASCII so the odds of the receiving program decoding the character back to binary incorrectly are minimal. Only a decades old IBM mainframe that uses EBCDIC could get you into trouble. Or just plain avoid encoding to text and keep it binary.
Since you're trying to encode binary data to a text stream this SO question already contains an answer to the question: "How do I encode something as base64?" From there plain ASCII/ANSI text is fine for the output encoding.

Categories