I am using a Node.js chat server which sends ASCII messages containing the id of the users and unicode (utf16 - little endian) messages as text messages. How can I determine the type of encoding in the C# client?
Any given ASCII character has the same values in Unicode (if you're using UTF8). So the only way to know this is if one of the message characters is not an ASCII character.
A better way to figure this out is using a flag (either a bit, a character or multiple characters) that indicates whether something is a user id or a text message.
Related
I have a project where everything that is stored in database is encrypted. For encoding we use System.Text.Encoding.Default.GetBytes(text).
The problem is that now the client wants to add support for polish (and other nordic) characters and using the Default encoding doesn't work, the polish characters get converted to english characters (e.g Ą gets converted to A).
I can't change the encoding (Unicode seems to work) as the previous data will be lost.
Is there any way to get around this and add support for new characters while keeping the old data?
To be clear; you realise that "encoding" is not "encrypting", but I suppose you encrypt the byte array you get from encoding your string data?
Then I'd suggest either decrypting and re-encoding and re-encrypting all existing data using UTF-8 (the most efficient encoding for Western alphabets), or add a "version" or "encoding" column indicating with which encoding the data was encrypted.
I have text which is encoded in UTF-8, well formatted. I can't predict what languages the text will be in.
To print this text on a receipt thermal printer, I need to choose the best encoding to display the text with and convert to it. Unfortunately, UTF-8 is not supported. All available encodings can represent only a subset of characters.
So I would need to find the best option (e.g. from the list mentioned here https://reference.epson-biz.com/modules/ref_escpos/index.php?content_id=321) to convert the data too and lose as little characaters in the process (I am aware I will not be able to print Cyrillic together with Thai)
It is not a "guess the encoding" question, but rather choosing the best encoding for minimum loss of representable characters.
Has anybody seen a good approach?
I've to read datamatrix barcodes (vda 4902, gtin, gs1) which use non-printable chars as seperator.
The goal is to scan the barcode with intermec or honeywell hardware and send it to a c# mvc webapplication.
The printable characters are received by the webapplication, but the non-printable chars not.
I've scanned the code to the VI editor on a linux server - bere i can see the special characters. But i couldn't get it with a asp.net to work nor a c# windows form application.
So currently i don't know where to look at...
Most likely if you are passing values to another page or webservice, you are forgetting the step of properly encoding the characters you are sending. You should probably look at using something like System.Web.HttpServerUtility.HtmlEncode. This function properly converts special characters in the value you are sending to an alternate representation that gets decoded on the receiving end.
Depending on other specifics would you did not elaborate on your original question, there are many other ways to encode/escape characters for purposes like this. But the above is what I would suggest starting with if you are not clear.
I have a website where a user can upload a txt file of data and the data will be imported into the db. However, some users are uploading the data in UTF-8, and others are uploading it in UTF-16.
byte[] fileData = null;
uploader.PostedFile.InputStream.Read(fileData, 0, length);
data = TLCommon.EncodeJsString(System.Text.Encoding.UTF8.GetString(fileData));
When the file is saved in UTF-16 and uploaded, the data is garbage. How can I handle this situation?
There are various heuristics you can employ, such as checking for a high percentage of 00 bytes in the stream. (These won't be present in UTF-8, but are common in UTF-16 text that contains ASCII characters.)
This however, can't distinguish between UTF-8 and Windows-1252, which are incompatible 8-bit encodings that are both very common on U.S. English Windows systems. You can add more checks, such as looking for byte sequences that are invalid in one encoding but not another, but this starts to get very complex and typically doesn't distinguish between different single-byte encodings.
Microsoft provides a library named MLang, which can automatically detect UTF-8, UTF-16, and many 8-bit codepages using statistical analysis of the bytes in the stream. Its accuracy is quite good if it has a large-enough sample of text to work with. I blogged about how to use this method, and posted the full source code on GitHub.
There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for UTF-16, 3 for UTF-8), or if you know something about the file (is the first character supposed to be ascii, such as in XML, which start with a '<') then you can use it to find out the encoding. But if you don't have those pieces of information you'll have to guess by using some heuristics.
What's relation between CharacterSet here:
http://msdn.microsoft.com/en-us/library/ms709353(VS.85).aspx
and ascii encoding here:
http://msdn.microsoft.com/en-us/library/system.text.asciiencoding.getbytes(VS.71).aspx
ANSI is the current Windows ANSI code page, equivalent to Encoding.Default.
OEM is the current OEM code page typically used by console applications.
You can get this using:
Encoding.GetEncoding(CultureInfo.CurrentCulture.TextInfo.OEMCodePage)
In a console application, the OEM encoding will also be available using
Console.OutputEncoding
This is really, really ancient. ODBC dates from the stone age, back when Windows starting taking over from MS-DOS. Back then, lots of text was still encoded in the original IBM-PC character set, named the "OEM Character Set" by Microsoft. The standard IBM-PC set had some accented characters and pseudo graphics glyphs in the upper half, codes 0x80-0xff.
Too limited for text output in non-English languages, Microsoft started using code pages, ranges of character glyphs suitable for a certain language group. The American English set of characters were standardized by ANSI, that label is now attached (incorrectly) to any non-OEM code page.
Nobody encodes text in the OEM character set anymore, it went the way of the dodo at least 10 years ago. The proper setting here is ANSI. And keeping your fingers crossed behind your back that the code page used to encode the text matches your system's default code page. That's dodo too, Unicode solved it.
The short answer to your question, there's no direct relation.
The longer version:
CharacterSet for the "Schema.ini" file can be either ANSI or OEM.
ANSI and ASCII refer to different thing.
You can read more of it here:
Understanding ASCII and ANSI Characters
ASCII vs ANSI Encoding by Alex Hoffman
From my understanding, CharacterSet=ANSI is equivalent to Encoding.Default. OEM might be ASCIIEncoding then.
However, ANSI uses the system ANSI code page, so incompatibilities may arise if the same file is accessed from computers with different code pages.
I've compiled my own reference in order to switch between the two:
Windows code page Name System.Text.Encoding schema.ini CharacterSet
20127 ASCII (US) ASCII 20127
1252 ANSI Latin I Default ANSI
65001 UTF-8 UTF8 65001
1200 UTF-16 LE Unicode Unicode
1201 UTF-16 BE BigEndianUnicode 1201