How to convert escape characters to ANSI from string in XML

How to convert escape characters to ANSI from string in XML - c#

When i use a Windows service written in C# to send a raw string to a receipt printer that uses escape sequences to initiate features within the printer. Such as:
\x1b\x64\x02
to operate the paper cutter.
I'm using a receipt template file on another server and placing the text, base 64 encoded, into XML to give it to the Windows service. The problem is that the text is literally printed on the receipt when extracting it from the XML and it's something to do with the escape characters not being translated.
I'm guessing that when i type those escape sequences into Visual Studio, it represents the ANSI character right there in the string (because it works when i do it that way and then the paper cutter triggers).
How would i translate the string containing escape sequences into the proper format for use within the C# Windows service? The goal is to type these simple escape sequences into a template file and then convert them properly on the other end.
C# code i have at the moment that is trying to decode base 64 and then convert to ANSI encoding:
byte[] data = Convert.FromBase64String(content);
content = Encoding.Default.GetString(data);
byte[] utf8Bytes = Encoding.UTF8.GetBytes(content);
content = Encoding.GetEncoding("Windows-1252").GetString(utf8Bytes);
return RawPrinterHelper.SendStringToPrinter(getConfig().receiptPrinter, content);
Here is the content of the test receipt that is received by the service:
XHgxYlx4MWRceDYxXHgxIE1OSG9tZU91dGxldC5jb20yMzAwIFdlc3QgSGlnaHdheSAxMw0KQnVybnN2aWxsZSwgTU4gNTUzMzcNCjk1Mi0yNzktMTU4Nw0KDQpceDFiXHgxZFx4NjFceDBceDFiXHg0NFx4Mlx4MTBceDIyXHgwIERhdGU6IDIwMTctMDYtMTVceDkgVGltZTogMzo0NmFtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tIA0KXHgxYlx4NDUgU0FMRSANClx4MWJceDQ2
When i base 64 encode \x7 for example,
From within the service: Bw== (as ASCII)
From within the service: BwA= (as Unicode)
From a string in php: XHg3

So i've found a solution that requires a small amount of change to the template:
On the template side, using PHP, I convert the \x7 style unicode reference to  and then before i base 64 encode it:
$receipt_str = mb_convert_encoding($receipt_str, 'UTF-8', 'HTML-ENTITIES');
This is working so far. I'm just not sure if there is a better way or if there are limits to this.

Related

Decode UTF-8 bytes as Latin-1 characters

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Farsi (Persian-Arabic):
Ù…Ø¯Ù„-Ø±Ù†Ú¯-Ù…ÙˆÛŒ-Ø¬Ø¯ÛŒØ¯-5-436x500
whereas it should look like this:
مدل-رنگ-موی-جدید-5-436x500
This link convert this correctly:
http://www.ltg.ed.ac.uk/~richard/utf-8.html
How I can do it in c#?

It is very hard to tell exactly what is going on from the description of your question. We would all be much better off if you provided us with an example of what is happening using a single character instead of a whole string, and if you chose an example character which does not belong to some exotic character set, for example the bullet character (u2022) or something like that.
Anyhow, what is probably happening is this:
The letter "ر" is represented in UTF-8 as a byte sequence of D8 B1, but what you see is "Ø±", and that's because in UTF-16 Ø is u00D8 and ± is u00B1. So, the incoming text was originally in UTF-8, but in the process of importing it to a dotNet Unicode String in your application it was incorrectly interpreted as being in some 8-bit character set such as ANSI or Latin-1. That's why you now have a Unicode String which appears to contain garbage.
However, the process of converting 8-bit characters to Unicode is for the most part not destructive, so all of the information is still there, that's why the UTF-8 tool that you linked to can still kind of make sense out of it.
What you need to do is convert the string back to an array of ANSI (or Latin-1, whatever) bytes, and then re-construct the string the right way, which is a conversion of UTF-8 to Unicode.
I cannot easily reproduce your situation, so here are some things to try:
byte[] bytes = System.Text.Encoding.Ansi.GetBytes( garbledUnicodeString );
followed by
string properUnicodeString = System.Text.Encoding.UTF8.GetString( bytes );

What encoding be used to create MS-DOS txt file using C#(UTF8Encoding vs Encoding)

I am trying to create a flat file for a legacy system and they mandates that the data to be presented in TextEncoding of MS DOS .txt file (Text Document - MS-DOS Format CP_OEM). I am a bit confused between files generated by using UTF8Encoding class in C# (.net4.0 framework) and I think it produce a file in default txt file (Encoding: CP_ACP).
I think Encoding names CP_ACP , Winodows and ANSI refers to same thing and Windows default is ANSI and it will omit any unicode character information.
If I use UTF8Encoding class in C# library to create a text file(as below), is it going to be in the MS DOS txt file format?
byte[] title = new UTF8Encoding(true).GetBytes("New Text File");
As per the answer supplied it is evident that UTF8 is NOT equivalent to MSDOS txt format and should use Encoding.GetEncoding(850) method to get the encoding library.
I read the following posts to check on my information but nothing conclusive yet.
https://blogs.msdn.microsoft.com/oldnewthing/20120220-00?p=8273
https://blog.mh-nexus.de/2015/01/character-encoding-confusion
https://blogs.msdn.microsoft.com/oldnewthing/20090115-00?p=19483
Finally the conclusion is to go with Encoding.GetEncoding(850) when creating a byte array to be converted back to the actual file(note: i am using byte array as i can leverage existing middle wares).

You can use the File.ReadXY(String, Encoding) and File.WriteXY(String, String[], Encoding) methods, where XY is either AllLines, Lines or AllText working with string[], IEnumerable<string> and string respectively.
MS-DOS uses different code pages. Probably the code page 850 "Western European / Latin-1" or code page 437 "OEM-US / OEM / PC-8 / DOS Latin US" (as #HansPassant suggests) will be okay. If you are not sure, which code page you need, create example files containing letters like ä, ö, ü, é, è, ê, ç, à or greek letters with the legacy system and see whether they work. If you don't use such letters or other special characters, then the code page is not very critical.
File.WriteAllText(path, "Hello World", Encoding.GetEncoding(850));
The character codes from 0 to 127 (7-bit) are the same for all MS-DOS code pages, for ANSI and UTF-8. UTF files are sometimes introduced with a BOM (byte order mark).
MS-DOS knows only 8-bit characters. The codes 128 to 255 differ for the different national code pages.
See: File Class, Encoding Class and Wikipedia: Code Page.

C# UNICODE to ANSI conversion

I need your help concerning something which disturbs me when working with UNICODE encoding in .NET Framework ...
I have to interface with some customer data systems with are non-UNICODE applications, and those customers have worldwide companies (Chinese, Korean, Russian, ...). So they have to provide me an ASCII 8 bits file, wich will be encoded with their Windows code page.
So, if a Greek customer sends me a text file containing 'Σ' (sigma letter '\u03A3') in a product name, I will get an equivalent letter corresponding to the 211 ANSI code point, represented in my own code page. My computer is a French Windows, which means the code page is Windows-1252, so I will have in place 'Ó' in this text file... Ok.
I know this customer is a Greek one, so I can read his file by forcing the windows-1253 code page in my import parameters.
/// <summary>
/// Convert a string ASCII value using code page encoding to Unicode encoding
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
public static string ToUnicode(string value, int codePage)
{
Encoding windows = Encoding.Default;
Encoding unicode = Encoding.Unicode;
Encoding sp = Encoding.GetEncoding(codePage);
if (sp != null && !String.IsNullOrEmpty(value))
{
// First get bytes in windows encoding
byte[] wbytes = windows.GetBytes(value);
// Check if CodePage to use is different from current Windows one
if (windows.CodePage != sp.CodePage)
{
// Convert to Unicode using SP code page
byte[] ubytes = Encoding.Convert(sp, unicode, wbytes);
return unicode.GetString(ubytes);
}
else
{
// Directly convert to Unicode using windows code page
byte[] ubytes = Encoding.Convert(windows, unicode, wbytes);
return unicode.GetString(ubytes);
}
}
else
{
return value;
}
}
Well in the end I got 'Σ' in my application and I am able to save this into my SQL Server database. Now my application has to perform some complex computations, and then I have to give back this file to the customer with an automatic export...
So my problem is that I have to perform a UNICODE => ANSI conversion?! But this is not as simple as I thought at the beginning...
I don't want to save the code page used during import, so my first idea was to convert UNICODE to windows-1252, and then automatically send the file to the customers. They will read the exported text file with their own code page so this idea was interesting for me.
But the problem is that the conversion in this way has a strange behaviour... Here are two different examples:
1st example (я)
char ya = '\u042F';
string strYa = Char.ConvertFromUtf32(ya);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1251 = System.Text.Encoding.GetEncoding(1251);
string strYa1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strYa)));
string strYa1251 = ansi1251.GetString(System.Text.Encoding.Convert(unicode, ansi1251, unicode.GetBytes(strYa)));
So strYa1252 contains '?', whereas strYa1251 contains valid char 'я'. So it seems it is impossible te convert to ANSI if valid code page is not indicated to Convert() function ... So nothing in Unicode Encoding class helps user to get equivalences between ANSI and UNICODE code points ? :\
2nd example (Σ)
char sigma = '\u3A3';
string strSigma = Char.ConvertFromUtf32(sigma);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1253 = System.Text.Encoding.GetEncoding(1253);
string strSigma1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strSigma)));
string strSigma1253 = ansi1253.GetString(System.Text.Encoding.Convert(unicode, ansi1253, unicode.GetBytes(strSigma)));
At this time, I have the correct 'Σ' in the strSigma1253 string, but I also have 'S' for strSigma1252. As indicated at the beginning, I should have 'Ó' if ANSI code has been found, or '?' if the character has not been found, but not 'S'. Why?
Yes of course, a linguist could say that 'S' is equivalent to the greek Sigma character because they sound the same in both alphabets, but they don't have the same ANSI code!
So how can the Convert() function in the .NET framework manage this kind of equivalence?
And does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?

I should have ...'?' if the character has not been found, but not 'S'. Why?
This is known as 'best-fit' encoding, and in most cases it's a bad thing. When Windows can't encode a character to the target code page (because Σ does not exist in code page 1252), it makes best efforts to map the character to something a bit like it. This can mean losing the diacritical marks (ë→e), or mapping to a cognate (Σ→S), a character that's related (≤→=), a character that's unrelated but looks a bit similar (∞→8), or whatever other madcap replacement seemed like a good idea at the time but turns out to be culturally or mathematically offensive in practice.
You can see the tables for cp1252, including that Sigma mapping, here.
Apart from being a silent mangling of dubious usefulness, it also has some quite bad security implications. You should be able to stop it happening by setting EncoderFallback to ReplacementFallback or ExceptionFallback.
does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?
You'll have to keep a table of encodings for each customer. Read their input files using that encoding to decode; write their output files using the same encoding.
(For sanity, set new customers to UTF-8 and document that this is the preferred encoding.)

convert from ascii (dos) to windows

Hi I have a string written in ascii code whose output is " ”˜‰ƒ ‰™˜€" this is a name in Hebrew. How can I convert it to Hebrew letters?
.net c# winform

There are no Hebrew letters in ASCII, so you have to mean ANSI. There is a default encoding for the system that is used for encoding ANSI, which you need to know to decode it.
It's probably the Windows-1255 or ISO 8859-8 encoding that was used. You can use the Encoding class to decode the data. Example:
Encoding.GetEncoding("ISO 8859-8").GetString(data);
If you already have a string, the problem is that you have decoded data using the wrong encoding. You have to go further back in the process before the data was a string, so that you can get the actual encoded bytes.
If you for example are reading the string from a file, you have to either read the file as bytes instead, or set the encoding that the stream reader uses to decode the file data into characters.

How to get correctly-encoded HTML from the clipboard?

Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?
For example, executing a command like this:
string s = (string) Clipboard.GetData(DataFormats.Html)
Results in stuff like:
<FONT size=-2>Â Â <A href="/advanced_search?hl=en">Advanced
Search</A><BR>Â Â Preferences<BR>Â Â <A
href="/language_tools?hl=en">Language
Tools</A></FONT>
Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.
It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?

In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).
It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the Â character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.
For my other project I made a function that fix data with corrupted encoding.
In this case simple conversion should be sufficient:
byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);
My original function is a little bit more complex and contains tests to ensure that data are not corrupted...
public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
if (string.IsNullOrEmpty(text))
return false;
byte[] data = encoding.GetBytes(text);
// there should not be any character outside source encoding
string newStr = encoding.GetString(data);
if (!string.Equals(text, newStr)) // if there is any character "outside"
return false; // leave, the input is in a different encoding
if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
return false; // if not, can not convert to UTF-8
text = Encoding.UTF8.GetString(data);
return true;
}
I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...
EDIT: (July 20, 2017)
It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0)
(Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)

You have to interpret the data as UTF-8. See MS Office hyperlinks change code page?.

DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.
You get allot of wrong encodings, leading funny/bad characters such as
'Å','â€¹','Å’','Å½','Å¡','Å“','Å¾','Å¸','Â','Â¡','Â¢','Â£','Â¤','Â¥','Â¦','Â§','Â¨','Â©'
Full explanation here
Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters
Soln: Create a translation dictionary and search and replace.

I don't know what your original source document is, but be aware that Word and Outlook provide several versions of the clipboard in different encodings. One is usually Windows-1252 and another is UTF-8. Possibly you're grabbing the UTF-8 encoded version by default, when you're expecting the Windows-1252 (Latin-1 + Smart Quotes)? Non-ASCII characters would show up as multiple odd Latin-1 accented characters. Most "Smart Quotes" are not in the Latin-1 set and are often three bytes in UTF-8.
Can you specify which encoding you want the clipboard contents in?

Try this:
System.Windows.Forms.Clipboard.GetText(System.Windows.Forms.TextDataFormat.Html);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to convert escape characters to ANSI from string in XML - c#

Related

Decode UTF-8 bytes as Latin-1 characters

What encoding be used to create MS-DOS txt file using C#(UTF8Encoding vs Encoding)

C# UNICODE to ANSI conversion

convert from ascii (dos) to windows

How to get correctly-encoded HTML from the clipboard?

Categories

Resources