get all Text Encoding in universal windows app - c#

I want to make windows application that convert text encoding of Txt file.
so I need to get all text encoding supported by windows.
I look for Encoding.GetEncodings() Method but it is not exist in Universal Windows Apps.
so I try to get encoding by CodePage, and this is my code:
List<string> list = new List<string>();
List<string> errors = new List<string>();
int[] code_page = { 0, 1200, 1201, 1252, 10003, 10008, 12000, 12001, 20127, 20936, 20949, 28591, 28598, 38598, 50220, 50221,
50222, 50225, 50227, 51932, 51936, 51949, 52936, 57002, 57003, 57004, 57005, 57006, 57007, 57008, 57009, 57010, 57011, 65000, 65001 };
for (int i = 0; i < code_page.Length; i++)
{
try
{
list.Add(Encoding.GetEncoding(code_page[i]).EncodingName);
}
catch (Exception ex) { errors.Add(code_page[i] + "\t\t" + ex.Message); }
}
}
And I have this result:
First list (Encoding)
Unicode (UTF-8)
Unicode
Unicode (Big-Endian)
Unicode (UTF-32)
Unicode (UTF-32 Big-Endian)
US-ASCII
Western European (ISO)
Unicode (UTF-7)
Unicode (UTF-8)
Errors list
37___No data is available for encoding 37. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
437__No data is available for encoding 437. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
500__No data is available for encoding 500. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
...etc
and my question is there any way to get all windows text encoding.
Thank you.

Related

Printer is printing Korean characters when Arabic CodePages are specified

Printer: Yujin Thermal Printer
Library: ESC-POS-.NET (C#)
Different Diagnostics done so far:
Printed Arabic words using Notepad / Wordpad/ MS Word, Arabic characters prints perfect.
Printed Arabic characters using wrong codepage, prints question marks as placeholders for Arabic characters.
Code:
var e = new EPSON();
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
Encoding enc = Encoding.GetEncoding(1256);
var bytes = ByteSplicer.Combine(
e.CodePage(CodePage.WPC1256_ARABIC),
e.CenterAlign(),
e.PrintLine(" Test Page See the Arabic text Below WPC1256_ARABIC"),
enc.GetBytes("طباعة صفحة إختبار "),
e.PrintLine("\x1D\x56\x42\x00"),
e.CodePage(CodePage.PC720_ARABIC),
e.CenterAlign(),
e.PrintLine(" Test Page See the Arabic text Below PC864_ARABIC"),
enc.GetBytes("طباعة صفحة إختبار "),
e.PrintLine("\x1D\x56\x42\x00"),
e.CodePage(CodePage.PC864_ARABIC),
e.CenterAlign(),
e.PrintLine(" Test Page See the Arabic text Below PC864_ARABIC"),
enc.GetBytes("طباعة صفحة إختبار "),
e.PrintLine("\x1D\x56\x42\x00"),
e.CodePage(CodePage.PC864_ARABIC),
);`
This is the method I use to send bytes to printer. I hope I am not loosing any byte data.
public static bool SendBytesToPrinter(string szPrinterName, byte[] data)
{
var pUnmanagedBytes = Marshal.AllocCoTaskMem(data.Length); // Allocate unmanaged memory
Marshal.Copy(data, 0, pUnmanagedBytes, data.Length); // copy bytes into unmanaged memory
var retval = SendBytesToPrinter(szPrinterName, pUnmanagedBytes, data.Length);
Marshal.FreeCoTaskMem(pUnmanagedBytes); // Free the allocated unmanaged memory
return retval;
}
Output:
These are the Korean characters that gets printed: "핸 턱한"
I have opened a github issue on this as well
Thank you.

HttpClient: Correct order to detect encoding

I'm using HttpClient to fetch some files. I put the content into a byte array (bytes). Now I need to detect the encoding. The contenttype will be either html, css, JavaScript or XML contenttype.
Currently I check the charset from headers, then check for a BOM (byte order mark) before I finally check the first part of the file for a charset meta tag.
Normally this works fine, because there are no conflicts.
But: Is that order correct (in case of conflict)?
The code I corrently use:
Encoding encoding;
try
{
encoding = Encoding.GetEncoding(responseMessage.Content.Headers.ContentType.CharSet);
}
catch
{
using (MemoryStream ms = new MemoryStream(bytes))
{
using (StreamReader sr = new StreamReader(ms, Encoding.Default, true))
{
char[] chars = new char[1024];
sr.Read(chars, 0, 1024);
string textDefault = new string(chars);
if (sr.CurrentEncoding == Encoding.Default)
{
encoding = Global.EncodingFraContentType(textDefault);
}
else
{
encoding = sr.CurrentEncoding;
}
}
}
}
responseInfo.Text = encoding.GetString(bytes);
Global.EncodingFraContentType is a regular expression that finds the charset defined either in XML declaration, or in a meta tag.
What order is the correct to detect charset/encoding?
The correct answer depends not on order, but on which actually gives the correct result, and there's no perfect answer here.
If there is a conflict, then the server has given you something incorrect. Since it's incorrect there can't be a "correct" order because there isn't a correct way of being incorrect. And, maybe the header and the embedded metadata are both wrong!
No even slightly common-used encoding can have something that looks like a BOM would look like in UTF-8 or UTF-16 at the beginning and still be a valid example of the content types you mention, so if there's a BOM then that wins.
(The one exception to that is if the document is so badly edited as to switch encoding part-way through, which is no unheard of, but then the buggy content is so very buggy as to have no real meaning).
If the content contains no octet that is greater than 0x7F then it doesn't matter and the header and metadata both claim it as different examples of US-ASCII, UTF-8, any of the ISO-8859 family of encodings, or any of the other encodings for which those octets all map to the same code point, then it doesn't really matter which you consider it to be, as the nett result is the same. Consider it to be whatever the metadata says, as then you don't need to rewrite it to match correctly.
If it's in UTF-16 without a BOM it is likely going to be clearly as such very soon as all of those formats have a lot of characters with special meaning in the range U+0000 to U+00FF (indeed, generally U+0020 to U+007F) and so you'll have lots of ranges with a zero byte every other character.
If it has octets above 0x7F and is valid UTF-8, then it's almost certainly UTF-8. (By the same token if it's not UTF-8 and has octets above 0x7F then it almost certainly can't be mistaken for UTF-8).
The trickiest reasonably common case is if you have conflicting claims about it being in two different encodings which are both single-octet-per-character encodings and an octet in the range 0x80-0xFF is present. This is the case that you can't be sure about. If one encoding is a subset of the other (especially when C1 controls are excluded) then you could go for the superset, but that requires storing knowledge about those encodings, and considerable amount of work. Most of the time I'd be inclined to just throw an exception, and when it's found in the logs see if I can get the source to fix their bug, or special-case that source, but that doesn't work if you are dealing with a very large number of disparate sources that you may not have a relationship with. Alas there is no perfect answer here.
Its worth noting also that sometimes both header and embedded metadata will agree with each other incorrectly. A common case is content in CP-1252 but claimed as being in ISO-8859-1.
According to W3C Faq
If you have a UTF-8 byte-order mark (BOM) at the start of your file then recent browser versions other than Internet Explorer 10 or 11 will use that to determine that the encoding of your page is UTF-8. It has a higher precedence than any other declaration, including the HTTP header.
When it comes to the http-header vs meta BOM takes precedence, as long as the meta is within the first 1024 it can take precedence, though there is no strict rule on that.
Conclusion - in order of importance:
Byte Order Mark (BOM): If present, this is AUTHORATIVE, since it was
added by the editor that actually saved the file (this can only be
present on unicode encodings).
Content-Type charset (in header set by the server): For dynamically created/processed files, it should be present (since the
server knows), but might not be for static files (the server just
sends those).
Inline charset: For xml, html and css the encoding can be be specified inside the document, in either xml prolog, html meta tag
or #charset in css. To read that you need to decode the first
part of the document using for instance 'Windows-1252' encoding.
Assume utf-8. This is the standard of the web and is today by far the most used.
If the found encoding equals 'ISO-8859-1', use 'Windows-1252' instead (required in html5 - read more at Wikipedia
Now try to decode the document using the found encoding. If error handling is turned on, that might fail! In that case:
Use 'Windows-1252'. This was the standard in old windows files and works fine as last try (there's still a lot of old files out there).
This will never throw errors. However it might of course be wrong.
I have made a method that implements this. The regex I use is able to find encodings specified as:
Xml: <?xml version="1.0" encoding="utf-8"?> OR <?xml encoding="utf-8"?>
html: <meta charset="utf-8" /> OR <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
css: #charset "utf-8";
(It works with both single and double qoutes).
You will need:
using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
Here's the method that returns the decoded string (parameters are the HttpClient and the Uri):
public static async Task<string> GetString(HttpClient httpClient, Uri url)
{
byte[] bytes;
Encoding encoding = null;
Regex charsetRegex = new Regex(#"(?<=(<meta.*?charset=|^\<\?xml.*?encoding=|^#charset[ ]?)[""']?)[\w-]+?(?=[""';\r\n])",
RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture);
using (HttpResponseMessage responseMessage = await httpClient.GetAsync(url).ConfigureAwait(false))
{
responseMessage.EnsureSuccessStatusCode();
bytes = await responseMessage.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
string headerCharset = responseMessage?.Content?.Headers?.ContentType?.CharSet;
byte[] buffer = new byte[0x1000];
Array.Copy(bytes, buffer, Math.Min(bytes.Length, buffer.Length));
using (MemoryStream ms = new MemoryStream(buffer))
{
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), true, buffer.Length, true))
{
string testString = await sr.ReadToEndAsync().ConfigureAwait(false);
if (!sr.CurrentEncoding.Equals(Encoding.GetEncoding("Windows-1252")))
{
encoding = sr.CurrentEncoding;
}
else if (headerCharset != null)
{
encoding = Encoding.GetEncoding(headerCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
else
{
string inlineCharset = charsetRegex.Match(testString).Value;
if (!string.IsNullOrEmpty(inlineCharset))
{
encoding = Encoding.GetEncoding(inlineCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
else
{
encoding = new UTF8Encoding(false, true);
}
}
if (encoding.Equals(Encoding.GetEncoding("iso-8859-1")))
{
encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
}
}
using (MemoryStream ms = new MemoryStream(bytes))
{
try
{
using (StreamReader sr = new StreamReader(ms, encoding, false, 0x8000, true))
{
return await sr.ReadToEndAsync().ConfigureAwait(false);
}
}
catch (DecoderFallbackException)
{
ms.Position = 0;
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), false, 0x8000, true))
{
return await sr.ReadToEndAsync().ConfigureAwait(false);
}
}
}
}
}
You should wrap the method call in a try/catch, since HttpClient can throw errors, if the request fails.
Update:
In .Net Core, you don't have the 'Windows-1252' encoding (big mistake IMHO), so here you must settle with 'ISO-8859-1'.

How to encode and decode Broken Chinese/Unicode characters?

I've tried googling around but wasn't able to find what charset that this text below belongs to:
具有éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®
But putting <meta http-equiv="Content-Type" Content="text/html; charset=utf-8"> and keeping that string into an HTML file, I was able to view the Chinese characters properly:
具有靜電產生裝置之影像輸入裝置
So my question is:
What tools can I use to detect the character set of this text?
And how do I convert/encode/decode them properly in C#?
Updates:
For completion sake, I've updated this test.
[TestMethod]
public void TestMethod1()
{
string encodedText = "具有éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®";
Encoding utf8 = new UTF8Encoding();
Encoding window1252 = Encoding.GetEncoding("Windows-1252");
byte[] postBytes = window1252.GetBytes(encodedText);
string decodedText = utf8.GetString(postBytes);
string actualText = "具有靜電產生裝置之影像輸入裝置";
Assert.AreEqual(actualText, decodedText);
}
}
What is happening when you save the "bad" string in a text file with a meta tag declaring the correct encoding is that your text editor is saving the file with Windows-1252 encoding, but the browser is reading the file and interpreting it as UTF-8. Since the "bad" string is incorrectly decoded UTF-8 bytes with the Windows-1252 encoding, you are reversing the process by encoding the file as Windows-1252 and decoding as UTF-8.
Here's an example:
using System.Text;
using System.Windows.Forms;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
string s = "具有靜電產生裝置之影像輸入裝置"; // Unicode
Encoding Windows1252 = Encoding.GetEncoding("Windows-1252");
Encoding Utf8 = Encoding.UTF8;
byte[] utf8Bytes = Utf8.GetBytes(s); // Unicode -> UTF-8
string badDecode = Windows1252.GetString(utf8Bytes); // Mis-decode as Latin1
MessageBox.Show(badDecode,"Mis-decoded"); // Shows your garbage string.
string goodDecode = Utf8.GetString(utf8Bytes); // Correctly decode as UTF-8
MessageBox.Show(goodDecode, "Correctly decoded");
// Recovering from bad decode...
byte[] originalBytes = Windows1252.GetBytes(badDecode);
goodDecode = Utf8.GetString(originalBytes);
MessageBox.Show(goodDecode, "Re-decoded");
}
}
}
Even with correct decoding, you'll still need a font that supports the characters being displayed. If your default font doesn't support Chinese, you still might not see the correct characters.
The correct thing to do is figure out why the string you have was decoded as Windows-1252 in the first place. Sometimes, though, data in a database is stored incorrectly to begin with and you have to resort to these games to fix the problem.
string test = "敭畳灴獩楫n"; //incoming data. must be mesutpiskin
byte[] bytes = Encoding.Unicode.GetBytes(test);
string s = string.Empty;
for (int i = 0; i < bytes.Length; i++)
{
s += (char)bytes[i];
}
s = s.Trim((char)0);
MessageBox.Show(s);
//s=mesutpiskin
I'm not really sure what you mean, but I'm guessing you want to convert between a string in a certain encoding in byte array form and a string. Let's assume the character encoding is called "FooBar":
This is how you encode and decode:
Encoding myEncoding = Encoding.GetEncoding("FooBar");
string myString = "lala";
byte[] myEncodedBytes = myEncoding.GetBytes(myString);
string myDecodedString = myEncoding.GetString(myEncodedBytes);
You can learn more about the Encoding class over at MSDN.
Answering your question at the end of your post:
If you want to determine the text encoding on runtime you should look at that: http://code.google.com/p/ude/
for converting character sets you can use http://msdn.microsoft.com/en-us/library/system.text.encoding.convert(v=vs.100).aspx
It's Windows Latin 1. I pasted the Chinese text as UTF-8 into BBEDIT (a text editor for Mac) and re-opened the file as Windows Latin 1 and bang, the exact diacritics appeared.

c# encoding issue with?

i have an input like: DisplaygröÃe
And i want output like: Displaygröÿe
With notepad++ problem was solved by: converting to ansi, encoding to utf8 and converting back to ansi.
I need to do this programmatically in c#.
I've tried converting to / from ansi, utf8, latin-1 and none work properly, it shows ? with a function that uses Encoding.Default.GetBytes, then
res = Enconding.Convert(src1,dest1,bytes) and
EncodingDest.GetChars(res);
where EncodingDest it represent output encoding..
Code is running in Console application, but same result are on WPF.
Doesn't matter with encoding is good for output only if it works, these problems also are for country's like spain, italy or sweden.
use System.Text.Encoding
var ascii = Encoding.ASCII.GetBytes("DisplaygröÃe");
var utf8 = Encoding.Convert(Encoding.ASCII, Encoding.UTF8, ascii);
var output = Encoding.UTF8.GetString(utf8);
When you output a string somewhere (like a TextWriter, or a Stream, or a byte[]), you should always specify the encoding, unless you want the UTF-8 output (the default one):
using(StreamWriter sw = new StreamWriter("file.txt", Encoding.GetEncoding("windows-1252"))
sw.WriteLine("Displaygröÿe");
#DanM: You need to know what character set your input is in.
"DisplaygröÃe" is what you will see if you take the string "Displaygröße" (suggested by Vlad) encode it to bytes as UTF-8, and then incorrectly decode it as latin1.
If you do the same with Displaygröÿe, you would see "Displaygröÿe" (the inverted question mark is literally there, it is not a placeholder for something that can't be displayed.) Technically, "DisplaygröÃe" probably has another character between the à and e, but it is a control code, and is thus invisible to you.
If you have an character set foo, this is true: my_string = foo_decode(foo_encode(my_string)). If you have another character set bar, this is true: barf = bar_decode(foo_encode(my_string)) where barf is garbage like you're seeing.
If you don't know what character set your input is in, you will only decode it correctly by chance.
It appears that your input files are in UTF-8, and you will need to decode the bytes from the file as such. (I don't speak enough C# to help you here... I only speak character encodings.)
using (var rdr = new StreamReader(fs, Encoding.GetEncoding(1252))) {
result = rdr.ReadToEnd();
}
we had similar problem when sending data to text printer, and only one I get working is this (written as extension):
public static byte[] ToAnsiMemBytes(this string input)
{
int length = input.Length;
byte[] result = new byte[length];
try
{
IntPtr bytes = Marshal.StringToCoTaskMemAnsi(input);
Marshal.Copy(bytes, result, 0, length);
}
catch (Exception)
{
result = null;
}
return result;
}

Encoding in Streamreader in my silverlight application

Having trouble geting the encoding right in my silverlight application.
I need support for western europe letters like æ,ø,å,â and so(Latin1??).
But I can't get it right. What should be instead of SOMEENCODINGHERE? did try
Encoding enc = Encoding.GetEncoding("Latin1"); but no names I used as param was recognized =( .
If I use Encoding.Unicode tr.ReadLine() reads the whole file and convert it to Chinese for some reason.
private Dictionary<int, string> InitDictionary()
{
var d = new Dictionary<int, string>();
var sri = App.GetResourceStream(new Uri(fileDic, UriKind.Relative));
using (TextReader tr = new StreamReader(sri.Stream, Encoding.SOMEENCODINGHERE))
{
int i = 0;
string line;
while ((line = tr.ReadLine()) != null)
{
d.Add(i++, line);
}
}
return d;
}
If you really want ISO-Latin-1, you can use
Encoding.GetEncoding(28591);
But the normal Windows Western Europe code page is 1252:
Encoding.GetEncoding(1252);
Are you absolutely sure that's the encoding for your stream though? These days it's more common to use UTF-8. What's generating your text resource?
Silverlight (1-4, don't known about 5) doesn't support ANSI Encodings (codepages). It supports only Unicode encodings: UTF8 and UTF16.
See http://msdn.microsoft.com/en-us/library/system.text.encoding%28VS.95%29.aspx for details.
So suggested Encoding.GetEncoding(1252) and any other codepage numbers do not work.
You have to implement your Encoding class for needed codepage.
If you have found an appropriate implementation please share it, i'd be interesting.

Categories