Encoding in Streamreader in my silverlight application - c#

Having trouble geting the encoding right in my silverlight application.
I need support for western europe letters like æ,ø,å,â and so(Latin1??).
But I can't get it right. What should be instead of SOMEENCODINGHERE? did try
Encoding enc = Encoding.GetEncoding("Latin1"); but no names I used as param was recognized =( .
If I use Encoding.Unicode tr.ReadLine() reads the whole file and convert it to Chinese for some reason.
private Dictionary<int, string> InitDictionary()
{
var d = new Dictionary<int, string>();
var sri = App.GetResourceStream(new Uri(fileDic, UriKind.Relative));
using (TextReader tr = new StreamReader(sri.Stream, Encoding.SOMEENCODINGHERE))
{
int i = 0;
string line;
while ((line = tr.ReadLine()) != null)
{
d.Add(i++, line);
}
}
return d;
}

If you really want ISO-Latin-1, you can use
Encoding.GetEncoding(28591);
But the normal Windows Western Europe code page is 1252:
Encoding.GetEncoding(1252);
Are you absolutely sure that's the encoding for your stream though? These days it's more common to use UTF-8. What's generating your text resource?

Silverlight (1-4, don't known about 5) doesn't support ANSI Encodings (codepages). It supports only Unicode encodings: UTF8 and UTF16.
See http://msdn.microsoft.com/en-us/library/system.text.encoding%28VS.95%29.aspx for details.
So suggested Encoding.GetEncoding(1252) and any other codepage numbers do not work.
You have to implement your Encoding class for needed codepage.
If you have found an appropriate implementation please share it, i'd be interesting.

Related

get all Text Encoding in universal windows app

I want to make windows application that convert text encoding of Txt file.
so I need to get all text encoding supported by windows.
I look for Encoding.GetEncodings() Method but it is not exist in Universal Windows Apps.
so I try to get encoding by CodePage, and this is my code:
List<string> list = new List<string>();
List<string> errors = new List<string>();
int[] code_page = { 0, 1200, 1201, 1252, 10003, 10008, 12000, 12001, 20127, 20936, 20949, 28591, 28598, 38598, 50220, 50221,
50222, 50225, 50227, 51932, 51936, 51949, 52936, 57002, 57003, 57004, 57005, 57006, 57007, 57008, 57009, 57010, 57011, 65000, 65001 };
for (int i = 0; i < code_page.Length; i++)
{
try
{
list.Add(Encoding.GetEncoding(code_page[i]).EncodingName);
}
catch (Exception ex) { errors.Add(code_page[i] + "\t\t" + ex.Message); }
}
}
And I have this result:
First list (Encoding)
Unicode (UTF-8)
Unicode
Unicode (Big-Endian)
Unicode (UTF-32)
Unicode (UTF-32 Big-Endian)
US-ASCII
Western European (ISO)
Unicode (UTF-7)
Unicode (UTF-8)
Errors list
37___No data is available for encoding 37. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
437__No data is available for encoding 437. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
500__No data is available for encoding 500. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
...etc
and my question is there any way to get all windows text encoding.
Thank you.

Dynamic loaded XML and XmlTextReader character encoding issue

I load a XML like this:
var url = Application.dataPath + #"/config.xml";
var www = new WWW(url);
while (!www.isDone)
{
yield return new WaitForSeconds(0.2f);
}
After that I create a XmlTextReader in order to parse that XML:
GameSettings.ParseXML(new XmlTextReader(new StringReader(www.text)));
But I'm having problem with character encoding (é,ç,ã,ê, etc). What can I do make it works?
If you use WWW.text, the function expects the web page contents encoded in UTF-8 or ASCII but your customer uses Windows-1252.
Like Bart already suggested, the best way would be to request that the customer just uses UTF-8. If that is not possible und you are sure that the customer always uses Windows-1252 you can convert the encoding inside your application.
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
Encoding utf8 = Encoding.UTF8;
byte[] windowsBytes = www.bytes;
byte[] utf8Bytes = Encoding.Convert(windows1252, utf8, windowsBytes);
string converted_xml = utf8.GetString(utf8Bytes);

Change StreamReader Encoding while reading from NetworkStream

I am trying to read an email from POP3 and change to the correct encoding when I find the charset in the headers.
I use a TCP Client to connect to the POP3 server.
Below is my code :
public string ReadToEnd(POP3Client pop3client, out System.Text.Encoding messageEncoding)
{
messageEncoding = TCPStream.CurrentEncoding;
if (EOF)
return ("");
System.Text.StringBuilder sb = new System.Text.StringBuilder(m_bytetotal * 2);
string st = "";
string tmp;
do
{
tmp = TCPStream.ReadLine();
if (tmp == ".")
EOF = true;
else
sb.Append(tmp + "\r\n");
//st += tmp + "\r\n";
m_byteread += tmp.Length + 2; // CRLF discarded by read
FireReceived();
if (tmp.ToLower().Contains("content-type:") && tmp.ToLower().Contains("charset="))
{
try
{
string charSetFound = tmp.Substring(tmp.IndexOf("charset=") + "charset=".Length).Replace("\"", "").Replace(";", "");
var realEnc = System.Text.Encoding.GetEncoding(charSetFound);
if (realEnc != TCPStream.CurrentEncoding)
{
TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);
}
}
catch { }
}
} while (!EOF);
messageEncoding = TCPStream.CurrentEncoding;
return (sb.ToString());
}
If I remove this line:
TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);
Everything works fine except that when the e-mail contains different charset characters I get question marks as the initial encoding is ASCII.
Any suggestions on how to change the encoding while reading data from the Network Stream?
You're doing it wrong (tm).
Seriously, though, you are going about trying to solve this problem in completely the wrong way. Don't use a StreamReader for this. And especially don't read 1 byte at a time (as you said you needed to do in a comment on an earlier "solution").
For an explanation of why not to use a StreamReader, besides the obvious "because it isn't designed to switch between encodings during the process of reading", feel free to read over another answer I gave about the inefficiencies of using a StreamReader here: Reading an mbox file in C#
What you need to do is buffer your reads (such as a 4k buffer should be fine). Then, as you are already having to do anyway, scan for the '\n' byte to extract content on a line-by-line basis, combining header lines that were folded.
Each header may have multiple encoded-word tokens which may each be in a separate charset, assuming they are properly encoded, otherwise you'll have to deal with undeclared 8-bit data and try to massage that into unicode somehow (probably by having a set of fallback charsets). I'd recommend trying UTF-8 first followed by a selection of charsets that the user of your library has provided before finally trying iso-8859-1 (make sure not to try iso-8859-1 until you've tried everything else, because any sequence of 8-bit text will convert properly to unicode using the iso-8859-1 character encoding).
When you get to text content of the message, you'll want to check the Content-Type header for a charset parameter. If no charset parameter is defined, it should be US-ASCII, but in practice it could be anything. Even if the charset is defined, it might not match the actual character encoding used in the text body of the message, so once again you'll probably want to have a set of fallbacks.
As you've probably guessed by this point, this is very clearly not a trivial task as it requires the parser to do on-the-fly character conversion as it goes (and the character conversion requires internal parser state about what the expected charset is at any given time).
Since I've already done the work, you should really consider using MimeKit which will parse the email and properly do charset conversion on the headers and the content using the appropriate charset encoding.
I've also written a Pop3Client class that is included in my MailKit library.
If your goal is to learn and write your own library, I'd still highly recommend reading over my code because it is highly efficient and does things in a proper way.
There are some ways you can detect the encoding by looking at the Byte Order Mark, which are the firts few bytes of the stream. These will tell you the encoding. However, the stream might not have a BOM, and in these cases it could be ASCII, UTF without BOM, or others.
You can convert your stream from one encoding to another with the Encoding Class:
Encoding textEncoding = Encoding.[your detected encoding here];
byte[] converted = Encoding.UTF8.GetBytes(textEncoding.GetString(TCPStream.GetBuffer()));
You may select your preferred encoding when converting.
Hope it answers your question.
edit
You may use this code to read your stream in blocks.
MemoryStream st = new MemoryStream();
int numOfBytes = 1024;
int reads = 1;
while (reads > 0)
{
byte[] bytes = new byte[numOfBytes];
reads = yourStream.Read(bytes, 0, numOfBytes);
if (reads > 0)
{
int writes = ( reads < numOfBytes ? reads : numOfBytes);
st.Write(bytes, 0, writes);
}
}

How to encode and decode Broken Chinese/Unicode characters?

I've tried googling around but wasn't able to find what charset that this text below belongs to:
具有éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®
But putting <meta http-equiv="Content-Type" Content="text/html; charset=utf-8"> and keeping that string into an HTML file, I was able to view the Chinese characters properly:
具有靜電產生裝置之影像輸入裝置
So my question is:
What tools can I use to detect the character set of this text?
And how do I convert/encode/decode them properly in C#?
Updates:
For completion sake, I've updated this test.
[TestMethod]
public void TestMethod1()
{
string encodedText = "具有éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®";
Encoding utf8 = new UTF8Encoding();
Encoding window1252 = Encoding.GetEncoding("Windows-1252");
byte[] postBytes = window1252.GetBytes(encodedText);
string decodedText = utf8.GetString(postBytes);
string actualText = "具有靜電產生裝置之影像輸入裝置";
Assert.AreEqual(actualText, decodedText);
}
}
What is happening when you save the "bad" string in a text file with a meta tag declaring the correct encoding is that your text editor is saving the file with Windows-1252 encoding, but the browser is reading the file and interpreting it as UTF-8. Since the "bad" string is incorrectly decoded UTF-8 bytes with the Windows-1252 encoding, you are reversing the process by encoding the file as Windows-1252 and decoding as UTF-8.
Here's an example:
using System.Text;
using System.Windows.Forms;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
string s = "具有靜電產生裝置之影像輸入裝置"; // Unicode
Encoding Windows1252 = Encoding.GetEncoding("Windows-1252");
Encoding Utf8 = Encoding.UTF8;
byte[] utf8Bytes = Utf8.GetBytes(s); // Unicode -> UTF-8
string badDecode = Windows1252.GetString(utf8Bytes); // Mis-decode as Latin1
MessageBox.Show(badDecode,"Mis-decoded"); // Shows your garbage string.
string goodDecode = Utf8.GetString(utf8Bytes); // Correctly decode as UTF-8
MessageBox.Show(goodDecode, "Correctly decoded");
// Recovering from bad decode...
byte[] originalBytes = Windows1252.GetBytes(badDecode);
goodDecode = Utf8.GetString(originalBytes);
MessageBox.Show(goodDecode, "Re-decoded");
}
}
}
Even with correct decoding, you'll still need a font that supports the characters being displayed. If your default font doesn't support Chinese, you still might not see the correct characters.
The correct thing to do is figure out why the string you have was decoded as Windows-1252 in the first place. Sometimes, though, data in a database is stored incorrectly to begin with and you have to resort to these games to fix the problem.
string test = "敭畳灴獩楫n"; //incoming data. must be mesutpiskin
byte[] bytes = Encoding.Unicode.GetBytes(test);
string s = string.Empty;
for (int i = 0; i < bytes.Length; i++)
{
s += (char)bytes[i];
}
s = s.Trim((char)0);
MessageBox.Show(s);
//s=mesutpiskin
I'm not really sure what you mean, but I'm guessing you want to convert between a string in a certain encoding in byte array form and a string. Let's assume the character encoding is called "FooBar":
This is how you encode and decode:
Encoding myEncoding = Encoding.GetEncoding("FooBar");
string myString = "lala";
byte[] myEncodedBytes = myEncoding.GetBytes(myString);
string myDecodedString = myEncoding.GetString(myEncodedBytes);
You can learn more about the Encoding class over at MSDN.
Answering your question at the end of your post:
If you want to determine the text encoding on runtime you should look at that: http://code.google.com/p/ude/
for converting character sets you can use http://msdn.microsoft.com/en-us/library/system.text.encoding.convert(v=vs.100).aspx
It's Windows Latin 1. I pasted the Chinese text as UTF-8 into BBEDIT (a text editor for Mac) and re-opened the file as Windows Latin 1 and bang, the exact diacritics appeared.

c# encoding issue with?

i have an input like: DisplaygröÃe
And i want output like: Displaygröÿe
With notepad++ problem was solved by: converting to ansi, encoding to utf8 and converting back to ansi.
I need to do this programmatically in c#.
I've tried converting to / from ansi, utf8, latin-1 and none work properly, it shows ? with a function that uses Encoding.Default.GetBytes, then
res = Enconding.Convert(src1,dest1,bytes) and
EncodingDest.GetChars(res);
where EncodingDest it represent output encoding..
Code is running in Console application, but same result are on WPF.
Doesn't matter with encoding is good for output only if it works, these problems also are for country's like spain, italy or sweden.
use System.Text.Encoding
var ascii = Encoding.ASCII.GetBytes("DisplaygröÃe");
var utf8 = Encoding.Convert(Encoding.ASCII, Encoding.UTF8, ascii);
var output = Encoding.UTF8.GetString(utf8);
When you output a string somewhere (like a TextWriter, or a Stream, or a byte[]), you should always specify the encoding, unless you want the UTF-8 output (the default one):
using(StreamWriter sw = new StreamWriter("file.txt", Encoding.GetEncoding("windows-1252"))
sw.WriteLine("Displaygröÿe");
#DanM: You need to know what character set your input is in.
"DisplaygröÃe" is what you will see if you take the string "Displaygröße" (suggested by Vlad) encode it to bytes as UTF-8, and then incorrectly decode it as latin1.
If you do the same with Displaygröÿe, you would see "Displaygröÿe" (the inverted question mark is literally there, it is not a placeholder for something that can't be displayed.) Technically, "DisplaygröÃe" probably has another character between the à and e, but it is a control code, and is thus invisible to you.
If you have an character set foo, this is true: my_string = foo_decode(foo_encode(my_string)). If you have another character set bar, this is true: barf = bar_decode(foo_encode(my_string)) where barf is garbage like you're seeing.
If you don't know what character set your input is in, you will only decode it correctly by chance.
It appears that your input files are in UTF-8, and you will need to decode the bytes from the file as such. (I don't speak enough C# to help you here... I only speak character encodings.)
using (var rdr = new StreamReader(fs, Encoding.GetEncoding(1252))) {
result = rdr.ReadToEnd();
}
we had similar problem when sending data to text printer, and only one I get working is this (written as extension):
public static byte[] ToAnsiMemBytes(this string input)
{
int length = input.Length;
byte[] result = new byte[length];
try
{
IntPtr bytes = Marshal.StringToCoTaskMemAnsi(input);
Marshal.Copy(bytes, result, 0, length);
}
catch (Exception)
{
result = null;
}
return result;
}

Categories