Dynamic loaded XML and XmlTextReader character encoding issue

Dynamic loaded XML and XmlTextReader character encoding issue - c#

I load a XML like this:
var url = Application.dataPath + #"/config.xml";
var www = new WWW(url);
while (!www.isDone)
{
yield return new WaitForSeconds(0.2f);
}
After that I create a XmlTextReader in order to parse that XML:
GameSettings.ParseXML(new XmlTextReader(new StringReader(www.text)));
But I'm having problem with character encoding (é,ç,ã,ê, etc). What can I do make it works?

If you use WWW.text, the function expects the web page contents encoded in UTF-8 or ASCII but your customer uses Windows-1252.
Like Bart already suggested, the best way would be to request that the customer just uses UTF-8. If that is not possible und you are sure that the customer always uses Windows-1252 you can convert the encoding inside your application.
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
Encoding utf8 = Encoding.UTF8;
byte[] windowsBytes = www.bytes;
byte[] utf8Bytes = Encoding.Convert(windows1252, utf8, windowsBytes);
string converted_xml = utf8.GetString(utf8Bytes);

Related

How to remove BOM from an encoded base64 UTF string?

I have a file encoded in base64 using openssl base64 -in en -out en1 in a command line in MacOS and I am reading this file using the following code:
string fileContent = File.ReadAllText(Path.Combine(AppContext.BaseDirectory, MConst.BASE_DIR, "en1"));
var b1 = Convert.FromBase64String(fileContent);
var str1 = System.Text.Encoding.UTF8.GetString(b1);
The string I am getting has a ? before the actual file content. I am not sure what's causing this, any help will be appreciated.
Example Input:
import pandas
import json
Encoded file example:
77u/DQppbXBvcnQgY29ubmVjdG9yX2FwaQ0KaW1wb3J0IGpzb24NCg0K
Output based on the C# code:
?import pandas
import json

Normally, when you read UTF (with BOM) from a text file, the decoding is handled for you behind the scene. For example, both of the following lines will read UTF text correctly regardless of whether or not the text file has a BOM:
File.ReadAllText(path, Encoding.UTF8);
File.ReadAllText(path); // UTF8 is the default.
The problem is that you're dealing with UTF text that has been encoded to a Base64 string. So, ReadAllText() can no longer handle the BOM for you. You can either do it yourself by (checking and) removing the first 3 bytes from the byte array or delegate that job to a StreamReader, which is exactly what ReadAllText() does:
var bytes = Convert.FromBase64String(fileContent);
string finalString = null;
using (var ms = new MemoryStream(bytes))
using (var reader = new StreamReader(ms)) // Or:
// using (var reader = new StreamReader(ms, Encoding.UTF8))
{
finalString = reader.ReadToEnd();
}
// Proceed to using finalString.

Bytes read as UTF8 string and converted to Base64

Forgive the lengthy setup here but I thought it may help to have the context...
I am implementing a custom digital signature validation method in as part of a WCF service. We're using a custom method because various differing interpretations of some industry standards but the details there aren't all that relevant.
In this particular scenario, I am receiving an MTOM/XOP encoded request where the root MIME part contains a digital signature and the signature DigestValue and SignatureValue pieces are split up into separate MIME parts.
The MIME parts that contain the signature DigestValue and SignatureValue data is binary encoded so it is literally a bunch of raw bytes in the web request like this:
Content-Id: <c18605af-18ec-4fcb-bec7-e3767ef6fe53#example.jaxws.sun.com>
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
[non-printable-binary-data-goes-here]
--uuid:eda4d7f2-4647-4632-8ecb-5ba44f1a076d
I am reading the contents of the message in as a string (using the default UTF8 encoding) like this (see the requestAsString parameter below):
MessageBuffer buffer = request.CreateBufferedCopy(int.MaxValue);
try
{
using (MemoryStream mstream = new MemoryStream())
{
buffer.WriteMessage(mstream);
mstream.Position = 0;
using (StreamReader sr = new StreamReader(mstream))
{
requestAsString = sr.ReadToEnd();
}
request = buffer.CreateMessage();
}
}
After I read the MTOM/XOP message in, I am attempting to re-organize the multiple MIME parts into one SOAP message where the signature DigestValue and SignatureValue elements are restored to the original SOAP envelope (and not as attachments). So basically I am taking decoding the MTOM/XOP request.
Unfortunately, I am having trouble reading the DigestValue and SignatureValue pieces correctly. I need to read the bytes out of the message and get the base64 string representation of that data.
Despite all the context above, it seems the core problem is reading the binary data in as a string (UTF8 encoded) and then converting it to a proper base64 representation.
Here is what I am seeing in my test code:
This is my example base64 string:
string base64String = "mowXMw68eLSv9J1W7f43MvNgCrc=";
I can then get the byte representation of that string. This yields an array of 20 bytes:
byte[] base64Bytes = Convert.FromBase64String(base64String);
I then get the UTF8 encoded version of those bytes:
string decodedString = UTF8Encoding.UTF8.GetString(base64Bytes);
Now the strange part... if I convert the string back to bytes as follows, I get an array of bytes that is 39 bytes long:
byte[] base64BytesBack = UTF8Encoding.UTF8.GetBytes(decodedString);
So obviously at this point, when I convert back into a base64 string, it doesn't match the original value:
string base64StringBack = Convert.ToBase64String(base64BytesBack);
base64StringBack is set to "77+977+9FzMO77+9eO+/ve+/ve+/vVbvv73vv703Mu+/vWAK77+9"
What am I doing wrong here? If I switch to using UTF8Encoding.Unicode.GetString() and UTF8Encoding.Unicode.GetBytes(), it works as expected:
string base64String = "mowXMw68eLSv9J1W7f43MvNgCrc=";
// First get an array of bytes from the base64 string
byte[] base64Bytes = Convert.FromBase64String(base64String);
// Get the Unicode representation of the base64 bytes.
string decodedString = UTF8Encoding.Unicode.GetString(base64Bytes);
byte[] base64BytesBack = UTF8Encoding.Unicode.GetBytes(decodedString);
string base64StringBack = Convert.ToBase64String(base64BytesBack);
Now base64StringBack is set to "mowXMw68eLSv9J1W7f43MvNgCrc=" so it seems I am mis-using the UTF8 encoding somehow or it is behaving differently than I would expect.

Arbitrary binary data cannot be decoded into an UTF8 encoded string and then encoded back to the same binary data. The paragraph "Invalid byte sequences" in http://en.wikipedia.org/wiki/UTF-8 points that out.
I am a bit confused as to why you want the data encoded/decoded as UTF8.

Ok, I took a different approach to reading the MTOM/XOP message:
Instead of relying on my own code to parse the MIME parts by hand, I just used XmlDictionaryReader.CreateMtomReader() to get an XmlDictionaryReader and read the message into an XmlDocument (being careful to preserve whitespace on the XmlDocument so digital signatures aren't broken):
MessageBuffer buffer = request.CreateBufferedCopy(int.MaxValue);
messageContentType = WebOperationContext.Current.IncomingRequest.ContentType;
try
{
using (MemoryStream mstream = new MemoryStream())
{
buffer.WriteMessage(mstream);
mstream.Position = 0;
if (messageContentType.Contains("multipart/related;"))
{
Encoding[] encodings = new Encoding[1];
encodings[0] = Encoding.UTF8;
// MTOM
using (XmlDictionaryReader reader = XmlDictionaryReader.CreateMtomReader(mstream, encodings, messageContentType, XmlDictionaryReaderQuotas.Max))
{
XmlDocument msgDoc = new XmlDocument();
msgDoc.PreserveWhitespace = true;
msgDoc.Load(reader);
requestAsString = msgDoc.OuterXml;
reader.Close();
}
}
else
{
// Text
using (StreamReader sr = new StreamReader(mstream))
{
requestAsString = sr.ReadToEnd();
}
}
request = buffer.CreateMessage();
}
}
finally
{
buffer.Close();
}

Converting a string encoded in utf8 to unicode in C#

I've got this string returned via HTTP Post from a URL in a C# application, that contains some chinese character eg:
GelatosÂ® Colors Gift Setä¸æ–‡
Problem is I want to convert it to
Gelatos® Colors Gift Set中文
Both string are actually identical but encoded differently. I understand in C# everything is UTF16. I've tried reading alof of postings here regarding converting from one encoding to the other but no luck.
Hope someone could help.
Here's the C# code:
WebClient wc = new WebClient();
json = wc.DownloadString("http://mysite.com/ext/export.asp");
textBox2.Text = "Receiving orders....";
//convert the string to UTF16
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
Encoding utf8 = Encoding.UTF8;
byte[] asciiBytes = ascii.GetBytes(json);
byte[] utf8Bytes = utf8.GetBytes(json);
byte[] unicodeBytes = Encoding.Convert(utf8, unicode, utf8Bytes);
string sOut = unicode.GetString(unicodeBytes);
System.Windows.Forms.MessageBox.Show(sOut); //doesn't work...
Here's the code from the server:
<%#CodePage = 65001%>
<%option explicit%>
<%
Session.CodePage = 65001
Response.charset ="utf-8"
Session.LCID = 1033 'en-US
.....
response.write (strJSON)
%>
The output from the web is correct. But I was just wondering if some changes is done on the http stream to the C# application.
thanks.

Download the web pages as bytes in the first place. Then, convert the bytes to the correct encoding.
By first converting it using a wrong encoding you are probably losing data. Especially using ASCII.

If the server is really returning UTF-8 text, you can configure your WebClient by setting its Encoding property. This would eliminate any need for subsequent conversions.
using (WebClient wc = new WebClient())
{
wc.Encoding = Encoding.UTF8;
json = wc.DownloadString("http://mysite.com/ext/export.asp");
}

How to send UTF-8 coded XML without receiving umlauts in entity notation

I'm sending data to a business partner. The demand of the partner is clear: UTF-8 signed XML. Furthermore they want to receive german umlauts not being in entity notation.
This german site tells it decimal notation. But it means the same.
So how to tell the Encoding how to send the umlauts?
Thanks in advance!
//edit: the business partner confirmed they receive the umlauts in entity notation (&#123)
//edit2: here's the code I use to send the data
WebClient client = new WebClient
{
Proxy = new WebProxy("wwwproxy", 80)
{
Credentials = CredentialCache.DefaultCredentials
}
};
byte[] response = client.UploadData(
Data.Resources.gateway_uri,
Encoding.UTF8.GetBytes(request.ToString(SaveOptions.DisableFormatting)));
return XDocument.Parse(System.Text.Encoding.UTF8.GetString(response));

I'm guessing that request is an XNode instance and I'm guessing that its ToString method tries to avoid encoding problems by encoding most characters outside the ASCII range as it cannot know what the final encoding will be.
To have better control over the encoding, you'll need to use an XML writer with a stream:
MemoryStream buffer = new MemoryStream();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.UTF8; // should be the default, but you never know
XMLWriter writer = XMLWriter.Create(buffer, settings);
request.WriteTo(writer);
writer.Flush();
byte[] requestData = buffer.ToArray();
byte[] response = client.UploadData(Data.Resources.gateway_uri, requestData);

How to encode and decode Broken Chinese/Unicode characters?

I've tried googling around but wasn't able to find what charset that this text below belongs to:
å…·æœ‰éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®
But putting <meta http-equiv="Content-Type" Content="text/html; charset=utf-8"> and keeping that string into an HTML file, I was able to view the Chinese characters properly:
具有靜電產生裝置之影像輸入裝置
So my question is:
What tools can I use to detect the character set of this text?
And how do I convert/encode/decode them properly in C#?
Updates:
For completion sake, I've updated this test.
[TestMethod]
public void TestMethod1()
{
string encodedText = "å…·æœ‰éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®";
Encoding utf8 = new UTF8Encoding();
Encoding window1252 = Encoding.GetEncoding("Windows-1252");
byte[] postBytes = window1252.GetBytes(encodedText);
string decodedText = utf8.GetString(postBytes);
string actualText = "具有靜電產生裝置之影像輸入裝置";
Assert.AreEqual(actualText, decodedText);
}
}

What is happening when you save the "bad" string in a text file with a meta tag declaring the correct encoding is that your text editor is saving the file with Windows-1252 encoding, but the browser is reading the file and interpreting it as UTF-8. Since the "bad" string is incorrectly decoded UTF-8 bytes with the Windows-1252 encoding, you are reversing the process by encoding the file as Windows-1252 and decoding as UTF-8.
Here's an example:
using System.Text;
using System.Windows.Forms;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
string s = "具有靜電產生裝置之影像輸入裝置"; // Unicode
Encoding Windows1252 = Encoding.GetEncoding("Windows-1252");
Encoding Utf8 = Encoding.UTF8;
byte[] utf8Bytes = Utf8.GetBytes(s); // Unicode -> UTF-8
string badDecode = Windows1252.GetString(utf8Bytes); // Mis-decode as Latin1
MessageBox.Show(badDecode,"Mis-decoded"); // Shows your garbage string.
string goodDecode = Utf8.GetString(utf8Bytes); // Correctly decode as UTF-8
MessageBox.Show(goodDecode, "Correctly decoded");
// Recovering from bad decode...
byte[] originalBytes = Windows1252.GetBytes(badDecode);
goodDecode = Utf8.GetString(originalBytes);
MessageBox.Show(goodDecode, "Re-decoded");
}
}
}
Even with correct decoding, you'll still need a font that supports the characters being displayed. If your default font doesn't support Chinese, you still might not see the correct characters.
The correct thing to do is figure out why the string you have was decoded as Windows-1252 in the first place. Sometimes, though, data in a database is stored incorrectly to begin with and you have to resort to these games to fix the problem.

string test = "敭畳灴獩楫n"; //incoming data. must be mesutpiskin
byte[] bytes = Encoding.Unicode.GetBytes(test);
string s = string.Empty;
for (int i = 0; i < bytes.Length; i++)
{
s += (char)bytes[i];
}
s = s.Trim((char)0);
MessageBox.Show(s);
//s=mesutpiskin

I'm not really sure what you mean, but I'm guessing you want to convert between a string in a certain encoding in byte array form and a string. Let's assume the character encoding is called "FooBar":
This is how you encode and decode:
Encoding myEncoding = Encoding.GetEncoding("FooBar");
string myString = "lala";
byte[] myEncodedBytes = myEncoding.GetBytes(myString);
string myDecodedString = myEncoding.GetString(myEncodedBytes);
You can learn more about the Encoding class over at MSDN.

Answering your question at the end of your post:
If you want to determine the text encoding on runtime you should look at that: http://code.google.com/p/ude/
for converting character sets you can use http://msdn.microsoft.com/en-us/library/system.text.encoding.convert(v=vs.100).aspx

It's Windows Latin 1. I pasted the Chinese text as UTF-8 into BBEDIT (a text editor for Mac) and re-opened the file as Windows Latin 1 and bang, the exact diacritics appeared.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Dynamic loaded XML and XmlTextReader character encoding issue - c#

Related

How to remove BOM from an encoded base64 UTF string?

Bytes read as UTF8 string and converted to Base64

Converting a string encoded in utf8 to unicode in C#

How to send UTF-8 coded XML without receiving umlauts in entity notation

How to encode and decode Broken Chinese/Unicode characters?

Categories

Resources