Xml exception due to leading unicode character in REST API response - c#

When I try to parse a response from a certain REST API, I'm getting an XmlException saying "Data at the root level is invalid. Line 1, position 1." Looking at the XML it looks fine, but then examining the first character I see that it is actually a zero-width no-break space (character code 65279 or 0xFEFF).
Is there any good reason for that character to be there? Maybe I'm supposed to be setting a different Encoding when I make my request? Currently I'm using Encoding.UTF8.
I've thought about just removing the character from the string, or asking the developer of the REST API to fix it, but before I do either of those things I wanted to check if there is a valid reason for that character to be there. I'm no unicode expert. Is there something different I should be doing?
Edit: I suspected that it might be something like that (BOM). So, the question becomes, should I have to deal with this character specially? I've tried loading the XML two ways and both throw the same exception:
public static User GetUser()
{
WebClient req = new WebClient();
req.Encoding = Encoding.UTF8;
string response = req.DownloadString(url);
XmlSerializer ser = new XmlSerializer(typeof(User));
User user = ser.Deserialize(new StringReader(response)) as User;
XElement xUser = XElement.Parse(response);
...
return user;
}

U+FFEF is a byte order mark. It's there at the start of the document to indicate the character encoding (or rather, the byte-order of an encoding which could be either way; most specifically UTF-16). It's entirely reasonable for it to be there at the start of an XML document. Its use as a zero-width non-breaking space is deprecated in favour of U+2060 instead.
It would be unreasonable if the byte-order mark was in a different encoding, e.g. if it were a UTF-8 BOM in a document which claimed to be UTF-8.
How are you loading the document? Perhaps you're specifying an inappropriate encoding somewhere? It's best to let the XML API detect the encoding if at all possible.
EDIT: After you've download it as a string, I can imagine that could cause problems... given that it's used to detect the encoding, which you've already got. Don't download it as a string - download it as binary data (WebClient.DownloadData) and then you should be able to parse it okay, I believe. However, you probably still shouldn't use XElement.Parse as there may well be a document declaration - use XDocument.Parse. I'd be slightly surprised if the result of the call could be fed straight into XmlSerializer, but you can have a go... wrap it in a MemoryStream if necessary.

That is called a Byte Order Mark. It's not required in UTF-8 though.

Instead of using Encoding.UTF8, create your own UTF-8 encoder, using the constructor overload that lets you specify whether or not the BOM is to be emitted:
req.Encoding = new UTF8Encoding( false ) ; // omit the BOM
I believe that will do the trick for you.
Amended to Note: The following will work:
public static User GetUser()
{
WebClient req = new WebClient();
req.Encoding = Encoding.UTF8;
byte[] response = req.DownloadData(url);
User instance ;
using ( MemoryStream stream = new MemoryStream(buffer) )
using ( XmlReader reader = XmlReader.Create( stream ) )
{
XmlSerializer serializer = new XmlSerializer(typeof(User)) ;
instance = (User) serializer.Deserialize( reader ) ;
}
return instance ;
}

That character at the beginning is the BOM (Byte Order Mark). It's placed as the first character in unicode text files to specify which encoding was used to create the file.
The BOM should not be part of the response, as the encoding is specified differently for HTTP content.
Typically a BOM in the response comes from sending a text file as response, where the text file was saved with the BOM signature. Visual Studio for example has an option to save a file without the BOM signature so that it can be send directly as a response.

Related

How do I extract UTF-8 strings out of a JSON file using LitJSON, as JsonData does not seem to convert?

I've tried many methods to extract some strings out of a JSON file using LitJson in Unity.
I've encoding converts all over, tried getting byte arrays and sending them around and nothing seems to work.
I went to the very start of where I create the JsonData object and tried to run the following test:
public JsonData CreateJSONDataObject()
{
Debug.Assert(pathName != null, "No JSON Data path name set. Please set before commencing read.");
string jsonString = File.ReadAllText(Application.dataPath + pathName, System.Text.Encoding.UTF8);
JsonData jsonDataObject = JsonMapper.ToObject(jsonString);
Debug.Log("Test compatibility: ë | " + jsonDataObject["Roots"][2]["name"]);
return jsonDataObject;
}
I made sure my jsonString is using UTF-8, however the output shows this:
Test compatibility: ë | W�den
I've tried many other methods, but as this is making sure to encode right when creating a JsonData object I can't think of what I am doing wrong as I just don't know enough about JSON.
Thank you in advance.
This type of problem occurs when a text file is written with one encoding and read using a different one. I was able to reproduce your problem with the following program, which removes the JSON serialization from the equation entirely:
string file = #"c:\temp\test.txt";
string text = "Wöden";
File.WriteAllText(file, text, Encoding.Default));
string text2 = File.ReadAllText(file, Encoding.UTF8);
Debug.WriteLine(text2);
Since you are reading with UTF-8 and it is not working, the real question is, what encoding was used to write the file originally? You should be using the same encoding to read it back. I suspect that the file was originally created using either Windows-1252 or iso-8859-1 instead of UTF-8. Try using one of those when you read the file, e.g.:
string jsonString = File.ReadAllText(Application.dataPath + pathName,
Encoding.GetEncoding("Windows-1252"));
You said in the comments that your JSON file was not created programmatically, but was "written by hand", meaning you used Notepad or some other text editor to make the file. If that is so, then that explains how you got into this situation. When you save the file, you should have the option to choose an encoding. For Notepad at least, the default encoding is "ANSI", which most likely maps to Windows-1252 (Western European), but depends on your locale. If you are in the Baltic region, for example, it would be Windows-1257 (Baltic). In any case, "ANSI" is not UTF-8. If you want to save the file in UTF-8 encoding, you have to specifically choose that option. Whatever option you use to save the file, that is the encoding you need to use to read it the next time, whether it is with a text editor or with code. Using the wrong encoding to read the file is what causes the corruption.
To change the encoding of a file, you first have to read it in using the same encoding that it was saved in originally, and then you can write it back out using a different encoding. You can do that with your text editor, simply by re-saving the file with a different encoding, or you can do that programmatically:
string text = File.ReadAllText(file, originalEncoding);
File.WriteAllText(file, text, newEncoding);
The key is knowing which encoding was used originally, and therein lies the rub. For legacy encodings (such as Windows-12xx) there is no way to tell because there is no marker in the file which identifies it. Unicode encodings (e.g. UTF-8, UTF-16), on the other hand, do write out a marker at the beginning of the file, called a BOM, or byte-order mark, which can be detected programmatically. That, coupled with the fact that Unicode encodings can represent all characters, is why they are much preferred over legacy encodings.
For more information, I highly recommend reading What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

c# encoding problems (question marks) while reading file from StreamReader

I've a problem while reading a .txt file from my Windows Phone app.
I've made a simple app, that reads a stream from a .txt file and prints it.
Unfortunately I'm from Italy and we've many letters with accents. And here's the problem, in fact all accented letters are printed as a question mark.
Here's the sample code:
var resourceStream = Application.GetResourceStream(new Uri("frasi.txt",UriKind.RelativeOrAbsolute));
if (resourceStream != null)
{
{
//System.Text.Encoding.Default, true
using (var reader = new StreamReader(resourceStream.Stream, System.Text.Encoding.UTF8))
{
string line;
line = reader.ReadLine();
while (line != null)
{
frasi.Add(line);
line = reader.ReadLine();
}
}
}
So, I'm asking you how to avoid this matter.
All the best.
[EDIT:] Solution: I didn't make sure the file was encoded in UTF-8- I saved it with the correct encoding and it worked like a charm. thank you Oscar
You need to use Encoding.Default. Change:
using (var reader = new StreamReader(resourceStream.Stream, System.Text.Encoding.UTF8))
to
using (var reader = new StreamReader(resourceStream.Stream, System.Text.Encoding.Default))
You have commented out is what you should be using if you do not know the exact encoding of your source data. System.Text.Encoding.Default uses the encoding for the operating system's current ANSI code page and provides the best chance of a correct encoding. This should detect the current region settings/encoding and use those.
However, from MSDN the warning:
Different computers can use different encodings as the default, and the default encoding can even change on a single computer. Therefore, data streamed from one computer to another or even retrieved at different times on the same computer might be translated incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these two reasons, using the default encoding is generally not recommended. To ensure that encoded bytes are decoded properly, your application should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to use a higher-level protocol to ensure that the same format is used for encoding and decoding.
Despite this, in my experience with data coming from a number of different source and various different cultures, this is the one that provides the most consistent results out-of-the-box... Esp. for the case of diacritic marks which are turned to question marks when moving from ANSI to UTF8.
I hope this helps.

C# Issue with reading XML with chars of different encodings in it

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:
var xDoc = XDocument.Load(taxFile);
It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:
XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
xDoc = XDocument.Load(oReader);
}
This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".
Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.
XmlReader xmlTax = XmlReader.Create(filePath);
And again the workout with StreamReader helps. The same question.
It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).
The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.
Looking forward for your replies. Thanks in advance
The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.
As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered
Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:
The UTF8Encoding object that is returned by this property may not have
the appropriate behavior for your application. It uses replacement
fallback to replace each string that it cannot encode and each byte
that it cannot decode with a question mark ("?") character.
You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx
If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

How should the '\t' character be handled within XML attribute values?

I seem to have found something of an inconsistency between the various XML implementations within .Net 3.5 and I'm struggling to work out which is nominally correct.
The issue is actually fairly easy to reproduce:
Create a simple xml document with a text element containing '\t' characters and give it an attribute that contains '\t' characters:
var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");
xmlDoc.Save(#"d:\TabTest.xml");
NB: This means that XmlDocument itself is quite happy with '\t' characters in an attribuite value.
Load the document using new XmlTextReader:
var rawFile = XmlReader.Create(#"D:\TabTest.xml");
var rawDoc = new XmlDocument();
rawDoc.Load(rawFile);
Load the document using XmlReader.Create:
var rawFile2 = new XmlTextReader(#"D:\TabTest.xml");
var rawDoc2 = new XmlDocument();
rawDoc2.Load(rawFile2);
Compare the documents in the debugger:
(rawDoc).InnerXml "<test><text attrib=\"Tab' 'space' '\">Tab'\t'space' '</text></test>" string
(rawDoc2).InnerXml "<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>" string
The document read using new XmlTextReader was what I expected, both the '\t' in the text value and attribute value was there as expected.
However, if you look at the document read by XmlReader.Create you find that the '\t' character in the attribute value will have been converted into a ' ' character.
What the....!! :-)
After a bit of a Google search I found that I could encode a '\t' as ' ' - if I used this instead of '\t' in the example XML both readers work as expected.
Now Altova XmlSpy and various other XML readers seem to be perfectly happy with '\t' characters in attribute values, my question is what is the correct way to handle this?
Should I be writing XML file with '\t' characters encoded in attribute values like XmlReader.Create expects or are the other XML tools right and '\t' characters are valid and XmlReader.Create is broken?
Which way should I go to fix/work around this issue?
Probably something to do with Attribute Value Normalization. For CDATA attributes an XML parser is required to replace newlines and tabs in attribute values by spaces, unless they are written in escaped form as character references.
#all: Thanks for all your answers and comments.
It would seem that Justin and Michael Kay are correct and white space should be encoded according to the W3C XML specifications and that the issue is that a significant number of the MS implementations do not honour this requirement.
In my case, XML specification aside, all I really want is for the attribute values to be correctly persisted - i.e. the values saved should be exactly the values read.
The answer to that is to force the use of an XmlWriter created by using XmlWriter.Create method when saving the XML files in the first place.
While both Dataset and XmlDocument provide save/write mechanisms neither of them correctly encode white space in attributes when used in their default form. If I force them to use a manually created XmlWriter, however, the correct encoding is applied and written to the file.
So the original file save code becomes:
var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");
using (var xmlWriter = XmlWriter.Create(#"d:\TabTest.Encoded.xml"))
{
xmlDoc.Save(xmlWriter);
}
This writer then correctly encodes the white space in a symmetrical way for the XmlReader.Create reader to read without altering the attribute values.
The other thing to note here is that this solution encapsulates the encoding from my code entirely as the reader and writer perform the encoding and decoding transparently on read and write.
Check out XmlReaderSettings.ComformanceLevel. In particular, this description:
Note that XmlReader objects created by the Create method are more compliant by default than the XmlTextReader class. The following are conformance improvements that are not enabled on XmlTextReader, but are available by default on readers created by the Create method
At a glance it seems that XmlTextReader is not compliant with the W3C recommendation. See the section in the recommendation on attribute value normalization, specifically
For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
Hence the behaviour that you weren't expecting (seeing a space instead of a tab) is actually the correct recommended behaviour.
I have no idea why XmlTextReader is behaving this way (there is nothing in the documentation), however you seem to have already seem to have identified the correct workaround - encode the attribute as instead. In this case the normalised string will contain the tab character itself.

Converting string from memorystream to binary[] contains leading crap

--Edit with more bgnd information--
A (black box) COM object returns me a string.
A 2nd COM object expects this same string as byte[] as input and returns a byte[] with the processed data.
This will be fed to a browser as downloadable, non-human-readable file that will be loaded in a client side stand-alone application.
so I get the string inputString from the 1st COM and convert it into a byte[] as follows
BinaryFormatter bf = new BinaryFormatter();
MemoryStream ms = new MemoryStream();
bf.Serialize(ms, inputString);
obj = ms.ToArray();
I feed it to the 2nd COM and read it back out.
The result gets written to the browser.
Response.ContentType = "application/octet-stream";
Response.AddHeader("content-disposition", "attachment; filename="test.dat");
Response.BinaryWrite(obj);
The error occurs in the 2nd COm because the formatting is incorrect.
I went to check the original string and that was perfectly fine. I then pumped the result from the 1st com directly to the browser and watched what came out. It appeared that somewhere along the road extra unreadable characters are added. What are these characters, what are they used for and how can I prevent them from making my 2nd COM grind to a halt?
The unreadable characters are of this kind:
NUL/SOH/NUL/NUL/NUL/FF/FF/FF/FF/SOH/NUL/NUL/NUL etc
Any ideas?
--Answer--
Use
System.Text.Encoding.UTF8.GetBytes(theString)
rather then
BinaryFormatter.Serialize()
BinaryFormatter is almost certainly not what you want to use.
If you just need to convert a string to bytes, use Encoding.GetBytes for a suitable encoding, of course. UTF-8 is usually correct, but check whether the document specifies an encoding.
Okay, with your updated information: your 2nd COM object expects binary data, but you want to create that binary data from a string. Does it treat it as plain binary data?
My guess is that something is going to reverse this process on the client side. If it's eventually going to want to reconstruct the data as a string, you need to pick the right encoding to use, and use it on both sides. UTF-8 is a good bet in most cases, but if the client side is just going to write out the data to a file and use it as an XML file, you need to choose the appropriate encoding based on the XML.
You said before that the first few characters of the string were just "<foo>" (or something similar) - does that mean there's no XML declaration? If not, choose UTF-8. Otherwise, you should look at the XML declaration and use that to determine your encoding (again defaulting to UTF-8 if the declaration doesn't specify the encoding).
Once you've got the right encoding, use Encoding.GetBytes as mentioned in earlier answers.
I think you are missing the point of BinarySerialization.
For starters, what Type is formulaXml?
Binary serialization will compact that into a machine represented value, NOT XML! The content will look like:
ÿÿÿÿ AIronScheme, Version=1.0.0.0, Culture=neutral, Public
Perhaps you should be looking at the XML serializer instead.
Update:
You want to write out some XML as a 'content-disposition' stream.
To do this, do something like:
byte[] buffer = Encoding.Default.GetBytes(formulaXml);
Response.BinaryWrite(buffer);
That should work like you hoped (I think).
The BinaryFormatter's job is to convert the objects into some opaque serialisation format that can only be understood by another BinaryFormatter at the other end.
(Just about to mention Encoding.GetBytes as well, but Jon beat me to it.)
You'll probably want to use System.Text.Encoding.UTF8.GetBytes().
Is the crap in the beginning two bytes long?
This could be the byte order mark of a Unicode encoded string.
http://en.wikipedia.org/wiki/Byte-order_mark

Categories