C# Issue with reading XML with chars of different encodings in it - c#

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:
var xDoc = XDocument.Load(taxFile);
It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:
XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
xDoc = XDocument.Load(oReader);
}
This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".
Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.
XmlReader xmlTax = XmlReader.Create(filePath);
And again the workout with StreamReader helps. The same question.
It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).
The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.
Looking forward for your replies. Thanks in advance

The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.
As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered
Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:
The UTF8Encoding object that is returned by this property may not have
the appropriate behavior for your application. It uses replacement
fallback to replace each string that it cannot encode and each byte
that it cannot decode with a question mark ("?") character.
You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx
If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

Related

How do I extract UTF-8 strings out of a JSON file using LitJSON, as JsonData does not seem to convert?

I've tried many methods to extract some strings out of a JSON file using LitJson in Unity.
I've encoding converts all over, tried getting byte arrays and sending them around and nothing seems to work.
I went to the very start of where I create the JsonData object and tried to run the following test:
public JsonData CreateJSONDataObject()
{
Debug.Assert(pathName != null, "No JSON Data path name set. Please set before commencing read.");
string jsonString = File.ReadAllText(Application.dataPath + pathName, System.Text.Encoding.UTF8);
JsonData jsonDataObject = JsonMapper.ToObject(jsonString);
Debug.Log("Test compatibility: ë | " + jsonDataObject["Roots"][2]["name"]);
return jsonDataObject;
}
I made sure my jsonString is using UTF-8, however the output shows this:
Test compatibility: ë | W�den
I've tried many other methods, but as this is making sure to encode right when creating a JsonData object I can't think of what I am doing wrong as I just don't know enough about JSON.
Thank you in advance.
This type of problem occurs when a text file is written with one encoding and read using a different one. I was able to reproduce your problem with the following program, which removes the JSON serialization from the equation entirely:
string file = #"c:\temp\test.txt";
string text = "Wöden";
File.WriteAllText(file, text, Encoding.Default));
string text2 = File.ReadAllText(file, Encoding.UTF8);
Debug.WriteLine(text2);
Since you are reading with UTF-8 and it is not working, the real question is, what encoding was used to write the file originally? You should be using the same encoding to read it back. I suspect that the file was originally created using either Windows-1252 or iso-8859-1 instead of UTF-8. Try using one of those when you read the file, e.g.:
string jsonString = File.ReadAllText(Application.dataPath + pathName,
Encoding.GetEncoding("Windows-1252"));
You said in the comments that your JSON file was not created programmatically, but was "written by hand", meaning you used Notepad or some other text editor to make the file. If that is so, then that explains how you got into this situation. When you save the file, you should have the option to choose an encoding. For Notepad at least, the default encoding is "ANSI", which most likely maps to Windows-1252 (Western European), but depends on your locale. If you are in the Baltic region, for example, it would be Windows-1257 (Baltic). In any case, "ANSI" is not UTF-8. If you want to save the file in UTF-8 encoding, you have to specifically choose that option. Whatever option you use to save the file, that is the encoding you need to use to read it the next time, whether it is with a text editor or with code. Using the wrong encoding to read the file is what causes the corruption.
To change the encoding of a file, you first have to read it in using the same encoding that it was saved in originally, and then you can write it back out using a different encoding. You can do that with your text editor, simply by re-saving the file with a different encoding, or you can do that programmatically:
string text = File.ReadAllText(file, originalEncoding);
File.WriteAllText(file, text, newEncoding);
The key is knowing which encoding was used originally, and therein lies the rub. For legacy encodings (such as Windows-12xx) there is no way to tell because there is no marker in the file which identifies it. Unicode encodings (e.g. UTF-8, UTF-16), on the other hand, do write out a marker at the beginning of the file, called a BOM, or byte-order mark, which can be detected programmatically. That, coupled with the fact that Unicode encodings can represent all characters, is why they are much preferred over legacy encodings.
For more information, I highly recommend reading What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Xml exception due to leading unicode character in REST API response

When I try to parse a response from a certain REST API, I'm getting an XmlException saying "Data at the root level is invalid. Line 1, position 1." Looking at the XML it looks fine, but then examining the first character I see that it is actually a zero-width no-break space (character code 65279 or 0xFEFF).
Is there any good reason for that character to be there? Maybe I'm supposed to be setting a different Encoding when I make my request? Currently I'm using Encoding.UTF8.
I've thought about just removing the character from the string, or asking the developer of the REST API to fix it, but before I do either of those things I wanted to check if there is a valid reason for that character to be there. I'm no unicode expert. Is there something different I should be doing?
Edit: I suspected that it might be something like that (BOM). So, the question becomes, should I have to deal with this character specially? I've tried loading the XML two ways and both throw the same exception:
public static User GetUser()
{
WebClient req = new WebClient();
req.Encoding = Encoding.UTF8;
string response = req.DownloadString(url);
XmlSerializer ser = new XmlSerializer(typeof(User));
User user = ser.Deserialize(new StringReader(response)) as User;
XElement xUser = XElement.Parse(response);
...
return user;
}
U+FFEF is a byte order mark. It's there at the start of the document to indicate the character encoding (or rather, the byte-order of an encoding which could be either way; most specifically UTF-16). It's entirely reasonable for it to be there at the start of an XML document. Its use as a zero-width non-breaking space is deprecated in favour of U+2060 instead.
It would be unreasonable if the byte-order mark was in a different encoding, e.g. if it were a UTF-8 BOM in a document which claimed to be UTF-8.
How are you loading the document? Perhaps you're specifying an inappropriate encoding somewhere? It's best to let the XML API detect the encoding if at all possible.
EDIT: After you've download it as a string, I can imagine that could cause problems... given that it's used to detect the encoding, which you've already got. Don't download it as a string - download it as binary data (WebClient.DownloadData) and then you should be able to parse it okay, I believe. However, you probably still shouldn't use XElement.Parse as there may well be a document declaration - use XDocument.Parse. I'd be slightly surprised if the result of the call could be fed straight into XmlSerializer, but you can have a go... wrap it in a MemoryStream if necessary.
That is called a Byte Order Mark. It's not required in UTF-8 though.
Instead of using Encoding.UTF8, create your own UTF-8 encoder, using the constructor overload that lets you specify whether or not the BOM is to be emitted:
req.Encoding = new UTF8Encoding( false ) ; // omit the BOM
I believe that will do the trick for you.
Amended to Note: The following will work:
public static User GetUser()
{
WebClient req = new WebClient();
req.Encoding = Encoding.UTF8;
byte[] response = req.DownloadData(url);
User instance ;
using ( MemoryStream stream = new MemoryStream(buffer) )
using ( XmlReader reader = XmlReader.Create( stream ) )
{
XmlSerializer serializer = new XmlSerializer(typeof(User)) ;
instance = (User) serializer.Deserialize( reader ) ;
}
return instance ;
}
That character at the beginning is the BOM (Byte Order Mark). It's placed as the first character in unicode text files to specify which encoding was used to create the file.
The BOM should not be part of the response, as the encoding is specified differently for HTTP content.
Typically a BOM in the response comes from sending a text file as response, where the text file was saved with the BOM signature. Visual Studio for example has an option to save a file without the BOM signature so that it can be send directly as a response.

XML exception on loading

I have an XML file. When I try to load it using .LOAD methods, I get this exception:
System.Xml.XmlException: data at root level invalid at position 1 line 1.
What I have at the beginning of the XML file is this:
<?xml version="1.0" standalone="yes" ?>
I think that string that is used for LoadXml is constructed wrong by either
ignoring BOM and forcing wrong encoding
reading BOM as first character
constructed by hand altogether and first character is not <
Based on last comment I bet that code looks like (or some variation of it) instead of loading XML directly from Stream object (which will handle encoding properly):
// My guess of how wrong code looks like! Not a solution!!!!
StreamReader r = new StreamReader(path, System.Text.Encoding.Unicode);
string xml = r.ReadToEnd();
XmlDocument d = new XmlDocument();
d.LoadXml(xml);
You should review your code that constructs the string you are using in XmlDocument.LoadXml and check if it is indeed valid XML. I'd recommend to create small program that models code that is failing and investigate the behavior.
Position 1 line 1 suggests a problem with the very first char it encounters.
I would suggest firstly confirming that no leading whitespace/other char is in there (sounds silly, but they can creep in easily).
It could also be a char encoding issue, causing that first char to not be read as a '<'.
I bet it's not there. I've found that when I've gotten this error the file or path is missing/incorrect.
Thanks for pouring in your suggestions. The problem was on the build server, the XML file was being pulled from a field called contents in a table called File. I am accessing the XML using the FileID. But the FileID is not the same as FileID on my local database. So, On the build server, I was pulling the XML from a test record which had dummy data. Hence the error. Hope I have made sense. I have fixed the issue by dynamically finding the FileID and querying the contents.

XML Exception: Invalid Character(s)

I am working on a small project that is receiving XML data in string form from a long running application. I am trying to load this string data into an XDocument (System.Xml.Linq.XDocument), and then from there do some XML Magic and create an xlsx file for a report on the data.
On occasion, I receive the data that has invalid XML characters, and when trying to parse the string into an XDocument, I get this error.
[System.Xml.XmlException]
Message: '?', hexadecimal value 0x1C, is an invalid character.
Since I have no control over the remote application, you could expect ANY kind of character.
I am well aware that XML has a way where you can put characters in it such as &#x1C or something like that.
If at all possible I would SERIOUSLY like to keep ALL the data. If not, than let it be.
I have thought about editing the response string programatically, then going back and trying to re-parse should an exception be thrown, but I have tried a few methods and none of them seem successful.
Thank you for your thought.
Code is something along the line of this:
TextReader tr;
XDocument doc;
string response; //XML string received from server.
...
tr = new StringReader (response);
try
{
doc = XDocument.Load(tr);
}
catch (XmlException e)
{
//handle here?
}
You can use the XmlReader and set the XmlReaderSettings.CheckCharacters property to false. This will let you to read the XML file despite the invalid characters. From there you can import pass it to a XmlDocument or XDocument object.
You can read a little more about in my blog.
To load the data to a System.Xml.Linq.XDocument it will look a little something like this:
XDocument xDocument = null;
XmlReaderSettings xmlReaderSettings = new XmlReaderSettings { CheckCharacters = false };
using (XmlReader xmlReader = XmlReader.Create(filename, xmlReaderSettings))
{
xmlReader.MoveToContent();
xDocument = XDocument.Load(xmlReader);
}
More information can be found here.
XML can handle just about any character, but there are ranges, control codes and such, that it won't.
Your best bet, if you can't get them to fix their output, is to sanitize the raw data you're receiving. You need replace illegal characters with the character reference format you noted.
(You can't even resort to CDATA, as there is no way to escape these characters there.)
Would something as described in this blog post be helpful?
Basically, he creates a sanitizing xml stream.
If your input is not XML, you should use something like Tidy or Tagsoup to clean the mess up.
They would take any input and try, hopefully, to make a useful DOM from it.
I don't know how relevant dark side libraries are called.
Garbage In, Garbage Out. If the remote application is sending you garbage, then that's all you'll get. If they think they're sending XML, then they need to be fixed. In this case, you're not doing them any favors by working around their bug.
You should also make sure of what they think they're sending. What did the %1C mean to them? What did they want it to be?
IMHO the best solution would be to modify the code/program/whatever produced the invalid XML that is being fed to your program. Unfortunately this is not always possible. In this case you need to escape all characters < 0x20 before trying to load the document.
If you really can't fix the source XML data, consider taking an approach like I described in this answer. Basically, you create a TextReader subclass (e.g StripTextReader) that wraps an existing TextReader (tr) and discards invalid characters.
Its a late answer, but may help someone. When you read or serialize an XML it may have 1 invisible character at the beginning of the XML. XDocument don't like this invisible character.
So while reading the XML, just start reading from the first < character:
var myXml = XDocument.Parse(loadedString.Substring(loadedString.IndexOf("<")));
That's it and it loads just fine.

How do I safely create an XPathNavigator against a Stream in C#?

Given a Stream as input, how do I safely create an XPathNavigator against an XML data source?
The XML data source:
May possibly contain invalid hexadecimal characters that need to be removed.
May contain characters that do not match the declared encoding of the document.
As an example, some XML data sources in the cloud will have a declared encoding of utf-8, but the actual encoding is windows-1252 or ISO 8859-1, which can cause an invalid character exception to be thrown when creating an XmlReader against the Stream.
From the StreamReader.CurrentEncoding property documentation: "The current character encoding used by the current reader. The value can be different after the first call to any Read method of StreamReader, since encoding autodetection is not done until the first call to a Read method." This seems indicate that CurrentEncoding can be checked after the first read, but are we stuck storing this encoding when we need to write out the XML data to a Stream?
I am hoping to find a best practice for safely creating an XPathNavigator/IXPathNavigable instance against an XML data source that will gracefully handle encoding an invalid character issues (in C# preferably).
I had a similar issue when some XML fragments were imported into a CRM system using the wrong encoding (there was no encoding stored along with the XML fragments).
In a loop I created a wrapper stream using the current encoding from a list. The encoding was constructed using the DecoderExceptionFallback and EncoderExceptionFallback options (as mentioned by #Doug). If a DecoderFallbackException was thrown during processing the original stream is reset and the next-most-likely encoding is used.
Our encoding list was something like UTF-8, Windows-1252, GB-2312 and US-ASCII. If you fell off the end of the list then the stream was really bad and was rejected/ignored/etc.
EDIT:
I whipped up a quick sample and basic test files (source here). The code doesn't have any heuristics to choose between code pages that both match the same set of bytes, so a Windows-1252 file may be detected as GB2312, and vice-versa, depending on file content, and encoding preference ordering.
It's possible to use the DecoderFallback class (and a few related classes) to deal with bad characters, either by skipping them or by doing something else (restarting with a new encoding?).
When using a XmlTextReader or something similiar, the reader itself will figure out the encoding declared in the xml file.

Categories