Hey guys, XDocument is being very finicky with one of the xml feeds I have to parse, and keeps giving me the error
'=' is an unexpected token. The expected token is ';'. Line 1, position 576.
Which is basically XDocument crying about a loose "=" sign in the XML document.
I don't have any control over the source XML document, so I need to either get XDocument to ignore this error, or use some other class. Any ideas on either one?
If the document isn't well-formed XML (and my guess is that you have '&=' in the document or some other entity-looking string) then it's unlikely that any other XML parsers are going to be any happier with it. Have you tried loading the document in, say, IE to see if it parses there or pasted to an XML validator? You can also just try XmlDocument.Load() and see if it parses there, that's the next closest XML parser (aside from XmlReader which takes a little bit of setting up).
It won't make for good XML, but if you need to just load up a bad document then the HTML Agility Pack is a good tool. It can overlook many of the things that make HTML not XHTML and not XML-like, so your erroneous XML input will likely be parsed too. The object model it expresses is similar to XmlDocument. e.g.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.xml");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
Or you can use Agility Pack to clean up the XML and then feed its clean output to a real XML parser for further processing.
This is a quick and dirty trick that I've used for one-time tasks. It's not necessarily recommended over a proper solution.
What I would recommended if time permits is to somehow format/fix the erroneous XML content (e.g. maybe in its string form, or using another tool) before feeding it to an XML parser.
Take a look at the answers of this question: Parsing an XML/XHTML document but ignoring errors in C#
The best option I believe is to parse it in a try/catch block, remove the offending block inside the catch block, and re-parse.
Related
Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?
That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.
A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.
The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.
IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);
XML snippet:
<field>& is escaped</field>
<field>"also escaped"</field>
<field>is & "not" escaped</field>
<field>is " and is not & escaped</field>
I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser?
I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it.
The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly
string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))
I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent.
This question is helpful as it gives you a Regex to find these rogue ampersands:
&(?!(?:apos|quot|[gl]t|amp);|#)
And you can see that it matches the correct text in this demo. You can use this in a simple replace operation:
var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&");
And then you'll be able to parse your XML.
Preprocess the textual data (not really XML) with HTML Tidy with quote-ampersand set to true.
If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser.
For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like when there's no definition of &npsp;, then life starts to become rather more difficult.
Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. That's what standards are for.
my English is not he best, but it will work I think.
Also I'm an absolut newcomer to C#.
Given is the following code snippet:
It has to open an XML-document, from which I KNOW that one of the nodes can be missintepreted, btw is really wrong.
try
{
XPathDocument expressionLib = new XPathDocument(path);
XPathNavigator xnav = expressionLib.CreateNavigator();
}
...and so on
my intention is to create the XPathDocument and the XPathNavigator and THEN watch out for the errors.
but my code Fails with "XPathDocument expressionLib = new XPathDocument(path);" (well, it raises an expception which I catch) so I assume that "XPathDocument(path);" validates the whole XML-document before returning it.
At Microsoft pages I didn't find any hints for that assumed behavior - can you verify it?
And, what could be the workaround?
Yes, I WANT open that XML with that error inside (not at the topmost node) and react just for that invalid node and work with the rest of the file.
Enjoy Weekend
Alex.
There is no workaround. If the document is not a valid XML document or there are invalid characters or sections in the document you'll get the exception.
The only way to continue is to handle the XmlException and try to manipulate the Xml data to make it valid which could range from simple if it's just a matter of escaping some invalid character(s) to complex if you have to perform some advanced formatting or if you receive documents containing many different types of errors.
Perhaps the best course of action is to write an XML validator/repair class you'd put your XML document through before attempting to load it with XPathDocument class although I'm pretty sure there must be some library out there that would be able to do all the heavy lifting for you...
I have some very simple code:
XmlDocument doc = new XmlDocument();
Console.WriteLine("loading");
doc.Load(url);
Console.WriteLine("loaded");
XmlNodeList nodeList = doc.GetElementsByTagName("p");
foreach(XmlNode node in nodeList)
{
Console.WriteLine(node.ChildNodes[0].Value);
}
return source;
I'm working on this file and it takes two minutes to load. Why does it take so long? I tried both with fetching and file from the net and loading a local file.
I imagine it's the DTD of the page that's taking so long to load. Given that it defines entities, you shouldn't disable it, so you're probably better off not going down this path.
Given the inner workings of the wikipedia parser (a right mess), I'd say it's a big leap to assume it's going to produce well-formed XHTML every time.
Use HTML Agility Pack to parse (then you can convert to XmlDocument a little more easily if required, IIRC).
If you really want to go down the XmlDocument route you can keep a local cache of the HTML DTDs. See this post, this post and this post for details.
It is becuase XmlDocument doesn't just load your Xml into a nice class heirarchy it also goes and fetches all of the namespace DTD's defined in the document. Run fiddler and you will see the calls to fetch
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
These all took me about 20 seconds to fetch.
I have a similar question like XML indenting when injecting an XML string into an XmlWriter.
I also use XmlWriter.WriteRaw to output an XML fragment into an xml node. This document fragment does not follow the idention which would be nice. The answer given on the post above does not work for XML fragments. Well, I can read the fragment with the XmlReader, but WriteNode does not work with document fragments (see http://msdn.microsoft.com/en-us/library/1wd6aw1b.aspx).
Any hints on how to cheat to get proper indentation?
Thanks for any hints
It's an old question, but I got the same problem today. My solution is to use XmlDocument.WriteContentTo method:
var innerXmlDoc = new XmlDocument();
innerXmlDoc.LoadXml("<params><param id=\"param1\" value=\"value1\"/></params>");
innerXmlDoc.WriteContentTo(xmlWriter);
https://dotnetfiddle.net/a2928r
You could build a valid Xml-Document in memory containing your xml fragment. Then read this document with the XmlReader (e.g. by using MemoryStream) and let the XmlWriter write indented xml.
A faster approach wold be indenting the xml yourself by manipulating the string. Search for <, increase the nesting level and add the indention spaces. If you find </ or a self closing tag decrease the nesting level and append a \n
I don't think there is a fast and nice solution to your problem, but i might be wrong...