Reading oddly-formatted XML file C# - c#

I need some help with reading an oddly-formatted XML file. Because of the way the nodes and attributes are structured, I keep running into XMLException errors (at least, that's what the output window is telling me; my breakpoints refuse to fire so that I can check it). Anyway, here's the XML. Anyone experienced anything like this before?
<ApplicationMonitoring>
<MonitoredApps>
<Application>
<function1 listenPort="5000"/>
</Application>
<Application>
<function2 listenPort="6000"/>
</Application>
</MonitoredApps>
<MIBs>
<site1 location="test.mib"/>
</MIBs>
<Community value="public"/>
<proxyAgent listenPort="161" timeOut="2"/>
</ApplicationMonitoring>
Cheers
EDIT: Current version of the parsing code (file path shortened - Im not actually using this one):
XmlDocument xml = new XmlDocument();
xml.LoadXml(#"..\..\..\ApplicationMonitoring.xml");
string port = xml.DocumentElement["proxyAgent"].InnerText;

Your problem in loading the XML is that xml.LoadXml expects you to pass the xml document as a string, not a file reference.
Try instead using:
xml.Load(#"..\..\..\ApplicationMonitoring.xml");
Essentially in your original code you are telling it that your xml document is
..\..\..\ApplicationMonitoring.xml
And I'm sure you can now see why there is a parse exception. :) I've tested this with your xml document and the modified load and it works fine (except for the issue that Only Bolivian Here pointed out with the fact that your inner Text is not going to return anything.
For completeness you probably want:
XmlDocument xml = new XmlDocument();
xml.Load(#"..\..\..\ApplicationMonitoring.xml");
string port = xml.DocumentElement["proxyAgent"].Attributes["listenPort"].Value;
//And to get stuff more specifically in the tree something like this
string function1 = xml.SelectSingleNode("//function1").Attributes["listenPort"].Value;
Note the use of the Value property on the attribute and not the ToString method which won't do what you are expecting.
Exactly how you extract the data from the xml is probably dependant on what you are doing with it. For example you may want to get a list of Application nodes to enumerate over with a foreach by doing this xml.SelectNodes("//Application").
If you are having trouble with extdacting stuff though that is probably the scope of a different question since this was just about how to get the XML document loaded.

xml.DocumentElement["proxyAgent"].InnerText;
The proxyAgent element is self closing. InnerText will return the string inside of an XML element, in this case, there is no inner elements.
You need to access an attribute of the element, not the InnerText.

Try this:
string port = xml.GetElementsByTagName("ProxyAgent")[0].Attributes["listenPort"].ToString();
Or use Linq to XML:
http://msdn.microsoft.com/en-us/library/bb387098.aspx
And... your XML is not malformed...

Related

C# does XPathDocument(Path) evaluate the whole path before returning the document?

my English is not he best, but it will work I think.
Also I'm an absolut newcomer to C#.
Given is the following code snippet:
It has to open an XML-document, from which I KNOW that one of the nodes can be missintepreted, btw is really wrong.
try
{
XPathDocument expressionLib = new XPathDocument(path);
XPathNavigator xnav = expressionLib.CreateNavigator();
}
...and so on
my intention is to create the XPathDocument and the XPathNavigator and THEN watch out for the errors.
but my code Fails with "XPathDocument expressionLib = new XPathDocument(path);" (well, it raises an expception which I catch) so I assume that "XPathDocument(path);" validates the whole XML-document before returning it.
At Microsoft pages I didn't find any hints for that assumed behavior - can you verify it?
And, what could be the workaround?
Yes, I WANT open that XML with that error inside (not at the topmost node) and react just for that invalid node and work with the rest of the file.
Enjoy Weekend
Alex.
There is no workaround. If the document is not a valid XML document or there are invalid characters or sections in the document you'll get the exception.
The only way to continue is to handle the XmlException and try to manipulate the Xml data to make it valid which could range from simple if it's just a matter of escaping some invalid character(s) to complex if you have to perform some advanced formatting or if you receive documents containing many different types of errors.
Perhaps the best course of action is to write an XML validator/repair class you'd put your XML document through before attempting to load it with XPathDocument class although I'm pretty sure there must be some library out there that would be able to do all the heavy lifting for you...

Loading XML Document - Name cannot begin with the zero character

I am trying to load something which claims to be an XML document into any type of .net XML object: XElement, XmlDocument, or XmlTextReader. All of them throw an exception :
Name cannot begin with the '0' character, hexadecimal value 0x30
The error related to a bit of 'XML'
<chart_value
color="ff4400"
alpha="100"
size="12"
position="cursor"
decimal_char="."
0=""
/>
I believe the problem is the author should not have named an attribute as 0.
If I could change this I would, but I do not have control of this feed. I suppose those who use it are using more permissive tools. Is there anyway I can load this as XML without throwing an error?
There is no XML declaration either, nor namespace or contract definition. I was thinking I might have to turn it into a string and do a replace, but this is not very elegant. Was wondering if there was any other options.
As many have said, this is not XML.
Having said that, it's almost XML and WANTS to be XML, so I don't think you should use a regex to screw around inside of it (here's why).
Wherever you're getting the stream, dump into into a string, change 0= to something like zero= and try parsing it.
Don't forget to reverse the operation if you have to return-to-sender.
If you're reading from a file, you can do something like this:
var txt = File.ReadAllText(#"\path\to\wannabe.xml");
var clean = txt.Replace("0=", "zero=");
var doc = new XmlDocument();
doc.LoadXml(clean);
This is not guaranteed to remove all potential XML problems -- but it should remove the one you have.
Just replace the Numeric value with '_'
Example: "0=" replace to "_0="
I hope that will fix the problem, thanks.
It might claim to be an XML document, but the claim is clearly false, so you should reject the document.
The only good way to deal with bad XML is to find out what bit of software is producing it, and either fix it or throw it away. All the benefits of XML go out of the window if people start tolerating stuff that's nearly XML but not quite.
The 0="" obviously uses an invalid attribute name 0. You'd probably have to do a find/replace to try and fix the XML if you cannot fix it at the source that created it. You might be able to use RegEx to try to do more efficient manipulation of the XML string.

Alternatives to XDocument

Hey guys, XDocument is being very finicky with one of the xml feeds I have to parse, and keeps giving me the error
'=' is an unexpected token. The expected token is ';'. Line 1, position 576.
Which is basically XDocument crying about a loose "=" sign in the XML document.
I don't have any control over the source XML document, so I need to either get XDocument to ignore this error, or use some other class. Any ideas on either one?
If the document isn't well-formed XML (and my guess is that you have '&=' in the document or some other entity-looking string) then it's unlikely that any other XML parsers are going to be any happier with it. Have you tried loading the document in, say, IE to see if it parses there or pasted to an XML validator? You can also just try XmlDocument.Load() and see if it parses there, that's the next closest XML parser (aside from XmlReader which takes a little bit of setting up).
It won't make for good XML, but if you need to just load up a bad document then the HTML Agility Pack is a good tool. It can overlook many of the things that make HTML not XHTML and not XML-like, so your erroneous XML input will likely be parsed too. The object model it expresses is similar to XmlDocument. e.g.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.xml");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
Or you can use Agility Pack to clean up the XML and then feed its clean output to a real XML parser for further processing.
This is a quick and dirty trick that I've used for one-time tasks. It's not necessarily recommended over a proper solution.
What I would recommended if time permits is to somehow format/fix the erroneous XML content (e.g. maybe in its string form, or using another tool) before feeding it to an XML parser.
Take a look at the answers of this question: Parsing an XML/XHTML document but ignoring errors in C#
The best option I believe is to parse it in a try/catch block, remove the offending block inside the catch block, and re-parse.

How to tell if a string is xml?

We have a string field which can contain XML or plain text. The XML contains no <?xml header, and no root element, i.e. is not well formed.
We need to be able to redact XML data, emptying element and attribute values, leaving just their names, so I need to test if this string is XML before it's redacted.
Currently I'm using this approach:
string redact(string eventDetail)
{
string detail = eventDetail.Trim();
if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
...
Is there a better way?
Are there any edge cases this approach could miss?
I appreciate I could use XmlDocument.LoadXml and catch XmlException, but this feels like an expensive option, since I already know that a lot of the data will not be in XML.
Here's an example of the XML data, apart from missing a root element (which is omitted to save space, since there will be a lot of data), we can assume it is well formed:
<TableName FirstField="Foo" SecondField="Bar" />
<TableName FirstField="Foo" SecondField="Bar" />
...
Currently we are only using attribute based values, but we may use elements in the future if the data becomes more complex.
SOLUTION
Based on multiple comments (thanks guys!)
string redact(string eventDetail)
{
if (string.IsNullOrEmpty(eventDetail)) return eventDetail; //+1 for unit tests :)
string detail = eventDetail.Trim();
if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
XmlDocument xml = new XmlDocument();
try
{
xml.LoadXml(string.Format("<Root>{0}</Root>", detail));
}
catch (XmlException e)
{
log.WarnFormat("Data NOT redacted. Caught {0} loading eventDetail {1}", e.Message, eventDetail);
return eventDetail;
}
... // redact
If you're going to accept not well formed XML in the first place, I think catching the exception is the best way to handle it.
One possibility is to mix both solutions. You can use your redact method and try to load it (inside the if). This way, you'll only try to load what is likely to be a well-formed xml, and discard most of the non-xml entries.
If your goal is reliability then the best option is to use XmlDocument.LoadXml to determine if it's valid XML or not. A full parse of the data may be expensive but it's the only way to reliably tell if it's valid XML or not. Otherwise any character you don't examine in the buffer could cause the data to be illegal XML.
Depends on how accurate a test you want. Considering that you already don't have the official <xml, you're already trying to detect something that isn't XML. Ideally you'd parse the text by a full XML parser (as you suggest LoadXML); anything it rejects isn't XML. The question is, do you care if you accept a non-XML string? For instance,
are you OK with accepting
<the quick brown fox jumped over the lazy dog's back>
as XML and stripping it? If so, your technique is fine. If not, you have to decide how tight a test you want and code a recognizer with that degree of tightness.
How is the data coming to you? What is the other type of data surrounding it? Perhaps there is a better way; perhaps you can tokenise the data you control, and then infer that anything that is not within those tokens is XML, but we'd need to know more.
Failing a cute solution like that, I think what you have is fine (for validating that it starts and ends with those characters).
We need to know more about the data format really.
If the XML contains no root element (i.e. it's an XML fragment, not a full document), then the following would be perfectly valid sample, as well - but wouldn't match your detector:
foo<bar/>baz
In fact, any text string would be valid XML fragment (consider if the original XML document was just the root element wrapping some text, and you take the root element tags away)!
try
{
XmlDocument myDoc = new XmlDocument();
myDoc.LoadXml(myString);
}
catch(XmlException ex)
{
//take care of the exception
}

XML Exception: Invalid Character(s)

I am working on a small project that is receiving XML data in string form from a long running application. I am trying to load this string data into an XDocument (System.Xml.Linq.XDocument), and then from there do some XML Magic and create an xlsx file for a report on the data.
On occasion, I receive the data that has invalid XML characters, and when trying to parse the string into an XDocument, I get this error.
[System.Xml.XmlException]
Message: '?', hexadecimal value 0x1C, is an invalid character.
Since I have no control over the remote application, you could expect ANY kind of character.
I am well aware that XML has a way where you can put characters in it such as &#x1C or something like that.
If at all possible I would SERIOUSLY like to keep ALL the data. If not, than let it be.
I have thought about editing the response string programatically, then going back and trying to re-parse should an exception be thrown, but I have tried a few methods and none of them seem successful.
Thank you for your thought.
Code is something along the line of this:
TextReader tr;
XDocument doc;
string response; //XML string received from server.
...
tr = new StringReader (response);
try
{
doc = XDocument.Load(tr);
}
catch (XmlException e)
{
//handle here?
}
You can use the XmlReader and set the XmlReaderSettings.CheckCharacters property to false. This will let you to read the XML file despite the invalid characters. From there you can import pass it to a XmlDocument or XDocument object.
You can read a little more about in my blog.
To load the data to a System.Xml.Linq.XDocument it will look a little something like this:
XDocument xDocument = null;
XmlReaderSettings xmlReaderSettings = new XmlReaderSettings { CheckCharacters = false };
using (XmlReader xmlReader = XmlReader.Create(filename, xmlReaderSettings))
{
xmlReader.MoveToContent();
xDocument = XDocument.Load(xmlReader);
}
More information can be found here.
XML can handle just about any character, but there are ranges, control codes and such, that it won't.
Your best bet, if you can't get them to fix their output, is to sanitize the raw data you're receiving. You need replace illegal characters with the character reference format you noted.
(You can't even resort to CDATA, as there is no way to escape these characters there.)
Would something as described in this blog post be helpful?
Basically, he creates a sanitizing xml stream.
If your input is not XML, you should use something like Tidy or Tagsoup to clean the mess up.
They would take any input and try, hopefully, to make a useful DOM from it.
I don't know how relevant dark side libraries are called.
Garbage In, Garbage Out. If the remote application is sending you garbage, then that's all you'll get. If they think they're sending XML, then they need to be fixed. In this case, you're not doing them any favors by working around their bug.
You should also make sure of what they think they're sending. What did the %1C mean to them? What did they want it to be?
IMHO the best solution would be to modify the code/program/whatever produced the invalid XML that is being fed to your program. Unfortunately this is not always possible. In this case you need to escape all characters < 0x20 before trying to load the document.
If you really can't fix the source XML data, consider taking an approach like I described in this answer. Basically, you create a TextReader subclass (e.g StripTextReader) that wraps an existing TextReader (tr) and discards invalid characters.
Its a late answer, but may help someone. When you read or serialize an XML it may have 1 invisible character at the beginning of the XML. XDocument don't like this invisible character.
So while reading the XML, just start reading from the first < character:
var myXml = XDocument.Parse(loadedString.Substring(loadedString.IndexOf("<")));
That's it and it loads just fine.

Categories