Validating an XML file without an xsd file - c#

I found a lot of great questions on this subject. Unfortunately the answers all say to use an xsd file. I made an xsd file from an xml file using xsd.exe. I copied code from here and pasted into Visual studio and I got an error on the first line.
Not wanted to spend time figuring out why it would not run I decided to code the validation myself.
Here are the two points I am using:
Each left caret will have a right caret, so at the end of the file their will be an equal amount of left and right carets.
At the end of the file, if I either take the amount of left carets, or the amount of right carets subtract one from the total (Because the header does not have a backslash) and divide the total by two, I get the number of slashes.
I am encountering some problems though.
I am using string.count() This method also counts carets that are in the attributes (Which I do not want).
I compute the expected number of backslashes when I am done reading the file. If the numbers do not match I write "The expected number of slashes do not match" But I don't know where it is in the file.
I can't think of a way to fix these problems at the moment.
Does anyone have a better way to validate an xml file without using an xsd file?

Take note that WELL FORMED XML is different to VALID XML.
XML that adheres to the XML standard is considered well formed, while XML that adheres to a DTD is considered VALID.
If you just want to check if XML is well formed try this:
try
{
var file = "Your xml path";
var settings = new XmlReaderSettings { DtdProcessing = DtdProcessing.Ignore, XmlResolver = null };
using (var reader = XmlReader.Create(new StreamReader(file), settings))
{
var document = new XmlDocument();
document.Load(reader);
}
}
catch (Exception exc)
{
//show the exception here
}
P.S: Well formedness of an XML is always a prerequisite of a valid XML.
Hope it helps!

When you say "caret", I think you must be talking about the "<" and ">" symbols, which in the XML world are usually referred to as "angle brackets".
When you talk about checking whether the carets match up, you are therefore talking about whether the file conforms to XML syntax. That's called "well-formedness checking" in the XML world. Validation is something different and deeper. You need a schema (either an XSD schema or some other kind) for validation, but for well-formedness checking all you need is an XML parser.
Don't try to implement the well-formedness checking yourself. (a) because it's not easy, (b) because parsers are readily available off-the-shelf, and (c) because you clearly don't have a very advanced understanding of the problem. Just run your file through an XML parser and it will do the job for you.

Related

Parsing XML file in C# - how to report errors

First I load the file in a structure
XElement xTree = XElement.Load(xml_file);
Then I create an enumerable collection of the elements.
IEnumerable<XElement> elements = xTree.Elements();
And iterate elements
foreach (XElement el in elements)
{
}
The problem is - when I fail to parse the element (a user made a typo or inserted a wrong value) - how can I report exact line in the file?
Is there any way to tie an element to its corresponding line in the file?
One way to do it (although not a proper one) –
When you find a wrong value, add an invalid char (e.g. ‘<’) to it.
So instead of: <ExeMode>Bla bla bla</ExeMode>
You’ll have: <ExeMode><Bla bla bla</ExeMode>
Then load the XML again with try / catch (System.Xml.XmlException ex).
This XmlException has LineNumber and LinePosition.
If there is a limited set of acceptable values, I believe XML Schemas have the concept of an enumerated type -- so write a schema for the file and have the parser validate against that. Assuming the parser you're using supports Schemas, which most should by now.
I haven't looked at DTDs in decades, but they may have the same kind of capability.
Otherwise, you would have to consider this semantic checking rather than syntactic checking, and that makes it your application's responsibility. If you are using a SAX parser and interpreting the data as you go, you may be able to get the line number; check your parser's features.
Otherwise the best answer I've found is to report the problem using an xpath to the affected node/token rather than a line number. You may be able to find a canned solution for that, either as a routine to run against a DOM tree or as a state machine you can run alongside your SAX code to track the path as you go.
(All the "maybe"s are because I haven't looked at what's currently available in a very long time, and because I'm trying to give an answer that is valid for all languages and parser implementations. This should still get you pointed in some useful directions.)

C# does XPathDocument(Path) evaluate the whole path before returning the document?

my English is not he best, but it will work I think.
Also I'm an absolut newcomer to C#.
Given is the following code snippet:
It has to open an XML-document, from which I KNOW that one of the nodes can be missintepreted, btw is really wrong.
try
{
XPathDocument expressionLib = new XPathDocument(path);
XPathNavigator xnav = expressionLib.CreateNavigator();
}
...and so on
my intention is to create the XPathDocument and the XPathNavigator and THEN watch out for the errors.
but my code Fails with "XPathDocument expressionLib = new XPathDocument(path);" (well, it raises an expception which I catch) so I assume that "XPathDocument(path);" validates the whole XML-document before returning it.
At Microsoft pages I didn't find any hints for that assumed behavior - can you verify it?
And, what could be the workaround?
Yes, I WANT open that XML with that error inside (not at the topmost node) and react just for that invalid node and work with the rest of the file.
Enjoy Weekend
Alex.
There is no workaround. If the document is not a valid XML document or there are invalid characters or sections in the document you'll get the exception.
The only way to continue is to handle the XmlException and try to manipulate the Xml data to make it valid which could range from simple if it's just a matter of escaping some invalid character(s) to complex if you have to perform some advanced formatting or if you receive documents containing many different types of errors.
Perhaps the best course of action is to write an XML validator/repair class you'd put your XML document through before attempting to load it with XPathDocument class although I'm pretty sure there must be some library out there that would be able to do all the heavy lifting for you...

Is there a way to make XmlDocument parsing less strict

I am making a program that will store its data in an XML file. When people write XML they can make subtle mistakes, like ending a comment with - so it looks like <!-- comment ---> or adding a </>inside an attribute. Naturally, the XML still can be read all right, but trying to input this text into XmlDocument will give a syntax error (and it wont be parsed).
Is there a way to make XmlDocument less strict and make it ignore violations of the standard that do not make the document unparseable? For example, its clear that <!-- comment ---> is still a comment even though it contains - at the end which is against the standard specification).
No, and that's a good thing.
XML is a strict format, the solution here is to have correct (corrected) input.
All XML tools are very picky, by design. You might have some luck with an XMLReeader and fixing or rejecting faulty elements.
But it's far better to create the XML with a suitable tool. Quite a few of them are named XmlPad
No, XML parsers are expected to reject input that is not valid XML.
You may try your luck preprocessing the invalid files by Tidy, but better simply make sure the input is valid.
Here's an example usage. Tidy will fix your comments and do some escaping, but an extra opening < will break things up more often than not - guessing in that case is simply too much to ask.
Tidy tidy = new Tidy();
tidy.Options.FixComments = true;
tidy.Options.XmlTags = true;
tidy.Options.XmlOut = true;
string invalid = "<root>< <!--comment--->></root>";
MemoryStream input = new MemoryStream(Encoding.UTF8.GetBytes(invalid));
MemoryStream output = new MemoryStream();
tidy.Parse(input, output, new TidyMessageCollection());
// TODO check the messages
string repaired = Encoding.UTF8.GetString(output.ToArray());

Loading XML Document - Name cannot begin with the zero character

I am trying to load something which claims to be an XML document into any type of .net XML object: XElement, XmlDocument, or XmlTextReader. All of them throw an exception :
Name cannot begin with the '0' character, hexadecimal value 0x30
The error related to a bit of 'XML'
<chart_value
color="ff4400"
alpha="100"
size="12"
position="cursor"
decimal_char="."
0=""
/>
I believe the problem is the author should not have named an attribute as 0.
If I could change this I would, but I do not have control of this feed. I suppose those who use it are using more permissive tools. Is there anyway I can load this as XML without throwing an error?
There is no XML declaration either, nor namespace or contract definition. I was thinking I might have to turn it into a string and do a replace, but this is not very elegant. Was wondering if there was any other options.
As many have said, this is not XML.
Having said that, it's almost XML and WANTS to be XML, so I don't think you should use a regex to screw around inside of it (here's why).
Wherever you're getting the stream, dump into into a string, change 0= to something like zero= and try parsing it.
Don't forget to reverse the operation if you have to return-to-sender.
If you're reading from a file, you can do something like this:
var txt = File.ReadAllText(#"\path\to\wannabe.xml");
var clean = txt.Replace("0=", "zero=");
var doc = new XmlDocument();
doc.LoadXml(clean);
This is not guaranteed to remove all potential XML problems -- but it should remove the one you have.
Just replace the Numeric value with '_'
Example: "0=" replace to "_0="
I hope that will fix the problem, thanks.
It might claim to be an XML document, but the claim is clearly false, so you should reject the document.
The only good way to deal with bad XML is to find out what bit of software is producing it, and either fix it or throw it away. All the benefits of XML go out of the window if people start tolerating stuff that's nearly XML but not quite.
The 0="" obviously uses an invalid attribute name 0. You'd probably have to do a find/replace to try and fix the XML if you cannot fix it at the source that created it. You might be able to use RegEx to try to do more efficient manipulation of the XML string.

How to tell if a string is xml?

We have a string field which can contain XML or plain text. The XML contains no <?xml header, and no root element, i.e. is not well formed.
We need to be able to redact XML data, emptying element and attribute values, leaving just their names, so I need to test if this string is XML before it's redacted.
Currently I'm using this approach:
string redact(string eventDetail)
{
string detail = eventDetail.Trim();
if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
...
Is there a better way?
Are there any edge cases this approach could miss?
I appreciate I could use XmlDocument.LoadXml and catch XmlException, but this feels like an expensive option, since I already know that a lot of the data will not be in XML.
Here's an example of the XML data, apart from missing a root element (which is omitted to save space, since there will be a lot of data), we can assume it is well formed:
<TableName FirstField="Foo" SecondField="Bar" />
<TableName FirstField="Foo" SecondField="Bar" />
...
Currently we are only using attribute based values, but we may use elements in the future if the data becomes more complex.
SOLUTION
Based on multiple comments (thanks guys!)
string redact(string eventDetail)
{
if (string.IsNullOrEmpty(eventDetail)) return eventDetail; //+1 for unit tests :)
string detail = eventDetail.Trim();
if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
XmlDocument xml = new XmlDocument();
try
{
xml.LoadXml(string.Format("<Root>{0}</Root>", detail));
}
catch (XmlException e)
{
log.WarnFormat("Data NOT redacted. Caught {0} loading eventDetail {1}", e.Message, eventDetail);
return eventDetail;
}
... // redact
If you're going to accept not well formed XML in the first place, I think catching the exception is the best way to handle it.
One possibility is to mix both solutions. You can use your redact method and try to load it (inside the if). This way, you'll only try to load what is likely to be a well-formed xml, and discard most of the non-xml entries.
If your goal is reliability then the best option is to use XmlDocument.LoadXml to determine if it's valid XML or not. A full parse of the data may be expensive but it's the only way to reliably tell if it's valid XML or not. Otherwise any character you don't examine in the buffer could cause the data to be illegal XML.
Depends on how accurate a test you want. Considering that you already don't have the official <xml, you're already trying to detect something that isn't XML. Ideally you'd parse the text by a full XML parser (as you suggest LoadXML); anything it rejects isn't XML. The question is, do you care if you accept a non-XML string? For instance,
are you OK with accepting
<the quick brown fox jumped over the lazy dog's back>
as XML and stripping it? If so, your technique is fine. If not, you have to decide how tight a test you want and code a recognizer with that degree of tightness.
How is the data coming to you? What is the other type of data surrounding it? Perhaps there is a better way; perhaps you can tokenise the data you control, and then infer that anything that is not within those tokens is XML, but we'd need to know more.
Failing a cute solution like that, I think what you have is fine (for validating that it starts and ends with those characters).
We need to know more about the data format really.
If the XML contains no root element (i.e. it's an XML fragment, not a full document), then the following would be perfectly valid sample, as well - but wouldn't match your detector:
foo<bar/>baz
In fact, any text string would be valid XML fragment (consider if the original XML document was just the root element wrapping some text, and you take the root element tags away)!
try
{
XmlDocument myDoc = new XmlDocument();
myDoc.LoadXml(myString);
}
catch(XmlException ex)
{
//take care of the exception
}

Categories