Automatically correct invalid XML?

Automatically correct invalid XML? - c#

I am currently using SSIS on a project where I need to verify the correct XML file structure. In particular, I have to check that there is no missing tag in the XML file and if so, I have to reassemble this line without tag.
I'll give you an example to better understand.
<?xml version="1.0"?>
<catalog>
<DATA>0000000061E82D821590010000409525CD</DATA>
<DATA>0000000061E82D8C163001000140AD0DF6</DATA>
<DATA>0000000061E82D9616E301000240776CAB</DATA>
<DATA> 0000000061E82DA0178001000340C56B6</DATA>
<DATA>0000000061E82DAA188001000440C0C7CB</DATA>
0000000061E82DDAEA4001000540BB9A276
</catalog>
For example in the above XML there is a <DATA> tag missing. I have no influence on the creation of the XML.
How could I notice that a <DATA> tag is missing (the number of data lines is not fixed), and subsequently retrieve that line where there is no tag ?
For example in the above xml there is a <DATA> tag missing. I have no influence on the creation of the XML.
The solution can be a suite of SSIS components or a c# script.

It is impossible to automatically correct invalid XML in the general case.
Terminology correction
For example in the above XML there is a <DATA> tag missing.
There is not a <DATA> tag missing. You probably mean that there are supposed to be begin and end DATA tags surrounding 0000000061E82DDAEA4001000540BB9A276. The difference is significant because if there were only a single tag missing, the "XML" would not be well-formed. If a schema says that a catalog element may only have DATA children, then the XML is not valid.
See Well-formed vs Valid XML for a detailed description of this important distinction.
If you actually had to try to read not-well-formed "XML," you could try the suggestions listed in How to parse invalid (bad / not well-formed) XML?.
Don't try to automatically correct invalid XML
Best practice is to reject the input and force the sender/creator to fix the document. The entire raison d'être for a schema is to express the invariants that can be relied upon to process the data. Violating those invariants means all bets are off.
Don't be seduced by the superficial simplicity of peep-hole repair ideas
Every repair idea implies an assumption about the data that is not expressed in the schema, which would be bad because:
There should be a clearly and explicitly expressed definition of validity and
The assumptions
will likely not be expressed unambiguously.
may not be expressed at all.
may be incomplete or entirely incorrect.
will probably go unconfirmed because an errant producer that
can/will not fix validity against a schema is unlikely to be
able to assess the validity of an assumption over all data
that it is, or could be, sending over all time.
See also
How to remove invalid XML elements

Related

How do I validate this kind of xml errors in the c# code?

<?xml version="1.0" encoding="utf-8"?>
<xml>
<a>str1234</a>xxxx
</xml>
I got this xml file, as you can see, there's "xxxx" after the "a" close tag.
I tried the xmldocument.load() method but it wouldn't throw any exceptions.
I tried to generate a xsd file from this xml, then validate this xml with the generated xsd.
However, it also wouldn't throw any errors.

It is important to understand the difference between valid and well-formed XML.
Commenters have sloppily said that your XML is valid. They actually should not make such a statement without a schema against which to assess validity. They should be saying that your XML is well-formed.
You seem to be concerned that xxxx text as a sibling to an a element is not well formed, but it is perfectly well-formed XML. It might also be valid, if the parent element, xml is defined by a schema to allow mixed-content.
I tried to generate a xsd file from this xml, then validate this xml
with the generated xsd.
Well, if you used a tool to generate an XSD from an XML document instance, and the XSD said the XML was valid, then the tool is working as designed.
But when my code loading this kind of .config file,it got stuck.
Just like being well-formed doesn't guarantee validity, it also doesn't guarantee that it meets the needs of any given consuming XML application. A configuration file has rules, perhaps expressed in an XSD, that the XML must follow. These rules are in addition to being well-formed (the rules that a parser requires in order to parse XML).
See also How to validate xml code file though .NET? + How would I do it if I use XML serialization?

Is CDATA required to validate/deserialize against a schema if a string element contains valid XML

I am hosting a C# WCF SOAP which service that has a call which contains the following element
<element name="SomeXmlElement" type="xsd:string" minOccurs="0"/>
The WSDL in question is provided by the client.
The content of this element is valid XML which in general will conform to a different XSD, but for our purposes is arbitrary valid XML
If the data is passed "raw" which is the way the client prefers to send it, SomeXmlElement is null after being deserialized
<SomeXmlElement><SomeArbitraryXml/></SomeXmlElement>
If I have them wrap it in a CDATA it works correctly, but the customer/client complains that they don't have to do that for other implementations, and it causes compatability issues
<SomeXmlElement><![CDATA[<SomeArbitraryXml/>]]></SomeXmlElement>
My understanding is that there are only a few choices to have this deserialize correctly.
wrap in CDATA (nested cdata ugh)
Change the schema to use a complex type instead of string, where the complex type references the other XSD schema
xs:any in the schema (what would this deserialize as?)
The customer insists that this is just a deficiency in my code/.Net and that this should deserialize/process fine in the raw format.
Rolling my own deserializer would be possible, or just loading into a DOM and accessing the InnerXml property or whatnot, but thats a lot of work to override default expected behavior imo.
Thoughts? Suggestions? Am I interpreting the XML specs correctly? Are there any choices that don't require schema changes or rewriting lots of WCF default behavior?

Your client has no right to complain. They're publishing an interface and then telling you out-of-band to ignore parts of the interface specification.
If they want to allow arbitrary XML under SomeXmlElement, then they should use xsd:any.
If they want to restrict the XML under SomeXmlElement to that given by another XSD, then they should import or include the other XSD and explicitly reference the allowed elements.
But they should not specify that SomeXmlElement contains an xsd:string and then expect its content model to really be XML. You're the one who has the right to complain.
That their implementations are 10 years old or Java based is irrelevant. XML and XSD specifications go back that far and work well in Java.
So, besides looking for validation here, you probably want advice beyond telling your client to fix their broken interface definition...
Consider rewriting their XSD to be what they really mean, and hold yourself and your code to a higher standard (an actual standard, that is). Anything else would be a hack upon a hack and make you an accessory to their crime.

Do I need to write my own validator for xml validation against xsd schema?

I'm trying to support my users in creating an xml based on an xsd (xml schema).
So I show possible elements and the user can add it to an xml.
However, i have problems to determine the possible elements or to validate that what the user adds is correct. How do I check complex elements?
Let's say we have a sequence element. How am I going to check that the user adds an element at the right place?
Let's say we have a choice element. How am I going to check that an element from the other particle has been added already?
I can validate the xml against the schema in c# but the errors it returns can (maybe) be showed to the user but I can't use them in my code since the format is inappropriate for that and it just doesn't return enough details.
Do I need to write my own validator (and implement all the w3c specs)??
Thanks!

You shouldn't need to implement your own validator. XmlSchemaValidator will actually give you a good amount of information. See the answer to my own similar question here : XML Schemas -- List allowed attributes/tags at position in XML

Evaluate XML Find Bad Character?

I have some XML coming from a remote (Java) web service into my c# console app, it is written to a Microsoft SQL Server XML column via a stored procedure. Sometimes the XML has a bad character somewhere and SQL Server is not giving enough information about where the problem exactly is.
I would like to evaluate the XML before the database-write happens, and of course I have no XSD.
What is a good way to evaluate every part of the XML for "regular conformance" before writing to the database? I am using .NET 4.0, C#.
Thanks.

If you have the possibility, I would recommend doing XML Schema Validation on all XML dat that you retrieve in 3rd party services.
Xml Schema validation will ensure that every element of an XML document is valid against it's defined contract.
You should consider making the Xml Schema Validation optional, as it introduces an overhead, that you might want to prevent in production environments. But in development and testing environments it can be quite beneficial to get detail validation error information from all your 3rd party services.

You can try sanitizing your xml which might help a little bit: http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/
That link really only helps filter invalid characters, most of the time, that will not be enough nor helpful (however I still recommended filtering unknown characters for security).
I think for checking if tags are valid or not, you can use a try catch. If the try catch is returning problems on line 1 then, the problem could be that you don't have a root element in your xml? Or it could be that your encoding is incorrect for the xml document. They should return different errors.

XSD validation error human readable

I want to be able to validate a XML against a XSD and generate user readable errors, for example, including XSD documentation tag.
I just wanted to know if C# provides this in a easy, elegant and non-painful way, otherwise I'll parse down the error and find the node within XSD.

XML Schema itself doesn't provide a way to do what you want!
XML Schema is not meant to communicate to humans in a human manner. Forget about it.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.