Remove XML that does not validate against XSD

Remove XML that does not validate against XSD - c#

I have an XML and XSD.
The problem I have is that if one element\attribute fails during the upload then nothing is uploaded. Therefore using the XSD, I would like to strip out any invalid “rows” prior to the upload.
If the following is taken as example
<Row>
<Column1>1</Column1>
<Column2>2</Column2>
</Row>
<Row>
<Column1>1</Column1>
<Column2>2</Column2>
</Row>
<Row>
<Column1>1</Column1>
**<Column2>**B**</Column2>**
</Row>
<Row>
<Column1>1</Column1>
**<Column2>**C**</Column2>**
</Row>
In the above example, Column2 in the 3rd Row and 4th row is invalid. Therefore I would like to remove it both from the XML.
I tried
foreach (XmlElement row in doc.SelectNodes("TableName/Row"))
{
if (row.SchemaInfo.Validity == XmlSchemaValidity.Invalid)
{
row.ParentNode.RemoveChild(row);
}
}
but it removes only the first error section and if later there are sections with error the SchemaInfo.Validity value is "NotKnown"

I think the only way to do this would be to manually validate the XML using your own code.
Due to the possible structure of an XSD and the possible errors that could occur in it, creating a validator that can consistently skip over an error and continue, would be very difficult (and hence is not something that any of the parsers i'm aware of have done).
In some circumstances they will continue validation after an error, but typically they then ignore all siblings after the initial error (in order to get back to a more consistent state). Basically once an error is encountered there are often multiple validation paths that can be taken as the validation state has become ambiguous.
That said if your data is something along the lines of your sample and you have some control over your XSD you could refactor the XSD defintion of <row> to be root element (then use an element ref where you need it). You could then load each <row> element one at a time and validate each one as you go. That way the code that reads the document is disconnected from the validation of each <row>, so if one is invalid you discard it and move onto the next.
NOTE : This approach would mean the rest of the XML document is NOT validated.

Related

Needs space in the element name in xml formation

I've set of data in the database and needs to convert to the xml format, but the problem is one of the element name has the space between the element name but I want to use this name in the xml
<Data>
<Out put xml>
<ROW>
</ROW>
</Out put xml>
<Data>
I aware about that couldn't possible to use space in the element name. Please suggest me any other alternative to achieve this.

Try to encode the name while converting to xml
XmlConvert.EncodeName(Name);
Then to convert back just decode it
XmlConvert.DecodeName(Name);

XAttribute contains encoded HTML

I have the following XML element defined in a document:
<modelDef name="EmployeeOutput" description="Current Model Run output.">
<resultSet name="Values">
<field name="Net P&L Impact" typeName="Decimal" formatString="C2"/>
</resultSet>
</modelDef>
Notice the name attribute on the field element. When I retrieve the value it comes back with the ampersand, where I want to keep the encoded value as defined in the xml. This is the simple code I use to retrieve the value of the attribute
field.Attribute("name").Value
How can I make sure that I get the &amp code back rather than the actual ampersand symbol?
Thanks!

Actually, "Net P&L Impact" is the value you've encoded in XML, because & is the XML code for an ampersand. If the actual encoded value should be "Net P&L Impact" then you need to change your XML:
<modelDef name="EmployeeOutput" description="Current Model Run output.">
<resultSet name="Values">
<field name="Net P&amp;L Impact" typeName="Decimal" formatString="C2"/>
</resultSet>
</modelDef>
However, it's rare to encounter real-world entities with & in their names. Why do you want it to be encoded that way? Is it because you're outputting it to HTML? If so, use HttpUtility.HtmlEncode() on the resulting value as you go to output it in HTML. That's the correct and safe way. Don't expect your XML-parsing code to anticipate the context that the value will be output into.

Linq duplicate elements when iterating over XML

<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<stock-items>
<stock-item>
<name>Loader 34</name>
<sku>45GH6</sku>
<vendor>HITINANY</vendor>
<useage>Lifter 45 models B to C</useage>
<typeid>01</typeid>
<version>01</version>
<reference>33</reference>
<comments>EOL item. No Re-order</comments>
<traits>
<header>56765</header>
<site>H4</site>
<site>A6</site>
<site>V1</site>
</traits>
<type-validators>
<actions>
<endurance-tester>bake/shake</endurance-tester>
</actions>
<rules>
<results-file>Test-Results.txt</results-file>
<file-must-contain file-name="Test-Results.xml">
<search>
<term>[<![CDATA[<"TEST TYPES 23 & 49 PASSED"/>]]></term>
<search-type>exactMatch</search-type>
</search>
</file-must-contain>
</rules>
</type-validators>
</stock-item>
</stock-items>
Im trying to get the rules fragment from the xml above into a string so it can be added to a database. Currently the search element and its contents are added twice. I know why this is happing but cant figure out how to prevent it.
Heres my code
var Rules = from rules in Type.Descendants("rules")
select rules.Descendants();
StringBuilder RulesString = new StringBuilder();
foreach (var rule in Rules)
{
foreach (var item in rule)
{
RulesString.AppendLine(item.ToString());
}
}
Console.WriteLine(RulesString);
Finally any elements in rules are optional and some of these elements may or may not contain other child elements up to 4 or 5 levels deep. TIA
UPDATE:
To try and make it clearer what im trying to achieve.
From the xml above I should end up with a string containing everthing in the rules element, exactly like this:
<results-file>Test-Results.txt</results-file>
<file-must-contain file-name="Test-Results.xml">
<search>
<term>[<![CDATA[<"TEST TYPES 23 & 49 PASSED"/>]]></term>
<search-type>exactMatch</search-type>
</search>
</file-must-contain>

Objective is to extract the entire contents of the rules element as is while taking account that the rules element may or may not contains child elements several levels deep
If you just want the entirety of the rules element as a string (rather than caring about its contents as xml), you don't need to dig into its contents, you just need to get the element as an XNode and then call ToString() on it :
The following example uses this method to retrieve indented XML.
XElement xmlTree = new XElement("Root",
new XElement("Child1", 1)
);
Console.WriteLine(xmlTree);
This example produces the following output:
<Root>
<Child1>1</Child1>
</Root>

if you want to prevent duplicates than you will need to use Distinct() or GroupBy() after parsing the xml and before building the string.
I'm still not fully understanding exactly what the output should be, so I can't provide a clear solution on what exactly to use, or how, in terms of locating duplicates. If you can refine the original post that would help.
we need the structure of the xml as it would appear in your scenario. nesting and all.
we need an example of the final string.
saving it to a db doesn't really matter for this post so you only need to briefly mention that once, if at all.

Clean out/replace invalid XML characters in element attributes

UPDATE: The invalid characters are actually in the attributes instead of the elements, this will prevent me from using the CDATA solution as suggested below.
In my application I receive the following XML as a string. There are a two problems with this why this isn't accepted as valid XML.
Hope anyone has a solution for fixing these bug gracefully.
There are ASCII characters in the XML that aren't allowed. Not only the one displayed in the example but I would like to replace all the ASCII code with their corresponding characters.
Within an element the '<' exists - I would like to remove all these entire 'inner elements' (<L CODE="C01">WWW.cars.com</L>) from the XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
<cars>
<car model="ford" description="Argentinië love this"/>
<car model="kia" description="a small family car"/>
<car model="opel" description="great car <L CODE="C01">WWW.cars.com</L>"/>
</cars>

For a quick fix, you could load this not-XML into a string, and add [CDATA][1] markers inside any XML tags that you know usually tend to contain invalid data. For example, if you only ever see bad data inside <description> tags, you could do:
var soCalledXml = ...;
var xml = soCalledXml
.Replace("<description>", "<description><![CDATA[")
.Replace("</description>", "]]></description>");
This would turn the tag into this:
<description><![CDATA[great car <L CODE="C01">WWW.cars.com</L>]]></description>
which you could then process successfully -- it would be a <description> tag that contains the simple string great car <L CODE="C01">WWW.cars.com</L>.
If the <description> tag could ever have any attributes, then this kind of string replacement would be fraught with problems. But if you can count on the open tag to always be exactly the string <description> with no attributes and no extra whitespace inside the tag, and if you can count on the close tag to always be </description> with no whitespace before the >, then this should get you by until you can convince whoever is producing your crap input that they need to produce well-formed XML.
Update
Since the malformed data is inside an attribute, CDATA won't work. But you could use a regular expression to find everything inside those quote characters, and then do string manipulation to properly escape the <s and >s. They're at least escaping embedded quotes, so a regex to go from " to " would work.
Keep in mind that it's generally a bad idea to use regexes on XML. Of course, what you're getting isn't actually XML, but it's still hard to get right for all the same reasons. So expect this to be brittle -- it'll work for your sample input, but it may break when they send you the next file, especially if they don't escape & properly. Your best bet is still to convince them to give you well-formed XML.
using System.Text.RegularExpressions;
var soCalledXml = ...;
var xml = Regex.Replace(soCalledXml, "description=\"[^\"]*\"",
match => match.Value.Replace("<", "<").Replace(">", ">"));

You could wrap that content in a CDATA section.
With regex it will be something like this, match
"<description>(.*?)</description>"
and replace with
"<description><![CDATA[$1]]></description>"

Creating an XSD schema

I have an xml tag:
<ROW field1="value 1" field2="value 2" ... />
fieldi has a string value, and number of attributes fieldi is variable, but not less than 1. Is it possible to create an xsd schema for this tag?
possible xml document
<ROWDATA>
<ROW field1="dfgdf" field2="ddfg"></ROW>
<ROW field1="dfedf" field2="djkfg" field3="cdffd"></ROW>
<ROW field1="dfedf" field2="djkfg" field3="cdffd" field4="dfedf" field5="djkfg" field6="cdffd"></ROW>
</ROWDATA>
in this xml document, which I receive from a web server, can be a variable number of attributes field (I noted them as fieldi, where i means the order of a specific attribute field)
So I have, unknown number of ROW elements and unknown number of field attributes in the ROW element
Thanks

If you're using Visual Studio 2008:
Open your Xml file in Visual Studio
Go to the 'Xml' menu option at the top of the screen
Choose 'Create Schema'
This will generate your xsd schema(s)
EDIT
Try this example for details on setting minOccurs (on elements) or required (on attributes) so you can manipulate your derived schema.

If you are not comfortable in writing XSD yourself, use some generator like this.
EDIT: Based on your XML in comments, I can think of below structure of XSD
<xsd:element name="FieldHeader">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Fields" type="xsd:string"/> <!--use minOccurs maxOccurs here-->
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:simpleType name="fieldi">
<xsd:restriction base="xsd:string"/>
</xsd:simpleType>
<xsd:simpleType name="Fields">
<xsd:list itemType="fieldi" />
</xsd:simpleType>

I think I have understood your requirement. Just to avoid misconceptions, let me reveal what I have understood once:
"You have an xml file which contains an element with the name fieldi, comes with set of some unknown attributes. Which means you don't know [or say don't want] the names and values of those attributes. Just want to see, there is at-least 1 attribute appearing",
Well. sorry to say that, this requirement is running out of capability of XML-schema. :-[
You cannot have attributes undeclared in schema. If it appears in xml, it requires to have a proper definition for that. There is something called <anyAttribute/> [click-here] which again requires definition [somewhere, in another linked schema].
1) Defining all the possible attributes making use="optional", doesn't look practically possible. And also your last requirement go skipped.
2) If it is possible then, convert all the attributes to elements [using transformation, or you can either ask the sender to do so, I don't know how complicate it is in your case], And define element <any/>, which sounds somewhat comfortable. but your requirement [at-least one attribute must appear] is still not achieved.
so this is it I can add-up for you. If you can change the requirement or input xml structure then let me know, I will see, whether I can help you in any-other ways ..
regards,
infant-pro

I solved the problem but in other way, by controlling the deserialization of the xml document in the way I need. However, I don't like this solution, because I had wanted to create classes from the xsd scheme and use them in my code.
Anyway, thanks to everybody

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove XML that does not validate against XSD - c#

Related

Needs space in the element name in xml formation

XAttribute contains encoded HTML

Linq duplicate elements when iterating over XML

Clean out/replace invalid XML characters in element attributes

Creating an XSD schema

Categories

Resources