xml entity missing while loading document in c# asp.net - c#

Hi i have a xml file with entities,
below is a piece of my xml code
<line>Intellectual life – 1268–1559. I. Title.</line>
<line>DG533.R84 2015</line>
<line>945′.05–dc23 2014019659</line>
when i load the above xml in c#, entities are missing and substituted with some other values,
may i know what is the reason
below is modified xml using c#
<line>Intellectual life – 1268–1559. I. Title.</line>
<line>DG533.R84 2015</line>
<line>945′.05–dc23 2014019659</line>
i want the modified xml as same as source xml
here is my c# code to do above process
using System.Xml;
XmlDocument doc= new XmlDocument();
doc.Load("sample.xml");
doc.Save("sample.xml");
Thanks
Appu

How do I preserve special characters when writing XML with XDocument.Save()?
According to #JonSkeet's answer in the above link, they (the encoded entity and it's corresponding character getting saved) are just different representation of the same thing. This translation shouldn't cause you any trouble because normally the receiving party that process the XML further will recognize either representation as the same thing.
XDocument.Save() removes my
entities
If you really need to preserve the entities, there is also an attempt to do so by inheriting XmlTextWriter class and overriding it's WriteString() method to be manually replacing each special character with the corresponding entity. See the 2nd link above for example implementation. Anyway this approach is going to be cumbersome if you have many different entities to preserve.

Related

Parsing XML file in C# - how to report errors

First I load the file in a structure
XElement xTree = XElement.Load(xml_file);
Then I create an enumerable collection of the elements.
IEnumerable<XElement> elements = xTree.Elements();
And iterate elements
foreach (XElement el in elements)
{
}
The problem is - when I fail to parse the element (a user made a typo or inserted a wrong value) - how can I report exact line in the file?
Is there any way to tie an element to its corresponding line in the file?
One way to do it (although not a proper one) –
When you find a wrong value, add an invalid char (e.g. ‘<’) to it.
So instead of: <ExeMode>Bla bla bla</ExeMode>
You’ll have: <ExeMode><Bla bla bla</ExeMode>
Then load the XML again with try / catch (System.Xml.XmlException ex).
This XmlException has LineNumber and LinePosition.
If there is a limited set of acceptable values, I believe XML Schemas have the concept of an enumerated type -- so write a schema for the file and have the parser validate against that. Assuming the parser you're using supports Schemas, which most should by now.
I haven't looked at DTDs in decades, but they may have the same kind of capability.
Otherwise, you would have to consider this semantic checking rather than syntactic checking, and that makes it your application's responsibility. If you are using a SAX parser and interpreting the data as you go, you may be able to get the line number; check your parser's features.
Otherwise the best answer I've found is to report the problem using an xpath to the affected node/token rather than a line number. You may be able to find a canned solution for that, either as a routine to run against a DOM tree or as a state machine you can run alongside your SAX code to track the path as you go.
(All the "maybe"s are because I haven't looked at what's currently available in a very long time, and because I'm trying to give an answer that is valid for all languages and parser implementations. This should still get you pointed in some useful directions.)

C# does XPathDocument(Path) evaluate the whole path before returning the document?

my English is not he best, but it will work I think.
Also I'm an absolut newcomer to C#.
Given is the following code snippet:
It has to open an XML-document, from which I KNOW that one of the nodes can be missintepreted, btw is really wrong.
try
{
XPathDocument expressionLib = new XPathDocument(path);
XPathNavigator xnav = expressionLib.CreateNavigator();
}
...and so on
my intention is to create the XPathDocument and the XPathNavigator and THEN watch out for the errors.
but my code Fails with "XPathDocument expressionLib = new XPathDocument(path);" (well, it raises an expception which I catch) so I assume that "XPathDocument(path);" validates the whole XML-document before returning it.
At Microsoft pages I didn't find any hints for that assumed behavior - can you verify it?
And, what could be the workaround?
Yes, I WANT open that XML with that error inside (not at the topmost node) and react just for that invalid node and work with the rest of the file.
Enjoy Weekend
Alex.
There is no workaround. If the document is not a valid XML document or there are invalid characters or sections in the document you'll get the exception.
The only way to continue is to handle the XmlException and try to manipulate the Xml data to make it valid which could range from simple if it's just a matter of escaping some invalid character(s) to complex if you have to perform some advanced formatting or if you receive documents containing many different types of errors.
Perhaps the best course of action is to write an XML validator/repair class you'd put your XML document through before attempting to load it with XPathDocument class although I'm pretty sure there must be some library out there that would be able to do all the heavy lifting for you...

Validate xml subset using schema subset in c# .net XmlDocument

Currently I have a solution that builds an XML document in a number of sections and then validates the final concatenated xml against a single schema. Is it possible to use a subset of the same schema to validate each section individually?
The answer is yes in most of the cases. For a disclaimer, in theory someone could intentionally write an XML Schema that would make some of my proposals impossible, but then that would be just bad practice in XSD authoring.
For a straightforward solution, the following assumptions should be true:
A section is well formed XML; you're concatenating XmlElement nodes. E.g.:
<section-element ... attribute content>
... more content
</section-element>
Each of the sections being merged has a matching global element declaration in your XML Schema set. If you use the xsi:type attribute for any of your sections, things might get a bit tricky, but not hard to fix.
The validation would be common code, where the XmlReader would be an XmlNodeReader on the node you're concatenating. Use the XmlReaderSettings as usual...
The above would work for any XSD (you don't have a design time dependency of knowing the XSD). For anything below, the code would have to match your XSD...
If you don't have the matching global elements in the XML Schema then you have to look at the type of each matching local element declaration. If the type is global, then you can easily create, in memory, dummy elements that match your sections, of the global type (assuming a Venetian Blind authoring style).
If even the type is anonymous (more of a Russian Doll style), then you can even fake that, by creating a global element with a type that is a copy of the anonymous type - all in memory.

Loading XML Document - Name cannot begin with the zero character

I am trying to load something which claims to be an XML document into any type of .net XML object: XElement, XmlDocument, or XmlTextReader. All of them throw an exception :
Name cannot begin with the '0' character, hexadecimal value 0x30
The error related to a bit of 'XML'
<chart_value
color="ff4400"
alpha="100"
size="12"
position="cursor"
decimal_char="."
0=""
/>
I believe the problem is the author should not have named an attribute as 0.
If I could change this I would, but I do not have control of this feed. I suppose those who use it are using more permissive tools. Is there anyway I can load this as XML without throwing an error?
There is no XML declaration either, nor namespace or contract definition. I was thinking I might have to turn it into a string and do a replace, but this is not very elegant. Was wondering if there was any other options.
As many have said, this is not XML.
Having said that, it's almost XML and WANTS to be XML, so I don't think you should use a regex to screw around inside of it (here's why).
Wherever you're getting the stream, dump into into a string, change 0= to something like zero= and try parsing it.
Don't forget to reverse the operation if you have to return-to-sender.
If you're reading from a file, you can do something like this:
var txt = File.ReadAllText(#"\path\to\wannabe.xml");
var clean = txt.Replace("0=", "zero=");
var doc = new XmlDocument();
doc.LoadXml(clean);
This is not guaranteed to remove all potential XML problems -- but it should remove the one you have.
Just replace the Numeric value with '_'
Example: "0=" replace to "_0="
I hope that will fix the problem, thanks.
It might claim to be an XML document, but the claim is clearly false, so you should reject the document.
The only good way to deal with bad XML is to find out what bit of software is producing it, and either fix it or throw it away. All the benefits of XML go out of the window if people start tolerating stuff that's nearly XML but not quite.
The 0="" obviously uses an invalid attribute name 0. You'd probably have to do a find/replace to try and fix the XML if you cannot fix it at the source that created it. You might be able to use RegEx to try to do more efficient manipulation of the XML string.

Parsing XML-ish data

Yes, I really am going to ask about parsing XML with regexes... here goes.
I have some XML-ish data, and I need to parse it. I can't do it completely with an XMLDocument or similar because it's not proper XML, and I'm not sure I can (or want to) change the format. The main problem is tags which have special meaning, and look like this:
<$ something_here $>
C#'s XmlDocument falls over parsing that, and I assume other methods will too. I could, with a lot of work, change the above to something like
<some_special_tag><![CDATA[ something_here ]]></some_special_tag>
But that's ugly, and I don't really want to. The reason it would be time consuming to change is that I have hundreds, maybe thousands of XML documents which would need to be changed.
At the moment, I'm parsing the document with regexes. I only need to pick out a couple of specific tags (not the ones above), and it seems to be working, but I'm uncomfortable with it. I'm doing something like this at the moment:
...
MatchCollection mc = Regex.Matches(Template, "<tagname.*?/tagname>"); // or similar
foreach (Match m in mc) {
try {
XmlDocument xd = new XmlDocument();
xd.LoadXml(m.Value);
...
This at least means I'm not using regexes exclusively :)
Can anyone think of a better way? Is there some way of getting XmlDocument to politely ignore the $ character that causes it to fall over? It doesn't seem likely, but I thought I should at least get some opinions.
No, there is no way to get XmlDocument to parse a document which isn't xml, no matter how close to xml it might look!
If its possible to do then I would definitely recommend that you convert your documents to be actual xml (or at least some recognised document format). Trying to create and maintain a reliable working parser for any format is quite a lot of work, let alone a format that doesn't appear to be rigeriously defined.
Using a some_special_tag element to identify special sections seems like a good idea to me. If necessary you can use a different namespace to ensure no clashes with other elements in your document - this is in fact exactly the way that xslt works ("special" tags are used to mean special things, like templates or nodes that should be replaced) and exactly what xml was designed to support.
Also I don't understand why you would need to place the something_here bit in CDATA sections. All characters that "break" xml can be escaped fairly easily (for example by writing < as <). CDATA sections are generally only used when the contents of a node needs so much escaping that its easier and less messy to just to use CDATA sections instead.
Update: Regarding migration to a new format, can you not use both methods? Attempt to parse the document as an XML document (or if there are performance concerns then perform some other test to quickly determine if the document is in the "old" or "new" format such as checking for a version attribute in the root element) - if it doesn't work then fall back to the old method.
This way as long as everything is working fine (which is will be as long as nothing changes) users don't need to modify their documents, however if they run into problems or want to use any new features then explain to them that they must update their document to the new format.
Depending on how well your current "parser" works, you may even be able to provide an upgrade utility that automatically performns the conversion (as best it can).
Can't you replace <$ something_here $> to that big CDATA section at run-time and then load the XML document as usual?

Categories