XML: how to pre-parse when only SOME data is escaped? - c#

XML snippet:
<field>& is escaped</field>
<field>"also escaped"</field>
<field>is & "not" escaped</field>
<field>is " and is not & escaped</field>
I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser?
I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it.
The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly
string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))

I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent.
This question is helpful as it gives you a Regex to find these rogue ampersands:
&(?!(?:apos|quot|[gl]t|amp);|#)
And you can see that it matches the correct text in this demo. You can use this in a simple replace operation:
var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&");
And then you'll be able to parse your XML.

Preprocess the textual data (not really XML) with HTML Tidy with quote-ampersand set to true.

If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser.
For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like when there's no definition of &npsp;, then life starts to become rather more difficult.
Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. That's what standards are for.

Related

How do I generate &quot in XML from C#

I'm using XmlTextWriter to generate XML from a C# app. The initial input will be in this format, 1" , but I can replace it with whatever. I need to end up with 1" but I keep getting 1&quot;
C#
xml.WriteStartElement("data");
xml.WriteAttributeString("type", "wstring");
xml.WriteString("1"");
xml.WriteEndElement();
Ha! When I past the XML I need in here it converts it to 1". But whats I really need to show is the actual 1" in the 3rd line of the code.
I also need to use / and Ø. How can I do this, thanks.
I need to end up with 1" but I keep getting 1"
That's because you're trying to do the XML API's job for it.
Your job is to provide just text - its job is to handle escaping and anything else XML needs to do.
So you should just use
xml.WriteString("1\"");
... where the backslash is just for the sake of C#'s string literal handling, and has nothing to do with XML. The logical value you're trying to write out is a 1 followed by a double-quote. Whether it's escaped as " or not should be irrelevant to anything processing it. If you've got something which is over-sensitive, you should fix that.
If you desperately need this (and I would again strongly urge you not to), try:
xml.WriteString("1");
xml.WriteEntityRef("quot");
(I'd also urge you to use LINQ to XML unless you're in a situation where you really need to use XmlWriter. It'll make your life significantly simpler.)

Name cannot begin with the '1' character, hexadecimal value 0x31. Line 2, position 2

While loading XML file in a C# application, I am getting
Name cannot begin with the '1' character, hexadecimal value 0x31.
Line 2, position 2.
The XML tag begins like this.
<version="1.0" encoding="us-ascii" standalone="yes" />
<1212041205115912>
I am not supposed to change this tag at any cost.
How can I resolve this?
You are supposed to change the tag name since the one you wrote violates the xml standard.
Just to remember the interesting portion of it here:
XML Naming Rules
XML elements MUST follow these naming rules:
Names can contain letters, numbers, and other characters
Names cannot start with a number or punctuation character
Names cannot start with the letters xml (or XML, or Xml, etc)
Names cannot contain spaces
Any name can be used, no words are reserved.
as a suggestion to solve your problem mantaining the standard:
Use an attribute, ie <Number value="1212041205115912"/>
Add a prefix to the tag ie <_1212041205115912/>
Of course you can mantain the structure you propose by writing your own format parser, but I can state it would be a really bad idea, because in the future someone would probably extend the format and would not be happy to see that the file that seems xml it is actually not, and he/she can get angry for that. Furthermore, if you want your custom format, use something simpler, I mean: messing a text file with some '<' and '>' does not add any value if it is not an officially recognized format, it is better to use someting like a simple plain text file instead.
IF you absolutely cant change it, eg. for some reason the format is already out in the wild and used by other systems/customers/whatever.
Since it is an invalid xml document, try to clean it up before parsing it.
eg. make a regex that replaces all < number> tags with < IMessedUp>number< /IMessedUp> and then parse it.
Sort of iffy way to do it, but I will solve your problem.
If you need to process this document, then stop thinking of it as XML, and cast aside any thoughts of using XML tools to process it. You're dealing with a proprietary format and you will need to write your own tools to handle it. If you want the benefits of using XML technology, you will have to redesign your documents so they are valid XML.

Loading XML Document - Name cannot begin with the zero character

I am trying to load something which claims to be an XML document into any type of .net XML object: XElement, XmlDocument, or XmlTextReader. All of them throw an exception :
Name cannot begin with the '0' character, hexadecimal value 0x30
The error related to a bit of 'XML'
<chart_value
color="ff4400"
alpha="100"
size="12"
position="cursor"
decimal_char="."
0=""
/>
I believe the problem is the author should not have named an attribute as 0.
If I could change this I would, but I do not have control of this feed. I suppose those who use it are using more permissive tools. Is there anyway I can load this as XML without throwing an error?
There is no XML declaration either, nor namespace or contract definition. I was thinking I might have to turn it into a string and do a replace, but this is not very elegant. Was wondering if there was any other options.
As many have said, this is not XML.
Having said that, it's almost XML and WANTS to be XML, so I don't think you should use a regex to screw around inside of it (here's why).
Wherever you're getting the stream, dump into into a string, change 0= to something like zero= and try parsing it.
Don't forget to reverse the operation if you have to return-to-sender.
If you're reading from a file, you can do something like this:
var txt = File.ReadAllText(#"\path\to\wannabe.xml");
var clean = txt.Replace("0=", "zero=");
var doc = new XmlDocument();
doc.LoadXml(clean);
This is not guaranteed to remove all potential XML problems -- but it should remove the one you have.
Just replace the Numeric value with '_'
Example: "0=" replace to "_0="
I hope that will fix the problem, thanks.
It might claim to be an XML document, but the claim is clearly false, so you should reject the document.
The only good way to deal with bad XML is to find out what bit of software is producing it, and either fix it or throw it away. All the benefits of XML go out of the window if people start tolerating stuff that's nearly XML but not quite.
The 0="" obviously uses an invalid attribute name 0. You'd probably have to do a find/replace to try and fix the XML if you cannot fix it at the source that created it. You might be able to use RegEx to try to do more efficient manipulation of the XML string.

Parsing XML-ish data

Yes, I really am going to ask about parsing XML with regexes... here goes.
I have some XML-ish data, and I need to parse it. I can't do it completely with an XMLDocument or similar because it's not proper XML, and I'm not sure I can (or want to) change the format. The main problem is tags which have special meaning, and look like this:
<$ something_here $>
C#'s XmlDocument falls over parsing that, and I assume other methods will too. I could, with a lot of work, change the above to something like
<some_special_tag><![CDATA[ something_here ]]></some_special_tag>
But that's ugly, and I don't really want to. The reason it would be time consuming to change is that I have hundreds, maybe thousands of XML documents which would need to be changed.
At the moment, I'm parsing the document with regexes. I only need to pick out a couple of specific tags (not the ones above), and it seems to be working, but I'm uncomfortable with it. I'm doing something like this at the moment:
...
MatchCollection mc = Regex.Matches(Template, "<tagname.*?/tagname>"); // or similar
foreach (Match m in mc) {
try {
XmlDocument xd = new XmlDocument();
xd.LoadXml(m.Value);
...
This at least means I'm not using regexes exclusively :)
Can anyone think of a better way? Is there some way of getting XmlDocument to politely ignore the $ character that causes it to fall over? It doesn't seem likely, but I thought I should at least get some opinions.
No, there is no way to get XmlDocument to parse a document which isn't xml, no matter how close to xml it might look!
If its possible to do then I would definitely recommend that you convert your documents to be actual xml (or at least some recognised document format). Trying to create and maintain a reliable working parser for any format is quite a lot of work, let alone a format that doesn't appear to be rigeriously defined.
Using a some_special_tag element to identify special sections seems like a good idea to me. If necessary you can use a different namespace to ensure no clashes with other elements in your document - this is in fact exactly the way that xslt works ("special" tags are used to mean special things, like templates or nodes that should be replaced) and exactly what xml was designed to support.
Also I don't understand why you would need to place the something_here bit in CDATA sections. All characters that "break" xml can be escaped fairly easily (for example by writing < as <). CDATA sections are generally only used when the contents of a node needs so much escaping that its easier and less messy to just to use CDATA sections instead.
Update: Regarding migration to a new format, can you not use both methods? Attempt to parse the document as an XML document (or if there are performance concerns then perform some other test to quickly determine if the document is in the "old" or "new" format such as checking for a version attribute in the root element) - if it doesn't work then fall back to the old method.
This way as long as everything is working fine (which is will be as long as nothing changes) users don't need to modify their documents, however if they run into problems or want to use any new features then explain to them that they must update their document to the new format.
Depending on how well your current "parser" works, you may even be able to provide an upgrade utility that automatically performns the conversion (as best it can).
Can't you replace <$ something_here $> to that big CDATA section at run-time and then load the XML document as usual?

removing xml tag with regex

I need to remove the tag "image" with regex.
I'm working with C# .Net
example <rrr><image from="91524" to="92505" /></rrr> should become:
<rrr></rrr>
Anyone???
You shouldn't really be using regex for this task, especially when .NET provides such powerful tools to handle XML:
XElement xml = XElement.Parse("<rrr><image from=\"91524\" to=\"92505\" /></rrr>");
xml.Descendants("image").Remove();
However if you insist on doing this with regex, let's see what happens:
string xml = "<rrr><image from=\"91524\" to=\"92505\" /></rrr>";
string output = Regex.Replace(xml, "<image.*?>", "");
This method has some problems though that the first method solves for you. Example problems:
Doesn't handle case sensitivity.
> characters in attributes can confuse the regex.
Newlines won't be matched correctly.
Incorrectly matches other tags that start with image like <image2 />.
XML comments can cause problems.
Doesn't handle both <image /> and <image></image>.
etc...
Some of these are easy to fix, some are more tricky. But in the end it's not worth spending time improving the regular expression solution to handle all the special cases when the LINQ to XML solution is so simple and does all this for you.
Even though XML is very regular and suffers from a draconian "validate or die" policy, this Stack Overflow question will prove very enlightening.
Regular expressions are powerful--but the XML tools in .NET are better for this task, because they are designed to handle this sort of thing. You can manipulate the XML based upon its structure, something Regexes can't do because they see your XML as text.
XML is text, but it's text with a particular structure. Take advantage of that known quality.
Try this:
<image[^>]*>

Categories