XmlException - inserting attribute gives "unexpected token" exception - c#

I have an XmlDocument object in C# that I transform, using XslTransform, to html. In the stylesheet I insert an id attribute in a span tag, taking the id from an element in the XmlDocument. Here is the template for the element:
<xsl:template match="word">
<span>
<xsl:attribute name="id"><xsl:value-of select="#id"></xsl:value-of></xsl:attribute>
<xsl:apply-templates/>
</span>
</xsl:template>
But then I want to process the result document as an xhtml document (using the XmlDocument dom). So I'm taking a selected element in the html, creating a range out of it, and try to load the element using XmlLoad():
wordElem.LoadXml(range.htmlText);
But this gives me the following exception: "'598' is an unexpected token. The expected token is '"' or '''. Line 1, position 10."
And if I move the cursor over the range.htmlText, I see the tags for the element, and the "id" shows without quotes, which confuses me (i.e.SPAN id=598 instead of SPAN id="598"). To confuse the matter further, if I insert a blank space or something like that in the value of the id in the stylesheet, it works fine, i.e.:
<span>
<xsl:attribute name="id"><xsl:text> </xsl:text> <xsl:value-of select="#id"></xsl:value-of></xsl:attribute>
<xsl:apply-templates/>
</span>
(Notice the whitespace in the xsl:text element). Now if I move the cursor over the range.htmlText, I see an id with quotes as usual in attributes (and as it shows if I open the html file in notepad or something).
What is going on here? Why can't I insert an attribute this way and have a result that is acceptable as xhtml for XmlDocument to read? I feel I am missing something fundamental, but all this surprises me, since I do this sort of transformations using xsl:attribute to insert attributes all the time for other types of xsl transformations. Why doesn't XmlDocument accept this value?
By the way, it doesn't matter if it is an id attribute. i have tried with the "class" attribute, "style" etc, and also using literal values such as "style" and setting the value to "color:red" and so on. The compiler always complains it is an unvalid token, and does not include quotes for the value unless there is a whitespace or something else in there (linebreaks etc.).
I hope I have provided enough information. Any help will be greatly appreciated.
Basically, what I want to accomplish is set an id in a span element in html, select a word in a webbrowser control with this document loaded, and get the id attribute out of the selected element. I've accomplished everything, and can actually do what I want, but only if I use regex e.g. to get the attribute value, and I want to be able to use XmlDocument instead to simply get the value out of the attribute directly.
I'm sure I'm missing something simple, but if so please tell me.
Regards,
Anders

Try doing this.
<span id="{#id}">
Inner Text
</span>

Related

How can I grab text before a tag with HTMLAgilityPack

Suppose I have this HTML string:
These are some links<br>1234 - <a id="1234" href="#">My Number 1</a><br>4321 - My Number 2...
I want to extract the text after the <br> tag (1234 -), the inner text of the <a> tag (My Number 1), and the id attribute of the <a> tag (1234) as well. I am using the HTMLAgilityPack to help parse the HTML data that I get.
So far I have tried doing this:
// mNodes = code to get html string I want to parse
HtmlNode mNumberListNodes = mNodes[1]; // mNodes[1] is equal to a string as shown above
List<HtmlNode> mNumberNodes = mNumberListNodes.Descendants("a").ToList();
I am using debugging points to stop and view the previous child nodes in each of the HtmlNode's, but I am not having any luck finding the corresponding number text.
Anyone have any experience using the HTMLAgilityPack in C# that could help me?
I believe the
mNodes.InnerText
property will give you all the text that is not in html tags, specifically the "1234" you want. Text itself is not a node in the DOM.
Assuming the code above is correct, to get the id value, use:
mNumberListNodes.Descendants("a").ToList()[0].Attributes["id"].Value
I've had pretty good success using XPath with this library, and also regular expressions.

Parsing xml attributes with embedded double-quotes in their values using XMLDocument object

This is a web project.
I receive a partial html string from an external source. Using XMLDocument to parse it works well except when it encounters an attribute with embedded quotes such as the "style" attribute below.
<span id="someId" style="font-family:"Calibri", Sans-Serif;">Some Text</span>
It seems as though (but I could be wrong) that LoadXml() thinks that the double-quote before Calibri ends the style attribute and that Calibri is another "token" (token is the term I get in the error message).
var xml = new XmlDocument();
xml.LoadXml(<the html string above, properly escaped>); // <--- here is where I get the error message below
"'Calibri' is an unexpected token. Expecting white space. Line 1, position 18."
I can use Regex to replace the inner quotes but it will be rather ugly. And, I may well end up doing it!
I thought perhaps HtmlAgilityPack would help, but I couldn't find good documentation on it and I would rather avoid 3rd party libraries with sparse documentation.
Is there a way to make LoadXml() accept it (and, subsequently, have the Attributes collection parse it correctly)? I don't have much hope for that, but I am throwing it out there anyways. Or should I be using another class altogether other than XmlDocument? I am open to using a 3rd party library with good documentation.
That data is invalid. An attribute quoted with double quotes cannot contain double quotes in the attribute value. An attribute quoted with single quotes cannot have single quotes in the value.
Valid:
<tag attr1="value with 'single' quotes" attr2='value with "double" quotes' />
Invalid:
<tag attr1="value with "double" quotes" attr2='value with 'single' quotes' />
Note that the invalid example can be made valid as follows:
<tag attr1="value with "double" quotes" attr2='value with &apos;single&apos; quotes' />

Writing xml with at most two tags per line

I am saving xml from .NET's XElement. I've been using the method ToString, but the formatting doesn't look how I'd like (examples below). I'd like at most two tags per line. How can I achieve that?
Saving XElement.Parse("<a><b><c>one</c><c>two</c></b><b>three<c>four</c><c>five</c></b></a>").ToString() gives me
<a>
<b>
<c>one</c>
<c>two</c>
</b>
<b>three<c>four</c><c>five</c></b>
</a>
But for readability I would rather 'three', 'four' and 'five' were on separate lines:
<a>
<b>
<c>one</c>
<c>two</c>
</b>
<b>three
<c>four</c>
<c>five</c>
</b>
</a>
Edit: Yes I understand this is syntactically different and "not in the spirit of xml", but I'm being pragmatic. Recently I've seen megabyte-size xml files with as few as 3 lines—these are challenging to text editors, source control, and diff tools. Something needs to be done! I've tested that changing the formatting above is compatible with our application.
If you want exactly that output, you'll need to do it manually, adding whitespace around nodes as necessary.
Almost all whitespace in XML documents is significant, even if we only think of it as indenting. When we ask the serializer to indent the document for us, it is making changes to the content that can get extracted, so they try to be as conservative as possible. The elements
<tag>foo</tag>
and
<tag>
foo
</tag>
have different content, and if an serializer changed the former into the latter, it would change what you get back from your XML API when asking for the contents of <tag>.
The usual rule of thumb is that no indenting will be applied if there's any existing non-whitespace between the elements. In this case, your three between the tags would be modified if a serializer applied the indenting you desire, so nothing will do it for you automatically.
If you have control over the XML format, it's inadvisable to mix element and text children like this, where <b> has both text (three) and element (<c>) children, as it causes issues like what you're seeing.
The formatting isn't working the way you want because of the naked "three". Is there a reason it's not in it's own tag? Should it be an attribute of "b" instead?
Explained reasons to colleagues - we're going to change the file format. I recommend you try to do the same. It's nigh impossible to do what I wanted, because most xml tools assume whitespace is significant.
XML is an information exchange format, intended for computers. The whitespace is irrelevant (depending on location and schema, really) and as such, it would be arbitrary to use one or the other.
You could use XmlTextWriter with XElement.Save and see whether you can tweak it to your liking with the XmlWriter.Settings Property
I've had to do something similar before (for a client request). All I ended up doing was writing a custom .ToString() method only used for either displaying the XML in a browser(ugh, i know) or for their use in downloading an xml file of the content. Because the code did not have to be computationally efficient, it was merely a matter of checking the children of each tag and arranging the 'hanging' text as such.
Eventually we were able to convince the user that the text should be an attribute instead.

Clean out/replace invalid XML characters in element attributes

UPDATE: The invalid characters are actually in the attributes instead of the elements, this will prevent me from using the CDATA solution as suggested below.
In my application I receive the following XML as a string. There are a two problems with this why this isn't accepted as valid XML.
Hope anyone has a solution for fixing these bug gracefully.
There are ASCII characters in the XML that aren't allowed. Not only the one displayed in the example but I would like to replace all the ASCII code with their corresponding characters.
Within an element the '<' exists - I would like to remove all these entire 'inner elements' (<L CODE="C01">WWW.cars.com</L>) from the XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
<cars>
<car model="ford" description="Argentinië love this"/>
<car model="kia" description="a small family car"/>
<car model="opel" description="great car <L CODE="C01">WWW.cars.com</L>"/>
</cars>
For a quick fix, you could load this not-XML into a string, and add [CDATA][1] markers inside any XML tags that you know usually tend to contain invalid data. For example, if you only ever see bad data inside <description> tags, you could do:
var soCalledXml = ...;
var xml = soCalledXml
.Replace("<description>", "<description><![CDATA[")
.Replace("</description>", "]]></description>");
This would turn the tag into this:
<description><![CDATA[great car <L CODE="C01">WWW.cars.com</L>]]></description>
which you could then process successfully -- it would be a <description> tag that contains the simple string great car <L CODE="C01">WWW.cars.com</L>.
If the <description> tag could ever have any attributes, then this kind of string replacement would be fraught with problems. But if you can count on the open tag to always be exactly the string <description> with no attributes and no extra whitespace inside the tag, and if you can count on the close tag to always be </description> with no whitespace before the >, then this should get you by until you can convince whoever is producing your crap input that they need to produce well-formed XML.
Update
Since the malformed data is inside an attribute, CDATA won't work. But you could use a regular expression to find everything inside those quote characters, and then do string manipulation to properly escape the <s and >s. They're at least escaping embedded quotes, so a regex to go from " to " would work.
Keep in mind that it's generally a bad idea to use regexes on XML. Of course, what you're getting isn't actually XML, but it's still hard to get right for all the same reasons. So expect this to be brittle -- it'll work for your sample input, but it may break when they send you the next file, especially if they don't escape & properly. Your best bet is still to convince them to give you well-formed XML.
using System.Text.RegularExpressions;
var soCalledXml = ...;
var xml = Regex.Replace(soCalledXml, "description=\"[^\"]*\"",
match => match.Value.Replace("<", "<").Replace(">", ">"));
You could wrap that content in a CDATA section.
With regex it will be something like this, match
"<description>(.*?)</description>"
and replace with
"<description><![CDATA[$1]]></description>"

XSLT getting a piece of xml to custommethod

I have a small question about XSLT, I've only recently started with xslt.
So the thing is I need to give with my custom method a piece of xml that matches the template, but the problem is, what I give is a string but it doenst have tags anymore:
so example if my xml looks like this:
<a>hi</a>
<a>bye</b>
I recieve only string that consists as this: "hi bye"
So I need to give instead of only the value/text of the node, but whole node with tags and attributes and elements etc etc.
My xslt looks like this:
<xsl:template match="SpecialNode">
<xsl:value-of select="CustomMethod:Handler(node()[*], #name)"/>
</xsl:template>
but whatsoever I tried (like ./node() or descendant::node() or * and so on), I always get the string without xml tags :(
but I need to have something like this passed to my method in a string.
<a>hi</a><a>bye</a>
If you just want to get the tag name, try
<xsl:template match="SpecialNode">
<xsl:value-of select="CustomMethod:Handler(name(.))"/>
</xsl:template>
If you want the whole element, as well as the tag name, try
<xsl:template match="SpecialNode">
<xsl:value-of select="CustomMethod:Handler(., name(.))"/>
</xsl:template>
Use:
CustomMethod:Handler(.)
Your XSLT stylesheet is processing a tree of nodes, and you want your external c# (?) code to see lexical serialized XML containing angle brackets. So the tree of nodes needs to be serialized into lexical XML somewhere along the line. That's not going to happen by magic as an implicit conversion done by a function call. It's probably best to let the C# code receive the data as nodes, and do the serialization from there - assuming the processing can't be node at the tree level.

Categories