I've found a lot of articles about how to get node content by using simple XPath expression and C#, for example:
XPath:
/bookstore/author/first-name
C#:
string xpathExpression = "/bookstore/author/first-name";
nodes = navigator.Select(xpathExpression);
I wonder how to get content that is inside of an element, and the same element is inside another element and another and another.
Just take a look on below code:
<Cell>
<CellContent>
<Para>
<ParaLine>
<String>ABCabcABC abcABC abc ABCABCABC.</string>
</ParaLine>
</Para>
</CellContent>
</Cell>
I only want to extract content ABCabcABC abcABC abc ABCABCABC. from String element.
Do you know how to resolve problem by use XPath expression and .Net C#?
After googling c# .net xpath for few seconds you'll find this article, which provides example which you can easily modify to use XPathDocument, XPathNavigator and XPathNavigator::SelectSingleNode():
XPathNavigator nav;
XPathDocument docNav;
string xPath;
docNav = new XPathDocument("c:\\books.xml");
nav = docNav.CreateNavigator();
xPath = "/Cell/CellContent/Para/ParaLine/String/text()";
string value = nav.SelectSingleNode(xPath).Value
I recommend more reading on xPath syntax. Much more.
navigator.SelectSingleNode("/Cell/CellContent/Para/ParaLine/String/text()").Value
You can use Linq to XML as well to get value of specified element
var list = XDocument.Parse("xml string").Descendants("ParaLine")
.Select(x => x.Element("string").Value).ToList();
From above query you will get value of all the string element which are inside ParaLine tag.
Related
I'm trying to capture the attribute "description" in this XML:
<ProductoModel xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/WebApi.Models">
<descripcion>descripcion 1</descripcion>
<fecha_registro>2016-03-01</fecha_registro>
<id_producto>1</id_producto>
<id_proveedor>1</id_proveedor>
<nombre_producto>producto 1</nombre_producto>
<precio>200</precio>
</ProductoModel>
My Code :
XmlDocument xDoc = new XmlDocument();
xDoc.LoadXml(content);
XmlNamespaceManager manager = new XmlNamespaceManager(xDoc.NameTable);
manager.AddNamespace("MYNS", "http://schemas.datacontract.org/2004/07/WebApi.Models");
XmlNode node = xDoc.DocumentElement.SelectSingleNode("MYNS:ProductoModel", manager);
MessageBox.Show(node.Attributes.GetNamedItem("descripcion").Value);
The problem is I can not capture the attribute "descripcion" and get the following error:
Object reference not set to an instance of an object.
As I can capture the required attribute?
<descripcion> is not attribute. It is element.
You can get any element (or attribute) with a single xpath query.
XmlNode node = xDoc.DocumentElement.SelectSingleNode("/MYNS:ProductoModel/MYNS:descripcion", manager);
MessageBox.Show(node.InnerText);
Note the character / at the beginning of the xpath expression.
If you want another easy way operate XML, check this out. This is a little tool for xml operate, it's much easier to use and understand than XmlNode.
I am parsing a number of HTML documents, and within each need to try and extract a UK postal address. In order to do so I am parsing the HTML with AngleSharp and then looking for nodes with TextContent that match my RegEx:
var parser = new HtmlParser();
var source = "<html><head><title>Test Title</title></head><body><h1>Some example source</h1><p>This is a paragraph element and example postode EC1A 4NP</body></html>";
var document = parser.Parse(source);
Regex searchTerm = new Regex("([A-PR-UWYZ][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)");
var list = document.All.Where(m => searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));
This returns 3 results, the html, body and p elements. The only element I want to return is the p element as that has the innerText matching the regex correctly. There may also be more than one match on a page so I can't just return the last result. I am looking to just return any elements where the text in that element (not in any child nodes) matches the regex.
Edit
I don't know in advance the doc structure or even the tag that the postcode will be within which is why I'm using regex. Once I have the result I am planning on traversing the dom to obtain the rest of the address so I don't just want to treat the doc as a string
If you are looking to extract a particular node within a well-formed HTML/XML document then have a look at utilising XPath. There's some examples here on MSDN
You can use utilities libraries such as HTML Tidy to "clean-up" the html and make it well formed if it isn't already.
Ok, I took a different approach in the end. I searched the HTML doc as a string with the RegEx NOT to parse the HTML but simply to find the exact match value. once I had that value it was simple enough to use an xpath expression to return the node. In the example above, the regex search returns EC1A 4NP and the following XPATH:
//*[contains(text(),'EC1A 4NP')]
returns the required node. For XPath ease, I switched from AngleSharp to HtmlAgilityPack for the HTML parsing
I've had a quick look at the doco of parser. Below is what you need to do if you want to check only the text in <p> tags.
var list = document.All.Where(m => m.LocalName.ToUpper() == "P" && searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));
I am using c# console app to get xml document. Now once xmldocument is loaded i want to search for specific href tag:
href="/abc/def
inside the xml document.
once that node is found i want to strip tag completly and just show Hello.
Hello
I think i can simply get the tag using regex. But can anyone please tell me how can i remove the href tag completly using regex?
xml & html same difference: tagged content. xml is stricter in it's formatting.
for this use case I would use transformations and xpath queries rebuild the document. As #Yahia stated, regex on tagged documents is typically a bad idea. the regex for parsing is far to complex to be affective as a generic solution.
The most popular technology for similar tasks is called XPath. (It is also a key component of XQuery and XSLT.) Would the following perhaps solve your task, too?
root.SelectSingleNode("//a[#href='/abc/def']").InnerText = "Hello";
You could try
string x = #"<?xml version='1.0'?>
<EXAMPLE>
<a href='/abc/def'>Hello</a>
</EXAMPLE>";
System.Xml.XmlDocument doc = new XmlDocument();
doc.LoadXml(x);
XmlNode n = doc.SelectSingleNode("//a[#href='/abc/def']");
XmlNode p = n.ParentNode;
p.RemoveChild(n);
System.Xml.XmlNode newNode = doc.CreateNode("element", "a", "");
newNode.InnerXml = "Hello";
p.AppendChild(newNode);
Not really sure if this is what you are trying to do but it should be enough to get you headed in right direction.
I'm having trouble retrieving a single node by its explicit XPath that I have already found by other ways. I have node and I can get its XPath, but when I try to retrieve that same node again this time via node.XPath it gives the "expression must evaluate to a node-set" error. Shouldn't this work? I'm using HtmlAgilityPack in C# btw for the HtmlDocument.
HtmlDocument doc = new HtmlDocument();
doc.Load(#"..\..\test1.htm");
HtmlNode node = doc.DocumentNode.SelectSingleNode("(//node()[#id='something')])[first()]");
HtmlNode same = doc.DocumentNode.SelectSingleNode(node.XPath);
BTW: this is the value of node.XPath:
"/html[1]/body[1]/table[1]/tr[1]/td[1]/div[1]/div[1]/div[2]/table[1]/tr[1]/td[1]/div[1]/div[1]/table[1]/tr[1]/td[1]/div[1]/div[1]/div[4]/div[2]/div[1]/div[1]/div[4]/#text[2]"
I was able to get it working by replacing #text with the function text(). I'm not sure why it didn't just emit the XPath that way in the first place.
HtmlNode same = doc.DocumentNode.SelectSingleNode(node.XPath.Replace("#text","text()");
Your XPath ends in "#text[2]", which means "the second 'text' attribute". Attributes aren't nodes, they're node metadata.
This is a common problem I've had with XPath: wanting the value of an attribute while the XPath operation absolutely has to extract a node.
The solution I've used for this is to wrap my XPath fetching with something that detects and strips off the attribute portion of the string (via a myXPathString.LastIndexOf( "#" ) method call) and then uses the truncated myXPathString to fetch the node and collect the desired attribute value as a second step.
Hope that helps,
J
I have the below fragement of XML, notice that the Reference node holds a URI which links to the Id attribute of the Body node.
<Reference URI="#Body">
<SOAP-ENV:Body Id="Body" xmlns:SOAP-ENV="http://www.dingo.org">
<ns0:Add xmlns:ns0="http://www.moo.com">
<ns0:a>2</ns0:a>
<ns0:b>3</ns0:b>
</ns0:Add>
</SOAP-ENV:Body>
If I had the value of the URI attribute how would I then get the whole Body XMLNode? I presume this would be best done via an XPath epression but haven't any clue on XPath. Note that the XML will not always be so simple. I'm doing this in c# btw :)
Any ideas?
Thanks
Jon
EDIT: I wouldn't know the XML structure or namespaces before hand, all I would know is that the reference element has the ID of the xmlNode i want to retrieve, hope this is sligtly clearer.
You can add a condition that applies to a relative (or absolute node) to any step of an XPath expression.
In this case:
//*[#id=substring-after(/Reference/#URI, '#')]
The //* matches all elements in the document. The part in [] is a condition. Inside the condition the part of the URI element of the root References node is taken, but ignoring the '#' (and anything before it).
Sample code, assuming you have loaded your XML into XPathDocument doc:
var nav = doc.CreateNavigator();
var found = nav.SelectSingleNode("//*[#id=substring-after(/Reference/#URI, '#')]");
If you have the value of the URI attribute in a variable you could use
myXmlDocument.DocumentElement.SelectSingleNode("//SOAP-ENV:Body[ID='pURI']")
where pURI is the value of the URI attribute and myXmlDocument is the Xml Document object
Something like this:
XmlDocument requestDocument = new XmlDocument();
requestDocument.LoadXml(yourXmlString);
String someXml = requestDocument.SelectSingleNode(#"/*[local-name()='Reference ']/*[local-name()='Body']").InnerXml;