Get end of element with HTML Agility Pack? - c#

I am using HTML Agility Pack to parse my HTML, and I need to know the position of each element within the HTML. HtmlNode.StreamPosition gives me the location in the HTML, works great. However, I'd also like the stream position of the end of element. I can get the StreamPosition and add on the length of the OuterHTML, but this is inaccurate as the OuterHTML from HTMLAgility pack will often not match up exactly with the actual HTML text.
I'm also game for using AngleSharp, if it's any easier or better suited to this. So basically, I can get the location of the start of a HTML element, how can I get the location of the end?

There is actually a private _endnode field of HtmlNode which is the closing tag of element. So either you can change HAP source code to expose it, or you can use System.Reflection to access it.
There is another
similar HAP issue with some example codes.

Related

Select "src" value with XPath to HtmlAgilityPack

I'm on a development process of a crawling engine. My program crawls websites through Xpath with HtmlAgilityPack. I need to get some image src tag's directly. You can see my simple code below which is not working correctly, thanks in advice!
PS: Please ignore " char problem, XPath patterns are provided by database.
Agility.DocumentNode.SelectSingleNode("//img[#id="product_photo"]/#src");
And this is the line i need to crawl (the *...* part shows block to extract
<img id="product_photo" src="*/images/thumb/4400/10280/st.jpg*">
Some pages provide image in meta tags so .Attributes["src"] wont work.
UPDATE: You can see my query and result here
You cann't get the value of "src" or any other attributes in using:
Agility.DocumentNode.SelectSingleNode(yourXpath);
Just by using:
string s=Agility.DocumentNode.SelectSingleNode(yourXpath).value;
It's because XPath cann't return value of an attribute by SelectSingleNode() func in HtmlAgilityPack class. So you must use SelectSingleNode(yourXpath).value or use Regex after the pharsing to get just the "src" without the outerText.

How to generate xpath by looking for a string in an HTML document?

I have an HTML document, and I am willing to find out the xpath to an element containing a certain string.
To elaborate a bit more:
My HTML document is created dynamically and I have no specific names for s. The divs I am interested at look like (more or less):
<div>Country: China</div>
<div>Type: Earphones</div>
I want to get the whole string "Country: China". In order to do so, I want to find the xpath to this div by searching for "Country:" in the HTML.
I hope I was specific enough... Thank you!
Here are a couple ways:
//div[contains(child::text(), "Country:")]
//div/child::text()[contains(., "Country:")]/parent::node()
If you want to try things out within a browser, try out in-browser XPath bookmarklet.

Get a particular input element from a particular form

Like the statement,
string value = document.forms["sap.client.SsrClient.form"].elements["sapwdssr..requestCounter"].value;
in javascript, is there a corresponding statement to get the value of a particular input element within a particular form in C#?
I can do so by using HTMLDocument and mshtml interface. But that is a rather cumbersome process so if any direct method or property exists it would be great.
I assume you are asking to parse HTML, rather than attempting to do some form of runtime manipulation of a rendered web page, correct?
If that's the case, I highly suggest you look into the HTML Agility Pack, which we have used very successfully to parse HTML as if it were XML. You could do your stuff with a simple XPath query.

How do I access a specific HTML element using C#?

I have a string containing HTML and I need to be able to access a specific element to get the text from it (the element has no id or class or name so regex is out of the question).
For example, lets say I needed to access: "/html/body/div/div[3]/div/table[0]/div/ul/li[12]/a/".
How could I go about doing this?
If the HTML is well formatted, you can parse the HTML with an XmlDocument
Also as Maxim mentioned, the HTML Agility Pack can probably do what you need.
Here's a recent article from 4guysfromrolla on parsing HTML with the HTML Agility Pack

Regular Expression for Extracting Script Tags

I am trying to write a regular expression in C# to remove all script tags and anything contained within them.
So far I have come up with the following: \<([^:]*?:)?script\>[^(\</<([^:]*?:)?script\>)]*?\</script\>, however this does not work.
I'll break it up and explain my thinking in each section:
\<([^:]*?:)?script\>
Here I am trying to state that it should get any script element, even if it is prefixed with a namespace, say, <a:script></a:script>. I have also added this to the closing tag.
[^(\</<([^:]*?:)?script\>)]*?
Here I am trying to state that it should allow anything to be contained within the tags except for </a:script>, </script>, etc.
\</script\>
Here I am stating that it should have a closing tag.
Can anyone spot where I am going wrong?
This regular expression does the trick just fine:
\<(?:[^:]+:)?script\>.*?\<\/(?:[^:]+:)?script\>
But don't do it please
You will run into a problem by this simple HTML:
<script>
var s = "<script></script>";
</script>
How are you going to solve this problem? It is smarter to use the HTML Agility Pack for such things.
You can't parse HTML with regular expressions.
Use the HTML Agility Pack instead.

Categories