How do I access a specific HTML element using C#? - c#

I have a string containing HTML and I need to be able to access a specific element to get the text from it (the element has no id or class or name so regex is out of the question).
For example, lets say I needed to access: "/html/body/div/div[3]/div/table[0]/div/ul/li[12]/a/".
How could I go about doing this?

If the HTML is well formatted, you can parse the HTML with an XmlDocument
Also as Maxim mentioned, the HTML Agility Pack can probably do what you need.
Here's a recent article from 4guysfromrolla on parsing HTML with the HTML Agility Pack

Related

Get end of element with HTML Agility Pack?

I am using HTML Agility Pack to parse my HTML, and I need to know the position of each element within the HTML. HtmlNode.StreamPosition gives me the location in the HTML, works great. However, I'd also like the stream position of the end of element. I can get the StreamPosition and add on the length of the OuterHTML, but this is inaccurate as the OuterHTML from HTMLAgility pack will often not match up exactly with the actual HTML text.
I'm also game for using AngleSharp, if it's any easier or better suited to this. So basically, I can get the location of the start of a HTML element, how can I get the location of the end?
There is actually a private _endnode field of HtmlNode which is the closing tag of element. So either you can change HAP source code to expose it, or you can use System.Reflection to access it.
There is another
similar HAP issue with some example codes.

Regular expression in .net to get a special tag

There is a sample html code like below:
<div><span>span1</span></div>
<b>for test</b>
<span>span2</span>
Is there any way to get all span tags that are not in div tags (In this sample: span2)
According to this post C# Regular Expression excluding a string this is my pattern but it does not work.
pattern: ((?:(?!\b<div>\b))*)((.|\n)*?)<span>((.|\n)*?)</span>((.|\n)*?)((?:(?!\b</div>\b))*)
You really don't want to be using regular expressions to try to parse HTML. You can read more about the many reasons on this Stack Overflow question:
RegEx match open tags except XHTML self-contained tags
You should use an HTML parser like Html Agility Pack, or even a simple XML parser like XMLReader

How to generate xpath by looking for a string in an HTML document?

I have an HTML document, and I am willing to find out the xpath to an element containing a certain string.
To elaborate a bit more:
My HTML document is created dynamically and I have no specific names for s. The divs I am interested at look like (more or less):
<div>Country: China</div>
<div>Type: Earphones</div>
I want to get the whole string "Country: China". In order to do so, I want to find the xpath to this div by searching for "Country:" in the HTML.
I hope I was specific enough... Thank you!
Here are a couple ways:
//div[contains(child::text(), "Country:")]
//div/child::text()[contains(., "Country:")]/parent::node()
If you want to try things out within a browser, try out in-browser XPath bookmarklet.

Regex to get the tags

I have a html like this :
<h1> Headhing </h>
<font name="arial">some text</font></br>
some other text
In C#,
I want to get the out put as below. Simply content inside the font start tag and end tag
<font name="arial">some text</font>
First off, your html is wrong. you should close a <h1> with a </h1> not </h>. This one thing is why reg ex is inappropriate to parse tags.
Second, there are hundreds of questions on SO talking about parsing html with regex. The answer is don't. Use something like the html agility pack.
I wouldn't recommend to try it with regex.
I use the HTML Agility Pack to parse HTML and get what I want.
It's a lovely HTML parser that is commonly recommended for this. It will take malformed HTML and massage it into XHTML and then a traversable DOM, like the XML classes. So, is very useful for the code you find in the wild.
There's also an HTML parser from Microsoft MSHTML but I haven't tried it.
Regex regExfont = new Regex(#"<font name=""arial""[^>]*>.*</font>");
MatchCollection rows = regExfont.Matches(string);
good website is http://www.regexlib.com/RETester.aspx

How to find all img tags on a string?

I want to find all img tags in a string of text and put a surrounding link - a tag around it.
What's the best way to do it? And I want to retrieve the src link and put in the href attribute.
By the way I'm doing it in C# but the way should be similar in every language. Has anybody done such filtering and replacing? Any advice would be appreciated.
If you're parsing and dealing with raw HTML strings, I would highly recommend using the Html Agility Pack library.
How do you parse an HTML string for image tags to get at the SRC information?

Categories