How to parse this piece of HTML? - c#

good morning!
i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:
<h1>My caption</h1>
<p>Here will be some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
i need following output:
group 1: content of h1
group 2: content of h1-following text
group 3-n: content of subcaptions + text
what i have atm:
<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>
this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>), which only gives me the caption but not the content - i'm fine with that atm.
does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)?
edit:
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the <h1>-tag.
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p> to <div> and <ul>...
atm this seems more or less iterate over the whole document and parsing tag for tag ...?
any hints?

You will really need HTML parser for this

Don't use regex to parse HTML. Consider using the HTML Agility Pack.

There are some possibilities:
REGEX - Fast but not reliable, it cant deal with malformed html.
HtmlAgilityPack - Good, but have many memory leaks. If you want to deal with a few files, there is no problem.
SGMLReader - Really good, but there are a problem. Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html.
http://developer.mindtouch.com/SgmlReader
Majestic-12 - Good but not so fast as SGMLReader.
http://www.majestic12.co.uk/projects/html_parser.php
Example for SGMLreader (VB.net)
Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)
Dim XNS As XNamespace
' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
XNS = htmldoc.Root.GetDefaultNamespace
Catch
XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
XNS = "http://www.w3.org/1999/xhtml"
End If
'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
Scripts &= link.Value
Next
In Majestic-12 is different, you have to walk to every tag with a "Next" command. You can find a example code with the dll.

As others have mentioned, use the HtmlAgilityPack. However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler:
http://code.google.com/p/fizzler/
Using this you could find all <p> tags using:
var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();
Or find a specific div like <div id="myDiv"></div>:
var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');
It can't get any easier than that!

Related

grab text value using html agillity pack

Please check the code bellow. I am trying to grab a html text value from this html doc. I want to grab text Quick Kill 32 oz. Mosquito Yard Spray and i already tried to do it using SelectSingleNode like bellow and this cant grab this text value. Any idea how to fix it?
string html = #"<div class='pod-plp__description js-podclick-analytics' data-podaction='product name'>
<a class='' data-pos='0' data-request-type='sr' data-pod-type='pr' href='/p/AMDRO-Quick-Kill-32-oz-Mosquito-Yard-Spray-100530440/304755303'>
<span class='pod-plp__brand-name'>AMDRO</span>
Quick Kill 32 oz. Mosquito Yard Spray
</a>
</div>";
var doc = new HtmlDocument();
doc.Load(html);
string title = doc.DocumentNode
.SelectSingleNode("//div[#class='pod-plp__description js-podclick-analytics']span[#class='pod-plp__brand-name']")
.InnerText;
You are trying to targeting only span[#class='pod-plp__brand-name'] which will return you only inside span but you need following-sibling::text() to grab text after your span. Please see my example code bellow. Also you can learn more from html-agility-pack official site.
var Content = htmlDoc.DocumentNode.SelectSingleNode("//span[#class='pod-plp__brand-name']/following-sibling::text()[1]");
string title = titleAgain.InnerText.Trim();
Found solution from here

HTML Agility Pack Node Selection

I'm brand new to HTML Agility Pack (as well as network-based programming in general). I am trying to extract a specific line of HTML, but I don't know enough about HTML Agility Pack's syntax to understand what I'm not writing correctly (and am lost in their documentation). URLs here are modified.
string html;
using (WebClient client = new WebClient())
{
html = client.DownloadString("https://google.com/");
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode img in doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a"))
{
Debug.Log(img.GetAttributeValue("href", null));
}
return null;
This is what the HTML looks like
<div id="ngg-image-3" class="ngg-gallery-thumbnail-box" >
<div class="ngg-gallery-thumbnail">
<a href="https://urlhere.png"
// More code here
</a>
</div>
</div>
The problem occurs on the foreach line. I've tried matching examples online the best I can but am missing it. TIA.
HTMLAgilityPack uses XPath syntax to query nodes - HAP effectively converts the HTML document into an XML document. So the trick is learning about XPATH querying so you can get the right combinations of tags and attributes to get the result you need.
The HTML snippet you pasted isn't well formed (there's no closing >on the anchor tag. Assuming that it is closed, then
//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a[#href]
will return an XPathNodeList of only those tags that have href attributes.
If there are none that meet your criteria, nothing will be written.
For debugging purposes, perhaps log less specific query node count or OuterXml to see what you're getting e.g.
Debug.Log(doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail'])[0].OuterXml)

Unable to build a regex to match the article tag

I have been trying to create a regex to match the article tag and get all the text .
Here is my article tag-
<article id="post-82" class="post-82 post type-post status-publish format-standard hentry category-publishing">
<div class="entry-content clearfix">
<div class="abh_box abh_box_up abh_box_drop-down"><ul class="abh_tabs"> <li class="abh_about abh_active">
<p>With India playing host,</p>
<footer class="entry-meta-bar clearfix"><div class="entry-meta clearfix">
<span class="comments">No Comments</span>
</div></footer>
</article>
I need everything which is inside the article tag.So far I have tried the following Regex-
<article (.*?)</article>
(?:<article>)(.*?)(?:</article>)
None of them works .Please help.
Don't use regex for parsing of HTML. Use Html parser like Html Agility pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
var result = doc.DocumentNode.SelectNodes("article").FirstOrDefault();
You don't want to use regex for something like this and you don't need to load an XML parser. Just use .getAttribute("innerHTML") on the element you want the contained HTML for.
For example, this gets only the article element in your supplied HTML by ID.
System.out.println(driver.findElement(By.id("post-82")).getAttribute("innerHTML"));
This gets the HTML for all articles on the page.
for (WebElement article : driver.findElements(By.tagName("article")))
{
System.out.println(article.getAttribute("innerHTML"));
}
You can try this regex:
<[article][^>]*>((.|\n)*?)<\/article>
https://regex101.com/r/oOJ9bt/2

How to get an element using c#

I'm new with C#, and I'm trying to access an element from a website using webBrowser. I wondered how can I get the "Developers" string from the site:
<div id="title" style="display: block;">
<b>Title:</b> **Developers**
</div>
I tried to use webBrowser1.Document.GetElementById("title") ,but I have no idea how to keep going from here.
Thanks :)
You can download the source code using WebClient class
then look within the file for the <b>Title:</b>**Developers**</div> and then omit everything beside the "Developers".
HtmlAgilityPack and CsQuery is the way many people has taken to work with HTML page in .NET, I'd recommend them too.
But in case your task is limited to this simple requirement, and you have a <div> markup that is valid XHTML (like the markup sample you posted), then you can treat it as an XML. Means you can use .NET native API such as XDocument or XmlDocument to parse the HTML and perform an XPath query to get specific part from it, for example :
var xml = #"<div id=""title"" style=""display: block;""> <b>Title:</b> Developers</div>";
//or according to your code snippet, you may be able to do as follow :
//var xml = webBrowser1.Document.GetElementById("title").OuterHtml;
var doc = new XmlDocument();
doc.LoadXml(xml);
var text = doc.DocumentElement.SelectSingleNode("//div/b/following-sibling::text()");
Console.WriteLine(text.InnerText);
//above prints " Developers"
Above XPath select text node ("Developers") next to <b> node.
You can use HtmlAgilityPack (As mentioned by Giannis http://htmlagilitypack.codeplex.com/). Using a web browser control is too much for this task:
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com");
var el = doc.GetElementbyId("title");
string s = el.InnerHtml; // get the : <b>Title:</b> **Developers**
I haven't tried this code but it should be very close to working.
There must be an InnerText in HtmlAgilityPack as well, allowing you to do this:
string s = el.InnerText; // get the : Title: **Developers**
You can also remove the Title: by removing the appropriate node:
el.SelectSingleNode("//b").Remove();
string s = el.InnerText; // get the : **Developers**
If for some reason you want to stick to the web browser control, I think you can do this:
var el = webBrowser1.Document.GetElementById("title");
string s = el.InnerText; // get the : Title: **Developers**
UPDATE
Note that the //b above is XPath syntax which may be interesting for you to learn:
http://www.w3schools.com/XPath/xpath_syntax.asp
http://www.freeformatter.com/xpath-tester.html

Explicit Element Closing Tags with System.Xml.Linq Namespace

I am using the (.NET 3.5 SP1) System.Xml.Linq namespace to populate an html template document with div tags of data (and then save it to disk). Sometimes the div tags are empty and this seems to be a problem when it comes to HTML. According to my research, the DIV tag is not self-closing. Therefore, under Firefox at least, a <div /> is considered an opening div tag without a matching closing tag.
So, when I create new div elements by declaring:
XElement divTag = new XElement("div");
How can I force the generated XML to be <div></div> instead of <div /> ?
I'm not sure why you'd end up with an empty DIV (seems a bit pointless!) But:
divTag.SetValue(string.Empty);
Should do it.
With
XElement divTag = new XElement("div", String.Empty);
you get the explicit closing tag
I don't know the answer to your question using LINQ. But there is a project called HTML Agility Pack on codeplex that allows you to create and manipulate HTML documents much similar to the way we can manipulate XML document using System.Xml namespace classes.
I did this. Working as expected.
myXml = new XElement("script", new XAttribute("src", "value"));
myXml .Value = "";
Which gives below as result.
<script src = "value"></script>

Categories