c# htmlagilitypack parse first link from div with class? - c#

I am trying to parse the first link in the html code below /search?id=3
<div class="brs_col">
<p>
<a href="/search?id=3">
<b>
vastu shastra
</b>
</a>
</p>
<p>
<a href="/search?id=1">
<b>
bygga
</b>
bastu
</a>
</p>
</div>
I've tried to select it with the following XPATH, but cant seem to get any of them to work:
//div[#class='brs_col']//p//a[#href]
//div[#class='brs_col']//p[0]//a[#href]
//div[#class='brs_col']//p//a[0][#href]
Any ideas?

Try this:
var doc = new HtmlDocument();
doc.LoadHtml(#"<div class=""brs_col"">
<p><b>vastu shastra</b></p>
<p><b>bygga</b>bastu</p>
</div>");
var hrefValue = doc.DocumentNode
.SelectSingleNode("//div[#class='brs_col']/p/a")
.Attributes["href"]
.Value;

You can try this
doc.DocumentNode.SelectNodes("//a[#href]").FirstOrDefault();

This if you sure that is the first url in the whole HTML document:
doc.DocumentNode.SelectSingleNode("//a").Attributes["href"].Value;
Or this if you sure that is the first ulr in the class brs_col
doc.DocumentNode.SelectSingleNode("//div[#class='brs_col']//a").Attributes["href"].Value;

Related

HTMLAgilityPack C#, How to extract text from nested Tags in DIV

I have this HTML code where I want to extract the date from:
<div id="footer">
<div style="font-size:smaller">
Added in:
<strong>
07/06/2021 2:15:36 PM
</strong>
</div>
</div>
This is my C# HTMLAgilityPack
doc.DocumentNode.SelectSingleNode("//div[#id='footer']").InnerText
doc.DocumentNode.SelectSingleNode("//div[#id='footer']/div/strong").InnerText
Update :
All Code :
var html ="<div id=\"footer\"><div style=\"font-size:smaller\"> Added in:<strong> 07/06/2021 2:15:36 PM </strong></div></div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var time = doc.DocumentNode.SelectSingleNode("//div[#id='footer']/div/strong").InnerText;
and I extracted the Date

Html agility pack Addressing

in this Html
<div class="contacts-list">
<h4 class="title">Contact</h4>
<div class="contact-phone">
<span class="icon"><i class="ee-phone"></i></span><span class="type">تلفن</span>
<span class="contact-data">
<a dir='auto' href='tel:05138946697'>05138946697</a> </span>
</div>
I have to extract the value of the "a" tag but I must be sure it is inside a "div" tag with a "contact-phone" class.
I don't really understand how I have to do this can someone help me?
so I get the value I need like this using the HTML Agility pack and Xpath
foreach (HtmlNode node in htmlDocument.DocumentNode.SelectNodes("//div[#class='" + "contact-phone" + "']/span[#class='"+ "contact-data" + "']/a"))
{
value = node.InnerText;
}

fetching span value from html document

I have following xpath fetched using firefox xpath plugin
id('some_id')/x:ul/x:li[4]/x:span
using html agility pack I'm able to fetch id('some_id')/x:ul/x:li[4]
htmlDoc.DocumentNode.SelectNodes(#"//div[#id='some_id']/ul/li[4]").FirstOrDefault();
but I dont know how to get this span value.
update
<div id="some_id">
<ul>
<li><li>
<li><li>
<li><li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>
You don't need parse HTML with LINQ2XML, HTMLAgilityPack it's for it and it's more easy to obtain the node in the following way :
var html = #" <div id=""some_id"">
<ul>
<li></li>
<li></li>
<li></li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var value = doc.DocumentNode.SelectSingleNode("div[#id='some_id']/ul/li/span").InnerText;
Console.WriteLine(value);
An alternative approach (without html-agility-pack) would be to use LINQ2XML. You can use the XDocument.Descendants method to take the span element and take it's value:
var xml = #" <div id=""some_id"">
<ul>
<li></li>
<li></li>
<li></li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>";
var doc = XDocument.Parse(xml);
Console.WriteLine(doc.Root.Descendants("span").FirstOrDefault().Value);
The code can be extended to check if the div element has the matching id, using the XElement.Attribute property:
var doc = XDocument.Parse(xml);
Console.WriteLine(doc.Elements("div").Where (e => e.Attribute("id").Value == "some_id").Descendants("span").FirstOrDefault().Value);
One drawback of this solution is that the XML structure (HTML, XHTML) needs to be properly closed or else the parsing will fail.

Linq to XML - Render CDATA as HTML

I have the following XML:
<stories>
<story id="1234">
<title>This is a title</title>
<date>1/1/1980</date>
<article>
<![CDATA[<p>This is an article.</p>]]>
</article>
</story>
</stories>
And the following Linq to XML code in C#:
#{
XDocument xmlDoc = XDocument.Load("foo.xml");
var stories = from story in xmlDoc.Descendants("stories")
.Descendants("story")
.OrderByDescending(s => (string)s.Attribute("id"))
select new
{
title = story.Element("title").Value,
date = story.Element("date").Value,
article = story.Element("article").Value,
};
foreach (var story in stories)
{
<text><div class="news_item">
<span class="title">#story.title</span>
<span class="date">#story.date</span>
<div class="story">#story.article</div>
</div></text>
}
}
The rendered HTML is output to the browser as:
<div class="news_item">
<span class="title">This is a title</span>
<span class="date">1/1/1980</span>
<div class="story"><p>This is an article.</p></div>
</div>
I want the <p> tag rendered as HTML to the browser, not encoded. How do I accomplish this?
Razor encodes values by default. You need to use Html.Raw helper to avoid it ( Html.Raw() in ASP.NET MVC Razor view )
<div class="story">#Html.Raw(story.article)</div>

c# - reading HTML?

I'm developing a program in C# and I require some help. I'm trying to create an array or a list of items, that display on a certain website. What I'm trying to do is read the anchor text and it's href. So for example, this is the HTML:
<div class="menu-1">
<div class="items">
<div class="minor">
<ul>
<li class="menu-item">
<a class="menu-link" title="Item-1" id="menu-item-1"
href="/?item=1">Item 1</a>
</li>
<li class="menu-item">
<a class="menu-link" title="Item-1" id="menu-item-2"
href="/?item=2">Item 2</a>
</li>
<li class="menu-item">
<a class="menu-link" title="Item-1" id="menu-item-3"
href="/?item=3">Item 3</a>
</li>
<li class="menu-item">
<a class="menu-link" title="Item-1" id="menu-item-4"
href="/?item=4">Item 4</a>
</li>
<li class="menu-item">
<a class="menu-link" title="Item-1" id="menu-item-5"
href="/?item=5">Item 5</a>
</li>
</ul>
</div>
</div>
</div>
So from that HTML I would like to read this:
string[,] array = {{"Item 1", "/?item=1"}, {"Item 2", "/?item=2"},
{"Item 3", "/?item=3"}, {"Item 4", "/?item=4"}, {"Item 5", "/?item=5"}};
The HTML is an example I had written, the actual site does not look like that.
As others said HtmlAgilityPack is the best for html parsing, also be sure to download HAP Explorer from HtmlAgilityPack site, use it to test your selects, anyway this SelectNode command will get all anchors that have ID and it start with menu-item :
HtmlDocument doc = new HtmlDocument();
doc.Load(htmlFile);
var myNodes = doc.DocumentNode.SelectNodes("//a[starts-with(#id,'menu-item-')]");
foreach (HtmlNode node in myNodes)
{
Console.WriteLine(node.Id);
}
If the HTML is valid XML you can load it using the XmlDocument class and then access the pieces you want using XPaths, or you can use and XmlReader as Adriano suggests (a bit more work).
If the HTML is not valid XML I'd suggest to use some existing HTML parsers - see for example this - that worked OK for us.
You can also use the HtmlAgility pack
I think this case is simple enough to use a regular expression, like <a.*title="([^"]*)".*href="([^"]*)":
string strRegex = #"<a.*title=""([^""]*)"".*href=""([^""]*)""";
RegexOptions myRegexOptions = RegexOptions.None;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = ...;
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
if (myMatch.Success)
{
// Use the groups matched
}
}

Categories