Selecting the next element using HTML Agility Pack - c#

I am using HTML Agility Pack and searching for div with class="fileHeader" that has "RelayClinical Patient Education with Animations Install zip" in a child h4 element. Once found, I want to capture the "href" attribute inside the anchor tag of that particular block. How can I get it?
HTML Source
<div class="fileHeader" id="fileHeader_7311111">
<h4 class="collapsed">RelayClinical Patient Education with Animations Install zip</h4>
<div class="defaultMethod">
<a class="buttonGrey" href="https://mckc-esd.subscribenet.com/cgi-bin/download?rid=2511740931&rp=DTM20130905162949MzcyODIwNjM0" title="Clicking this link will open a new window." rel="noreferrer">
HTTPS Download
</a>
</div>
</div>
Code
HtmlNodeCollection fileHeaderNodes = bodyNode.SelectNodes("//div[#class='fileHeader']//h4");
foreach (HtmlNode fileHeader in fileHeaderNodes)
{
if (fileHeader.InnerText.Trim() == "RelayClinical Patient Education with Animations Install zip")
{
HtmlNodeCollection fileHeaderNodes = bodyNode.SelectNodes("//div[#class='fileHeader']//h4");
foreach (HtmlNode fileHeader in fileHeaderNodes)
{
if (fileHeader.InnerText.Trim() == "RelayClinical Patient Education with Animations Install zip")
{
foreach (HtmlNode link in fileHeader.SelectNodes("//a[#href]"))
{
// extract the link and put in dataUrl var
if ((link.InnerText.Trim() == "HTTPS Download") && isFound == true)
{
count++;
// select all a tags (html anchor tags) that have a href attribute
HtmlAttribute att = link.Attributes["href"];
dataUrl = att.Value;
}
}
}
}
}
}

Rather than selecting the h4 element, select the a element directly. Then you can grab the href attribute.
var h4Text = "RelayClinical Patient Education with Animations Install zip";
var xpath = String.Format(
"//div[#class='fileHeader' and h4='{0}']/div[#class='defaultMethod']/a",
h4Text
);
var anchor = doc.DocumentNode.SelectSingleNode(xpath);
if (anchor != null)
{
var attr = anchor.GetAttributeValue("href", null);
// do stuff with attr
}

Related

Suppress Tags from HTML with AgilityHTMLPack

I need help because I am not really used to work with HTML. I show a webdocument from my code, the web document read an HTML file, containing some Images.
Everytime, just before the Image tag, I observed two tags who create some wrong caracters. An example would be better.
<p ><br clear=all> </span>
<img border=0 width=265 height=105 id="Picture 84856"
src="Test_HTML/image272.jpg"></p>
the printing is partially correct because it shows the Images and a lots of wrong ÂÂÂÂÂÂÂÂÂ characters.
So I decided to try to cut the tags.
I don't know how to do this. Perhaps I am completely wrong but I think it is good start, isn't it?
My test to suppress these tags in a Html node is
public void ShowTag(string tag)
{
string innerHtml= "//div[#id='"+tag+ "']";
string inner = "//p";
string brToRemove = "//br";
string spanToRemove = "//span";
var nodes = document.DocumentNode.SelectSingleNode(innerHtml);
bool br_deleted = false;
foreach (HtmlNode nd in nodes.SelectNodes(inner))
{
foreach (HtmlNode child in nd.ChildNodes)
{
if (child.Name == "br")
{
int a = 0;
a++;
child.ParentNode.RemoveChild(child);
br_deleted = true;
}
if(child.Name=="span")
{
int b = 0;
b++;
if (br_deleted == true)
{
//nd.ParentNode.RemoveChild(child);
child.Remove();
br_deleted = false;
}
}
}
}
but I cannot remove the child, do you have any idea?
I founded where the problem came from: When selecting the good node, I needed to add the Headers so i could identify the encoding.
string innerHtml = "//div[#id='" + tag + "']";
string inner = "//p";
webbrowser.Navigate("about:blank");
LoadDocument();
HtmlNode nodes = document.DocumentNode.SelectSingleNode(innerHtml);
HtmlNode head = document.DocumentNode.SelectSingleNode("/html/head");
head.AppendChild(nodes);
webbrowser.NavigateToString(head.InnerHtml);

Get element by class name via browser C#

I have:
<h2 class="entry-title" itemprop="headline">
<a href="http://www.printesaurbana.ro/2015/10/idei-despre-un-start-bun-in-blogging.html" Idei despre un start bun în blogging </a>
</h2>
I want get href using class name, I try:
if (webBrowser1.Document != null)
{
var links = webBrowser1.Document.GetElementsByTagName("a");
foreach (HtmlElement link in links)
{
if (link.GetAttribute("class") == "entry-title")
{
MessageBox.Show("Here");
}
}
}
But didn't work. How solve this?
You should use link.GetAtribute("className"). Besides, it is the h2 tag in your html document that has the entry-title class. Corrected code:
if (webBrowser1.Document != null)
{
var links = webBrowser1.Document.GetElementsByTagName("h2");
foreach (HtmlElement link in links)
{
if (link.GetAttribute("className") == "entry-title")
{
MessageBox.Show("Here");
}
}
}

HTML Agility pack with c#

c# code:`
var node = new HtmlWeb();
var doc = node.Load("http://ask.fm/");
HtmlNode ournode = doc.DocumentNode.SelectSingleNode("//div[#id='heads']")
textBox1.Text=ournode.InnerHtml;
`
html code :
//< div id="heads" >
<img alt="" class="head" id="face_30132803" src="http://img3.ask.fm/assets2/103/548/655/872/thumb_tiny/IMG_20150513_192250.jpg" />
<img alt="" class="head" id="face_56578735" src="http://img1.ask.fm/assets2/091/364/883/712/thumb_tiny/11094711_919135961470973_149663457_njpg720960png1280963.png" />
I want to see the following in the text box
/sudenur3434
/leylaulucay
I have added an additional line to your code:
var node = new HtmlWeb();
var doc = node.Load("http://ask.fm/");
HtmlNode ournode = doc.DocumentNode.SelectSingleNode("//div[#id='heads']")
var val = ournode.Attributes["href"].Value;
textBox1.Text=val;
This would let you get the href attribute. Simply use the same code to get the other nodes href value and then add them to your textbox
Since a text box is usually used for one liners, I am giving you an example that will simply write all links in the direct output window of VS.
If you use e.g. a ListBox instead of a text box you can replace Debug.Print by e.g. ListBox1.Items.Add(href.Value)
This here will give you all href urls from all a children in div id="heads":
var site = new HtmlWeb();
var htmldoc = site.Load("http://ask.fm/");
var headDiv = htmldoc.DocumentNode.SelectSingleNode("//div[#id='heads']");
if (headDiv != null)
{
var anchors = headDiv.SelectNodes("a");
foreach (HtmlNode aNode in anchors)
{
var href = aNode.Attributes.AttributesWithName("href").FirstOrDefault();
if (href != null)
Debug.Print(href.Value);
}
}
< div id="heads" >
<img alt="" class="head" id="face_30132803" src="http://img3.ask.fm/assets2/103/548/655/872/thumb_tiny/IMG_20150513_192250.jpg" />
<a href="/leylaulucay" data-rlt-aid="welcome_head"><img alt="" class="head" id="face_5
how to agility pack parse in textbox

Xpath for <a></a> tag

Let say this is my html code
<a class="" data-tracking-id="0_Motorola"
href="/motorola?otracker=nmenu_sub_electronics_0_Motorola">
Motorola
</a>
I used C# code to find the href value like this
var tags = htmlDoc.DocumentNode.SelectNodes("//div[#class='top-menu unit']
//ul//li//div[#id='submenu_electronics']//a");
if (tags != null)
{
foreach (var t in tags)
{
var name = t.InnerText.Trim();
var url =t.Attributes["href"].Value;
}
}
I am getting url='/motorola' but I need url=/motorola?otracker=nmenu_sub_electronics_0_Motorola
its not appending text after ?,&.. Please clarify where I went wrong.
I have used HtmlAgilityPack in the past and I have previously used it like this :
var url = t.GetAttributeValue("href","");
You can try that and see if it works.

html agility pack remove children

I'm having difficulty trying to remove a div with a particular ID, and its children using the HTML Agility pack. I am sure I'm just missing a config option, but its Friday and I'm struggling.
The simplified HTML runs:
<html><head></head><body><div id='wrapper'><div id='functionBar'><div id='search'></div></div></div></body></html>
This is as far as I have got. The error thrown by the agility pack shows it cannot find a div structure:
<div id='functionBar'></div>
Here's the code so far (taken from Stackoverflow....)
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
//htmlDoc.OptionFixNestedTags = true;
// filePath is a path to a file containing the html
htmlDoc.LoadHtml(Html);
string output = string.Empty;
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count > 0)
{
// Handle any parse errors as required
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
if (bodyNode != null)
{
HtmlAgilityPack.HtmlNode functionBarNode = bodyNode.SelectSingleNode ("//div[#id='functionBar']");
bodyNode.RemoveChild(functionBarNode,false);
output = bodyNode.InnerHtml;
}
}
}
bodyNode.RemoveChild(functionBarNode,false);
But functionBarNode is not a child of bodyNode.
How about functionBarNode.ParentNode.RemoveChild(functionBarNode, false)? (And forget the bit about finding bodyNode.)
You can simply call:
var documentNode = document.DocumentNode;
var functionBarNode = documentNode.SelectSingleNode("//div[#id='functionBar']");
functionBarNode.Remove();
It is much simpler, and does the same as:
functionBarNode.ParentNode.RemoveChild(functionBarNode, false);
This will work for multiple:
HtmlDocument d = this.Download(string.Format(validatorUrl, Url));
foreach (var toGo in QuerySelectorAll(d.DocumentNode, "p[class=helpwanted]").ToList())
{
toGo.Remove();
}

Categories