Im trying to learn webscraping and to get the href value from the "a" node using Htmlagilitypack in C#. There is multiple Gridcells within the gridview that has articles with smallercells and I want the "a" node href value from all of them
<div class=Tabpanel>
<div class=G ridW>
<div class=G ridCell>
<article>
<div class=s mallerCell>
<a href="..........">
</div>
</article>
</div>
</div>
<div class=r andom>
</div>
<div class=r andom>
</div>
</div>
This is what I have come up with so far, feels like I'm making it way more complicated than it has to be. Where do I go from here? Or is there an easier way to do this?
httpclient = new HttpClient();
var html = await httpclient.GetStringAsync(Url);
var htmldoc = new HtmlDocument();
htmldoc.LoadHtml(html);
var ReceptLista = new List < HtmlNode > ();
ReceptLista = htmldoc.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Equals("GridW")).ToList();
var finalList = new List < HtmlNode > ();
finalList = ReceptLista[0].Descendants("article").ToList();
var finalList2 = new List < List < HtmlNode >> ();
for (int i = 0; i < finalList.Count; i++) {
finalList2.Add(finalList[i].DescendantNodes().Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-content")).ToList());
}
var finalList3 = new List < List < HtmlNode >> ();
for (int i = 0; i < finalList2.Count; i++) {
finalList3.Add(finalList2[i].Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-link js-searchRecipeLink")).ToList());
}
If you can probably make things a lot simpler by using XPath.
If you want all the links in article tags, you can do the following.
var anchors = htmldoc.SelectNodes("//article/a");
var links = anchors.Select(a=>a.attributes["href"].Value).ToList();
I think it is Value. Check with docs.
If you want only the anchor tags that are children of article, and also with class smallerCell, you can change the xpath to //article/div[#class='smallerClass']/a.
you get the idea. I think you're just missing xpath knowledge. Also note that HtmlAgilityPack also has plugins that can add CSS selectors, so that's also an option if you don't want to do xpath.
Simplest way I'd go about it would be this...
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
var nodesWithARef = doc.DocumentNode.Descendants("a");
foreach (HtmlNode node in nodesWithARef)
{
Console.WriteLine(node.GetAttributeValue("href", ""));
}
Reasoning: Using the Descendants function would give you an array of all the links that you're interested in from the entire html. You can go over the nodes and do what you need ... i am simply printing the href.
Another Way to go about it would be to look up all the nodes that have the class named 'smallerCell'. Then, for each of those nodes, look up the href if it exists under that and print it (or do something with it).
var nodesWithSmallerCells = doc.DocumentNode.SelectNodes("//div[#class='smallerCell']");
if (nodesWithSmallerCells != null)
foreach (HtmlNode node in nodesWithSmallerCells)
{
HtmlNodeCollection children = node.SelectNodes(".//a");
if (children != null)
foreach (HtmlNode child in children)
Console.WriteLine(child.GetAttributeValue("href", ""));
}
Related
Suppose I have the following HTML code:
<div class="MyDiv">
<h2>Josh</h2>
</div>
<div class="MyDiv">
<h2>Anna</h2>
</div>
<div class="MyDiv">
<h2>Peter</h2>
</div>
And I want to get the names, so this is what I did (C#):
string url = "https://...";
var web = new HtmlWeb();
HtmlNode[] nodes = null;
HtmlDocument doc = null;
doc = web.Load(url);
nodes = doc.DocumentNode.SelectNodes("//div[#class='MyDiv").ToArray() ?? null;
foreach (HtmlNode n in nodes){
var name = n.SelectSingleNode("//h2");
Console.WriteLine(name.InnerHtml);
}
Output:
Josh
Josh
Josh
and it is so strange because n contains only the desired <div>. How can I resolve this issue?
Fixed by writing .//h2 instead of //h2
It's because of your XPath statement "//h2". You should change this simply to "h2". When you start with the two "//" the path starts at the top. And then it selects "Josh" every time, because that is the first h2 node.
You could also do like this:
List<string> names =
doc.DocumentNode.SelectNodes("//div[#class='MyDiv']/h2")
.Select(dn => dn.InnerText)
.ToList();
foreach (string name in names)
{
Console.WriteLine(name);
}
I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210
I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();
I'm currently using HtmlAgilityPack to strip the content of a div (with contentEditable) from all the unnecessary Tag, so I can only keep the text between <p></p> tag. After getting the text I send it to a corrector that give me back the words in error inside this specifique <p></p>.
Dictionary<string, List<string>> DicoError = new Dictionary<string, List<string>>();
int nbError = 0;
HtmlDocument html = new HtmlDocument();
html.LoadHtml(texteAFormater);
var nodesSpan = html.DocumentNode.SelectNodes("//span");
var nodesA = html.DocumentNode.SelectNodes("//div");
if (nodesSpan != null)
{
foreach (var node in nodesSpan)
{
node.Remove();
}
}
if (nodesA != null)
{
foreach (var node in nodesA)
{
if (node.Attributes["edth_type"] != null)
{
if (string.Equals(node.Attributes["edth_type"].Value, "contenu", StringComparison.InvariantCultureIgnoreCase)==false)
{
node.Remove();
}
}
}
}
var paragraphe = html.DocumentNode.SelectNodes("p");
for(int i =0; i< paragraphe.Count; i++){
string texteToCorrect = paragraphe[i].innerText;
List<string> errorInsideParagraph = new List<string>();
errorInsideParagraph = callProlexis(HtmlEntity.DeEntitize(texteToCorrect), nbError, DicoError);
for(int j=0;j<motEnErreur.Count; j++){
HtmlNode spanNode = html.CreateElement("span");
spanNode.Attributes.Add("class", typeError);
spanNode.Attributes.Add("id", nbError);
spanNode.Attributes.Add("oncontextmenu","rightClickMustWork(event, this);return false");
}
}
I manage to send the innerText to my corrector, the worry I got is admitting my innerText for this paragraph is :
<p>this is some text <em>error</em> how should this work</p>
In this one two words are in error : error and should
how can I add my spanNode so it will keep the <em></em> around error? (I need to keep the actual tag around the word in error if there is one already and just wrap the spanNode around it).
So the expected result will be :
<p>this is some text <span ...><em>error</em></span> how <span ...>should</span> this work</p>
Edit: I was thinking something like finding the word in error inside the innerHtml then get the parent node of this word, if it is <p> then there is no tag around him and we can just add the spanNode if it is another tag then we need to add the spanNodeas his parent node such as spanNode is the child of <p> but the parent of the tag around this word. I'm not sure how to do it.
I have a the HTML code which I would like to parse.
I have written the code below:
HtmlAgilityPack.HtmlWeb web5 = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://www.analytics4.co.uk/pdf.js/web/viewer.html?file=http://www.analytics4.co.uk/pdf.js/pdf/w15639.pdf");
//var divs5 = doc5.DocumentNode.SelectNodes("//div[id='viewerContainer']").SelectMany(x => x.Descendants("div"));
// HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://google.co.uk");
HtmlNodeCollection tl = doc5.DocumentNode.SelectNodes("//div[#id='viewerContainer']//div[#id='viewer']//");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.OuterHtml);
}
The result I get for Inner HTML is just
<div id="viewer" class="pdfViewer"></div>
and it doesn't make sense. Could anyone explain me how can I go deeper and deeper to the inner divs and so on? Please guys...I need your help.
To go deeper you can use this techniques:
foreach (var node in tl){
var a = node.ChildNodes[2]; // a is the third child of node
var b = node.SelectSingleNode("./div[3]"); // b is the third "div"
// element in node children. The "./" in XPath means "from current node"
}
Good luck!
I have a problem that my xpath is not working.
I am trying to get the url from Google.com's search result list into a string list.
But i am unable to reach on url using Xpath.
Please help me in correcting my xpath. Also tell me what should be on the place of ??
HtmlWeb hw = new HtmlWeb();
List<string> urls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=" +txtURL.Text.Replace(" " , "+"));
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[#class='f kv']");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["?????????"];
urls.Add(link.Value);
}
for (int i = 0; i <= urls.Count - 1; i++)
{
if (urls.ElementAt(i) != null)
{
if (IsValid(urls.ElementAt(i)) != true)
{
grid.Rows.Add(urls.ElementAt(i));
}
}
}
The URLs seem to live in the cite element under that selected divs, so the XPath to select those is //div[#class='f kv']/cite.
Now, since these contain markup but you only want the text, select the InnerText of the selected nodes. Note that these do not begin with http://.
HtmlNodeCollection linkNodes =
doc.DocumentNode.SelectNodes("//div[#class='f kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.InnerText;
urls.Add(link.Value);
}
The correct XPath is "//div[#class='kv']/cite". The f class you see in the browser element inspector is (probably) added after the page is rendered using javascript.
Also, the link text is not in an attribute, you can get it using the InnerText property of the <div> element(s) obtained at the earlier step.
I changed these lines and it works:
var linkNodes = doc.DocumentNode.SelectNodes("//div[#class='kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
urls.Add(linkNode.InnerText);
}
There's a caveat though: some links are trimmed (you'll see a ... in the middle)