Parsing dl with HtmlAgilityPack - c#

This is the sample HTML I am trying to parse with Html Agility Pack in ASP.Net (C#).
<div class="content-div">
<dl>
<dt>
<b>1</b>
</dt>
<dd> First Entry</dd>
<dt>
<b>2</b>
</dt>
<dd> Second Entry</dd>
<dt>
<b>3</b>
</dt>
<dd> Third Entry</dd>
</dl>
</div>
The Values I want are :
The hyperlink -> 1.html
The Anchor Text ->1
Inner Text od dd -> First Entry
(I have taken examples of the first entry here but I want the values for these elements for all the entries in the list )
This is the code I am using currently,
var webGet = new HtmlWeb();
var document = webGet.Load(url2);
var parsedValues=
from info in document.DocumentNode.SelectNodes("//div[#class='content-div']")
from content in info.SelectNodes("dl//dd")
from link in info.SelectNodes("dl//dt/b/a")
.Where(x => x.Attributes.Contains("href"))
select new
{
Text = content.InnerText,
Url = link.Attributes["href"].Value,
AnchorText = link.InnerText,
};
GridView1.DataSource = parsedValues;
GridView1.DataBind();
The problem is that I get the values for the link and the anchor text correctly but for the inner text of it just takes the value of the first entry and fills the same value for all other entries for the total number of times the element occurs and then it starts over with the second one. I may not be so clear in my explanation so here's a sample output I am getting with this code:
First Entry 1.html 1
First Entry 2.html 2
First Entry 3.html 3
Second Entry 1.html 1
Second Entry 2.html 2
Second Entry 3.html 3
Third Entry 1.html 1
Third Entry 2.html 2
Third Entry 3.html 3
Whereas I am trying to get
First Entry 1.html 1
Second Entry 2.html 2
Third Entry 3.html 3
I am pretty new to HAP and have very little knoweledge on xpath, so I am sure I am doing something wrong here, but I couldn't make it work even after spending hours on it. Any help would be much appreciated.

Solution 1
I have defined a function that given a dt node will return the next dd node after it:
private static HtmlNode GetNextDDSibling(HtmlNode dtElement)
{
var currentNode = dtElement;
while (currentNode != null)
{
currentNode = currentNode.NextSibling;
if(currentNode.NodeType == HtmlNodeType.Element && currentNode.Name =="dd")
return currentNode;
}
return null;
}
and now the LINQ code can be transformed to:
var parsedValues =
from info in document.DocumentNode.SelectNodes("//div[#class='content-div']")
from dtElement in info.SelectNodes("dl/dt")
let link = dtElement.SelectSingleNode("b/a[#href]")
let ddElement = GetNextDDSibling(dtElement)
where link != null && ddElement != null
select new
{
Text = ddElement.InnerHtml,
Url = link.GetAttributeValue("href", ""),
AnchorText = link.InnerText
};
Solution 2
Without additional functions:
var infoNode =
document.DocumentNode.SelectSingleNode("//div[#class='content-div']");
var dts = infoNode.SelectNodes("dl/dt");
var dds = infoNode.SelectNodes("dl/dd");
var parsedValues = dts.Zip(dds,
(dt, dd) => new
{
Text = dd.InnerHtml,
Url = dt.SelectSingleNode("b/a[#href]").GetAttributeValue("href", ""),
AnchorText = dt.SelectSingleNode("b/a[#href]").InnerText
});

Just a e.g. of how can you parse some elements using Html Agility Pack
public string ParseHtml()
{
string output = null;
HtmlDocument htmldocument = new HtmlDocument();
htmldocument.LoadHtml(YourHTML);
HtmlNode node = htmldocument.DocumentNode;
HtmlNodeCollection dds = node.SelectNodes("//dd"); //Select all dd tags
HtmlNodeCollection anchors = node.SelectNodes("//b/a[#href]"); //Select all 'a' tags that contais href attribute
for (int i = 0; i < dds.Count; i++)
{
string atributteValue = null.
Text = dds[i].InnerText;
Url = anchors[i].GetAttributeValue("href", atributteValue);
AnchorText = anchors[i].InnerText;
//Your code...
}
return output;
}

Related

HtmlAgilityPack issue

Suppose I have the following HTML code:
<div class="MyDiv">
<h2>Josh</h2>
</div>
<div class="MyDiv">
<h2>Anna</h2>
</div>
<div class="MyDiv">
<h2>Peter</h2>
</div>
And I want to get the names, so this is what I did (C#):
string url = "https://...";
var web = new HtmlWeb();
HtmlNode[] nodes = null;
HtmlDocument doc = null;
doc = web.Load(url);
nodes = doc.DocumentNode.SelectNodes("//div[#class='MyDiv").ToArray() ?? null;
foreach (HtmlNode n in nodes){
var name = n.SelectSingleNode("//h2");
Console.WriteLine(name.InnerHtml);
}
Output:
Josh
Josh
Josh
and it is so strange because n contains only the desired <div>. How can I resolve this issue?
Fixed by writing .//h2 instead of //h2
It's because of your XPath statement "//h2". You should change this simply to "h2". When you start with the two "//" the path starts at the top. And then it selects "Josh" every time, because that is the first h2 node.
You could also do like this:
List<string> names =
doc.DocumentNode.SelectNodes("//div[#class='MyDiv']/h2")
.Select(dn => dn.InnerText)
.ToList();
foreach (string name in names)
{
Console.WriteLine(name);
}

Parse single data elements from HTML tables with C#?

I have this code in my main function and I want to parse only the first row of the table (e.g Nov 7, 2017 73.78 74.00 72.32 72.71 17,245,947).
I created a node that concludes only the first row but when I start debugging the node value is null. How can I parse these data and store them for example in a string or in single variables. Is there a way?
WebClient web = new WebClient();
string page = web.DownloadString("https://finance.google.com/finance/historical?q=NYSE:C&ei=7O4nV9GdJcHomAG02L_wCw");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var node = doc.DocumentNode.SelectSingleNode("//*[#id=\"prices\"]/table/tbody/tr[2]");
List<List<string>> node = doc.DocumentNode.SelectSingleNode("//*[#id=\"prices\"]/table").Descendants("tr").Skip(1).Where(tr => tr.Elements("td").Count() > 1).Select(tr => tr.Elements("td").Select(td=>td.InnerText.Trim()).ToList()).ToList() ;
It seems that your selection XPath string has errors. Since tbody is a generated node it should not be included in path:
//*[#id=\"prices\"]/table/tr[2]
While this should read the value HtmlAgilityPack hits another problem malformed html. All <tr> and <td> nodes in parsed text do not have corresponding </tr> or </td> closing tags and HtmlAgitilityPack fails to select values from table with malformed rows. Therefore, it is necessary to select in first step the whole table:
//*[#id=\"prices\"]/table
And in the next step either sanitize HTML by adding </tr> and </td> closing tags and repeat parsing with corrected table or use extracted string to hand parse it - just extract lines 10 to 15 from table string and split them on > character. Raw parsing is shown below. Code is tested and working.
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
namespace GoogleFinanceDataScraper
{
class Program
{
static void Main(string[] args)
{
WebClient web = new WebClient();
string page = web.DownloadString("https://finance.google.com/finance/historical?q=NYSE:C&ei=7O4nV9GdJcHomAG02L_wCw");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var node = doc.DocumentNode.SelectSingleNode("//div[#id='prices']/table");
string outerHtml = node.OuterHtml;
List<String> data = new List<string>();
using(StringReader reader = new StringReader(outerHtml))
{
for(int i = 0; ; i++)
{
var line = reader.ReadLine();
if (i < 9) continue;
else if (i < 15)
{
var dataRawArray = line.Split(new char[] { '>' });
var value = dataRawArray[1];
data.Add(value);
}
else break;
}
}
Console.WriteLine($"{data[0]}, {data[1]}, {data[2]}, {data[3]}, {data[4]}, {data[5]}");
}
}
}

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210
I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();

Regex to remove and replace characters

I have the following
<option value="Abercrombie">Abercrombie</option>
My file has about 2000 rows in it each row has a different location, I'm trying to understand regex but unfortunately nothing I learn will go in and I'm unsure if this is possible.
What I want to do is run a regex which will strip the above HTML which will leave the following
Abercrombie
I then want to prefix a particular number to the front so the result would be for example
2,Abercrombie
Is this possible?
Don't use a regular expression since HTML is not a regular language. You can use Linq's XML parser. If you want to process the entire file, you can replace the elements inline:
int myNumber = 2;
var html = #"<html><body><option value=""Abercrombie"">Abercrombie</option><div><option value=""Forever21"">Forever21</option></div></body></html>";
var doc = XDocument.Load(new StringReader(html));
var options = doc.Descendants().Where(o => o.Name == "option").ToList();
foreach (var element in options)
{
element.ReplaceWith(string.Format("{0},{1}", myNumber, element.Value));
}
var result = doc.ToString();
This gives:
<html>
<body>2,Abercrombie<div>2,Forever21</div></body>
</html>
If you just want to grab the text for a specific tag, you can use the following:
int myNumber = 2;
var html = #"<option value=""Abercrombie"">Abercrombie</option>";
var doc = XDocument.Load(new StringReader(html));
var element = doc.Descendants().FirstOrDefault(o => o.Name == "option");
var attribute = element.Attribute("value").Value;
var result = string.Format("{0},{1}", myNumber, attribute);
//result == "2,Abercrombie"

HtmlAgilityPack scraping "href"

I wrote this code.:
Warning, the link point to adult site!!!
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load("http://xhamster.com/movies/2808613/jewel_is_a_sexy_cougar_who_loves_to_fuck_lucky_younger_guys.html");
var aTags = document.DocumentNode.SelectNodes("//div[contains(#class,'noFlash')]");
if (aTags != null)
foreach (var aTag in aTags)
{
var href = aTag.Attributes["href"].Value;
textBox2.Text = href;
}
I got an error when i try run this programm.
If i put other things in "var href" for example.:
var href = aTag.InnerHtml
I got the inner text, and i can see there the "href=" link, and some other datas.
But i need only the link after the href!
You are selecting div elements. A div element can't have href attribute.If you want to get href's of anchor tags you can use:
var hrefs = aTags.Descendants("a")
.Select(node => node.GetAttributeValue("href",""))
.ToList();

Categories