Parsing html with html agility pack

Parsing html with html agility pack - c#

I want to collect all tags in from this div but do not know how to do this in the best way with xpath method
<div class="biz_info">
<h3>Sørby Rehab</h3>
<table class="string_14">
<tbody>
<tr>
<td>Postadr.:</td>
<td class="tab_space">Rognerudveien 8 B, 0681 Oslo</td>
</tr>
<tr>
<td>Telefon:</td>
<td class="tab_space">928 70 700</td>
</tr>
<tr>
<td>Nettside:</td>
<td class="tab_space">www.sorby-rehab.no</td>
</tr>
</tbody>
</table>
</div>
Today my code looks like this (but very bad):
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(result));
HtmlNode root = doc.DocumentNode;
List<string> anchorTags = new List<string>();
foreach (HtmlNode link in root.SelectNodes("//#class=biz_info"))
{
string att = link.OuterHtml;
anchorTags.Add(att);
}
Is someone who is professional in xpath that can help me?

HtmlDocument html = new HtmlDocument();
html.Load(new StringReader(result));
var anchorTags = html.DocumentNode.SelectNodes("//div[#class='biz_info']//a")
.Select(a => a.OuterHtml)
.ToList();
That will give you list of anchor tags html. If you need just urls:
urls = html.DocumentNode.SelectNodes("//div[#class='biz_info']//a[#href!='']")
.Select(a => a.Attributes["href"].Value)
.ToList();

Related

How to find last column of a table using Html Agility Pack

I have a table like this:
<table border="0" cellpadding="0" cellspacing="0" id="table2">
<tr>
<th>Name
</th>
<th>Age
</th>
</tr>
<tr>
<td>Mario
</td>
<th>Age: 78
</td>
</tr>
<tr>
<td>Jane
</td>
<td>Age: 67
</td>
</tr>
<tr>
<td>James
</td>
<th>Age: 92
</td>
</tr>
</table>
I want to get the last td from all rows using Html Agility Pack.
Here is my C# code so far:
await page.GoToAsync(NumOfSaleItems, new NavigationOptions
{
WaitUntil = new WaitUntilNavigation[] { WaitUntilNavigation.DOMContentLoaded }
});
var html4 = page.GetContentAsync().GetAwaiter().GetResult();
var htmlDoc4 = new HtmlDocument();
htmlDoc4.LoadHtml(html4);
var SelectTable = htmlDoc4.DocumentNode.SelectNodes("/html/body/div[2]/div/div/div/table[2]/tbody/tr/td[1]/div[3]/div[2]/div/table[2]/tbody/tr/td[4]");
if (SelectTable.Count == 0)
{
continue;
}
else
{
foreach (HtmlNode row in SelectTable)//
{
string value = row.InnerText;
value = value.ToString();
var firstSpaceIndex = value.IndexOf(" ");
var firstString = value.Substring(0, firstSpaceIndex);
LastSellingDates.Add(firstString);
}
}
How can I get only the last column of the table?

I think the XPath you want is: //table[#id='table2']//tr/td[last()].
//table[#id='table2'] finds the table by ID anywhere in the document. This is preferable to a long brittle path from the root, since a table ID is less likely to change than the rest of the HTML structure.
//tr gets the descendent rows in the table. I'm using two slashes in case there might be an intervening <tbody> element in the actual HTML.
/td[last()] gets the last <td> in each row.
From there you just need to select the InnerText of each <td>.
var tds = htmlDoc.DocumentNode.SelectNodes("//table[#id='table2']//tr/td[last()]");
var values = tds?.Select(td => td.InnerText).ToList() ?? new List<string>();
Working demo here: https://dotnetfiddle.net/7I8yk1

how can i get following-sibling using HtmlAgilityPack?

i have many tr tags in html code:
<div class="noticeTabBoxWrapper">
<tr>
<td>
<span>Text for anchor</span>
</td>
</tr>
<tr>
<td>
<span>*constantly changing text*</span>
</td>
</tr>
In my code i write this:
//div[#class = 'noticeTabBoxWrapper']//span[contains(text(), 'Text for anchor')]").InnerText
How can I rewrite code so that I can extract the necessary text that follows immediately after the anchor?
Thanks.

Assuming raw is your sample data :
var doc = new HtmlDocument();
doc.LoadHtml(raw);
var xpath = "//div[#class='noticeTabBoxWrapper']//span[contains(., 'Text for anchor')]/following::td[1]/span";
var result = doc.DocumentNode.SelectSingleNode(xpath);
Console.WriteLine(result.InnerText)
Output : *constantly changing text*

C# How I can retrieve this information?

On the HTML Page I have something like that
<table class="information">
<tbody>
<tr>
<td class="name">Name:</td>
<td>John</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
....
</tbody>
</table>
How I can retrieve the name (there are other information too but in my example I wrote only name)?
Notes: HTML has more than one table
I tried this
foreach (HtmlElement item in wb.Document.GetElementsByTagName("table"))
{
if (item.OuterHtml.Contains("information"))
{
... //Here i don't know how to continue
}
}

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var table = doc.DocumentNode.SelectSingleNode("//table[#class='information']");
var td = table.SelectSingleNode("//td[#class='name']");
Console.WriteLine(td.InnerText);
or
var text = doc.DocumentNode.Descendants("td")
.First(td => td.Attributes["class"] != null && td.Attributes["class"].Value == "name")
.InnerText;

HtmlElementCollection tData = wb.Document.GetElementsByTagName("td");
foreach (HtmlElement td in tData)
{
string name = "";
if (td.GetAttribute("classname") == "name")
{
name = td.InnerText;
}
}

Check out HtmlAgilityPack - it is free and quite good library to work with html sources.

How to get a link's title and href value separately with html agility pack?

Im trying to download a page contain a table like this
<table id="content-table">
<tbody>
<tr>
<th id="name">Name</th>
<th id="link">link</th>
</tr>
<tr class="tt_row">
<td class="ttr_name">
<a title="name_of_the_movie" href="#"><b>name_of_the_movie</b></a>
<br>
<span class="pre">message</span>
</td>
<td class="td_dl">
<img alt="Download" src="#">
</td>
</tr>
<tr class="tt_row"> .... </tr>
<tr class="tt_row"> .... </tr>
</tbody>
</table>
i want to extract the name_of_the_movie from td class="ttr_name" and download link from td class="td_dl"
this is the code i used to loop through table rows
HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
hDocument.LoadHtml(htmlSource);
HtmlNode table = hDocument.DocumentNode.SelectSingleNode("//table");
foreach (var row in table.SelectNodes("//tr"))
{
HtmlNode nameNode = row.SelectSingleNode("td[0]");
HtmlNode linkNode = row.SelectSingleNode("td[1]");
}
currently i have no idea how to check the nameNode and linkNode and extract data inside it
any help would be appreciated
Regards

I can't test it right now, but it should be something among the lines of :
string name= namenode.Element("a").Element("b").InnerText;
string url= linknode.Element("a").GetAttributeValue("href","unknown");

nameNode.Attributes["title"]
linkNode.Attributes["href"]
presuming you are getting the correct Nodes.

public const string UrlExtractor = #"(?: href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript|.*css|.*this\.)(?<url>.*?)(?:[\s>""'])";
public static Match GetMatchRegEx(string text)
{
return new Regex(UrlExtractor, RegexOptions.IgnoreCase).Match(text);
}
Here is how you can extract all Href Url. I'm using that regex in one of my projects, you can modify it to match your needs and rewrite it to match title as well. I guess it is more convenient to match them in bulk

How to extract text from HTML using htmlagilitypack for this sample?

I wanna extract the text from a HTML source. I'm trying with c# and htmlagilitypack dll.
The source is:
<table>
<tr>
<td class="title">
<a onclick="func1">Here 2</a>
</td>
<td class="arrow">
<img src="src1" width="9" height="8" alt="Down">
</td>
<td class="percent">
<span>39%</span>
</td>
<td class="title">
<a onclick="func2">Here 1</a>
</td>
<td class="arrow">
<img src="func3" width="9" height="8" alt="Up">
</td>
<td class="percent">
<span>263%</span>
</td>
</tr>
</table>
How can I get the text Here 1 and Here 2 from the table?

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml("web page string");
var xyz = from x in htmlDoc.DocumentNode.DescendantNodes()
where x.Name == "td" && x.Attributes.Contains("class")
where x.Attributes["class"].Value == "title"
select x.InnerText;
not so pretty but should work

Xpath version
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(t);
//this simply works because InnerText is iterative for all child nodes
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//td[#class='title']");
//but to be more accurate you can use the next line instead
//HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//td[#class='title']/a");
string result;
foreach (HtmlNode item in nodes)
result += item.InnerText;
and for the LINQ version just change the var Nodes = .. line with:
var Nodes = from x in htmlDoc.DocumentNode.DescendantNodes()
where x.Name == "td" && x.Attributes["class"].Value == "title"
select x.InnerText;

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing html with html agility pack - c#

Related

How to find last column of a table using Html Agility Pack

how can i get following-sibling using HtmlAgilityPack?

C# How I can retrieve this information?

How to get a link's title and href value separately with html agility pack?

How to extract text from HTML using htmlagilitypack for this sample?

Categories

Resources