Get only the text of a webpage using HTML Agility Pack? - c#

I'm trying to scrape a web page to get just the text. I'm putting each word into a dictionary and counting how many times each word appears on the page. I'm trying to use HTML Agility Pack as suggested from this post: How to get number of words on a web page?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
int wordCount = 0;
Dictionary<string, int> dict = new Dictionary<string, int>();
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
MatchCollection matches = Regex.Matches(node.InnerText, #"\b(?:[a-z]{2,}|[ai])\b", RegexOptions.IgnoreCase);
foreach (Match s in matches)
{
//Add the entry to the dictionary
}
}
However, with my current implementation, I'm still getting lots of results that are from the markup that should not be counted. It's close, but not quite there yet (I don't expect it to be perfect).
I'm using this page as an example. My results are showing a lot of the uses of the words "width" and "googletag", despite those not being in the actual text of the page at all.
Any suggestions on how to fix this? Thanks!

You can't be sure that the word you are searching for is displayed or not to the user as there will be JS execution and CSS rules that will affect that.
The following program does find 0 matches for "width", and "googletag" but finds 126 "html" matches whereas Chrome Ctrl+F finds 106 matches.
Note that the program does not match the word if it's parent node is <script>.
using HtmlAgilityPack;
using System;
namespace WordCounter
{
class Program
{
private static readonly Uri Uri = new Uri("https://www.w3schools.com/html/html_editors.asp");
static void Main(string[] args)
{
var doc = new HtmlWeb().Load(Uri);
var nodes = doc.DocumentNode.SelectSingleNode("//body").DescendantsAndSelf();
var word = Console.ReadLine().ToLower();
while (word != "exit")
{
var count = 0;
foreach (var node in nodes)
{
if (node.NodeType == HtmlNodeType.Text && node.ParentNode.Name != "script" && node.InnerText.ToLower().Contains(word))
{
count++;
}
}
Console.WriteLine($"{word} is displayed {count} times.");
word = Console.ReadLine().ToLower();
}
}
}
}

Related

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items
When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp
I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

Injecting HTML at specific location using HTMLAgilityPack

I've been asked to inject a bunch of HTML into a specific point in a HTML document, and have been looking at using HTMLAgilityPack to do so.
The recomended way to do this, as far as I can tell, is to parse using nodes and replace/delete the relevant nodes.
This is my code so far
//Load original HTML
var originalHtml = new HtmlDocument();
originalHtml.Load(#"C:\Temp\test.html");
//Load inject HTML
var inject = new HtmlDocument();
inject.Load(#"C:\Temp\Temp\inject.html");
var injectNode = HtmlNode.CreateNode(inject.Text);
//Get all HTML nodes to inject/delete
var nodesToDelete = originalHtml.DocumentNode.SelectNodes("//p[#style='page-break-after:avoid']");
var countToDelete = nodesToDelete.Count();
//loop through stuff to remove
int count = 0;
foreach (var nodeToDelete in nodesToDelete)
{
count++;
if (count == 1)
{
//replace with inject HTML
nodeToDelete.ParentNode.ReplaceChild(injectNode, nodeToDelete);
}
else if (count <= countToDelete)
{
//remove, as HTML already injected
nodeToDelete.ParentNode.RemoveChild(nodeToDelete);
}
}
What I'm finding, is that the original HTML is not correctly updated, it appears as though it only injects the parent level node, which is a simple and none of the child nodes.
Any help??
Thanks,
Patrick.
Well, I couldn't work out how to do this using HTMLAgilityPack, probably more due to my lack of understanding of nodes more than anything else, but I did find an easy fix using AngleSharp.
//Load original HTML into document
var parser = new HtmlParser();
var htmlDocument = parser.Parse(File.ReadAllText(#"C:\Temp\test.html"));
//Load inject HTML as raw text
var injectHtml = File.ReadAllText(#"C:\Temp\inject.html")
//Get all HTML elements to inject/delete
var elements = htmlDocument.All.Where(e => e.Attributes.Any(a => a.Name == "style" && a.Value == "page-break-after:avoid"));
//loop through stuff to remove
int count = 1;
foreach (var element in elements)
{
if (count == 1)
{
//replace with inject HTML
element.OuterHtml = injectHtml;
}
else
{
//remove, as HTML already injected
element.Remove();
}
count++;
}
//Re-write updated file
File.WriteAllText(#"C:\Temp\test_updated.html", string.Format("{0}{1}{2}{3}","<html>",htmlDocument.Head.OuterHtml,htmlDocument.Body.OuterHtml,"</html>"));

How do I access the tr of this table using HtmlAgilityPack?

So I am currently playing around with HtmlAgiltiyPack trying to understand how traversing through an XML document like a HTML document to see how works and how it flows.
The website I selected was this one https://www.kijiji.ca
What I am trying to do is to grab the Title of the the Featured listings
but I have stumbled onto an issue.
I managed to find all the Featured tables but now I would like to dive into the current one I am at and find it's tr which contains the class description.
This is what I have so far.
private static string URL = "https://www.kijiji.ca/b-renovation-contracting-handyman/ontario/home-renovations/k0c753l9004";
private static HtmlWeb client = new HtmlWeb();
static void Main(string[] args)
{
var DOM = client.Load(URL);
var Featured = DOM.DocumentNode.SelectNodes("//table[contains(#class,'top-feature')]");
foreach (var Listing in Featured)
{
}
}
There are a few things I wonder, for one, the thing I asked above, how to dive in deeper and also..
What I have right there.
what does Listing actually contain, does it contain all the childnodes? Which I guess in this case would be tbody looking at this for reference.
Or would it contain all the childnodes, not only tbody but also tr & td?
How about this Example :
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table")) {
Console.WriteLine("Found: " + table.Id);
foreach (HtmlNode row in table.SelectNodes("tr")) {
Console.WriteLine("row");
foreach (HtmlNode cell in row.SelectNodes("th|td")) {
Console.WriteLine("cell: " + cell.InnerText);
}
}
}
Note that you can make it prettier with LINQ-to-Objects if you want:
var query = from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
from row in table.SelectNodes("tr").Cast<HtmlNode>()
from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
select new {Table = table.Id, CellText = cell.InnerText};
foreach(var cell in query) {
Console.WriteLine("{0}: {1}", cell.Table, cell.CellText);
}

How to wrap text with a span tag using HtmlAgilityPack

I'm currently using HtmlAgilityPack to strip the content of a div (with contentEditable) from all the unnecessary Tag, so I can only keep the text between <p></p> tag. After getting the text I send it to a corrector that give me back the words in error inside this specifique <p></p>.
Dictionary<string, List<string>> DicoError = new Dictionary<string, List<string>>();
int nbError = 0;
HtmlDocument html = new HtmlDocument();
html.LoadHtml(texteAFormater);
var nodesSpan = html.DocumentNode.SelectNodes("//span");
var nodesA = html.DocumentNode.SelectNodes("//div");
if (nodesSpan != null)
{
foreach (var node in nodesSpan)
{
node.Remove();
}
}
if (nodesA != null)
{
foreach (var node in nodesA)
{
if (node.Attributes["edth_type"] != null)
{
if (string.Equals(node.Attributes["edth_type"].Value, "contenu", StringComparison.InvariantCultureIgnoreCase)==false)
{
node.Remove();
}
}
}
}
var paragraphe = html.DocumentNode.SelectNodes("p");
for(int i =0; i< paragraphe.Count; i++){
string texteToCorrect = paragraphe[i].innerText;
List<string> errorInsideParagraph = new List<string>();
errorInsideParagraph = callProlexis(HtmlEntity.DeEntitize(texteToCorrect), nbError, DicoError);
for(int j=0;j<motEnErreur.Count; j++){
HtmlNode spanNode = html.CreateElement("span");
spanNode.Attributes.Add("class", typeError);
spanNode.Attributes.Add("id", nbError);
spanNode.Attributes.Add("oncontextmenu","rightClickMustWork(event, this);return false");
}
}
I manage to send the innerText to my corrector, the worry I got is admitting my innerText for this paragraph is :
<p>this is some text <em>error</em> how should this work</p>
In this one two words are in error : error and should
how can I add my spanNode so it will keep the <em></em> around error? (I need to keep the actual tag around the word in error if there is one already and just wrap the spanNode around it).
So the expected result will be :
<p>this is some text <span ...><em>error</em></span> how <span ...>should</span> this work</p>
Edit: I was thinking something like finding the word in error inside the innerHtml then get the parent node of this word, if it is <p> then there is no tag around him and we can just add the spanNode if it is another tag then we need to add the spanNodeas his parent node such as spanNode is the child of <p> but the parent of the tag around this word. I'm not sure how to do it.

htmlagilitypack xpath incorrect

I have a problem that my xpath is not working.
I am trying to get the url from Google.com's search result list into a string list.
But i am unable to reach on url using Xpath.
Please help me in correcting my xpath. Also tell me what should be on the place of ??
HtmlWeb hw = new HtmlWeb();
List<string> urls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=" +txtURL.Text.Replace(" " , "+"));
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[#class='f kv']");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["?????????"];
urls.Add(link.Value);
}
for (int i = 0; i <= urls.Count - 1; i++)
{
if (urls.ElementAt(i) != null)
{
if (IsValid(urls.ElementAt(i)) != true)
{
grid.Rows.Add(urls.ElementAt(i));
}
}
}
The URLs seem to live in the cite element under that selected divs, so the XPath to select those is //div[#class='f kv']/cite.
Now, since these contain markup but you only want the text, select the InnerText of the selected nodes. Note that these do not begin with http://.
HtmlNodeCollection linkNodes =
doc.DocumentNode.SelectNodes("//div[#class='f kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.InnerText;
urls.Add(link.Value);
}
The correct XPath is "//div[#class='kv']/cite". The f class you see in the browser element inspector is (probably) added after the page is rendered using javascript.
Also, the link text is not in an attribute, you can get it using the InnerText property of the <div> element(s) obtained at the earlier step.
I changed these lines and it works:
var linkNodes = doc.DocumentNode.SelectNodes("//div[#class='kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
urls.Add(linkNode.InnerText);
}
There's a caveat though: some links are trimmed (you'll see a ... in the middle)

Categories