I am using the HtmlAgiityPack.
It is an excellent tool for parsing data, however every instance I have used it, I have always had either a class or id to aim at, i.e. -
string example = doc.DocumentNode.SelectSingleNode("//div[#class='target']").InnerText.Trim();
However I have come across a piece of text that isn't nested in any particular pattern with a class or id I can aim at. E.g. -
<p>Example Header</p>: This is the text I want!<br>
However the example given does always following the same patter i.e. the text will always be after </p>: and before <br>.
I can extract the text using a regular expression however would prefer to use the agility pack as the rest of the code follows suit. Is there a means of doing this using the pack?
This XPath works for me :
var html = #"<div class=""target"">
<p>Example Header</p>: This is the text I want!<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/div[#class='target']/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
/text() select all text nodes that is direct child of the <div>
[(normalize-space())] exclude all text nodes those contain only white
spaces (there are 2 new lines excluded from this html sample : one before <p> and the other after <br>)
Result :
UPDATE I :
All element must have a parent, like <div> in above example. Or if it is the root node you're talking about, the same approach should still work. The key is to use /text() XPath to get text node :
var html = #"<p>Example Header</p>: This is the text I want!<br>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
UPDATE II :
Ok, so you want to select text node after <p> element and before <br> element. You can use this XPath then :
var result =
doc.DocumentNode
.SelectSingleNode("/text()[following-sibling::br and preceding-sibling::p]")
.OuterHtml;
Related
I'm trying to build a web scraping tool for a news website. I'm having problems selecting the relevant text since the text is divided into multiple different elements. I'm using HTML Agility Pack and I have tried to select text ( //text() ) from the main div, but when I do this I get a lot of garbage text I don't want, like javascript code.
How can I select text from some nested elements and ignore other elements?
<div class="texto_container paywall">
Some text I want
<a href="https://www.sabado.pt/sabermais/ana-gomes" target="_blank" rel="noopener">
Text I want
</a>
sample of text I want
<em>
another text i want
</em>
<aside class="multimediaEmbed contentRight">
A lot of nested elements here with some text I dont want
</aside>
<div class="inContent">
A lot of nested elements here with some text I don't want
</div>
Back to the text I want!
<twitter-widget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" >
Don't want any of this text located in nested elements!
</twitter-widget>
<p>
Final revelant text i want to collect!
</p>
</div>
EDIT
I tried to use XPath to exclude the tags I don't want, but I still get text nodes from those tags in the result.
var parse_me = htmlDoc.DocumentNode.SelectNodes("//div[#class='texto_container paywall']//text()[not(parent::aside)][not(parent::div[#class='inContent'])][not(parent::twitter-widget)]");
I think this code doesn't work because on the tags I don't want to include the text parent nodes aren't the "main" tag, because it is inside of a lot of nested tags.
EDIT
After some thinking and some research I fixed the previous problem by using ancestor:: instead of parent:: and I got rid of some of the intended text.
But I still can't get rid of the twitter-widget text, because it always returns a null node even with the XPath copied from the Google Chrome inspect element tool.
var Twitter_Node = htmlDoc.DocumentNode.SelectSingleNode("//*[#id='twitter - widget - 0']");
This gets returned as null. How is this possible? XPath was copied from Chrome.
You can try to exclude the text from specific tags :
//body//text()[not(parent::aside)][not(parent::div[#class="inContent"])][not(parent::twitter-widget)]
You could use concat but it's more complicated since you have to know the number and the position of each tag in the "chain" :
concat(//body//div[#class="texto_container paywall"]/text()[1],//body//a[#href]/text(),//body//div[#class="texto_container paywall"]/text()[2],//body//em/text(),//body//div[#class="texto_container paywall"]/text()[5],//body//p/text())
I am using ScrapySharp nuget which adds in my sample below, (It's possible HtmlAgilityPack offers the same functionality built it, I am just used to ScrapySharp from years ago)
You can simply punctually extract all the texts you do NOT want, and then replace their occurrences in the main div text with an empty string, removing them from the final result.
var doc = new HtmlDocument();
doc.Load(#"C:\Desktop\z.html"); //I created an html with your sample HTML set as the html body
List<string> textsIWant = new List<string>();
var textsIdoNotWant = new List<string>();
//text I do not want
var aside = doc.DocumentNode.CssSelect(".multimediaEmbed.contentRight").FirstOrDefault();
if (aside != null)
{
textsIdoNotWant.Add(aside.InnerText);
}
var inContent = doc.DocumentNode.CssSelect(".inContent").FirstOrDefault();
if (inContent != null)
{
textsIdoNotWant.Add(inContent.InnerText);
}
var twitterWidget = doc.DocumentNode.CssSelect("#twitter-widget-0").FirstOrDefault();
if (twitterWidget != null)
{
textsIdoNotWant.Add(twitterWidget.InnerText);
}
var div = doc.DocumentNode.CssSelect(".texto_container.paywall").FirstOrDefault();
if (div != null)
{
var text = div.InnerText;
foreach (var textIDoNotWant in textsIdoNotWant)
{
text = text.Replace(textIDoNotWant, string.Empty);
}
textsIWant.Add(text);
}
foreach (var text in textsIWant)
Console.WriteLine(text);
I am trying to get Start called span from Here
Chrome gives me this xPath: //*[#id="guide-pages"]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/div[2]/div[1]/h2
But HtmlAgilityPack returns null, after I tried remove them one by one; this works: //*[#id="guide-pages"]/div[2]/div[1] , but not the rest of them.
My full Code:
HtmlDocument doc = new HtmlDocument();
var text = await ReadUrl();
doc.LoadHtml(text);
Console.WriteLine($"Getting Data From: {doc.DocumentNode.SelectSingleNode("//head/title").InnerText}"); //Works fine
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[#id='guide-pages']/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/div[2]/div[1]/h2") == null);
Output:
Getting Data From: Miss Fortune Build Guide : [7.11] KOREAN MF Build - Destroy the Carry! [Added Support] :: League of Legends Strategy Builds
True
Don't use xpath from Chrome. Use LINQ in HtmlAgilityPack instead. For example
.Descendants("div") will give you all the div under 1 html node. Each html node will have meta data like id, attributes(classes...), and you can query your wanted div from there.
This is one handy method to check if a HtmlNode has classes or not.
public static bool HasClass(this HtmlNode node, params string[] classValueArray)
{
var classValue = node.GetAttributeValue("class", "");
var classValues = classValue.Split(' ');
return classValueArray.All(c => classValues.Contains(c));
}
HTML string contains the exact string
<div class="XKa d-k-l"><span class="VTb d-k-l"></span></div><div class="pha d-k-l"><div ><div>Hello World </div>
I want to retrieve the Hello World from a div.
I am using HtmlAgilityPack
var item =
HTMLContent.DocumentNode
.SelectSingleNode("//div[#class='XKa d-k-l']//span[#class='VTb d-k-l']//div[#class='pha d-k-l']")
.InnerHtml;
Exception: Object reference not set to an instance of an object. Cant figure out the correct syntax Appreciate your help
div[#class='pha d-k-l'] is not descendant of div[#class='XKa d-k-l'], the relation is siblings instead ancestor-descendant. You can try using following-sibling axis like so :
//div[#class='XKa d-k-l']/following-sibling::div[#class='pha d-k-l']
working demo example :
var html = #"<div class='XKa d-k-l'><span class='VTb d-k-l'></span></div><div class='pha d-k-l'><div><div>Hello World </div></div></div>";
var HTMLContent = new HtmlDocument();
HTMLContent.LoadHtml(html);
var item = HTMLContent.DocumentNode
.SelectSingleNode("//div[#class='XKa d-k-l']/following-sibling::div[#class='pha d-k-l']").InnerHtml;
Console.WriteLine(item);
output :
<div><div>Hello World </div></div>
update :
You can add the span check like this :
//div[#class='XKa d-k-l'][span/#class='VTb d-k-l']/following-sibling::div[#class='pha d-k-l']
I am using the Html Agility pack to select out textual data from within rss xml. For every other node type (title, pubdate, guid .etc) I can select out the inner-text using XPath conventions however when querying "//link" or indeed "item/link" empty strings are returned.
public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
//Create a new document.
var document = new HtmlDocument();
//Populate the document with an rss file.
document.LoadHtml(rssSource);
//Select out all of the required nodes.
var itemNodes = document.DocumentNode.SelectNodes("item/link");
//If zero nodes were found, return an empty list, otherwise return the content of those nodes.
return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}
Does anybody have an understanding of why this element behaves differently to the others?
Additional: Running "item/link" returns zero nodes. Running "//link" returns the correct number of nodes however the inner text is zero chars in length.
Using the below test data, with "//name" returns a single record for "fred" however with "//link" a single record with an empty string is returned.
<site><link>Hello World</link><name>Fred</name></site>
I am certain its because of the world "link". If I change it to "linkz" it works perfectly.
The below workaround works perfectly. However I would like to understand why searching on "//link" does not work as other elements do.
public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
rssSource = rssSource.Replace("<link>", "<link-renamed>");
rssSource = rssSource.Replace("</link>", "</link-renamed>");
//Create a new document.
var document = new HtmlDocument();
//Populate the document with an rss file.
document.LoadHtml(rssSource);
//Select out all of the required nodes.
var itemNodes = document.DocumentNode.SelectNodes("//link-renamed");
//If zero nodes were found, return an empty list, otherwise return the content of those nodes.
return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}
If you print the DocumentNode.OuterHtml, you will see the problem :
var html = #"<site><link>Hello World</link><name>Fred</name></site>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
output :
<site><link>Hello World<name>Fred</name></site>
link happen to be one of some special tags* that is treated as self-closing tag by HAP. You can alter this behavior by setting ElementsFlags before parsing the HTML, for example :
var html = #"<site><link>Hello World</link><name>Fred</name></site>";
HtmlNode.ElementsFlags.Remove("link"); //remove link from list of special tags
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
var links = doc.DocumentNode.SelectNodes("//link");
foreach (HtmlNode link in links)
{
Console.WriteLine(link.InnerText);
}
Dotnetfiddle Demo
output :
<site><link>Hello World</link><name>Fred</name></site>
Hello World
*) Complete list of the special tags besides link, that included in the ElementsFlags dictionary by default, can be seen in the source code of HtmlNode.cs. Some of the most popular among them are <meta>, <img>, <frame>, <input>, <form>, <option>, etc.
I'm trying to parse out stock exchange information whit a simple piece of C# from a HTML document. The problem is that I can not get my head around the syntax, the tr class="LomakeTaustaVari" gets parsed out but how do I get the second bit that has no tr-class?
Here's a piece of the HTML, it repeats it self whit different values.
<tr class="LomakeTaustaVari">
<td><div class="Ensimmainen">12:09</div></td>
<td><div>MSI</div></td>
<td><div>POH</div></td>
<td><div>42</div></td>
<td><div>64,50</div></td>
</tr>
<tr>
<td><div class="Ensimmainen">12:09</div></td>
<td><div>SRE</div></td>
<td><div>POH</div></td>
<td><div>156</div></td>
<td><div>64,50</div></td>
</tr>
My C# code:
{
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load ("https://www.op.fi/op/henkiloasiakkaat/saastot-ja-sijoitukset/kurssit-ja-markkinat/markkinat?sivu=alltrades.html&sym=KNEBV.HSE&from=10:00&to=19:00&id=32453");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr[#class='LomakeTaustaVari']"))
{
Console.WriteLine(row.InnerText);
}
Console.ReadKey();
}
Try to use the next xpath //tr[preceding-sibling::tr[#class='LomakeTaustaVari']]:
var nodes = doc.DocumentNode.SelectNodes("//tr[preceding-sibling::tr[#class='LomakeTaustaVari']]");
It should select nodes that have preceding node tr with class LomakeTaustaVari.
Just FYI: if no nodes found, SelectNodes method returns null.
If you manage to get a reference to the <tr class="LomakeTaustaVari"> element, I see two possible solutions.
You can navigate to the parent and then find all its <tr> children:
lomakeTaustaVariElement.Parent.SelectNodes("tr"); // iterate over these if needed
You can also use NextSibling to get the next <tr>:
var trWithoutClass = lomakeTaustaVariElement.NextSibling;
Please note that using the second alternative you may run into issues, because whitespace present in the HTML may be interpreted as being a distinct element.
To overcome this, you may recursively call NextSibling until you encounter a tr element.
This will iterate over all nodes in document. You will probably also need to be more specific with starting node, so you will only select that you are interested in.
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr"))
{
Console.WriteLine(row.InnerText);
}
Probably I don't understand something, but the simplest XPath for any tr element selection should do the work:
doc.DocumentNode.SelectNodes("//tr")
Otherwise, in case you would like to select elements with specific class attributes only, it could be:
doc.DocumentNode.SelectNodes("//tr[#class = 'someClass1' or #class = 'someClass2']")
If you do not like to load the page and want to use a ready html string, e.g. from a WebBrowser element, you can use the following example:
var web = new HtmlAgilityPack.HtmlDocument();
web.LoadHtml(webBrowser1.Document.Body.Parent.OuterHtml);
var q = web.DocumentNode.SelectNodes("/html/body/div[2]/div/div[1]") //XPath /html/body/div[2]/div/div[1]