Retrieving value of element using HTMLAgility pack - c#

I am using HTMLAgility pack to parse html and then using xpath retrieve a table column with a specific class.
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("www.url.com");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
Response.Write(row.InnerHtml + "<br />");
}
I retrieve the data and row.Innerhtml looks like this.
<a>Title</a> <span>Year</span><br />
I want to save the value of a and span element in separate string variables. Please help

Your xpath expression selects the second <td> that has the class titleColumn. According to the node's inner html, this <td> hode has two child nodes: <a> and <span>. So you could easily find these nodes, and then put inner text (or inner html) into your string variables. See, this:
foreach (var row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
var a = row.SelectSingleNode("a");
var span = row.SelectSingleNode("span");
Console.WriteLine(a.InnerText);
Console.WriteLine(span.InnerText);
}
will output:
Title
Year

Related

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items
When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp
I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

how do i get all the value of a table from a website

string Url = "http://www.dsebd.org/latest_share_price_scroll_l.php";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string a = doc.DocumentNode.SelectNodes("//iframe*[#src=latest_share_price_all\"]//html/body/div/table/tbody")[0].InnerText;
i have tried, but null value found in string a.
Ok this one confused me for a while but I've got it now. Instead of pulling the whole page from http://www.dsebd.org/latest_share_price_scroll_l.php, you can get just the table data from http://www.dsebd.org/latest_share_price_all.php.
There was some strange behaviour with trying to select child elements of the #document node under the iframe element. Someone with more xpath experience might be able to explain this.
Now you can get all the table row nodes by using the following xpath:
string url = "http://www.dsebd.org/latest_share_price_all.php";
HtmlDocument doc = new HtmlWeb().Load(url);
HtmlNode docNode = doc.DocumentNode;
var nodes = docNode.SelectNodes("//body/div/table/tr");
That will give you all the table row nodes. Then you need to go through each node you just got and get the values you want.
Just for example if you wanted to get the trading code, high, and volume you would do the following:
//Remove the first node because it is the header row at the top of the table
nodes.RemoveAt(0);
foreach(HtmlNode rowNode in nodes)
{
HtmlNode tradingCodeNode = rowNode.SelectSingleNode("td[2]/a");
string tradingCode = tradingCodeNode.InnerText;
HtmlNode highNode = rowNode.SelectSingleNode("td[4]");
string highValue = highNode.InnerText;
HtmlNode volumeNode = rowNode.SelectSingleNode("td[11]");
string volumeValue = volumeNode.InnerText;
//Do whatever you want with the values here
//Put them in a class or add them to a list
}
XPath uses 1-based indices so when you are referring to a particular cell in a table row by number the first element is at index 1, instead of using index 0 as in a C# array.

Get text that lies after pattern without class or id

I am using the HtmlAgiityPack.
It is an excellent tool for parsing data, however every instance I have used it, I have always had either a class or id to aim at, i.e. -
string example = doc.DocumentNode.SelectSingleNode("//div[#class='target']").InnerText.Trim();
However I have come across a piece of text that isn't nested in any particular pattern with a class or id I can aim at. E.g. -
<p>Example Header</p>: This is the text I want!<br>
However the example given does always following the same patter i.e. the text will always be after </p>: and before <br>.
I can extract the text using a regular expression however would prefer to use the agility pack as the rest of the code follows suit. Is there a means of doing this using the pack?
This XPath works for me :
var html = #"<div class=""target"">
<p>Example Header</p>: This is the text I want!<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/div[#class='target']/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
/text() select all text nodes that is direct child of the <div>
[(normalize-space())] exclude all text nodes those contain only white
spaces (there are 2 new lines excluded from this html sample : one before <p> and the other after <br>)
Result :
UPDATE I :
All element must have a parent, like <div> in above example. Or if it is the root node you're talking about, the same approach should still work. The key is to use /text() XPath to get text node :
var html = #"<p>Example Header</p>: This is the text I want!<br>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
UPDATE II :
Ok, so you want to select text node after <p> element and before <br> element. You can use this XPath then :
var result =
doc.DocumentNode
.SelectSingleNode("/text()[following-sibling::br and preceding-sibling::p]")
.OuterHtml;

C# parse html with xpath

I'm trying to parse out stock exchange information whit a simple piece of C# from a HTML document. The problem is that I can not get my head around the syntax, the tr class="LomakeTaustaVari" gets parsed out but how do I get the second bit that has no tr-class?
Here's a piece of the HTML, it repeats it self whit different values.
<tr class="LomakeTaustaVari">
<td><div class="Ensimmainen">12:09</div></td>
<td><div>MSI</div></td>
<td><div>POH</div></td>
<td><div>42</div></td>
<td><div>64,50</div></td>
</tr>
<tr>
<td><div class="Ensimmainen">12:09</div></td>
<td><div>SRE</div></td>
<td><div>POH</div></td>
<td><div>156</div></td>
<td><div>64,50</div></td>
</tr>
My C# code:
{
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load ("https://www.op.fi/op/henkiloasiakkaat/saastot-ja-sijoitukset/kurssit-ja-markkinat/markkinat?sivu=alltrades.html&sym=KNEBV.HSE&from=10:00&to=19:00&id=32453");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr[#class='LomakeTaustaVari']"))
{
Console.WriteLine(row.InnerText);
}
Console.ReadKey();
}
Try to use the next xpath //tr[preceding-sibling::tr[#class='LomakeTaustaVari']]:
var nodes = doc.DocumentNode.SelectNodes("//tr[preceding-sibling::tr[#class='LomakeTaustaVari']]");
It should select nodes that have preceding node tr with class LomakeTaustaVari.
Just FYI: if no nodes found, SelectNodes method returns null.
If you manage to get a reference to the <tr class="LomakeTaustaVari"> element, I see two possible solutions.
You can navigate to the parent and then find all its <tr> children:
lomakeTaustaVariElement.Parent.SelectNodes("tr"); // iterate over these if needed
You can also use NextSibling to get the next <tr>:
var trWithoutClass = lomakeTaustaVariElement.NextSibling;
Please note that using the second alternative you may run into issues, because whitespace present in the HTML may be interpreted as being a distinct element.
To overcome this, you may recursively call NextSibling until you encounter a tr element.
This will iterate over all nodes in document. You will probably also need to be more specific with starting node, so you will only select that you are interested in.
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr"))
{
Console.WriteLine(row.InnerText);
}
Probably I don't understand something, but the simplest XPath for any tr element selection should do the work:
doc.DocumentNode.SelectNodes("//tr")
Otherwise, in case you would like to select elements with specific class attributes only, it could be:
doc.DocumentNode.SelectNodes("//tr[#class = 'someClass1' or #class = 'someClass2']")
If you do not like to load the page and want to use a ready html string, e.g. from a WebBrowser element, you can use the following example:
var web = new HtmlAgilityPack.HtmlDocument();
web.LoadHtml(webBrowser1.Document.Body.Parent.OuterHtml);
var q = web.DocumentNode.SelectNodes("/html/body/div[2]/div/div[1]") //XPath /html/body/div[2]/div/div[1]

HtmlAgilityPack set node InnerText

I want to replace inner text of HTML tags with another text.
I am using HtmlAgilityPack
I use this code to extract all texts
HtmlDocument doc = new HtmlDocument();
doc.Load("some path")
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']")) {
// How to replace node.InnerText with some text ?
}
But InnerText is readonly. How can I replace texts with another text and save them to file ?
Try code below. It select all nodes without children and filtered out script nodes. Maybe you need to add some additional filtering. In addition to your XPath expression this one also looking for leaf nodes and filter out text content of <script> tags.
var nodes = doc.DocumentNode.SelectNodes("//body//text()[(normalize-space(.) != '') and not(parent::script) and not(*)]");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(htmlNode.InnerText + "_translated"), htmlNode);
}
Strange, but I found that InnerHtml isn't readonly. And when I tried to set it like that
aElement.InnerHtml = "sometext";
the value of InnerText also changed to "sometext"
The HtmlTextNode class has a Text property* which works perfectly for this purpose.
Here's an example:
var textNodes = doc.DocumentNode.SelectNodes("//body/text()").Cast<HtmlTextNode>();
foreach (var node in textNodes)
{
node.Text = node.Text.Replace("foo", "bar");
}
And if we have an HtmlNode that we want to change its direct text, we can do something like the following:
HtmlNode node = //...
var textNode = (HtmlTextNode)node.SelectSingleNode("text()");
textNode.Text = "new text";
Or we can use node.SelectNodes("text()") in case it has more than one.
* Not to be confused with the readonly InnerText property.

Categories