C# parse html with xpath - c#

I'm trying to parse out stock exchange information whit a simple piece of C# from a HTML document. The problem is that I can not get my head around the syntax, the tr class="LomakeTaustaVari" gets parsed out but how do I get the second bit that has no tr-class?
Here's a piece of the HTML, it repeats it self whit different values.
<tr class="LomakeTaustaVari">
<td><div class="Ensimmainen">12:09</div></td>
<td><div>MSI</div></td>
<td><div>POH</div></td>
<td><div>42</div></td>
<td><div>64,50</div></td>
</tr>
<tr>
<td><div class="Ensimmainen">12:09</div></td>
<td><div>SRE</div></td>
<td><div>POH</div></td>
<td><div>156</div></td>
<td><div>64,50</div></td>
</tr>
My C# code:
{
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load ("https://www.op.fi/op/henkiloasiakkaat/saastot-ja-sijoitukset/kurssit-ja-markkinat/markkinat?sivu=alltrades.html&sym=KNEBV.HSE&from=10:00&to=19:00&id=32453");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr[#class='LomakeTaustaVari']"))
{
Console.WriteLine(row.InnerText);
}
Console.ReadKey();
}

Try to use the next xpath //tr[preceding-sibling::tr[#class='LomakeTaustaVari']]:
var nodes = doc.DocumentNode.SelectNodes("//tr[preceding-sibling::tr[#class='LomakeTaustaVari']]");
It should select nodes that have preceding node tr with class LomakeTaustaVari.
Just FYI: if no nodes found, SelectNodes method returns null.

If you manage to get a reference to the <tr class="LomakeTaustaVari"> element, I see two possible solutions.
You can navigate to the parent and then find all its <tr> children:
lomakeTaustaVariElement.Parent.SelectNodes("tr"); // iterate over these if needed
You can also use NextSibling to get the next <tr>:
var trWithoutClass = lomakeTaustaVariElement.NextSibling;
Please note that using the second alternative you may run into issues, because whitespace present in the HTML may be interpreted as being a distinct element.
To overcome this, you may recursively call NextSibling until you encounter a tr element.

This will iterate over all nodes in document. You will probably also need to be more specific with starting node, so you will only select that you are interested in.
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr"))
{
Console.WriteLine(row.InnerText);
}

Probably I don't understand something, but the simplest XPath for any tr element selection should do the work:
doc.DocumentNode.SelectNodes("//tr")
Otherwise, in case you would like to select elements with specific class attributes only, it could be:
doc.DocumentNode.SelectNodes("//tr[#class = 'someClass1' or #class = 'someClass2']")

If you do not like to load the page and want to use a ready html string, e.g. from a WebBrowser element, you can use the following example:
var web = new HtmlAgilityPack.HtmlDocument();
web.LoadHtml(webBrowser1.Document.Body.Parent.OuterHtml);
var q = web.DocumentNode.SelectNodes("/html/body/div[2]/div/div[1]") //XPath /html/body/div[2]/div/div[1]

Related

Get text that lies after pattern without class or id

I am using the HtmlAgiityPack.
It is an excellent tool for parsing data, however every instance I have used it, I have always had either a class or id to aim at, i.e. -
string example = doc.DocumentNode.SelectSingleNode("//div[#class='target']").InnerText.Trim();
However I have come across a piece of text that isn't nested in any particular pattern with a class or id I can aim at. E.g. -
<p>Example Header</p>: This is the text I want!<br>
However the example given does always following the same patter i.e. the text will always be after </p>: and before <br>.
I can extract the text using a regular expression however would prefer to use the agility pack as the rest of the code follows suit. Is there a means of doing this using the pack?
This XPath works for me :
var html = #"<div class=""target"">
<p>Example Header</p>: This is the text I want!<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/div[#class='target']/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
/text() select all text nodes that is direct child of the <div>
[(normalize-space())] exclude all text nodes those contain only white
spaces (there are 2 new lines excluded from this html sample : one before <p> and the other after <br>)
Result :
UPDATE I :
All element must have a parent, like <div> in above example. Or if it is the root node you're talking about, the same approach should still work. The key is to use /text() XPath to get text node :
var html = #"<p>Example Header</p>: This is the text I want!<br>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
UPDATE II :
Ok, so you want to select text node after <p> element and before <br> element. You can use this XPath then :
var result =
doc.DocumentNode
.SelectSingleNode("/text()[following-sibling::br and preceding-sibling::p]")
.OuterHtml;

Retrieving value of element using HTMLAgility pack

I am using HTMLAgility pack to parse html and then using xpath retrieve a table column with a specific class.
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("www.url.com");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
Response.Write(row.InnerHtml + "<br />");
}
I retrieve the data and row.Innerhtml looks like this.
<a>Title</a> <span>Year</span><br />
I want to save the value of a and span element in separate string variables. Please help
Your xpath expression selects the second <td> that has the class titleColumn. According to the node's inner html, this <td> hode has two child nodes: <a> and <span>. So you could easily find these nodes, and then put inner text (or inner html) into your string variables. See, this:
foreach (var row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
var a = row.SelectSingleNode("a");
var span = row.SelectSingleNode("span");
Console.WriteLine(a.InnerText);
Console.WriteLine(span.InnerText);
}
will output:
Title
Year

Using HTMLAgilityPack to get all the values of a select element

Here is what I have so far:
HtmlAgilityPack.HtmlDocument ht = new HtmlAgilityPack.HtmlDocument();
TextReader reader = File.OpenText(#"C:\Users\TheGateKeeper\Desktop\New folder\html.txt");
ht.Load(reader);
reader.Close();
HtmlNode select= ht.GetElementbyId("cats[]");
List<HtmlNode> options = new List<HtmlNode>();
foreach (HtmlNode option in select.ChildNodes)
{
if (option.Name == "option")
{
options.Add(option);
}
}
Now I have a list of all the "options" for the select element. What properties do I need to access to get the key and the text?
So if for example the html for one option would be:
<option class="level-1" value="1">Funky Town</option>
I want to get as output:
1 - Funky Town
Thanks
Edit: I just noticed something. When I got the child elements of the "Select" elements, it returned elements of type "option" and elements of type "#text".
Hmmm .. #text has the string I want, but select has the value.
I tought HTMLAgilityPack was an html parser? Why did it give me confusing values like this?
This is due to the default configuration for the html parser; it has configured the <option> as HtmlElementFlag.Empty (with the comment 'they sometimes contain, and sometimes they don't...'). The <form> tag has the same setup (CanOverlap + Empty) which causes them to appear as empty nodes in the dom, without any child nodes.
You need to remove that flag before parsing the document.
HtmlNode.ElementsFlags.Remove("option");
Notice that the ElementsFlags property is static and any changes will affect all further parsing.
edit: you should probably be selecting the option nodes directly via xpath. I think this should work for that:
var options = select.SelectNodes("option");
that will get your options without the text nodes. the options should contain that string you want somewhere. waiting for your html sample.
foreach (var option in options)
{
int value = int.Parse(option.Attributes["value"].Value);
string text = option.InnerText;
}
 
you can add some sanity checking on the attribute to make sure it exists.

Parse through current page

Is there a way to get a page to parse through its self?
So far I have:
string whatever = TwitterSpot.InnerHtml;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(whatever);
foreach("this is where I am stuck")
{
}
I want to parse the page so what I did is create a parent div named TwitterSpot. Put the InnerHtml into a string, and have loaded it as a new HtmlDocument.
Next I want to get within that a string value of "#XXXX+n " and replace it in the page infront with some cool formatting.
I am getting stuck on my foreach loop do not know how I should search for a # or how to look through the loaded HtmlDocument.
The next step is to apply change to where ever I have seen a # tag. I could do this is JavaScript probably a lot easier I know but I am adament on seeing how I can get asp.net c# to do it.
The # is a string value within the html I am not referring to it as a Control ID.
Assuming you're using HtmlAgilityPack, you could use xpath to find text nodes which contain your value:
var matchedNodes = document.DocumentNode
.SelectNodes("//text()[contains(.,'#XXXX+n ')]");
Then you could just interate through these nodes and make all the necessary replacemens:
foreach (HtmlTextNode node in matchedNodes)
{
node.Text = node.Text.Replace("#XXXX+n ", "brand new text");
}
You can use http://htmlagilitypack.codeplex.com/ to parse HTML and manipulate its content; works very well.
I guess you could use RegEx to find all matches and loop through them.
You could just change it to be:
string whatever = TwitterSpot.InnerHtml;
whatever = whatever.Replace("#XXXX+n ", String.format("<b>{0}</b>", "#XXXX+n "));
No parsing required...
When I did this before, I stored the HTML in an XML doc and looped through each node. You can then apply XSLT or just parse the nodes.
It sounds like for your purposes though that you don't really need to do that. I'd recommend making the divs into server controls and programmatically looping through their child controls, as such:
foreach (Object o in divSomething.Controls)
{
if (o.GetType == "TextBox" && ((TextBox)o).ID == "txtSomething")
{
((TextBox)o).Attributes.Add("style", "font: Arial; color: Red;");
}
}

HtmlAgilityPack | Wrongly retrieved table nodes

Hi there I've just registered on this website, because I need some help.
I want to get results from the nyaa.eu website.
Basically:
Table Node is called <table class="tlist">
Every row node is called <tr class="tlistrow"> also sometimes it's 'trusted tlistrow' etc.
The Nodes I try to retrieve are: <td class="tlistname"> <td class="tlistsize"> <td class="tlistsn"> and <td class="tlistln">
Firstly I'm retrieving a table which contains all the info about torrents:
HtmlNode hnTable = doc.DocumentNode.SelectSingleNode("//table[#class='tlist']");
So, next thing is retrieving all the rows which contains 'tlistrow' in its class attribute:
HtmlNodeCollection hncRows = hnTable.SelectNodes("//tr[contains(#class,'tlistrow')]");
And finally the problem is when I read every node it's always the same one:
foreach (HtmlNode row in hncRows)
{
foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("//td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("//td[#class='tlistname']").InnerText);
break;
}
}
}
The text displayed in the messagebox is always the same, it looks like it only selects one node multiple times.
How can I fix this or if I am doing anything wrong, please correct me.
foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("//td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("//td[#class='tlistname']").InnerText);
break;
}
}
You need to understand the difference between relative XPath expressions and absolute XPath expressions.
A relative XPath expression is evaluated off (having as an initial context node) a specific node in the XML document.
An absolute XPath expression is evaluated against the whole XML document (having as initial context node the document-node).
Any XPath expression that starts with the character / is an absolute XPath expression.
Based on the provided code, you want to use a relative XPath expression with an initial context node, containd in the variable named child.
The problem is that the expression you are using:
//td[#class='tlistname']
starts with / and is therefore an absolute XPath expression.
This, passed to the SelectSingleNode() method always selects the first td element in the whole XML document, that has a class attribute with string value "tlistname."
Solution: Use a relative XPath expression, such as:
.//td[#class='tlistname']
The // XPath in the expression will look anywhere in the document for matches. Remove that when you don't need it.
So, try something like:
HtmlNode hnTable = doc.DocumentNode.SelectSingleNode("//table[#class='tlist']");
HtmlNodeCollection hncRows = hnTable.SelectNodes("/tr[contains(#class,'tlistrow')]");
foreach (HtmlNode row in hncRows)
{
foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("/td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("/td[#class='tlistname']").InnerText);
break;
}
}
}

Categories