Parse HTML class in individual items with htmlagilitypack

Parse HTML class in individual items with htmlagilitypack - c#

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items

When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp

I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

Related

how do i get all the value of a table from a website

string Url = "http://www.dsebd.org/latest_share_price_scroll_l.php";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string a = doc.DocumentNode.SelectNodes("//iframe*[#src=latest_share_price_all\"]//html/body/div/table/tbody")[0].InnerText;
i have tried, but null value found in string a.

Ok this one confused me for a while but I've got it now. Instead of pulling the whole page from http://www.dsebd.org/latest_share_price_scroll_l.php, you can get just the table data from http://www.dsebd.org/latest_share_price_all.php.
There was some strange behaviour with trying to select child elements of the #document node under the iframe element. Someone with more xpath experience might be able to explain this.
Now you can get all the table row nodes by using the following xpath:
string url = "http://www.dsebd.org/latest_share_price_all.php";
HtmlDocument doc = new HtmlWeb().Load(url);
HtmlNode docNode = doc.DocumentNode;
var nodes = docNode.SelectNodes("//body/div/table/tr");
That will give you all the table row nodes. Then you need to go through each node you just got and get the values you want.
Just for example if you wanted to get the trading code, high, and volume you would do the following:
//Remove the first node because it is the header row at the top of the table
nodes.RemoveAt(0);
foreach(HtmlNode rowNode in nodes)
{
HtmlNode tradingCodeNode = rowNode.SelectSingleNode("td[2]/a");
string tradingCode = tradingCodeNode.InnerText;
HtmlNode highNode = rowNode.SelectSingleNode("td[4]");
string highValue = highNode.InnerText;
HtmlNode volumeNode = rowNode.SelectSingleNode("td[11]");
string volumeValue = volumeNode.InnerText;
//Do whatever you want with the values here
//Put them in a class or add them to a list
}
XPath uses 1-based indices so when you are referring to a particular cell in a table row by number the first element is at index 1, instead of using index 0 as in a C# array.

Cannot extract <link> element using HtmlAgilityPack and XPath

I am using the Html Agility pack to select out textual data from within rss xml. For every other node type (title, pubdate, guid .etc) I can select out the inner-text using XPath conventions however when querying "//link" or indeed "item/link" empty strings are returned.
public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
//Create a new document.
var document = new HtmlDocument();
//Populate the document with an rss file.
document.LoadHtml(rssSource);
//Select out all of the required nodes.
var itemNodes = document.DocumentNode.SelectNodes("item/link");
//If zero nodes were found, return an empty list, otherwise return the content of those nodes.
return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}
Does anybody have an understanding of why this element behaves differently to the others?
Additional: Running "item/link" returns zero nodes. Running "//link" returns the correct number of nodes however the inner text is zero chars in length.
Using the below test data, with "//name" returns a single record for "fred" however with "//link" a single record with an empty string is returned.
<site><link>Hello World</link><name>Fred</name></site>
I am certain its because of the world "link". If I change it to "linkz" it works perfectly.
The below workaround works perfectly. However I would like to understand why searching on "//link" does not work as other elements do.
public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
rssSource = rssSource.Replace("<link>", "<link-renamed>");
rssSource = rssSource.Replace("</link>", "</link-renamed>");
//Create a new document.
var document = new HtmlDocument();
//Populate the document with an rss file.
document.LoadHtml(rssSource);
//Select out all of the required nodes.
var itemNodes = document.DocumentNode.SelectNodes("//link-renamed");
//If zero nodes were found, return an empty list, otherwise return the content of those nodes.
return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}

If you print the DocumentNode.OuterHtml, you will see the problem :
var html = #"<site><link>Hello World</link><name>Fred</name></site>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
output :
<site><link>Hello World<name>Fred</name></site>
link happen to be one of some special tags* that is treated as self-closing tag by HAP. You can alter this behavior by setting ElementsFlags before parsing the HTML, for example :
var html = #"<site><link>Hello World</link><name>Fred</name></site>";
HtmlNode.ElementsFlags.Remove("link"); //remove link from list of special tags
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
var links = doc.DocumentNode.SelectNodes("//link");
foreach (HtmlNode link in links)
{
Console.WriteLine(link.InnerText);
}
Dotnetfiddle Demo
output :
<site><link>Hello World</link><name>Fred</name></site>
Hello World
*) Complete list of the special tags besides link, that included in the ElementsFlags dictionary by default, can be seen in the source code of HtmlNode.cs. Some of the most popular among them are <meta>, <img>, <frame>, <input>, <form>, <option>, etc.

Gettig Htmlelement based on HtmlAgilityPack.HtmlNode

I use HtmlAgilityPack to parse the html document of a webbrowser control.
I am able to find my desired HtmlNode, but after getting the HtmlNode, I want to retun the corresponding HtmlElement in the WebbrowserControl.Document.
In fact HtmlAgilityPack parse an offline copy of the live document, while I want to access live elements of the webbrowser Control to access some rendered attributes like currentStyle or runtimeStyle
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.Document.Body.InnerHtml);
var some_nodes = doc.DocumentNode.SelectNodes("//p");
// this selection could be more sophisticated
// and the answer shouldn't relay on it.
foreach (HtmlNode node in some_nodes)
{
HtmlElement live_element = CorrespondingElementFromWebBrowserControl(node);
// CorrespondingElementFromWebBrowserControl is what I am searching for
}
If the element had a specific attribute it could be easy but I want a solution which works on any element.
Please help me what can I do about it.

In fact there seems to be no direct possibility to change the document directly in the webbroser control.
But you can extract the html from it, mnipulate it and write it back again like this:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.DocumentText);
foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.ChildNodes) {
node.Attributes.Add("TEST", "TEST");
}
StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb)) {
doc.Save(sw);
webBrowser1.DocumentText = sb.ToString();
}
For direct manipulation you can maybe use the unmanaged pointer webBrowser1.Document.DomDocument to the document, but this is outside of my knowledge.

HtmlAgilityPack definitely can't provide access to nodes in live HTML directly. Since you said there is no distinct style/class/id on the element you have to walk through the nodes manually and find matches.
Assuming HTML is reasonably valid (so both browser and HtmlAgilityPack perform normalization similarly) you can walk pairs of elements starting from the root of both trees and selecting the same child node.
Basically you can build "position-based" XPath to node in one tree and select it in another tree. Xpath would look something like (depending you want to pay attention to just positions or position and node name):
"/*[1]/*[4]/*[2]/*[7]"
"/body/div[2]/span[1]/p[3]"
Steps:
In using HtmlNode you've found collect all parent nodes up to the root.
Get root of element of HTML in browser
for each level of children find position of corresponding child on HtmlNodes collection on step 1 in its parent and than find live HtmlElement among children of current live node.
Move to newly found child and go back to 3 till found node you are looking for.

the XPath attribute of the HtmlAgilityPack.HtmlNode shows the nodes on the path from root to the node. For example \div[1]\div[2]\table[0]. You can traverse this path in the live document to find the corresponding live element. However this path may not be precise as HtmlAgilityPack removes some tags like <form> then before using this solution add the omitted tags back using
HtmlNode.ElementsFlags.Remove("form");
struct DocNode
{
public string Name;
public int Pos;
}
///// structure to hold the name and position of each node in the path
The following method finds the live element according to the XPath
static public HtmlElement GetLiveElement(HtmlNode node, HtmlDocument doc)
{
var pattern = #"/(.*?)\[(.*?)\]"; // like div[1]
// Parse the XPath to extract the nodes on the path
var matches = Regex.Matches(node.XPath, pattern);
List<DocNode> PathToNode = new List<DocNode>();
foreach (Match m in matches) // Make a path of nodes
{
DocNode n = new DocNode();
n.Name = n.Name = m.Groups[1].Value;
n.Pos = Convert.ToInt32(m.Groups[2].Value)-1;
PathToNode.Add(n); // add the node to path
}
HtmlElement elem = null; //Traverse to the element using the path
if (PathToNode.Count > 0)
{
elem = doc.Body; //begin from the body
foreach (DocNode n in PathToNode)
{
//Find the corresponding child by its name and position
elem = GetChild(elem, n);
}
}
return elem;
}
the code for GetChild Method used above
public static HtmlElement GetChild(HtmlElement el, DocNode node)
{
// Find corresponding child of the elemnt
// based on the name and position of the node
int childPos = 0;
foreach (HtmlElement child in el.Children)
{
if (child.TagName.Equals(node.Name,
StringComparison.OrdinalIgnoreCase))
{
if (childPos == node.Pos)
{
return child;
}
childPos++;
}
}
return null;
}

htmlagilitypack xpath incorrect

I have a problem that my xpath is not working.
I am trying to get the url from Google.com's search result list into a string list.
But i am unable to reach on url using Xpath.
Please help me in correcting my xpath. Also tell me what should be on the place of ??
HtmlWeb hw = new HtmlWeb();
List<string> urls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=" +txtURL.Text.Replace(" " , "+"));
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[#class='f kv']");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["?????????"];
urls.Add(link.Value);
}
for (int i = 0; i <= urls.Count - 1; i++)
{
if (urls.ElementAt(i) != null)
{
if (IsValid(urls.ElementAt(i)) != true)
{
grid.Rows.Add(urls.ElementAt(i));
}
}
}

The URLs seem to live in the cite element under that selected divs, so the XPath to select those is //div[#class='f kv']/cite.
Now, since these contain markup but you only want the text, select the InnerText of the selected nodes. Note that these do not begin with http://.
HtmlNodeCollection linkNodes =
doc.DocumentNode.SelectNodes("//div[#class='f kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.InnerText;
urls.Add(link.Value);
}

The correct XPath is "//div[#class='kv']/cite". The f class you see in the browser element inspector is (probably) added after the page is rendered using javascript.
Also, the link text is not in an attribute, you can get it using the InnerText property of the <div> element(s) obtained at the earlier step.
I changed these lines and it works:
var linkNodes = doc.DocumentNode.SelectNodes("//div[#class='kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
urls.Add(linkNode.InnerText);
}
There's a caveat though: some links are trimmed (you'll see a ... in the middle)

HtmlAgilityPack replace node

I want to replace a node with a new node. How can I get the exact position of the node and do a complete replace?
I've tried the following, but I can't figured out how to get the index of the node or which parent node to call ReplaceChild() on.
string html = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var bolds = document.DocumentNode.Descendants().Where(item => item.Name == "b");
foreach (var item in bolds)
{
string newNodeHtml = GenerateNewNodeHtml();
HtmlNode newNode = new HtmlNode(HtmlNodeType.Text, document, ?);
item.ParentNode.ReplaceChild( )
}

To create a new node, use the HtmlNode.CreateNode() factory method, do not use the constructor directly.
This code should work out for you:
var htmlStr = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var query = doc.DocumentNode.Descendants("b");
foreach (var item in query.ToList())
{
var newNodeStr = "<foo>bar</foo>";
var newNode = HtmlNode.CreateNode(newNodeStr);
item.ParentNode.ReplaceChild(newNode, item);
}
Note that we need to call ToList() on the query, we will be modifying the document so it would fail if we don't.
If you wish to replace with this string:
"some text <b>node</b> <strong>another node</strong>"
The problem is that it is no longer a single node but a series of nodes. You can parse it fine using HtmlNode.CreateNode() but in the end, you're only referencing the first node of the sequence. You would need to replace using the parent node.
var htmlStr = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var query = doc.DocumentNode.Descendants("b");
foreach (var item in query.ToList())
{
var newNodesStr = "some text <b>node</b> <strong>another node</strong>";
var newHeadNode = HtmlNode.CreateNode(newNodesStr);
item.ParentNode.ReplaceChild(newHeadNode.ParentNode, item);
}

Have Implemented the following solution to achieve the same.
var htmlStr = "<b>bold_one</b><div class='LatestLayout'><div class='olddiv'><strong>strong</strong></div></div><b>bold_two</b>";
var htmlDoc = new HtmlDocument();
HtmlDocument document = new HtmlDocument();
document.Load(htmlStr);
htmlDoc.DocumentNode.SelectSingleNode("//div[#class='olddiv']").Remove();
htmlDoc.DocumentNode.SelectSingleNode("//div[#class='LatestLayout']").PrependChild(newChild)
htmlDoc.Save(FilePath); // FilePath .html file with full path if need to save file.
so selecting an object and removing respective HTML object
and appending it as chile. of respective object.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parse HTML class in individual items with htmlagilitypack - c#

I think it's how you're looking up and storing the data. Try: foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]")) { string hrefValue = link.GetAttributeValue( "href", string.Empty ); MessageBox.Show(hrefValue); MessageBox.Show(link.InnerText); }

Related

how do i get all the value of a table from a website

Cannot extract <link> element using HtmlAgilityPack and XPath

Gettig Htmlelement based on HtmlAgilityPack.HtmlNode

htmlagilitypack xpath incorrect

HtmlAgilityPack replace node

Categories

Resources