how to add data in href with HTML Agility Pack - c#

I have a code to get all the 5 links from the site, so I need to change these links by putting "https://advancecare.pt" before......
For now I have this code:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("myLink");
foreach (HtmlNode ic in doc.DocumentNode.SelectNodes("//div[#class='component row-splitter']"))
{
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlNode test = doc.DocumentNode.SelectNodes("//a[#href]").First();
string hrefValue = link.GetAttributeValue("href", string.Empty);
// test.SetAttributeValue("href", "mylink" + hrefValue);
link.SetAttributeValue("href", "mylink" + hrefValue);
}
}
This code return:
https:mylinkmylinkmylinkmylink/hrefValue

You iterate all div nodes in the document, and then iterate all links within document, so each link is being processed as many times as div elements in the document.
Seach only links which are children for div:
foreach (HtmlNode link in ic.SelectNodes("//a[#href]"))
...

Related

Suppress a serie of Tags into a document

I am using HtmlAgility but I am not really used to HTML documents.
A combination of tags create problem when printed, so I decided to cut them but how?
<p ><br clear=all>
</span></p>
I have a HtmlDocument that I load at the beginning and then I try to cut the previous tags.
to cut them I have tried:
HtmlAgilityPack.HtmlDocument document; // get the document
foreach (var node in document.DocumentNode.SelectNodes("//div"))
{
IEnumerable<HtmlNode> test= node.ChildNodes;
foreach(HtmlNode val in test)
{
if (val.Name == "br") //Want to check if I go through the node, I am looking for
{
int a = 0;
a++;
IEnumerable <HtmlAttribute> attribute = node.GetAttributes();
foreach(HtmlAttribute att in attribute)
{
if (att.Name == "clear")
{
HtmlNode getNode = node.NextSibling;
node.Remove();
}
}
}
}
}
It doesn't work!!!
If I would insert :
foreach (var node in document.DocumentNode.SelectNodes("//br"))
I could remove the Tag but I cannot access to the following node
could you help me?

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items
When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp
I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

HTML Agility Issue

I have a the HTML code which I would like to parse.
I have written the code below:
HtmlAgilityPack.HtmlWeb web5 = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://www.analytics4.co.uk/pdf.js/web/viewer.html?file=http://www.analytics4.co.uk/pdf.js/pdf/w15639.pdf");
//var divs5 = doc5.DocumentNode.SelectNodes("//div[id='viewerContainer']").SelectMany(x => x.Descendants("div"));
// HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://google.co.uk");
HtmlNodeCollection tl = doc5.DocumentNode.SelectNodes("//div[#id='viewerContainer']//div[#id='viewer']//");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.OuterHtml);
}
The result I get for Inner HTML is just
<div id="viewer" class="pdfViewer"></div>
and it doesn't make sense. Could anyone explain me how can I go deeper and deeper to the inner divs and so on? Please guys...I need your help.
To go deeper you can use this techniques:
foreach (var node in tl){
var a = node.ChildNodes[2]; // a is the third child of node
var b = node.SelectSingleNode("./div[3]"); // b is the third "div"
// element in node children. The "./" in XPath means "from current node"
}
Good luck!

htmlagilitypack xpath incorrect

I have a problem that my xpath is not working.
I am trying to get the url from Google.com's search result list into a string list.
But i am unable to reach on url using Xpath.
Please help me in correcting my xpath. Also tell me what should be on the place of ??
HtmlWeb hw = new HtmlWeb();
List<string> urls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=" +txtURL.Text.Replace(" " , "+"));
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[#class='f kv']");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["?????????"];
urls.Add(link.Value);
}
for (int i = 0; i <= urls.Count - 1; i++)
{
if (urls.ElementAt(i) != null)
{
if (IsValid(urls.ElementAt(i)) != true)
{
grid.Rows.Add(urls.ElementAt(i));
}
}
}
The URLs seem to live in the cite element under that selected divs, so the XPath to select those is //div[#class='f kv']/cite.
Now, since these contain markup but you only want the text, select the InnerText of the selected nodes. Note that these do not begin with http://.
HtmlNodeCollection linkNodes =
doc.DocumentNode.SelectNodes("//div[#class='f kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.InnerText;
urls.Add(link.Value);
}
The correct XPath is "//div[#class='kv']/cite". The f class you see in the browser element inspector is (probably) added after the page is rendered using javascript.
Also, the link text is not in an attribute, you can get it using the InnerText property of the <div> element(s) obtained at the earlier step.
I changed these lines and it works:
var linkNodes = doc.DocumentNode.SelectNodes("//div[#class='kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
urls.Add(linkNode.InnerText);
}
There's a caveat though: some links are trimmed (you'll see a ... in the middle)

HtmlAgilityPack set node InnerText

I want to replace inner text of HTML tags with another text.
I am using HtmlAgilityPack
I use this code to extract all texts
HtmlDocument doc = new HtmlDocument();
doc.Load("some path")
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']")) {
// How to replace node.InnerText with some text ?
}
But InnerText is readonly. How can I replace texts with another text and save them to file ?
Try code below. It select all nodes without children and filtered out script nodes. Maybe you need to add some additional filtering. In addition to your XPath expression this one also looking for leaf nodes and filter out text content of <script> tags.
var nodes = doc.DocumentNode.SelectNodes("//body//text()[(normalize-space(.) != '') and not(parent::script) and not(*)]");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(htmlNode.InnerText + "_translated"), htmlNode);
}
Strange, but I found that InnerHtml isn't readonly. And when I tried to set it like that
aElement.InnerHtml = "sometext";
the value of InnerText also changed to "sometext"
The HtmlTextNode class has a Text property* which works perfectly for this purpose.
Here's an example:
var textNodes = doc.DocumentNode.SelectNodes("//body/text()").Cast<HtmlTextNode>();
foreach (var node in textNodes)
{
node.Text = node.Text.Replace("foo", "bar");
}
And if we have an HtmlNode that we want to change its direct text, we can do something like the following:
HtmlNode node = //...
var textNode = (HtmlTextNode)node.SelectSingleNode("text()");
textNode.Text = "new text";
Or we can use node.SelectNodes("text()") in case it has more than one.
* Not to be confused with the readonly InnerText property.

Categories