Get URLs inside a HTML page with HTML Agility Pack

Get URLs inside a HTML page with HTML Agility Pack - c#

I have this code:
foreach (HtmlNode node in hd.DocumentNode.SelectNodes("//div[#class='compTitle options-toggle']//a"))
{
string s=("node:" + node.GetAttributeValue("href", string.Empty));
}
I want to get urls in tags like this:
<div class="compTitle options-toggle">
<a class=" ac-algo fz-l ac-21th lh-24" href="http://www.bestbuy.com">
<b>Huawei</b> Products - Best Buy
</a>
</div>
I want to get "http://www.bestbuy.com" and "Huawei Products - Best Buy"
what should I do? Is my code correct?

this is an example of working code
var document = new HtmlDocument();
document.LoadHtml("<div class=\"compTitle options-toggle\"><a class=\" ac-algo fz-l ac-21th lh-24\" href=\"http://www.bestbuy.com\"><b>Huawei</b> Products - Best Buy</a></div>");
var tags = document.DocumentNode.SelectNodes("//div[#class='compTitle options-toggle']//a").ToList();
foreach (var tag in tags)
{
var link = tag.Attributes["href"].Value; // http://www.bestbuy.com
var text = tag.InnerText; // Huawei Products - Best Buy
}

The closing double quote should fix the selecting (it worked for me).
Get the plain text as
string contentText = node.InnerText;
or having the Huawei word in bold, like this:
string contentHtml = node.InnerHtml;

Related

c# Regex inside some html tag

I'm trying during some hour with regex to take text inside some html tag:
<div class="ewok-rater-header-section">
<ul class="header">
<li><h1>meow</h1></li>
<li><h1>meow2</h1></li>
<li><h1>Time = <span class="work-weight">9.0 minutes</span></h1></li>
</ul>
</div>
i take meow with
var regexpost = new System.Text.RegularExpressions.Regex(#"<h1(.*?)>(.*?)</h1>");
var mpost = regexpost.Match(reqpost);
string lechat = (mpost.Groups[2].Value).ToString();
but not other
I like to add meow in a textbox , meow2 in a second textbox and 9.0 (minutes) in a last one

In these situations a Html parser can help a lot, and can also be a lot more precise and robust
Html Agility pack
Example
var html = #"<div class=""ewok-rater-header-section"">
<li><h1>meow</h1></li>
<li><h1>meow2</h1></li>
<li><h1>Time = <span class=""work-weight"">9.0 minutes</span></h1></li>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
// you can search for the heading
foreach (var node in doc.DocumentNode.SelectNodes("//li//h1"))
{
Console.WriteLine("Found heading : " + node.InnerText);
}
// or you can be more specific
var someSpan = doc.DocumentNode
.SelectNodes("//span[#class='work-weight']")
.FirstOrDefault();
Console.WriteLine("Found span : " + someSpan.InnerText);
Output
Found heading : meow
Found heading : meow2
Found heading : Time = 9.0 minutes
Found span : 9.0 minutes
Demo here

it s for parse http reponse. Then is it not slow to use a html parser to create document ?

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210

I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();

AngleSharp Parsing

Can't find many examples of using AngleSharp for parsing when you don't have a class name or id to use.
HTML
<span><span class="icon icon_none"></span></span>
<span><span class="icon icon_none"></span></span>
<span><span class="icon icon_none"></span></span>
I want to find the href from any <a> tags that have a title = Bing
In Python BeautifulSoup I would use
item_needed = a_row.find('a', {'title': 'Bing'})
and then grab the href attribute
or jQuery
a[title='Bing']
But, I'm stuck using AngleSharp
eg. following example
https://github.com/AngleSharp/AngleSharp/wiki/Examples#getting-certain-elements
c# AngleSharp
var parser = new AngleSharp.Parser.Html.HtmlParser();
var document = parser.Parse(#"<span><span class=""icon icon_none""></span></span>< span >< a href = ""bing.com"" title = ""Bing"" >< span class=""icon icon_none""></span></a></span><span><span class=""icon icon_none""></span></span>");
//Do something with LINQ
var blueListItemsLinq = document.All.Where(m => m.LocalName == "a" && //stuck);

Looks like there was problem in your HTML markup that cause AngleSharp failed to find the target element i.e the spaces around angle-brackets :
< span >< a href = ""bing.com"" title = ""Bing"" >< span class=""icon icon_none"">
Having the HTML fixed, both LINQ and CSS selector successfully select the target link :
var parser = new AngleSharp.Parser.Html.HtmlParser();
var document = parser.ParseDocument(#"<span><span class=""icon icon_none""></span></span><span><span class=""icon icon_none""></span></span><span><span class=""icon icon_none""></span></span>");
//LINQ example
var blueListItemsLinq = document.All
.Where(m => m.LocalName == "a" &&
m.GetAttribute("title") == "Bing"
);
//LINQ equivalent CSS selector example
var blueListItemsCSS = document.QuerySelectorAll("a[title='Bing']");
//print href attributes value to console
foreach (var item in blueListItemsCSS)
{
Console.WriteLine(item.GetAttribute("href"));
}

string with HTML - replace elements/part of string (??Regex)

... I try to explain it in another way. I have a string like this:
string myText = "... <p class="MsoNormal">bla gezeichnete bla zuzustellen.</p><p>10.0080</p><p class="MsoNormal">text text text</p><p class="p--heading-2"><span class="anchor--on anchorname--160p001200">Schriftliche Bearbeitung</span</p><p>1.02</p><p>Eine blablabla text text</p><p>1.010</p><p>Ein text text (look <a xlink:type="simple" xlink:show="replace" xlink:role="17160" xlink:actuate="onRequest" xlink:href="link/a1000-text.xml">10.0060</a>) text text text</p> ..."
Now I want edit a part of string (c#) -> for example:
myText = myText.Replace("<p class="p--heading-2"><span class="anchor--on anchorname--160p00">Schriftliche Bearbeitung</span</p>", "<h2><a name="anchorname">Schriftliche Bearbeitung</a></p>");
The problem are the variable values (for excample the anchorname needs different values) and so I can´t replace the string.
Comment to first answer: I don´t want to use third-party supplier software (respective HtmlAgilityPack).
Are there any ideas for solution? If a regex the best solution, how the regex looks like?
thanks.

Use HtmlAgilityPack not regex
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//p[#class='p--heading-2']");
foreach (HtmlNode htmlNode in nodes)
{
var newNodeStr = htmlNode.InnerText;
var newNode = HtmlNode.CreateNode("<h3><a>"+newNodeStr+"</a></h3>");
htmlNode.ParentNode.ReplaceChild(newNode, htmlNode);
}

Removing DIV from a text file if it contains a certain classname

I am currently working with an XML document which has RSS feeds inside. And I wanted to parse it so that if a div tag with a class name "feedflare" is found, the code would remove the whole DIV.
I could not find an example of doing this as the search for it is polluted with "HTML editor errors" and other irrelevant data.
Would anyone here be kind enough to share methods in reaching my goal?
I must state that I DO NOT want to use HtmlAgilityPack if I can avoid it.
This is my process:
Load XML, parse through elements and pick out, Title, Description, Link.
Then save all this as HTML (with tags being added programatically to build a web page) and then when all of the tags are added, I want to parse the resulting "HTML text" and remove the annoying DIV tag.
Let's assume "string HTML = textBox1.text" where textBox1 is where the resulting HTML is pasted, after parsing the main XML document.
How would I then loop through the contents of textBox1.text and remove ONLY the div tag called "feedflare" (see below).
<div class="feedflare">
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:yIl2AUoC8zA">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?d=yIl2AUoC8zA" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:H0mrP-F8Qgo">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?d=H0mrP-F8Qgo" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:D7DqB2pKExk">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?i=lB-zYAGjzDU:1zqeSgzxt90:D7DqB2pKExk" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:V_sGLiPBpWU">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?i=lB-zYAGjzDU:1zqeSgzxt90:V_sGLiPBpWU" border="0"></img></a>
</div>
Thank you in advance.

Using this xml library, do:
XElement root = XElement.Load(file); // or .Parse(string);
XElement div = root.XPathElement("//div[#class={0}]", "feedflare");
div.Remove();
root.Save(file); // or string = root.ToString();

try this
System.Xml.XmlDocument d = new System.Xml.XmlDocument();
d.LoadXml(Your_XML_as_String);
foreach(System.Xml.XmlNode n in d.GetElementsByTagName("div"))
d.RemoveChild(n);
and use d.OuterXml to retrieve the new xml.

My solution in Javascript is:
function unrichText(texto) {
var n = texto.indexOf("\">"); //Finding end of "<div class="ExternalClass...">
var sub = texto.substring(0, n+2); //Adding first char and last two (">)
var tmp = texto.replace(sub, ""); //Removing it
tmp = replaceAll(tmp, "</div>", ""); //Removing last "div"
tmp = replaceAll(tmp, "<p>", ""); //Removing other stuff
tmp = replaceAll(tmp, "</p>", "");
tmp = replaceAll(tmp, " ", "");
return (tmp);
}
function replaceAll(str, find, replace) {
return str.replace(new RegExp(find, 'g'), replace);
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get URLs inside a HTML page with HTML Agility Pack - c#

The closing double quote should fix the selecting (it worked for me). Get the plain text as string contentText = node.InnerText; or having the Huawei word in bold, like this: string contentHtml = node.InnerHtml;

Related

c# Regex inside some html tag

Find specific link in html doc c# using HTML Agility Pack

AngleSharp Parsing

string with HTML - replace elements/part of string (??Regex)

Removing DIV from a text file if it contains a certain classname

Categories

Resources