Get text from webpage using HtmlAgilityPack

Get text from webpage using HtmlAgilityPack - c#

I would like to get the innertext from the tags < p >. I use HtmlAgilityPack to get the html code from the website. But this isn't working what am I doing wrong?
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml("urlwebsite");
var itemList = doc.DocumentNode.SelectNodes("//p")
.Select(p => p.InnerText)
.ToList();

Please try following
HtmlAgilityPack.HtmlDocument doc;
var web = new HtmlAgilityPack.HtmlWeb();
doc = web.Load("urlwebsite");
var itemList = doc.DocumentNode.SelectNodes("//p")
.Select(p => p.InnerText)
.ToList();

Related

Using a method in a Lambda Expression - HTMLDoc

I want to load content into a htmlDocument list using HTML Agility pack.
I have successfully achieved what I want using:
var htmllist = new List<HtmlDocument>();
int counter = 0;
foreach(var c in content)
{
htmllist.Add(new HtmlDocument());
htmllist[counter].LoadHtml(c);
counter += 1;
}
How can i write this in a Lambda expression? I tried:
var htmllist = content.Select(p => new HtmlDocument() {Text = p })

You need to add a ToList() to execute the query like
var htmllist = content.Select(p => new HtmlDocument() { Text = p }).ToList();
Per your comment and another comment: you can change your existing code a bit like
private HtmlDocument LoadHtmlFromContent(string content)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(content);
return doc;
}
Now call this in your Linq query like
var htmllist = content.Select(p => this.LoadHtmlFromContent(p)).ToList();

Enumerable.Selectaccepts an arbitrary selector as Func<TSource,TResult>. So you can inline the conversion method, but imho it really doesn't look great…
content.Select(c => {var doc = new HtmlDocument(); doc.LoadHtml(c); return doc;});
If you're using C# >=7.0 you could think about using a local function for that. E.g.
void Convert(IEnumerable<string> content)
{
var htmls = content.Select(ConvertToHtml);
HtmlDocument ConvertToHtml(string c)
{
var doc = new HtmlDocument();
doc.LoadHtml(c);
return doc;
}
}
That looks more maintainable to me.

Getting text from h1 className

im trying to get the following text 'getthis' by using the agility pack.
<h1 class="point">getthis< span class="level">Niveau 0</span> </h1>
I've tried:
var links = webBrowser1.Document.GetElementsByTagName("point");
foreach (HtmlElement link in links)
{
Console.WriteLine("AAAAA");
}
var varit = doc.DocumentNode
.SelectSingleNode("//h1[#class='point']")
.InnerText;
Console.WriteLine("WB2 TRUE ", varit.ToString());*
var varit = doc.DocumentNode
.SelectSingleNode("//h1[#class='point']")
.InnerHtml;
Console.WriteLine("WB2 TRUE ", varit.ToString());*
Why isn't my code working?

try this:
htmlDoc.DocumentNode.Descendants("h1")
.Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("point"))
.First().InnerHtml;
this code:
string html = "<h1 class=\"point\">getthis< span class=\"level\">Niveau 0</span> </h1>";
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
var str = htmlDoc.DocumentNode.Descendants("h1").Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("point")).First().ChildNodes[0].InnerText;
returns: "getthis"

I've fixed this question after finding this out:
var a = doc.DocumentNode.SelectNodes("//h1").Skip(0).Take(1).Single();

Html Agility Pack get contents from table

I need to get the location, address, and phone number from "http://anytimefitness.com/find-gym/list/AL" So far I have this...
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(stateURLs[0].ToString());
var BlankNode =
htmlDoc.DocumentNode.SelectNodes("/div[#class='segmentwhite']/table[#style='width: 100%;']//tr[#class='']");
var GrayNode =
htmlDoc.DocumentNode.SelectNodes("/div[#class='segmentwhite']/table[#style='width: 100%;']//tr[#class='gray_bk']");
I have looked around stackoverflow for a while but none of the present post regarding htmlagilitypack has really helped. I have also have been using http://www.w3schools.com/xpath/xpath_syntax.asp

Since <div> you're after is not direct child of root node, you need to use // instead of /. Then you can combine XPath for BlankNode and GrayNode using or operator, for example :
var htmlweb = new HtmlWeb();
HtmlDocument htmlDoc = htmlweb.Load("http://anytimefitness.com/find-gym/list/AL");
htmlDoc.OptionFixNestedTags = true;
var AllNode =
htmlDoc.DocumentNode.SelectNodes("//div[#class='segmentwhite']/table//tr[#class='' or #class='gray_bk']");
foreach (HtmlNode node in AllNode)
{
var location = node.SelectSingleNode("./td[2]").InnerText;
var address = node.SelectSingleNode("./td[3]").InnerText;
var phone = node.SelectSingleNode("./td[4]").InnerText;
//do something with above informations
}

Here's an example I tested in LinqPad.
string url = #"http://anytimefitness.com/find-gym/list/AL";
var client = new System.Net.WebClient();
var data = client.DownloadData(url);
var html = Encoding.UTF8.GetString(data);
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(html);
var gyms = htmlDoc.DocumentNode.SelectNodes("//tbody/tr[#class='' or #class='gray_bk']");
foreach (var gym in gyms) {
var city = gym.SelectSingleNode("./td[2]").InnerText;
var address = gym.SelectSingleNode("./td[3]").InnerText;
var phone = gym.SelectSingleNode("./td[4]").InnerText;
}
Since the HtmlAgilityPack also supports Linq, you could also do something like:
string [] classes = {"", "gray_bk"};
var gyms = htmlDoc
.DocumentNode
.Descendants("tr")
.Where(t => classes.Contains(t.Attributes["class"].Value))
.ToList();
gyms.ForEach(gym => {
var city = gym.SelectSingleNode("./td[2]").InnerText;
var address = gym.SelectSingleNode("./td[3]").InnerText;
var phone = gym.SelectSingleNode("./td[4]").InnerText;
});

Load website using Html-agility-pack using xpath-syntax

I'm having this method to select out specific html and put it In a list.
Works perfect when I'm using a html-file saved on my computer. But how can a load a content from a website
This is my method loading the .html-file, witch works:
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("C:/Users/Jonathan/Desktop/laggen.html");
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
}
But I want to load a website instead of a file, like this:
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("http://www.dabas.com/mypages/search.aspx?typ=FP&sosokord=laggen"); <--- this is the site I want to load
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
}

Use
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.dabas.com/mypages/search.aspx?typ=FP&sosokord=laggen");
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
foreach (string gtin in listOfGtins)
{
Console.WriteLine(gtin);
}
if you want to load HTML over HTTP from a URL.

extracting values of text from html source file

in this code var TempTxt holds An Html Body Content
as string
how can i extract element <table> or <td> inner text/ html using lambada syntax ?
public string ExtractPageValue(IWebDriver DDriver, string url="")
{
if(string.IsNullOrEmpty(url))
url = #"http://www.boi.org.il/he/Markets/ExchangeRates/Pages/Default.aspx";
var service = InternetExplorerDriverService.CreateDefaultService(directory);
service.LogFile = directory + #"\seleniumlog.txt";
service.LoggingLevel = InternetExplorerDriverLogLevel.Trace;
var options = new InternetExplorerOptions();
options.IntroduceInstabilityByIgnoringProtectedModeSettings = true;
DDriver = new InternetExplorerDriver(service, options, TimeSpan.FromSeconds(60));
DDriver.Navigate().GoToUrl(url);
var TempTxt = DDriver.PageSource;
return "";//Math.Round(Convert.ToDouble( TempTxt.Split(' ')[10]),2).ToString();
}

If you are open to try HtmlAgilityPack
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var table = doc.DocumentNode.SelectNodes("//table/tr")
.Select(tr => tr.Elements("td").Select(td => td.InnerText).ToList())
.ToList();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get text from webpage using HtmlAgilityPack - c#

Please try following HtmlAgilityPack.HtmlDocument doc; var web = new HtmlAgilityPack.HtmlWeb(); doc = web.Load("urlwebsite"); var itemList = doc.DocumentNode.SelectNodes("//p") .Select(p => p.InnerText) .ToList();

Related

Using a method in a Lambda Expression - HTMLDoc

Getting text from h1 className

Html Agility Pack get contents from table

Load website using Html-agility-pack using xpath-syntax

extracting values of text from html source file

Categories

Resources