Load website using Html-agility-pack using xpath-syntax - c#

I'm having this method to select out specific html and put it In a list.
Works perfect when I'm using a html-file saved on my computer. But how can a load a content from a website
This is my method loading the .html-file, witch works:
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("C:/Users/Jonathan/Desktop/laggen.html");
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
}
But I want to load a website instead of a file, like this:
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("http://www.dabas.com/mypages/search.aspx?typ=FP&sosokord=laggen"); <--- this is the site I want to load
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
}

Use
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.dabas.com/mypages/search.aspx?typ=FP&sosokord=laggen");
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
foreach (string gtin in listOfGtins)
{
Console.WriteLine(gtin);
}
if you want to load HTML over HTTP from a URL.

Related

Parsing site using HtmlAgilityPack in C#

For example, I have link https://shikimori.one/animes/38256-magia-record-mahou-shoujo-madoka-magica-gaiden-tv/art
I wanna get from there list of div classes by name "container packery" using HtmlAgilityPack in C#. (in order to download images from all the links) But this part
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(link);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(link);
return me html code from this page https://shikimori.one/animes/38256-magia-record-mahou-shoujo-madoka-magica-gaiden-tv as i understood. So, I can`t parse anything from "/art". That because next part of code just returns null.
var links = htmlDocument.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Equals("menu-slide-outer x199")).ToList();
What am I missing?
Final code:
class Program
{
static List<string> sources = new List<string>();
[STAThread]
static void Main(string[] args)
{
var link = "https://shikimori.one/animes/1577-taiho-shichau-zo/art";
var web = new HtmlWeb();
web.BrowserTimeout = TimeSpan.FromTicks(0);
var htmlDocument = web.LoadFromBrowser(link);
var divlink = htmlDocument.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Equals("container packery")).ToList();
var alink = htmlDocument.DocumentNode.Descendants("a")
.Where(node => node.GetAttributeValue("class", "")
.Equals("b-image")).ToList();
foreach(var a in alink)
{
sources.Add(a.GetAttributeValue("href", string.Empty));
}
Console.WriteLine("done");
Console.ReadKey();
}
With HttpClient:
string html = string.Empty;
using (var httpClient = new HttpClient())
{
html = await httpClient.GetStringAsync(link);
}
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html); // LoadHtml expects the page source, not the URL.
Or, a simple way to parse HTML code from a URL:
var web = new HtmlAgilityPack.HtmlWeb();
var htmlDocument = await web.LoadFromWebAsync(link);
Or, load dynamic (i.e. Ajax) content:
var web = new HtmlAgilityPack.HtmlWeb();
var htmlDocument = web.LoadFromBrowser(link);

Using a method in a Lambda Expression - HTMLDoc

I want to load content into a htmlDocument list using HTML Agility pack.
I have successfully achieved what I want using:
var htmllist = new List<HtmlDocument>();
int counter = 0;
foreach(var c in content)
{
htmllist.Add(new HtmlDocument());
htmllist[counter].LoadHtml(c);
counter += 1;
}
How can i write this in a Lambda expression? I tried:
var htmllist = content.Select(p => new HtmlDocument() {Text = p })
You need to add a ToList() to execute the query like
var htmllist = content.Select(p => new HtmlDocument() { Text = p }).ToList();
Per your comment and another comment: you can change your existing code a bit like
private HtmlDocument LoadHtmlFromContent(string content)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(content);
return doc;
}
Now call this in your Linq query like
var htmllist = content.Select(p => this.LoadHtmlFromContent(p)).ToList();
Enumerable.Selectaccepts an arbitrary selector as Func<TSource,TResult>. So you can inline the conversion method, but imho it really doesn't look great…
content.Select(c => {var doc = new HtmlDocument(); doc.LoadHtml(c); return doc;});
If you're using C# >=7.0 you could think about using a local function for that. E.g.
void Convert(IEnumerable<string> content)
{
var htmls = content.Select(ConvertToHtml);
HtmlDocument ConvertToHtml(string c)
{
var doc = new HtmlDocument();
doc.LoadHtml(c);
return doc;
}
}
That looks more maintainable to me.

Html Agility Pack get contents from table

I need to get the location, address, and phone number from "http://anytimefitness.com/find-gym/list/AL" So far I have this...
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(stateURLs[0].ToString());
var BlankNode =
htmlDoc.DocumentNode.SelectNodes("/div[#class='segmentwhite']/table[#style='width: 100%;']//tr[#class='']");
var GrayNode =
htmlDoc.DocumentNode.SelectNodes("/div[#class='segmentwhite']/table[#style='width: 100%;']//tr[#class='gray_bk']");
I have looked around stackoverflow for a while but none of the present post regarding htmlagilitypack has really helped. I have also have been using http://www.w3schools.com/xpath/xpath_syntax.asp
Since <div> you're after is not direct child of root node, you need to use // instead of /. Then you can combine XPath for BlankNode and GrayNode using or operator, for example :
var htmlweb = new HtmlWeb();
HtmlDocument htmlDoc = htmlweb.Load("http://anytimefitness.com/find-gym/list/AL");
htmlDoc.OptionFixNestedTags = true;
var AllNode =
htmlDoc.DocumentNode.SelectNodes("//div[#class='segmentwhite']/table//tr[#class='' or #class='gray_bk']");
foreach (HtmlNode node in AllNode)
{
var location = node.SelectSingleNode("./td[2]").InnerText;
var address = node.SelectSingleNode("./td[3]").InnerText;
var phone = node.SelectSingleNode("./td[4]").InnerText;
//do something with above informations
}
Here's an example I tested in LinqPad.
string url = #"http://anytimefitness.com/find-gym/list/AL";
var client = new System.Net.WebClient();
var data = client.DownloadData(url);
var html = Encoding.UTF8.GetString(data);
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(html);
var gyms = htmlDoc.DocumentNode.SelectNodes("//tbody/tr[#class='' or #class='gray_bk']");
foreach (var gym in gyms) {
var city = gym.SelectSingleNode("./td[2]").InnerText;
var address = gym.SelectSingleNode("./td[3]").InnerText;
var phone = gym.SelectSingleNode("./td[4]").InnerText;
}
Since the HtmlAgilityPack also supports Linq, you could also do something like:
string [] classes = {"", "gray_bk"};
var gyms = htmlDoc
.DocumentNode
.Descendants("tr")
.Where(t => classes.Contains(t.Attributes["class"].Value))
.ToList();
gyms.ForEach(gym => {
var city = gym.SelectSingleNode("./td[2]").InnerText;
var address = gym.SelectSingleNode("./td[3]").InnerText;
var phone = gym.SelectSingleNode("./td[4]").InnerText;
});

HtmlAgilityPack c# taking several images and links

Please help-me with this question:
private void DownLoadCompleted(object sender, HtmlDocumentLoadCompleted e)
{
var doc = new HtmlDocument();
doc.LoadHtml("http://www.unnu.com/popular-music-videos");
//var query = doc.DocumentNode.Descendants("img");
MessageBox.Show("chegou");
foreach (HtmlNode linkNode in doc.DocumentNode.SelectNodes("#//img[#src]"))
{
HtmlAttribute link = linkNode.Attributes[#"href"];
HtmlNode imageNode = linkNode.SelectSingleNode(#"//.php?src");
HtmlAttribute src = imageNode.Attributes[#"src"];
string Link = link.Value;
Uri imageUrl = new Uri(src.Value);
MessageBox.Show("chegou");
}
}
I need to get the all images and titles with your respective urls. I'm Using windows phone 7.5. The dll is the same.
doc.LoadHtml expects html, not the url. Are you looking for something like this?
var web = new HtmlAgilityPack.HtmlWeb();
var doc = web.Load("http://www.unnu.com/popular-music-videos");
var imgs = doc.DocumentNode.SelectNodes(#"//img[#src]")
.Select(img => new
{
Link = img.Attributes["src"].Value,
Title = img.Attributes["alt"].Value
})
.ToList();
If HtmlAgilityPack's WP7 version doesn't support HtmlWeb you can also use WebClient to get the html string and it can be used as parameter to doc.LoadHtml

extracting values of text from html source file

in this code var TempTxt holds An Html Body Content
as string
how can i extract element <table> or <td> inner text/ html using lambada syntax ?
public string ExtractPageValue(IWebDriver DDriver, string url="")
{
if(string.IsNullOrEmpty(url))
url = #"http://www.boi.org.il/he/Markets/ExchangeRates/Pages/Default.aspx";
var service = InternetExplorerDriverService.CreateDefaultService(directory);
service.LogFile = directory + #"\seleniumlog.txt";
service.LoggingLevel = InternetExplorerDriverLogLevel.Trace;
var options = new InternetExplorerOptions();
options.IntroduceInstabilityByIgnoringProtectedModeSettings = true;
DDriver = new InternetExplorerDriver(service, options, TimeSpan.FromSeconds(60));
DDriver.Navigate().GoToUrl(url);
var TempTxt = DDriver.PageSource;
return "";//Math.Round(Convert.ToDouble( TempTxt.Split(' ')[10]),2).ToString();
}
If you are open to try HtmlAgilityPack
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var table = doc.DocumentNode.SelectNodes("//table/tr")
.Select(tr => tr.Elements("td").Select(td => td.InnerText).ToList())
.ToList();

Categories