webclient htmlagility pack web parsing - c#

C# + webclient + htmlagility pack + web parsing
I wanted to go through the list of the jobs of this page but i can't parse those links because it changes.
One of the example, when i see the link as it is in the browser(Link),,
when i parse it using webclient and htmlagilitypack i get the changed link
Do i have to do settings on webclient? to include sessions or scripts?
Here is my code on that..
private void getLinks()
{
StreamReader sr = new StreamReader("categories.txt");
while(!sr.EndOfStream)
{
string url = sr.ReadLine();
WebClient wc = new WebClient();
string source = wc.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(".//a[#class='internerLink primaerElement']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine("http://jobboerse.arbeitsagentur.de" + node.Attributes["href"].Value);
}
}
sr.Close();
}

You may try a WebBrowser class (http://msdn.microsoft.com/en-us/library/system.windows.controls.webbrowser%28v=vs.110%29.aspx) and then use its DOM Accessing DOM from WebBrowser to retrieve the links.
mshtml.IHTMLDocument2 htmlDoc = webBrowser.Document as mshtml.IHTMLDocument2;
// do something like find button and click
htmlDoc.all.item("testBtn").click();

Related

Html Agility Pack get specific content from a <li> tag

I need some text from this website https://www.amazon.com/dp/B074J9SSPD, to be specific, I need to extract data under the "About the Product" section.
I tried
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = new HtmlDocument();
doc = web.Load("https://amazon.com/dp/B074J9SSPD");
foreach(var node in doc.DocumentNode.SelectNodes("//li[#class='showHiddenFeatureBullets']") {
string ar = node.InnerText;
HtmlAttribute att = node.Attributes["class"];
MessageBox.Show(ar.ToString());
if (att.Value.Contains("showHiddenFeatureBulletsway,
}
}
Plz suggest the right way , I'm getting blank string.
Your original code (before that first edit) worked for me it just was missing the right parentheses on the foreach loop. I also broke out the nodes into it's own variable to make it easier to read but this should work for you. I tested it locally and it worked for me.
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = new HtmlDocument();
doc = web.Load("https://amazon.com/dp/B074J9SSPD");
var aboutProductNodes = doc.DocumentNode.SelectNodes("//li[#class='showHiddenFeatureBullets']");
foreach (var node in aboutProductNodes)
{
string ar = node.InnerText;
HtmlAttribute att = node.Attributes["class"];
MessageBox.Show(ar.ToString().Trim());
if (att.Value.Contains("showHiddenFeatureBullets"))
{
}
}
However I would suggest looking into the amazon API. It worked about half the time and then the other half was Amazon replying to use their api and not web scrape them. So that might have been a part of your problem too.
https://developer.amazon.com/services-and-apis

How do I correctly grab the images I scrape with HtmlAgilityPack?

I am currently working on a project and I am learning HAP as I go.
I get the basics of it and it seems like it could be very powerful.
I'm having an issue right now, I am trying to scrape a product on this one website and get the links to the images but I dont know how to extract the link from the xpath.
I used to do this with Regex which was alot easier but I am moving on this HAP.
This is my current code I dont think it will be very useful to see but i'll out it in either way.
private static void HAP()
{
var url = "https://www.dhgate.com/product/brass-hexagonal-fidget-spinner-hexa-spinner/403294406.html#gw-0-4|ff8080815e03d6df015e9394cc681f8a:ff80808159abe8a5015a3fd78c5b51bb";
// HtmlWeb - A Utility class to get HTML document from http
var web = new HtmlWeb();
//Load() Method download the specified HTML document from an Internet resource.
var doc = web.Load(url);
var rootNode = doc.DocumentNode;
var divs = doc.DocumentNode.SelectNodes(String.Format("//IMG[#src='{0}']", "www.dhresource.com/webp/m/100x100/f2/albu/g5/M00/14/45/rBVaI1kWttaAI1IrAATeirRp-t8793.jpg"));
Console.WriteLine(divs);
Console.ReadLine();
}
This is the link I am scraping from
https://www.dhgate.com/product/2017-led-light-up-hand-spinners-fidget-spinner/398793721.html#s1-0-1b;searl|4175152669
And this should be the xPath of the first image.
//IMG[#src='//www.dhresource.com/webp/m/100x100s/f2-albu-g5-M00-6E-20-rBVaI1kWtmmAF9cmAANMKysq_GY926.jpg/2017-led-light-up-hand-spinners-fidget-spinner.jpg']
I create a helper method for this.
I had to get the node, and then the attribute and then cycle through the attribute to get all the links.
private static void HAP()
{
//Declare the URL
var url = "https://www.dhgate.com/product/brass-hexagonal-fidget-spinner-hexa-spinner/403294406.html#gw-0-4|ff8080815e03d6df015e9394cc681f8a:ff80808159abe8a5015a3fd78c5b51bb";
// HtmlWeb - A Utility class to get HTML document from http
var web = new HtmlWeb();
//Load() Method download the specified HTML document from an Internet resource.
var doc = web.Load(url);
var rootNode = doc.DocumentNode;
var nodes = doc.DocumentNode.SelectNodes("//img");
foreach (var src in nodes)
{
var links = src.Attributes["src"].Value;
Console.WriteLine(links);
}
Console.ReadLine();
}

how to remove htmldocument.cs not found error in html agility pack

class Response:
public string WebResponse(string url) //class through which i'll have link of website and will parse some divs in method of this class
{
string html = string.Empty;
try
{
HtmlDocument doc = new HtmlDocument(); //when code comes here it gives an error htmldocument.cs not found,and open window for browsing source
WebClient client = new WebClient(); // even if i put htmlWeb there it still look for HtmlWeb.cs not found
html = client.DownloadString(url); //is this from some breakpoint error coz i set only one in method where i am parsing,
doc.LoadHtml(html);
}
catch (Exception)
{
html = string.Empty;
}
return html; //please help me to remove this error using html agility pack with console application
}
even if i make new project and run code it stuck here and i have added DLL too still it is giving me this error please help me to remove this error
WebResponse is an abstract class meaning it is a reserved word first of all. Second - In order to use WebResponse a class has to inherit from WebResponse ie.
public class WR : WebResponse
{
//Code
}
Also. Your current code has nothing to with Html Agility Pack. If you want to load the html of a webpage into a HtmlDocument - do the following:
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try{
var temp = new Uri(url);
var request = (HttpWebRequest)WebRequest.Create(temp);
request.Method = "GET";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
htmlDoc.Load(stream, Encoding.GetEncoding("iso-8859-9"));
}
}
}catch(WebException ex){
Console.WriteLine(ex.Message);
}
Then in order to get nodes in the Html Document you have to use xPath like so:
HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//body");
Console.WriteLine(node.InnerText);
that error is sometimes because of version of you are using of Nuget html agility pack,update your nuget in the visual studio gallery then try installing html agility pack and run in your project
You can try cleaning and re-building the solution.This may fix the issue.

Get the content of an element of a Web page using C#

Is there any way to get the content of an element or control of an open web page in a browser from a c# app?
I tried to get the window ex, but I don't know how to use it after to have any sort of communication with it. I also tried this code:
using (var client = new WebClient())
{
var contents = client.DownloadString("http://www.google.com");
Console.WriteLine(contents);
}
This code gives me a lot of data I can't use.
You could use an HTML parser such as HTML Agility Pack to extract the information you are interested in from the HTML you downloaded:
using (var client = new WebClient())
{
// Download the HTML
string html = client.DownloadString("http://www.google.com");
// Now feed it to HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// Now you could query the DOM. For example you could extract
// all href attributes from all anchors:
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute href = link.Attributes["href"];
if (href != null)
{
Console.WriteLine(href.Value);
}
}
}

Read content of Web Browser in WPF

Hello Developers I want to read external content from Website such as element between tag . I am using Web Browser Control and here is my code however this Code just fills my Web browser control with the Web Page
public MainWindow()
{
InitializeComponent();
wbMain.Navigate(new Uri("http://www.annonymous.com", UriKind.RelativeOrAbsolute));
}
You can use the Html Agility Pack library to parse any HTML formatted data.
HtmlDocument doc = new HtmlDocument();
doc.Load(wbMain.DocumentText);
var nodes = doc.SelectNodes("//a[#href"]);
NOTE: The method SelectNode accepts XPath, not CSS or jQuery selectors.
var node = doc.SelectNodes("id('my_element_id')");
As I understood from your question, you are only trying to parse the HTML data, and you don't need to show the actual web page.
If that is the case than you can take a very simple approach and use HttpWebRequest:
var _plainText = string.Empty;
var _request = (HttpWebRequest)WebRequest.Create("http://www.google.com");
_request.Timeout = 5000;
_request.Method = "GET";
_request.ContentType = "text/plain";
using (var _webResponse = (HttpWebResponse)_request.GetResponse())
{
var _webResponseStatus = _webResponse.StatusCode;
var _stream = _webResponse.GetResponseStream();
using (var _streamReader = new StreamReader(_stream))
{
_plainText = _streamReader.ReadToEnd();
}
}
Try this:
dynamic doc = wbMain.Document;
var htmlText = doc.documentElement.InnerHtml;
edit: Taken from here.

Categories