Read content of Web Browser in WPF - c#

Hello Developers I want to read external content from Website such as element between tag . I am using Web Browser Control and here is my code however this Code just fills my Web browser control with the Web Page
public MainWindow()
{
InitializeComponent();
wbMain.Navigate(new Uri("http://www.annonymous.com", UriKind.RelativeOrAbsolute));
}

You can use the Html Agility Pack library to parse any HTML formatted data.
HtmlDocument doc = new HtmlDocument();
doc.Load(wbMain.DocumentText);
var nodes = doc.SelectNodes("//a[#href"]);
NOTE: The method SelectNode accepts XPath, not CSS or jQuery selectors.
var node = doc.SelectNodes("id('my_element_id')");

As I understood from your question, you are only trying to parse the HTML data, and you don't need to show the actual web page.
If that is the case than you can take a very simple approach and use HttpWebRequest:
var _plainText = string.Empty;
var _request = (HttpWebRequest)WebRequest.Create("http://www.google.com");
_request.Timeout = 5000;
_request.Method = "GET";
_request.ContentType = "text/plain";
using (var _webResponse = (HttpWebResponse)_request.GetResponse())
{
var _webResponseStatus = _webResponse.StatusCode;
var _stream = _webResponse.GetResponseStream();
using (var _streamReader = new StreamReader(_stream))
{
_plainText = _streamReader.ReadToEnd();
}
}

Try this:
dynamic doc = wbMain.Document;
var htmlText = doc.documentElement.InnerHtml;
edit: Taken from here.

Related

How do I correctly grab the images I scrape with HtmlAgilityPack?

I am currently working on a project and I am learning HAP as I go.
I get the basics of it and it seems like it could be very powerful.
I'm having an issue right now, I am trying to scrape a product on this one website and get the links to the images but I dont know how to extract the link from the xpath.
I used to do this with Regex which was alot easier but I am moving on this HAP.
This is my current code I dont think it will be very useful to see but i'll out it in either way.
private static void HAP()
{
var url = "https://www.dhgate.com/product/brass-hexagonal-fidget-spinner-hexa-spinner/403294406.html#gw-0-4|ff8080815e03d6df015e9394cc681f8a:ff80808159abe8a5015a3fd78c5b51bb";
// HtmlWeb - A Utility class to get HTML document from http
var web = new HtmlWeb();
//Load() Method download the specified HTML document from an Internet resource.
var doc = web.Load(url);
var rootNode = doc.DocumentNode;
var divs = doc.DocumentNode.SelectNodes(String.Format("//IMG[#src='{0}']", "www.dhresource.com/webp/m/100x100/f2/albu/g5/M00/14/45/rBVaI1kWttaAI1IrAATeirRp-t8793.jpg"));
Console.WriteLine(divs);
Console.ReadLine();
}
This is the link I am scraping from
https://www.dhgate.com/product/2017-led-light-up-hand-spinners-fidget-spinner/398793721.html#s1-0-1b;searl|4175152669
And this should be the xPath of the first image.
//IMG[#src='//www.dhresource.com/webp/m/100x100s/f2-albu-g5-M00-6E-20-rBVaI1kWtmmAF9cmAANMKysq_GY926.jpg/2017-led-light-up-hand-spinners-fidget-spinner.jpg']
I create a helper method for this.
I had to get the node, and then the attribute and then cycle through the attribute to get all the links.
private static void HAP()
{
//Declare the URL
var url = "https://www.dhgate.com/product/brass-hexagonal-fidget-spinner-hexa-spinner/403294406.html#gw-0-4|ff8080815e03d6df015e9394cc681f8a:ff80808159abe8a5015a3fd78c5b51bb";
// HtmlWeb - A Utility class to get HTML document from http
var web = new HtmlWeb();
//Load() Method download the specified HTML document from an Internet resource.
var doc = web.Load(url);
var rootNode = doc.DocumentNode;
var nodes = doc.DocumentNode.SelectNodes("//img");
foreach (var src in nodes)
{
var links = src.Attributes["src"].Value;
Console.WriteLine(links);
}
Console.ReadLine();
}

webclient htmlagility pack web parsing

C# + webclient + htmlagility pack + web parsing
I wanted to go through the list of the jobs of this page but i can't parse those links because it changes.
One of the example, when i see the link as it is in the browser(Link),,
when i parse it using webclient and htmlagilitypack i get the changed link
Do i have to do settings on webclient? to include sessions or scripts?
Here is my code on that..
private void getLinks()
{
StreamReader sr = new StreamReader("categories.txt");
while(!sr.EndOfStream)
{
string url = sr.ReadLine();
WebClient wc = new WebClient();
string source = wc.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(".//a[#class='internerLink primaerElement']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine("http://jobboerse.arbeitsagentur.de" + node.Attributes["href"].Value);
}
}
sr.Close();
}
You may try a WebBrowser class (http://msdn.microsoft.com/en-us/library/system.windows.controls.webbrowser%28v=vs.110%29.aspx) and then use its DOM Accessing DOM from WebBrowser to retrieve the links.
mshtml.IHTMLDocument2 htmlDoc = webBrowser.Document as mshtml.IHTMLDocument2;
// do something like find button and click
htmlDoc.all.item("testBtn").click();

Get the content of an element of a Web page using C#

Is there any way to get the content of an element or control of an open web page in a browser from a c# app?
I tried to get the window ex, but I don't know how to use it after to have any sort of communication with it. I also tried this code:
using (var client = new WebClient())
{
var contents = client.DownloadString("http://www.google.com");
Console.WriteLine(contents);
}
This code gives me a lot of data I can't use.
You could use an HTML parser such as HTML Agility Pack to extract the information you are interested in from the HTML you downloaded:
using (var client = new WebClient())
{
// Download the HTML
string html = client.DownloadString("http://www.google.com");
// Now feed it to HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// Now you could query the DOM. For example you could extract
// all href attributes from all anchors:
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute href = link.Attributes["href"];
if (href != null)
{
Console.WriteLine(href.Value);
}
}
}

Is it possible to navigate a site by clicking links and then downloading the correct piece?

I'll try to explain what exactly I mean. I'm working on a program and I'm trying to download a bunch of images automatically from this site.
Namely, I want to download the big square icons from the page you get when you click on a hero name there, for example on the Darius page the image in the top left with the name DariusSquare.png and save that into a folder.
Is this possible or am I asking too much from C#?
Thank you very much!
In general, everything is possible given enough time and money. In your case, you need very little of former and none of latter :)
What you need to do can be described in following high-level steps:
Get all <a> tags within the table with heroes.
Use WebClient class to navigate to URL these <a> tags point to (i.e. to value of href attributes) and download the HTML
You will need to find some wrapper element that is present on each page with hero and that contains his image. Then, you should be able to get to the image src attribute and download it. Alternatively, perhaps each image has an common ID you can use?
I don't think anyone will provide you with an exact code that will perform these steps for you. Instead, you need to do some research of your own.
Yes it's possible, do a C# Web request and use the C# HTML Agility Pack to find the image url.
The you can use another web request to download the image:
Example downloading image from url:
public static Image LoadImage(string url)
{
var backgroundUrl = url;
var request = WebRequest.Create(backgroundUrl);
var response = request.GetResponse();
var stream = response.GetResponseStream();
return Image.FromStream(stream);
}
Example using html agility pack and getting some other data:
var request = (HttpWebRequest)WebRequest.Create(profileurl);
request.Method = "GET";
using (var response = request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
result = reader.ReadToEnd();
}
var doc = new HtmlDocument();
doc.Load(new StringReader(result));
var root = doc.DocumentNode;
HtmlNode profileHeader = root.SelectSingleNode("//*[#id='profile-header']");
HtmlNode profileRight = root.SelectSingleNode("//*[#id='profile-right']");
string rankHtml = profileHeader.SelectSingleNode("//*[#id='best-team-1']").OuterHtml.Trim();
#region GetPlayerAvatar
var avatarMatch = Regex.Match(profileHeader.SelectSingleNode("/html/body/div/div[2]/div/div/div/div/div/span").OuterHtml, #"(portraits[^(h3)]+).*no-repeat;", RegexOptions.IgnoreCase);
if (avatarMatch.Success)
{
battleNetPlayerFromDB.PlayerAvatarCss = avatarMatch.Value;
}
#endregion
}
}

HtmlAgilityPack - How to set custom encoding when loading pages

Is it possible to set custom encoding when loading pages with the method below?
HtmlWeb hwWeb = new HtmlWeb();
HtmlDocument hd = hwWeb.load("myurl");
I want to set encoding to "iso-8859-9".
I use C# 4.0 and WPF.
Edit: The question has been answered on MSDN.
I suppose you could try overriding the encoding in the HtmlWeb object.
Try this:
var web = new HtmlWeb
{
AutoDetectEncoding = false,
OverrideEncoding = myEncoding,
};
var doc = web.Load(myUrl);
Note: It appears that the OverrideEncoding property was added to HTML agility pack in revision 76610 so it is not available in the current release v1.4 (66017). The next best thing to do would be to read the page manually with the encodings overridden.
var document = new HtmlDocument();
using (var client = new WebClient())
{
using (var stream = client.OpenRead(url))
{
var reader = new StreamReader(stream, Encoding.GetEncoding("iso-8859-9"));
var html = reader.ReadToEnd();
document.LoadHtml(html);
}
}
This is a simple version of the solution answered here (for some reasons it got deleted)
A decent answer is over here which handles auto-detecting the encoding as well as some other nifty features:
C# and HtmlAgilityPack encoding problem

Categories