Get the content of an element of a Web page using C#

Get the content of an element of a Web page using C# - c#

Is there any way to get the content of an element or control of an open web page in a browser from a c# app?
I tried to get the window ex, but I don't know how to use it after to have any sort of communication with it. I also tried this code:
using (var client = new WebClient())
{
var contents = client.DownloadString("http://www.google.com");
Console.WriteLine(contents);
}
This code gives me a lot of data I can't use.

You could use an HTML parser such as HTML Agility Pack to extract the information you are interested in from the HTML you downloaded:
using (var client = new WebClient())
{
// Download the HTML
string html = client.DownloadString("http://www.google.com");
// Now feed it to HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// Now you could query the DOM. For example you could extract
// all href attributes from all anchors:
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute href = link.Attributes["href"];
if (href != null)
{
Console.WriteLine(href.Value);
}
}
}

Related

how to remove htmldocument.cs not found error in html agility pack

class Response:
public string WebResponse(string url) //class through which i'll have link of website and will parse some divs in method of this class
{
string html = string.Empty;
try
{
HtmlDocument doc = new HtmlDocument(); //when code comes here it gives an error htmldocument.cs not found,and open window for browsing source
WebClient client = new WebClient(); // even if i put htmlWeb there it still look for HtmlWeb.cs not found
html = client.DownloadString(url); //is this from some breakpoint error coz i set only one in method where i am parsing,
doc.LoadHtml(html);
}
catch (Exception)
{
html = string.Empty;
}
return html; //please help me to remove this error using html agility pack with console application
}
even if i make new project and run code it stuck here and i have added DLL too still it is giving me this error please help me to remove this error

WebResponse is an abstract class meaning it is a reserved word first of all. Second - In order to use WebResponse a class has to inherit from WebResponse ie.
public class WR : WebResponse
{
//Code
}
Also. Your current code has nothing to with Html Agility Pack. If you want to load the html of a webpage into a HtmlDocument - do the following:
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try{
var temp = new Uri(url);
var request = (HttpWebRequest)WebRequest.Create(temp);
request.Method = "GET";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
htmlDoc.Load(stream, Encoding.GetEncoding("iso-8859-9"));
}
}
}catch(WebException ex){
Console.WriteLine(ex.Message);
}
Then in order to get nodes in the Html Document you have to use xPath like so:
HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//body");
Console.WriteLine(node.InnerText);

that error is sometimes because of version of you are using of Nuget html agility pack,update your nuget in the visual studio gallery then try installing html agility pack and run in your project

You can try cleaning and re-building the solution.This may fix the issue.

webclient htmlagility pack web parsing

C# + webclient + htmlagility pack + web parsing
I wanted to go through the list of the jobs of this page but i can't parse those links because it changes.
One of the example, when i see the link as it is in the browser(Link),,
when i parse it using webclient and htmlagilitypack i get the changed link
Do i have to do settings on webclient? to include sessions or scripts?
Here is my code on that..
private void getLinks()
{
StreamReader sr = new StreamReader("categories.txt");
while(!sr.EndOfStream)
{
string url = sr.ReadLine();
WebClient wc = new WebClient();
string source = wc.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(".//a[#class='internerLink primaerElement']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine("http://jobboerse.arbeitsagentur.de" + node.Attributes["href"].Value);
}
}
sr.Close();
}

You may try a WebBrowser class (http://msdn.microsoft.com/en-us/library/system.windows.controls.webbrowser%28v=vs.110%29.aspx) and then use its DOM Accessing DOM from WebBrowser to retrieve the links.
mshtml.IHTMLDocument2 htmlDoc = webBrowser.Document as mshtml.IHTMLDocument2;
// do something like find button and click
htmlDoc.all.item("testBtn").click();

Read content of Web Browser in WPF

Hello Developers I want to read external content from Website such as element between tag . I am using Web Browser Control and here is my code however this Code just fills my Web browser control with the Web Page
public MainWindow()
{
InitializeComponent();
wbMain.Navigate(new Uri("http://www.annonymous.com", UriKind.RelativeOrAbsolute));
}

You can use the Html Agility Pack library to parse any HTML formatted data.
HtmlDocument doc = new HtmlDocument();
doc.Load(wbMain.DocumentText);
var nodes = doc.SelectNodes("//a[#href"]);
NOTE: The method SelectNode accepts XPath, not CSS or jQuery selectors.
var node = doc.SelectNodes("id('my_element_id')");

As I understood from your question, you are only trying to parse the HTML data, and you don't need to show the actual web page.
If that is the case than you can take a very simple approach and use HttpWebRequest:
var _plainText = string.Empty;
var _request = (HttpWebRequest)WebRequest.Create("http://www.google.com");
_request.Timeout = 5000;
_request.Method = "GET";
_request.ContentType = "text/plain";
using (var _webResponse = (HttpWebResponse)_request.GetResponse())
{
var _webResponseStatus = _webResponse.StatusCode;
var _stream = _webResponse.GetResponseStream();
using (var _streamReader = new StreamReader(_stream))
{
_plainText = _streamReader.ReadToEnd();
}
}

Try this:
dynamic doc = wbMain.Document;
var htmlText = doc.documentElement.InnerHtml;
edit: Taken from here.

GetElementsByTagName in C#

I have this piece of code:
string x = textBox1.Text;
string[] list = x.Split(';');
foreach (string u in list)
{
string url = "http://*********/index.php?n=" + u;
webBrowser1.Navigate(url);
webBrowser1.Document.GetElementsByTagName("META");
}
and I'm trying to get the <META> tags to output to a message box, but when I test it out, I keep getting this error:
Object reference not set to an instance of an object.

Your problem is that you're accessing the Document object before the document has loaded - WebBrowsers are asynchronous. Just parse the HTML using a library like the HTML Agility Pack.
Here's how you might get the <meta> tags using the HTML Agility Pack. (Assumes using System.Net; and using HtmlAgilityPack;.)
// Create a WebClient to use to download the string:
using(WebClient wc = new WebClient()) {
// Create a document object
HtmlDocument d = new HtmlDocument();
// Download the content and parse the HTML:
d.LoadHtml(wc.DownloadString("http://stackoverflow.com/questions/10368605/getelementsbytagname-in-c-sharp/10368631#10368631"));
// Loop through all the <meta> tags:
foreach(HtmlNode metaTag in d.DocumentNode.Descendants("meta")) {
// It's a <meta> tag! Do something with it.
}
}

You shouldn't try to access the document until it has finish loading. Run that code inside a handler for the DocumentCompleted event.
But Matti is right. If all you need is to read the HTML you shouldn't be using a WebBrowser. Just fetch the text and parse it using an HTML parser.

You can retrieve META tags and any other HTML element directly from your WebBrowser control, there is no need of HTML Agility Pack or other component.
Like Mark said, wait first for the DocumentCompleted event:
webBrowser.DocumentCompleted += WebBrowser_DocumentCompleted;
Then you can catch any element and content from the HTML document. The following code gets the title and the meta description:
private void WebBrowser_DocumentCompleted(object sender, System.Windows.Forms.WebBrowserDocumentCompletedEventArgs e)
{
System.Windows.Forms.WebBrowser browser = sender as System.Windows.Forms.WebBrowser;
string title = browser.Document.Title;
string description = String.Empty;
foreach (HtmlElement meta in browser.Document.GetElementsByTagName("META"))
{
if (meta.Name.ToLower() == "description")
{
description = meta.GetAttribute("content");
}
}
}

Extract data webpage

Folks,
I'm tryning to extract data from web page using C#.. for the moment I used the Stream from the WebReponse and I parsed it as a big string. It's long and painfull. Someone know better way to extract data from webpage? I say WINHTTP but isn't for c#..

To download data from a web page it is easier to use WebClient:
string data;
using (var client = new WebClient())
{
data = client.DownloadString("http://www.google.com");
}
For parsing downloaded data, provided that it is HTML, you could use the excellent Html Agility Pack library.
And here's a complete example extracting all the links from a given page:
class Program
{
static void Main(string[] args)
{
using (var client = new WebClient())
{
string data = client.DownloadString("http://www.google.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(data);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode link in nodes)
{
HtmlAttribute att = link.Attributes["href"];
Console.WriteLine(att.Value);
}
}
}
}

If the webpage is valid XHTML, you can read it into an XPathDocument and xpath your way quickly and easily straight to the data you want. If it's not valid XHTML, I'm sure there are some HTML parsers out there you can use.
Found a similar question with an answer that should help.
Looking for C# HTML parser

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get the content of an element of a Web page using C# - c#

Related

how to remove htmldocument.cs not found error in html agility pack

webclient htmlagility pack web parsing

Read content of Web Browser in WPF

GetElementsByTagName in C#

Extract data webpage

Categories

Resources