GetElementsByTagName in C# - c#

I have this piece of code:
string x = textBox1.Text;
string[] list = x.Split(';');
foreach (string u in list)
{
string url = "http://*********/index.php?n=" + u;
webBrowser1.Navigate(url);
webBrowser1.Document.GetElementsByTagName("META");
}
and I'm trying to get the <META> tags to output to a message box, but when I test it out, I keep getting this error:
Object reference not set to an instance of an object.

Your problem is that you're accessing the Document object before the document has loaded - WebBrowsers are asynchronous. Just parse the HTML using a library like the HTML Agility Pack.
Here's how you might get the <meta> tags using the HTML Agility Pack. (Assumes using System.Net; and using HtmlAgilityPack;.)
// Create a WebClient to use to download the string:
using(WebClient wc = new WebClient()) {
// Create a document object
HtmlDocument d = new HtmlDocument();
// Download the content and parse the HTML:
d.LoadHtml(wc.DownloadString("http://stackoverflow.com/questions/10368605/getelementsbytagname-in-c-sharp/10368631#10368631"));
// Loop through all the <meta> tags:
foreach(HtmlNode metaTag in d.DocumentNode.Descendants("meta")) {
// It's a <meta> tag! Do something with it.
}
}

You shouldn't try to access the document until it has finish loading. Run that code inside a handler for the DocumentCompleted event.
But Matti is right. If all you need is to read the HTML you shouldn't be using a WebBrowser. Just fetch the text and parse it using an HTML parser.

You can retrieve META tags and any other HTML element directly from your WebBrowser control, there is no need of HTML Agility Pack or other component.
Like Mark said, wait first for the DocumentCompleted event:
webBrowser.DocumentCompleted += WebBrowser_DocumentCompleted;
Then you can catch any element and content from the HTML document. The following code gets the title and the meta description:
private void WebBrowser_DocumentCompleted(object sender, System.Windows.Forms.WebBrowserDocumentCompletedEventArgs e)
{
System.Windows.Forms.WebBrowser browser = sender as System.Windows.Forms.WebBrowser;
string title = browser.Document.Title;
string description = String.Empty;
foreach (HtmlElement meta in browser.Document.GetElementsByTagName("META"))
{
if (meta.Name.ToLower() == "description")
{
description = meta.GetAttribute("content");
}
}
}

Related

webclient htmlagility pack web parsing

C# + webclient + htmlagility pack + web parsing
I wanted to go through the list of the jobs of this page but i can't parse those links because it changes.
One of the example, when i see the link as it is in the browser(Link),,
when i parse it using webclient and htmlagilitypack i get the changed link
Do i have to do settings on webclient? to include sessions or scripts?
Here is my code on that..
private void getLinks()
{
StreamReader sr = new StreamReader("categories.txt");
while(!sr.EndOfStream)
{
string url = sr.ReadLine();
WebClient wc = new WebClient();
string source = wc.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(".//a[#class='internerLink primaerElement']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine("http://jobboerse.arbeitsagentur.de" + node.Attributes["href"].Value);
}
}
sr.Close();
}
You may try a WebBrowser class (http://msdn.microsoft.com/en-us/library/system.windows.controls.webbrowser%28v=vs.110%29.aspx) and then use its DOM Accessing DOM from WebBrowser to retrieve the links.
mshtml.IHTMLDocument2 htmlDoc = webBrowser.Document as mshtml.IHTMLDocument2;
// do something like find button and click
htmlDoc.all.item("testBtn").click();

Get the content of an element of a Web page using C#

Is there any way to get the content of an element or control of an open web page in a browser from a c# app?
I tried to get the window ex, but I don't know how to use it after to have any sort of communication with it. I also tried this code:
using (var client = new WebClient())
{
var contents = client.DownloadString("http://www.google.com");
Console.WriteLine(contents);
}
This code gives me a lot of data I can't use.
You could use an HTML parser such as HTML Agility Pack to extract the information you are interested in from the HTML you downloaded:
using (var client = new WebClient())
{
// Download the HTML
string html = client.DownloadString("http://www.google.com");
// Now feed it to HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// Now you could query the DOM. For example you could extract
// all href attributes from all anchors:
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute href = link.Attributes["href"];
if (href != null)
{
Console.WriteLine(href.Value);
}
}
}

Extracting images urls from html in c# using html agility pack and writing them in a xml file

I am new to c# and I really need help with the following problem. I wish to extract the photos urls from a webpage that have a specific pattern. For example I wish to extract all the images that have the following pattern name_412s.jpg. I use the following code to extract images from html, but I do not kow how to adapt it.
public void Images()
{
WebClient x = new WebClient();
string source = x.DownloadString(#"http://www.google.com");
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.Load(source);
foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img")
{
images[] = link["src"];
}
}
I also need to write the results in a xml file. Can you also help me with that?
Thank you !
To limit the query results, you need to add a condition to your XPath. For instance, //img[contains(#src, 'name_412s.jpg')] will limit the results to only img elements that have an src attribute that contains that file name.
As far as writing out the results to XML, you'll need to create a new XML document and then copy the matching elements into it. Since you won't be able to directly import an HtmlAgilityPack node into an XmlDocument, you'll have to manually copy all the attributes. For instance:
using System.Net;
using System.Xml;
// ...
public void Images()
{
WebClient x = new WebClient();
string source = x.DownloadString(#"http://www.google.com");
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.Load(source);
XmlDocument output = new XmlDocument();
XmlElement imgElements = output.CreateElement("ImgElements");
output.AppendChild(imgElements);
foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img[contains(#src, '_412s.jpg')]")
{
XmlElement img = output.CreateElement(link.Name);
foreach(HtmlAttribute a in link.Attributes)
{
img.SetAttribute(a.Name, a.Value)
}
imgElements.AppendChild(img);
}
output.Save(#"C:\test.xml");
}

Getting all the anchor tags of a web page

Given a web URL, I want to detect all the links in a WEBSITE, identify the internal links and list them.
What I have is this:
WebClient webClient = null;
webClient = new WebClient();
string strUrl = "http://www.anysite.com";
string completeHTMLCode = "";
try
{
completeHTMLCode = webClient.DownloadString(strUrl);
}
catch (Exception)
{
}
Using this I can read the contents of the page....but the only idea I have in my mind is parsing this string....searching for <a then href then the value between the double quotes.
Is this the only way out? Or there lies some other better solution(s)?
Use the HTML Agility Pack. Here's a link to a blog post to get you started. Do not use Regex.
using HtmlAgilityPack
completeHTMLCode =
webClient.DownloadString(strUrl);
doc.Load(completeHTMLCode);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#a"])
{
//
}

Extract data webpage

Folks,
I'm tryning to extract data from web page using C#.. for the moment I used the Stream from the WebReponse and I parsed it as a big string. It's long and painfull. Someone know better way to extract data from webpage? I say WINHTTP but isn't for c#..
To download data from a web page it is easier to use WebClient:
string data;
using (var client = new WebClient())
{
data = client.DownloadString("http://www.google.com");
}
For parsing downloaded data, provided that it is HTML, you could use the excellent Html Agility Pack library.
And here's a complete example extracting all the links from a given page:
class Program
{
static void Main(string[] args)
{
using (var client = new WebClient())
{
string data = client.DownloadString("http://www.google.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(data);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode link in nodes)
{
HtmlAttribute att = link.Attributes["href"];
Console.WriteLine(att.Value);
}
}
}
}
If the webpage is valid XHTML, you can read it into an XPathDocument and xpath your way quickly and easily straight to the data you want. If it's not valid XHTML, I'm sure there are some HTML parsers out there you can use.
Found a similar question with an answer that should help.
Looking for C# HTML parser

Categories