Extract data webpage - c#

Folks,
I'm tryning to extract data from web page using C#.. for the moment I used the Stream from the WebReponse and I parsed it as a big string. It's long and painfull. Someone know better way to extract data from webpage? I say WINHTTP but isn't for c#..

To download data from a web page it is easier to use WebClient:
string data;
using (var client = new WebClient())
{
data = client.DownloadString("http://www.google.com");
}
For parsing downloaded data, provided that it is HTML, you could use the excellent Html Agility Pack library.
And here's a complete example extracting all the links from a given page:
class Program
{
static void Main(string[] args)
{
using (var client = new WebClient())
{
string data = client.DownloadString("http://www.google.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(data);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode link in nodes)
{
HtmlAttribute att = link.Attributes["href"];
Console.WriteLine(att.Value);
}
}
}
}

If the webpage is valid XHTML, you can read it into an XPathDocument and xpath your way quickly and easily straight to the data you want. If it's not valid XHTML, I'm sure there are some HTML parsers out there you can use.
Found a similar question with an answer that should help.
Looking for C# HTML parser

Related

How do I correctly grab the images I scrape with HtmlAgilityPack?

I am currently working on a project and I am learning HAP as I go.
I get the basics of it and it seems like it could be very powerful.
I'm having an issue right now, I am trying to scrape a product on this one website and get the links to the images but I dont know how to extract the link from the xpath.
I used to do this with Regex which was alot easier but I am moving on this HAP.
This is my current code I dont think it will be very useful to see but i'll out it in either way.
private static void HAP()
{
var url = "https://www.dhgate.com/product/brass-hexagonal-fidget-spinner-hexa-spinner/403294406.html#gw-0-4|ff8080815e03d6df015e9394cc681f8a:ff80808159abe8a5015a3fd78c5b51bb";
// HtmlWeb - A Utility class to get HTML document from http
var web = new HtmlWeb();
//Load() Method download the specified HTML document from an Internet resource.
var doc = web.Load(url);
var rootNode = doc.DocumentNode;
var divs = doc.DocumentNode.SelectNodes(String.Format("//IMG[#src='{0}']", "www.dhresource.com/webp/m/100x100/f2/albu/g5/M00/14/45/rBVaI1kWttaAI1IrAATeirRp-t8793.jpg"));
Console.WriteLine(divs);
Console.ReadLine();
}
This is the link I am scraping from
https://www.dhgate.com/product/2017-led-light-up-hand-spinners-fidget-spinner/398793721.html#s1-0-1b;searl|4175152669
And this should be the xPath of the first image.
//IMG[#src='//www.dhresource.com/webp/m/100x100s/f2-albu-g5-M00-6E-20-rBVaI1kWtmmAF9cmAANMKysq_GY926.jpg/2017-led-light-up-hand-spinners-fidget-spinner.jpg']
I create a helper method for this.
I had to get the node, and then the attribute and then cycle through the attribute to get all the links.
private static void HAP()
{
//Declare the URL
var url = "https://www.dhgate.com/product/brass-hexagonal-fidget-spinner-hexa-spinner/403294406.html#gw-0-4|ff8080815e03d6df015e9394cc681f8a:ff80808159abe8a5015a3fd78c5b51bb";
// HtmlWeb - A Utility class to get HTML document from http
var web = new HtmlWeb();
//Load() Method download the specified HTML document from an Internet resource.
var doc = web.Load(url);
var rootNode = doc.DocumentNode;
var nodes = doc.DocumentNode.SelectNodes("//img");
foreach (var src in nodes)
{
var links = src.Attributes["src"].Value;
Console.WriteLine(links);
}
Console.ReadLine();
}

jquery using c# webclient .net namespace?

I have the following JQUERY code that relates to a html document from a website.
$
Anything is appreciated,
Salute.
From what I can remember using the HtmlAgilityPack
var rawText = "<html><head><head><body><div id='container'><article><p>stuff<p></article><article><p>stuff2</p></article></div></body></html>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(rawText);
var stuff = doc.DocumentNode.Descendants("div")
.SelectMany(div => div.Descendants("article"));
var length = stuff.Count();
var textValues = stuff.Select(a => a.InnerHtml).ToList();
Output:
length: 2
textValues: List<String> (2 items)
<p>stuff<p>
<p>stuff2</p>
To get the HTML, instead of hardcoding it as above, use the WebClient class since it has a simplier API than WebRequest.
var client = new WebClient();
var html = client.DownloadString("http://yoursite.com/file.html");
To answer your question specifically related to using the System.Net namespace you would do this:
go here to see the way to use the WebRequest class itself to get the
content.
http://msdn.microsoft.com/en-us/library/456dfw4f%28v=vs.110%29.aspx
Next after you get the content back you need to parse it using HTMLAgility pack found here: http://htmlagilitypack.codeplex.com/
How would one code the JQUERY into C#, this is an untested example:
var doc = new HtmlDocument();
doc.Load(#"D:\test.html"); //you can also use a memory stream instead.
var container = doc.GetElementbyId("continer");
foreach (HtmlNode node in container.Elements("img"))
{
HtmlAttribute valueAttribute = node.Attributes["value"];
if (valueAttribute != null) Console.WriteLine(valueAttribute.Value);
}
In your case the attributes you want after you find the element are alt, src, and href
It will take you about 1 day to learn agilitypack but it's mature fast and well liked by the community.

Get the content of an element of a Web page using C#

Is there any way to get the content of an element or control of an open web page in a browser from a c# app?
I tried to get the window ex, but I don't know how to use it after to have any sort of communication with it. I also tried this code:
using (var client = new WebClient())
{
var contents = client.DownloadString("http://www.google.com");
Console.WriteLine(contents);
}
This code gives me a lot of data I can't use.
You could use an HTML parser such as HTML Agility Pack to extract the information you are interested in from the HTML you downloaded:
using (var client = new WebClient())
{
// Download the HTML
string html = client.DownloadString("http://www.google.com");
// Now feed it to HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// Now you could query the DOM. For example you could extract
// all href attributes from all anchors:
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute href = link.Attributes["href"];
if (href != null)
{
Console.WriteLine(href.Value);
}
}
}

Extracting images urls from html in c# using html agility pack and writing them in a xml file

I am new to c# and I really need help with the following problem. I wish to extract the photos urls from a webpage that have a specific pattern. For example I wish to extract all the images that have the following pattern name_412s.jpg. I use the following code to extract images from html, but I do not kow how to adapt it.
public void Images()
{
WebClient x = new WebClient();
string source = x.DownloadString(#"http://www.google.com");
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.Load(source);
foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img")
{
images[] = link["src"];
}
}
I also need to write the results in a xml file. Can you also help me with that?
Thank you !
To limit the query results, you need to add a condition to your XPath. For instance, //img[contains(#src, 'name_412s.jpg')] will limit the results to only img elements that have an src attribute that contains that file name.
As far as writing out the results to XML, you'll need to create a new XML document and then copy the matching elements into it. Since you won't be able to directly import an HtmlAgilityPack node into an XmlDocument, you'll have to manually copy all the attributes. For instance:
using System.Net;
using System.Xml;
// ...
public void Images()
{
WebClient x = new WebClient();
string source = x.DownloadString(#"http://www.google.com");
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.Load(source);
XmlDocument output = new XmlDocument();
XmlElement imgElements = output.CreateElement("ImgElements");
output.AppendChild(imgElements);
foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img[contains(#src, '_412s.jpg')]")
{
XmlElement img = output.CreateElement(link.Name);
foreach(HtmlAttribute a in link.Attributes)
{
img.SetAttribute(a.Name, a.Value)
}
imgElements.AppendChild(img);
}
output.Save(#"C:\test.xml");
}

Issue with HTMLAgilityPack parsing HTML using C#

I'm just trying to learn about HTMLAgilityPack and XPath, I'm attempting to get a list of (HTML Links) companies from the NASDAQ website;
http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx
I currently have the following code;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// Create a request for the URL.
WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Read into a HTML store read for HAP
htmlDoc.LoadHtml(responseFromServer);
HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[#id='indu_table']/tbody/tr[*]/td/b/a");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Debug.Write(node.InnerText);
}
// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
I've used an XPath addon for Chrome to get the XPath of;
//*table[#id='indu_table']/tbody/tr[*]/td/b/a
When running my project, I get an xpath unhandled exception about it being an invalid token.
I'm a little unsure what's wrong with it, i've tried to put a number in the tr[*] section above but i still get the same error.
I've been looking at this for the last hour, is it anything simple?
thanks
Since the data comes from javascript you have to parse the javascript and not the html, so the Agility Pack doesn't help that much, but it makes things a bit easier. The following is how it could be done using Agility Pack and Newtonsoft JSON.Net to parse the Javascript.
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
//Using Regex here to get just the array we're interested in...
string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
JArray jArray = JArray.Parse(stockArray);
foreach (JToken token in jArray.Children())
{
listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
}
}
To explain a bit more in detail, the data comes from one big javascript array on the page var table_body = [....
Each stock is one element in the array and is an array itself.
["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]
So by parsing the array and taking the first element and appending the fix url we get the same result as the javascript.
Why won't you just use Descendants("a") method?
It's much simplier and is more object oriented. You'll just get a bunch of objects.
The you can just get the "href" attribute from those objects.
Sample code:
htmlDoc.DocumentNode.Descendants("a").Attributes["href"].Value
If you just need list of links from certain webpage, this method will do just fine.
If you look at the page source for that URL, there's not actually an element with id=indu_table. It appears to be generated dynamically (i.e. in javascript); the html that you get when loading directly from the server will not reflect anything that's changed by client script. This is probably why it's not working.

Categories