Issue with HTMLAgilityPack parsing HTML using C#

Issue with HTMLAgilityPack parsing HTML using C# - c#

I'm just trying to learn about HTMLAgilityPack and XPath, I'm attempting to get a list of (HTML Links) companies from the NASDAQ website;
http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx
I currently have the following code;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// Create a request for the URL.
WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Read into a HTML store read for HAP
htmlDoc.LoadHtml(responseFromServer);
HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[#id='indu_table']/tbody/tr[*]/td/b/a");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Debug.Write(node.InnerText);
}
// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
I've used an XPath addon for Chrome to get the XPath of;
//*table[#id='indu_table']/tbody/tr[*]/td/b/a
When running my project, I get an xpath unhandled exception about it being an invalid token.
I'm a little unsure what's wrong with it, i've tried to put a number in the tr[*] section above but i still get the same error.
I've been looking at this for the last hour, is it anything simple?
thanks

Since the data comes from javascript you have to parse the javascript and not the html, so the Agility Pack doesn't help that much, but it makes things a bit easier. The following is how it could be done using Agility Pack and Newtonsoft JSON.Net to parse the Javascript.
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
//Using Regex here to get just the array we're interested in...
string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
JArray jArray = JArray.Parse(stockArray);
foreach (JToken token in jArray.Children())
{
listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
}
}
To explain a bit more in detail, the data comes from one big javascript array on the page var table_body = [....
Each stock is one element in the array and is an array itself.
["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]
So by parsing the array and taking the first element and appending the fix url we get the same result as the javascript.

Why won't you just use Descendants("a") method?
It's much simplier and is more object oriented. You'll just get a bunch of objects.
The you can just get the "href" attribute from those objects.
Sample code:
htmlDoc.DocumentNode.Descendants("a").Attributes["href"].Value
If you just need list of links from certain webpage, this method will do just fine.

If you look at the page source for that URL, there's not actually an element with id=indu_table. It appears to be generated dynamically (i.e. in javascript); the html that you get when loading directly from the server will not reflect anything that's changed by client script. This is probably why it's not working.

Related

jquery using c# webclient .net namespace?

I have the following JQUERY code that relates to a html document from a website.
$
Anything is appreciated,
Salute.

From what I can remember using the HtmlAgilityPack
var rawText = "<html><head><head><body><div id='container'><article><p>stuff<p></article><article><p>stuff2</p></article></div></body></html>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(rawText);
var stuff = doc.DocumentNode.Descendants("div")
.SelectMany(div => div.Descendants("article"));
var length = stuff.Count();
var textValues = stuff.Select(a => a.InnerHtml).ToList();
Output:
length: 2
textValues: List<String> (2 items)
<p>stuff<p>
<p>stuff2</p>
To get the HTML, instead of hardcoding it as above, use the WebClient class since it has a simplier API than WebRequest.
var client = new WebClient();
var html = client.DownloadString("http://yoursite.com/file.html");

To answer your question specifically related to using the System.Net namespace you would do this:
go here to see the way to use the WebRequest class itself to get the
content.
http://msdn.microsoft.com/en-us/library/456dfw4f%28v=vs.110%29.aspx
Next after you get the content back you need to parse it using HTMLAgility pack found here: http://htmlagilitypack.codeplex.com/
How would one code the JQUERY into C#, this is an untested example:
var doc = new HtmlDocument();
doc.Load(#"D:\test.html"); //you can also use a memory stream instead.
var container = doc.GetElementbyId("continer");
foreach (HtmlNode node in container.Elements("img"))
{
HtmlAttribute valueAttribute = node.Attributes["value"];
if (valueAttribute != null) Console.WriteLine(valueAttribute.Value);
}
In your case the attributes you want after you find the element are alt, src, and href
It will take you about 1 day to learn agilitypack but it's mature fast and well liked by the community.

C# filter JS files from HttpWebRequest/WebResponse

I searched but could not find anything that worked for me.
A while ago I started with C# and my first personal project was a simple WebCrawler.
It should check the sourcecode for special Strings to identify if for example Google Analytics or something similar is included.
So it works fine but of course I'm missing the JS and Iframes since HttpWebRequest does not render the website as I know.
So I wanted to check for "<script src="" for example and then get the URL through a split.
But this does not work as expected and I don't think this is a clean and good way.
Since I'm checking for strings it could be destroyed by simply changing the string from "<script" to "< script" as example so I have no idea how to get a specific string from a big string.
I found regular expressions (rex) and split but I'm not sure if rex and split would be good since there could be more types of "src=" or split("\"", "\"", text)
I don't want a "here you go" of course I want to understand and to do it myself but I have no idea where to go from here..
Sorry for the long text and no examples but at the moment I have no access and there is not really much except for rex and split's
EDIT: I think I'll create a class which checks every char for a special row like "
Best,
Mike

Try Html agility pack
I haven't used it personally, but something like this should work (i haven't tested it):
string url = "some/url";
var request = (HttpWebRequest)HttpWebRequest.Create(url);
var webResponse = (HttpWebResponse)request.GetResponse();
var responseStream = webResponse.GetResponseStream();
var streamReader = new StreamReader(responseStream);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(streamReader.ReadToEnd());
var scripts = doc.DocumentNode.Descendants()
.Where(n => n.Name == "script");
this should get you all script nodes to do with them what you want =)

So I found a way to get the JS URL here is my code
List<string> srcurl = new List<string>();
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("some/url");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//script[#src]");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["src"];
srcurl.Add(link.Value);
}
Regarding the code from #avidenic, if you want to use it be aware to use
doc.LoadHtml(streamReader.ReadToEnd());
Best,
Mike

Is it possible to navigate a site by clicking links and then downloading the correct piece?

I'll try to explain what exactly I mean. I'm working on a program and I'm trying to download a bunch of images automatically from this site.
Namely, I want to download the big square icons from the page you get when you click on a hero name there, for example on the Darius page the image in the top left with the name DariusSquare.png and save that into a folder.
Is this possible or am I asking too much from C#?
Thank you very much!

In general, everything is possible given enough time and money. In your case, you need very little of former and none of latter :)
What you need to do can be described in following high-level steps:
Get all <a> tags within the table with heroes.
Use WebClient class to navigate to URL these <a> tags point to (i.e. to value of href attributes) and download the HTML
You will need to find some wrapper element that is present on each page with hero and that contains his image. Then, you should be able to get to the image src attribute and download it. Alternatively, perhaps each image has an common ID you can use?
I don't think anyone will provide you with an exact code that will perform these steps for you. Instead, you need to do some research of your own.

Yes it's possible, do a C# Web request and use the C# HTML Agility Pack to find the image url.
The you can use another web request to download the image:
Example downloading image from url:
public static Image LoadImage(string url)
{
var backgroundUrl = url;
var request = WebRequest.Create(backgroundUrl);
var response = request.GetResponse();
var stream = response.GetResponseStream();
return Image.FromStream(stream);
}
Example using html agility pack and getting some other data:
var request = (HttpWebRequest)WebRequest.Create(profileurl);
request.Method = "GET";
using (var response = request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
result = reader.ReadToEnd();
}
var doc = new HtmlDocument();
doc.Load(new StringReader(result));
var root = doc.DocumentNode;
HtmlNode profileHeader = root.SelectSingleNode("//*[#id='profile-header']");
HtmlNode profileRight = root.SelectSingleNode("//*[#id='profile-right']");
string rankHtml = profileHeader.SelectSingleNode("//*[#id='best-team-1']").OuterHtml.Trim();
#region GetPlayerAvatar
var avatarMatch = Regex.Match(profileHeader.SelectSingleNode("/html/body/div/div[2]/div/div/div/div/div/span").OuterHtml, #"(portraits[^(h3)]+).*no-repeat;", RegexOptions.IgnoreCase);
if (avatarMatch.Success)
{
battleNetPlayerFromDB.PlayerAvatarCss = avatarMatch.Value;
}
#endregion
}
}

Read specific div from HttpResponse

I am sending 1 httpWebRequest and reading the response.
I am getting full page in the response.
I want to get 1 div which is names ad Rate from the response.
So how can I match that pattern?
My code is like:
HttpWebRequest WebReq = (HttpWebRequest)WebRequest.Create("http://www.domain.com/");
HttpWebResponse WebResp = (HttpWebResponse)WebReq.GetResponse();
Stream response = WebResp.GetResponseStream();
StreamReader data = new StreamReader(response);
string result = data.ReadToEnd();
I am getting response like:
<HTML><BODY><div id="rate">Todays rate 55 Rs.</div></BODY></HTML>
I want to read data of div rate. i.e. I should get Content "Todays rate 55 Rs."
So how can I make regex for this???

The HTML Agility Pack can load and parse the file for you, no need for messy streams and responses:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/owobe3");
HtmlNode rateNode = doc.DocumentNode.SelectSingleNode("//div[#id='rate']");
string rate = rateNode.InnerText;

You should read the entire response and then use something like the Html Agility Pack to parse the response and extract the bits you want in an xpath-like syntax:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
var output = doc.DocumentNode.SelectSingleNode("//div[#id='rate']").InnerHtml;
Dont use regular expressions!

If you have only one Todays rate text then you can do it like this:
Todays rate \d+ Rs.
In other case you can add div tag in your regex.
Edit: Sorry, haven't installed regex locally
You need to use grouping and get value from the group. It will look like this
<div id="rate">(?<group>[^<])</div>
Don't know if it works, however use this idea.

Extract data webpage

Folks,
I'm tryning to extract data from web page using C#.. for the moment I used the Stream from the WebReponse and I parsed it as a big string. It's long and painfull. Someone know better way to extract data from webpage? I say WINHTTP but isn't for c#..

To download data from a web page it is easier to use WebClient:
string data;
using (var client = new WebClient())
{
data = client.DownloadString("http://www.google.com");
}
For parsing downloaded data, provided that it is HTML, you could use the excellent Html Agility Pack library.
And here's a complete example extracting all the links from a given page:
class Program
{
static void Main(string[] args)
{
using (var client = new WebClient())
{
string data = client.DownloadString("http://www.google.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(data);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode link in nodes)
{
HtmlAttribute att = link.Attributes["href"];
Console.WriteLine(att.Value);
}
}
}
}

If the webpage is valid XHTML, you can read it into an XPathDocument and xpath your way quickly and easily straight to the data you want. If it's not valid XHTML, I'm sure there are some HTML parsers out there you can use.
Found a similar question with an answer that should help.
Looking for C# HTML parser

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Issue with HTMLAgilityPack parsing HTML using C# - c#

Related

jquery using c# webclient .net namespace?

C# filter JS files from HttpWebRequest/WebResponse

Is it possible to navigate a site by clicking links and then downloading the correct piece?

Read specific div from HttpResponse

Extract data webpage

Categories

Resources