I need retrieve the HTML code for a page that uses the AngularJS to process some information and generate a graph. I could easily retrieve the html code using WebRequest, as the example below, but the content (graphic) generated by AngularJS does not come in the page code.
WebRequest request = WebRequest.Create("http://localhost:36789/minhaapp#/index");
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}
Has anyone ever experienced this?
Thank you in advance for your support.
At the end of your Page_Load method, call this getHTMLContent() method:
public string getHTMLContent()
{
StringBuilder sb = new StringBuilder();
StringWriter tw = new StringWriter(sb);
HtmlTextWriter hw = new HtmlTextWriter(tw);
panel.RenderControl(hw);
String html = sb.ToString();
return html;
}
The entire page is contained in an asp:Panel called panel. The way this works is from the RenderControl() method which you can read a little bit more about here. Simply put, it gets all the content within the asp:Panel tags (the whole page) and once used after the Page_Load event has been executed, it will get all the raw HTML for the page.
There's a library called PhantomJS, which renders the js off the site first and then u can get the source after it got rendered. But obviously it will also slow down the process if you are doing a lot of websites
Related
I'm trying to make an application where it will read certain texts from a website.
using AngleSharp.Parser.Html;
...
var source = #"
<html>
<head>
</head>
<body>
<td class=""period_slot_1"">
<strong>TG</strong>
</body>
</html>";
var parser = new HtmlParser();
var document = parser.Parse(source);
var strong = document.QuerySelector("strong");
MessageBox.Show(strong.TextContent); // Display text
From googling, I've successfully done above. I have copy&pasted a part of html in a variable to see if I can get the value I'm looking for.
So it gets the value I want, which is string "TG".
However, the website will have different value to "TG" every time, so I need my program to refer straight to the html of the website at the time.
Is is possible for me to load the whole html source in the source variable and make it work, if can how can I do it and what would be best for me to get what I want?
Thank you so much for reading the question.
I assume you're saying you want to read directly from a page on the internet from a url. In which case you should do:
WebClient myClient = new WebClient();
Stream response = myClient.OpenRead("http://yahoo.com");
StreamReader reader = new StreamReader(response);
string source = reader.ReadToEnd();
var parser = new HtmlParser();
var document = parser.Parse(source);
var p = document.QuerySelector("p");
// I used 'p' instead of 'strong' because there's no
//strong on that page
MessageBox.Show(p.TextContent); // Display text
response.Close();
I'll try to explain what exactly I mean. I'm working on a program and I'm trying to download a bunch of images automatically from this site.
Namely, I want to download the big square icons from the page you get when you click on a hero name there, for example on the Darius page the image in the top left with the name DariusSquare.png and save that into a folder.
Is this possible or am I asking too much from C#?
Thank you very much!
In general, everything is possible given enough time and money. In your case, you need very little of former and none of latter :)
What you need to do can be described in following high-level steps:
Get all <a> tags within the table with heroes.
Use WebClient class to navigate to URL these <a> tags point to (i.e. to value of href attributes) and download the HTML
You will need to find some wrapper element that is present on each page with hero and that contains his image. Then, you should be able to get to the image src attribute and download it. Alternatively, perhaps each image has an common ID you can use?
I don't think anyone will provide you with an exact code that will perform these steps for you. Instead, you need to do some research of your own.
Yes it's possible, do a C# Web request and use the C# HTML Agility Pack to find the image url.
The you can use another web request to download the image:
Example downloading image from url:
public static Image LoadImage(string url)
{
var backgroundUrl = url;
var request = WebRequest.Create(backgroundUrl);
var response = request.GetResponse();
var stream = response.GetResponseStream();
return Image.FromStream(stream);
}
Example using html agility pack and getting some other data:
var request = (HttpWebRequest)WebRequest.Create(profileurl);
request.Method = "GET";
using (var response = request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
result = reader.ReadToEnd();
}
var doc = new HtmlDocument();
doc.Load(new StringReader(result));
var root = doc.DocumentNode;
HtmlNode profileHeader = root.SelectSingleNode("//*[#id='profile-header']");
HtmlNode profileRight = root.SelectSingleNode("//*[#id='profile-right']");
string rankHtml = profileHeader.SelectSingleNode("//*[#id='best-team-1']").OuterHtml.Trim();
#region GetPlayerAvatar
var avatarMatch = Regex.Match(profileHeader.SelectSingleNode("/html/body/div/div[2]/div/div/div/div/div/span").OuterHtml, #"(portraits[^(h3)]+).*no-repeat;", RegexOptions.IgnoreCase);
if (avatarMatch.Success)
{
battleNetPlayerFromDB.PlayerAvatarCss = avatarMatch.Value;
}
#endregion
}
}
I'm just trying to learn about HTMLAgilityPack and XPath, I'm attempting to get a list of (HTML Links) companies from the NASDAQ website;
http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx
I currently have the following code;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// Create a request for the URL.
WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Read into a HTML store read for HAP
htmlDoc.LoadHtml(responseFromServer);
HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[#id='indu_table']/tbody/tr[*]/td/b/a");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Debug.Write(node.InnerText);
}
// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
I've used an XPath addon for Chrome to get the XPath of;
//*table[#id='indu_table']/tbody/tr[*]/td/b/a
When running my project, I get an xpath unhandled exception about it being an invalid token.
I'm a little unsure what's wrong with it, i've tried to put a number in the tr[*] section above but i still get the same error.
I've been looking at this for the last hour, is it anything simple?
thanks
Since the data comes from javascript you have to parse the javascript and not the html, so the Agility Pack doesn't help that much, but it makes things a bit easier. The following is how it could be done using Agility Pack and Newtonsoft JSON.Net to parse the Javascript.
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
//Using Regex here to get just the array we're interested in...
string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
JArray jArray = JArray.Parse(stockArray);
foreach (JToken token in jArray.Children())
{
listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
}
}
To explain a bit more in detail, the data comes from one big javascript array on the page var table_body = [....
Each stock is one element in the array and is an array itself.
["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]
So by parsing the array and taking the first element and appending the fix url we get the same result as the javascript.
Why won't you just use Descendants("a") method?
It's much simplier and is more object oriented. You'll just get a bunch of objects.
The you can just get the "href" attribute from those objects.
Sample code:
htmlDoc.DocumentNode.Descendants("a").Attributes["href"].Value
If you just need list of links from certain webpage, this method will do just fine.
If you look at the page source for that URL, there's not actually an element with id=indu_table. It appears to be generated dynamically (i.e. in javascript); the html that you get when loading directly from the server will not reflect anything that's changed by client script. This is probably why it's not working.
I am trying to grab a data from a WEBPAGE , <DIV>particular class <DIV class="personal_info"> it has 10 similar <DIV>S and is of same Class "Personal_info" ( as shown in HTML Code and now i want to extract all the DIVs of Class personal_info which are in 10 - 15 in every webpage .
<div class="personal_info"><span class="bold">Rama Anand</span><br><br> Mobile: 9916184586<br>rama_asset#hotmail.com<br> Bangalore</div>
to do the needful i started using HTML AGILE PACK as suggested by some one in Stack overflow
and i stuck at the beginning it self bcoz of lack of knowledge in HtmlAgilePack my C# code goes like this
HtmlAgilityPack.HtmlDocument docHtml = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlWeb docHFile = new HtmlWeb();
docHtml = docHFile.Load("http://127.0.0.1/2.html");
then how to code further so that data from DIV whose class is "personal_info" can be grabbed ... suggestion with example will be appreciated
I can't check this right now, but isn't it:
var infos = from info in docHtml.DocumentNode.SelectNodes("//div[#class='personal_info']") select info;
To get a url loaded you can do something like:
var document = new HtmlAgilityPack.HtmlDocument();
var url = "http://www.google.com";
var request = (HttpWebRequest)WebRequest.Create(url);
using (var responseStream = request.GetResponse().GetResponseStream())
{
document.Load(responseStream, Encoding.UTF8);
}
Also note there is a fork to let you use jquery selectors in agility pack.
IEnumerable<HtmlNode> myList = document.QuerySelectorAll(".personal_info");
http://yosi-havia.blogspot.com/2010/10/using-jquery-selectors-on-server-sidec.html
What happened to Where?
node.DescendantNodes().Where(node_it => node_it.Name=="div");
if you want top node (root) you use page.DocumentNode as "node".
Since i havn't access to the TemplateControl or page from a WCF service i was wondering if it was possible to render a custom control? If so how would one do it?
private string GetRenderedHtmlFrom(Control control)
{
StringBuilder stringBuilder = new StringBuilder();
StringWriter sw = new System.IO.StringWriter(stringBuilder);
HtmlTextWriter htmlWriter = new HtmlTextWriter(textWriter);
control.RenderControl(htmlWriter );
return stringBuilder.ToString();
}
Thanks
This actually wasn't achievable and i ended up abandoning the idea. The rough solution i implemented was loading an html page, and using string.Format() to manipulate it then returned the results as a string and let the JavaScript 'load the control'.