Getting nodes from html page using HtmlAgilityPack - c#

My program collects info about Steam users' profiles (such as games, badges and etc.). I use HtmlAgilityPack to collect data from html page and so far it worked for me just good.
The problem is that on some pages it works well, but on some - returns null nodes or throws an exception
object reference not set to an instance of an object
Here's an example.
This part works well (when I'm getting badges):
WebClient client = new WebClient();
string html = client.DownloadString("http://steamcommunity.com/profiles/*id*/badges/");
var doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection div = doc.DocumentNode.SelectNodes("//div[#class=\"badge_row is_link\"]");
This returns the exact amout of badges and then I can do whatever I want with them.
But in this one I do the exact same thing (but getting games), and somehow it keeps throwing me and error I mentioned above:
WebClient client = new WebClient();
string html = client.DownloadString("http://steamcommunity.com/profiles/*id*/games/?tab=all");
var doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection div = doc.DocumentNode.SelectNodes("//*[#id='game_33120']");
I know that there is the node on the page (checked via google chrome code view) and I don't know why in 1st case it works, but in the 2nd it doesn't.

When you right-click on the page and choose View Source do you still see an element with id='game_33120'? My guess is you won't. My guess is that the page is being built dynamically, client-side. Therefore, the HTML that comes down in the request doesn't contain the element you're looking for. Instead that element appears once the Javascript code has run in the browser.
It appears that the original request will have a section of Javascript that contains a variable called rgGames which is a Javascript array of the games that will be rendered on the screen. You should be able to extract the information from that.

I dont understand the selectNodes method with this parameter "//*[#id='game_33120']", maybe is this your fault, but you can check this:
The real link of an steamprofil with batches etc is:
http://steamcommunity.com/id/id/badges/
and not
http://steamcommunity.com/profiles/id/badges/
after I visited an badges page, the url stay in the browser, at the games link, they redirect you to
http:// steamcommunity. com
Maybe this can help you

Related

How to extract a specific line from a webpage in c#

HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create("https://www.google.com/search?q=" + "msg");
HttpWebResponse myres = (HttpWebResponse)myReq.GetResponse();
using (StreamReader sr = new StreamReader(myres.GetResponseStream()))
{
pageContent = sr.ReadToEnd();
}
if (pageContent.Contains("find"))
{
display = "done";
}
currently what this code does is check if "find" exists on a url and display done if it is present
What I want is to display the whole line or para which contains "find".
So like instead display="done" I want to store the line which contains find in display
HTML pages don't have lines. Whitespace outside tags is ignored and an entire minified page may have no newlines at all. Even if it did, newlines are simply ignored even inside tags.That's why <br> is necessary. If you want to find a specific element you'll have to use an HTML parser like HTMLAgilityPack and identify the element using an XPath or CSS selector expression.
Copying from the landing page examples:
var url = $"https://www.google.com/search?q={msg}" ;
var web = new HtmlWeb();
var doc = web.Load(url);
var value = doc.DocumentNode
.SelectNodes("//div[#id='center_col']")
.First()
.Attributes["value"].Value;
What you put in SelectNodes depends on what you want to find.
One way to test various expressions is to open the web page you want in a browser, open the browser's Developer Tools and start searching in the Element inspector. The search functionality there accepts XPath and CSS selectors.

HtmlAgilityPack Not Finding Specific Node That Should Be There

I'm loading a URL and am looking for a specific node that should exist in the HTML doc but it is returning null every time. In fact, every node that I try to find is returning null. I have used this same code on other web pages but for some reason in this instance it isn't working. Could the HtmlDoc be loading something different than the source I see in my browser?
I'm obviously new to web scraping but have run into this kind of problem multiple times where I have to make an elaborate workaround because I'm unable to select a node that I can see in my browser. Is there something fundamentally wrong with how I'm going about this?
string[] arr = { "abercrombie", "adt" };
for(int i=0;i<1;i++)
{
string url = #"https://www.google.com/search?rlz=1C1CHBF_enCA834CA834&ei=lsfeXKqsCKOzggf9ub3ICg&q=" + arr[i] + "+ticker" + "&oq=abercrombie+ticker&gs_l=psy-ab.3..35i39j0j0i22i30l2.102876.105833..106007...0.0..0.134.1388.9j5......0....1..gws-wiz.......0i71j0i67j0i131j0i131i67j0i20i263j0i10j0i22i10i30.3zqfY4KZsOg";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(url);
var node = htmlDoc.DocumentNode.SelectSingleNode("//span[#class = 'HfMth']");
Console.WriteLine(node.InnerHtml);
}
UPDATE
Thanks to RobertBaron for pointing me in the right direction. Here is a great copy paste solution.
The page that you are trying to scrape has javascript code that runs to load the entire contents of the page. Because your browser runs that javascript, you see the entire contents of the page. The HtmlWeb.Load() does not run any javascript code and so you only see a partial page.
You can use the WebBrowser control to scrape that page. Just like your browser, it will run any javascript code, and the entire page will be loaded. There are several stack overflow articles that show how to do this. Here are some of them.
WebBrowser Control in a new thread
Perform screen-scape of Webbrowser control in thread
How to cancel Task await after a timeout period
That content is dynamically added and not present in what is returned via your current method + url; which is why your xpath is unsuccessful. You can check what is returned with, for example:
var node = htmlDoc.DocumentNode.SelectSingleNode("//*");
Selecting something which is present for your first url - to show you can select a node
var node = htmlDoc.DocumentNode.SelectSingleNode("//span[#class = 'st']");
You can use developer tools > network tab > to see if any specific dynamic content you are after is available by a separate xhr request url.

Fetching data from a web page to a C# application

I am trying to create a desktop application in C# that will retrieve data from a website. In short, this is an application that I will use to create statistics for my local league's fantasy football (soccer) game. All the data I want to use is freely available online, but there are no APIs available to retrieve the data.
The first thing I tried was to get the HTML code for the website using WebClient and DownloadString:
WebClient client = new WebClient();
string priceChangeString = client.DownloadString(url);
However, it turned out that the data is not in the HTML string.
If I use Developer Tools in Chrome I can inspect the page under "elements". Here I see that the data I want:
Screenshot from Chrome Developer Tools
I have tried to get these values by using "Copy as XPath" and HtmlAgilityPack, but I can't get this to work my code:
using HtmlAgilityPack;
string url = "https://fantasy.eliteserien.no/a/statistics/cost_change_start";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
string userscore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
I have tried several variations of this code, but they all returns NullReferenceExceptions:
Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.
at FantasyTest.Program.Main(String[] args) in C:\Users\my_username\source\repos\FantasyTest\FantasyTest\Program.cs:line 27
Does anyone see what I'm doing wrong when I try to use HtmlAgilityPack and XPath? Are there any other approaches I can take to solve this?
The web page from this example can be found here
I used a list to store all the information, and then search through that list for example <span>, and in all the <spans> I made the application to search for class="card-list".
var url = "https://fantasy.eliteserien.no/a/statistics/cost_change_start";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//This is the part of the code that takes information from the website
//Note that this part matches your screenshot, in the HTML code
//You can use that there is a table with class="ism-table ism-table--el"
//This piece of code target that specific table
var ProductsHtml = htmlDocument.DocumentNode.Descendants("table")
.Where(node => node.GetAttributeValue("class", "")
.Equals("ism-table ism-table--el")).ToList(); ;
try{
var ProductListItems = ProductsHtml[0].Descendants("tr")
foreach (var ProductListItem in ProductListItems)
{
//This targets whats inside the table
Console.WriteLine("Id: " +
ProductListItem.Descendants("<HEADER>")
.Where(node => node.GetAttributeValue("<CLASS>", "")
.Equals("<CLASS=>")).FirstOrDefault().InnerText
);
}
In your case I think you need regex to match the numbers. This site have the numbers in <td>number</td> format. What we need is <td class="mNOK">number</td>.
So you need to use regex to match all the numbers. To do that we do:
//Regex Match numbers in <td>
Console.WriteLine("numbers: " +
Regex.Match(ProductListItem.Descendants("td").FirstOrDefault().InnerText
, #[0-9]")
);
Note that you need to change <URL>, <HEADER>, <CLASS> and <CLASS=>.
<URL>: The site you want to take information from,
<HEADER>: What header inside the HTML code do you want to target
reading. For example "span, div, li, ul",
<CLASS>: Inside that header, what do you want to look for. Example
"id, name",
<CLASS=>: What does the <CLASS> need to be equal to, to read the
inner text
If you don’t mind calling an external python program, I’d suggest looking at python and the library called “BeautifulSoup”. It parses html nicely. Have the python program write out an xml file that your application can deserialize... the c# program can then do whatever it needs to do using that deserialized structure.
Thank you all for the feedback on this post, it has helped me find a solution to this problem.
It turned out that the data I wanted to retrieve was loaded with javascript. This means that the methods HtmlWeb and HtmlDocument from HtmlAgilityPack loads the html before the data I want has been loaded to the page, and these can thus not be used for this purpose.
I got around this by using a headless browser. I downloaded Chromdriver and Selenium via Nuget, and got the data I wanted by using the following code:
using OpenQA.Selenium.Chrome;
var chromeOptions = new ChromeOptions();
chromeOptions.AddArguments("headless");
using (var driver = new ChromeDriver(chromeOptions))
{
driver.Navigate().GoToUrl("https://fantasy.eliteserien.no/a/statistics/cost_change_start");
// As IWebElement
var fantasyTable = driver.FindElementByClassName("ism-scroll-table");
// Content as text-string
string fantasyTableText = fantasyTable.Text;
// As Html-string
string fantasyTableAsHtml = fantasyTable.GetAttribute("innerHTML");
// My code for handling the data follows here...
}
Resource used to solve this:
How to start ChromeDriver in headless mode

Pull timer value from a webpage using xPath and C#

I am trying to pull some timer values off of websites using the xpath in the HtmlAgilityPack. However, when I am using the xpath, I get null reference exceptions because a particular node does not exist when I am grabbing it. To test why this was, I used a doc.Save to check the nodes myself, and I found that the nodes truly do not exist. From my understanding, HtmlAgilityPack should download the webpage almost exactly how I see it, with all the data in there as well. However, most of the data in fact is missing.
How exactly am I supposed to grab the timer values, or even an event title from either of the following websites:
http://dulfy.net/2014/04/23/event-timer/
http://guildwarstemple.com/dragontimer/eventsb.php?serverKey=108&langKey=1
My current code to pull just the title of the event from the first timebox from guildwarstemple is:
public void updateEventData()
{
//string Url = "http://dulfy.net/2014/04/23/event-timer/";
string Url = "http://guildwarstemple.com/dragontimer/eventsb.php?serverKey=108&langKey=1";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
doc.Save("c:/doc.html");
Title = doc.DocumentNode.SelectNodes("//*[#id='ep1']/p")[0].InnerText;
//*[#id="scheduleList"]/div[3]
//*[#id="scheduleList"]/div[3]/div[3]/text()
}
You XPath expression fails because there is only one div with #id='ep1' in the document, and it has no p inside:
<div id="ep1" class="eventTimeBox"></div>
In fact, all the divs in megaContainer are empty in the link you are trying to load with your code.
If you think there should be p elements in there, it's probably being added dynamically via JavaScript, so it might not be available when you are scraping the site with a C# client.
In fact, there are some JavaScript variables:
<script>
...
var e7 = 'ep1';
...
var e7t = '57600';
...
Maybe you want to get that data. This:
substring-before(substring-after(normalize-space(//script[contains(.,"var e7t")]),"var e7t = '"),"'")
selects the <script> which contains var e7t and extracts the string in the apostrophes. It will return:
57600
The same with your other link. The expression:
//*[#id="scheduleList"]
is a an empty div. You can't navigate further inside it:
<div id="scheduleList" style="width: 720px; min-width: 720px; background: #1a1717; color: #656565;"></div>
But this time there seems to be no nested JavaScript that refers to it in the page.

c# find image in html and download them

i want download all images stored in html(web page) , i dont know how much image will be download , and i don`t want use "HTML AGILITY PACK"
i search in google but all site make me more confused ,
i tried regex but only one result ... ,
People are giving you the right answer - you can't be picky and lazy, too. ;-)
If you use a half-baked solution, you'll deal with a lot of edge cases. Here's a working sample that gets all links in an HTML document using HTML Agility Pack (it's included in the HTML Agility Pack download).
And here's a blog post that shows how to grab all images in an HTML document with HTML Agility Pack and LINQ
// Bing Image Result for Cat, First Page
string url = "http://www.bing.com/images/search?q=cat&go=&form=QB&qs=n";
// For speed of dev, I use a WebClient
WebClient client = new WebClient();
string html = client.DownloadString(url);
// Load the Html into the agility pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// Now, using LINQ to get all Images
List<HtmlNode> imageNodes = null;
imageNodes = (from HtmlNode node in doc.DocumentNode.SelectNodes("//img")
where node.Name == "img"
&& node.Attributes["class"] != null
&& node.Attributes["class"].Value.StartsWith("img_")
select node).ToList();
foreach(HtmlNode node in imageNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}
First of all I just can't leave this phrase alone:
images stored in html
That phrase is probably a big part of the reason your question was down-voted twice. Images are not stored in html. Html pages have references to images that web browsers download separately.
This means you need to do this in three steps: first download the html, then find the image references inside the html, and finally use those references to download the images themselves.
To accomplish this, look at the System.Net.WebClient() class. It has a .DownloadString() method you can use to get the html. Then you need to find all the <img /> tags. You're own your own here, but it's straightforward enough. Finally, you use WebClient's .DownloadData() or DownloadFile() methods to retrieve the images.
You can use a WebBrowser control and extract the HTML from that e.g.
System.Windows.Forms.WebBrowser objWebBrowser = new System.Windows.Forms.WebBrowser();
objWebBrowser.Navigate(new Uri("your url of html document"));
System.Windows.Forms.HtmlDocument objDoc = objWebBrowser.Document;
System.Windows.Forms.HtmlElementCollection aColl = objDoc.All.GetElementsByName("IMG");
...
or directly invoke the IHTMLDocument family of COM interfaces
In general terms
You need to fetch the html page
Search for img tags and extract the src="..." portion out of them
Keep a list of all these extracted image urls.
Download them one by one.
Maybe this question about C# HTML parser will help you a little bit more.

Categories