Parse webpage with Fragment identifier in URL, using HTML Agility Pack - c#

I want to parse webpage with Fragment identifier(#), f.e. http://steamcommunity.com/market/search?q=appid%3A570+uncommon#p4
When i use my browser(Google Chrome), i have different result, for different identifier(#p1,#p2,#p3), but when i use HTML Agility Pack, i always get first page, despite of page identifier.
string sURL = "http://steamcommunity.com/market/search?q=appid%3A570+uncommon#p"
wClient = new WebClient();
html = new HtmlAgilityPack.HtmlDocument();
html.LoadHtml(wClient.DownloadString(sURL+i));
I understand, that something like Ajax used here and in fact exist only one page. How can i fix my problem, and get results from other pages using C#?

Like David said,
use URL : http://steamcommunity.com/market/search/render/?query=appid%3A570%20uncommon&search_descriptions=0&start=30&count=10
where start is the start number and count is the number of items you want.
the result is a json result, so for stating the obvious you only want to use results_html
side note: in your chrome browser (when pressed F12) click on network tab and you will see the request and result being made

Related

I can't get the content of a web page without html codes in C#

I want to get the text of a web page in windows form application. I am using:
WebClient client = new WebClient();
string downloadString = client.DownloadString(link);
However, it gave me html codes of the web page.
Here is the question:
Can I get the specific part of a website? For example a part that has a class name "ask-page new-topbar". I want to get every part that has class name "ask-page new-topbar".
No, you can't get only parts of a website, when you send a request to a url.
What you can do is use the Html Agility Pack and let it dig through the Html code to give you the contents of the requested node.

Get Proper XPath for SelectNodes

I just started using HtmlAgilityPack to scrape some text from websites. I have experimented and found that some websites are easier than others in regards to getting the proper XPath when using the SelectNodes method. I believe I am doing something wrong but can't figure it out.
For example when exploring the DOM in Google Chrome, I am able to copy the XPath: //*[#id="page"]/span/table[7]/tbody/tr[1]/td/span[2]/a then I would do something like..
var search = doc.DocumentNode.SelectNodes("//[#id=\"page\"]//span//table//tr//td//span//a"
When using the search in a foreach loop I get a null reference error and sure enough the debugger says search is null. So I am assuming the XPath is wrong..(or I am doing something else totally wrong) So my question is how exactly do I get the proper XPath for HtmlAgilityPack to find these nodes?
Following up on what you request in your last comment, the html is fully rendered only after the http get request is returns.
Several javascript calls insert blocks of html into the document.
You want the following of them: loadCompanyProfileData('ContactInfo'), which generates an http get request that looks like:
http://financials.morningstar.com/cmpind/company-profile/component.action?component=ContactInfo&t=XNAS:AAPL&region=usa&culture=en-US&cur=&_=1465809033745.
This returns the email, which you can extract with code like the following:
HtmlWeb w = new HtmlWeb();
var doc = w.Load("http://financials.morningstar.com/cmpind/company-profile/component.action?component=ContactInfo&t=XNAS:AAPL&region=usa&culture=en-US&cur=&_=1465809033745");
var emails = doc.DocumentNode.CssSelect("a")
.Where(a => a.GetAttributeValue("href")
.StartsWith("mailto:"))
.Select(a => a.GetAttributeValue("href")
.Replace("mailto:", string.Empty));
emails ends up containing 1 element, being investor_relations#apple.com.
You problem is to determine what should be the "cur" parameter that the loadCompanyProfileData javascript function uses for each distinct company.
I could not locate in the code where/how is this parameter generated.
One alternative would be to execute a browser emulator (like selenium web driver port for c#) so you can execute javascript code - and run the call to loadCompanyProfileData('ContactInfo') for each company request.
But I could not get this to work as well, my web drive script execution does not look to be working.

Selenium C# Dynamic Meta Tags

Im using Selenium for C# in order to serve fully rendered javascript applications to google spiders and users with javascript disabled. I am using ASP.NET MVC to serve the pages from my controller. I need to be able to generate dynamic meta tags before the content is served to the caller. For example, the following pseudo code:
var pageSource = driver.PageSource; // This is where i get my page content
var meta = driver.findElement(By.tagname("meta.description")).getAttribute("content");
meta.content = "My New Meta Tag Value Here";
return driver.PageSource; // return the page source with edited meta tags to the client
I know how to get the page source to the caller, i am already doing this, but i cant seem to find the right selector to edit the meta tags before i push the content back to the requester. How would I accomplish this?
Selenium doesn't have a feature specifically for this. But technically, you can change meta tags with JavaScript, so you can use Selenium's IJavaScriptExecutor in C#.
If the page is using jQuery, here's one way to do it:
// new content to swap in
String newContent = "My New Meta Tag Value Here";
// jQuery function to do the swapping
String changeMetasScript = "$('meta[name=author]').attr('content', arguments[0]);"
// execute with JavaScript Executer
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
js.ExecuteScript(changeMetasScript, newContent);

What is the best way to get all webpage links with Html Agility Pack?

i am try to get all links from webpage with Html Agility Pack, after send web URL (cnn.com) i have this list (return by Html Agility class):
what is the best way to get all this page links cause some of those links start with "/" and not with the page address ?
That's what I use in cases like these:
protected Uri GetAbsoluteUri(string linkUri)
{
var uri = new Uri(linkUri, UriKind.RelativeOrAbsolute);
return uri.IsAbsoluteUri ? uri : new Uri(PageUri, uri);
}
The code above assumes that:
linkUri is the value of an anchor's href attribute
PageUri is a System.Uri object that represents the Absolute Uri of the current page
Those links that don't start with an http:// are relative to the current address (http://cnn.com) so you could prepend it to get the full address. And for those that represent javascript functions, well, there's not much you could do with HTML Agility Pack as it only parses HTML.

Add querystring to all anchor links in HTML body

In C# given a string which contains HTML what is the best way to automatically add the query string data test=1 to the end of every hyperlink? It should only modify the url inside the href attribute for anchor links (eg not do it for image urls etc).
An example would be:
Input
Visit http://www.test.com today
and see what deals we have.
Output
Visit http://www.test.com today
and see what deals we have.
This seems to be a bit tricky and am not sure where the best place to start on this would be. Any help appreciated!
HTML Agility Pack is a very fine library for parsing HTML.
Sample for get all text in html:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)");
var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
foreach(var node in result)
{
string AchivedText=node.InnerText;//Your desire text
}

Categories