How can certain things be omitted from the page source - c#

I am writing a program that automatically start up a web browser to a song of choice, so in order to do this my program is using an httwebrequest/response to
get the source code of a webpage that contains a link to the audio source page
search through that website source code and find the audio source website
open up chrome the the specified webpage, so i can listen to the song.
I am using project.com as my audio source, and I do not plan on using another site such as youtube
The problem I am having, is that, while I can see a link to the audio source website when i load the initial website in chrome, the page source does not contain it. For instance, this website which
has a link to the audio source http://pl.st/s/1709472017 where you can actually see on the initial website, but when I look at the page source using chrome, then I cannot find this audio source link.
If I right click on the audio source textbox and select inspect element, then I can see this:
<input class="copy-song-link"
type="textbox"
value="http://pl.st/s/1709472017"
name="url"
onclick="javascript:select();" title="Copy and share this song URL">
so this link is clearly located somewhere. My questions are these
why/how is this link not in the source page, but i can see when i look at the website through chrome
how come chrome's "inspect element" can find this url, while the page source does not include it.
How does the inspect element differ from looking at the source code?
I am pretty new to http communication so any help would be appreciated.

I generally use the plugin Firebug for Firefox for situations like this. It will allow you to use the "Net" tab to inspect all subsequent requests (often Ajax) that occur while the page is loading.
In your case it appears an Ajax request was collecting the data that is being used to generate the link that you want to pull out. This data appears in JSON and JavaScript is likely generating the links on the client side from the JSON. Take a look at this link
http://www3.playlist.com/async/searchbeta/tracks?searchfor=r%20u%20mine
The linkid used to generate the link is in the first part of the response ...PPL.search.trackdata = [{"linkid":1709472017...

Related

Display thumbnail of page when displayed in website search results

Our client wants to display a thumbnail of a screenshot of a page when listed in the search results of the website. Of course, they want it to be automated. The website is build on Sitecore 9.0 and uses SOLR for indexing. It seems that creating a computed index field would be the best option performance-wise, but I feel like it will still take forever when running a full index rebuild as it's making an http request for every page.
I took a look at some solutions for capturing thumbnails, this one looks to be the most promising http://html2canvas.hertzen.com/. However, it doesn't seem like this will work with server-side C# http requests. Also, not sure how I'd selectively toggle the html2canvas event on pages as well as have the page send back the image as a response in the http request.
Any other solution ideas would be appreciated.

Loading Javascript with C# Console Application

I am currently using htmlAgilityPack for some web scraping, however I've encountered a website that has script tags and I am unable to load it for scraping. I have little experience with web and am unsure how to properly load the webpage and convert back to something htmlAgility can parse.
Pretty much, when I inspect element in chrome, there is a table, but the htmlAgilityPack reads a script tag.
Any help would be appreciated.
Thank you
I have had similar problems too. It is very annoying that their is not one unified method of doing on all websites in a C# console.
However depending on the site you are looking at there may be some information in meta tags in the head section of the html. When I was making an application to get Youtube Subscription count I found it had the count in a meta tag (I assume this information is here for the scripts to use). This may be similar for the web page you are scraping.
To do this I first added a
document.save(//put a link to where the html file needs to go)
then I opened the html document in Google Chrome, opened up dev tools and did a search for "Subscriptions" (You can replace this for whatever you are looking for). Hopefully depending on the website you are scraping there may be a tag with some info in it for you.
Good Luck! :)

C# Webbrowser not loading full web page - page loaded event handler

I have a webpage that I want to monitor that has stock market information that I want to read and store. The information gathered is to be stored somewhere, say a .csv file or similar for later analysis.
The first problem I have is detecting when this page has fully loaded. The time taken to load can vary enormously. The event handlers I have tried all fire multiple times (I know this has been covered and I have tried the various techniques, but to no avail). Perhaps it is something specific to do with this web-page? Anyway, I need to know when this page has fully loaded and is sitting pretty with all graphics displayed properly.
The second problem is that, I cannot get the true source page into the webbrowser. As as a consequence, all access to the DOM fails as the HTML representation inside the webbrowser control appears not match what is actually happening on the webpage. I have dumped the text (webBrowser2.DocumentText) and it looks nothing like what you see when I check source in a browser, chrome for example. (I also use the firebug extension in Firefox to double check things). How can I get to the correct page into the webbrowser so I can start to manipulate things?
Essentially, in terms of the data, I need the GMT Time, Strike Rate and expiration time. My process will monitor with a timer control. To be able to read all the other element data on screen is a nice-to-have.
Can this be done?
I am an experienced programmer new to web programming and C#.
I think you want this AJAX request.
As a review, the web works by first loading the web page, then scanning the web page for additional files it needs to load (js, css, images, etc). When those finish, the onload event is triggered and some AJAX functions may run.
In this case, only some of the page is loaded and AJAX functions update the data in the graph later. As you've seen "Show Source" only shows the original file that was downloaded and is not a dump of its current state.
The easiest way to get the data is to find the URL of the AJAX request that loads the graph data. It is already conveniently formatted in JSON for you to scrap.

I can not see whole html source code by pressing ctrl+U

I am working on scrapping of website. so i make one desktop application for that.
I check website using inspect element then i can see whole data of website but when i try to check website data using page source(ctrl+U) then there is nothing.
means i can't find any website data in page source but can see in firebug(inspect element).
because of this when i am trying to get data using c# coding then i am getting only page source data which doesn't contains any website data only contains schema(structure) and js links.
see below image of firebug.
And this is page source image.
You met the js-powered site. The content is dynamicly loaded thru js, thus it's not visible in page-source. Turn to the scrape libraries that support js code evaluation. See here an example.

Making Dynamically Created ASP.net Page SEO Friendly

im starting the pseudo code of a new site, and want it to be as SEO friendly as possible.
the site i am creating is a booking agency site with c# and asp.net. essentially bands will register on the site with their availability and other info, and fill out their profile information with images etc. this info will be stored in a db.
creating this is not a problem, but i want the site to be a SEO friendly as possible.
I know google loves huge sites with great content. And all of these profile pages would be an excellent addition to my site for seo purposes. i also hear that google cannot see dynamically generated content when crawling a site.
i want to find a method of coding these pages, so google can see the content when it crawls them.
i need a pointer in the right direction for a solution for this. nothing is off limits - i will basically code my entire site around this principle, i just have no idea where to start looking for a solution. im not looking for a code solution, just what i should be researching to solve this issue.
Thanks in advance
i also hear that google cannot see dynamically generated content when crawling a site.
Google can see anything you can retrieve via http GET request (ie: there's a specific URL for it) and that someone either linked to or is listed in a published xml site map file.
To make sure that your profile pages fit this, you will want to make sure that profiles are all rendered via a single asp.net *.aspx file that determines which page is shown via a url parameter. Something that looks like this:
http://example.com/profiles.aspx?profile=SomeBandName
Now, you probably also want a friendly URL, that looks like this:
http://example.com/profiles/SomeBandName
To do that, you need to set up routing.
In order to crawl and index your pages by google or other search engine properly. Follow the following guidelines.
i: Page title must be precise and according to content available in page.
ii: Page url should be user friendly.
iii: Content is king (useful content)
iv: No ajax or javascript oriented way to load contents.
v: No flash or other media files. if exist must have description via alt tag.
vi: Create url sitemap of all static and dynamically generated contents.
vii: Submit sitemap to google and keep tracking how google crawl and index your pages.
fix issues contineously if google found via crawling.
In this way your most pages and content will be index properly and fastly.
I'd look into dynamic URL Rewriting.
Basically instead of having one page say http://localhost/Profile.aspx you'll have a bunch of simulated urls like
http://localhost/profiles/Band1
http://localhost/profiles/Band2
http://localhost/profiles/Band3
etc.
All of those will then map to back to the orgial profile.aspx page with a parameter so internally in your code it would look like http://localhost/Profile.aspx?Name=Band1, http://localhost/Profile.aspx?Name=Band2, etc
Basically your website appears to have a bunch of pages for each band but in reality they are all getting mapped back to the same asp.net page but have different parameters.
This is article I read about it some time back. http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx
i also hear that google cannot see dynamically generated content when crawling a site.
you could create a sitemap.xml with the urls pointing to the dynamic profile pages. using google webmaster tools you can submit and monitor the crawling progress.
you may also create an index page or something similar ('browse by category' pages) that link to matching profile pages.
a reference for seo I regularly use is http://www.seomoz.org/learn-seo

Categories