So here's the deal. I'm creating a spider bot for a website that scans all the product pages and records the product data. I'm using C# and the WebClient library to download the HTML string. The site I'm crawling must be specially made because the HTML that is received from WebClient.DownloadString() is different than the HTML that I get when I view the source of the HTML when visiting it on a browser. This seems intentional because the only info I can't get is the price.
Does anyone know a workaround for this problem or can anyone explain what is happening? Thanks.
It is probably using the the user agent string to decide what content to send. The example here shows how to set the user agent header.
Related
I am currently using htmlAgilityPack for some web scraping, however I've encountered a website that has script tags and I am unable to load it for scraping. I have little experience with web and am unsure how to properly load the webpage and convert back to something htmlAgility can parse.
Pretty much, when I inspect element in chrome, there is a table, but the htmlAgilityPack reads a script tag.
Any help would be appreciated.
Thank you
I have had similar problems too. It is very annoying that their is not one unified method of doing on all websites in a C# console.
However depending on the site you are looking at there may be some information in meta tags in the head section of the html. When I was making an application to get Youtube Subscription count I found it had the count in a meta tag (I assume this information is here for the scripts to use). This may be similar for the web page you are scraping.
To do this I first added a
document.save(//put a link to where the html file needs to go)
then I opened the html document in Google Chrome, opened up dev tools and did a search for "Subscriptions" (You can replace this for whatever you are looking for). Hopefully depending on the website you are scraping there may be a tag with some info in it for you.
Good Luck! :)
im starting the pseudo code of a new site, and want it to be as SEO friendly as possible.
the site i am creating is a booking agency site with c# and asp.net. essentially bands will register on the site with their availability and other info, and fill out their profile information with images etc. this info will be stored in a db.
creating this is not a problem, but i want the site to be a SEO friendly as possible.
I know google loves huge sites with great content. And all of these profile pages would be an excellent addition to my site for seo purposes. i also hear that google cannot see dynamically generated content when crawling a site.
i want to find a method of coding these pages, so google can see the content when it crawls them.
i need a pointer in the right direction for a solution for this. nothing is off limits - i will basically code my entire site around this principle, i just have no idea where to start looking for a solution. im not looking for a code solution, just what i should be researching to solve this issue.
Thanks in advance
i also hear that google cannot see dynamically generated content when crawling a site.
Google can see anything you can retrieve via http GET request (ie: there's a specific URL for it) and that someone either linked to or is listed in a published xml site map file.
To make sure that your profile pages fit this, you will want to make sure that profiles are all rendered via a single asp.net *.aspx file that determines which page is shown via a url parameter. Something that looks like this:
http://example.com/profiles.aspx?profile=SomeBandName
Now, you probably also want a friendly URL, that looks like this:
http://example.com/profiles/SomeBandName
To do that, you need to set up routing.
In order to crawl and index your pages by google or other search engine properly. Follow the following guidelines.
i: Page title must be precise and according to content available in page.
ii: Page url should be user friendly.
iii: Content is king (useful content)
iv: No ajax or javascript oriented way to load contents.
v: No flash or other media files. if exist must have description via alt tag.
vi: Create url sitemap of all static and dynamically generated contents.
vii: Submit sitemap to google and keep tracking how google crawl and index your pages.
fix issues contineously if google found via crawling.
In this way your most pages and content will be index properly and fastly.
I'd look into dynamic URL Rewriting.
Basically instead of having one page say http://localhost/Profile.aspx you'll have a bunch of simulated urls like
http://localhost/profiles/Band1
http://localhost/profiles/Band2
http://localhost/profiles/Band3
etc.
All of those will then map to back to the orgial profile.aspx page with a parameter so internally in your code it would look like http://localhost/Profile.aspx?Name=Band1, http://localhost/Profile.aspx?Name=Band2, etc
Basically your website appears to have a bunch of pages for each band but in reality they are all getting mapped back to the same asp.net page but have different parameters.
This is article I read about it some time back. http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx
i also hear that google cannot see dynamically generated content when crawling a site.
you could create a sitemap.xml with the urls pointing to the dynamic profile pages. using google webmaster tools you can submit and monitor the crawling progress.
you may also create an index page or something similar ('browse by category' pages) that link to matching profile pages.
a reference for seo I regularly use is http://www.seomoz.org/learn-seo
Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack
First of all, I hope my question doesn't bother you. I really need to get and idea of how I can accomplish that, but unfortunatelly, I'm really a beginner, I'm crawling when it comes to programming. I'm struggling to learn it the best way I can. I'll thank you for any help you give me.
Here's the task: I was ordered to find a way to collect some data from a website using a c# application. This will be done everyday, in order to update the data which we'll use to calculate some financial index.
I know my question might sound vague, anyway, even telling me how I can be more precise will help me. I know I seem to know desperate, but putting appart all the personell issues, my scholarship kind of depends on it.
Thanks in advance! (Please, don't mind the bad English, I'm brasilian and my English might not be that good yet.)
First, your English is fine. In fact, I thought you were a native speaker until you said otherwise.
The term you're looking for is 'site scraping'. Observe this question: Options for HTML scraping?. The second answer points to an HTML agility pack library you can use.
Now, there are two possibilities here. The first is you have to parse the HTML and scrape your data out of it. This is more computationally intensive and depends on the layout of the page. If they change the way the site looks, it could break the scraper.
The second possibility is they provide some XML or JSON web service you can consume. In this case you aren't scraping anything, but are rather using a true data feed. If the layout of the site changes, you will not break. Whether your target site supports this form of data feed is up to the site.
If I understand your question, you're being asked to do some Web Scraping, where you 1) download the contents of a web page and 2) try to parse data from that content.
For step #1, you should look into using a WebClient object in C# to download the HTML from the web page. You can give a WebClient object the URL you want to download the content from and obtain a String containing the content (probably HTML) of the URL.
How you go about doing step #2 depends on what content is present at the web site. If you know of certain patterns you're looking for in the HTML, you can search the HTML string using various methods. A more general solution for parsing HTML data can be found through using the Html Agility Pack, which will let you handle the HTML as a tree structure (DOM).
Use the WebClient class to get the page.
Turn the html into xml.
Use XPath to select the data you are interested in.
Ok, this is a pretty straightforward app design, and a lot of the code exists that you can reuse. Since you're a beginner, I'll break down into steps of what you need to do and recommend approaches.
1) You will use classes from System.Net to pull the web pages (WebClient being the easiest to usse). You will want to have this part of the program run on a timer if you can (using the scheduled jobs feature of the OS) and have it just pull the pages and drop them in a folder.
2) You have a second job which will run separately, pulling unread files from that folder, parsing them (using the HtmlAgility pack library is best) and then storing them in an index of some kind (Lucene is best for that)
3) You have a front end application of some sort (web or desktop) which queries that index for the information you're looking for.
I was just curious if anyone has heard of any sort of API for guitar tabs? The thought passed through my mind that it would be really neat if I could grab guitar tabs from the internet to bring into my C# app, but haven't been able to find anything.
Thanks,
Chris
You can use System.Net.WebRequest to read the repository of your choice.
You can also use System.ServiceModel.Syndication.SyndicationFeed to read RSS feeds.
EDIT: Try scraping http://www.mxtabs.net/guitar_tabs/ using WebRequest.
You can write code that sends requests to the pages of that (or any other) website and parses the responses to extract information.
However, you might want to get their permission first. They might even offer you an API.
You should approach it from another angle.. Why isn't there a good data format to transfer guitar tab data? I mean, ASCII art is good but it is easily damaged, and it doesn't convery timing information well.
If you could come up with a format that could reach critical mass, that would be a good thing.
To scrape a web page, first figure out exactly which data you want to extract.
Visit the relevant pages with Fiddler running, and look at the HTTP requests and responses that you get.
You can then write C# code that requests the relevant page and reads through the response, line by line, looking for lines that you're interested in.
If the web page is XHTML compliant, you can also parse it using XDocument, but most web pages aren't.
http://www.911tabs.com/ seems to have a large selection, and they appear to all be a variant of ASCII-art, so it would be relatively easy to write an HTML- or text-scraping routine. They don't appear to use a very standardized format, however, so this might be more work than I think.