I want to get the wind data from this site "www.windguru.cz".
Form the Google Chrome's build-in tool, the XPath of the table is //*[#id="tabid_0_content_div"].
But the problem is when I parse this XPath through HtmlAgilityPack, the IDE through NullReferenceException again and again.
I have no idea to solve it.
Can anyone give some hints?
Thanks a million.
Because you are getting exact what browser gets when requesting that url. Those tables are generating by Javascript dynamically. I believe what you can do here is using combination of WebBrowser control and HTMLAgilityPack. Just wait until the page is loaded, after that you can use HAP.
Related
I am currently using htmlAgilityPack for some web scraping, however I've encountered a website that has script tags and I am unable to load it for scraping. I have little experience with web and am unsure how to properly load the webpage and convert back to something htmlAgility can parse.
Pretty much, when I inspect element in chrome, there is a table, but the htmlAgilityPack reads a script tag.
Any help would be appreciated.
Thank you
I have had similar problems too. It is very annoying that their is not one unified method of doing on all websites in a C# console.
However depending on the site you are looking at there may be some information in meta tags in the head section of the html. When I was making an application to get Youtube Subscription count I found it had the count in a meta tag (I assume this information is here for the scripts to use). This may be similar for the web page you are scraping.
To do this I first added a
document.save(//put a link to where the html file needs to go)
then I opened the html document in Google Chrome, opened up dev tools and did a search for "Subscriptions" (You can replace this for whatever you are looking for). Hopefully depending on the website you are scraping there may be a tag with some info in it for you.
Good Luck! :)
Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack
First of all, I hope my question doesn't bother you. I really need to get and idea of how I can accomplish that, but unfortunatelly, I'm really a beginner, I'm crawling when it comes to programming. I'm struggling to learn it the best way I can. I'll thank you for any help you give me.
Here's the task: I was ordered to find a way to collect some data from a website using a c# application. This will be done everyday, in order to update the data which we'll use to calculate some financial index.
I know my question might sound vague, anyway, even telling me how I can be more precise will help me. I know I seem to know desperate, but putting appart all the personell issues, my scholarship kind of depends on it.
Thanks in advance! (Please, don't mind the bad English, I'm brasilian and my English might not be that good yet.)
First, your English is fine. In fact, I thought you were a native speaker until you said otherwise.
The term you're looking for is 'site scraping'. Observe this question: Options for HTML scraping?. The second answer points to an HTML agility pack library you can use.
Now, there are two possibilities here. The first is you have to parse the HTML and scrape your data out of it. This is more computationally intensive and depends on the layout of the page. If they change the way the site looks, it could break the scraper.
The second possibility is they provide some XML or JSON web service you can consume. In this case you aren't scraping anything, but are rather using a true data feed. If the layout of the site changes, you will not break. Whether your target site supports this form of data feed is up to the site.
If I understand your question, you're being asked to do some Web Scraping, where you 1) download the contents of a web page and 2) try to parse data from that content.
For step #1, you should look into using a WebClient object in C# to download the HTML from the web page. You can give a WebClient object the URL you want to download the content from and obtain a String containing the content (probably HTML) of the URL.
How you go about doing step #2 depends on what content is present at the web site. If you know of certain patterns you're looking for in the HTML, you can search the HTML string using various methods. A more general solution for parsing HTML data can be found through using the Html Agility Pack, which will let you handle the HTML as a tree structure (DOM).
Use the WebClient class to get the page.
Turn the html into xml.
Use XPath to select the data you are interested in.
Ok, this is a pretty straightforward app design, and a lot of the code exists that you can reuse. Since you're a beginner, I'll break down into steps of what you need to do and recommend approaches.
1) You will use classes from System.Net to pull the web pages (WebClient being the easiest to usse). You will want to have this part of the program run on a timer if you can (using the scheduled jobs feature of the OS) and have it just pull the pages and drop them in a folder.
2) You have a second job which will run separately, pulling unread files from that folder, parsing them (using the HtmlAgility pack library is best) and then storing them in an index of some kind (Lucene is best for that)
3) You have a front end application of some sort (web or desktop) which queries that index for the information you're looking for.
So here's the deal. I'm creating a spider bot for a website that scans all the product pages and records the product data. I'm using C# and the WebClient library to download the HTML string. The site I'm crawling must be specially made because the HTML that is received from WebClient.DownloadString() is different than the HTML that I get when I view the source of the HTML when visiting it on a browser. This seems intentional because the only info I can't get is the price.
Does anyone know a workaround for this problem or can anyone explain what is happening? Thanks.
It is probably using the the user agent string to decide what content to send. The example here shows how to set the user agent header.
I'm hoping people have some ideas to help solve this problem.
I am developing a C# ASP.NET website and the client requires an online form that users will fill in and submit. OK, so far so good.....
Imagine, say, a form that you fill in on paper - they normally have a distinctive look specific to the company and will be filed, quite possibly as a legally binding document.
I need to have an online form that when submitted emails the client with something they can print out and will look exactly like their printed forms.
As this is web based, I think the option of capturing a screenshot are out the question, so I'm wondering how best to approach this?
Even if I just had a form that captures the data presented how I want, how could I translate this data into the view they want?
Any ideas and suggestions greatly appreciated.
You'll need to take the raw data that was submitted and import it into a standard document (likely PDF). You can use Crystal or another reporting solution, or direct to PDF using one of the many PDF .NET solutions that are out there.
I don't think you'd even want to deal with making the document physically match the screen - much easier to make the web look like the web, and make the printed doc look like a printed doc.
Print a page (this one) from a Browser, notice all the headers and footers?
If you want serious control over how it is going to look, you will need to generate a PDF (or maybe XPS).
Couldn't you just use a sepparate page with a CSS that gives the desired look & feel?