I’m working on Web Scraping using C# HttpWebRequest/HttpWebResponse. For the most part this process has gone smoothly. But after POSTing my way through several pages, I have gotten stuck with what seems to be an inconsistency between testing with the Web Browser and the HttpWebRequest/HttpWebResponse calls.
The problem occurs when I land on a page containing an input element that has a name similar to this: “RidiculouslyLongInputName.RidiculouslyLongInputName.RidiculouslyLongInputName.#RidiculouslyLong”
POSTing a value for this input element causes a 500 error when using HttpWebRequest but works fine when POSTing through the browser. If I remove this input value from the post data the the HttpWebRequest will not get the 500 error. But then I'm stuck with a data validate issue from the website.
Any idea on why HttpWebRequest is failing?
It's times like these when packet sniffers come in extremely useful for seeing exactly what kind of data is flowing through and what the difference is.
http://www.wireshark.org/
Is a great tool for things like this.
Filter down to only the domains you're interested in, then send off the packet with HttpWebRequest. Save the packet data somewhere. Repeat but do the request through the browser. Check the difference.
If it is indeed an issue with POST variables, it should be evident in the HTTP payload.
Not sure why you are running into the problem, but I would recommend grabbing a copy of Fiddler and taking a look at what the browser is sending in the POST request. It is possible there is something less than obvious going on.
You can also use Firebug extension with Firefox. With this extension installed and enabled, go through the entire scenario in Firefox. FIrebug will tell you the exact request/response sent by the browser. You can then duplicate that as much as possible using HttpWebRequest
First thanks for MEF response. That case was a personal mistake so I deleted the question.
I think best tool for your case is Fiddler but I guess there are other JavaScript attached to that button or something like that you are missing to mimic. WebRequest cannot do that for you and WebBrowser can do since it's working on DOM.
In order to use WebRequest correctly you highly need to reverse engineer every request by something like Fiddler. It's very hard to find what's exactly going on by looking at the page's source (and it's referenced Javascripts/CSS...).
Related
What's the best way to scrape a web page that has AJAX/dynamic loading of data?
For example: scraping a webpage that presents 20 images on load, but when a user scroll down the page it loads more images (sort of like Facebook). In such a case how do you scrape all the images, not just the first 20?
This is something that not even the major search engines have mastered yet. It's called "event-driven crawling".
Google even has a guide on what to do to help them crawl your ajax sites better
Best thing would be to read some open source crawlers and see what they do. But your chances of crawling even 80% are slim at best, unless you have a specific target in mind.
There are also some interesting reads at crawljax
Basically, You should try looking for scripts and checking if they make any ajax calls, then determine what kind of parameters they take and make repeat calls with incremented/decremented parameter values. This only works if the parameters have a logical pattern, such as being numbers, single letters etc. It also depends on whether you're targeting a known site or just sending it into the wild. If you know your target you can inspect it's DOM and customize your code for greater accuracy as mentioned by wolf.
Good luck
Use a tool such as Fiddler or WireShark to inspect the web request that is done when loading more items.
Then replicate the request in your code.
Update (thanks to pguardiario ofr his comment):
Note that Wireshark is a low level network capture tool that offers a great deal of detail about the traffic (packets being exchanged, DNS lookps, and so on), and may be painful to use in such scenario, where you only wish to see the HTTP Requests.
So, you're better off using Fiddler, or a similar tool in a browser (ex: Chrome's Network inspect panel).
Crawljax is open source and can dynamically crawl Ajax-based content.
We are downloading a full web page using System.Net.WebClient class. But we only want less than half of the page. So is there a way to download a portion of the page, say 1/3rd, half etc of a page using .net library so that We can save the network bandwidth and the space? If so, please throw your ideas, thanks.
You need to provide an "Accept-Ranges header" to your GET or POST request. That can be done by using the AddRange method of your HttpWebRequest:
HttpWebRequest myHttpWebRequest =
(HttpWebRequest)WebRequest.Create("http://www.foo.com");
myHttpWebRequest.AddRange(0,100);
That would yield the first 100 bytes. The server, however, needs to support this.
The sort answer is not unless the web app supports some way to tailor it's response to what you want it to return.
This could take the form of a
query string parameter
header field value
The simplest way to add this would be a query string parameter and when detected write out the necessary HTML to the response object. If you are unable to make changes to the web app then you won't be able to control how much of a page is returned to you.
You might want to read up on how HTTP works since the question and it's answer relies upon this. Specifically the Header Definition should be helpful.
I was just curious if anyone has heard of any sort of API for guitar tabs? The thought passed through my mind that it would be really neat if I could grab guitar tabs from the internet to bring into my C# app, but haven't been able to find anything.
Thanks,
Chris
You can use System.Net.WebRequest to read the repository of your choice.
You can also use System.ServiceModel.Syndication.SyndicationFeed to read RSS feeds.
EDIT: Try scraping http://www.mxtabs.net/guitar_tabs/ using WebRequest.
You can write code that sends requests to the pages of that (or any other) website and parses the responses to extract information.
However, you might want to get their permission first. They might even offer you an API.
You should approach it from another angle.. Why isn't there a good data format to transfer guitar tab data? I mean, ASCII art is good but it is easily damaged, and it doesn't convery timing information well.
If you could come up with a format that could reach critical mass, that would be a good thing.
To scrape a web page, first figure out exactly which data you want to extract.
Visit the relevant pages with Fiddler running, and look at the HTTP requests and responses that you get.
You can then write C# code that requests the relevant page and reads through the response, line by line, looking for lines that you're interested in.
If the web page is XHTML compliant, you can also parse it using XDocument, but most web pages aren't.
http://www.911tabs.com/ seems to have a large selection, and they appear to all be a variant of ASCII-art, so it would be relatively easy to write an HTML- or text-scraping routine. They don't appear to use a very standardized format, however, so this might be more work than I think.
I'm not sure how to modify the CustomRules.js file to only show requests for a certain domain.
Does anyone know how to accomplish this?
This is easy to do.
On the filters tab, click "show only if the filter contains, and then key in your domain.
edit
Turns out it is quite easy; edit OnBeforeRequest to add:
if (!oSession.HostnameIs("www.google.com")) {oSession["ui-hide"] = "yup";}
filters to google, for example.
(original answer)
I honestly don't know if this is something that Fiddler has built in (I've never tried), but it is certainly something that Wireshark will do pretty easily - of course, you get different data (in particular for SSL) - so YMMV.
My answer is somewhat similar to #Marc Gravels, however I prefer to filter it by url containing some specific string.
You will need fiddler script - it's an add-on to fiddler.
When installed go to fiddler script tag and paste following into OnBeforeRequest function. (Screenshot below)
if (oSession.url.Contains("ruby:8080") || oSession.url.Contains("localhost:53929")) { oSession["ui-hide"] = "yup"; }
This way you can filter by any part of url be it port hostname or whatever.
Hope this saves you some time.
You can filter the requests using the filter tab in fiddler. Please see screenshots below. If you are using google chrome, be sure to use the correct process id in fiddler(from google chrome).
The Fiddler site has a cookbook of a whole bunch of things that you can do with CustomRules.js, including how to do exactly this :)
I've got the following piece of code in an aspx webpage:
Response.Redirect("/Someurl/");
I also want to send a different referrer with the redirect something like:
Response.Redirect("/Someurl/", "/previousurl/?message=hello");
Is this possible in Asp.net or is the referrer handled solely by the browser?
Cheers
Stephen
Referrer is readonly and meant to be that way. I do not know why you need that but you can send query variables as instead of
Response.Redirect("/Someurl/");
you can call
Response.Redirect("/Someurl/?message=hello");
and get what you need there, if that helps.
Response.Redirect sends an answer code (HTTP 302) to the browser which in turn issues a new request (at least this is the expected behavior). Another possibility is to use Server.Transfer (see here) which doesn't go back to the browser.
Anyway, both of them don't solve your request. Perhaps giving some more detail on your case can help find another solution. ;-)
The referrer comes solely from the client browser (which may be lying to you, too)