I am using AngleSharp in C# to simulate a web browser. For debugging purposes, sometimes I want to see the page I am traversing. I am asking if there is an easy way to show the current document in a web browser (preferably the system's default browser) and if possible with current cookie states.
I am very late to the party, but hopefully someone will find my answer useful: The short answer is no, the long answer is, yes - with some work that is possible in a limited way.
How to make it possible? By injecting some code into AngleSharp that opens a (local) webserver. The content from this webserver could then be inspected in any webbrowser (e.g., the system's default browser).
The injected local webserver would serve the current document at its root (e.g., http://localhost:9000/), along with all auxiliary information in HTTP headers (e.g., cookie states). The problem with this approach is that we either transport the document's original source or a serialization of the DOM as seen by AngleSharp. Therefore, there could be some deviations and it may not be what you want. Alternatively, the server could emit JS code that replicates what AngleSharp currently sees (however, then standard debugging seems more viable).
Any approach, however, requires some (tedious?) work and therefore needs to be justified. Since you want to "see" the page I guess a CSS renderer would be more interesting (also it could be embedded in any application or made available in form of a VS extension).
Hope this helps!
Related
I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
I've implemented async pluggable protocol in a .net 2.0 application using C# which loads html files stored on the local machine into a MemoryStream.
when I load the html files normally in the webbrowser control using their local file paths, xmlhttprequest works fine but loading the files through the protocol and an attempt to use xmlhttprequest returns an access denied error.
I presume that this behavior is due to the webbrowser control no longer knowing that the html files are stored on the local machine, and is loading them in an untrusted internet zone.
Even though I'm returning S_OK for URLACTION_CROSS_DOMAIN_DATA inside IInternetSecurityManager's ProcessUrlAction which I checked with a break point to make sure it was fired, my IInternetSecurityManager's return value for this action is being ignored.
I've tried setting pdwZone to tagURLZONE.URLZONE_LOCAL_MACHINE in IInternetSecurityManager's MapUrlToZone for my protocol URLs and played around a little with GetSecurityId although I'm not sure exactly what I'm doing with and broke other things like allowing scripts to load etc... Nothing seems to work to allow cross-domain xmlhttprequest.
Anyone any idea how I can get this to work.
Not really an answer, but it may help to isolate the problem. I'd first implement this APP handler in C++ and test it with some robust unmanaged WebBrowser ActiveX host sample, like Lician Wishick's Webform:
http://www.wischik.com/lu/programmer/webform.html
If I could get it working reliably with the unmanaged host, I'd proceed with C# implementation.
I'd also try setting FEATURE_BROWSER_EMULATION to 8000 or less, to impose emulation of legacy IE behavior, just to check if it works that way.
That said, I wouldn't hold my hopes high. I've done my share of WebBrowser/MSHTML integration in the past, and I have a feeling that the APP support hasn't been regression-tested since IE9, in favor for new IE stuff aimed to embrace open web standards.
Updated, MSDN vaguely mentions this:
Upon successful completion, pbSecurityId contains the scheme, domain,
and zone information, as well as whether the specified pwszUrl was
derived from a Mark of the Web.
Here's the format which worked for me long ago (perhaps, way before "Mark of the Web" was introduced):
static const char security[] = "https:www.mysite.com\2\0\0"; // C++ puts the termination \0 for us
I believe, 2 stands here for the "Trusted Sites" zone. Other zones can be found here:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet Settings\Lockdown_Zones
Hope this helps.
Maybe I'm wrong but, have you tried to send in your protocol headers Access-Control-Allow-Origin: *?
What's the best way to scrape a web page that has AJAX/dynamic loading of data?
For example: scraping a webpage that presents 20 images on load, but when a user scroll down the page it loads more images (sort of like Facebook). In such a case how do you scrape all the images, not just the first 20?
This is something that not even the major search engines have mastered yet. It's called "event-driven crawling".
Google even has a guide on what to do to help them crawl your ajax sites better
Best thing would be to read some open source crawlers and see what they do. But your chances of crawling even 80% are slim at best, unless you have a specific target in mind.
There are also some interesting reads at crawljax
Basically, You should try looking for scripts and checking if they make any ajax calls, then determine what kind of parameters they take and make repeat calls with incremented/decremented parameter values. This only works if the parameters have a logical pattern, such as being numbers, single letters etc. It also depends on whether you're targeting a known site or just sending it into the wild. If you know your target you can inspect it's DOM and customize your code for greater accuracy as mentioned by wolf.
Good luck
Use a tool such as Fiddler or WireShark to inspect the web request that is done when loading more items.
Then replicate the request in your code.
Update (thanks to pguardiario ofr his comment):
Note that Wireshark is a low level network capture tool that offers a great deal of detail about the traffic (packets being exchanged, DNS lookps, and so on), and may be painful to use in such scenario, where you only wish to see the HTTP Requests.
So, you're better off using Fiddler, or a similar tool in a browser (ex: Chrome's Network inspect panel).
Crawljax is open source and can dynamically crawl Ajax-based content.
We are using Html Agility Pack to scrape data for HTML-based site; is there any DLL like Html Agility Pack to scrape flash-based site?
It really depends on the site you are trying to scrap. There are two types of sites in this regard:
If the site has the data inside the swf file, then you'll have to decompile the swf file, and read the data inside. with enough work you can probably do it programmatically. However if this is the case, it might be easier to just gather the data manually, since it's probably isn't going to change much.
If most cases however, especially with sites that have a lot of data, the flash file is actually contacting an external API. In that case you can simply ignore the flash altogether and get to the API directly. If your not sure, just activate Firebug's net panel, and start browsing. If it's using an external api it should become obvious.
Once you find that API, you could probably reverse engineer how to manipulate it to give you whatever data you need.
Also note that if it's a big enough site, there are probably non-flash ways to get to the same data:
It might have a mobile site (with no flash) - try accessing the site with an iPhone user-agent.
It might have a site for crawlers (like googlebot) - try accessing the site with a googlebot user-agent.
EDIT:
if your talking about crawling (crawling means getting data from any random site) rather then scraping (Getting structured data from a specific site), then there's not much you can do, even googlebot isn't scrapping flash content. Mostly because unlike HTML, flash doesn't have a standardized syntax that you can immediately tell what is text, what is a link etc...
You won't have much luck with the HTML Agility Pack. One method would be to use something like FiddlerCore to proxy HTTP requests to/from a Flash site. You would start the FiddlerCore proxy, then use something like the C# WebBrowser to go to the URL you want to scrape. As the page loads, all those HTTP requests will get proxied and you can inspect their contents. However, you wouldn't get most text since that's often static within the Flash. Instead, you'd get mostly larger content (videos, audio, and maybe images) that are usually stored separately. This will be slowed compared to more traditional scraping/crawling because you'll actually have to execute/run the page in the browser.
If you're familiar with all of those YouTube Downloader type of extensions, they work on this same principal except that they intercept HTTP requests directly from FireFox (for example) rather than a separate proxy.
I believe that Google and some of the big search engines have a special arrangement with Adobe/Flash and are provided with some software that lets their search engine crawlers see more of the text and things that Google relies on. Same goes for PDF content. I don't know if any of this software is publicly available.
Scraping Flash content would be quite involved, and the reliability of any component that claims to do so is questionable at best. However, if you wish to "crawl" or follow hyperlinks in a Flash animation on some web page, you might have some luck with Infant. Infant is a free Java library for web crawling, and offers limited / best-effort Flash content hyperlink following abilities. Infant is not open source, but is free for personal and commercial use. No registration required!
How about capturing the whole page as an image and running an OCR on the page to read the data
I want to find a decent solution to track URLs and html content that users are visiting and provide more information to user. The solution should bring minimum impacts to end users.
I don't want to write plugins for different browsers. It's hard to maintain.
I don't accept proxy method, since I don't want to change any of user's proxy settings.
My application is writen in C# and targeting to Windows. It's best if the solution can support other OS as well.
Based on my research, I found following methods that looks working for me, but all of them have their drawbacks, I can't determine which one is the best.
Use WinPcap
WinPcap sniffers all TCP packets without changing any of user settings but only requires to install the WinPcap setup, which is acceptable to me. But I have two questions:
a. how to convert TCP packet into URL and HTML
b. Does it really impact the performance? I don't know if sniffer all TCP traffic is overhead for this requirment.
Find history files for different browsers
This way looks like the easist one, but I wonder if the solution is stable. I am not sure if the browser will stably write the history and when it writes to. My application will popup information before the user leave the current page. The solution won't work for me if browser writes to history file when user close the browser.
Use FindWindow or accessiblity object or COM interface to find the UI element which contains the URL
I find this way is not complete, for example, Chrome will only show the active tab's URL but not all of them.
Another drawback is that I have to request the URL another time to get its HTML content.
Any comment or suggestion is welcome.
BTW, I am not doing any spyware. The application is trying to find all RSS feeds from web page and show them to end users. I can easily do that in a browser plugin but I really want to support multiple broswers with single UI. Thanks.
Though this is very old post, I thought to just give an input.
Approach 1 of WinPcap is the best one. This will work for any browser, even builtin browser of any other installed application. The approach will be less resource consuming too.
There is a library Pcap.Net that has HTTP parser. You can construct http stream and use its httpresponsedatagram to parse the body that can be consumed by your application.
This link helped giving more insight to me -
Tcp Session Reconstruction with Winpcap