I noticed that when using a 3G dongle to access the web. A javacript script is included at the end of any page that I visit. but just for (html, htm, php, asp, aspx). The script adds the functionality to download reduced quality images instead of full size, to save on bandwidth. However, it's function is irrelevant to my question.
I need to be able to do the same thing. For any request that comes into my machine, I would like a javascript include. But not a BHO or browser extention.
Does anyone know how this is done?
You would probably need to write a Web proxy engine to achieve this. Then you would configure it as either a transparent proxy (perhaps you would need to implement a network card driver for this), or configure all browsers on the machine to direct traffic through the proxy.
A 3G dongle may achieve this through various means, e.g. a filter inserted somewhere in the driver software, or may be even some processing occurs right in the hardware (less likely in my opinion).
I think I am going to use FiddlerCore as a service to do this. It seems to be a nice clean way of doing this.
http://www.fiddler2.com/Fiddler/Core/
Related
I've implemented async pluggable protocol in a .net 2.0 application using C# which loads html files stored on the local machine into a MemoryStream.
when I load the html files normally in the webbrowser control using their local file paths, xmlhttprequest works fine but loading the files through the protocol and an attempt to use xmlhttprequest returns an access denied error.
I presume that this behavior is due to the webbrowser control no longer knowing that the html files are stored on the local machine, and is loading them in an untrusted internet zone.
Even though I'm returning S_OK for URLACTION_CROSS_DOMAIN_DATA inside IInternetSecurityManager's ProcessUrlAction which I checked with a break point to make sure it was fired, my IInternetSecurityManager's return value for this action is being ignored.
I've tried setting pdwZone to tagURLZONE.URLZONE_LOCAL_MACHINE in IInternetSecurityManager's MapUrlToZone for my protocol URLs and played around a little with GetSecurityId although I'm not sure exactly what I'm doing with and broke other things like allowing scripts to load etc... Nothing seems to work to allow cross-domain xmlhttprequest.
Anyone any idea how I can get this to work.
Not really an answer, but it may help to isolate the problem. I'd first implement this APP handler in C++ and test it with some robust unmanaged WebBrowser ActiveX host sample, like Lician Wishick's Webform:
http://www.wischik.com/lu/programmer/webform.html
If I could get it working reliably with the unmanaged host, I'd proceed with C# implementation.
I'd also try setting FEATURE_BROWSER_EMULATION to 8000 or less, to impose emulation of legacy IE behavior, just to check if it works that way.
That said, I wouldn't hold my hopes high. I've done my share of WebBrowser/MSHTML integration in the past, and I have a feeling that the APP support hasn't been regression-tested since IE9, in favor for new IE stuff aimed to embrace open web standards.
Updated, MSDN vaguely mentions this:
Upon successful completion, pbSecurityId contains the scheme, domain,
and zone information, as well as whether the specified pwszUrl was
derived from a Mark of the Web.
Here's the format which worked for me long ago (perhaps, way before "Mark of the Web" was introduced):
static const char security[] = "https:www.mysite.com\2\0\0"; // C++ puts the termination \0 for us
I believe, 2 stands here for the "Trusted Sites" zone. Other zones can be found here:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet Settings\Lockdown_Zones
Hope this helps.
Maybe I'm wrong but, have you tried to send in your protocol headers Access-Control-Allow-Origin: *?
What's the best way to scrape a web page that has AJAX/dynamic loading of data?
For example: scraping a webpage that presents 20 images on load, but when a user scroll down the page it loads more images (sort of like Facebook). In such a case how do you scrape all the images, not just the first 20?
This is something that not even the major search engines have mastered yet. It's called "event-driven crawling".
Google even has a guide on what to do to help them crawl your ajax sites better
Best thing would be to read some open source crawlers and see what they do. But your chances of crawling even 80% are slim at best, unless you have a specific target in mind.
There are also some interesting reads at crawljax
Basically, You should try looking for scripts and checking if they make any ajax calls, then determine what kind of parameters they take and make repeat calls with incremented/decremented parameter values. This only works if the parameters have a logical pattern, such as being numbers, single letters etc. It also depends on whether you're targeting a known site or just sending it into the wild. If you know your target you can inspect it's DOM and customize your code for greater accuracy as mentioned by wolf.
Good luck
Use a tool such as Fiddler or WireShark to inspect the web request that is done when loading more items.
Then replicate the request in your code.
Update (thanks to pguardiario ofr his comment):
Note that Wireshark is a low level network capture tool that offers a great deal of detail about the traffic (packets being exchanged, DNS lookps, and so on), and may be painful to use in such scenario, where you only wish to see the HTTP Requests.
So, you're better off using Fiddler, or a similar tool in a browser (ex: Chrome's Network inspect panel).
Crawljax is open source and can dynamically crawl Ajax-based content.
We are using Html Agility Pack to scrape data for HTML-based site; is there any DLL like Html Agility Pack to scrape flash-based site?
It really depends on the site you are trying to scrap. There are two types of sites in this regard:
If the site has the data inside the swf file, then you'll have to decompile the swf file, and read the data inside. with enough work you can probably do it programmatically. However if this is the case, it might be easier to just gather the data manually, since it's probably isn't going to change much.
If most cases however, especially with sites that have a lot of data, the flash file is actually contacting an external API. In that case you can simply ignore the flash altogether and get to the API directly. If your not sure, just activate Firebug's net panel, and start browsing. If it's using an external api it should become obvious.
Once you find that API, you could probably reverse engineer how to manipulate it to give you whatever data you need.
Also note that if it's a big enough site, there are probably non-flash ways to get to the same data:
It might have a mobile site (with no flash) - try accessing the site with an iPhone user-agent.
It might have a site for crawlers (like googlebot) - try accessing the site with a googlebot user-agent.
EDIT:
if your talking about crawling (crawling means getting data from any random site) rather then scraping (Getting structured data from a specific site), then there's not much you can do, even googlebot isn't scrapping flash content. Mostly because unlike HTML, flash doesn't have a standardized syntax that you can immediately tell what is text, what is a link etc...
You won't have much luck with the HTML Agility Pack. One method would be to use something like FiddlerCore to proxy HTTP requests to/from a Flash site. You would start the FiddlerCore proxy, then use something like the C# WebBrowser to go to the URL you want to scrape. As the page loads, all those HTTP requests will get proxied and you can inspect their contents. However, you wouldn't get most text since that's often static within the Flash. Instead, you'd get mostly larger content (videos, audio, and maybe images) that are usually stored separately. This will be slowed compared to more traditional scraping/crawling because you'll actually have to execute/run the page in the browser.
If you're familiar with all of those YouTube Downloader type of extensions, they work on this same principal except that they intercept HTTP requests directly from FireFox (for example) rather than a separate proxy.
I believe that Google and some of the big search engines have a special arrangement with Adobe/Flash and are provided with some software that lets their search engine crawlers see more of the text and things that Google relies on. Same goes for PDF content. I don't know if any of this software is publicly available.
Scraping Flash content would be quite involved, and the reliability of any component that claims to do so is questionable at best. However, if you wish to "crawl" or follow hyperlinks in a Flash animation on some web page, you might have some luck with Infant. Infant is a free Java library for web crawling, and offers limited / best-effort Flash content hyperlink following abilities. Infant is not open source, but is free for personal and commercial use. No registration required!
How about capturing the whole page as an image and running an OCR on the page to read the data
I would like to know if there is an easy or practical way to determine how long it takes 1 kB on a website to load?
If you are using Firefox, you can get information on how long various files take to download using the Firebug plugin. It has a bunch of network monitoring features.
Chrome has a console similar to Firebug under tools->developer tools.
This is great for a quick reality check when you are developing, but sometimes you need a little more. For instance, you might want to set up a monitoring script to ensure response times aren't creeping up. Selenium is great for this and supports both Java and C#. Another option is to write a quick script using a headless browser like Mechanize (Ruby, Perl).
I prefer doing this kind of monitoring from the client end as opposed to on the server side because you get a more realistic perspective of what your end users are experiencing.
I want to find a decent solution to track URLs and html content that users are visiting and provide more information to user. The solution should bring minimum impacts to end users.
I don't want to write plugins for different browsers. It's hard to maintain.
I don't accept proxy method, since I don't want to change any of user's proxy settings.
My application is writen in C# and targeting to Windows. It's best if the solution can support other OS as well.
Based on my research, I found following methods that looks working for me, but all of them have their drawbacks, I can't determine which one is the best.
Use WinPcap
WinPcap sniffers all TCP packets without changing any of user settings but only requires to install the WinPcap setup, which is acceptable to me. But I have two questions:
a. how to convert TCP packet into URL and HTML
b. Does it really impact the performance? I don't know if sniffer all TCP traffic is overhead for this requirment.
Find history files for different browsers
This way looks like the easist one, but I wonder if the solution is stable. I am not sure if the browser will stably write the history and when it writes to. My application will popup information before the user leave the current page. The solution won't work for me if browser writes to history file when user close the browser.
Use FindWindow or accessiblity object or COM interface to find the UI element which contains the URL
I find this way is not complete, for example, Chrome will only show the active tab's URL but not all of them.
Another drawback is that I have to request the URL another time to get its HTML content.
Any comment or suggestion is welcome.
BTW, I am not doing any spyware. The application is trying to find all RSS feeds from web page and show them to end users. I can easily do that in a browser plugin but I really want to support multiple broswers with single UI. Thanks.
Though this is very old post, I thought to just give an input.
Approach 1 of WinPcap is the best one. This will work for any browser, even builtin browser of any other installed application. The approach will be less resource consuming too.
There is a library Pcap.Net that has HTTP parser. You can construct http stream and use its httpresponsedatagram to parse the body that can be consumed by your application.
This link helped giving more insight to me -
Tcp Session Reconstruction with Winpcap