Open URL in browser and take image snapshot in an automated fashion

Open URL in browser and take image snapshot in an automated fashion - c#

I was wondering if there was any way we could programmatically fire up an URL in a browser (i.e. firefox or google chrome), and take a snapshot of the webpage.
It would be awesome if this can be done in Linux environment. I do not have any lead on how to go about it; I can going to mark this in C# too - I'm willing to settle for Windows based approach also

I think wkhtmltopdf will pretty much do what you're looking for. It uses WebKit, the Chrome/Safari engine)
the wkhtmltoimage executable will take a URL and file save location arguments.
Windows/Linux/MacOS flavors.

Related

Including Javascript on all websites on machine

I noticed that when using a 3G dongle to access the web. A javacript script is included at the end of any page that I visit. but just for (html, htm, php, asp, aspx). The script adds the functionality to download reduced quality images instead of full size, to save on bandwidth. However, it's function is irrelevant to my question.
I need to be able to do the same thing. For any request that comes into my machine, I would like a javascript include. But not a BHO or browser extention.
Does anyone know how this is done?

You would probably need to write a Web proxy engine to achieve this. Then you would configure it as either a transparent proxy (perhaps you would need to implement a network card driver for this), or configure all browsers on the machine to direct traffic through the proxy.
A 3G dongle may achieve this through various means, e.g. a filter inserted somewhere in the driver software, or may be even some processing occurs right in the hardware (less likely in my opinion).

I think I am going to use FiddlerCore as a service to do this. It seems to be a nice clean way of doing this.
http://www.fiddler2.com/Fiddler/Core/

How to scrape a flash based site?

We are using Html Agility Pack to scrape data for HTML-based site; is there any DLL like Html Agility Pack to scrape flash-based site?

It really depends on the site you are trying to scrap. There are two types of sites in this regard:
If the site has the data inside the swf file, then you'll have to decompile the swf file, and read the data inside. with enough work you can probably do it programmatically. However if this is the case, it might be easier to just gather the data manually, since it's probably isn't going to change much.
If most cases however, especially with sites that have a lot of data, the flash file is actually contacting an external API. In that case you can simply ignore the flash altogether and get to the API directly. If your not sure, just activate Firebug's net panel, and start browsing. If it's using an external api it should become obvious.
Once you find that API, you could probably reverse engineer how to manipulate it to give you whatever data you need.
Also note that if it's a big enough site, there are probably non-flash ways to get to the same data:
It might have a mobile site (with no flash) - try accessing the site with an iPhone user-agent.
It might have a site for crawlers (like googlebot) - try accessing the site with a googlebot user-agent.
EDIT:
if your talking about crawling (crawling means getting data from any random site) rather then scraping (Getting structured data from a specific site), then there's not much you can do, even googlebot isn't scrapping flash content. Mostly because unlike HTML, flash doesn't have a standardized syntax that you can immediately tell what is text, what is a link etc...

You won't have much luck with the HTML Agility Pack. One method would be to use something like FiddlerCore to proxy HTTP requests to/from a Flash site. You would start the FiddlerCore proxy, then use something like the C# WebBrowser to go to the URL you want to scrape. As the page loads, all those HTTP requests will get proxied and you can inspect their contents. However, you wouldn't get most text since that's often static within the Flash. Instead, you'd get mostly larger content (videos, audio, and maybe images) that are usually stored separately. This will be slowed compared to more traditional scraping/crawling because you'll actually have to execute/run the page in the browser.
If you're familiar with all of those YouTube Downloader type of extensions, they work on this same principal except that they intercept HTTP requests directly from FireFox (for example) rather than a separate proxy.
I believe that Google and some of the big search engines have a special arrangement with Adobe/Flash and are provided with some software that lets their search engine crawlers see more of the text and things that Google relies on. Same goes for PDF content. I don't know if any of this software is publicly available.

Scraping Flash content would be quite involved, and the reliability of any component that claims to do so is questionable at best. However, if you wish to "crawl" or follow hyperlinks in a Flash animation on some web page, you might have some luck with Infant. Infant is a free Java library for web crawling, and offers limited / best-effort Flash content hyperlink following abilities. Infant is not open source, but is free for personal and commercial use. No registration required!

How about capturing the whole page as an image and running an OCR on the page to read the data

is it possible to accurately assess how long it takes one kilobyte to load in a webpage?

I would like to know if there is an easy or practical way to determine how long it takes 1 kB on a website to load?

If you are using Firefox, you can get information on how long various files take to download using the Firebug plugin. It has a bunch of network monitoring features.
Chrome has a console similar to Firebug under tools->developer tools.
This is great for a quick reality check when you are developing, but sometimes you need a little more. For instance, you might want to set up a monitoring script to ensure response times aren't creeping up. Selenium is great for this and supports both Java and C#. Another option is to write a quick script using a headless browser like Mechanize (Ruby, Perl).
I prefer doing this kind of monitoring from the client end as opposed to on the server side because you get a more realistic perspective of what your end users are experiencing.

How to capture visited URLs and their html by any browsers

I want to find a decent solution to track URLs and html content that users are visiting and provide more information to user. The solution should bring minimum impacts to end users.
I don't want to write plugins for different browsers. It's hard to maintain.
I don't accept proxy method, since I don't want to change any of user's proxy settings.
My application is writen in C# and targeting to Windows. It's best if the solution can support other OS as well.
Based on my research, I found following methods that looks working for me, but all of them have their drawbacks, I can't determine which one is the best.
Use WinPcap
WinPcap sniffers all TCP packets without changing any of user settings but only requires to install the WinPcap setup, which is acceptable to me. But I have two questions:
a. how to convert TCP packet into URL and HTML
b. Does it really impact the performance? I don't know if sniffer all TCP traffic is overhead for this requirment.
Find history files for different browsers
This way looks like the easist one, but I wonder if the solution is stable. I am not sure if the browser will stably write the history and when it writes to. My application will popup information before the user leave the current page. The solution won't work for me if browser writes to history file when user close the browser.
Use FindWindow or accessiblity object or COM interface to find the UI element which contains the URL
I find this way is not complete, for example, Chrome will only show the active tab's URL but not all of them.
Another drawback is that I have to request the URL another time to get its HTML content.
Any comment or suggestion is welcome.
BTW, I am not doing any spyware. The application is trying to find all RSS feeds from web page and show them to end users. I can easily do that in a browser plugin but I really want to support multiple broswers with single UI. Thanks.

Though this is very old post, I thought to just give an input.
Approach 1 of WinPcap is the best one. This will work for any browser, even builtin browser of any other installed application. The approach will be less resource consuming too.
There is a library Pcap.Net that has HTTP parser. You can construct http stream and use its httpresponsedatagram to parse the body that can be consumed by your application.
This link helped giving more insight to me -
Tcp Session Reconstruction with Winpcap

Printing from a .NET Service [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am working on a project right now that involves receiving a message from another application, formatting the contents of that message, and sending it to a printer. The technology of choice is C# windows service. The output could be called a report, I suppose, but a reporting engine is not necessary. A simple templating engine, like StringTemplate, or even XSLT outputting HTML would be fine. The problem I'm having is finding a free way to print this kind of output from a service. Since it seems that it will work, I'm working on a prototype using Microsoft's RDLC, populating a local report and then rendering it as an image to a memory stream, which I will then print. Issues with that are:
Multi-page printing will be a big headache.
Still have to use PrintDocument to print the memory stream, which is unsupported in a Windows Service (though it may work - haven't gotten that far with the prototype yet)
If the data coming across changes, I have to change the dataset and the class that the data is being deserialized into. bad bad bad.
Has anyone had to do anything remotely like this? Any advice? I already posted a question about printing HTML without user input, and after wasting about 3 days on that, I have come to the conclusion that it cannot be done, at least not with any freely available tool.
All help is appreciated.
EDIT: We are on version 2.0 of the .NET framework.

Trust me, you will spend more money trying to search/develop a solution for this as compared to buying a third party component. Do not reinvent the wheel and go for the paid solution.
Printing is a complex problem and I would love to see the day when better framework support is added for this.

Printing from a Windows service is really painful. It seems to work... sometimes... but finally it craches or throws an exception from time to time, without any clear reason. It's really hopeless. Officially, it's even not supported, without any explanation, nor any proposal for an alternate solution.
Recently, I have been confronted to the problem and after several unsuccessful trials and experimentations, I came finally with two viable solutions:
Write your own printing DLL using the Win32 API (in C/C++ for instance), then use it from your service with P/Invoke (works fine)
Write your own printing COM+ component, then uses it from your service. I have chosen this solution with success recently (but it was third party COM+ component, not own written) It works absolutely fine too.

I've done it. It's a pain in the A*s. The problem is that printing requires that GDI engine to be in place, which normally means that you have to have the desktop, which only loads when you're logged in. If you're attempting to do this from a Service on a Server, then you normally aren't logged in.
So first you can't run as the normal service user, but instead as a real user that has interactive login rights. Then you have to tweak the service registry entries (I forget how at the moment, would have to find the code which I can do tonight if you're really interested). Finally, you have to pray.
Your biggest long term headache will be with print drivers. If you are running as a service without a logged in user, some print drivers like to pop up dialogs from time to time. What happens when your printer is out of toner? Or out of paper? The driver may pop up a dialog that will never be seen, and hold up the printer queue because nobody is logged in!

To answer your first question, this can be fairly straight forward depending on the data. We have a variety of Service-based applications that do exactly what you are asking. Typically, we parse the incoming file and wrap our own Postscript or PCL around it. If you layout is fairly simple, then there are some very basic PCL codes you can wrap it with to provide the font/print layup you want (I'd be more then happy to give you some guidance here offline).
One you have a print ready file you can send it to a UNC printer that is shared, directly to a locally installed printer, or even to the IP of the device (RAW or LPR type data).
If, however, you are going down the PDF path, the simplest method is to send the PDF output to a printer that supports direct PDF printing (many do now). In this case you just send the PDF to the device and away it prints.
The other option is to launch Ghostscript which should be free for your needs (check the licensing as they have a few different version, some GNU, some GPL etc.) and either use it's built in print function or simply convert to Postscript and send to the device. I've used Ghostscript many times in Service apps but not a huge fan as you will basically be shelling out and executing a command line app to do the conversion. That being said, it's a stable app that does tend to fail gracefully

Printing from a service is a bad idea. Network printers are connected "per-user". You can mark the service to be run as a particular user, but I'd consider that a bad security practice. You might be able to connect to a local printer, but I'd still hesitate before going this route.
The best option is to have the service store the data and have a user-launched application do the printing by asking the service for the data. Or a common location that the data is stored, like a database.
If you need to have the data printed as regular intervals, setup a Task event thru the Task Scheduler. Launching a process from a service will require knowing the user name and password, which again is bad security practice.
As for the printing itself, use a third-party tool to generate the report will be the easiest.

This may not be what you're looking for, but if I needed to do this quick&dirty, I would:
Create a separate WPF application (so I could use the built-in document handling)
Give the service the ability to interact with the desktop (note that you don't actually have to show anything on the desktop, or be logged in for this to work)
Have the service run the application, and give it the data to print.
You could probably also jigger this to print from a web browser that you run from the service (though I'd recommend building your own shell IE, rather than using a full browser).
For a more detailed (also free) solution, your best bet is probably to manually format the document yourself (using GDI+ to do the layout for you). This is tedious, error prone, time consuming, and wastes a lot of paper during development, but also gives you the most control over what's going to the printer.

If you can output to post script some printers will print anything that gets FTPed to a certain directory on them.
We used this to get past the print credits that our university exposed on us, but if your service outputs to a ps then you can just ftp the ps file to the printer.

We are using DevExpress' XtraReports to print from a service without any problems. Their report model is similar to that of Windows Forms, so you could dynamically insert text elements and then issue the print command.

I think we are going to go the third party route. I like the XSL -> HTML -> PDF -> Printer flow... Winnovative's HTML to PDF looks good for the first part, but I'm running into a block finding a good PDF printing solution... any suggestions? Ideally the license would be on a developer basis, not on a deployed runtime basis.

In answer to your question about PDF printing, I have not found an elegant solution. I was "shell" ing out to Adobe which was unreliable and required a user to be logged in at all times. To fix this specific problem, I requested that the files we process (invoices) be formatted as multi-page Tiff files instead which can be split apart and printed using native .NET printing functions. Adobe's position seems to be "get the user to view the file in Adobe Reader and they can click print". Useless.
I am still keen to find a good way of producing quality reports which can be output from the web server...

Printing using System.Drawing.Printing is not supported by MS, as per Yann Trevin's response. However, you might be able to use the new, WPF-based, System.Printing (I think)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.