Website scraper to the next level

Website scraper to the next level - c#

So!
For a fansite I run I also run a website scraper(/xmleader) that reads information from a secure weblocation of a game. It works perfectly as it is now but I want to make it better and mainly faster.
The 1st problem I faced was how to maintain a session where you can do a ton of requests (like 1 to 10 every 30 seconds) while maintaining logged in. Normal httprequest didn't really worked because the login was secured with a token that must be submitted together with my login information. Now the solution was made as followed: On a Form is just placed a webbrowser control and when the login page was loaded(documentCompleted event) I fill the login information inside the document and simply submit it.
Now I can access all the secure pages I want to BUT not with a HttpWebRequest I placed inside the code. BUT When I placed multiple WebBrowserControls on the same form all them could access the secure part of the site. So that is how I placed 6 of them to do -kind of- parallel requests (for xml and html) to access information in my account quickly.
This works like a charm actually, you nicely see 7 browsers browse away and analyse the domdocument but naturally this creates a lot of overhead since I don't need the images and all the flash etc to load (or the iFrames which cause very annoying multiple documentLoaded events). So I want to login once and be able to request inside the code with HttpWebRequest with the session/cookie information of the webbrowser(or login in some other way).
So how do I do this? Is this even possible or should I approach it completely differently ?
(ps I write everything in C#)

You can show the first WebBrowser, login and, after the submit, you get the cookies from it and attach them all over your HttpWebRequests.
Having only the WebBrowser shown for the first login should improve your performance a lot!
Only pay attention to browser validation / async content loading.

You can't use HttpWebRequests to share the same session with WebBrowser. You'd need to use an API based on UrlMon or WinInet, that's what WebBrowser uses behind the scene.
I listed some of the options here: https://stackoverflow.com/a/22686805/1768303.
Perhaps, the XMLHTTPRequest COM object would be the most feasible one.

Related

Monitoring changes in web application

I want to monitor changes in background in complex web application. This is one-page application with many scripts and so on. I need to be logged in to have access to data I want to monitor.
I tried to use webrequest, but I think that the application is to complex to do it that way. There is also a problem with authentication.
I also tried WebBrowser component, but web application is telling me, that this browser is too old and I should get newer one.
Perfect solution would:
Open this web application in chrome (or some other modern browser) in background
Save the page to memory
Extract values using something like HtmlAgilityPack
While this will be happening I want to normally use the computer (so opening chrome window is not a good solution for me).
Is there any way to achieve something like that?

if you can cope with an extra browser running, have a look at SeleniumHQ. with its webdriver-backed selenium you can start a dedicated browser instance and perform user actions by coding in high-level programming languages like java. it should not interfere your manual work at all, but will take up the same amount of memory and cpu time your "real" browser would.
if the web application has no captcha and does not object to automated script accessing it, you could also login in a background program by sending appropriate HTTP requests and parse the response. python's urllib2 would be my first choice.
if you dont want any additional processes running, you could also create a browser plugin, that autorefreshs and parses a certain open tab every few seconds.

How to programmatically click a button on a webpage in bot (web crawler)?

I would like to build a bot - web crawler - to collect phone numbers.
I have a problem though: to see the phone number, a user must click something like "Show".
How can I solve this problem?

Check what the act of clicking on the button does. Does it call a Javascript function? Does that make an HTTP call to a backend? If so your bot should do that call instead of screen-scraping the first page. If not, does it just play with the DOM of the page to show an item on screen?

All the data you're looking for comes from some sort of back-end, so if you look in the developer tools of your browser when going through the page you can usually figure out what calls to script in order to get the data.
It is possible to make this harder (and that is what some sites to to protect themselves from scraping). Typically if you're in this situation, what you're doing is not entirely legal or nice. But technically it's very interesting, so here goes.
The best way to go forward is to run the site in a real browser (like PhantomJS, or Chrome) and use a framework like Webdriver to simulate browser interactions. This way you can pull most of the data out usually.
If you find that your ip gets blocked, you may use Tor and use multiple instances dynamically to hit the site... but make sure you ask the site owner nicely if you're allowed to do that of course.

C# Webbrowser Automation

Background:
I am creating a Windows Form App that automates order entry on a intranet Web Application. We have a large amount of order entry that needs to be done that will cost us a lot of money so I volenteered to automate the task.
Problem:
I am using the webbrowser class to navigate the web app. I have gotten very far but reached a road block. There is a part in the app that opens a web dialog page. How do I interact with the web dialog. My instance of the webbrowser class is still with the parent page. I am hoping someone can point me in the right direction.

You've got a number of options. To expand on the answers from others and add a new idea...
Do it using the webbrowser control: This is technically possible by either injecting javascript into the target page as demonstrated here or creating a JavaScript object and using it as a bridge via the webbrowser.objectforscripting property. This is very fragile - something as simple as the website changing an element's Id could break it. You also need to make sure your code doesn't interfere with the functioning of the form (clashing function names, etc...)
Do it using a postback: Monitor the communications between the web browser and the server (I personally prefer Firfox/Firebug but ie/Fiddler or Chrome/F12 are both good too). As long as you can replicate the actions of the browser exactly, the server can't know the difference. The problem here is that browsers are complex and the more secure a form is, the more demanding servers are. This means you may have to fake a login, get cookies, send them back on subsequnt requests, Handle Viewstate data and xss prevention variables. It's possible and it's far more robust than the first option but can be a pain to get working. If it's not a highly secure form,, this is your best bet. More information here
Do it by browser automation: Selenium is probably the best option here (as mentioned by others) but suffers from a similar flaw to the webbrowser control in that it's sensitive to changes on the form itself (But not as mcuh so as the webbrowser control).
Incidentally, if you have Visual Studio Ultimate/Test edition (and some others, not sure which), it includes a suite of testing tools including an excellent engine to automate load testing a website. This is also superb for tracking down what exactly a form does as you can see every step of the emulation.
Hope this helps

You have two choices depending of the level of complexity you need:
Use a HTTP Debugger like Fiddler to find out the POST data you
need to send to each page and mimic it via a HttpWebRequest.
Use a Browser Automation Tool like Selenium and do the job.
NOTE: Your action may be considered as spamming by the website so be ready for IP blocking, CAPTCHA...

You could give Selenium a go: http://seleniumhq.org/
UI automation is a far more intuitive approach to these types of tasks.

How can I tell if 2 browser windows are sharing their session?

Many of our users, internal and external, start our web application. Then at some later point, they open a new window from within the browser. They want to have 2 independent sessions of the application running. However, by doing it this way they are actually using the same session data.
Is there a way, in code, to determine if there is another browser window open with the same session?
We're using VS 2008, C# and/or VB.Net.
Thanks.
COMBINING MY RESPONSES FROM BELOW:
Maybe I'm saying this wrong. When they open a second window and change it to a different widget number, and then go back to the original window, on the next post-back it will be using the second window's widget number, not its own
We are using IE7.

The major browsers that I've tested apps on (IE, FF and Google Chrome) all default to using the same collection of cookies regardless of whether you are opening a duplicate web page in a new tab or a new browser instance.
The result is that 2 different tabs, or 2 instances of the same browser, by default, will look like the same session to the server.
Because the multiple instances use the same cookies, the server cannot tell requests from them apart, and will associate them with the same Session data, because they all have the same SessionID, assuming cookie-based SessionID.
Generally there is nothing wrong with this behaviour, and you would have to have a good business case against that behaviour to want to code a work around.
I do not believe it is possible to distinguish the different browser tabs from server side code. There may be some sort of client side script hack that would help.
Would it help to include a Html meta refresh tag so that the various tabs at least update themselves periodically?
If, on the other hand, what you are after is to treat a group of user/server interactions as a kind of "session within a session", you may be able to do this by storing a random Guid (or Widget Number) in ViewState, and checking it on postback.
Hope this helps.

IE8 - shares session between tabs and browser instances; new session can be started using File->New Session command
IE7 - shares session between tabs but not between browser instances
Firefox - shares session between tabs and instances; another Firefox can be started in different profile (firefox.exe -P "profileName" -no-remote) and then have separate session

See http://blogs.msdn.com/ie/archive/2009/05/06/session-cookies-sessionstorage-and-ie8.aspx for discussion of this topic for IE7 and IE8.

They're not sharing the same data. A new session is started in the new browser window and a separate trip to the database is initiated.

You can inspect the headers in Fiddler or you can output the Session.ID in the Windows. Sessions are created for each browser instance, not each window.

Best method for Website Automation?

Let me rephrase the question...
Here's the scenario: As an insurance agent you are constantly working with multiple insurance websites. For each website I need to login and pull up a client. I am looking to automate this process.
I currently have a solution built for iMacros but that requires a download/installation.
I'm looking for a solution using the .NET framework that will allow the user to provide their login credentials and information about a client and I will be able to automate this process for them.
This will involve knowledge of each specific website which is fine, I will have all of that information.
I would like for this process to be able to happen in the background and then launch the website to the user once the action is performed.

You could try the following tools:
StoryTestIQ
Selenium
Watir
Windmill Testing Framework
Visual Studio Web Tests
They are automated testing tools/frameworks that allow you to write automated tests from a UI perspective and verify the results.

Use Watin. It's an open source .NET library to automate IE and Firefox. It's a lot easier than manipulating raw HTTP requests or hacking the WebBrowser control to do what you want, and you can run it from a console app or service, since you mentioned this wouldn't be a WinForms app.
You can also make the browser window invisible if needed, since you mentioned only showing this to the user at a certain point.

I've done this in the past using the WebBrowser control inside a winforms app that i execute on the server. The WebBrowser control will allow you to access the html elements on the page, input information, click buttons/links, etc. It should allow you to accomplish your goal.
There are ways to do this without the WebBrowser control, look at the HTML Agility Pack.

Assuming that you are talking about filling and submitting a form or forms using a bot of some sort then scraping the response to display to the user.
Use HttpWebRequest(?) to create a form post containing the relevant form fields and data from your model and submit the request.
Retrieve and analyse the response, store any cookies as you will need to resubmit the cookie on the next request.
Formulate the next request based on the results of the first request ( remembering to attach cookies as necessary ) and submit it.
Retrieve the response and display or parse and display ( depending on what you are hoping to achieve ).
You say this is not a client app - therefore I will assume a web app. The downside of this is that once you start proxying requests for the user, you will have to always proxy those requests as there is no way for you to transfer any session cookies from the target site to the user and there is no ( simple / easy / logical ) way for the user to log in to the target site and then transfer the cookie to you.
Usually when trying to do this sort of integration, people will use some form of published API for interacting with the companies / systems in question as they are designed for the type of interactions that you are referring to.

It is not clear to me what difficulty you want to communicate when you wrote:
I currently have a solution built for
iMacros but that requires a
download/installation.
I think here lies some your requirements about which you are not explicit. You certainly need to "download/install" your .Net program on your client's machines. So, what's the difference?
Anyway, Crowbar seems promising:
Crowbar is a web scraping environment
based on the use of a server-side
headless mozilla-based browser.
Its purpose is to allow running
javascript scrapers against a DOM to
automate web sites scraping but
avoiding all the syntax normalization
issues.
For people not familiar with this terminology: "javascript scrapers" here means something like an iMacros' macro, used to extract information from a web site (in the end is a Javascript program, for what purpose you use it I do not think makes a difference).
Design
Crowbar is implemented as a (rather
simple, in fact) XULRunner application
that provides an HTTP RESTful web
service implemented in javascript
(basically turning a web browser into
a web server!) that you can use to
'remotely control' the browser.
I don't know if this headless browser can be extended with add-ons like a normal Firefox installation. In such case you can even think to use yours iMacros' macros (or use CoScripter) with appropriate packaging.
The more I think about this, more I feel that this is a convoluted solution for what you wrote you want to achieve. So, please, clarify.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.