Crawl site and detect 3rd party cookies - c#

I am writing a crawler to log all cookies being deployed by a set number of sites. I can pick up 1st party cookies being set on page visit using selenium, but a limitation in the software means that it won't pick up 3rd party cookies. Are there any other tools which are available which can do pick all cookies?
Thanks.

If you are doing this as a one-time task, you can use something like the FireCookie extension to the Firefox browser, which lets you export all the cookies:
http://www.softwareishard.com/blog/firecookie/
If you want to automate this task and run it periodically, consider a solution like the following:
First get a list of pages that need to be crawled.
Then load each page consecutively into a web browser. It's not enough to simply fetch the HTML of the page, because you need to load and process all javascript, iframes, and so forth that might set cookies. It could probably be a headless browser such as PhantomJS ( http://www.phantomjs.org/ ) or some other solution as long as it actually renders the page like a browser would do.
Use a web proxy such as Charles proxy ( http://www.charlesproxy.com/ ) to record all the network requests from the browser. The recorded session can be saved and processed to extract all the cookie headers. Charles proxy has an API that can be used to export the session to an XML file, so you might be able to automate this part as well.

I believe you could use RegEx and ie.GetCookie() to collect all the cookies from a website. Haven't tried it myself but as far the documentation goes I think it'll be rather easy.

Related

How to: Encrypt URL in WebBrowser Controls

I have a program that opens a web browser control and just displays a web page from our server. They can't navigate around or anything.
The users are not allowed to know the credentials required to login, so after some googling on how to log into a server I found this:
http://user_name:password#URL
This is 'hard coded' into the web browsers code. -It works fine.
HOWEVER: Some smart ass managed to grab the credentials by using WireShark which tracks all the packets sent from your machine.
Is there a way I can encrypt this so the users cannot find out?
I've tried other things like using POST but with the way the page was setup, it was proving extremely difficult to get working. -(Its an SSRS Report Manager webpage)
I forgot to include a link to this question: How to encrypt/decrypt the url in C#
^I cannot use this answer as I myself am not allowed to change any of the server setup!
Sorry if this is an awful question, I've tried searching around for the past few days but can't find anything that works.
Perhaps you could work around your issue with a layer of indirection - for example, you could create a simple MVC website that doesn't require any authentication (or indeed, requires some authentication that you fully control) and it is this site that actually makes the request to the SSRS page.
That way you can have full control over how you send authentication, and you need never worry about someone ever getting access to the actual SSRS system. Now if your solution requires the webpage to be interactive then I'm not sure this will work for you, but if it's just a static report, it might be the way to go.
i.e. your flow from the app would be
User logs into your app (or use Windows credentials, etc)
User clicks to request the SSRS page
Your app makes an HTTP request to your MVC application
Your MVC application makes the "real" HTTP request to SSRS (eg via HttpClient, etc) and dumps the result back to the caller (for example,it could write the SSRS response via #HTML.Raw in an MVC View) The credentials for SSRS will therefore never be sent by your app, so you don't need to worry about that problem any more...
Just a thought.
Incidentally, you could take a look here for the various options that SSRS allows for authentication; you may find some method that suits (for e.g Custom authentication) - I know you mentioned you can't change anything on the server so I'm just including it for posterity.

Load and Use a Web Cookie

I am writing a small web scraper in C#. It is going to use HttpWebRequest to get the HTML file, find the required data, then report it back to the caller.
The required data is only available when a user is logged in. As I am new to interfacing programmatically with http, Javascript, et al, I am not going to try and log on programmatically. The user will have to log on to the website, and my program will get the stored cookie, and load it into the CookieContainer for the http request.
I've done enough research to know that the data belongs in the CookieContainer (I think), but I can't seem to find anywhere an example of how to find a cookie created by IE (or firefox, or chrome, etc), load it programmatically, populate the CookieContainer, and send it with an http get request. So how does one do all that?
Thanks!
I'm afraid you can't do that. Main reason is security. Because of cookies being used to identify a user, browser can't provide an easy access to cookies. Otherwise it would be really easy to still them.
You should better learn how to login user with HttpWebRequest or any other class like that.

Capture Website Thumbnail without Security Concerns

I've been looking for a way to navigate to a website URL and capture a thumbnail of what the home page looks like. I have found a solution on code project using the web browser control but people were saying it is not for production and there are security risks (malicious stuff on the web page, ect.)
Is there any 'safe' way to do this without worrying about downloading a virus, ect? It would be nice to capture the page as it really is, but perhaps the only way to do this is to disable javascript, etc. perhaps? I'm using asp.net c#.
You can use free online services within your application: 20+ Free Online Website Thumbnails Generators
After you register a free account on for example w3snapshot.com, you will be able to request thumbnails:
http://images.w3snapshot.com/?url=http://www.google.com&size=L&key=1234567890&format=jpg&quality=80
I think, you can select the one that will fulfill monthly/daily request limits for you.

Send HTTP Post with default browser with C#

I am wondering if it is possible to send POST data with the default browser of a computer in C#.
Here is the situation. My client would like the ability to have their C# application open their browser and send client information to a webform. This webform would be behind a login screen. The assumption from the application side is that once the client data is sent to the login screen, the login screen would pass that information onto the webform to prepopulate it. This would be done over HTTPS and the client would like this to be done with a POST and not a GET as client information would be sent as plain text.
I have found some wonderful solutions that do POSTS and handle the requests. As an example
http://geekswithblogs.net/rakker/archive/2006/04/21/76044.aspx
So the TL;DR version of this would be
1) Open Browser
2) Open some URL with POST data
Thanks for your help,
Paul
I've handled a similar situation once by generating an HTML page on the fly with a form setup with hidden values for everything. There was a bit of Javascript on the page so that when it loaded, it would submit the form, therefore posting the data as necessary.
I suspect this method would work for you.
Generate a dictionary of fields and values
Generate an HTML page with the Javascript to automatically submit when page is loaded
Write page to a temp location on disk
Launch default browser with that page
Remember though that POST data is sent plaintext as well. POST is generally the way to go for more than a couple fields, as you can fit in more data (2048 byte limit on URLs) and that your user has a friendly URL to see in their browser.
Nothing is sent as plain text when you use SSL, it is encrypted. Unless you set what the default browser is (IE, Firefox, Chrome, etc), then you'll have to figure out what the default browser is and use its API to do this work (if it's possible).
What would probably be must faster and more efficient would be to open the default browser by invoking a URL with Start Process and pass the information on the query string (this is doing a GET instead of a POST, which I know isn't what you're asking for).
The response from the server could be a redirect, and the redirect could send down the filled-out form (storing the values in session or something similar).
That way the complexity is pushed to the website and not the windows application, which should be easier to update if something goes wrong.
HTH
Can you compile your logic in C# and then call it from PowerShell? From PowerShell you can very easily automate Internet Explorer. This is IE only but you might be able to also use WaitnN.
Anything you put at the end of the URL counts as the querystring, which is what GET fills. It is more visible than the POSTed data in the body, but no more secure with regard to a sniffer.
So, in short, no.

Writing crawler that stay logged in with any server

I am writing a crawler. Once after the crawler logs into a website I want to make the crawler to "stay-always-logged-in". How can I do that? Is a client (like browser, crawler etc.,) make a server to obey this rule? This scenario could occur when the server allows limited logins in day.
"Logged-in state" is usually represented by cookies. So what your have to do is to store the cookie information sent by that server on login, then send that cookie with each of your subsequent requests (as noted by Aiden Bell in his message, thx).
See also this question:
How to "keep-alive" with cookielib and httplib in python?
A more comprehensive article on how to implement it:
http://www.voidspace.org.uk/python/articles/cookielib.shtml
The simplest examples are at the bottom of this manual page:
https://docs.python.org/library/cookielib.html
You can also use a regular browser (like Firefox) to log in manually. Then you'll be able to save the cookie from that browser and use that in your crawler. But such cookies are usually valid only for a limited time, so it is not a long-term fully automated solution. It can be quite handy for downloading contents from a Web site once, however.
UPDATE:
I've just found another interesting tool in a recent question:
http://www.scrapy.org
It can also do such cookie based login:
http://doc.scrapy.org/topics/request-response.html#topics-request-response-ref-request-userlogin
The question I mentioned is here:
Scrapy domain_name for spider
Hope this helps.

Categories