Writing crawler that stay logged in with any server

Writing crawler that stay logged in with any server - c#

I am writing a crawler. Once after the crawler logs into a website I want to make the crawler to "stay-always-logged-in". How can I do that? Is a client (like browser, crawler etc.,) make a server to obey this rule? This scenario could occur when the server allows limited logins in day.

"Logged-in state" is usually represented by cookies. So what your have to do is to store the cookie information sent by that server on login, then send that cookie with each of your subsequent requests (as noted by Aiden Bell in his message, thx).
See also this question:
How to "keep-alive" with cookielib and httplib in python?
A more comprehensive article on how to implement it:
http://www.voidspace.org.uk/python/articles/cookielib.shtml
The simplest examples are at the bottom of this manual page:
https://docs.python.org/library/cookielib.html
You can also use a regular browser (like Firefox) to log in manually. Then you'll be able to save the cookie from that browser and use that in your crawler. But such cookies are usually valid only for a limited time, so it is not a long-term fully automated solution. It can be quite handy for downloading contents from a Web site once, however.
UPDATE:
I've just found another interesting tool in a recent question:
http://www.scrapy.org
It can also do such cookie based login:
http://doc.scrapy.org/topics/request-response.html#topics-request-response-ref-request-userlogin
The question I mentioned is here:
Scrapy domain_name for spider
Hope this helps.

Related

login website like(FB, TWITTER ) and crawl data with c#?

I am creating a console application in c#(visual studio).
but i don't know where to start.
1st i want to login(phantomjs or selenium)>>then go to a (specified)website URL and extract html?
i want to know how to save login information in my web request.
thank you.

Long story short, it's not easy to do that just with web request because each site has its own way of managing cookies and security.
It's easier if you use a web browser control to login first. From there, the browser can obtain a valid cookie and you can start crawl data from there.
I've done a similar thing with Chegg website. For details, you can check out my repository https://github.com/hungqcao/chegg-solutions-saver
In your case, it can get a little complicated since FB, Twitter may have 2-factor authentication or something similar to that but the idea stays the same.
Let me know if you need help.

How to: Encrypt URL in WebBrowser Controls

I have a program that opens a web browser control and just displays a web page from our server. They can't navigate around or anything.
The users are not allowed to know the credentials required to login, so after some googling on how to log into a server I found this:
http://user_name:password#URL
This is 'hard coded' into the web browsers code. -It works fine.
HOWEVER: Some smart ass managed to grab the credentials by using WireShark which tracks all the packets sent from your machine.
Is there a way I can encrypt this so the users cannot find out?
I've tried other things like using POST but with the way the page was setup, it was proving extremely difficult to get working. -(Its an SSRS Report Manager webpage)
I forgot to include a link to this question: How to encrypt/decrypt the url in C#
^I cannot use this answer as I myself am not allowed to change any of the server setup!
Sorry if this is an awful question, I've tried searching around for the past few days but can't find anything that works.

Perhaps you could work around your issue with a layer of indirection - for example, you could create a simple MVC website that doesn't require any authentication (or indeed, requires some authentication that you fully control) and it is this site that actually makes the request to the SSRS page.
That way you can have full control over how you send authentication, and you need never worry about someone ever getting access to the actual SSRS system. Now if your solution requires the webpage to be interactive then I'm not sure this will work for you, but if it's just a static report, it might be the way to go.
i.e. your flow from the app would be
User logs into your app (or use Windows credentials, etc)
User clicks to request the SSRS page
Your app makes an HTTP request to your MVC application
Your MVC application makes the "real" HTTP request to SSRS (eg via HttpClient, etc) and dumps the result back to the caller (for example,it could write the SSRS response via #HTML.Raw in an MVC View) The credentials for SSRS will therefore never be sent by your app, so you don't need to worry about that problem any more...
Just a thought.
Incidentally, you could take a look here for the various options that SSRS allows for authentication; you may find some method that suits (for e.g Custom authentication) - I know you mentioned you can't change anything on the server so I'm just including it for posterity.

Getting the Google search terms from a https Google search

I want to get the search terms that user typed on Google to get to my long-tail landing page (and use them on that page).
Getting the the "q" variable from the query string using the response referrer (in ASP C#) works well but only if the referring Google page was not loaded as https.
This is obviously a problem due to the fact that almost everyone is logged in to their Google accounts on their browsers all the time and, if they are, all Google pages will be automatically loaded (and redirected) to use https.
When a user (on https://www.google.com) searches for something and clicks on a search result, Google seems to redirect the user to an intermediate page that strips the request of its query string and replaces it with a different one that pretty much only contains url that the intermediate page should redirect to (i.e. the url to my long-tail landing page).
Is there any way that I can get the original search terms that were used on https://www.google.com anyway? Maybe if JavaScript could access the browser history or something similar?

Is there any way that I can get the original search terms that were used on https://www.google.com
No, the full text of the https session is secured via SSL this includes headers, urls etc. In your scenario, for security reasons browers tend to omit the referer header therefore you won't be able to access it (unless the destination URL is also secured via HTTPS). This is part of the HTTP spec - 15.1.3 Encoding Sensitive Information in URI's.
The only thing you can do is put a disclaimer on your site to say it doesn't work over https.

Since it is Google, it is not possible because there is not shared link with your website.
Once you are on HTTPS - it does not allow sending of REFERRER headers. I am sure you are aware that headers can be manipulated and cannot be trusted but, you may trust Google. However, due to privacy policy any activity done on Google by Google users are not shared by 3rd party. Link
Again, in server side languages you can find functions for HTTP Referrer but not HTTPS referrer. That is for a reason !
Unless and until you do not have a collaboration with the originating server who may create an exception in their RFC thing to allow HTTP REFERRER ONLY for your website. It isn't possible.
Hope that helps! (in moving on) :)
EDIT: Wikipedia Link See Referrer Hiding (2nd last line)

To see the referrer data you need to be either a paying google ads customer (and the visitor come via an ad-click) or have your site in HTTPS as well. Certs are cheap these days or you could use an intermediary like CloudFlare to do the SSL and have a self-signed cert on your site.
You can also see queries no matter the method used, with Google Webmaster tools.
I wrote a little about this here: http://blogs.dixcart.com/public/technology/2012/03/say-goodbye-to-keyword-tracking.html

Easiest way to get web page source code from pages that require logins -- C#

So I play an online game that's web based and I'd like to automate certain things with it using C#. Problem is that I can't simply use WebClient.DownloadData() because I need to be logged in to actually recieve the source. The other alternative was to use the built-in web browser control but that doesn't give me access to source code. Any suggestions?

I don't think NetworkCredentials will work in all cases. This only works with "Basic" or "Negotiate" authentication.
I've done this before with an internal website for some load testing, but sounds like you are trying to "game" the game. For that reason I won't go into details but the login to the site is probably being done in the form of an HTTP POST when you hit the login button.
You'd have to trap the POST request and replicate it in your code and make sure that your implementation maintains the session state as well, because if the game site is written well at all it will make sure that the current session has logged in before doing anything game related.

You can set the login credentials on the webclient using its Credentials property before calling DownloadData:
WebClient client = new WebClient();
client.Credentials = new NetworkCredential("username", "password");
EDIT: As mjmarsh points out, this will only work for sites that use a challenge-response method of authentication as part of a single request (I'm so used to dealing with this at work, I hadn't considered the other types!). If the site uses forms authentication (or indeed any other form of authentication), this method will not work as the authentication is not part of a single request - multiple requests are needed that you will need to handle yourself.

Network credentials will not work as mjmarsh has already pointed out.
While web scraping we come across lot of pages where login is needed. One of the approaches I use is install fiddler and monitor the POST and GET packets while manually logging in the site. This allows you to find out how the browser emulates the login. Then you need to recreate the same process by Code.
For example, most web servers use cookies to assume the session is authenticated. So you can use the credentials to post UserName and Password on the web site and record the Cookie. This cookie can then be used to access any further details on the web site.
Please check following link to check out more about Advanced Web Scraping:
http://krishnan.co.in/blog/post/Web-Scraping-Yahoo-Mail.aspx
In this blog, you will find how to authenticate into Yahoo account and then read the page after authentication.

Take down website to public, but leave for testing... "We're Not Open"

We are rolling out a site for a client using IIS tomorrow.
I am to take the site down to the general public (Sorry, we are updating message) and allow the client to test over the weekend after we perform the upgrade.
If it is successful, I open it to everbody - if not, I rollback.
What is the easiest way to put a "We're not open" sign for the general public, but leave the rest open to testers?

Redirect via IIS. Create a new website in IIS and put your "Sorry updating" message in the Default.aspx. Then switch ports between the real site (will go from 80, to something else (6666)) and the 'maintenance' site (set on 80).
Then tell your testers to go to yoursite.com:6666.
Then switch the real site back to 80 after taking down the 'maintenance' site.

I thought it would be worthwhile to mention ASP.NET 2.0+'s "app offline" feature. (Yes, I realize the questioner wants to leave the app up for testing, but I'm writing this for later readers who might come here with different needs).
If you really want to take the application offline for everyone (for instance to do server maintenance) there is a very simple option. All you have to do in ASP.NET 2.0 and higher is put a file with this name:
app_offline.htm
...in the root directory of your ASP.NET application. Put an appropriate "sorry come back later" message in there. That's it. The ASP.NET runtime does the rest.
Details on Scott Guthrie's blog.

Require that testers login. You can even hide the login page so that you need a direct link to even see it. Then, for all people not logged in, redirect to the page that displays your message.

Fire up another "site" in IIS which will catch your host-header for your primary site. Use either a custom 307/503/404 page that has "we're down for maintainance" or use some sort of URL-rewrite to redirect people to your single static file.
switch host-header-binding on your real site to something else, like dev.domain.com or testing.domain.com that your developers use.
Or, block by IP, and have your custom "Not authorized" page tell visitors that your down to maintainance.
You have several options.

Some methods that I've used before:
Windows authentication and/or separate subdomains for client to test.
Disable anonymous website access in IIS and give your client a username/password combo to test the website.
Disable default document in IIS and give your client an absolute URL to the main index file.

We tend to have a log in page and an include file across all pages in the site (usually the DB Connection as it's included in all files) that checks for a valid logged in session. If you've not logged in you get a message saying the site's down for maintainance

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Writing crawler that stay logged in with any server - c#

I am writing a crawler. Once after the crawler logs into a website I want to make the crawler to "stay-always-logged-in". How can I do that? Is a client (like browser, crawler etc.,) make a server to obey this rule? This scenario could occur when the server allows limited logins in day.

Related

login website like(FB, TWITTER ) and crawl data with c#?

How to: Encrypt URL in WebBrowser Controls

Getting the Google search terms from a https Google search

Easiest way to get web page source code from pages that require logins -- C#

Take down website to public, but leave for testing... "We're Not Open"

Categories

Resources