What sort of web host lets you run crawlers on it?

What sort of web host lets you run crawlers on it? - c#

I'm working on a graduation project for one of my university courses, and I need find some place to run several crawlers I wrote in C# from. With no web hosting experience, I'm a bit lost. Is this something that any site allows? Do I need a special host that gives more access to the server? The crawler is a simple app that does its work, then periodically writes information to a remote database.

A web crawler is a simulation of a normal user. It acess sites like browsers do, getting the html code (javascript, etc.) returned from the server (so no internal access to server code). Being that, any site can be crawled.
Be aware of some web crawler ethics guidelines. There are pages you shouldn't index or follow its links. And web developers build some files and instructions to web crawlers, saying what you can index or follow.

If you can't run it off your desktop for some reason, you'll need a host that lets you execute arbitrary C# code. Most cheap web servers don't do this due to the potential security implications, since there will be several other people running on the same server.
This means you'll need to be on a server where you have your own OS. Either a VPS - Virtual Private Server, where virtualization is used to give you your own OS but share the hardware - or your own dedicated server, where you have both the hardware and software to yourself.
Note that if you're running on a server that's shared in any way, you'll need to make sure to throttle yourself so as to not cause problems for your neighbors; your primary issue will be not using too much CPU or bandwidth. This isn't just for politeness - most web hosts will suspend your hosting if you're causing problems on their network, such as denying the other users of the hardware you're on resources by consuming them all yourself. You can usually burst higher usage levels, but they'll cut you off if you sustain them for a significant period of time.

This doesn't seem to have anything to do with web hosting. You just need a machine with an internet connection and a database server.
I'd check with your university if I were you. At least in my time, a lot was possible to arrange in-house when it came to graduation projects.
Failing that, you could look into a simple VPS (Virtual Private Server) account. Unless you are sure your app runs under Mono, you will need a Windows one. The resource limits are usually a lot lower than you'd get from a dedicated server, but they're relatively affordable. Some will offer a MS SQL Server database you can use next to the VPS account (on another machine). Installing SQL Server on the VPS itself can be a problem license wise.
Make sure you check the terms of usage before you open an account, as well as the (virtual) system specs though. Also check if there is some kind of minimum contract period. Sometimes this can be longer than a single month, especially if there is no setup fee.
If at all possible, find a host that's geographically close to you. A server on the other side of the world can get a little annoying to access remotely using Remote Desktop.

80legs lets you use their crawlers to process millions of web pages with your own program.
The rates are:
$2.00 per million pages
$0.03 per CPU-hour
They claim to crawl 2 billion web pages a day.

You will need a VPS(Virtual private server) or a full on dedicated server. Crawlers are nothing more then applications that "crawl" the internet. While you could set up a web site to be a crawler, it is not practical because the web page would have to be accessed for you crawler to work. You will have to read the ToS(Terms of service) for the host to see what the terms are for usage. Some of the lower prices hosts will cut your connection with a reason of "negatively impacting the network" if you try to use to much bandwidth even though they have given you plenty to use.
VPS are around $30-80 for a linux server and $60+ for a windows server.
Dedicated services run $100+ for both linux and windows servers.

You don't need any web hosting to run your spider. Just ask for a PC with web connection that can act as a dedicated server,configure the database and run the crawler from there.

Related

Hosting a RESTful API

I'm a C# ASP.NET junior dev and have worked with Code First C# Databases, RESTful API's, MVC & Vue (a frontend framework sort of like React) to create websites.
Now at work and during my education, I've never handled deployment.
At this time I have a personal project. I have succesfully hosted my relational MySQL Database on phpMyAdmin and can update it from my local desktop.
My hosting site let me know they do not host C# or anything of the sort.
I found some posts suggesting Azure, AWS, others, but for every post I find I find equal people protesting those.
What is a good site to host my first REST API? I'm looking for something that can go beyond Minimum Viable Product and I'd like to host my website under the hosting service I'm currently using (so not paired hosting with the API).
What would the costprice look like for an API that's deployed and being used by clients?
I realize this cost depends on the amount of traffic, but assume a basic API used for, let's say, posting orders in an online shop (though website/app/w.e, it all would communicate through the API).
Any tips are welcome as I feel I'm swimming in the dark researching this.
Thank you

Any hosting service that grants you real access to a machine will be able to run your API as some specialized for the .net/core ecossystem.
I supposed you know about php based on phpMyAdmin service and the ecossystem of hosts that support php, although cheaper they do not exactly give you access to the machine and probably will not support .net/core as inumerous others tech stacks.
As a Junior Developer I believe you should have a little bit of practice in any deployment ecossystem so I encorage you to try most of the big clouds (Azure, GCP, AWS), but also some smaller hosts to gain experience and understand a bit more about the differences in deployment and ecosystem.
Azure will be really easy, you can create an account and post your API without any costs using an free WebApp, VS will even have publishing tools that will handle 90% of the job, GCP will be a little trickier and will require you to know a bit about containers and clusters, if you go for a non specilized host like digitalocean you will need to understand more about the Operational system and associated servers/controllers to deploy and publish
about the cost part is a lot more difficult. It will depend on the host that you are using and the load (process, memory, size, and throughput of data). In my experience I had some very small-scale APIs that required more processing or memory to accomplish some tasks like PDF generation than a medium-scale API that only had json data transaction

What happens to open sockets when scaling down Azure web app?

We have a scenario where we use multiple Web apps in Azure, when scaling up I understand Azure is simply starting more Web processes and as such allows for connections to multiple servers, there's a broadcast system in place for the synchronization. The issue is, what happens to an open socket if we manually or automatically scales down? Say we have 5 servers which each have an open web socket, and we scale down to 1, what will happen to the 4 sockets that were connected to the servers that are being removed?
As a side note, if they stick and keep up until the socket is disconnected to the client, will Azure bill me for that time?
If they don't stick, it's only a matter of making sure the client reconnects properly.
By what I've seen so far, it seems to stick, but that might just be a grace period while it's scaling down so I'd rather be on the sure here with an answer from someone who actually knows.

From another thread a few years ago it was the newest instance that is removed (most of the time) but I cannot find anything about it waiting for the connections to drop.
Which instances are stopped when I scale my Azure role down?
There is however a management API that you can access to scale down (delete) specific cloud service roles.
The Delete Role Instances operation deletes multiple role instances
from a deployment in a cloud service.
POST Request
https://management.core.windows.net/<subscription-id>/services/hostedservices/<cloudservice-name>/deployments/<deployment-name>/roleinstances/
Using this you can monitor which instances you want to remove and send the delete command programmatically. That way you could wait for the users to cleanly disconnect from the instance before deleting it.
Reference to the Microsoft API doc for this:
https://msdn.microsoft.com/library/azure/dn469418.aspx

So, after quite a bit of development and testing there's an answer. We're deploying using Kudu so Azure builds and publishes the Web app. The IIS instances that have open Websockets will run it's Application_End cycle and shut down the TCP connection.
As far as i can tell so far, this happens before the new site is spun up and accepts connections. Thus, there is no fear of being billed for additional hours. This also seem to occur for all Web apps (sites) within the plan when scaling the Web app plan (server farm), no matter if it's up or down.
This might perhaps an inconvenience for our users, but with proper shutdown on the server side and re-connection from the client side it should work out just fine.

There's only one way to find out.. test it. (or ask an Azure engineer, but this might take ages..)
I would assume that it would not scale down a machine if someone is connected. Imagine watching a stream and it randomly stops to connect to another server? I wouldn't think Microsoft would create it to drop connections.

The first paragraph mentions web roles:
On the Scale page of the Azure Management Portal, you can manually scale your application or you can set parameters to automatically scale it. You can scale applications that are running Web Roles, Worker Roles, or Virtual Machines. To scale an application that is running instances of Web Roles or Worker Roles, you add or remove role instances to accommodate the work load.
Web apps or web roles require the use of a VM. This is detailed in the first bullet point listed:
You should consider the following information before you configure scaling for your application:
•You must add Virtual Machines that you create to an availability set to scale an application that uses them. The Virtual Machines that you add can be initially turned on or turned off, but they will be turned on in a scale-up action and turned off in a scale-down action. For more information about Virtual Machines and availability sets, see Manage the Availability of Virtual Machines.
The information that follows the bullet points details the scaling process.
For additional information this link also mentions the use of vm's for web apps. The verbiage below can be found in the section titled Web App Concepts:
Auto Scaling - Web Apps enables you to quickly scale-up or out to handle any incoming customer load. Manually select the number and size of VMs or set up auto-scaling to scale your servers based on load or schedule.
https://azure.microsoft.com/en-us/documentation/articles/app-service-web-overview/

Create a file from the browser

I'm looking for a way to establish a simple communication between a c# web application and the operating system.
Since i'm working on Silverlight, i get everything i need to create files into any folder on the C:/ Disk. The problem is that we're going to migrate from Silverlight to Html 5 / C#
So i'd need a way to create files FROM any browser to any OS : Windows,Mac,Linux ..
I thought about using Microsoft Active X but that's not cross platforms.
I'm simply looking for a technology/plugin/software or anything that would allow me to do that, the less client interaction would be the best.

I think your need is in conflict with any common sense about security. If there was a simple way to create any file on any computer that loads your web app, just imagine how quickly all sorts of malware would spread.
But going back to your question - I think it will not be simple (btw. was it really simple in silverlight?). What I can imagine is to have some kind of service running on a client PC (the user would have to install it, or it could be corporate policy if your web app is targeted at corporate solutions). Then the service would listen on some TCP port and your web app could send requests to that port with the intent to create particular file with particular content. All the security concerns would be then implemented in mentioned service so that it doesn't get abused by hostile web apps

Obtaining Files From A Local Web Server or Network Share - Which Is Better

Here's the scenerio.
I have to access a web service on the local LAN to obtain a list of files which I then must retrieve from the machine running the web service. The question has arisen whether to use a mapped drive or just retrieve the files via HTTP from the web service (or web server if the service is self-hosting).
All machines are running Windows XP or later.
I am leaning towards the web server approach - because it has the fewest unknowns as far as having the necessary permissions to access the files.
So basically the question is which is the better approach - web server or network share?

I would go the webservice route because it reduces the number of variables in the equation. Based on your current setup you already need a web service in order to get a list of files to download. At this point you know access to the web service isn't a problem so putting the files there removes a lot of unknowns.
If you put files onto another machine you run the risk of hitting at least the following problems that do not exist with the web service (since you already know you have access)
Permission Issues
Firewall issues

I would think it depends on various factors you haven't mentioned: will lots of clients be trying to access these files at a given time? Will the app be distributed across multiple servers in the future? Might you need to implement a caching system in the future?
If the answer is no to all of these, then you should probably pick what's easiest.

I would lean towards plain old HTTP. Doing it via the web service would probably involve marshalling the file as an array, for example, which makes it larger. A file share means needing to worry about permissions.

Shaky connectivity - favor web or desktop app?

I'm a desktop application developer who is temporarily working in the web. I'm working with a client that wants me to build an app for use by locations all over the state; however, these locations have very shaky connectivity.
They really want a centralized web app and are suggesting I build a "lean" web app. I don't know what a "lean web app" means: small HTTP requests but lots of them? or large HTTP requests with few of them? I tend to favor chunky vs chatty.. but I've never had to worry about connectivity before.
Do I suggest a desktop app that replicates data when connectivity exists? If not, what's the best way to approach a web app when connectivity is shaky?
EDIT:
I must qualify my question with further information. Assuming the web option, they've disallowed the use of browser runtime technologies and anything that requires installation. Thus, Silverlight is out, Flash is out, Gears is out - only asp.net and javascript is available to me. Having state this, part of my question was whether to use a desktop app; I suppose that can be extended to "thicker technologies".
EDIT #2: Network is homogeneous - every node is Windows. This won't be changing.

You should get a definition of what the client means by "lean" so that you don't have confusion surrounding it. Maybe present them with several options of lean that you think they might mean. One thing I've found is it's no good at all to guess about client requirements. Just get clarification before you waste a bunch of time.

Shaky connectivity definitely favors a desktop application. Web apps are great for users that have always-on Internet connections, and that might be using a variety of different browsers and operating systems.
Your client probably has locations that are all using Windows, so a desktop application is an appropriate choice. One other advantage of web applications is that they make the deployment issue easy to deal with. Auto-update technologies like ClickOnce make the deployment and update of desktop applications almost as easy.
And not to knock Google Gears, but it's relatively new and would have to be considered more risky than a tried-and-true desktop application.
Update: and if you're limited to just javascript on the client side, you definitely do not want to make this a web app. Your application simply will not be available whenever the Internet connection is down. There are ways to save stuff locally in javascript using cookies and user stores and whatnot, but you just don't want to do this.

If connectivity is so bad, I would suggest that you write a WinForm app that downloads information, locally edits it and then uploads it. This way, if your connection goes down, all you have to do is retry until it works.
They seem to be suggesting a plain vanilla web app that doesn't use AJAX or rely on .NET postbacks or do anything that might make it break down horribly if your connection goes away for a bit. Instead, it should be designed so that you can hit Refresh until it works. In other words, they seem to want the closest thing to a WinForm app, only uglier.

You may consider using a framework like Google Gears to help provide functionality during network down time. This allows users to connect to the web page once (with a functioning connection) and then be able to use the web app from then on, even without a connection.
When the network is restored, the framework can sync changes back with the central database.
There is even a tutorial for using Google Gears with the .Net Framework.
Gears with other languages

You mention that connectivity is shaky at these locations, but that the app needs to be centralized. One thing you might consider is using multiple decentralized read database servers and a single centralized write server. Mysql makes this possible and affordable if your app is small.
Have the main database server at the datacenter/central office. Put up small web/db servers at each location, with your app installed. You can even run them off a user computer if the remote location is not too big. Make the local database servers connect to the centralized database server as replication slaves. As changes come in to the centralized database, the slave servers will pull down the data and make it available locally. When the connection is unavailable, your app data is still at least available, if not up to date. When the connection is available, the database handles replicating all relevant data down.
Now all you have to do is make your app use two separate database handles: reading data it uses the local database, writing data it uses the central database.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.