I need to build a windows forms application to measure the time it takes to fully load a web page, what's the best approach to do that?
The purpose of this small app is to monitor some pages in a website, in a predetermined interval, in order to be able to know beforehand if something is going wrong with the webserver or the database server.
Additional info:
I can't use a commercial app, I need to develop this in order to be able to save the results to a database and create a series of reports based on this info.
The webrequest solution seems to be the approach I'm goint to be using, however, it would be nice to be able to measure the time it takes to fully load the the page (images, css, javascript, etc). Any idea how that could be done?
If you just want to record how long it takes to get the basic page source, you can wrap a HttpWebRequest around a stopwatch. E.g.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(address);
System.Diagnostics.Stopwatch timer = new Stopwatch();
timer.Start();
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
timer.Stop();
TimeSpan timeTaken = timer.Elapsed;
However, this will not take into account time to download extra content, such as images.
[edit] As an alternative to this, you may be able to use the WebBrowser control and measure the time between performing a .Navigate() and the DocumentCompleted event from firing. I think this will also include the download and rendering time of extra content. However, I haven't used the WebBrowser control a huge amount and only don't know if you have to clear out a cache if you are repeatedly requesting the same page.
Depending on how the frequency you need to do it, maybe you can try using Selenium (a automated testing tool for web applications), since it users internally a web browser, you will have a pretty close measure. I think it would not be too difficult to use the Selenium API from a .Net application (since you can even use Selenium in unit tests).
Measuring this kind of thing is tricky because web browsers have some particularities when then download all the web pages elements (JS, CSS, images, iframes, etc) - this kind of particularities are explained in this excelent book (http://www.amazon.com/High-Performance-Web-Sites-Essential/dp/0596529309/).
A homemade solution probably would be too much complex to code or would fail to attend some of those particularities (measuring the time spent in downloading the html is not good enough).
One thing you need to take account of is the cache. Make sure you are measuring the time to download from the server and not from the cache. You will need to insure that you have turned off client side caching.
Also be mindful of server side caching. Suppose you download the pace at 9:00AM and it takes 15 seconds, then you download it at 9:05 and it takes 3 seconds, and finally at 10:00 it takes 15 seconds again.
What might be happening is that at 9 the server had to fully render the page since there was nothing in the cache. At 9:05 the page was in the cache, so it did not need to render it again. Finally by 10 the cache had been cleared so the page needed to be rendered by the server again.
I highly recommend that you checkout the YSlow addin for FireFox which will give you a detailed analysis of the times taken to download each of the items on the page.
Something like this would probably work fine:
System.Diagnostics.Stopwatch sw = new Stopwatch()
System.Net.HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create("http://www.example.com");
// other request details, credentials, proxy settings, etc...
sw.Start();
System.Net.HttpWebResponse res = (HttpWebResponse)req.GetResponse();
sw.Stop();
TimeSpan timeToLoad = sw.Elapsed;
I wrote once a experimental program which downloads a HTML page and the objects it references (images, iframes, etc).
More complicated than it seems because there is HTTP negotiation so, some Web clients will get the SVG version of an image and some the PNG one, widely different in size. Same thing for <object>.
I'm often confronted with a quite similar problem. However I take a slight different approach: First of all, why should I care about static content at all? I mean of course it's important for the user, if it takes 2 minutes or 2 seconds for an image, but that's not my problem AFTER I fully developed the page. These things are problems while developing and after deployment it's not the static content, but it's the dynamic stuff that normally slows things down (like you said in your last paragraph). The next thing is, why do you trust that so many things stay constant? If someone on your network fires up a p2p program, the routing goes wrong or your ISP has some issues your server-stats will certainly go down. And what does your benchmark say for a user living across the globe or just using a different ISP? All I'm saying is, that you are benchmarking YOUR point of view, but that doesn't say much about the servers performance, does it?
Why not let the site/server itself determine how long it took to load? Here is a small example written in PHP:
function microtime_float()
{
list($usec, $sec) = explode(" ", microtime());
return ((float)$usec + (float)$sec);
}
function benchmark($finish)
{
if($finish == FALSE){ /* benchmark start*/
$GLOBALS["time_start"] = microtime_float();
}else{ /* benchmark end */
$time = microtime_float() - $GLOBALS["time_start"];
echo '<div id="performance"><p>'.$time.'</p></div>';
}
}
It adds at the end of the page the time it took to build (hidden with css). Every couple of minutes I grep this with a regular expression and parse it. If this time goes up I know that there something wrong (including the static content!) and via a RSS-Feed I get informed and I can act.
With firebug we know the "normal" performance of a site loading all content (development phase). With the benchmark we get the current server situation (even on our cell phone). OK. What next? We have to make certain that all/most visitors are getting a good connection. I find this part really difficult and are open to suggestions. However I try to take the log files and ping a couple of IPs to see how long it takes to reach this network. Additionally before I decide for a specific ISP I try to read about the connectivity and user opinions...
You can use a software like these :
http://www.cyscape.com/products/bhawk/page-load-time.aspx
http://www.trafficflowseo.com/2008/10/website-load-timer-great-to-monitor.html
Google will be helpful to find the one best suited for your needs.
http://www.httpwatch.com/
Firebug NET tab
If you're using firefox install the firebug extension found at http://getfirebug.com. From there, choose the net tab, and it'll let you know the load/response time for everything on the page.
tl;dr
Use a headless browser to measure the loading times. One example of doing so is Website Loading Time.
Long version
I ran into the same challenges you're running into, so I created a side-project to measure actual loading times. It uses Node and Nightmare to manipulate a headless ("invisible") web browser. Once all of the resources are loaded, it reports the number of milliseconds it took to fully load the page.
One nice feature that would be useful for you is that it loads the webpage repeatedly and can feed the results to a third-party service. I feed the values into NIXStats for reporting; you should be able to adapt the code to feed the values into your own database.
Here's a screenshot of the resulting values for our backend once fed into NIXStats:
Example usage:
website-loading-time rinogo$ node website-loading-time.js https://google.com
1657
967
1179
1005
1084
1076
...
Also, if the main bulk of your code must be in C#, you can still take advantage of this script/library. Since it is a command-line tool, you can call it from your C# code and process the result.
https://github.com/rinogo/website-loading-time
Disclosure: I am the author of this project.
Related
today, I use Selenium to parse data from a website. Here is my code:
public ActionResult ParseData()
{
IWebDriver driver = new FirefoxDriver();
driver.Navigate().GoToUrl(myURL);
IList<IWebElement> nameList = driver.FindElements(By.XPath(myXPath));
return View(nameList);
}
The problem is, whenever it runs, it opens new window at myURL location, then get the data, and leave that window opening.
I don't want Selenium to open any new window here. Just run at the background and give me the parsed data. How can I achieve that? Please help me. Thanks a lot.
Generally I agree with andrei: why use Selenium if you are not planning to interact with browser window?
Having said that, simplest thing to do to prevent Selenium from leaving the window open, is to close it before returning from the function:
driver.Quit();
Another option, if the page doesn't have to be loaded in Firefox, is to use HtmlUnit Driver instead (it has no UI)
Well, it seems that on each web request you are creating (though, not closing / disposing) a Selenium driver object. As I have said in the comment, there may be better solutions for your problem...
As you want to fetch a web page and extract some data from it, feel free to use:
WebClient
WebRequest
A web application is not very a hospitable environment for a Selenium driver instance IMHO. Though, if you still want to play a bit with it, make the Selenium instance static and reuse it among requests. Still, if it will be used from concurrent requests (multiple threads running at the same time), a crush is very probable :) You have the option to protect the instance (locks, critical section etc.) but then you will have zero scalability.
Short answer: fetch the data by in another way, Selenium is just for automatic exploration tests as far as I know...
But...
If you really have to explore that website - the source of your data - with Selenium... Then fetch the data using Selenium in advance - speculatively, in another process (a console application that runs in background) and store it in some files or in a database. Then, from the web application, read the data and return it to your clients :)
If you do not have yet the data the client has asked for, respond with some error - "please try again in 5 minutes", and tell the console application (that's running in background) to fetch that data (there are various ways of communicating across process boundaries - the web app and the console app in our case, but you can use a simple file / db for queuing "data requests" - whatever)...
What's the best way to scrape a web page that has AJAX/dynamic loading of data?
For example: scraping a webpage that presents 20 images on load, but when a user scroll down the page it loads more images (sort of like Facebook). In such a case how do you scrape all the images, not just the first 20?
This is something that not even the major search engines have mastered yet. It's called "event-driven crawling".
Google even has a guide on what to do to help them crawl your ajax sites better
Best thing would be to read some open source crawlers and see what they do. But your chances of crawling even 80% are slim at best, unless you have a specific target in mind.
There are also some interesting reads at crawljax
Basically, You should try looking for scripts and checking if they make any ajax calls, then determine what kind of parameters they take and make repeat calls with incremented/decremented parameter values. This only works if the parameters have a logical pattern, such as being numbers, single letters etc. It also depends on whether you're targeting a known site or just sending it into the wild. If you know your target you can inspect it's DOM and customize your code for greater accuracy as mentioned by wolf.
Good luck
Use a tool such as Fiddler or WireShark to inspect the web request that is done when loading more items.
Then replicate the request in your code.
Update (thanks to pguardiario ofr his comment):
Note that Wireshark is a low level network capture tool that offers a great deal of detail about the traffic (packets being exchanged, DNS lookps, and so on), and may be painful to use in such scenario, where you only wish to see the HTTP Requests.
So, you're better off using Fiddler, or a similar tool in a browser (ex: Chrome's Network inspect panel).
Crawljax is open source and can dynamically crawl Ajax-based content.
I need get the html code this site (with C#):
http://urbs-web.curitiba.pr.gov.br/centro/defmapalinhas.asp?l=n (only works with IE8)
Using the WebClient class, or HttpWebResquest, or any other library, I do not have access to the html code generated dynamically.
So my only solution (I guess) would be to use the WebBrowser Control (WPF).
I was trying and trying, using mshtml.HTMLDocument and SHDocVw.IWebBrowser2
but it is a mess, I can not find what I want on it
it seems there are many "iframe", and inside there are more "iframe".
I do not know, I tried:
IHTMLElementCollection elcol = htmlDoc.getElementsByTagName("iframe");
var test = htmlDoc.getElementsByTagName("HTML");
var test2 = doc.all;
but had no progress, does anyone know how to help me?
Observation / trivia: This is the site that shows where all bus pass in my city. This site is horrible, and only works in IE8 has serious problems. I would like to use this information to try to create a better service, using google maps or bing maps posteriorly.
The site that I was trying to get the information is no longer available, the idea to get dynamic html source code was abandoned and I cannot found the solution using a WebBrowser Control for WPF.
I believe that today there are other ways to solve this problem.
You need to use the "Frames" object in the WebBrowser control, this object collection will return all frames and iframes if I recall correctly, and you need to look at the frames collection for each newly discovered frame you find on the page, get me? So, it’s like a recursive discovery loop that you need to run, you add each frame you find to your array or collection, and for each "unsearched" frame, you must look at that frames ".Frames" collection (they will all have a .Count etc, just a typical collection) and you do this for every newly discovered frame that you find, until of course, there are no longer any newly discovered frames that haven't had their ".Frames" collection searched.
So, the function, if done as per above, will allow for infinitely nested frames to be discovered, as I've done this in a VB6 project (I'm happy to give you the source for it if you would like it). However, the nesting is not preserved in my example, but that is ok since the nesting structure isn't important and you should figure out which was what by the order of the frames that are added to the collection since the order is related to the hierarchy of the frames being added.
Once you do that, getting the html source on this is pretty straight forward and I’m sure you know how to do, probably a .DocumentText depending on the version of the WB control you are using.
Also, you say it is not possible to use the HTTP clients to directly grab the source code? I must disagree, since once you have the frame objects, you can get the URLs from each frame object and do a URL2String type call to get the URL and turn it into a string from any httpclient-like class or framework. The only way it may be prevented on their behalf if if they accept requests only from a particular referrer (ie: the referrer must be from their domain name on some of their files etc), or the USER_AGENT where if it isn't one of the specified browsers, then it is technically possible that they will reject and not return data, unlikely but possible.
However, both referrer and user_agent can be changed in the httpclient you are using, so if they are imposing limits based on this sort of stuff, you can spoof them very easily and give them the data that they expect. Once again, this is low probability stuff, but it is possible they may have set things up this way especially if their data is proprietary.
PS: My first visit to the site ended up in IE crashing and reopening that tab :), terrible site I agree.
What I'm trying to do is create an asp.net page that runs a random number generator, displays the random number, and writes it to a text file. That part is no worries, the issues is I want the number generation and file writing to continue while the page is live - ie if no one is actually viewing the page, it's just sitting on the server, the process should continue.
Is this possible?
EDIT: Foolishly overlooked using a webservice to generate the number - I've knocked up a basic service that generates a number and writes it to a text file. Can't work out how to schedule/automate it - could I set up a timer, with a given interval, then use timer_Tick?
Scheduling is new to me, any advice is appreciated.
You can use Window Service to work in backgroud, please see below link:
http://www.codeproject.com/KB/dotnet/simplewindowsservice.aspx
http://www.codeguru.com/columns/dotnet/article.php/c6919
Have you considered the use of scheduled tasks? So, rather than the page calling the updates, the scheduled task does that, and the page viewer is just seeing "latest results" at any specific point. Of course, that may not be feasible, but by the sounds of it, you're after a constantly working service/task with an ability to view the latest number, a little like an RSA token which shows new numebrs even if you dont need one.
Not sure if this is what you want. But if you are interested in using a scheduler for this task, you can try Quartz.Net. It is a very popular, full-featured and open source sheduling system.
Please describe what you are trying to achieve. There might be a better way than writing random numbers to a file.
I would not use a service (web or winservice) for this. There is no benefit to use a webservice since it will just do exactly the same as your web would do. A windows service will continue to run independent of your web, and you need to create some kind of IPC and to keep track of several timers/files.
The easiest way to do this is to use a System.Threading.Timer and keep it in a session variable. Also note that you need to kill it when the user session expires.
You should also be aware of that one timer will be created per user that uses the page.
Update
Create a Windows Service application and add a System.Threading.Timer to it. Write to the file in the timer callback.
Then open the textfile in your web app (using FileShare.ReadWrite + FileMode.Read)
I would like to know if there is an easy or practical way to determine how long it takes 1 kB on a website to load?
If you are using Firefox, you can get information on how long various files take to download using the Firebug plugin. It has a bunch of network monitoring features.
Chrome has a console similar to Firebug under tools->developer tools.
This is great for a quick reality check when you are developing, but sometimes you need a little more. For instance, you might want to set up a monitoring script to ensure response times aren't creeping up. Selenium is great for this and supports both Java and C#. Another option is to write a quick script using a headless browser like Mechanize (Ruby, Perl).
I prefer doing this kind of monitoring from the client end as opposed to on the server side because you get a more realistic perspective of what your end users are experiencing.