C# - how to download only the modified part of an HTML

C# - how to download only the modified part of an HTML - c#

I'm using C# + HttpWebRequest.
I have an HTML page I need to frequently check for updates.
Assuming I already have an older version of the HTML page (in a string for example), is there any way to download ONLY the "delta", or modified portion of the page, without downloading the entire page itself and comparing it to the older version?

Only if that functionality is included in the web server, and that's pretty unlikely. So no, sorry.

Not for any given page, no.
But if you wrote a facility to give you the differences based on a timestamp or some kind of ID, then yes. This isn't anything standard. You'd have to create a feed for the page using syndication, or create a web service to satisfy the need. Of course, you have to be in control of the web server you want to monitor, which may not be the case for you.

The short answer is, no. The long answer is that if the HTML is in version control and you write some server side code that, given a particular version number, gives you the diff between the current version and the specified version, yes. If the HTML isn't in version control and you just want to compare your version to the current version, then either you need to download the current version to do the comparison on the client or upload your version to the server and have it do the comparison -- and send the difference back. Obviously, it's more efficient just to have your client re-download the new version.

Set IfModifiedSince property of HttpWebRequest.
This won't give you 'delta', but will reply with 301 if the page was not modified at all.

You have the old version and the server has the new version. How could you download just the delta without knowing what has been changed? How could the server deliver the delta without knowing which old version you have?
Obviously, there is no way around downloading the entire page. Or uploading the old version to the server (assuming the server has a service that allows that), but that would only increase the traffic.

Like the other answers before me, There is no way to get around the download.
You can however not parse the html if it the same by creating a hash for each page revision and comparing the current hash with the new hash. Then you would use a diff algorithm to extract only the 'delta' information. I think most modern crawlers do something along these lines.

If the older versions were kept on the web server, and when you requested the delta, you sent a 'version number' or a modified date for the version that you have, theoretically the server could diff the page and send only the difference. But both copies have to be on one machine for anybody to know what the difference is.

You could use the AddRange method of the HttpWebRequest Class.
With this you can specify a byte range of the ressource you want to download.
This is also used to continue interrupted http downloads.
This is no delta but you can decrease traffic if you only load the parts that change.

Related

Open AngleSharp document in Chrome

I am using AngleSharp in C# to simulate a web browser. For debugging purposes, sometimes I want to see the page I am traversing. I am asking if there is an easy way to show the current document in a web browser (preferably the system's default browser) and if possible with current cookie states.

I am very late to the party, but hopefully someone will find my answer useful: The short answer is no, the long answer is, yes - with some work that is possible in a limited way.
How to make it possible? By injecting some code into AngleSharp that opens a (local) webserver. The content from this webserver could then be inspected in any webbrowser (e.g., the system's default browser).
The injected local webserver would serve the current document at its root (e.g., http://localhost:9000/), along with all auxiliary information in HTTP headers (e.g., cookie states). The problem with this approach is that we either transport the document's original source or a serialization of the DOM as seen by AngleSharp. Therefore, there could be some deviations and it may not be what you want. Alternatively, the server could emit JS code that replicates what AngleSharp currently sees (however, then standard debugging seems more viable).
Any approach, however, requires some (tedious?) work and therefore needs to be justified. Since you want to "see" the page I guess a CSS renderer would be more interesting (also it could be embedded in any application or made available in form of a VS extension).
Hope this helps!

How to capture visited URLs and their html by any browsers

I want to find a decent solution to track URLs and html content that users are visiting and provide more information to user. The solution should bring minimum impacts to end users.
I don't want to write plugins for different browsers. It's hard to maintain.
I don't accept proxy method, since I don't want to change any of user's proxy settings.
My application is writen in C# and targeting to Windows. It's best if the solution can support other OS as well.
Based on my research, I found following methods that looks working for me, but all of them have their drawbacks, I can't determine which one is the best.
Use WinPcap
WinPcap sniffers all TCP packets without changing any of user settings but only requires to install the WinPcap setup, which is acceptable to me. But I have two questions:
a. how to convert TCP packet into URL and HTML
b. Does it really impact the performance? I don't know if sniffer all TCP traffic is overhead for this requirment.
Find history files for different browsers
This way looks like the easist one, but I wonder if the solution is stable. I am not sure if the browser will stably write the history and when it writes to. My application will popup information before the user leave the current page. The solution won't work for me if browser writes to history file when user close the browser.
Use FindWindow or accessiblity object or COM interface to find the UI element which contains the URL
I find this way is not complete, for example, Chrome will only show the active tab's URL but not all of them.
Another drawback is that I have to request the URL another time to get its HTML content.
Any comment or suggestion is welcome.
BTW, I am not doing any spyware. The application is trying to find all RSS feeds from web page and show them to end users. I can easily do that in a browser plugin but I really want to support multiple broswers with single UI. Thanks.

Though this is very old post, I thought to just give an input.
Approach 1 of WinPcap is the best one. This will work for any browser, even builtin browser of any other installed application. The approach will be less resource consuming too.
There is a library Pcap.Net that has HTTP parser. You can construct http stream and use its httpresponsedatagram to parse the body that can be consumed by your application.
This link helped giving more insight to me -
Tcp Session Reconstruction with Winpcap

Windows App spellcheck

I was wondering if there is another way to spell check a Windows app instead what I've been of using: "Microsoft.Office.Interop.Word". I can't buy a spell checking add-on. I also cannot use open source and would like the spell check to be dynamic..any suggestions?
EDIT:
I have seen several similar questions, the problem is they all suggest using open source applications (which I would love) or Microsoft Word.
I am currently using Word to spell check and it slows my current application down and causes several glitches in my application. Word is not a clean solution so I'm really wanting to find some other way.. Is my only other option to recreate my app as a WPF app so I can take advantage of the SpellCheck Class?

If I were you I would download the data from the English Wiktionary and parse it to obtain a list of all English words (for instance). Then you could rather easily write at least a primitive spell-checker yourself. In fact, I use a parsed version of the English Wiktionary in my own mathematical application AlgoSim. If you'd like, I could send you the data file.
Update
I have now published a parsed word list at english.zip (942 kB, 383735 entries, zip). The data originates from the English Wiktionary, and as such, is licensed under the Creative Commons Attribution/Share-Alike License.
To obtain a list like this, you can either download all articles on Wiktionary as a huge XML file containing all Wiki- and HTML-formatted articles. This is then more or less trivial to parse. Alternatively, you can run a bot on the site. I got help to obtain a parsed file from a user at Wiktionary (I seem to have forgotten his name, though...), and this file (english.txt in english.zip) is a further processed version of the file I got.

http://msdn.microsoft.com/en-us/library/system.windows.controls.spellcheck.aspx

I use Aspell-win32, it's old but it's open source, and works as well or better than the Word spell check. Came here looking for a built in solution.

Finding more information about browser versions with C#/ASP.Net

First, some background to my problem.
There are many versions of Internet Explorer 6 and 7 that do not support more than 20 Key-Value pairs in a cookie. I have a list of full versions that do and do not support this. This is fixed in a windows update, but it's not possible for me to force the users of my app to carry out windows update in order to use my app.
We have developed a different cookie jar for versions of Internet Explorer that do not support this, however the performance of this is not optimal, and therefore we need to only use this on versions of IE that require it.
The full version number of an IE browser is in the format 6.00.2900.2180. Everywhere I have found suggests using Request.Browser to find out browser information, but this is far too limited for my needs. To clarify this, MajorVersion returns 6, and MinorVersion returns 0, giving me 6.0 (6.0 is the version of pretty much every version of Internet Explorer 6 that exists). So what I need is the third and fourth parts (or at the very least, the third part) of the full version.
So, does anyone know of a way, in ASP.Net with C#, to find out the information I need? If someone has looked extensively into this and found it to impossible, that is fine as an answer.

You may need to revisit why you're storing so many different key-value pairs. Going low-tech, couldn't you concatenate the values into fewer or maybe even a single key? What sort of values are you storing--in a cookie?

Try grabbing the "User-Agent" request header using Request.Headers

Copying this from meandmycode to accept it as answer.
IE doesn't specify the long version
number in the user-agent header so you
have absolute no chance of detecting
this other than sending a 'snoop' page
with javascript to detect the complex
version number.. but doing something
like that is dodge city, and
javascript may not be able to find the
full version either.

How can I measure the response and loading time of a webpage?

I need to build a windows forms application to measure the time it takes to fully load a web page, what's the best approach to do that?
The purpose of this small app is to monitor some pages in a website, in a predetermined interval, in order to be able to know beforehand if something is going wrong with the webserver or the database server.
Additional info:
I can't use a commercial app, I need to develop this in order to be able to save the results to a database and create a series of reports based on this info.
The webrequest solution seems to be the approach I'm goint to be using, however, it would be nice to be able to measure the time it takes to fully load the the page (images, css, javascript, etc). Any idea how that could be done?

If you just want to record how long it takes to get the basic page source, you can wrap a HttpWebRequest around a stopwatch. E.g.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(address);
System.Diagnostics.Stopwatch timer = new Stopwatch();
timer.Start();
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
timer.Stop();
TimeSpan timeTaken = timer.Elapsed;
However, this will not take into account time to download extra content, such as images.
[edit] As an alternative to this, you may be able to use the WebBrowser control and measure the time between performing a .Navigate() and the DocumentCompleted event from firing. I think this will also include the download and rendering time of extra content. However, I haven't used the WebBrowser control a huge amount and only don't know if you have to clear out a cache if you are repeatedly requesting the same page.

Depending on how the frequency you need to do it, maybe you can try using Selenium (a automated testing tool for web applications), since it users internally a web browser, you will have a pretty close measure. I think it would not be too difficult to use the Selenium API from a .Net application (since you can even use Selenium in unit tests).
Measuring this kind of thing is tricky because web browsers have some particularities when then download all the web pages elements (JS, CSS, images, iframes, etc) - this kind of particularities are explained in this excelent book (http://www.amazon.com/High-Performance-Web-Sites-Essential/dp/0596529309/).
A homemade solution probably would be too much complex to code or would fail to attend some of those particularities (measuring the time spent in downloading the html is not good enough).

One thing you need to take account of is the cache. Make sure you are measuring the time to download from the server and not from the cache. You will need to insure that you have turned off client side caching.
Also be mindful of server side caching. Suppose you download the pace at 9:00AM and it takes 15 seconds, then you download it at 9:05 and it takes 3 seconds, and finally at 10:00 it takes 15 seconds again.
What might be happening is that at 9 the server had to fully render the page since there was nothing in the cache. At 9:05 the page was in the cache, so it did not need to render it again. Finally by 10 the cache had been cleared so the page needed to be rendered by the server again.
I highly recommend that you checkout the YSlow addin for FireFox which will give you a detailed analysis of the times taken to download each of the items on the page.

Something like this would probably work fine:
System.Diagnostics.Stopwatch sw = new Stopwatch()
System.Net.HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create("http://www.example.com");
// other request details, credentials, proxy settings, etc...
sw.Start();
System.Net.HttpWebResponse res = (HttpWebResponse)req.GetResponse();
sw.Stop();
TimeSpan timeToLoad = sw.Elapsed;

I wrote once a experimental program which downloads a HTML page and the objects it references (images, iframes, etc).
More complicated than it seems because there is HTTP negotiation so, some Web clients will get the SVG version of an image and some the PNG one, widely different in size. Same thing for <object>.

I'm often confronted with a quite similar problem. However I take a slight different approach: First of all, why should I care about static content at all? I mean of course it's important for the user, if it takes 2 minutes or 2 seconds for an image, but that's not my problem AFTER I fully developed the page. These things are problems while developing and after deployment it's not the static content, but it's the dynamic stuff that normally slows things down (like you said in your last paragraph). The next thing is, why do you trust that so many things stay constant? If someone on your network fires up a p2p program, the routing goes wrong or your ISP has some issues your server-stats will certainly go down. And what does your benchmark say for a user living across the globe or just using a different ISP? All I'm saying is, that you are benchmarking YOUR point of view, but that doesn't say much about the servers performance, does it?
Why not let the site/server itself determine how long it took to load? Here is a small example written in PHP:
function microtime_float()
{
list($usec, $sec) = explode(" ", microtime());
return ((float)$usec + (float)$sec);
}
function benchmark($finish)
{
if($finish == FALSE){ /* benchmark start*/
$GLOBALS["time_start"] = microtime_float();
}else{ /* benchmark end */
$time = microtime_float() - $GLOBALS["time_start"];
echo '<div id="performance"><p>'.$time.'</p></div>';
}
}
It adds at the end of the page the time it took to build (hidden with css). Every couple of minutes I grep this with a regular expression and parse it. If this time goes up I know that there something wrong (including the static content!) and via a RSS-Feed I get informed and I can act.
With firebug we know the "normal" performance of a site loading all content (development phase). With the benchmark we get the current server situation (even on our cell phone). OK. What next? We have to make certain that all/most visitors are getting a good connection. I find this part really difficult and are open to suggestions. However I try to take the log files and ping a couple of IPs to see how long it takes to reach this network. Additionally before I decide for a specific ISP I try to read about the connectivity and user opinions...

You can use a software like these :
http://www.cyscape.com/products/bhawk/page-load-time.aspx
http://www.trafficflowseo.com/2008/10/website-load-timer-great-to-monitor.html
Google will be helpful to find the one best suited for your needs.

http://www.httpwatch.com/
Firebug NET tab

If you're using firefox install the firebug extension found at http://getfirebug.com. From there, choose the net tab, and it'll let you know the load/response time for everything on the page.

tl;dr
Use a headless browser to measure the loading times. One example of doing so is Website Loading Time.
Long version
I ran into the same challenges you're running into, so I created a side-project to measure actual loading times. It uses Node and Nightmare to manipulate a headless ("invisible") web browser. Once all of the resources are loaded, it reports the number of milliseconds it took to fully load the page.
One nice feature that would be useful for you is that it loads the webpage repeatedly and can feed the results to a third-party service. I feed the values into NIXStats for reporting; you should be able to adapt the code to feed the values into your own database.
Here's a screenshot of the resulting values for our backend once fed into NIXStats:
Example usage:
website-loading-time rinogo$ node website-loading-time.js https://google.com
1657
967
1179
1005
1084
1076
...
Also, if the main bulk of your code must be in C#, you can still take advantage of this script/library. Since it is a command-line tool, you can call it from your C# code and process the result.
https://github.com/rinogo/website-loading-time
Disclosure: I am the author of this project.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# - how to download only the modified part of an HTML - c#

Only if that functionality is included in the web server, and that's pretty unlikely. So no, sorry.

Set IfModifiedSince property of HttpWebRequest. This won't give you 'delta', but will reply with 301 if the page was not modified at all.

You could use the AddRange method of the HttpWebRequest Class. With this you can specify a byte range of the ressource you want to download. This is also used to continue interrupted http downloads. This is no delta but you can decrease traffic if you only load the parts that change.

Related

Open AngleSharp document in Chrome

How to capture visited URLs and their html by any browsers

Windows App spellcheck

Finding more information about browser versions with C#/ASP.Net

How can I measure the response and loading time of a webpage?

Categories

Resources