Grabbing Images from a webpage quickly

Grabbing Images from a webpage quickly - c#

I was wondering if someone could give me some guidance here. I'd like to be able to programatically get every image on a webpage as quickly as possible. This is what I'm currently doing: (note that clear is a WebBrowser control)
if (clear.ReadyState == WebBrowserReadyState.Complete)
{
doc = (IHTMLDocument2)clear.Document.DomDocument;
sobj = doc.selection;
body = doc.body as HTMLBody;
sobj.clear();
range = body.createControlRange() as IHTMLControlRange;
for (int j = 0; j < clear.Document.Images.Count; j++)
{
img = (IHTMLControlElement)clear.Document.Images[j].DomElement;
HtmlElement ele = clear.Document.Images[j];
string test = ele.OuterHtml;
string test2 = ele.InnerHtml;
range.add(img);
range.select();
range.execCommand("Copy", false, null);
Image image = Clipboard.GetImage();
if (image != null)
{
temp = new Bitmap(image);
Clipboard.Clear();
......Rest of code ...........
}
}
}
However, I find this can be slow for alot of images, and additionally it hijacks my clipboard. I was wondering if there is a better way?

I suggest using HttpWebRequest and HttpWebResponse. In your comment you asked about efficiency/speed.
From the standpoint of data being transferred using HttpWebRequest will be at worst the same as using a browser control, but almost certainly much better. When you (or a browser) makes a request to a web server, you initially only get the markup for the page itself. This markup may include image references, objects like flash, and resources (like scripts and css files) that are referenced, but not actually included in the page itself. A web browser will then proceed to request all the associated resources needed to render the page, but using HttpWebRequest you can request only those things that you actually want (the images).
From the standpoint of resources or processing power required to extract entities from a page, there is no comparison: using a broswer control is far more resource intensive than scanning an HttpWebResponse. Scanning some data using C# code is extremely fast. Rendering a web page involves javascript, graphics rendering, css parsing, layout, caching, and so on. It's a pretty intensive operation, actually. Using a browser under programmatic control, this will quickly become apparent: I doubt you could process more than a page every second or so.
On the other hand, a C# program dealing directly with a web server (with no rendering engine involved) could probably handle dozens if not hundreds of pages per second. For all practical purposes, you'd really be limited only by the response time of the server and your internet connection.

There are multiple approaches here.
If it's a one time thing, just browse to the site and select File > Save Page As... and let the browser save all the images locally for you.
If it's a recurring thing there are lots of different ways.
buy a program that does this. I'm sure there are hundreds of implementations.
use the html agility pack to grab the page and compile a list of all the images I want. Then spin a thread for each image that downloads and saves it. You might limit the number of threads depending on various factors like your (and the sites) bandwidth and local disk speed. Note that some sites have arbitrary limitations placed on the number of concurrent requests per connection they will handle. Depending on the site this might be as few as 3.
This is by no means conclusive. There are lots of other ways. I probably wouldn't do it through a WebBrowser control though. That code looks brittle.

Related

Fine Uploader session Thumbnails slow to load

I'm creating a mockup file upload tool for a community site using Fine Uploader.
I've got the session set up to retrieve the initial files from the server along with a thumbnail url.
It all works great, however the rendering of the thumbnails is really slow.
I can't work out why. So I hard-coded to use a very small thumbnail for each of the four files. This made no difference.
The server side not the issue. The information is coming back very quickly.
Am I doing something wrong? Why is fineuploader so slow? Here's screen grab. It's taking four seconds to render the four thumbnails.
I'm using latest chrome. It's a NancyFX project on a fairly powerful machine. Rending other pages with big images on them is snappy.
Client side code:
thumbnails: {
placeholders: {
waitingPath: '/Content/js/fine-uploader/placeholders/waiting-generic.png',
notAvailablePath: '/Content/js/fine-uploader/placeholders/not_available-generic.png'
}
},
session: {
endpoint: "/getfiles/FlickaId/342"
},
Server side code:
// Fine uploader makes session request to get existing files
Get["/getfiles/FlickaId/{FlickaId}"] = parameters =>
{
//get the image files from the server
var i = FilesDatabase.GetFlickaImagesById(parameters.FlickaId);
// list to hold the files
var list = new List<UploadedFiles>();
// build the response data object list
foreach (var imageFile in i)
{
var f = new UploadedFiles();
f.name = "test-thumb-small.jpg"; // imageFile.ImageFileName;
f.size = 1;
f.uuid = imageFile.FileGuid;
f.thumbnailUrl = "/Content/images/flickabase/thumbnails/" + "test-thumb-small.jpg"; // imageFile.ImageFileName;
list.Add(f);
}
return Response.AsJson(list); // our model is serialised by Nancy as Json!
};

This is by design, and was implemented both to prevent the UI thread from being flooded with the image scaling logic and to prevent a memory leak issue specific to Chrome. This is explained in the thumbnails and previews section of the documentation, specifically in the "performance considerations" area:
For browsers that support client-generated image previews (qq.supportedFeatures.imagePreviews === true), a configurable pause between template-generated previews is in effect. This is to prevent the complex process of generating previews from overwhelming the client machine's CPU for a lengthy amount of time. Without this limit in place, the browser's UI thread runs the risk of blocking, preventing any user interaction (scrolling, etc) until all previews have been generated.
You can adjust or remove this pause via the thumbnails option, but I suggest you not do this unless you are sure users will not drop a large number of complex image files.

How can I get the loading time and volume of a website?

I'm working on a program that should measure loading time and volume of a website that I give as an input.
Here I have some code that returns just response time of website but I want the total loading time and total volume of items such (pictures, JavaScript, HTML, etc.).
public string Loading_Time(string url)
{
Stopwatch stopwatch = new Stopwatch();
WebClient client = new WebClient();
client.Credentials = CredentialCache.DefaultCredentials;
stopwatch.Start();
string result = client.DownloadString(url);
stopwatch.Stop();
return stopwatch.Elapsed.Milliseconds.ToString();
}
How can I achieve that?

This is going to be a little bit tough. Start by using something like HTMLAgilityPack or something similar to parse the returned html from your original request (dont try to parse HTML yourself!)
Scan through the object representation of the HTML once parsed, and decide what you want to measure the size of. Typically this will be
Includes, such as CSS, or javascript
Images in IMG and BUTTON elements, as well as background images
The difficulty is that often images are specified as part of a css stylesheet - so are you going to try to parse every css file to obtain these too?
The original request you made for the HTML you could have obtained the byte size of the downloaded string. Start with this number as your "volume".
Now make a separate request for each js, css, image etc file in the same way. But all you're interested in is the byte size of each download - its readily available when you make an HTTP request. Add each item's byte size to the total.
When you're finished you will have the total byte size for all artifacts of that web page.

Streaming MP3 Chunks on ASP.NET

Currently, I have a feature on an ASP.NET website where the user can play back MP3 Files. The code looks something like this:
Response.Clear();
Response.ContentType = "audio/mpeg";
foreach (DataChunk leChunk in db.Mp3Files.First(mp3 => mp3.Mp3ResourceId.Equals(id)).Data.Chunks.OrderBy(chunk => chunk.ChunkOrder))
{
Response.BinaryWrite(leChunk.Data);
}
Unfortunately, if a larger MP3 file is selected, the audio does not begin to play until the entire file is downloaded, which can cause a noticeable delay. Is there any way to get the MP3 to start playing immediately, even though the entire file may not yet be transferred?

You should be able to do what you want by writing to the outpstream of the response, i.e.:
Response.OutputStream.Write
It is also probably a good idea to check previously if Response.IsClientConnected and give up if not.
I found a demo that allows playback of mp3 files from an asp.net web application:
http://aspsnippets.com/Articles/Save-MP3-Audio-Files-to-database-and-display-in-ASPNet-GridView-with-Play-and-Download-option.aspx

try this:
Response.BufferOutput = false; //sets chunked encoding
Response.ContentType = "audio/mpeg";
using (var bw = new BinaryWriter(Response.OutputStream))
{
foreach (DataChunk leChunk in db.Mp3Files.First(mp3 => mp3.Mp3ResourceId.Equals(id)).Data.Chunks.OrderBy(chunk => chunk.ChunkOrder))
{
if (Response.IsClientConnected) //avoids the host closed the connection exception
{
bw.Write(leChunk.Data);
}
}
}
Also, go yo your web.config file and do this if you still have problems with chunked encoding:
<system.webServer>
<asp enableChunkedEncoding="true" />
</system.webServer>
The error you reported above about the host being closing the connection is happening probably because you are opening the page using the browser and when the browser reads the content type, it opens the media player and closes itself who had the opened connection which was then closed, causing that error, so to avoid this, you need to check periodically whether your client is still connected or not.
Finally, I would use a Generic Handler (.ashx) or a custom handler and set a .mp3 extension for this if you are using a aspx page to avoid the unnecessary overhead of the web page.
I hope this helps.

Try setting Response.BufferOutput = false before streaming the response.

If the location of the MP3 files are publicly available to your user then an alternative approach could be to just return the MP3's URL and use the HTML 5 audio tags in your mark up to stream the music. I am pretty sure that the default behaviour of the audio tag would be to stream the file rather than wait until the whole file has downloaded.

One method to support this would be implementing HTTP byte range requests.
By default I don't believe that ASP.NET does this, and definitely won't if using any of the code in the questions or the answer.
You can implement this manually with a little work though. Another option, which would be much less dev work, would be to let IIS serve a static file. I assume that isn't an option though.
Here's an example implementation:
http://www.codeproject.com/Articles/820146/HTTP-Partial-Content-In-ASP-NET-Web-API-Video

Download content from the internet with code

I have to download some content from a website every day so I figure it will be nice to have a program that will do it... The problem is that the website requires authentication.
My current solution is by using System.Windows.Forms.WebBrowser control. I currently do something like:
/* Create browser */
System.Windows.Forms.WebBrowser browser = new System.Windows.Forms.WebBrowser();
/* navigate to desired site */
browser.Navigate("http://stackoverflow.com/");
// wait for browser to download dom
/* Get all tags of type input */
var elements = browser.Document.Body.GetElementsByTagName("input");
/* let's look for the one we are interested */
foreach (System.Windows.Forms.HtmlElement curInput in elements)
{
if (curInput.GetAttribute("name") == "q") //
{
curInput.SetAttribute("value", "I changed the value of this input");
break;
}
}
// etc
I think this approach works but is not the best solution. I have tried to use the webclient class and that seems to work but for some reason it does not work. I belive the reason why it does not work is because I have to save the cookies?
So my question is how will I be able to track all the bytes that get send to the server and all the bytes that get responded in order to download what I need. In other words I will like to have the webclient act as a webrowser and once I get to the part I need by just looking at the source I should be able to parser the data that I need.
I will appreciate if someone can show me an example of how to do so. Google chrome does a pretty good job displaying lots of information:
Thanks in advance,
Antonio

Answering your question:
The best utility i know to track traffic is Fiddler (its free).
For sending advanced HTTP requests, you should use class System.Net.HttpWebRequest, which also has property CookieContainer, and Headers, allowing you to do what ever you want.
Hope it helps.

Downloading part of a web page - data mining

this is basically what I'm doing. I select a science article from en.wikipedia.org and get the a list of users that have made edits and how many times they've edited the article. To get this I follow links from the page to get that lead me to toolserver. I use this page http://toolserver.org/~daniel/WikiSense/Contributors.php?wikilang=en&wikifam=.wikipedia.org&page=Quantum_mechanics&since=&until=&grouped=on&hideanons=on&order=-edit_count&max=100&order=-edit_count&format=wiki to retrieve the editors in a sorted list and excluding anonymous. This works well, because it comes in a nicely formatted list, (even though it has dates which I don't need).
However, to judge their credibility, I need to take a look at the top users and see the top articles they're contributing to, to see if they're editing a lot of science articles or just random junk. I'm having a hard time getting data on each of these users, as currently, the only site I can find that shows user history is http://en.wikipedia.org/w/index.php?title=Special:Contributions&limit=5000&target=Aquirata
However, it takes quite a while to get a single user's webpage, at least 20 seconds, and then I still have to parse out the useless data, etc. I don't need close to as much data as I'm forced to download. This is my code so far for getting a user's data:
static string getWebPage(string url)
{
WebClient client = new WebClient();
client.Headers.Add("user-agent",
"Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
return client.DownloadString(url);
}
static void Main(string[] args)
{
string url = "http://en.wikipedia.org/w/index.php?title=Special:Contributions&limit=50&target=Aquirata";
string page = getWebPage(url);
var lines = page.Split('\n', '\r');
var edits = lines.Where(t => t.StartsWith("<li class"));
foreach (string s in edits)
Console.WriteLine(s);
Console.ReadLine();
}
Is there a possible alternative that will be faster and/or easier? Maybe there's a database somewhere for this? (i'm not sure if wikimedia has statistics on user's contributions).
Also, I'm using C# because I'm most familiar with it. I might switch over to java to allow cross platform but I'm open for any other suggestions.

I think wikipedia provides their data for download (so you don't have to strip it out from the HTML page).
See: http://dumps.wikimedia.org/enwiki/
HTH

Selecting only a certain part of a document can be done with a range request, which are documented in RFC 2616 Section 14.16.
For example:
$ curl -H"range: bytes=1-20" www.apache.org
!DOCTYPE HTML PUBLIC
$

I think that you can Deal with the WIKI as XML so you can use XPATHs to get the required data.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.