Download content from the internet with code - c#

I have to download some content from a website every day so I figure it will be nice to have a program that will do it... The problem is that the website requires authentication.
My current solution is by using System.Windows.Forms.WebBrowser control. I currently do something like:
/* Create browser */
System.Windows.Forms.WebBrowser browser = new System.Windows.Forms.WebBrowser();
/* navigate to desired site */
browser.Navigate("http://stackoverflow.com/");
// wait for browser to download dom
/* Get all tags of type input */
var elements = browser.Document.Body.GetElementsByTagName("input");
/* let's look for the one we are interested */
foreach (System.Windows.Forms.HtmlElement curInput in elements)
{
if (curInput.GetAttribute("name") == "q") //
{
curInput.SetAttribute("value", "I changed the value of this input");
break;
}
}
// etc
I think this approach works but is not the best solution. I have tried to use the webclient class and that seems to work but for some reason it does not work. I belive the reason why it does not work is because I have to save the cookies?
So my question is how will I be able to track all the bytes that get send to the server and all the bytes that get responded in order to download what I need. In other words I will like to have the webclient act as a webrowser and once I get to the part I need by just looking at the source I should be able to parser the data that I need.
I will appreciate if someone can show me an example of how to do so. Google chrome does a pretty good job displaying lots of information:
Thanks in advance,
Antonio

Answering your question:
The best utility i know to track traffic is Fiddler (its free).
For sending advanced HTTP requests, you should use class System.Net.HttpWebRequest, which also has property CookieContainer, and Headers, allowing you to do what ever you want.
Hope it helps.

Related

Streaming MP3 Chunks on ASP.NET

Currently, I have a feature on an ASP.NET website where the user can play back MP3 Files. The code looks something like this:
Response.Clear();
Response.ContentType = "audio/mpeg";
foreach (DataChunk leChunk in db.Mp3Files.First(mp3 => mp3.Mp3ResourceId.Equals(id)).Data.Chunks.OrderBy(chunk => chunk.ChunkOrder))
{
Response.BinaryWrite(leChunk.Data);
}
Unfortunately, if a larger MP3 file is selected, the audio does not begin to play until the entire file is downloaded, which can cause a noticeable delay. Is there any way to get the MP3 to start playing immediately, even though the entire file may not yet be transferred?
You should be able to do what you want by writing to the outpstream of the response, i.e.:
Response.OutputStream.Write
It is also probably a good idea to check previously if Response.IsClientConnected and give up if not.
I found a demo that allows playback of mp3 files from an asp.net web application:
http://aspsnippets.com/Articles/Save-MP3-Audio-Files-to-database-and-display-in-ASPNet-GridView-with-Play-and-Download-option.aspx
try this:
Response.BufferOutput = false; //sets chunked encoding
Response.ContentType = "audio/mpeg";
using (var bw = new BinaryWriter(Response.OutputStream))
{
foreach (DataChunk leChunk in db.Mp3Files.First(mp3 => mp3.Mp3ResourceId.Equals(id)).Data.Chunks.OrderBy(chunk => chunk.ChunkOrder))
{
if (Response.IsClientConnected) //avoids the host closed the connection exception
{
bw.Write(leChunk.Data);
}
}
}
Also, go yo your web.config file and do this if you still have problems with chunked encoding:
<system.webServer>
<asp enableChunkedEncoding="true" />
</system.webServer>
The error you reported above about the host being closing the connection is happening probably because you are opening the page using the browser and when the browser reads the content type, it opens the media player and closes itself who had the opened connection which was then closed, causing that error, so to avoid this, you need to check periodically whether your client is still connected or not.
Finally, I would use a Generic Handler (.ashx) or a custom handler and set a .mp3 extension for this if you are using a aspx page to avoid the unnecessary overhead of the web page.
I hope this helps.
Try setting Response.BufferOutput = false before streaming the response.
If the location of the MP3 files are publicly available to your user then an alternative approach could be to just return the MP3's URL and use the HTML 5 audio tags in your mark up to stream the music. I am pretty sure that the default behaviour of the audio tag would be to stream the file rather than wait until the whole file has downloaded.
One method to support this would be implementing HTTP byte range requests.
By default I don't believe that ASP.NET does this, and definitely won't if using any of the code in the questions or the answer.
You can implement this manually with a little work though. Another option, which would be much less dev work, would be to let IIS serve a static file. I assume that isn't an option though.
Here's an example implementation:
http://www.codeproject.com/Articles/820146/HTTP-Partial-Content-In-ASP-NET-Web-API-Video

Access to content of a process information

I create an instance of IE with this code:
System.Diagnostics.Process p =
System.Diagnostics.Process.Start("IEXPLORE.EXE",
#"http://www.asnaf.ir/moreinfounit.php?sSdewfwo87kjLKH7624QAZMLLPIdyt75576rtffTfdef22de=1&iIkjkkewr782332ihdsfJHLKDSJKHWPQ397iuhdf87D3dffR=2009585&gGtkh87KJg89jhhJG75gjhu64HGKvuttt87guyr6e67JHGVt=117&cCli986gjdfJK755jh87KJ87hgf9871g00113kjJIZAEQ798=0a26e8ea07358781d128aa4bc98dd89a");
I want to get the contents of the opened window. Is it possible to read the HTML content by this process?
Use following COde,
using (var client = new WebClient())
{
string result = client.DownloadString("http://www.asnaf.ir/moreinfounit.php?sSdewfwo87kjLKH7624QAZMLLPIdyt75576rtffTfdef22de=1&iIkjkkewr782332ihdsfJHLKDSJKHWPQ397iuhdf87D3dffR=2009585&gGtkh87KJg89jhhJG75gjhu64HGKvuttt87guyr6e67JHGVt=117&cCli986gjdfJK755jh87KJ87hgf9871g00113kjJIZAEQ798=0a26e8ea07358781d128aa4bc98dd89a");
// TODO: ur logice here
}
no. your processes run in different virtual addressing spaces. That would have been a serious security vulnerability if you could have read the memory space allocated by another process.
Edit: Consider using something like a WebBrowserControl in your original process. That way you cold easily retrieve the page it displays.
It might be possible, but I'd actually use a HttpWebRequest to obtain the HTML content. If you really just want to get the HTML content for a given http-URL, using IE as a separate process is definitely not the way to go.
You should use WebClient class to retrieve web page content. Check this link:
http://msdn.microsoft.com/en-us/library/system.net.webclient(v=vs.80).aspx

How to get raw page source (not generated source) from c#

The goal is to get the raw source of the page, I mean do not run the scripts or let the browsers format the page at all. for example: suppose the source is <table><tr></table> after the response, I don't want get <table><tbody><tr></tr></tbody></table>, how to do this via c# code?
More info: for example, type "view-source:http://feeds.gawker.com/kotaku/full" in the browser's address bar will give u a xml file, but if you just call "http://feeds.gawker.com/kotaku/full" it will render a html page, what I want is the xml file. hope this is clear.
Here's one way, but it's not really clear what you actually want.
using(var wc = new WebClient())
{
var source = wc.DownloadString("http://google.com");
}
If you mean when rendering your own page. You can get access the the raw page content using a ResponseFilter, or by overriding page render. I would question your motives for doing this though.
Scripts run client-side, so it has no bearing on any c# code.
You can use a tool such as Fiddler to see what is actually being sent over the wire.
disclaimer: I think Fiddler is amazing

Downloading part of a web page - data mining

this is basically what I'm doing. I select a science article from en.wikipedia.org and get the a list of users that have made edits and how many times they've edited the article. To get this I follow links from the page to get that lead me to toolserver. I use this page http://toolserver.org/~daniel/WikiSense/Contributors.php?wikilang=en&wikifam=.wikipedia.org&page=Quantum_mechanics&since=&until=&grouped=on&hideanons=on&order=-edit_count&max=100&order=-edit_count&format=wiki to retrieve the editors in a sorted list and excluding anonymous. This works well, because it comes in a nicely formatted list, (even though it has dates which I don't need).
However, to judge their credibility, I need to take a look at the top users and see the top articles they're contributing to, to see if they're editing a lot of science articles or just random junk. I'm having a hard time getting data on each of these users, as currently, the only site I can find that shows user history is http://en.wikipedia.org/w/index.php?title=Special:Contributions&limit=5000&target=Aquirata
However, it takes quite a while to get a single user's webpage, at least 20 seconds, and then I still have to parse out the useless data, etc. I don't need close to as much data as I'm forced to download. This is my code so far for getting a user's data:
static string getWebPage(string url)
{
WebClient client = new WebClient();
client.Headers.Add("user-agent",
"Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
return client.DownloadString(url);
}
static void Main(string[] args)
{
string url = "http://en.wikipedia.org/w/index.php?title=Special:Contributions&limit=50&target=Aquirata";
string page = getWebPage(url);
var lines = page.Split('\n', '\r');
var edits = lines.Where(t => t.StartsWith("<li class"));
foreach (string s in edits)
Console.WriteLine(s);
Console.ReadLine();
}
Is there a possible alternative that will be faster and/or easier? Maybe there's a database somewhere for this? (i'm not sure if wikimedia has statistics on user's contributions).
Also, I'm using C# because I'm most familiar with it. I might switch over to java to allow cross platform but I'm open for any other suggestions.
I think wikipedia provides their data for download (so you don't have to strip it out from the HTML page).
See: http://dumps.wikimedia.org/enwiki/
HTH
Selecting only a certain part of a document can be done with a range request, which are documented in RFC 2616 Section 14.16.
For example:
$ curl -H"range: bytes=1-20" www.apache.org
!DOCTYPE HTML PUBLIC
$
I think that you can Deal with the WIKI as XML so you can use XPATHs to get the required data.

Grabbing Images from a webpage quickly

I was wondering if someone could give me some guidance here. I'd like to be able to programatically get every image on a webpage as quickly as possible. This is what I'm currently doing: (note that clear is a WebBrowser control)
if (clear.ReadyState == WebBrowserReadyState.Complete)
{
doc = (IHTMLDocument2)clear.Document.DomDocument;
sobj = doc.selection;
body = doc.body as HTMLBody;
sobj.clear();
range = body.createControlRange() as IHTMLControlRange;
for (int j = 0; j < clear.Document.Images.Count; j++)
{
img = (IHTMLControlElement)clear.Document.Images[j].DomElement;
HtmlElement ele = clear.Document.Images[j];
string test = ele.OuterHtml;
string test2 = ele.InnerHtml;
range.add(img);
range.select();
range.execCommand("Copy", false, null);
Image image = Clipboard.GetImage();
if (image != null)
{
temp = new Bitmap(image);
Clipboard.Clear();
......Rest of code ...........
}
}
}
However, I find this can be slow for alot of images, and additionally it hijacks my clipboard. I was wondering if there is a better way?
I suggest using HttpWebRequest and HttpWebResponse. In your comment you asked about efficiency/speed.
From the standpoint of data being transferred using HttpWebRequest will be at worst the same as using a browser control, but almost certainly much better. When you (or a browser) makes a request to a web server, you initially only get the markup for the page itself. This markup may include image references, objects like flash, and resources (like scripts and css files) that are referenced, but not actually included in the page itself. A web browser will then proceed to request all the associated resources needed to render the page, but using HttpWebRequest you can request only those things that you actually want (the images).
From the standpoint of resources or processing power required to extract entities from a page, there is no comparison: using a broswer control is far more resource intensive than scanning an HttpWebResponse. Scanning some data using C# code is extremely fast. Rendering a web page involves javascript, graphics rendering, css parsing, layout, caching, and so on. It's a pretty intensive operation, actually. Using a browser under programmatic control, this will quickly become apparent: I doubt you could process more than a page every second or so.
On the other hand, a C# program dealing directly with a web server (with no rendering engine involved) could probably handle dozens if not hundreds of pages per second. For all practical purposes, you'd really be limited only by the response time of the server and your internet connection.
There are multiple approaches here.
If it's a one time thing, just browse to the site and select File > Save Page As... and let the browser save all the images locally for you.
If it's a recurring thing there are lots of different ways.
buy a program that does this. I'm sure there are hundreds of implementations.
use the html agility pack to grab the page and compile a list of all the images I want. Then spin a thread for each image that downloads and saves it. You might limit the number of threads depending on various factors like your (and the sites) bandwidth and local disk speed. Note that some sites have arbitrary limitations placed on the number of concurrent requests per connection they will handle. Depending on the site this might be as few as 3.
This is by no means conclusive. There are lots of other ways. I probably wouldn't do it through a WebBrowser control though. That code looks brittle.

Categories