Using HttpClient for Asynchronous File Downloads - c#

I have a service which returns a csv file to a POST request. I would like to download said file using asynchronous techniques. While I can get the file, my code has a couple of outstanding problems and questions:
1) Is this really asynchronous?
2) Is there a way to know the length of the content even though it is being sent in chunked format? Think progress bars).
3) How can I best monitor progress in order to hold off the program exit until all work is complete.
using System;
using System.IO;
using System.Net.Http;
namespace TestHttpClient2
{
class Program
{
/*
* Use Yahoo portal to access quotes for stocks - perform asynchronous operations.
*/
static string baseUrl = "http://real-chart.finance.yahoo.com/";
static string requestUrlFormat = "/table.csv?s={0}&d=0&e=9&f=2015&g=d&a=4&b=5&c=2000&ignore=.csv";
static void Main(string[] args)
{
while (true)
{
Console.Write("Enter a symbol to research or [ENTER] to exit: ");
string symbol = Console.ReadLine();
if (string.IsNullOrEmpty(symbol))
break;
DownloadDataForStockAsync(symbol);
}
}
static async void DownloadDataForStockAsync(string symbol)
{
try
{
using (var client = new HttpClient())
{
client.BaseAddress = new Uri(baseUrl);
client.Timeout = TimeSpan.FromMinutes(5);
string requestUrl = string.Format(requestUrlFormat, symbol);
//var content = new KeyValuePair<string, string>[] {
// };
//var formUrlEncodedContent = new FormUrlEncodedContent(content);
var request = new HttpRequestMessage(HttpMethod.Post, requestUrl);
var sendTask = client.SendAsync(request, HttpCompletionOption.ResponseHeadersRead);
var response = sendTask.Result.EnsureSuccessStatusCode();
var httpStream = await response.Content.ReadAsStreamAsync();
string OutputDirectory = "StockQuotes";
if (!Directory.Exists(OutputDirectory))
{
Directory.CreateDirectory(OutputDirectory);
}
DateTime currentDateTime = DateTime.Now;
var filePath = Path.Combine(OutputDirectory, string.Format("{1:D4}_{2:D2}_{3:D2}_{4:D2}_{5:D2}_{6:D2}_{7:D3}_{0}.csv",
symbol,
currentDateTime.Year, currentDateTime.Month, currentDateTime.Day,
currentDateTime.Hour, currentDateTime.Minute, currentDateTime.Second, currentDateTime.Millisecond
));
using (var fileStream = File.Create(filePath))
using (var reader = new StreamReader(httpStream))
{
httpStream.CopyTo(fileStream);
fileStream.Flush();
}
}
}
catch (Exception ex)
{
Console.WriteLine("Error, try again!");
}
}
}
}

"Is this really asynchronous?"
Yes, mostly. The DownloadDataForStockAsync() method will return before the operation is complete, at the await response.Content.ReadAsStreamAsync() statement.
The main exception is near the end of the method, where you call Stream.CopyTo(). This isn't asynchronous, and because it's a potentially lengthy operation could result in noticeable delays. However, in a console program you won't notice, because the continuation of the method is executed in the thread pool rather than the original calling thread.
If you intend to move this code to a GUI framework, such as Winforms or WPF, you should change the statement to read await httpStream.CopyToAsync(fileStream);
Is there a way to know the length of the content even though it is being sent in chunked format? Think progress bars).
Assuming the server includes the Content-Length in the headers (and it should), yes. This should be possible.
Note that if you were using HttpWebRequest, the response object would have a ContentLength property giving you this value directly. You are using HttpRequestMessage here instead, which I'm less familiar with. But as near as I can tell, you should be able to access the Content-Length value like this:
long? contentLength = response.Content.Headers.ContentLength;
if (contentLength != null)
{
// use value to initialize "determinate" progress indication
}
else
{
// no content-length provided; will need to display progress as "indeterminate"
}
How can I best monitor progress in order to hold off the program exit until all work is complete.
There are lots of ways. I will point out that any reasonable way will require that you change the DownloadDataForStockAsync() method so that it returns Task and not void. Otherwise, you don't have access to the task that's created. You should do this anyway though, so that's not a big deal. :)
The simplest would be to just keep a list of all the tasks you start, and then wait on them before exiting:
static void Main(string[] args)
{
List<Task> tasks = new List<Task>();
while (true)
{
Console.Write("Enter a symbol to research or [ENTER] to exit: ");
string symbol = Console.ReadLine();
if (string.IsNullOrEmpty(symbol))
break;
tasks.Add(DownloadDataForStockAsync(symbol));
}
Task.WaitAll(tasks);
}
Of course, this requires that you explicitly maintain a list of each Task object, including those which have already completed. If you intend for this to run for a long time and process a very large number of symbols, that might be prohibitive. In that case, you might prefer to use the CountDownEvent object:
static void Main(string[] args)
{
CountDownEvent countDown = new CountDownEvent();
while (true)
{
Console.Write("Enter a symbol to research or [ENTER] to exit: ");
string symbol = Console.ReadLine();
if (string.IsNullOrEmpty(symbol))
break;
countDown.AddCount();
DownloadDataForStockAsync(symbol).ContinueWith(task => countdown.Signal()) ;
}
countDown.Wait();
}
This simply increments the CountDownEvent counter for each task you create, and attaches a continuation to each task to decrement the counter. When the counter reaches zero, the event is set, allowing a call to Wait() to return.

Related

Images take a very long time to load in C#

The problem I have it is:
I tried to download 1000+ images -> it works, but it takes a very long time to load the image downloaded completely, and the program continues and downloads the next image etc... Until let's admit 100 but the 8th image is still not finished downloading.
So I would like to understand why I encounter such a problem here and / or how to fix this problem.
Hope to see an issue
private string DownloadSourceCode(string url)
{
string sourceCode = "";
try
{
using (WebClient WC = new WebClient())
{
WC.Encoding = Encoding.UTF8;
WC.Headers.Add("Accept", "image / webp, */*");
WC.Headers.Add("Accept-Language", "fr, fr - FR");
WC.Headers.Add("Cache-Control", "max-age=1");
WC.Headers.Add("DNT", "1");
WC.Headers.Add("Origin", url);
WC.Headers.Add("TE", "Trailers");
WC.Headers.Add("user-agent", Fichier.LoadUserAgent());
sourceCode = WC.DownloadString(url);
}
}
catch (WebException e)
{
if (e.Status == WebExceptionStatus.ProtocolError)
{
string status = string.Format("{0}", ((HttpWebResponse)e.Response).StatusCode);
LabelID.TextInvoke(string.Format("{0} {1} {2} ", status,
((HttpWebResponse)e.Response).StatusDescription,
((HttpWebResponse)e.Response).Server));
}
}
catch (NotSupportedException a)
{
MessageBox.Show(a.Message);
}
return sourceCode;
}
private void DownloadImage(string URL, string filePath)
{
try
{
using (WebClient WC = new WebClient())
{
WC.Encoding = Encoding.UTF8;
WC.Headers.Add("Accept", "image / webp, */*");
WC.Headers.Add("Accept-Language", "fr, fr - FR");
WC.Headers.Add("Cache-Control", "max-age=1");
WC.Headers.Add("DNT", "1");
WC.Headers.Add("Origin", "https://myprivatesite.fr//" + STARTNBR.ToString());
WC.Headers.Add("user-agent", Fichier.LoadUserAgent());
WC.DownloadFile(URL, filePath);
NBRIMAGESDWLD++;
}
STARTNBR = CheckBoxBack.Checked ? --STARTNBR : ++STARTNBR;
}
catch (IOException)
{
LabelID.TextInvoke("Accès non autorisé au fichier");
}
catch (WebException e)
{
if (e.Status == WebExceptionStatus.ProtocolError)
{
LabelID.TextInvoke(string.Format("{0} / {1} / {2} ", ((HttpWebResponse)e.Response).StatusCode,
((HttpWebResponse)e.Response).StatusDescription,
((HttpWebResponse)e.Response).Server));
}
}
catch (NotSupportedException a)
{
MessageBox.Show(a.Message);
}
}
private void DownloadImages()
{
const string URL = "https://myprivatesite.fr/";
string imageIDURL = string.Concat(URL, STARTNBR);
string sourceCode = DownloadSourceCode(imageIDURL);
if (sourceCode != string.Empty)
{
string imageNameURL = Fichier.GetURLImage(sourceCode);
if (imageNameURL != string.Empty)
{
string imagePath = PATHIMAGES + STARTNBR + ".png";
LabelID.TextInvoke(STARTNBR.ToString());
LabelImageURL.TextInvoke(imageNameURL + "\r");
DownloadImage(imageNameURL, imagePath);
Extension.SaveOptions(STARTNBR, CheckBoxBack.Checked);
}
}
STARTNBR = CheckBoxBack.Checked ? --STARTNBR : ++STARTNBR;
}
// END FUNCTIONS
private void BoutonStartPause_Click(object sender, EventArgs e)
{
if (Fichier.RGBIMAGES != null)
{
if (boutonStartPause.Text == "Start")
{
boutonStartPause.ForeColor = Color.DarkRed;
boutonStartPause.Text = "Pause";
if (myTimer == null)
myTimer = new System.Threading.Timer(_ => new Task(DownloadImages).Start(), null, 0, Trackbar.Value);
}
else if (boutonStartPause.Text == "Pause")
EndTimer();
Extension.SaveOptions(STARTNBR, CheckBoxBack.Checked);
}
}
So I would like to understand why I encounter such a problem here and / or how to fix this problem.
There are probably two reasons I can think of.
Connection/Port Exhaustion
Thread Pool Exhaustion
Connection/Port Exhaustion
This happens when you're attempting to create too many connections at once, or when the connections you made previously have not yet been released. When you use a WebClient the resources it uses sometimes don't get released immediately. This causes a delay between when that object is disposed and the actual time that the next WebClient attempting to use the same port/connection actually gets access to that port.
An example of something that would most likely cause Connection/Port Exhaustion
int i = 1_000;
while(i --> 0)
{
using var Client = new WebClient();
// do some webclient stuff
}
When you create a lot of web clients, which is sometimes necessary due to the inherent lack of concurrency in WebClient. There's a possibility that by the time the next WebClient is instantiated, the port that the last one was using may not be available yet, causing either a delay(while it waits for the port) or worse the next WebClient opening another port/connection. This can cause a never ending list of connections to open causing things to grind to a halt!
Thread Pool Exhaustion
This is caused by trying to create too many Task or Thread objects at once that block their own execution(via Thread.Sleep or a long running operation).
Normally this isn't an issue since the built in TaskScheduler does a really good job of keeping track of a lot of tasks and makes sure that they all get turns to execute their code.
Where this becomes a problem is the TaskScheduler has no context for which tasks are important, or which tasks are going to need more time than others to complete. So therefor when many tasks are processing long running operations, blocking, or throwing exceptions, the TaskScheduler has to wait for those tasks to finish before it can start new ones. If you are particularly unlucky the TaskScheduler can start a bunch of tasks that are all blocking and no tasks can start, even if all the other tasks waiting are small and would complete instantly.
You should generally use as few tasks as possible to increase reliability and avoid thread pool exhaustion.
What you can do
You have a few options to help improve the reliability and performance of this code.
Consider using HttpClient instead. I understand you may be required to use WebClient so I have provided answers using WebClient exclusively.
Consider Requesting multiple downloads/strings within the same task to avoid Thread Pool Exhaustion
Consider using a WebClient helper class that limits the available webclients that can be active at once, and has the ability to keep webclients open if you're going to be accessing the same website multiple times.
WebClient Helper Class
I created a very simple helper class to get you started. This will allow you to create WebClient requests asynchronously without having to worry about creating too many clients at once. The default limit is the number of Cores in the client's processor(this was chosen arbitrarily).
public class ConcurrentWebClient
{
// limits the number of maximum clients able to be opened at once
public static int MaxConcurrentDownloads => Environment.ProcessorCount;
// holds any clients that should be kept open
private static readonly ConcurrentDictionary<string, WebClient> Clients;
// prevents more than the alloted webclients to be open at once
public static readonly SemaphoreSlim Locker;
// allows cancellation of clients
private static CancellationTokenSource TokenSource = new();
static ConcurrentWebClient()
{
Clients = new ConcurrentDictionary<string, WebClient>();
Locker ??= new SemaphoreSlim(MaxConcurrentDownloads, MaxConcurrentDownloads);
}
// creates new clients, or if a name is provided retrieves it from the dictionary so we don't need to create more than we need
private async Task<WebClient> CreateClient(string Name, bool persistent, CancellationToken token)
{
// try to retrieve it from the dictionary before creating a new one
if (Clients.ContainsKey(Name))
{
return Clients[Name];
}
WebClient newClient = new();
if (persistent)
{
// try to add the client to the dict so we can reference it later
while (Clients.TryAdd(Name, newClient) is false)
{
token.ThrowIfCancellationRequested();
// allow other tasks to do work while we wait to add the new client
await Task.Delay(1, token);
}
}
return newClient;
}
// allows sending basic dynamic requests without having to create webclients outside of this class
public async Task<T> NewRequest<T>(Func<WebClient, T> Expression, int? MaxTimeout = null, string Id = null)
{
// make sure we dont have more than the maximum clients open at one time
// 100s was chosen becuase WebClient has a default timeout of 100s
await Locker.WaitAsync(MaxTimeout ?? 100_000, TokenSource.Token);
bool persistent = true;
if (Id is null)
{
persistent = false;
Id = string.Empty;
}
try
{
WebClient client = await CreateClient(Id, persistent, TokenSource.Token);
// run the expression to get the result
T result = await Task.Run<T>(() => Expression(client), TokenSource.Token);
if (persistent is false)
{
// just in case the user disposes of the client or sets it to ull in the expression we should not assume it's not null at this point
client?.Dispose();
}
return result;
}
finally
{
// make sure even if we encounter an error we still
// release the lock
Locker.Release();
}
}
// allows assigning the headers without having to do it for every webclient manually
public static void AssignDefaultHeaders(WebClient client)
{
client.Encoding = System.Text.Encoding.UTF8;
client.Headers.Add("Accept", "image / webp, */*");
client.Headers.Add("Accept-Language", "fr, fr - FR");
client.Headers.Add("Cache-Control", "max-age=1");
client.Headers.Add("DNT", "1");
// i have no clue what Fichier is so this was not tested
client.Headers.Add("user-agent", Fichier.LoadUserAgent());
}
// cancels a webclient by name, whether its being used or not
public async Task Cancel(string Name)
{
// look to see if we can find the client
if (Clients.ContainsKey(Name))
{
// get a token incase we have to emergency cance
CancellationToken token = TokenSource.Token;
// try to get the client from the dictionary
WebClient foundClient = null;
while (Clients.TryGetValue(Name, out foundClient) is false)
{
token.ThrowIfCancellationRequested();
// allow other tasks to perform work while we wait to get the value from the dictionary
await Task.Delay(1, token);
}
// if we found the client we should cancel and dispose of it so it's resources gets freed
if (foundClient != null)
{
foundClient?.CancelAsync();
foundClient?.Dispose();
}
}
}
// the emergency stop button
public void ForceCancelAll()
{
// this will throw lots of OperationCancelledException, be prepared to catch them, they're fast.
TokenSource?.Cancel();
TokenSource?.Dispose();
TokenSource = new();
foreach (var item in Clients)
{
item.Value?.CancelAsync();
item.Value?.Dispose();
}
Clients.Clear();
}
}
Request Multiple Things at Once
Here all I did was switch to using the helper class, and made it so you can request multiple things using the same connection
public async Task<string[]> DownloadSourceCode(string[] urls)
{
var downloader = new ConcurrentWebClient();
return await downloader.NewRequest<string[]>((WebClient client) =>
{
ConcurrentWebClient.AssignDefaultHeaders(client);
client.Headers.Add("TE", "Trailers");
string[] result = new string[urls.Length];
for (int i = 0; i < urls.Length; i++)
{
string url = urls[i];
client.Headers.Remove("Origin");
client.Headers.Add("Origin", url);
result[i] = client.DownloadString(url);
}
return result;
});
}
private async Task<bool> DownloadImage(string[] URLs, string[] filePaths)
{
var downloader = new ConcurrentWebClient();
bool downloadsSucessful = await downloader.NewRequest<bool>((WebClient client) =>
{
ConcurrentWebClient.AssignDefaultHeaders(client);
int len = Math.Min(URLs.Length, filePaths.Length);
for (int i = 0; i < len; i++)
{
// side-note, this is assuming the websites you're visiting aren't mutating the headers
client.Headers.Remove("Origin");
client.Headers.Add("Origin", "https://myprivatesite.fr//" + STARTNBR.ToString());
client.DownloadFile(URLs[i], filePaths[i]);
NBRIMAGESDWLD++;
STARTNBR = CheckBoxBack.Checked ? --STARTNBR : ++STARTNBR;
}
return true;
});
return downloadsSucessful;
}

HttpClient async - Too fast skipping webservice requests

I have an issue where i loop over about 31 webservice URLs.
If i put a Thread.Sleep(1000) in the top code, it will work perfectly, but if I remove this, I only get success on 10 (sometimes less and sometimes more) request out of 31. How do I make it wait?
Code
foreach(var item in ss)
{
//Call metaDataApi(url,conn,name,alias)
}
public static void metadataApi(string _url, string _connstring, string _spname, string _alias)
{
// Thread.Sleep(1000);
//Metadata creation - Table Creation
using (var httpClient = new HttpClient())
{
string url = _url;
using (HttpResponseMessage response = httpClient.GetAsync(url).GetAwaiter().GetResult())
using (HttpContent content = response.Content)
{
Console.WriteLine("CHECKING");
if (response.IsSuccessStatusCode)
{
Console.WriteLine("IS OK");
string json = content.ReadAsStringAsync().GetAwaiter().GetResult();
//Doing some stuff not relevant
}
}
}
}
How it can look
You should look to use async/await where you can, but you could try something like this:
// you should share this for connection pooling
public static HttpClient = new HttpClient();
public static void Main(string[] args)
{
// build a list of tasks to wait on, then wait
var tasks = ss.Select(x => metadataApi(url, conn, name, alias)).ToArray();
Task.WaitAll(tasks);
}
public static async Task metadataApi(string _url, string _connstring, string _spname, string _alias)
{
string url = _url;
var response = await httpClient.GetAsync(url);
Console.WriteLine("CHECKING");
if (response.IsSuccessStatusCode)
{
Console.WriteLine("IS OK");
string json = await content.ReadAsStringAsync();
//Doing some stuff not relevant
}
}
One thing to note, this will try to run many in parallel. If you need to run them all one after the other, may want to make another async function that waits on each result individually and call that from the Main. .Result is a bit of an antipattern (with modern c# syntax, you can use async on the main function) but for your script it should be "ok", but I'd minimize usage of it (hence why I wouldn't use .Result inside of a loop.

Use DownloadFileTaskAsync to download all files at once

Given a input text file containing the Urls, I would like to download the corresponding files all at once. I use the answer to this question
UserState using WebClient and TaskAsync download from Async CTP as reference.
public void Run()
{
List<string> urls = File.ReadAllLines(#"c:/temp/Input/input.txt").ToList();
int index = 0;
Task[] tasks = new Task[urls.Count()];
foreach (string url in urls)
{
WebClient wc = new WebClient();
string path = string.Format("{0}image-{1}.jpg", #"c:/temp/Output/", index+1);
Task downloadTask = wc.DownloadFileTaskAsync(new Uri(url), path);
Task outputTask = downloadTask.ContinueWith(t => Output(path));
tasks[index] = outputTask;
}
Console.WriteLine("Start now");
Task.WhenAll(tasks);
Console.WriteLine("Done");
}
public void Output(string path)
{
Console.WriteLine(path);
}
I expected that the downloading of the files would begin at the point of "Task.WhenAll(tasks)". But it turns out that the output look likes
c:/temp/Output/image-2.jpg
c:/temp/Output/image-1.jpg
c:/temp/Output/image-4.jpg
c:/temp/Output/image-6.jpg
c:/temp/Output/image-3.jpg
[many lines deleted]
Start now
c:/temp/Output/image-18.jpg
c:/temp/Output/image-19.jpg
c:/temp/Output/image-20.jpg
c:/temp/Output/image-21.jpg
c:/temp/Output/image-23.jpg
[many lines deleted]
Done
Why does the downloading begin before WaitAll is called? What can I change to achieve what I would like (i.e. all tasks will begin at the same time)?
Thanks
Why does the downloading begin before WaitAll is called?
First of all, you're not calling Task.WaitAll, which synchronously blocks, you're calling Task.WhenAll, which returns an awaitable which should be awaited.
Now, as others said, when you call an async method, even without using await on it, it fires the asynchronous operation, because any method conforming to the TAP will return a "hot task".
What can I change to achieve what I would like (i.e. all tasks will
begin at the same time)?
Now, if you want to defer execution until Task.WhenAll, you can use Enumerable.Select to project each element to a Task, and materialize it when you pass it to Task.WhenAll:
public async Task RunAsync()
{
IEnumerable<string> urls = File.ReadAllLines(#"c:/temp/Input/input.txt");
var urlTasks = urls.Select((url, index) =>
{
WebClient wc = new WebClient();
string path = string.Format("{0}image-{1}.jpg", #"c:/temp/Output/", index);
var downloadTask = wc.DownloadFileTaskAsync(new Uri(url), path);
Output(path);
return downloadTask;
});
Console.WriteLine("Start now");
await Task.WhenAll(urlTasks);
Console.WriteLine("Done");
}
Why does the downloading begin before WaitAll is called?
Because:
Tasks created by its public constructors are referred to as “cold”
tasks, in that they begin their life cycle in the non-scheduled
TaskStatus.Created state, and it’s not until Start is called on these
instances that they progress to being scheduled. All other tasks begin
their life cycle in a “hot” state, meaning that the asynchronous
operations they represent have already been initiated and their
TaskStatus is an enumeration value other than Created. All tasks
returned from TAP methods must be “hot.”
Since DownloadFileTaskAsync is a TAP method, it returns "hot" (that is, already started) task.
What can I change to achieve what I would like (i.e. all tasks will begin at the same time)?
I'd look at TPL Data Flow. Something like this (I've used HttpClient instead of WebClient, but, actually, it doesn't matter):
static async Task DownloadData(IEnumerable<string> urls)
{
// we want to execute this in parallel
var executionOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };
// this block will receive URL and download content, pointed by URL
var donwloadBlock = new TransformBlock<string, Tuple<string, string>>(async url =>
{
using (var client = new HttpClient())
{
var content = await client.GetStringAsync(url);
return Tuple.Create(url, content);
}
}, executionOptions);
// this block will print number of bytes downloaded
var outputBlock = new ActionBlock<Tuple<string, string>>(tuple =>
{
Console.WriteLine($"Downloaded {(string.IsNullOrEmpty(tuple.Item2) ? 0 : tuple.Item2.Length)} bytes from {tuple.Item1}");
}, executionOptions);
// here we tell to donwloadBlock, that it is linked with outputBlock;
// this means, that when some item from donwloadBlock is being processed,
// it must be posted to outputBlock
using (donwloadBlock.LinkTo(outputBlock))
{
// fill downloadBlock with input data
foreach (var url in urls)
{
await donwloadBlock.SendAsync(url);
}
// tell donwloadBlock, that it is complete; thus, it should start processing its items
donwloadBlock.Complete();
// wait while downloading data
await donwloadBlock.Completion;
// tell outputBlock, that it is completed
outputBlock.Complete();
// wait while printing output
await outputBlock.Completion;
}
}
static void Main(string[] args)
{
var urls = new[]
{
"http://www.microsoft.com",
"http://www.google.com",
"http://stackoverflow.com",
"http://www.amazon.com",
"http://www.asp.net"
};
Console.WriteLine("Start now.");
DownloadData(urls).Wait();
Console.WriteLine("Done.");
Console.ReadLine();
}
Output:
Start now.
Downloaded 1020 bytes from http://www.microsoft.com
Downloaded 53108 bytes from http://www.google.com
Downloaded 244143 bytes from http://stackoverflow.com
Downloaded 468922 bytes from http://www.amazon.com
Downloaded 27771 bytes from http://www.asp.net
Done.
What can I change to achieve what I would like (i.e. all tasks will
begin at the same time)?
To synchronize the beginning of the download you could use Barrier class.
public void Run()
{
List<string> urls = File.ReadAllLines(#"c:/temp/Input/input.txt").ToList();
Barrier barrier = new Barrier(url.Count, ()=> {Console.WriteLine("Start now");} );
Task[] tasks = new Task[urls.Count()];
Parallel.For(0, urls.Count, (int index)=>
{
string path = string.Format("{0}image-{1}.jpg", #"c:/temp/Output/", index+1);
tasks[index] = DownloadAsync(Uri(urls[index]), path, barrier);
})
Task.WaitAll(tasks); // wait for completion
Console.WriteLine("Done");
}
async Task DownloadAsync(Uri url, string path, Barrier barrier)
{
using (WebClient wc = new WebClient())
{
barrier.SignalAndWait();
await wc.DownloadFileAsync(url, path);
Output(path);
}
}

Simple Task-returning Asynchronous HtppListener with async/await and handling high load

I have created the following simple HttpListener to serve multiple requests at the same time (on .NET 4.5):
class Program {
static void Main(string[] args) {
HttpListener listener = new HttpListener();
listener.Prefixes.Add("http://+:8088/");
listener.Start();
ProcessAsync(listener).ContinueWith(task => { });
Console.ReadLine();
}
static async Task ProcessAsync(HttpListener listener) {
HttpListenerContext ctx = await listener.GetContextAsync();
// spin up another listener
Task.Factory.StartNew(() => ProcessAsync(listener));
// Simulate long running operation
Thread.Sleep(1000);
// Perform
Perform(ctx);
await ProcessAsync(listener);
}
static void Perform(HttpListenerContext ctx) {
HttpListenerResponse response = ctx.Response;
string responseString = "<HTML><BODY> Hello world!</BODY></HTML>";
byte[] buffer = Encoding.UTF8.GetBytes(responseString);
// Get a response stream and write the response to it.
response.ContentLength64 = buffer.Length;
Stream output = response.OutputStream;
output.Write(buffer, 0, buffer.Length);
// You must close the output stream.
output.Close();
}
}
I use Apache Benchmark Tool to load test this. When I make a 1 request, I get the max wait time for a request as 1 second. If I make 10 requests, for example, max wait time for a response goes up to 2 seconds.
How would you change my above code to make it as efficient as it can be?
Edit
After #JonSkeet's answer, I changed the code as below. Initially, I tried to simulate a blocking call but I guess it was the core problem. So,I took #JonSkeet's suggestion and change that to Task.Delay(1000). Now, the below code gives max. wait time as approx. 1 sec for 10 concurrent requests:
class Program {
static bool KeepGoing = true;
static List<Task> OngoingTasks = new List<Task>();
static void Main(string[] args) {
HttpListener listener = new HttpListener();
listener.Prefixes.Add("http://+:8088/");
listener.Start();
ProcessAsync(listener).ContinueWith(async task => {
await Task.WhenAll(OngoingTasks.ToArray());
});
var cmd = Console.ReadLine();
if (cmd.Equals("q", StringComparison.OrdinalIgnoreCase)) {
KeepGoing = false;
}
Console.ReadLine();
}
static async Task ProcessAsync(HttpListener listener) {
while (KeepGoing) {
HttpListenerContext context = await listener.GetContextAsync();
HandleRequestAsync(context);
// TODO: figure out the best way add ongoing tasks to OngoingTasks.
}
}
static async Task HandleRequestAsync(HttpListenerContext context) {
// Do processing here, possibly affecting KeepGoing to make the
// server shut down.
await Task.Delay(1000);
Perform(context);
}
static void Perform(HttpListenerContext ctx) {
HttpListenerResponse response = ctx.Response;
string responseString = "<HTML><BODY> Hello world!</BODY></HTML>";
byte[] buffer = Encoding.UTF8.GetBytes(responseString);
// Get a response stream and write the response to it.
response.ContentLength64 = buffer.Length;
Stream output = response.OutputStream;
output.Write(buffer, 0, buffer.Length);
// You must close the output stream.
output.Close();
}
}
It looks to me like you'll end up with a bifurcation of listeners. Within ProcessAsync, you start a new task to listen (via Task.Factory.StartNew), and then you call ProcessAsync again at the end of the method. How can that ever finish? It's not clear whether that's the cause of your performance problems, but it definitely looks like an issue in general.
I'd suggest changing your code to be just a simple loop:
static async Task ProcessAsync(HttpListener listener) {
while (KeepGoing) {
var context = await listener.GetContextAsync();
HandleRequestAsync(context);
}
}
static async Task HandleRequestAsync(HttpListenerContext context) {
// Do processing here, possibly affecting KeepGoing to make the
// server shut down.
}
Now currently the above code ignores the return value of HandleRequestAsync. You may want to keep a list of the "currently in flight" tasks, and when you've been asked to shut down, use await Task.WhenAll(inFlightTasks) to avoid bringing the server down too quickly.
Also note that Thread.Sleep is a blocking delay. An asynchronous delay would be await Task.Delay(1000).

Mass Downloading of Webpages C#

My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.
for (int i = 1; i<=pages; i++)
{
string page_specific_link = baseurl + "&page=" + i.ToString();
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
catch (Exception)
{
}
}
The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.
I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.
So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.
I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances
// now process urls
foreach (var url in urls_to_download)
{
var worker = ClientQueue.Take();
worker.DownloadStringAsync(url, ...);
}
When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.
In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.
That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.
I should also note that there is a huge difference in resource usage between these two blocks of code:
WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
MyWebClient.DownloadString(url);
}
---------------
foreach (var url in urls_to_download)
{
WebClient MyWebClient = new WebClient();
MyWebClient.DownloadString(url);
}
The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.
Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).
Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.
In addition to #Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();
Parallel.ForEach(pages, x =>
{
using(var client = new WebClient())
{
var pagesource = client.DownloadString(x);
sources.Add(pagesource);
}
});
Yet another approach, that uses async:
static IEnumerable<string> GetSources(List<string> pages)
{
var sources = new BlockingCollection<string>();
var latch = new CountdownEvent(pages.Count);
foreach (var p in pages)
{
using (var wc = new WebClient())
{
wc.DownloadStringCompleted += (x, e) =>
{
sources.Add(e.Result);
latch.Signal();
};
wc.DownloadStringAsync(new Uri(p));
}
}
latch.Wait();
return sources;
}
You should use parallel programming for this purpose.
There are a lot of ways to achieve what u want; the easiest would be something like this:
var pageList = new List<string>();
for (int i = 1; i <= pages; i++)
{
pageList.Add(baseurl + "&page=" + i.ToString());
}
// pageList is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page);
client.Dispose();
lock (sourcelist)
sourcelist.Add(pagesource);
}
catch (Exception) {}
});
I Had a similar Case ,and that's how i solved
using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;
namespace WebClientApp
{
class MainClassApp
{
private static int requests = 0;
private static object requests_lock = new object();
public static void Main() {
List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
foreach(var url in urls) {
ThreadPool.QueueUserWorkItem(GetUrl, url);
}
int cur_req = 0;
while(cur_req<urls.Count) {
lock(requests_lock) {
cur_req = requests;
}
Thread.Sleep(1000);
}
Console.WriteLine("Done");
}
private static void GetUrl(Object the_url) {
string url = (string)the_url;
WebClient client = new WebClient();
Stream data = client.OpenRead (url);
StreamReader reader = new StreamReader(data);
string html = reader.ReadToEnd ();
/// Do something with html
Console.WriteLine(html);
lock(requests_lock) {
//Maybe you could add here the HTML to SourceList
requests++;
}
}
}
You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.
While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.
What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.
First, you would get the urls that you want to download:
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl +
"&page=" + i.ToString(CultureInfo.InvariantCulture)));
Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
// Create the task completion source.
var tcs = new TaskCompletionSource<Tuple<Uri, string>>();
// The web client.
var wc = new WebClient();
// Attach to the DownloadStringCompleted event.
client.DownloadStringCompleted += (s, e) => {
// Dispose of the client when done.
using (wc)
{
// If there is an error, set it.
if (e.Error != null)
{
tcs.SetException(e.Error);
}
// Otherwise, set cancelled if cancelled.
else if (e.Cancelled)
{
tcs.SetCanceled();
}
else
{
// Set the result.
tcs.SetResult(new Tuple<string, string>(url, e.Result));
}
}
};
// Start the process asynchronously, don't burn a thread.
wc.DownloadStringAsync(url);
// Return the task.
return tcs.Task;
});
Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:
// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();
// Wait for all to complete.
Task.WaitAll(materializedTasks);
Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
// pair.Item1 will contain the Uri.
// pair.Item2 will contain the content.
}
Note that the above code has the caveat of not having an error handling.
If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).
I am using an active Threads count and a arbitrary limit:
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}

Categories