WebClient and Tasks not retrieving new content yet new content is available?

WebClient and Tasks not retrieving new content yet new content is available? - c#

Even though I am creating new instances of WebClient, and following standard procedures to ensure that the WebClient is removed by GC, which shouldn't even matter: the webclient retrieves content from my server that it previously retrieved, and only restarting the app will it allow new content from my server to be retrieved. The content is a simple text file string, no fancy caching since it works on WinRT just fine.
This is a mystery, as I am trying to make a ChatBox; I need to refresh to gain new content yet the WebClient returns content it retrieved the first time.
My Code:
public async Task<string> RetrieveDocURL(Uri uri)
{
return await DownloadStringTask(new WebClient(), uri);
}
/*
public async Task<byte[]> RetrieveImageURL(string url)
{
return await _webClient.GetByteArrayAsync(url);
}
* */
private Task<string> DownloadStringTask(WebClient client, Uri uri)
{
var tcs = new TaskCompletionSource<string>();
client.DownloadStringCompleted += (o, e) =>
{
if (e.Error != null)
tcs.SetException(e.Error);
else
tcs.SetResult(e.Result);
};
client.DownloadStringAsync(uri);
return tcs.Task;
}

The WebClient's caching strategy is really aggressive. If you're querying the same URL each time, you should consider adding a random parameter at the end. Something like:
"http://www.yourserver.com/yourService/?nocache=" + DateTime.UtcNow.Ticks

Related

Throws TargetInvocationException when downloading with WebClient DownloadStringAsync

I am trying to download multiple webpages using the WebClient class. When I try to download a website's html, a TargetInvocationException is thrown, and I do not know why it happens. Here is my code:
public HashSet<string> DownloadWebpages(HashSet<string> urls)
{
HashSet<string> HTML = new HashSet<string>();
for (int i = 0; i < urls.Count; i++)
{
WebClient client = new WebClient();
client.DownloadStringCompleted += (s, e) =>
{
try
{
lock (HTML)
{
HTML.Add(e.Result); //The exception happens on this line
}
}
catch { }
};
client.DownloadStringAsync(new Uri(urls.ElementAt(i)));
}
return HTML;
}
Is there any way to fix this. All I'm trying to do is download multiple webpages using async, trying to make it has fast as possible.

The TargetInvocationException is thrown, because the webclient is not able to download the website. Here is a test,
string html = new WebClient().DownloadString("https://www.siteth#tw!llcause an error!/randompage/");
This will cause an exception. So if you tried to download the same webpage with your code, it will cause an TargetInvocationException

WebClient is an obsolete class replaced since 2012 by HttpClient. It was never built with HTTP APIs or thread safety in mind. It's easier to do what you want with HttpClient's GetStringAsync:
public async Task<HashSet<string>> DownloadWebpages(IEnumerable<string> urls)
{
HashSet<string> HTML = new HashSet<string>();
var client=new HttpClient();
foreach (var url in urls)
{
var source=await client.GetStringAsync(url);
HTML.Add(source);
}
return HTML;
}
Since .NET 6 you can even retrieve the URLs concurrently with Parallel.ForEachAsync. In this case you'd need a thread-safe collection to store them, eg a ConcurrentDictionary :
HttpClient _client=new HttpClient();
public async Task<ConcurrentDictionary<string,string>> DownloadWebpages(IEnumerable<string> urls)
{
var HTML = new ConcurrentDictionary<string,string>();
await Parallel.ForeachAsync(urls,async url=>{
{
var source=await client.GetStringAsync(url);
HTML.Add(url,source);
});
return HTML;
}
HttpClient is thread-safe and meant to be reused.
If you absolutely have to use WebClient (why???) you can use DownloadStringTaskAsync. You won't be able to make concurrent calls though because WebClient isn't thread-safe.
public async Task<HashSet<string>> DownloadWebpages(IEnumerable<string> urls)
{
HashSet<string> HTML = new HashSet<string>();
var client=new WebClient();
foreach (var url in urls)
{
var source=await client.DownloadStringTaskAsync(url);
HTML.Add(source);
}
return HTML;
}

Why no error notification for UploadFileAsync with WebClient?

When I execute the following code:
public static async Task UploadFile(string serverPath, string pathToFile, string authToken)
{
serverPath = #"C:\_Series\S1\The 100 S01E03.mp4";
var client = new WebClient();
var uri = new Uri($"http://localhost:50424/api/File/Upload?serverPath={WebUtility.UrlEncode(serverPath)}");
client.UploadProgressChanged += UploadProgressChanged;
client.UploadFileCompleted += UploadCompletedCallback;
//client.UploadFileAsync(uri, "POST", pathToFile);
client.UploadFile(uri, "POST", pathToFile);
}
I get the exception:
System.Net.WebException: 'The remote server returned an error: (404)
Not Found.'
I'm not too worried about the 404, I'm busy tracing down why the WebClient can't find it, but my big concern is that if I call UploadFileAsync with the same uri, the method just executes as if nothing is wrong.
The only indication that something is wrong is that neither of the two event handlers is invoked. I strongly suspect that I don't get an exception because the async call is not async/await but event based, but then I would expect some kind of event or property that indicates an exception has occurred.
How is one supposed to use code that hides errors like this, especially network errors which are relatively more common, in production?

Why no error notification for UploadFileAsync with WebClient?
Citing WebClient.UploadFileAsync Method (Uri, String, String) Remarks
The file is sent asynchronously using thread resources that are automatically allocated from the thread pool. To receive notification when the file upload completes, add an event handler to the UploadFileCompleted event.
Emphasis mine.
You get no errors because it is being executed on another thread so as not to block the current thread. To see the error you can access it in the stated event handler via the UploadFileCompletedEventArgs.Exception.
I was curious as to why using WebClient and not HttpClient which is already primarily async, but then my assumption was because of the upload progress.
I would suggest wrapping the WebClient call with event handlers in a Task using a TaskCompletionSource to take advantage of TAP.
The following is similar to the examples provided here How to: Wrap EAP Patterns in a Task
public static async Task UploadFileAsync(string serverPath, string pathToFile, string authToken, IProgress<int> progress = null) {
serverPath = #"C:\_Series\S1\The 100 S01E03.mp4";
using (var client = new WebClient()) {
// Wrap Event-Based Asynchronous Pattern (EAP) operations
// as one task by using a TaskCompletionSource<TResult>.
var task = client.createUploadFileTask(progress);
var uri = new Uri($"http://localhost:50424/api/File/Upload?serverPath={WebUtility.UrlEncode(serverPath)}");
client.UploadFileAsync(uri, "POST", pathToFile);
//wait here while the file uploads
await task;
}
}
Where createUploadFileTask is a custom extension method used to wrap the Event-Based Asynchronous Pattern (EAP) operations of the WebClient as one task by using a TaskCompletionSource<TResult>.
private static Task createTask(this WebClient client, IProgress<int> progress = null) {
var tcs = new TaskCompletionSource<object>();
#region callbacks
// Specifiy the callback for UploadProgressChanged event
// so it can be tracked and relayed through `IProgress<T>`
// if one is provided
UploadProgressChangedEventHandler uploadProgressChanged = null;
if (progress != null) {
uploadProgressChanged = (sender, args) => progress.Report(args.ProgressPercentage);
client.UploadProgressChanged += uploadProgressChanged;
}
// Specify the callback for the UploadFileCompleted
// event that will be raised by this WebClient instance.
UploadFileCompletedEventHandler uploadCompletedCallback = null;
uploadCompletedCallback = (sender, args) => {
// unsubscribing from events after asynchronous
// events have completed
client.UploadFileCompleted -= uploadCompletedCallback;
if (progress != null)
client.UploadProgressChanged -= uploadProgressChanged;
if (args.Cancelled) {
tcs.TrySetCanceled();
return;
} else if (args.Error != null) {
// Pass through to the underlying Task
// any exceptions thrown by the WebClient
// during the asynchronous operation.
tcs.TrySetException(args.Error);
return;
} else
//since no result object is actually expected
//just set it to null to allow task to complete
tcs.TrySetResult(null);
};
client.UploadFileCompleted += uploadCompletedCallback;
#endregion
// Return the underlying Task. The client code
// waits on the task to complete, and handles exceptions
// in the try-catch block there.
return tcs.Task;
}
Going one step further and creating another extension method to wrap the upload file to make it await able...
public static Task PostFileAsync(this WebClient client, Uri address, string fileName, IProgress<int> progress = null) {
var task = client.createUploadFileTask(progress);
client.UploadFileAsync(address, "POST", fileName);//this method does not block the calling thread.
return task;
}
Allowed your UploadFile to be refactored to
public static async Task UploadFileAsync(string serverPath, string pathToFile, string authToken, IProgress<int> progress = null) {
using (var client = new WebClient()) {
var uri = new Uri($"http://localhost:50424/api/File/Upload?serverPath={WebUtility.UrlEncode(serverPath)}");
await client.PostFileAsync(uri, pathToFile, progress);
}
}
This now allow you to call the upload asynchronously and even keep track of the progress with your very own Progress Reporting (Optional)
For example if in an XAML based platform
public class UploadProgressViewModel : INotifyPropertyChanged, IProgress<int> {
public int Percentage {
get {
//...return value
}
set {
//...set value and notify change
}
}
public void Report(int value) {
Percentage = value;
}
}
Or using the out of the box Progress<T> Class
So now you should be able to upload the file without blocking the thread and still be able to await it, get progress notifications, and handle exceptions, provided you have a try/catch in place.

Best way to call and manage response for many http request

I have a challenge, I need to call many http request and handle each of them.
How to do it, I don't want to wait for get response from one of them and then call next, how to assign a method for process response (like callback).
How can define callback and assign to each of them ?

What you need is an Asynchronous programming model where you create async tasks and later use await keyword for the response.
So essentially you are not waiting for the first async call to finish, you'd just fire as many async tasks as you wish and wait to get a response only when you need the response to move ahead with your program logic.
Have a look at below for more details:
https://msdn.microsoft.com/en-us/library/hh696703.aspx

1) you can call that normaly(noneasync):
public string TestNoneAsync()
{
var webClient = new WebClient();
return webClient.DownloadString("http://www.google.com");
}
2) you can use APM (async):
private void SpecAPI()
{
var req = (HttpWebRequest)WebRequest.Create("http://www.google.com");
//req.Method = "HEAD";
req.BeginGetResponse(
asyncResult =>
{
var resp = (HttpWebResponse)req.EndGetResponse(asyncResult);
var headersText = formatHeaders(resp.Headers);
Console.WriteLine(headersText);
}, null);
}
private string formatHeaders(WebHeaderCollection headers)
{
var headerString = headers.Keys.Cast<string>()
.Select(header => string.Format("{0}:{1}", header, headers[header]));
return string.Join(Environment.NewLine, headerString.ToArray());
}
3) you can create a callback and asign it,EAP.(async .net 2):
public void EAPAsync()
{
var webClient = new WebClient();
webClient.DownloadStringAsync(new Uri("http://www.google.com"));
webClient.DownloadStringCompleted += webClientDownloadStringCompleted;
}
void webClientDownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
// use e.Result
Console.WriteLine("download completed callback.");
}
4) you can use newer way TAP, cleare way (c# 5). it's recommended:
public async Task<string> DownloadAsync(string url)
{
var webClient = new WebClient();
return await webClient.DownloadStringTaskAsync(url);
}
public void DownloadAsyncHandler()
{
//DownloadAsync("http://www.google.com");
}
threading in this solution is't good approch.(many threads that pending to call http request!)

WebClient.OpenReadCompleted - how to wait for this event to finish

So basiacally, I'm creating a dll project which have to get some info from the bing maps api. Below I'm pasting a method which is resposible for getting and serializing the response:
private void GetResponse(Uri uri, Action<Response> callback)
{
WebClient wc = new WebClient();
wc.OpenReadCompleted += (o, a) =>
{
if (callback != null)
{
DataContractJsonSerializer responseSerializer = new DataContractJsonSerializer(typeof(Response));
callback(responseSerializer.ReadObject(a.Result) as Response);
}
};
wc.OpenReadAsync(uri);
}
Below method is using GetResponse method to get lattitudes:
public Lattitudes getLattitudes()
{
Lattitudes lattitudes = new Lattitudes();
GetResponse(geocodeRequestURI, (x) =>
{
lattitudes.SouthLatitude = x.ResourceSets[0].Resources[0].BoundingBox[0];
lattitudes.SouthLatitude = x.ResourceSets[0].Resources[0].BoundingBox[1];
lattitudes.SouthLatitude = x.ResourceSets[0].Resources[0].BoundingBox[2];
lattitudes.SouthLatitude = x.ResourceSets[0].Resources[0].BoundingBox[3];
});
return lattitudes;
}
My problem is that second method is returning empty object and part inside GetResponse method is being executed later. Is There a way to wait for this event to finish and then return, or maybe I need to redesign it ?(max version of .NET I can use is 4.0)

I suggest if you want to wait than make use of Task by converting method competable with async and await like as below
private async Task RequestDataAsync(string uri, Action<string> action)
{
var client = new WebClient();
string data = await client.OpenReadAsync(uri);
action(data);
}
call method from the code
Task t = RequestDataAsync(parameter)
t.Wait();
this help more : http://www.dotnetperls.com/async

Mass Downloading of Webpages C#

My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.
for (int i = 1; i<=pages; i++)
{
string page_specific_link = baseurl + "&page=" + i.ToString();
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
catch (Exception)
{
}
}

The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.
I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.
So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.
I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances
// now process urls
foreach (var url in urls_to_download)
{
var worker = ClientQueue.Take();
worker.DownloadStringAsync(url, ...);
}
When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.
In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.
That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.
I should also note that there is a huge difference in resource usage between these two blocks of code:
WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
MyWebClient.DownloadString(url);
}
---------------
foreach (var url in urls_to_download)
{
WebClient MyWebClient = new WebClient();
MyWebClient.DownloadString(url);
}
The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.

Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).
Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.

In addition to #Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();
Parallel.ForEach(pages, x =>
{
using(var client = new WebClient())
{
var pagesource = client.DownloadString(x);
sources.Add(pagesource);
}
});
Yet another approach, that uses async:
static IEnumerable<string> GetSources(List<string> pages)
{
var sources = new BlockingCollection<string>();
var latch = new CountdownEvent(pages.Count);
foreach (var p in pages)
{
using (var wc = new WebClient())
{
wc.DownloadStringCompleted += (x, e) =>
{
sources.Add(e.Result);
latch.Signal();
};
wc.DownloadStringAsync(new Uri(p));
}
}
latch.Wait();
return sources;
}

You should use parallel programming for this purpose.
There are a lot of ways to achieve what u want; the easiest would be something like this:
var pageList = new List<string>();
for (int i = 1; i <= pages; i++)
{
pageList.Add(baseurl + "&page=" + i.ToString());
}
// pageList is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page);
client.Dispose();
lock (sourcelist)
sourcelist.Add(pagesource);
}
catch (Exception) {}
});

I Had a similar Case ,and that's how i solved
using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;
namespace WebClientApp
{
class MainClassApp
{
private static int requests = 0;
private static object requests_lock = new object();
public static void Main() {
List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
foreach(var url in urls) {
ThreadPool.QueueUserWorkItem(GetUrl, url);
}
int cur_req = 0;
while(cur_req<urls.Count) {
lock(requests_lock) {
cur_req = requests;
}
Thread.Sleep(1000);
}
Console.WriteLine("Done");
}
private static void GetUrl(Object the_url) {
string url = (string)the_url;
WebClient client = new WebClient();
Stream data = client.OpenRead (url);
StreamReader reader = new StreamReader(data);
string html = reader.ReadToEnd ();
/// Do something with html
Console.WriteLine(html);
lock(requests_lock) {
//Maybe you could add here the HTML to SourceList
requests++;
}
}
}
You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.

While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.
What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.
First, you would get the urls that you want to download:
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl +
"&page=" + i.ToString(CultureInfo.InvariantCulture)));
Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
// Create the task completion source.
var tcs = new TaskCompletionSource<Tuple<Uri, string>>();
// The web client.
var wc = new WebClient();
// Attach to the DownloadStringCompleted event.
client.DownloadStringCompleted += (s, e) => {
// Dispose of the client when done.
using (wc)
{
// If there is an error, set it.
if (e.Error != null)
{
tcs.SetException(e.Error);
}
// Otherwise, set cancelled if cancelled.
else if (e.Cancelled)
{
tcs.SetCanceled();
}
else
{
// Set the result.
tcs.SetResult(new Tuple<string, string>(url, e.Result));
}
}
};
// Start the process asynchronously, don't burn a thread.
wc.DownloadStringAsync(url);
// Return the task.
return tcs.Task;
});
Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:
// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();
// Wait for all to complete.
Task.WaitAll(materializedTasks);
Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
// pair.Item1 will contain the Uri.
// pair.Item2 will contain the content.
}
Note that the above code has the caveat of not having an error handling.
If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).

I am using an active Threads count and a arbitrary limit:
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

WebClient and Tasks not retrieving new content yet new content is available? - c#

The WebClient's caching strategy is really aggressive. If you're querying the same URL each time, you should consider adding a random parameter at the end. Something like: "http://www.yourserver.com/yourService/?nocache=" + DateTime.UtcNow.Ticks

Related

Throws TargetInvocationException when downloading with WebClient DownloadStringAsync

Why no error notification for UploadFileAsync with WebClient?

Best way to call and manage response for many http request

WebClient.OpenReadCompleted - how to wait for this event to finish

Mass Downloading of Webpages C#

Categories

Resources