I am working with the EventHubBufferedProducerClient class of the Azure SDK for .NET (Azure.Messaging.EventHubs v. 5.7.5); I need to send two groups of messages, with the second group starting after publishing the first.
No issues with the first group: I enqueue them and then use the FlushAsync method to make sure all the messages in the buffer are sent for publication.
When I try to enqueue a message of the second group, though, I receive an ObjectDisposedException: 'The CancellationTokenSource has been disposed.'.
NB: I do not use the EventHubProducerClient because I need to tailor the Partition Key to each message.
I also tried the following "toy code" (I hid the actual connstring and hubname for posting) , to be sure the issue is not related to the processing of the data before and after publication - the issue also repeats with this code.
static async Task Main(string[] args)
{
EventHubBufferedProducerClient client = new EventHubBufferedProducerClient("connectionstring", "eventhubname");
client.SendEventBatchFailedAsync += args =>
{
return Task.CompletedTask;
};
for (int i = 0; i<3; i++)
{
EventData data = new EventData($"string {i}");
await client.EnqueueEventAsync(data);
}
await client.FlushAsync();
for (int i = 3; i < 6; i++)
{
EventData data = new EventData($"string {i}");
await client.EnqueueEventAsync(data); //EXCEPTION HERE at the first iteration
}
await client.FlushAsync();
}
I know I can "solve" this by creating a new instance of the client to enqueue and publish the second group of events, but I'm not sure it's the best solution; I'm also quite curious to understand why the issue happens.
Thank you for your help!
I am learning the correct usage of executeAsync method of RESTSharp and seems like am facing a memory leak due to incorrect disposing of RestRequests (probably):
var client = new RestClient("some_URL");
int count = 99999;
for (int i=0; i < count; i++) {
var request = new RestRequest("some_link", Method.GET);
request.AddParameter("some_param", i.toString());
client.ExecuteAsync(request, response =>
{
some_callback_function(response.content);
request = null;
response = null;
}
}
If I let this code run the memory of my app keeps increasing and doesn't seem to be garbage collected.
What is the proper way to handle ExecuteAsync's responses and dispose the objects afterwards?
I have a piece of code that looks like this:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
using (var client = new WebClient())
{
Task<string> getDownloadsTask = client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
taskList[i] = getDownloadsTask;
}
}
Task.WaitAll(taskList.Cast<Task>().ToArray()); //Wait for all results to come back
The code freezes after Task.WaitAll... I have an idea why, it's because client is already disposed at the time of calling, is it possible to delay its disposal until later? Can you recommend another approach?
You need to create and dispose the WebClient within your task. I don't have a way to test this, but see if points you in the right direction:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
taskList[i] = Task.Run(() =>
{
using (var client = new WebClient())
{
return client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
}
});
}
Task.WaitAll(taskList.Cast<Task>().ToArray());
I don't see how that code would ever work, since you dispose the WebClient before the task was run.
You want to do something like this:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
var client = new WebClient();
Task<string> task = client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
task.ContinueWith(x => client.Dispose());
taskList[i] = task;
}
Task.WaitAll(taskList.Cast<Task>().ToArray()); //Wait for all results to come back
i.e. if you dispose the WebClient in the first loop, it's not allocated when you trigger the tasks by using Task.WaitAll. The ContinueWith call will be invoked once the task completes and can therefore be used to dispose each WebClient instance.
However, to get the code to execute concurrent requests to a single host you need to configure the service point. Read this question: Trying to run multiple HTTP requests in parallel, but being limited by Windows (registry)
Using this post I wrote code that checks 200 proxies for example. The timeout for a socket is 2sec. Everything is working, but the problem that Code #1 takes more than 2minutes to check 200 proxies limited to 2sec timeout. But with Code #2 it takes 2sec to check 200 proxies and it would take also 2sec to check 1000 proxies with Code #2.
Code #1 uses ThreadPool.
Code #1 opens proxyCount sockets, goes to Sleep for 2sec and than checks what succeeded. It takes 2sec exactly.
So where is the problem in Code #1? Why ThreadPool with minimum 20 threads are much much slower than doing it without threads?
Code #1
int proxyCount = 200;
CountdownEvent cde = new CountdownEvent(proxyCount);
private void RefreshProxyIPs(object obj)
{
int workerThreads, ioThreads;
ThreadPool.GetMinThreads(out workerThreads, out ioThreads);
ThreadPool.SetMinThreads(20, ioThreads);
var proxies = GetServersIPs(proxyCount);
watch.Start();
for (int i = 0; i < proxyCount; i++)
{
var proxy = proxies[i];
ThreadPool.QueueUserWorkItem(CheckProxy, new IPEndPoint(IPAddress.Parse(proxy.IpAddress), proxy.Port));
}
cde.Wait();
cde.Dispose();
watch.Stop();
}
private List<IPEndPoint> list = new List<IPEndPoint>();
private void CheckProxy(object o)
{
var proxy = o as IPEndPoint;
using (var socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp))
{
var asyncResult = socket.BeginConnect(proxy.Address, proxy.Port, null, null);
if (asyncResult.AsyncWaitHandle.WaitOne(2000))
{
try
{
socket.EndConnect(asyncResult);
}
catch (SocketException)
{
}
catch (ObjectDisposedException)
{
}
}
if (socket.Connected)
{
list.Add(proxy);
socket.Close();
}
}
cde.Signal();
}
Code #2
int proxyCount = 200;
var sockets = new Socket[proxyCount];
var socketsResults = new IAsyncResult[proxyCount];
var proxies = GetServersIPs(proxyCount);
for (int i = 0; i < proxyCount; i++)
{
var proxy = proxies[i];
sockets[i] = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
socketsResults[i] = sockets[i].BeginConnect(IPAddress.Parse(proxy.IpAddress), proxy.Port, null, proxy);
}
Thread.Sleep(2000);
for (int i = 0; i < proxyCount; i++)
{
var success = false;
try
{
if (socketsResults[i].IsCompleted)
{
sockets[i].EndConnect(socketsResults[i]);
success = sockets[i].Connected;
sockets[i].Close();
}
sockets[i].Dispose();
}
catch { }
var proxy = socketsResults[i].AsyncState as Proxy;
if (success) { _validProxies.Add(proxy); }
}
The threadpool threads you start are just not very good tp threads. They don't perform any real work but just block on the WaitOne() call. So 20 of them start executing right away and don't complete for 2 seconds. the threadpool scheduler only allows another thread to start when one of them completes or none of them complete within 0.5 seconds. It then allow an extra one to run. So it takes a while before all the requests are completed.
You could fix it by calling SetMinThreads() and setting the minimum to 200. But that's incredibly wasteful of system resources. You might as well call Socket.BeginConnect() 200 times and find out what happened 2 seconds later. Your fast version.
Looks like in the first example, you're waiting for each proxy connection to timeout, or 2 seconds, whichever comes first. Plus, you're queuing up 200 separate work requests. Your thread pool size is probably going to be way less than this. Check it with GetMaxThreads. You're only going to have that number of work requests running concurrently, and the next request has to wait on a previous item to timeout.
My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.
for (int i = 1; i<=pages; i++)
{
string page_specific_link = baseurl + "&page=" + i.ToString();
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
catch (Exception)
{
}
}
The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.
I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.
So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.
I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances
// now process urls
foreach (var url in urls_to_download)
{
var worker = ClientQueue.Take();
worker.DownloadStringAsync(url, ...);
}
When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.
In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.
That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.
I should also note that there is a huge difference in resource usage between these two blocks of code:
WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
MyWebClient.DownloadString(url);
}
---------------
foreach (var url in urls_to_download)
{
WebClient MyWebClient = new WebClient();
MyWebClient.DownloadString(url);
}
The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.
Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).
Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.
In addition to #Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();
Parallel.ForEach(pages, x =>
{
using(var client = new WebClient())
{
var pagesource = client.DownloadString(x);
sources.Add(pagesource);
}
});
Yet another approach, that uses async:
static IEnumerable<string> GetSources(List<string> pages)
{
var sources = new BlockingCollection<string>();
var latch = new CountdownEvent(pages.Count);
foreach (var p in pages)
{
using (var wc = new WebClient())
{
wc.DownloadStringCompleted += (x, e) =>
{
sources.Add(e.Result);
latch.Signal();
};
wc.DownloadStringAsync(new Uri(p));
}
}
latch.Wait();
return sources;
}
You should use parallel programming for this purpose.
There are a lot of ways to achieve what u want; the easiest would be something like this:
var pageList = new List<string>();
for (int i = 1; i <= pages; i++)
{
pageList.Add(baseurl + "&page=" + i.ToString());
}
// pageList is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page);
client.Dispose();
lock (sourcelist)
sourcelist.Add(pagesource);
}
catch (Exception) {}
});
I Had a similar Case ,and that's how i solved
using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;
namespace WebClientApp
{
class MainClassApp
{
private static int requests = 0;
private static object requests_lock = new object();
public static void Main() {
List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
foreach(var url in urls) {
ThreadPool.QueueUserWorkItem(GetUrl, url);
}
int cur_req = 0;
while(cur_req<urls.Count) {
lock(requests_lock) {
cur_req = requests;
}
Thread.Sleep(1000);
}
Console.WriteLine("Done");
}
private static void GetUrl(Object the_url) {
string url = (string)the_url;
WebClient client = new WebClient();
Stream data = client.OpenRead (url);
StreamReader reader = new StreamReader(data);
string html = reader.ReadToEnd ();
/// Do something with html
Console.WriteLine(html);
lock(requests_lock) {
//Maybe you could add here the HTML to SourceList
requests++;
}
}
}
You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.
While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.
What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.
First, you would get the urls that you want to download:
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl +
"&page=" + i.ToString(CultureInfo.InvariantCulture)));
Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
// Create the task completion source.
var tcs = new TaskCompletionSource<Tuple<Uri, string>>();
// The web client.
var wc = new WebClient();
// Attach to the DownloadStringCompleted event.
client.DownloadStringCompleted += (s, e) => {
// Dispose of the client when done.
using (wc)
{
// If there is an error, set it.
if (e.Error != null)
{
tcs.SetException(e.Error);
}
// Otherwise, set cancelled if cancelled.
else if (e.Cancelled)
{
tcs.SetCanceled();
}
else
{
// Set the result.
tcs.SetResult(new Tuple<string, string>(url, e.Result));
}
}
};
// Start the process asynchronously, don't burn a thread.
wc.DownloadStringAsync(url);
// Return the task.
return tcs.Task;
});
Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:
// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();
// Wait for all to complete.
Task.WaitAll(materializedTasks);
Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
// pair.Item1 will contain the Uri.
// pair.Item2 will contain the content.
}
Note that the above code has the caveat of not having an error handling.
If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).
I am using an active Threads count and a arbitrary limit:
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}