I have a piece of code that looks like this:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
using (var client = new WebClient())
{
Task<string> getDownloadsTask = client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
taskList[i] = getDownloadsTask;
}
}
Task.WaitAll(taskList.Cast<Task>().ToArray()); //Wait for all results to come back
The code freezes after Task.WaitAll... I have an idea why, it's because client is already disposed at the time of calling, is it possible to delay its disposal until later? Can you recommend another approach?
You need to create and dispose the WebClient within your task. I don't have a way to test this, but see if points you in the right direction:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
taskList[i] = Task.Run(() =>
{
using (var client = new WebClient())
{
return client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
}
});
}
Task.WaitAll(taskList.Cast<Task>().ToArray());
I don't see how that code would ever work, since you dispose the WebClient before the task was run.
You want to do something like this:
var taskList = new Task<string>[masterResult.D.Count];
for (int i = 0; i < masterResult.D.Count; i++) //Go through all the lists we need to pull (based on master list) and create a task-list
{
var client = new WebClient();
Task<string> task = client.DownloadStringTaskAsync(new Uri(agilityApiUrl + masterResult.D[i].ReferenceIdOfCollection + "?$format=json"));
task.ContinueWith(x => client.Dispose());
taskList[i] = task;
}
Task.WaitAll(taskList.Cast<Task>().ToArray()); //Wait for all results to come back
i.e. if you dispose the WebClient in the first loop, it's not allocated when you trigger the tasks by using Task.WaitAll. The ContinueWith call will be invoked once the task completes and can therefore be used to dispose each WebClient instance.
However, to get the code to execute concurrent requests to a single host you need to configure the service point. Read this question: Trying to run multiple HTTP requests in parallel, but being limited by Windows (registry)
Related
I have a method where I need recursion to get a hierarchy of files and folders from an API (graph). When I do my recursion inside of a for loop it works as expected and returns a hierarchy with 665 files. This takes about a minute though because it only fetches one folder at a time, whereas doing it with Task.WhenAll only takes 10 seconds.
When using Task.WhenAll I get inconsistent results though, it will return the hierarchy with anywhere from 661 to 665 files depending on the run, with the exact same code. i'm using the variable totalFileCount as an indication of how many files it has found.
Obviously i'm doing something wrong but I can't quite figure out what. Any help is greatly appreciated!
For loop
for (int i = 0; i < folders.Count; i++)
{
var folder = folders[i];
await GetSharePointHierarchy(folder, folderItem, $"{outPath}{folder.Name}\\");
}
Task.WhenAll
var tasks = new List<Task>();
for (int i = 0; i < folders.Count; i++)
{
var folder = folders[i];
var task = GetSharePointHierarchy(folder, folderItem, $"{outPath}{folder.Name}\\");
tasks.Add(task);
}
await Task.WhenAll(tasks);
Full method
public async Task<GraphFolderItem> GetSharePointHierarchy(DriveItem currentDrive, GraphFolderItem parentFolderItem, string outPath = "")
{
IEnumerable<DriveItem> children = await graphHandler.GetFolderChildren(sourceSharepointId, currentDrive.Id);
var folders = new List<DriveItem>();
var files = new List<DriveItem>();
var graphFolderItems = new List<GraphFolderItem>();
foreach (var item in children)
{
if (item.Folder != null)
{
System.IO.Directory.CreateDirectory(outPath + item.Name);
//Console.WriteLine(outPath + item.Name);
folders.Add(item);
}
else
{
totalFileCount++;
files.Add(item);
}
}
var folderItem = new GraphFolderItem
{
SourceFolder = currentDrive,
ItemChildren = files,
FolderChildren = graphFolderItems,
DownloadPath = outPath
};
parentFolderItem.FolderChildren.Add(folderItem);
for (int i = 0; i < folders.Count; i++)
{
var folder = folders[i];
await GetSharePointHierarchy(folder, folderItem, $"{outPath}{folder.Name}\\");
}
return parentFolderItem;
}
It is race condition problem. In parallel execution, you should not use normal datatype or variable. You should always use thread safe concept as per your requirement like thread safe datatypes/collection or lock or Monitor or interlocked.
In this case, interlocked.Increment is good approach like replace the below one where using totalFileCount
Interlocked.Increment(ref totalFileCount);
Please refer the below link for good understanding
Thread Safe concept in details or Thread-safety
It seemed like the problem is when you are using the Task.WhenAll way you are making the code flow run in parallel and in the other way with await each time you run the async function, the code flow is actually not run in parallel
and this is exactly your problem, your source code inside the async function access to shared memory object - totalFileCount
What causing multi threads access to an object at the same time.
For fixing it and still execute the code in parallel, surround the access to totalFileCount instance with the lock statement which limit the number of concurrent executions of a block of code
lock(lockRefObject)
{
totalFileCount++;
}
I have a Web App and I noticed that after a while it restarts due to lack of memory.
After researching, I found that memory increases after sending a message via WebPubSub.
This can be easily reproduced (sample):
using Azure.Core;
using Azure.Messaging.WebPubSub;
var connectionString = "<ConnectionString >";
var hub = "<HubName>";
var serviceClient = new WebPubSubServiceClient(connectionString, hub);
Console.ReadKey();
Task[] tasks = new Task[100];
for (int i = 0; i < 100; i++)
{
tasks[i] = serviceClient.SendToUserAsync("testUser", RequestContent.Create("Message"), ContentType.TextPlain);
}
Task.WaitAll(tasks);
Console.ReadKey();
During debugging, I noticed that a new HttpConnection is created during each send and the old one remains. Thus, when sending 100 messages, 100 connections will be created, during the next sending, more will be created.
I concluded that the problem is in the WebPubSub SDK, but maybe it's not so and someone can help me solve it.
UPD:
When sending 100 messages in parallel, 100 connections are created in the HttpConnectionPool, hence the sharp increase in unmanaged memory. The next time you send 100 messages, existing connections from the pool will be used, and no new connections are created, but a lot of data is allocated on the heap.
So now I'm finding out how long connections live in the pool, what data lives in the Heap and how to free them. Call GC.Collect(); after Task.WaitAll(tasks); solves the problem.
Both Response and RequestContent are IDisposable.
Does using below code help?
var serviceClient = new WebPubSubServiceClient(connectionString, hub);
Task[] tasks = new Task[100];
for (int i = 0; i < 100; i++)
{
tasks[i] = SendToUser(serviceClient);
}
Task.WaitAll(tasks);
Console.ReadKey();
private async static Task SendToUser(WebPubSubServiceClient serviceClient)
{
using var content = RequestContent.Create("Message");
using var response = await serviceClient.SendToUserAsync("testUser", content, ContentType.TextPlain);
}
How to execute this code in parallel? I tried to execute the execution in threads, but the requests are still being executed sequentially. I am new to parallel programming, I will be very happy for your help.
public async Task<IList<AdModel>> LocalBitcoins_buy(int page_number)
{
IList<AdModel> Buy_ads = new List<AdModel>();
string next_page_url;
string url = "https://localbitcoins.net/buy-bitcoins-online/.json?page=" + page_number;
WebRequest request = WebRequest.Create(url);
request.Method = "GET";
using (WebResponse response = await request.GetResponseAsync())
{
using (var reader = new StreamReader(response.GetResponseStream()))
{
JObject json = JObject.Parse(await reader.ReadToEndAsync());
next_page_url = (string) json["pagination"]["next"];
int counter = (int) json["data"]["ad_count"];
for (int ad_list_index = 0; ad_list_index < counter; ad_list_index++)
{
AdModel save = new AdModel();
save.Seller = (string) json["data"]["ad_list"][ad_list_index]["data"]["profile"]["username"];
save.Give = (string) json["data"]["ad_list"][ad_list_index]["data"]["currency"];
save.Get = "BTC";
save.Limits = (string) json["data"]["ad_list"][ad_list_index]["data"]["first_time_limit_btc"];
save.Deals = (string) json["data"]["ad_list"][ad_list_index]["data"]["profile"]["trade_count"];
save.Reviews = (string) json["data"]["ad_list"][ad_list_index]["data"]["profile"]["feedback_score"];
save.PaymentWindow = (string) json["data"]["ad_list"][ad_list_index]["data"]["payment_window_minutes"];
Buy_ads.Add(save);
}
}
}
Console.WriteLine(page_number);
return Buy_ads;
}
I googled and found this links 1, 2. It seems that WebRequest cannot execute requests in parallel. Also I tried to send multiple requests in parallel using WebRequest and for some reasons WebRequest didn't make requests in parallel.
But when I used HttpClient class it did requests in parallel. Try to use HttpClient instead of WebRequest as Microsoft recommends.
So, firstly, you should use HttpClient to make web request.
Then you can use the next approach to download pages in parallel:
public static IList<AdModel> DownloadAllPages()
{
int[] pageNumbers = getPageNumbers();
// Array of tasks that download data from the pages.
Task<IList<AdModel>>[] tasks = new Task<IList<AdModel>>[pageNumbers.Length];
// This loop lauches download tasks in parallel.
for (int i = 0; i < pageNumbers.Length; i++)
{
// Launch download task without waiting for its completion.
tasks[i] = LocalBitcoins_buy(pageNumbers[i]);
}
// Wait for all tasks to complete.
Task.WaitAll(tasks);
// Combine results from all tasks into a common list.
return tasks.SelectMany(t => t.Result).ToList();
}
Of course, you should add error handling into this method.
I suggest you study this post carefully: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/walkthrough-accessing-the-web-by-using-async-and-await
It has a tutorial that is doing exactly what you want to do: Downloading multiple pages simultaneously.
WebRequest does not work asynchronously with getResponse purposefully. It has a second method: getResponseAsync().
To get your toes wet with threading, you can also use the drop-in replacement for the foreach and for loops: https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/how-to-write-a-simple-parallel-foreach-loop
Im writing a C# Ping-Application.
I started with a synchronous Ping-method, but I figurred out that pinging several server with one click takes more and more time.
So I decided to try the asynchronous method.
Can someone help me out?
public async Task<string> CustomPing(string ip, int amountOfPackets, int sizeOfPackets)
{
// timeout
int Timeout = 2000;
// PaketSize logic
string packet = "";
for (int j = 0; j < sizeOfPackets; j++)
{
packet += "b";
};
byte[] buffer = Encoding.ASCII.GetBytes(packet);
// time-var
long ms = 0;
// Main Method
using (Ping ping = new Ping())
for (int i = 0; i < amountOfPackets; i++)
{
PingReply reply = await ping.SendPingAsync(ip, Timeout, buffer);
ms += reply.RoundtripTime;
};
return (ms / amountOfPackets + " ms");
};
I defined a "Server"-Class (Ip or host, City, Country).
Then I create a "server"-List:
List<Server> ServerList = new List<Server>()
{
new Server("www.google.de", "Some City,", "Some Country")
};
Then I loop through this list and I try to call the method like this:
foreach (var server in ServerList)
ListBox.Items.Add("The average response time of your custom server is: " + server.CustomPing(server.IP, amountOfPackets, sizeOfPackets));
Unfortunately, this is much more competitive than the synchronous method, and at the point where my method should return the value, it returns
System.Threading.Tasks.Taks`1[System.string]
since you have an async method it will return the task when it is called like this:
Task<string> task = server.CustomPing(server.IP, amountOfPackets, sizeOfPackets);
when you add it directly to your ListBox while concatenating it with a string it will use the ToString method, which by default prints the full class name of the object. This should explaint your output:
System.Threading.Tasks.Taks`1[System.string]
The [System.string] part actually tells you the return type of the task result. This is what you want, and to get it you would need to await it! like this:
foreach (var server in ServerList)
ListBox.Items.Add("The average response time of your custom server is: " + await server.CustomPing(server.IP, amountOfPackets, sizeOfPackets));
1) this has to be done in another async method and
2) this will mess up all the parallelity that you are aiming for. Because it will wait for each method call to finish.
What you can do is to start all tasks one after the other, collect the returning tasks and wait for all of them to finish. Preferably you would do this in an async method like a clickhandler:
private async void Button1_Click(object sender, EventArgs e)
{
Task<string> [] allTasks = ServerList.Select(server => server.CustomPing(server.IP, amountOfPackets, sizeOfPackets)).ToArray();
// WhenAll will wait for all tasks to finish and return the return values of each method call
string [] results = await Task.WhenAll(allTasks);
// now you can execute your loop and display the results:
foreach (var result in results)
{
ListBox.Items.Add(result);
}
}
The class System.Threading.Tasks.Task<TResult> is a helper class for Multitasking. While it resides in the Threading Namespace, it works for Threadless Multitasking just as well. Indeed if you see a function return a task, you can usually use it for any form of Multitasking. Tasks are very agnostic in how they are used. You can even run it synchronously, if you do not mind that little extra overhead of having a Task doing not a lot.
Task helps with some of the most important rules/convetions of Multitasking:
Do not accidentally swallow exceptions. Threadbase Multitasking is notoriously good in doing just that.
Do not use the result after a cancelation
It does that by throwing you exceptions in your face (usually the Aggregate one) if you try to access the Result Property when convention tells us you should not do that.
As well as having all those other usefull properties for Multitasking.
My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.
for (int i = 1; i<=pages; i++)
{
string page_specific_link = baseurl + "&page=" + i.ToString();
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
catch (Exception)
{
}
}
The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.
I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.
So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.
I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances
// now process urls
foreach (var url in urls_to_download)
{
var worker = ClientQueue.Take();
worker.DownloadStringAsync(url, ...);
}
When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.
In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.
That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.
I should also note that there is a huge difference in resource usage between these two blocks of code:
WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
MyWebClient.DownloadString(url);
}
---------------
foreach (var url in urls_to_download)
{
WebClient MyWebClient = new WebClient();
MyWebClient.DownloadString(url);
}
The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.
Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).
Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.
In addition to #Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();
Parallel.ForEach(pages, x =>
{
using(var client = new WebClient())
{
var pagesource = client.DownloadString(x);
sources.Add(pagesource);
}
});
Yet another approach, that uses async:
static IEnumerable<string> GetSources(List<string> pages)
{
var sources = new BlockingCollection<string>();
var latch = new CountdownEvent(pages.Count);
foreach (var p in pages)
{
using (var wc = new WebClient())
{
wc.DownloadStringCompleted += (x, e) =>
{
sources.Add(e.Result);
latch.Signal();
};
wc.DownloadStringAsync(new Uri(p));
}
}
latch.Wait();
return sources;
}
You should use parallel programming for this purpose.
There are a lot of ways to achieve what u want; the easiest would be something like this:
var pageList = new List<string>();
for (int i = 1; i <= pages; i++)
{
pageList.Add(baseurl + "&page=" + i.ToString());
}
// pageList is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page);
client.Dispose();
lock (sourcelist)
sourcelist.Add(pagesource);
}
catch (Exception) {}
});
I Had a similar Case ,and that's how i solved
using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;
namespace WebClientApp
{
class MainClassApp
{
private static int requests = 0;
private static object requests_lock = new object();
public static void Main() {
List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
foreach(var url in urls) {
ThreadPool.QueueUserWorkItem(GetUrl, url);
}
int cur_req = 0;
while(cur_req<urls.Count) {
lock(requests_lock) {
cur_req = requests;
}
Thread.Sleep(1000);
}
Console.WriteLine("Done");
}
private static void GetUrl(Object the_url) {
string url = (string)the_url;
WebClient client = new WebClient();
Stream data = client.OpenRead (url);
StreamReader reader = new StreamReader(data);
string html = reader.ReadToEnd ();
/// Do something with html
Console.WriteLine(html);
lock(requests_lock) {
//Maybe you could add here the HTML to SourceList
requests++;
}
}
}
You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.
While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.
What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.
First, you would get the urls that you want to download:
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl +
"&page=" + i.ToString(CultureInfo.InvariantCulture)));
Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
// Create the task completion source.
var tcs = new TaskCompletionSource<Tuple<Uri, string>>();
// The web client.
var wc = new WebClient();
// Attach to the DownloadStringCompleted event.
client.DownloadStringCompleted += (s, e) => {
// Dispose of the client when done.
using (wc)
{
// If there is an error, set it.
if (e.Error != null)
{
tcs.SetException(e.Error);
}
// Otherwise, set cancelled if cancelled.
else if (e.Cancelled)
{
tcs.SetCanceled();
}
else
{
// Set the result.
tcs.SetResult(new Tuple<string, string>(url, e.Result));
}
}
};
// Start the process asynchronously, don't burn a thread.
wc.DownloadStringAsync(url);
// Return the task.
return tcs.Task;
});
Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:
// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();
// Wait for all to complete.
Task.WaitAll(materializedTasks);
Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
// pair.Item1 will contain the Uri.
// pair.Item2 will contain the content.
}
Note that the above code has the caveat of not having an error handling.
If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).
I am using an active Threads count and a arbitrary limit:
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}