I understand the implications of using an async lambda with Parallel.ForEach which is why I'm not using it here. This then forces me to use .Result for each of my Tasks that make Http requests. However, running this simple scraper through the performance profiler shows that .Result has an elapsed exclusive time % of ~98% which is obviously due to the blocking nature of the call.
My question is: is there any possibility of optimizing this for it to still be async? I'm not sure that will help in this case since it may just take this long to retrieve the HTML/XML.
I'm running a 4 core processor with 8 logical cores (hence the MaxDegreesOfParallelism = 8. Right now I'm looking at around 2.5 hours to download and parse ~51,000 HTML/XML pages of simple financial data.
I was leaning towards using XmlReader instead of Linq2XML to speed up the parsing, but it appears the bottleneck is at the .Result call.
And although it should not matter here, the SEC limits scraping to 10 requests/sec.
public class SECScraper
{
public event EventHandler<ProgressChangedEventArgs> ProgressChangedEvent;
public SECScraper(HttpClient client, FinanceContext financeContext)
{
_client = client;
_financeContext = financeContext;
}
public void Download()
{
_numDownloaded = 0;
_interval = _financeContext.Companies.Count() / 100;
Parallel.ForEach(_financeContext.Companies, new ParallelOptions {MaxDegreeOfParallelism = 8},
company =>
{
RetrieveSECData(company.CIK);
});
}
protected virtual void OnProgressChanged(ProgressChangedEventArgs e)
{
ProgressChangedEvent?.Invoke(this, e);
}
private void RetrieveSECData(int cik)
{
// move this url elsewhere
var url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" + cik +
"&type=10-q&dateb=&owner=include&count=100";
var srBody = ReadHTML(url).Result; // consider moving this to srPage
var srPage = new SearchResultsPage(srBody);
var reportLinks = srPage.GetAllReportLinks();
foreach (var link in reportLinks)
{
url = SEC_HOSTNAME + link;
var fdBody = ReadHTML(url).Result;
var fdPage = new FilingDetailsPage(fdBody);
var xbrlLink = fdPage.GetInstanceDocumentLink();
var xbrlBody = ReadHTML(SEC_HOSTNAME + xbrlLink).Result;
var xbrlDoc = new XBRLDocument(xbrlBody);
var epsData = xbrlDoc.GetAllEPSData();
//foreach (var eps in epsData)
// Console.WriteLine($"{eps.StartDate} to {eps.EndDate} -- {eps.EPS}");
}
IncrementNumDownloadedAndNotify();
}
private async Task<string> ReadHTML(string url)
{
using var response = await _client.GetAsync(url);
return await response.Content.ReadAsStringAsync();
}
}
The task is not CPU bound, but rather network bound so there is no need to use multiple threads.
Make multiple async calls on one thread. just don't await them. Put the tasks on a list. When you get a certain amount there (say you want 10 going at once), start waiting for the first one to finish (Look up 'task, WhenAny' for more info).
Then put more on :-) You can then control the size of the lits of tasks by #/second using other code.
is there any possibility of optimizing this for it to still be async?
Yes. I'm not sure why you're using Parallel in the first place; it seems like the wrong solution for this kind of problem. You have asynchronous work to do across a collection of items, so a better fit would be asynchronous concurrency; this is done using Task.WhenAll:
public class SECScraper
{
public async Task DownloadAsync()
{
_numDownloaded = 0;
_interval = _financeContext.Companies.Count() / 100;
var tasks = _financeContext.Companies.Select(company => RetrieveSECDataAsync(company.CIK)).ToList();
await Task.WhenAll(tasks);
}
private async Task RetrieveSECDataAsync(int cik)
{
var url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" + cik +
"&type=10-q&dateb=&owner=include&count=100";
var srBody = await ReadHTMLAsync(url);
var srPage = new SearchResultsPage(srBody);
var reportLinks = srPage.GetAllReportLinks();
foreach (var link in reportLinks)
{
url = SEC_HOSTNAME + link;
var fdBody = await ReadHTMLAsync(url);
var fdPage = new FilingDetailsPage(fdBody);
var xbrlLink = fdPage.GetInstanceDocumentLink();
var xbrlBody = await ReadHTMLAsync(SEC_HOSTNAME + xbrlLink);
var xbrlDoc = new XBRLDocument(xbrlBody);
var epsData = xbrlDoc.GetAllEPSData();
}
IncrementNumDownloadedAndNotify();
}
private async Task<string> ReadHTMLAsync(string url)
{
using var response = await _client.GetAsync(url);
return await response.Content.ReadAsStringAsync();
}
}
Also, I recommend you use IProgress<T> for reporting progress.
Related
I have a list of URLs (thousands), I want to asynchronously get page data from each URL as fast as possible without putting extreme load on the CPU.
I have tried using threading but it still feels quite slow:
public static ConcurrentQueue<string> List = new ConcurrentQueue<string>(); //URL List (assume I added them already)
public static void Threading()
{
for(int i=0;i<100;i++) //100 threads
{
Thread thread = new Thread(new ThreadStart(Task));
thread.Start();
}
}
public static void Task()
{
while(!(List.isEmpty))
{
List.TryDequeue(out string URL);
//GET REQUEST HERE
}
}
Is there any better way to do this? I want to do this asynchronously but I can't figure out how to do it, and I don't want to sacrifice speed or CPU efficiency to do so.
Thanks :)
You should use Microsoft's Reactive Framework (aka Rx) - NuGet System.Reactive and add using System.Reactive.Linq; - then you can do this:
public static IObservable<(string url, string content)> GetAllUrls(List<string> urls) =>
Observable
.Using(
() => new HttpClient(),
hc =>
from url in urls.ToObservable()
from response in Observable.FromAsync(() => hc.GetAsync(url))
from content in Observable.FromAsync(() => response.Content.ReadAsStringAsync())
select (url, content));
That allows you to consume the results in a couple of ways.
You can process them as they get produced:
IDisposable subscription =
GetAllUrls(urlsx).Subscribe(x => Console.WriteLine(x.content));
Or you can get all of them produced and then await the full results:
(string url, string content)[] results = await GetAllUrls(urlsx).ToArray();
You are best off using HttpClient which allows async Task requests.
Just store each task in a list, and await the whole list. To prevent too many requests at once, wait for any single one to complete if there are too many, and remove the completed one from the list.
const int maxDegreeOfParallelism = 100;
static HttpClient _client = new HttpClient();
public static async Task GetAllUrls(List<string> urls)
{
var tasks = new List<Task>(urls.Count);
foreach (var url in urls)
{
if (tasks.Count == maxDegreeOfParallelism) // this prevents too many requests at once
tasks.Remove(await Task.WhenAny(tasks));
tasks.Add(GetUrl(url));
}
await Task.WhenAll(tasks);
}
private static async Task GetUrl(string url)
{
using var response = await _client.GetAsync(url);
// handle response here
var responseStr = await response.Content.ReadAsStringAsync(); // whatever
// do stuff etc
}
I am trying to load a full auctionhouse by loading each page async from an api call and putting all the same items together in a list in a dictionary. When I make the parallel for loop, It does not return anything. Help would be appricieted.
Have a great day!
-Vexea
{
string url = "https://api.hypixel.net/skyblock/auctions";
//Gets number of pages to make threads on the auction house...
using (HttpResponseMessage response = await ApiHelper.GetApiClient("application/json").GetAsync(url))
{
if (response.IsSuccessStatusCode)
{
AuctionHouseModel auctionHouse = await response.Content.ReadAsAsync<AuctionHouseModel>();
return auctionHouse.Pages;
}
else
{
return 0;
}
}
}
private static async Task<AuctionPageModel> LoadHypixelAuctionPage(int page, string apiKey)
{
//Loads a solid page...
string url = "https://api.hypixel.net/skyblock/auctions?key=" + apiKey + "&page=" + page;
using (HttpResponseMessage response = await ApiHelper.GetApiClient("application/json").GetAsync(url))
{
if (response.IsSuccessStatusCode)
{
return await response.Content.ReadAsAsync<AuctionPageModel>();
}
else
{
return null;
}
}
}
public async static Task<AuctionHouseModel> LoadHypixelAuctionHouse(string apiKey)
{
//Loads all pages needed and puts them into a dictionary...
List<AuctionPageModel> pages = new List<AuctionPageModel>();
AuctionHouseModel output = new AuctionHouseModel();
Parallel.For(1, await LoadHypixelAuctionPages(), async page => {
pages.Add(await LoadHypixelAuctionPage(page, apiKey)); //This returns nothing, count of pages stays 0 and nothing happens...
});
foreach (AuctionPageModel page in pages)
foreach(AuctionProductModel product in page.Products)
try
{
output.Products[product.Name].Add(product);
}
catch
{
output.Products.Add(product.Name, new List<AuctionProductModel>());
output.Products[product.Name].Add(product);
}
output.Pages = await LoadHypixelAuctionPages();
return output;
}
When you're doing parallel programming you need to make sure to use thread-safe types or locking. Perhaps there're more things wrong than this, but the first thing you need to fix is making sure to lock access to the pages list. Secondly, the first paramenter in Parallel.For is inclusive while the second parameter is exclusive. So if LoadHypixelAuctionPages() returns 0 or 1, nothing will run inside the loop, so you probably mean LoadHypixelAuctionPages() + 1 if the first page number is 1 and not 0:
List<AuctionPageModel> pages = new List<AuctionPageModel>();
AuctionHouseModel output = new AuctionHouseModel();
Parallel.For(1, await LoadHypixelAuctionPages() + 1, async page =>
{
var loadedPage = await LoadHypixelAuctionPage(page, apiKey);
lock(pages)
{
pages.Add(loadedPage);
}
});
//...
Take a look at this fiddle to see what can happen when not locking.
An alternative to locking is using one of the concurrent collections, such as ConcurrentQueue<T>
You can't use any Parallel methods with async. Parallel is for CPU-bound code and async is (primarily) for I/O-bound code. The Parallel class doesn't properly understand anything async.
Instead of parallel concurrency, you need asynchronous concurrency (Task.WhenAll):
List<AuctionPageModel> pages = new List<AuctionPageModel>();
AuctionHouseModel output = new AuctionHouseModel();
var tasks = Enumerable
.Range(1, await LoadHypixelAuctionPages())
.Select(async page => pages.Add(await LoadHypixelAuctionPage(page, apiKey)))
.ToList();
await Task.WhenAll(tasks);
or, more simply:
AuctionHouseModel output = new AuctionHouseModel();
var tasks = Enumerable
.Range(1, await LoadHypixelAuctionPages())
.Select(async page => await LoadHypixelAuctionPage(page, apiKey))
.ToList();
var pages = await Task.WhenAll(tasks);
Let's say I want to download 1000 recipes from a website. The websites accepts at most 10 concurrent connections. Each recipe should be stored in an array, at its corresponding index. (I don't want to send the array to the DownloadRecipe method.)
Technically, I've already solved the problem, but I would like to know if there is an even cleaner way to use async/await or something else to achieve it?
static async Task MainAsync()
{
int recipeCount = 1000;
int connectionCount = 10;
string[] recipes = new string[recipeCount];
Task<string>[] tasks = new Task<string>[connectionCount];
int r = 0;
while (r < recipeCount)
{
for (int t = 0; t < tasks.Length; t++)
{
tasks[t] = Task.Run(async () => recipes[r] = await DownloadRecipe(r));
r++;
}
await Task.WhenAll(tasks);
}
}
static async Task<string> DownloadRecipe(int index)
{
// ... await calls to download recipe
}
Also, this solution it's not optimal, since it doesn't bother starting a new download until all the 10 running downloads are finished. Is there something we can improve there without bloating the code too much? A thread pool limited to 10 threads?
There are many many ways you could do this. One way is to use an ActionBlock which give you access to MaxDegreeOfParallelism fairly easily and will work well with async methods
static async Task MainAsync()
{
var recipeCount = 1000;
var connectionCount = 10;
var recipes = new string[recipeCount];
async Task Action(int i) => recipes[i] = await DownloadRecipe(i);
var processor = new ActionBlock<int>(Action, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = connectionCount,
SingleProducerConstrained = true
});
for (var i = 0; i < recipeCount; i++)
await processor.SendAsync(i);
processor.Complete();
await processor.Completion;
}
static async Task<string> DownloadRecipe(int index)
{
...
}
Another way might be to use a SemaphoreSlim
var slim = new SemaphoreSlim(connectionCount, connectionCount);
var tasks = Enumerable
.Range(0, recipeCount)
.Select(Selector);
async Task<string> Selector(int i)
{
await slim.WaitAsync()
try
{
return await DownloadRecipe(i)
}
finally
{
slim.Release();
}
}
var recipes = await Task.WhenAll(tasks);
Another set of approaches is to use Reactive Extensions (Rx)... Once again there are many ways to do this, this is just an awaitable approach (and likely could be better all things considered)
var results = await Enumerable
.Range(0, recipeCount)
.ToObservable()
.Select(i => Observable.FromAsync(() => DownloadRecipe(i)))
.Merge(connectionCount)
.ToArray()
.ToTask();
Alternative approach to have 10 "pools" which will load data "simultaneously".
You don't need to wrap IO operations with the separate thread. Using separate thread for IO operations is just a waste of resources.
Notice that thread which downloads data will do nothing, but just waiting for a response. This is where async-await approach come very handy - we can send multiple requests without waiting them to complete and without wasting threads.
static async Task MainAsync()
{
var requests = Enumerable.Range(0, 1000).ToArray();
var maxConnections = 10;
var pools = requests
.GroupBy(i => i % maxConnections)
.Select(group => DownloadRecipesFor(group.ToArray()))
.ToArray();
await Task.WhenAll(pools);
var recipes = pools.SelectMany(pool => pool.Result).ToArray();
}
static async Task<IEnumerable<string>> DownLoadRecipesFor(params int[] requests)
{
var recipes = new List<string>();
foreach (var request in requests)
{
var recipe = await DownloadRecipe(request);
recipes.Add(recipe);
}
return recipes;
}
Because inside the pool (DownloadRecipesFor method) we download results one by one - we make sure that we have no more than 10 active requests all the time.
This is little bit more effective than originals, because we don't wait for 10 tasks to complete before starting next "bunch".
This is not ideal, because if last "pool" finishes early then others it aren't able to pickup next request to handle.
Final result will have corresponding indexes, because we will process "pools" and requests inside in same order as we created them.
I am describing my problem in a simple example and then describing a more close problem.
Imagine We Have n items [i1,i2,i3,i4,...,in] in the box1 and we have a box2 that can handle m items to do them (m is usually much less than n) . The time required for each item is different. I want to always have doing m job items until all items are proceeded.
A much more close problem is that for example you have a list1 of n strings (URL addresses) of files and we want to have a system to have m files downloading concurrently (for example via httpclient.getAsync() method). Whenever downloading of one of m items finishes, another remaining item from list1 must be substituted as soon as possible and this must be countinued until all of List1 items proceeded.
(number of n and m are specified by users input at runtime)
How this can be done?
Here is a generic method you can use.
when you call this TIn will be string (URL addresses) and the asyncProcessor will be your async method that takes the URL address as input and returns a Task.
The SlimSemaphore used by this method is going to allow only n number of concurrent async I/O requests in real time, as soon as one completes the other request will execute. Something like a sliding window pattern.
public static Task ForEachAsync<TIn>(
IEnumerable<TIn> inputEnumerable,
Func<TIn, Task> asyncProcessor,
int? maxDegreeOfParallelism = null)
{
int maxAsyncThreadCount = maxDegreeOfParallelism ?? DefaultMaxDegreeOfParallelism;
SemaphoreSlim throttler = new SemaphoreSlim(maxAsyncThreadCount, maxAsyncThreadCount);
IEnumerable<Task> tasks = inputEnumerable.Select(async input =>
{
await throttler.WaitAsync().ConfigureAwait(false);
try
{
await asyncProcessor(input).ConfigureAwait(false);
}
finally
{
throttler.Release();
}
});
return Task.WhenAll(tasks);
}
You should look in to TPL Dataflow, add the System.Threading.Tasks.Dataflow NuGet package to your project then what you want is as simple as
private static HttpClient _client = new HttpClient();
public async Task<List<MyClass>> ProcessDownloads(IEnumerable<string> uris,
int concurrentDownloads)
{
var result = new List<MyClass>();
var downloadData = new TransformBlock<string, string>(async uri =>
{
return await _client.GetStringAsync(uri); //GetStringAsync is a thread safe method.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
var processData = new TransformBlock<string, MyClass>(
json => JsonConvert.DeserializeObject<MyClass>(json),
new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded});
var collectData = new ActionBlock<MyClass>(
data => result.Add(data)); //When you don't specifiy options dataflow processes items one at a time.
//Set up the chain of blocks, have it call `.Complete()` on the next block when the current block finishes processing it's last item.
downloadData.LinkTo(processData, new DataflowLinkOptions {PropagateCompletion = true});
processData.LinkTo(collectData, new DataflowLinkOptions {PropagateCompletion = true});
//Load the data in to the first transform block to start off the process.
foreach (var uri in uris)
{
await downloadData.SendAsync(uri).ConfigureAwait(false);
}
downloadData.Complete(); //Signal you are done adding data.
//Wait for the last object to be added to the list.
await collectData.Completion.ConfigureAwait(false);
return result;
}
In the above code only concurrentDownloads number of HttpClients will be active at any given time, unlimited threads will be processing the received strings and turning them in to objects, and a single thread will be taking those objects and adding them to a list.
UPDATE: here is a simplified example that only does what you asked for in the question
private static HttpClient _client = new HttpClient();
public void ProcessDownloads(IEnumerable<string> uris, int concurrentDownloads)
{
var downloadData = new ActionBlock<string>(async uri =>
{
var response = await _client.GetAsync(uri); //GetAsync is a thread safe method.
//do something with response here.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
foreach (var uri in uris)
{
downloadData.Post(uri);
}
downloadData.Complete();
downloadData.Completion.Wait();
}
A simple solution for throttling is a SemaphoreSlim.
EDIT
After a slight alteration the code now creates the tasks when they are needed
var client = new HttpClient();
SemaphoreSlim semaphore = new SemaphoreSlim(m, m); //set the max here
var tasks = new List<Task>();
foreach(var url in urls)
{
// moving the wait here throttles the foreach loop
await semaphore.WaitAsync();
tasks.Add(((Func<Task>)(async () =>
{
//await semaphore.WaitAsync();
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
semaphore.Release();
}))());
}
await Task.WhenAll(tasks);
This is another way to do it
var client = new HttpClient();
var tasks = new HashSet<Task>();
foreach(var url in urls)
{
if(tasks.Count == m)
{
tasks.Remove(await Task.WhenAny(tasks));
}
tasks.Add(((Func<Task>)(async () =>
{
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
}))());
}
await Task.WhenAll(tasks);
Process items in parallel, limiting the number of simultaneous jobs:
string[] strings = GetStrings(); // Items to process.
const int m = 2; // Max simultaneous jobs.
Parallel.ForEach(strings, new ParallelOptions {MaxDegreeOfParallelism = m}, s =>
{
DoWork(s);
});
I'm kinda new to async tasks.
I've a function that takes student ID and scrapes data from specific university website with the required ID.
private static HttpClient client = new HttpClient();
public static async Task<Student> ParseAsync(string departmentLink, int id, CancellationToken ct)
{
string website = string.Format(departmentLink, id);
try
{
string data;
var stream = await client.GetAsync(website, ct);
using (var reader = new StreamReader(await stream.Content.ReadAsStreamAsync(), Encoding.GetEncoding("windows-1256")))
data = reader.ReadToEnd();
//Parse data here and return Student.
} catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
And it works correctly. Sometimes though I need to run this function for a lot of students so I use the following
for(int i = ids.first; i <= ids.last; i++)
{
tasks[i - ids.first] = ParseStudentData.ParseAsync(entity.Link, i, cts.Token).ContinueWith(t =>
{
Dispatcher.Invoke(() =>
{
listview_students.Items.Add(t.Result);
//Students.Add(t.Result);
//lbl_count.Content = $"{listview_students.Items.Count}/{testerino.Length}";
});
});
}
I'm storing tasks in an array to wait for them later.
This also works finely as long as the students count is between (0, ~600?) it's kinda random.
And then for every other student that still hasn't been parsed throws A task was cancelled.
Keep in mind that, I never use the cancellation token at all.
I need to run this function on so many students it can reach ~9000 async task altogether. So what's happening?
You are basically creating a denial of service attack on the website when you are queuing up 9000 requests in such a short time frame. Not only is this causing you errors, but it could take down the website. It would be best to limit the number of concurrent requests to a more reasonable value (say 30). While there are probably several ways to do this, one that comes to mind is the following:
private async Task Test()
{
var tasks = new List<Task>();
for (int i = ids.first; i <= ids.last; i++)
{
tasks.Add(/* Do stuff */);
await WaitList(tasks, 30);
}
}
private async Task WaitList(IList<Task> tasks, int maxSize)
{
while (tasks.Count > maxSize)
{
var completed = await Task.WhenAny(tasks).ConfigureAwait(false);
tasks.Remove(completed);
}
}
Other approaches might leverage the producer/consumer pattern using .Net classes such as a BlockingCollection
This is what I ended up with based on #erdomke code:
public static async Task ForEachParallel<T>(
this IEnumerable<T> list,
Func<T, Task> action,
int dop)
{
var tasks = new List<Task>(dop);
foreach (var item in list)
{
tasks.Add(action(item));
while (tasks.Count >= dop)
{
var completed = await Task.WhenAny(tasks).ConfigureAwait(false);
tasks.Remove(completed);
}
}
// Wait for all remaining tasks.
await Task.WhenAll(tasks).ConfigureAwait(false);
}
// usage
await Enumerable
.Range(1, 500)
.ForEachParallel(i => ProcessItem(i), Environment.ProcessorCount);