C# and WebApi stream result of task collection - c#

I have a simple API and one service. Service is reading tree structure from path from configuration. problem is tree can be rather large, so I thought I can solve this by creating collection of tasks and resolve that tasks in form of stream on ActionResult. To make things harder I need a tree as result, can not split it to different requests.
So normally I would get file tree by:
public IEnumerable<string> GetFiles()
{
var result = new List<string>();
foreach (var resource in _root)
{
this.ValidateRootFolder(resource);
result.AddRange(Directory.EnumerateFiles(resource, "*.*", SearchOption.AllDirectories));
}
return result;
}
So that is simple but can be slow if there is a giant tree, and what I am trying to do is something like:
public ConcurrentBag<Task<IEnumerable<string>>> GetFiles()
{
var tasks = new ConcurrentBag<Task<IEnumerable<string>>>();
Parallel.ForEach(_root, (resource, token) =>
{
this.ValidateRootFolder(resource);
var task = Task.Run(() => Directory.EnumerateFiles(resource, "*.*", SearchOption.AllDirectories));
tasks.Add(task);
});
return tasks;
}
And this is creating task collection, so I can execute those that on endpoint, something like:
[HttpGet, ActionName("GetFiles")]
public IActionResult GetFiles()
{
ConcurrentBag<Task> tasks = _fileService.GetFiles();
return Ok(tasks); // how to make stream out of all these files
}
So my question is how to convert this task collection as stream with result, or is it possible?
And if not is there other way to do this?

Parallelism in a web application isn't always a great idea because it uses threads that would be serving web requests otherwise. Each web request is served by a separate thread. If all cores are busy, web request will have to wait.
Parallel.ForEach will use all available cores, which means no other request will be served until either Parallel.ForEach completes or one of the worker threads is rescheduled. In most web applications that would be very bad.
One way to handle this would be to use PLINQ to enumerate all folders with a limited degree-of-parallelism and return the results as a single list:
public IEnumerable<string> GetFiles()
{
var files= _roots.AsParallel()
.WithDegreeOfParallelism(2)
.SelectMany(fld=>Directory.EnumerateFiles(fld))
.AsEnumerable();
return files;
}
or
public IEnumerable<string> GetFiles()
{
var files= _roots.AsParallel()
.WithDegreeOfParallelism(2)
.SelectMany(fld=>Directory.EnumerateFiles(fld))
.ToList();
return files;
}
The benefit over Parallel.ForEach is that PLINQ handles the collection of the partial results into the final result set.
If you want to get the files grouped by root, you could use Select and GetFiles :
public IEnumerable<string[]> GetFiles()
{
var files= _roots.AsParallel()
.WithDegreeOfParallelism(2)
.Select(fld=>Directory.GetFiles(fld))
.ToList();
}
You could also return a dictionary of files per root:
public Dictionary<string,string[]> GetFiles()
{
var files= _roots.AsParallel()
.WithDegreeOfParallelism(2)
.Select(root=>(root,files=Directory.GetFiles(fld)))
.ToDictionary(p=>p.root,p=>p.files);
}
No async
There's no Directory.EnumerateFilesAsync so there's no way to benefit from asynchronous (not parallel) enumeration. The reason is that not all OSs have async IO and even when they do, the file system drivers may not have asynchronous file enumeration.
Windows NT was asynchronous from the start, with blocking operations emulated at the API level. Windows 9x wasn't. Linux on the other hand was synchronous, with async I/O added later. Even Windows doesn't have an async directory enumeration API though, because not all file systems support this.

Related

C# Add to a List Asynchronously in API

I have an API which needs to be run in a loop for Mass processing.
Current single API is:
public async Task<ActionResult<CombinedAddressResponse>> GetCombinedAddress(AddressRequestDto request)
We are not allowed to touch/modify the original single API. However can be run in bulk, using foreach statement. What is the best way to run this asychronously without locks?
Current Solution below is just providing a list, would this be it?
public async Task<ActionResult<List<CombinedAddressResponse>>> GetCombinedAddress(List<AddressRequestDto> requests)
{
var combinedAddressResponses = new List<CombinedAddressResponse>();
foreach(AddressRequestDto request in requests)
{
var newCombinedAddress = (await GetCombinedAddress(request)).Value;
combinedAddressResponses.Add(newCombinedAddress);
}
return combinedAddressResponses;
}
Update:
In debugger, it has to go to combinedAddressResponse.Result.Value
combinedAddressResponse.Value = null
and Also strangely, writing combinedAddressResponse.Result.Value gives error below "Action Result does not contain a definition for for 'Value' and no accessible extension method
I'm writing this code off the top of my head without an IDE or sleep, so please comment if I'm missing something or there's a better way.
But effectively I think you want to run all your requests at once (not sequentially) doing something like this:
public async Task<ActionResult<List<CombinedAddressResponse>>> GetCombinedAddress(List<AddressRequestDto> requests)
{
var combinedAddressResponses = new List<CombinedAddressResponse>(requests.Count);
var tasks = new List<Task<ActionResult<CombinedAddressResponse>>(requests.Count);
foreach (var request in requests)
{
tasks.Add(Task.Run(async () => await GetCombinedAddress(request));
}
//This waits for all the tasks to complete
await tasks.WhenAll(tasks.ToArray());
combinedAddressResponses.AddRange(tasks.Select(x => x.Result.Value));
return combinedAddressResponses;
}
looking for a way to speed things up and run in parallel thanks
What you need is "asynchronous concurrency". I use the term "concurrency" to mean "doing more than one thing at a time", and "parallel" to mean "doing more than one thing at a time using threads". Since you're on ASP.NET, you don't want to use additional threads; you'd want to use a form of concurrency that works asynchronously (which uses fewer threads). So, Parallel and Task.Run should not be parts of your solution.
The way to do asynchronous concurrency is to build a collection of tasks, and then use await Task.WhenAll. E.g.:
public async Task<ActionResult<IReadOnlyList<CombinedAddressResponse>>> GetCombinedAddress(List<AddressRequestDto> requests)
{
// Build the collection of tasks by doing an asynchronous operation for each request.
var tasks = requests.Select(async request =>
{
var combinedAddressResponse = await GetCombinedAdress(request);
return combinedAddressResponse.Value;
}).ToList();
// Wait for all the tasks to complete and get the results.
var results = await Task.WhenAll(tasks);
return results;
}

Spawn new thread inside each foreach(), but do not return until all complete

I have a foreach() that loops through 15 reports and generates a PDF for each. The PDF generation process is slow (3 seconds each). But if I could generate them all concurrently with threads, maybe all 15 could be done in 4-5 seconds total. One constraint is that the function must not return until ALL pdfs have generated. Also, will 15 concurrent worker threads cause problems or instability for dotnet/windows?
Here is my pseudocode:
private void makePDFs(string path) {
string[] folders = Directory.GetDirectories(path);
foreach(string folderPath in folders) {
generatePDF(...);
}
// DO NOT RETURN UNTIL ALL PDFs HAVE BEEN GENERATED
}
}
What is the simplest way to achieve this?
The most straightforward approach is to use Parallel.ForEach:
private void makePDFs(string path)
{
string[] folders = Directory.GetDirectories(path);
Parallel.ForEach(folders, (folderPath) =>
{
generatePDF(folderPath);
};
//WILL NOT RETURN UNTIL ALL PDFs HAVE BEEN GENERATED
}
This way you avoid having to create, keep track of, and await each separate task; the TPL does it all for you.
You need to get a list of tasks and then use Task.WhenAll to wait for completion
var tasks = folders.Select(folder => Task.Run(() => generatePDF(...)));
await Task.WhenAll(tasks);
If you can't or don't want to use async/await you can use:
Task.WaitAll(tasks);
It will block current thread until all tasks are completed. So I'd recommend to use the 1st approach if you can.
You can also run your PDF generation in parallel using Parallel C# class:
Parallel.ForEach(folders, folder => generatePDF(...));
Please see this answer to choose which approach works the best for your problem.
.NET has a handy method just for this: Task.WhenAll(IEnumerable<Task>)
It will wait for all tasks in the IEnumerable to finish before continuing. It is an async method, so you need to await it.
var tasks = new List<Task>();
foreach(string folderPath in folders) {
tasks.Add(Task.Run(() => generatePdf()));
}
await Task.WhenAll(tasks);

await thousands of Tasks

I have an application which converts some data often there are 1.000 - 30.000 files.
I need to do 3 steps:
copy a File (replace some text in there)
Make a Webrequest with WebClient to download a file (I send the copied file to a WebServer, which converts the file to another format)
Take the downloaded file and change some of the content
So all three steps include some I/O and I used async/await methods:
var tasks = files.Select(async (file) =>
{
Item item = await createtempFile(file).ConfigureAwait(false);
await convert(item).ConfigureAwait(false);
await clean(item).ConfigureAwait(false);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
I don´t know if this is the best practice, because I create more than thousand tasks. I thought about splitting the three steps like:
List<Item> items = new List<Item>();
var tasks = files.Select(async (file) =>
{
Item item = await createtempFile(file, ext).ConfigureAwait(false);
lock(items)
items.Add(item);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
var tasks = items.Select(async (item) =>
{
await convert(item, baseAddress, ext).ConfigureAwait(false);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
var tasks = items.Select(async (item) =>
{
await clean(targetFile, item.Doctype, ext).ConfigureAwait(false);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
But that doesn´t seem to be better or faster, because I create 3 times thousands of tasks.
Should I throttle the creation of tasks? Like chunks of 100 tasks?
Or am I just overthinking it and the creation of thousands of tasks is just fine.
The CPU is idling with 2-4% peak, so I thought about too many awaits or context switches.
Maybe the WebRequest calls are too many, because the WebServer/WebService can´t handle thousands of Requests simultaneously and I should only throttle the WebRequests?
I already increased the .NET maxconnection in the app.config file.
It is possible to execute async operations in parallel with limiting the number of concurrent operations. There is a cool extension method for that, it is not part of the .Net framework.
/// <summary>
/// Enumerates a collection in parallel and calls an async method on each item. Useful for making
/// parallel async calls, e.g. independent web requests when the degree of parallelism needs to be
/// limited.
/// </summary>
public static Task ForEachAsync<T>(this IEnumerable<T> source, int degreeOfParalellism, Func<T, Task> action)
{
return Task.WhenAll(Partitioner.Create(source).GetPartitions(degreeOfParalellism).Select(partition => Task.Run(async () =>
{
using (partition)
while (partition.MoveNext())
await action(partition.Current);
})));
}
Call it like this:
var files = new List<string> {"one", "two", "three"};
await files.ForEachAsync(5, async file =>
{
// do async stuff here with the file
await Task.Delay(1000);
});
As commenters have correctly noted, you're overthinking it. The .NET runtime has absolutely no problem tracking thousands of tasks.
However, you might want to consider using a TPL Dataflow pipeline, which would enable you to easily have different concurrency levels for different operations ("blocks") in your pipeline.

Best way to call many web services?

I have 30 sub companies and every one has implemented their web service (with different technologies).
I need to implement a web service to aggregate them, for example, all the sub company web services have a web method with name GetUserPoint(int nationalCode) and I need to implement my web service that will call all of them and collect all of the responses (for example sum of points).
This is my base class:
public abstract class BaseClass
{ // all same attributes and methods
public long GetPoint(int nationalCode);
}
For each of sub companies web services, I implement a class that inherits this base class and define its own GetPoint method.
public class Company1
{
//implement own GetPoint method (call a web service).
}
to
public class CompanyN
{
//implement own GetPoint method (call a web service).
}
so, this is my web method:
[WebMethod]
public long MyCollector(string nationalCode)
{
BaseClass[] Clients = new BaseClass[] { new Company1(),//... ,new Company1()}
long Result = 0;
foreach (var item in Clients)
{
long ResultTemp = item.GetPoint(nationalCode);
Result += ResultTemp;
}
return Result;
}
OK, it works but it's so slow, because every sub companys web service is hosted on different servers (on the internet).
I can use parallel programing like this:(is this called parallel programing!?)
foreach (var item in Clients)
{
Tasks.Add(Task.Run(() =>
{
Result.AddRange(item.GetPoint(MasterLogId, mobileNumber));
}
}
I think parallel programing (and threading) isn't good for this solution, because my solution is IO bound (not CPU intensive)!
Call every external web service is so slow, am i right? Many thread that are pending to get response!
I think async programming is the best way but I am new to async programming and parallel programing.
What is the best way? (parallel.foreach - async TAP - async APM - async EAP -threading)
Please write for me an example.
It's refreshing to see someone who has done their homework.
First things first, as of .NET 4 (and this is still very much the case today) TAP is the preferred technology for async workflow in .NET. Tasks are easily composable, and for you to parallelise your web service calls is a breeze if they provide true Task<T>-returning APIs. For now you have "faked" it with Task.Run, and for the time being this may very well suffice for your purposes. Sure, your thread pool threads will spend a lot of time blocking, but if the server load isn't very high you could very well get away with it even if it's not the ideal thing to do.
You just need to fix a potential race condition in your code (more on that towards the end).
If you want to follow the best practices though, you go with true TAP. If your APIs provide Task-returning methods out of the box, that's easy. If not, it's not game over as APM and EAP can easily be converted to TAP. MSDN reference: https://msdn.microsoft.com/en-us/library/hh873178(v=vs.110).aspx
I'll also include some conversion examples here.
APM (taken from another SO question):
MessageQueue does not provide a ReceiveAsync method, but we can get it to play ball via Task.Factory.FromAsync:
public static Task<Message> ReceiveAsync(this MessageQueue messageQueue)
{
return Task.Factory.FromAsync(messageQueue.BeginReceive(), messageQueue.EndPeek);
}
...
Message message = await messageQueue.ReceiveAsync().ConfigureAwait(false);
If your web service proxies have BeginXXX/EndXXX methods, this is the way to go.
EAP
Assume you have an old web service proxy derived from SoapHttpClientProtocol, with only event-based async methods. You can convert them to TAP as follows:
public Task<long> GetPointAsyncTask(this PointWebService webService, int nationalCode)
{
TaskCompletionSource<long> tcs = new TaskCompletionSource<long>();
webService.GetPointAsyncCompleted += (s, e) =>
{
if (e.Cancelled)
{
tcs.SetCanceled();
}
else if (e.Error != null)
{
tcs.SetException(e.Error);
}
else
{
tcs.SetResult(e.Result);
}
};
webService.GetPointAsync(nationalCode);
return tcs.Task;
}
...
using (PointWebService service = new PointWebService())
{
long point = await service.GetPointAsyncTask(123).ConfigureAwait(false);
}
Avoiding races when aggregating results
With regards to aggregating parallel results, your TAP loop code is almost right, but you need to avoid mutating shared state inside your Task bodies as they will likely execute in parallel. Shared state being Result in your case - which is some kind of collection. If this collection is not thread-safe (i.e. if it's a simple List<long>), then you have a race condition and you may get exceptions and/or dropped results on Add (I'm assuming AddRange in your code was a typo, but if not - the above still applies).
A simple async-friendly rewrite that fixes your race would look like this:
List<Task<long>> tasks = new List<Task<long>>();
foreach (BaseClass item in Clients) {
tasks.Add(item.GetPointAsync(MasterLogId, mobileNumber));
}
long[] results = await Task.WhenAll(tasks).ConfigureAwait(false);
If you decide to be lazy and stick with the Task.Run solution for now, the corrected version will look like this:
List<Task<long>> tasks = new List<Task<long>>();
foreach (BaseClass item in Clients)
{
Task<long> dodgyThreadPoolTask = Task.Run(
() => item.GetPoint(MasterLogId, mobileNumber)
);
tasks.Add(dodgyThreadPoolTask);
}
long[] results = await Task.WhenAll(tasks).ConfigureAwait(false);
You can create an async version of the GetPoint:
public abstract class BaseClass
{ // all same attributes and methods
public abstract long GetPoint(int nationalCode);
public async Task<long> GetPointAsync(int nationalCode)
{
return await GetPoint(nationalCode);
}
}
Then, collect the tasks for each client call. After that, execute all tasks using Task.WhenAll. This will execute them all in parallell. Also, as pointed out by Kirill, you can await the results of each task:
var tasks = Clients.Select(x => x.GetPointAsync(nationalCode));
long[] results = await Task.WhenAll(tasks);
If you do not want to make the aggregating method async, you can collect the results by calling .Result instead of awaiting, like so:
long[] results = Task.WhenAll(tasks).Result;

Async & Await issue in a Metro Style app

I have a simple Metro style app that's giving me an issue with (async & await).
List<string> fileNames = new List<string>();
...
...
LoadList();
...
...
(Problem) Code that accesses the elements of the fileNames List
...
...
private async void LoadList()
{
// Code that loops through a directory and adds the
// file names to the fileNames List using GetFilesAsync()
}
The problem is that the fileNames List is accessed prematurely - before it is fully loaded with items. This is because of the async method - the program continues with the next line of code while the async method continues its processing.
How can I access the List after it is fully loaded (After the async method is done)?
Is there a way to accomplish what I'm trying to do without using async in Metro apps ?
You need the calling method to be asynchronous too - and rather than having a variable of fileNames, I'd make the LoadList method return it. So you'd have:
public async Task ProcessFiles()
{
List<string> fileNames = await LoadList();
// Now process the files
}
public async Task<List<string>> LoadList()
{
List<string> fileNames = new List<string>();
// Do stuff...
return fileNames;
}
This does mean that you need to wait for all the files to be found before you start processing them; if you want to process them as you find them you'll need to think about using a BlockingCollection of some kind. EDIT: As Stephen points out, TPL Dataflow would be a great fit here too.

Categories