I have a server app (C# with .Net 5) that exposes a gRPC bi-directional endpoint. This endpoint takes in a binary stream in which the server analyzes and produces responses that are sent back to the gRPC response stream.
Each file being sent over gRPC is few megabytes and it takes few minutes for the gRPC call to complete streaming (without latency). With latencies, this time increases sometimes by 50%.
On the client, I have 2 tasks (Task.Run) running, one streaming the file from the clients' file system using FileStream, other reading responses from the server (gRPC).
On the server also, I have 2 tasks running, one reading messages from the gRPC request stream and pushing them into a queue (DataFlow.BufferBlock<byte[]>), other processing messages from the queue, and writing responses to gRPC.
The problem:
If I disable (comment out) all the server processing code, and simply read and log messages from gRPC, there's almost 0 latency from client to server.
When the server has processing enabled, the clients see latencies while writing to grpcClient.
With just 10 active parallel sessions (gRPC Calls) these latencies can go up to 10-15 seconds.
PS: this only happens when I have more than one client running, a higher number of concurrent clients means higher latency.
The client code looks a bit like the below:
FileStream fs = new(audioFilePath, FileMode.Open, FileAccess.Read, FileShare.Read, 1024 * 1024, true);
byte[] buffer = new byte[10_000];
GrpcClient client = new GrpcClient(_singletonChannel); // using single channel since only 5-10 clients are there right now
BiDiCall call = client.BiDiService(hheaders: null, deadline: null, CancellationToken.None);
var writeTask = Task.Run(async () => {
while (fs.ReadAsync(buffer, 0, buffer.Length))
{
call.RequestStream.WriteAsync(new() { Chunk = ByteString.CopyFrom(buffer) });
}
await call.RequestStream.CompleteAsync();
});
var readTask = Task.Run(async () => {
while (await call.ResponseStream.MoveNext())
{
// write to log call.ResponseStream.Current
}
});
await Task.WhenAll(writeTask, readTask);
await call;
Server code looks like:
readonly BufferBlock<MessageRequest> messages = new();
MessageProcessor _processor = new();
public override async Task BiDiService(IAsyncStreamReader<MessageRequest> requestStream,
IServerStreamWriter<MessageResponse> responseStream,
ServerCallContext context)
{
var readTask = TaskFactory.StartNew(() => {
while (await requestStream.MoveNext())
{
messages.Post(requestStream.Current); // add to queue
}
messages.Complete();
}, TaskCreationOptions.LongRunning).ConfigureAwait(false);
var processTask = Task.Run(() => {
while (await messages.OutputAvailableAsync())
{
var message = await messages.ReceiveAsync(); // pick from queue
// if I comment out below line and run with multiple clients = latency disappears
var result = await _processor.Process(message); // takes some time to process
if (result.IsImportantForClient())
await responseStrem.WriteAsync(result.Value);
}
});
await Task.WhenAll(readTask, processTask);
}
So, as it turned out, the problem was due to the delay in the number of worker threads spawned by the ThreadPool.
The ThreadPool was taking more time to spawn threads to process these tasks causing gRPC reads to have a significant lag.
This was fixed after increasing the minThread count for spawn requstes using ThreadPool.SetMinThreads. MSDN reference
There have been a number of promising comments on the SO's initial question, but wanted to paraphrase what I thought was important: there's
a an outer async method that calls in to 2
Task.Run()'s - with TaskCreationOptions.LongRunning option that wrap async loops, and finally a
returns a Task.WhenAll() rejoins the two Tasks...
Alois Kraus offers that an OS task scheduler is an OS and its scheduling could be abstracting away what you might think is more efficient - this could very well be true and if it is
i would offer the suggestion to try and remove the asynchronous processing and see what kind of benefit difference you might see with various sync/async blends might work better for your particular scenario.
one thing to make sure to remember is that asynce/await logically blocks at an expense of automatic thread management - this is great for single-path-ish I/O bound processing (ex. needing to call a db/webservice before moving on to next step of execution) and can be less beneficial as you move toward compute-bound processing (execution that needs to be explicitly re-joined - async/await implicitly take care of Task re-join)
Related
My use case is this: send 100,000+ web requests to our application server and wait the results. Here, most of the delay is IO-bound, not CPU-bound, so I understand the Dataflow libraries may not be the best tool for this. I've managed to use it will a lot of success and have set the MaxDegreeOfParallelism to the number of requests that I trust the server to be able to handle, however, since this is the maximum number of tasks, it's no guarantee that this will actually be the number of tasks running at any time.
The only bit of information I could find in the documentation is this:
Because the MaxDegreeOfParallelism property represents the maximum degree of parallelism, the dataflow block might execute with a lesser degree of parallelism than you specify. The dataflow block can use a lesser degree of parallelism to meet its functional requirements or to account for a lack of available system resources. A dataflow block never chooses a greater degree of parallelism than you specify.
This explanation is quite vague on how it actually determines when to spin up a new task. My hope was that it will recognize that the task is blocked due to IO, not any system resources, and it will basically stay at the maximum degrees of parallelism for the entire duration of the operation.
However, after monitoring a network capture, it seems to be MUCH quicker in the beginning and slower near the end. I can see from the capture, that at the beginning it does reach the maximum as specified. The TPL library doesn't have any built-in way to monitor the current number of active threads, so I'm not really sure of the best way to investigate further on that end.
My implementation:
internal static ExecutionDataflowBlockOptions GetDefaultBlockOptions(int maxDegreeOfParallelism,
CancellationToken token) => new()
{
MaxDegreeOfParallelism = maxDegreeOfParallelism,
CancellationToken = token,
SingleProducerConstrained = true,
EnsureOrdered = false
};
private static async ValueTask<T?> ReceiveAsync<T>(this ISourceBlock<T?> block, bool configureAwait, CancellationToken token)
{
try
{
return await block.ReceiveAsync(token).ConfigureAwait(configureAwait);
} catch (InvalidOperationException)
{
return default;
}
}
internal static async IAsyncEnumerable<T> YieldResults<T>(this ISourceBlock<T?> block, bool configureAwait,
[EnumeratorCancellation]CancellationToken token)
{
while (await block.OutputAvailableAsync(token).ConfigureAwait(configureAwait))
if (await block.ReceiveAsync(configureAwait, token).ConfigureAwait(configureAwait) is T result)
yield return result;
// by the time OutputAvailableAsync returns false, the block is gauranteed to be complete. However,
// we want to await it anyway, since this will propogate any exception thrown to the consumer.
// we don't simply await the completed task, because that wouldn't return all aggregate exceptions,
// just the last to occur
if (block.Completion.Exception != null)
throw block.Completion.Exception;
}
public static IAsyncEnumerable<TResult> ParallelSelectAsync<T, TResult>(this IEnumerable<T> source, Func<T, Task<TResult?>> body,
int maxDegreeOfParallelism = DataflowBlockOptions.Unbounded, TaskScheduler? scheduler = null, CancellationToken token = default)
{
var options = GetDefaultBlockOptions(maxDegreeOfParallelism, token);
if (scheduler != null)
options.TaskScheduler = scheduler;
var block = new TransformBlock<T, TResult?>(body, options);
foreach (var item in source)
block.Post(item);
block.Complete();
return block.YieldResults(scheduler != null && scheduler != TaskScheduler.Default, token);
}
So, basically, my question is this: when an IO-bound action is executed in a TPL Dataflow block, how can I ensure the block stays at the MaxDegreeOfParallelism that is set?
On the contrary, Dataflow is great at IO work and perfect for this scenario. DataFlow architectures work by creating pipelines similar to Bash or PowerShell pipelines. Each block acting as a separate command, reading messages from its input queue and passing them to the next block through its output queue. That's why the default DOP is 1 - parallelism and concurrency come from using multiple commands/blocks, not a fat block with a high DOP
This is a simplified example of what I use at work request daily sales reports from about a hundred airlines (BSPs for those that know about air tickets), parse the reports and then download individual ticket records, before importing everything into the database.
In this case the head block downloads content with a DOP=10, then the parser block parses the responses one at a time. The downloader is IO-bound so it can make a lot more requests than there are cores, as many as the services allow, or the application wants to handle.
The parser on the other hand is CPU bound. A high DOP would lock a lot of core which would harm not just the application, but other processes as well.
// Create the blocks
var dlOptions = new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism=10
};
var downloader=new TransformBlock<string,string>(
url => _client.GetStringAsync(url,cancellationToken),
dlOptions);
var parser=new TransformBlock<string,Something>(ParseIntoSomething);
var importer=new ActionBlock<Something>(ImportInDb);
// Link the blocks
var linkOptions = new DataflowLinkOptions {PropagateCompletion = true};
downloader.LinkTo(parser,linkOptions);
parser.LinkTo(importer,linkOptions);
After building this 3 step pipeline I post URLs at the front and expect the tail block to complete
foreach(var url in urls)
{
downloader.Post(url);
}
downloader.Complete();
await importer.Completion;
There are a lot of improvements to this. Right now, if the downloader is faster than the parser, all the content will be buffered in memory. In a long running pipeline this can easily take up all available memory.
A simple way to avoid this is add BoundedCapacity=N to the parser block options. If the parser input buffer is full, upstream blocks, in this case the downloader, will pause and wait until a slot becomes available :
var parserOptions = new ExecutionDataflowBlockOptions {
BoundedCapacity=2,
MaxDegreeOfParallelism=2,
};
var parser=new TransformBlock<string,Something>(ParseIntoSomething, parserOptions);
My application hangs (force-killed by Polly's Circuit breaker or timeout) when trying to concurrently receive and deserialize very large JSON files, containing over 12,000,000 chars each.
using System.Text.Json;
Parallel.For(0, 1000000, async (i, state) =>
{
var strategy = GetPollyResilienceStrategy(); // see the method implementation in the following.
await strategy.ExecuteAsync(async () =>
{
var stream = await httpClient.GetStreamAsync(
GetEndPoint(i), cancellationToken);
var foo = await JsonSerializer.DeserializeAsync<Foo>(
stream, cancellationToken: cancellationToken);
// Process may require API calls to the client,
// but no other API calls and JSON deserialization
// is required after processedFoo is obtained.
var processedFoo = Process(foo); // Long running CPU and IO bound task, that may involve additional API calls and JSON deserialization.
queue.Add(processedFoo); // A background task, implementation in the following.
});
});
I use different resilience strategies, one from GetPollyResilienceStrategy to process the item i, and one for the HttpClient. The former is more relevant here, so I'm sharing that but I'm also happy to share the strategy used for the HttpClient if needed.
AsyncPolicyWrap GetPollyResilienceStrategy()
{
var retry = Policy
.Handle<Exception>()
.WaitAndRetryAsync(Backoff.DecorrelatedJitterBackoffV2(
retryCount: 3,
medianFirstRetryDelay: TimeSpan.FromMinutes(1)));
var timeout = Policy.TimeoutAsync(timeout: TimeSpan.FromMinutes(5));
var circuitBreaker = Policy
.Handle<Exception>()
.AdvancedCircuitBreakerAsync(
failureThreshold: 0.5,
samplingDuration: TimeSpan.FromMinutes(10),
minimumThroughput: 2,
durationOfBreak: TimeSpan.FromMinutes(1));
var strategy = Policy.WrapAsync(retry, timeout, circuitBreaker);
return strategy;
}
And the background task is implemented as the following.
var queue = new BlockingCollection<Foo>();
var listner = Task.Factory.StartNew(() =>
{
while (true)
{
Foo foo;
try
{
foo = queue.Take(cancellationToken);
}
catch (OperationCanceledException)
{
break;
}
LongRunningCpuBoundTask(foo);
SerializeToCsv(foo);
}
},
creationOptions: TaskCreationOptions.LongRunning);
Context
My .NET console application receives a very large JSON file over HTTP and tries to deserialize it by reading some fields necessary for the application and ignoring others. A successful process is expected to run for a few days. Though the program "hangs" after a few hours. After extensive debugging, it turns out an increasing number of threads are stuck trying to deserialize the JSON, and the program hangs (i.e., the Parallel.For does not start another one) when ~5 threads are stuck. Every time it gets stuck at a different i, and since JSON objects are very large, it is not feasible to log every received JSON for debugging.
Why does it get stuck? Is there any built-in max capacity in JsonSerializer that is reached? e.g., buffer size?
Is it possible that GetStreamAsync is reading corrupt data, hence JsonSerializer is stuck in some corner case trying to deserialize a corrupt JSON?
I found this thread relevant, though not sure if there was a resolution other than "fixed in newer version" https://github.com/dotnet/runtime/issues/41604
The program eventually exists but as a result of either the circuit breaker or timeout. I have given very long intervals in the resilience strategy, e.g., giving the process 20min to try deserializing JSON before retrying.
Even without knowing how did you set up your resiliency strategy it seems like you want to kill two birds with one stone:
Add resilient behaviour for the http based communication
Add resilient behaviour for the stream parsing
I would recommend to separate these two.
GetStreamAsync
The GetStreamAsync call returns a Task<Stream> which does not allow you to access the underlying HttpResponseMessage.
But if you issue your request for the stream like this:
var response = await httpClient.GetAsync(url, HttpCompletionOption.ResponseHeadersRead);
using var stream = await response.Content.ReadAsStreamAsync();
then you would be able to decorate the GetAsync call with a http based Polly policy definition.
DeserializeAsync
Looking to this problem only from resilience perspective it would make sense to use a combination of CancellationTokenSources to enforce timeout like this:
CancellationTokenSource userCancellation = ...;
var timeoutSignal = new CancellationTokenSource(TimeSpan.FromMinutes(20));
var combinedCancellation = CancellationTokenSource.CreateLinkedTokenSource(userCancellation.Token, timeoutSignal.Token);
...
var foo = await JsonSerializer.DeserializeAsync<Foo>(stream, combinedCancellation.Token);
But you could achieve the same with optimistic timeout policy.
var foo = await timeoutPolicy.ExecuteAsync(
async token => await JsonSerializer.DeserializeAsync<Foo>(stream, token), ct);
UPDATE #1
My understanding is the strategy used for every item is independent from the other, correct? So if the circuit breaks, why the parallel.ForEachAsync does not continue with the others?
The timeout and retry policies are stateless. Whereas the circuit breaker maintains a state which is shared between the executions. Here I have detailed some internals if you are interested how does it work under the hood.
Also, if the loop is broken, why no exception?
If the threshold of the successive/subsequent requests is reached then the CB transitions from Closed to Open and the CB throws the original exception (if it was triggered for some exception). In Open state if you try to perform a new request then it will short-cut the execution with a BrokenCircuitException.
So, back to your question. Yes, there should be an exception but because you have used Parallel.Foreach which does not support async that's why the exception is shallowed. If you would have used await Parallel.ForeachAsync then it should throw the exception.
UPDATE #2
After assessing your GetPollyResilienceStrategy code I have two more advices:
Please change the return type of the method to IAsyncPolicy from AsyncPolicyWrap
AsyncPolicyWrap is an implementation detail and should not be exposed
Please change the order of timeout and circuit breaker
Policy.WrapAsync(retry, circuitBreaker, timeout);
In your setup the CB will not break for timeout
In my suggested setup the CB could break for TimeoutRejectedException as well
Parallel.For does not play well with async. I would not even expect your example to compile, since the lamda for the Parallel.For lacks an async keyword. So I would expect it to start just about all the tasks at the same time. Eventually this will likely lead to bad things, like threadpool exhaustion.
I would suggest using another pattern for your work
If using .Net 6, use Parallel.ForEachAsync
Keep using Parallel.For but make your worker method synchronous
Use something like LimitedConcurrencyTaskScheduler (see example) to limit the number of concurrent tasks.
Use Dataflow, But I'm not very familiar with this, so i cannot advice exactly how it should be used.
Manually split your work into chunks, and run chunks in parallel.
I have a .netcore 6 BackGroundService which pushes data from on-premise to a 3rd party API.
The 3rd party API takes about 500 milliseconds to process the API call.
The problem is that I have about 1,000,000 rows of data to push to this API one at a time. At 1/2 second per row, it's going to take about 6 days to sync up.
So, I would like to try to spawn multiple threads in order to hit the API simultaneously with 10 threads.
var startTime = DateTimeOffset.Now;
var batchSize = _config.GetValue<int>("BatchSize");
using (var scope = _serviceScopeFactory.CreateScope())
{
var context = scope.ServiceProvider.GetRequiredService<PlankContext>();
var dncEntries = await context.PlankQueueDnc.Where(x => x.ToProcessFlag == true).Take(batchSize).ToListAsync();
foreach (var plankQueueDnc in dncEntries)
{
var response = await _plankConnector.InsertDncAsync(plankQueueDnc);
context.PlankQueueDnc.Update(plankQueueDnc);
}
await context.SaveChangesAsync();
}
Here is the code. As you can see, it gets a batch of 100 records and then processes them one by one. Is there a way to modify this so this line is not awaited? I don't quite understand how it would work if it were not awaited. Would it create a thread for each execution in the loop?
var response = await _plankConnector.InsertDncAsync(plankQueueDnc);
I am clearly not up to speed on threads as well as the esteemed #StephanCleary.
So suggestions would be appreciated.
In .NET 6 you can use Parallel.ForEachAsync to execute operations concurrently, using either all available cores or a limited Degree-Of-Parallelism.
The following code loads all records, executes the posts concurrently, then updates the records :
using (var scope = _serviceScopeFactory.CreateScope())
{
var context = scope.ServiceProvider.GetRequiredService<PlankContext>();
var dncEntries = await context.PlankQueueDnc
.Where(x => x.ToProcessFlag == true)
.Take(batchSize)
.ToListAsync();
await Parallel.ForEachAsync(dncEntries,async plankQueueDnc=>
{
var response = await _plankConnector.InsertDncAsync(plankQueueDnc);
plankQueueDnc.Whatever=response.Something;
};
await context.SaveChangesAsync();
}
There's no reason to call Update as a DbContext tracks the objects it loaded and knows which ones were modified. SaveChangesAsync will persist all changes in a single transaction
DOP and Throttling
By default, ParallelForEachAsync will execute as many tasks concurrently as there are cores. This may be too little or too much for HTTP calls. On the one hand, the client machine isn't using its CPU at all while waiting for the remote service. On the other hand, the remote service itself may not like or even allow too many concurrent calls and may even impose throttling.
The ParallelOptions class can be used to specify the degree of parallelism. If the API allows it, we could execute eg 20 concurrent calls :
var option=new ParallelOptions { MaxDegreeOfParallelism = 20};
await Parallel.ForEachAsync(dncEntries,options,async plankQueueDnc=>{...});
Many services impose a rate on how many requests can be made in a period of time. A (somewhat naive) way of implementing this is to add a small delay in the task worker code can take care of this :
var delay=100;
await Parallel.ForEachAsync(dncEntries,options,async plankQueueDnc=>{
...
await Task.Delay(delay);
});
I have a bunch of independent REST calls to make (say 1000) , each call has differing body. How to make these calls in the least amount of time?
I am using a Parallel.foreach loop to to make the calls , but doesn't a call wait for the previous call to finish (on a single thread) , is there any callback kind of system to prevent this and make the process faster?
Parallel.foreach(...){
(REST call)
HttpResponseMessage response = this.client.PostAsync(new Uri(url), content).Result;
}
Using await also gives almost same results.
Make all the calls with PostAsync:
var urlAndContentArray = ...
// The fast part
IEnumerable<Task<HttpResponseMessage>> tasks = urlAndContentArray.Select
(x => this.client.PostAsync(new Uri(x.Url), x.Content));
// IF you care about results: here is the slow part:
var responses = await Task.WhenAll(tasks);
Note that this will make all the calls very quickly as requested, but indeed time it takes to get results is mostly not related to how many requests you send out - it's limited by number of outgoing requests .Net will run in parallel as well as how fast those servers reply (and if they have any DOS protection/throttling).
The simplest way to do many async actions in parallel, is to start them without waiting, capture tasks in a collection and then wait untill all tasks will be completed.
For example
var httpClient = new HttpClient();
var payloads = new List<string>(); // this is 1000 payloads
var tasks = payloads.Select(p => httpClient.PostAsync("https://addresss", new StringContent(p)));
await Task.WhenAll(tasks);
It should be enough for start, but mind 2 things.
There is still a connection pool per hostname, what defaults to 4. You can use HttpSocketsHandler to control the pool size.
It will really start or the 1000 items in parallel, what sometimes might be not what you want. To control MAX amount of parallel items you can check ActionBlock
I have a console app written using C# on the top of Core .NET 2.2 framework.
My application allows me to trigger long-running admin jobs using Windows task scheduler.
One of the admin jobs makes a web-API call which download lots of files before it uploads them onto Azure Blob storage. Here are the logical steps that my code will need to performs to get the job done
Call the remote API which response with Mime message where each message represents a file.
Parse out the Mime messages and convert each message into a MemoryStream creating a collection of MemoryStream
Once I have a collection with multiple 1000+ MemoryStream, I want to write each Stream onto the Azure Blob Storage. Since the write to the remote storage is slow, I am hoping that I can execute each write iteration using its own process or thread. This will allow me to have potintially 1000+ thread running at the same time in parallel instead of having to wait the result of each writes operation. Each thread will be responsible for logging any errors that potentially occur during the write/upload process. Any logged errors will be dealt with using a different job so I don't have to worry about retrying.
My understanding is calling the code that writes/upload the stream asynchronously will do exactly that. In other words, I would say "there is a Stream execute it and run for as long as it takes. I don't really care about the result as long as the task gets completed."
While testing, I found out that my understanding of calling async is somewhat invalid. I was under the impression that when calling a method that is defined with async will get executed in the background thread/worker until that process is completed. But, my understanding failed when I tested the code. My code showed me that without adding the keyword await the async code is never really executed. At the same time, when the keyword await is added, the code will wait until the process finishes executing before it continues. In other words, adding await for my need will defeat the purpose of calling the method asynchronously.
Here is a stripped down version of my code for the sake of explaining what I am trying to accomplish
public async Task Run()
{
// This gets populated after calling the web-API and parsing out the result
List<Stream> files = new List<MemoryStream>{.....};
foreach (Stream file in files)
{
// This code should get executed in the background without having to await the result
await Upload(file);
}
}
// This method is responsible of upload a stream to a storage and log error if any
private async Task Upload(Stream stream)
{
try
{
await Storage.Create(file, GetUniqueName());
}
catch(Exception e)
{
// Log any errors
}
}
From the above code, calling await Upload(file); works and will upload the file as expected. However, since I am using await when calling the Upload() method, my loop will NOT jump to the next iteration until the upload code finishes. At the same time, removing the await keyword, the loop does not await the upload process, but the Stream never actually writes to the storage as if I never called the code.
How can I execute multiple Upload method in parallel so that I have one thread running per upload in the background?
Convert the list to a list of "Upload" tasks, and await them all with Task.WhenAll():
public async Task Run()
{
// This gets populated after calling the web-API and parsing out the result
List<Stream> files = new List<MemoryStream>{.....};
var tasks = files.Select(Upload);
await Task.WhenAll(tasks);
}
See this post for some more information about tasks/await.
I am hoping that I can execute each write iteration using its own process or thread.
This is not really the best way to do this. Processes and Threads are limited resources. Your limiting factor is waiting on the network to perform an action.
What you'll want to do is just something like:
var tasks = new List<Task>(queue.Count);
while (queue.Count > 0)
{
var myobject = Queue.Dequeue();
var task = blockBlob.UploadFromByteArrayAsync(myobject.content, 0, myobject.content.Length);
tasks.Add(task);
}
await Task.WhenAll(tasks);
Here we're just creating tasks as fast as we can and then wait for them all to complete. We'll just let the .Net framework take care of the rest.
The important thing here is that Threads don't improve the speed of waiting for network resources. Tasks are a way to delegate what needs to be done out of the threads hands so you have more threads to do whatever (like start up a new upload, or response to a finished upload). If the thread simply waits for the upload to complete, it's a wasted resource.
You likely need this:
var tasks = files.Select(Upload);
await Task.WhenAll(tasks);
Just note that it'll spawn as many tasks as you have files what may bring the process/machine down if there will be too many of them. See Have a set of Tasks with only X running at a time as n example how to address that.
The other answers are fine, however another approach is to your TPL DataFlow available in Nuget from https://www.nuget.org/packages/System.Threading.Tasks.Dataflow/
public static async Task DoWorkLoads(List<Something> results)
{
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 50
};
var block = new ActionBlock<Something>(MyMethodAsync, options);
foreach (var result in results)
block.Post(result );
block.Complete();
await block.Completion;
}
...
public async Task MyMethodAsync(Something result)
{
// Do async work here
}
The advantage of dataflow
Is it naturally works with async as does WhenAll task based solutions
it can also be plumbed in to a larger pipeline of tasks
You could retry errors by piping them back in.
Add any pre-processing calls into earlier blocks
You can limit the MaxDegreeOfParallelism if throttling is a concern
You can make more complicated pipelines, hence the Name of DataFlow
You could convert your code to an Azure Function and have it let Azure handle most of the parallelism, scale-out and upload to Azure Blob Storage work.
You could use an Http Trigger or Service Bus trigger to initiate each download, process and upload task.