Parsing multiple very large JSON files concurrently

Parsing multiple very large JSON files concurrently - c#

My application hangs (force-killed by Polly's Circuit breaker or timeout) when trying to concurrently receive and deserialize very large JSON files, containing over 12,000,000 chars each.
using System.Text.Json;
Parallel.For(0, 1000000, async (i, state) =>
{
var strategy = GetPollyResilienceStrategy(); // see the method implementation in the following.
await strategy.ExecuteAsync(async () =>
{
var stream = await httpClient.GetStreamAsync(
GetEndPoint(i), cancellationToken);
var foo = await JsonSerializer.DeserializeAsync<Foo>(
stream, cancellationToken: cancellationToken);
// Process may require API calls to the client,
// but no other API calls and JSON deserialization
// is required after processedFoo is obtained.
var processedFoo = Process(foo); // Long running CPU and IO bound task, that may involve additional API calls and JSON deserialization.
queue.Add(processedFoo); // A background task, implementation in the following.
});
});
I use different resilience strategies, one from GetPollyResilienceStrategy to process the item i, and one for the HttpClient. The former is more relevant here, so I'm sharing that but I'm also happy to share the strategy used for the HttpClient if needed.
AsyncPolicyWrap GetPollyResilienceStrategy()
{
var retry = Policy
.Handle<Exception>()
.WaitAndRetryAsync(Backoff.DecorrelatedJitterBackoffV2(
retryCount: 3,
medianFirstRetryDelay: TimeSpan.FromMinutes(1)));
var timeout = Policy.TimeoutAsync(timeout: TimeSpan.FromMinutes(5));
var circuitBreaker = Policy
.Handle<Exception>()
.AdvancedCircuitBreakerAsync(
failureThreshold: 0.5,
samplingDuration: TimeSpan.FromMinutes(10),
minimumThroughput: 2,
durationOfBreak: TimeSpan.FromMinutes(1));
var strategy = Policy.WrapAsync(retry, timeout, circuitBreaker);
return strategy;
}
And the background task is implemented as the following.
var queue = new BlockingCollection<Foo>();
var listner = Task.Factory.StartNew(() =>
{
while (true)
{
Foo foo;
try
{
foo = queue.Take(cancellationToken);
}
catch (OperationCanceledException)
{
break;
}
LongRunningCpuBoundTask(foo);
SerializeToCsv(foo);
}
},
creationOptions: TaskCreationOptions.LongRunning);
Context
My .NET console application receives a very large JSON file over HTTP and tries to deserialize it by reading some fields necessary for the application and ignoring others. A successful process is expected to run for a few days. Though the program "hangs" after a few hours. After extensive debugging, it turns out an increasing number of threads are stuck trying to deserialize the JSON, and the program hangs (i.e., the Parallel.For does not start another one) when ~5 threads are stuck. Every time it gets stuck at a different i, and since JSON objects are very large, it is not feasible to log every received JSON for debugging.
Why does it get stuck? Is there any built-in max capacity in JsonSerializer that is reached? e.g., buffer size?
Is it possible that GetStreamAsync is reading corrupt data, hence JsonSerializer is stuck in some corner case trying to deserialize a corrupt JSON?
I found this thread relevant, though not sure if there was a resolution other than "fixed in newer version" https://github.com/dotnet/runtime/issues/41604
The program eventually exists but as a result of either the circuit breaker or timeout. I have given very long intervals in the resilience strategy, e.g., giving the process 20min to try deserializing JSON before retrying.

Even without knowing how did you set up your resiliency strategy it seems like you want to kill two birds with one stone:
Add resilient behaviour for the http based communication
Add resilient behaviour for the stream parsing
I would recommend to separate these two.
GetStreamAsync
The GetStreamAsync call returns a Task<Stream> which does not allow you to access the underlying HttpResponseMessage.
But if you issue your request for the stream like this:
var response = await httpClient.GetAsync(url, HttpCompletionOption.ResponseHeadersRead);
using var stream = await response.Content.ReadAsStreamAsync();
then you would be able to decorate the GetAsync call with a http based Polly policy definition.
DeserializeAsync
Looking to this problem only from resilience perspective it would make sense to use a combination of CancellationTokenSources to enforce timeout like this:
CancellationTokenSource userCancellation = ...;
var timeoutSignal = new CancellationTokenSource(TimeSpan.FromMinutes(20));
var combinedCancellation = CancellationTokenSource.CreateLinkedTokenSource(userCancellation.Token, timeoutSignal.Token);
...
var foo = await JsonSerializer.DeserializeAsync<Foo>(stream, combinedCancellation.Token);
But you could achieve the same with optimistic timeout policy.
var foo = await timeoutPolicy.ExecuteAsync(
async token => await JsonSerializer.DeserializeAsync<Foo>(stream, token), ct);
UPDATE #1
My understanding is the strategy used for every item is independent from the other, correct? So if the circuit breaks, why the parallel.ForEachAsync does not continue with the others?
The timeout and retry policies are stateless. Whereas the circuit breaker maintains a state which is shared between the executions. Here I have detailed some internals if you are interested how does it work under the hood.
Also, if the loop is broken, why no exception?
If the threshold of the successive/subsequent requests is reached then the CB transitions from Closed to Open and the CB throws the original exception (if it was triggered for some exception). In Open state if you try to perform a new request then it will short-cut the execution with a BrokenCircuitException.
So, back to your question. Yes, there should be an exception but because you have used Parallel.Foreach which does not support async that's why the exception is shallowed. If you would have used await Parallel.ForeachAsync then it should throw the exception.
UPDATE #2
After assessing your GetPollyResilienceStrategy code I have two more advices:
Please change the return type of the method to IAsyncPolicy from AsyncPolicyWrap
AsyncPolicyWrap is an implementation detail and should not be exposed
Please change the order of timeout and circuit breaker
Policy.WrapAsync(retry, circuitBreaker, timeout);
In your setup the CB will not break for timeout
In my suggested setup the CB could break for TimeoutRejectedException as well

Parallel.For does not play well with async. I would not even expect your example to compile, since the lamda for the Parallel.For lacks an async keyword. So I would expect it to start just about all the tasks at the same time. Eventually this will likely lead to bad things, like threadpool exhaustion.
I would suggest using another pattern for your work
If using .Net 6, use Parallel.ForEachAsync
Keep using Parallel.For but make your worker method synchronous
Use something like LimitedConcurrencyTaskScheduler (see example) to limit the number of concurrent tasks.
Use Dataflow, But I'm not very familiar with this, so i cannot advice exactly how it should be used.
Manually split your work into chunks, and run chunks in parallel.

Related

Select with asynchronous method

I've found a question that was really useful for me, but I still can't realize what equivalent in LINQ-world has further construction:
public async Task<List<ObjectInfo>> GetObjectsInfo(string[] objectIds)
{
var result = new List<ObjectInfo>(objectIds.Length);
foreach (var id in objectIds)
{
result.Add(await GetObjectInfo(id));
}
return result;
}
If I wrote instead
var result = await Task.WhenAll(objectIds.Select(id => GetObjectInfo(id)));
wouldn't these tasks be started simultaneously?
In my case it would be better to run them in series.
Edit 1: Answering to the comment of Theodor Zoulias.
Of course, I forgot Async suffix in methods' names!
The method GetObjectInfoAsync makes http request to external service. Additionally, this service has restriction for requests' frequency, so I use following construction.
using (var throttler = new Throttler(clientId))
{
while (!throttler.IsCallAllowed(out var waitTime))
{
await Task.Delay(waitTime);
}
var response = await client.PerformHttpRequestAsync(request);
return response.Content.FromJson<TResponse>(serializerSettings);
}
Throttler knowns last request's time for each client.

You should take into account the following considerations when you are using whenAll
Quotes taken from MS documentation
If any of the supplied tasks completes in a faulted state, the returned task will also complete in a Faulted state, where its exceptions will contain the aggregation of the set of unwrapped exceptions from each of the supplied tasks.
If none of the supplied tasks faulted but at least one of them was canceled, the returned task will end in the Canceled state.
If none of the tasks faulted and none of the tasks were canceled, the resulting task will end in the RanToCompletion state.
If the supplied array/enumerable contains no tasks, the returned task will immediately transition to a RanToCompletion state before it's returned to the caller.
That means that while the items will run maybe simultaneously (you can't know how as it depends on the resources of the machine and you can't know the order), the above can't be avoided.
If you want to have specific error handling per case and a specific order, then the first solution is the way to go. It kind of beats the point of the async/await but not in it's entirety. Even with sequential execution of the async items, your thread at least will not go to sleep and can still be used until the awaitable is ready.

Waiting on a continuous UI background polling task

I am somewhat new to parallel programming C# (When I started my project I worked through the MSDN examples for TPL) and would appreciate some input on the following example code.
It is one of several background worker tasks. This specific task pushes status messages to a log.
var uiCts = new CancellationTokenSource();
var globalMsgQueue = new ConcurrentQueue<string>();
var backgroundUiTask = new Task(
() =>
{
while (!uiCts.IsCancellationRequested)
{
while (globalMsgQueue.Count > 0)
ConsumeMsgQueue();
Thread.Sleep(backgroundUiTimeOut);
}
},
uiCts.Token);
// Somewhere else entirely
backgroundUiTask.Start();
Task.WaitAll(backgroundUiTask);
I'm asking for professional input after reading several topics like Alternatives to using Thread.Sleep for waiting, Is it always bad to use Thread.Sleep()?, When to use Task.Delay, when to use Thread.Sleep?, Continuous polling using Tasks
Which prompts me to use Task.Delay instead of Thread.Sleep as a first step and introduce TaskCreationOptions.LongRunning.
But I wonder what other caveats I might be missing? Is polling the MsgQueue.Count a code smell? Would a better version rely on an event instead?

First of all, there's no reason to use Task.Start or use the Task constructor. Tasks aren't threads, they don't run themselves. They are a promise that something will complete in the future and may or may not produce any results. Some of them will run on a threadpool thread. Use Task.Run to create and run the task in a single step when you need to.
I assume the actual problem is how to create a buffered background worker. .NET already offers classes that can do this.
ActionBlock< T >
The ActionBlock class already implements this and a lot more - it allows you to specify how big the input buffer is, how many tasks will process incoming messages concurrently, supports cancellation and asynchronous completion.
A logging block could be as simple as this :
_logBlock=new ActionBlock<string>(msg=>File.AppendAllText("myLog.txt",msg));
The ActionBlock class itself takes care of buffering the inputs, feeding new messages to the worker function when it arrives, potentially blocking senders if the buffer gets full etc. There's no need for polling.
Other code can use Post or SendAsync to send messages to the block :
_block.Post("some message");
When we are done, we can tell the block to Complete() and await for it to process any remaining messages :
_block.Complete();
await _block.Completion;
Channels
A newer, lower-level option is to use Channels. You can think of channels as a kind of asynchronous queue, although they can be used to implement complex processing pipelines. If ActionBlock was written today, it would use Channels internally.
With channels, you need to provide the "worker" task yourself. There's no need for polling though, as the ChannelReader class allows you to read messages asynchronously or even use await foreach.
The writer method could look like this :
public ChannelWriter<string> LogIt(string path,CancellationToken token=default)
{
var channel=Channel.CreateUnbounded<string>();
var writer=channel.Writer;
_=Task.Run(async ()=>{
await foreach(var msg in channel.Reader.ReadAllAsync(token))
{
File.AppendAllText(path,msg);
}
},token).ContinueWith(t=>writer.TryComplete(t.Exception);
return writer;
}
....
_logWriter=LogIt(somePath);
Other code can send messages by using WriteAsync or TryWrite, eg :
_logWriter.TryWrite(someMessage);
When we're done, we can call Complete() or TryComplete() on the writer :
_logWriter.TryComplete();
The line
.ContinueWith(t=>writer.TryComplete(t.Exception);
is needed to ensure the channel is closed even if an exception occurs or the cancellation token is signaled.
This may seem too cumbersome at first. Channels allow us to easily run initialization code or carry state from one message to the next. We could open a stream before the loop starts and use it instead of reopening the file each time we call File.AppendAllText, eg :
public ChannelWriter<string> LogIt(string path,CancellationToken token=default)
{
var channel=Channel.CreateUnbounded<string>();
var writer=channel.Writer;
_=Task.Run(async ()=>{
//***** Can't do this with an ActionBlock ****
using(var writer=File.AppendText(somePath))
{
await foreach(var msg in channel.Reader.ReadAllAsync(token))
{
writer.WriteLine(msg);
//Or
//await writer.WriteLineAsync(msg);
}
}
},token).ContinueWith(t=>writer.TryComplete(t.Exception);
return writer;
}

Definitely Task.Delay is better than Thread.Sleep, because you will not be blocking the thread on the pool, and during the wait the thread on the pool will be available to handle other tasks. Then, you don't need to make your task long-running. Long-running tasks are run in a dedicated thread, and then Task.Delay is meaningless.
Instead, I will recommend a different approach. Just use System.Threading.Timer and make your life simple. Timers are kernel objects that will run their callback on the thread pool, and you will not have to worry about delay or sleep.

The TPL Dataflow library is the preferred tool for this kind of job. It allows building efficient producer-consumer pairs quite easily, and more complex pipelines as well, while offering a complete set of configuration options. In your case using a single ActionBlock should be enough.
A simpler solution you might consider is to use a BlockingCollection. It has the advantage of not requiring the installation of any package (because it is built-in), and it's also much easier to learn. You don't have to learn more than the methods Add, CompleteAdding, and GetConsumingEnumerable. It also supports cancellation. The drawback is that it's a blocking collection, so it blocks the consumer thread while waiting for new messages to arrive, and the producer thread while waiting for available space in the internal buffer (only if you specify a boundedCapacity in the constructor).
var uiCts = new CancellationTokenSource();
var globalMsgQueue = new BlockingCollection<string>();
var backgroundUiTask = new Task(() =>
{
foreach (var item in globalMsgQueue.GetConsumingEnumerable(uiCts.Token))
{
ConsumeMsgQueueItem(item);
}
}, uiCts.Token);
The BlockingCollection uses a ConcurrentQueue internally as a buffer.

Throttling with SemaphoreSlim -- "Task.Run()" vs "new Func<Task>()"

This might not be specific to SemaphoreSlim exclusively, but basically my question is about whether there is a difference between the below two methods of throttling a collection of long running tasks, and if so, what that difference is (and when if ever to use either).
In the example below, let's say that each tracked task involves loading data from a Url (totally made up example, but is a common one that I've found for SemaphoreSlim examples).
The main difference comes down to how the individual tasks are added to the list of tracked tasks. In the first example, we call Task.Run() with a lambda, whereas in the second, we new up a Func(<Task<Result>>()) with a lambda and then immediately call that func and add the result to the tracked task list.
Examples:
Using Task.Run():
SemaphoreSlim ss = new SemaphoreSlim(_concurrentTasks);
List<string> urls = ImportUrlsFromSource();
List<Task<Result>> trackedTasks = new List<Task<Result>>();
foreach (var item in urls)
{
await ss.WaitAsync().ConfigureAwait(false);
trackedTasks.Add(Task.Run(async () =>
{
try
{
return await ProcessUrl(item);
}
catch (Exception e)
{
_log.Error($"logging some stuff");
throw;
}
finally
{
ss.Release();
}
}));
}
var results = await Task.WhenAll(trackedTasks);
Using a new Func:
SemaphoreSlim ss = new SemaphoreSlim(_concurrentTasks);
List<string> urls = ImportUrlsFromSource();
List<Task<Result>> trackedTasks = new List<Task<Result>>();
foreach (var item in urls)
{
trackedTasks.Add(new Func<Task<Result>>(async () =>
{
await ss.WaitAsync().ConfigureAwait(false);
try
{
return await ProcessUrl(item);
}
catch (Exception e)
{
_log.Error($"logging some stuff");
throw;
}
finally
{
ss.Release();
}
})());
}
var results = await Task.WhenAll(trackedTasks);

There are two differences:
Task.Run does error handling
First off all, when you call the lambda, it runs. On the other hand, Task.Run would call it. This is relevant because Task.Run does a bit of work behind the scenes. The main work it does is handling a faulted task...
If you call a lambda, and the lambda throws, it would throw before you add the Task to the list...
However, in your case, because your lambda is async, the compiler would create the Task for it (you are not making it by hand), and it will correctly handle the exception and make it available via the returned Task. Therefore this point is moot.
Task.Run prevents task attachment
Task.Run sets DenyChildAttach. This means that the tasks created inside the Task.Run run independently from (are not synchronized with) the returned Task.
For example, this code:
List<Task<int>> trackedTasks = new List<Task<int>>();
var numbers = new int[]{0, 1, 2, 3, 4};
foreach (var item in numbers)
{
trackedTasks.Add(Task.Run(async () =>
{
var x = 0;
(new Func<Task<int>>(async () =>{x = item; return x;}))().Wait();
Console.WriteLine(x);
return x;
}));
}
var results = await Task.WhenAll(trackedTasks);
Will output the numbers from 0 to 4, in unknown order. However the following code:
List<Task<int>> trackedTasks = new List<Task<int>>();
var numbers = new int[]{0, 1, 2, 3, 4};
foreach (var item in numbers)
{
trackedTasks.Add(new Func<Task<int>>(async () =>
{
var x = 0;
(new Func<Task<int>>(async () =>{x = item; return x;}))().Wait();
Console.WriteLine(x);
return x;
})());
}
var results = await Task.WhenAll(trackedTasks);
Will output the numbers from 0 to 4, in order, every time. This is odd, right? What happens is that the inner task is attached to outer one, and executed right away in the same thread. But if you use Task.Run, the inner task is not attached and scheduled independently.
This remain true even if you use await, as long as the task you await does not go to an external system...
What happens with external system? Well, for example, if your task is reading from an URL - as in your example - the system would create a TaskCompletionSource, get the Task from it, set a response handler that writes the result to the TaskCompletionSource, make the request, and return the Task. This Task is not scheduled, it running on the same thread as a parent task makes no sense. And thus, it can break the order.
Since, you are using await to wait on an external system, this point is moot too.
Conclusion
I must conclude that these are equivalent.
If you want to be safe, and make sure it works as expected, even if - in a future version - some of the above points stops being moot, then keep Task.Run. On the other hand, if you really want to optimize, use the lambda and avoid the Task.Run (very small) overhead. However, that probably won't be a bottleneck.
Addendum
When I talk about a task that goes to an external system, I refer to something that runs outside of .NET. There a bit of code that will run in .NET to interface with the external system, but the bulk of the code will not run in .NET, and thus will not be in a managed thread at all.
The consumer of the API specify nothing for this to happen. The task would be a promise task, but that is not exposed, for the consumer there is nothing special about it.
In fact, a task that goes to an external system may barely run in the CPU at all. Futhermore, it might just be waiting on something exterior to the computer (it could be the network or user input).
The pattern is as follows:
The library creates a TaskCompletionSource.
The library sets a means to recieve a notification. It can be a callback, event, message loop, hook, listening to a socket, a pipe line, waiting on a global mutex... whatever is necesary.
The library sets code to react to the notification that will call SetResult, or SetException on the TaskCompletionSource as appropiate for the notification recieved.
The library does the actual call to the external system.
The library returns TaskCompletionSource.Task.
Note: with extra care of optimization not reordering things where it should not, and with care of handling errors during the setup phase. Also, if a CancellationToken is involved, it has to be taken into account (and call SetCancelled on the TaskCompletionSource when appropiate). Also, there could be tear down necesary in the reaction to the notification (or on cancellation). Ah, do not forget to validate your parameters.
Then the external system goes and does whatever it does. Then when it finishes, or something goes wrong, gives the library the notification, and your Task is sudendtly completed, faulted... (or if cancellation happened, your Task is now cancelled) and .NET will schedule the continuations of the task as needed.
Note: async/await uses continuations behind the scenes, that is how execution resumes.
Incidentally, if you wanted to implement SempahoreSlim yourself, you would have to do something very similar to what I describe above. You can see it in my backport of SemaphoreSlim.
Let us see a couple of examples of promise tasks...
Task.Delay: when we are waiting with Task.Delay, the CPU is not spinning. This is not running in a thread. In this case the notification mechanism will be an OS timer. When the OS sees that the time of the timer has elapsed, it will call into the CLR, and then the CLR will mark the task as completed. What thread was waiting? none.
FileStream.ReadSync: when we are reading from storage with FileStream.ReadSync the actual work is done by the device. The CRL has to declare a custom event, then pass the event, the file handle and the buffer to the OS... the OS calls the device driver, the device driver interfaces with the device. As the storage device recovers the information, it will write to memory (directly on the specified buffer) via DMA technology. And when it is done, it will set an interruption, that is handled by the driver, that notifies the OS, that calls the custom event, that marks the task as completed. What thread did read the data from storage? none.
A similar pattern will be used to download from a web page, except, this time the device goes to the network. How to make an HTTP request and how the system waits for a response is beyond the scope of this answer.
It is also possible that the external system is another program, in which case it would run on a thread. But it won't be a managed thread on your process.
Your take away is that these task do not run on any of your threads. And their timing might depend on external factors. Thus, it makes no sense to think of them as running in the same thread, or that we can predict their timing (well, except of course, in the case of the timer).

Both are not very good because they create the tasks immediately. The func version is a little less overhead since it saves the Task.Run route over the thread pool just to immediately end the thread pool work and suspend on the semaphore. You don't need an async Func, you could simplify this by using an async method (possibly a local function).
But you should not do this at all. Instead, use a helper method that implements a parallel async foreach.
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext())
await body(partition.Current);
}));
}
Then you just go urls.ForEachAsync(myDop, async input => await ProcessAsync(input));
Here, the tasks are created on demand. You can even make the input stream lazy.

Calling async methods from non-async code

I'm in the process of updating a library that has an API surface that was built in .NET 3.5. As a result, all methods are synchronous. I can't change the API (i.e., convert return values to Task) because that would require that all callers change. So I'm left with how to best call async methods in a synchronous way. This is in the context of ASP.NET 4, ASP.NET Core, and .NET/.NET Core console applications.
I may not have been clear enough - the situation is that I have existing code that is not async aware, and I want to use new libraries such as System.Net.Http and the AWS SDK that support only async methods. So I need to bridge the gap, and be able to have code that can be called synchronously but then can call async methods elsewhere.
I've done a lot of reading, and there are a number of times this has been asked and answered.
Calling async method from non async method
Synchronously waiting for an async operation, and why does Wait() freeze the program here
Calling an async method from a synchronous method
How would I run an async Task<T> method synchronously?
Calling async method synchronously
How to call asynchronous method from synchronous method in C#?
The problem is that most of the answers are different! The most common approach I've seen is use .Result, but this can deadlock. I've tried all the following, and they work, but I'm not sure which is the best approach to avoid deadlocks, have good performance, and plays nicely with the runtime (in terms of honoring task schedulers, task creation options, etc). Is there a definitive answer? What is the best approach?
private static T taskSyncRunner<T>(Func<Task<T>> task)
{
T result;
// approach 1
result = Task.Run(async () => await task()).ConfigureAwait(false).GetAwaiter().GetResult();
// approach 2
result = Task.Run(task).ConfigureAwait(false).GetAwaiter().GetResult();
// approach 3
result = task().ConfigureAwait(false).GetAwaiter().GetResult();
// approach 4
result = Task.Run(task).Result;
// approach 5
result = Task.Run(task).GetAwaiter().GetResult();
// approach 6
var t = task();
t.RunSynchronously();
result = t.Result;
// approach 7
var t1 = task();
Task.WaitAll(t1);
result = t1.Result;
// approach 8?
return result;
}

So I'm left with how to best call async methods in a synchronous way.
First, this is an OK thing to do. I'm stating this because it is common on Stack Overflow to point this out as a deed of the devil as a blanket statement without regard for the concrete case.
It is not required to be async all the way for correctness. Blocking on something async to make it sync has a performance cost that might matter or might be totally irrelevant. It depends on the concrete case.
Deadlocks come from two threads trying to enter the same single-threaded synchronization context at the same time. Any technique that avoids this reliably avoids deadlocks caused by blocking.
In your code snippet, all calls to .ConfigureAwait(false) are pointless because the return value is not awaited. ConfigureAwait returns a struct that, when awaited, exhibits the behavior that you requested. If that struct is simply dropped, it does nothing.
RunSynchronously is invalid to use because not all tasks can be processed that way. This method is meant for CPU-based tasks, and it can fail to work under certain circumstances.
.GetAwaiter().GetResult() is different from Result/Wait() in that it mimics the await exception propagation behavior. You need to decide if you want that or not. (So research what that behavior is; no need to repeat it here.) If your task contains a single exception then the await error behavior is usually convenient and has little downside. If there are multiple exceptions, for example from a failed Parallel loop where multiple tasks failed, then await will drop all exceptions but the first one. That makes debugging harder.
All these approaches have similar performance. They will allocate an OS event one way or another and block on it. That's the expensive part. The other machinery is rather cheap compared to that. I don't know which approach is absolutely cheapest.
In case an exception is being thrown, that is going to be the most expensive part. On .NET 5, exceptions are processed at a rate of at most 200,000 per second on a fast CPU. Deep stacks are slower, and the task machinery tends to rethrow exceptions multiplying their cost. There are ways of blocking on a task without the exception being rethrown, for example task.ContinueWith(_ => { }, TaskContinuationOptions.ExecuteSynchronously).Wait();.
I personally like the Task.Run(() => DoSomethingAsync()).Wait(); pattern because it avoids deadlocks categorically, it is simple and it does not hide some exceptions that GetResult() might hide. But you can use GetResult() as well with this.

I'm in the process of updating a library that has an API surface that was built in .NET 3.5. As a result, all methods are synchronous. I can't change the API (i.e., convert return values to Task) because that would require that all callers change. So I'm left with how to best call async methods in a synchronous way.
There is no universal "best" way to perform the sync-over-async anti-pattern. Only a variety of hacks that each have their own drawbacks.
What I recommend is that you keep the old synchronous APIs and then introduce asynchronous APIs alongside them. You can do this using the "boolean argument hack" as described in my MSDN article on Brownfield Async.
First, a brief explanation of the problems with each approach in your example:
ConfigureAwait only makes sense when there is an await; otherwise, it does nothing.
Result will wrap exceptions in an AggregateException; if you must block, use GetAwaiter().GetResult() instead.
Task.Run will execute its code on a thread pool thread (obviously). This is fine only if the code can run on a thread pool thread.
RunSynchronously is an advanced API used in extremely rare situations when doing dynamic task-based parallelism. You're not in that scenario at all.
Task.WaitAll with a single task is the same as just Wait().
async () => await x is just a less-efficient way of saying () => x.
Blocking on a task started from the current thread can cause deadlocks.
Here's the breakdown:
// Problems (1), (3), (6)
result = Task.Run(async () => await task()).ConfigureAwait(false).GetAwaiter().GetResult();
// Problems (1), (3)
result = Task.Run(task).ConfigureAwait(false).GetAwaiter().GetResult();
// Problems (1), (7)
result = task().ConfigureAwait(false).GetAwaiter().GetResult();
// Problems (2), (3)
result = Task.Run(task).Result;
// Problems (3)
result = Task.Run(task).GetAwaiter().GetResult();
// Problems (2), (4)
var t = task();
t.RunSynchronously();
result = t.Result;
// Problems (2), (5)
var t1 = task();
Task.WaitAll(t1);
result = t1.Result;
Instead of any of these approaches, since you have existing, working synchronous code, you should use it alongside the newer naturally-asynchronous code. For example, if your existing code used WebClient:
public string Get()
{
using (var client = new WebClient())
return client.DownloadString(...);
}
and you want to add an async API, then I would do it like this:
private async Task<string> GetCoreAsync(bool sync)
{
using (var client = new WebClient())
{
return sync ?
client.DownloadString(...) :
await client.DownloadStringTaskAsync(...);
}
}
public string Get() => GetCoreAsync(sync: true).GetAwaiter().GetResult();
public Task<string> GetAsync() => GetCoreAsync(sync: false);
or, if you must use HttpClient for some reason:
private string GetCoreSync()
{
using (var client = new WebClient())
return client.DownloadString(...);
}
private static HttpClient HttpClient { get; } = ...;
private async Task<string> GetCoreAsync(bool sync)
{
return sync ?
GetCoreSync() :
await HttpClient.GetString(...);
}
public string Get() => GetCoreAsync(sync: true).GetAwaiter().GetResult();
public Task<string> GetAsync() => GetCoreAsync(sync: false);
With this approach, your logic would go into the Core methods, which may be run synchronously or asynchronously (as determined by the sync parameter). If sync is true, then the core methods must return an already-completed task. For implemenation, use synchronous APIs to run synchronously, and use asynchronous APIs to run asynchronously.
Eventually, I recommend deprecating the synchronous APIs.

I just went thru this very thing with the AWS S3 SDK. Used to be sync, and I built a bunch of code on that, but now it's async. And that's fine: they changed it, nothing to be gained by moaning about it, move on.So I need to update my app, and my options are to either refactor a large part of my app to be async, or to "hack" the S3 async API to behave like sync.I'll eventually get around to the larger async refactoring - there are many benefits - but for today I have bigger fish to fry so I chose to fake the sync.Original sync code was ListObjectsResponse response = api.ListObjects(request); and a really simple async equivalent that works for me is Task<ListObjectsV2Response> task = api.ListObjectsV2Async(rq2);ListObjectsV2Response rsp2 = task.GetAwaiter().GetResult();
While I get it that purists might pillory me for this, the reality is that this is just one of many pressing issues and I have finite time so I need to make tradeoffs. Perfect? No. Works? Yes.

You Can Call Async Method From non-async method .Check below Code .
public ActionResult Test()
{
TestClass result = Task.Run(async () => await GetNumbers()).GetAwaiter().GetResult();
return PartialView(result);
}
public async Task<TestClass> GetNumbers()
{
TestClass obj = new TestClass();
HttpResponseMessage response = await APICallHelper.GetData(Functions.API_Call_Url.GetCommonNumbers);
if (response.IsSuccessStatusCode)
{
var result = response.Content.ReadAsStringAsync().Result;
obj = JsonConvert.DeserializeObject<TestClass>(result);
}
return obj;
}

Most efficient configuration of C# Tasks and Continuations when calling a web service?

I am creating a network client that talks to a real-time web API.
The client must make many different calls per second and feed a Task<TResult> back to each client component so the client can decide whether to block:
public Task<TResult> Execute<TResult>(IOperation<TResult> operation);
The process of making an API call runs like this:
Serialize the (small, less than 1KB) request to Json
Send the request using an HttpClient
On success, deserialize to TResult (Json could be a few hundred KB in size, but usually much smaller) and return
In my tests, choosing where to include each step within a Task workflow (and therefore on which thread) has a significant impact on the performance.
The fastest setup I have found so far is this (semi-pseudo-code, omitted generic type params for brevity):
// serialize on main thread
var requestString = JsonConvert.SerializeObject(request);
// create message - omitted
var post = Task.Factory.StartNew(() => this.client.SendAsync(requestMessage)).Unwrap();
return post.ContinueWith(response =>
{
var jsonString = response.Result.Content.ReadAsStringAsync();
return JsonConvert.DeserializeObject(jsonString.Result);
});
The slowest is this setup where the whole process is executed within a single task:
return Task.Factory.StartNew((request) =>
{
var requestString = JsonConvert.SerializeObject(request);
// create message - omitted
var post = client.SendAsync(requestMessage);
var jsonString = post.Result.Content.ReadAsStringAsync();
return JsonConvert.DeserializeObject(jsonString.Result);
})
I would have thought the last method could be quickest because you create one background thread per request. My assumption is that this doesn't allow the TPL to make the most efficient use of the available threads because of the blocking call.
So, is there a general rule around what should go in a task and what should sit outside it, or in a continuation?
In this specific case, are there any further optimisations I could try?

You shouldn't need to use Task.Factory.StartNew at all, as SendAsync returns a Task already:
var post = this.client.SendAsync(requestMessage);
return post.ContinueWith(response =>
{
var jsonString = response.Result.Content.ReadAsStringAsync();
return JsonConvert.DeserializeObject(jsonString.Result);
});
This will actually be more efficient, as it doesn't require a ThreadPool thread at all.
Note that you could further optimize this with async/await (to keep the response asynchronous):
var response = await this.client.SendAsync(requestMessage);
var jsonString = await response.Content.ReadAsStringAsync();
return JsonConvert.DeserializeObject(jsonString);
This could be written via TPL continuations, as well, but this would require returning the (unwrapped) Task from ReadAsStringAsync, then posting a new continuation on it in order to get the final string.

So, first off, there's no reason to use StartNew when sending the initial request. You already have a task, it's needless overhead to start the task in a background thread.
So this:
var post = Task.Factory.StartNew(() => this.client.SendAsync(requestMessage)).Unwrap();
Can become:
var post = this.client.SendAsync(requestMessage);
Next, in both cases, you're blocking on the results of the ReadAsStringAsync method, rather than processing it asynchronously. This is chewing up another thread pool thread just sitting there doing nothing.
Instead do something like:
return post.ContinueWith(response =>
{
return response.Result.Content.ReadAsStringAsync()
.ContinueWith(t => JsonConvert.DeserializeObject(t.Result));
}).UnWrap();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing multiple very large JSON files concurrently - c#

Related

Select with asynchronous method

Waiting on a continuous UI background polling task

Throttling with SemaphoreSlim -- "Task.Run()" vs "new Func<Task>()"

Calling async methods from non-async code

Most efficient configuration of C# Tasks and Continuations when calling a web service?

Categories

Resources