I am reading azure tables' data - around 5k tables and collects different metrics and save them back to some other azure tables, everything in a asynchronous way. The problem I am facing is when there are huge data which can happen occasionally, application starts hanging. The same code is working fine with less data. The steps I am doing (all of them are asynchronous using Rx, async and await) are
Read all the table names from Azure
Read all the tables previous metric data (1 & 2 are in parallel - Task.WhenAll)
Get data from each table, process and save (Task.WhenAll)
what I want is, use asynchronousy till it doesn't make my application hanging. If there are more data than what can be handled, it should not read any more table's data instead focus on completing the available data processing.
Does Parallel.ForEach takes care of that?
The code: edited as per Stephen Cleary, Still not working for all the tables. whereas it is working for 500 tables,
I think it is the amount of data that brings the app (console app) to a standstill rather than the number of threads. (One thread may end up retrieving million rows, in thousands and each thousand is passed to a method and its count is added to dictionary hence can be garbage collected when there is a need for more memory) or is it the way I have implemented Semaphoreslim that is wrong?
public async Task CalculateMetricsForAllTablesAsync()
{
var allWizardTableNamesTask = GetAllWizardTableNamesAsync();
var allTablesNamesWithLastRunTimeTask = GetAllTableNamesWithLastRunTimeAsync();
await Task.WhenAll(allWizardTableNamesTask, allTablesNamesWithLastRunTimeTask).ConfigureAwait(false);
var allWizardTableNames = allWizardTableNamesTask.Result;
var allTablesNamesWithLastRunTime = allTablesNamesWithLastRunTimeTask.Result;
var throttler = new SemaphoreSlim(10);
var concurrentTableProcessingTasks = new ConcurrentStack<Task>();
foreach (var tname in allWizardTableNames)
{
await throttler.WaitAsync();
try
{
concurrentTableProcessingTasks.Push(ProcessTableDataAsync(tname, getTableNameWithLastRunTime(tname)));
}
finally
{
throttler.Release();
}
}
await Task.WhenAll(concurrentTableProcessingTasks).ConfigureAwait(false);
}
private async Task ProcessTableDataAsync(string tableName, Tuple<string, string> matchingTable)
{
var tableDataRetrieved = new TaskCompletionSource<bool>();
var metricCountsForEachDay = new ConcurrentDictionary<string, Tuple<int, int>>();
_fromATS.GetTableDataAsync<DynamicTableEntity>(tableName, GetFilter(matchingTable))
.Subscribe(entities => ProcessWizardDataChunk(metricCountsForEachDay, entities), () => tableDataRetrieved.TrySetResult(true));
await tableDataRetrieved.Task;
await SaveMetricDataAsync(tableName, metricCountsForEachDay).ConfigureAwait(false);
}
Since your async is wrapping Rx, I'd recommend throttling at the async level. You can do this by defining a SemaphoreSlim and wrapping your method logic within a WaitAsync/Release.
Alternatively, consider TPL Dataflow. Dataflow has built-in options for throttling (MaxDegreeOfParallelism), and also interoperates naturally with async and Rx.
Related
I have a .netcore 6 BackGroundService which pushes data from on-premise to a 3rd party API.
The 3rd party API takes about 500 milliseconds to process the API call.
The problem is that I have about 1,000,000 rows of data to push to this API one at a time. At 1/2 second per row, it's going to take about 6 days to sync up.
So, I would like to try to spawn multiple threads in order to hit the API simultaneously with 10 threads.
var startTime = DateTimeOffset.Now;
var batchSize = _config.GetValue<int>("BatchSize");
using (var scope = _serviceScopeFactory.CreateScope())
{
var context = scope.ServiceProvider.GetRequiredService<PlankContext>();
var dncEntries = await context.PlankQueueDnc.Where(x => x.ToProcessFlag == true).Take(batchSize).ToListAsync();
foreach (var plankQueueDnc in dncEntries)
{
var response = await _plankConnector.InsertDncAsync(plankQueueDnc);
context.PlankQueueDnc.Update(plankQueueDnc);
}
await context.SaveChangesAsync();
}
Here is the code. As you can see, it gets a batch of 100 records and then processes them one by one. Is there a way to modify this so this line is not awaited? I don't quite understand how it would work if it were not awaited. Would it create a thread for each execution in the loop?
var response = await _plankConnector.InsertDncAsync(plankQueueDnc);
I am clearly not up to speed on threads as well as the esteemed #StephanCleary.
So suggestions would be appreciated.
In .NET 6 you can use Parallel.ForEachAsync to execute operations concurrently, using either all available cores or a limited Degree-Of-Parallelism.
The following code loads all records, executes the posts concurrently, then updates the records :
using (var scope = _serviceScopeFactory.CreateScope())
{
var context = scope.ServiceProvider.GetRequiredService<PlankContext>();
var dncEntries = await context.PlankQueueDnc
.Where(x => x.ToProcessFlag == true)
.Take(batchSize)
.ToListAsync();
await Parallel.ForEachAsync(dncEntries,async plankQueueDnc=>
{
var response = await _plankConnector.InsertDncAsync(plankQueueDnc);
plankQueueDnc.Whatever=response.Something;
};
await context.SaveChangesAsync();
}
There's no reason to call Update as a DbContext tracks the objects it loaded and knows which ones were modified. SaveChangesAsync will persist all changes in a single transaction
DOP and Throttling
By default, ParallelForEachAsync will execute as many tasks concurrently as there are cores. This may be too little or too much for HTTP calls. On the one hand, the client machine isn't using its CPU at all while waiting for the remote service. On the other hand, the remote service itself may not like or even allow too many concurrent calls and may even impose throttling.
The ParallelOptions class can be used to specify the degree of parallelism. If the API allows it, we could execute eg 20 concurrent calls :
var option=new ParallelOptions { MaxDegreeOfParallelism = 20};
await Parallel.ForEachAsync(dncEntries,options,async plankQueueDnc=>{...});
Many services impose a rate on how many requests can be made in a period of time. A (somewhat naive) way of implementing this is to add a small delay in the task worker code can take care of this :
var delay=100;
await Parallel.ForEachAsync(dncEntries,options,async plankQueueDnc=>{
...
await Task.Delay(delay);
});
We have a database with around 400k elements we need to compute. Below is shown a sample of an orchestrator function.
[FunctionName("Crawl")]
public static async Task<List<string>> RunOrchestrator(
[OrchestrationTrigger] DurableOrchestrationContext context)
{
if (!context.IsReplaying)
{
}
WriteLine("In orchistration");
var outputs = new List<string>();
var tasks = new Task<string>[3];
var retryOptions = new RetryOptions(
firstRetryInterval: TimeSpan.FromSeconds(60),
maxNumberOfAttempts: 3);
// Replace "hello" with the name of your Durable Activity Function.
tasks[0] = context.CallActivityWithRetryAsync<string>("Crawl_Hello",retryOptions, "Tokyo");
tasks[1] = context.CallActivityWithRetryAsync<string>("Crawl_Hello", retryOptions, "Seattle");
tasks[2] = context.CallActivityWithRetryAsync<string>("Crawl_Hello",retryOptions, "London");
await Task.WhenAll(tasks);
return outputs;
}
Every time an activty is called the orchestration function is called. But I dont want to get 400k items from the database each time an activity is getting called. Would just just add all the activity code inside the if statement or what is the right approach here? I can't see that working with the WaitAll function.
Looks like you've figured out the approach for this as you've mentioned in your other query but elaborating the same here for the benefit of others.
Ideally, you should have an activity function to first fetch all the data that you need first, batch them up and call another activity function that processes that data.
Since you have a large number of elements to compute on, its best to split compute into separate sub-orchestrators because the fan-in operation is performed on a single instance.
For further reading, there are some documented performance targets that could help when deploying durable functions.
I recently have started programming in C# (after having some experience with PHP and JavaScript), and i built a simple console program that downloads a JSON string and stores certain values in a database. The data in question is approx. 70.000 sets (converted into rows into my database). Due to a limitation on the server where I download this JSON from (Quandl), it was recommended to download it with 100 datasets per request, so I have 700 requests to make.
With every request, I download the JSON string, deserialize it and loop through it a 100 times to store the respective values in the database. I am using WebClient to make the request and I utilize JSON.net for the deserialization.
Currently, with the setup I have, it takes approx. 7 seconds for every request and including inserting the data into the database, it takes about one and half hour to finish.
The question then becomes; is there anyway to speed this up with the async/await method? Everything I read is more on the UI side of things (i.e. the UI is not frozen while a request is processed), but I was wondering if it were possible to start the requests maybe simultaneously (or, per 10 at the time or something). For completion, I have added a sanitized version of my code (made it a bit shorter but no logic has been removed).
https://dotnetfiddle.net/S0fnBc
async/await are for asynchronous operations. Asynchronous execution does not equal parallel execution. Asynchronous execution does not block the caller, and parallel execution allows for concurrent execution. You need parallel execution. To do this, you can use the Task Parallel Library. There is also a patterns and practices book that is a great read. Here is a simplified implementation:
var httpClient = new HttpClient();
httpClient.BaseAddress = new Uri("/path/to/data");
var tasks = new Task<Task<HttpResponseMessage>>[5];
for (var i = 0; i < tasks.Length; i++)
{
tasks[i] = Task<Task<HttpResponseMessage>>.Factory.StartNew(async () => await httpClient.GetAsync("?updatedFilterParams"));
}
Task.WhenAll(tasks); // wait for them to complete
foreach (var task in tasks)
{
var data = task.Result.Result.Content.ReadAsStringAsync();
// do something
}
Some things to note: WebClient is not capable of concurrent requests so you'll either have to new up another one for every request or use HttpClient as I have. Also, there are multiple things in between your code and the data that can and often do impose limits on concurrent requests for the same origin, so you'll want to throttle how many requests you fire off at a time.
I'm using the .NET API available from parse.com,
https://parse.com/docs/dotnet_guide#objects-saving
A snippet of my code looks like this;
public async void UploadCurrentXML()
{
...
var query = ParseObject.GetQuery("RANDOM_TABLE").WhereEqualTo("some_field", "string");
var count = await query.CountAsync();
ParseObject temp_A;
temp_A = await query.FirstAsync();
...
// do lots of stuff
...
await temp_A.SaveAsync();
}
To summarize; A query is made to a remote database. From the result a specific object (or its reference) is obtained from the database. Multiple operations are performed on the object and in the end, its saved back into the database.
All the database operations happen via await ParseObject.randomfunction() . Is it possible to call these functions in a synchronous manner? Or at least wait till the operation returns without moving on? The application is designed for maintenance purposes and time of operation is NOT an issue.
I'm asking this because as things stand, I get an error which states
The number of count operations in progress has reached its limit.
I've tried,
var count = await query.CountAsync().ConfigureAwait(false);
in all the await calls, but it doesn't help - the code is still running asynchronously.
var count = query.CountAsync().Result;
causes the application to get stuck - fairly certain that I've hit a deadlock.
A bit of searching led me to this question,
How would I run an async Task<T> method synchronously?
But I don't understand how it could apply to my case, since I do not have access to the source of ParseObject. Help? (Am using .NET 4.5)
I recommend that you use asynchronous programming throughout. If you're running into some kind of resource issue (i.e., multiple queries on a single db not allowed), then you should structure your code so that cannot happen (e.g., disabling UI buttons while operations are in flight). Or, if you must, you can use SemaphoreSlim to throttle your async code:
private readonly SemaphoreSlim _mutex = new SemaphoreSlim(1);
public async Task UploadCurrentXMLAsync()
{
await _mutex.WaitAsync();
try
{
...
var query = ParseObject.GetQuery("RANDOM_TABLE").WhereEqualTo("some_field", "string");
var count = await query.CountAsync();
ParseObject temp_A;
temp_A = await query.FirstAsync();
...
// do lots of stuff
...
await temp_A.SaveAsync();
}
finally
{
_mutex.Release();
}
}
But if you really, really want to synchronously block, you can do it like this:
public async Task UploadCurrentXMLAsync();
Task.Run(() => UploadCurrentXMLAsync()).Wait();
Again, I just can't recommend this last "solution", which is more of a hack than a proper solution.
if the api method returns an async task, you can get the awaiter and get the result synchronously
api.DoWorkAsync().GetAwaiter().GetResult();
I am working on a simple server that exposes webservices to clients. Some of the requests may take a long time to complete, and are logically broken into multiple steps. For such requests, it is required to report progress during execution. In addition, a new request may be initiated before a previous one completes, and it is required that both execute concurrently (barring some system-specific limitations).
I was thinking of having the server return a TaskId to its clients, and having the clients track the progress of the requests using the TaskId. I think this is a good approach, and I am left with the issue of how tasks are managed.
Never having used the TPL, I was thinking it would be a good way to approach this problem. Indeed, it allows me to run multiple tasks concurrently without having to manually manage threads. I can even create multi-step tasks relatively easily using ContinueWith.
I can't come up with a good way of tracking a task's progress, though. I realize that when my requests consist of a single "step", then the step has to cooperatively report its state. This is something I would prefer to avoid at this point. However, when a request consists of multiple steps, I would like to know which step is currently executing and report progress accordingly. The only way I could come up with is extremely tiresome:
Task<int> firstTask = new Task( () => { DoFirstStep(); return 3.14; } );
firstTask.
ContinueWith<int>( task => { UpdateProgress("50%"); return task.Result; } ).
ContinueWith<string>( task => { DoSecondStep(task.Result); return "blah"; }.
ContinueWith<string>( task => { UpdateProgress("100%"); return task.Result; } ).
And even this is not perfect since I would like the Task to store its own progress, instead of having UpdateProgress update some known location. Plus it has the obvious downside of having to change a lot of places when adding a new step (since now the progress is 33%, 66%, 100% instead of 50%, 100%).
Does anyone have a good solution?
Thanks!
This isn't really a scenario that the Task Parallel Library supports that fully.
You might consider an approach where you fed progress updates to a queue and read them on another Task:
static void Main(string[] args)
{
Example();
}
static BlockingCollection<Tuple<int, int, string>> _progressMessages =
new BlockingCollection<Tuple<int, int, string>>();
public static void Example()
{
List<Task<int>> tasks = new List<Task<int>>();
for (int i = 0; i < 10; i++)
tasks.Add(Task.Factory.StartNew((object state) =>
{
int id = (int)state;
DoFirstStep(id);
_progressMessages.Add(new Tuple<int, int, string>(
id, 1, "10.0%"));
DoSecondStep(id);
_progressMessages.Add(new Tuple<int, int, string>(
id, 2, "50.0%"));
// ...
return 1;
},
(object)i
));
Task logger = Task.Factory.StartNew(() =>
{
foreach (var m in _progressMessages.GetConsumingEnumerable())
Console.WriteLine("Task {0}: Step {1}, progress {2}.",
m.Item1, m.Item2, m.Item3);
});
List<Task> waitOn = new List<Task>(tasks.ToArray());
waitOn.Add(logger);
Task.WaitAll(waitOn.ToArray());
Console.ReadLine();
}
private static void DoSecondStep(int id)
{
Console.WriteLine("{0}: First step", id);
}
private static void DoFirstStep(int id)
{
Console.WriteLine("{0}: Second step", id);
}
This sample doesn't show cancellation, error handling or account for your requirement that your task may be long running. Long running tasks place special requirements on the scheduler. More discussion of this can be found at http://parallelpatterns.codeplex.com/, download the book draft and look at Chapter 3.
This is simply an approach for using the Task Parallel Library in a scenario like this. The TPL may well not be the best approach here.
If your web services are running inside ASP.NET (or a similar web application server) then you should also consider the likely impact of using threads from the thread pool to execute tasks, rather than service web requests:
How does Task Parallel Library scale on a terminal server or in a web application?
I don't think the solution you are looking for will involve the Task API. Or at least, not directly. It doesn't support the notion of percentage complete, and the Task/ContinueWith functions need to participate in that logic because it's data that is only available at that level (only the final invocation of ContinueWith is in any position to know the percentage complete, and even then, doing so algorithmically will be a guess at best because it certainly doesn't know if one task is going to take a lot longer than the other. I suggest you create your own API to do this, possibly leveraging the Task API to do the actual work.
This might help: http://blog.stephencleary.com/2010/06/reporting-progress-from-tasks.html. In addition to reporting progress, this solution also enables updating form controls without getting the Cross-thread operation not valid exception.