I didn't coded for a year now and I need some help with LinqPad. I need to run multiple search query to get data from my database.
Let's say I need to run
void DeepSearch(string input)
{
Orders.Where(y => y.OrderReference.Contains(input)).Dump();
Invoice.Where(y => y.InvoiceReference.Contains(input)).Dump();
Clients.Where(y => y.Name.Contains(input)).Dump();
}
To speed up performance I would like to launch these 3 queries together and dump as soon I get the result. I don't care about the order.
Can I symply add async ?
async void DeepSearch(string input)
{
Orders.Where(y => y.OrderReference.Contains(input)).Dump();
Invoice.Where(y => y.InvoiceReference.Contains(input)).Dump();
Clients.Where(y => y.Name.Contains(input)).Dump();
}
No you cannot; "async" is not the same as "parallel", and even if it were: data-sources need to be designed for (separately) both async and/or parallel. I'm guessing that Orders, Invoices and Clients here are LINQ providers over a shared database connection (probably using EF), in which case: database connections in ADO.NET aren't designed for concurrency/parallelism. Additionally, LINQ doesn't directly expose async operations, but: EF does make some concessions to async execution - if you use methods like Microsoft.EntityFrameworkCore.EntityFrameworkQueryableExtensions.ToListAsync(), which just means: "add using Microsoft.EntityFrameworkCore; to the top of the file, and use await source.Where(...).ToListAsync()".
However:
going async without going parallel doesn't improve your direct observed performance; it just means that you're not tying up the thread (a limited resource) while waiting on the DB, allowing the thread to be released to go and do more interesting things than waiting
going concurrent is complicated, and requires either separate isolated connections (etc), or connections that are designed for concurrency
You can certainly run queries in parallel if you instantiate a separate data context per parallel operation. (This is required because data contexts are not threadsafe, whether LINQ-to-SQL or EF Core).
LINQPad's automatically generated data context is called TypedDataContext, so your code in LINQPad will look like this:
Task.Run(() => new TypedDataContext().Orders.Where(y => y.OrderReference.Contains(input)Dump());
Task.Run(() => new TypedDataContext().Invoices.Where(y => y.InvoiceReference.Contains(input)).Dump());
Task.Run(() => new TypedDataContext().Clients.Where(y => y.Name.Contains(input).Dump());
If you assign the tasks to variables, you can await them (or use Task.WhenAll to await them all).
Depending on whether you're using LINQ-to-SQL or EF Core, there's also a cost associated with instantiating data contexts, so you might want to re-use the same TypedDataContext for subsequent operations on the same thread.
Related
I'm trying to figure what is the best approach, apart of synchronous programming, for doing some EF6 queries that retrieve data. I'll post here all 5 methods(these take place in a Controller Action ):
//would it be better to not "async" the ActionResult?
public async Task<ActionResult> Index{
// I depend on this so I don't even know if it's ok to make it async or not -> what do you think?
var userinfo = _dataservice.getUserInfo("John");
// C1: synchronous way
var watch1 = System.Diagnostics.Stopwatch.StartNew();
var info1 = _getInfoService.GetSomeInfo1(userinfo);
var info2 = _getInfoService.GetSomeInfo2(userinfo);
watch1.Stop();
var t1 = watch.EllapsedMilliSeconds; // this takes about 3200
// C2: asynchronous way
var watch2 = System.Diagnostics.Stopwatch.StartNew();
var infoA1 = await _getInfoService.GetSomeInfoAsync1(userinfo).ConfigureAwait(false);
var infoA2 = await _getInfoService.GetSomeInfoAsync2(userinfo).ConfigureAwait(false);
watch2.Stop();
var t2 = watch2.EllapsedMilliSeconds; // this takes about 3020
// C2.1: asynchronous way launch then await
var watch21 = System.Diagnostics.Stopwatch.StartNew();
var infoA21 = _getInfoService.GetSomeInfoAsync1(userinfo).ConfigureAwait(false);
var infoA22 = _getInfoService.GetSomeInfoAsync2(userinfo).ConfigureAwait(false);
// I tought if I launch them first then await, it would run faster...but not
var a = await infoA21;
var b = await infoA22;
watch21.Stop();
var t21 = watch21.EllapsedMilliSeconds; // this takes about the same 30201
// C3: asynchronous with Task.Run() and await.WhenAll()
var watch1 = System.Diagnostics.Stopwatch.StartNew();
var infoT1 = TaskRun(() => _getInfoService.GetSomeInfo1(userinfo));
var infoT2 = TaskRun(() => _getInfoService.GetSomeInfo2(userinfo));
await Task.WhenAll(infoT1,infoT2)
watch3.Stop();
var t3 = watch3.EllapsedMilliSeconds; // this takes about 2010
// C4: Parallel way
MyType var1; MyType2 var2;
var watch4 = System.Diagnostics.Stopwatch.StartNew();
Parallel.Invoke(
() => var1 = _getInfoService.GetSomeInfoAsync1(userinfo).GetAwaiter().GetResult(),// also using just _getInfoService.GetSomeInfo1(userinfo) - but sometimes throws an Entity error on F10 debugging
() => var2 = _getInfoService.GetSomeInfoAsync2(userinfo).GetAwaiter().GetResult()// also using just _getInfoService.GetSomeInfo2(userinfo)- but sometimes throws an Entity error on F10 debugging
);
watch4.Stop();
var t4 = watch4.EllapsedMilliSeconds; // this takes about 2012
}
Methods implementation:
public MyType1 GetSomeInfo1(SomeOtherType param){
// result = some LINQ queries here
Thread.Sleep(1000);
return result;
}
public MyType2 GetSomeInfo2(SomeOtherType param){
// result = some LINQ queries here
Thread.Sleep(2000);
return result;
}
public Task<MyType1> GetSomeInfoAsync1(SomeOtherType param){
// result = some LINQ queries here
Thread.Sleep(1000);
return Task.FromResult(result);
}
public Task<MyType2> GetSomeInfoAsync2(SomeOtherType param){
// result = some LINQ queries here
Thread.Sleep(2000);
return Task.FromResult(result);
}
If I understood correctly, await for 2 tasks(like in C2 and C2.1) does not make them run in parallel(not even in C.1 example where I launch them first then await), it just frees the current thread and gives them to another 2 different threads that will deal with those tasks
Task.Run() will in fact do just as Invoke.Parallel does, spreading the work on 2 different CPU's for making them run in parallel
Launching them first and then awaiting (C.1 example) shouldn't make them run a some sort of parallel way?
Would it be better not using async or parallel at all?
Please make me understand on these examples how can I have async and also better performance, also if there are any implications with EntityF that I must consider. I'm reading for a few days already and I only get confused, so please don't give me another links to read :)
async code can be mixed with parallelism by calling without await, then awaiting a Task.WaitAll(). However, the main consideration when looking at parallelism is ensuring the code called is thread -safe. DbContexts are not thread-safe, so to run parallel operations you need separate DbContext instances for each method. This means that code that normally relies on dependency injection to receive a DbContext/Unit of Work and would get a reference that is lifetime scoped to something like the web request cannot be used in parallelized calls. Calls that are parallelized will need to have a DbContext that is scoped for just that call.
When dealing with parallelized methods working with EF Entities that also means that you need to ensure that any entity references are treated as detached entities. They cannot safely be associated with one another as if they had been returned by different DbContexts in different parallel tasks.
For example, using normal async & await:
var order = await Repository.GetOrderById(orderId);
var orderLine = await Repository.CreateOrderLineForProduct(productId, quantity);
order.OrderLines.Add(orderLine);
await Repository.SaveChanges();
As a very basic example where the repository class gets a DbContext injected. The CreateOrderLine method would be using the DbContext to load the Product and possibly other details to make an OrderLine. When awaited, the async variants ensure only one thread is accessing the DbContext at a time so the same single DbContext instance can be used by the Repository. The Order, new OrderLine, Product, etc. are all tracked by the same DbContext instance so a SaveChanges call issued by the repository against that single instance will work as expected.
If we tried to parallelize it like:
var orderTask = Repository.GetOrderById(orderId);
var orderLineTask = Repository.CreateOrderLineForProduct(productId, quantity);
await Task.WhenAll(orderTask, orderLineTask);
var order = orderTask.Result;
var orderLine = orderLineTask.Result;
order.OrderLines.Add(orderLine);
await Repository.SaveChanges();
This would likely result in exceptions from EF that the DbContext is being accessed across threads as both the GetOrderById, and calls within CreateOrderLine. What's worse is that EF won't detect that it is being called by multiple threads until those threads both try to access the DbSets etc. at the same time. So this can sometimes result in an intermittent error that might not appear during testing or appear reliably when not under load (queries all finish quite quickly and don't trip on each other) but grind to a halt with exceptions when running under load. To address this, the DbContext reference in the Repository needs to be scoped for each method. This means rather than using an injected DbContext, it needs to look more like:
public Order GetOrderById(int orderId)
{
using(var context = new AppDbContext())
{
return context.Orders
.Include(x=>x.OrderLines)
.AsNoTracking()
.Single(x => x.OrderId == orderId);
}
}
We could still use dependency injection to inject something like a DbContext Factory class to create the DbContext which can be mocked out. The key thing is that the scope of the DbContext must be moved to within the parallelized method. AsNoTracking() is important because we cannot leave this order "tracked" by this DbContext; When we want to save the order and any other associated entities, we will have to associate this order with a new DbContext instance. (this one is being disposed) If the Entity still thinks it's tracked, that will result in an error. This also means that the repository Save has to change to something more like:
Repository.Save(order);
to pass in an entity, associate it and all referenced entities with a DbContext, and then calling SaveChanges.
Needless to say this starts getting messy, and it hasn't even touched on things like exception handling. You also lose aspects like change tracking because of the need to work with detached entities. To avoid potential issues between tracked and untracked entities and such I would recommend that parallelized code should always deal with POCO view models or more complete "operations" with entities rather than doing things like returning detached entities. We want to avoid confusion between code that might be called via an Order that is tracked (using synchronous or async calls) vs. an Order that is not tracked because it is the result of a parallelized call. That said, it can have its uses, but I would highly recommend keeping it's use to a minimum.
async/await can be an excellent pattern to adopt for longer, individual operations where that web request can expect to wait a few seconds such as a search or report. This frees up the web request handling thread to start responding to other requests while the user waits. Hence it's use to boost server responsiveness, not to be confused with making calls faster. For short and snappy operations it ends up adding a bit of extra overhead, so these should just be left as synchronous calls. async is not something I would ever argue needs to be an "all or nothing" decision in an application.
So that above example, loading an Order by ID and creating an Orderline would be something that I would normally leave synchronous, not asynchronous. Loading an entity graph by ID is typically quite fast. A better example where I would leverage async would be something like:
var query = Repository.GetOrders()
.Where(x => x.OrderStatus.OrerStatusId == OrderStatus.New
&& x.DispatchDate <= DateTime.Today());
if (searchCriteria.Any())
query = query.Where(buildCriteria(searchCriteria));
var pendingOrders = await query.Skip(pageNumber * pageSize)
.Take(PageSize)
.ProjectTo<OrderSearchResultViewModel>()
.ToListAsync();
Where in this example I have a search operation which is expected to run across a potentially large number of orders and possibly include less efficient user defined search criteria before fetching a page of results. It might take less than a second, or several seconds to run, and there could be a number of calls, including other searches, to be processing from other users at the time.
Parallelization is more geared towards situations where there are a mix of long and short-running operations that need to be completed as a unit so one doesn't need to wait for the other to complete before it starts. Much more care needs to be taken in this model when it comes to operations with EF Entities, so it's definitely not a pattern I would design as the "default" for in a system.
So to summarize:
Synchronous - Quick hits to the database or in-memory cache such as pulling rows by ID or in general queries expected to execute in 250ms or less. (Basically, the default)
Asynchronous - Bigger queries across larger sets with potentially slower execution time such as dynamic searches, or shorter operations that are expected to be called extremely frequently.
Parallel - Expensive operations that will launching several queries to complete where the queries can be "stripped" for the necessary data and run completely independently and in the background. I.e. reports or building exports, etc.
Have a job which does the following 2 tasks
Read up to 300 unique customerId from database into a List.
Then Call a stored procedure for each customerId, which execute queries in the SP, creates a XML (up to of 10 KB) and store the XML into a database table.
So, in this case there should be 300 records in the table.
On an average, the SP takes around 3 secs to process each customer until it's xml creation. So that means, it's taking total 15 minutes to completely process all 300 customers. The problem is, in future it may be even more time-consuming.
I don't want to go with bulk-insert option by having logic of xml creation in application. Using bulk-insert, I won't be able to know which customerId's data was a problem if xml creation failed. So I want to call the SP for each customer.
To process all customer in parallel, I created 4 dedicated threads, each processing a collection of unique customerId, all 4 threads together processed all 300 customers in 5 minutes. Which I was expecting.
However I want to use ThreadPool rather than creating my own threads.
I want to have 2 type of threads here. One type of is process and create xml for each customer, another to work on the customers for XML is already created. This another thread will call a SP which would update a flag on a customer table based on customer's XML available.
So which is the best way to process 300 customers in parallel and quickly and also updating customer table in parallel or on a separate thread?
Is dedicated thread still good option here or Parallel.ForEach or await Task.WhenAll?
I know Parallel.Foreach will block the main thread, which I want to use for updating customer table.
You have to choose among several options for implementation. First of all, choose the schema you are using. You can implement your algorithm in co-routine fashion, whenever the thread needs some long-prepairing data, it yields the execution with await construction.
// It can be run inside the `async void` event handler from your UI.
// As it is async, the UI thread wouldn't be blocked
async Task SaveAll()
{
for(int i = 0; i < 100; ++i)
{
// somehow get a started task for saving the (i) customer on this thread
await SaveAsync(i);
}
}
// This method is our first coroutine, which firstly starts fetching the data
// and after that saves the result in database
async Task SaveAsync(int customerId)
{
// at this point we yield the work to some other method to be run
// as at this moment we do not do anything
var customerData = await FetchCustomer(customerId);
// at this moment we start to saving the data asynchroniously
// and yield the execution another time
var result = await SaveCustomer(customerData);
// at this line we can update the UI with result
}
FetchCustomer and SaveCustomer can use the TPL (they can be replaced with anonymous methods, but I don't like this approach). Task.Run will execute the code inside the default ThreadPool, so UI thread wouldn't be blocked (more about this method in Stephen Cleary's blog):
async Task<CustomerData> FetchCustomer(int customerId)
{
await Task.Run(() => DataRepository.GetCustomerById(customerId));
}
// ? here is a placeholder for your result type
async Task<?> SaveCustomer(CustomerData customer)
{
await Task.Run(() => DataRepository.SaveCustomer(customer));
}
Also I suggest you to examine this articles from that blog:
StartNew is Dangerous
Async and Await
Don't Block on Async Code
Async/Await - Best Practices in Asynchronous Programming
Another option is to use the TPL Dataflow extension, very similar to the this answer:
Nesting await in Parallel foreach
I suggest you to examine the contents of linked post, and decide for yourself, which approach will you implement.
I would try to solve your task completely within SQL. This will greatly reduce server roundtrips.
I think you can create a enumeration of tasks, each one dealing with one record, and then to call Task.WhenAll() to run them all.
I have a helper method returns IEnumerable<string>. As the collection grows, it's slowing down dramatically. My current approach is to do essentially the following:
var results = new List<string>();
foreach (var item in items)
{
results.Add(await item.Fetch());
}
I'm not actually sure whether this asynchronicity gives me any benefit (it sure doesn't seem like it), but all methods up the stack and to my controller's actions are asynchronous:
public async Task<IHttpActionResult> FetchAllItems()
As this code is ultimately used by my API, I'd really like to parallelize these all for what I hope would be great speedup. I've tried .AsParallel:
var results = items
.AsParallel()
.Select(i => i.Fetch().Result)
.AsList();
return results;
And .WhenAll (returning a string[]):
var tasks = items.Select(i => i.Fetch());
return Task<string>.WhenAll<string>(tasks).Result;
And a last-ditch effort of firing off all long-running jobs and sequentially awaiting them (hoping that they were all running in parallel, so waiting on one would let all others nearly complete):
var tasks = new LinkedList<Task<string>>();
foreach (var item in items)
tasks.AddLast(item.Fetch());
var results = new LinkedList<string>();
foreach (var task in tasks)
results.AddLast(task.Result);
In every test case, the time it takes to run is directly proportional to the number of items. There's no discernable speedup by doing this. What am I missing in using Tasks and await/async?
There's a difference between parallel and concurrent. Concurrency just means doing more than one thing at a time, whereas parallel means doing more than one thing on multiple threads. async is great for concurrency, but doesn't (directly) help you with parallelism.
As a general rule, parallelism on ASP.NET should be avoided. This is because any parallel work you do (i.e., AsParallel, Parallel.ForEach, etc) shares the same thread pool as ASP.NET, so that reduces ASP.NET's capability to handle other requests. This impacts the scalability of your web service. It's best to leave the thread pool to ASP.NET.
However, concurrency is just fine - specifically, asynchronous concurrency. This is where Task.WhenAll comes in. Code like this is what you should be looking for (note that there is no call to Task<T>.Result):
var tasks = items.Select(i => i.Fetch());
return await Task<string>.WhenAll<string>(tasks);
Given your other code samples, it would be good to run through your call tree starting at Fetch and replace all Result calls with await. This may be (part of) your problem, because Result forces synchronous execution.
Another possible problem is that the underlying resource being fetched does not support concurrent access, or there may be throttling that you're not aware of. E.g., if Fetch retrieves data from another web service, check out System.Net.ServicePointManager.DefaultConnectionLimit.
There is also a configurable limitation on the max connections to a single server that can make download performance independent to the number of client threads.
To change the connection limit use
ServicePointManager.DefaultConnectionLimit
Maximum concurrent requests for WebClient, HttpWebRequest, and HttpClient
As best as I can, I opt for async all the way down. However, I am still stuck using ASP.NET Membership which isn't built for async. As a result my calls to methods like string[] GetRolesForUser() can't use async.
In order to build roles properly I depend on data from various sources so I am using multiple tasks to fetch the data in parallel:
public override string[] GetRolesForUser(string username) {
...
Task.WaitAll(taskAccounts, taskContracts, taskOtherContracts, taskMoreContracts, taskSomeProduct);
...
}
All of these tasks are simply fetching data from a SQL Server database using the Entity Framework. However, the introduction of that last task (taskSomeProduct) is causing a deadlock while none of the other methods have been.
Here is the method that causes a deadlock:
public async Task<int> SomeProduct(IEnumerable<string> ids) {
var q = from c in this.context.Contracts
join p in this.context.Products
on c.ProductId equals p.Id
where ids.Contains(c.Id)
select p.Code;
//Adding .ConfigureAwait(false) fixes the problem here
var codes = await q.ToListAsync();
var slotCount = codes .Sum(p => char.GetNumericValue(p, p.Length - 1));
return Convert.ToInt32(slotCount);
}
However, this method (which looks very similar to all the other methods) isn't causing deadlocks:
public async Task<List<CustomAccount>> SomeAccounts(IEnumerable<string> ids) {
return await this.context.Accounts
.Where(o => ids.Contains(o.Id))
.ToListAsync()
.ToCustomAccountListAsync();
}
I'm not quite sure what it is about that one method that is causing the deadlock. Ultimately they are both doing the same task of querying the database. Adding ConfigureAwait(false) to the one method does fix the problem, but I'm not quite sure what differentiates itself from the other methods which execute fine.
Edit
Here is some additional code which I originally omitted for brevity:
public static Task<List<CustomAccount>> ToCustomAccountListAsync(this Task<List<Account>> sqlObjectsTask) {
var sqlObjects = sqlObjectsTask.Result;
var customObjects = sqlObjects.Select(o => PopulateCustomAccount(o)).ToList();
return Task.FromResult<List<CustomAccount>>(customObjects);
}
The PopulateCustomAccount method simply returns a CustomAccount object from the database Account object.
In ToCustomAccountListAsync you call Task.Result. That's a classic deadlock. Use await.
This is not an answer, but I have a lot to say, it wouldn't fit in comments.
Some fact: EF context is not thread safe and doesn't support parallel execution:
While thread safety would make async more useful it is an orthogonal feature. It is unclear that we could ever implement support for it in the most general case, given that EF interacts with a graph composed of user code to maintain state and there aren't easy ways to ensure that this code is also thread safe.
For the moment, EF will detect if the developer attempts to execute two async operations at one time and throw.
Some prediction:
You say that:
The parallel execution of the other four tasks has been in production for months without deadlocking.
They can't be executing in parallel. One possibility is that the thread pool cannot assign more than one thread to your operations, in that case they would be executed sequentially. Or it could be the way you are initializing your tasks, I'm not sure. Assuming they are executed sequentially (otherwise you would have recognized the exception I'm talking about), there is another problem:
Task.WaitAll hanging with multiple awaitable tasks in ASP.NET
So maybe it isn't about that specific task SomeProduct but it always happens on the last task? Well, if they executed in parallel, there wouldn't be a "last task" but as I've already pointed out, they must be running sequentially considering they had been in production for quite a long time.
I have a collection of objects. Before I save these objects, I need to grab some reference data from the Database. I need to do a few calls to the DB for each object. These methods can happen in any order / are not dependent upon each other.
Can I use async / await to make these db operations happen at the same time ?
Parallel.ForEach(animals,
animal => {
animal.Owner = GetAnimalOwner(ownerId);
animal.FavouriteFood = GetFavouriteFood(foodId);
Database.Store(animal);
});
That's some pseudo code that explains what I'm trying to do.
Is this a good candidate for async / await ?
If all you want to do is to execute the two operations in parallel, then you don't need async-await. You can use old TPL methods, like Parallel.Invoke():
Parallel.ForEach(animals,
animal => {
Parallel.Invoke(
() => animal.Owner = GetAnimalOwner(ownerId),
() => animal.FavouriteFood = GetFavouriteFood(foodId));
Database.Store(animal);
});
Keep in mind that there is no guarantee that this will actually improve your performance, depending on your DB and your connection to it.
In general, async-await makes sense in two cases:
You want to offload some long-running code from the UI thread.
You want to use less threads to improve performance or scalability.
It seems to me like #1 is not your case. If you're in case #2, then you should rewrite your database methods into truly asynchronous ones (depending on your DB library, this may not be possible).
If you do that, then you could do something similar to what Scott Chamberlain suggested (but not using Parallel.ForEach()):
var tasks = animals.Select(
async animal => {
var getOwnerTask = GetAnimalOwnerAsync(ownerId);
var getFavouriteFoodTask = GetFavouriteFoodAsync(foodId);
animal.Owner = await getOwnerTask;
animal.FavouriteFood = await getFavouriteFoodTask;
await Database.StoreAsync(animal);
});
await Task.WhenAll(tasks);
This will execute with even bigger degree of parallelism then Parallel.ForEach() (because that's limited by the thread pool), but again, that may not actually improve performance. If you wanted to limit the degree of parallelism, you could add SemaphoreSlim or you could use TPL Dataflow instead.