C# Entity Framework: Bulk Extensions Input Memory Issue

C# Entity Framework: Bulk Extensions Input Memory Issue - c#

I am currently using EF Extensions. One thing I don't understand, "its supposed to help with performance"
however placing a million+ records into List variable, is a Memory Issue itself.
So If wanting to update million records, without holding everything in memory, how can this be done efficiently?
Should we use a for loop, and update in batches say 10,000? Does EFExtensions BulkUpdate have any native functionality to support this?
Example:
var productUpdate = _dbContext.Set<Product>()
.Where(x => x.ProductType == 'Electronics'); // this creates IQueryable
await productUpdate.ForEachAsync(c => c.ProductBrand = 'ABC Company');
_dbContext.BulkUpdateAsync(productUpdate.ToList());
Resource:
https://entityframework-extensions.net/bulk-update

This is actually something that EF is not made for. EF's database interactions start from the record object, and flow from there. EF cannot generate a partial UPDATE (i.e. not overwriting everything) if the entity wasn't change tracked (and therefore loaded), and similarly it cannot DELETE records based on a condition instead of a key.
There is no EF equivalent (without loading all of those records) for conditional update/delete logic such as
UPDATE People
SET FirstName = 'Bob'
WHERE FirstName = 'Robert'
or
DELETE FROM People
WHERE FirstName = 'Robert'
Doing this using the EF approach will require you to load all of these entities just to send them back (with an update or delete) to the database, and that's a waste of bandwidth and performance as you've already found.
The best solution I've found here is to bypass EF's LINQ-friendly methods and instead execute the raw SQL yourself. This can still be done using an EF context.
using (var ctx = new MyContext())
{
string updateCommand = "UPDATE People SET FirstName = 'Bob' WHERE FirstName = 'Robert'";
int noOfRowsUpdated = ctx.Database.ExecuteSqlCommand(updateCommand);
string deleteCommand = "DELETE FROM People WHERE FirstName = 'Robert'";
int noOfRowsDeleted = ctx.Database.ExecuteSqlCommand(deleteCommand);
}
More information here. Of course don't forget to protect against SQL injection where relevant.
The specific syntax to run raw SQL may vary per version of EF/EF Core but as far as I'm aware all versions allow you to execute raw SQL.
I can't comment on the performance of EF Extensions or BulkUpdate specifically, and I'm not going to buy it from them.
Based on their documentation, they don't seem to have the methods with the right signatures to allow for conditional update/delete logic.
BulkUpdate doesn't seem to allow you to input the logical condition (the WHERE in your UPDATE command) that would allow you to optimize this.
BulkDelete still has a BatchSize setting, which suggests that they are still handling the records one at a time (well, per batch I guess), and not using a single DELETE query with a condition (WHERE clause).
Based on your intended code in the question, EF Extensions isn't really giving you what you need. It's more performant and cheaper to simply execute raw SQL on the database, as this bypasses EF's need to load its entities.
Update
I might stand corrected, there is some support for conditional update logic, as seen here. However, it is unclear to me while the example still loads everything in memory and what the purpose of that conditional WHERE logic then is if you've already loaded it all in memory (why not use in-memory LINQ then?)
However, even if this works without loading the entities, it's still:
more limited (only equality checks are allowed, compared to SQL allowing any boolean condition that is valid SQL),
relatively complex (I don't like their syntax, maybe that's subjective)
and more costly (still a paid library)
compared to rolling your own raw SQL query. I would still suggest rolling your own raw SQL here, but that's just my opinion.

I found the "proper" EF Extensions way to do a bulk update with a query-like condition:
var productUpdate = _dbContext.Set<Product>()
.Where(x => x.ProductType == 'Electronics')
.UpdateFromQuery( x => new Product { ProductBrand = "ABC Company" });
This should result in a proper SQL UPDATE ... SET ... WHERE, without the need to load entities first, as per the documentation:
Why UpdateFromQuery is faster than SaveChanges, BulkSaveChanges, and BulkUpdate?
UpdateFromQuery executes a statement directly in SQL such as UPDATE [TableName] SET [SetColumnsAndValues] WHERE [Key].
Other operations normally require one or multiple database round-trips which makes the performance slower.
You can check the working syntax on this dotnet fiddle example, adapted from their example of BulkUpdate.
Other considerations
No mention of batch operations for this, unfortunately.
Before doing a big update like this, it might be worth considering deactivating indexes you may have on this column, and rebuild them afterward. This is especially useful if you have many of them.
Careful about the condition in the Where, if it can't be translated as SQL by EF, then it will be done client side, meaning the "usual" terrible roundtrip "Load - change in memory - update"

Related

I want a way to update a range of records based on a where clause in entity framework without using ToList() and foreach

I have a function in my asp.net core app which updates a bunch of records based on a certain criteria I write in a where clause ... I read that ToList() has bad performance , so is there a better and faster way than using tolist and foreach ???
This is my current way doing it , I would appreciate it if someone provides a more efficient way
public async Task UpdateCatalogOnTenantApproval(int tenantID)
{
var catalogQuery = GetQueryable();
var catalog = await catalogQuery.Where(x => x.IdTenant == tenantID).ToListAsync();
catalog.ForEach(c => { c.IsApprovedByAdmin = true; c.IsActive = true; });
Context.UpdateRange(catalog);
await Context.SaveChangesAsync(); ;
}

read that ToList() has bad performance ,
That is wrong. ToList has as good a performance as you will get - submit a bad query which is overly complex and which results in bad SQL that SQL Server will take ages to execute and it is slow.
Also, many people think "ToList" is slow (as in: in the profiler). You see, yo ustart with a db context, take a set of entities there, add some where clauses - all fast. Then ToList and it takes "long" (compared to the rest). Well, THAT is where the query is sent to the sql server ;) WHere (x=>whatever) takes "no time" because all it does is add some nodes to the expression tree, not executing the query. THAT is mostly what people mix up - delayed execution which exeutes only when asked for the results.
And third, some people like "ToList().Where() and complain about performance. Filter as much as possible no the DB.
All three reasons are why people think ToList is slow - but all it shows is a lack of understanding of how LINQ and SQL operate.

Entity Framework does not handle bulk update operations by default -- hence your existing code. If you really want to do these bulk operations, then you have two options:
Write the SQL yourself and use the ExecuteSqlCommand() method
to execute it; or
Look at 3rd party extensions, such as https://entityframework-extensions.net/

We can reduce query cost by selecting a subset of data before attaching for EF to track, and then updating.
However, it may be just pointless micro-optimization that does not perform significantly better unless you are processing massive amount of records.
// select pk for EF to track, and the 2 fields to be modified
var catalog = await catalogQuery.Where(x => x.IdTenant == tenantID)
.Select(x => new Catelog{x.CatelogId, x.IsApprovedByAdmin, x.IsActive }).ToListAsync();
//next we attach range here to let EF track the list
Context.AttachRange(catalog);
//perform your update as usual, this will be flagged as modified if changed
catalog.ForEach(c => { c.IsApprovedByAdmin = true; c.IsActive = true; });
//save and let EF update based on modified fields.
await Context.SaveChangesAsync();

Let me explain to you what you have done and what you are trying to do.
You are partially right about the performance issues related to ToList and ToListAsync as they are mainly responsible to upload entities to the memory and track them.
Based on that if your request is expected to deal intensively with light data you are not required to enhance your code. if it is not, however, there are many open approaches each one has its pros and cons and you have to treat and balance between them for each case you do not want to use the dual app-SQL requests.
let's be more realistic by talking about your case:
1- we assume that your method is a resource-consuming by (loading high volume of data, intensively called, or both)
2- I see the modification is too static by updating all of the rows by c.IsApprovedByAdmin = true; c.IsActive = true;
form (1) and (2) I suggest to write a stored procedure or ExexcuteSqlCammand (as Bryan Lewis suggested) that does this for you
because (3) the stored procedures, triggers, and all the SQL based operation are hard-maintainable and are highly potential for hidden exceptions. In your case, however, you less likely to fell into that as your code is too basic and you could reduce more the risk by construct your query from dynamic elements such as nameof(yourClassName that is the table name).YouProperty and the like ...
Anyway, this is an example to show that there is no ideal approach and you have study each case alone.
Finally, I do not agree with the 3d parties extensions as most of freely provided developed by unprofessionals and tracking exceptions caused by them are nightmares, and the paid versions are too expensive and not 0-exception extensions. The 3d party extension are more oriented to the complex bulk update/delete and/or huge data.
e.g.
await Context.UpdateAsync(e=> new Catalog
{ Archived = e.LastUpdate >
DateTime.UtcNow.AddYears(-99)? false : true
});

How do I update multiple Entity models in one SQL statement?

I had the following:
List<Message> unreadMessages = this.context.Messages
.Where( x =>
x.AncestorMessage.MessageID == ancestorMessageID &&
x.Read == false &&
x.SentTo.Id == userID ).ToList();
foreach(var unreadMessage in unreadMessages)
{
unreadMessage.Read = true;
}
this.context.SaveChanges();
But there must be a way of doing this without having to do 2 SQL queries, one for selecting the items, and one for updating the list.
How do i do this?

Current idiomatic support in EF
As far as I know, there is no direct support for "bulk updates" yet in Entity Framework (there has been an ongoing discussion for bulk operation support for a while though, and it is likely it will be included at some point).
(Why) Do you want to do this?
It is clear that this is an operation that, in native SQL, can be achieved in a single statement, and provides some significant advantages over the approach followed in your question. Using the single SQL statement, only a very small amount of I/O is required between client and DB server, and the statement itself can be completely executed and optimized by the DB server. No need to transfer to and iterate through a potentially large result set client side, just to update one or two fields and send this back the other way.
How
So although not directly supported by EF, it is still possible to do this, using one of two approaches.
Option A. Handcode your SQL update statement
This is a very simple approach, that does not require any other tools/packages and can be performed Async as well:
var sql = "UPDATE TABLE x SET FIELDA = #fieldA WHERE FIELDB = #fieldb";
var parameters = new SqlParameter[] { ..., ... };
int result = db.Database.ExecuteSqlCommand(sql, parameters);
or
int result = await db.Database.ExecuteSqlCommandAsync(sql, parameters);
The obvious downside is, well breaking the nice linqy paradigm and having to handcode your SQL (possibly for more than one target SQL dialect).
Option B. Use one of the EF extension/utility packages
Since a while, a number of open source nuget packages are available that offer specific extensions to EF. A number of them do provide a nice "linqy" way to issue a single update SQL statement to the server. Two examples are:
Entity Framework Extended Library that allows performing a bulk update using a statement like:
context.Messages.Update(
x => x.Read == false && x.SentTo.Id == userID,
x => new Message { Read = true });
It is also available on github
EntityFramework.Utilities that allows performing a bulk update using a statement like:
EFBatchOperation
.For(context, context.Messages)
.Where(x => x.Read == false && x.SentTo.Id == userID)
.Update(x => x.Read, x => x.Read = true);
It is also available on github
And there are definitely other packages and libraries out there that provide similar support.

Even SQL has to do this in two steps in a sense, in that an UPDATE query with a WHERE clause first runs the equivalent of a SELECT behind the scenes, filtering via the WHERE clause, then applying the update. So really, I don't think you need to be worried about improving this.
Further, the reason why it's broken into two steps like this in LINQ is precisely for performance reasons. You want that "select" to be as minimal as possible, i.e. you don't want to load any more objects from the database into in memory objects than you have to. Only then do you alter objects (in the foreach).
If you really want to run a native UPDATE on the SQL side, you could use a System.Data.SqlClient.SqlCommand to issue the update, instead of having LINQ give you back objects that you then update. That will be faster, but then you conceptually move some of your logic out of your C# code object model space into the database model space (you are doing things in the database, not in your object space), even if the SqlCommand is being issued from your code.

Querying with many (~100) search terms with Entity Framework

I need to do a query on my database that might be something like this where there could realistically be 100 or more search terms.
public IQueryable<Address> GetAddressesWithTown(string[] towns)
{
IQueryable<Address> addressQuery = DbContext.Addresses;
addressQuery.Where( x => towns.Any( y=> x.Town == y ) );
return addressQuery;
}
However when it contains more than about 15 terms it throws and exception on execution because the SQL generated is too long.
Can this kind of query be done through Entity Framework?
What other options are there available to complete a query like this?

Sorry, are we talking about THIS EXACT SQL?
In that case it is a very simple "open your eyes thing".
There is a way (contains) to map that string into an IN Clause, that results in ONE sql condition (town in ('','',''))
Let me see whether I get this right:
addressQuery.Where( x => towns.Any( y=> x.Town == y ) );
should be
addressQuery.Where ( x => towns.Contains (x.Town)
The resulting SQL will be a LOT smaller. 100 items is still taxing it - I would dare saying you may have a db or app design issue here and that requires a business side analysis, I have not me this requirement in 20 years I work with databases.

This looks like a scenario where you'd want to use the PredicateBuilder as this will help you create an Or based predicate and construct your dynamic lambda expression.
This is part of a library called LinqKit by Joseph Albahari who created LinqPad.
public IQueryable<Address> GetAddressesWithTown(string[] towns)
{
var predicate = PredicateBuilder.False<Address>();
foreach (string town in towns)
{
string temp = town;
predicate = predicate.Or (p => p.Town.Equals(temp));
}
return DbContext.Addresses.Where (predicate);
}

You've broadly got two options:
You can replace .Any with a .Contains alternative.
You can use plain SQL with table-valued-parameters.
Using .Contains is easier to implement and will help performance because it translated to an inline sql IN clause; so 100 towns shouldn't be a problem. However, it also means that the exact sql depends on the exact number of towns: you're forcing sql-server to recompile the query for each number of towns. These recompilations can be expensive when the query is complex; and they can evict other query plans from the cache as well.
Using table-valued-parameters is the more general solution, but it's more work to implement, particularly because it means you'll need to write the SQL query yourself and cannot rely on the entity framework. (Using ObjectContext.Translate you can still unpack the query results into strongly-typed objects, despite writing sql). Unfortunately, you cannot use the entity framework yet to pass a lot of data to sql server efficiently. The entity framework doesn't support table-valued-parameters, nor temporary tables (it's a commonly requested feature, however).
A bit of TVP sql would look like this select ... from ... join #townTableArg townArg on townArg.town = address.town or select ... from ... where address.town in (select town from #townTableArg).
You probably can work around the EF restriction, but it's not going to be fast and will probably be tricky. A workaround would be to insert your values into some intermediate table, then join with that - that's still 100 inserts, but those are separate statements. If a future version of EF supports batch CUD statements, this might actually work reasonably.
Almost equivalent to table-valued paramters would be to bulk-insert into a temporary table and join with that in your query. Mostly that just means you're table name will start with '#' rather than '#' :-). The temp table has a little more overhead, but you can put indexes on it and in some cases that means the subsequent query will be much faster (for really huge data-quantities).
Unfortunately, using either temporary tables or bulk insert from C# is a hassle. The simplest solution here is to make a DataTable; this can be passed to either. However, datatables are relatively slow; the over might be relevant once you start adding millions of rows. The fastest (general) solution is to implement a custom IDataReader, almost as fast is an IEnumerable<SqlDataRecord>.
By the way, to use a table-valued-parameter, the shape ("type") of the table parameter needs to be declared on the server; if you use a temporary table you'll need to create it too.
Some pointers to get you started:
http://lennilobel.wordpress.com/2009/07/29/sql-server-2008-table-valued-parameters-and-c-custom-iterators-a-match-made-in-heaven/
SqlBulkCopy from a List<>

Using EF4 how can I track the # of times a record is part of a Skip().Take() result set

So using EF4, I'm running a search query that ends with this common function:
query = query.Skip(5).Take(10);
Those records in the database have a column called ImpressionCount, which I intend to use to count the number of times that each record displayed on a page of search results.
What's the most efficient way to do this? Off the top of my head, I'm just going to look at the result set, get a list of ID's and then hit the database again using ADO.NET to do something like:
UPDATE TableName SET ImpressionCount = ImpressionCount + 1 WHERE Id IN (1,2,3,4,5,6,7,8,9,10)
Seems simple enough, just wondering if there's a more .NET 4 / Linq-ish way to do this that I'm not thinking of. One that doesn't involve another hit to the database would be nice too. :)
EDIT: So I'm leaning towards IAbstract's response as the answer since it doesn't appear there's a "built in" way to do this. I didn't think there was but it never hurts to ask. However, the only other question I think I want to throw out there is: is it possible to write a SQL trigger that could only operate on this particular query? I don't want ImpressionCount to update on EVERY select statement for the record (for example, when someone goes to view the detail page, that's not an impression -- if an admin edits the record in the back end, that's not an impression)...possible using LINQ or no?
SQL Server would somehow need to be able to identify that the query was generated by that Linq command, not sure if that's possible or not. This site is expecting relatively heavy traffic so I'm just trying to optimize where possible, but if it's overkill, I might just go ahead and hit the database again one time for each page of results.

In my opinion, it is better to go ahead and run with the SQL command as you have it. Just because we have LINQ does not mean it is always the best choice. Your statement gives you a one-call process to update impression counts and should be fairly quick.

You can technically use a .Select to modify each element in a returned result, but it's not idiomatic C#/linq, so you're better off using a foreach loop.
Example:
var query = query.ToList().Select(x => { x.ImpressionCount++; return x; });
As IAbstract said, be careful of performance issues. Using the example above, or a foreach will execute 10 updates. Your one update statement is better.
I know Linq2NHibernate has this same issue - trying to stick with Linq just isn't any good for updates (that's why it's called "language integrated query");
Edit:
Actually there's probably no reason why EF4 or NHibernate couldn't parse the select expression, realize it's an update, and translate it into an update statement an execute it, but certainly neither framework will do that. If that were something that could happen, you'd want a new .Update() extension method for IQueryable<T> to explicitly state that you're modifying data. Using .Select() for it is a dirty hack.
... which means there's no reason you couldn't write your own .Update(x => x.ImpressionCount++) extension method for IQueryable<T> that output the SQL you want and call ExecuteStoreCommand, but it would be a lot of work.

Caching Linq Query Question

I am creating a forum package for a cms and looking at caching some of the queries to help with performance, but I'm not sure if caching the below will help/do what it should on the below (BTW: Cachehelper is a simple helper class that just adds and removes from cache)
// Set cache variables
IEnumerable<ForumTopic> maintopics;
if (!CacheHelper.Get(topicCacheKey, out maintopics))
{
// Now get topics
maintopics = from t in u.ForumTopics
where t.ParentNodeId == CurrentNode.Id
orderby t.ForumTopicLastPost descending
select t;
// Add to cache
CacheHelper.Add(maintopics, topicCacheKey);
}
//End Cache
// Pass to my pager helper
var pagedResults = new PaginatedList<ForumTopic>(maintopics, p ?? 0, Convert.ToInt32(Settings.ForumTopicsPerPage));
// Now bind
rptTopicList.DataSource = pagedResults;
rptTopicList.DataBind();
Doesn't linq only execute when its enumerated? So the above won't work will it? as its only enumerated when I pass it to the paging helper which .Take()'s a certain amount of records based on a querystring value 'p'

You need to enumerate your results, for example by calling the ToList() method.
maintopics = from t in u.ForumTopics
where t.ParentNodeId == CurrentNode.Id
orderby t.ForumTopicLastPost descending
select t;
// Add to cache
CacheHelper.Add(maintopics.ToList(), topicCacheKey);

My experience with Linq-to-Sql is that it's not super performant when you start getting into complex objects and/or joins.
The first step is to set up LoadOptions on the datacontext. This will force joins so that a complete record is recalled. This was a problem in a ticket tracking system I wrote. I was displaying a list of 10 tickets and saw about 70 queries come across the wire. I had ticket->substatus->status. Due to L2S's lazy initialization, that caused each foreign key for each object that I referenced in the grid to fire off a new query.
Here's a blog post (not mine) about this subject (MSDN was weak): http://oakleafblog.blogspot.com/2007/08/linq-to-sql-query-execution-with.html
The next option is to create precompiled Linq queries. I had to do this with large joins. Here's another blog post on the subject: http://aspguy.wordpress.com/2008/08/15/speed-up-linq-to-sql-with-compiled-linq-queries/
The next option is to convert things over to using stored procedures. This makes programming and deployment harder for sure, but for complex queries where you only need a subset of data, they will be orders of magnitude faster.
The reason I bring this up is because the way you're talking about caching things (why not use the built in Cache in ASP.NET?) is going to cause you lots of headaches in the long term. I'd recommend building your system and then running SQL traces to see where your database performance problems are, then build optimizations around that. You might find that your real issues aren't in the "top 10 topics" but in other, much simpler to fix areas.

Yes, you need to enumerate your results. Linq will not evaluate your query until you enumerate the results.
If you want a general caching strategy for Linq, here is a great tutorial:
http://petemontgomery.wordpress.com/2008/08/07/caching-the-results-of-linq-queries/
The end goal is the ability to automatically generate unique cache keys for any Linq query.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.