Multiple operations under linq2Entities - c#

I've been using Linq2Entities for quite some time now on small scale programs. So usually my queries are basic (return elements with a certain value in a specific column, add/update element, remove element,...).
Now i'm moving to a larger scale program and given my experience went with Linq2Entities and everything was fine until i had to remove a large number of elements.
In Linq2Entities i can only find a remove method that takes on entity - and so to remove 1000 elemnts i need first to retrieve them then apply a remove one by one then call savechanges.
Either i am missing something or i feel a) this is really bad performance-wise and b) i would be better off making this a procedure in the database.
Am i right in my assumptions? Should massive removals / additions be made as procedures only? or is there a way to do this efficiently with Linq2Entities that i am missing?

If you have the primary key values then you can use the other features of the Context by creating the objects manually, setting their key(s), and attaching them to the context and setting their state to delete.
var ids = new List<Guid>();
foreach (var id in ids)
{
Employee employee = new Employee();
employee.Id = id;
entityEntry = context.Entry(employee);
entityEntry.State = EntityState.Deleted;
}
context.SaveChanges();

Related

Entity Framework bulk insert with updating rows from another table

I'm using Microsoft SQL Server and Entity Framework. I have N (for example 10 000) items to insert. Before inserting each item I need to insert or update existing group. It doesn't work well because of low performance. It's because I'm generating too many queries. Each time in loop I'm looking for group by querying Groups table by three (already indexed) parameters.
I was thinking about querying first all groups by using WHERE IN query (Groups.Where(g => owners.Contains(g.OwnerId) && .. ), but as I remember such queries are limited by number of parameters.
Maybe I should write a stored procedure?
Here is my example code. I'm using IUnitOfWork pattern for wrapping the EF DbContext:
public async Task InsertAsync(IItem item)
{
var existingGroup = await this.unitOfWork.Groups.GetByAsync(item.OwnerId, item.Type, item.TypeId);
if (existingGroup == null)
{
existingGroup = this.unitOfWork.Groups.CreateNew();
existingGroup.State = GroupState.New;
existingGroup.Type = item.Code;
existingGroup.TypeId = item.TypeId;
existingGroup.OwnerId = item.OwnerId;
existingGroup.UpdatedAt = item.CreatedAt;
this.unitOfWork.Groups.Insert(existingGroup);
}
else
{
existingGroup.UpdatedAt = item.CreatedAt;
existingGroup.State = GroupState.New;
this.unitOfWork.Groups.Update(existingGroup);
}
this.unitOfWork.Items.Insert(item);
}
foreach(var item in items)
{
InsertAsync(item);
}
await this.unitOfWork.SaveChangesAsync();
There are three key elements to improve performance when bulk inserting:
Set AutoDetectChangesEnabled and ValidateOnSaveEnabled to false:
_db.Configuration.AutoDetectChangesEnabled = false;
_db.Configuration.ValidateOnSaveEnabled = false;
Break up your inserts into segments, wich use the same DbContext, then recreate it. How large the segment should be varies from use-case to use-case, I made best performance at around 100 Elements before recreating the Context. This is due to the observing of the elements in the DbContext.
Also make sure not to recreate the context for every insert.
(See Slauma's answer here Fastest Way of Inserting in Entity Framework)
When checking other tables, make sure to use IQueryable where possible and to work only where necessary with ToList() or FirstOrDefault(). Since ToList() and FirstOrDefault() loads the objects. (See Richard Szalay's answer here What's the difference between IQueryable and IEnumerable)
These tricks helped me out the most when bulk inserting in a scenario as you described. There are also other possibilities. For example SP's, and the BulkInsert function.

Populating DbSet<TEntity>.Local with only specified fields

In Linq to Sql, I would download only a subset of fields for processing in order to reduce query time. Something like this...
var local_data = from row in context.MyTable
select new {
ID = row.ID,
Name = row.Name,
EMAIL = row.EMAIL
};
And then I would simply convert the projected data into a POCO collection...
foreach(var item in local_data){
collection.Add(
new MyTable(){
ID = item.ID,
NAME = item.NAME,
EMAIL = item.EMAIL
};
);
}
This is extremely useful when dealing with massive, unwieldly table records where I only want to pull a handful of columns. When I heard about DbSet<TEntity>.Local, I was eager to switch over from Linq2SQL, but I can't seem to find the version of this new streamlined caching system that allows me to narrow the query scope to specific columns. How would I go about this?
caching system that allows me to narrow the query scope to specific columns
Sorry, the answer is: not possible.
The reason is that EF's internal cache is used for tracking entities, full entities. Being able to access these cached entities through the Local collection is a mere bonus that was introduced with the DbContext API. The cache doesn't exist because of it. The cache is for change tracking.
When EF materialized an entity from the database, it stores its original values into the change tracker and also frequently stores copies of its current values. When it's time to save changes, these values are compared and SQL statements are generated accordingly to store the changes.
Now you know this, you'll understand that EF can't store party populated entities into its cache. How should EF carry out change tracking if an entity can have any random collection of original values and current values?
Also, the result of a projection -- select new -- is never tracked (cached) and, thus, not accessible through a Local collection.
So in this respect you won't gain much by moving to EF.

Compare very large lists of database objects in c#

I have inherited a poorly designed database table (no primary key or indexes, oversized nvarchar fields, dates stored as nvarchar, etc.). This table has roughly 350,000 records. I get handed a list of around 2,000 potentially new records at predefined intervals, and I have to insert any of the potentially new records if the database does not already have a matching record.
I initially tried making comparisons in a foreach loop, but it quickly became obvious that there was probably a much more efficient way. After doing some research, I then tried the .Any(), .Contains(), and .Exclude() methods.
My research leads me to believe that the .Exclude() method would be the most efficient, but I get out of memory errors when trying that. The .Any() and .Contains() methods seem to both take roughly the same time to complete (which is faster than the foreach loop).
The structure of the two lists are identical, and each contain multiple strings. I have a few questions that I have not found satisfying answers to, if you don't mind.
When comparing two lists of objects (made up of several strings), is the .Exclude() method considered to be the most efficient?
Is there a way to use projection when using the .Exclude() method? What I would like to find a way to accomplish would be something like:
List<Data> storedData = db.Data;
List<Data> incomingData = someDataPreviouslyParsed;
// No Projection that runs out of memory
var newData = incomingData.Exclude(storedData).ToList();
// PsudoCode that I would like to figure out if is possible
// First use projection on db so as to not get a bunch of irrelevant data
List<Data> storedData = db.Data.Select(x => new { x.field1, x.field2, x.field3 });
var newData = incomingData.Select(x => new { x.field1, x.field2, x.field3 }).Exclude(storedData).ToList();
Using a raw SQL statement in SQL Server Studio Manager, the query takes slightly longer than 10 seconds. Using EF, it seems to take in excess of a minute. Is that poorly optimized SQL by EF, or is that overhead from EF that makes such a difference?
Would raw SQL in EF be a better practice in a situation like this?
Semi-Off-Topic:
When grabbing the data from the database and storing it in the variable storedData, does that eliminate the usefulness of any indexes (should there be any) stored in the table?
I hate to ask so many questions, and I'm sure that many (if not all) of them are quite noobish. However, I have nowhere else to turn, and I have been looking for clear answers all day. Any help is very much so appreciated.
UPDATE
After further research, I have found what seems to be a very good solution to this problem. Using EF, I grab the 350,000 records from the database keeping only the columns I need to create a unique record. I then take that data and convert it to a dictionary grouping the kept columns as the key (like can be seen here). This solves the problem of there already being duplicates in the returned data, and gives me something fast to work with to compare my newly parsed data to. The performance increase was very noticeable!
I'm still not sure if this would be approaching the best practice, but I can certainly live with the performance of this. I have also seen some references to ToLookup() that I may try to get working to see if there is a performance gain there as well. Nevertheless, here is some code to show what I did:
var storedDataDictionary = storedData.GroupBy(k => (k.Field1 + k.Field2 + k.Field3 + k.Field4)).ToDictionary(g => g.Key, g => g.First());
foreach (var item in parsedData)
{
if (storedDataDictionary.ContainsKey(item.Field1 + item.Field2 + item.Field3 + item.Field4))
{
// duplicateData is a previously defined list
duplicateData.Add(item);
}
else
{
// newData is a previously defined list
newData.Add(item);
}
}
No reason to use EF for that.
Grab only columns that are required for you to make decision if you should update or insert the record (so those which represent missing "primary key"). Don't waste memory for other columns.
Build a HashSet of existing primary keys (i.e. if primary key is a number, HashSet of int, if it has multiple keys - combine them to string).
Check your 2000 items against HashSet, that is very fast.
Update or insert items with raw sql.
I suggest you consider doing it in SQL, not C#. You don't say what RDBMS you are using, but you could look at the MERGE statement, e.g. (for SQL Server 2008):
https://technet.microsoft.com/en-us/library/bb522522%28v=sql.105%29.aspx
Broadly, the statement checks if a record is 'new' - if so, you can INSERT it; if not there is UPDATE and DELETE capabilities, or you just ignore it.

Efficiency of C# Find on 1000+ records

I am trying to essentially see if entities exist in a local context and sort them accordingly. This function seems to be faster than others we have tried runs in about 50 seconds for 1000 items but I am wondering if there is something I can do to improve the efficiency. I believe the find here is slowing it down significantly as a simple foreach iteration over 1000 takes milliseconds and benchmarking shows bottle necking there. Any ideas would be helpful. Thank you.
Sample code:
foreach(var entity in entities) {
var localItem = db.Set<T>().Find(Key);
if(localItem != null)
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
If this is a database (which from the comments I've gathered that it is...)
You would be better off doing fewer queries.
list1.AddRange(db.Set<T>().Where(x => x.Key == Key));
list2.AddRange(db.Set<T>().Where(x => x.Key != Key));
This would be 2 queries instead of 1000+.
Also be aware of the fact that by adding each one to a List<T>, you're keeping 2 large arrays. So if 1000+ turns into 10000000, you're going to have interesting memory issues.
See this post on my blog for more information: http://www.artisansoftware.blogspot.com/2014/01/synopsis-creating-large-collection-by.html
If I understand correctly the database seems to be the bottleneck? If you want to (effectivly) select data from a database relation, whose attribute x should match a ==-criteria, you should consider creating a secondary access path for that attribute (an index structure). Depending on your database system and the distribution in your table this might be a hash index (especially good for checks on ==) or a B+-tree (allrounder) or whatever your system offers you.
However this only works if...
you not only get the full data set once and have to live with that in your application.
adding (another) index to the relation is not out of question (or e.g. its not worth to have it for a single need).
adding an index wouldn't be effective - e.g if the attribute you are querying on has very few unique values.
I found your answers very helpful but here is ultimately how I fold the problem. It seemed .Find was the bottleneck.
var tableDictionary = db.Set<T>().ToDictionary(x => x.KeyValue, x => x);
foreach(var entity in entities) {
if (tableDictionary.ContainsKey(entity.yKeyValue))
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
This ran in with 900+ rows in about a 10th of a second which for our purposes was efficient enough.
Rather than querying the DB for each item, you can just do one query, get all of the data (since you want all of the data from the DB eventually) and you can then group it in memory, which can be done (in this case) about as efficiently as in the database. By creating a lookup of whether or not the key is equal, we can easily get the two groups:
var lookup = db.Set<T>().ToLookup(item => item.Key == Key);
var list1 = lookup[true].ToList();
var list2 = lookup[false].ToList();
(You can use AddRange instead if the lists have previous values that should also be in them.)

Dictionary/List speed for database update

I've got my model updating the database according to some information that comes in in the form of a Dictionary. The way I currently do it is below:
SortedItems = db.SortedItems.ToList();
foreach (SortedItem si in SortedItems)
{
string key = si.PK1 + si.PK2 + si.PK3 + si.PK4;
if (updates.ContainsKey(key) && updatas[key] != si.SortRank)
{
si.SortRank = updates[key];
db.SortedItems.ApplyCurrentValues(si);
}
}
db.SaveChanges();
Would it be faster to iterate through the dictionary, and do a db lookup for each item? The dictionary only contains the item that have changed, and can be anywhere from 2 items to the entire set. My idea for the alternate method would be:
foreach(KeyValuePair<string, int?> kvp in updates)
{
SortedItem si = db.SortedItems.Single(s => (s.PK1 + s.PK2 + s.PK3 + s.PK4).Equals(kvp.Key));
si.SortRank = kvp.Value;
db.SortedItems.ApplyCurrentValues(si);
}
db.SaveChanges();
EDIT: Assume the number of updates is usually about 5-20% of the db entires
Let's look:
Method 1:
You'd iterate through all 1000 items in the database
You'd still visit every item in the Dictionary and have 950 misses against the dictionary
You'd still have 50 update calls to the database.
Method 2:
You'd iterate every item in the dictionary with no misses in the dictionary
You'd have 50 individual lookup calls to the database.
You'd have 50 update calls to the database.
This really depends on how big the dataset is and what % on average get modified.
You could also do something like this:
Method 3:
Build a set of all the keys from the dictionary
Query the database once for all items matching those keys
Iterate over the results and update each item
Personally, I would try to determine your typical case scenario and profile each solution to see which is best. I really think the 2nd solution, though, will result in a ton of database and network hits if you have a large set and a large number of updates, since for each update it would have to hit the database twice (once to get the item, once to update the item).
So yes, this is a very long winded way of saying, "it depends..."
When in doubt, I'd code both and time them based on reproductions of production scenarios.
To add to #James' answer, you would get fastest results using a stored proc (or a regular SQL command).
The problem with LINQ-to-Entities (and other LINQ providers, if they haven't updated recently) is that they don't know how to produce SQL updates with where clauses:
update SortedItems set SortRank = #NewRank where PK1 = #PK1 and (etc.)
A stored procedure would do this at the server side, and you would only need a single db call.

Categories