Why is EF4 Code First so slow when storing objects?

Why is EF4 Code First so slow when storing objects? - c#

I'm currently doing some research on usage of db4o a storage for my web application. I'm quite happy how easy db4o works. So when I read about the Code First approach I kinda liked is, because the way of working with EF4 Code First is quite similar to working with db4o: create your domain objects (POCO's), throw them at db4o, and never look back.
But when I did a performance comparison, EF 4 was horribly slow. And I couldn't figure out why.
I use the following entities :
public class Recipe
{
private List _RecipePreparations;
public int ID { get; set; }
public String Name { get; set; }
public String Description { get; set; }
public List Tags { get; set; }
public ICollection Preparations
{ get { return _RecipePreparations.AsReadOnly(); } }
public void AddPreparation(RecipePreparation preparation)
{
this._RecipePreparations.Add(preparation);
}
}
public class RecipePreparation
{
public String Name { get; set; }
public String Description { get; set; }
public int Rating { get; set; }
public List Steps { get; set; }
public List Tags { get; set; }
public int ID { get; set; }
}
To test the performance I new up a recipe, and add 50.000 RecipePrepations. Then I stored the object in db4o like so :
IObjectContainer db = Db4oEmbedded.OpenFile(Db4oEmbedded.NewConfiguration(), #"RecipeDB.db4o");
db.Store(recipe1);
db.Close();
This takes around 13.000 (ms)
I store the stuff with EF4 in SQL Server 2008 (Express, locally) like this :
cookRecipes.Recipes.Add(recipe1);
cookRecipes.SaveChanges();
And that takes 200.000 (ms)
Now how on earth is db4o 15(!!!) times faster that EF4/SQL? Am I missing a secret turbo button for EF4? I even think that db4o could be made faster? Since I don't initialize the database file, I just let it grow dynamically.

Did you call SaveChanges() inside the loop? No wonder it's slow! Try doing this:
foreach(var recipe in The500000Recipes)
{
cookRecipes.Recipes.Add(recipe);
}
cookRecipes.SaveChanges();
EF expects you to make all the changes you want, and then call SaveChanges once. That way, it can optimize database communication and sql to perform the changes between opening state and saving state, ignoring all changes that you have undone. (For example, adding 50 000 records, then removing half of them, then hitting SaveChanges will only add 25 000 records to the database. Ever.)

Perhaps you can disable Changetracking while adding new objects, this would really increase Performance.
context.Configuration.AutoDetectChangesEnabled = false;
see also for more info: http://coding.abel.nu/2012/03/ef-code-first-change-tracking/

The EF excels at many things, but bulk loading is not one of them. If you want high-performance bulk loading, doing it directly through the DB server will be faster than any ORM. If your app's sole performance constraint is bulk loading, then you probably shouldn't use the EF.

Just to add on to the other answers: db4o typically runs in-process, while EF abstracts an out-of-process (SQL) database. However, db4o is essentially single-threaded. So while it might be faster for this one example with one request, SQL will handle concurrency (multiple queries, multiple users) much better than a default db4o database setup.

Related

Best way to query using EF

Using LINQ, I am having trouble querying my DbContext in an efficient way.
The database contains 700,000 over entities which have a date and a name and other information.
In my code, I have a new list of objects (which can potentially have 100,000 elements) coming in and I would like to query my database and deduct which information are new entity or which information are existing entities that needs to be updated.
I would like to do it in a very efficient way (with a single query if possible).
This is my code :
public class MyDbContext : DbContext
{
public DbSet<MyEntity> MyEntities { get; set; }
}
public class MyEntity
{
[Key]
public Guid Id { get; set; }
public DateTime Date { get; set; }
public string Name { get; set; }
public double Amount { get; set; }
public string Description { get; set; }
}
public class IncomingInfo
{
public DateTime Date { get; set; }
public string Name { get; set; }
public double Amount { get; set; }
}
public class Modifier
{
public void AddOrUpdate(IList<IncomingInfo> info)
{
using (var context = new MyDbContext())
{
//Find the new information
//to add as new entities
IEnumerable<MyEntity> EntitiesToAdd = ??
//Find the information
//to update in existing entities
IEnumerable<MyEntity> EntitiesToUpdate = ??
}
}
}
Can someone help me constructing my query?
Thank you very much.
Edit :
Sorry I forgot to explain how do I consider two entities equal.
There are equal if the Date and the Name property are identical.
I first tried to build a predicate using LinqKit PredicateBuilder without much success (encountered the error of parameter too large, had to make multiple queries which took time).
So far the most successful way I found was to implement a LEFT OUTER join and join the incoming list to the DbSet
Which I implemented this way :
var values = info.GroupJoin(context.MyEntities,
inf => inf.Name + inf.Date.ToString(),
ent => ent.Name + ent.Date.ToString(),
(inf, ents) => new { Info = inf, Entities = ents })
.SelectMany(i => i.Entities.DefaultIfEmpty(),
(i, ent) => new { i.Info.Name, i.Info.Amount, i.Info.Date, ToBeAdded = ent == null ? true : false });
IEnumerable<MyEntity> EntitiesToAdd = values.Where(i => i.ToBeAdded)
.Select(i => new MyEntity
{
Id = Guid.NewGuid(),
Amount = i.Amount,
Date = i.Date,
Name = i.Name,
Description = null
}).ToList();
My test contains 700,000 entities in database. The incoming info list contains 70,000 items; where 50,000 are existing entities and 20,000 are new entities.
This query takes around 15 seconds to execute which does not seem right to me.
Hopefully this is enough to ask for help. Can someone help me one this ?
Thank you very much.

I read the pastebin response from #Leniency and it covers some of the same stuff I was going to say, like querying a date range and comparing on there. The problem with that method though is that (depending on how those dates are set) it might return all 700K+ records in the database, which would give you the absolute worst performance.
My suggestion is that you analyze your network topology to see how expensive your calls to the database really are. I'm assuming this is running on a (web) server which is receiving these IncomingInfo objects from clients. If this server is closely connected to your database server (or on the same machine) then you might be better off not optimizing your calls to the database.
Also, if you have control over the behavior of the clients, you might want to force them to send only like 25 to 100 records with each request. This would make it so that you could deal with them in much more manageable chunks. The client might have to send 100 or more requests to the server (which you could do async so that they get sent ~5 at a time, depending on expected load profiles), but at least it wouldn't be sitting there for 5+ minutes waiting to get a response back from the server for a single request.
BTW, the GroupJoin call that you said took 15 seconds probably is having to download all 700K records before doing the join. You see, joins can't be done on objects that don't exist on the same machine, it either has to send all the IncomingInfo objects (or at least the Name+Date.ToString() concatenations) to the database, or it has to request all the records from the database before any joins can be done. You would probably have to look at the SQL that is being sent to the database in order to tell which method is being used. But you would probably find that querying the database for matches one at a time would probably be faster than the join in this case.
Hope that helps! ;)

History tables in .NET MVC code first approach?

I need to track a change history of some database objects in a MVC .NET application using the code first approach.
Here is what is meant by history table:
http://database-programmer.blogspot.de/2008/07/history-tables.html
I would use a history table for it, if I would write the SQL queries myself. But in the code first approach the SQL is generated... and I would like to stick to this paradigm.
The goal is a structure that holds all "old" revisions of changed/deleted entries together with some additional information (e.g. timestamp, user who changed it, ...)
Any ideas?
Regards,
Stefan
To be more specific - here is some code example:
public class Node {
public int NodeID { get; set; }
public string? data { get; set; } // sample data
}
public class NodeHistory {
public int NodeID { get; set; }
public string? data { get; set; }
public int UserID { get; set; }
public DataTime timestamp { get; set; }
}
What I need is some "framework" assistance to be able to add an entry to NodeHistory whenever a change is -persisted- to table the Node structure.
That means: Just overriding the set-method isn't a solution, as it would also create an entry, if the change to a "Node" is not persisted at the end (e.g. roleback).

I think the best approach for me would be to use a repository pattern and do the insertion into the NodeHistory table on every operation on the Node object that you see fit to keep a history of.
EDIT: Some code
public class NodeRepository{
public Node EditNode(Node toEdit, int userId){
using(new TransactionScope())
{
//Edit Node in NodeContext like you would anyway without repository
NodeContext.NodeHistories.Add(new NodeHistory(){//initialise NodeHistory stuff here)
NodeContext.SaveChagnes();
}
}
}
public class NodeContext:DbContext{
public DbSet<Node> Nodes{get;set;}
public DbSet<NodeHistory> NodeHistories{get;set;}
}
If you are looking for something simpler than this, then I have no idea what it might be.

This is really something you should do with a trigger. Yes, you have to write some sql for it, but then history is updated no matter how the update occurs, either manually, or through some other means.

CRUD on related entities using Dapper

My application has be entity model as below and use Dapper
public class Goal
{
public string Text { get; set; }
public List<SubGoal> SubGoals { get; set; }
}
public class SubGoal
{
public string Text { get; set; }
public List<Practise> Practices { get; set; }
public List<Measure> Measures { get; set; }
}
and has a repository as below
public interface IGoalPlannerRepository
{
IEnumerable<Goal> FindAll();
Goal Get(int id);
void Save(Goal goal);
}
I came across two scenarios as below
While retrieving data (goal entity), it needs to retrieve all the related objects in hierarchy (all subgoals along with practices and measures)
When a goal is saved all the related data need to be inserted and/or updated
Please suggest is there a better way to handle these scenarios other than "looping through" the collections and writing lots and lots of SQL queries.

The best way to do large batch data updates in SQL using Dapper is with compound queries.
You can retrieve all your objects in one query as a multiple resultset, like this:
CREATE PROCEDURE get_GoalAndAllChildObjects
#goal_id int
AS
SELECT * FROM goal WHERE goal_id = #goal_id
SELECT * FROM subgoals WHERE goal_id = #goal_id
Then, you write a dapper function that retrieves the objects like this:
using (var multi = connection.QueryMultiple("get_GoalAndAllChildObjects", new {goal_id=m_goal_id})) {
var goal = multi.Read<Goal>();
var subgoals = multi.Read<SubGoal>();
}
Next comes updating large data in batches. You do that through table parameter inserts (I wrote an article on this here: http://www.altdevblogaday.com/2012/05/16/sql-server-high-performance-inserts/ ). Basically, you create one table for each type of data you are going to insert, then write a procedure that takes those tables as parameters and write them to the database.
This is super high performance and about as optimized as you can get, plus the code isn't too complex.
However, I need to ask: is there any point to keeping "subgoals" and all the other objects relational? One easy alternative is to create an XML or JSON document that contains your goal and all its child objects serialized into text, and just save that object to the file system. It's unbelievably high performance, very simple, very extensible, and takes very little code. The only downside is that you can't write a SQL statement to browse across all subgoals with a bit of work. Consider it - it might be worth a thought ;)

Optimizing Repository’s SubmitChanges Method

I have following repository. I have a mapping between LINQ 2 SQL generated classes and domain objects using a factory.
The following code will work; but I am seeing two potential issues
1) It is using a SELECT query before update statement.
2) It need to update all the columns (not only the changed column). This is because we don’t know what all columns got changed in the domain object.
How to overcome these shortcomings?
Note: There can be scenarios (like triggers) which gets executed based on specific column update. So I cannot update a column unnecessarily.
REFERENCE:
LINQ to SQL: Updating without Refresh when “UpdateCheck = Never”
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=113917
CODE
namespace RepositoryLayer
{
public interface ILijosBankRepository
{
void SubmitChangesForEntity();
}
public class LijosSimpleBankRepository : ILijosBankRepository
{
private IBankAccountFactory bankFactory = new MySimpleBankAccountFactory();
public System.Data.Linq.DataContext Context
{
get;
set;
}
public virtual void SubmitChangesForEntity(DomainEntitiesForBank.IBankAccount iBankAcc)
{
//Does not get help from automated change tracking (due to mapping)
//Selecting the required entity
DBML_Project.BankAccount tableEntity = Context.GetTable<DBML_Project.BankAccount>().SingleOrDefault(p => p.BankAccountID == iBankAcc.BankAccountID);
if (tableEntity != null)
{
//Setting all the values to updates (except primary key)
tableEntity.Status = iBankAcc.AccountStatus;
//Type Checking
if (iBankAcc is DomainEntitiesForBank.FixedBankAccount)
{
tableEntity.AccountType = "Fixed";
}
if (iBankAcc is DomainEntitiesForBank.SavingsBankAccount)
{
tableEntity.AccountType = "Savings";
}
Context.SubmitChanges();
}
}
}
}
namespace DomainEntitiesForBank
{
public interface IBankAccount
{
int BankAccountID { get; set; }
double Balance { get; set; }
string AccountStatus { get; set; }
void FreezeAccount();
}
public class FixedBankAccount : IBankAccount
{
public int BankAccountID { get; set; }
public string AccountStatus { get; set; }
public double Balance { get; set; }
public void FreezeAccount()
{
AccountStatus = "Frozen";
}
}
}

If I understand your question, you are being passed an entity that you need to save to the database without knowing what the original values were, or which of the columns have actually changed.
If that is the case, then you have four options
You need to go back to the database to see the original values ie perform the select, as you code is doing. This allows you to set all your entity values and Linq2Sql will take care of which columns are actually changed. So if none of your columns are actually changed, then no update statement is triggered.
You need to avoid the select and just update the columns. You already know how to do (but for others see this question and answer). Since you don't know which columns have changed you have no option but set them all. This will produce an update statement even if no columns are actually changed and this can trigger any database triggers. Apart from disabling the triggers, about the only thing you can do here is make sure that the triggers are written to check the old and new columns values to avoid any further unnecessary updates.
You need to change your requirements/program so that you require both old and new entities values, so you can determine which columns have changed without going back to the database.
Don't use LINQ for your updates. LINQ stands for Language Integrated QUERY and it is (IMHO) brilliant at query, but I always looked on the updating/deleting features as an extra bonus, but not something which it was designed for. Also, if timing/performance is critical, then there is no way that LINQ will match properly hand-crafted SQL.

This isn't really a DDD question; from what I can tell you are asking:
Use linq to generate direct update without select
Where the accepted answer was no its not possible, but theres a higher voted answer that suggests you can attach an object to your context to initiate the change tracking of the data context.
Your second point about disabling triggers has been answered here and here. But as others have commented do you really need the triggers? Should you not be controlling these updates in code?
In general I think you're looking at premature optimization. You're using an ORM and as part of that you're trusting in L2S to make the database plumbing decisions for you. But remember where appropriate you can use stored procedures execute specific your SQL.

Nhibernate, painfully slow query, am I doing it wrong?

I have some major performance issues when asking a specific nhibernate question.
I have two tables, A and B where A has ~4000 rows and B has ~50 000 rows. The relation between A and B is one to many.
So the question that I ask needs to load all entites in A and then force load all entities in B because I want to aggregate over the entities in B.
I'm using fluenthibernate and have configured it to allow lazyloading, this works great for all other questions except this one where I have to load ~50000 entities, this number will likely grow with 50k a month. The question takes above a minute to ask now (probably even slower)
Obvious optimizations that I've already done: Only create one sessionfactory, lazyloading is not turned off.
So my question is this, will nhibernate be to slow in this aspect ? (that is, should I build my DAL with regular SQL questions rather than nhibernate?) or is there a way to improve the performance. This is a reporting application, so there won't be many concurrent users but I still would like to make this question atleast take less then 5-10 seconds.
EDIT
Adding code:
public class ChatSessions
{
public virtual int Id { get; set; }
public virtual IList<ChatComments> Comments { get; set; }
public ChatSessions()
{
Comments = new List<ChatComments>();
}
}
public ChatCommentsMapping()
{
Id(x => x.Id);
References(x => x.ChatSession);
}
public class ChatComments
{
public virtual int Id { get; set; }
public virtual ChatSessions ChatSession{ get; set; }
public virtual string Comment { get; set; }
public virtual DateTime TimeStamp { get; set; }
public virtual int CommentType { get; set; }
public virtual bool Deleted { get; set; }
public virtual string ChatAlias { get; set; }
}
public ChatSessionsMapping()
{
Id(x => x.Id);
References(x => x.ChatRoom)
.Not.LazyLoad();
HasMany(x => x.Comments)
.Table("chatcomments");
}
Then In my repo I use this query:
public IList<ChatComments> GetChatCommentsBySession(int chatsessionid)
{
using(var session = _factory.OpenSession())
{
var chatsession = session.Get<ChatSessions>(chatsessionid);
NHibernateUtil.Initialize(chatsession.Comments);
return chatsession.Comments;
}
}
And that method gets called once for every Chatsession.
The query that I aggregate with then looks something like this:
foreach (var hour in groupedByHour){
var datetime = hour.Sessions.First().StartTimeStamp;
var dp = new DataPoint<DateTime, double>
{
YValue = hour.Sessions.Select(x =>
_chatCommentsRepo.GetChatCommentsBySession(x.Id).Count)
.Aggregate((counter,item) => counter += item),
XValue = new DateTime(datetime.Year, datetime.Month, datetime.Day, datetime.Hour, 0, 0)
};
datacollection.Add(dp);
}

Selecting 50,000 rows of any size is not ever going to be quick, but consider using a subselect fetching strategory - it should work a lot better in your scenerio. Also, make sure you have an index for the foreign key in your database.
There's an example of what could be happening at the NHProf site
EDIT: I'd thoroughly recommend NHProf if you're doing any work with NHibernate - it's a quick way to get to WIN.

I posted a comment then re-read your question and suspect that you are probably utilizing NHibernate in a manner for which it's not ideal. You say you're pulling the table B rows to aggregate over them. Are you doing this using LINQ or something on the collections after you've pulled the individual records via NH?
If so, you might want to consider utilizing NH's capability to create projections that will perform the aggregates for you. In this way, NH will generate the SQL to do the aggregations, which in most cases is going to be much faster than doing 4000 retrievals of related items then performing aggregates in code.
This SO question might get you started: What's the best way to get aggregate results from NHibernate?
UPDATE
Yeah, looking at your code you're disabling lazy-loading, which is firing off a separate query for each of your chat items in order to pull the comments. It's taking forever because you're essentially doing 8000 separate queries.
It appears that you're trying to return a count of comments by hour. You can either do some manual SQL to split your comment timestamp by grouping by a DATEPART SQL expression or incorporate the datepart eval in your criteria, like this SO question: How to use DatePart in an NHibernate Criteria Query.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.