Using LINQ, I am having trouble querying my DbContext in an efficient way.
The database contains 700,000 over entities which have a date and a name and other information.
In my code, I have a new list of objects (which can potentially have 100,000 elements) coming in and I would like to query my database and deduct which information are new entity or which information are existing entities that needs to be updated.
I would like to do it in a very efficient way (with a single query if possible).
This is my code :
public class MyDbContext : DbContext
{
public DbSet<MyEntity> MyEntities { get; set; }
}
public class MyEntity
{
[Key]
public Guid Id { get; set; }
public DateTime Date { get; set; }
public string Name { get; set; }
public double Amount { get; set; }
public string Description { get; set; }
}
public class IncomingInfo
{
public DateTime Date { get; set; }
public string Name { get; set; }
public double Amount { get; set; }
}
public class Modifier
{
public void AddOrUpdate(IList<IncomingInfo> info)
{
using (var context = new MyDbContext())
{
//Find the new information
//to add as new entities
IEnumerable<MyEntity> EntitiesToAdd = ??
//Find the information
//to update in existing entities
IEnumerable<MyEntity> EntitiesToUpdate = ??
}
}
}
Can someone help me constructing my query?
Thank you very much.
Edit :
Sorry I forgot to explain how do I consider two entities equal.
There are equal if the Date and the Name property are identical.
I first tried to build a predicate using LinqKit PredicateBuilder without much success (encountered the error of parameter too large, had to make multiple queries which took time).
So far the most successful way I found was to implement a LEFT OUTER join and join the incoming list to the DbSet
Which I implemented this way :
var values = info.GroupJoin(context.MyEntities,
inf => inf.Name + inf.Date.ToString(),
ent => ent.Name + ent.Date.ToString(),
(inf, ents) => new { Info = inf, Entities = ents })
.SelectMany(i => i.Entities.DefaultIfEmpty(),
(i, ent) => new { i.Info.Name, i.Info.Amount, i.Info.Date, ToBeAdded = ent == null ? true : false });
IEnumerable<MyEntity> EntitiesToAdd = values.Where(i => i.ToBeAdded)
.Select(i => new MyEntity
{
Id = Guid.NewGuid(),
Amount = i.Amount,
Date = i.Date,
Name = i.Name,
Description = null
}).ToList();
My test contains 700,000 entities in database. The incoming info list contains 70,000 items; where 50,000 are existing entities and 20,000 are new entities.
This query takes around 15 seconds to execute which does not seem right to me.
Hopefully this is enough to ask for help. Can someone help me one this ?
Thank you very much.
I read the pastebin response from #Leniency and it covers some of the same stuff I was going to say, like querying a date range and comparing on there. The problem with that method though is that (depending on how those dates are set) it might return all 700K+ records in the database, which would give you the absolute worst performance.
My suggestion is that you analyze your network topology to see how expensive your calls to the database really are. I'm assuming this is running on a (web) server which is receiving these IncomingInfo objects from clients. If this server is closely connected to your database server (or on the same machine) then you might be better off not optimizing your calls to the database.
Also, if you have control over the behavior of the clients, you might want to force them to send only like 25 to 100 records with each request. This would make it so that you could deal with them in much more manageable chunks. The client might have to send 100 or more requests to the server (which you could do async so that they get sent ~5 at a time, depending on expected load profiles), but at least it wouldn't be sitting there for 5+ minutes waiting to get a response back from the server for a single request.
BTW, the GroupJoin call that you said took 15 seconds probably is having to download all 700K records before doing the join. You see, joins can't be done on objects that don't exist on the same machine, it either has to send all the IncomingInfo objects (or at least the Name+Date.ToString() concatenations) to the database, or it has to request all the records from the database before any joins can be done. You would probably have to look at the SQL that is being sent to the database in order to tell which method is being used. But you would probably find that querying the database for matches one at a time would probably be faster than the join in this case.
Hope that helps! ;)
Related
Is it possible to run Linq-to-SQL queries when underling database structure is changing from time to time (I mean database updates that happens due to business requirements and since database is shared among several apps It may be happens without announcements to me)?
Is there any way that I can connect to new database structure in Linq-to-SQL without updating the .dbml file in my source code?
If I want to run raw queries knowing that my database structure changes during time, can I use any of Linq-to-SQL benefits somehow?
Provided the structure you have in your classes match to your tables (at least covering all the fields you need) you can do that. ie: Northwind customers table have more than 4 fields in reality. Provided below 4 are still in that table this would work:
void Main()
{
DataContext db = new DataContext(#"server=.\SQLexpress;trusted_connection=yes;database=Northwind");
Table<Customer> Customers = db.GetTable<Customer>();
var data = Customers.Where(c => c.Country == "USA");
foreach (var customer in data)
{
Console.WriteLine($"{customer.CustomerID}, {customer.CompanyName}");
}
}
[Table(Name = "Customers")]
public class Customer
{
[Column]
public string CustomerID { get; set; }
[Column]
public string CompanyName { get; set; }
[Column]
public string ContactName { get; set; }
[Column]
public string Country { get; set; }
}
For raw SQL, again you could use a type covering fields in select list or dynamic.
Note: For inserts, for this to work, fields that are not in your model should either accept null or have default values.
I am trying to work out how to use the .NET EntityFramework to generate both readable and natural code and efficient SQL query statements when fetching related entities. For example, given the following code-first definition
public class WidgetContext : DbContext
{
public DbSet<Widget> Widgets { get; set; }
public DbSet<Gizmo> Gizmos { get; set; }
}
public class Widget
{
public virtual int Id { get; set; }
[Index]
[MaxLength(512)]
public virtual string Name { get; set; }
public virtual ICollection<Gizmo> Gizmos { get; set; }
}
public class Gizmo
{
public virtual long Id { get; set; }
[Index]
[MaxLength(512)]
public virtual string Name { get; set; }
public virtual Widget Widget { get; set; }
public virtual int WidgetId { get; set; }
}
I want to be able to write code like
using (var wc = new WidgetContext())
{
var widget = wc.Widgets.First(x => x.Id == 123);
var gizmo = widget.Gizmos.First(x => x.Name == "gizmo 99");
}
and see a SQL query created along the lines of
SELECT TOP (1) * from Gizmos WHERE WidgetId = 123 AND Name = 'gizmo 99'
So that the work of picking the right Gizmo is performed by the database. This is important because in my use case each Widget could have thousands of related Gizmos and in a particular request I only need to retrieve one at a time. Unfortunately the code above causes the EntityFramework to create SQL like this instead
SELECT * from Gizmos WHERE WidgetId = 123
The match on Gizmo.Name is then being performed in memory by scanning the complete set of related Gizmo entities.
After a good deal of experimentation, I have found ways of creating the efficient SQL use I am looking for in the entity framework, but only by using ugly code which is much less natural to write. The example below illustrates this.
using System.Data.Entity;
using System.Data.Entity.Core.Objects.DataClasses;
using System.Linq;
static void Main(string[] args)
{
Database.SetInitializer(new DropCreateDatabaseAlways<WidgetContext>());
using (var wc = new WidgetContext())
{
var widget = new Widget() { Name = "my widget"};
wc.Widgets.Add(widget);
wc.SaveChanges();
}
using (var wc = new WidgetContext())
{
var widget = wc.Widgets.First();
for (int i = 0; i < 1000; i++)
widget.Gizmos.Add(new Gizmo() { Name = string.Format("gizmo {0}", i) });
wc.SaveChanges();
}
using (var wc = new WidgetContext())
{
wc.Database.Log = Console.WriteLine;
var widget = wc.Widgets.First();
Console.WriteLine("=====> Query 1");
// queries all gizmos associated with the widget and then runs the 'First' query in memory. Nice code, ugly database usage
var g1 = widget.Gizmos.First(x => x.Name == "gizmo 99");
Console.WriteLine("=====> Query 2");
// queries on the DB with two terms in the WHERE clause - only pulls one record, good SQL, ugly code
var g2 = ((EntityCollection<Gizmo>) widget.Gizmos).CreateSourceQuery().First(x => x.Name == "gizmo 99");
Console.WriteLine("=====> Query 3");
// queries on the DB with two terms in the WHERE clause - only pulls one record, good SQL, ugly code
var g3 = wc.Gizmos.First(x => x.Name == "gizmo 99" && x.WidgetId == widget.Id);
Console.WriteLine("=====> Query 4");
// queries on the DB with two terms in the WHERE clause - only pulls one record, also good SQL, ugly code
var g4 = wc.Entry(widget).Collection(x => x.Gizmos).Query().First(x => x.Name == "gizmo 99");
}
Console.ReadLine();
}
Query 1 demonstrates the 'fetch everything and filter' approach that is generated by the natural usage of the entity objects.
Queries 2,3 and 4 above all generate what I would consider to be an efficient SQL query - one that returns a single row and has two terms in the WHERE clause, but they all involve very stilted C# code.
Does anyone have a solution that will allow natural C# code to be written and generate efficient SQL utilization in this case?
I should note that I have tried replacing ICollection with EntityCollection in my Widget object to allow the cast to be removed from the Query 2 code above. Unfortunately this leads to an EntityException telling me that
The object could not be added to the EntityCollection or
EntityReference. An object that is attached to an ObjectContext cannot
be added to an EntityCollection or EntityReference that is not
associated with a source object.
when I try to retrieve any related objects.
Any suggestions appreciated.
Ok, further digging has let me get as close as I think is possible to where I want to be (which, to reiterate, is code that looks OO but generates efficient DB usage patterns).
It turns out that Query2 above (casting the related collection to an EntityCollection) actually isn't a good solution, since although it generates the desired query type against the database, the mere act of fetching the Gizmos collection from the widget is enough to make the entity framework go off to the database and fetch all related Gizmos - i.e. performing the query that I am trying to avoid.
However, it's possible to get the EntityCollection for a relationship without calling the getter of the collection property, as described here http://blogs.msdn.com/b/alexj/archive/2009/06/08/tip-24-how-to-get-the-objectcontext-from-an-entity.aspx. This approach sidesteps the entity framework fetching related entities when you access the Gizmos collection property.
So, an additional read-only property on the Widget can be added like this
public IQueryable<Gizmo> GizmosQuery
{
get
{
var relationshipManager = ((IEntityWithRelationships)this).RelationshipManager;
return (IQueryable<Gizmo>) relationshipManager.GetAllRelatedEnds().First( x => x is EntityCollection<Gizmo>).CreateSourceQuery();
}
}
and then the calling code can look like this
var g1 = widget.GizmosQuery.First(x => x.Name == "gizmo 99");
This approach generates SQL that efficiently fetches only a single row from the database, but depends on the following conditions holding true
Only one relationship from the source to the target type. Having multiple relationships linking a Widget to Gizmos would mean a more complicated predicate would be needed in the .First() call in GizmosQuery.
Proxy creation is enabled for the DbContext and the Widget class is eligible for proxy generation (https://msdn.microsoft.com/en-us/library/vstudio/dd468057%28v=vs.100%29.aspx)
The GizmosQuery property must not be called on objects that are newly created using new Widget() since these will not be proxies and will not implement IEntityWithRelationships. New objects that are valid proxies can be created using wc.Widgets.Create() instead if necessary.
I'm new to the world of .Net, ASP, Entity Framework, and Linq, so bear with me...
I originally had a model set up like the following;
public class Pad
{
[Key]
[DatabaseGenerated(DatabaseGeneratedOption.Identity)]
public Guid PadId { get; set; }
public string StreetAddress { get; set; }
public int ZipCode { get; set; }
public virtual ICollection<Mate> Mates { get; set; }
public virtual ICollection<Message> Messages { get; set; }
}
(A pad is a chat room - it contains many Mates and many thousands of Messages)
In a Web API controller, I have a function designed to get the 25 most recent messages from the specified Pad.
public IHttpActionResult GetMessages(string id)
{
var padGuid = new Guid(id);
// Try to find the pad referenced by the passed ID
var pads = (from p in db.Pads
where p.PadId == padGuid
select p);
if (pads.Count() <= 0)
{
return NotFound();
}
var pad = pads.First();
// Grab the last 25 messages in this pad.
// PERFORMANCE PROBLEM
var messages = pad.Messages.OrderBy(c => c.SendTime).Skip(Math.Max(0, pad.Messages.Count() - 25));
var messagesmodel = from m in messages
select m.toViewModel();
return Ok(messagesmodel);
}
The problem with this implementation is that it seems as though EF is loading the entire set of messages (multiple thousands) into memory before getting the count, ordering, etc. Resulting in a massive performance penalty in Pads with a ton of messages.
My first thought was to convert the Pad.Messages member type to an IQueryable instead of an ICollection - this should defer the Linq queries to SQL, or so I thought. Upon doing this, however, functions like pad.Messages.Count() above complain - turns out that pad.Messages is a null value! And it breaks in other places, such as adding new Messages to the Pad.Messages value.
What is the proper implementation of something like this? In other places, I've seen the recommended solution is constructing a second query against the context such as select Messages where PadId = n, but this hardly seems intuitive when I can have a Messages member value to work with.
Thank you!
var messages = db.Pads.Where(p => p.PadId == padGuid)
.SelectMany(pad => p.Messages.OrderBy(c => c.SendTime)
.Skip(Math.Max(0, pad.Messages.Count() - 25)));
How do you plan to count the number of results in a DB query without actually executing the DB query?
How do you plan to get the first item in the query without actually executing the query?
You cannot do either; both must execute the query.
I have some major performance issues when asking a specific nhibernate question.
I have two tables, A and B where A has ~4000 rows and B has ~50 000 rows. The relation between A and B is one to many.
So the question that I ask needs to load all entites in A and then force load all entities in B because I want to aggregate over the entities in B.
I'm using fluenthibernate and have configured it to allow lazyloading, this works great for all other questions except this one where I have to load ~50000 entities, this number will likely grow with 50k a month. The question takes above a minute to ask now (probably even slower)
Obvious optimizations that I've already done: Only create one sessionfactory, lazyloading is not turned off.
So my question is this, will nhibernate be to slow in this aspect ? (that is, should I build my DAL with regular SQL questions rather than nhibernate?) or is there a way to improve the performance. This is a reporting application, so there won't be many concurrent users but I still would like to make this question atleast take less then 5-10 seconds.
EDIT
Adding code:
public class ChatSessions
{
public virtual int Id { get; set; }
public virtual IList<ChatComments> Comments { get; set; }
public ChatSessions()
{
Comments = new List<ChatComments>();
}
}
public ChatCommentsMapping()
{
Id(x => x.Id);
References(x => x.ChatSession);
}
public class ChatComments
{
public virtual int Id { get; set; }
public virtual ChatSessions ChatSession{ get; set; }
public virtual string Comment { get; set; }
public virtual DateTime TimeStamp { get; set; }
public virtual int CommentType { get; set; }
public virtual bool Deleted { get; set; }
public virtual string ChatAlias { get; set; }
}
public ChatSessionsMapping()
{
Id(x => x.Id);
References(x => x.ChatRoom)
.Not.LazyLoad();
HasMany(x => x.Comments)
.Table("chatcomments");
}
Then In my repo I use this query:
public IList<ChatComments> GetChatCommentsBySession(int chatsessionid)
{
using(var session = _factory.OpenSession())
{
var chatsession = session.Get<ChatSessions>(chatsessionid);
NHibernateUtil.Initialize(chatsession.Comments);
return chatsession.Comments;
}
}
And that method gets called once for every Chatsession.
The query that I aggregate with then looks something like this:
foreach (var hour in groupedByHour){
var datetime = hour.Sessions.First().StartTimeStamp;
var dp = new DataPoint<DateTime, double>
{
YValue = hour.Sessions.Select(x =>
_chatCommentsRepo.GetChatCommentsBySession(x.Id).Count)
.Aggregate((counter,item) => counter += item),
XValue = new DateTime(datetime.Year, datetime.Month, datetime.Day, datetime.Hour, 0, 0)
};
datacollection.Add(dp);
}
Selecting 50,000 rows of any size is not ever going to be quick, but consider using a subselect fetching strategory - it should work a lot better in your scenerio. Also, make sure you have an index for the foreign key in your database.
There's an example of what could be happening at the NHProf site
EDIT: I'd thoroughly recommend NHProf if you're doing any work with NHibernate - it's a quick way to get to WIN.
I posted a comment then re-read your question and suspect that you are probably utilizing NHibernate in a manner for which it's not ideal. You say you're pulling the table B rows to aggregate over them. Are you doing this using LINQ or something on the collections after you've pulled the individual records via NH?
If so, you might want to consider utilizing NH's capability to create projections that will perform the aggregates for you. In this way, NH will generate the SQL to do the aggregations, which in most cases is going to be much faster than doing 4000 retrievals of related items then performing aggregates in code.
This SO question might get you started: What's the best way to get aggregate results from NHibernate?
UPDATE
Yeah, looking at your code you're disabling lazy-loading, which is firing off a separate query for each of your chat items in order to pull the comments. It's taking forever because you're essentially doing 8000 separate queries.
It appears that you're trying to return a count of comments by hour. You can either do some manual SQL to split your comment timestamp by grouping by a DATEPART SQL expression or incorporate the datepart eval in your criteria, like this SO question: How to use DatePart in an NHibernate Criteria Query.
I'm currently doing some research on usage of db4o a storage for my web application. I'm quite happy how easy db4o works. So when I read about the Code First approach I kinda liked is, because the way of working with EF4 Code First is quite similar to working with db4o: create your domain objects (POCO's), throw them at db4o, and never look back.
But when I did a performance comparison, EF 4 was horribly slow. And I couldn't figure out why.
I use the following entities :
public class Recipe
{
private List _RecipePreparations;
public int ID { get; set; }
public String Name { get; set; }
public String Description { get; set; }
public List Tags { get; set; }
public ICollection Preparations
{ get { return _RecipePreparations.AsReadOnly(); } }
public void AddPreparation(RecipePreparation preparation)
{
this._RecipePreparations.Add(preparation);
}
}
public class RecipePreparation
{
public String Name { get; set; }
public String Description { get; set; }
public int Rating { get; set; }
public List Steps { get; set; }
public List Tags { get; set; }
public int ID { get; set; }
}
To test the performance I new up a recipe, and add 50.000 RecipePrepations. Then I stored the object in db4o like so :
IObjectContainer db = Db4oEmbedded.OpenFile(Db4oEmbedded.NewConfiguration(), #"RecipeDB.db4o");
db.Store(recipe1);
db.Close();
This takes around 13.000 (ms)
I store the stuff with EF4 in SQL Server 2008 (Express, locally) like this :
cookRecipes.Recipes.Add(recipe1);
cookRecipes.SaveChanges();
And that takes 200.000 (ms)
Now how on earth is db4o 15(!!!) times faster that EF4/SQL? Am I missing a secret turbo button for EF4? I even think that db4o could be made faster? Since I don't initialize the database file, I just let it grow dynamically.
Did you call SaveChanges() inside the loop? No wonder it's slow! Try doing this:
foreach(var recipe in The500000Recipes)
{
cookRecipes.Recipes.Add(recipe);
}
cookRecipes.SaveChanges();
EF expects you to make all the changes you want, and then call SaveChanges once. That way, it can optimize database communication and sql to perform the changes between opening state and saving state, ignoring all changes that you have undone. (For example, adding 50 000 records, then removing half of them, then hitting SaveChanges will only add 25 000 records to the database. Ever.)
Perhaps you can disable Changetracking while adding new objects, this would really increase Performance.
context.Configuration.AutoDetectChangesEnabled = false;
see also for more info: http://coding.abel.nu/2012/03/ef-code-first-change-tracking/
The EF excels at many things, but bulk loading is not one of them. If you want high-performance bulk loading, doing it directly through the DB server will be faster than any ORM. If your app's sole performance constraint is bulk loading, then you probably shouldn't use the EF.
Just to add on to the other answers: db4o typically runs in-process, while EF abstracts an out-of-process (SQL) database. However, db4o is essentially single-threaded. So while it might be faster for this one example with one request, SQL will handle concurrency (multiple queries, multiple users) much better than a default db4o database setup.