EF LINQ ToList is very slow - c#

I am using ASP NET MVC 4.5 and EF6, code first migrations.
I have this code, which takes about 6 seconds.
var filtered = _repository.Requests.Where(r => some conditions); // this is fast, conditions match only 8 items
var list = filtered.ToList(); // this takes 6 seconds, has 8 items inside
I thought that this is because of relations, it must build them inside memory, but that is not the case, because even when I return 0 fields, it is still as slow.
var filtered = _repository.Requests.Where(r => some conditions).Select(e => new {}); // this is fast, conditions match only 8 items
var list = filtered.ToList(); // this takes still around 5-6 seconds, has 8 items inside
Now the Requests table is quite complex, lots of relations and has ~16k items. On the other hand, the filtered list should only contain proxies to 8 items.
Why is ToList() method so slow? I actually think the problem is not in ToList() method, but probably EF issue, or bad design problem.
Anyone has had experience with anything like this?
EDIT:
These are the conditions:
_repository.Requests.Where(r => ids.Any(a => a == r.Student.Id) && r.StartDate <= cycle.EndDate && r.EndDate >= cycle.StartDate)
So basically, I can checking if Student id is in my id list and checking if dates match.

Your filtered variable contains a query which is a question, and it doesn't contain the answer. If you request the answer by calling .ToList(), that is when the query is executed. And that is the reason why it is slow, because only when you call .ToList() is the query executed by your database.
It is called Deferred execution. A google might give you some more information about it.
If you show some of your conditions, we might be able to say why it is slow.

In addition to Maarten's answer I think the problem is about two different situation
some condition is complex and results in complex and heavy joins or query in your database
some condition is filtering on a column which does not have an index and this cause the full table scan and make your query slow.
I suggest start monitoring the query generated by Entity Framework, it's very simple, you just need to set Log function of your context and see the results,
using (var context = new MyContext())
{
context.Database.Log = Console.Write;
// Your code here...
}
if you see something strange in generated query try to make it better by breaking it in parts, some times Entity Framework generated queries are not so good.
if the query is okay then the problem lies in your database (assuming no network problem).
run your query with an SQL profiler and check what's wrong.
UPDATE
I suggest you to:
add index for StartDate and EndDate Column in your table (one for each, not one for both)

ToList executes the query against DB, while first line is not.
Can you show some conditions code here?
To increase the performance you need to optimize query/create indexes on the DB tables.

Your first line of code only returns an IQueryable. This is a representation of a query that you want to run not the result of the query. The query itself is only runs on the databse when you call .ToList() on your IQueryable, because its the first point that you have actually asked for data.
Your adjustment to add the .Select only adds to the existing IQueryable query definition. It doesnt change what conditions have to execute. You have essentially changed the following, where you get back 8 records:
select * from Requests where [some conditions];
to something like:
select '' from Requests where [some conditions];
You will still have to perform the full query with the conditions giving you 8 records, but for each one, you only asked for an empty string, so you get back 8 empty strings.
The long and the short of this is that any performance problem you are having is coming from your "some conditions". Without seeing them, its is difficult to know. But I have seen people in the past add .Where clauses inside a loop, before calling .ToList() and inadvertently creating a massively complicated query.

Jaanus. The most likely reason of this issue is complecity of generated SQL query by entity framework. I guess that your filter condition contains some check of other tables.
Try to check generated query by "SQL Server Profiler". And then copy this query to "Management Studio" and check "Estimated execution plan". As a rule "Management Studio" generatd index recomendation for your query try to follow these recomendations.

Related

Linq ToList() and Count() performance problem

I have 200k rows in my table and I need to filter the table and then show in datatable. When I try to do that, my sql run fast. But when I want to get row count or run the ToList(), it takes long time. Also when I try to convert it to list it has 15 rows after filter, it has not huge data.
public static List<Books> GetBooks()
{
List<Books> bookList = new List<Books>();
var v = from a in ctx.Books select a);
int allBooksCount = v.Count(); // I need all books count before filter. but it is so slow is my first problem
if (isFilter)
{
v = v.Where(a => a.startdate <= DateTime.Now && a.enddate>= DateTime.Now);
}
.
.
bookList = v.ToList(); // also toList is so slow is my second problem
}
There's nothing wrong with the code you've shown. So either you have some trouble in the database itself, or you're ruining the query by using IEnumerable instead of IQueryable.
My guess is that either ctx.Books is IEnumerable<Books> (instead of IQueryable<Books>), or that the Count (and Where etc.) method you're calling is the Enumerable version, rather than the Queryable version.
Which version of Count are you actually calling?
First, to get help you need to provide quantitative values for "fast" vs. "too long". Loading entities from EF will take longer than running a raw SQL statement in a client tool like TOAD etc. Are you seeing differences of 15ms vs. 15 seconds, or 15ms vs. 150ms?
To help identify and eliminate possible culprits:
Eliminate the possibility of a long-running DbContext instance tracking too many entities bogging down performance. The longer a DbContext is used and the more entities it tracks, the slower it gets. Temporarily change the code to:
List<Books> bookList = new List<Books>();
using (var context = new YourDbContext())
{
var v = from a in context.Books select a);
int allBooksCount = v.Count(); // I need all books count before filter. but it is so slow is my first problem
if (isFilter)
{
v = v.Where(a => a.startdate <= DateTime.Now && a.enddate>= DateTime.Now);
}
.
.
bookList = v.ToList();
}
Using a fresh DbContext ensures queries are not sifting through in-memory entities after running a query to find tracked instances to return. This also ensures we are running against IQueryable off the Books DbSet within the context. We can only guess what "ctx" in your code actually represents.
Next: Look at a profiler for MySQL, or have your database log out SQL statements to capture exactly what EF is requesting. Check that the Count and ToList each trigger just one query against the database, and then run these exact statements against the database. If there are more than these two queries being run then something odd is happening behind the scenes that you need to investigate, such as that your example doesn't really represent what your real code is doing. You could be tripping client side evaluation (if using EF Core 2) or lazy loading. The next thing I would look at is if possible to look at the execution plan for these queries for hints like missing indexes or such. (my DB experience is primarily SQL Server so I cannot provide advice on tools to use for MySQL)
I would log the actual SQL queries here. You can then use DESCRIBE to look at how many rows it hits. There are various tools that can further analyse the queries if DESCRIBE isn't sufficient. This way you can see whether it's the queries or the (lack of) indices that is the problem. Next step has to be guided by that.

Least expensive operation to check for new records in table with EF

I have PostgreSQL running together with EF7.
My table structure:
Id (bigint)
Content (text)
CreatedAt (timestamp without time zone NOT NULL)
I have infinite scroll in my frontend app and query results with .Skip(n) & .Take(m).
I.e.
.Skip(0), .Take(10);
.Skip(10), .Take(10);
.Skip(20), .Take(10);
<...>
Now, while scrolling, if there're newer records, I have to know how many and add them to .Skip(n) function. I don't need to display them, just need to add them to consideration while skipping.
Currently I'm checking for them like so but this seemingly should be quite expensive after table will exceed 50k-100k records:
_myRepository.GetAll().Where(x => x.CreatedAt > newestActivityDate).Count();
GetAll():
public IQueryable<Activity> GetAll()
{
return _context.Activities.OrderByDescending(x => x.CreatedAt);
}
What would be the best (most performant) operation to check whether there're new records & how many? Checking by date and only then doing count if there're any new records?
EDIT:
Added GetAll() description for more clarity.
I don't know what your .GetAll() method does, so that really is the crux of your problem.
LINQ has what is called deferred execution. What that means is that the query will not get executed until the results are acted upon. So in the case of when you're building a SQL query, nothing will get sent to the database until you finish with it.
So if your .GetAll() method queries with a .ToList() in there, then it will pull everything from the database immediately before adding the rest of your filtering. In that case, yes, it will be very expensive.
However, you can get around that by just asking for the count. You already have most of that set up.
If you add a new method in your repository like this:
public int GetNewRecordsCount(DateTime newestActivityDate)
{
_context.Widgets.Count(x => x.CreatedAt > newestActivityDate);
}
Then this will create a different SQL query that is just something similar to this:
SELECT COUNT(*)
FROM Widgets
WHERE CreatedAt > newestActivityDate;
That is a very quick operation where the database will do all of the filtering for you. The only thing returned to your code is just a single row, single column result of the total count.

Efficiently paging large data sets with LINQ

When looking into the best ways to implement paging in C# (using LINQ), most suggestions are something along these lines:
// Execute the query
var query = db.Entity.Where(e => e.Something == something);
// Get the total num records
var total = query.Count();
// Page the results
var paged = query.Skip((pageNum - 1) * pageSize).Take(pageSize);
This seems to be the commonly suggested strategy (simplified).
For me, my main purpose in paging is for efficiency. If my table contains 1.2 million records where Something == something, I don't want to retrieve all of them at the same time. Instead, I want to page the data, grabbing as few records as possible. But with this method, it seems that this is a moot point.
If I understand it correctly, the first statement is still retrieving the 1.2 million records, then it is being paged as necessary.
Does paging in this way actually improve performance? If the 1.2 million records are going to be retrieved every time, what's the point (besides the obvious UI benefits)?
Am I misunderstanding this? Any .NET gurus out there that can give me a lesson on LINQ, paging, and performance (when dealing with large data sets)?
The first statement does not execute the actual SQL query, it only builds part of the query you intend to run.
It is when you call query.Count() that the first will be executed
SELECT COUNT(*) FROM Table WHERE Something = something
On query.Skip().Take() won't execute the query either, it is only when you try to enumerate the results(doing a foreach over paged or calling .ToList() on it) that it will execute the appropriate SQL statement retrieving only the rows for the page (using ROW_NUMBER).
If watch this in the SQL Profiler you will see that exactly two queries are executed and at no point it will try to retrieve the full table.
Be careful when you are using the debugger, because if you step after the first statement and try to look at the contents of query that will execute the SQL query. Maybe that is the source of your misunderstanding.
// Execute the query
var query = db.Entity.Where(e => e.Something == something);
For your information, nothing is called after the first statement.
// Get the total num records
var total = query.Count();
This count query will be translated to SQL, and it'll make a call to database.
This call will not get all records, because the generated SQL is something like this:
SELECT COUNT(*) FROM Entity where Something LIKE 'something'
For the last query, it doesn't get all the records neither. The query will be translated into SQL, and the paging run in the database.
Maybe you'll find this question useful: efficient way to implement paging
I believe Entity Framework might structure the SQL query with the appropriate conditions based on the linq statements. (e.g. using ROWNUMBER() OVER ...).
I could be wrong on that, however. I'd run SQL profiler and see what the generated query looks like.

Why does adding an unnecessary ToList() drastically speed this LINQ query up? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Why does forcing materialization using ToList() make my query orders of magnitude faster when, if anything, it should do the exact opposite?
1) Calling First() immediately
// "Context" is an Entity Framework DB-first model
var query = from x in Context.Users
where x.Username.ToLower().Equals(User.Identity.Name.ToLower())
select x;
var User = query.First();
// ** The above takes 30+ seconds to run **
2) Calling First() after calling ToList():
var query = from x in Context.Users
where x.Username.ToLower().Equals(User.Identity.Name.ToLower())
select x;
var User = query.ToList().First(); // Added ToList() before First()
// ** Now it takes < 1 second to run! **
Update and Resolution
After getting the generated SQL, the only difference is, as expected, the addition of TOP (1) in the first query. As Andyz Smith says in his answer below, the root cause is that the SQL Server optimizer, in this particular case, chooses a worse execution plan when TOP (1) is added. Thus the problem has nothing to do with LINQ (which did the right thing by adding TOP (1)) and everything to do with the idiosyncrasies of SQL Server.
I can only think of one reason...
To test it, can you please remove the Where clause and re-run the test? Comment here if the result is the first statement being faster, and i will explain why.
Edit
In the LINQ statement Where clause, you are using the .ToLower() method of the string. My guess is that LINQ does not have built in conversion to SQL for this method, so the resultant SQL is something line
SELECT *
FROM Users
Now, we know that LINQ lazy loads, but it also knows that since it has not evaluated the WHERE clause, it needs to load the elements to do the comparison.
Hypothesis
The first query is lazy loading EVERY element in the result set. It is then doing the .ToLower() comparison and returning the first result. This results in n requests to the server and a huge performance overhead. Cannot be sure without seeing the SQL Tracelog.
The Second statement calls ToList, which requests a batch SQL before doing the ToLower comparison, resulting in only one request to the server
Alternative Hypothesis
If the profiler shows only one server execution, try executing the same query with the Top 1 clause and see if it takes as long. As per this post (Why is doing a top(1) on an indexed column in SQL Server slow?) the TOP clause can sometimes mess with the SQL server optimiser and stop it using the correct indices.
Curiosity edit
try changing the LINQ to
var query = from x in Context.Users
where x.Username.Equals(User.Identity.Name, StringComparison.OrdinalIgnoreCase)
select x;
Credit to #Scott for finding the way to do case insensitive comparison in LINQ. Give it a go and see if it is faster.
The SQL won't be the same as Linq is lazy loading. So your call to .ToList() will force .Net to evaluate the expression, then in memory select the first() item.
Where as the other option should add top 1 into the SQL
E.G.
var query = from x in Context.Users
where x.Username.ToLower().Equals(User.Identity.Name.ToLower())
select x;
//SQL executed here
var User = query.First();
and
var query = from x in Context.Users
where x.Username.ToLower().Equals(User.Identity.Name.ToLower())
select x;
//SQL executed here!
var list = query.ToList();
var User = query.First();
As below, the first query should be faster! I would suggest doing a SQL profiler to see what's going on. The speed of the queries will depend on your data structure, number of records, indexes, etc.
The timing of your test will alter the results also. As a couple of people have mentioned in comments, the first time you hit EF it needs to initialise and load the metadata. so if you run these together, the first one should always be slow.
Here's some more info on EF performance considerations
notice the line:
Model and mapping metadata used by the Entity Framework is loaded into
a MetadataWorkspace. This metadata is cached globally and is available
to other instances of ObjectContext in the same application domain.
&
Because an open connection to the database consumes a valuable
resource, the Entity Framework opens and closes the database
connection only as needed. You can also explicitly open the
connection. For more information, see Managing Connections and
Transactions in the Entity Framework.
So, the optimizer chooses a bad way to run the query.
Since you can't add optimizer hints to the SQL to force the optimizer to choose a better plan I see two options.
Add a covering index/indexed view on all the columns that are retrieved/included in the select Pretty ludicrous, but I think it will work, because that index will make it easy peasy for the optimizer to choose a better plan.
Always prematurely materialize queries that include First or Last or Take.  Dangerous because as the data gets larger the break even point between pulling all the data locally and doing the First()  and doing the query with Top on the server is going to change.
http://geekswithblogs.net/Martinez/archive/2013/01/30/why-sql-top-may-slow-down-your-query-and-how.aspx
https://groups.google.com/forum/m/#!topic/microsoft.public.sqlserver.server/L2USxkyV1uw
http://connect.microsoft.com/SQLServer/feedback/details/781990/top-1-is-not-considered-as-a-factor-for-query-optimization
TOP slows down query
Why does TOP or SET ROWCOUNT make my query so slow?

Entity Framework + LINQ + "Contains" == Super Slow?

Trying to refactor some code that has gotten really slow recently and I came across a code block that is taking 5+ seconds to execute.
The code consists of 2 statements:
IEnumerable<int> StudentIds = _entities.Filters
.Where(x => x.TeacherId == Profile.TeacherId.Value && x.StudentId != null)
.Select(x => x.StudentId)
.Distinct<int>();
and
_entities.StudentClassrooms
.Include("ClassroomTerm.Classroom.School.District")
.Include("ClassroomTerm.Teacher.Profile")
.Include("Student")
.Where(x => StudentIds.Contains(x.StudentId)
&& x.ClassroomTerm.IsActive
&& x.ClassroomTerm.Classroom.IsActive
&& x.ClassroomTerm.Classroom.School.IsActive
&& x.ClassroomTerm.Classroom.School.District.IsActive).AsQueryable<StudentClassroom>();
So it's a bit messy but first I get a Distinct list of Id's from one Table (Filters), then I query another Table using it.
These are relatively small tables, but it's still 5+ seconds of query time.
I put this in LINQPad and it showed that it was doing the bottom query first then running 1000 "distinct" queries afterwards.
On a whim I changed the "StudentIds" code by just adding .ToArray() at the end. This improved the speed 1000x ... it now takes like 100ms to complete the same query.
What's the deal? What am I doing wrong?
This is one of the pitfalls of deferred execution in Linq: In your first approach StudentIds is really an IQueryable, not an in-memory collection. That means using it in the second query will run the query again on the database - each and every time.
Forcing execution of the first query by using ToArray() makes StudentIds an in-memory collection and the Contains part in your second query will run over this collection that contains a fixed sequence of items - This gets mapped to something equivalent to a SQL where StudentId in (1,2,3,4) query.
This query will of course, be much much faster since you determined this sequence once up-front, and not every time the Where clause is executed. Your second query without using ToArray() (I would think) would be mapped to a SQL query with an where exists (...) sub-query that gets evaluated for each row.
ToArray() Materializes the initial query to the server memory.
My guess would be the query provider is not able to parse the expression StudentIds.Contains(x.StudentId). Hence it probably thinks that the studentIds is an array already loaded to memory. So it's probably querying the database over and over again during the parsing phase. The only way to know for sure is to setup the profiler.
If you need to do this on the db server, use a join, instead of "contains". If you need to use contains to do what looks like a join problem, you are likely to be missing a surrogate primary key or a foreign key somewhere.
You could also declare studentIds as IQueryable instead of IEnumerable. This might give the query provider the hint it needs to interpret the studentIds as expression aka. data not already loaded to memory. I somehow doubt this but worth a try.
If all else fails, use ToArray(). This will load the initial studentIds to memory.

Categories