Efficiently paging large data sets with LINQ - c#

When looking into the best ways to implement paging in C# (using LINQ), most suggestions are something along these lines:
// Execute the query
var query = db.Entity.Where(e => e.Something == something);
// Get the total num records
var total = query.Count();
// Page the results
var paged = query.Skip((pageNum - 1) * pageSize).Take(pageSize);
This seems to be the commonly suggested strategy (simplified).
For me, my main purpose in paging is for efficiency. If my table contains 1.2 million records where Something == something, I don't want to retrieve all of them at the same time. Instead, I want to page the data, grabbing as few records as possible. But with this method, it seems that this is a moot point.
If I understand it correctly, the first statement is still retrieving the 1.2 million records, then it is being paged as necessary.
Does paging in this way actually improve performance? If the 1.2 million records are going to be retrieved every time, what's the point (besides the obvious UI benefits)?
Am I misunderstanding this? Any .NET gurus out there that can give me a lesson on LINQ, paging, and performance (when dealing with large data sets)?

The first statement does not execute the actual SQL query, it only builds part of the query you intend to run.
It is when you call query.Count() that the first will be executed
SELECT COUNT(*) FROM Table WHERE Something = something
On query.Skip().Take() won't execute the query either, it is only when you try to enumerate the results(doing a foreach over paged or calling .ToList() on it) that it will execute the appropriate SQL statement retrieving only the rows for the page (using ROW_NUMBER).
If watch this in the SQL Profiler you will see that exactly two queries are executed and at no point it will try to retrieve the full table.
Be careful when you are using the debugger, because if you step after the first statement and try to look at the contents of query that will execute the SQL query. Maybe that is the source of your misunderstanding.

// Execute the query
var query = db.Entity.Where(e => e.Something == something);
For your information, nothing is called after the first statement.
// Get the total num records
var total = query.Count();
This count query will be translated to SQL, and it'll make a call to database.
This call will not get all records, because the generated SQL is something like this:
SELECT COUNT(*) FROM Entity where Something LIKE 'something'
For the last query, it doesn't get all the records neither. The query will be translated into SQL, and the paging run in the database.
Maybe you'll find this question useful: efficient way to implement paging

I believe Entity Framework might structure the SQL query with the appropriate conditions based on the linq statements. (e.g. using ROWNUMBER() OVER ...).
I could be wrong on that, however. I'd run SQL profiler and see what the generated query looks like.

Related

How Dapper.NET works internally with .Count() and SingleOrDefault()?

I am new to Dapper though I am aware about ORMs and DAL and have implemented DAL with NHibernate earlier.
Example Query: -
string sql = "SELECT * FROM MyTable";
public int GetCount()
{
var result = Connection.Query<MyTablePoco>(sql).Count();
return result;
}
Will Dapper convert this query (internally) to SELECT COUNT(*) FROM MyTable looking at .Count() at the end?
Similarly, will it convert to SELECT TOP 1 * FROM MyTable in case of SingleOrDefault()?
I came from NHibernate world where it generates query accordingly. I am not sure about Dapper though. As I am working with MS Access, I do not see a way to check the query generated.
No, dapper will not adjust your query. The immediate way to tell this is simply: does the method return IEnumerable... vs IQueryable...? If it is the first, then it can only use local in-memory mechanisms.
Specifically, by default, Query will actually return a fully populated List<>. LINQ's Count() method recognises that and just accesses the .Count property of the list. So all the data is fetched from the database.
If you want to ask the database for the count, ask the database for the count.
As for mechanisms to view what is actually sent to the database: we use mini-profiler for this. It works great.
Note: when you are querying exactly one row: QueryFirstOrDefault (and the other variants you would expect) exist and have optimizations internally (including hints to ADO.NET, although not all providers can act on those things) to do things as efficiently as possible, but it does not adjust your query. In some cases the provider itself (not dapper) can help, but ultimately: if you only want the first row, ask the database for the first row (TOP or similar).

EF LINQ ToList is very slow

I am using ASP NET MVC 4.5 and EF6, code first migrations.
I have this code, which takes about 6 seconds.
var filtered = _repository.Requests.Where(r => some conditions); // this is fast, conditions match only 8 items
var list = filtered.ToList(); // this takes 6 seconds, has 8 items inside
I thought that this is because of relations, it must build them inside memory, but that is not the case, because even when I return 0 fields, it is still as slow.
var filtered = _repository.Requests.Where(r => some conditions).Select(e => new {}); // this is fast, conditions match only 8 items
var list = filtered.ToList(); // this takes still around 5-6 seconds, has 8 items inside
Now the Requests table is quite complex, lots of relations and has ~16k items. On the other hand, the filtered list should only contain proxies to 8 items.
Why is ToList() method so slow? I actually think the problem is not in ToList() method, but probably EF issue, or bad design problem.
Anyone has had experience with anything like this?
EDIT:
These are the conditions:
_repository.Requests.Where(r => ids.Any(a => a == r.Student.Id) && r.StartDate <= cycle.EndDate && r.EndDate >= cycle.StartDate)
So basically, I can checking if Student id is in my id list and checking if dates match.
Your filtered variable contains a query which is a question, and it doesn't contain the answer. If you request the answer by calling .ToList(), that is when the query is executed. And that is the reason why it is slow, because only when you call .ToList() is the query executed by your database.
It is called Deferred execution. A google might give you some more information about it.
If you show some of your conditions, we might be able to say why it is slow.
In addition to Maarten's answer I think the problem is about two different situation
some condition is complex and results in complex and heavy joins or query in your database
some condition is filtering on a column which does not have an index and this cause the full table scan and make your query slow.
I suggest start monitoring the query generated by Entity Framework, it's very simple, you just need to set Log function of your context and see the results,
using (var context = new MyContext())
{
context.Database.Log = Console.Write;
// Your code here...
}
if you see something strange in generated query try to make it better by breaking it in parts, some times Entity Framework generated queries are not so good.
if the query is okay then the problem lies in your database (assuming no network problem).
run your query with an SQL profiler and check what's wrong.
UPDATE
I suggest you to:
add index for StartDate and EndDate Column in your table (one for each, not one for both)
ToList executes the query against DB, while first line is not.
Can you show some conditions code here?
To increase the performance you need to optimize query/create indexes on the DB tables.
Your first line of code only returns an IQueryable. This is a representation of a query that you want to run not the result of the query. The query itself is only runs on the databse when you call .ToList() on your IQueryable, because its the first point that you have actually asked for data.
Your adjustment to add the .Select only adds to the existing IQueryable query definition. It doesnt change what conditions have to execute. You have essentially changed the following, where you get back 8 records:
select * from Requests where [some conditions];
to something like:
select '' from Requests where [some conditions];
You will still have to perform the full query with the conditions giving you 8 records, but for each one, you only asked for an empty string, so you get back 8 empty strings.
The long and the short of this is that any performance problem you are having is coming from your "some conditions". Without seeing them, its is difficult to know. But I have seen people in the past add .Where clauses inside a loop, before calling .ToList() and inadvertently creating a massively complicated query.
Jaanus. The most likely reason of this issue is complecity of generated SQL query by entity framework. I guess that your filter condition contains some check of other tables.
Try to check generated query by "SQL Server Profiler". And then copy this query to "Management Studio" and check "Estimated execution plan". As a rule "Management Studio" generatd index recomendation for your query try to follow these recomendations.

Improve LINQ query performance?

public List<Agents_main_view_distinct> getActiveAgents(DateTime start, DateTime end)
{
myactiveagents = null;
myactiveagents = mydb.Agents_main_view_distincts.Where(u => u.Status.Equals("Existing") && u.DateJoined2 >=start && u.DateJoined2 <=end).OrderByDescending(ac => ac.Recno).ToList();
return myactiveagents;
}
I have a simple LINQ query that queries a view.My worry is its performance. It works well with few hundreds records but when the records are over 2000. The SQL server times out.
Things I have done to improve on the performance.
1. Wrote a query to query the tables directly (No improvement).
2. Reduced unnecessary columns, previous it had 27 columns, reduces it to 20.
In a desperation attempt I increased the server time out to 600. But still it was timing out.
View SQL query
SELECT dbo.Agents.Recno, dbo.Agents.Rec_date, dbo.Agents.AgentsId, dbo.Agents.AgentsName, dbo.Agents.Industry_status, dbo.Agents.DOB, dbo.Agents.Branch,
dbo.Agents.MobileNumber, dbo.Agents.MaritalStatus, dbo.Agents.PIN, dbo.Agents.Gender, dbo.Agents.Email, dbo.Agents.ProvisionalLicense,
dbo.Agents.IRALicenseNumber, dbo.Agents.PreviousCompany, dbo.Agents.YearsOfExperience, dbo.Agents.COPNumber, dbo.Agents.DateJoined AS DateJoined2,
dbo.Agents.DateJoined, dbo.Agents.PreviousOccupation, dbo.Agents.ProffesionalQualification, dbo.Agents.EducationalQualification, dbo.Agents.Status,
dbo.Agents.Termination_Date2 AS Termination_Date, dbo.Agents.Comments, dbo.Agents.Temination_code, dbo.Agents.Company_ID, dbo.Agents.Submit_By,
dbo.Agents.PassportPhoto, dbo.Insurane_Companies.Company_name, DATEDIFF(year, dbo.Agents.DOB, GETDATE()) AS age, YEAR(dbo.Agents.DateJoined)
AS YearJoined, YEAR(dbo.Agents.Termination_Date2) AS YearTermination, dbo.Agents.REGION, dbo.Agents.DOB AS DOB2,
dbo.Insurane_Companies.Company_code
FROM dbo.Agents INNER JOIN
dbo.Insurane_Companies ON dbo.Agents.Company_ID = dbo.Insurane_Companies.Company_id
You could try moving the Where and OrderBy clauses into the view itself, passing in parameters by using a stored procedure/ user defined function if necessary.
You could also add Glimpse to your project. Amongst many other things, you can inspect SQL calls to see if you have any unnecessary or time consuming DB hits.
As James suggested, it would be best to create a stored procedure and call it via LINQ, passing any necessary parameters. This way, the server will handle the processing in the should be much faster since the query would not have to be converted back to SQL to be processed. Other than that you can use SQL's Query Analyzer or Profiler to see what the bottleneck may be.

Entity Framework + LINQ + "Contains" == Super Slow?

Trying to refactor some code that has gotten really slow recently and I came across a code block that is taking 5+ seconds to execute.
The code consists of 2 statements:
IEnumerable<int> StudentIds = _entities.Filters
.Where(x => x.TeacherId == Profile.TeacherId.Value && x.StudentId != null)
.Select(x => x.StudentId)
.Distinct<int>();
and
_entities.StudentClassrooms
.Include("ClassroomTerm.Classroom.School.District")
.Include("ClassroomTerm.Teacher.Profile")
.Include("Student")
.Where(x => StudentIds.Contains(x.StudentId)
&& x.ClassroomTerm.IsActive
&& x.ClassroomTerm.Classroom.IsActive
&& x.ClassroomTerm.Classroom.School.IsActive
&& x.ClassroomTerm.Classroom.School.District.IsActive).AsQueryable<StudentClassroom>();
So it's a bit messy but first I get a Distinct list of Id's from one Table (Filters), then I query another Table using it.
These are relatively small tables, but it's still 5+ seconds of query time.
I put this in LINQPad and it showed that it was doing the bottom query first then running 1000 "distinct" queries afterwards.
On a whim I changed the "StudentIds" code by just adding .ToArray() at the end. This improved the speed 1000x ... it now takes like 100ms to complete the same query.
What's the deal? What am I doing wrong?
This is one of the pitfalls of deferred execution in Linq: In your first approach StudentIds is really an IQueryable, not an in-memory collection. That means using it in the second query will run the query again on the database - each and every time.
Forcing execution of the first query by using ToArray() makes StudentIds an in-memory collection and the Contains part in your second query will run over this collection that contains a fixed sequence of items - This gets mapped to something equivalent to a SQL where StudentId in (1,2,3,4) query.
This query will of course, be much much faster since you determined this sequence once up-front, and not every time the Where clause is executed. Your second query without using ToArray() (I would think) would be mapped to a SQL query with an where exists (...) sub-query that gets evaluated for each row.
ToArray() Materializes the initial query to the server memory.
My guess would be the query provider is not able to parse the expression StudentIds.Contains(x.StudentId). Hence it probably thinks that the studentIds is an array already loaded to memory. So it's probably querying the database over and over again during the parsing phase. The only way to know for sure is to setup the profiler.
If you need to do this on the db server, use a join, instead of "contains". If you need to use contains to do what looks like a join problem, you are likely to be missing a surrogate primary key or a foreign key somewhere.
You could also declare studentIds as IQueryable instead of IEnumerable. This might give the query provider the hint it needs to interpret the studentIds as expression aka. data not already loaded to memory. I somehow doubt this but worth a try.
If all else fails, use ToArray(). This will load the initial studentIds to memory.

Selecting first 100 records using Linq

How can I return first 100 records using Linq?
I have a table with 40million records.
This code works, but it's slow, because will return all values before filter:
var values = (from e in dataContext.table_sample
where e.x == 1
select e)
.Take(100);
Is there a way to return filtered? Like T-SQL TOP clause?
No, that doesn't return all the values before filtering. The Take(100) will end up being part of the SQL sent up - quite possibly using TOP.
Of course, it makes more sense to do that when you've specified an orderby clause.
LINQ doesn't execute the query when it reaches the end of your query expression. It only sends up any SQL when either you call an aggregation operator (e.g. Count or Any) or you start iterating through the results. Even calling Take doesn't actually execute the query - you might want to put more filtering on it afterwards, for instance, which could end up being part of the query.
When you start iterating over the results (typically with foreach) - that's when the SQL will actually be sent to the database.
(I think your where clause is a bit broken, by the way. If you've got problems with your real code it would help to see code as close to reality as possible.)
I don't think you are right about it returning all records before taking the top 100. I think Linq decides what the SQL string is going to be at the time the query is executed (aka Lazy Loading), and your database server will optimize it out.
Have you compared standard SQL query with your linq query? Which one is faster and how significant is the difference?
I do agree with above comments that your linq query is generally correct, but...
in your 'where' clause should probably be x==1 not x=1 (comparison instead of assignment)
'select e' will return all columns where you probably need only some of them - be more precise with select clause (type only required columns); 'select *' is a vaste of resources
make sure your database is well indexed and try to make use of indexed data
Anyway, 40milions records database is quite huge - do you need all that data all the time? Maybe some kind of partitioning can reduce it to the most commonly used records.
I agree with Jon Skeet, but just wanted to add:
The generated SQL will use TOP to implement Take().
If you're able to run SQL-Profiler and step through your code in debug mode, you will be able to see exactly what SQL is generated and when it gets executed. If you find the time to do this, you will learn a lot about what happens underneath.
There is also a DataContext.Log property that you can assign a TextWriter to view the SQL generated, for example:
dbContext.Log = Console.Out;
Another option is to experiment with LINQPad. LINQPad allows you to connect to your datasource and easily try different LINQ expressions. In the results panel, you can switch to see the SQL generated the LINQ expression.
I'm going to go out on a limb and guess that you don't have an index on the column used in your where clause. If that's the case then it's undoubtedly doing a table scan when the query is materialized and that's why it's taking so long.

Categories