OrderBy is stable for LINQ to Objects, but MSDN on Queryable.OrderBy doesn't mention if it is stable or not.
I guess it depends on the provider implementation. Is it unstable for SQL Server? Because it looks so. I did a quick look at Queryable source code, but it is not obvious from there.
I need to order a collection before other operations and I want to use IQueryable, rather than IEnumerable for the sake of performance.
// All the timestamps are the same and I am getting inconsistent
// results by running it multiple times, first few pages return the same results
var result = data.OrderBy(i => i.TimeStamp).Skip(start).Take(length);
but if I use
var result = data.ToList().OrderBy(i => i.TimeStamp).Skip(start).Take(length);
It works just fine, but I lose performance boost from LINQ to SQL. It seems combination of Queryable OrderBy/Skip/Take produce inconsistent results.
SQL Code generated seems fine to me:
SELECT
...
FROM [dbo].[Table] AS [Extent1]
ORDER BY [Extent1].[TimeStamp] ASC
OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
In Linq-to-Entities LINQ queries are translated into SQL queries so Linq-to-Objects implementation of OrderBy doesn't matter. You should look at your database implementation of ORDER BY. If you are using MS SQL you can find in docs that:
To achieve stable results between query requests using OFFSET and FETCH, the following conditions must be met:
(...)
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.
So ORDER BY for the same values does not guarantee the same order so limiting it could provide different results set. To solve this you can simply sort by some additional column that has unique values e.g. id. So basically you will have:
var result = data
.OrderBy(i => i.TimeStamp)
.ThenBy(i => i.Id)
.Skip(start)
.Take(length);
I take it that by "stable", you mean consistent. If you didn't have the ORDER BY in a SQL query, the order of the data is not guaranteed for each time you run the query. It will simply return all of the data in whatever order is most efficient for the server. When you add the ORDER BY, it will sort that data. Since you are sorting data where all of the sort values are the same, no rows are being reordered, so the ordered data is in an order you don't expect. If you need a specific order, you will need to add a secondary sort column such as an ID.
It is a best to never assume the order of data coming back from the server unless you explicitly define what that order is.
Related
Im trying to query my OData webservice from a C# application.
When i do the following:
var SecurityDefs = from SD in nav.ICESecurityDefinition.Take(1)
orderby SD.Entry_No descending
select SD;
i get an exception because .top() and .orderby is not supposed to be used together.
I need to get the last record in the dataset and only the last.
The purpose is to get the last used entry number in a ledger and then continue creating new entries incrementing the found entry no.
I cant seem to find anything online that explains how to do this.
Its very important that the service only returns the last record from the feed since speed is paramount in this solution.
i get an exception because .top() and .orderby is not supposed to be used together.
Where did you read that? In general .top() or .Take() should ONLY be used in conjunction WITH .orderby(), otherwise the record being retrieved is not guaranteed to be repeatable or predictable.
Probably the compounding issue here is mixing query and fluent expression syntax, which is valid, but you have to understand the order of precedence.
Your syntax is taking 1 record, then applying a sort order... you might find it easier to start with a query like this:
// build your query
var SecurityDefsQuery = from SD in nav.ICESecurityDefinition
orderby SD.Entry_No descending
select SD;
// Take the first item from the list, if it exists, will be a single record.
var SecurityDefs = SecurityDefsQuery.FirstOrDefault();
// Take an array of only the first record if it exists
var SecurityDefsDeferred = SecurityDefsQuery.Take(1);
This can be executed on a single line using brackets, but you can see how the query is the same in both cases, SecurityDefs in this case is a single ICESecurityDefinition typed record, where as SecurityDefsDeferred is an IQueryable<ICESecurityDefinition> that only has a single record.
If you only need the record itself, you this one liner:
var SecurityDefs = (from SD in nav.ICESecurityDefinition
orderby SD.Entry_No descending
select SD).FirstOrDefault();
You can execute the same query using fluent notation as well:
var SecurityDefs = nav.ICESecurityDefinition.OrderByDescending(sd => sd.Entry_No)
.FirstOrDefault();
In both cases, .Take(1) or .top() is being implemented through .FirstOrDefault(). You have indicated that speed is important, so use .First() or .FirstOrDefault() instead of .Single() or .SingleOrDefault() because the single variants will actually request .Take(2) and will throw an exception if it returns 1 or no results.
The OrDefault variants on both of these queries will not impact the performance of the query itself and should have negligble affect on your code, use the one that is appriate for your logic that uses the returned record and if you need to handle the case when there is no existing record.
If the record being returned has many columns, and you are only interested in the Entry_No column value, then perhaps you should simply query for that specific value itself:
Query expression:
var lastEntryNo = (from SD in nav.ICESecurityDefinition
orderby SD.Entry_No descending
select SD.Entry_No).FirstOrDefault();
Fluent expression:
var lastEntryNo = nav.ICESecurityDefinition.OrderByDescending(sd => sd.Entry_No)
.Select(sd => sd.Entry_No)
.FirstOrDefault();
If Speed is paramount then look at providing a specific custom endpoint on the service to either serve the record or do not process the 'Entry_No` in the client at all, make that the job of the code that receives data from the client and compute it at the time the entries are inserted.
Making the query perform faster is not the silver bullet you might be looking for though, Even if this is highly optimised, your current pattern means that X number of clients could all call the service to get the current value of Entry_No, meaning all of them would start incrementing from the same value.
If you MUST increment the Entry_No from the client then you should look at putting a custom endpoint on the service to simply return the Next Entry_No to use. This should be optimistic meaning that you don't care if the Entry_No actually gets used in the end, but you can implement the end point such that every call will increment the field in the database and return the next value.
Its getting a bit beyond the scope of your initial post, but SQL Server now has support for Sequences that formalise this type of logic from a database and schema point of view, using Sequence simplifies how we can manage these types of incrementations from the client, because we no longer rely on the outcome of data updates to be comitted to the table before the client can increment the next record. (which is what your TOP, Order By Desc solution is trying to do.
I have a little program that needs to do some calculation on a data range. The range maybe contain about half a millon of records. I just looked to my db and saw that a group by was executed.
I thought that the result was executed on the first line, and later I just worked with data in RAM. But now I think that the query builder combine the expression.
var Test = db.Test.Where(x => x > Date.Now.AddDays(-7));
var Test2 = (from p in Test
group p by p.CustomerId into g
select new { UniqueCount = g.Count() } );
In my real world app I got more subqueries that is based on the range selected by the first query. I think I just added a big overhead to let the DB make different selects.
Now I bascilly just call .ToList() after the first expression.
So my question is am I right about that the query builder combine different IQueryable when it builds the expression tree?
Yes, you are correct. LINQ expressions are lazily evaluated at the moment you evaluate them (via .ToList(), for example). At that point in time, Entity Framework will look at the total query and build an SQL statement to represent it.
In this particular case, it's probably wiser to not evaluate the first query, because the SQL database is optimized for performing set-based operations like grouping and counting. Rather than forcing the database to send all the Test objects across the wire, deserializing the results into in-memory objects, and then performing the grouping and counting locally, you will likely see better performance by having the SQL database just return the resulting Counts.
Background:
Entity Framework 4, with SQL Server 2008
Problem:
I have a table Order. Each row has a column Timestamp.
The user can choose some time in past and I need to get the Order closest to the specified time, but that had occurred before the specified time. In other words, the last order before the specified time.
For example, if I have orders
2008-01-12
2009-04-17
2009-09-24
2010-11-02
2010-12-01
2011-05-16
and choose a date 2010-07-22, I should get the 2009-09-24 order, because that's the last order before the specified date.
var query = (from oData in db.OrderDatas
where oData.Timestamp <= userTime
orderby oData.Timestamp ascending
select oData).Last();
This is closest to what I am trying. However, I am not sure how exactly does the Last operator work when translated to SQL, if it's translated at all.
Question:
Will this query fetch all data (earlier than userTime) and then take the last element, or will it be translated so that only one element will be returned from the database? My table can hold very large number of rows (100000+) so performance is an issue here.
Also, how would one retrieve the closest time in the database (not necessarily the earlier time)? In the example of 2010-07-22, one would get 2010-11-02, because it is closer to the date specified than 2009-09-24.
In general, if you're concerned about how LINQ behaves, you should check what does happen with the SQL. If you haven't worked out how to see how your LINQ queries are turned into SQL, that should be the very next thing you do.
As you noted in your comment, Last() isn't supported by LINQ to SQL so the same may be true for EF. Fortunately, it's easy to use First() instead:
var query = (from oData in db.OrderDatas
where oData.Timestamp <= userTime
orderby oData.Timestamp descending
select oData).First();
Try using:
var query = (from oData in db.OrderDatas
where oData.Timestamp <= userTime
orderby oData.Timestamp descending
select oData).Take(1);
It's the equivalent of TOP 1
Question:
Will this query fetch all data (earlier than userTime) and then take
the last element, or will it be translated so that only one element
will be returned from the database? My table can hold very large
number of rows (100000+) so performance is an issue here.
In this case, using the first() approach, the query will be executed immediately and it will optimized in such a way that it will ony retrieve 1 record. Most probably a top(1) select. You really need to check the genereated sql with a sql profilihg tool or by using the log of the datacontext. Or you can use linqpad. linq-2-sql can lead to N+1 queries if not used the proper way. This behaviour is quite predictable but in the beginning you really have to be aware.
I have a linq query that is causing some timeout issues. Basically, I have a query that is returning the top 100 results from a table that has approximately 500,000 records.
Here is the query:
using (var dc = CreateContext())
{
var accounts = string.IsNullOrEmpty(searchText)
? dc.Genealogy_Accounts
.Where(a => a.Genealogy_AccountClass.Searchable)
.OrderByDescending(a => a.ID)
.Take(100)
: dc.Genealogy_Accounts
.Where(a => (a.Code.StartsWith(searchText)
|| a.Name.StartsWith(searchText))
&& a.Genealogy_AccountClass.Searchable)
.OrderBy(a => a.Code)
.Take(100);
return accounts.Select(a =>
}
}
Oddly enough it is the first linq query that is causing the timeout. I thought that by doing a 'Take' we wouldn't need to scan all 500k of records. However, that must be what is happening. I'm guessing that the join to find what is 'searchable' is causing the issue. I'm not able to denormalize the tables... so I'm wondering if there is a way to rewrite the linq query to get it to return quicker... or if I should just write this query as a Stored Procedure (and if so, what might it look like). Thanks.
Well to start with, I'd find out what query is being generated (in LINQ to SQL you'd set the Log on the data context) and then profile it in SQL Server Management Studio. Play with it there until you've found something that is fast enough (either by changing the query or adding indexes) and if you've had to change the query, work out how to represent that in LINQ.
I suspect the problem is that you're combining OrderBy and Take - which means it potentially needs to find out all the results in order to work out which the top 100 would look like. Is Code indexed? If not, try indexing that - it may help by allowing the server to consider records in the order in which they'd be returned, so it can stop after it's found 100 records. You should look at indexes for the other columns too.
The Take(100) translates to "Select Top 100" etc. This would help if your problem was an otherwise huge result set, where there are a lot of columns returned. I bet though that your problem is a table scan resulting from the query. In this case, .Take(100) might not help much at all.
So, the likely culprit is the same as if you were doing SQL using ADO.NET: How are your Indxes? Are the fields being searched fields for which you don't have good indexes? This would cause a drastic decrease in performance compared to queries that do utilize good indexes. Add an index that includes Code and Name and see what happens. Not using an index for Code is guaranteed to hose you, because of the Order By. Also, what field links Genealogy_Accounts and Genealogy_AccountClass? A lack of index on either table could hose things. (I would guess an index including Searchable is unlikely to help.)
Use SQL Profiler to see the actual query being run (though you can do this in VS too), and to see how bad it really is on the server.
The problem might be LINQ doing something stupid generating the query, but this is probably not the case. We're finding LINQ-to-SQL often makes better queries than we do. Even if it looks goofy, it's usually very efficient. You can put the SQL in Query Analyzer, and check out the query plan. Then rewrite the SQL to be more human-simple and see if it improve things -- I bet it won't. I think you'll still see a table scan, indicating something is wrong with your index.
How can I return first 100 records using Linq?
I have a table with 40million records.
This code works, but it's slow, because will return all values before filter:
var values = (from e in dataContext.table_sample
where e.x == 1
select e)
.Take(100);
Is there a way to return filtered? Like T-SQL TOP clause?
No, that doesn't return all the values before filtering. The Take(100) will end up being part of the SQL sent up - quite possibly using TOP.
Of course, it makes more sense to do that when you've specified an orderby clause.
LINQ doesn't execute the query when it reaches the end of your query expression. It only sends up any SQL when either you call an aggregation operator (e.g. Count or Any) or you start iterating through the results. Even calling Take doesn't actually execute the query - you might want to put more filtering on it afterwards, for instance, which could end up being part of the query.
When you start iterating over the results (typically with foreach) - that's when the SQL will actually be sent to the database.
(I think your where clause is a bit broken, by the way. If you've got problems with your real code it would help to see code as close to reality as possible.)
I don't think you are right about it returning all records before taking the top 100. I think Linq decides what the SQL string is going to be at the time the query is executed (aka Lazy Loading), and your database server will optimize it out.
Have you compared standard SQL query with your linq query? Which one is faster and how significant is the difference?
I do agree with above comments that your linq query is generally correct, but...
in your 'where' clause should probably be x==1 not x=1 (comparison instead of assignment)
'select e' will return all columns where you probably need only some of them - be more precise with select clause (type only required columns); 'select *' is a vaste of resources
make sure your database is well indexed and try to make use of indexed data
Anyway, 40milions records database is quite huge - do you need all that data all the time? Maybe some kind of partitioning can reduce it to the most commonly used records.
I agree with Jon Skeet, but just wanted to add:
The generated SQL will use TOP to implement Take().
If you're able to run SQL-Profiler and step through your code in debug mode, you will be able to see exactly what SQL is generated and when it gets executed. If you find the time to do this, you will learn a lot about what happens underneath.
There is also a DataContext.Log property that you can assign a TextWriter to view the SQL generated, for example:
dbContext.Log = Console.Out;
Another option is to experiment with LINQPad. LINQPad allows you to connect to your datasource and easily try different LINQ expressions. In the results panel, you can switch to see the SQL generated the LINQ expression.
I'm going to go out on a limb and guess that you don't have an index on the column used in your where clause. If that's the case then it's undoubtedly doing a table scan when the query is materialized and that's why it's taking so long.