Selecting first 100 records using Linq

Selecting first 100 records using Linq - c#

How can I return first 100 records using Linq?
I have a table with 40million records.
This code works, but it's slow, because will return all values before filter:
var values = (from e in dataContext.table_sample
where e.x == 1
select e)
.Take(100);
Is there a way to return filtered? Like T-SQL TOP clause?

No, that doesn't return all the values before filtering. The Take(100) will end up being part of the SQL sent up - quite possibly using TOP.
Of course, it makes more sense to do that when you've specified an orderby clause.
LINQ doesn't execute the query when it reaches the end of your query expression. It only sends up any SQL when either you call an aggregation operator (e.g. Count or Any) or you start iterating through the results. Even calling Take doesn't actually execute the query - you might want to put more filtering on it afterwards, for instance, which could end up being part of the query.
When you start iterating over the results (typically with foreach) - that's when the SQL will actually be sent to the database.
(I think your where clause is a bit broken, by the way. If you've got problems with your real code it would help to see code as close to reality as possible.)

I don't think you are right about it returning all records before taking the top 100. I think Linq decides what the SQL string is going to be at the time the query is executed (aka Lazy Loading), and your database server will optimize it out.

Have you compared standard SQL query with your linq query? Which one is faster and how significant is the difference?
I do agree with above comments that your linq query is generally correct, but...
in your 'where' clause should probably be x==1 not x=1 (comparison instead of assignment)
'select e' will return all columns where you probably need only some of them - be more precise with select clause (type only required columns); 'select *' is a vaste of resources
make sure your database is well indexed and try to make use of indexed data
Anyway, 40milions records database is quite huge - do you need all that data all the time? Maybe some kind of partitioning can reduce it to the most commonly used records.

I agree with Jon Skeet, but just wanted to add:
The generated SQL will use TOP to implement Take().
If you're able to run SQL-Profiler and step through your code in debug mode, you will be able to see exactly what SQL is generated and when it gets executed. If you find the time to do this, you will learn a lot about what happens underneath.
There is also a DataContext.Log property that you can assign a TextWriter to view the SQL generated, for example:
dbContext.Log = Console.Out;
Another option is to experiment with LINQPad. LINQPad allows you to connect to your datasource and easily try different LINQ expressions. In the results panel, you can switch to see the SQL generated the LINQ expression.

I'm going to go out on a limb and guess that you don't have an index on the column used in your where clause. If that's the case then it's undoubtedly doing a table scan when the query is materialized and that's why it's taking so long.

Related

Efficiently paging large data sets with LINQ

When looking into the best ways to implement paging in C# (using LINQ), most suggestions are something along these lines:
// Execute the query
var query = db.Entity.Where(e => e.Something == something);
// Get the total num records
var total = query.Count();
// Page the results
var paged = query.Skip((pageNum - 1) * pageSize).Take(pageSize);
This seems to be the commonly suggested strategy (simplified).
For me, my main purpose in paging is for efficiency. If my table contains 1.2 million records where Something == something, I don't want to retrieve all of them at the same time. Instead, I want to page the data, grabbing as few records as possible. But with this method, it seems that this is a moot point.
If I understand it correctly, the first statement is still retrieving the 1.2 million records, then it is being paged as necessary.
Does paging in this way actually improve performance? If the 1.2 million records are going to be retrieved every time, what's the point (besides the obvious UI benefits)?
Am I misunderstanding this? Any .NET gurus out there that can give me a lesson on LINQ, paging, and performance (when dealing with large data sets)?

The first statement does not execute the actual SQL query, it only builds part of the query you intend to run.
It is when you call query.Count() that the first will be executed
SELECT COUNT(*) FROM Table WHERE Something = something
On query.Skip().Take() won't execute the query either, it is only when you try to enumerate the results(doing a foreach over paged or calling .ToList() on it) that it will execute the appropriate SQL statement retrieving only the rows for the page (using ROW_NUMBER).
If watch this in the SQL Profiler you will see that exactly two queries are executed and at no point it will try to retrieve the full table.
Be careful when you are using the debugger, because if you step after the first statement and try to look at the contents of query that will execute the SQL query. Maybe that is the source of your misunderstanding.

// Execute the query
var query = db.Entity.Where(e => e.Something == something);
For your information, nothing is called after the first statement.
// Get the total num records
var total = query.Count();
This count query will be translated to SQL, and it'll make a call to database.
This call will not get all records, because the generated SQL is something like this:
SELECT COUNT(*) FROM Entity where Something LIKE 'something'
For the last query, it doesn't get all the records neither. The query will be translated into SQL, and the paging run in the database.
Maybe you'll find this question useful: efficient way to implement paging

I believe Entity Framework might structure the SQL query with the appropriate conditions based on the linq statements. (e.g. using ROWNUMBER() OVER ...).
I could be wrong on that, however. I'd run SQL profiler and see what the generated query looks like.

Is LINQ's Enumerable.Take method a good equivalent of SQL's TOP in terms of speed?

I need to query in a table of almost 1 million entries, but the interesting part is only the top 500.
I wonder if using the Take() method in the end of the query line is the best alternative in terms of speed (and memory) optimization, or it's the same thing of just doing the whole query and then taking the top elements.

If you run Enumerable.Take, then you are using linq-to-objects - and yes it will run the entire query and filter at the client. This will be unnecessarily expensive.
Which is why you shouldn't be doing that. You should be using Queryable.Take - then it will do the right thing, and page at the database.
If you use:
var page = query.Take(count)
(etc) it will get it right automatically.

That depends on your query provider. If you use linq to entities or linq to sql, the linq query will be translated to sql - in other words, it's equivalent to using sql directly.
Usually when using linq in context of databases, that's what you will do.

Querying with many (~100) search terms with Entity Framework

I need to do a query on my database that might be something like this where there could realistically be 100 or more search terms.
public IQueryable<Address> GetAddressesWithTown(string[] towns)
{
IQueryable<Address> addressQuery = DbContext.Addresses;
addressQuery.Where( x => towns.Any( y=> x.Town == y ) );
return addressQuery;
}
However when it contains more than about 15 terms it throws and exception on execution because the SQL generated is too long.
Can this kind of query be done through Entity Framework?
What other options are there available to complete a query like this?

Sorry, are we talking about THIS EXACT SQL?
In that case it is a very simple "open your eyes thing".
There is a way (contains) to map that string into an IN Clause, that results in ONE sql condition (town in ('','',''))
Let me see whether I get this right:
addressQuery.Where( x => towns.Any( y=> x.Town == y ) );
should be
addressQuery.Where ( x => towns.Contains (x.Town)
The resulting SQL will be a LOT smaller. 100 items is still taxing it - I would dare saying you may have a db or app design issue here and that requires a business side analysis, I have not me this requirement in 20 years I work with databases.

This looks like a scenario where you'd want to use the PredicateBuilder as this will help you create an Or based predicate and construct your dynamic lambda expression.
This is part of a library called LinqKit by Joseph Albahari who created LinqPad.
public IQueryable<Address> GetAddressesWithTown(string[] towns)
{
var predicate = PredicateBuilder.False<Address>();
foreach (string town in towns)
{
string temp = town;
predicate = predicate.Or (p => p.Town.Equals(temp));
}
return DbContext.Addresses.Where (predicate);
}

You've broadly got two options:
You can replace .Any with a .Contains alternative.
You can use plain SQL with table-valued-parameters.
Using .Contains is easier to implement and will help performance because it translated to an inline sql IN clause; so 100 towns shouldn't be a problem. However, it also means that the exact sql depends on the exact number of towns: you're forcing sql-server to recompile the query for each number of towns. These recompilations can be expensive when the query is complex; and they can evict other query plans from the cache as well.
Using table-valued-parameters is the more general solution, but it's more work to implement, particularly because it means you'll need to write the SQL query yourself and cannot rely on the entity framework. (Using ObjectContext.Translate you can still unpack the query results into strongly-typed objects, despite writing sql). Unfortunately, you cannot use the entity framework yet to pass a lot of data to sql server efficiently. The entity framework doesn't support table-valued-parameters, nor temporary tables (it's a commonly requested feature, however).
A bit of TVP sql would look like this select ... from ... join #townTableArg townArg on townArg.town = address.town or select ... from ... where address.town in (select town from #townTableArg).
You probably can work around the EF restriction, but it's not going to be fast and will probably be tricky. A workaround would be to insert your values into some intermediate table, then join with that - that's still 100 inserts, but those are separate statements. If a future version of EF supports batch CUD statements, this might actually work reasonably.
Almost equivalent to table-valued paramters would be to bulk-insert into a temporary table and join with that in your query. Mostly that just means you're table name will start with '#' rather than '#' :-). The temp table has a little more overhead, but you can put indexes on it and in some cases that means the subsequent query will be much faster (for really huge data-quantities).
Unfortunately, using either temporary tables or bulk insert from C# is a hassle. The simplest solution here is to make a DataTable; this can be passed to either. However, datatables are relatively slow; the over might be relevant once you start adding millions of rows. The fastest (general) solution is to implement a custom IDataReader, almost as fast is an IEnumerable<SqlDataRecord>.
By the way, to use a table-valued-parameter, the shape ("type") of the table parameter needs to be declared on the server; if you use a temporary table you'll need to create it too.
Some pointers to get you started:
http://lennilobel.wordpress.com/2009/07/29/sql-server-2008-table-valued-parameters-and-c-custom-iterators-a-match-made-in-heaven/
SqlBulkCopy from a List<>

Using EF4 how can I track the # of times a record is part of a Skip().Take() result set

So using EF4, I'm running a search query that ends with this common function:
query = query.Skip(5).Take(10);
Those records in the database have a column called ImpressionCount, which I intend to use to count the number of times that each record displayed on a page of search results.
What's the most efficient way to do this? Off the top of my head, I'm just going to look at the result set, get a list of ID's and then hit the database again using ADO.NET to do something like:
UPDATE TableName SET ImpressionCount = ImpressionCount + 1 WHERE Id IN (1,2,3,4,5,6,7,8,9,10)
Seems simple enough, just wondering if there's a more .NET 4 / Linq-ish way to do this that I'm not thinking of. One that doesn't involve another hit to the database would be nice too. :)
EDIT: So I'm leaning towards IAbstract's response as the answer since it doesn't appear there's a "built in" way to do this. I didn't think there was but it never hurts to ask. However, the only other question I think I want to throw out there is: is it possible to write a SQL trigger that could only operate on this particular query? I don't want ImpressionCount to update on EVERY select statement for the record (for example, when someone goes to view the detail page, that's not an impression -- if an admin edits the record in the back end, that's not an impression)...possible using LINQ or no?
SQL Server would somehow need to be able to identify that the query was generated by that Linq command, not sure if that's possible or not. This site is expecting relatively heavy traffic so I'm just trying to optimize where possible, but if it's overkill, I might just go ahead and hit the database again one time for each page of results.

In my opinion, it is better to go ahead and run with the SQL command as you have it. Just because we have LINQ does not mean it is always the best choice. Your statement gives you a one-call process to update impression counts and should be fairly quick.

You can technically use a .Select to modify each element in a returned result, but it's not idiomatic C#/linq, so you're better off using a foreach loop.
Example:
var query = query.ToList().Select(x => { x.ImpressionCount++; return x; });
As IAbstract said, be careful of performance issues. Using the example above, or a foreach will execute 10 updates. Your one update statement is better.
I know Linq2NHibernate has this same issue - trying to stick with Linq just isn't any good for updates (that's why it's called "language integrated query");
Edit:
Actually there's probably no reason why EF4 or NHibernate couldn't parse the select expression, realize it's an update, and translate it into an update statement an execute it, but certainly neither framework will do that. If that were something that could happen, you'd want a new .Update() extension method for IQueryable<T> to explicitly state that you're modifying data. Using .Select() for it is a dirty hack.
... which means there's no reason you couldn't write your own .Update(x => x.ImpressionCount++) extension method for IQueryable<T> that output the SQL you want and call ExecuteStoreCommand, but it would be a lot of work.

Linq. Help me tune this!

I have a linq query that is causing some timeout issues. Basically, I have a query that is returning the top 100 results from a table that has approximately 500,000 records.
Here is the query:
using (var dc = CreateContext())
{
var accounts = string.IsNullOrEmpty(searchText)
? dc.Genealogy_Accounts
.Where(a => a.Genealogy_AccountClass.Searchable)
.OrderByDescending(a => a.ID)
.Take(100)
: dc.Genealogy_Accounts
.Where(a => (a.Code.StartsWith(searchText)
|| a.Name.StartsWith(searchText))
&& a.Genealogy_AccountClass.Searchable)
.OrderBy(a => a.Code)
.Take(100);
return accounts.Select(a =>
}
}
Oddly enough it is the first linq query that is causing the timeout. I thought that by doing a 'Take' we wouldn't need to scan all 500k of records. However, that must be what is happening. I'm guessing that the join to find what is 'searchable' is causing the issue. I'm not able to denormalize the tables... so I'm wondering if there is a way to rewrite the linq query to get it to return quicker... or if I should just write this query as a Stored Procedure (and if so, what might it look like). Thanks.

Well to start with, I'd find out what query is being generated (in LINQ to SQL you'd set the Log on the data context) and then profile it in SQL Server Management Studio. Play with it there until you've found something that is fast enough (either by changing the query or adding indexes) and if you've had to change the query, work out how to represent that in LINQ.
I suspect the problem is that you're combining OrderBy and Take - which means it potentially needs to find out all the results in order to work out which the top 100 would look like. Is Code indexed? If not, try indexing that - it may help by allowing the server to consider records in the order in which they'd be returned, so it can stop after it's found 100 records. You should look at indexes for the other columns too.

The Take(100) translates to "Select Top 100" etc. This would help if your problem was an otherwise huge result set, where there are a lot of columns returned. I bet though that your problem is a table scan resulting from the query. In this case, .Take(100) might not help much at all.
So, the likely culprit is the same as if you were doing SQL using ADO.NET: How are your Indxes? Are the fields being searched fields for which you don't have good indexes? This would cause a drastic decrease in performance compared to queries that do utilize good indexes. Add an index that includes Code and Name and see what happens. Not using an index for Code is guaranteed to hose you, because of the Order By. Also, what field links Genealogy_Accounts and Genealogy_AccountClass? A lack of index on either table could hose things. (I would guess an index including Searchable is unlikely to help.)
Use SQL Profiler to see the actual query being run (though you can do this in VS too), and to see how bad it really is on the server.
The problem might be LINQ doing something stupid generating the query, but this is probably not the case. We're finding LINQ-to-SQL often makes better queries than we do. Even if it looks goofy, it's usually very efficient. You can put the SQL in Query Analyzer, and check out the query plan. Then rewrite the SQL to be more human-simple and see if it improve things -- I bet it won't. I think you'll still see a table scan, indicating something is wrong with your index.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.