I've been working for the first time with the Entity Framework in .NET, and have been writing LINQ queries in order to get information from my model. I would like to program in good habits from the beginning, so I've been doing research on the best way to write these queries, and get their results. Unfortunately, in browsing Stack Exchange, I've seem to have come across two conflicting explanations in how deferred/immediate execution works with LINQ:
A foreach causes the query to be executed in each iteration of the loop:
Demonstrated in question Slow foreach() on a LINQ query - ToList() boosts performance immensely - why is this? , the implication is that "ToList()" needs to be called in order to evaluate the query immediately, as the foreach is evaluating the query on the data source repeatedly, slowing down the operation considerably.
Another example is the question Foreaching through grouped linq results is incredibly slow, any tips? , where the accepted answer also implies that calling "ToList()" on the query will improve performance.
A foreach causes a query to be executed once, and is safe to use with LINQ
Demonstrated in question Does foreach execute the query only once? , the implication is that the foreach causes one enumeration to be established, and will not query the datasource each time.
Continued browsing of the site has turned up many questions where "repeated execution during a foreach loop" is the culprit of the performance concern, and plenty of other answers stating that a foreach will appropriately grab a single query from a datasource, which means that both explanations seem to have validity. If the "ToList()" hypothesis is incorrect (as most of the current answers as of 2013-06-05 1:51 PM EST seem to imply), where does this misconception come from? Is there one of these explanations that is accurate and one that isn't, or are there different circumstances that could cause a LINQ query to evaluate differently?
Edit: In addition to the accepted answer below, I've turned up the following question over on Programmers that very much helped my understanding of query execution, particularly the the pitfalls that could result in multiple datasource hits during a loop, which I think will be helpful for others interested in this question: https://softwareengineering.stackexchange.com/questions/178218/for-vs-foreach-vs-linq
In general LINQ uses deferred execution. If you use methods like First() and FirstOrDefault() the query is executed immediately. When you do something like;
foreach(string s in MyObjects.Select(x => x.AStringProp))
The results are retrieved in a streaming manner, meaning one by one. Each time the iterator calls MoveNext the projection is applied to the next object. If you were to have a Where it would first apply the filter, then the projection.
If you do something like;
List<string> names = People.Select(x => x.Name).ToList();
foreach (string name in names)
Then I believe this is a wasteful operation. ToList() will force the query to be executed, enumerating the People list and applying the x => x.Name projection. Afterwards you will enumerate the list again. So unless you have a good reason to have the data in a list (rather than IEnumerale) you're just wasting CPU cycles.
Generally speaking using a LINQ query on the collection you're enumerating with a foreach will not have worse performance than any other similar and practical options.
Also it's worth noting that people implementing LINQ providers are encouraged to make the common methods work as they do in the Microsoft provided providers but they're not required to. If I were to go write a LINQ to HTML or LINQ to My Proprietary Data Format provider there would be no guarantee that it behaves in this manner. Perhaps the nature of the data would make immediate execution the only practical option.
Also, final edit; if you're interested in this Jon Skeet's C# In Depth is very informative and a great read. My answer summarizes a few pages of the book (hopefully with reasonable accuracy) but if you want more details on how LINQ works under the covers, it's a good place to look.
try this on LinqPad
void Main()
{
var testList = Enumerable.Range(1,10);
var query = testList.Where(x =>
{
Console.WriteLine(string.Format("Doing where on {0}", x));
return x % 2 == 0;
});
Console.WriteLine("First foreach starting");
foreach(var i in query)
{
Console.WriteLine(string.Format("Foreached where on {0}", i));
}
Console.WriteLine("First foreach ending");
Console.WriteLine("Second foreach starting");
foreach(var i in query)
{
Console.WriteLine(string.Format("Foreached where on {0} for the second time.", i));
}
Console.WriteLine("Second foreach ending");
}
Each time the where delegate is being run we shall see a console output, hence we can see the Linq query being run each time. Now by looking at the console output we see the second foreach loop still causes the "Doing where on" to print, thus showing that the second usage of foreach does in fact cause the where clause to run again...potentially causing a slow down.
First foreach starting
Doing where on 1
Doing where on 2
Foreached where on 2
Doing where on 3
Doing where on 4
Foreached where on 4
Doing where on 5
Doing where on 6
Foreached where on 6
Doing where on 7
Doing where on 8
Foreached where on 8
Doing where on 9
Doing where on 10
Foreached where on 10
First foreach ending
Second foreach starting
Doing where on 1
Doing where on 2
Foreached where on 2 for the second time.
Doing where on 3
Doing where on 4
Foreached where on 4 for the second time.
Doing where on 5
Doing where on 6
Foreached where on 6 for the second time.
Doing where on 7
Doing where on 8
Foreached where on 8 for the second time.
Doing where on 9
Doing where on 10
Foreached where on 10 for the second time.
Second foreach ending
It depends on how the Linq query is being used.
var q = {some linq query here}
while (true)
{
foreach(var item in q)
{
...
}
}
The code above will execute the Linq query multiple times. Not because of the foreach, but because the foreach is inside another loop, so the foreach itself is being executed multiple times.
If all consumers of a linq query use it "carefully" and avoid dumb mistakes such as the nested loops above, then a linq query should not be executed multiple times needlessly.
There are occasions when reducing a linq query to an in-memory result set using ToList() are warranted, but in my opinion ToList() is used far, far too often. ToList() almost always becomes a poison pill whenever large data is involved, because it forces the entire result set (potentially millions of rows) to be pulled into memory and cached, even if the outermost consumer/enumerator only needs 10 rows. Avoid ToList() unless you have a very specific justification and you know your data will never be large.
Sometimes it might be a good idea to "cache" a LINQ query using ToList() or ToArray(), if the query is being accessed multiple times in your code.
But keep in mind that "caching" it still calls a foreach in turn.
So the basic rule for me is:
if a query is simply used in one foreach (and thats it) - then I don't cache the query
if a query is used in a foreach and in some other places in the code - then I cache it in a var using ToList/ToArray
foreach, by itself, only runs through its data once. In fact, it specifically runs through it once. You can't look ahead or back, or alter the index the way you can with a for loop.
However, if you have multiple foreachs in your code, all operating on the same LINQ query, you may get the query executed multiple times. This is entirely dependent on the data, though. If you're iterating over an LINQ-based IEnumerable/IQueryable that represents a database query, it will run that query each time. If you're iterating over an List or other collection of objets, it will run through the list each time, but won't hit your database repeatedly.
In other words, this is a property of LINQ, not a property of foreach.
The difference is in the underlying type. As LINQ is built on top of IEnumerable (or IQueryable) the same LINQ operator may have completely different performance characteristics.
A List will always be quick to respond, but it takes an upfront effort to build a list.
An iterator is also IEnumerable and may employ any algorithm every time it fetches the "next" item. This will be faster if you don't actually need to go through the complete set of items.
You can turn any IEnumerable into a list by calling ToList() on it and storing the resulting list in a local variable. This is advisable if
You don't depend on deferred execution.
You have to access more total items than the whole set.
You can pay the upfront cost of retrieving and storing all items.
Using LINQ even without entities what you will get is that deferred execution is in effect.
It is only by forcing an iteration that the actual linq expression is evaluated.
In that sense each time you use the linq expression it is going to be evaluated.
Now with entities this is still the same, but there is just more functionality at work here.
When the entity framework sees the expression for the first time, it looks if he has executed this query already. If not, it will go to the database and fetch the data, setup its internal memory model and return the data to you. If the entity framework sees it already fetched the data beforehand, it is not going to go to the database and use the memory model that it setup earlier to return data to you.
This can make your life easier, but it can also be a pain. For instance if you request all records from a table by using a linq expression. The entity framework will load all data from the table. If later on you evaluate the same linq expression, even if in the time being records were deleted or added, you will get the same result.
The entity framework is a complicated thing. There are of course ways to make it reexecute the query, taking into account the changes it has in its own memory model and the like.
I suggest reading "programming entity framework" of Julia Lerman. It addresses lots of issues like the one you having right now.
It will execute the LINQ statement the same number of times no matter if you do .ToList() or not. I have an example here with colored output to the console:
What happens in the code (see code at the bottom):
Create a list of 100 ints (0-99).
Create a LINQ statement that prints every int from the list followed by two * to the console in red color, and then return the int if it's an even number.
Do a foreach on the query, printing out every even number in green color.
Do a foreach on the query.ToList(), printing out every even number in green color.
As you can see in the output below, the number of ints written to the console is the same, meaning the LINQ statement is executed the same number of times.
The difference is in when the statement is executed. As you can see, when you do a foreach on the query (that you have not invoked .ToList() on), the list and the IEnumerable object, returned from the LINQ statement, are enumerated at the same time.
When you cache the list first, they are enumerated separately, but still the same amount of times.
The difference is very important to understand, because if the list is modified after you have defined your LINQ statement, the LINQ statement will operate on the modified list when it is executed (e.g. by .ToList()). BUT if you force execution of the LINQ statement (.ToList()) and then modify the list afterwards, the LINQ statement will NOT work on the modified list.
Here's the output:
Here's my code:
// Main method:
static void Main(string[] args)
{
IEnumerable<int> ints = Enumerable.Range(0, 100);
var query = ints.Where(x =>
{
Console.ForegroundColor = ConsoleColor.Red;
Console.Write($"{x}**, ");
return x % 2 == 0;
});
DoForeach(query, "query");
DoForeach(query, "query.ToList()");
Console.ForegroundColor = ConsoleColor.White;
}
// DoForeach method:
private static void DoForeach(IEnumerable<int> collection, string collectionName)
{
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n--- {0} FOREACH BEGIN: ---", collectionName);
if (collectionName.Contains("query.ToList()"))
collection = collection.ToList();
foreach (var item in collection)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write($"{item}, ");
}
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n--- {0} FOREACH END ---", collectionName);
}
Note about execution time: I did a few timing tests (not enough to post it here though) and I didn't find any consistency in either method being faster than the other (including the execution of .ToList() in the timing). On larger collections, caching the collection first and then iterating it seemed a bit faster, but there was no definitive conclusion from my test.
Related
Below is the code snippet. EF6 is used.
var itemNames = context.cam.AsParallel()
.Where(x=> x.cams ==
"edsfdf")
.Select(item => item.fg)
.FirstOrDefault();
Why is PLINQ slower?
If you look at the signature of .AsParallel(), it takes an IEnumerable<T>, rather than an IQueryable<T>. A linq query is only converted to a SQL statement while it's kept as an IQueryable. Once you enumerate it, it executes the query and returns the records.
So to break down the query you have:
context.cam.AsParallel()
This bit of code essentially will execute SELECT * FROM cam on the database, and then start iterating through the results.The results will be passed into a ParallelQuery. This essentially will load the entire table into memory.
.Where(x=> x.cams == "edsfdf")
.Select(item => item.fg)
.FirstOrDefault()
After this point, all of those operations will happen in parallel. A simple string equality comparison is likely extremely inexpensive compared to the overhead of spinning up a lot of threads and managing locking and concurrency between them (which PLINQ will take care of for you). Parallel processing and the costs/benefits is a complicated topic, but it's usually best saved for CPU-intensive work.
If you skipped the AsParallel() call, everything remains as an IQueryable all the way through the linq statement, so EntityFramework will send a single SQL command that looks something like SELECT fg FROM cam WHERE cams = 'edsfdf' and return that single result, which SQL Server will optimize to a very fast lookup, especially if there's an index on cams.
It's slower because you have used parralisim in a bad place. Your Where clause doesn't do heavy duty work, then it's not good scenario to use PLINQ.
If PLINQ comes with associated
complexity then what is the real benefit for it's existence?
You don't need hundred people to find a missing child in a home, because finding 100 people takes longer than finding the child by yourself. But if you were to search a forest, this overhead was worth it, because finding a child in a big forest alone takes longer than to find 100 people to find a child in a big forest!
parallelism is effective when used at the right time in the right place.
I have read several different sources over the years that indicate that when storing a collection of data, a List<T> is efficient when you want to insert objects, and an IEnumerable<T> is best for enumerating over a collection.
In LINQ-to-Entities, there is the AsEnumerable() function, that will return an IEnumerable<T>, but it will not resolve the SQL created by the LINQ statement until you start enumerating over the list.
What if I want to store objects from LINQ to Entities in a collection and then query on that collection later?
Using this strategy causes the SQL to be resolved by adding a WHERE clause and querying each record separately. I specifically don't want to do that because I'm trying to limit network chatter:
var myDataToLookup = context.MyData.AsEnumerable();
for(var myOtherDatum in myOtherDataList)
{
// gets singular record from database each time.
var myDatum = myDataToLookup.SingleOrDefault(w => w.key == myOtherDatum.key)
}
How do I resolve the SQL upfront so myDataToLookup actually contains the data in memory? I've tried ToArray:
var myDataToLookup = context.MyData.ToArray();
But I recently learned that it actually uses more memory than ToList does:
Is it better to call ToList() or ToArray() in LINQ queries?
Should I use a join instead?
var myCombinedData = from o in myOtherDataList
join d in myDataToLookup on
o.key equals d.key
select { myOtherData: o, myData: d};
Should I use ToDictionary and store my key as the key to the dictionary? Or am I worrying too much about this?
If you're using LINQ to Entities then you should not worry if ToArray is slower than ToList. There is almost no difference between them in terms of performance and LINQ to Entities itself will be a bottleneck anyway.
Regarding a dictionary. It is a structure optimized for reads by keys. There is an additional cost on adding new items though. So, if you will read by key a lot and add new items not that often then that's the way to go. But to be honest - you probably should not bother at all. If data size is not big enough, you won't see a difference.
Think of IEnumerable, ICollection and IList/IDictionary as a hierarchy each one inheriting from the previous one. Arrays add a level of restriction and complexity on top of Lists. Simply, IEnumerable gives you iteration only. ICollection adds counting and IList then gives richer functionality including find, add and remove elements by index or via lambda expressions. Dictionaries provide efficient access via a key. Arrays are much more static.
So, the answer then depends on your requirements. If it is appropriate to hold the data in memory and you need to frequently re-query it then I usually convert the Entity result to a List. This also loads the data.
If access via a set of keys is paramount then I use a Dictionary.
I cannot remember that last time I used an array except for infrequent and very specific purposes.
SO, not a direct answer, but as your question and the other replies indicate there isn't a single answer and the solution will be a compromise.
When I code and measure performance and data carried over the network, here is how I look at things based on your example above.
Let's say your result returns 100 records. Your code has now run a query on the server and performed 1 second of processing (I made the number up for sake of argument).
Then you need to cast it to a list which is going to be 1 more second of processing. Then you want to find all records that have a value of 1. The code will now Loop through the entire list to find the values with 1 and then return you the result. This is let's say another 1 second of processing and it finds 10 records.
Your network is going to carry over 10 records that took 3 seconds to process.
If you move your logic to your Data layer and make your query search right away for the records that you want, you can then save 2 seconds of performance and still only carry 10 records across the network. The bonus side is also that you can just use IEnumerable<T> as a result and not have to cast it a list. Thus eliminating the 1 second of casting to list and 1 second of iterating through the list.
I hope this helps answer your question.
After reading: http://forums.devart.com/viewtopic.php?f=31&t=22425&p=74949&hilit=IQueryable#p74949
It seems like I should use IQueryable to figure out if a entity's child table has a record or not.
My initial attempt was like:
Ace.DirDs.Any();
This line of code (or similar lines of code) could be run hundreds of times and was causing a huge performance issue.
So from reading the previous post in the link above I thought I would try something like:
IQueryable<DirD> dddd = CurrentContext.DirDs
.Where(d => d.AceConfigModelID == ace.ID).Take(1);
bool hasAChild = dddd.Any();
Would there be a better way?
There's no need for a Take(1). Plus, this one is shorter to type.
bool hasAChild = CurrentContext.DirDs.Any(d => d.AceConfigModelID == ace.ID);
I may be wrong, but I think Any() will still cause an initial Read() of the first row from the database server back to the client. You may be better getting a Count so you only get a number back:
bool hasAChild = CurrentContext.DirDs.Count(d => d.AceConfigModelID == ace.ID) > 0;
By the way this doesn't appear to be looking at a child table just DirDs.
In your example, you are materializing the IQueryable into an IEnumerable, and so the entire query is executed and you are then only just taking the first row in the result. Either of the previously exampled answers will be exponentially faster than this. Be careful when using Count as there is both a property Count and a method Count(). In order to avoid your original problem (and if you choose the Count route), you'll want to use the method Count() like in Rhumborl's example, otherwise it'll execute the query and give you the property Count of your IEnumerable that was returned. The method Count() essentially translates into a SQL COUNT whereas the method Any() translates into a SQL EXISTS (at least when working with Microsoft SQL Server). One can be more efficient than the other at this level depending on what your backend database is and which version of EF you're using.
My vote would be to always default to Any() and explore Count() if you run into performance troubles. There can still be performance costs with the method Count() at the database level, but that still depends on the database you're using.
Here's a good related answer: Linq To Entities - Any VS First VS Exists
I am using ASP NET MVC 4.5 and EF6, code first migrations.
I have this code, which takes about 6 seconds.
var filtered = _repository.Requests.Where(r => some conditions); // this is fast, conditions match only 8 items
var list = filtered.ToList(); // this takes 6 seconds, has 8 items inside
I thought that this is because of relations, it must build them inside memory, but that is not the case, because even when I return 0 fields, it is still as slow.
var filtered = _repository.Requests.Where(r => some conditions).Select(e => new {}); // this is fast, conditions match only 8 items
var list = filtered.ToList(); // this takes still around 5-6 seconds, has 8 items inside
Now the Requests table is quite complex, lots of relations and has ~16k items. On the other hand, the filtered list should only contain proxies to 8 items.
Why is ToList() method so slow? I actually think the problem is not in ToList() method, but probably EF issue, or bad design problem.
Anyone has had experience with anything like this?
EDIT:
These are the conditions:
_repository.Requests.Where(r => ids.Any(a => a == r.Student.Id) && r.StartDate <= cycle.EndDate && r.EndDate >= cycle.StartDate)
So basically, I can checking if Student id is in my id list and checking if dates match.
Your filtered variable contains a query which is a question, and it doesn't contain the answer. If you request the answer by calling .ToList(), that is when the query is executed. And that is the reason why it is slow, because only when you call .ToList() is the query executed by your database.
It is called Deferred execution. A google might give you some more information about it.
If you show some of your conditions, we might be able to say why it is slow.
In addition to Maarten's answer I think the problem is about two different situation
some condition is complex and results in complex and heavy joins or query in your database
some condition is filtering on a column which does not have an index and this cause the full table scan and make your query slow.
I suggest start monitoring the query generated by Entity Framework, it's very simple, you just need to set Log function of your context and see the results,
using (var context = new MyContext())
{
context.Database.Log = Console.Write;
// Your code here...
}
if you see something strange in generated query try to make it better by breaking it in parts, some times Entity Framework generated queries are not so good.
if the query is okay then the problem lies in your database (assuming no network problem).
run your query with an SQL profiler and check what's wrong.
UPDATE
I suggest you to:
add index for StartDate and EndDate Column in your table (one for each, not one for both)
ToList executes the query against DB, while first line is not.
Can you show some conditions code here?
To increase the performance you need to optimize query/create indexes on the DB tables.
Your first line of code only returns an IQueryable. This is a representation of a query that you want to run not the result of the query. The query itself is only runs on the databse when you call .ToList() on your IQueryable, because its the first point that you have actually asked for data.
Your adjustment to add the .Select only adds to the existing IQueryable query definition. It doesnt change what conditions have to execute. You have essentially changed the following, where you get back 8 records:
select * from Requests where [some conditions];
to something like:
select '' from Requests where [some conditions];
You will still have to perform the full query with the conditions giving you 8 records, but for each one, you only asked for an empty string, so you get back 8 empty strings.
The long and the short of this is that any performance problem you are having is coming from your "some conditions". Without seeing them, its is difficult to know. But I have seen people in the past add .Where clauses inside a loop, before calling .ToList() and inadvertently creating a massively complicated query.
Jaanus. The most likely reason of this issue is complecity of generated SQL query by entity framework. I guess that your filter condition contains some check of other tables.
Try to check generated query by "SQL Server Profiler". And then copy this query to "Management Studio" and check "Estimated execution plan". As a rule "Management Studio" generatd index recomendation for your query try to follow these recomendations.
I have a list of items and a LINQ query over them. Now, with LINQ's deferred execution, would a subsequent foreach loop execute the query only once or for each turn in the loop?
Given this example (Taken from Introduction to LINQ Queries (C#), on MSDN)
// The Three Parts of a LINQ Query:
// 1. Data source.
int[] numbers = new int[7] { 0, 1, 2, 3, 4, 5, 6 };
// 2. Query creation.
// numQuery is an IEnumerable<int>
var numQuery =
from num in numbers
where (num % 2) == 0
select num;
// 3. Query execution.
foreach (int num in numQuery)
{
Console.Write("{0,1} ", num);
}
Or, in other words, would there be any difference if I had:
foreach (int num in numQuery.ToList())
And, would it matter, if the underlying data is not in an array, but in a Database?
Now, with LINQ's deferred execution, would a subsequent foreach loop execute the query only once or for each turn in the loop?
Yes, once for the loop. Actually, it may execute the query less than once - you could abort the looping part way through and the (num % 2) == 0 test wouldn't be performed on any remaining items.
Or, in other words, would there be any difference if I had:
foreach (int num in numQuery.ToList())
Two differences:
In the case above, ToList() wastes time and memory, because it first does the same thing as the initial foreach, builds a list from it, and then foreachs that list. The differences will be somewhere between trivial and preventing the code from ever working, depending on the size of the results.
However, in the case where you are going to repeatedly do foreach on the same results, or otherwise use it repeatedly, the then while the foreach only runs the query once, the next foreach runs it again. If the query is expensive, then the ToList() approach (and storing that list) can be a massive saving.
No, it makes no difference. The in expression is evaluated once. More specifically, the foreach construct invokes the GetEnumerator() method on the in expression and repeatedly calls MoveNext() and accesses the Current property in order to traverse the IEnumerable.
OTOH, calling ToList() is redundant. You shouldn't bother calling it.
If the input is a database, the situation is slightly different, since LINQ outputs IQueryable, but I'm pretty sure that foreach still treats it as an IEnumerable (which IQueryable inherits).
As written, each iteration of the loop would do exactly as much work as it needed to fetch the next result. So the answer would technically be "none of the above". The query would execute "in pieces".
If you use ToList() or any other materialization method (ToArray() etc) then the query will be evaluated once on the spot and subsequent operations (such as iterating over the results) will simply work on a "dumb" list.
If numbers were an IQueryable instead of an IEnumerable -- as it would likely be in a database scenario -- then the above is still close to the truth although not a perfectly accurate description. In particular, on the first attempt to materialize a result the queryable provider would talk to the database and produce a result set; then, rows from this result set would be pulled on each iteration.
The linq query will be executed when it is enumerated (either as the result of a .ToList() call or doing a foreach over the results.
If you are enumerating the results of the linq query twice, both times will cause it to query the data source (in your example, enumerating the collection) as it is itself only returning an IEnumerable. However, depending on the linq query, it may not always enumerate the entire collection (e.g .Any() and .Single() will stop on the first object or the first matching object if there is a .Where()).
The implementation details of a linq provider may differ so the usual behaviour when the data source is a database is to call .ToList() straight away to cache the results of the query & also ensure that the query (in the case of EF, L2S or NHibernate) is executed once there and then rather than when the collection is enumerated at some point later in the code and to prevent the query being executed multiple times if the results are enumerated multiple times.