What are the mechanics of the expression tree limitation here? - c#

The fact that I expected this to work and it didn't leads me to search for the piece of the picture that I don't see.

Imagine that query being remoted over to a database. How is the database engine supposed to reach over the internet as it is executing the query and tell the "count" variable on your machine to update itself? There is no standard mechanism for doing so, and therefore anything that would mutate a variable on the local machine cannot be put into an expression tree that would run on a remote machine.
More generally, queries that cause side effects when they are executed are very, very bad queries. Never, ever put a side effect in a query, even in the cases where doing so is legal. It can be very confusing. Remember, a query is not a for loop. A query results in an object represents the query, not the results of the query. A query which has side effects when executed will execute those side effects twice when asked for its results twice, and execute them zero times when asked for the results zero times. A query where the first part mutates a variable will mutate that variable before the second clause executes, not during the execution of the second clause. As a result, many query clauses give totally bizarre results when they depend on side effect execution.
For more thoughts on this, see Bill Wagner's MSDN article on the subject:
http://msdn.microsoft.com/en-us/vcsharp/hh264182
If you were writing a LINQ-to-Objects query you could do what you want by zipping:
var zipped = model.DirectTrackCampaigns.Top("10").Zip(Enumerable.Range(0, 10000), (first, second)=>new { first, second });
var res = zipped.Select(c=> new Campaign { IsSelected = c.second % 2 == 0, Name = c.first.CampaignName };
That is, make a sequence of pairs of numbers and campaigns, then manipulate the pairs.
How you'd do that in LINQ-to-whatever-you're-using, I don't know.

When you pass anything to a method that expects an Expression, nothing is actually evaluated at that time. All it does is, it breaks apart the code and creates an Expression tree out of it.
Now, in .Net 4.0, alot of things were added to the Expression API (including Expression.Increment and Expression.Assign which would actually be what you're doing), however the compiler (I think it's a limitation of the C# compiler) hasn't been updated yet to take advantage of the new 4.0 Expression stuff. Therefore, we're limited to method calls, and assignment calls will not work. Hypothetically, this could be supported in the future.

IsSelected = (count++) % 2 == 0
Pretty much means:
IsSelected = (count = count + 1) % 2 == 0
That's where the assignment is happening and the source of the error.
According to the different answers to this question, this should be possible with .NET 4.0 using expressions, though not in lambdas.

Related

Fastest way to see if a entity's child table contains any records

After reading: http://forums.devart.com/viewtopic.php?f=31&t=22425&p=74949&hilit=IQueryable#p74949
It seems like I should use IQueryable to figure out if a entity's child table has a record or not.
My initial attempt was like:
Ace.DirDs.Any();
This line of code (or similar lines of code) could be run hundreds of times and was causing a huge performance issue.
So from reading the previous post in the link above I thought I would try something like:
IQueryable<DirD> dddd = CurrentContext.DirDs
.Where(d => d.AceConfigModelID == ace.ID).Take(1);
bool hasAChild = dddd.Any();
Would there be a better way?
There's no need for a Take(1). Plus, this one is shorter to type.
bool hasAChild = CurrentContext.DirDs.Any(d => d.AceConfigModelID == ace.ID);
I may be wrong, but I think Any() will still cause an initial Read() of the first row from the database server back to the client. You may be better getting a Count so you only get a number back:
bool hasAChild = CurrentContext.DirDs.Count(d => d.AceConfigModelID == ace.ID) > 0;
By the way this doesn't appear to be looking at a child table just DirDs.
In your example, you are materializing the IQueryable into an IEnumerable, and so the entire query is executed and you are then only just taking the first row in the result. Either of the previously exampled answers will be exponentially faster than this. Be careful when using Count as there is both a property Count and a method Count(). In order to avoid your original problem (and if you choose the Count route), you'll want to use the method Count() like in Rhumborl's example, otherwise it'll execute the query and give you the property Count of your IEnumerable that was returned. The method Count() essentially translates into a SQL COUNT whereas the method Any() translates into a SQL EXISTS (at least when working with Microsoft SQL Server). One can be more efficient than the other at this level depending on what your backend database is and which version of EF you're using.
My vote would be to always default to Any() and explore Count() if you run into performance troubles. There can still be performance costs with the method Count() at the database level, but that still depends on the database you're using.
Here's a good related answer: Linq To Entities - Any VS First VS Exists

Does "foreach" cause repeated Linq execution?

I've been working for the first time with the Entity Framework in .NET, and have been writing LINQ queries in order to get information from my model. I would like to program in good habits from the beginning, so I've been doing research on the best way to write these queries, and get their results. Unfortunately, in browsing Stack Exchange, I've seem to have come across two conflicting explanations in how deferred/immediate execution works with LINQ:
A foreach causes the query to be executed in each iteration of the loop:
Demonstrated in question Slow foreach() on a LINQ query - ToList() boosts performance immensely - why is this? , the implication is that "ToList()" needs to be called in order to evaluate the query immediately, as the foreach is evaluating the query on the data source repeatedly, slowing down the operation considerably.
Another example is the question Foreaching through grouped linq results is incredibly slow, any tips? , where the accepted answer also implies that calling "ToList()" on the query will improve performance.
A foreach causes a query to be executed once, and is safe to use with LINQ
Demonstrated in question Does foreach execute the query only once? , the implication is that the foreach causes one enumeration to be established, and will not query the datasource each time.
Continued browsing of the site has turned up many questions where "repeated execution during a foreach loop" is the culprit of the performance concern, and plenty of other answers stating that a foreach will appropriately grab a single query from a datasource, which means that both explanations seem to have validity. If the "ToList()" hypothesis is incorrect (as most of the current answers as of 2013-06-05 1:51 PM EST seem to imply), where does this misconception come from? Is there one of these explanations that is accurate and one that isn't, or are there different circumstances that could cause a LINQ query to evaluate differently?
Edit: In addition to the accepted answer below, I've turned up the following question over on Programmers that very much helped my understanding of query execution, particularly the the pitfalls that could result in multiple datasource hits during a loop, which I think will be helpful for others interested in this question: https://softwareengineering.stackexchange.com/questions/178218/for-vs-foreach-vs-linq
In general LINQ uses deferred execution. If you use methods like First() and FirstOrDefault() the query is executed immediately. When you do something like;
foreach(string s in MyObjects.Select(x => x.AStringProp))
The results are retrieved in a streaming manner, meaning one by one. Each time the iterator calls MoveNext the projection is applied to the next object. If you were to have a Where it would first apply the filter, then the projection.
If you do something like;
List<string> names = People.Select(x => x.Name).ToList();
foreach (string name in names)
Then I believe this is a wasteful operation. ToList() will force the query to be executed, enumerating the People list and applying the x => x.Name projection. Afterwards you will enumerate the list again. So unless you have a good reason to have the data in a list (rather than IEnumerale) you're just wasting CPU cycles.
Generally speaking using a LINQ query on the collection you're enumerating with a foreach will not have worse performance than any other similar and practical options.
Also it's worth noting that people implementing LINQ providers are encouraged to make the common methods work as they do in the Microsoft provided providers but they're not required to. If I were to go write a LINQ to HTML or LINQ to My Proprietary Data Format provider there would be no guarantee that it behaves in this manner. Perhaps the nature of the data would make immediate execution the only practical option.
Also, final edit; if you're interested in this Jon Skeet's C# In Depth is very informative and a great read. My answer summarizes a few pages of the book (hopefully with reasonable accuracy) but if you want more details on how LINQ works under the covers, it's a good place to look.
try this on LinqPad
void Main()
{
var testList = Enumerable.Range(1,10);
var query = testList.Where(x =>
{
Console.WriteLine(string.Format("Doing where on {0}", x));
return x % 2 == 0;
});
Console.WriteLine("First foreach starting");
foreach(var i in query)
{
Console.WriteLine(string.Format("Foreached where on {0}", i));
}
Console.WriteLine("First foreach ending");
Console.WriteLine("Second foreach starting");
foreach(var i in query)
{
Console.WriteLine(string.Format("Foreached where on {0} for the second time.", i));
}
Console.WriteLine("Second foreach ending");
}
Each time the where delegate is being run we shall see a console output, hence we can see the Linq query being run each time. Now by looking at the console output we see the second foreach loop still causes the "Doing where on" to print, thus showing that the second usage of foreach does in fact cause the where clause to run again...potentially causing a slow down.
First foreach starting
Doing where on 1
Doing where on 2
Foreached where on 2
Doing where on 3
Doing where on 4
Foreached where on 4
Doing where on 5
Doing where on 6
Foreached where on 6
Doing where on 7
Doing where on 8
Foreached where on 8
Doing where on 9
Doing where on 10
Foreached where on 10
First foreach ending
Second foreach starting
Doing where on 1
Doing where on 2
Foreached where on 2 for the second time.
Doing where on 3
Doing where on 4
Foreached where on 4 for the second time.
Doing where on 5
Doing where on 6
Foreached where on 6 for the second time.
Doing where on 7
Doing where on 8
Foreached where on 8 for the second time.
Doing where on 9
Doing where on 10
Foreached where on 10 for the second time.
Second foreach ending
It depends on how the Linq query is being used.
var q = {some linq query here}
while (true)
{
foreach(var item in q)
{
...
}
}
The code above will execute the Linq query multiple times. Not because of the foreach, but because the foreach is inside another loop, so the foreach itself is being executed multiple times.
If all consumers of a linq query use it "carefully" and avoid dumb mistakes such as the nested loops above, then a linq query should not be executed multiple times needlessly.
There are occasions when reducing a linq query to an in-memory result set using ToList() are warranted, but in my opinion ToList() is used far, far too often. ToList() almost always becomes a poison pill whenever large data is involved, because it forces the entire result set (potentially millions of rows) to be pulled into memory and cached, even if the outermost consumer/enumerator only needs 10 rows. Avoid ToList() unless you have a very specific justification and you know your data will never be large.
Sometimes it might be a good idea to "cache" a LINQ query using ToList() or ToArray(), if the query is being accessed multiple times in your code.
But keep in mind that "caching" it still calls a foreach in turn.
So the basic rule for me is:
if a query is simply used in one foreach (and thats it) - then I don't cache the query
if a query is used in a foreach and in some other places in the code - then I cache it in a var using ToList/ToArray
foreach, by itself, only runs through its data once. In fact, it specifically runs through it once. You can't look ahead or back, or alter the index the way you can with a for loop.
However, if you have multiple foreachs in your code, all operating on the same LINQ query, you may get the query executed multiple times. This is entirely dependent on the data, though. If you're iterating over an LINQ-based IEnumerable/IQueryable that represents a database query, it will run that query each time. If you're iterating over an List or other collection of objets, it will run through the list each time, but won't hit your database repeatedly.
In other words, this is a property of LINQ, not a property of foreach.
The difference is in the underlying type. As LINQ is built on top of IEnumerable (or IQueryable) the same LINQ operator may have completely different performance characteristics.
A List will always be quick to respond, but it takes an upfront effort to build a list.
An iterator is also IEnumerable and may employ any algorithm every time it fetches the "next" item. This will be faster if you don't actually need to go through the complete set of items.
You can turn any IEnumerable into a list by calling ToList() on it and storing the resulting list in a local variable. This is advisable if
You don't depend on deferred execution.
You have to access more total items than the whole set.
You can pay the upfront cost of retrieving and storing all items.
Using LINQ even without entities what you will get is that deferred execution is in effect.
It is only by forcing an iteration that the actual linq expression is evaluated.
In that sense each time you use the linq expression it is going to be evaluated.
Now with entities this is still the same, but there is just more functionality at work here.
When the entity framework sees the expression for the first time, it looks if he has executed this query already. If not, it will go to the database and fetch the data, setup its internal memory model and return the data to you. If the entity framework sees it already fetched the data beforehand, it is not going to go to the database and use the memory model that it setup earlier to return data to you.
This can make your life easier, but it can also be a pain. For instance if you request all records from a table by using a linq expression. The entity framework will load all data from the table. If later on you evaluate the same linq expression, even if in the time being records were deleted or added, you will get the same result.
The entity framework is a complicated thing. There are of course ways to make it reexecute the query, taking into account the changes it has in its own memory model and the like.
I suggest reading "programming entity framework" of Julia Lerman. It addresses lots of issues like the one you having right now.
It will execute the LINQ statement the same number of times no matter if you do .ToList() or not. I have an example here with colored output to the console:
What happens in the code (see code at the bottom):
Create a list of 100 ints (0-99).
Create a LINQ statement that prints every int from the list followed by two * to the console in red color, and then return the int if it's an even number.
Do a foreach on the query, printing out every even number in green color.
Do a foreach on the query.ToList(), printing out every even number in green color.
As you can see in the output below, the number of ints written to the console is the same, meaning the LINQ statement is executed the same number of times.
The difference is in when the statement is executed. As you can see, when you do a foreach on the query (that you have not invoked .ToList() on), the list and the IEnumerable object, returned from the LINQ statement, are enumerated at the same time.
When you cache the list first, they are enumerated separately, but still the same amount of times.
The difference is very important to understand, because if the list is modified after you have defined your LINQ statement, the LINQ statement will operate on the modified list when it is executed (e.g. by .ToList()). BUT if you force execution of the LINQ statement (.ToList()) and then modify the list afterwards, the LINQ statement will NOT work on the modified list.
Here's the output:
Here's my code:
// Main method:
static void Main(string[] args)
{
IEnumerable<int> ints = Enumerable.Range(0, 100);
var query = ints.Where(x =>
{
Console.ForegroundColor = ConsoleColor.Red;
Console.Write($"{x}**, ");
return x % 2 == 0;
});
DoForeach(query, "query");
DoForeach(query, "query.ToList()");
Console.ForegroundColor = ConsoleColor.White;
}
// DoForeach method:
private static void DoForeach(IEnumerable<int> collection, string collectionName)
{
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n--- {0} FOREACH BEGIN: ---", collectionName);
if (collectionName.Contains("query.ToList()"))
collection = collection.ToList();
foreach (var item in collection)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write($"{item}, ");
}
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n--- {0} FOREACH END ---", collectionName);
}
Note about execution time: I did a few timing tests (not enough to post it here though) and I didn't find any consistency in either method being faster than the other (including the execution of .ToList() in the timing). On larger collections, caching the collection first and then iterating it seemed a bit faster, but there was no definitive conclusion from my test.

How and When LINQ Queries are Translated and Evaluated?

I have a LINQ to SQL Query as below:
var carIds = from car in _db.Cars
where car.Color == 'Blue'
select car.Id;
The above will be translated into the below sql finally:
select Id from Cars
where Color = 'Blue'
I've read that the phases are:
LINQ to Lambda Expression
Lambda Expression to Expression Trees
Expression Trees to SQL Statements
SQL Statements gets executed
So, my questions are when and how does this get translated and executed?
I know that phase 4 happens at runtime when my "carIds" variable get accessed within a foreach loop.
foreach(var carId in carIds) // here?
{
Console.Writeline(carId) // or here?
}
What about the other phases? when do they occur? at compile time or runtime? at which line (on the definition or after the definition, when accessed or before it gets accessed)?
What you are talking about is deferred execution - essentially, the linq to SQL query will be executed at the time you attempt to enumerate the results (e.g. iterate round it, .ToArray() it etc).
In your example, the statement is executed on the following line:
foreach(var carId in carIds)
See this MSDN article on LINQ and Deferred Execution
is done by either you, the dev, or the C# compiler; you can write lambda expression manually, or with LINQ syntax they can be implied from LINQ (i.e. where x.Foo == 12 is treated as though it were .Where(x => x.Foo == 12), and then the compiler processes step 2 based on that)
is done by the C# compiler; it translates the "lambda expression" into IL that constructs (or re-uses, where possible) an expression tree. You could also argue that the runtime is what processes this step
and
are typically done when the query is enumerated, specifically when foreach calls GetEnumerator() (and possibly even deferred to the first .MoveNext)
the deferred execution allows queries to be composed; for example:
var query = ...something complex...
var items = query.Take(20).ToList();
here the Take(20) is performed as part of the overall query, so you don't bring back everything from the server and then (at the caller) filter it to the first 20; only 20 rows are ever retrieved from the server.
You can also use compiled queries to restrict the query to just step 4 (or steps 3 and 4 in some cases where the query needs different TSQL depending on parameters). Or you can skip everything except 4 if you write just TSQL to start with.
You should also add a step "5": materialization; this is quite a complex and important step. In my experience, the materialization step can actually be the most significant in terms of overall performance (hence why we wrote "dapper").

Linq performance: does it make sense to move out condition from query?

There is a LINQ query:
int criteria = GetCriteria();
var teams = Team.GetTeams()
.Where(team=> criteria == 0 || team.Criteria == criteria)
.ToList();
Does it make sense from performance (or any other point of view) to convert it into the following?
var teams = Team.GetTeams();
if (criteria != 0)
{
teams = teams.Where(team => team.Criteria == criteria);
}
teams = teams.ToList();
I have a solid set of criteria; should I separate them and apply each criteria only if necessary? Or just apply them in one LINQ query and leave up-to .NET to optimize the query?
Please advise. Any thoughts are welcome!
P.S. Guys, I don't use Linq2Sql, just LINQ, that is a pure C# code
First, it is important to understand that LINQ actually comes in several varieties, which can mostly be divided by whether they use expression trees or compiled code to execute the query.
Linq2Sql (as well as any Linq provider that converts the query into another form to perform the query) use expression trees so that the query can be analyzed and converted into another form (often SQL). In this case it is possible for the provider to modify the query, potenitally performing some level of optimization. I am not aware of any providers that currently do this.
Linq to Objects uses compiled code, and will always execute the query as written. There is no opportunity for the query to be optimized other than by the developer.
Second, all Linq queries are deferred. That means that the query is not actually executed until an attempt to get results. A side effect of this is that you can build a query in several steps, and only the final query will be executed. The second example in the question still results in only one query being executed.
If you are using Linq2Sql then it is likely that both queries have roughly similar performance. However, the first example extended to handle many criteria could potentially lead to a poor overly general execution plan which ultimately degrades performance. The trimmed query does not run this risk, and will generate an execution plan for each real permutation of criteria.
If you are using Linq to Objects, then the second query is definitely preferable, since the first query will execute the predicate passed to Where once for every input even when the condition always returns true.
I think the 2nd option is better because it is checking the criteria value only once. Whereas in the first query , criteria is being checked for every row.
But I also believe that you will not get any significant performance gain from it. It may give you better results on tables with a high number of records (theoretically)

Selecting first 100 records using Linq

How can I return first 100 records using Linq?
I have a table with 40million records.
This code works, but it's slow, because will return all values before filter:
var values = (from e in dataContext.table_sample
where e.x == 1
select e)
.Take(100);
Is there a way to return filtered? Like T-SQL TOP clause?
No, that doesn't return all the values before filtering. The Take(100) will end up being part of the SQL sent up - quite possibly using TOP.
Of course, it makes more sense to do that when you've specified an orderby clause.
LINQ doesn't execute the query when it reaches the end of your query expression. It only sends up any SQL when either you call an aggregation operator (e.g. Count or Any) or you start iterating through the results. Even calling Take doesn't actually execute the query - you might want to put more filtering on it afterwards, for instance, which could end up being part of the query.
When you start iterating over the results (typically with foreach) - that's when the SQL will actually be sent to the database.
(I think your where clause is a bit broken, by the way. If you've got problems with your real code it would help to see code as close to reality as possible.)
I don't think you are right about it returning all records before taking the top 100. I think Linq decides what the SQL string is going to be at the time the query is executed (aka Lazy Loading), and your database server will optimize it out.
Have you compared standard SQL query with your linq query? Which one is faster and how significant is the difference?
I do agree with above comments that your linq query is generally correct, but...
in your 'where' clause should probably be x==1 not x=1 (comparison instead of assignment)
'select e' will return all columns where you probably need only some of them - be more precise with select clause (type only required columns); 'select *' is a vaste of resources
make sure your database is well indexed and try to make use of indexed data
Anyway, 40milions records database is quite huge - do you need all that data all the time? Maybe some kind of partitioning can reduce it to the most commonly used records.
I agree with Jon Skeet, but just wanted to add:
The generated SQL will use TOP to implement Take().
If you're able to run SQL-Profiler and step through your code in debug mode, you will be able to see exactly what SQL is generated and when it gets executed. If you find the time to do this, you will learn a lot about what happens underneath.
There is also a DataContext.Log property that you can assign a TextWriter to view the SQL generated, for example:
dbContext.Log = Console.Out;
Another option is to experiment with LINQPad. LINQPad allows you to connect to your datasource and easily try different LINQ expressions. In the results panel, you can switch to see the SQL generated the LINQ expression.
I'm going to go out on a limb and guess that you don't have an index on the column used in your where clause. If that's the case then it's undoubtedly doing a table scan when the query is materialized and that's why it's taking so long.

Categories