How and When LINQ Queries are Translated and Evaluated?

How and When LINQ Queries are Translated and Evaluated? - c#

I have a LINQ to SQL Query as below:
var carIds = from car in _db.Cars
where car.Color == 'Blue'
select car.Id;
The above will be translated into the below sql finally:
select Id from Cars
where Color = 'Blue'
I've read that the phases are:
LINQ to Lambda Expression
Lambda Expression to Expression Trees
Expression Trees to SQL Statements
SQL Statements gets executed
So, my questions are when and how does this get translated and executed?
I know that phase 4 happens at runtime when my "carIds" variable get accessed within a foreach loop.
foreach(var carId in carIds) // here?
{
Console.Writeline(carId) // or here?
}
What about the other phases? when do they occur? at compile time or runtime? at which line (on the definition or after the definition, when accessed or before it gets accessed)?

What you are talking about is deferred execution - essentially, the linq to SQL query will be executed at the time you attempt to enumerate the results (e.g. iterate round it, .ToArray() it etc).
In your example, the statement is executed on the following line:
foreach(var carId in carIds)
See this MSDN article on LINQ and Deferred Execution

is done by either you, the dev, or the C# compiler; you can write lambda expression manually, or with LINQ syntax they can be implied from LINQ (i.e. where x.Foo == 12 is treated as though it were .Where(x => x.Foo == 12), and then the compiler processes step 2 based on that)
is done by the C# compiler; it translates the "lambda expression" into IL that constructs (or re-uses, where possible) an expression tree. You could also argue that the runtime is what processes this step
and
are typically done when the query is enumerated, specifically when foreach calls GetEnumerator() (and possibly even deferred to the first .MoveNext)
the deferred execution allows queries to be composed; for example:
var query = ...something complex...
var items = query.Take(20).ToList();
here the Take(20) is performed as part of the overall query, so you don't bring back everything from the server and then (at the caller) filter it to the first 20; only 20 rows are ever retrieved from the server.
You can also use compiled queries to restrict the query to just step 4 (or steps 3 and 4 in some cases where the query needs different TSQL depending on parameters). Or you can skip everything except 4 if you write just TSQL to start with.
You should also add a step "5": materialization; this is quite a complex and important step. In my experience, the materialization step can actually be the most significant in terms of overall performance (hence why we wrote "dapper").

Related

Understanding deferred execution performance

Suppose we have a list of students:
var students = new List<Student>(//fill with 5000 students);
We then find the youngest male student from this list:
Method 1:
var youngestMaleStudent = students.Where(s => s.Gender == "male").OrderBy(s => s.Age).First();
Console.WriteLine(youngestMaleStudent.Name);
Method 2:
var maleStudents = students.Where(s => s.Gender == "male").ToList()
var youngestMaleStudent = maleStudents.OrderBy(s => s.Age).First();
Console.WriteLine(youngestMaleStudent.Name);
I would think Method 1 should be more efficient as Method 2 creates a new list and moves everything into it, but presumably, this isn't a huge deal as copying memory is relatively fast? (though 5000 objects may start to weight things down)
But then I think, do they run differently at all performance-wise? How does LINQ process each step in Method 1, does it not need to copy everything into a list of some form in order to then start sorting (ordering) the data?

Linq deferred execution allows to enqueue, or to chain, differents parts of a query like select, where and order, as for SQL, which is executed when it is used, with a foreach or a ToList() for example.
This is obtained using fluent interface pattern.
What are the benefits of a Deferred Execution in LINQ?
Deferred Execution of LINQ Query (tutorialsteacher.com)
Deferred Vs Immediate Query Execution in LINQ (c-sharpcorner.com)
Therefore method 1 is faster because in method 2 ToList() executes the query and First() executes a new query. Thus this last can be about 2x time at worst without considering underlying caches and optimizations. Because it uses an executed query (ToList()) to do an orderby on it that is a second executed query (First()).
In other words, in method 1, the query is executed only by the First() method call and all previous calls are deferred to prepare the final query for this process like adding parameters in a string (in the case of SQL and it's about the same thing for any other target). But in method 2, the ToList() creates a List<> instance from an executed query that consumes time and memory, and next the First() call do another query on this list, that consumes time and memory again...
So it is important to check in the documentation of every Linq method if it is deferred or not.
Linq can be both a performance and spaghetti code killer as well as a black hole.

In method 2 .ToList() convert an IQuryable to IEnumerable and based on that, all data from database fetched and then students.Where(s => s.Gender == "male") condition apply on it in memory.
In method 1 youngestMaleStudent is IQuryable then the query students.Where(s => s.Gender == "male").OrderBy(s => s.Age).First(); processed on database side.
The result is method 1 performed better specially when your data is huge.

What is difference between these 2 EF queries ? Which is best and why?

What is difference between these 2 queries in EF, which is best and why?
using (var context = new BloggingContext())
{
var blogs = (from b in context.Blogs select b).ToList();
var blogs = context.Blogs.ToList();
}

I believe your question is about Method Syntax vs Query Syntax. Your first query is based on Query Syntax and the second one is based on Method syntax.
See: Query Syntax and Method Syntax in LINQ (C#)
Most queries in the introductory Language Integrated Query (LINQ)
documentation are written by using the LINQ declarative query syntax.
However, the query syntax must be translated into method calls for the
.NET common language runtime (CLR) when the code is compiled.
EDIT:
With your edited code snippets, there is no difference between two queries at the time of execution. Your first query is based on query syntax which will compile into method syntax (your second query). To select between those two is a matter of choice. I personally finds method syntax more readable.
Old Answer:
However,There is a major difference between your two queries. Your first query is just a query construct, It hasn't been executed,considering that you have tagged Entity framework, Your first query will not bring any records from database in memory. To iterate the result set you need ToList(), ToArray() etc.
Your second query is infact getting all the records from your table and loading in a List<T> object in memory.
Also see: Deferred query execution
In a query that returns a sequence of values, the query variable
itself never holds the query results and only stores the query
commands. Execution of the query is deferred until the query variable
is iterated over in a foreach or For Each loop. This is known as
deferred execution; that is, query execution occurs some time after
the query is constructed. This means that you can execute a query as
frequently as you want to. This is useful when, for example, you have
a database that is being updated by other applications. In your
application, you can create a query to retrieve the latest information
and repeatedly execute the query, returning the updated information
every time.

Totally Agreed with #Habib right answer courtesy #Habib
Remember I copied from #Habib
I believe your question is about Method Syntax vs Query Syntax. Your first query is based on Query Syntax and the second one is based on Method syntax.
See: Query Syntax and Method Syntax in LINQ (C#)
Most queries in the introductory Language Integrated Query (LINQ) documentation are written by using the LINQ declarative query syntax. However, the query syntax must be translated into method calls for the .NET common language runtime (CLR) when the code is compiled.
EDIT: With your edited code snippets, there is no difference between two queries at the time of execution. Your first query is based on query syntax which will compile into method syntax (your second query). To select between those two is a matter of choice. I personally finds method syntax more readable.

First query doesn't perform anything, you are just constructing the query.
Second query fetches all the blog records from DB and load them into memory.
Which is best is depends on your needs.
Edit: If you are asking for syntax, there is no difference.The query syntax will be compiled into extension method calls by the compiler.
Apart from that, this:
from b in context.Blogs select b
is kinda pointless. It's equivelant to context.Blogs.Select(x => x). So in this case I would go with context.Blogs.ToList();

Are different IQueryable objects combined?

I have a little program that needs to do some calculation on a data range. The range maybe contain about half a millon of records. I just looked to my db and saw that a group by was executed.
I thought that the result was executed on the first line, and later I just worked with data in RAM. But now I think that the query builder combine the expression.
var Test = db.Test.Where(x => x > Date.Now.AddDays(-7));
var Test2 = (from p in Test
group p by p.CustomerId into g
select new { UniqueCount = g.Count() } );
In my real world app I got more subqueries that is based on the range selected by the first query. I think I just added a big overhead to let the DB make different selects.
Now I bascilly just call .ToList() after the first expression.
So my question is am I right about that the query builder combine different IQueryable when it builds the expression tree?

Yes, you are correct. LINQ expressions are lazily evaluated at the moment you evaluate them (via .ToList(), for example). At that point in time, Entity Framework will look at the total query and build an SQL statement to represent it.
In this particular case, it's probably wiser to not evaluate the first query, because the SQL database is optimized for performing set-based operations like grouping and counting. Rather than forcing the database to send all the Test objects across the wire, deserializing the results into in-memory objects, and then performing the grouping and counting locally, you will likely see better performance by having the SQL database just return the resulting Counts.

Linq performance: does it make sense to move out condition from query?

There is a LINQ query:
int criteria = GetCriteria();
var teams = Team.GetTeams()
.Where(team=> criteria == 0 || team.Criteria == criteria)
.ToList();
Does it make sense from performance (or any other point of view) to convert it into the following?
var teams = Team.GetTeams();
if (criteria != 0)
{
teams = teams.Where(team => team.Criteria == criteria);
}
teams = teams.ToList();
I have a solid set of criteria; should I separate them and apply each criteria only if necessary? Or just apply them in one LINQ query and leave up-to .NET to optimize the query?
Please advise. Any thoughts are welcome!
P.S. Guys, I don't use Linq2Sql, just LINQ, that is a pure C# code

First, it is important to understand that LINQ actually comes in several varieties, which can mostly be divided by whether they use expression trees or compiled code to execute the query.
Linq2Sql (as well as any Linq provider that converts the query into another form to perform the query) use expression trees so that the query can be analyzed and converted into another form (often SQL). In this case it is possible for the provider to modify the query, potenitally performing some level of optimization. I am not aware of any providers that currently do this.
Linq to Objects uses compiled code, and will always execute the query as written. There is no opportunity for the query to be optimized other than by the developer.
Second, all Linq queries are deferred. That means that the query is not actually executed until an attempt to get results. A side effect of this is that you can build a query in several steps, and only the final query will be executed. The second example in the question still results in only one query being executed.
If you are using Linq2Sql then it is likely that both queries have roughly similar performance. However, the first example extended to handle many criteria could potentially lead to a poor overly general execution plan which ultimately degrades performance. The trimmed query does not run this risk, and will generate an execution plan for each real permutation of criteria.
If you are using Linq to Objects, then the second query is definitely preferable, since the first query will execute the predicate passed to Where once for every input even when the condition always returns true.

I think the 2nd option is better because it is checking the criteria value only once. Whereas in the first query , criteria is being checked for every row.
But I also believe that you will not get any significant performance gain from it. It may give you better results on tables with a high number of records (theoretically)

What are the mechanics of the expression tree limitation here?

The fact that I expected this to work and it didn't leads me to search for the piece of the picture that I don't see.

Imagine that query being remoted over to a database. How is the database engine supposed to reach over the internet as it is executing the query and tell the "count" variable on your machine to update itself? There is no standard mechanism for doing so, and therefore anything that would mutate a variable on the local machine cannot be put into an expression tree that would run on a remote machine.
More generally, queries that cause side effects when they are executed are very, very bad queries. Never, ever put a side effect in a query, even in the cases where doing so is legal. It can be very confusing. Remember, a query is not a for loop. A query results in an object represents the query, not the results of the query. A query which has side effects when executed will execute those side effects twice when asked for its results twice, and execute them zero times when asked for the results zero times. A query where the first part mutates a variable will mutate that variable before the second clause executes, not during the execution of the second clause. As a result, many query clauses give totally bizarre results when they depend on side effect execution.
For more thoughts on this, see Bill Wagner's MSDN article on the subject:
http://msdn.microsoft.com/en-us/vcsharp/hh264182
If you were writing a LINQ-to-Objects query you could do what you want by zipping:
var zipped = model.DirectTrackCampaigns.Top("10").Zip(Enumerable.Range(0, 10000), (first, second)=>new { first, second });
var res = zipped.Select(c=> new Campaign { IsSelected = c.second % 2 == 0, Name = c.first.CampaignName };
That is, make a sequence of pairs of numbers and campaigns, then manipulate the pairs.
How you'd do that in LINQ-to-whatever-you're-using, I don't know.

When you pass anything to a method that expects an Expression, nothing is actually evaluated at that time. All it does is, it breaks apart the code and creates an Expression tree out of it.
Now, in .Net 4.0, alot of things were added to the Expression API (including Expression.Increment and Expression.Assign which would actually be what you're doing), however the compiler (I think it's a limitation of the C# compiler) hasn't been updated yet to take advantage of the new 4.0 Expression stuff. Therefore, we're limited to method calls, and assignment calls will not work. Hypothetically, this could be supported in the future.

IsSelected = (count++) % 2 == 0
Pretty much means:
IsSelected = (count = count + 1) % 2 == 0
That's where the assignment is happening and the source of the error.
According to the different answers to this question, this should be possible with .NET 4.0 using expressions, though not in lambdas.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.