Came across some legacy code where the logic attempts to prevent un-necessary multiple calls to an expensive query GetStudentsOnCourse(), but fails due to a misunderstanding of deferred execution.
var students = studentsToRemoveRecords.Select(x => x.CourseId)
.Distinct()
.SelectMany(c => studentRepository.GetStudentsOnCourse(c.Value));
var studentsToRemove = new List<Student>();
foreach (var record in studentsToRemoveRecords)
{
studentsToRemove.Add(
students.Single(s => s.Id == record.StudentId));
}
Here, if there are 2 records for the same course in studentsToRemoveRecords, the query GetStudentsOnCourse() will needlessly be called twice (with the same course id) instead of once.
You can solve this by converting students to a list beforehand and forcing it to memory (preventing the execution from being deferred). Or by simply rewriting the logic into something a bit simpler.
But I then realised I actually struggle to put into words exactly why GetStudentsOnCourse() is called twice in the scenario above... is it that LINQ is repeating the same work everytime studentsToRemoveRecords is iterated over, even though the resulting input values are identical each time?
is it that LINQ is repeating the same work everytime studentsToRemoveRecords is iterated over, even though the resulting input values are identical each time?
Yes, that's the nature of LINQ. Some Visual Studio Extensions, like ReSharper, give you warnings when you create code that might lead to multiple iterations of a LINQ Query.
If you want to avoid it, do this:
var students = studentsToRemoveRecords.Select(x => x.CourseId)
.Distinct()
.SelectMany(c => studentRepository.GetStudentsOnCourse(c.Value))
.ToList();
With ToList() the Query is executed immediately and the resulting entities are stored in a List<T>. Now you can iterate several times over students without having performance issues.
Edit to include comments:
Here is a link to some good documentation about it (thank you Sergio): LINQ Documentation
And some thoughts about your question how to handle this in a large code base:
Well, there are reasons for both scenarios - direct execution and storing the result into a new list, and deferred execution.
If you are familiar with SQL databases, you can think of a LINQ Query like a View or a Stored Procedure. You define what filtering/altering you want to execute on a base table to get the resulting entities. And each time you query that View/execute that Stored Procedure, it runs based on the current data in the base table.
Same for LINQ. Your Query (without ToList()) was just like the definition of the View. And each time you iterate over it, that definition gets executed based on the current Entities in studentsToRemoveRecords at that moment.
And maybe that's your intetion. Maybe you know that this base list is altering and you want to execute your query several times, expecting different results. Then do it without ToList().
But when you want to execute your query only once and then expect an immutable result list over which you can iterate multiple times, do it with ToList().
So both Scenarios are valid. And when you iterate only once, both scenarios are equal (disclaimer: when you iterate directly after defining the query). Maybe that's why you saw it so many times like this. It depends what you want.
Unclear exactly how your classes are done, BUT:
public class Student
{
public int Id { get; set; }
}
public class StudentCourse
{
public int StudentId { get; set; }
public int? CourseId { get; set; }
}
public class StudentRepository
{
public StudentCourse[] StudentCourses = new[]
{
new StudentCourse { CourseId = 1, StudentId = 100 },
new StudentCourse { CourseId = 2, StudentId = 200 },
new StudentCourse { CourseId = 3, StudentId = 300 },
new StudentCourse { CourseId = 4, StudentId = 400 },
};
public Student[] GetStudentsOnCourse(int courseId)
{
Console.WriteLine($"{nameof(GetStudentsOnCourse)}({courseId})");
return StudentCourses.Where(x => x.CourseId == courseId).Select(x => new Student { Id = x.StudentId }).ToArray();
}
}
and then
static void Main(string[] args)
{
var studentRepository = new StudentRepository();
var studentsToRemoveRecords = studentRepository.StudentCourses.ToArray();
var students = studentsToRemoveRecords.Select(x => x.CourseId)
.Distinct()
.SelectMany(c => studentRepository.GetStudentsOnCourse(c.Value));
//.ToArray();
var studentsToRemove = new List<Student>();
foreach (var record in studentsToRemoveRecords)
{
studentsToRemove.Add(
students.Single(s => s.Id == record.StudentId));
}
}
the method is called 16 times, with .ToArray() it is called 4 times. Note that .Single() will parse the full students collection to check that there is a single student with the "right" Id. Compare it with First() that will break after finding one record with the right Id (10 total calls of the method). As I've said in my comment, the method is called studentsToRemoveRecords.Count() * studentsToRemoveRecords.Distinct().Count(), so something like x ^ 2. Doing a .ToArray() "memoizes" the result of the GetStudentsOnCourse.
Just out of curiosity, you can add this class to your code:
public static class Tools
{
public static IEnumerable<T> DebugEnumeration<T>(this IEnumerable<T> enu)
{
Console.WriteLine("Begin Enumeration");
foreach (var res in enu)
{
yield return res;
}
}
}
and then do:
.SelectMany(c => studentRepository.GetStudentsOnCourse(c.Value))
.DebugEnumeration();
This will show you when the SelectMany is enumerated.
Related
I'm working on small app which is written in c# .net core and I'm populating one prop in a code because that information is not available in database, code looks like this:
public async Task<IEnumerable<ProductDTO>> GetData(Request request)
{
IQueryable<Product> query = _context.Products;
var products = await query.ToListAsync();
// WARNING - THIS SOLUTION LOOKS EXPENCIVE TO ME!
return MapDataAsDTO(products).Select(c =>
{
c.HasBrandStock = products.Any(cc => cc.ParentProductId == c.Id);
return c;
});
}
}
private IEnumerable<ProductDTO> MapDataAsDTO(IEnumerable<Product> products)
{
return products.Select(p => MapData(p)).ToList();
}
What is bothering me here is this code:
return MapDataAsDTO(products).Select(c =>
{
c.HasBrandStock = data.Any(cc => cc.ParentProductId == c.Id);
return c;
});
}
I've tested it on like 300k rows and it seems slow, I'm wondering is there a better solutions in this situations?
Thanks guys!
Cheers
First up, this method is loading all products, and generally that is a bad idea unless you are guaranteeing that the total number of records will remain reasonable, and the total size of those records will be reasonable. If the system can grow, add support for server-side pagination now. (Page # and Page size, leveraging Skip & Take) 300k products is not a reasonable number to be loading all data in one hit. Any way you skin this cat it will be slow, expensive, and error prone due to server load without paging. One user making a request on the server will need to have the DB server allocate for and load up 300k rows, transmit that data over the wire to the app server, which will allocate memory for those 300k rows, then transmit that data over the wire to the client who literally does not need those 300k rows at once. What do you think happens when 10 users hit this page? 100? And what happens when it's "to slow" and they start hammering the F5 key a few times. >:)
Second, async is not a silver bullet. It doesn't make queries faster, it actually makes them a bit slower. What it does do is allow your web server to be more responsive to other requests while those slower queries are running. Default to synchronous queries, get them running as efficiently as possible, then for the larger ones that are justified, switch them to asynchronous. MS made async extremely easy to implement, perhaps too easy to treat as a default. Keep it simple and synchronous to start, then re-factor methods to async as needed.
From what I can see you want to load all products into DTOs, and for products that are recognized as being a "parent" of at least one other product, you want to set their DTO's HasBrandStock to True. So given product IDs 1 and 2, where 2's parent ID is 1, the DTO for Product ID 1 would have a HasBrandStock True while Product ID 2 would have HasBrandStock = False.
One option would be to tackle this operation in 2 queries:
var parentProductIds = _context.Products
.Where(x => x.ParentProductId != null)
.Select(x => x.ParentProductId)
.Distinct()
.ToList();
var dtos = _context.Products
.Select(x => new ProductDTO
{
ProductId = x.ProductId,
ProductName = x.ProductName,
// ...
HasBrandStock = parentProductIds.Contains(x.ProductId)
}).ToList();
I'm using a manual Select here because I don't know what your MapAsDto method is actually doing. I'd highly recommend using Automapper and it's ProjectTo<T> method if you want to simplify the mapping code. Custom mapping functions can too easily hide expensive bugs like ToList calls when someone hits a scenario that EF cannot translate.
The first query gets a distinct list of just the Product IDs that are the parent ID of at least one other product. The second query maps out all products into DTOs, setting the HasBrandStock based on whether each product appears in the parentProductIds list or not.
This option will work if a relatively limited number of products are recognized as "parents". That first list can only get so big before it risks crapping out being too many items to translate into an IN clause.
The better option would be to look at your mapping. You have a ParentProductId, does a product entity have an associated ChildProducts collection?
public class Product
{
public int ProductId { get; set; }
public string ProductName { get; set; }
// ...
public virtual Product ParentProduct { get; set; }
public virtual ICollection<Product> ChildProducts { get; set; } = new List<Product>();
}
public class ProductConfiguration : EntityTypeConfiguration<Product>
{
public ProductConfiguration()
{
HasKey(x => x.ProductId);
HasOptional(x => x.ParentProduct)
.WithMany(x => x.ChildProducts)
.Map(x => x.MapKey("ParentProductId"));
}
}
This example maps the ParentProductId without exposing a field in the entity (recommended). Otherwise, if you do expose a ParentProductId, substitute the .Map(...) call with .HasForeignKey(x => x.ParentProductId).
This assumes EF6 as per your tags, if you're using EF Core then you use HasForeignKey("ParentProductId") in place of Map(...) to establish a shadow property for the FK without exposing a property. The entity configuration is a bit different with Core.
This allows your queries to leverage the relationship between parent products and any related children products. Populating the DTOs can be accomplished with one query:
var dtos = _context.Products
.Select(x => new ProductDTO
{
ProductId = x.ProductId,
ProductName = x.ProductName,
// ...
HasBrandStock = x.ChildProducts.Any()
}).ToList();
This leverages the relationship to populate your DTO and it's flag in one pass. The caveat here is that there is now a cyclical relationship between product and itself represented in the entity. This means don't feed entities to something like a serializer. That includes avoiding adding entities as members of DTOs/ViewModels.
What seemed that it should be a relatively straight-forward task has turned into something of a surprisingly complex issue. To the point that I'm starting to think that my methodology perhaps is simply out of scope with the capabilities of Linq.
What I'm trying to do is piece-together a Linq query and then invoke .Include() in order to pull-in values from a number of child entities. For example, let's say I have these entities:
public class Parent
{
public int Id { get; set; }
public string Name { get; set; }
public string Location { get; set; }
public ISet<Child> Children { get; set; }
}
public class Child
{
public int Id { get; set; }
public int ParentId { get; set; }
public Parent Parent { get; set; }
public string Name { get; set; }
}
And let's say I want to perform a query to retrieve records from Parent, where Name is some value and Location is some other value, and then include Child records, too. But for whatever reason I don't know the query values for Name and Location at the same time, so I have to take two separate queryables and join them, such:
MyDbContext C = new MyDbContext();
var queryOne = C.Parent.Where(p => p.Name == myName);
var queryTwo = C.Parent.Where(p => p.Location == myLocation);
var finalQuery = queryOne.Intersect(queryTwo);
That works fine, producing results exactly as if I had just done:
var query = C.Parent.Where(p => p.Name == myName && p.Location = myLocation);
And similarly, I can:
var finalQuery = queryOne.Union(queryTwo);
To give me results just as if I had:
var query = C.Parent.Where(p => p.Name == myName || p.Location = myLocation);
What I cannot do, however, once the Intersect() or Union() is applied, however, is then go about mapping the Child using Include(), as in:
finalQuery.Include(p => p.Children);
This code will compile, but produces results as follows:
In the case of a Union(), a result set will be produced, but no Child entities will be enumerated.
In the case of an Intersect(), a run-time error is generated upon attempt to apply Include(), as follows:
Expression of type
'System.Collections.Generic.IEnumerable`1[Microsoft.EntityFrameworkCore.Query.Internal.AnonymousObject]'
cannot be used for parameter of type
'System.Collections.Generic.IEnumerable`1[System.Object]' of method
'System.Collections.Generic.IEnumerable`1[System.Object]
Intersect[Object](System.Collections.Generic.IEnumerable`1[System.Object],
System.Collections.Generic.IEnumerable`1[System.Object])'
The thing that baffles me is that this code will work exactly as expected:
var query = C.Parent.Where(p => p.Name == myName).Where(p => p.Location == myLocation);
query.Include(p => p.Children);
I.e., with the results as desired, including the Child entities enumerated.
my methodology perhaps is simply out of scope with the capabilities of Linq
The problem is not LINQ, but EF Core query translation, and specifically the lack of Intersect / Union / Concat / Except method SQL translation, tracked by #6812 Query: Translate IQueryable.Concat/Union/Intersect/Except/etc. to server.
Shortly, such queries currently use client evaluation, which with combination of how the EF Core handles Include leads to many unexpected runtime exceptions (like your case #2) or wrong behaviors (like Ignored Includes in your case #1).
So while your approach technically perfectly makes sense, according to the EF Core team leader response
Changing this to producing a single SQL query on the server isn't currently a top priority
so this currently is not even planned for 3.0 release, although there are plans to change (rewrite) the whole query translation pipeline, which might allow implementing that as well.
For now, you have no options. You may try processing the query expression trees yourself, but that's a complicated task and you'll probably find why it is not implemented yet :) If you can convert your queries to the equivalent single query with combined Where condition, then applying Include will be fine.
P.S. Note that even now your approach technically "works" w/o Include, prefomance wise the way it is evaluated client side makes it absolutely non equivalent of the corresponding single query.
A long time has gone by, but this .Include problem still exists in EF 6. However, there is a workaround: Append every child request with .Include before intersecting/Unionizing.
MyDbContext C = new MyDbContext();
var queryOne = db.Parents.Where(p => p.Name == parent.Name).Include("Children");
var queryTwo = db.Parents.Where(p => p.Location == parent.Location).Include("Children");
var finalQuery = queryOne.Intersect(queryTwo);
As stated by #Ivan Stoev, Intersection/Union is done with after-fetched data, while .Include is ok at request time.
So, as of now, you have this one option available.
I'm currently trying to write some code that will run a query on two separate databases, and will return the results to an anonymous object. Once I have the two collections of anonymous objects, I need to perform a comparison on the two collections. The comparison is that I need to retrieve all of the records that are in webOrders, but not in foamOrders. Currently, I'm making the comparison by use of Linq. My major problem is that both of the original queries return about 30,000 records, and as my code is now, it takes waay too long to complete. I'm new to using Linq, so I'm trying to understand if using Linq to compare the two collections of anonymous objects will actually cause the database queries to run over and over again - due to deferred execution. This may be an obvious answer, but I don't yet have a very firm understanding of how Linq and anonymous objects work with deferred execution. I'm hoping someone may be able to enlighten me. Below is the code that I have...
private DataTable GetData()
{
using (var foam = Databases.Foam(false))
{
using (MySqlConnection web = new MySqlConnection(Databases.ConnectionStrings.Web(true)
{
var foamOrders = foam.DataTableEnumerable(#"
SELECT order_id
FROM Orders
WHERE order_id NOT LIKE 'R35%'
AND originpartner_code = 'VN000011'
AND orderDate > Getdate() - 7 ")
.Select(o => new
{
order = o[0].ToString().Trim()
}).ToList();
var webOrders = web.DataTableEnumerable(#"
SELECT ORDER_NUMBER FROM TRANSACTIONS AS T WHERE
(Str_to_date(T.ORDER_DATE, '%Y%m%d %k:%i:%s') >= DATE_SUB(Now(), INTERVAL 7 DAY))
AND (STR_TO_DATE(T.ORDER_DATE, '%Y%m%d %k:%i:%s') <= DATE_SUB(NOW(), INTERVAL 1 HOUR))")
.Select(o => new
{
order = o[0].ToString().Trim()
}).ToList();
return (from w in webOrders
where !(from f in foamOrders
select f.order).Contains(w.order)
select w
).ToDataTable();
}
}
}
Your linq ceases to be deferred when you do
ToDataTable();
At that point it is snapshotted as done and dusted forever.
Same is true with foamOrders and webOrders when you convert it
ToList();
You could do it as one query. I dont have mySQL to check it out on.
Regarding deferred execution:
Method .ToList() iterates over the IEnumerable retrieves all values and fill a new List<T> object with that values. So it's definitely not deferred execution at this point.
It's most likely the same with .ToDataTable();
P.S.
But i'd recommend to :
Use custom types rather than anonymous types.
Do not use LINQ to compare objects because it's not really effective (linq is doing extra job)
You can create a custom MyComparer class (that might implement IComparer interface) and method like Compare<T1, T2> that compares two entities. Then you can create another method to compare two sets of entities for example T1[] CompareRange<T1,T2>(T1[] entities1, T2[] entities2) that reuse your Compare method in a loop and returns result of the operation
P.S.
Some of other resource-intensive operations that may potentially lead to significant performance losses (in case if you need to perform thousands of operations) :
Usage of enumerator object (foreach loop or some of LINQ methods)
Possible solution : Try to use for loop if it is possible.
Extensive use of anonymous methods (compiler requires significant time to compile the lambda expression / operator );
Possible solution : Store lambdas in delegates (like Func<T1, T2>)
In case it helps anyone in the future, my new code is pasted below. It runs much faster now. Thanks to everyone's help, I've learned that even though the deferred execution of my database queries was cut off and the results became static once I used .ToList(), using Linq to compare the resulting collections was very inefficient. I went with a for loop instead.
private DataTable GetData()
{
//Needed to have both connections open in order to preserve the scope of var foamOrders and var webOrders, which are both needed in order to perform the comparison.
using (var foam = Databases.Foam(isDebug))
{
using (MySqlConnection web = new MySqlConnection(Databases.ConnectionStrings.Web(isDebug)))
{
var foamOrders = foam.DataTableEnumerable(#"
SELECT foreignID
FROM Orders
WHERE order_id NOT LIKE 'R35%'
AND originpartner_code = 'VN000011'
AND orderDate > Getdate() - 7 ")
.Select(o => new
{
order = o[0].ToString()
.Trim()
}).ToList();
var webOrders = web.DataTableEnumerable(#"
SELECT ORDER_NUMBER FROM transactions AS T WHERE
(Str_to_date(T.ORDER_DATE, '%Y%m%d %k:%i:%s') >= DATE_SUB(Now(), INTERVAL 7 DAY))
AND (STR_TO_DATE(T.ORDER_DATE, '%Y%m%d %k:%i:%s') <= DATE_SUB(NOW(), INTERVAL 1 HOUR))
", 300)
.Select(o => new
{
order = o[0].ToString()
.Trim()
}).ToList();
List<OrderNumber> on = new List<OrderNumber>();
foreach (var w in webOrders)
{
if (!foamOrders.Contains(w))
{
OrderNumber o = new OrderNumber();
o.orderNumber = w.order;
on.Add(o);
}
}
return on.ToDataTable();
}
}
}
public class OrderNumber
{
public string orderNumber { get; set; }
}
}
public List<Employee> GetEmployees(){
var employee = new ApplicationDBContext().Employee;
return employee.ToList();
}
//somewhere in other part of code.
//Use GetEmployees.
var employees = GetEmployees();
var importantEmployees = employees.Where(e => e.IsImportant == true);
In terms of performance, this method is feasible?
Is there any solution to make it fast?
Thanks!
As soon as GetEmployees() executes ToList(), you retrieve all the records from the database, not just the "important" ones. By the time you execute the Where clause later on, it's too late.
Create another method, where you filter with Where before calling ToList().
public List<Employee> GetImportantEmployees()
{
var employee = new ApplicationDBContext().Employee;
return employee.Where(e => e.IsImportant).ToList();
}
Other than that, I'm not sure what else you can do to make it faster from your C# code. Apply more filters if you only need a subset of the "important" employees (also before calling ToList()).
I want to insert into my table a column named 'S' that will get some string value based on a value it gets from a table column.
For example: for each ID (a.z) I want to gets it's string value stored in another table. The string value is returned from another method that gets it through a Linq query.
Is it possible to call a method from Linq?
Should I do everything in the same query?
This is the structure of the information I need to get:
a.z is the ID in the first square in table #1, from this ID I get another id in table #2, and from that I can get my string value that I need to display under column 'S'.
var q = (from a in v.A join b in v.B
on a.i equals b.j
where a.k == "aaa" && a.h == 0
select new {T = a.i, S = someMethod(a.z).ToString()})
return q;
The line S = someMethod(a.z).ToString() causing the following error:
Unable to cast object of type 'System.Data.Linq.SqlClient.SqlColumn'
to type 'System.Data.Linq.SqlClient.SqlMethodCall'.
You have to execute your method call in Linq-to-Objects context, because on the database side that method call will not make sense - you can do this using AsEnumerable() - basically the rest of the query will then be evaluated as an in memory collection using Linq-to-Objects and you can use method calls as expected:
var q = (from a in v.A join b in v.B
on a.i equals b.j
where a.k == "aaa" && a.h == 0
select new {T = a.i, Z = a.z })
.AsEnumerable()
.Select(x => new { T = x.T, S = someMethod(x.Z).ToString() })
You'll want to split it up into two statements. Return the results from the query (which is what will hit the database), and then enumerate the results a second time in a separate step to transform the translation into the new object list. This second "query" won't hit the database, so you'll be able to use the someMethod() inside it.
Linq-to-Entities is a bit of a strange thing, because it makes the transition to querying the database from C# extremely seamless: but you always have to remind yourself, "This C# is going to get translated into some SQL." And as a result, you have to ask yourself, "Can all this C# actually get executed as SQL?" If it can't - if you're calling someMethod() inside it - your query is going to have problems. And the usual solution is to split it up.
(The other answer from #BrokenGlass, using .AsEnumerable(), is basically another way to do just that.)
That is an old question, but I see nobody mention one "hack", that allows to call methods during select without reiterating. Idea is to use constructor and in constructor you can call whatever you wish (at least it works fine in LINQ with NHibernate, not sure about LINQ2SQL or EF, but I guess it should be the same).
Below I have source code for benchmark program, it looks like reiterating approach in my case is about twice slower than constructor approach and I guess there's no wonder - my business logic was minimal, so things like iteration and memory allocation matters.
Also I wished there was better way to say, that this or that should not be tried to execute on database,
// Here are the results of selecting sum of 1 million ints on my machine:
// Name Iterations Percent
// reiterate 294 53.3575317604356%
// constructor 551 100%
public class A
{
public A()
{
}
public A(int b, int c)
{
Result = Sum(b, c);
}
public int Result { get; set; }
public static int Sum(int source1, int source2)
{
return source1 + source2;
}
}
class Program
{
static void Main(string[] args)
{
var range = Enumerable.Range(1, 1000000).ToList();
BenchmarkIt.Benchmark.This("reiterate", () =>
{
var tst = range
.Select(x => new { b = x, c = x })
.AsEnumerable()
.Select(x => new A
{
Result = A.Sum(x.b, x.c)
})
.ToList();
})
.Against.This("constructor", () =>
{
var tst = range
.Select(x => new A(x, x))
.ToList();
})
.For(60)
.Seconds()
.PrintComparison();
Console.ReadKey();
}
}