Using Include with Intersect/Union/Exclude in Linq

Using Include with Intersect/Union/Exclude in Linq - c#

What seemed that it should be a relatively straight-forward task has turned into something of a surprisingly complex issue. To the point that I'm starting to think that my methodology perhaps is simply out of scope with the capabilities of Linq.
What I'm trying to do is piece-together a Linq query and then invoke .Include() in order to pull-in values from a number of child entities. For example, let's say I have these entities:
public class Parent
{
public int Id { get; set; }
public string Name { get; set; }
public string Location { get; set; }
public ISet<Child> Children { get; set; }
}
public class Child
{
public int Id { get; set; }
public int ParentId { get; set; }
public Parent Parent { get; set; }
public string Name { get; set; }
}
And let's say I want to perform a query to retrieve records from Parent, where Name is some value and Location is some other value, and then include Child records, too. But for whatever reason I don't know the query values for Name and Location at the same time, so I have to take two separate queryables and join them, such:
MyDbContext C = new MyDbContext();
var queryOne = C.Parent.Where(p => p.Name == myName);
var queryTwo = C.Parent.Where(p => p.Location == myLocation);
var finalQuery = queryOne.Intersect(queryTwo);
That works fine, producing results exactly as if I had just done:
var query = C.Parent.Where(p => p.Name == myName && p.Location = myLocation);
And similarly, I can:
var finalQuery = queryOne.Union(queryTwo);
To give me results just as if I had:
var query = C.Parent.Where(p => p.Name == myName || p.Location = myLocation);
What I cannot do, however, once the Intersect() or Union() is applied, however, is then go about mapping the Child using Include(), as in:
finalQuery.Include(p => p.Children);
This code will compile, but produces results as follows:
In the case of a Union(), a result set will be produced, but no Child entities will be enumerated.
In the case of an Intersect(), a run-time error is generated upon attempt to apply Include(), as follows:
Expression of type
'System.Collections.Generic.IEnumerable`1[Microsoft.EntityFrameworkCore.Query.Internal.AnonymousObject]'
cannot be used for parameter of type
'System.Collections.Generic.IEnumerable`1[System.Object]' of method
'System.Collections.Generic.IEnumerable`1[System.Object]
Intersect[Object](System.Collections.Generic.IEnumerable`1[System.Object],
System.Collections.Generic.IEnumerable`1[System.Object])'
The thing that baffles me is that this code will work exactly as expected:
var query = C.Parent.Where(p => p.Name == myName).Where(p => p.Location == myLocation);
query.Include(p => p.Children);
I.e., with the results as desired, including the Child entities enumerated.

my methodology perhaps is simply out of scope with the capabilities of Linq
The problem is not LINQ, but EF Core query translation, and specifically the lack of Intersect / Union / Concat / Except method SQL translation, tracked by #6812 Query: Translate IQueryable.Concat/Union/Intersect/Except/etc. to server.
Shortly, such queries currently use client evaluation, which with combination of how the EF Core handles Include leads to many unexpected runtime exceptions (like your case #2) or wrong behaviors (like Ignored Includes in your case #1).
So while your approach technically perfectly makes sense, according to the EF Core team leader response
Changing this to producing a single SQL query on the server isn't currently a top priority
so this currently is not even planned for 3.0 release, although there are plans to change (rewrite) the whole query translation pipeline, which might allow implementing that as well.
For now, you have no options. You may try processing the query expression trees yourself, but that's a complicated task and you'll probably find why it is not implemented yet :) If you can convert your queries to the equivalent single query with combined Where condition, then applying Include will be fine.
P.S. Note that even now your approach technically "works" w/o Include, prefomance wise the way it is evaluated client side makes it absolutely non equivalent of the corresponding single query.

A long time has gone by, but this .Include problem still exists in EF 6. However, there is a workaround: Append every child request with .Include before intersecting/Unionizing.
MyDbContext C = new MyDbContext();
var queryOne = db.Parents.Where(p => p.Name == parent.Name).Include("Children");
var queryTwo = db.Parents.Where(p => p.Location == parent.Location).Include("Children");
var finalQuery = queryOne.Intersect(queryTwo);
As stated by #Ivan Stoev, Intersection/Union is done with after-fetched data, while .Include is ok at request time.
So, as of now, you have this one option available.

Related

C# LINQ executing the same work over and over

Came across some legacy code where the logic attempts to prevent un-necessary multiple calls to an expensive query GetStudentsOnCourse(), but fails due to a misunderstanding of deferred execution.
var students = studentsToRemoveRecords.Select(x => x.CourseId)
.Distinct()
.SelectMany(c => studentRepository.GetStudentsOnCourse(c.Value));
var studentsToRemove = new List<Student>();
foreach (var record in studentsToRemoveRecords)
{
studentsToRemove.Add(
students.Single(s => s.Id == record.StudentId));
}
Here, if there are 2 records for the same course in studentsToRemoveRecords, the query GetStudentsOnCourse() will needlessly be called twice (with the same course id) instead of once.
You can solve this by converting students to a list beforehand and forcing it to memory (preventing the execution from being deferred). Or by simply rewriting the logic into something a bit simpler.
But I then realised I actually struggle to put into words exactly why GetStudentsOnCourse() is called twice in the scenario above... is it that LINQ is repeating the same work everytime studentsToRemoveRecords is iterated over, even though the resulting input values are identical each time?

is it that LINQ is repeating the same work everytime studentsToRemoveRecords is iterated over, even though the resulting input values are identical each time?
Yes, that's the nature of LINQ. Some Visual Studio Extensions, like ReSharper, give you warnings when you create code that might lead to multiple iterations of a LINQ Query.
If you want to avoid it, do this:
var students = studentsToRemoveRecords.Select(x => x.CourseId)
.Distinct()
.SelectMany(c => studentRepository.GetStudentsOnCourse(c.Value))
.ToList();
With ToList() the Query is executed immediately and the resulting entities are stored in a List<T>. Now you can iterate several times over students without having performance issues.
Edit to include comments:
Here is a link to some good documentation about it (thank you Sergio): LINQ Documentation
And some thoughts about your question how to handle this in a large code base:
Well, there are reasons for both scenarios - direct execution and storing the result into a new list, and deferred execution.
If you are familiar with SQL databases, you can think of a LINQ Query like a View or a Stored Procedure. You define what filtering/altering you want to execute on a base table to get the resulting entities. And each time you query that View/execute that Stored Procedure, it runs based on the current data in the base table.
Same for LINQ. Your Query (without ToList()) was just like the definition of the View. And each time you iterate over it, that definition gets executed based on the current Entities in studentsToRemoveRecords at that moment.
And maybe that's your intetion. Maybe you know that this base list is altering and you want to execute your query several times, expecting different results. Then do it without ToList().
But when you want to execute your query only once and then expect an immutable result list over which you can iterate multiple times, do it with ToList().
So both Scenarios are valid. And when you iterate only once, both scenarios are equal (disclaimer: when you iterate directly after defining the query). Maybe that's why you saw it so many times like this. It depends what you want.

Unclear exactly how your classes are done, BUT:
public class Student
{
public int Id { get; set; }
}
public class StudentCourse
{
public int StudentId { get; set; }
public int? CourseId { get; set; }
}
public class StudentRepository
{
public StudentCourse[] StudentCourses = new[]
{
new StudentCourse { CourseId = 1, StudentId = 100 },
new StudentCourse { CourseId = 2, StudentId = 200 },
new StudentCourse { CourseId = 3, StudentId = 300 },
new StudentCourse { CourseId = 4, StudentId = 400 },
};
public Student[] GetStudentsOnCourse(int courseId)
{
Console.WriteLine($"{nameof(GetStudentsOnCourse)}({courseId})");
return StudentCourses.Where(x => x.CourseId == courseId).Select(x => new Student { Id = x.StudentId }).ToArray();
}
}
and then
static void Main(string[] args)
{
var studentRepository = new StudentRepository();
var studentsToRemoveRecords = studentRepository.StudentCourses.ToArray();
var students = studentsToRemoveRecords.Select(x => x.CourseId)
.Distinct()
.SelectMany(c => studentRepository.GetStudentsOnCourse(c.Value));
//.ToArray();
var studentsToRemove = new List<Student>();
foreach (var record in studentsToRemoveRecords)
{
studentsToRemove.Add(
students.Single(s => s.Id == record.StudentId));
}
}
the method is called 16 times, with .ToArray() it is called 4 times. Note that .Single() will parse the full students collection to check that there is a single student with the "right" Id. Compare it with First() that will break after finding one record with the right Id (10 total calls of the method). As I've said in my comment, the method is called studentsToRemoveRecords.Count() * studentsToRemoveRecords.Distinct().Count(), so something like x ^ 2. Doing a .ToArray() "memoizes" the result of the GetStudentsOnCourse.
Just out of curiosity, you can add this class to your code:
public static class Tools
{
public static IEnumerable<T> DebugEnumeration<T>(this IEnumerable<T> enu)
{
Console.WriteLine("Begin Enumeration");
foreach (var res in enu)
{
yield return res;
}
}
}
and then do:
.SelectMany(c => studentRepository.GetStudentsOnCourse(c.Value))
.DebugEnumeration();
This will show you when the SelectMany is enumerated.

Select() decline in performance

I'm working on small app which is written in c# .net core and I'm populating one prop in a code because that information is not available in database, code looks like this:
public async Task<IEnumerable<ProductDTO>> GetData(Request request)
{
IQueryable<Product> query = _context.Products;
var products = await query.ToListAsync();
// WARNING - THIS SOLUTION LOOKS EXPENCIVE TO ME!
return MapDataAsDTO(products).Select(c =>
{
c.HasBrandStock = products.Any(cc => cc.ParentProductId == c.Id);
return c;
});
}
}
private IEnumerable<ProductDTO> MapDataAsDTO(IEnumerable<Product> products)
{
return products.Select(p => MapData(p)).ToList();
}
What is bothering me here is this code:
return MapDataAsDTO(products).Select(c =>
{
c.HasBrandStock = data.Any(cc => cc.ParentProductId == c.Id);
return c;
});
}
I've tested it on like 300k rows and it seems slow, I'm wondering is there a better solutions in this situations?
Thanks guys!
Cheers

First up, this method is loading all products, and generally that is a bad idea unless you are guaranteeing that the total number of records will remain reasonable, and the total size of those records will be reasonable. If the system can grow, add support for server-side pagination now. (Page # and Page size, leveraging Skip & Take) 300k products is not a reasonable number to be loading all data in one hit. Any way you skin this cat it will be slow, expensive, and error prone due to server load without paging. One user making a request on the server will need to have the DB server allocate for and load up 300k rows, transmit that data over the wire to the app server, which will allocate memory for those 300k rows, then transmit that data over the wire to the client who literally does not need those 300k rows at once. What do you think happens when 10 users hit this page? 100? And what happens when it's "to slow" and they start hammering the F5 key a few times. >:)
Second, async is not a silver bullet. It doesn't make queries faster, it actually makes them a bit slower. What it does do is allow your web server to be more responsive to other requests while those slower queries are running. Default to synchronous queries, get them running as efficiently as possible, then for the larger ones that are justified, switch them to asynchronous. MS made async extremely easy to implement, perhaps too easy to treat as a default. Keep it simple and synchronous to start, then re-factor methods to async as needed.
From what I can see you want to load all products into DTOs, and for products that are recognized as being a "parent" of at least one other product, you want to set their DTO's HasBrandStock to True. So given product IDs 1 and 2, where 2's parent ID is 1, the DTO for Product ID 1 would have a HasBrandStock True while Product ID 2 would have HasBrandStock = False.
One option would be to tackle this operation in 2 queries:
var parentProductIds = _context.Products
.Where(x => x.ParentProductId != null)
.Select(x => x.ParentProductId)
.Distinct()
.ToList();
var dtos = _context.Products
.Select(x => new ProductDTO
{
ProductId = x.ProductId,
ProductName = x.ProductName,
// ...
HasBrandStock = parentProductIds.Contains(x.ProductId)
}).ToList();
I'm using a manual Select here because I don't know what your MapAsDto method is actually doing. I'd highly recommend using Automapper and it's ProjectTo<T> method if you want to simplify the mapping code. Custom mapping functions can too easily hide expensive bugs like ToList calls when someone hits a scenario that EF cannot translate.
The first query gets a distinct list of just the Product IDs that are the parent ID of at least one other product. The second query maps out all products into DTOs, setting the HasBrandStock based on whether each product appears in the parentProductIds list or not.
This option will work if a relatively limited number of products are recognized as "parents". That first list can only get so big before it risks crapping out being too many items to translate into an IN clause.
The better option would be to look at your mapping. You have a ParentProductId, does a product entity have an associated ChildProducts collection?
public class Product
{
public int ProductId { get; set; }
public string ProductName { get; set; }
// ...
public virtual Product ParentProduct { get; set; }
public virtual ICollection<Product> ChildProducts { get; set; } = new List<Product>();
}
public class ProductConfiguration : EntityTypeConfiguration<Product>
{
public ProductConfiguration()
{
HasKey(x => x.ProductId);
HasOptional(x => x.ParentProduct)
.WithMany(x => x.ChildProducts)
.Map(x => x.MapKey("ParentProductId"));
}
}
This example maps the ParentProductId without exposing a field in the entity (recommended). Otherwise, if you do expose a ParentProductId, substitute the .Map(...) call with .HasForeignKey(x => x.ParentProductId).
This assumes EF6 as per your tags, if you're using EF Core then you use HasForeignKey("ParentProductId") in place of Map(...) to establish a shadow property for the FK without exposing a property. The entity configuration is a bit different with Core.
This allows your queries to leverage the relationship between parent products and any related children products. Populating the DTOs can be accomplished with one query:
var dtos = _context.Products
.Select(x => new ProductDTO
{
ProductId = x.ProductId,
ProductName = x.ProductName,
// ...
HasBrandStock = x.ChildProducts.Any()
}).ToList();
This leverages the relationship to populate your DTO and it's flag in one pass. The caveat here is that there is now a cyclical relationship between product and itself represented in the entity. This means don't feed entities to something like a serializer. That includes avoiding adding entities as members of DTOs/ViewModels.

EF Core Cascading Deletes for Speed?

I'm working on an established (but changeable, assuming existing data survives any changes) code base and investigating some very slow deletes. So far I've only succeeded in making things worse, so here we are. I've backed out most of my attempted changes below to avoid adding extra unnecessary confusion.
There's a data class ProductDefinition which models a same-object hierarchy similar to e.g. a folder structure: every PD (except the root) will have one parent, but like a folder can have multiple children.
public class ProductDefinition
{
public int ID { get; set; }
// each tree of PDs should have a 'head' which will have no parent
// but most will have a ParentPDID and corresponding ParentPD
public virtual ProductDefinition ParentProductDefinition { get; set; }
public int? ParentProductDefinitionId { get; set; }
public virtual List<ProductDefinition> ProductDefinitions { get; set; }
= new List<ProductDefinition>();
[Required]
[StringLength(100)]
public string Name { get; set; }
// etc. Fields. Nothing so large you'd expect speed issues
}
The corresponding table has been specifically declared in the Context
public DbSet<ProductDefinition> ProductDefinitions { get; set; }
Along with a Fluent API relationship defined on Context.OnModelCreating
modelBuilder.Entity<ProductDefinition>()
.HasMany(productDefinition => productDefinition.ProductDefinitions)
.WithOne(childPd => childPd.ParentProductDefinition)
.HasForeignKey(childPd => childPd.ParentProductDefinitionId)
.HasPrincipalKey(productDefinition => productDefinition.ID);
It looks like an attempt has already been made to firm up deletion in the ProductDefinitionManager class
public static async Task ForceDelete(int ID, ProductContext context)
{
// wrap the recursion in a save so that it only happens once
await ForceDeleteNoSave(ID, context);
await context.SaveChangesAsync();
}
And
private static async Task ForceDeleteNoSave(int ID, ProductContext context)
{
var pd = await context.ProductDefinitions
.AsNoTracking()
.Include(x => x.ProductDefinitions)
.SingleAsync(x => x.ID == ID);
if (pd.ProductDefinitions != null && pd.ProductDefinitions.Count != 0)
{
var childIDs = pd.ProductDefinitions.Select(x => x.ID).ToList();
// delete the children recursively
foreach (var child in childIDs)
{
// EDITED HERE TO CORRECTLY REFLECT THE CURRENT CODE BASE
await ForceDeleteNoSave(child, context);
}
}
// delete the PD
// mark Supplier as edited
var supplier = await context.Suppliers.FindAsync(pd.SupplierID);
supplier.Edited = true;
// reload with tracking
pd = await context.ProductDefinitions.FirstOrDefaultAsync(x => x.ID == ID);
context.ProductDefinitions.Remove(pd);
}
At present, the above solution 'works', but:
a) takes over 2 minutes to complete
b) Seems to be giving the React front end a 502 error (but see above). Certainly the FE is claiming a 502
My primary question is: is there a way to improve the deletion speed, e.g. by defining a cascading delete in FluentAPI (my attempt hit an issue when trying to apply the migration)? But I welcome any discussion of what might be causing the FE to report Bad Gateway.

Unfortunately this is self refencing relationship and cascade delete cannot be used due to "multiple cascade paths" issue - limitation of SqlServer (and probably other) database (Oracle has no such issue).
The best way to handle in the databases which does not support "multiple cascade paths" is to use database trigger ("instead of delete").
But let say we want to handle it via client code in EF Core. The question is how to load effectively a recursive tree like structure (another not easy task in EF Core due to lack of recursive query support).
The problem with your code is that it uses depth first algorithm, which executes a lot of database queries. The more appropriate and performant way is to use breath first algorithm - in simple words, loading the items by level. This way the number of the database queries would be the maximum depth in the tree, which is way less than the number of the elements.
One way to implement that is to start with a query with initial filter applies, and then use SelectMany to get the next level (each SelectMany adds a join to the previous query). The process ends when the query does not return data:
public static async Task ForceDelete(int ID, ProductContext context)
{
var items = new List<ProductDefinition>();
// Collect the items by level
var query = context.ProductDefinitions.Where(e => e.ID == ID);
while (true)
{
var nextLevel = await query
.Include(e => e.Supplier)
.ToListAsync();
if (nextLevel.Count == 0) break;
items.AddRange(nextLevel);
query = query.SelectMany(e => e.ProductDefinitions);
}
foreach (var item in items)
item.Supplier.Edited = true;
context.RemoveRange(items);
await context.SaveChangesAsync();
}
Note that the executed queries eager load the related Supplier so it ca easily be updated.
Once the items are collected, they are simply marked for deleting via RemoveRange method. The order doesn't matter because EF Core will apply the commands by the dependency order anyway.
Another way to collect the items is to use the IDs from the previous level as a filter (SQL IN):
// Collect the items by level
Expression<Func<ProductDefinition, bool>> filter = e => e.ID == ID;
while (true)
{
var nextLevel = await context.ProductDefinitions
.Include(e => e.Supplier)
.Where(filter)
.ToListAsync();
if (nextLevel.Count == 0) break;
items.AddRange(nextLevel);
var parentIds = nextLevel.Select(e => e.ID);
filter = e => parentIds.Contains(e.ParentProductDefinitionId.Value);
}
I like more the former. The drawback is that EF Core generates a huge table name aliases, and also it could hit some SQL join number limitation in case of big depth. The later has no depth limitation, but might have issues with big IN clause. You should check which one is more appropriate for your case.

Ok. It is a bit hard to understand exactly why this is slow. How big is the data structure etc.
The first thing that springs to my eye when I look at the above code is the following:
public static async Task ForceDelete(int ID, ProductContext context)
{
// wrap the recursion in a save so that it only happens once
await ForceDeleteNoSave(ID, context);
await context.SaveChangesAsync();
}
This method is called recursively but every time you are done with a bunch of children it will call context.SaveChagesAsync(). Which means when you run the code you will get multiple saves and multiple calls to the database.
This seems like an anti-pattern, because if your program crashes half-way through it has already deleted some of the children.
Instead have an InitForceDelete() that in the end will call the context.SaveChangesAsync() so it is all done in one operation.
Something like this:
public static async Task InitForceDelete(int ID, ProductContext context)
{
// wrap the recursion in a save so that it only happens once
await ForceDeleteNoSave(ID, context);
await context.SaveChangesAsync();
}
private static async Task ForceDeleteNoSave(int ID, ProductContext context)
{
var pd = await context.ProductDefinitions
.AsNoTracking()
.Include(x => x.ProductDefinitions)
.SingleAsync(x => x.ID == ID);
if (pd.ProductDefinitions != null && pd.ProductDefinitions.Count != 0)
{
var childIDs = pd.ProductDefinitions.Select(x => x.ID).ToList();
// delete the children recursively
foreach (var child in childIDs)
{
await ForceDeleteNoSave(child, context);
}
}
var supplier = await context.Suppliers.FindAsync(pd.SupplierID);
supplier.Edited = true;
// reload with tracking
pd = await context.ProductDefinitions.FirstOrDefaultAsync(x => x.ID == ID);
context.ProductDefinitions.Remove(pd);
}
Now secondly you should try to inspect the sql that is being executed on your SQL server. You should be able to find the execution plans triggered by your LINQ statements and see if the SQL is completely crazy. Maybe your code is executing one call per ProductDefinition which would make it super slow.
I am sorry I cannot be more precise, but from the code you have presented it is hard to give direct pointers except for your constant call to context.SaveChagesAsync().

Entity Framework Include directive not getting all expected related rows

While debugging some performance issues I discovered that Entity framework was loading a lot of records via lazy loading (900 extra query calls ain't fast!) but I was sure I had the correct include. I've managed to get this down to quite a small test case to demonstrate the confusion I'm having, the actual use case is more complex so I don't have a lot of scope to re-work the signature of what I'm doing but hopefully this is a clear example of the issue I'm having.
Documents have Many MetaInfo rows related. I want to get all documents grouped by MetaInfo rows with a specific value, but I want all the MetaInfo rows included so I don't have to fire off a new request for all the Documents MetaInfo.
So I've got the following Query.
ctx.Configuration.LazyLoadingEnabled = false;
var DocsByCreator = ctx.Documents
.Include(d => d.MetaInfo) // Load all the metaInfo for each object
.SelectMany(d => d.MetaInfo.Where(m => m.Name == "Author") // For each Author
.Select(m => new { Doc = d, Creator = m })) // Create an object with the Author and the Document they authored.
.ToList(); // Actualize the collection
I expected this to have all the Document / Author pairs, and have all the Document MetatInfo property filled.
That's not what happens, I get the Document objects, and the Authors just fine, but the Documents MetaInfo property ONLY has MetaInfo objects with Name == "Author"
If I move the where clause out of the select many it does the same, unless I move it to after the actualisation (which while here might not be a big deal, it is in the real application as it means we're getting a huge amount more data than we want to deal with.)
After playing with a bunch of different ways to do this I think it really looks like the issue is when you do a select(...new...) as well as the where and the include. Doing the select, or the Where clause after actualisation makes the data appear the way I expected it to.
I figured it was an issue with the MetaInfo property of Document being filtered, so I rewrote it as follows to test the theory and was surprised for find that this also gives the same (I think wrong) result.
ctx.Configuration.LazyLoadingEnabled = false;
var DocsByCreator = ctx.Meta
.Where(m => m.Name == "Author")
.Include(m => m.Document.MetaInfo) // Load all the metaInfo for Document
.Select(m => new { Doc = m.Document, Creator = m })
.ToList(); // Actualize the collection
Since we're not putting the where on the Document.MetaInfo property I expected this to bypass the problem, but strangely it doesn't the documents still only appear to have "Author" MetaInfo object.
I've created a simple test project and uploaded it to github with a bunch of test cases in, as far as I can tell they should all pass, bug only the ones with premature actualisation pass.
https://github.com/Robert-Laverick/EFIncludeIssue
Anyone got any theories? Am I abusing EF / SQL in some way I'm missing? Is there anything I can do differently to get the same organisation of results? Is this a bug in EF that's just been hidden from view by the LazyLoad being on by default, and it being a bit of an odd group type operation?

This is a limitation in EF in that Includes will be ignored if the scope of the entities returned is changed from where the include was introduced.
I couldn't find the reference to this for EF6, but it is documented for EF Core. (https://learn.microsoft.com/en-us/ef/core/querying/related-data) (see "ignore includes") I suspect it is a limit in place to stop EF's SQL generation from going completely AWOL in certain scenarios.
So while var docs = context.Documents.Include(d => d.Metas) would return the metas eager loaded against the document; As soon as you .SelectMany() you are changing what EF is supposed to return, so the Include statement is ignored.
If you want to return all documents, and include a property that is their author:
var DocsByCreator = ctx.Documents
.Include(d => d.MetaInfo)
.ToList() // Materialize the documents and their Metas.
.SelectMany(d => d.MetaInfo.Where(m => m.Name == "Author") // For each Author
.Select(m => new { Doc = d, Creator = m })) // Create an object with the Author and the Document they authored.
.ToList(); // grab your collection of Doc and Author.
If you only want documents that have authors:
var DocsByCreator = ctx.Documents
.Include(d => d.MetaInfo)
.Where(d => d.MetaInfo.Any(m => m.Name == "Author")
.ToList() // Materialize the documents and their Metas.
.SelectMany(d => d.MetaInfo.Where(m => m.Name == "Author") // For each Author
.Select(m => new { Doc = d, Creator = m })) // Create an object with the Author and the Document they authored.
.ToList(); // grab your collection of Doc and Author.
This means you will want to be sure that all of your filtering logic is done above that first 'ToList() call. Alternatively you can consider resolving the Author meta after the query such as when view models are populated, or an unmapped "Author" property on Document that resolves it. Though I generally avoid unmapped properties because if their use slips into an EF query, you get a nasty error at runtime.
Edit: Based on the requirement to skip & take I would recommend utilizing view models to return data rather than returning entities. Using a view model you can instruct EF to return just the raw data you need, compose the view models with either simple filler code or utilizing Automapper which plays nicely with IQueryable and EF and can handle most deferred cases like this.
For example:
public class DocumentViewModel
{
public int DocumentId { get; set; }
public string Name { get; set; }
public ICollection<MetaViewModel> Metas { get; set; } = new List<MetaViewModel>();
[NotMapped]
public string Author // This could be update to be a Meta, or specialized view model.
{
get { return Metas.SingleOrDefault(x => x.Name == "Author")?.Value; }
}
}
public class MetaViewModel
{
public int MetaId { get; set; }
public string Name { get; set; }
public string Value { get; set; }
}
Then the query:
var viewModels = context.Documents
.Select(x => new DocumentViewModel
{
DocumentId = x.DocumentId,
Name = x.Name,
Metas = x.Metas.Select(m => new MetaViewModel
{
MetaId = m.MetaId,
Name = m.Name,
Value = m.Value
}).ToList()
}).Skip(pageNumber*pageSize)
.Take(PageSize)
.ToList();
The relationship of an "author" to a document is implied, not enforced, at the data level. This solution keeps the entity models "pure" to the data representation and lets the code handle transforming that implied relationship into exposing a document's author.
The .Select() population can be handled by Automapper using .ProjectTo<TViewModel>().
By returning view models rather than entities you can avoid issues like this where .Include() operations get invalidated, plus avoid issues due to the temptation of detaching and reattaching entities between different contexts, plus improve performance and resource usage by only selecting and transmitting the data needed, and avoiding lazy load serialization issues if you forget to disable lazy-load or unexpected #null data with it.

Return IQueryable containing 2 Subclasses

Can you return IQueryable which is composed of two or more different subclasses ? Here's my attempt to show what I mean. It throws the error:
System.NotSupportedException: Types in
Union or Concat have members assigned
in different order..
var a = from oi in db.OrderItems
where oi.OrderID == ID
&& oi.ListingID != null
select new OrderItemA {
// etc
} as OrderItem;
var b = from oi in db.OrderItems
where oi.OrderID == ID
&& oi.AdID != null
select new OrderItemB {
//etc
} as OrderItem;
return a.Concat<OrderItem>(b);

Try doing the concat on IEnumerable instead of IQueryable:
return a.AsEnumerable().Concat(b.AsEnumerable());
If you need an IQueryable result you could do this:
return a.AsEnumerable().Concat(b.AsEnumerable()).AsQueryable();
Doing this will force the concat to happen in-memory instead of in SQL, and any additional operations will also happen in-memory (LINQ To Objects).
However, unlike the .ToList() example, the execution should still be deferred (making your data lazy loaded).

My guess is that this is because you are using LINQ in an LINQ-to-SQL context.
So using Concat means that LINQ2SQL will need to join both query into a SQL UNION query which might be where the System.NotSupportedException originated from.
Can you try this:
return a.ToList().Concat<OrderItem>(b.ToList());
And see if it make any difference?
What the above does is that it executes the query twice and then concatenate them in-memory instead of hot-off-SQL as to avoid the query translation problem.
It might not be the ideal solution, but if this work, my assumption is probably correct, that it's a query translation problem:
More information about Union and Concat translation to SQL:
http://blog.benhall.me.uk/2007/08/linq-to-sql-difference-between-concat.html
http://msdn.microsoft.com/en-us/library/bb399342.aspx
http://msdn.microsoft.com/en-us/library/bb386979.aspx
Hope this helps.

Interestingly after reading your post and a bit of testing, I realized that what your actually doing does seem to work just fine for me given that the projection part you show as ellipsis in both of your queries match. You see, LINQ to SQL appears to construct the underlying projection for the SQL select command based off of the property assignment statements as opposed to the actual type being materialized so as long as both sides have the same number, type, and order (not sure about this) of member assignments the UNION query should be valid.
My solution that I've been working with is to create a property on my DataContext class which acts much like a SQL View in that it allows me to write a query (in my case a Union between two different tables) and then use that query as if it is itself like a table when composing read-only select statements.
public partial class MyDataContext
{
public IQueryable<MyView> MyView
{
get
{
var query1 =
from a in TableA
let default_ColumnFromB = (string)null
select new MyView()
{
ColumnFromA = a.ColumnFromA,
ColumnFromB = default_ColumnFromB,
ColumnSharedByAAndB = a.ColumnSharedByAAndB,
};
var query2 =
from a in TableB
let default_ColumnFromA = (decimal?)null
select new MyView()
{
ColumnFromA = default_ColumnFromA,
ColumnFromB = b.ColumnFromB,
ColumnSharedByAAndB = b.ColumnSharedByAAndB,
};
return query1.Union(query2);
}
}
}
public class MyView
{
public decimal? ColumnFromA { get; set; }
public string ColumnFromB { get; set; }
public int ColumnSharedByAAndB { get; set; }
}
Notice two key things:
First of all the projection formed by the queries which make up both halves of the Union have the same number, type, and order of columns. Now LINQ may require the order to be the same (not sure about this) but it is definitely true that SQL does for a UNION and we can be sure that LINQ will require at least the same type and number of columns and these "columns" are known by the member assignments and not from the properties of the type you are instantiating in your projection.
Secondly LINQ currently doesn't allow for multiple constants to be used within a projections for queries which formulate a Concat or Union and from my understanding this is mainly because these two separate queries are separately optimized before the Union operation is processed. Normally LINQ to SQL is smart enough to realize that if you have a constant value which is only being used in the projection, then why send it to SQL just to have it come right back the way it was instead of tacking it on as a post process after the raw data comes back from SQL Server. Unfortunately the problem here is that this is a case of LINQ to SQL being to smart for it's own good, as it optimizes each individual query too early in the process. The way I've found to work around this is to use the let keyword to form a range variable for each value in the projection which will be materialized by getting it's value from a constant. Somehow this tricks LINQ to SQL into carrying these constants through to the actual SQL command which keeps all expected columns in the resulting UNION. More on this technique can be found here.
Using this techinque I at least have something reusable so that no matter how complex or ugly the actual Union can get, especially with the range variables, that in your end queries you can write queries to these pseudo views such as MyView and deal with the complexity underneath.

Can you do the projection after the concat?
// construct the query
var a = from oi in db.OrderItems
where oi.OrderID == ID
&& oi.ListingID != null
select new {
type = "A"
item = oi
}
var b = from oi in db.OrderItems
where oi.OrderID == ID
&& oi.AdID != null
select new {
type = "B"
item = oi
}
var temp = a.Concat<OrderItem>(b);
// create concrete types after concatenation
// to avoid inheritance issue
var result = from oi in temp
select (oi.type == "A"
? (new OrderItemA {
// OrderItemA projection
} as OrderItem)
: (new OrderItemB {
// OrderItemB projection
} as OrderItem)
);
return result
Not sure if the ternary operator works in LINQ2SQL in the above scenario but that might help avoid the inheritance issue.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.