Dynamically generate Linq Select - c#

I have a database that users can run a variety of calculations on. The calculations run on 4 different columns each calculation does not necessarily use every column i.e. calculation1 might turn into sql like
SELECT SUM(Column1)
FROM TABLE
WHERE Column1 is not null
and calculation2 would be
SELECT SUM(Column2)
WHERE Column2 is null
I am trying to generate this via linq and I can get the correct data back by calculating everything every time such as
table.Where(x => x.Column1 != null)
.Where(x => x.Column2 == null)
.GroupBy(x => x.Date)
.Select(dateGroup => new
{
Calculation1 = dateGroup.Sum(x => x.Column1 != null),
Calculation2 = dateGroup.Sum(x => x.Column2 == null)
}
The problem is that my dataset is very large, and so I do not want to perform a calculation unless the user has requested it. I have looked into dynamically generating Linq queries. All I have found so far is PredicateBuilder and DynamicSQL, which appear to only be useful for dynamically generating the Where predicate, and hardcoding the sql query itself as a string with the Sum(Column1) or Sum(Column2) being inserted when necessary.
How would one go about dynamically adding the different parts of the Select query into an anonymous type like this? Or should I be looking at an entirely different way of handling this

You can return your query without executing it, which will allow you to dynamically choose what to return.
That said, you cannot dynamically modify an anonymous type at runtime. They are statically typed at compile time. However, you can use a different return object to allow for dynamic properties without needing an external library.
var query = table
.Where(x => x.Column1 != null)
.Where(x => x.Column2 == null)
.GroupBy(x => x.Date);
You can then dyamically resolve queries with any one of the following:
dynamic
dynamic returnObject = new ExpandoObject();
if (includeOne)
returnObject.Calculation1 = groupedQuery.Select (q => q.Sum(x => x.Column1));
if (includeTwo)
returnObject.Calculation2 = groupedQuery.Select (q => q.Sum (x => x.Column2));
Concrete Type
var returnObject = new StronglyTypedObject();
if (includeOne)
returnObject.Calculation1 = groupedQuery.Select (q => q.Sum(x => x.BrandId));
Dictionary<string, int>

I solved this and kept myself from having to lose type safety with Dynamic Linq by using a hacky workaround. I have a object containing bools that correspond to what calculations I want to do such as
public class CalculationChecks
{
bool doCalculation1 {get;set;}
bool doCalculation2 {get;set;}
}
and then do a check in my select for whether or not I should do the calculation or return a constant, like so
Select(x => new
{
Calculation1 = doCalculation1 ? DoCalculation1(x) : 0,
Calculation2 = doCalculation2 ? DoCalculation2(x) : 0
}
However, this appears to be an edge case with linq to sql or ef, that causes the generated sql to still do the calculations specified in DoCalculation1() and DoCalculation2 and then use a case statement to decide whether or not its going to return the data to me. It runs signficantly slower,40-60% in testing, and the execution plan shows that it uses a much more inefficient query.
The solution to this problem was to use an ExpressionVisitor to go through the expression and remove the calculations if the corresponding bool was false. The code showing how implement this ExpressionVisitor was provided by #StriplingWarrior on this question Have EF Linq Select statement Select a constant or a function
Using both of these solutions together is still not creating sql that runs at 100% the speed of plain sql. In testing it was within 10s of plain sql no matter the size of the test set, and the major portions of the execution plan were the same

Related

How to make a linq-query with multiple Contains()/Any() on possibly empty lists?

I am trying to make a query to a database view based on earlier user-choices. The choices are stored in lists of objects.
What I want to achieve is for a record to be added to the reportViewList if the stated value exists in one of the lists, but if for example the clientList is empty the query should overlook this statement and add all clients in the selected date-range. The user-choices are stored in temporary lists of objects.
The first condition is a time-range, this works fine. I understand why my current solution does not work, but I can not seem to wrap my head around how to fix it. This example works when both a client and a product is chosen. When the lists are empty the reportViewList is obviously also empty.
I have played with the idea of adding all the records in the date-range and then removing the ones that does not fit, but this would be a bad solution and not efficient.
Any help or feedback is much appreciated.
List<ReportView> reportViews = new List<ReportView>();
using(var dbc = new localContext())
{
reportViewList = dbc.ReportViews.AsEnumerable()
.Where(x => x.OrderDateTime >= from && x.OrderDateTime <= to)
.Where(y => clientList.Any(x2 => x2.Id == y.ClientId)
.Where(z => productList.Any(x3 => x3.Id == z.ProductId)).ToList();
}
You should not call AsEnumerable() before you have added eeverything to your query. Calling AsEnumerable() here will cause your complete data to be loaded in memory and then be filtered in your application.
Without AsEnumerable() and before calling calling ToList() (Better call ToListAsync()), you are working with an IQueryable<ReportView>. You can easily compose it and just call ToList() on your final query.
Entity Framework will then examinate your IQueryable<ReportView> and generate an SQL expression out of it.
For your problem, you just need to check if the user has selected any filters and only add them to the query if they are present.
using var dbc = new localContext();
var reportViewQuery = dbc.ReportViews.AsQueryable(); // You could also write IQuryable<ReportView> reportViewQuery = dbc.ReportViews; but I prefer it this way as it is a little more save when you are refactoring.
// Assuming from and to are nullable and are null if the user has not selected them.
if (from.HasValue)
reportViewQuery = reportViewQuery.Where(r => r.OrderDateTime >= from);
if (to.HasValue)
reportViewQuery = reportViewQuery.Where(r => r.OrderDateTime <= to);
if(clientList is not null && clientList.Any())
{
var clientIds = clientList.Select(c => c.Id).ToHashSet();
reportViewQuery = reportViewQuery.Where(r => clientIds.Contains(y.ClientId));
}
if(productList is not null && productList.Any())
{
var productIds = productList.Select(p => p.Id).ToHashSet();
reportViewQuery = reportViewQuery.Where(r => productIds .Contains(r.ProductId));
}
var reportViews = await reportViewQuery.ToListAsync(); // You can also use ToList(), if you absolutely must, but I would not recommend it as it will block your current thread.

Why is Entity Framework having performance issues when calculating a sum

I am using Entity Framework in a C# application and I am using lazy loading. I am experiencing performance issues when calculating the sum of a property in a collection of elements. Let me illustrate it with a simplified version of my code:
public decimal GetPortfolioValue(Guid portfolioId) {
var portfolio = DbContext.Portfolios.FirstOrDefault( x => x.Id.Equals( portfolioId ) );
if (portfolio == null) return 0m;
return portfolio.Items
.Where( i =>
i.Status == ItemStatus.Listed
&&
_activateStatuses.Contains( i.Category.Status )
)
.Sum( i => i.Amount );
}
So I want to fetch the value for all my items that have a certain status of which their parent has a specific status as well.
When logging the queries generated by EF I see it is first fetching my Portfolio (which is fine). Then it does a query to load all Item entities that are part of this portfolio. And then it starts fetching ALL Category entities for each Item one by one. So if I have a portfolio that contains 100 items (each with a category), it literally does 100 SELECT ... FROM categories WHERE id = ... queries.
So it seems like it's just fetching all info, storing it in its memory and then calculating the sum. Why does it not do a simple join between my tables and calculate it like that?
Instead of doing 102 queries to calculate the sum of 100 items I would expect something along the lines of:
SELECT
i.id, i.amount
FROM
items i
INNER JOIN categories c ON c.id = i.category_id
WHERE
i.portfolio_id = #portfolioId
AND
i.status = 'listed'
AND
c.status IN ('active', 'pending', ...);
on which it could then calculate the sum (if it is not able to use the SUM directly in the query).
What is the problem and how can I improve the performance other than writing a pure ADO query instead of using Entity Framework?
To be complete, here are my EF entities:
public class ItemConfiguration : EntityTypeConfiguration<Item> {
ToTable("items");
...
HasRequired(p => p.Portfolio);
}
public class CategoryConfiguration : EntityTypeConfiguration<Category> {
ToTable("categories");
...
HasMany(c => c.Products).WithRequired(p => p.Category);
}
EDIT based on comments:
I didn't think it was important but the _activeStatuses is a list of enums.
private CategoryStatus[] _activeStatuses = new[] { CategoryStatus.Active, ... };
But probably more important is that I left out that the status in the database is a string ("active", "pending", ...) but I map them to an enum used in the application. And that is probably why EF cannot evaluate it? The actual code is:
... && _activateStatuses.Contains(CategoryStatusMapper.MapToEnum(i.Category.Status)) ...
EDIT2
Indeed the mapping is a big part of the problem but the query itself seems to be the biggest issue. Why is the performance difference so big between these two queries?
// Slow query
var portfolio = DbContext.Portfolios.FirstOrDefault(p => p.Id.Equals(portfolioId));
var value = portfolio.Items.Where(i => i.Status == ItemStatusConstants.Listed &&
_activeStatuses.Contains(i.Category.Status))
.Select(i => i.Amount).Sum();
// Fast query
var value = DbContext.Portfolios.Where(p => p.Id.Equals(portfolioId))
.SelectMany(p => p.Items.Where(i =>
i.Status == ItemStatusConstants.Listed &&
_activeStatuses.Contains(i.Category.Status)))
.Select(i => i.Amount).Sum();
The first query does a LOT of small SQL queries whereas the second one just combines everything into one bigger query. I'd expect even the first query to run one query to get the portfolio value.
Calling portfolio.Items this will lazy load the collection in Items and then execute the subsequent calls including the Where and Sum expressions. See also Loading Related Entities article.
You need to execute the call directly on the DbContext the Sum expression can be evaluated database server side.
var portfolio = DbContext.Portfolios
.Where(x => x.Id.Equals(portfolioId))
.SelectMany(x => x.Items.Where(i => i.Status == ItemStatus.Listed && _activateStatuses.Contains( i.Category.Status )).Select(i => i.Amount))
.Sum();
You also have to use the appropriate type for _activateStatuses instance as the contained values must match the type persisted in the database. If the database persists string values then you need to pass a list of string values.
var _activateStatuses = new string[] {"Active", "etc"};
You could use a Linq expression to convert enums to their string representative.
Notes
I would recommend you turn off lazy loading on your DbContext type. As soon as you do that you will start to catch issues like this at run time via Exceptions and can then write more performant code.
I did not include error checking for if no portfolio was found but you could extend this code accordingly.
Yep CategoryStatusMapper.MapToEnum cannot be converted to SQL, forcing it to run the Where in .Net. Rather than mapping the status to the enum, _activeStatuses should contain the list of integer values from the enum so the mapping is not required.
private int[] _activeStatuses = new[] { (int)CategoryStatus.Active, ... };
So that the contains becomes
... && _activateStatuses.Contains(i.Category.Status) ...
and can all be converted to SQL
UPDATE
Given that i.Category.Status is a string in the database, then
private string[] _activeStatuses = new[] { CategoryStatus.Active.ToString(), ... };

Replacing Include() calls to Select()

Im trying to eliminate the use of the Include() calls in this IQueryable definition:
return ctx.timeDomainDataPoints.AsNoTracking()
.Include(dp => dp.timeData)
.Include(dp => dp.RecordValues.Select(rv => rv.RecordKind).Select(rk => rk.RecordAlias).Select(fma => fma.RecordAliasGroup))
.Include(dp => dp.RecordValues.Select(rv => rv.RecordKind).Select(rk => rk.RecordAlias).Select(fma => fma.RecordAliasUnit))
.Where(dp => dp.RecordValues.Any(rv => rv.RecordKind.RecordAlias != null))
.Where(dp => dp.Source == 235235)
.Where(dp => dp.timeData.time >= start && cd.timeData.time <= end)
.OrderByDescending(cd => cd.timeData.time);
I have been having issues with the database where the run times are far too long and the primary cause of this is the Include() calls are pulling everything.
This is evident in viewing the table that is returned from the resultant SQL query generated from this showing lots of unnecessary information being returned.
One of the things that you learn I guess.
The Database has a large collection of data points which there are many Recorded values.
Each Recorded value is mapped to a Record Kind which may have a Record Alias.
I have tried creating a Select() as an alternative but I just cant figure out how to construct the right Select and also keep the entity hierarchy correctly loaded. I.e. the related entities are loaded with unnecessary calls to the DB.
Does anyone has alternate solutions that may jump start me to solve this problem.
Ill add more detail if needed.
You are right. One of the slower parts of a database query is the transport of the selected data from the DBMS to your local process. Hence it is wise to limit this.
Every TimeDomainDataPoint has a primary key. All RecordValues of this TimeDomainDataPoint have a foreign key TimeDomainDataPointId with a value equal to this primary key.
So If TimeDomainDataPoint with Id 4 has a thousand RecordValues, then every RecordValue will have a foreign key with a value 4. It would be a waste to transfer this value 4 a 1001 times, while you only need it once.
When querying data, always use Select and select only the properties you actually plan to use. Only use Include if you plan to update the fetched included items.
The following will be much faster:
var result = dbContext.timeDomainDataPoints
// first limit the datapoints you want to select
.Where(datapoint => d.RecordValues.Any(rv => rv.RecordKind.RecordAlias != null))
.Where(datapoint => datapoint.Source == 235235)
.Where(datapoint => datapoint.timeData.time >= start
&& datapoint.timeData.time <= end)
.OrderByDescending(datapoint => datapoint.timeData.time)
// then select only the properties you actually plan to use
Select(dataPoint => new
{
Id = dataPoint.Id,
RecordValues = dataPoint.RecordValues
.Where(recordValues => ...) // if you don't want all RecordValues
.Select(recordValue => new
{
// again: select only the properties you actually plan to use:
Id = recordValue.Id,
// not needed, you know the value: DataPointId = recordValue.DataPointId,
RecordKinds = recordValues.RecordKinds
.Where(recordKind => ...) // if you don't want all recordKinds
.Select(recordKind => new
{
... // only the properties you really need!
})
.ToList(),
...
})
.ToList(),
TimeData = dataPoint.TimeData.Select(...),
...
});
Possible imporvement
The part:
.Where(datapoint => d.RecordValues.Any(rv => rv.RecordKind.RecordAlias != null))
is used to fetch only datapoints that have recordValues with a non-null RecordAlias. If you are selecting the RecordAlias anyway, consider doing this Where after your select:
.Select(...)
.Where(dataPoint => dataPoint
.Where(dataPoint.RecordValues.RecordKind.RecordAlias != null)
.Any());
I'm not really sure whether this is faster. If your database management system internally first creates a complete table with all columns of all joined tables and then throws away the columns that are not selected, then it won't make a difference. However, if it only creates a table with the columns it actually uses, then the internal table will be smaller. This could be faster.
your problem is hierarchy joins in your query.In order to decrease this problem create other query for get result from relation table as follows:
var items= ctx.timeDomainDataPoints.AsNoTracking().Include(dp =>dp.timeData).Include(dp => dp.RecordValues);
var ids=items.selectMany(item=>item.RecordValues).Select(i=>i.Id);
and on other request to db:
var otherItems= ctx.RecordAlias.AsNoTracking().select(dp =>dp.RecordAlias).where(s=>ids.Contains(s.RecordKindId)).selectMany(s=>s.RecordAliasGroup)
to this approach your query do not have internal joins.

ASP.NET MVC C# Select and Where Statements

I'm having trouble understanding .Select and .Where statements. What I want to do is select a specific column with "where" criteria based on another column.
For example, what I have is this:
var engineers = db.engineers;
var managers = db.ManagersToEngineers;
List<ManagerToEngineer> matchedManager = null;
Engineer matchedEngineer = null;
if (this.User.Identity.IsAuthenticated)
{
var userEmail = this.User.Identity.Name;
matchedEngineer = engineers.Where(x => x.email == userEmail).FirstOrDefault();
matchedManager = managers.Select(x => x.ManagerId).Where(x => x.EngineerId == matchedEngineer.PersonId).ToList();
}
if (matchedEngineer != null)
{
ViewBag.EngineerId = new SelectList(new List<Engineer> { matchedEngineer }, "PersonId", "FullName");
ViewBag.ManagerId = new SelectList(matchedManager, "PersonId", "FullName");
}
What I'm trying to do above is select from a table that matches Managers to Engineers and select a list of managers based on the engineer's id. This isn't working and when I go like:
matchedManager = managers.Where(x => x.EngineerId == matchedEngineer.PersonId).ToList();
I don't get any errors but I'm not selecting the right column. In fact the moment I'm not sure what I'm selecting. Plus I get the error:
Non-static method requires a target.
if you want to to select the manager, then you need to use FirstOrDefault() as you used one line above, but if it is expected to have multiple managers returned, then you will need List<Manager>, try like:
Update:
so matchedManager is already List<T>, in the case it should be like:
matchedManager = managers.Where(x => x.EngineerId == matchedEngineer.PersonId).ToList();
when you put Select(x=>x.ManagerId) after the Where() now it will return Collection of int not Collection of that type, and as Where() is self descriptive, it filters the collection as in sql, and Select() projects the collection on the column you specify:
List<int> managerIds = managers.Where(x => x.EngineerId == matchedEngineer.PersonId)
.Select(x=>x.ManagerId).ToList();
The easiest way to remember what the methods do is to remember that this is being translated to SQL.
A .Where() method will filter the rows returned.
A .Select() method will filter the columns returned.
However, there are a few ways to do that with the way you should have your objects set up.
First, you could get the Engineer, and access its Managers:
var engineer = context.Engineers.Find(engineerId);
return engineer.Managers;
However, that will first pull the Engineer out of the database, and then go back for all of the Managers. The other way would be to go directly through the Managers.
return context.Managers.Where(manager => manager.EngineerId == engineerId).ToList();
Although, by the look of the code in your question, you may have a cross-reference table (many to many relationship) between Managers and Engineers. In that case, my second example probably wouldn't work. In that case, I would use the first example.
You want to filter data by matching person Id and then selecting manager Id, you need to do following:
matchedManager = managers.Where(x => x.EngineerId == matchedEngineer.PersonId).Select(x => x.ManagerId).ToList();
In your case, you are selecting the ManagerId first and so you have list of ints, instead of managers from which you can filter data
Update:
You also need to check matchedEngineer is not null before retrieving the associated manager. This might be cause of your error
You use "Select" lambda expression to get the field you want, you use "where" to filter results

How to structure a partially dynamic LINQ statement?

I currently have a method on my repository like this:
public int GetMessageCountBy_Username(string username, bool sent)
{
var query = _dataContext.Messages.AsQueryable();
if (sent)
query = query.Where(x => x.Sender.ToLower() == username.ToLower());
else
query = query.Where(x => x.Recipient.ToLower() == username.ToLower());
return query.Count();
}
It currently builds one of two queries based on the sent boolean. Is this the best way to do this or is there a way to do this within the query itself? I want to check if x.Sender is equal to username if sent equals true. But I want to check if x.Recipient is equal to username if sent equals false.
I then want this LINQ expression to translate into SQL within Entity Framework, which I believe it is doing.
I just want to avoid repeating as much code as possible.
You could do something like this :
public int GetMessageCountBy_Username(string username, bool sent)
{
Func<Message, string> userSelector = m => sent ? m.Sender : m.Recipient;
var query =
_dataContext.Messages.AsQueryable()
.Where(x => userSelector(x).ToLower() == username.ToLower());
return query.Count();
}
Thus the choosing of the right user (the sender or the recipient) is done before the linq part, saving you from repeating it twice.
Yes, I believe this is correct way to do it. Because it is easy to create complex queries without repeating whole parts of queries.
And your thinking about translating to SQL is correct too. But beware, this is done at the moment, when data or agregation is requested. In your case, the SQL will be generated and executed when you call Count().

Categories