Linq join efficiency question - c#

// Loop each users profile
using (DataClassesDataContext db = new DataClassesDataContext())
{
var q = (from P in db.tblProfiles orderby P.UserID descending select new { LastUpdated = P.ProfileLastUpdated, UserID = P.UserID }).ToList();
foreach(var Rec in q){
string Username = db.tblForumAuthors.SingleOrDefault(author => author.Author_ID == Rec.UserID).Username;
AddURL(("Users/" + Rec.UserID + "/" + Username), Rec.LastUpdated.Value, ChangeFrequency.daily, 0.4);
}
}
This is for my sitemap, printing a URL for each users profile on the system. But say we have 20,000 users, is the Username query going to slow this down significantly?
I'm used to having the join in the SQL query, but having it separated from the main query and in the loop seems like it could be inefficient unless it compiles it well.

It will probably be unbearably slow. In your case this will issue 20,000 separate SQL queries to the database. Since the queries run synchronously, you will incur the server communication overhead on each iteration. The delay will accumulate quite fast.
Go with a join.
from P in db.tblProfiles
join A in db.tblForumAuthors on P.UserID equals A.Author_ID
orderby P.UserID descending
select new { LastUpdated = P.ProfileLastUpdated, UserID = P.UserID, Username = A.Username };
By the way, SingleOrDefault(...).Username will throw a NullReferenceException if the author is missing. Better use Single() or check your logic.

If you have constranints set up correctly in your database while designing the DataContext, then the designer should generate a one-to-one members in your Profile and Author classes.
If not, you can do it manually in designer.
Then you will be able to do something like this:
var q =
from profile in db.tblProfiles
order by profile.UserID descending
select new {
LastUpdated = profile.ProfileLastUpdated,
profile.UserID,
profile.Author.Username
};

Do the JOIN!
Save yourself from unnecessary database access, join and get everything you need in a single shot!

Related

Entity Framework - slow query after adding group by

I have a following query which runs very fast:
var query =
(from art in ctx.Articles
join phot in ctx.ArticlePhotos on art.Id equals phot.ArticleId
join artCat in ctx.ArticleCategories on art.Id equals artCat.ArticleId
join cat in ctx.Categories on artCat.CategoryId equals cat.Id
where art.Active && art.ArticleCategories.Any(c => c.Category.MaterializedPath.StartsWith(categoryPath))
orderby art.PublishDate descending
select new ArticleSmallResponse
{
Id = art.Id,
Title = art.Title,
Active = art.Active,
PublishDate = art.PublishDate ?? art.CreateDate,
MainImage = phot.RelativePath,
RootCategory = art.Category.Name,
Summary = art.Summary
})
.AsNoTracking().Take(request.Take);
However, if I add group by and change query to following statement, it runs much much slower.
var query =
(from art in ctx.Articles
join phot in ctx.ArticlePhotos on art.Id equals phot.ArticleId
join artCat in ctx.ArticleCategories on art.Id equals artCat.ArticleId
join cat in ctx.Categories on artCat.CategoryId equals cat.Id
where art.Active && art.ArticleCategories.Any(c => c.Category.MaterializedPath.StartsWith(categoryPath))
orderby art.PublishDate descending
select new ArticleSmallResponse
{
Id = art.Id,
Title = art.Title,
Active = art.Active,
PublishDate = art.PublishDate ?? art.CreateDate,
MainImage = phot.RelativePath,
RootCategory = art.Category.Name,
Summary = art.Summary
})
.GroupBy(m => m.Id)
.Select(m => m.FirstOrDefault())
.AsNoTracking().Take(request.Take);
Homepage calls query 9 times for each category. With the first version of query, without caching turned on and connecting to SQL remotely, page load is around 1.5 seconds, which makes it almost instant when application is on server, but second way makes homepage load around 39 seconds when SQL is remotely.
Can it be fixed without rewriting the entire query in to the view or stored procedure?
Grouping is an expensive operation on the database end. Without knowing what your database looks like and what indexes you've setup, it will be difficult to determine. Why not just group on the client side after the data has arrived (assuming its not an overwhelming amount).
This question explains how.
Group by in LINQ

How to join a list and large lists/tables using LINQ

Initially I have such a list :
List<Car> cars = db.Car.Where(x => x.ProductionYear == 2005).ToList();
Then I'm trying to join this list with two large tables using LINQ like this :
var joinedList = (from car in cars
join driver in db.Driver.ToList()
on car.Id equals driver.CarId
join building in db.Building.ToList()
on driver.BuildingId equals building.Id
select new Building
{
Name = building.Name;
Id = building.Id;
City = building.City;
}).ToList();
Both Driver and Building tables have about 1 million rows. When I run this join I get out of memory exception. How can I make this join work? Should I make the join operation on database? If yes, how can I carry cars list to the db? Thanks in advance.
Even if you remove the .ToList() calls inside your join, you code will still pull all the data and perform the join in-memory and not in SQL server. This is because you're using a local list cars in your join. The below should solve your problem:
var joinedList = (from car in db.Car.Where(x => x.ProductionYear == 2005)
join driver in db.Driver
on car.Id equals driver.CarId
join building in db.Building
on driver.BuildingId equals building.Id
select new Building
{
Name = building.Name;
Id = building.Id;
City = building.City;
}).ToList();
You can remove the last .ToList() and do some paging if you expect to get too many records in the results.
even If You have removed .ToList() replace in .AsQueryable()
AsQueryable Faster then ToList And AsEnumerable
If you create an IQueryable, then the query may be converted to sql
and run on the database server
If you create an IEnumerable, then all rows will be pulled into
memory as objects before running the query.
In both cases if you don't call a ToList() or ToArray() then query
will be executed each time it is used, so, say, you have an
IQueryable and you fill 4 list boxes from it, then the query will be
run against the database 4 times.
so following Used Linq query
var joinedList = (from car in db.Car.Where(x => x.ProductionYear == 2005).AsQueryable()
join driver in db.Driver.AsQueryable()
on car.Id equals driver.CarId
join building in db.Building.AsQueryable()
on driver.BuildingId equals building.Id
select new Building
{
Name = building.Name,
Id = building.Id,
City = building.City,
}).ToList();
First don't ever try ToList() while using LINQ(you can) but make sure that you use ToList() as less as possible in a very rare scenarios only.
Every time you will get OutOfMemoryException when the table contains many rows.
So, here is the code for your question:
var joinedList = (from car in db.Car.GetQueryable().Where(x => x.ProductionYear == 2005)
join driver in db.Driver.GetQueryable() on car.Id equals driver.CarId
join building in db.Building.GetQueryable() on driver.BuildingId equals building.Id
select new Building
{
Name = building.Name;
Id = building.Id;
City = building.City;
}).ToList();

Linq to SQL Slow Query

My ASP.Net application has the following Linq to SQL function to get a distinct list of height values from the product table.
public static List<string> getHeightList(string catID)
{
using (CategoriesClassesDataContext db = new CategoriesClassesDataContext())
{
var heightTable = (from p in db.Products
join cp in db.CatProducts on p.ProductID equals cp.ProductID
where p.Enabled == true && (p.CaseOnly == null || p.CaseOnly == false) && cp.CatID == catID
select new { Height = p.Height, sort = Convert.ToDecimal(p.Height.Replace("\"", "")) }).Distinct().OrderBy(s => s.sort);
List<string> heightList = new List<string>();
foreach (var s in heightTable)
{
heightList.Add(s.Height.ToString());
}
return heightList;
}
}
I ran Redgate SQL Monitor which shows that this query is using a lot of resources.
Redgate is also showing that I am running the following query:
select count(distinct [height]) from product p
join catproduct cp on p.productid = cp.productid
join cat c on cp.catid = c.catid
where p.enabled=1 and p.displayfilter = 1 and c.catid = 'C2-14'
My questions are:
A suggestion to change the function so that it uses less resources?
Also, how does linq to sql generate the above query from my function? (I did not write select count(distinct [height]) from product anywhere in the code)
There are 90,000 records in the products. This category which I am trying to get the distinct list of heights has 50,000 product records
Thank you in advance,
Nick
First of all your posted sql query and linq query doesn't match at all. it's not the LINQ query rather the underlying SQL query itself performing slow. Make sure, all the columns involved in JOIN ON clause and WHERE clause and ORDER BY clause are indexed properly in order to have a better execution plan; else you will end up getting a FULL Table Scan and a File Sort and query will deemed to perform slow.
The join multiplies the number of Products the query returns. To undo that, you apply Distinct at the end. It will certainly reduce db resources if you return unique Products right away:
var heightTable = (from p in db.Products
where p.CatProducts.Any(cp => cp.CatID == catID)
&& p.Enabled && (p.CaseOnly == null || !p.CaseOnly)
select new
{
Height = p.Height,
sort = Convert.ToDecimal(p.Height.Replace("\"", ""))
}).OrderBy(s => s.sort);
This changes the join into a where clause. It saves the db engine the trouble of deduplicating the result.
If that still performs poorly, you should try to do the conversion and ordering in memory, i.e. after receiving the raw results from the database.
As for the count. I don't know where it comes from. Such queries typically get generated by paging libraries such as PagedList, but I see no trace of that in your code.
Side note: you can return ...
heightList.Select(x => x.Height.ToString()).ToList()
... instead of creating the list yourself.

Return a list with members that belong to another list

I have 2 tables:
Users data with PK UserId
UsersOrders with FK UserID
I want to do LINQ query to give me a list of users with member that is the list of the user orders.
I've tried:
var myNestedData = (from ub in db.Users
join ah in db.UsersOrders on ub.UserId equals ah.UserId
into joined
from j in joined.DefaultIfEmpty()
group j by new { ub.UserId, ub.UserName, ub.UserPhone, ub.Approved } into grouped
where grouped.Key.Approved == true
select new
{
UserId= grouped.Key.UserId,
UserName = grouped.Key.UserName,
UserPhone = grouped.Key.UserPhone,
Orders = grouped
}).ToList();
The problem is that I'm getting inside Orders an Object<a,UsersOrders>, which I don't expect.
Is this the right way to approach a solution in terms of performance?
It seems to me you should be able to keep this simple:
var myNestedData = (from u in db.Users
where u.Approved == true
select new
{
User = u,
Orders = u.UserOrders
}).ToList();
You are trying to get a list of all the Approved users and that users orders. Am I missing something?
ALso the name of u.UserOrders will depend on how you have configured your mapping.
Change it to
Orders = grouped.ToList()
To see if it's a potential performance issue, look at the SQL that is generated to see if it is an N+1 scenario (meaning it runs one query for the parent record and N queries for the child records.
I say potential because while may not be the fasted method, it may not be an issue. If there is no noticeable performance problems I wouldn't worry about it and instead focus on making the app better by improving the experience, which may include improving the perfornamce of other areas of the map.

Using DISTINCT on a subquery to remove duplicates in Entity Framework

I have question about use of Distinct with Entity Framework, using Sql 2005. In this example:
practitioners = from p in context.Practitioners
join pn in context.ProviderNetworks on
p.ProviderId equals pn.ProviderId
(notNetworkIds.Contains(pn.Network))
select p;
practitioners = practitioners
.Distinct()
.OrderByDescending(p => p.UpdateDate);
data = practitioners.Skip(PageSize * (pageOffset ?? 0)).Take(PageSize).ToList();
It all works fine, but the use of distinct is very inefficient. Larger result sets incur unacceptable performance. The DISTINCT is killing me. The distinct is only needed because multiple networks can be queried, causing Providers records to be duplicated. In effect I need to ask the DB "only return providers ONCE even if they're in multiple networks". If I could place the DISTINCT on the ProviderNetworks, the query runs much faster.
How can I cause EF to add the DISTINCT only the subquery, not to the entire resultset?
The resulting simplified sql I DON'T want is:
select DISTINCT p.* from Providers
inner join Networks pn on p.ProviderId = pn.ProviderId
where NetworkName in ('abc','def')
IDEAL sql is:
select p.* from Providers
inner join (select DISTINCT ProviderId from Networks
where NetworkName in ('abc','def'))
as pn on p.ProviderId = pn.ProviderId
Thanks
Dave
I dont think you need a Distinct here but a Exists (or Any as it is called in Linq)
Try this:
var q = (from p in context.Practitioners
where context.ProviderNetworks.Any(pn => pn.ProviderId == p.ProviderId && notNetworkIds.Contains(pn.Network))
orderby p.UpdateDate descending
select p).Skip(PageSize * (pageOffset ?? 0)).Take(PageSize).ToList();

Categories