Join in LINQ within a where clause

Join in LINQ within a where clause - c#

I have a database structure which has a set of users and their UserId
I then have a table called 'Post' which consists of a text field and a CreatedBy field.
I then have a 'Follows' table which consists of 'WhoIsFollowing' and 'WhoTheyFollow' fields.
The idea is that the 'Follows' table maps which users another user 'Follows'.
If I am using the application as a particular user and I want to get all my relevant 'Posts', these would be posts of those users I follow, or my own posts.
I have been trying to get this into one LINQ statement but have been failing to get it perfect. Ultimately I need to query the 'Posts' table for all the 'Posts' that I have posted, joined with all the posts of the people I follow in the 'Follows' table.
I have got it working with this statement
postsWeWant = (from s in db.Posts
join sa in db.Follows on s.CreatedBy equals sa.WhoTheyAreFollowing into joinTable1
from x in joinTable1.DefaultIfEmpty()
where (x.WhoIsFollowing == userId || s.CreatedBy == userId) && !s.Deleted
orderby s.DateCreated descending
select s).Take(25).ToList();
The issue is that it seems to come back with duplicates for all the posts posted by the user themselves. I have added .Distinct() to get around this, but instead of taking 25 posts each time, the duplicates are meaning it comes back with much less when there are a lot of posts by that user in the latest 25.
First off, why is the above coming back with duplicates? (It would help me understand the statement a bit more), and secondly how do I get around it?

Its difficult to say exactly without the data structure, but I would recommend investigating and perhaps expanding your join to eliminate duplicate association.
If that fails then I would use a group by clause to remove duplicates so there is no need for a distinct. The reason you are ending up with less than 25 records is probably because the elimination of duplicates is happening after taking 25. But I think I would need more of your code to tell for sure.

Related

Linq slowness materializing complex queries

I have often found that if I have too many joins in a Linq query (whether using Entity Framework or NHibernate) and/or the shape of the resulting anonymous class is too complex, Linq takes a very long time to materialize the result set into objects.
This is a generic question, but here's a specific example using NHibernate:
var libraryBookIdsWithShelfAndBookTagQuery = (from shelf in session.Query<Shelf>()
join sbttref in session.Query<ShelfBookTagTypeCrossReference>() on
shelf.ShelfId equals sbttref.ShelfId
join bookTag in session.Query<BookTag>() on
sbttref.BookTagTypeId equals (byte)bookTag.BookTagType
join btbref in session.Query<BookTagBookCrossReference>() on
bookTag.BookTagId equals btbref.BookTagId
join book in session.Query<Book>() on
btbref.BookId equals book.BookId
join libraryBook in session.Query<LibraryBook>() on
book.BookId equals libraryBook.BookId
join library in session.Query<LibraryCredential>() on
libraryBook.LibraryCredentialId equals library.LibraryCredentialId
join lcsg in session
.Query<LibraryCredentialSalesforceGroupCrossReference>()
on library.LibraryCredentialId equals lcsg.LibraryCredentialId
join userGroup in session.Query<UserGroup>() on
lcsg.UserGroupOrganizationId equals userGroup.UserGroupOrganizationId
where
shelf.ShelfId == shelfId &&
userGroup.UserGroupId == userGroupId &&
!book.IsDeleted &&
book.IsDrm != null &&
book.BookFormatTypeId != null
select new
{
Book = book,
LibraryBook = libraryBook,
BookTag = bookTag
});
// add a couple of where clauses, then...
var result = libraryBookIdsWithShelfAndBookTagQuery.ToList();
I know it's not the query execution, because I put a sniffer on the database and I can see that the query is taking 0ms, yet the code is taking about a second to execute that query and bring back all of 11 records.
So yeah, this is an overly complex query, having 8 joins between 9 tables, and I could probably restructure it into several smaller queries. Or I could turn it into a stored procedure - but would that help?
What I'm trying to understand is, where is that red line crossed between a query that is performant and one that starts to struggle with materialization? What's going on under the hood? And would it help if this were a SP whose flat results I subsequently manipulate in memory into the right shape?
EDIT: in response to a request in the comments, here's the SQL emitted:
SELECT DISTINCT book4_.bookid AS BookId12_0_,
libraryboo5_.librarybookid AS LibraryB1_35_1_,
booktag2_.booktagid AS BookTagId15_2_,
book4_.title AS Title12_0_,
book4_.isbn AS ISBN12_0_,
book4_.publicationdate AS Publicat4_12_0_,
book4_.classificationtypeid AS Classifi5_12_0_,
book4_.synopsis AS Synopsis12_0_,
book4_.thumbnailurl AS Thumbnai7_12_0_,
book4_.retinathumbnailurl AS RetinaTh8_12_0_,
book4_.totalpages AS TotalPages12_0_,
book4_.lastpage AS LastPage12_0_,
book4_.lastpagelocation AS LastPag11_12_0_,
book4_.lexilerating AS LexileR12_12_0_,
book4_.lastpageposition AS LastPag13_12_0_,
book4_.hidden AS Hidden12_0_,
book4_.teacherhidden AS Teacher15_12_0_,
book4_.modifieddatetime AS Modifie16_12_0_,
book4_.isdeleted AS IsDeleted12_0_,
book4_.importedwithlexile AS Importe18_12_0_,
book4_.bookformattypeid AS BookFor19_12_0_,
book4_.isdrm AS IsDrm12_0_,
book4_.lightsailready AS LightSa21_12_0_,
libraryboo5_.bookid AS BookId35_1_,
libraryboo5_.libraryid AS LibraryId35_1_,
libraryboo5_.externalid AS ExternalId35_1_,
libraryboo5_.totalcopies AS TotalCop5_35_1_,
libraryboo5_.availablecopies AS Availabl6_35_1_,
libraryboo5_.statuschangedate AS StatusCh7_35_1_,
booktag2_.booktagtypeid AS BookTagT2_15_2_,
booktag2_.booktagvalue AS BookTagV3_15_2_
FROM shelf shelf0_,
shelfbooktagtypecrossreference shelfbookt1_,
booktag booktag2_,
booktagbookcrossreference booktagboo3_,
book book4_,
librarybook libraryboo5_,
library librarycre6_,
librarycredentialsalesforcegroupcrossreference librarycre7_,
usergroup usergroup8_
WHERE shelfbookt1_.shelfid = shelf0_.shelfid
AND booktag2_.booktagtypeid = shelfbookt1_.booktagtypeid
AND booktagboo3_.booktagid = booktag2_.booktagid
AND book4_.bookid = booktagboo3_.bookid
AND libraryboo5_.bookid = book4_.bookid
AND librarycre6_.libraryid = libraryboo5_.libraryid
AND librarycre7_.librarycredentialid = librarycre6_.libraryid
AND usergroup8_.usergrouporganizationid =
librarycre7_.usergrouporganizationid
AND shelf0_.shelfid = #p0
AND usergroup8_.usergroupid = #p1
AND NOT ( book4_.isdeleted = 1 )
AND ( book4_.isdrm IS NOT NULL )
AND ( book4_.bookformattypeid IS NOT NULL )
AND book4_.lightsailready = 1
EDIT 2: Here's the performance analysis from ANTS Performance Profiler:

It is often database "good" practice to place lots of joins or super common joins into views. ORMs don't let you ignore these facts nor do they supplement the decades of time spent fine tuning databases to do these kinds of things efficiently. Refactor those joins into a singular view or a couple views if that'd make more sense in the greater perspective of your application.
NHibernate should be optimizing the query down and reducing the data so that .Net only has to mess with the important parts. However, if those domain objects are just naturally large, that's still a lot of data. Also, if it's a really large result set in terms of rows returned, that's a lot of objects getting instantiated even if the DB is able to return the set quickly. Refactoring this query into a view that only returns the data you actually need would also reduce object instantiation overhead.
Another thought would be to not do a .ToList(). Return the enumerable and let your code lazily consume the data.

According to profiling information, the CreateQuery takes 45% of the total execution time. However as you mentioned the query took 0ms when you executed directly. But this alone is not enough to say there is a performance problem because,
You are running the query with the profiler which has significant impact on execution time.
When you use a profiler, it will affect every code is being profiled but not the sql execution time (because it happens in the SQL server), so you can see everything else is slower compared to SQL statement.
so ideal scenario is to measure how long it takes to execute entire code block, measure time for SQL query and calculate times, and if you do that you will probably end up with different values.
However, I'm not saying that the the NH Linq to SQL implementation is optimized for any query you come up with, but there are other ways in NHibernate to deal with those situations such as QueryOverAPI, CriteriaQueries, HQL and finally SQL.
Where is that red line crossed between a query that is performant and
one that starts to struggle with materialization. What's going on under the hood?
This one is pretty hard question and without having detail knowledge of NHibernate Linq to SQL provider it's hard to provide a accurate answer. You can always try different mechanisms provided and see which one is the best for given scenario.
And would it help if this were a SP whose flat results I subsequently
manipulate in memory into the right shape?
Yes, using a SP would help things to work pretty fast, but using SP would add more maintenance problems to your code base.

You have generic question, I'll tell you generic answer :)
If you query data for reading (not for update) try to use anonymous classes. The reason is - they are lighter to create, they have no navigatoin properties. And you select only data you need! It's very important rule. So, try to replace your select with smth like this:
select new
{
Book = new { book.Id, book.Name},
LibraryBook = new { libraryBook.Id, libraryBook.AnotherProperty},
BookTag = new { bookTag.Name}
}
Stored procedures are good, when query is complex and linq-provider generates not effective code, so, you can replace it with plain SQL or stored procedure. It's not offten case and, I think, it's not your situation
Run your sql-query. How many rows it returns? Is it the same value as result? Sometimes linq provider generates code, that select much more rows to select one entity. It happens, when entity has one to many relationship with another selecting entity. For example:
class Book
{
int Id {get;set;}
string Name {get;set;}
ICollection<Tag> Tags {get;set;}
}
class Tag
{
string Name {get;set;}
Book Book {get;set;}
}
...
dbContext.Books.Where(o => o.Id == 1).Select(o=>new {Book = o, Tags = o.Tags}).Single();
I Select only one book with Id = 1, but provider will generate code, that returns rows amount equals to Tags amount (entity framework does this).
Split complex query to set of simple and join in client side. Sometimes, you have complex query with many conditionals and resulting sql become terrible. So, you split you big query to more simple, get results of each and join/filter on client side.
At the end, I advice you to use anonymous class as result of select.

Don’t use Linq’s Join. Navigate!
in that post you can see:
As long as there are proper foreign key constraints in the database, the navigation properties will be created automatically. It is also possible to manually add them in the ORM designer. As with all LINQ to SQL usage I think that it is best to focus on getting the database right and have the code exactly reflect the database structure. With the relations properly specified as foreign keys the code can safely make assumptions about referential integrity between the tables.

I agree 100% with the sentiments expressed by everyone else (with regards to their being two parts to the optimisation here and the SQL execution being a big unknown, and likely cause of poor performance).
Another part of the solution that might help you get some speed is to pre-compile your LINQ statements. I remember this being a huge optimisation on a tiny project (high traffic) I worked on ages and ages ago... seems like it would contribute to the client side slowness you're seeing. Having said all that though I've not found a need to use them since... so heed everyone else's warnings first! :)
https://msdn.microsoft.com/en-us/library/vstudio/bb896297(v=vs.100).aspx

Stuck on SQL query with multiple joins

Alright, the system I got is a pretty outdated ERP system based around an Ingres database. The database schema is ... well ... not very nice (not really normalized) but basically it works out. Please understand that I cannot change anything related to the database.
Consider the following SQL statement:
SELECT
-- some selected fields here
FROM
sta_artikelstamm s
left join sta_chargen c on c.artikel_nr = s.artikel_nr and c.lager != 93
left join sta_artikelbeschreib b on s.artikel_nr = b.artikel_nr and b.seite = 25 and b.zeilennr = 1
left join sta_einkaufskonditionen ek on s.artikel_nr = ek.artikel_nr AND s.lieferant_1 = ek.kunden_nr
left join sta_kundenstamm ks on ek.kunden_nr = ks.nummer AND ks.nummer = s.lieferant_1
left join tab_teilegruppe2 tg2 on s.teilegruppe_2 = tg2.teilegruppe
WHERE
(s.status = 0)
AND
(s.teilegruppe_2 IS NOT NULL) AND (s.teilegruppe_2 != '')
So far, this works as expected, I get exactely 40742 results back. The result set looks alright, the number matches about what I would expect and the statement has shown no duplicates. I explicitly use a LEFT JOIN since some fields in related tables may not contain entries but I would like to keep the info from the main article table nonetheless.
Now, table tab_teilegruppe2 consists of 3 fields (bezeichnung = description, teilegruppe = part group == primary key, taricnr - please ignore this field, it may be null or contain some values but I don't need it).
I though of adding the following SQL part to only include rows in the resultset which do NOT appear in a specific part group. I therefore added the following line at the very end of the SQL statement.
AND (s.teilegruppe_2 NOT IN (49,57,60,63,64,65,66,68,71,73,76,77,78,79,106,107))
I'm by no means an SQL expert (you probably have guessed that already), but shouldn't an additional WHERE statement remove rows instead of adding? As soon as I add this simple additional statement in the WHERE clause, I get 85170 result rows.
Now I'm guessing it has to do with the "NOT IN" statement, but I don't understand why I suddenly get more rows than before. Anyone can give me a pointer where to look for my error?

What is the type of the s.teilegruppe_2 column? Is it an integer or some sort of string (VARCHAR)?
The (s.teilegruppe_2 != '') suggests it is a string but your NOT IN is comparing it against a list of integers.
If the column involved is a string then the NOT IN list will match all the values since none of them are going to match an integer value.

Entity Framework 5 (Code First) Navigation Properties

Is it the correct behaviour of entity framework to load all items with the given foreign key for a navigation property before querying/filtering?
For example:
myUser.Apples.First(a => a.Id == 1 && !a.Expires.HasValue);
Will load all apples associated with that user. (The SQL query doesn't query the ID or Expires fields).
There are two other ways of doing it (which generate the correct SQL) but neither as clean as using the navigation properties:
myDbContext.Entry(myUser).Collection(u => u.Apples).Query().First(a => a.Id == 1 && !a.Expires.HasValue);
myDbContext.Apples.First(a => a.UserId == myUser.Id && a.Id == 1 && !a.Expires.HasValue);
Things I've Checked
Lazy load is enabled and is not disabled anywhere.
The navigation properties are virtual.

EDIT:
Ok based on your edit I think i had the wrong idea about what you were asking (which makes a lot more sense now). Ill leave the previous answer around as i think its probably useful to explain but is much less relevant to your specific question as it stands.
From what you've posted your user object is enabled for lazy loading. EF enables lazy loading by default, however there is one requirement to lazy loading which is to mark navigation properties as virtual (which you have done).
Lazy loading works by attaching to the get method on a navigation property and performing a SQL query at that point to retrieve the foreign entity. Navigation properties are also not queriable collections, which means that when you execute the get method your query will be executed immediately.
In your above example the apples collection on User is enumerated before you execute the .first call (which occurs using plain old linq to objects). This means that SQL will return back all of the apples associated to the user and filter them in memory on the querying machine (as you have observed). This will also mean you need two queries to pull down the apples you are interested in (one for the user and one for the nav property) which may not be efficient for you if all you want is apples.
A perhaps better way of doing this is to keep the whole expression as a query for as long as possible. An example of this would be something like the following:
myDbContext.Users
.Where(u=>u.Id == userId)
.SelectMany(u=>u.Apples)
.Where(a=>a.Id == 1 && !a.Expires.HasValue);
this should execute as a single SQL statement and only pull down the apples you care about.
HTH
Ok from what i can understand of your question you are asking why EF appears to allow you to use navigation properties in a query even though they may be null in the result set.
In answer to your question yes this is expected behavior, heres why:
Why you write a query it is translated into SQL, for example something like
myDbContext.Apples.Where(a=>a.IsRed)
will turn into something like
Select * from Apples
where [IsRed] = 1
similarly something like the following will also be translated directly to SQL
myDbContext.Apples.Where(a=>a.Tree.Height > 100)
will turn into something like
Select a.* from Apples as a
inner join Tree as t on a.TreeId = t.Id
where t.Height > 100
However its a bit of a different story when we actually pull down the result sets.
To avoid pulling down too much data and making it slow EF offers several mechanisms for specifying what comes back in the result set. One is lazy loading (which incidently needs to be used carefully if you want to avoid performance issues) and the second is the include syntax. These methods restrict what we are pulling back so that queries are quick and dont consume un-needed resources.
For example in the above you will note that only Apple fields are returned.
If we were to add an include to that as below you could get a different result:
myDbContext.Apples.Include(a=>a.Tree).Where(a=>a.Tree.Height > 100)
will translate to SQL similar to:
Select a.*, t.* from Apples as a
inner join Tree as t on a.TreeId = t.Id
where t.Height > 100
In your above example (which I'm fairly sure isn't syntactically correct as myContext.Users should be a collection and therefore shouldn't have a .Apples) you are creating a query therefor all variables are available. When you enumerate that query you have to be explicit about whats returned.
For more details on navigation properties and how they work (and the .Include syntax) check out my blog: http://blog.staticvoid.co.nz/2012/07/entity-framework-navigation-property.html

Most efficient way to sum data in C#

I am trying to create a friendly report summing enrollment for number of students by time of day. I initially started with loops for campusname, then time, then day and hibut it was extremely inefficient and slow. I decided to take another approach and select all the data I need in one select and organize it using c#.
Raw Data View
My problem is I am not sure whether to put this into arrays, or lists, or a dictionary or datatable to sum the enrollment and organize it as seen below(mockup, not calculated). Any guidance would be appreciated.
Friendly View

Well, if you only need to show the user some data (and not edit it) you may want to create a report.
Otherwise, if you only need sums, you could get all the data in an IEnumerable and call .Sum(). And as pointed out by colinsmith, you can use Linq in parallel.
But one thing is definite though... If you have a lot of data, you don't want to do many queries. You could either use a sum query in SQL (if the data is stored in a database) or do the sum from a collection you've fetched.
You don't want to fetch the data in a loop. Processing data in memory is way faster than querying multiple times the database and then process it.

Normally I would advise you to do this in the database, i.e. a select using group by etc, I'm having a bit of trouble figuring out how your first picture relates to the second with regards to the days so I can't offer an example.
You could of course do this in C# as well using LINQ to objects but I would first try and solve it in the DB, you are better of performance and bandwidth wise that way.

I am not quite sure what you are exactly after. But from my understanding, i would suggest you to create a class to represent your enrollment
public class Enrollment
{
public string CampusName { set;get;}
public DateTime DateEnrolled { set;get;}
}
And Get all enrollment details from the database to a collection of this class
List<Enrollment> enrollments=db.GetEnrollments();
Now you can do so many operations on this Collection to get your desired data
Ex : If you want to get all Enrollment happened on Fridays
var fridaysEnrollMent = enrollments.
Where(x => x.DateEnrolled.DayOfWeek == DayOfWeek.Friday).ToList();
If you want the Count of Enrollments happened in AA campus
var fridayCount = fridaysEnrollMent .Where(d => d.CampusName == "AA").Count();

something like
select campusname, ssrmeet_begin_time, count(ssrmeet_monday), count(ssrmeet_tue_day) ..
from the_table
group by campusname, ssrmeet_begin_time
order by campusname, ssrmeet_begin_time
/
should be close to what you want. The count only counts the values, not the NULL's. It is also thousands of times faster than first fetching all data to the client. Let the database do the analysis for you, it already has all the data.
BTW: instead of those pics, it is smarter to give some ddl and insert statements with data to work on. That would invite more people to help to answer the question.

SQL Server Architecture for specific problem - full-text search - with full join

I am building an application that searches candidate's resumes. I need to use full-text search on the application as there are a lot of records and the resume field is fairly large. The issue is that for advanced searches, I have another table RelocationItems, that lists zips, states, etc. for the candidates relocation preferences and is related through a candidateID in the RelocationItems table. The problem is that sometimes a candidate will have no RelocationItems, sometimes they will have one, and sometimes they will have more than one. So, simple enough, I created a View that uses full outer join and then can select using DISTINCT on candidateID to find the candidates I need that will relocate to a certain area based on the search criteria.
The big problem with this view though as since it uses and Full Join, I can't use the full-text search now! (obviously so because my full-text index field is now not a unique not-null field)
And my stored procedure has the CONTAINS word in it so it won't even compile.
Should I :
- Create a new table based on the view? (and then create another index identity field)
- Do something to store the relocation items in the candidate table (maybe an XML field)? (I don't think you can store a table-value parameter in 2008 can you?)
- Do some sort of Union of Tables (Queries)? (Run the search against the Candidates Table and then against the RelocationTable and then merge or union)?
Thanks for any suggestions on the best way to work around this problem!!!

I created a View that uses full outer join and then can select using DISTINCT on candidateID to
find the candidates I need that will relocate to a certain area based on the search criteria.
Already a potential problem - a subselect with exists would be better.
A properly set up query would have no problem - do not use a join, go for a subselect and exists.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.