Large LINQ Grouping query, what's happening behind the scenes - c#

Take the following LINQ query as an example. Please don't comment on the code itself as I've just typed it to help with this question.
The following LINQ query uses a 'group by' and calculates summary information. As you can see there are numerous calculations which are being performed on the data but how efficient is LINQ behind the scenes.
var NinjasGrouped = (from ninja in Ninjas
group pos by new { pos.NinjaClan, pos.NinjaRank }
into con
select new NinjaGroupSummary
{
NinjaClan = con.Key.NinjaClan,
NinjaRank = con.Key.NinjaRank,
NumberOfShoes = con.Sum(x => x.Shoes),
MaxNinjaAge = con.Max(x => x.NinjaAge),
MinNinjaAge = con.Min(x => x.NinjaAge),
ComplicatedCalculation = con.Sum(x => x.NinjaGrade) != 0
? con.Sum(x => x.NinjaRedBloodCellCount)/con.Sum(x => x.NinjaDoctorVisits)
: 0,
ListOfNinjas = con.ToList()
}).ToList();
How many times is the list of 'Ninjas' being iterated over in order to calculate each of the values?
Would it be faster to employ a foreach loop to speed up the execution of such a query?
Would adding '.AsParallel()' after Ninjas result in any performance improvements?
Is there a better way of calculating summery information for List?
Any advice is appreciated as we use this type of code throughout our software and I would really like to gain a better understanding of what LINQ is doing underneath the hood (so to speak). Perhaps there is a better way?

Assuming this is a LINQ to Objects query:
Ninjas is only iterated over once; the groups are built up into internal concrete lists, which you're then iterating over multiple times (once per aggregation).
Using a foreach loop almost certainly wouldn't speed things up - you might benefit from cache coherency a bit more (as each time you iterate over a group it'll probably have to fetch data from a higher level cache or main memory) but I very much doubt that it would be significant. The increase in pain in implementing it probably would be significant though :)
Using AsParallel might speed things up - it looks pretty easily parallelizable. Worth a try...
There's not a much better way for LINQ to Objects, to be honest. It would be nice to be able to perform the aggregation as you're grouping, and Reactive Extensions would allow you to do something like that, but for the moment this is probably the simplest approach.
You might want to have a look at the GroupBy post in my Edulinq blog series for more details on a possible implementation.

Related

PLINQ slower than actual Linq for this snippet

Below is the code snippet. EF6 is used.
var itemNames = context.cam.AsParallel()
.Where(x=> x.cams ==
"edsfdf")
.Select(item => item.fg)
.FirstOrDefault();
Why is PLINQ slower?
If you look at the signature of .AsParallel(), it takes an IEnumerable<T>, rather than an IQueryable<T>. A linq query is only converted to a SQL statement while it's kept as an IQueryable. Once you enumerate it, it executes the query and returns the records.
So to break down the query you have:
context.cam.AsParallel()
This bit of code essentially will execute SELECT * FROM cam on the database, and then start iterating through the results.The results will be passed into a ParallelQuery. This essentially will load the entire table into memory.
.Where(x=> x.cams == "edsfdf")
.Select(item => item.fg)
.FirstOrDefault()
After this point, all of those operations will happen in parallel. A simple string equality comparison is likely extremely inexpensive compared to the overhead of spinning up a lot of threads and managing locking and concurrency between them (which PLINQ will take care of for you). Parallel processing and the costs/benefits is a complicated topic, but it's usually best saved for CPU-intensive work.
If you skipped the AsParallel() call, everything remains as an IQueryable all the way through the linq statement, so EntityFramework will send a single SQL command that looks something like SELECT fg FROM cam WHERE cams = 'edsfdf' and return that single result, which SQL Server will optimize to a very fast lookup, especially if there's an index on cams.
It's slower because you have used parralisim in a bad place. Your Where clause doesn't do heavy duty work, then it's not good scenario to use PLINQ.
If PLINQ comes with associated
complexity then what is the real benefit for it's existence?
You don't need hundred people to find a missing child in a home, because finding 100 people takes longer than finding the child by yourself. But if you were to search a forest, this overhead was worth it, because finding a child in a big forest alone takes longer than to find 100 people to find a child in a big forest!
parallelism is effective when used at the right time in the right place.

Searching a collection effectively in c#

I have an AsyncObservable collection of some class, say "dashboard". Each item inside dashboard collection contains a collection of some other class, say "chart". That chart has various properties such as name,type etc... I want to search based on chart name, type etc on this collection. Can anybody suggest me some searching technique? Currently I am searching by traversing the whole collection using a foreach and comparing entered input with each item inside the collection (this is not so efficient if amount of data is large)... I want to make it more efficient - I am using c#..
My code is:
foreach (DashBoard item in this.DashBoards)
{
Chart obj1 = item.CurrentCharts.ToList().Find(chart => chart.ChartName.ToUpper().Contains(searchText.ToUpper()));
if (obj1 != null)
{
if (obj1.IsHighlighted != Colors.Wheat)
obj1.IsHighlighted = Colors.Wheat;
item.IsExpanded = true;
flagList.Add(1);
}
else
{
flagList.Add(0);
}
}
You can use the LINQ query.
For example something you can do like this.If you post your code,we can solve the problem
Dashboard.SelectMany(q => q.Chart).Where(a => a.Name == "SomeName")
Here is the reference linq question: querying nested collections
Edit:Foreach loops or LINQ
The answer is not really clear-cut.There are two sides to any code cost arguments: performance and maintainability.The first of these is obvious and quantifiable.
Under the hood LINQ will iterate over the collection, just as foreach will. The difference between LINQ and foreach is that LINQ will defer execution until the iteration begins.
Performance wise take a look at this blog post: http://www.schnieds.com/2009/03/linq-vs-foreach-vs-for-loop-performance.html
In your case:
If the collection is relatively small or medium size i would suggest you to use foreach for better performance.
At the end of the day.
Linq is more elegant but less efficient most of the time, foreach clutters the code a bit but perform better.
On large collections/on a where using parallel computing make sense i would choose LINQ as the performance gaps will be reduced to minimum.

Caching Linq Query Question

I am creating a forum package for a cms and looking at caching some of the queries to help with performance, but I'm not sure if caching the below will help/do what it should on the below (BTW: Cachehelper is a simple helper class that just adds and removes from cache)
// Set cache variables
IEnumerable<ForumTopic> maintopics;
if (!CacheHelper.Get(topicCacheKey, out maintopics))
{
// Now get topics
maintopics = from t in u.ForumTopics
where t.ParentNodeId == CurrentNode.Id
orderby t.ForumTopicLastPost descending
select t;
// Add to cache
CacheHelper.Add(maintopics, topicCacheKey);
}
//End Cache
// Pass to my pager helper
var pagedResults = new PaginatedList<ForumTopic>(maintopics, p ?? 0, Convert.ToInt32(Settings.ForumTopicsPerPage));
// Now bind
rptTopicList.DataSource = pagedResults;
rptTopicList.DataBind();
Doesn't linq only execute when its enumerated? So the above won't work will it? as its only enumerated when I pass it to the paging helper which .Take()'s a certain amount of records based on a querystring value 'p'
You need to enumerate your results, for example by calling the ToList() method.
maintopics = from t in u.ForumTopics
where t.ParentNodeId == CurrentNode.Id
orderby t.ForumTopicLastPost descending
select t;
// Add to cache
CacheHelper.Add(maintopics.ToList(), topicCacheKey);
My experience with Linq-to-Sql is that it's not super performant when you start getting into complex objects and/or joins.
The first step is to set up LoadOptions on the datacontext. This will force joins so that a complete record is recalled. This was a problem in a ticket tracking system I wrote. I was displaying a list of 10 tickets and saw about 70 queries come across the wire. I had ticket->substatus->status. Due to L2S's lazy initialization, that caused each foreign key for each object that I referenced in the grid to fire off a new query.
Here's a blog post (not mine) about this subject (MSDN was weak): http://oakleafblog.blogspot.com/2007/08/linq-to-sql-query-execution-with.html
The next option is to create precompiled Linq queries. I had to do this with large joins. Here's another blog post on the subject: http://aspguy.wordpress.com/2008/08/15/speed-up-linq-to-sql-with-compiled-linq-queries/
The next option is to convert things over to using stored procedures. This makes programming and deployment harder for sure, but for complex queries where you only need a subset of data, they will be orders of magnitude faster.
The reason I bring this up is because the way you're talking about caching things (why not use the built in Cache in ASP.NET?) is going to cause you lots of headaches in the long term. I'd recommend building your system and then running SQL traces to see where your database performance problems are, then build optimizations around that. You might find that your real issues aren't in the "top 10 topics" but in other, much simpler to fix areas.
Yes, you need to enumerate your results. Linq will not evaluate your query until you enumerate the results.
If you want a general caching strategy for Linq, here is a great tutorial:
http://petemontgomery.wordpress.com/2008/08/07/caching-the-results-of-linq-queries/
The end goal is the ability to automatically generate unique cache keys for any Linq query.

Is this an efficient way of doing a linq query with dynamic order by fields?

I have the following method to apply a sort to a list of objects (simplified it for the example):
private IEnumerable<Something> SetupOrderSort(IEnumerable<Something> input,
SORT_TYPE sort)
{
IOrderedEnumerable<Something> output = input.OrderBy(s => s.FieldA).
ThenBy(s => s.FieldB);
switch (sort)
{
case SORT_TYPE.FIELD1:
output = output.ThenBy(s => s.Field1);
break;
case SORT_TYPE.FIELD2:
output = output.ThenBy(s => s.Field2);
break;
case SORT_TYPE.UNDEFINED:
break;
}
return output.ThenBy(s => s.FieldC).ThenBy(s => s.FieldD).
AsEnumerable();
}
What I needs is to be able to insert a specific field in the midst of the orby clause. By default the ordering is: FIELDA, FIELDB, FIELDC, FIELDD.
When a sort field is specified though I need to insert the specified field between FIELDB and FIELDC in the sort order.
Currently there is only 2 possible fields to sort by but could be up to 8. Performance wise is this a good approach? Is there a more efficient way of doing this?
EDIT: I saw the following thread as well: Dynamic LINQ OrderBy on IEnumerable<T> but I thought it was overkill for what I needed. This is a snippet of code that executes a lot so I just want to make sure I am not doing something that could be easily done better that I am just missing.
Don't try and "optimize" stuff you haven't proved slow with a profiler.
It's highly unlikely that this will be slow enough to notice. I strongly suspect the overhead of actually sorting the list is higher than switching on one string.
The important question is: Is this code maintainable? Will you forget to add another case the next time you add a property to Something? If that will be a problem, consider using the MS Dynamic Query sample, from the VS 2008 C# samples page.
Otherwise, you're fine.
There's nothing inefficient about your method, but there is something unintuitive about it, which is that you can't sort by multiple columns - something that end users are almost sure to want to do.
I might hand-wave this concern away on the chance that both columns are unique, but the fact that you subsequently hard-code in another sort at the end leads me to believe that Field1 and Field2 are neither related nor unique, in which case you really should consider the possibility of having an arbitrary number of levels of sorting, perhaps by accepting an IEnumerable<SORT_TYPE> or params SORT_TYPE[] argument instead of a single SORT_TYPE.
Anyway, as far as performance goes, the OrderBy and ThenBy extensions have deferred execution, so each successive ThenBy in your code is probably no more than a few CPU instructions, it's just wrapping one function in another. It will be fine; the actual sorting will be far more expensive.

Why is the SQL produced by LINQ-to-Entities so inefficient?

The following (cut down) code excerpt is a Linq-To-Entities query that results in SQL (via ToTraceString) that is much slower than a hand crafted query. Am I doing anything stupid, or is Linq-to-Entities just bad at optimizing queries?
I have a ToList() at the end of the query as I need to execute it before using it to build an XML data structure (which was a whole other pain).
var result = (from mainEntity in entities.Main
where (mainEntity.Date >= today) && (mainEntity.Date <= tomorrow) && (!mainEntity.IsEnabled)
select new
{
Id = mainEntity.Id,
Sub =
from subEntity in mainEntity.Sub
select
{
Id = subEntity.Id,
FirstResults =
from firstResultEntity in subEntity.FirstResult
select new
{
Value = firstResultEntity.Value,
},
SecondResults =
from secondResultEntity in subEntity.SecondResult
select
{
Value = secondResultEntity.Value,
},
SubSub =
from subSubEntity in entities.SubSub
where (subEntity.Id == subSubEntity.MainId) && (subEntity.Id == subSubEntity.SubId)
select
new
{
Name = (from name in entities.Name
where subSubEntity.NameId == name.Id
select name.Name).FirstOrDefault()
}
}
}).ToList();
While working on this, I've also has some real problems with Dates. When I just tried to include returned dates in my data structure, I got internal error "1005".
Just as a general observation and not based on any practical experience with Linq-To-Entities (yet): having four nested subqueries inside a single query doesn't look like it's awfully efficient and speedy to begin with.
I think your very broad statement about the (lack of) quality of the SQL generated by Linq-to-Entities is not warranted - and you don't really back it up by much evidence, either.
Several well respected folks including Rico Mariani (MS Performance guru) and Julie Lerman (author of "Programming EF") have been showing in various tests that in general and overall, the Linq-to-SQL and Linq-to-Entities "engines" aren't really all that bad - they achieve overall at least 80-95% of the possible peak performance. Not every .NET app dev can achieve this :-)
Is there any way for you to rewrite that query or change the way you retrieve the bits and pieces that make up its contents?
Marc
Have you tried not materializing the result immediately by calling .ToList()? I'm not sure it will make a difference, but you might see improved performance if you iterate over the result instead of calling .ToList() ...
foreach( var r in result )
{
// build your XML
}
Also, you could try breaking up the one huge query into separate queries and then iterating over the results. Sending everything in one big gulp might be the issue.

Categories