Is it possible to do this as a single efficient LINQ query? - c#

I have a class like
public class Foo
{
public string X;
public string Y;
public int Z;
}
and the query I want to achieve is, given an IEnumerable<Foo> called foos,
"Group by X, then by Y, and choose the the largest subgroup
from each supergroup; if there is a tie, choose the one with the
largest Z."
In other words, a not-so-compact solution would look like
var outer = foos.GroupBy(f => f.X);
foreach(var g1 in outer)
{
var inner = g1.GroupBy(g2 => g2.Y);
int maxCount = inner.Max(g3 => g3.Count());
var winners = inner.Where(g4 => g4.Count() == maxCount));
if(winners.Count() > 1)
{
yield return winners.MaxBy(w => w.Z);
}
else
{
yield return winners.Single();
}
}
and a not-so-efficient solution would be like
from foo in foos
group foo by new { foo.X, foo.Y } into g
order by g.Key.X, g.Count(), g.Max(f => f.Z)
. . . // can't figure the rest out
but ideally I'd like both compact and efficient.

you are reusing enumerables too much, that causes whole enumerable to be executed again which can cause significant performance decrease in some cases.
Your not so compact code can be simplified to this.
foreach (var byX in foos.GroupBy(f => f.X))
{
yield return byX.GroupBy(f => f.Y, f => f, (_, byY) => byY.ToList())
.MaxBy(l => l.Count)
.MaxBy(f => f.Z);
}
Here is how it goes,
items are grouped by x, hence the variable is named byX, which means entire byX enumerable contains similar X's.
Now you group this grouped items by Y. the variable named byY means that entire byY enumerable contains similar Y's that also have similar X's
Finally you select largest list i.e winners (MaxyBy(l => l.Count)) and from winners you select item with highest Z (MaxBy(f => f.Z)).
The reason I used byY.ToList() was to prevent duplicate enumeration that otherwise would be caused by Count() and MaxBy().
Alternatively you can change your entire iterator into single return statement.
return foos.GroupBy(f => f.X, f => f, (_, byX) =>
byX.GroupBy(f => f.Y, f => f,(__, byY) => byY.ToList())
.MaxBy(l => l.Count)
.MaxBy(f => f.Z));

Based on the wording of your question I assume that you want the result to be an IEnumerable<IEnumerable<Foo>>. Elements are grouped by both X and Y so all elements in a specific inner sequence will have the same value for X and Y. Furthermore, every inner sequence will have different (unique) values for X.
Given the following data
X Y Z
-----
A p 1
A p 2
A q 1
A r 3
B p 1
B q 2
the resulting sequence of sequences should consist of two sequences (for X = A and X = B)
X Y Z
-----
A p 1
A p 2
X Y Z
-----
B q 2
You can get this result using the following LINQ expression:
var result = foos
.GroupBy(
outerFoo => outerFoo.X,
(x, xFoos) => xFoos
.GroupBy(
innerFoo => innerFoo.Y,
(y, yFoos) => yFoos)
.OrderByDescending(yFoos => yFoos.Count())
.ThenByDescending(yFoos => yFoos.Select(foo => foo.Z).Max())
.First());
If you really care about performance you can most likely improve it at the cost of some complexity:
When picking the group with most elements or highest Z value two passes are performed over the elements in each group. First the elements are counted using yFoos.Count() and then the maximum Z value is computed using yFoos.Select(foo => foo.Z).Max(). However, you can do the same in one pass by using Aggregate.
Also, it is not necessary to sort all the groups to find the "largest" group. Instead a single pass over all the groups can be done to find the "largest" group again using Aggregate.
result = foos
.GroupBy(
outerFoo => outerFoo.X,
(x, xFoos) => xFoos
.GroupBy(
innerFoo => innerFoo.Y,
(y, yFoos) => new
{
Foos = yFoos,
Aggregate = yFoos.Aggregate(
(Count: 0, MaxZ: int.MinValue),
(accumulator, foo) =>
(Count: accumulator.Count + 1,
MaxZ: Math.Max(accumulator.MaxZ, foo.Z)))
})
.Aggregate(
new
{
Foos = Enumerable.Empty<Foo>(),
Aggregate = (Count: 0, MaxZ: int.MinValue)
},
(accumulator, grouping) =>
grouping.Aggregate.Count > accumulator.Aggregate.Count
|| grouping.Aggregate.Count == accumulator.Aggregate.Count
&& grouping.Aggregate.MaxZ > accumulator.Aggregate.MaxZ
? grouping : accumulator)
.Foos);
I am using a ValueTuple as the accumulator in Aggregate as I expect that to have a good performance. However, if you really want to know you should measure.

You can prety much ignore the outer grouping and what is left is just a little advaced MaxBy, kind of alike a two parameter sorting. If you implement that, you would end up with something like:
public IEnumerable<IGrouping<string, Foo>> GetFoo2(IEnumerable<Foo> foos)
{
return foos.GroupBy(f => f.X)
.Select(f => f.GroupBy(g => g.Y)
.MaxBy2(g => g.Count(), g => g.Max(m => m.Z)));
}
It is questionable how much you can call this linq approach, as you moved all the functionality into quite ordinary function. You can also implement the functionality with aggregate. There are two options. With seed and without seed. I like the latter option:
public IEnumerable<IGrouping<string, Foo>> GetFoo3(IEnumerable<Foo> foos)
{
return foos.GroupBy(f => f.X)
.Select(f => f.GroupBy(g => g.Y)
.Aggregate((a, b) =>
a.Count() > b.Count() ? a :
a.Count() < b.Count() ? b :
a.Max(m => m.Z) >= b.Max(m => m.Z) ? a : b
));
}
The performance would suffer if Count() is not constant time, which is not guaranteed, but on my tests it worked fine. The variant with seed would be more complicated, but may be faster if done right.

Thinking about this further, I realized your orderby could vastly simplify everything, still not sure it is that understandable.
var ans = foos.GroupBy(f => f.X, (_, gXfs) => gXfs.GroupBy(gXf => gXf.Y).Select(gXgYfs => gXgYfs.ToList())
.OrderByDescending(gXgYfs => gXgYfs.Count).ThenByDescending(gXgYfs => gXgYfs.Max(gXgYf => gXgYf.Z)).First());
While it is possible to do this in LINQ, I don't find it any more compact or understandable if you make it into one statement when using query comprehension syntax:
var ans = from foo in foos
group foo by foo.X into foogX
let foogYs = (from foo in foogX
group foo by foo.Y into rfoogY
select rfoogY)
let maxYCount = foogYs.Max(y => y.Count())
let foogYsmZ = from fooY in foogYs
where fooY.Count() == maxYCount
select new { maxZ = fooY.Max(f => f.Z), fooY = from f in fooY select f }
let maxMaxZ = foogYsmZ.Max(y => y.maxZ)
select (from foogY in foogYsmZ where foogY.maxZ == maxMaxZ select foogY.fooY).First();
If you are willing to use lambda syntax, some things become easier and shorter, though not necessarily more understandable:
var ans = from foogX in foos.GroupBy(f => f.X)
let foogYs = foogX.GroupBy(f => f.Y)
let maxYCount = foogYs.Max(foogY => foogY.Count())
let foogYmCmZs = foogYs.Where(fooY => fooY.Count() == maxYCount).Select(fooY => new { maxZ = fooY.Max(f => f.Z), fooY })
let maxMaxZ = foogYmCmZs.Max(foogYmZ => foogYmZ.maxZ)
select foogYmCmZs.Where(foogYmZ => foogYmZ.maxZ == maxMaxZ).First().fooY.Select(y => y);
With lots of lambda syntax, you can go completely incomprehensible:
var ans = foos.GroupBy(f => f.X, (_, gXfs) => gXfs.GroupBy(gXf => gXf.Y).Select(gXgYf => new { fCount = gXgYf.Count(), maxZ = gXgYf.Max(f => f.Z), gXgYfs = gXgYf.Select(f => f) }))
.Select(fC_mZ_gXgYfs_s => {
var maxfCount = fC_mZ_gXgYfs_s.Max(fC_mZ_gXgYfs => fC_mZ_gXgYfs.fCount);
var fC_mZ_gXgYfs_mCs = fC_mZ_gXgYfs_s.Where(fC_mZ_gXgYfs => fC_mZ_gXgYfs.fCount == maxfCount).ToList();
var maxMaxZ = fC_mZ_gXgYfs_mCs.Max(fC_mZ_gXgYfs => fC_mZ_gXgYfs.maxZ);
return fC_mZ_gXgYfs_mCs.Where(fC_mZ_gXgYfs => fC_mZ_gXgYfs.maxZ == maxMaxZ).First().gXgYfs;
});
(I modified this third possiblity to reduce repetitive calculations and be more DRY, but that did make it a bit more verbose.)

Related

LINQ groupby search in contains

I have the following result:
var result = (from p1 in db.Table
select new ReportInform
{
DataValue = p1.DataValue,
SampleDate = p1.SampleDate
})
.Distinct()
.ToList();
// Next getting list of duplicate SampleDates
var duplicates = result.GroupBy(x => x.SampleDate)
.Where(g => g.Count() > 1)
.Select (x => x)
.ToList();
foreach (var r in result)
{
if (duplicates.Contains(r.SampleDate)) // get error here on incompatbility
{
r.SampleDate = r.SampleDate.Value.AddMilliseconds(index++);
}
}
Cannot convert from 'System.DateTime?' to 'System.Linq.IGrouping
That error is pretty clear but may not be at a first glance. As a programmer, you need to learn how to read, understand and make sense of compiler or runtime errors.
Anyhow it is complaining that it cannot convert DateTime? to System.Linq.IGrouping<System.DateTime, ReportInForm>. Why? Because this query returns an System.Linq.IGrouping<System.DateTime, ReportInForm>
var duplicates = result.GroupBy(x => x.SampleDate)
.Where(g => g.Count() > 1)
.Select (x => x)
.ToList();
The GroupBy method returns IGrouping<System.DateTime, ReportInForm> which has a Key and the Key is the thing you grouped by and a list of items in that group. You are grouping by SampleDate and checking if there are more than one items in that group and then selecting the group. Thus dulplicates has a list of IGrouping<System.DateTime, ReportInForm> and you are asking the runtime to check if it contains a DateTime? and it blows up at this line:
duplicates.Contains(r.SampleDate)
One way to fix this is: What you want to do is to select the key of that group. Thus do this:
.Select (x => x.Key)
If you are expecting duplicates to be of type List<DateTime?> then you meant to write this
.Select(x => x.Key)
instead of
.Select(x => x)

List<T> extension method First, Second, Third....Nth

I want to access the first, second, third elements in a list. I can use built in .First() method for accessing first element.
My code is as follows:
Dictionary<int, Tuple<int, int>> pList = new Dictionary<int, Tuple<int, int>>();
var categoryGroups = pList.Values.GroupBy(t => t.Item1);
var highestCount = categoryGroups
.OrderByDescending(g => g.Count())
.Select(g => new { Category = g.Key, Count = g.Count() })
.First();
var 2ndHighestCount = categoryGroups
.OrderByDescending(g => g.Count())
.Select(g => new { Category = g.Key, Count = g.Count() })
.GetNth(1);
var 3rdHighestCount = categoryGroups
.OrderByDescending(g => g.Count())
.Select(g => new { Category = g.Key, Count = g.Count() })
.GetNth(2);
twObjClus.WriteLine("--------------------Cluster Label------------------");
twObjClus.WriteLine("\n");
twObjClus.WriteLine("Category:{0} Count:{1}",
highestCount.Category, highestCount.Count);
twObjClus.WriteLine("\n");
twObjClus.WriteLine("Category:{0} Count:{1}",
2ndHighestCount.Category, 2ndHighestCount.Count);
// Error here i.e. "Can't use 2ndHighestCount.Category here"
twObjClus.WriteLine("\n");
twObjClus.WriteLine("Category:{0} Count:{1}",
3rdHighestCount.Category, 3rdHighestCount.Count);
// Error here i.e. "Can't use 3rdHighestCount.Category here"
twObjClus.WriteLine("\n");
I have written extension method GetNth() as:
public static IEnumerable<T> GetNth<T>(this IEnumerable<T> list, int n)
{
if (n < 0)
throw new ArgumentOutOfRangeException("n");
if (n > 0){
int c = 0;
foreach (var e in list){
if (c % n == 0)
yield return e;
c++;
}
}
}
Can I write extension methods as .Second(), .Third() similar to
built in method .First() to access second and third indices?
If what you're looking for is a single object, you don't need to write it yourself, because a built-in method for that already exists.
foo.ElementAt(1)
will get you the second element, etc. It works similarly to First and returns a single object.
Your GetNth method seems to be returning every Nth element, instead of just the element at index N. I'm assuming that's not what you want since you said you wanted something similar to First.
Since #Eser gave up and doesn't want to post the correct way as an answer, here goes:
You should rather do the transforms once, collect the results into an array, and then get the three elements from that. The way you're doing it right now results in code duplication as well as grouping and ordering being done multiple times, which is inefficient.
var highestCounts = pList.Values
.GroupBy(t => t.Item1)
.OrderByDescending(g => g.Count())
.Select(g => new { Category = g.Key, Count = g.Count() })
.Take(3)
.ToArray();
// highestCounts[0] is the first count
// highestCounts[1] is the second
// highestCounts[2] is the third
// make sure to handle cases where there are less than 3 items!
As an FYI, if you some day need just the Nth value and not the top three, you can use .ElementAt to access values at an arbitrary index.

Can this query about finding missing keys be improved? (either SQL or LINQ)

I am developing a ASP.NET MVC website and is looking a way to improve this routine. It can be improved either at LINQ level or SQL Server level. I hope at best we can do it within one query call.
Here is the tables involved and some example data:
We have no constraint that every Key has to have each LanguageId value, and indeed the business logic does not allow such contraint. However, at application level, we want to warn the admin that a key is missing a/some language values. So I have this class and query:
public class LocalizationKeyWithMissingCodes
{
public string Key { get; set; }
public IEnumerable<string> MissingCodes { get; set; }
}
This method get the Key list, as well as any missing codes (for example, if we have en + jp + ch language codes, and the key only has values for en + ch, the list will contains jp):
public IEnumerable<LocalizationKeyWithMissingCodes> GetAllKeysWithMissingCodes()
{
var languageList = Utils.ResolveDependency<ILanguageRepository>().GetActive();
var languageIdList = languageList.Select(q => q.Id);
var languageIdDictionary = languageList.ToDictionary(q => q.Id);
var keyList = this.GetActive()
.Select(q => q.Key)
.Distinct();
var result = new List<LocalizationKeyWithMissingCodes>();
foreach (var key in keyList)
{
// Get missing codes
var existingCodes = this.Get(q => q.Active && q.Key == key)
.Select(q => q.LanguageId);
// ToList to make sure it is processed at application
var missingLangId = languageList.Where(q => !existingCodes.Contains(q.Id))
.ToList();
result.Add(new LocalizationKeyWithMissingCodes()
{
Key = key,
MissingCodes = missingLangId
.Select(q => languageIdDictionary[q.Id].Code),
});
}
result = result.OrderByDescending(q => q.MissingCodes.Count() > 0)
.ThenBy(q => q.Key)
.ToList();
return result;
}
I think my current solution is not good, because it make a query call for each key. Is there a way to improve it, by either making it faster, or pack within one query call?
EDIT: This is the final query of the answer:
public IQueryable<LocalizationKeyWithMissingCodes> GetAllKeysWithMissingCodes()
{
var languageList = Utils.ResolveDependency<ILanguageRepository>().GetActive();
var localizationList = this.GetActive();
return localizationList
.GroupBy(q => q.Key, (key, items) => new LocalizationKeyWithMissingCodes()
{
Key = key,
MissingCodes = languageList
.GroupJoin(
items,
lang => lang.Id,
loc => loc.LanguageId,
(lang, loc) => loc.Any() ? null : lang)
.Where(q => q != null)
.Select(q => q.Code)
}).OrderByDescending(q => q.MissingCodes.Count() > 0) // Show the missing keys on the top
.ThenBy(q => q.Key);
}
Another possibility, using LINQ:
public IEnumerable<LocalizationKeyWithMissingCodes> GetAllKeysWithMissingCodes(
List<Language> languages,
List<Localization> localizations)
{
return localizations
.GroupBy(x => x.Key, (key, items) => new LocalizationKeyWithMissingCodes
{
Key = key,
MissingCodes = languages
.GroupJoin( // check if there is one or more match for each language
items,
x => x.Id,
y => y.LanguageId,
(x, ys) => ys.Any() ? null : x)
.Where(x => x != null) // eliminate all languages with a match
.Select(x => x.Code) // grab the code
})
.Where(x => x.MissingCodes.Any()); // eliminate all complete keys
}
Here is the SQL logic to identify the keys that are missing "complete" language assignments:
SELECT
all.[Key],
all.LanguageId
FROM
(
SELECT
loc.[Key],
lang.LanguageId
FROM
Language lang
FULL OUTER JOIN
Localization loc
ON (1 = 1)
WHERE
lang.Active = 1
) all
LEFT JOIN
Localization loc
ON (loc.[Key] = all.[Key])
AND (loc.LanguageId = all.LanguageId)
WHERE
loc.[Key] IS NULL;
To see all keys (instead of filtering):
SELECT
all.[Key],
all.LanguageId,
CASE WHEN loc.[Key] IS NULL THEN 1 ELSE 0 END AS Flagged
FROM
(
SELECT
loc.[Key],
lang.LanguageId
FROM
Language lang
FULL OUTER JOIN
Localization loc
ON (1 = 1)
WHERE
lang.Active = 1
) all
LEFT JOIN
Localization loc
ON (loc.[Key] = all.[Key])
AND (loc.LanguageId = all.LanguageId);
your code seems to be doing a lot of database query and materialization..
in terms of LINQ, the single query would look like this..
we take the cartesian product of language and localization tables to get all combinations of (key, code) and then subtract the (key, code) tuples that exist in the relationship. this gives us the (key, code) combination that don't exist.
var result = context.Languages.Join(context.Localizations, lang => true,
loc => true, (lang, loc) => new { Key = loc.Key, Code = lang.Code })
.Except(context.Languages.Join(context.Localizations, lang => lang.Id,
loc => loc.LanguageId, (lang, loc) => new { Key = loc.Key, Code = lang.Code }))
.GroupBy(r => r.Key).Select(r => new LocalizationKeyWithMissingCodes
{
Key = r.Key,
MissingCodes = r.Select(kc => kc.Code).ToList()
})
.ToList()
.OrderByDescending(lkmc => lkmc.MissingCodes.Count())
.ThenBy(lkmc => lkmc.Key).ToList();
p.s. i typed this LINQ query on the go, so let me know if it has syntax issues..
the gist of the query is that we take a cartesian product and subtract matching rows.

NRules: match a collection

I'm trying to figure out the BRE NRules and got some examples working but having a hard time to match a collection.
IEnumerable<Order> orders = null;
When()
.Match<IEnumerable<Order>>(o => o.Where(c => c.Cancelled).Count() >= 3)
.Collect<Order>(() => orders, o => o.Cancelled);
Then()
.Do(ctx => orders.ToList().ForEach(o => o.DoSomething()));
Basically what I want is if there are 3 orders cancelled then do some action. But I can't seem get a match on a collection, single variables do work.
The program:
var order3 = new Order(123458, customer, 2, 20.0);
var order4 = new Order(123459, customer, 1, 10.0);
var order5 = new Order(123460, customer, 1, 11.0);
order3.Cancelled = true;
order4.Cancelled = true;
order5.Cancelled = true;
session.Insert(order3);
session.Insert(order4);
session.Insert(order5);
session.Fire();
What am I doing wrong here?
With the 0.3.1 version of NRules, the following will activate the rule when you collected 3 or more canceled orders:
IEnumerable<Order> orders = null;
When()
.Collect<Order>(() => orders, o => o.Cancelled)
.Where(x => x.Count() >= 3);
Then()
.Do(ctx => orders.ToList().ForEach(o => o.DoSomething()));
Update:
For posterity, starting with version 0.4.x the right syntax is to use reactive LINQ. Matching a collection will look like this:
IEnumerable<Order> orders = null;
When()
.Query(() => orders, q => q
.Match<Order>(o => o.Cancelled)
.Collect()
.Where(x => x.Count() >= 3));
Then()
.Do(ctx => DoSomething(orders));
In your example, it should be pretty straightforward
IEnumerable<Order> orders = null;
When()
.Collect<Order>(() => orders, o => o.Cancelled == true);
Then()
.Do(ctx => orders.ToList().ForEach(o => o.DoSomething()));
I think the important part is the o.Cancelled alone without the == true. I know this sound wack, but somehow the property evaluation alone that is not an expression (i.e. just the property) is not well supported in NRules.
I ran into this problem and when I added the == true everything fell into place.
How to join Multiple Collection based on some expression like
IEnumerable<RawMsp> rawMsps = null;
IEnumerable<AsmMasterView> asmMasterViews = null;
IEnumerable<AsmInvestor> asmInvestors = null;
When()
.Match<AsmInvestor>(() => rawMsps)
.Match<AsmInvestor>(() => asmInvestor, i => i.InvestorId.ToString() == rawMsp.INVESTOR_CODE)
.Match<AsmMasterView>(() => asmMasterView, x => x.ChildAssumptionHistId == asmInvestor.AssumptionHistId);
Match is applicable individual object , Not sure apply equals of Enumerable Objects

The type of the arguments cannot be inferred usage Linq GroupJoin

I'm trying to make a linq GroupJoin, and I receive the fore mentioned error. This is the code
public Dictionary<string, List<QuoteOrderline>> GetOrderlines(List<string> quoteNrs)
{
var quoteHeadersIds = portalDb.nquote_orderheaders
.Where(f => quoteNrs.Contains(f.QuoteOrderNumber))
.Select(f => f.ID).ToList();
List<nquote_orderlines> orderlines = portalDb.nquote_orderlines
.Where(f => quoteHeadersIds.Contains(f.QuoteHeaderID))
.ToList();
var toRet = quoteNrs
.GroupJoin(orderlines, q => q, o => o.QuoteHeaderID, (q => o) => new
{
quoteId = q,
orderlines = o.Select(g => new QuoteOrderline()
{
Description = g.Description,
ExtPrice = g.UnitPrice * g.Qty,
IsInOrder = g.IsInOrder,
PartNumber = g.PartNo,
Price = g.UnitPrice,
ProgramId = g.ProgramId,
Quantity = (int)g.Qty,
SKU = g.SKU
}).ToList()
});
}
I suspect this is the immediate problem:
(q => o) => new { ... }
I suspect you meant:
(q, o) => new { ... }
In other words, "here's a function taking a query and an order, and returning an anonymous type". The first syntax simply doesn't make sense - even thinking about higher ordered functions, you'd normally have q => o => ... rather than (q => o) => ....
Now that won't be enough on its own... because GroupJoin doesn't return a dictionary. (Indeed, you don't even have a return statement yet.) You'll need a ToDictionary call after that. Alternatively, it may well be more appropriate to return an ILookup<string, QuoteOrderLine> via ToLookup.

Categories