Use Group By in order to remove duplicates - c#

I am looking for a simple way of removing duplicates without having to implement the class IComparable, having to override GetHashCode etc..
I think this can be achieved with linq. I have the class:
class Person
{
public string Name;
public ing Age;
}
I have a list of about 500 People List<Person> someList = new List<Person()
now I want to remove people with the same name and if there is a duplicate I want to keep the person that had the greater age. In other words if I have the list:
Name----Age---
Tom, 24 |
Alicia, 22 |
Alicia, 12 |
I will like to end up with:
Name----Age---
Tom, 24 |
Alicia, 22 |
How can I do this with a query? My list is not that long so I don't want to create a hash set nor implement the IComparable interface. It will be nice if I can do this with a linq query.
I think this can be done with the groupBy extension method by doing something like:
var people = // the list of Person
person.GroupBy(x=>x.Name).Where(x=>x.Count()>1)
... // select the person that has the greatest age...

people
.GroupBy(p => p.Name)
.Select(g => g.OrderByDescending(p => p.Age).First())
This will work across different Linq providers. If this is just Linq2Objects, and speed is important (usually, it isn't) consider using one of the many MaxBy extensions found on the web (here's Skeet's) and replacing
g.OrderByDescending(p => p.Age).First()
with
g.MaxBy(p => p.Age)

This can be trivially easy so long as you first create a helper function MaxBy that is capable of selecting the item from a sequence who's selector is largest. Unfortunately the Max function in LINQ won't work as we want to select the item from the sequence, not the selected value.
var distinctPeople = people.GroupBy(person => person.Name)
.Select(group => group.MaxBy(person => person.Age));
And then the implementation of MaxBy:
public static TSource MaxBy<TSource, TKey>(this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector, IComparer<TKey> comparer = null)
{
comparer = comparer ?? Comparer<TKey>.Default;
using (var iterator = source.GetEnumerator())
{
if (!iterator.MoveNext())
throw new ArgumentException("Source must have at least one item");
var maxItem = iterator.Current;
var maxKey = keySelector(maxItem);
while (iterator.MoveNext())
{
var nextKey = keySelector(iterator.Current);
if (comparer.Compare(nextKey, maxKey) > 0)
{
maxItem = iterator.Current;
maxKey = nextKey;
}
}
return maxItem;
}
}
Note that while you can achieve the same result by sorting the sequence and then taking the first item, doing so is less efficient in general than doing just one pass with a max function.

I prefer to be simple:
var retPeople = new List<Person>;
foreach (var p in person)
{
if(!retPeople.Contains(p))
{
retPeople.Add(p);
}
}
Making Person to implement IComparable

I got rid of my last answer because I realized it was too slow and was too complicated. Here is the solution that makes a little more sense
var peoplewithLargestAgeByName =
from p in people
orderby p.Name
group p by p.Name into peopleByName
select peopleByName.First ( );
This is the same solution as the solution #spender contributed, but with the linq syntax.

Related

C# How to split a List in two using LINQ [duplicate]

This question already has answers here:
Can I split an IEnumerable into two by a boolean criteria without two queries?
(6 answers)
Closed 2 years ago.
I am trying to split a List into two Lists using LINQ without iterating the 'master' list twice. One List should contain the elements for which the LINQ condition is true, and the other should contain all the other elements. Is this at all possible?
Right now I just use two LINQ queries, thus iterating the (huge) master List twice.
Here's the (pseudo) code I am using right now:
List<EventModel> events = GetAllEvents();
List<EventModel> openEvents = events.Where(e => e.Closer_User_ID == null);
List<EventModel> closedEvents = events.Where(e => e.Closer_User_ID != null);
Is it possible to yield the same results without iterating the original List twice?
You can use ToLookup extension method as follows:
List<Foo> items = new List<Foo> { new Foo { Name="A",Condition=true},new Foo { Name = "B", Condition = true },new Foo { Name = "C", Condition = false } };
var lookupItems = items.ToLookup(item => item.Condition);
var lstTrueItems = lookupItems[true];
var lstFalseItems = lookupItems[false];
You can do this in one statement by converting it into a Lookup table:
var splitTables = events.Tolookup(event => event.Closer_User_ID == null);
This will return a sequence of two elements, where every element is an IGrouping<bool, EventModel>. The Key says whether the sequence is the sequence with null Closer_User_Id, or not.
However this looks rather mystical. My advice would be to extend LINQ with a new function.
This function takes a sequence of any kind, and a predicate that divides the sequence into two groups: the group that matches the predicate and the group that doesn't match the predicate.
This way you can use the function to divide all kinds of IEnumerable sequences into two sequences.
See Extension methods demystified
public static IEnumerable<IGrouping<bool, TSource>> Split<TSource>(
this IEnumerable<TSource> source,
Func<TSource,bool> predicate)
{
return source.ToLookup(predicate);
}
Usage:
IEnumerable<Person> persons = ...
// divide the persons into adults and non-adults:
var result = persons.Split(person => person.IsAdult);
Result has two elements: the one with Key true has all Adults.
Although usage has now become easier to read, you still have the problem that the complete sequence is processed, while in fact you might only want to use a few of the resulting items
Let's return an IEnumerable<KeyValuePair<bool, TSource>>, where the Boolean value indicates whether the item matches or doesn't match:
public static IEnumerable<KeyValuePair<bool, TSource>> Audit<TSource>(
this IEnumerable<TSource> source,
Func<TSource,bool> predicate)
{
foreach (var sourceItem in source)
{
yield return new KeyValuePair<bool, TSource>(predicate(sourceItem, sourceItem));
}
}
Now you get a sequence, where every element says whether it matches or not. If you only need a few of them, the rest of the sequence is not processed:
IEnumerable<EventModel> eventModels = ...
EventModel firstOpenEvent = eventModels.Audit(event => event.Closer_User_ID == null)
.Where(splitEvent => splitEvent.Key)
.FirstOrDefault();
The where says that you only want those Audited items that passed auditing (key is true).
Because you only need the first element, the rest of the sequence is not audited anymore
GroupBy and Single should accomplish what you're looking for:
var groups = events.GroupBy(e => e.Closer_User_ID == null).ToList(); // As others mentioned this needs to be materialized to prevent `events` from being iterated twice.
var openEvents = groups.SingleOrDefault(grp => grp.Key == true)?.ToList() ?? new List<EventModel>();
var closedEvents = groups.SingleOrDefault(grp => grp.Key == false)?.ToList() ?? new List<EventModel>();
One line solution by using ForEach method of List:
List<EventModel> events = GetAllEvents();
List<EventModel> openEvents = new List<EventModel>();
List<EventModel> closedEvents = new List<EventModel>();
events.ForEach(x => (x.Closer_User_ID == null ? openEvents : closedEvents).Add(x));
You can do without LINQ. Switch to conventional loop approach.
List<EventModel> openEvents = new List<EventModel>();
List<EventModel> closedEvents = new List<EventModel>();
foreach(var e in events)
{
if(e.Closer_User_ID == null)
{
openEvents.Add(e);
}
else
{
closedEvents.Add(e);
}
}

How to return a specific item in Distinct using EqualityComparer in C#

I have defined a CustomListComparer which compares List<int> A and List<int> B and if Union of the two lists equals at least on of the lists, considers them equal.
var distinctLists = MyLists.Distinct(new CustomListComparer()).ToList();
public bool Equals(Frame other)
{
var union = CustomList.Union(other.CustomList).ToList();
return union.SequenceEqual(CustomList) ||
union.SequenceEqual(other.CustomList);
}
For example, the below lists are equal:
ListA = {1,2,3}
ListB = {1,2,3,4}
And the below lists are NOT:
ListA = {1,5}
ListB = {1,2,3,4}
Now all this works fine. But here is my question: Which one of the Lists (A or B) gets into distinctLists? Do I have any say in that? Or is it all handled by compiler itself?
What I mean is say that the EqualityComparer considers both of the Lists equal. and adds one of them to distinctLists. Which one does it add? I want the list with more items to be added.
Distinct always adds the first element which it see. So it depends on the order of the sequence which you passed in.
Source is fairly simple, which can be found here
static IEnumerable<TSource> DistinctIterator<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer) {
Set<TSource> set = new Set<TSource>(comparer);
foreach (TSource element in source)
if (set.Add(element)) yield return element;
}
If you need to return list with more elements, you need to roll your own. Worth noting that Distinct is lazy, but the implementation you're asking for will need a eager implementation.
static class MyDistinctExtensions
{
public static IEnumerable<T> DistinctMaxElements<T>(this IEnumerable<T> source, IEqualityComparer<T> comparer) where T : ICollection
{
Dictionary<T, List<T>> dictionary = new Dictionary<T, List<T>>(comparer);
foreach (var item in source)
{
List<T> list;
if (!dictionary.TryGetValue(item, out list))
{
list = new List<T>();
dictionary.Add(item, list);
}
list.Add(item);
}
foreach (var list in dictionary.Values)
{
yield return list.Select(x => new { List = x, Count = x.Count })
.OrderByDescending(x => x.Count)
.First().List;
}
}
}
Updated the answer with naive implementation, not tested though.
Instead of Distinct you can use GroupBy with MaxBy method::
var distinctLists = MyLists.GroupBy(x => x, new CustomListComparer())
.Select(g => g.MaxBy(x => x.Count))
.ToList();
This will group lists using your comparer and select the list that has max item from each group.
MaxBy is quite useful in this situation, you can find it in MoreLINQ library.
Edit: Using pure LINQ:
var distinctLists = MyLists.GroupBy(x => x, new CustomListComparer())
.Select(g => g.First(x => x.Count == g.Max(l => l.Count)))
.ToList();

Sort by name and then group by some boolean relationship? (.NET)

I have a list of named objects:
class NamedObject {
public string name;
public int value;
public NamedObject(string name, int value) {
this.name = name;
this.value = value;
}
}
...
public static bool HasRelationship(NamedObject a, NamedObject b) {
return a.value == b.value;
}
...
var objs = new List<NamedObject>();
objs.Add(new NamedObject("D", 1));
objs.Add(new NamedObject("Z", 2));
objs.Add(new NamedObject("Y", 3));
objs.Add(new NamedObject("A", 2));
objs.Add(new NamedObject("C", 1));
objs.Add(new NamedObject("Z", 1));
of which I would like to sort by name, and then sub-sort by a boolean relationship. For the purposes of this example the boolean relationship is a.value == b.value.
Output List:
A (2)
Z (2)
C (1)
D (1)
Z (1)
Y (3)
So sort by name, group by boolean relationship, sort sub-group by name.
Edit:
The above is a simplification of the actual sorting, in my application the HasRelationship function determines whether two orientations have symmetry. The orientations are named so that they appear in a logical order within the editor interface.
Here is a visualisation:
http://pbrd.co/16okFxp
I'm confused by the question, I will try to be as clear as possible.
First it seems you want to sort NamedObjects by name, that's the clear and easy part. An order by should do the work.
And then you want to partiton it by an arbitrary predicate based on a pair of NamedObjects. I think that's the issue that arises the confusion.
The predicate you provide determines properties of pairs of NamedObjects, so now you are dealing with pairs. There's no unique answer to this question.
I understand you want to partition pairs by the predicate but you must understand that with a boolean partition you only will have two partitions ( relation is true or not), and no guaranteed order of values within the partition.
So at most you could get (sorted by name on the first term of the pair):
pair(A,Z)(true)
pair(A,C)(false)
pair(A,D)(false)
...
pair(C,D)(true)
...
The point is you can't order by pair relationship without implicitly deal with pairs. So to give you an answer I will assume:
Relation may not be symmetric
You want to sort by the first pair term name
With this context an answer could be. First get pairs.
var namedPairs = namedObjects.SelectMany(outerNamedObject =>
namedObjects.Select(innerNamedObject => new
{
First = outerNamedObject,
Second = innerNamedObject
}));
Then we do the grouping
var partitionedNamedPairs = namedPairs.GroupBy(pair =>
HasRelationship(pair.First, pair.Second));
After that, sort by first term name and then by the group key ( the relation partition )
var result = partitionedNamedPairs.SelectMany(
grouping => grouping.Select(pair => new { pair, key = grouping.Key }))
.OrderBy(keyedPair => keyedPair.pair.First.name)
.ThenBy(keyedPair => keyedPair.key);
You could then use select to remove the second term of the pair, but i don't see the point of that, because your provided predicate is binary.
I think you should join your list by itself since your HasRelationship method needs two objects.
var result = objs.OrderBy(x => x.name)
.Join(objs, _ => true, _ => true, (l, r) => new { l, r, rel = HasRelationship(l, r) })
.Where(x => x.rel)
.SelectMany(x=>new []{x.l,x.r})
.Distinct()
.ToList();
Although this returns the list you expect, I can not say I understand your requirements clearly.
The following solution is fully commented and hopefully will help future readers of this question to understand the sorting process which was required.
The answer by #QtX was nice and concise, though it seems that people were having difficulty understanding what I was actually requesting, so sorry about that guys!
Usage Example:
var sortedObjs = objs.SortAndGroupByRelationship(obj => obj.name, HasRelationship);
Extension method for sort and group:
public static IEnumerable<T> SortAndGroupByRelationship<T, TKey>(this IEnumerable<T> objs, Func<T, TKey> keySelector, Func<T, T, bool> relationship) where TKey : IComparable<TKey> {
// Group items which are related.
var groups = new List<List<T>>();
foreach (var obj in objs) {
bool grouped = false;
// Attempt to place named object into an existing group.
foreach (var group in groups)
if (relationship(obj, group[0])) {
group.Add(obj);
grouped = true;
break;
}
// Create new group for named object.
if (!grouped) {
var newGroup = new List<T>();
newGroup.Add(obj);
groups.Add(newGroup);
}
}
// Sort objects within each group by name.
foreach (var group in groups)
group.Sort( (a, b) => keySelector(a).CompareTo(keySelector(b)) );
// Sort groups by name.
groups.Sort( (a, b) => keySelector(a[0]).CompareTo(keySelector(b[0])) );
// Flatten groups into resulting array.
var sortedList = new List<T>();
foreach (var group in groups)
sortedList.AddRange(group);
return sortedList;
}
var sorted = objs.GroupBy(x => x.value, (k, g) => g.OrderBy(x => x.name))
.OrderBy(g => g.First().name)
.SelectMany(g => g);
Returns exactly what you want, without using HasRelationship method.

LINQ how to query if a value is between a list of ranges?

Let's say I have a Person record in a database, and there's an Age field for the person.
Now I have a page that allows me to filter for people in certain age ranges.
For example, I can choose multiple range selections, such as "0-10", "11-20", "31-40".
So in this case, I'd get back a list of people between 0 and 20, as well as 30 to 40, but not 21-30.
I've taken the age ranges and populated a List of ranges that looks like this:
class AgeRange
{
int Min { get; set; }
int Max { get; set; }
}
List<AgeRange> ageRanges = GetAgeRanges();
I am using LINQ to SQL for my database access and queries, but I can't figure out how query the ranges.
I want to do something like this, but of course, this won't work since I can't query my local values against the SQL values:
var query = from person in db.People
where ageRanges.Where(ages => person.Age >= ages.Min && person.Age <= ages.Max).Any())
select person;
You could build the predicate dynamically with PredicateBuilder:
static Expression<Func<Person, bool>> BuildAgePredicate(IEnumerable<AgeRange> ranges)
{
var predicate = PredicateBuilder.False<Person>();
foreach (var r in ranges)
{
// To avoid capturing the loop variable
var r2 = r;
predicate = predicate.Or (p => p.Age >= r2.Min && p.Age <= r2.Max);
}
return predicate;
}
You can then use this method as follows:
var agePredicate = BuildAgePredicate(ageRanges);
var query = db.People.Where(agePredicate);
As one of your errors mentioned you can only use a local sequence with the 'Contains' method. One option would then be to create a list of all allowed ages like so:
var ages = ageRanges
.Aggregate(new List<int>() as IEnumerable<int>, (acc, x) =>
acc.Union(Enumerable.Range(x.Min,x.Max - (x.Min - 1)))
);
Then you can call:
People.Where(x => ages.Contains(x.Age))
A word of caution to this tale, should your ranges be large, then this will FAIL!
(This will work well for small ranges (your max number of accepted ages will probably never exceed 100), but any more than this and both of the above commands will become VERY expensive!)
Thanks to Thomas' answer, I was able to create this more generic version that seems to be working:
static IQueryable<T> Between<T>(this IQueryable<T> query, Expression<Func<T, decimal>> predicate, IEnumerable<NumberRange> ranges)
{
var exp = PredicateBuilder.False<T>();
foreach (var range in ranges)
{
exp = exp.Or(
Expression.Lambda<Func<T, bool>>(Expression.GreaterThanOrEqual(predicate.Body, Expression.Constant(range.Min)), predicate.Parameters))
.And(Expression.Lambda<Func<T, bool>>(Expression.LessThanOrEqual(predicate.Body, Expression.Constant(range.Max)), predicate.Parameters));
}
return query.Where(exp);
}
Much simpler implementation is to use Age.CompareTo()
I had a similar problem and solved it using CompareTo
In a database of houses, I want to find houses within the range max and min
from s in db.Homes.AsEnumerable()
select s;
houses = houses.Where( s=>s.Price.CompareTo(max) <= 0 && s.Price.CompareTo(min) >= 0 ) ;

Get Non-Distinct elements from an IEnumerable

I have a class called Item. Item has an identifier property called ItemCode which is a string. I would like to get a list of all non-distinct Items in a list of Items.
Example:
List<Item> itemList = new List<Item>()
{
new Item("code1", "description1"),
new Item("code2", "description2"),
new Item("code2", "description3"),
};
I want a list containing the bottom two entries
If I use
var distinctItems = itemsList.Distinct();
I get the list of distinct items which is great, but I want almost the opposite of that. I could subtract the the distinct list from the original list but that wouldn't contain ALL repeats, just one instance of each.
I've had a play and can't figure out an elegant solution. Any pointers or help would be much appreciated. Thanks!
I have 3.5 so LINQ is available
My take:
var distinctItems =
from list in itemsList
group list by list.ItemCode into grouped
where grouped.Count() > 1
select grouped;
as an extension method:
public static IEnumerable<T> NonDistinct<T, TKey> (this IEnumerable<T> source, Func<T, TKey> keySelector)
{
return source.GroupBy(keySelector).Where(g => g.Count() > 1).SelectMany(r => r);
}
You might want to try it with group by operator. The idea would be to group them by the ItemCode and taking the groups with more than one member, something like :
var grouped = from i in itemList
group i by i.ItemCode into g
select new { Code = g.Key, Items = g };
var result = from g in grouped
where g.Items.Count() > 1;
I'd suggest writing a custom extension method, something like this:
static class RepeatedExtension
{
public static IEnumerable<T> Repeated<T>(this IEnumerable<T> source)
{
var distinct = new Dictionary<T, int>();
foreach (var item in source)
{
if (!distinct.ContainsKey(item))
distinct.Add(item, 1);
else
{
if (distinct[item]++ == 1) // only yield items on first repeated occurence
yield return item;
}
}
}
}
You also need to override Equals() method for your Item class, so that items are correctly compared by their code.

Categories