How to find and remove duplicate objects in a collection using LINQ? - c#

I have a simple class representing an object. It has 5 properties (a date, 2 decimals, an integer and a string). I have a collection class, derived from CollectionBase, which is a container class for holding multiple objects from my first class.
My question is, I want to remove duplicate objects (e.g. objects that have the same date, same decimals, same integers and same string). Is there a LINQ query I can write to find and remove duplicates? Or find them at the very least?

You can remove duplicates using the Distinct operator.
There are two overloads - one uses the default equality comparer for your type (which for a custom type will call the Equals() method on the type). The second allows you to supply your own equality comparer. They both return a new sequence representing your original set without duplicates. Neither overload actually modifies your initial collection - they both return a new sequence that excludes duplicates..
If you want to just find the duplicates, you can use GroupBy to do so:
var groupsWithDups = list.GroupBy( x => new { A = x.A, B = x.B, ... }, x => x )
.Where( g => g.Count() > 1 );
To remove duplicates from something like an IList<> you could do:
yourList.RemoveAll( yourList.Except( yourList.Distinct() ) );

If your simple class uses Equals in a manner that satisfies your requirements then you can use the Distinct method
var col = ...;
var noDupes = col.Distinct();
If not then you will need to provide an instance of IEqualityComparer<T> which compares values in the way you desire. For example (null problems ignored for brevity)
public class MyTypeComparer : IEqualityComparer<MyType> {
public bool Equals(MyType left, MyType right) {
return left.Name == right.Name;
}
public int GetHashCode(MyType type) {
return 42;
}
}
var noDupes = col.Distinct(new MyTypeComparer());
Note the use of a constant for GetHashCode is intentional. Without knowing intimate details about the semantics of MyType it is impossible to write an efficient and correct hashing function. In lieu of an efficient hashing function I used a constant which is correct irrespective of the semantics of the type.

Related

Getting List.Join to compare properly

I am trying to create a list by joining two lists if a property matches correctly. I am using the following command:
FooList = TrackedStrings.Join (FooList,
str => str,
Foo => Foo.GetString (),
(str, Foo) => Foo,
new Comparer ())
.ToList ();
And the following class to compare:
public class Comparer : IEqualityComparer<string>
{
public bool Equals (string x, string y)
{
return y.Contains (x);
}
public int GetHashCode (string str)
{
return str.GetHashCode ();
}
}
Now, the idea is that I only want to keep the items that have a GetString () containing any one of the strings from TrackedStrings. However, it doesn't work: the comparer only returns true if the strings are equal. For example, let's say that we have two lists:
List<string> TrackedActions = new List<string> { "Created", "Deleted" };
List<Foo> FooList = new List<FooList> { new Foo ("Created"), new Foo ("Deleted Something")};
With the current command, the second Foo is dropped from the list - instead of matching to TrackedActions[1] and being kept.
Thus, my question is: Why is Comparer not working the way I expect it to?
You should not use IEqualityComparer because The Equals method is reflexive, symmetric, and transitive. MSDN
In your case its not symmetric Equals(a,b) != Equals (b,a)
Glorfindel's answer is not totally correct too, because it's not transitive:
Equals("abcd","bc") == true
Equals("bcde", "bc") == true
Equals("abcd","bcde") == false
A custom comparer must make sure that the Equals relationship it defines is symmetric. This means that whenever x.Equals(y), y.Equals(x) and vice versa.
The reason for this is that you can never predict in which order the elements are compared, i.e. which one of these is called:
aStringFromLeftList.Equals(aStringFromRightList)
or
aStringFromRightList.Equals(aStringFromLeftList)
Because the relationship you need is neither symmetric nor transitive, you can't use a Comparer for your problem.
Your comparer not working is due to the implementation of the GetHashCode()
regardless the right way to implement the IEqualityComparer.
The match is done by
Compare the hashcode of 2 strings. In your case Deleted Something definitely return different hashcode with Deleted
If (1) is equal, then use Equals() to compare again because HashCode may have collision and not accurate, but fast.

Remove duplicates in custom IComparable class

I have a table that has combo pairs identifiers, and I use that to go through CSV files looking for matches. I'm trapping the unidentified pairs in a List, and sending them to an output box for later addition. I would like the output to only have single occurrences of unique pairs. The class is declared as follows:
public class Unmatched:IComparable<Unmatched>
{
public string first_code { get; set; }
public string second_code { get; set; }
public int CompareTo(Unmatched other)
{
if (this.first_code == other.first_code)
{
return this.second_code.CompareTo(other.second_code);
}
return other.first_code.CompareTo(this.first_code);
}
}
One note on the above code: This returns it in reverse alphabetical order, to get it in alphabetical order use this line:
return this.first_code.CompareTo(other.first_code);
Here is the code that adds it. This is directly after the comparison against the datatable elements
unmatched.Add(new Unmatched()
{ first_code = fields[clients[global_index].first_match_column]
, second_code = fields[clients[global_index].second_match_column] });
I would like to remove all pairs from the list where both first code and second code are equal, i.e.;
PTC,138A
PTC,138A
PTC,138A
MA9,5A
MA9,5A
MA9,5A
MA63,138A
MA63,138A
MA59,87BM
MA59,87BM
Should become:
PTC, 138A
MA9, 5A
MA63, 138A
MA59, 87BM
I have tried adding my own Equate and GetHashCode as outlined here:
http://www.morgantechspace.com/2014/01/Use-of-Distinct-with-Custom-Class-objects-in-C-Sharp.html
The SE links I have tried are here:
How would I distinct my list of key/value pairs
Get list of distinct values in List<T> in c#
Get a list of distinct values in List
All of them return a list that still has all the pairs. Here is the current code (Yes, I know there are two distinct lines, neither appears to be working) that outputs the list:
parser.Close();
List<Unmatched> noDupes = unmatched.Distinct().ToList();
noDupes.Sort();
noDupes.Select(x => x.first_code).Distinct();
foreach (var pair in noDupes)
{
txtUnmatchedList.AppendText(pair.first_code + "," + pair.second_code + Environment.NewLine);
}
Here is the Equate/Hash code as requested:
public bool Equals(Unmatched notmatched)
{
//Check whether the compared object is null.
if (Object.ReferenceEquals(notmatched, null)) return false;
//Check whether the compared object references the same data.
if (Object.ReferenceEquals(this, notmatched)) return true;
//Check whether the UserDetails' properties are equal.
return first_code.Equals(notmatched.first_code) && second_code.Equals(notmatched.second_code);
}
// If Equals() returns true for a pair of objects
// then GetHashCode() must return the same value for these objects.
public override int GetHashCode()
{
//Get hash code for the UserName field if it is not null.
int hashfirst_code = first_code == null ? 0 : first_code.GetHashCode();
//Get hash code for the City field.
int hashsecond_code = second_code.GetHashCode();
//Calculate the hash code for the GPOPolicy.
return hashfirst_code ^ hashsecond_code;
}
I have also looked at a couple of answers that are using queries and Tuples, which I honestly don't understand. Can someone point me to a source or answer that will explain the how (And why) of getting distinct pairs out of a custom list?
(Side question-Can you declare a class as both IComparable and IEquatable?)
The problem is you are not implementing IEquatable<Unmatched>.
public class Unmatched : IComparable<Unmatched>, IEquatable<Unmatched>
EqualityComparer<T>.Default uses the Equals(T) method only if you implement IEquatable<T>. You are not doing this, so it will instead use Object.Equals(object) which uses reference equality.
The overload of Distinct you are calling uses EqualityComparer<T>.Default to compare different elements of the sequence for equality. As the documentation states, the returned comparer uses your implementation of GetHashCode to find potentially-equal elements. It then uses the Equals(T) method to check for equality, or Object.Equals(Object) if you have not implemented IEquatable<T>.
You have an Equals(Unmatched) method, but it will not be used since you are not implementing IEquatable<Unmatched>. Instead, the default Object.Equals method is used which uses reference equality.
Note your current Equals method is not overriding Object.Equals since that takes an Object parameter, and you would need to specify the override modifier.
For an example on using Distinct see here.
You have to implement the IEqualityComparer<TSource> and not IComparable<TSource>.

LINQ Except() Method Does Not Work

I have 2 IList<T> of the same type of object ItemsDTO. I want to exclude one list from another. However this does not seem to be working for me and I was wondering why?
IList<ItemsDTO> related = itemsbl.GetRelatedItems();
IList<ItemsDTO> relating = itemsbl.GetRelatingItems().Except(related).ToList();
I'm trying to remove items in related from the relating list.
Since class is a reference type, your ItemsDTO class must override Equals and GetHashCode for that to work.
From MSDN:
Produces the set difference of two sequences by using the default
equality comparer to compare values.
The default equality comparer is going to be a reference comparison. So if those lists are populated independently of each other, they may contain the same objects from your point of view but different references.
When you use LINQ against SQL Server you have the benefit of LINQ translating your LINQ statement to a SQL query that can perform logical equality for you based on primary keys or value comparitors. With LINQ to Objects you'll need to define what logical equality means to ItemsDTO. And that means overriding Equals() as well as GetHashCode().
Except works well for value types. However, since you are using Ref types, you need to override Equals and GethashCode on your ItemsDTO in order to get this to work
I just ran into the same problem. Apparently .NET thinks the items in one list are different from the same items in the other list (even though they are actually the same). This is what I did to fix it:
Have your class inherit IEqualityComparer<T>, eg.
public class ItemsDTO: IEqualityComparer<ItemsDTO>
{
public bool Equals(ItemsDTO x, ItemsDTO y)
{
if (x == null || y == null) return false;
return ReferenceEquals(x, y) || (x.Id == y.Id); // In this example, treat the items as equal if they have the same Id
}
public int GetHashCode(ItemsDTO obj)
{
return this.Id.GetHashCode();
}
}

This code returns distinct values. However, what I want is to return a strongly typed collection as opposed to an anonymous type

I have the following code:
var foo = (from data in pivotedData.AsEnumerable()
select new
{
Group = data.Field<string>("Group_Number"),
Study = data.Field<string>("Study_Name")
}).Distinct();
As expected this returns distinct values. However, what I want is to return a strongly-typed collection as opposed to an anonymous type, so when I do:
var foo = (from data in pivotedData.AsEnumerable()
select new BarObject
{
Group = data.Field<string>("Group_Number"),
Study = data.Field<string>("Study_Name")
}).Distinct();
This does not return the distinct values, it returns them all. Is there a way to do this with actual objects?
For Distinct() (and many other LINQ features) to work, the class being compared (BarObject in your example) must implement implement Equals() and GetHashCode(), or alternatively provide a separate IEqualityComparer<T> as an argument to Distinct().
Many LINQ methods take advantage of GetHashCode() for performance because internally they will use things like a Set<T> to hold the unique items, which uses hashing for O(1) lookups. Also, GetHashCode() can quickly tell you if two objects may be equivalent and which ones are definitely not - as long as GetHashCode() is properly implemented of course.
So you should make all your classes you intend to compare in LINQ implement Equals() and GetHashCode() for completeness, or create a separate IEqualityComparer<T> implementation.
Either do as dlev suggested or use:
var foo = (from data in pivotedData.AsEnumerable()
select new BarObject
{
Group = data.Field<string>("Group_Number"),
Study = data.Field<string>("Study_Name")
}).GroupBy(x=>x.Group).Select(x=>x.FirstOrDefault())
Check this out for more info http://blog.jordanterrell.com/post/LINQ-Distinct()-does-not-work-as-expected.aspx
You need to override Equals and GetHashCode for BarObject because the EqualityComparer.Default<BarObject> is reference equality unless you have provided overrides of Equals and GetHashCode (this is what Enumerable.Distinct<BarObject>(this IEnumerable<BarObject> source) uses). Alternatively, you can pass in an IEqualityComparer<BarObject> to Enumerable.Distinct<BarObject>(this IEnumerable<BarObject>, IEqualityComparer<BarObject>).
Looks like Distinct can not compare your BarObject objects. Therefore it compares their references, which of course are all different from each other, even if they have the same contents.
So either you overwrite the Equals method, or you supply a custom EqualityComparer to Distinct. Remember to overwrite GetHashCode when you implement Equals, otherwise it will produce strange results if you put your objects for example into a dictionary or hashtable as key (e.g. HashSet<BarObject>). It might be (don't know exactly) that Distinct internally uses a hashset.
Here is a collection of good practices for GetHashCode.
You want to use the other overload for Distinct() that takes a comparer. You can then implement your own IEqualityComparer<BarObject>.
Try this:
var foo = (from data in pivotedData.AsEnumerable().Distinct()
select new BarObject
{
Group = data.Field<string>("Group_Number"),
Study = data.Field<string>("Study_Name")
});
Should be as simple as:
var foo = (from data in pivotedData.AsEnumerable()
select new
{
Group = data.Field<string>("Group_Number"),
Study = data.Field<string>("Study_Name")
}).Distinct().Select(x => new BarObject {
Group = x.Group,
Study = x.Study
});

How would I remove items from a List<T>?

I have a list of items.
The problem is the returned items (which I have no control over) return the same items THREE time.
So while the actual things that should be in the list are:
A
B
C
I get
A
B
C
A
B
C
A
B
C
How can I cleanly and easily remove the duplicates? Maybe count the items, divide by three and delete anything from X to list.Count?
The quickest, simplest thing to do is to not remove the items but run a distinct query
var distinctItems = list.Distinct();
If it's a must that you have a list, you can always append .ToList() to the call. If it's a must that you continue to work with the same list, then you'd just have to iterate over it and keep track of what you already have and remove any duplicates.
Edit: "But I'm working with a class"
If you have a list of a given class, to use Distinct you need to either (a) override Equals and GetHashCode inside your class so that appropriate equality comparisons can be made. If you do not have access to the source code (or simply don't want to override these methods for whatever reason), then you can (b) provide an IEqualityComparer<YourClass> implementation as an argument to the Distinct method. This will also allow you to specify the Equals and GetHashCode implementations without having to modify the source of the actual class.
public class MyObjectComparer : IEqualityComparer<MyObject>
{
public bool Equals(MyObject a, MyObject b)
{
// code to determine equality, usually based on one or more properties
}
public int GetHashCode(MyObject a)
{
// code to generate hash code, usually based on a property
}
}
// ...
var distinctItems = myList.Distinct(new MyObjectComparer());
if you are 100% sure that you receive everything you need 3 times, then just
var newList = oldList.Take(oldList.Count / 3).ToList()
Linq has a Distinct() method which does exactly this. Or put the items in a HashSet if you want to avoid duplicated completely.
If you're using C# 3 or up:
var newList = dupList.Distinct().ToList();
If not then sort the list and do the following:
var lastItem = null;
foreach( var item in dupList )
{
if( item != lastItem )
{
newItems.Add(item);
}
lastItem = item;
}
you could simply create a new list and add items to it that are not already there.

Categories