C# Distinct on IEnumerable<T> with custom IEqualityComparer - c#

Here's what I'm trying to do. I'm querying an XML file using LINQ to XML, which gives me an IEnumerable<T> object, where T is my "Village" class, filled with the results of this query. Some results are duplicated, so I would like to perform a Distinct() on the IEnumerable object, like so:
public IEnumerable<Village> GetAllAlliances()
{
try
{
IEnumerable<Village> alliances =
from alliance in xmlDoc.Elements("Village")
where alliance.Element("AllianceName").Value != String.Empty
orderby alliance.Element("AllianceName").Value
select new Village
{
AllianceName = alliance.Element("AllianceName").Value
};
// TODO: make it work...
return alliances.Distinct(new AllianceComparer());
}
catch (Exception ex)
{
throw new Exception("GetAllAlliances", ex);
}
}
As the default comparer would not work for the Village object, I implemented a custom one, as seen here in the AllianceComparer class:
public class AllianceComparer : IEqualityComparer<Village>
{
#region IEqualityComparer<Village> Members
bool IEqualityComparer<Village>.Equals(Village x, Village y)
{
// Check whether the compared objects reference the same data.
if (Object.ReferenceEquals(x, y))
return true;
// Check whether any of the compared objects is null.
if (Object.ReferenceEquals(x, null) || Object.ReferenceEquals(y, null))
return false;
return x.AllianceName == y.AllianceName;
}
int IEqualityComparer<Village>.GetHashCode(Village obj)
{
return obj.GetHashCode();
}
#endregion
}
The Distinct() method doesn't work, as I have exactly the same number of results with or without it. Another thing, and I don't know if it's usually possible, but I cannot step into AllianceComparer.Equals() to see what could be the problem.
I've found examples of this on the Internet, but I can't seem to make my implementation work.
Hopefully, someone here might see what could be wrong here!
Thanks in advance!

The problem is with your GetHashCode. You should alter it to return the hash code of AllianceName instead.
int IEqualityComparer<Village>.GetHashCode(Village obj)
{
return obj.AllianceName.GetHashCode();
}
The thing is, if Equals returns true, the objects should have the same hash code which is not the case for different Village objects with same AllianceName. Since Distinct works by building a hash table internally, you'll end up with equal objects that won't be matched at all due to different hash codes.
Similarly, to compare two files, if the hash of two files are not the same, you don't need to check the files themselves at all. They will be different. Otherwise, you'll continue to check to see if they are really the same or not. That's exactly what the hash table that Distinct uses behaves.

Or change the line
return alliances.Distinct(new AllianceComparer());
to
return alliances.Select(v => v.AllianceName).Distinct();

Related

Using Contains on an List in a different way

I have this code which i want to change:
foreach (DirectoryInfo path in currDirs) {
if (!newDirs.Contains(path)) {
MyLog.WriteToLog("Folder not Found: "+path.Name + "in New Folder. ",MyLog.Messages.Warning);
currNoPairs.Add(path);
}
}
In the If part i don't want to check the path i want to check the path.Name.
So how can i use the Contains method on the properties.
the goal is to sort out all folders that have not the same name in the list of Current Directory List and New Directory List.
See - IEnumerable<T>.Contains with predicate
Those functions that take "predicates" (boolean functions that signify a match) will let you do more complex checks. In this case, you can use them to compare sub-properties instead of the top-level objects.
The new code will look something like this:
foreach (DirectoryInfo path in currDirs) {
if (!newDirs.Any(newDir => newDir.Name == path.Name)) {
// TODO: print your error message here
currNoPairs.Add(path.Name);
}
}
In reply to your comment:
Okay i understood, but whats the diffrence between any and contains then?
List<T>.Contains
This method goes through each item in the list, seeing if that item is equal to the value you passed in.
The code for this method looks a little like this (simplified here for illustration):
for(var item in yourList) {
if(item.Equals(itemYouPassedIn) {
return true;
}
}
return false;
As you see, it can only compare top-level items. It doesn't check sub-properties, unless you are using a custom type that overrides the default Equals method. Since you're using the built in DirectoryInfo types, you can't override this Equals behavior without making a custom derived class. Since there's easier ways to do this, I wouldn't recommend this approach unless you need to do it in a ton of different places.
IEnumerable<T>.Any
This method goes through each item in the list, and then passes that item to the "predicate" function you passed in.
The code for this method looks a little like this (simplified for illustration):
for(var item in yourList) {
if(isAMatch(item)) { // Note that `isAMatch` is the function you pass in to `Any`
return true;
}
}
return false;
Your predicate function can be as complicated as you want it to be, but in this case, you'd just use it to check if the sub-properties are equal.
// This bit of code defines a function with no name (a "lambda" function).
// We call it a "predicate" because it returns a bool, and is used to find matches
newDir => newDir.Name == path.Name
// Here's how it might look like if it were defined as a normal function -
// this won't quite work in reality cause `path` is passed in by a different means,
// but hopefully it makes the lambda syntax slightly more clear
bool IsAMatch(DirectoryInfo newDir) {
return newDir.Name == path.Name;
}
Since you can customize this predicate every place that you use it, this could be a better tactic. I'd recommend this style until you are doing this exact check in a bunch of places in your code, in which case a custom class might be better.
Here is how you check for property Any
foreach (DirectoryInfo path in currDirs) {
if (!newDirs.Any(dir => dir.FullName == path.FullName)) {
MyLog.WriteToLog("Folder not Found: "+path.Name + "in New Folder. ",MyLog.Messages.Warning);
currNoPairs.Add(path);
}
}
And by the way, your code could be written in a better way like this
var currDirsConcrete = currDirs.ToArray();
var pathsNotFound = "Following paths were not found \r\n " + string.Join("\r\n", currDirsConcrete.Where(d => d.FullName != path.FullName).ToArray());
var pathsFound = currDirsConcrete.Where(d => d.FullName == path.FullName).ToArray();
MyLog.WriteToLog(pathsNotFound, MyLog.Messages.Warning);
Note: You can skip the first line currDirsConcrete if your currDirs is already an array or a list. I did this to avoid redetermining the enumerable.
I would use linq with except and implement a DirComparator
List<DirectoryInfo> resultExcept = currDirs.Except(newDirs, new DirComparator()).ToList();
Here the IEqualityComparer<DirectoryInfo>:
public class DirComparator : IEqualityComparer<DirectoryInfo> {
public bool Equals(DirectoryInfo x, DirectoryInfo y)
{
//Check whether the compared objects reference the same data.
if (Object.ReferenceEquals(x, y)) return true;
//Check whether any of the compared objects is null.
if (Object.ReferenceEquals(x, null) || Object.ReferenceEquals(y, null))
return false;
//Check whether the products' properties are equal.
return x.Name.equals(y.Name);
}
public int GetHashCode(DirectoryInfo dir)
{
//Check whether the object is null
if (Object.ReferenceEquals(dir, null)) return 0;
//Get hash code for the Name field if it is not null.
return dir.Name == null ? 0 : dir.Name.GetHashCode();
}
}
you could also use intersect if you want it the other way around.

C# List comparison between custom objects

Pair<BoardLocation, BoardLocation> loc = new Pair<BoardLocation, BoardLocation>( this.getLocation(), l );
if(!this.getPlayer().getMoves().Contains( loc )) {
this.getPlayer().addMove( loc );
}
I'm using a Type I have created called "Pair" but, I'm trying to use the contains function in C# that would compare the two types but, I have used override in the Type "Pair" itself to compare the "ToString()" of both Pair objects being compared. So there are 4 strings being compared. The two Keys and two value. If the two Keys are equal, then the two values are compared. The reason why this makes sense is the Key is the originating(key) location for the location(value) being attacked. If the key and value are the same then the object should not be added.
public override bool Equals( object obj ) {
Pair<K, V> objNode = (Pair<K, V>)obj;
if(this.value.ToString().CompareTo( objNode.value.ToString() ) == 0) {
if(this.key.ToString().CompareTo( objNode.key.ToString() ) == 0) {
return true;
} else
return false;
} else {
return false;
}
}
The question is, Is there a better way to do this that doesn't involve stupid amounts of code or creating new objects for dealing with this. Of course if any ideas involve these, I am all ears. The part that confuses me about this is, perhaps I dont understand what is going on but, I was hoping that C# offered a method that just equivalence of values and not the object memory locations and etc.
I've just ported this from Java as well, and it works exactly the same but, I'm asking this question for C# because I'm hoping there was a better way for me to compare these objects without using ToString() with generic Types.
You can definitely make this code a lot simpler by using && and just returning the value of equality comparisons, instead of all those if statements and return true; or return false; statements.
public override bool Equals (object obj) {
// Safety first: handle the case where the other object isn't
// of the same type, or obj is null. In both cases we should
// return false, rather than throwing an exception
Pair<K, V> otherPair = objNode as Pair<K, V>;
if (otherPair == null) {
return false;
}
return key.ToString() == otherPair.key.ToString() &&
value.ToString() == otherPair.value.ToString();
}
In Java you could use equals rather than compareTo.
Note that these aren't exactly the same as == (and Equals) use an ordinal comparison rather than a culture-sensitive one - but I suspect that's what you want anyway.
I would personally shy away from comparing the values by ToString() representations. I would use the natural equality comparisons of the key and value types instead:
public override bool Equals (object obj) {
// Safety first: handle the case where the other object isn't
// of the same type, or obj is null. In both cases we should
// return false, rather than throwing an exception
Pair<K, V> otherPair = objNode as Pair<K, V>;
if (otherPair == null) {
return false;
}
return EqualityComparer<K>.Default.Equals(key, otherPair.key) &&
EqualityComparer<K>.Default.Equals(value, otherPair.value);
}
(As Avner notes, you could just use Tuple of course...)
As noted in comments, I'd also strongly recommend that you start using properties and C# naming conventions, e.g.:
if (!Player.Moves.Contains(loc)) {
Player.AddMove(loc);
}
The simplest way to improve this is to use, instead of your custom Pair class, an instance of the built-in Tuple<T1,T2> class.
The Tuple class, in addition to giving you an easy way to bundle several values together, automatically implements structural equality, meaning that a Tuple object is equal to another if:
It is a Tuple object.
Its two components are of the same types as the current instance.
Its two components are equal to those of the current instance. Equality is determined by the default object equality comparer for each component.
(from MSDN)
This means that instead of your Pair having to compare its values, you're delegating the responsibility to the types held in the Tuple.

Remove duplicates in custom IComparable class

I have a table that has combo pairs identifiers, and I use that to go through CSV files looking for matches. I'm trapping the unidentified pairs in a List, and sending them to an output box for later addition. I would like the output to only have single occurrences of unique pairs. The class is declared as follows:
public class Unmatched:IComparable<Unmatched>
{
public string first_code { get; set; }
public string second_code { get; set; }
public int CompareTo(Unmatched other)
{
if (this.first_code == other.first_code)
{
return this.second_code.CompareTo(other.second_code);
}
return other.first_code.CompareTo(this.first_code);
}
}
One note on the above code: This returns it in reverse alphabetical order, to get it in alphabetical order use this line:
return this.first_code.CompareTo(other.first_code);
Here is the code that adds it. This is directly after the comparison against the datatable elements
unmatched.Add(new Unmatched()
{ first_code = fields[clients[global_index].first_match_column]
, second_code = fields[clients[global_index].second_match_column] });
I would like to remove all pairs from the list where both first code and second code are equal, i.e.;
PTC,138A
PTC,138A
PTC,138A
MA9,5A
MA9,5A
MA9,5A
MA63,138A
MA63,138A
MA59,87BM
MA59,87BM
Should become:
PTC, 138A
MA9, 5A
MA63, 138A
MA59, 87BM
I have tried adding my own Equate and GetHashCode as outlined here:
http://www.morgantechspace.com/2014/01/Use-of-Distinct-with-Custom-Class-objects-in-C-Sharp.html
The SE links I have tried are here:
How would I distinct my list of key/value pairs
Get list of distinct values in List<T> in c#
Get a list of distinct values in List
All of them return a list that still has all the pairs. Here is the current code (Yes, I know there are two distinct lines, neither appears to be working) that outputs the list:
parser.Close();
List<Unmatched> noDupes = unmatched.Distinct().ToList();
noDupes.Sort();
noDupes.Select(x => x.first_code).Distinct();
foreach (var pair in noDupes)
{
txtUnmatchedList.AppendText(pair.first_code + "," + pair.second_code + Environment.NewLine);
}
Here is the Equate/Hash code as requested:
public bool Equals(Unmatched notmatched)
{
//Check whether the compared object is null.
if (Object.ReferenceEquals(notmatched, null)) return false;
//Check whether the compared object references the same data.
if (Object.ReferenceEquals(this, notmatched)) return true;
//Check whether the UserDetails' properties are equal.
return first_code.Equals(notmatched.first_code) && second_code.Equals(notmatched.second_code);
}
// If Equals() returns true for a pair of objects
// then GetHashCode() must return the same value for these objects.
public override int GetHashCode()
{
//Get hash code for the UserName field if it is not null.
int hashfirst_code = first_code == null ? 0 : first_code.GetHashCode();
//Get hash code for the City field.
int hashsecond_code = second_code.GetHashCode();
//Calculate the hash code for the GPOPolicy.
return hashfirst_code ^ hashsecond_code;
}
I have also looked at a couple of answers that are using queries and Tuples, which I honestly don't understand. Can someone point me to a source or answer that will explain the how (And why) of getting distinct pairs out of a custom list?
(Side question-Can you declare a class as both IComparable and IEquatable?)
The problem is you are not implementing IEquatable<Unmatched>.
public class Unmatched : IComparable<Unmatched>, IEquatable<Unmatched>
EqualityComparer<T>.Default uses the Equals(T) method only if you implement IEquatable<T>. You are not doing this, so it will instead use Object.Equals(object) which uses reference equality.
The overload of Distinct you are calling uses EqualityComparer<T>.Default to compare different elements of the sequence for equality. As the documentation states, the returned comparer uses your implementation of GetHashCode to find potentially-equal elements. It then uses the Equals(T) method to check for equality, or Object.Equals(Object) if you have not implemented IEquatable<T>.
You have an Equals(Unmatched) method, but it will not be used since you are not implementing IEquatable<Unmatched>. Instead, the default Object.Equals method is used which uses reference equality.
Note your current Equals method is not overriding Object.Equals since that takes an Object parameter, and you would need to specify the override modifier.
For an example on using Distinct see here.
You have to implement the IEqualityComparer<TSource> and not IComparable<TSource>.

Implementing GetHashCode for IEqualityComparer<T> with conditional equality

I'm wondering if anyone as any suggestions for this problem.
I'm using intersect and except (Linq) with a custom IEqualityComparer in order to query the set differences and set intersections of two sequences of ISyncableUsers.
public interface ISyncableUser
{
string Guid { get; }
string UserPrincipalName { get; }
}
The logic behind whether two ISyncableUsers are equal is conditional. The conditions center around whether either of the two properties, Guid and UserPrincipalName, have values. The best way to explain this logic is with code. Below is my implementation of the Equals method of my customer IEqualityComparer.
public bool Equals(ISyncableUser userA, ISyncableUser userB)
{
if (userA == null && userB == null)
{
return true;
}
if (userA == null)
{
return false;
}
if (userB == null)
{
return false;
}
if ((!string.IsNullOrWhiteSpace(userA.Guid) && !string.IsNullOrWhiteSpace(userB.Guid)) &&
userA.Guid == userB.Guid)
{
return true;
}
if (UsersHaveUpn(userA, userB))
{
if (userB.UserPrincipalName.Equals(userA.UserPrincipalName, StringComparison.InvariantCultureIgnoreCase))
{
return true;
}
}
return false;
}
private bool UsersHaveUpn(ISyncableUser userA, ISyncableUser userB)
{
return !string.IsNullOrWhiteSpace(userA.UserPrincipalName)
&& !string.IsNullOrWhiteSpace(userB.UserPrincipalName);
}
The problem I'm having, is with implementing GetHashCode so that the above conditional equality, represented above, is respected. The only way I've been able to get the intersect and except calls to work as expected is to simple always return the same value from GetHashCode(), forcing a call to Equals.
public int GetHashCode(ISyncableUser obj)
{
return 0;
}
This works but the performance penalty is huge, as expected. (I've tested this with non-conditional equality. With two sets containing 50000 objects, a proper hashcode implementation allows execution of intercept and except in about 40ms. A hashcode implementation that always returns 0 takes approximately 144000ms (yes, 2.4 minutes!))
So, how would I go about implementing a GetHashCode() in the scenario above?
Any thoughts would be more than welcome!
If I'm reading this correctly, your equality relation is not transitive. Picture the following three ISyncableUsers:
A { Guid: "1", UserPrincipalName: "2" }
B { Guid: "2", UserPrincipalName: "2" }
C { Guid: "2", UserPrincipalName: "1" }
A == B because they have the same UserPrincipalName
B == C because they have the same Guid
A != C because they don't share either.
From the spec,
The Equals method is reflexive, symmetric, and transitive. That is, it returns true if used to compare an object with itself; true for two objects x and y if it is true for y and x; and true for two objects x and z if it is true for x and y and also true for y and z.
If your equality relation isn't consistent, there's no way you can implement a hash code that backs it up.
From another point of view: you're essentially looking for three functions:
G mapping GUIDs to ints (if you know the GUID but the UPN is blank)
U mapping UPNs to ints (if you know the UPN but the GUID is blank)
P mapping (guid, upn) pairs to ints (if you know both)
such that G(g) == U(u) == P(g, u) for all g and u. This is only possible if you ignore g and u completely.
If we suppose that your Equals implementation is correct, i.e. it's reflective, transitive and symmetric then the basic implementation for your GetHashCode function should look like this:
public int GetHashCode(ISyncableUser obj)
{
if (obj == null)
{
return SOME_CONSTANT;
}
if (!string.IsNullOrWhiteSpace(obj.UserPrincipalName) &&
<can have user object with different guid and the same name>)
{
return GetHashCode(obj.UserPrincipalName);
}
return GetHashCode(obj.Guid);
}
You should also understand that you've got rather intricate dependencies between your objects.
Indeed, let's take two ISyncableUser objects: 'u1' and 'u2', such that u1.Guid != u2.Guid, but u1.UserPrincipalName == u2.UserPrincipalName and names are not empty. Requirements for Equality imposes that for any 'ISyncableUser' object 'u' such that u.Guid == u1.Guid, the condition u.UserPrincipalName == u1.UserPrincipalName should be also true. This reasoning dictates GetHashCode implementation, for each user object it should be based either on it's name or guid.
One way would be to maintain a dictionary of hashcodes for usernames and GUIDS.
You could generate this dictionary at the start once for all users, which would probably the cleanest solution.
You could add or update an entry in the Constructor of each user.
Or, you could maintain that dictionary inside the GetHashCode function. This means your GetHashCode function has more work to do and is not free of side-effects. Getting this to work with multiple threads or parallel-linq will need some more carefull work. So I don't know whether I would recommend this approach.
Nevertheless, here is my attempt:
private Dictionary<string, int> _guidHash =
new Dictionary<string, int>();
private Dictionary<string, int> _nameHash =
new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
public int GetHashCode(ISyncableUser obj)
{
int hash = 0;
if (obj==null) return hash;
if (!String.IsNullOrWhiteSpace(obj.Guid)
&& _guidHash.TryGetValue(obj.Guid, out hash))
return hash;
if (!String.IsNullOrWhiteSpace(obj.UserPrincipalName)
&& _nameHash.TryGetValue(obj.UserPrincipalName, out hash))
return hash;
hash = RuntimeHelpers.GetHashCode(obj);
// or use some other method to generate an unique hashcode here
if (!String.IsNullOrWhiteSpace(obj.Guid))
_guidHash.Add(obj.Guid, hash);
if (!String.IsNullOrWhiteSpace(obj.UserPrincipalName))
_nameHash.Add(obj.UserPrincipalName, hash);
return hash;
}
Note that this will fail if the ISyncableUser objects do not play nice and exhibit cases like in Rawling's answer. I am assuming that users with the same GUID will have the same name or no name at all, and users with the same principalName have the same GUID or no GUID at all. (I think the given Equals implementation has the same limitations)

Best way to compare two Dictionary<T> for equality

Is this the best way to create a comparer for the equality of two dictionaries? This needs to be exact. Note that Entity.Columns is a dictionary of KeyValuePair(string, object) :
public class EntityColumnCompare : IEqualityComparer<Entity>
{
public bool Equals(Entity a, Entity b)
{
var aCol = a.Columns.OrderBy(KeyValuePair => KeyValuePair.Key);
var bCol = b.Columns.OrderBy(KeyValuePAir => KeyValuePAir.Key);
if (aCol.SequenceEqual(bCol))
return true;
else
return false;
}
public int GetHashCode(Entity obj)
{
return obj.Columns.GetHashCode();
}
}
Also not too sure about the GetHashCode implementation.
Thanks!
Here's what I would do:
public bool Equals(Entity a, Entity b)
{
if (a.Columns.Count != b.Columns.Count)
return false; // Different number of items
foreach(var kvp in a.Columns)
{
object bValue;
if (!b.Columns.TryGetValue(kvp.Key, out bValue))
return false; // key missing in b
if (!Equals(kvp.Value, bValue))
return false; // value is different
}
return true;
}
That way you don't need to order the entries (which is a O(n log n) operation) : you only need to enumerate the entries in the first dictionary (O(n)) and try to retrieve values by key in the second dictionary (O(1)), so the overall complexity is O(n).
Also, note that your GetHashCode method is incorrect: in most cases it will return different values for different dictionary instances, even if they have the same content. And if the hashcode is different, Equals will never be called... You have several options to implement it correctly, none of them ideal:
build the hashcode from the content of the dictionary: would be the best option, but it's slow, and GetHashCode needs to be fast
always return the same value, that way Equals will always be called: very bad if you want to use this comparer in a hashtable/dictionary/hashset, because all instances will fall in the same bucket, resulting in O(n) access instead of O(1)
return the Count of the dictionary (as suggested by digEmAll): it won't give a great distribution, but still better than always returning the same value, and it satisfies the constraint for GetHashCode (i.e. objects that are considered equal should have the same hashcode; two "equal" dictionaries have the same number of items, so it works)
Something like this comes to mind, but there might be something more efficient:
public static bool Equals<TKey, TValue>(IDictionary<TKey, TValue> x,
IDictionary<TKey, TValue> y)
{
return x.Keys.Intersect(y.Keys).Count == x.Keys.Count &&
x.Keys.All(key => Object.Equals(x[key], y[key]));
}
It seems good to me, perhaps not the fastest but working.
You just need to change the GetHashCode implementation that is wrong.
For example you could return obj.Columns.Count.GetHashCode()

Categories