Getting List.Join to compare properly

Getting List.Join to compare properly - c#

I am trying to create a list by joining two lists if a property matches correctly. I am using the following command:
FooList = TrackedStrings.Join (FooList,
str => str,
Foo => Foo.GetString (),
(str, Foo) => Foo,
new Comparer ())
.ToList ();
And the following class to compare:
public class Comparer : IEqualityComparer<string>
{
public bool Equals (string x, string y)
{
return y.Contains (x);
}
public int GetHashCode (string str)
{
return str.GetHashCode ();
}
}
Now, the idea is that I only want to keep the items that have a GetString () containing any one of the strings from TrackedStrings. However, it doesn't work: the comparer only returns true if the strings are equal. For example, let's say that we have two lists:
List<string> TrackedActions = new List<string> { "Created", "Deleted" };
List<Foo> FooList = new List<FooList> { new Foo ("Created"), new Foo ("Deleted Something")};
With the current command, the second Foo is dropped from the list - instead of matching to TrackedActions[1] and being kept.
Thus, my question is: Why is Comparer not working the way I expect it to?

You should not use IEqualityComparer because The Equals method is reflexive, symmetric, and transitive. MSDN
In your case its not symmetric Equals(a,b) != Equals (b,a)
Glorfindel's answer is not totally correct too, because it's not transitive:
Equals("abcd","bc") == true
Equals("bcde", "bc") == true
Equals("abcd","bcde") == false

A custom comparer must make sure that the Equals relationship it defines is symmetric. This means that whenever x.Equals(y), y.Equals(x) and vice versa.
The reason for this is that you can never predict in which order the elements are compared, i.e. which one of these is called:
aStringFromLeftList.Equals(aStringFromRightList)
or
aStringFromRightList.Equals(aStringFromLeftList)
Because the relationship you need is neither symmetric nor transitive, you can't use a Comparer for your problem.

Your comparer not working is due to the implementation of the GetHashCode()
regardless the right way to implement the IEqualityComparer.
The match is done by
Compare the hashcode of 2 strings. In your case Deleted Something definitely return different hashcode with Deleted
If (1) is equal, then use Equals() to compare again because HashCode may have collision and not accurate, but fast.

Related

Remove duplicates in custom IComparable class

I have a table that has combo pairs identifiers, and I use that to go through CSV files looking for matches. I'm trapping the unidentified pairs in a List, and sending them to an output box for later addition. I would like the output to only have single occurrences of unique pairs. The class is declared as follows:
public class Unmatched:IComparable<Unmatched>
{
public string first_code { get; set; }
public string second_code { get; set; }
public int CompareTo(Unmatched other)
{
if (this.first_code == other.first_code)
{
return this.second_code.CompareTo(other.second_code);
}
return other.first_code.CompareTo(this.first_code);
}
}
One note on the above code: This returns it in reverse alphabetical order, to get it in alphabetical order use this line:
return this.first_code.CompareTo(other.first_code);
Here is the code that adds it. This is directly after the comparison against the datatable elements
unmatched.Add(new Unmatched()
{ first_code = fields[clients[global_index].first_match_column]
, second_code = fields[clients[global_index].second_match_column] });
I would like to remove all pairs from the list where both first code and second code are equal, i.e.;
PTC,138A
PTC,138A
PTC,138A
MA9,5A
MA9,5A
MA9,5A
MA63,138A
MA63,138A
MA59,87BM
MA59,87BM
Should become:
PTC, 138A
MA9, 5A
MA63, 138A
MA59, 87BM
I have tried adding my own Equate and GetHashCode as outlined here:
http://www.morgantechspace.com/2014/01/Use-of-Distinct-with-Custom-Class-objects-in-C-Sharp.html
The SE links I have tried are here:
How would I distinct my list of key/value pairs
Get list of distinct values in List<T> in c#
Get a list of distinct values in List
All of them return a list that still has all the pairs. Here is the current code (Yes, I know there are two distinct lines, neither appears to be working) that outputs the list:
parser.Close();
List<Unmatched> noDupes = unmatched.Distinct().ToList();
noDupes.Sort();
noDupes.Select(x => x.first_code).Distinct();
foreach (var pair in noDupes)
{
txtUnmatchedList.AppendText(pair.first_code + "," + pair.second_code + Environment.NewLine);
}
Here is the Equate/Hash code as requested:
public bool Equals(Unmatched notmatched)
{
//Check whether the compared object is null.
if (Object.ReferenceEquals(notmatched, null)) return false;
//Check whether the compared object references the same data.
if (Object.ReferenceEquals(this, notmatched)) return true;
//Check whether the UserDetails' properties are equal.
return first_code.Equals(notmatched.first_code) && second_code.Equals(notmatched.second_code);
}
// If Equals() returns true for a pair of objects
// then GetHashCode() must return the same value for these objects.
public override int GetHashCode()
{
//Get hash code for the UserName field if it is not null.
int hashfirst_code = first_code == null ? 0 : first_code.GetHashCode();
//Get hash code for the City field.
int hashsecond_code = second_code.GetHashCode();
//Calculate the hash code for the GPOPolicy.
return hashfirst_code ^ hashsecond_code;
}
I have also looked at a couple of answers that are using queries and Tuples, which I honestly don't understand. Can someone point me to a source or answer that will explain the how (And why) of getting distinct pairs out of a custom list?
(Side question-Can you declare a class as both IComparable and IEquatable?)

The problem is you are not implementing IEquatable<Unmatched>.
public class Unmatched : IComparable<Unmatched>, IEquatable<Unmatched>
EqualityComparer<T>.Default uses the Equals(T) method only if you implement IEquatable<T>. You are not doing this, so it will instead use Object.Equals(object) which uses reference equality.
The overload of Distinct you are calling uses EqualityComparer<T>.Default to compare different elements of the sequence for equality. As the documentation states, the returned comparer uses your implementation of GetHashCode to find potentially-equal elements. It then uses the Equals(T) method to check for equality, or Object.Equals(Object) if you have not implemented IEquatable<T>.
You have an Equals(Unmatched) method, but it will not be used since you are not implementing IEquatable<Unmatched>. Instead, the default Object.Equals method is used which uses reference equality.
Note your current Equals method is not overriding Object.Equals since that takes an Object parameter, and you would need to specify the override modifier.

For an example on using Distinct see here.
You have to implement the IEqualityComparer<TSource> and not IComparable<TSource>.

Implementing GetHashCode for IEqualityComparer<T> with conditional equality

I'm wondering if anyone as any suggestions for this problem.
I'm using intersect and except (Linq) with a custom IEqualityComparer in order to query the set differences and set intersections of two sequences of ISyncableUsers.
public interface ISyncableUser
{
string Guid { get; }
string UserPrincipalName { get; }
}
The logic behind whether two ISyncableUsers are equal is conditional. The conditions center around whether either of the two properties, Guid and UserPrincipalName, have values. The best way to explain this logic is with code. Below is my implementation of the Equals method of my customer IEqualityComparer.
public bool Equals(ISyncableUser userA, ISyncableUser userB)
{
if (userA == null && userB == null)
{
return true;
}
if (userA == null)
{
return false;
}
if (userB == null)
{
return false;
}
if ((!string.IsNullOrWhiteSpace(userA.Guid) && !string.IsNullOrWhiteSpace(userB.Guid)) &&
userA.Guid == userB.Guid)
{
return true;
}
if (UsersHaveUpn(userA, userB))
{
if (userB.UserPrincipalName.Equals(userA.UserPrincipalName, StringComparison.InvariantCultureIgnoreCase))
{
return true;
}
}
return false;
}
private bool UsersHaveUpn(ISyncableUser userA, ISyncableUser userB)
{
return !string.IsNullOrWhiteSpace(userA.UserPrincipalName)
&& !string.IsNullOrWhiteSpace(userB.UserPrincipalName);
}
The problem I'm having, is with implementing GetHashCode so that the above conditional equality, represented above, is respected. The only way I've been able to get the intersect and except calls to work as expected is to simple always return the same value from GetHashCode(), forcing a call to Equals.
public int GetHashCode(ISyncableUser obj)
{
return 0;
}
This works but the performance penalty is huge, as expected. (I've tested this with non-conditional equality. With two sets containing 50000 objects, a proper hashcode implementation allows execution of intercept and except in about 40ms. A hashcode implementation that always returns 0 takes approximately 144000ms (yes, 2.4 minutes!))
So, how would I go about implementing a GetHashCode() in the scenario above?
Any thoughts would be more than welcome!

If I'm reading this correctly, your equality relation is not transitive. Picture the following three ISyncableUsers:
A { Guid: "1", UserPrincipalName: "2" }
B { Guid: "2", UserPrincipalName: "2" }
C { Guid: "2", UserPrincipalName: "1" }
A == B because they have the same UserPrincipalName
B == C because they have the same Guid
A != C because they don't share either.
From the spec,
The Equals method is reflexive, symmetric, and transitive. That is, it returns true if used to compare an object with itself; true for two objects x and y if it is true for y and x; and true for two objects x and z if it is true for x and y and also true for y and z.
If your equality relation isn't consistent, there's no way you can implement a hash code that backs it up.
From another point of view: you're essentially looking for three functions:
G mapping GUIDs to ints (if you know the GUID but the UPN is blank)
U mapping UPNs to ints (if you know the UPN but the GUID is blank)
P mapping (guid, upn) pairs to ints (if you know both)
such that G(g) == U(u) == P(g, u) for all g and u. This is only possible if you ignore g and u completely.

If we suppose that your Equals implementation is correct, i.e. it's reflective, transitive and symmetric then the basic implementation for your GetHashCode function should look like this:
public int GetHashCode(ISyncableUser obj)
{
if (obj == null)
{
return SOME_CONSTANT;
}
if (!string.IsNullOrWhiteSpace(obj.UserPrincipalName) &&
<can have user object with different guid and the same name>)
{
return GetHashCode(obj.UserPrincipalName);
}
return GetHashCode(obj.Guid);
}
You should also understand that you've got rather intricate dependencies between your objects.
Indeed, let's take two ISyncableUser objects: 'u1' and 'u2', such that u1.Guid != u2.Guid, but u1.UserPrincipalName == u2.UserPrincipalName and names are not empty. Requirements for Equality imposes that for any 'ISyncableUser' object 'u' such that u.Guid == u1.Guid, the condition u.UserPrincipalName == u1.UserPrincipalName should be also true. This reasoning dictates GetHashCode implementation, for each user object it should be based either on it's name or guid.

One way would be to maintain a dictionary of hashcodes for usernames and GUIDS.
You could generate this dictionary at the start once for all users, which would probably the cleanest solution.
You could add or update an entry in the Constructor of each user.
Or, you could maintain that dictionary inside the GetHashCode function. This means your GetHashCode function has more work to do and is not free of side-effects. Getting this to work with multiple threads or parallel-linq will need some more carefull work. So I don't know whether I would recommend this approach.
Nevertheless, here is my attempt:
private Dictionary<string, int> _guidHash =
new Dictionary<string, int>();
private Dictionary<string, int> _nameHash =
new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
public int GetHashCode(ISyncableUser obj)
{
int hash = 0;
if (obj==null) return hash;
if (!String.IsNullOrWhiteSpace(obj.Guid)
&& _guidHash.TryGetValue(obj.Guid, out hash))
return hash;
if (!String.IsNullOrWhiteSpace(obj.UserPrincipalName)
&& _nameHash.TryGetValue(obj.UserPrincipalName, out hash))
return hash;
hash = RuntimeHelpers.GetHashCode(obj);
// or use some other method to generate an unique hashcode here
if (!String.IsNullOrWhiteSpace(obj.Guid))
_guidHash.Add(obj.Guid, hash);
if (!String.IsNullOrWhiteSpace(obj.UserPrincipalName))
_nameHash.Add(obj.UserPrincipalName, hash);
return hash;
}
Note that this will fail if the ISyncableUser objects do not play nice and exhibit cases like in Rawling's answer. I am assuming that users with the same GUID will have the same name or no name at all, and users with the same principalName have the same GUID or no GUID at all. (I think the given Equals implementation has the same limitations)

Best way to compare two Dictionary<T> for equality

Is this the best way to create a comparer for the equality of two dictionaries? This needs to be exact. Note that Entity.Columns is a dictionary of KeyValuePair(string, object) :
public class EntityColumnCompare : IEqualityComparer<Entity>
{
public bool Equals(Entity a, Entity b)
{
var aCol = a.Columns.OrderBy(KeyValuePair => KeyValuePair.Key);
var bCol = b.Columns.OrderBy(KeyValuePAir => KeyValuePAir.Key);
if (aCol.SequenceEqual(bCol))
return true;
else
return false;
}
public int GetHashCode(Entity obj)
{
return obj.Columns.GetHashCode();
}
}
Also not too sure about the GetHashCode implementation.
Thanks!

Here's what I would do:
public bool Equals(Entity a, Entity b)
{
if (a.Columns.Count != b.Columns.Count)
return false; // Different number of items
foreach(var kvp in a.Columns)
{
object bValue;
if (!b.Columns.TryGetValue(kvp.Key, out bValue))
return false; // key missing in b
if (!Equals(kvp.Value, bValue))
return false; // value is different
}
return true;
}
That way you don't need to order the entries (which is a O(n log n) operation) : you only need to enumerate the entries in the first dictionary (O(n)) and try to retrieve values by key in the second dictionary (O(1)), so the overall complexity is O(n).
Also, note that your GetHashCode method is incorrect: in most cases it will return different values for different dictionary instances, even if they have the same content. And if the hashcode is different, Equals will never be called... You have several options to implement it correctly, none of them ideal:
build the hashcode from the content of the dictionary: would be the best option, but it's slow, and GetHashCode needs to be fast
always return the same value, that way Equals will always be called: very bad if you want to use this comparer in a hashtable/dictionary/hashset, because all instances will fall in the same bucket, resulting in O(n) access instead of O(1)
return the Count of the dictionary (as suggested by digEmAll): it won't give a great distribution, but still better than always returning the same value, and it satisfies the constraint for GetHashCode (i.e. objects that are considered equal should have the same hashcode; two "equal" dictionaries have the same number of items, so it works)

Something like this comes to mind, but there might be something more efficient:
public static bool Equals<TKey, TValue>(IDictionary<TKey, TValue> x,
IDictionary<TKey, TValue> y)
{
return x.Keys.Intersect(y.Keys).Count == x.Keys.Count &&
x.Keys.All(key => Object.Equals(x[key], y[key]));
}

It seems good to me, perhaps not the fastest but working.
You just need to change the GetHashCode implementation that is wrong.
For example you could return obj.Columns.Count.GetHashCode()

Union-ing two custom classes returns duplicates

I have two custom classes, ChangeRequest and ChangeRequests, where a ChangeRequests can contain many ChangeRequest instances.
public class ChangeRequests : IXmlSerializable, ICloneable, IEnumerable<ChangeRequest>,
IEquatable<ChangeRequests> { ... }
public class ChangeRequest : ICloneable, IXmlSerializable, IEquatable<ChangeRequest>
{ ... }
I am trying to do a union of two ChangeRequests instances. However, duplicates do not seem to be removed. My MSTest unit test is as follows:
var cr1 = new ChangeRequest { CRID = "12" };
var crs1 = new ChangeRequests { cr1 };
var crs2 = new ChangeRequests
{
cr1.Clone(),
new ChangeRequest { CRID = "34" }
};
Assert.AreEqual(crs1[0], crs2[0], "First CR in both ChangeRequests should be equal");
var unionedCRs = new ChangeRequests(crs1.Union<ChangeRequest>(crs2));
ChangeRequests expected = crs2.Clone();
Assert.AreEqual(expected, unionedCRs, "Duplicates should be removed from a Union");
The test fails in the last line, and unionedCRs contains two copies of cr1. When I tried to debug and step through each line, I had a breakpoint in ChangeRequest.Equals(object) on the first line, as well as in the first line of ChangeRequest.Equals(ChangeRequest), but neither were hit. Why does the union contain duplicate ChangeRequest instances?
Edit: as requested, here is ChangeRequests.Equals(ChangeRequests):
public bool Equals(ChangeRequests other)
{
if (ReferenceEquals(this, other))
{
return true;
}
return null != other && this.SequenceEqual<ChangeRequest>(other);
}
And here's ChangeRequests.Equals(object):
public override bool Equals(object obj)
{
return Equals(obj as ChangeRequests);
}
Edit: I overrode GetHashCode on both ChangeRequest and ChangeRequests but still in my test, if I do IEnumerable<ChangeRequest> unionedCRsIEnum = crs1.Union<ChangeRequest>(crs2);, unionedCRsIEnum ends up with two copies of the ChangeRequest with CRID 12.
Edit: something has to be up with my Equals or GetHashCode implementations somewhere, since Assert.AreEqual(expected, unionedCRs.Distinct(), "Distinct should remove duplicates"); fails, and the string representations of expected and unionedCRs.Distinct() show that unionedCRs.Distinct() definitely has two copies of CR 12.

Make sure your GetHashCode implementation is consistent with your Equals - the Enumerable.Union method does appear to use both.
You should get a warning from the compiler if you've implemented one but not the other; it's still up to you to make sure that both methods agree with each other. Here's a convenient summary of the rules: Why is it important to override GetHashCode when Equals method is overridden?

I don't believe that Assert.AreEqual() examines the contents of the sequence - it compares the sequence objects themselves, which are clearly not equal.
What you want is a SequenceEqual() method, that will actually examine the contents of two sequences. This answer may help you. It's a response to a similar question, that describes how to compare to IEnumerable<> sequences.
You could easily take the responder's answer, and create an extension method to make the calls look more like assertions:
public static class AssertionExt
{
public static bool AreSequencesEqual<T>( IEnumerable<T> expected,
IEnumerable<T> sequence )
{
Assert.AreEqual(expected.Count(), sequence .Count());
IEnumerator<Token> e1 = expected.GetEnumerator();
IEnumerator<Token> e2 = sequence .GetEnumerator();
while (e1.MoveNext() && e2.MoveNext())
{
Assert.AreEqual(e1.Current, e2.Current);
}
}
}
Alternatively you could use SequenceEqual(), to compare the sequences, realizing that it won't provide any information about which elements are not equal.

As LBushkin says, Assert.AreEqual will just call Equals on the sequences.
You can use the SequenceEqual extension method though:
Assert.IsTrue(expected.SequenceEqual(unionedCRs));
That won't give much information if it fails, however.
You may want to use the test code we wrote for MoreLINQ which was sequence-focused - if the sequences aren't equal, it will specify in what way they differ. (I'm trying to get a link to the source file in question, but my network connection is rubbish.)

How to find and remove duplicate objects in a collection using LINQ?

I have a simple class representing an object. It has 5 properties (a date, 2 decimals, an integer and a string). I have a collection class, derived from CollectionBase, which is a container class for holding multiple objects from my first class.
My question is, I want to remove duplicate objects (e.g. objects that have the same date, same decimals, same integers and same string). Is there a LINQ query I can write to find and remove duplicates? Or find them at the very least?

You can remove duplicates using the Distinct operator.
There are two overloads - one uses the default equality comparer for your type (which for a custom type will call the Equals() method on the type). The second allows you to supply your own equality comparer. They both return a new sequence representing your original set without duplicates. Neither overload actually modifies your initial collection - they both return a new sequence that excludes duplicates..
If you want to just find the duplicates, you can use GroupBy to do so:
var groupsWithDups = list.GroupBy( x => new { A = x.A, B = x.B, ... }, x => x )
.Where( g => g.Count() > 1 );
To remove duplicates from something like an IList<> you could do:
yourList.RemoveAll( yourList.Except( yourList.Distinct() ) );

If your simple class uses Equals in a manner that satisfies your requirements then you can use the Distinct method
var col = ...;
var noDupes = col.Distinct();
If not then you will need to provide an instance of IEqualityComparer<T> which compares values in the way you desire. For example (null problems ignored for brevity)
public class MyTypeComparer : IEqualityComparer<MyType> {
public bool Equals(MyType left, MyType right) {
return left.Name == right.Name;
}
public int GetHashCode(MyType type) {
return 42;
}
}
var noDupes = col.Distinct(new MyTypeComparer());
Note the use of a constant for GetHashCode is intentional. Without knowing intimate details about the semantics of MyType it is impossible to write an efficient and correct hashing function. In lieu of an efficient hashing function I used a constant which is correct irrespective of the semantics of the type.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Getting List.Join to compare properly - c#

Related

Remove duplicates in custom IComparable class

Implementing GetHashCode for IEqualityComparer<T> with conditional equality

Best way to compare two Dictionary<T> for equality

Union-ing two custom classes returns duplicates

How to find and remove duplicate objects in a collection using LINQ?

Categories

Resources