In this situation where one member is edited to become equal to another, what is the proper way to force the HashSet to recalculate hashes and thereby purge itself of duplicates?
I knew better than to expect this to happen automatically, so I tried such things as intersecting the HashSet with itself, then reassigning it to a constructor call which refers to itself and the same EqualityComparer. I thought for sure the latter would work, but no.
One thing which does succeed is reconstructing the HashSet from its conversion to some other container type such as List, rather than directly from itself.
Class defs:
public class Test {
public int N;
public override string ToString() { return this.N.ToString(); }
}
public class TestClassEquality: IEqualityComparer<Test> {
public bool Equals(Test x, Test y) { return x.N == y.N; }
public int GetHashCode(Test obj) { return obj.N.GetHashCode(); }
}
Test code:
TestClassEquality eq = new TestClassEquality();
HashSet<Test> hs = new HashSet<Test>(eq);
Test a = new Test { N = 1 }, b = new Test { N = 2 };
hs.Add(a);
hs.Add(b);
b.N = 1;
string fmt = "Count = {0}; Values = {1}";
Console.WriteLine(fmt, hs.Count, string.Join(",", hs));
hs.IntersectWith(hs);
Console.WriteLine(fmt, hs.Count, string.Join(",", hs));
hs = new HashSet<Test>(hs, eq);
Console.WriteLine(fmt, hs.Count, string.Join(",", hs));
hs = new HashSet<Test>(new List<Test>(hs), eq);
Console.WriteLine(fmt, hs.Count, string.Join(",", hs));
Output:
"Count: 2; Values: 1,1"
"Count: 2; Values: 1,1"
"Count: 2; Values: 1,1"
"Count: 1; Values: 1"
Based on the final approach succeeding, I could probably create an extension method in which the HashSet dumps itself into a local List, clears itself, and then repopulates from said list.
Is that really necessary or is there some simpler way to do this?
Lasse's comment is correct: you are required by the contract of HashSet to not do this, so asking what to do when you do this is a non-starter. If it hurts when you do that, stop doing that. A mutable object must not be put into a hash set if a mutation will cause its hash value to change while it is in the set. You're in a cleft stick of your own making.
To get out of that cleft stick, you could:
Stop mutating the objects while they are in a hash set. Remove them before you mutate them, put them back in later.
Fix the implementation of equality and hashing on the object so that it is consistent across mutations.
When you create the hash set, provide a custom hashing/equality algorithm that does not change its opinions when the object is mutated.
Implement your own "set" class that has whatever behaviour you like in this scenario. That is extremely difficult, so be careful. (There is a reason why this restriction was created in the first place!)
There is no other way than recreating the HashSet<>. Sadly the HashSet<> constructor has an optimization so that if it is create from another HashSet<> it copies the hash codes... So we can cheat:
hs = new HashSet<Test>(hs.Skip(0), eq);
The hs.Skip(0) is a IEnumerable<>, not an HashSet<>. This defeats the HashSet<> check.
Note that there is no guarantee that in the future the Skip() won't implement a shortcircuit in case of 0, something like:
if (count == 0)
{
return enu;
}
else
{
return count elements;
}
(see Lippert's comment, false problem)
The "manual" method to do it is:
var hs2 = new HashSet<Test>(eq);
foreach (var value in hs)
{
hs2.Add(value);
}
hs = hs2;
So enumerate "manually" and readd.
As you saw, HashSets don't deal with mutable objects when modifying the object affects its hash code or equality to other objects. Just remove it and re-add it:
hs.Remove(b);
b.N = 1;
hs.Add(b);
Related
I have the following data type:
ISet<IEnumerable<Foo>>
So, I need to be able to create sets of sequences. E.g. this is ok:
ABC,AC,A
but this is not (since "AB" is repeated here"):
AB,A,ABC,BCA,AB
But, in order to do this - for "set" to not contain duplicates, I need to wrap my IEnumerable in some kind of other data type:
ISet<Seq>
//where
Seq : IEnumerable<Foo>, IEquatable<Seq>
Thus, I will be able to compare two sequences, and provide the Set data structure with a way of eliminating duplicates.
My question is: is there a fast data structure that allows for comparing sequences? I am thinking that somehow when Seq gets created, or added two, some kind of cumulative value is computed.
In other words, is it possible to implement Seq in such a way that I could do this:
var seq1 = new Seq( IList<Foo> );
var seq2 = new Seq( IList<Foo> )
seq1.equals(seq2) // O(1)
Thanks.
I have provided an implementation your sequence below. There are several points to note:
This only works if the IEnumerable<T> returns the same items every time it is enumerated, and that those items are not mutated during the scope of this object.
The hash code is cached. The first time it is requested it calculated it (feel free to improve the hash code algorithm if you know a better one) based on a full iteration of the underlying sequence. Because it only needs to be calculated once, this can be effectively considered O(1) if you compute it often. It's likely that adding to the set will be a bit slower (first time computation of the hash value) but searching or removing will be very quick.
The equals method first compares the hash codes. If the hash codes are different then the objects cannot possibly be equal (if the hash codes were properly implemented on all objects in the sequence, and nothing was mutated). As long as you have a low rate of collision, and are usually comparing items that aren't actually equal, this means that equals checks will not often get past that hash code check. If they do, an iteration of the sequence is needed (there is no way around that). Because of that the equals is likely to average O(1), even though its worst case is still O(n).
public class Foo : IEnumerable
{
private IEnumerable sequence;
private int? myHashCode = null;
public Foo(IEnumerable<T> sequence)
{
this.sequence = sequence;
}
public IEnumerator<T> GetEnumerator()
{
return sequence.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator()
{
return sequence.GetEnumerator();
}
public override bool Equals(object obj)
{
Foo<T> other = obj as Foo<T>;
if(other == null)
return false;
//if the hash codes are different we don't need to bother doing a deep equals check
//the hash code is cached, so it's fast.
if (GetHashCode() != obj.GetHashCode())
return false;
return Enumerable.SequenceEqual(sequence, other.sequence);
}
public override int GetHashCode()
{
//note that the hash code is cached, so the underlying sequence
//needs to not change.
return myHashCode ?? populateHashCode();
}
private int populateHashCode()
{
int somePrimeNumber = 37;
myHashCode = 1;
foreach (T item in sequence)
{
myHashCode = (myHashCode * somePrimeNumber) + item.GetHashCode();
}
return myHashCode.Value;
}
}
O(1) essentially mean you are not allowed to compare values of elements. If you can represent sequence as list of immutable objects (with caching on add so there is no duplicates across all instances) you can achieve it as you'd only need to compare first element - similar how string interning works.
Insert will have to search for all instances of elements for "current"+"with this next" element. Some sort of dictionary may be reasonable approach...
EDIT: I think it simply tried to come up with suffix tree.
[TestFixture]
class HashSetExample
{
[Test]
public void eg()
{
var comparer = new OddEvenBag();
var hs = new HashSet<int>(comparer);
hs.Add(1);
Assert.IsTrue(hs.Contains(3));
Assert.IsFalse(hs.Contains(0));
// THIS LINE HERE
var containedValue = hs.First(x => comparer.Equals(x, 3)); // i want something faster than this
Assert.AreEqual(1, containedValue);
}
public class OddEvenBag : IEqualityComparer<int>
{
public bool Equals(int x, int y)
{
return x % 2 == y % 2;
}
public int GetHashCode(int obj)
{
return obj % 2;
}
}
}
As well as checking if hs contains an odd number, I want to know what odd number if contains. Obviously I want a method that scales reasonably and does not simply iterate-and-search over the entire collection.
Another way to rephrase the question is, I want to replace the line below THIS LINE HERE with something efficient (say O(1), instead of O(n)).
Towards what end? I'm trying to intern a laaaaaaaarge number of immutable reference objects similar in size to a Point3D. Seems like using a HashSet<Foo> instead of a Dictionary<Foo,Foo> saves about 10% in memory. No, obviously this isn't a game changer but I figured it would not hurt to try it for a quick win. Apologies if this has offended anybody.
Edit: Link to similar/identical post provided by Balazs Tihanyi in comments, put here for emphasis.
The simple answer is no, you can't.
If you want to retrieve the object you will need to use a HashSet. There just isn't any suitable method in the API to do what you are asking for otherwise.
One optimization you could make though if you must use a Set for this is to first do a contains check and then only iterate over the Set if the contains returns true. Still you would almost certainly find that the extra overhead for a HashMap is tiny (since essentially it's just another object reference).
I know that the order of a dictionary is undefined, MSDN says so:
For purposes of enumeration, each item in the dictionary is treated as a KeyValuePair structure representing a value and its key. The order in which the items are returned is undefined.
Thats fine, but if I have two instances of a dictionary, each with the same content, will the order be the same?
I'm guessing so because as I understand, the order is determined by the hash of the keys, and if the two dictionaries have the same keys, they have the same hashes, and therefore the same order...
... Right?
Thanks!
Andy.
No it is not guaranteed to be the same order. Imagine the scenario where you had several items in the Dictionary<TKey, TValue> with the same hash code. If they are added to the two dictionaries in different orders it will result in different orders in enumeration .
Consider for example the following (equality conforming) code
class Example
{
public char Value;
public override int GetHashCode()
{
return 1;
}
public override bool Equals(object obj)
{
return obj is Example && ((Example)obj).Value == Value;
}
public override string ToString()
{
return Value.ToString();
}
}
class Program
{
static void Main(string[] args)
{
var e1 = new Example() { Value = 'a' };
var e2 = new Example() { Value = 'b' };
var map1 = new Dictionary<Example, string>();
map1.Add(e1, "1");
map1.Add(e2, "2");
var map2 = new Dictionary<Example, string>();
map2.Add(e2, "2");
map2.Add(e1, "1");
Console.WriteLine(map1.Values.Aggregate((x, y) => x + y));
Console.WriteLine(map2.Values.Aggregate((x, y) => x + y));
}
}
The output of running this program is
12
21
Short version: No.
Long version:
[TestMethod]
public void TestDictionary()
{
Dictionary<String, Int32> d1 = new Dictionary<string, int>();
Dictionary<String, Int32> d2 = new Dictionary<string, int>();
d1.Add("555", 1);
d1.Add("abc2", 2);
d1.Add("abc3", 3);
d1.Remove("abc2");
d1.Add("abc2", 2);
d1.Add("556", 1);
d2.Add("555", 1);
d2.Add("556", 1);
d2.Add("abc2", 2);
d2.Add("abc3", 3);
foreach (var i in d1)
{
Console.WriteLine(i);
}
Console.WriteLine();
foreach (var i in d2)
{
Console.WriteLine(i);
}
}
Output:
[555, 1]
[abc2, 2]
[abc3, 3]
[556, 1]
[555, 1]
[556, 1]
[abc2, 2]
[abc3, 3]
If MSDN says its undefined you have to rely on that. The thing with undefined is it means that the implementation of the dictionary is allowed to store it in whatever order it wants. This means that a programmer should never make any assumptions about the order. I would probably assume personally without looking that the order of the elements in the dictionary would depend on the order they went in but I could be wrong. Whatever the answer is though if you are wanting some behaviour whereby the order is the same for both then you are doing it wrong.
"if the two dictionaries have the same
keys, they have the same hashes, and
therefore the same order..."
I do not think this is the case. Even if it might be true, I would not rely on this. If it's true it is an implementation detail, that might change, or be different on different implementations of the CLR or BCL (Mono comes to mind).
The Microsoft Dictionary implementation is a little complex, but from looking at the code for 5 minutes, I am willing to guess that the sequence of enumeration will be based on how the dictionary got to it's current state, including the number of resizes and insertion order.
If the spec says the order is "undefined", you can't depend on the order without explicitly ordering it. The underlying implementation may be changed at any time with a new release or service pack, just for starters. Your dictionary may be upcast from any number of concrete implementations as well.
And underlying implementation may be sensitive to the order of operations applied. Adding keys 'a', 'b' and 'c', in that order may result in a different data structure than adding the same set of keys in a different order (say, 'b','c', and 'a'). Deletions may likewise affect the data structure.
A straight binary tree, for instance, if used as the data structure behind a dictionary, if the keys are added in order, the net result is a highly unbalanced tree that is essentially a linked list. The tree will be more balance if nodes are inserted in random order.
And some data structure morph as operations are performed. If, for instance, a dictionary is implemented with the underlying data structure being a red/black tree, tree nodes will be split/rotated in order to keep the tree balanced as inserts and deletes occur. So the actual data structure then is highly dependent on the order of operations, even if the final contents are the same.
I don't know the specifics of Microsoft's implementation, but in general your assumption holds only if there are no two items in the dictionary that hash to the same value or if those entries that do collide are added in the same order.
Testing the Equals method is pretty much straight forward (as far as I know). But how on earth do you test the GetHashCode method?
Test that two distinct objects which are equal have the same hash code (for various values). Check that non-equal objects give different hash codes, varying one aspect/property at a time. While the hash codes don't have to be different, you'd be really unlucky to pick different values for properties which happen to give the same hash code unless you've got a bug.
Gallio/MbUnit v3.2 comes with convenient contract verifiers which are able to test your implementation of GetHashCode() and IEquatable<T>. More specifically you may be interested by the EqualityContract and the HashCodeAcceptanceContract. See here, here and there for more details.
public class Spot
{
private readonly int x;
private readonly int y;
public Spot(int x, int y)
{
this.x = x;
this.y = y;
}
public override int GetHashCode()
{
int h = -2128831035;
h = (h * 16777619) ^ x;
h = (h * 16777619) ^ y;
return h;
}
}
Then you declare your contract verifier like this:
[TestFixture]
public class SpotTest
{
[VerifyContract]
public readonly IContract HashCodeAcceptanceTests = new HashCodeAcceptanceContract<Spot>()
{
CollisionProbabilityLimit = CollisionProbability.VeryLow,
UniformDistributionQuality = UniformDistributionQuality.Excellent,
DistinctInstances = DataGenerators.Join(Enumerable.Range(0, 1000), Enumerable.Range(0, 1000)).Select(o => new Spot(o.First, o.Second))
};
}
It would be fairly similar to Equals(). You'd want to make sure two objects which were the "same" at least had the same hash code. That means if .Equals() returns true, the hash codes should be identical as well. As far as what the proper hashcode values are, that depends on how you're hashing.
From personal experience. Aside from obvious things like same objects giving you same hash codes, you need to create large enough array of unique objects and count unique hash codes among them. If unique hash codes make less than, say 50% of overall object count, then you are in trouble, as your hash function is not good.
List<int> hashList = new List<int>(testObjectList.Count);
for (int i = 0; i < testObjectList.Count; i++)
{
hashList.Add(testObjectList[i]);
}
hashList.Sort();
int differentValues = 0;
int curValue = hashList[0];
for (int i = 1; i < hashList.Count; i++)
{
if (hashList[i] != curValue)
{
differentValues++;
curValue = hashList[i];
}
}
Assert.Greater(differentValues, hashList.Count/2);
In addition to checking that object equality implies equality of hashcodes, and the distribution of hashes is fairly flat as suggested by Yann Trevin (if performance is a concern), you may also wish to consider what happens if you change a property of the object.
Suppose your object changes while it's in a dictionary/hashset. Do you want the Contains(object) to still be true? If so then your GetHashCode had better not depend on the mutable property that was changed.
I would pre-supply a known/expected hash and compare what the result of GetHashCode is.
You create separate instances with the same value and check that the GetHashCode for the instances returns the same value, and that repeated calls on the same instance returns the same value.
That is the only requirement for a hash code to work. To work well the hash codes should of course have a good distribution, but testing for that requires a lot of testing...
Which is faster? This:
bool isEqual = (MyObject1 is MyObject2)
Or this:
bool isEqual = ("blah" == "blah1")
It would be helpful to figure out which one is faster. Obviously, if you apply .ToUpper() to each side of the string comparison like programmers often do, that would require reallocating memory which affects performance. But how about if .ToUpper() is out of the equation like in the above sample?
I'm a little confused here.
As other answers have noted, you're comparing apples and oranges. ::rimshot::
If you want to determine if an object is of a certain type use the is operator.
If you want to compare strings use the == operator (or other appropriate comparison method if you need something fancy like case-insensitive comparisons).
How fast one operation is compared to the other (no pun intended) doesn't seem to really matter.
After closer reading, I think that you want to compare the speed of string comparisions with the speed of reference comparisons (the type of comparison used in the System.Object base type).
If that's the case, then the answer is that reference comparisons will never be slower than any other string comparison. Reference comparison in .NET is pretty much analogous to comparing pointers in C - about as fast as you can get.
However, how would you feel if a string variable s had the value "I'm a string", but the following comparison failed:
if (((object) s) == ((object) "I'm a string")) { ... }
If you simply compared references, that might happen depending on how the value of s was created. If it ended up not being interned, it would not have the same reference as the literal string, so the comparison would fail. So you might have a faster comparison that didn't always work. That seems to be a bad optimization.
According to the book Maximizing .NET Performance
the call
bool isEqual = String.Equals("test", "test");
is identical in performance to
bool isEqual = ("test" == "test");
The call
bool isEqual = "test".Equals("test");
is theoretically slower than the call to the static String.Equals method, but I think you'll need to compare several million strings in order to actually detect a speed difference.
My tip to you is this; don't worry about which string comparison method is slower or faster. In a normal application you'll never ever notice the difference. You should use the way which you think is most readable.
The first one is used to compare types not values.
If you want to compare strings with a non-sensitive case you can use:
string toto = "toto";
string tata = "tata";
bool isEqual = string.Compare(toto, tata, StringComparison.InvariantCultureIgnoreCase) == 0;
Console.WriteLine(isEqual);
How about you tell me? :)
Take the code from this Coding Horror post, and insert your code to test in place of his algorithm.
Comparing strings with a "==" operator compares the contents of the string vs. the string object reference. Comparing objects will call the "Equals" method of the object to determine whether they are equal or not. The default implementation of Equals is to do a reference comparison, returning True if both object references are the same physical object. This will likely be faster than the string comparison, but is dependent on the type of object being compared.
I'd assume that comparing the objects in your first example is going to be about as fast as it gets since its simply checking if both objects point to the same address in memory.
As it has been mentioned several times already, it is possible to compare addresses on strings as well, but this won't necessarily work if the two strings are allocated from different sources.
Lastly, its usually good form to try and compare objects based on type whenever possible. Its typically the most concrete method of identification. If your objects need to be represented by something other than their address in memory, its possible to use other attributes as identifiers.
If I understand the question and you really want to compare reference equality with the plain old "compare the contents": Build a testcase and call object.ReferenceEquals compared against a == b.
Note: You have to understand what the difference is and that you probably cannot use a reference comparison in most scenarios. If you are sure that this is what you want it might be a tiny bit faster. You have to try it yourself and evaluate if this is worth the trouble at all..
I don't feel like any of these answers address the actual question. Let's say the string in this example is the type's name and we're trying to see if it's faster to compare a type name or the type to determine what it is.
I put this together and to my surprise, it's about 10% faster to check the type name string than the type in every test I ran. I intentionally put the simplest strings and classes into play to see if it was possible to be faster, and turns out it is possible. Not sure about more complicated strings and type comparisons from heavily inherited classes. This is of course a micro-op and may possibly change at some point in the evolution of the language I suppose.
In my case, I was considering a value converter that switches based on this name, but it could also switch over the type since each type specifies a unique type name. The value converter would figure out the font awesome icon to show based on the type of item presented.
using System;
using System.Diagnostics;
using System.Linq;
namespace ConsoleApp1
{
public sealed class A
{
public const string TypeName = "A";
}
public sealed class B
{
public const string TypeName = "B";
}
public sealed class C
{
public const string TypeName = "C";
}
class Program
{
static void Main(string[] args)
{
var testlist = Enumerable.Repeat(0, 100).SelectMany(x => new object[] { new A(), new B(), new C() }).ToList();
int count = 0;
void checkTypeName()
{
foreach (var item in testlist)
{
switch (item)
{
case A.TypeName:
count++;
break;
case B.TypeName:
count++;
break;
case C.TypeName:
count++;
break;
default:
break;
}
}
}
void checkType()
{
foreach (var item in testlist)
{
switch (item)
{
case A _:
count++;
break;
case B _:
count++;
break;
case C _:
count++;
break;
default:
break;
}
}
}
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 100000; i++)
{
checkTypeName();
}
sw.Stop();
Console.WriteLine(sw.Elapsed);
sw.Restart();
for (int i = 0; i < 100000; i++)
{
checkType();
}
sw.Stop();
Console.WriteLine(sw.Elapsed);
}
}
}