Hashcode to check uniqueness in a string array - c#

I am storing large number of arrays of data into a List, however, I don't want to store the data if it already exists in my list - the order of the data doesn't matter. I figured using GetHashCode to generate a hashcode would be appropriate because it was supposed to not care about order. However, what I found with a simple test below is that for the first two string[] a1 and a2 it generates a different hashcode.
Can I not utilize this method of checking? Can someone suggest a better way to check please?
string[] a1 = { "cat", "bird", "dog" };
string[] a2 = { "cat", "dog", "bird" };
string[] a3 = { "cat", "fish", "dog" };
Console.WriteLine(a1.GetHashCode());
Console.WriteLine(a2.GetHashCode());
Console.WriteLine(a3.GetHashCode());
the results from the above test produces three different hashcode results.
Ideally, I would have liked to see the same Hashcode for a1 and a2...so I am looking for something that would allow me to quickly check if those strings already exist.

Your arrays aren't equal, by the standard used by arrays for determining equality. The standard used by arrays for determining equality is that two separately created arrays are never equal.
If you want separately created collections with equal elements to compare as equal, then use a collection type which supports that.
I recommend HashSet<T>, in your case HashSet<string>. It doesn't provide the GetHashCode() and Equals() behaviour you want directly, but it has a CreateSetComparer() method that provides you with a helper class that does give you hash code and comparer methods that do what you want.
Just remember that you cannot use this for a quick equality check. You can only use this for a quick inequality check. Two objects that are not equal may still have the same hash code, basically by random chance. It's only when the hash codes aren't equal that you can skip the equality check.

If you say a1.GetHashCode(), this will always generate a new hash code for you:
using System;
public class Program
{
public static void Main()
{
string[] a1 = { "cat", "bird", "dog" };
string[] a2 = { "cat", "dog", "bird" };
string[] a3 = { "cat", "fish", "dog" };
Console.WriteLine(a1.GetHashCode());
Console.WriteLine(a2.GetHashCode());
Console.WriteLine(a3.GetHashCode());
}
}

Related

C# Multidimensional array with string[] and string

I basically want to make an array which contains one string[] and one normal string.
onion / {strawberry, banana, grape} in one array.
string[,] foodArray = new string[,] { onion, { strawberry, banana, grape } }
I'm wondering if it's even possible...
Thank you.
This sort of data typing is unclear for what you want to do. Use your types to clearly communicate your intentions
If you plan to lookup by the first string, I might recommend Dictionary<string, List<string>>. The Dictionary collection is an extremely useful collection type
If you want strictly arrays then you must use a jagged array as this will allow you to constrain the first "column" to being only 1 length while the list (2nd column) may be variable length. This would mean string[][] foodArray = new string[1][];
In either case multidimensionals arrays are not suited here, it will lead to wasted space as it allocates all the cells for the dimensions you set. Rule of thumb, always prefer jagged over multidimensional arrays unless you are absolutely sure the entire multidimensional array will be filled to its max in all its dimensions.
I think you do not really want a two dimensional array in this case.
What you really want is an array of a tuple.
using System;
namespace tuple_array
{
using Item = Tuple<String,String[]>;
class Program
{
static void Main(string[] args)
{
Item[] foodarray = new Item[] {
new Item("onion", new string[]
{ "strawberry", "banana", "grape" })
};
foreach (var item in foodarray) {
var (f,fs) = item;
var foo = string.Join(",", fs);
Console.WriteLine($"[{f},[{foo}]]");
}
}
}
}
It looks somewhat clumsy in C#, the F# version is much more terse and pretty:
type A = (string * string []) []
let a : A = [| "onion", [| "strawberry"; "banana"; "grape" |] |]
printfn "%A" a

Why HashSets with same elements return different values when calling to GetHashCode()?

Why HashSet<T>.GetHashCode() returns different hashcodes when they have the same elements?
For instance:
[Fact]
public void EqualSetsHaveSameHashCodes()
{
var set1 = new HashSet<int>(new [] { 1, 2, 3 } );
var set2 = new HashSet<int>(new [] { 1, 2, 3 } );
Assert.Equal(set1.GetHashCode(), set2.GetHashCode());
}
This test fails. Why?
How can I get the result I need? "Equal sets give the same hashcode"
HashSet<T> by default does not have value equality semantics. It has reference equality semantics, so two distinct hash sets won't be equal or have the same hash code even if the containing elements are the same.
You need to use a special purpose IEqualityComparer<HashSet<int>> to get the behavior you want. You can roll your own or use the default one the framework provides for you:
var hashSetOfIntComparer = HashSet<int>.CreateSetComparer();
//will evaluate to true
var haveSameHash = hashSetOfIntComparer.GetHashCode(set1) ==
hashSetOfIntComparer.GetHashCode(set2);
So, to make a long story short:
How can I get the result I need? "Equal sets give the same hashcode"
You can't if you are planning on using the default implementation of HashSet<T>.GetHashCode(). You either use a special purpose comparer or you extend HashSet<T> and override Equals and GetHashCode to suit your needs.
By default (and unless otherwise specifically documented), reference types are only considered equal if they reference the same object. As a developer, you can override the Equals() and GetHashCode() methods so that objects that you consider equal return true for the Equals and the same int for GetHashCode.
Depending on which test framework you are using, there will be either CollectionAssert.AreEquivalent() or an override to Assert.Equal that takes a comparer.
You could implement a custom HashSet that overrides the GetHashCode function which generates a new hashcode from all of the contents like below:
public class HashSetWithGetHashCode<T> : HashSet<T>
{
public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
foreach (var item in this)
hash = hash * 23 + item.GetHashCode();
return hash;
}
}
}

Whether a Dictionary can have Array as Key?

Am Facing a problem in Dictionaries.
Whether an Array can be a Key of a Value???
Dictionary<string[], int> di = new Dictionary<string[], int>();
di.Add(new string[]
{
"1","2"
}, 1);
di.Add(new string[]
{
"2","3"
}, 2);
MessageBox.Show(di[new string[] { "2", "3" }].ToString()); // Here KeyNotFoundException occurred.
Why Exception?
By default only references of the arrays would be compared, so you either have to
provide a custom IEqualityComparer<string[]> or
use a Tuple<string, string> as key instead ( since you only have two strings)
Here's a similar question's answer which shows how to create a custom comparer for the Dictionary- constructor.
No, actually you should not use arrays as a Dictionary<> Key; Dictionary<> when works with keys uses their hash codes which are computed as addresses:
String[] a = new[]{"1", "2"};
String[] b = new[]{"1", "2"};
a.GetHashCode() == b.GetHashCode(); // <- false
Arrays a and b have different addresses, and so different hash codes that's why
di.Add(a, 1);
di[b]; // <- error since a.GetHashCode() != b.GetHashCode()
Because a the Equals and GetHashCode functions of an array don't compare the content but the reference of the array himself.

How to find Complement of two HashSets

I have two HashSets – setA and setB.
How can we find the complement of setA and setB?
Is the code for intersection the best way to find intersection?
CODE
string stringA = "A,B,A,A";
string stringB = "C,A,B,D";
HashSet<string> setA = new HashSet<string>(stringA.Split(',').Select(t => t.Trim()));
HashSet<string> setB = new HashSet<string>(stringB.Split(',').Select(t => t.Trim()));
//Intersection - Present in A and B
HashSet<string> intersectedSet = new HashSet<string>( setA.Intersect(setB));
//Complemenet - Present in A; but not present in B
UPDATE:
Use OrdianlIgnoreCase for ignoring case sensitvity How to use HashSet<string>.Contains() method in case -insensitive mode?
REFERENCE:
What is the difference between HashSet<T> and List<T>?
Intersection of multiple lists with IEnumerable.Intersect()
Comparing two hashsets
Compare two hashsets?
Quickest way to find the complement of two collections in C#
1 - How can we find the complement of setA and setB?
Use HashSet<T>.Except Method
//Complemenet - Present in A; but not present in B
HashSet<string> ComplemenetSet = new HashSet<string>(setA.Except(setB));
try it with following string.
string stringA = "A,B,A,E";
string stringB = "C,A,B,D";
ComplementSet will contain E
2 - Is the code for intersection the best way to find intersection?
Probably, YES
You can use Except to get the complement of A or B. To get a symmetric complement, use SymmetricExceptWith.
setA.SymmetricExceptWith(setB);
Note that this modifies setA. To get the intersection, there are two methods: Intersect, which creates a new HashSet, and IntersectWith, which modifies the first:
// setA and setB unchanged
HashSet<string> intersection = setA.Intersect(setB);
// setA gets modified and holds the result
setA.IntersectWith(setB);

Difference of two lists C#

I have two lists of strings both of which are ~300,000 lines. List 1 has a few lines more than List 2. What I'm trying to do is find the strings that in List 1 but not in List 2.
Considering how many strings I have to compare, is Except() good enough or is there something better (faster)?
Internally the enumerable Except extension method uses Set<T> to perform the computation. It's going to be as least as fast as any other method.
Go with list1.Except(list2).
It'll give you the best performance and the simplest code.
My suggestion:
HashSet<String> hash1 = new HashSet<String>(new string[] { "a", "b", "c", "d" });
HashSet<String> hash2 = new HashSet<String>(new string[] { "a", "b" });
List<String> result = hash1.Except(hash2).ToList();

Categories