Culture Sensitive GetHashCode - c#

I'm writing a c# application that will process some text and provide basic query functions. In order to ensure the best possible support for other languages, I am allowing the users of the application to specify the System.Globalization.CultureInfo (via the "en-GB" style code) and also the full range of collation options using the System.Globalization.CompareOptions flags enum.
For regular string comparison I'm then using a combination of:
a) String.Compare overload that accepts the culture and options
b) For some bulk processes I'm caching the byte data (KeyData) from CompareInfo.GetSortKey (overload that accepts the options) and using a byte-by-byte comparison of the KeyData.
This seemed fine (although please comment if you think these two methods shouldn't be mixed), but then I had reason to use the HashSet<> class which only has an overload for IEqualityComparer<>.
MS documentation seems to suggest that I should use StringComparer (which implements both IEqualityComparer<> and IComparer<>), but this only seems to support the "IgnoreCase" option from CompareOptions and not "IgnoreKanaType", "IgnoreSymbols", "IgnoreWidth" etc.
I'm assuming that a StringComparer that ignores these other options could produce different hashcodes for two strings that might be considered the same using my other comparison options. I'd therefore get incorrect results from my application.
Only thought at the moment is to create my own IEqualityComparer<> that generates a hashcode from the SortKey.KeyData and compares eqality be using the String.Compare overload.
Any suggestions?

You will certainly need to implement your own IEqualityComparer<>, but I don't believe the hashcode necessarily has to play into it. Just use the string.Compare overload like you said.

Related

GetHashCode of System.Type returns different values

Why does GetHashCode returns different values for the same type. If i execute this code:
Console.WriteLine(typeof(Guid).GetHashCode());
In different applications, i get different hash codes.
If i execute the following statement multiple times in different applications:
Console.WriteLine("ABC".GetHashCode());
I get always the same hash code. But why is the hash code changing for System.Type, but not for System.String?
Thank you.
Neither System.String nor System.Type guarantees persistable hashcodes as a part of its contract. The fact that it so happens to work with System.String in your particular case is an implementation detail that cannot be relied upon. If you need to persist or export a hash of a string, use a different string hashing method. Persisting or exporting information about a type should probably use information like FullName, AssemblyQualifiedName, and others, depending on the exact requirement.

Comparator/Sorting/Equatable methodology and return value

Alright so I'm taking everything I've learned and trying to implement it in C#. Given that I have a background in Java my ride has been pretty smooth so far, but I'm running into issues into using the Comparer object and functions etc. I don't care about direct implementation/translation, but I want to know how C# compares two generic values. What does it use to sort them? Hashcode, or maybe some C#-specific methodology?
So just to clarify, I know how to sort, search, etc. using methods in C#. What I want to know is what's going on under the hood - what are the Comparer and other functions using to compare two values of generics?
I want to know how C# compares two generic values
It doesn't/can't, that is why there are the ICompariable and IComparer interfaces..
What I want to know is what's going on under the hood
If you're talking about types provided by .Net then..
If you have an array of types (such as string or integer) that already support IComparer, you can sort that array without providing any explicit reference to IComparer. In that case, the elements of the array are cast to the default implementation of IComparer (Comparer.Default) for you.
How to use the IComparable and IComparer interfaces in Visual C# is probably the best article I've seen specific to your question.
The role of IComparable is to provide a method of comparing two objects of a particular type
The role of IComparer is to provide additional comparison mechanisms. For example, you may want to provide ordering of your class on several fields or properties, ascending and descending order on the same field, or both.

Globally set String.Compare/ CompareInfo.Compare to Ordinal

I'm searching for a strategy with which I can set the default sortorder of String.CompareTo to bytewise - ordinal. I need to do this without having to specify the sortorder in the call to the method.
I have tried out several strategies without satisfactory results. I got as far as this:
CultureAndRegionInfoBuilder crib =
new CultureAndRegionInfoBuilder("foo", CultureAndRegionModifiers.Neutral);
CompareInfo compareInfo = new CustomCompareInfo();
crib.Register();
In this CustomCompareInfo I try to override the default CompareInfo class, but unfortunately this does not compile:
The type 'System.Globalization.CompareInfo' has no constructors defined
I'm stuck here. Got the feeling that a custom implementation of CompareInfo is the solution to my problem.
Got any ideas on this?
Edit: context of my question:
This project I'm working on is quite unusual - a huge codebase has been converted from an other programming language to .NET. In this programming language the string comparison defaults to ordinal and this difference with .NET is causing bugs in the converted codebase, so I figured it would be the most elegant solution if we'd be able to configure .NET to the same default behavior.
Of course it is possible to reconvert the code using a comparison-specifier. Or, we could introduce an extension method which performs a ordinal (binary) comparison. Et cetera..
However, as far as I am concerned, from an architectural viewpoint, these solutions are less elegant. This is the reason why I am searching for a solution with which I can set this ordinal comparison globally on the framework.
Thanks in advance!
Sorry, you can't make this work. The CompareInfo class does have a constructor. But it is internal and takes a CultureInfo as an argument. The actual implementation involves private members of CultureInfo that reflect sorting tables built into mscorlib. They are not extensible.
This does actually work in VB.NET, presumably the reason you are pursuing this. It has an Option Compare statement that lets you select binary comparison. This is however not implemented with CultureInfo, it is done by the compiler. Which recognizes a string comparison and replaces it with a custom vb.net string comparison method that is aware of the selected Option Compare. It's name is Microsoft.VisualBasic.CompilerServices.Operators.CompareString()
You cannot coax the C# compiler into the same behavior. You'd have to painstakingly replace comparison expressions in converted vb.net code. A horrible job of course and very prone to mistakes. If the conversion was done by a converter program then you might be better off with a good decompiler, it won't hide the CompareString() calls.
There appears to be no means of setting the default comparison mode (here, to ordinal).
If what you want is always-consistent comparison results, you can set, for each thread you create in your app, the culture to 'invariant' (cultureInfo with empty string as parameter)
Thread.CurrentThread.CurrentCulture = new CultureInfo("");
If you want to perform ordinal comparisons for performance, I really think that nothing can be done globally - you will need to pass this option explicitly each time you perform a string comparison.
Can you tell us what you need exactly?

Override ToString or provide non-ToString named extension method for an interface?

My question is about naming, design, and implementation choices. I can see myself going in two different directions with how to solve an issue and I'm interested to see where others who may have come across similar concerns would handle the issue. It's part aesthetics, part function.
A little background on the code... I created a type called ISlice<T> that provides a reference into a section of a source of items which can be a collection (e.g. array, list) or a string. The core support comes from a few implementation classes that support fast indexing using the Begin and End markers for the slice to get the item from the original source. The purpose is to provide slicing capabilities similar to what the Go language provides while using Python style indexing (i.e. both positive and negative indexes are supported).
To make creating slices (instances of ISlice<T>) easier and more "fluent", I created a set of extension methods. For example:
static public ISlice<T> Slice<T>(this IList<T> source, int begin, int end)
{
return new ListSlice<T>(source, begin, end);
}
static public ISlice<char> Slice(this string source, int begin, int end)
{
return new StringSlice(source, begin, end);
}
There are others, such as providing optional begin/end parameters, but the above will suffice for where I'm going with this.
These routines work well and make it easy to slice up a collection or a string. What I also need is way to take a slice and create a copy of it as an array, a list, or a string. That's where things get "interesting". Originally, I thought I'd need to create ToArray, ToList extension methods, but then remembered that the LINQ variants perform optimizations if your collection implements ICollection<T>. In my case, ISlice<T>, does inherits from it, though much to my chagrin as I dislike throwing NotSupportedExceptions from methods like Add. Regardless, I get those for free. Great.
What about converting back into a string as there's no built-in support for converting an IEnumerable<char> easily back into a string? Closest thing I found is one of the string.Concat overloads, but it would not handle chars as efficiently as it could. Just as important from a design stand point is that it doesn't jump out as a "conversion" routine.
The first thought was to create a ToString extension method, but that doesn't work as ToString is an instance method which means it trumps extension methods and would never be called. I could override ToString, but the behavior would be inconsistent as ListSlice<T> would need to special case its ToString for times where T is a char. I don't like that as the ToString will give something useful when the type parameter is a char, but the class name in other cases. Also, if there are other slice types created in the future I'd have to create a common base class to ensure the same behavior or each class would have to implement this same check. An extension method on the interface would handle that much more elegantly.
The extension method leads me to a naming convention issue. The obvious is to use ToString, but as stated earlier it's not allowed. I could name it something different, but what? ToNewString? NewString? CreateString? Something in the To-family of methods would let it fall in with the ToArray/ToList routines, but ToNewString sticks out as being 'odd' when seen in the intellisense and code editor. NewString/CreateString are not as discoverable as you'd have to know to look for them. It doesn't fit the "conversion method" pattern that the To-family methods provide.
Go with overriding ToString and accept the inconsistent behavior hardcoded into the ListSlice<T> implementation and other implementations? Go with the more flexible, but potentially more poorly named extension method route? Is there a third option I haven't considered?
My gut tells me to go with the ToString despite my reservations, though, it also occurred to me... Would you even consider ToString giving you a useful output on a collection/enumerable type? Would that violate the principle of least surprise?
Update
Most implementations of slicing operations provide a copy, albeit a subset, of the data from whatever source was used for the slice. This is perfectly acceptable in most use cases and leaves for a clean API as you can simply return the same data type back. If you slice a list, you return a list containing only the items in the range specified in the slice. If you slice a string, you return a string. And so on.
The slicing operations I'm describing above are solving an issue when working with constraints which make this behavior undesirable. For example, if you work with large data sets, the slice operations would lead to unnecessary additional memory allocations not to mention the performance impact of copying the data. This is especially true if the slices will have further processing done on them before getting to your final results. So, the goal of the slice implementation is to have references into larger data sets to avoid making unnecessary copies of the information until it becomes beneficial to do so.
The catch is that at the end of the processing the desire to turn the slice-based processed data back into a more API and .NET friendly type like lists, arrays, and strings. It makes the data easier to pass into other APIs. It also allows you to discard the slices, thus, also the large data set the slices referenced.
Would you even consider ToString giving you a useful output on a collection/enumerable type? Would that violate the principle of least surprise?
No, and yes. That would be completely unexpected behavior, since it would behave differently than every other collection type.
As for this:
What about converting back into a string as there's no built-in support for converting an IEnumerable>char< easily back into a string?
Personally, I would just use the string constructor taking an array:
string result = new string(mySlice.ToArray());
This is explicit, understood, and expected - I expect to create a new string by passing an object to a constructor.
Perhaps the reason for your conundrum is the fact that you are treating string as a ICollection<char>. You haven't provide details about the problem that you are trying to solve but maybe that's a wrong assumption.
It's true that a string is an IEnumerable<char>. But as you've noticed assuming a direct mapping to a collection of chars creates problems. Strings are just too "special" in the framework.
Looking at it from the other end, would it be obvious that the difference between an ISlice<char> and ISlice<byte> is that you can concatenate the former into a string? Would there be a concatenate operation on the latter that makes sense? What about ISlice<string>? Shouldn't I be able to concatenate those as well?
Sorry I'm not providing specific answers but maybe these questions will point you at the right solution for your problem.

How can I convert a C# list into something that's hashable?

I want something along the lines of Python's tuples (or, for sets, frozensets), which are hashable. I have a List<String> which is most certainly not hashing correctly (i.e. by value).
You will have to define your own container, possibly wrapping the List, to get useful semantics for equality-hash-equals (GetHashCode and Equals). You could even make the wrapper conform to IList if you like.
To avoid mutability issues and a changing GetHashCode/Equals results (which would make use of your new object in a hashing Dictionary problematic!) you should also provide some kind of guard (perhaps make a copy of the input upon creation of your type) and/or document the constraints.
You can use SequenceEqual to implement Equals rather trivially, but you'll need to implement a GetHashCode in a relevant way -- a simple method is a shifting XOR of the GetHashCode of each element.
Alternatively, if this is just used in a single Dictionary you can supply a custom IEqualityComparer and avoid creating a wrapped type: Dictionary constructor overload.
It depends what your final goals are and there very well already be such wrapping containers :-)
Note: In .NET4 there is a set of Tuple<...> classes which override GetHashCode and Equals. See cadenza as the 3rd party alternative for prior .NET versions.

Categories