C# Overriding the GetHashCode method

C# Overriding the GetHashCode method - c#

In this example, the poster has overridden the get hash code method. I understand that this has been done in order to provide a better hash value for the returned object, to reduce the number of collisions, and therefore reduce the number of occasions it will be necessary to call Equals().
What i would like to know, is how this algorithm been calculated:
return 17 + 31 * CurrentState.GetHashCode() + 31 * Command.GetHashCode();
Is there a particular reason that the numbers in question were chosen? Could i have simply picked my own numbers to put into it?

Generally you should choose primes. This helps you to avoid getting the same hash-value for different input parameters.

Prime numbers are usually used in hashcode computation to minimize the collisions. If you search for hashcode and prime numbers on this iste, you will find some detailed explanations on this (note that it is note language specific):
What is a sensible prime for hashcode calculation ?
Why does Java's hashCode() in String use 31 as a multiplier ?

You typically want to use prime numbers (as is done above) because it reduces the chance of collisions (two instances yielding same result). For more info, see: http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

Related

Which part of a GUID is most worth keeping?

I need to generate a unique ID and was considering Guid.NewGuid to do this, which generates something of the form:
0fe66778-c4a8-4f93-9bda-366224df6f11
This is a little long for the string-type database column that it will end up residing in, so I was planning on truncating it.
The question is: Is one end of a GUID more preferable than the rest in terms of uniqueness? Should I be lopping off the start, the end, or removing parts from the middle? Or does it just not matter?

You can save space by using a base64 string instead:
var g = Guid.NewGuid();
var s = Convert.ToBase64String(g.ToByteArray());
Console.WriteLine(g);
Console.WriteLine(s);
This will save you 12 characters (8 if you weren't using the hyphens).

Keep all of it.
From the above link:
* Four bits to encode the computer number,
* 56 bits for the timestamp, and
* four bits as a uniquifier.
you can redefine the Guid to right-size it to your needs.

If the GUID were simply a random number, you could keep an arbitrary subset of the bits and suffer a certain percent chance of collision that you can calculate with the "birthday algorithm":
double numBirthdays = 365; // set to e.g. 18446744073709551616d for 64 bits
double numPeople = 23; // set to the maximum number of GUIDs you intend to store
double probability = 1; // that all birthdays are different
for (int x = 1; x < numPeople; x++)
probability *= (double)(numBirthdays - x) / numBirthdays;
Console.WriteLine("Probability that two people have the same birthday:");
Console.WriteLine((1 - probability).ToString());
However, often the probability of a collision is higher because, as a matter of fact, GUIDs are in general NOT random. According to Wikipedia's GUID article there are five types of GUIDs. The 13th digit specifies which kind of GUID you have, so it tends not to vary much, and the top two bits of the 17th digit are always fixed at 01.
For each type of GUID you'll get different degrees of randomness. Version 4 (13th digit = 4) is entirely random except for digits 13 and 17; versions 3 and 5 are effectively random, as they are cryptographic hashes; while versions 1 and 2 are mostly NOT random but certain parts are fairly random in practical cases. A "gotcha" for version 1 and 2 GUIDs is that many GUIDs could come from the same machine and in that case will have a large number of identical bits (in particular, the last 48 bits and many of the time bits will be identical). Or, if many GUIDs were created at the same time on different machines, you could have collisions between the time bits. So, good luck safely truncating that.
I had a situation where my software only supported 64 bits for unique IDs so I couldn't use GUIDs directly. Luckily all of the GUIDs were type 4, so I could get 64 bits that were random or nearly random. I had two million records to store, and the birthday algorithm indicated that the probability of a collision was 1.08420141198273 x 10^-07 for 64 bits and 0.007 (0.7%) for 48 bits. This should be assumed to be the best-case scenario, since a decrease in randomness will usually increase the probability of collision.
I suppose that in theory, more GUID types could exist in the future than are defined now, so a future-proof truncation algorithm is not possible.

I agree with Rob - Keep all of it.
But since you said you're going into a database, I thought I'd point out that just using Guid's doesn't necessarily mean that it will index well in a database. For that reason, the NHibernate developers created a Guid.Comb algorithm that's more DB friendly.
See NHibernate POID Generators revealed and documentation on the Guid Algorithms for more information.
NOTE: Guid.Comb is designed to improve performance on MsSQL

Truncating a GUID is a bad idea, please see this article for why.
You should consider generating a shorter GUID, as google reveals some solutions for. These solutions seem to involve taking a GUID and changing it to be represented in full 255 bit ascii.

Working with numbers larger than max decimal value

I'm working with the product of the first 26 prime numbers. This requires more than 52 bits of precision, which I believe is the max a double can handle, and more than the 28-29 significant digits a decimal can provide. So what would be some strategies for performing multiplication and division on numbers this large?
Also, what would the performance impacts be of whatever hoops I'd have to jump through to make this happen?
The product of the first 22 prime numbers (the most I can multiply together on my calculator without dropping into scientific mode) is:
10,642,978,845,819,148,849,204,664,294,430
The product of the last four is
72,370,439
When multiplied together, I get:
7.7023705133964511682328635583552e+38
The performance impacts are especially important here, because we're essentially trying to resolve the question of whether a prime-number string comparison solution is faster in practice than a straight comparison of characters. The post which prompted this investigation is here. Processors are optimized for floating-point calculations; ideally I'd want to leverage as much of that optimization in whatever solution I end up with.
TIA!
James
PS: The code I do have is for a competing solution; I don't think the prime number solution can possibly be faster, but I'm trying to give it the fairest chance I can.

You can use BigInteger in C#4.0. For older versions, I think you need an open source library such as this one

I read the post you linked to, about the interview question. Since you're only multiplying and dividing these large integers, a huge optimization is to keep them in their prime-factorized form. Each large integer is an array [0..25] of ints, each element representing the exponent of the nth prime in the factorization. To multiply two large integers in this form, simply add the exponents element-by-element; to divide, subtract exponents.
But you will see this is equivalent to tabulating character counts on the two strings.

Calculate factorials in C#

How can you calculate large factorials using C#? Windows calculator in Win 7 overflows at Factorial (3500). As a programming and mathematical question I am interested in knowing how you can calculate factorial of a larger number (20000, may be) in C#. Any pointers?
[Edit] I just checked with a calc on Win 2k3, since I could recall doing a bigger factorial on Win 2k3. I was surprised by the way things worked out.
Calc on Win2k3 worked with even big numbers. I tried !50000 and I got an answer, 3.3473205095971448369154760940715e+213236
It was very fast while I did all this.
The main question here is not only to find out the appropriate data type, but also a bit mathematical. If I try to write a simple factorial code in C# [recursive or loop], the performance is really bad. It takes multiple seconds to get an answer. How is the calc in Windows 2k3 (or XP) able to perform such a huge factorial in less than 10 seconds? Is there any other way of calculating factorial programmatically in C#?

Have a look at the BigInteger structure:
http://msdn.microsoft.com/en-us/library/system.numerics.biginteger.aspx
Maybe this can help you implement this functionality.
CodeProject has an implementation for older versions of the framework at http://www.codeproject.com/KB/cs/biginteger.aspx.

If I try to write a simple factorial code in C# [recursive or loop], the performance is really bad. It takes multiple seconds to get an answer.
Let's do a quick order-of-magnitude calculation here for a naive implementation of factorial that performs n multiplications. Suppose we are on the last step. 19999! is about 218 bits. 20000 is about 25 bits; we'll assume that it is a 32 bit integer. The final multiplication therefore involves the addition of up to 25 partial results each roughly 218 bits long. The number of bit operations will therefore be on the order of 223.
That's for the last stage; there will be 20000 = 216 such operations at each stage, so that is a total of about 239 operations. Some of them will of course be cheaper, but we're going for an order of magnitude here.
A modern processor does about 232 operations per second. Therefore it will take about 27 seconds to get the result.
Of course, the big integer library writers were not naive; they take advantage of the ability of the chip to do many bit operations in parallel. They're probably doing the math in 32 bit chunks, giving speedups of a factor of 25. So our total order-of-magnitude calculation is that it should take about 22 seconds to get a result.
22 is 4. So your observation that it takes a few seconds to get a result is expected.
How is the calc in Windows 2k3 (or XP) able to perform such a huge factorial in less than 10 seconds?
I don't know. Extreme cleverness in exploiting the math operations on the chip probably. Or, using a non-naive algorithm for calculating factorial. Or, possibly they are using Stirling's Approximation and getting an inexact result.
Is there any other way of calculating factorial programmatically in C#?
Sure. If all you care about is the order of magnitude then you can use Stirling's Approximation. If you care about the exact value then you're going to have to compute it.

There exist sophisticated computational algorithms for efficiently computing the factorials of large, arbitrary precision numbers. The Schönhage–Strassen algorithm, for instance, allows you to perform asymptotically fast multiplication for arbitrarily large integers.
Case in point, Mathematica computes 22000! on my machine in less than 1 second. The Implementation Notes page at reference.wolfram.com states:
(Mathematica's) n! uses an O(log(n) M(n)) algorithm of Schönhage based on dynamic decomposition to prime powers.
Unfortunately, the implementation of such algorithms is both complicated and error prone. Rather than trying to roll your own implementation, it may be wiser for you to license a copy of Mathematica (or a similar product that meets your functional and performance needs) and either use it, or a .NET programming interface to it, to perform your computation.

Have you looked at System.Numerics.BigInteger?

Using System.Numerics BigInteger
var bi = new BigInteger(1);
var factorial = 171;
for (var i = 1; i <= factorial; i++)
{
bi *= i;
}
will be calculated to
1241018070217667823424840524103103992616605577501693185388951803611996075221691752992751978120487585576464959501670387052809889858690710767331242032218484364310473577889968548278290754541561964852153468318044293239598173696899657235903947616152278558180061176365108428800000000000000000000000000000000000000000
For 50000! it takes a couple seconds to calculate but it seems to work and the result is a 213237 digit number and that's also what Wolfram says.

You will probably have to implement your own arbitrary precision numeric type.
There are various approaches. probably not the most efficient, but perhaps the simplest is to have variable length arrays of byte (unsigned char). Each element represents a digit. ideally this would be included in a class, and you can then add a method which let's you multiply the number with another arbitrary precision number. A multiply with a standard C# integer would probably also be a good idea, but a little trickier to implement.

Since they don't give you the result down to the last digit, they may be "cheating" using some approximation.
Check out http://mathworld.wolfram.com/StirlingsApproximation.html
Using Stirling's formula you can calculate (an approximation of) the factorial of n in logn time. Of course, they might as well have a dictionary with pre-calculated values of factorial(n) for every n up to one million, making the calculator show the result extremely fast.

This answer covers limits for basic .Net types to compute and represent n!
Basic code to calculate factorial for "SomeType" that supports multiplication:
SomeType factorial = 1;
int n = 35;
for (int i = 1; i <= n; i++)
{
factorial *= i;
}
Limits for built in number types:
short - correct results up to 7!, incorrect results afterwards, code returns 0 starting 18 (similar to int)
int - correct results up to 12!, incorrect results afterwards, code returns 0 starting at 34 (Why computing factorial of realtively small numbers (34+) returns 0)
float - precise results up to 14!, correct but not precise afterwards, returns infinity starting at 35
long - correct results up to 20!, incorrect results afterwards, code returns 0 starting at 66 (similar to int)
double - precise results up to 22!, correct but not precise afterwards, returns infinity starting at 171
BigInteger - precise and upper limit is set by memory usage only.
Note: integer types overflow pretty quickly and start producing incorrect results. Realistically if you need factorials for any practical usage long is the type to go (up to 20!), if you can't expect limited numbers - BigInteger is the only type provided in .Net Framework to provide precise results (albeit slow for large numbers as there is no built-in optimized n! method)

You need a special big-number library for this. This link introduces the System.Numeric.BigInteger class, and incidentally has an example program that calculates factorials. But don't use the example! If you recurse like that, your stack will grow horribly. Just write a for-loop to do the multiplication.

I don't know how you could do this in a language without arbitrary precision arithmetic. I guess a start could be to count factors of 5 and 2, removing them from the product, and add on these zeroes at the end.
As you can see there are many.
>>> factorial(20000)
<<non-zeroes removed>>0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000L

crafting a class representing a fraction

what are some of the things i need to consider when designing this object
this is all i can think of
int nominator
int denominator
int sign
this object can be used in a math operation

Unless you use unsigned int and you are sure you don't want the denominator and numerator to contain signs, you should probably get rid of the third member (sign), as it is redundant.
Then it depends on the language you are using, you might want to overload some operators for this class (C++), or implement methods to compute the behaviour, like Rohith said.

Consider the behavior of the class. What operations do you want to be able to perform on this fraction class?
Create it - this means a constructor. So what arguments should the constructor take? 2 ints? 2 ints and a boolean for sign? The ints could carry signs too.
Add two fractions, subtract, multiply, divide - do you want these to be static methods or object methods [ aFraction.Add(anotherFraction) or Fraction.Add(aFraction, anotherFraction) ]. What do these methods return - a Fraction object? a float?
How do you compare two fractions are equal? If you want to do this without breaking encapsulation, then make sure you provide an equals method - Java and C# have a particular signature for the equals method.

Consider the following multiplication problem:
2/3 * 3/4. The answer is 6/12, naively. But 1/2, intelligently. You'll need to take this into account to deal with equality.
Now, what about 2000000000/3000000000 * 3/4? If you use 32-bit ints to represent your numerator and denominator, you'll overflow if you do the naive computation first. Of course, if your language supports bignums, this is not so big a problem.
When you're reducing to lowest terms, don't forget to consider the sign of the result -- in general, decide pick one of the numerator or the denominator to be negative when representing negative rationals, and stick with it.

You may also want a member function to output a decimal value, and a toString function so you can print numerator/denominator without extra effort.
There is a "boundary" case of sorts here, too - the denominator can't be zero or your value is undefined. Your constructor and any setters will need to respond to this possibility.

Chapter 2 in Timothy Budd's "Classic Data Structures In C++" has a very nice Rational class in C++. It includes all the points made here, including an implementation of GCD to normalize 6/12 into 1/2. Well worth reading.

Hashtable/Dictionary collisions

Using the standard English letters and underscore only, how many characters can be used at a maximum without causing a potential collision in a hashtable/dictionary.
So strings like:
blur
Blur
b
Blur_The_Shades_Slightly_With_A_Tint_Of_Blue
...

There's no guarantee that you won't get a collision between single letters.
You probably won't, but the algorithm used in string.GetHashCode isn't specified, and could change. (In particular it changed between .NET 1.1 and .NET 2.0, which burned people who assumed it wouldn't change.)
Note that hash code collisions won't stop well-designed hashtables from working - you should still be able to get the right values out, it'll just potentially need to check more than one key using equality if they've got the same hash code.
Any dictionary which relies on hash codes being unique is missing important information about hash codes, IMO :) (Unless it's operating under very specific conditions where it absolutely knows they'll be unique, i.e. it's using a perfect hash function.)

Given a perfect hashing function (which you're not typically going to have, as others have mentioned), you can find the maximum possible number of characters that guarantees no two strings will produce a collision, as follows:
No. of unique hash codes avilable = 2 ^ 32 = 4294967296 (assuming an 32-bit integer is used for hash codes)
Size of character set = 2 * 26 + 1 = 53 (26 lower as upper case letters in the Latin alphabet, plus underscore)
Then you must consider that a string of length l (or less) has a total of 54 ^ l representations. Note that the base is 54 rather than 53 because the string can terminate after any character, adding an extra possibility per char - not that it greatly effects the result.
Taking the no. of unique hash codes as your maximum number of string representations, you get the following simple equation:
54 ^ l = 2 ^ 32
And solving it:
log2 (54 ^ l) = 32
l * log2 54 = 32
l = 32 / log2 54 = 5.56
(Where log2 is the logarithm function of base 2.)
Since string lengths clearly can't be fractional, you take the integral part to give a maximum length of just 5. Very short indeed, but observe that this restriction would prevent even the remotest chance of a collision given a perfect hash function.
This is largely theoretical however, as I've mentioned, and I'm not sure of how much use it might be in the design consideration of anything. Saying that, hopefully it should help you understand the matter from a theoretical viewpoint, on top of which you can add the practical considersations (e.g. non-perfect hash functions, non-uniformity of distribution).

Universal Hashing
To calculate the probability of collisions with S strings of length L with W bits per character to a hash of length H bits assuming an optimal universal hash (1) you could calculate the collision probability based on a hash table of size (number of buckets) 'N`.
First things first we can assume a ideal hashtable implementation (2) that splits the H bits in the hash perfectly into the available buckets N(3). This means H becomes meaningless except as a limit for N.
W and 'L' are simply the basis for an upper bound for S. For simpler maths assume that strings length < L are simply padded to L with a special null character. If we were interested we are interested in the worst case this is 54^L (26*2+'_'+ null), plainly this is a ludicrous number, the actual number of entries is more useful than the character set and the length so we will simply work as if S was a variable in it's own right.
We are left trying to put S items into N buckets.
This then becomes a very well known problem, the birthday paradox
Solving this for various probabilities and number of buckets is instructive but assuming we have 1 billion buckets (so about 4GB of memory in a 32 bit system) then we would need only 37K entries before we hit a 50% chance of their being at least one collision. Given that trying to avoid any collisions in a hashtable becomes plainly absurd.
All this does not mean that we should not care about the behaviour of our hash functions. Clearly these numbers are assuming ideal implementations, they are an upper bound on how good we can get. A poor hash function can give far worse collisions in some areas, waste some of the possible 'space' by never or rarely using it all of which can cause hashes to be less than optimal and even degrade to a performance that looks like a list but with much worse constant factors.
The .NET framework's implementation of the string's hash function is not great (in that it could be better) but is probably acceptable for the vast majority of users and is reasonably efficient to calculate.
An Alternative Approach: Perfect Hashing
If you wish you can generate what are known as perfect hashes this requires full knowledge of the input values in advance however so is not often useful. In a simliar vein to the above maths we can show that even perfect hashing has it's limits:
Recall the limit of of 54 ^ L strings of length L. However we only have H bits (we shall assume 32) which is about 4 billion different numbers. So if you can have truly any string and any number of them then you have to satisfy:
54 ^ L <= 2 ^ 32
And solving it:
log2 (54 ^ L) <= 32
L * log2 54 <= 32
L <= 32 / log2 54 <= 5.56
Since string lengths clearly can't be fractional, you are left with a maximum length of just 5. Very short indeed.
If you know that you will only ever have a set of strings well below 4 Billion in size then perfect hashing would let you handle any value of L, but restricting the set of values can be very hard in practice and you must know them all in advance or degrade to what amounts to a database of string -> hash and add to it as new strings are encountered.
For this exercise the universal hash is optimal as we wish to reduce the probability of any collision i.e. for any input the probability of it having output x from a set of possibilities R is 1/R.
Note that doing an optimal job on the hashing (and the internal bucketing) is quite hard but that you should expect the built in types to be reasonable if not always ideal.
In this example I have avoided the question of closed and open addressing. This does have some bearing on the probabilities involved but not significantly

A hash algorithm isn't supposed to guarantee uniqueness. Given that there are far more potential strings (26^n for n length, even ignoring special chars, spaces, capitalization, non-english chars, etc.) than there are places in your hashtable, there's no way such a guarantee could be fulfilled. It's only supposed to guarantee a good distribution.

If your key is a string (e.g., a Dictionary) then it's GetHashCode() will be used. That's a 32bit integer. Hashtable defaults to a 1 key to value load factor and increases the number of buckets to maintain that load factor. So if you do see collisions they should tend to occur around reallocation boundaries (and decrease shortly after reallocation).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.