In this stackoverflow answer there is a piece of code to transform a char to lowercase:
// tricky way to convert to lowercase
sb.Append((char)(c | 32));
What is happening in (char)(c | 32) and how is it possible to do the opposite to transform to uppercase?
This is a cheap ASCII trick, that only works for that particular encoding. It is not recommended. But to answer your question, the reverse operation involves masking instead of combining:
sb.Append((char)(c & ~32));
Here, you take the bitwise inverse of 32 and use bitwise-AND. That will force that single bit off and leave others unchanged.
The reason this works is because the ASCII character set is laid out such that the lower 5 bits are the same for upper- and lowercase characters, and only differ by the 6th bit (32, or 00100000b). When you use bitwise-OR, you add the bit in. When you mask with the inverse, you remove the bit.
Related
Looking at Wolfram: collision to bits - graph with 99 documents, I'd need a 25.5bit hashing algorithm to have a 0.0001 chance for a collision.
I looked at CRC-24 and I was wondering if it could be improved to use even less characters. I have a big list of characters that can be used for the hash: Basically all Unicode characters except for 4 or 5 characters.
Now how do you create your own hash algorithm based on a set of usable characters in C#?
EDIT:
I try to precise the issue: I have 99 strings. I want to cut them to 64 chars max. length. This can create duplicates. But they need to be unique while maintaining their meaning. The idea was to create a hash as small as possible and replace the last characters with the hash created of the original string. The hash of course should have a low probability for collision and be as short as possible. How I understand, the more symbols can be used in the hash (as in a-z0-9 or a-zA-Z0-9 or even all unicode characters), the less characters can the hash have before there are collisions. I looked at sha-1 and just trimming it, or crc-32 but they are not using the "full potential" of unicode characters.
When implementing the Run-length encoding (RLE), can I assume that the Runs are going to be shorter than one byte?
So there will not be a situation where there is a run like this
WWWBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...
Where there are 256 B's because you cannot represent that length in one byte whereas you can represent the W's as 3W
If not, should the Run be split into two Runs? How should this situation be handled? I couldn't find any information about this case.
To my understanding, you understand the situation correctly. The word length used for counting the repetition of a character is usually a byte, and the individual characters usually are also encoded as a byte. If in the input there is a repetition of e.g. 300 b, the encoding will be as follows.
255 (number of repetitions of the next character)
98 (ASCII value for b)
45 (nunber of repetitions of the next character)
98 (ASCII value for b)
In total, a run of length larger than 255 will have to be split in two runs. That being said, the actual encoding depends on the specific implementations; it is also possible to use other types than bytes for counting the repetition of characters.
I'm converting a Java library over to C# as I rewrite a legacy application and I need some assistance. I need to understand what this line in Java is doing:
sb.append(Integer.toHexString((b & 0xFF) | 0x100).substring(1,3))
and if this C# line is equivalent
result += (Convert.ToInt32(b).ToString("x2") + " ").Substring(1,3);
In both cases b is a byte from a SHA-1 hash that the code is looping through.
The Java part I don't understand is ((b & 0xFF) | 0x100). It looks like it's padding it?
Ordinarily I would compare the output from the Java application to what my C# is generating but I am not in a postition to do that right now (and it's frustrating me - trust me).
You don't need to change the original that drastically - the C# equivalent (assuming 'sb' is a StringBuilder) is just:
sb.Append(((b & 0xFF) | 0x100).ToString("x").Substring(1, 2));
b & 0xFF will mask the lowest byte. So whatever b is, you will get something between 0x00 and 0xFF.
In the resulting integer, 9th bit is set, no matter what it was before. So you'll have something between 0x0100 to 0x01FF.
From that string the substring from index 1 to 3 is cropped. It will give you the last two digits which will be something between 00 and FF. The |0x100 is a neat trick to have Integer.toHexString give you a leading zero for the last two digits which it wouldn't according to it's javadoc ...
If I remeber correctly, your C# code does not exactly the same. But I hope with this explanation you can build it up yourself :)
I have a very simple problem that is giving me a really big headache in that I am port a bit of code from C++ into C# and for a very simple operation I am getting totally different results:-
C++
char OutBuff[25];
int i;
unsigned int SumCheck = 46840;
OutBuff[i++] = SumCheck & 0xFF; //these 2 ANDED make 248
The value written to the char array is -8
C#
char[] OutBuff = new char[25];
int i;
uint SumCheck = 46840;
OutBuff[i++] = (char)(sumCheck & 0xFF); //these 2 ANDED also make 248
The value written to the char array is 248.
Interestingly they are both the same characters, so this may be something to do with the format of a char array in C++ and C# - but ultimately I would be grateful if someone could give me a definitive answer.
Thanks in advance for any help.
David
Its overflow in C++, and no overflow in C#.
In C#, char is two byte. In C++, char is one byte!
So in C#, there is no overflow, and the value is retained. In C++, there is integral overflow.
Change the data type from char to uint16_t or unsigned char (in C++), you will see same result. Note that unsigned char can have a value of 248, without overflow. It can have value upto 255, in fact.
Maybe you should be using byte or sbyte instead of char. (char is only to store text chars and the actual binary serialization for char is not the same as in c++. char allows us to store characters without worrying about character byte width.)
A C# char is actually 16 bits, while a C++ char is usually 8 bits (a char is exactly 8 bits on Visual C++). So you're actually overflowing the integer in the C++ code, but the C# code does not overflow, since it holds more bits, and therefore has a bigger integer range.
Notice that 248 is outside the range of a signed char (-128 to 127). That should give you a hint that C#'s char might be bigger than 8 bits.
You're probably meant to use C#'s sbyte, (the closest equivalent to Visual C++'s char) if you want to preserve the behavior. Although you may want to recheck the code code since there's an overflow occurring in the C++ implementation.
As everyone has stated, in C# a char is 16 bits while in C++ it is usually 8 bits.
-8 and 248 in binary both (essentially) look like this:
11111000
Because a char in C++ is usually 8 bits (which is in fact your case), the result is -8. In C#, the value looks like this:
00000000 11111000
Which is 16 bits and becomes 248.
2's complement representation of -8 is the same as the binary represenation of 248 (unsigned)
So the binary representation is the same in both cases. The c++ is interpreted as an int8 result and in c# it's simply interpreted as a positive integer (int is 32 bit an truncating to 16 by casting to char doesn't affect the sign in this case)
The difference between -8 and 248 is all in how you interpret the data. They are stored exactly the same (0xF8). In C++, the default char type is 'signed'. So, 0xF8 = -8. If you change the data type to 'unsigned char', it will be interpreted as 248. VS also has a compiler option to make 'char' default to 'unsigned'.
Using the standard English letters and underscore only, how many characters can be used at a maximum without causing a potential collision in a hashtable/dictionary.
So strings like:
blur
Blur
b
Blur_The_Shades_Slightly_With_A_Tint_Of_Blue
...
There's no guarantee that you won't get a collision between single letters.
You probably won't, but the algorithm used in string.GetHashCode isn't specified, and could change. (In particular it changed between .NET 1.1 and .NET 2.0, which burned people who assumed it wouldn't change.)
Note that hash code collisions won't stop well-designed hashtables from working - you should still be able to get the right values out, it'll just potentially need to check more than one key using equality if they've got the same hash code.
Any dictionary which relies on hash codes being unique is missing important information about hash codes, IMO :) (Unless it's operating under very specific conditions where it absolutely knows they'll be unique, i.e. it's using a perfect hash function.)
Given a perfect hashing function (which you're not typically going to have, as others have mentioned), you can find the maximum possible number of characters that guarantees no two strings will produce a collision, as follows:
No. of unique hash codes avilable = 2 ^ 32 = 4294967296 (assuming an 32-bit integer is used for hash codes)
Size of character set = 2 * 26 + 1 = 53 (26 lower as upper case letters in the Latin alphabet, plus underscore)
Then you must consider that a string of length l (or less) has a total of 54 ^ l representations. Note that the base is 54 rather than 53 because the string can terminate after any character, adding an extra possibility per char - not that it greatly effects the result.
Taking the no. of unique hash codes as your maximum number of string representations, you get the following simple equation:
54 ^ l = 2 ^ 32
And solving it:
log2 (54 ^ l) = 32
l * log2 54 = 32
l = 32 / log2 54 = 5.56
(Where log2 is the logarithm function of base 2.)
Since string lengths clearly can't be fractional, you take the integral part to give a maximum length of just 5. Very short indeed, but observe that this restriction would prevent even the remotest chance of a collision given a perfect hash function.
This is largely theoretical however, as I've mentioned, and I'm not sure of how much use it might be in the design consideration of anything. Saying that, hopefully it should help you understand the matter from a theoretical viewpoint, on top of which you can add the practical considersations (e.g. non-perfect hash functions, non-uniformity of distribution).
Universal Hashing
To calculate the probability of collisions with S strings of length L with W bits per character to a hash of length H bits assuming an optimal universal hash (1) you could calculate the collision probability based on a hash table of size (number of buckets) 'N`.
First things first we can assume a ideal hashtable implementation (2) that splits the H bits in the hash perfectly into the available buckets N(3). This means H becomes meaningless except as a limit for N.
W and 'L' are simply the basis for an upper bound for S. For simpler maths assume that strings length < L are simply padded to L with a special null character. If we were interested we are interested in the worst case this is 54^L (26*2+'_'+ null), plainly this is a ludicrous number, the actual number of entries is more useful than the character set and the length so we will simply work as if S was a variable in it's own right.
We are left trying to put S items into N buckets.
This then becomes a very well known problem, the birthday paradox
Solving this for various probabilities and number of buckets is instructive but assuming we have 1 billion buckets (so about 4GB of memory in a 32 bit system) then we would need only 37K entries before we hit a 50% chance of their being at least one collision. Given that trying to avoid any collisions in a hashtable becomes plainly absurd.
All this does not mean that we should not care about the behaviour of our hash functions. Clearly these numbers are assuming ideal implementations, they are an upper bound on how good we can get. A poor hash function can give far worse collisions in some areas, waste some of the possible 'space' by never or rarely using it all of which can cause hashes to be less than optimal and even degrade to a performance that looks like a list but with much worse constant factors.
The .NET framework's implementation of the string's hash function is not great (in that it could be better) but is probably acceptable for the vast majority of users and is reasonably efficient to calculate.
An Alternative Approach: Perfect Hashing
If you wish you can generate what are known as perfect hashes this requires full knowledge of the input values in advance however so is not often useful. In a simliar vein to the above maths we can show that even perfect hashing has it's limits:
Recall the limit of of 54 ^ L strings of length L. However we only have H bits (we shall assume 32) which is about 4 billion different numbers. So if you can have truly any string and any number of them then you have to satisfy:
54 ^ L <= 2 ^ 32
And solving it:
log2 (54 ^ L) <= 32
L * log2 54 <= 32
L <= 32 / log2 54 <= 5.56
Since string lengths clearly can't be fractional, you are left with a maximum length of just 5. Very short indeed.
If you know that you will only ever have a set of strings well below 4 Billion in size then perfect hashing would let you handle any value of L, but restricting the set of values can be very hard in practice and you must know them all in advance or degrade to what amounts to a database of string -> hash and add to it as new strings are encountered.
For this exercise the universal hash is optimal as we wish to reduce the probability of any collision i.e. for any input the probability of it having output x from a set of possibilities R is 1/R.
Note that doing an optimal job on the hashing (and the internal bucketing) is quite hard but that you should expect the built in types to be reasonable if not always ideal.
In this example I have avoided the question of closed and open addressing. This does have some bearing on the probabilities involved but not significantly
A hash algorithm isn't supposed to guarantee uniqueness. Given that there are far more potential strings (26^n for n length, even ignoring special chars, spaces, capitalization, non-english chars, etc.) than there are places in your hashtable, there's no way such a guarantee could be fulfilled. It's only supposed to guarantee a good distribution.
If your key is a string (e.g., a Dictionary) then it's GetHashCode() will be used. That's a 32bit integer. Hashtable defaults to a 1 key to value load factor and increases the number of buckets to maintain that load factor. So if you do see collisions they should tend to occur around reallocation boundaries (and decrease shortly after reallocation).