Can someone kindly point me to an explanation, if there is one, to this chunk of code and what it does and why? Specifically the bottom line...
protected uint uMask;
int nBits = (int)Math.Log(BlockSize, 2);
uMask = 0xffffffff << nBits;
For instance, on the first iteration BlockSize is 8, nBits is 3 and after the operation, the uMask is 4294967288.
I tried Googling the third line as I don't know how to put this into plain language, and I got examples of code and that is not what I was looking for.
This looks to be creating a mask to exclude bits from a larger structure. Some piece of data is probably stored in a larger value and has a maximum value Blocksize. This code determines how many bits are required for that item, given its maximum value in Blocksize. It then uses this number of bits to create a mask. After the last line, uMask will look something like this in binary (assuming Blocksize is 8 and nBits is 3:
1111111111111111111111111111111111111111111111111111111111111000
or in hex:
0xfffffffc
This would typically be used to remove one field stored in piece of data in order to isolate some other field. Conceptually, you might have bits used for value A and value B in a 64 bit value:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBB
Suppose you want to get the value for A. You could do something like this:
result = value & uMask; // Step 1: Mask off B
result = result >> nBits // Step 2:Align A
Data will look like this:
Step 1: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA000
Step 2: 000AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Unless you have a savant math ability, you're never going to be able to read masks in decimal.
Related
I can easily convert a character string into a Huffman-Tree then encode into a binary sequence.
How should I save these to be able to actually compress the original data and then recover back?
I searched the web but I only could find guides and answers showing until what I already did. How can I use huffman algorithm further to actually achieve lossless compression?
I am using C# for this project.
EDIT: I've achieved these so far, might need rethinking.
I am attempting to compress a text file. I use Huffman Algorithm but there are some key points I couldn't figure out:
"aaaabbbccdef" when compressed gives this encoding
Key = a, Value = 11
Key = b, Value = 01
Key = c, Value = 101
Key = d, Value = 000
Key = e, Value = 001
Key = f, Value = 100
11111111010101101101000001100 is the encoded version. It normally needs 12*8 bits but we've compressed it to be 29 bits. This example might be a litte unnecessary for a file this small but let me explain what I tried to do.
We have 29 bits here but we need 8*n bits so I fill the encodedString with zeros until it becomes a multiple of eight. Since I can add 1 to 7 zeros it is more than enough to use 1-byte to represent this. This case I've added 3 zeros
11111111010101101101000001100000 Then add as binary how many extra bits I've added to the front and the split into 8-bit pieces
00000011-11111111-01010110-11010000-01100000
Turn these into ASCII characters
ÿVÐ`
Now if I have the encoding table I can look to the first 8bits convert that to integer ignoreBits and by ignoring the last ignoreBits turn it back to the original form.
The problem is I also want to include uncompressed version of encoding table with this file to have a fully functional ZIP/UNZIP prpgram but I am having trouble deciding when my ignoreBits ends, my encodingTable startse/ends, encoded bits start/end.
I thought about using null character but there is no assurance that Values cannot produce a null character. "ddd" in this situation produces 00000000-0.....
Your representation of the code needs to be self-terminating. Then you know the next bit is the start of the Huffman codes. One way is to traverse the tree that resulted from the Huffman code, writing a 0 bit for each branch, or a 1 bit followed by the symbol for leaf. When the traverse is done, you know the next bit must be the codes.
You also need to make your data self terminating. Note that in the example you give, the added three zero bits will be decoded as another 'd'. So you will incorrectly get 'aaaabbbccdefd' as the result. You need to either precede the encoded data with a count of symbols expected, or you need to add a symbol to your encoded set, with frequency 1, that marks the end of the data.
I have rarely used bitmasks and am trying to get more familiar with them. I understand various basic usages of them. As I understand them this should be able to work, but it seems to not work.
I have a use case where I have four different ints that may come in different arrangements, and I need to check if the current arrangement of ints has already come before as a different arrangement.
So one iteration they might come as:
2, 5, 10, 8
Next iteration:
1, 0, 2, 5
Now on the next iteration if this comes:
0, 1, 2, 5
It needs to discern that last set has already come in a different arrangement and skip it.
I am wondering, can I create a mask out of these ints, put them in a HashSet, so then I have easy lookup for whether or not that set of ints has come before?
Basically I am doing this:
int mask = int0 & int1 & int2 & int3;
if (checkHashSet.Contains(mask))
return; // int set already came, skip
//int set has not been processed, add mask and process
checkHashSet.Add(mask);
But that seems to be producing a mask that ends up equal to all following masks generated. So this doesn't work.
Can this work like this somehow? What would be the most performant way to check if a set of ints, no matter their arrangement, has already been processed?
Bit mask is generated by shift
int mask = (1 << int0) & (1 << int1) & (1 << int2) & (1 << int3);
HashSet.Add will check whether the item exists, Contains is redundant.
if(checkHashSet.Add(mask))
//int set has not been processed, add mask and process
else
// int set already came, skip
If the integer is greater than 31, you can use long or ulong, if it is greater than 64, then use 2 longs or BigInteger
To your question of "is there a better way". I think probably yes.
1) Sort your inputs and keep a map of entries you've already seen as a string. This is close to what you proposed but would be easier to read & implement.
2) An map having a key that is int[1000] would be easier than bit masks. As you process your input, for each number you find, increment the location in the array ++array[n]. You can then add that to a map with array[] as your key. You can search the map to work out if you've seen a particular combination before.
3) 1<< n only goes so far so you'd need array of ints to do it as a proper mask. array[n / 64] & (1 << (n % 64) ) or something like that. A good solution for embedded perhaps, but generally harder to understand. Also, it won't work reliably if any number is repeated as bits can only only mark one occurrence. Or use BigInteger as per Shingo's answer.
Personally I'd go the first one.
I would like to generate coupon codes , e.g. AYB4ZZ2. However, I would also like to be able to mark the used coupons and limit their global number, let's say N. The naive approach would be something like "generate N unique alphanumeric codes, put them into database and perform a db search on every coupon operation."
However, as far as I realize, we can also attempt to find a function MakeCoupon(n), which converts the given number into a coupon-like string with predefined length.
As far as I understand, MakeCoupon should fullfill the following requirements:
Be bijective. It's inverse MakeNumber(coupon) should be effectively computable.
Output for MakeCoupon(n) should be alphanumeric and should have small and constant length - so that it could be called human readable. E.g. SHA1 digest wouldn't pass this requirement.
Practical uniqueness. Results of MakeCoupon(n) for every natural n <= N should be totally unique or unique in the same terms as, for example, MD5 is unique (with the same extremely small collision probability).
(this one is tricky to define) It shouldn't be obvious how to enumerate all remaining coupons from a single coupon code - let's say MakeCoupon(n) and MakeCoupon(n + 1) should visually differ.
E.g. MakeCoupon(n), which simply outputs n padded with zeroes would fail this requirement, because 000001 and 000002 don't actually differ "visually".
Q:
Does any function or function generator, which fullfills the following requirements, exist? My search attempts only lead me to [CPAN] CouponCode, but it does not fullfill the requirement of the corresponding function being bijective.
Basically you can split your operation into to parts:
Somehow "encrypt" your initial number n, so that two consecutive numbers yield (very) different results
Construct your "human-readable" code from the result of step 1
For step 1 I'd suggest to use a simple block cipher (e.g. a Feistel cipher with a round function of your choice). See also this question.
Feistel ciphers work in several rounds. During each round, some round function is applied to one half of the input, the result is xored with the other half and the two halves are swapped. The nice thing about Feistel ciphers is that the round function hasn't to be two-way (the input to the round function is retained unmodified after each round, so the result of the round function can be reconstructed during decryption). Therefore you can choose whatever crazy operation(s) you like :). Also Feistel ciphers are symmetric, which fulfills your first requirement.
A short example in C#
const int BITCOUNT = 30;
const int BITMASK = (1 << BITCOUNT/2) - 1;
static uint roundFunction(uint number) {
return (((number ^ 47894) + 25) << 1) & BITMASK;
}
static uint crypt(uint number) {
uint left = number >> (BITCOUNT/2);
uint right = number & BITMASK;
for (int round = 0; round < 10; ++round) {
left = left ^ roundFunction(right);
uint temp = left; left = right; right = temp;
}
return left | (right << (BITCOUNT/2));
}
(Note that after the last round there is no swapping, in the code the swapping is simply undone in the construction of the result)
Apart from fulfilling your requirements 3 and 4 (the function is total, so for different inputs you get different outputs and the input is "totally scrambled" according to your informal definition) it is also it's own inverse (thus implicitely fulfilling requirement 1), i.e. crypt(crypt(x))==x for each x in the input domain (0..2^30-1 in this implementation). Also it's cheap in terms of performance requirements.
For step 2 just encode the result to some base of your choice. For instance, to encode a 30-bit number, you could use 6 "digits" of an alphabet of 32 characters (so you can encode 6*5=30 bits).
An example for this step in C#:
const string ALPHABET= "AG8FOLE2WVTCPY5ZH3NIUDBXSMQK7946";
static string couponCode(uint number) {
StringBuilder b = new StringBuilder();
for (int i=0; i<6; ++i) {
b.Append(ALPHABET[(int)number&((1 << 5)-1)]);
number = number >> 5;
}
return b.ToString();
}
static uint codeFromCoupon(string coupon) {
uint n = 0;
for (int i = 0; i < 6; ++i)
n = n | (((uint)ALPHABET.IndexOf(coupon[i])) << (5 * i));
return n;
}
For inputs 0 - 9 this yields the following coupon codes
0 => 5VZNKB
1 => HL766Z
2 => TMGSEY
3 => P28L4W
4 => EM5EWD
5 => WIACCZ
6 => 8DEPDA
7 => OQE33A
8 => 4SEQ5A
9 => AVAXS5
Note, that this approach has two different internal "secrets": First, the round function together with the number of rounds used and second, the alphabet you use for encoding the encyrpted result. But also note, that the shown implementation is in no way secure in a cryptographical sense!
Also note, that the shown function is a total bijective function, in the sense, that every possible 6-character code (with characters out of your alphabet) will yield a unique number. To prevent anyone from entering just some random code, you should define some kind of restictions on the input number. E.g. only issue coupons for the first 10.000 numbers. Then, the probability of some random coupon code to be valid would be 10000/2^30=0.00001 (it would require about 50000 attempts to find a correct coupon code). If you need more "security", you can just increase the bit size/coupon code length (see below).
EDIT: Change Coupon code length
Changing the length of the resulting coupon code requires some math: The first (encrypting) step only works on a bit string with even bit count (this is required for the Feistel cipher to work).
In the the second step, the number of bits that can be encoded using a given alphabet depends on the "size" of chosen alphabet and the length of the coupon code. This "entropy", given in bits, is, in general, not an integer number, far less an even integer number. For example:
A 5-digit code using a 30 character alphabet results in 30^5 possible codes which means ld(30^5)=24.53 bits/Coupon code.
For a four-digit code, there is a simple solution: Given a 32-Character alphabet you can encode *ld(32^4)=5*4=20* Bits. So you can just set the BITCOUNT to 20 and change the for loop in the second part of the code to run until 4 (instead of 6)
Generating a five-digit code is a bit trickier and somhow "weakens" the algorithm: You can set the BITCOUNT to 24 and just generate a 5-digit code from an alphabet of 30 characters (remove two characters from the ALPHABET string and let the for loop run until 5).
But this will not generate all possible 5-digit-codes: with 24 bits you can only get 16,777,216 possible values from the encryption stage, the 5 digit codes could encode 24,300,000 possible numbers, so some possible codes will never be generated. More specifically, the last position of the code will never contain some characters of the alphabet. This can be seen as a drawback, because it narrows down the set of valid codes in an obvious way.
When decoding a coupon code, you'll first have to run the codeFromCoupon function and then check, if bit 25 of the result is set. This would mark an invalid code that you can immediately reject. Note that, in practise, this might even be an advantage, since it allows a quick check (e.g. on the client side) of the validity of a code without giving away all internals of the algorithm.
If bit 25 is not set you'll call the crypt function and get the original number.
Though I may get docked for this answer I feel like I need to respond - I really hope that you hear what I'm saying as it comes from a lot of painful experience.
While this task is very academically challenging, and software engineers tend to challenge their intelect vs. solving problems, I need to provide you with some direction on this if I may. There is no retail store in the world, that has any kind of success anyway, that doesn't keep very good track of each and every entity that is generated; from each piece of inventory to every single coupon or gift card they send out those doors. It's just not being a good steward if you are, because it's not if people are going to cheat you, it's when, and so if you have every possible item in your arsenal you'll be ready.
Now, let's talk about the process by which the coupon is used in your scenario.
When the customer redeems the coupon there is going to be some kind of POS system in front right? And that may even be an online business where they are then able to just enter their coupon code vs. a register where the cashier scans a barcode right (I'm assuming that's what we're dealing with here)? And so now, as the vendor, you're saying that if you have a valid coupon code I'm going to give you some kind of discount and because our goal was to generate coupon codes that were reversable we don't need a database to verify that code, we can just reverse it right! I mean it's just math right? Well, yes and no.
Yes, you're right, it's just math. In fact, that's also the problem because so is cracking SSL. But, I'm going to assume that we all realize the math used in SSL is just a bit more complex than anything used here and the key is substantially larger.
It does not behoove you, nor is it wise for you to try and come up with some kind of scheme that you're just sure nobody cares enough to break, especially when it comes to money. You are making your life very difficult trying to solve a problem you really shouldn't be trying to solve because you need to be protecting yourself from those using the coupon codes.
Therefore, this problem is unnecessarily complicated and could be solved like this.
// insert a record into the database for the coupon
// thus generating an auto-incrementing key
var id = [some code to insert into database and get back the key]
// base64 encode the resulting key value
var couponCode = Convert.ToBase64String(id);
// truncate the coupon code if you like
// update the database with the coupon code
Create a coupon table that has an auto-incrementing key.
Insert into that table and get the auto-incrementing key back.
Base64 encode that id into a coupon code.
Truncate that string if you want.
Store that string back in the database with the coupon just inserted.
What you want is called Format-preserving encryption.
Without loss of generality, by encoding in base 36 we can assume that we are talking about integers in 0..M-1 rather than strings of symbols. M should probably be a power of 2.
After choosing a secret key and specifying M, FPE gives you a pseudo-random permutation of 0..M-1 encrypt along with its inverse decrypt.
string GenerateCoupon(int n) {
Debug.Assert(0 <= n && n < N);
return Base36.Encode(encrypt(n));
}
boolean IsCoupon(string code) {
return decrypt(Base36.Decode(code)) < N;
}
If your FPE is secure, this scheme is secure: no attacker can generate other coupon codes with probability higher than O(N/M) given knowledge of arbitrarily many coupons, even if he manages to guess the number associated with each coupon that he knows.
This is still a relatively new field, so there are few implementations of such encryption schemes. This crypto.SE question only mentions Botan, a C++ library with Perl/Python bindings, but not C#.
Word of caution: in addition to the fact that there are no well-accepted standards for FPE yet, you must consider the possibility of a bug in the implementation. If there is a lot of money on the line, you need to weigh that risk against the relatively small benefit of avoiding a database.
You can use a base-36 number system. Assume that you want 6 characters in the coupen output.
pseudo code for MakeCoupon
MakeCoupon(n)
{
Have an byte array of fixed size, say 6. Initialize all the values to 0.
convert the number to base - 36 and store the 'digits' in an array
(using integer division and mod operations)
Now, for each 'digit' find the corresponding ascii code assuming the
digits to start from 0..9,A..Z
With this convension output six digits as a string.
}
Now the calculating the number back is the reverse of this operation.
This would work with very large numbers (35^6) with 6 allowed characters.
Choose a cryptographic function c. There are a few requirements on c, but for now let us take SHA1.
choose a secret key k.
Your coupon code generating function could be, for number n:
concatenate n and k as "n"+"k" (this is known as salting in password management)
compute c("n"+"k")
the result of SHA1 is 160bits, encode them (for instance with base64) as an ASCII string
if the result is too long (as you said it is the case for SHA1), truncate it to keep only the first 10 letters and name this string s
your coupon code is printf "%09d%s" n s, i.e. the concatenation of zero-padded n and the truncated hash s.
Yes, it is trivial to guess n the number of the coupon code (but see below). But it is hard to generate another valid code.
Your requirements are satisfied:
To compute the reverse function, just read the first 9 digits of the code
The length is always 19 (9 digits of n, plus 10 letters of hash)
It is unique, since the first 9 digits are unique. The last 10 chars are too, with high probability.
It is not obvious how to generate the hash, even if one guesses that you used SHA1.
Some comments:
If you're worried that reading n is too obvious, you can obfuscate it lightly, like base64 encoding, and alternating in the code the characters of n and s.
I am assuming that you won't need more than a billion codes, thus the printing of n on 9 digits, but you can of course adjust the parameters 9 and 10 to your desired coupon code length.
SHA1 is just an option, you could use another cryptographic function like private key encryption, but you need to check that this function remains strong when truncated and when the clear text is provided.
This is not optimal in code length, but has the advantage of simplicity and widely available libraries.
I have this peculiar piece of code that is bothering me,
// exbPtr points to 128-bit unsigned integer
// lgID is a "short" with 0xFFFF being the max value
int hash = (*exbPtr + (int)lgID * 9) & tlpLengthMask;
Initially this "hash table", which is really an array is initialized to 256 elements, and tlpLengthMask is set to 255.
Then there is this mysterious code .. with a comment right above it saying "if we reached here .. there has been a collision". And then it starts looping back again, so looks like this is a hash collision, and re-hashing?
hash = (hash + (int)lgID * 2 + 1) & tlpLengthMask;
In addition, there is a ton of debug code that says that the length of this array should be a power of 2 because we're using mask as a modulus.
Can someone explain what the authors intent was? What is the reasoning behind this?
EDIT -- what I'm trying to discern is why he multiplied by 9, and then why multiply by 2 to re-hash.
There are three possibilities:
1) The original author just constructed the hashing functions more or less randomly, saw that they worked well enough, and left it at that.
2) The original author had test data that well represented the actual data and saw that these functions worked extremely well for his exact application.
3) This code is performing very poorly and his hash table is not operating efficiently at all.
The only real requirement is that the output look evenly distributed over the hash table for whatever input he actually encounters and always produce the same output for the same input. While these kinds of functions generally perform poorly, they may be good enough for this specific application.
By the way, this type of open hashing doesn't work in the face of deletions. For example, say you add one record to the table. Then you go to add a second, but it collides with the first, so you skip forward to add the second. Everything's fine now -- you can find both the first record (directly) and the second record (by skipping over the first when you find it at the second record's hash location).
But if you delete the first record, how do you find the second? When you look at the second record's hash location, you find nothing. Do you try skipping? If so, how many times?
There are workarounds to these problems, but they tend to be very easy to do incorrectly.
Just to avoid inventing hot-water, I am asking here...
I have an application with lots of arrays, and it is running out of memory.
So the thought is to compress the List<int> to something else, that would have same interface (IList<T> for example), but instead of int I could use shorter integers.
For example, if my value range is 0 - 100.000.000 I need only ln2(1000000) = 20 bits. So instead of storing 32 bits, I can trim the excess and reduce memory requirements by 12/32 = 37.5%.
Do you know of an implementation of such array. c++ and java would be also OK, since I could easily convert them to c#.
Additional requirements (since everyone is starting to getting me OUT of the idea):
integers in the list ARE unique
they have no special property so they aren't compressible in any other way then reducing the bit count
if the value range is one million for example, lists would be from 2 to 1000 elements in size, but there will be plenty of them, so no BitSets
new data container should behave like re-sizable array (regarding method O()-ness)
EDIT:
Please don't tell me NOT to do it. The requirement for this is well thought-over, and it is the ONLY option that is left.
Also, 1M of value range and 20 bit for it is ONLY AN EXAMPLE. I have cases with all different ranges and integer sizes.
Also, I could have even shorter integers, for example 7 bit integers, then packing would be
00000001
11111122
22222333
33334444
444.....
for first 4 elements, packed into 5 bytes.
Almost done coding it - will be posted soon...
Since you can only allocate memory in byte quantums, you are essentially asking if/how you can fit the integers in 3 bytes instead of 4 (but see #3 below). This is not a good idea.
Since there is no 3-byte sized integer type, you would need to use something else (e.g. an opaque 3-byte buffer) in its place. This would require that you wrap all access to the contents of the list in code that performs the conversion so that you can still put "ints" in and pull "ints" out.
Depending on both the architecture and the memory allocator, requesting 3-byte chunks might not affect the memory footprint of your program at all (it might simply litter your heap with unusable 1-byte "holes").
Reimplementing the list from scratch to work with an opaque byte array as its backing store would avoid the two previous issues (and it can also let you squeeze every last bit of memory instead of just whole bytes), but it's a tall order and quite prone to error.
You might want instead to try something like:
Not keeping all this data in memory at the same time. At 4 bytes per int, you 'd need to reach hundreds of millions of integers before memory runs out. Why do you need all of them at the same time?
Compressing the dataset by not storing duplicates if possible. There are bound to be a few of them if you are up to hundreds of millions.
Changing your data structure so that it stores differences between successive values (deltas), if that is possible. This might be not very hard to achieve, but you can only realistically expect something at the ballpark of 50% improvement (which may not be enough) and it will totally destroy your ability to index into the list in constant time.
One option that will get your from 32 bits to 24bits is to create a custom struct that stores an integer inside of 3 bytes:
public struct Entry {
byte b1; // low
byte b2; // middle
byte b3; // high
public void Set(int x) {
b1 = (byte)x;
b2 = (byte)(x >> 8);
b3 = (byte)(x >> 16);
}
public int Get() {
return (b3 << 16) | (b2 << 8) | b1;
}
}
You can then just create a List<Entry>.
var list = new List<Entry>();
var e = new Entry();
e.Set(12312);
list.Add(e);
Console.WriteLine(list[0].Get()); // outputs 12312
This reminds me of base64 and similar kinds of binary-to-text encoding.
They take 8 bit bytes and do a bunch of bit-fiddling to pack them into 4-, 5-, or 6-bit printable characters.
This also reminds me of the Zork Standard Code for Information Interchange (ZSCII), which packs 3 letters into 2 bytes, where each letter occupies 5 bits.
It sounds like you want to taking a bunch of 10- or 20-bit integers and pack them into a buffer of 8-bit bytes.
The source code is available for many libraries that handle a packed array of single bits
(a
b
c
d
e).
Perhaps you could
(a) download that source code and modify the source (starting from some BitArray or other packed encoding), recompiling to create a new library that handles packing and unpacking 10- or 20-bit integers rather than single bits.
It may take less programming and testing time to
(b) write a library that, from the outside, appears to act just like (a), but internally it breaks up 20-bit integers into 20 separate bits, then stores them using an (unmodified) BitArray class.
Edit: Given that your integers are unique you could do the following: store unique integers until the number of integers you're storing is half the maximum number. Then switch to storing the integers you don't have. This will reduce the storage space by 50%.
Might be worth exploring other simplification techniques before trying to use 20-bit ints.
How do you treat duplicate integers? If have lots of duplicates you could reduce the storage size by storing the integers in a Dictionary<int, int> where keys are unique integers and values are corresponding counts. Note this assumes you don't care about the order of your integers.
Are your integers all unique? Perhaps you're storing lots of unique integers in the range 0 to 100 mil. In this case you could try storing the integers you don't have. Then when determining if you have an integer i just ask if it's not in your collection.