I am currently trying to create a custom Deflate implementation in C#.
I am currently trying to implement the "pattern search" part where I have (up to) 32k of data and am trying to search the longest possible pattern for my input.
The RFC 1951 which defines Deflate says about that process:
The compressor uses a chained hash table to find duplicated strings,
using a hash function that operates on 3-byte sequences. At any
given point during compression, let XYZ be the next 3 input bytes to
be examined (not necessarily all different, of course). First, the
compressor examines the hash chain for XYZ. If the chain is empty,
the compressor simply writes out X as a literal byte and advances one
byte in the input. If the hash chain is not empty, indicating that
the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
same hash function value) has occurred recently, the compressor
compares all strings on the XYZ hash chain with the actual input data
sequence starting at the current point, and selects the longest
match.
I do know what a hash function is, and do know what a HashTable is as well. But what is a "chained hash table" and how could such a structure be designed to be efficient (in C#) with handling a large amout of data? Unforunately I didn't understand how the structure described in the RFC works.
What kind of hash function could I choose (what would make sense)?
Thank you in advance!
A chained hash table is a hash table that stores every item you put in it, even if the key for 2 items hashes to the same value, or even if 2 items have exactly the same key.
A DEFLATE implementation needs to store a bunch of (key, data) items in no particular order, and rapidly look-up a list of all the items with that key.
In this case, the key is 3 consecutive bytes of uncompressed plaintext, and the data is some sort of pointer or offset to where that 3-byte substring occurs in the plaintext.
Many hashtable/dictionary implementations store both the key and the data for every item.
It's not necessary to store the key in the table for DEFLATE, but it doesn't hurt anything other than using slightly more memory during compression.
Some hashtable/dictionary implementations such as the C++ STL unordered_map insist that every (key, data) item they store must have a unique key. When you try to store another (key, data) item with the same key as some older item already in the table, these implementations delete the old item and replace it with the new item.
That does hurt -- if you accidentally use the C++ STL unordered_map or similar implementation, your compressed file will be larger than if you had used a more appropriate library such as the C++ STL hash_multimap.
Such an error may be difficult to detect, since the resulting (unnecessarily large) compressed files can be correctly decompressed by any standard DEFLATE compressor to a file bit-for-bit identical to the original file.
A few implementations of DEFLATE and other compression algorithms deliberately use such an implementation, deliberately sacrificing compressed file size in order to gain compression speed.
As Nick Johnson said, the default hash function used in your standard "hashtable" or "dictionary" implementation is probably more than adequate.
http://en.wikipedia.org/wiki/Hashtable#Separate_chaining
In this case, they're describing a hashtable where each element contains a list of strings - in this case, all the strings starting with the three character prefix specified. You should simply be able to use standard .net hashtable or dictionary primitives - there's no need to replicate their exact implementation details.
32k is not a lot of data, so you don't have to worry about scaling your hashtable - and even if you did, the built-in primitives are likely to be more efficient than anything you could write yourself.
Related
We are trying to convert text "HELLOWORLDTHISISALARGESTRINGCONTENT" into a smaller text. while doing it using MD5 hash we are getting the 16 byte, since it is a one way encryption we are not able to decrypt it. Is there any other way to convert this large string to smaller and revert back the same data? If so please let us know how to do it
Thanks in advance.
Most compression algorithms won't be able to do much with a sequence that short (or may actually make it bigger) - so no: there isn't much you can do to magically shrink it. Your best bet would probably be just generate a guid, and store the full value keyed against the guid (in a database or whatever), and then use the short value as a one-time usage key, to look up the long value (and then erase the record).
It heavily depends on the input data. In general - the worst case - you can't lessen the size of a string through compression if the input data is not long enough and has a high entropy.
Hashing is the wrong approach as a hashing function tries to map a large input data to a short one, but it does not guarantee (by itself) that you can't find a second set of data to map to the same string.
What you can try to do is to imlement a compression algorithm or a lookback table.
Compression can be done by ziplib or any other compression library (just google for it). The lookback approach requires a second place to store the lookup information. For example, when you get the first input string, you map it to the number 1 and save the information 1 maps to {input data} somewhere else. For every subsequent data set you add another mapping entry. If the input data set is finite, this approach may save you space.
I have a (what seems like) a large task at hand.
I need to go through different archive volumes of multiple folders (we're talking terabytes of data). Within each folder is a .pst file. Some of these folders (and therefore files) may be exactly the same (name or data within the file). I want to be able to compare more than 2 files at once (if possible) to see if any dulpicates are found.
Once the duplicates are found, I need to delete them and keep the originals and then eventually extract all the unique emails.
I know there are programs out there that can find duplicates, but I'm not sure what arguments they would need to pass in these files and I don't know if they can handle such large volumes of data.
I'd like to program in either C# or VB. I'm at a loss on where I should start. Any suggestions??
Ex...
m:\mail\name1\name.pst
m:\mail\name2\name.pst (same exact data as the one above)
m:\mail\name3\anothername.pst (duplicate file to the other 2)
If you just want to remove entire duplicate files the task is very simple to implement.
You will have to go through all your folders and hash the contents of each file. The hash produced has some bits (e.g 32 to 256 bits). If two file hashes are equal there is an extremely high probability (depending on the collision resistance of your hash function, read number of bits) that the respective files are identical.
Of course, now the implementation is up to you (I am not a C# or VB programmer) but I would suggest you something like the following pseudo-code (Next I explain each step and give you links demonstrating how to do it in C#):
do{
file_byte_array = get_file_contents_into_byte_array(file) 1
hash = get_hash from_byte_array(file_byte_array); 2
if(hashtable_has_elem(hashtable,hash)) 3
remove_file(file); 4
else 5
hashtable_insert_elem(hashtable,hash,file); 6
}while_there_are_files_to evaluate 7
This logic should be executed over all of your .pst files. At line 1 (I assume you have your file opened) you write all the contents of your file into a byte array.
Once you have the byte array of your file, you must hash it using an hash function (line 2). You have plenty of hash functions implementations to choose. In some implementations you must break the file into blocks and hash each block contents (e.g here, here and here). Breaking your file in parts may be the only option, if your files are really huge and do not fit in your memory. On the other hand, you have many functions which accept the whole stream (e.g. here, here an example very similar to your problem,here, here, but I would advise you the super fast MurmurHash3). If you have efficiency requisites, stay away of cryptographic hash functions as they are much heavier and you do not need cryptographic properties to perform your task.
Finally, after computing the hash you just need to get some way in which you save the hashes and compare them, in order to find the identical hashes (read files) and delete them (lines 3-6). I purpose the use of a hash table or a dictionary, where the identifier (the object you use to perform lookups) is the file hash and the object File the entry value.
Notes:
Remember!!!: The more bits the hash value has, the lesser is the probability of collisions. If you want to know more about collision probabilities in hash functions read this excellent article. You must pay attention to this topic since your objective is to delete files. If you have a collision then you will delete one file which is not identical and you will loose it forever. There are many tactics to identify collisions, which you can combine and add to your algorithm (e.g. compare the size of your file, compare file content values at random positions, use more than one hash function). My advice would be to use all these tactics. If you use two hash functions, then for two files be considered identical they must have the hash value of each hash function equal:
file1, file2;
file1_hash1 = hash_function1(file1);
file2_hash1 = hash_function1(file2);
file1_hash2 = hash_function2(file1);
file2_hash2 = hash_function2(file2);
if(file1_hash1 == file2_hash1 &&
file2_hash2 == file2_hash2)
// file1 is_duplicate_of file2;
else
// file1 is_NOT_duplicate_of file2;
I would work thru the process of finding duplicates by first recursively finding all of the PST files, then match on file length, then filter by a fixed prefix of bytes, and finally performing full hash or byte comparison to get actual matches.
Recursively building the list and finding potential matches can be as simple as this:
Func<DirectoryInfo, IEnumerable<FileInfo>> recurse = null;
recurse = di => di.GetFiles("*.pst")
.Concat(di.GetDirectories()
.SelectMany(cdi => recurse(cdi)));
var potentialMatches =
recurse(new DirectoryInfo(#"m:\mail"))
.ToLookup(fi => fi.Length)
.Where(x => x.Skip(1).Any());
The potentialMatches query gives you a complete series of potential matches by file size.
I would then use the following functions (which I'd leave the implementation to you) to filter this list further.
Func<FileInfo, FileInfo, int, bool> prefixBytesMatch = /* your implementation */
Func<FileInfo, FileInfo, bool> hashMatch = /* your implementation */
By limiting the matches by file length and then by a prefix of bytes you will significantly reduce the computation of hashes required for your very large files.
I hope this helps.
I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
Requirements
The same checksum must be returned each time a calculation is done for a string
The checksum must be unique (no collisions)
I can not store previous IDs to check for collisions
Which algorithm should I use?
Update:
Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
The checksum should be alphanumeric
The strings are unicode
The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
The length of the checksum is not important for me (the shorter, the better)
Update2
Let's say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
The string is used in a view in a similar way to gettext in linux. i.e. the user just writes (in a razor view)
#T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I'm therefore looking for a way to generate a key out of it.
That's not possible.
If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.
Update:
The term "reasonably unique" doesn't make sense, either it's unique or it's not.
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5 class, then convert the hash code byte array into a string using the BitConverter class:
string theString = "asdf";
string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}
Console.WriteLine(hash);
Output:
912EC803B2CE49E4A541068D495AB570
You can use cryptographic Hash functions for this. Most of them are available in .Net
For example:
var sha1 = System.Security.Cryptography.SHA1.Create();
byte[] buf = System.Text.Encoding.UTF8.GetBytes("test");
byte[] hash= sha1.ComputeHash(buf, 0, buf.Length);
//var hashstr = Convert.ToBase64String(hash);
var hashstr = System.BitConverter.ToString(hash).Replace("-", "");
Note: This is an answer to the original question.
Assuming you want the checksum to be stored in a variable of fixed size (i.e. an integer), you cannot satisfy your second constraint.
The checksum must be unique (no collisions)
You cannot avoid collisions because there will be more distinct strings than there are possible checksum values.
I realize this post is practically ancient, but I stumbled upon it and have run into an almost identical issue in the past. We had an nvarchar(8000) field that we needed to lookup against.
Our solution was to create a persisted computed column using CHECKSUM of the nasty lookup field. We had an auto-incrementing ID field and keyed on (checksum, id)
When reading from the table, we wrote a proc that took the lookup text, computed the checksum and then took where the checksums were equal and the text was equal.
You could easily perform the checksum portions at the application level based on the answer above and store them manually instead of using our DB-centric solution. But the point is to get a reasonably sized key for indexing so that your text comparison runs against a bucket of collisions instead of the entire dataset.
Good luck!
To guarantee uniqueness, for a almost infinite size strings, treat the variable length string as a set of concatenated substrings each having "x characters in length". Your hash function needs only to determine uniqueness for a maximum substring length and then generate a series of checksum numbers generating values. Think of it as the equivalent network IP address with a set of checksum numbers.
Your issue with collisions is the assumption that a collision forces a slower search method to resolve each collision. If their are a insignificant number of possible collisions compared to the number of hash objects, then as a whole the extra overhead becomes NIL. A collision is due to the sizing of a table smaller than the maximum number of objects. This doesn't have to be the case because the table may have "holes" and each object within the table may have a reference count of objects at that collision. Only if this count is greater than 1, then a collision occurs or multiple instances of the same substring.
I need to generate a unique id for file sizes of upto 200-300MB. The condition is that the algo should be quick, it should not take much time. I am selecting the files from a desktop and calculation a hash value as such:
HMACSHA256 myhmacsha256 = new HMACSHA256(key);
byte[] hashValue = myhmacsha256.ComputeHash(fileStream);
filestream is a handle to the file to read content from it. This method is going to take a lot of time for obvious reasons.
Does windows generate a key for a file for its own book keeping that I could directly use ?
Is there any other way to identify if the file is same, instead of matching file name which is not very foolproof.
MD5.Create().ComputeHash(fileStream);
Alternatively, I'd suggest looking at this rather similar question.
How about generating a hash from the info that's readily available from the file itself? i.e. concatenate :
File Name
File Size
Created Date
Last Modified Date
and create your own?
When you compute hashes and compare them, it would require both files to completely go through. My suggestion is to first check the file sizes, if they are identical and then go through the files byte by byte.
If you want a "quick and dirty" check, I would suggest looking at CRC-32. It is extremely fast (the algorithm simply involves doing XOR with table lookups), and if you aren't too concerned about collision resistance, a combination of the file size and the CRC-32 checksum over the file data should be adequate. 28.5 bits are required to represent the file size (that gets you to 379M bytes), which means you get a checksum value of effectively just over 60 bits. I would use a 64-bit quantity to store the file size, for future proofing, but 32 bits would work too in your scenario.
If collision resistance is a consideration, then you pretty much have to use one of the tried-and-true-yet-unbroken cryptographic hash algorithms. I would still concur with what Devils child wrote and also include the file size as a separate (readily accessible) part of the hash, however; if the sizes don't match, there is no chance that the file content can be the same, so in that case the computationally intensive hash calculation can be skipped.
How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?
In string, it is overridden to make sure that there exists only one hash code for a particular string.
How to do that?
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
Can someone help?
Basically, hash functions use some generic function to digest data and generate a fingerprint (and integer number here) for that data. Unlike an index, this fingerprint depends ONLY on the data, and should be free of any predictable ordering based on the data. Any change to a single bit of the data should also change the fingerprint considerably.
Notice that nowhere does this guarantee that different data won't give the same hash. In fact, quite the opposite: this happens very often, and is called a collision. But, with an integer, the probability is roughly 1 in 4 billion against this (1 in 2^32). If a collision happens, you just compare the actual object you are hashing to see if they match.
This fingerprint can then be used as an index to an array (or arraylist) of stored values. Because the fingerprint is dependent only on the data, you can compute a hash for something and just check the array element for that hash value to see if it has been stored already. Otherwise, you'd have to go through the whole array checking if it matches an item.
You can also VERY quickly do associative arrays by using 2 arrays, one with Key values (indexed by hash), and a second with values mapped to those keys. If you use a hash, you just need to know the key's hash to find the matching value for the key. This is much faster than doing a binary search on a sorted key list, or a scan of the whole array to find matching keys.
There are MANY ways to generate a hash, and all of them have various merits, but few are simple. I suggest consulting the wikipedia page on hash functions for more info.
A hash code IS an index, and a hash table, at its very lowest level, IS an array. But for a given key value, we determine the index into in a hash table differently, to make for much faster data retrieval.
Example: You have 1,000 words and their definitions. You want to store them so that you can retrieve the definition for a word very, very quickly -- faster than a binary search, which is what you would have to do with an array.
So you create a hash table. You start with an array substantially bigger than 1,000 entries -- say 5,000 (the bigger, the more time-efficient).
The way you'll use your table is, you take the word to look up, and convert it to a number between 0 and 4,999. You choose the algorithm for doing this; that's the hashing algorithm. But you could doubtless write something that would be very fast.
Then you use the converted number as an index into your 5,000-element array, and insert/find your definition at that index. There's no searching at all: you've created the index directly from the search word.
All of the operations I've described are constant time; none of them takes longer when we increase the number of entries. We just need to make sure that there is sufficient space in the hash to minimize the chance of "collisions", that is, the chance that two different words will convert to the same integer index. Because that can happen with any hashing algorithm, we need to add checks to see if there is a collision, and do something special (if "hello" and "world" both hash to 1,234 and "hello" is already in the table, what will we do with "world"? Simplest is to put it in 1,235, and adjust our lookup logic to allow for this possibility.)
Edit: after re-reading your post: a hashing algorithm is most definitely not random, it must be deterministic. The index generated for "hello" in my example must be 1,234 every single time; that's the only way the lookup can work.
Answering each one of your questions directly:
How that integer hash is generated by
the GetHashCode() function? Is it a
random value which is not unique?
An integer hash is generated by whatever method is appropriate for the object.
The generation method is not random but must follow consistent rules, ensuring that a hash generated for one particular object will equal the hash generated for an equivalent object. As an example, a hash function for an integer would be to simply return that integer.
In string, it is overridden to make
sure that there exists only one hash
code for a particular string. How to
do that?
There are many ways this can be done. Here's an example I'm thinking of on the spot:
int hash = 0;
for(int i = 0; i < theString.Length; ++i)
{
hash ^= theString[i];
}
This is a valid hash algorithm, because the same sequence of characters will always produce the same hash number. It's not a good hash algorithm (an extreme understatement), because many strings will produce the same hash. A valid hash algorithm doesn't have to guarantee uniqueness. A good hash algorithm will make a chance of two differing objects producing the same number extremely unlikely.
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
A hash code is typically used in hash tables. A hash table is an array, but each entry in the array is a "bucket" of items, not just one item. If you have an object and you want to know which bucket it belongs in, calculate
hash_value MOD hash_table_size.
Then you simply have to compare the object with every item in the bucket. So a hash table lookup will most likely have a search time of O(1), as opposed to O(log(N)) for a sorted list or O(N) for an unsorted list.