Deduplicating .pst files to find unique emails

Deduplicating .pst files to find unique emails - c#

I have a (what seems like) a large task at hand.
I need to go through different archive volumes of multiple folders (we're talking terabytes of data). Within each folder is a .pst file. Some of these folders (and therefore files) may be exactly the same (name or data within the file). I want to be able to compare more than 2 files at once (if possible) to see if any dulpicates are found.
Once the duplicates are found, I need to delete them and keep the originals and then eventually extract all the unique emails.
I know there are programs out there that can find duplicates, but I'm not sure what arguments they would need to pass in these files and I don't know if they can handle such large volumes of data.
I'd like to program in either C# or VB. I'm at a loss on where I should start. Any suggestions??
Ex...
m:\mail\name1\name.pst
m:\mail\name2\name.pst (same exact data as the one above)
m:\mail\name3\anothername.pst (duplicate file to the other 2)

If you just want to remove entire duplicate files the task is very simple to implement.
You will have to go through all your folders and hash the contents of each file. The hash produced has some bits (e.g 32 to 256 bits). If two file hashes are equal there is an extremely high probability (depending on the collision resistance of your hash function, read number of bits) that the respective files are identical.
Of course, now the implementation is up to you (I am not a C# or VB programmer) but I would suggest you something like the following pseudo-code (Next I explain each step and give you links demonstrating how to do it in C#):
do{
file_byte_array = get_file_contents_into_byte_array(file) 1
hash = get_hash from_byte_array(file_byte_array); 2
if(hashtable_has_elem(hashtable,hash)) 3
remove_file(file); 4
else 5
hashtable_insert_elem(hashtable,hash,file); 6
}while_there_are_files_to evaluate 7
This logic should be executed over all of your .pst files. At line 1 (I assume you have your file opened) you write all the contents of your file into a byte array.
Once you have the byte array of your file, you must hash it using an hash function (line 2). You have plenty of hash functions implementations to choose. In some implementations you must break the file into blocks and hash each block contents (e.g here, here and here). Breaking your file in parts may be the only option, if your files are really huge and do not fit in your memory. On the other hand, you have many functions which accept the whole stream (e.g. here, here an example very similar to your problem,here, here, but I would advise you the super fast MurmurHash3). If you have efficiency requisites, stay away of cryptographic hash functions as they are much heavier and you do not need cryptographic properties to perform your task.
Finally, after computing the hash you just need to get some way in which you save the hashes and compare them, in order to find the identical hashes (read files) and delete them (lines 3-6). I purpose the use of a hash table or a dictionary, where the identifier (the object you use to perform lookups) is the file hash and the object File the entry value.
Notes:
Remember!!!: The more bits the hash value has, the lesser is the probability of collisions. If you want to know more about collision probabilities in hash functions read this excellent article. You must pay attention to this topic since your objective is to delete files. If you have a collision then you will delete one file which is not identical and you will loose it forever. There are many tactics to identify collisions, which you can combine and add to your algorithm (e.g. compare the size of your file, compare file content values at random positions, use more than one hash function). My advice would be to use all these tactics. If you use two hash functions, then for two files be considered identical they must have the hash value of each hash function equal:
file1, file2;
file1_hash1 = hash_function1(file1);
file2_hash1 = hash_function1(file2);
file1_hash2 = hash_function2(file1);
file2_hash2 = hash_function2(file2);
if(file1_hash1 == file2_hash1 &&
file2_hash2 == file2_hash2)
// file1 is_duplicate_of file2;
else
// file1 is_NOT_duplicate_of file2;

I would work thru the process of finding duplicates by first recursively finding all of the PST files, then match on file length, then filter by a fixed prefix of bytes, and finally performing full hash or byte comparison to get actual matches.
Recursively building the list and finding potential matches can be as simple as this:
Func<DirectoryInfo, IEnumerable<FileInfo>> recurse = null;
recurse = di => di.GetFiles("*.pst")
.Concat(di.GetDirectories()
.SelectMany(cdi => recurse(cdi)));
var potentialMatches =
recurse(new DirectoryInfo(#"m:\mail"))
.ToLookup(fi => fi.Length)
.Where(x => x.Skip(1).Any());
The potentialMatches query gives you a complete series of potential matches by file size.
I would then use the following functions (which I'd leave the implementation to you) to filter this list further.
Func<FileInfo, FileInfo, int, bool> prefixBytesMatch = /* your implementation */
Func<FileInfo, FileInfo, bool> hashMatch = /* your implementation */
By limiting the matches by file length and then by a prefix of bytes you will significantly reduce the computation of hashes required for your very large files.
I hope this helps.

Related

Calculate a checksum for a string

I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
Requirements
The same checksum must be returned each time a calculation is done for a string
The checksum must be unique (no collisions)
I can not store previous IDs to check for collisions
Which algorithm should I use?
Update:
Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
The checksum should be alphanumeric
The strings are unicode
The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
The length of the checksum is not important for me (the shorter, the better)
Update2
Let's say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
The string is used in a view in a similar way to gettext in linux. i.e. the user just writes (in a razor view)
#T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I'm therefore looking for a way to generate a key out of it.

That's not possible.
If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.
Update:
The term "reasonably unique" doesn't make sense, either it's unique or it's not.
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5 class, then convert the hash code byte array into a string using the BitConverter class:
string theString = "asdf";
string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}
Console.WriteLine(hash);
Output:
912EC803B2CE49E4A541068D495AB570

You can use cryptographic Hash functions for this. Most of them are available in .Net
For example:
var sha1 = System.Security.Cryptography.SHA1.Create();
byte[] buf = System.Text.Encoding.UTF8.GetBytes("test");
byte[] hash= sha1.ComputeHash(buf, 0, buf.Length);
//var hashstr = Convert.ToBase64String(hash);
var hashstr = System.BitConverter.ToString(hash).Replace("-", "");

Note: This is an answer to the original question.
Assuming you want the checksum to be stored in a variable of fixed size (i.e. an integer), you cannot satisfy your second constraint.
The checksum must be unique (no collisions)
You cannot avoid collisions because there will be more distinct strings than there are possible checksum values.

I realize this post is practically ancient, but I stumbled upon it and have run into an almost identical issue in the past. We had an nvarchar(8000) field that we needed to lookup against.
Our solution was to create a persisted computed column using CHECKSUM of the nasty lookup field. We had an auto-incrementing ID field and keyed on (checksum, id)
When reading from the table, we wrote a proc that took the lookup text, computed the checksum and then took where the checksums were equal and the text was equal.
You could easily perform the checksum portions at the application level based on the answer above and store them manually instead of using our DB-centric solution. But the point is to get a reasonably sized key for indexing so that your text comparison runs against a bucket of collisions instead of the entire dataset.
Good luck!

To guarantee uniqueness, for a almost infinite size strings, treat the variable length string as a set of concatenated substrings each having "x characters in length". Your hash function needs only to determine uniqueness for a maximum substring length and then generate a series of checksum numbers generating values. Think of it as the equivalent network IP address with a set of checksum numbers.
Your issue with collisions is the assumption that a collision forces a slower search method to resolve each collision. If their are a insignificant number of possible collisions compared to the number of hash objects, then as a whole the extra overhead becomes NIL. A collision is due to the sizing of a table smaller than the maximum number of objects. This doesn't have to be the case because the table may have "holes" and each object within the table may have a reference count of objects at that collision. Only if this count is greater than 1, then a collision occurs or multiple instances of the same substring.

Unique id for a file in C#

I need to generate a unique id for file sizes of upto 200-300MB. The condition is that the algo should be quick, it should not take much time. I am selecting the files from a desktop and calculation a hash value as such:
HMACSHA256 myhmacsha256 = new HMACSHA256(key);
byte[] hashValue = myhmacsha256.ComputeHash(fileStream);
filestream is a handle to the file to read content from it. This method is going to take a lot of time for obvious reasons.
Does windows generate a key for a file for its own book keeping that I could directly use ?
Is there any other way to identify if the file is same, instead of matching file name which is not very foolproof.

MD5.Create().ComputeHash(fileStream);
Alternatively, I'd suggest looking at this rather similar question.

How about generating a hash from the info that's readily available from the file itself? i.e. concatenate :
File Name
File Size
Created Date
Last Modified Date
and create your own?

When you compute hashes and compare them, it would require both files to completely go through. My suggestion is to first check the file sizes, if they are identical and then go through the files byte by byte.

If you want a "quick and dirty" check, I would suggest looking at CRC-32. It is extremely fast (the algorithm simply involves doing XOR with table lookups), and if you aren't too concerned about collision resistance, a combination of the file size and the CRC-32 checksum over the file data should be adequate. 28.5 bits are required to represent the file size (that gets you to 379M bytes), which means you get a checksum value of effectively just over 60 bits. I would use a 64-bit quantity to store the file size, for future proofing, but 32 bits would work too in your scenario.
If collision resistance is a consideration, then you pretty much have to use one of the tried-and-true-yet-unbroken cryptographic hash algorithms. I would still concur with what Devils child wrote and also include the file size as a separate (readily accessible) part of the hash, however; if the sizes don't match, there is no chance that the file content can be the same, so in that case the computationally intensive hash calculation can be skipped.

Chained Hash Table and understanding Deflate

I am currently trying to create a custom Deflate implementation in C#.
I am currently trying to implement the "pattern search" part where I have (up to) 32k of data and am trying to search the longest possible pattern for my input.
The RFC 1951 which defines Deflate says about that process:
The compressor uses a chained hash table to find duplicated strings,
using a hash function that operates on 3-byte sequences. At any
given point during compression, let XYZ be the next 3 input bytes to
be examined (not necessarily all different, of course). First, the
compressor examines the hash chain for XYZ. If the chain is empty,
the compressor simply writes out X as a literal byte and advances one
byte in the input. If the hash chain is not empty, indicating that
the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
same hash function value) has occurred recently, the compressor
compares all strings on the XYZ hash chain with the actual input data
sequence starting at the current point, and selects the longest
match.
I do know what a hash function is, and do know what a HashTable is as well. But what is a "chained hash table" and how could such a structure be designed to be efficient (in C#) with handling a large amout of data? Unforunately I didn't understand how the structure described in the RFC works.
What kind of hash function could I choose (what would make sense)?
Thank you in advance!

A chained hash table is a hash table that stores every item you put in it, even if the key for 2 items hashes to the same value, or even if 2 items have exactly the same key.
A DEFLATE implementation needs to store a bunch of (key, data) items in no particular order, and rapidly look-up a list of all the items with that key.
In this case, the key is 3 consecutive bytes of uncompressed plaintext, and the data is some sort of pointer or offset to where that 3-byte substring occurs in the plaintext.
Many hashtable/dictionary implementations store both the key and the data for every item.
It's not necessary to store the key in the table for DEFLATE, but it doesn't hurt anything other than using slightly more memory during compression.
Some hashtable/dictionary implementations such as the C++ STL unordered_map insist that every (key, data) item they store must have a unique key. When you try to store another (key, data) item with the same key as some older item already in the table, these implementations delete the old item and replace it with the new item.
That does hurt -- if you accidentally use the C++ STL unordered_map or similar implementation, your compressed file will be larger than if you had used a more appropriate library such as the C++ STL hash_multimap.
Such an error may be difficult to detect, since the resulting (unnecessarily large) compressed files can be correctly decompressed by any standard DEFLATE compressor to a file bit-for-bit identical to the original file.
A few implementations of DEFLATE and other compression algorithms deliberately use such an implementation, deliberately sacrificing compressed file size in order to gain compression speed.
As Nick Johnson said, the default hash function used in your standard "hashtable" or "dictionary" implementation is probably more than adequate.
http://en.wikipedia.org/wiki/Hashtable#Separate_chaining

In this case, they're describing a hashtable where each element contains a list of strings - in this case, all the strings starting with the three character prefix specified. You should simply be able to use standard .net hashtable or dictionary primitives - there's no need to replicate their exact implementation details.
32k is not a lot of data, so you don't have to worry about scaling your hashtable - and even if you did, the built-in primitives are likely to be more efficient than anything you could write yourself.

C# code to perform Binary search in a very big text file

Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.

I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)

As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.

You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);

This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.

If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.

The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx

Reading a large file into a Dictionary

I have a 1GB file containing pairs of string and long.
What's the best way of reading it into a Dictionary, and how much memory would you say it requires?
File has 62 million rows.
I've managed to read it using 5.5GB of ram.
Say 22 bytes overhead per Dictionary entry, that's 1.5GB.
long is 8 bytes, that's 500MB.
Average string length is 15 chars, each char 2 bytes, that's 2GB.
Total is about 4GB, where does the extra 1.5 GB go to?
The initial Dictionary allocation takes 256MB.
I've noticed that each 10 million rows I read, consume about 580MB, which fits quite nicely with the above calculation, but somewhere around the 6000th line, memory usage grows from 260MB to 1.7GB, that's my missing 1.5GB, where does it go?
Thanks.

It's important to understand what's happening when you populate a Hashtable. (The Dictionary uses a Hashtable as its underlying data structure.)
When you create a new Hashtable, .NET makes an array containing 11 buckets, which are linked lists of dictionary entries. When you add an entry, its key gets hashed, the hash code gets mapped on to one of the 11 buckets, and the entry (key + value + hash code) gets appended to the linked list.
At a certain point (and this depends on the load factor used when the Hashtable is first constructed), the Hashtable determines, during an Add operation, that it's encountering too many collisions, and that the initial 11 buckets aren't enough. So it creates a new array of buckets that's twice the size of the old one (not exactly; the number of buckets is always prime), and then populates the new table from the old one.
So there are two things that come into play in terms of memory utilization.
The first is that, every so often, the Hashtable needs to use twice as much memory as it's presently using, so that it can copy the table during resizing. So if you've got a Hashtable that's using 1.8GB of memory and it needs to be resized, it's briefly going to need to use 3.6GB, and, well, now you have a problem.
The second is that every hash table entry has about 12 bytes of overhead: pointers to the key, the value, and the next entry in the list, plus the hash code. For most uses, that overhead is insignificant, but if you're building a Hashtable with 100 million entries in it, well, that's about 1.2GB of overhead.
You can overcome the first problem by using the overload of the Dictionary's constructor that lets you provide an initial capacity. If you specify a capacity big enough to hold all of the entries you're going to be added, the Hashtable won't need to be rebuilt while you're populating it. There's pretty much nothing you can do about the second.

Everyone here seems to be in agreement that the best way to handle this is to read only a portion of the file into memory at a time. Speed, of course, is determined by which portion is in memory and what parts must be read from disk when a particular piece of information is needed.
There is a simple method to handle deciding what's the best parts to keep in memory:
Put the data into a database.
A real one, like MSSQL Express, or MySql or Oracle XE (all are free).
Databases cache the most commonly used information, so it's just like reading from memory. And they give you a single access method for in-memory or on-disk data.

Maybe you can convert that 1 GB file into a SQLite database with two columns key and value. Then create an index on key column. After that you can query that database to get the values of the keys you provided.

Thinking about this, I'm wondering why you'd need to do it... (I know, I know... I shouldn't wonder why, but hear me out...)
The main problem is that there is a huge amount of data that needs to be presumably accessed quickly... The question is, will it essentially be random access, or is there some pattern that can be exploited to predict accesses?
In any case, I would implement this as a sliding cache. E.g. I would load as much as feasibly possible into memory to start with (with the selection of what to load based as much on my expected access pattern as possible) and then keep track of accesses to elements by time last accessed.
If I hit something that wasn't in the cache, then it would be loaded and replace the oldest item in the cache.
This would result in the most commonly used stuff being accessible in memory, but would incur additional work for cache misses.
In any case, without knowing a little more about the problem, this is merely a 'general solution'.
It may be that just keeping it in a local instance of a sql db would be sufficient :)

You'll need to specify the file format, but if it's just something like name=value, I'd do:
Dictionary<string,long> dictionary = new Dictionary<string,long>();
using (TextReader reader = File.OpenText(filename))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] bits = line.Split('=');
// Error checking would go here
long value = long.Parse(bits[1]);
dictionary[bits[0]] = value;
}
}
Now, if that doesn't work we'll need to know more about the file - how many lines are there, etc?
Are you using 64 bit Windows? (If not, you won't be able to use more than 3GB per process anyway, IIRC.)
The amount of memory required will depend on the length of the strings, number of entries etc.

I am not familiar with C#, but if you're having memory problems you might need to roll your own memory container for this task.
Since you want to store it in a dict, I assume you need it for fast lookup?
You have not clarified which one should be the key, though.
Let's hope you want to use the long values for keys. Then try this:
Allocate a buffer that's as big as the file. Read the file into that buffer.
Then create a dictionary with the long values (32 bit values, I guess?) as keys, with their values being a 32 bit value as well.
Now browse the data in the buffer like this:
Find the next key-value pair. Calculate the offset of its value in the buffer. Now add this information to the dictionary, with the long as the key and the offset as its value.
That way, you end up with a dictionary which might take maybe 10-20 bytes per record, and one larger buffer which holds all your text data.
At least with C++, this would be a rather memory-efficient way, I think.

Can you convert the 1G file into a more efficient indexed format, but leave it as a file on disk? Then you can access it as needed and do efficient lookups.
Perhaps you can memory map the contents of this (more efficient format) file, then have minimum ram usage and demand-loading, which may be a good trade-off between accessing the file directly on disc all the time and loading the whole thing into a big byte array.

Loading a 1 GB file in memory at once doesn't sound like a good idea to me. I'd virtualize the access to the file by loading it in smaller chunks only when the specific chunk is needed. Of course, it'll be slower than having the whole file in memory, but 1 GB is a real mastodon...

Don't read 1GB of file into the memory even though you got 8 GB of physical RAM, you can still have so many problems. -based on personal experience-
I don't know what you need to do but find a workaround and read partially and process. If it doesn't work you then consider using a database.

If you choose to use a database, you might be better served by a dbm-style tool, like Berkeley DB for .NET. They are specifically designed to represent disk-based hashtables.
Alternatively you may roll your own solution using some database techniques.
Suppose your original data file looks like this (dots indicate that string lengths vary):
[key2][value2...][key1][value1..][key3][value3....]
Split it into index file and values file.
Values file:
[value1..][value2...][value3....]
Index file:
[key1][value1-offset]
[key2][value2-offset]
[key3][value3-offset]
Records in index file are fixed-size key->value-offset pairs and are ordered by key.
Strings in values file are also ordered by key.
To get a value for key(N) you would binary-search for key(N) record in index, then read string from values file starting at value(N)-offset and ending before value(N+1)-offset.
Index file can be read into in-memory array of structs (less overhead and much more predictable memory consumption than Dictionary), or you can do the search directly on disk.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.