File comparison using checksums - c#

I'm writing a program to find duplicates of files.
I have two folders, in which I have to find duplicates. In the worst case scenario i would have to compare all the files with each other. I was thinking to generate the checksum of each file, compare the checksums and then if the checksums are equal, perform a byte-by-byte check to be ensure the files are exactly the same.
The question is what checksum generator will be fast enough to waste time on it instead of just checking byte-by-byte?

You can reduce the number of comparisons you have to make and also the amount of I/O by getting a full list of the files and then sorting by length. Two files can't be identical if they don't have identical lengths. So you can eliminate a large number of files without doing any I/O other than getting the directory information, which you have to get anyway.
If there are only two files with the same length, X, then you don't have to compute a checksum for those files. Just compare them directly.
If there are three or more files with the same length, then you're better off computing the checksums for all three files, comparing the checksums, and then doing a byte-by-byte comparison if the checksums match.

First of all, group files by length first, as Jim Mischel says.
If the files to compare are large, it might be quicker to calculate your representative (which is all a checksum is) by taking the first n bytes of the file. Reading the entirety of a large file to calculate a checksum to compare it to another file that differs in the first n bytes is inefficient.
Theoretically, the 1st n bytes determine the file as uniquely as an n byte checksum. (This is the case if all possible files of a certain length are equally likely)
Of course, if the files to compare are small, it's as quick to read the whole file as a subset of it.

Any checksum algorithm will do. You can use MD5 for example. You will hardly waste any time since I/O is way slower than the CPU time spent calculating the checksum. You can also use CRC32.
You said: "I have two folders, in which I have to find duplicates."
I would like to clarify something here. If the goal is to find duplicate files, then it does not matter if the files are located in one, two or x number of folders. Assuming you have n files, you need in the order of n log n comparisons to find duplicates. It is indeed useful to read the n files once, calculate their checksums, then do a sort of the checksums in n log n time to find the duplicates. Note however, that you can avoid that by first comparing the file sizes and only resort to checksums when comparing 3 or more files of the same size. That would greatly accelerate your search for duplicates.

Related

Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)

Our server produces files like {c521c143-2a23-42ef-89d1-557915e2323a}-sign.xml in its log folder. The first part is GUID; the second part is name template.
I want to count the number of files with same name template. For instance, we have
{c521c143-2a23-42ef-89d1-557915e2323a}-sign.xml
{aa3718d1-98e2-4559-bab0-1c69f04eb7ec}-hero.xml
{0c7a50dc-972e-4062-a60c-062a51c7b32c}-sign.xml
The result should be
sign.xml,2
hero.xml,1
The total kinds of possible name templates is unknown, possibly exceeds int.MaxValue.
The total number of files on the server is unknown, possibly exceeds int.MaxValue.
Requirements:
The final result should be sorted by name template.
The server on which the tool is going to run is super critical. We should be able to tell memory usage (MB) and the number of temporary files generated, if any, before running the tool and without knowing any characteristic of the log folder.
We use C# language.
My idea:
For the first 5000 files, count the occurrences, write result to Group1.txt.
For the second 5000 files, count the occurrences, write result to Group2.txt.
Repeat until all files are processed. Now we have a bunch of group files.
Then I merge all these group files.
Group1.txt Group2.txt Group3.txt Group4.txt
\ / \ /
Group1-2.txt Group3-4.txt
\ /
Group1-4.txt
Group1-4.txt is the final result.
The disagreement between me and my friend is how we count the occurrences.
I suggest to use dictionary. File name template is key. Let m be partition size. (In this example it's 5000.) Then time complexity O(m), space complexity O(m).
My friend suggests to sort the name template then count the occurrence in one pass as same name templates are all together now. time complexity O(m log m), space complexity O(m).
We cannot persuade each other. Do you guys see any issues of the two methods?
IDK if external sorting with count-merging of duplicates has been studied. I did find a 1983 paper (see below). Usually, sorting algorithms are designed and studied with the assumption of sorting objects by keys, so duplicate keys have different objects. There might be some existing literature on this, but it's a very interesting problem. Probably it's just considered an application of compact dictionaries combined with external merge-sorting.
Efficient dictionaries for storing large amounts of strings in little memory is a very well studied problem. Most of the useful data structures can include auxiliary data for each word (in our case, a dup count).
TL:DR summary of useful ideas, since I rambled on in way too much detail about many things in the main body of this answer:
Batch boundaries when your dictionary size hits a threshold, not after fixed numbers of input files. If there were a lot of duplicates in a group of 5000 strings, you still won't be using very much memory. You can find way more duplicates in the first pass this way.
Sorted batches makes merging much faster. You can and should merge many->one instead of binary merging. Use a PriorityQueue to figure out which input file has the line you should take next.
To avoid a burst of memory usage when sorting the keys in a hash table, use a dictionary that can do an in-order traversal of keys. (i.e. sort on the fly.) There's SortedDictionary<TKey, TValue> (binary tree-based). This also interleaves the CPU usage of sorting with the I/O waiting to get the input strings.
Radix-sort each batch into outputs by first-character (a-z, non-alphabetic that sorts before A, and non-alphabetic that sorts after z). Or some other bucketing choice that distributes your keys well. Use separate dictionaries for each radix bucket, and empty only the biggest into a batch when you hit your memory ceiling. (fancier eviction heuristics than "biggest" may be worth it.)
throttle I/O (esp. when merging), and check system CPU load and memory pressure. Adapt behaviour accordingly to make sure you don't cause an impact when the server is most busy.
For smaller temp files at the cost of CPU time, use a common-prefix encoding, or maybe lz4.
A space-efficient dictionary will allow larger batch sizes (and thus a larger duplicate-finding window) for the same upper memory bound. A Trie (or better, Radix Trie) might be ideal, because it stores the characters within the tree nodes, with common prefixes only stored once. Directed Acyclic Word Graphs are even more compact (finding redundancy between common substrings that aren't prefixes). Using one as a Dictionary is tricky but probably possible (see below).
Take advantage of the fact that you don't need to delete any tree nodes or strings until you're going to empty the whole dictionary. Use a growable array of nodes, and another growable char array that packs strings head to tail. (Useful for a Radix Trie (multi-char nodes), but not a regular Trie where each node is a single char.)
Depending on how the duplicates are distributed, you might or might not be able to find very many on the first pass. This has some implications, but doesn't really change how you end up merging.
I'm assuming you have some directory traversal idea in mind, which can efficiently supply your code with a stream of strings to be uniquified and counted. So I'll just say "strings" or "keys", to talk about the inputs.
Trim off as many unnecessary characters as possible (e.g. lose the .xml if they're all .xml).
It might be useful to do the CPU/memory intensive work on a separate machine, depending on what other hardware you have with a fast network connection to your critical production server.
You could run a simple program on the server that sends filenames over a TCP connection to a program running on another machine, where it's safe to use much more memory. The program on the server could still do small dictionary batches, and just store them on a remote filesystem.
And now, since none of the other answers really put all the pieces together, here's my actual answer:
An upper bound on memory usage is easy. Write your program to use a constant memory ceiling, regardless of input size. Bigger inputs will lead to more merging phases, not more memory usage at any point.
The best estimate of temporary file storage space you can do without looking at the input is a very conservative upper bound that assumes every input string is unique. You need some way to estimate how many input strings there will be. (Most filesystems know how many separate files they contain, without having to walk the directory tree and count them.)
You can make some assumptions about the distribution of duplicates to make a better guess.
If number, rather than size, of scratch files is an issue, you can store multiple batches in the same output file, one after another. Either put length-headers at the start of each to allow skipping forward by batch, or write byte offsets to a separate data stream. If size is also important, see my paragraph about using frcode-style common-prefix compression.
As Ian Mercer points out in his answer, sorting your batches will make merging them much more efficient. If you don't, you either risk hitting a wall where your algorithm can't make forward progress, or you need to do something like load one batch, scan another batch for entries that are in the first, and rewrite the 2nd batch with just the potentially-few matching entries removed.
Not sorting your batches makes the time complexity of the first pass O(N), but either you have to sort at some point later, or your later stages have a worst-case bound that's dramatically worse. You want your output globally sorted, so other than RadixSort approaches, there's no avoiding an O(N log N) somewhere.
With limited batch size, O(log N) merge steps are expected, so your original analysis missed the O(N log N) complexity of your approach by ignoring what needs to happen after the phase1 batches are written.
The appropriate design choices change a lot depending on whether our memory ceiling is big enough to find many duplicates within one batch. If even a complex compact data structure like a Trie doesn't help much, putting the data into a Trie and getting it out again to write a batch is a waste of CPU time.
If you can't do much duplicate-elimination within each batch anyway, then you need to optimize for putting possibly-matching keys together for the next stage. Your first stage could group input strings by first byte, into up-to 252 or so output files (not all 256 values are legal filename characters), or into 27 or so output files (alphabet + misc), or 26+26+1 for upper/lower case + non-alphabetic. Temp files can omit the common prefix from each string.
Then most of these first stage batches should have a much higher duplicate density. Actually, this Radix distribution of inputs into output buckets is useful in any case, see below.
You should still sort your first-stage outputs in chunks, to give the next pass a much wider dup-find window for the same RAM.
I'm going to spend more time on the domain where you can find a useful amount of duplicates in the initial stream, before using up ~100MiB of RAM, or whatever you choose as an upper limit.
Obviously we add strings to some sort of dictionary to find and count duplicates on the fly, while only requiring enough storage for the set of unique strings. Just storing strings and then sorting them would be significantly less efficient, because we'd hit our RAM limit much sooner without on-the-fly duplicate detection.
To minimize the phase2 work, phase1 should find and count as many duplicates as possible, reducing the total size of the p2 data. Reducing the amount of merging work for phase2 is good, too. Bigger batches helps with both factors, so it's very useful to come as close to your memory ceiling as you safely can in phase1. Instead of writing a batch after a constant number of input strings, do it when your memory consumption nears your chosen ceiling. Duplicates are counted and thrown away, and don't take any extra storage.
An alternative to accurate memory accounting is tracking the unique strings in your dictionary, which is easy (and done for you by the library implementation). Accumulating the length of strings added can give you a good estimate of memory used for storing the strings, too. Or just make an assumption about string length distribution. Make your hash table the right size initially so it doesn't have to grow while you add elements, so you stop when it's 60% full (load factor) or something.
A space-efficient data structure for the dictionary increases our dup-finding window for a given memory limit. Hash tables get badly inefficient when their load factor is too high, but the hash table itself only has to store pointers to the strings. It's the most familiar dictionary and has a library implementations.
We know we're going to want our batch sorted once we've seen enough unique keys, so it might make sense to use a dictionary that can be traversed in sorted order. Sorting on the fly makes sense because keys will come in slowly, limited by disk IO since we're reading from filesystem metadata. One downside is if most of the keys we see are duplicates, then we're doing a lot of O(log batchsize) lookups, rather than a lot of O(1) lookups. And it's more likely that a key will be a duplicate when the dictionary is big, so most of those O(log batchsize) queries will be with a batch size near max, not uniformly distributed between 0 and max. A tree pays the O(log n) overhead of sorting for every lookup, whether the key turned out to be unique or not. A hash table only pays the sorting cost at the end after removing duplicates. So for a tree it's O(total_keys * log unique_keys), hash table is O(unique_keys * log unique_keys) to sort a batch.
A hash table with max load factor set to 0.75 or something might be pretty dense, but having to sort the KeyValuePairs before writing out a batch probably puts a damper on using standard Dictionary. You don't need copies of the strings, but you'll probably end up copying all the pointers (refs) to scratch space for a non-in-place sort, and maybe also when getting them out of the hash table before sorting. (Or instead of just pointers, KeyValuePair, to avoid having to go back and look up each string in the hash table). If short spikes of big memory consumption are tolerable, and don't cause you to swap / page to disk, you could be fine. This is avoidable if you can do an in-place sort in the buffer used by the hash table, but I doubt that can happen with standard-library containers.
A constant trickle of CPU usage to maintain the sorted dictionary at the speed keys are available is probably better than infrequent bursts of CPU usage to sort all of a batch's keys, besides the burst of memory consumption.
The .NET standard library has SortedDictionary<TKey, TValue>, which the docs say is implemented with a binary tree. I didn't check if it has a rebalance function, or uses a red-black tree, to guarantee O(log n) worst case performance. I'm not sure how much memory overhead it would have. If this is a one-off task, then I'd absolutely recommend using this to implement it quickly and easily. And also for a first version of a more optimized design for repeated use. You'll probably find it's good enough, unless you can find a nice library implementation of Tries.
Data structures for memory-efficient sorted dictionaries
The more memory efficient out dictionary is, the more dups we can find before having to write out a batch and delete the dictionary. Also, if it's a sorted dictionary, the larger our batches can be even when they can't find duplicates.
A secondary impact of data structure choice is how much memory traffic we generate while running on the critical server. A sorted array (with O(log n) lookup time (binary search), and O(n) insert time (shuffle elements to make room)) would be compact. However, it wouldn't just be slow, it would saturate memory bandwidth with memmove a lot of the time. 100% CPU usage doing this would have a bigger impact on the server's performance than 100% CPU usage searching a binary tree. It doesn't know where to load the next node from until it's loaded the current node, so it can't pipeline memory requests. The branch mispredicts of comparisons in the tree search also help moderate consumption of the memory bandwidth that's shared by all cores. (That's right, some 100%-CPU-usage programs are worse than others!)
It's nice if emptying our dictionary doesn't leave memory fragmented when we empty it. Tree nodes will be constant size, though, so a bunch of scattered holes will be usable for future tree node allocations. However, if we have separate dictionaries for multiple radix buckets (see below), key strings associated with other dictionaries might be mixed in with tree nodes. This could lead to malloc having a hard time reusing all the freed memory, potentially increasing actual OS-visible memory usage by some small factor. (Unless C# runtime garbage collection does compaction, in which case fragmentation is taken care of.)
Since you never need to delete nodes until you want to empty the dictionary and delete them all, you could store your Tree nodes in a growable array. So memory management only has to keep track of one big allocation, reducing bookkeeping overhead compared to malloc of each node separately. Instead of real pointers, the left / right child pointers could be array indices. This lets you use only 16 or 24 bits for them. (A Heap is another kind of binary tree stored in an array, but it can't be used efficiently as a dictionary. It's a tree, but not a search tree).
Storing the string keys for a dictionary would normally be done with each String as a separately-allocated object, with pointers to them in an array. Since again, you never need to delete, grow, or even modify one until you're ready to delete them all, you can pack them head to tail in a char array, with a terminating zero-byte at the end of each one. This again saves a lot of book-keeping, and also makes it easy to keep track of how much memory is in use for the key strings, letting you safely come closer to your chosen memory upper bound.
Trie / DAWG for even more compact storage
For even denser storage of a set of strings, we can eliminate the redundancy of storing all the characters of every string, since there are probably a lot of common prefixes.
A Trie stores the strings in the tree structure, giving you common-prefix compression. It can be traversed in sorted order, so it sorts on the fly. Each node has as many children as there are different next-characters in the set, so it's not a binary tree. A C# Trie partial implementation (delete not written) can be found in this SO answer, to a question similar to this but not requiring batching / external sorting.
Trie nodes need to store potentially many child pointers, so each node can be large. Or each node could be variable-size, holding the list of nextchar:ref pairs inside the node, if C# makes that possible. Or as the Wikipedia article says, a node can actually be a linked-list or binary search tree, to avoid wasting space in nodes with few children. (The lower levels of a tree will have a lot of that.) End-of-word markers / nodes are needed to distinguish between substrings that aren't separate dictionary entries, and ones that are. Our count field can serve that purpose. Count=0 means the substring ending here isn't in the dictionary. count>=0 means it is.
A more compact Trie is the Radix Tree, or PATRICIA Tree, which stores multiple characters per node.
Another extension of this idea is the Deterministic acyclic finite state automaton (DAFSA), sometimes called a Directed Acyclic Word Graph (DAWG), but note that the DAWG wikipedia article is about a different thing with the same name. I'm not sure a DAWG can be traversed in sorted order to get all the keys out at the end, and as wikipedia points out, storing associated data (like a duplicate count) requires a modification. I'm also not sure they can be built incrementally, but I think you can do lookups without having compacted. The newly added entries will be stored like a Trie, until a compaction step every 128 new keys merges them into the DAWG. (Or run the compaction less frequently for bigger DAWGs, so you aren't doing it too much, like doubling the size of a hash table when it has to grow, instead of growing linearly, to amortize the expensive op.)
You can make a DAWG more compact by storing multiple characters in a single node when there isn't any branching / converging. This page also mentions a Huffman-coding approach to compact DAWGs, and has some other links and article citations.
JohnPaul Adamovsky's DAWG implementation (in C) looks good, and describes some optimizations it uses. I haven't looked carefully to see if it can map strings to counts. It's optimized to store all the nodes in an array.
This answer to the dup-count words in 1TB of text question suggests DAWGs, and has a couple links, but I'm not sure how useful it is.
Writing batches: Radix on first character
You could get your RadixSort on, and keep separate dictionaries for every starting character (or for a-z, non-alphabetic that sorts before a, non-alphabetic that sorts after z). Each dictionary writes out to a different temp file. If you have multiple compute nodes available for a MapReduce approach, this would be the way to distribute merging work to the compute nodes.
This allows an interesting modification: instead of writing all radix buckets at once, only write the largest dictionary as a batch. This prevents tiny batches going into some buckets each time you. This will reduce the width of the merging within each bucket, speeding up phase2.
With a binary tree, this reduces the depth of each tree by about log2(num_buckets), speeding up lookups. With a Trie, this is redundant (each node uses the next character as a radix to order the child trees). With a DAWG, this actually hurts your space-efficiency because you lose out on finding the redundancy between strings with different starts but later shared parts.
This has the potential to behave poorly if there are a few infrequently-touched buckets that keep growing, but don't usually end up being the largest. They could use up a big fraction of your total memory, making for small batches from the commonly-used buckets. You could implement a smarter eviction algorithm that records when a bucket (dictionary) was last emptied. The NeedsEmptying score for a bucket would be something like a product of size and age. Or maybe some function of age, like sqrt(age). Some way to record how many duplicates each bucket has found since last emptied would be useful, too. If you're in a spot in your input stream where there are a lot of repeats for one of the buckets, the last thing you want to do is empty it frequently. Maybe every time you find a duplicate in a bucket, increment a counter. Look at the ratio of age vs. dups-found. Low-use buckets sitting there taking RAM away from other buckets will be easy to find that way, when their size starts to creep up. Really-valuable buckets might be kept even when they're the current biggest, if they're finding a lot of duplicates.
If your data structures for tracking age and dups found is a struct-of-arrays, the (last_emptied[bucket] - current_pos) / (float)dups_found[bucket] division can be done efficiently with vector floating point. One integer division is slower than one FP division. One FP division is the same speed as 4 FP divisions, and compilers can hopefully auto-vectorize if you make it easy for them like this.
There's a lot of work to do between buckets filling up, so division would be a tiny hiccup unless you use a lot of buckets.
choosing how to bucket
With a good eviction algorithm, an ideal choice of bucketing will put keys that rarely have duplicates together in some buckets, and buckets that have many duplicates together in other buckets. If you're aware of any patterns in your data, this would be a way to exploit it. Having some buckets that are mostly low-dup means that all those unique keys don't wash away the valuable keys into an output batch. An eviction algorithm that looks at how valuable a bucket has been in terms of dups found per unique key will automatically figure out which buckets are valuable and worth keeping, even though their size is creeping up.
There are many ways to radix your strings into buckets. Some will ensure that every element in a bucket compares less than every element in every later bucket, so producing fully-sorted output is easy. Some won't, but have other advantages. There are going to be tradeoffs between bucketing choices, all of which are data-dependent:
good at finding a lot of duplicates in the first pass (e.g. by separating high-dup patterns from low-dup patterns)
distributes the number of batches uniformly between buckets (so no bucket has a huge number of batches requiring a multi-stage merge in phase2), and maybe other factors.
produces bad behaviour when combined with your eviction algorithm on your data set.
amount of between-bucket merging needed to produce globally-sorted output. The importance of this scales with the total number of unique strings, not the number of input strings.
I'm sure clever people have thought about good ways to bucket strings before me, so this is probably worth searching on if the obvious approach of by-first-character isn't ideal. This special use-case (of sorting while eliminating/counting duplicates) is not typical. I think most work on sorting only considers sorts that preserve duplicates. So you might not find much that helps choose a good bucketing algorithm for a dup-counting external sort. In any case, it will be data-dependent.
Some concrete-options for bucketing are: Radix = first two bytes together (still combining upper/lowercase, and combining non-alphabetic characters). Or Radix = the first byte of the hash code. (Requires a global-merge to produce sorted output.) Or Radix = (str[0]>>2) << 6 + str[1]>>2. i.e. ignore the low 2 bits of the first 2 chars, to put [abcd][abcd].* together, [abcd][efgh].* together, etc. This would also require some merging of the sorted results between some sets of buckets. e.g. daxxx would be in the first bucket, but aexxx would be in the 2nd. But only buckets with the same first-char high-bits need to be merged with each other to produce the sorted final output.
An idea for handling a bucketing choice that gives great dup-finding but needs merge-sorting between buckets: When writing the phase2 output, bucket it with the first character as the radix to produce the sort order you want. Each phase1 bucket scatters output into phase2 buckets as part of the global sort. Once all the phase1 batches that can include strings starting with a have been processed, do the merge of the a phase2-bucket into the final output and delete those temp files.
Radix = first 2 bytes (combining non-alphabetic) would make for for 282 = 784 buckets. With 200MiB of RAM, that's average output file size of only ~256k. Emptying just one bucket at a time would make that the minimum, and you'd usually get larger batches, so this could work. (Your eviction algorithm could hit a pathological case that made it keep a lot of big buckets, and write a series of tiny batches for new buckets. There are dangers to clever heuristics if you don't test carefully).
Multiple batches packed into the same output file is probably most useful with many small buckets. You'll have e.g. 784 output files, each containing a series of batches. Hopefully your filesystem has enough contiguous free space, and is smart enough, to do a good job of not fragmenting too badly when scattering small-ish writes to many files.
Merging:
In the merging stages, with sorted batches we don't need a dictionary. Just take the next line from the batch that has the lowest, combining duplicates as you find them.
MergeSort typically merges pairs, but when doing external sorting (i.e. disk -> disk), a much wider input is common to avoid reading and re-writing the output a lot of times. Having 25 input files open to merge into one output file should be fine. Use the library implementation of PriorityQueue (typically implemented as a Heap) to choose the next input element from many sorted lists. Maybe add input lines with the string as the priority, and the count and input file number as payload.
If you used radix distribute-by-first-character in the first pass, then merge all the a batches into the final output file (even if this process takes multiple merging stages), then all the b batches, etc. You don't need to check any of the batches from the starts-with-a bucket against batches from any other bucket, so this saves a lot of merging work, esp. if your keys are well distributed by first character.
Minimizing impact on the production server:
Throttle disk I/O during merging, to avoid bringing your server to its knees if disk prefetch generates a huge I/O queue depth of reads. Throttling your I/O, rather than a narrower merge, is probably a better choice. If the server is busy with its normal job, it prob. won't be doing many big sequential reads even if you're only reading a couple files.
Check the system load occasionally while running. If it's high, sleep for 1 sec before doing some more work and checking again. If it's really high, don't do any more work until the load average drops (sleeping 30sec between checks).
Check the system memory usage, too, and reduce your batch threshold if memory is tight on the production server. (Or if really tight, flush your partial batch and sleep until memory pressure reduces.)
If temp-file size is an issue, you could do common-prefix compression like frcode from updatedb/locate to significantly reduce the file size for sorted lists of strings. Probably use case-sensitive sorting within a batch, but case-insensitive radixing. So each batch in the a bucket will have all the As, then all the as. Or even LZ4 compress / decompress them on the fly. Use hex for the counts, not decimal. It's shorter, and faster to encode/decode.
Use a separator that's not a legal filename character, like /, between key and count. String parsing might well take up a lot of the CPU time in the merge stage, so it's worth considering. If you can leave strings in per-file input buffers, and just point your PQueue at them, that might be good. (And tell you which input file a string came from, without storing that separately.)
performance tuning:
If the initial unsorted strings were available extremely fast, then a hash table with small batches that fit the dictionary in the CPU L3 cache might be a win, unless a larger window can include a much larger fraction of keys, and find more dups. It depends on how many repeats are typical in say 100k files. Build small sorted batches in RAM as you read, then merge them to a disk batch. This may be more efficient than doing a big in-memory quicksort, since you don't have random access to the input until you've initially read it.
Since I/O will probably be the limit, large batches that don't fit in the CPU's data cache are probably a win, to find more duplicates and (greatly?) reduce the amount of merging work to be done.
It might be convenient to check the hash table size / memory consumption after every chunk of filenames you get from the OS, or after every subdirectory or whatever. As long as you choose a conservative size bound, and you make sure you can't go for too long without checking, you don't need to go nuts checking every iteration.
This paper from 1983 examines external merge-sorting eliminating duplicates as they're encountered, and also suggests duplicate elimination with a hash function and a bitmap. With long input strings, storing MD5 or SHA1 hashes for duplicate-elimination saves a lot of space.
I'm not sure what they had in mind with their bitmap idea. Being collision-resistant enough to be usable without going back to check the original string would require a hash code of too many bits to index a reasonable-size bitmap. (e.g. MD5 is a 128bit hash).
How do you "merge the group files" in your approach? In worst case every line had a different name template so each group file had 5,000 lines in it and each merge doubles the number of lines until you overflow memory.
Your friend is closer to the answer, those intermediate files need to be sorted so you can read them line by line and merge them to create new files without having to hold them all in memory. This is a well-known problem, it's an external sort. Once sorted you can count the results.
A jolly good problem.
Considering that you intend to process the results in batches of 5000, I don't believe memory optimisations will be of particular importance so we could probably ignore that aspect like a bad Adam Sandler film and move onto the more exciting stuff. Besides, just because some computation uses more RAM does not necessarily imply it's a bad algorithm. No one ever complained about look-up tables.
However, I do agree computationally the dictionary approach is better because it's faster. With respect to the alternative, why perform an unnecessary sort even if its quick? The latter, with its "O(m log m)" is ultimately slower than "O(m)".
The Real Problem?
With RAM out of the equation, the problem is essentially that of computation. Any "performance problem" in the algorithm will arguably be insignificant to the time it takes to traverse the file system in the first place.
That's arguably where the real challenge will be. A problem for another time perhaps?
EDIT: displayName makes a good point about using Hadoop - quite ideal for concurrent jobs and compute
Good luck!
Your problem is a very good candidate for Map-Reduce. Great news: You don't need to move from C# to Java (Hadoop) as Map-Reduce is possible in .NET framework!
Through LINQs you have the basic elements of execution in place already for performing Map Reduce in C#. This might be one advantage over going for External Sort though there is no question about the observation behind External Sort. This link has the 'Hello World!' of Map-Reduce already implemented in C# using LINQs and should get you started.
If you do move to Java, one of the most comprehensive tutorial about it is here. Google about Hadoop and Map-Reduce and you will get plenty of information and numerous good online video tutorials.
Further, if you wish to move to Java, your requirements of:
Sorted results
critical RAM usage
will surely be met as they are inbuilt fulfillments you get from a Map-Reduce job in Hadoop.

C# Memory Mapped File - Better read or write sequentially?

I have two file of about 50GB each: an input and an output file.
I am using Memory Mapped File to manage these two files.
The input file contains 3 millions of Web pages, and after I have decided a permutation π of them, I have to write into the output file the Web pages in the new order.
So, I can choose to read sequentially the input file and write the web pages in different location of the output file, accordingly to the permutation π.
Or I can do the opposite: reading randomly the input file according to the permutation π and write sequentially into the output file.
Which option is faster? Why?
TL;DR: Due to caching, all file-append operations are sequential. Even writes to the middle of files will be elevator sorted and performed at block size, etc.
Random writing tends to be faster than random reading for several reasons:
When a file grows, the filesystem can choose where to put the new block.
Writes don't have to be performed immediately, the write buffer can assure that an entire block is written at once, meaning that data won't be added to an existing block, which already has a location.
Your processing can't take place until reads complete. And reading relies on a predictive cache. The OS is good at pre-caching sequential reads, horrible for random reads. If your reads are less than block sized, things are even worse -- the actual amount of data read from the disk will be greater than the size of the file.

Remove All Duplicates In A Large Text File

I am really stumped at this problem and as a result I have stopped working for a while. I work with really large pieces of data. I get approx 200gb of .txt data every week. The data can range up to 500 million lines. A lot of these are duplicate. I would guess only 20gb is unique. I have had several custom programs made including hash remove duplicates, external remove duplicates but none seem to work. The latest one was using a temp database but took several days to remove the data.
The problem with all the programs is that they crash after a certain point and after spending a large amount of money on these programs I thought I would come online and see if anyone can help. I understand this has been answered on here before and I have spent the last 3 hours reading about 50 threads on here but none seem to have the same problem as me i.e huge datasets.
Can anyone recommend anything for me? It needs to be super accurate and fast. Preferably not memory based as I only have 32gb of ram to work with.
The standard way to remove duplicates is to sort the file and then do a sequential pass to remove duplicates. Sorting 500 million lines isn't trivial, but it's certainly doable. A few years ago I had a daily process that would sort 50 to 100 gigabytes on a 16 gb machine.
By the way, you might be able to do this with an off-the-shelf program. Certainly the GNU sort utility can sort a file larger than memory. I've never tried it on a 500 GB file, but you might give it a shot. You can download it along with the rest of the GNU Core Utilities. That utility has a --unique option, so you should be able to just sort --unique input-file > output-file. It uses a technique similar to the one I describe below. I'd suggest trying it on a 100 megabyte file first, then slowly working up to larger files.
With GNU sort and the technique I describe below, it will perform a lot better if the input and temporary directories are on separate physical disks. Put the output either on a third physical disk, or on the same physical disk as the input. You want to reduce I/O contention as much as possible.
There might also be a commercial (i.e. pay) program that will do the sorting. Developing a program that will sort a huge text file efficiently is a non-trivial task. If you can buy something for a few hundreds of dollars, you're probably money ahead if your time is worth anything.
If you can't use a ready made program, then . . .
If your text is in multiple smaller files, the problem is easier to solve. You start by sorting each file, removing duplicates from those files, and writing the sorted temporary files that have the duplicates removed. Then run a simple n-way merge to merge the files into a single output file that has the duplicates removed.
If you have a single file, you start by reading as many lines as you can into memory, sorting those, removing duplicates, and writing a temporary file. You keep doing that for the entire large file. When you're done, you have some number of sorted temporary files that you can then merge.
In pseudocode, it looks something like this:
fileNumber = 0
while not end-of-input
load as many lines as you can into a list
sort the list
filename = "file"+fileNumber
write sorted list to filename, optionally removing duplicates
fileNumber = fileNumber + 1
You don't really have to remove the duplicates from the temporary files, but if your unique data is really only 10% of the total, you'll save a huge amount of time by not outputting duplicates to the temporary files.
Once all of your temporary files are written, you need to merge them. From your description, I figure each chunk that you read from the file will contain somewhere around 20 million lines. So you'll have maybe 25 temporary files to work with.
You now need to do a k-way merge. That's done by creating a priority queue. You open each file, read the first line from each file and put it into the queue along with a reference to the file that it came from. Then, you take the smallest item from the queue and write it to the output file. To remove duplicates, you keep track of the previous line that you output, and you don't output the new line if it's identical to the previous one.
Once you've output the line, you read the next line from the file that the one you just output came from, and add that line to the priority queue. You continue this way until you've emptied all of the files.
I published a series of articles some time back about sorting a very large text file. It uses the technique I described above. The only thing it doesn't do is remove duplicates, but that's a simple modification to the methods that output the temporary files and the final output method. Even without optimizations, the program performs quite well. It won't set any speed records, but it should be able to sort and remove duplicates from 500 million lines in less than 12 hours. Probably much less, considering that the second pass is only working with a small percentage of the total data (because you removed duplicates from the temporary files).
One thing you can do to speed the program is operate on smaller chunks and be sorting one chunk in a background thread while you're loading the next chunk into memory. You end up having to deal with more temporary files, but that's really not a problem. The heap operations are slightly slower, but that extra time is more than recaptured by overlapping the input and output with the sorting. You end up getting the I/O essentially for free. At typical hard drive speeds, loading 500 gigabytes will take somewhere in the neighborhood of two and a half to three hours.
Take a look at the article series. It's many different, mostly small, articles that take you through the entire process that I describe, and it presents working code. I'm happy to answer any questions you might have about it.
I am no specialist in such algorithms, but if it is a textual data (or numbers, doesn't matter), you can try to read your big file and write it into several files by first two or three symbols: all lines starting with "aaa" go to aaa.txt, all lines starting with "aab" - to aab.txt, etc. You'll get lots of files within which the data are in the equivalence relation: a duplicate to a word is in the same file as the word itself. Now, just parse each file in the memory and you're done.
Again, not sure that it will work, but i'd try this approach first...

Sorting gigantic binary files with C#

I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:
byte[8]byte[4]byte[n]
Where n is equal to the int32 value of byte[4].
This file has no delimiters and to read the whole file you would just repeat until EOF. With each "item" represented as byte[8]byte[4]byte[n].
The file looks like
byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF
byte[8] is a 64-bit number representing a period of time represented by .NET Ticks. I need to sort this file but can't seem to figure out the quickest way to do so.
Presently, I load the Ticks into a struct and the byte[n] start and end positions and read to the end of the file. After this, I sort the List in memory by the Ticks property and then open a BinaryReader and seek to each position in Ticks order, read the byte[n] value, and write to an external file.
At the end of the process I end up with a sorted binary file, but it takes FOREVER. I am using C# .NET and a pretty beefy server, but disk IO seems to be an issue.
Server Specs:
2x 2.6 GHz Intel Xeon (Hex-Core with HT) (24-threads)
32GB RAM
500GB RAID 1+0
2TB RAID 5
I've looked all over the internet and can only find examples where a huge file is 1GB (makes me chuckle).
Does anyone have any advice?
At great way to speed up this kind of file access is to memory-map the entire file into address space and let the OS take care of reading whatever bits from the file it needs to. So do the same thing as you're doing right now, except read from memory instead of using a BinaryReader/seek/read.
You've got lots of main memory, so this should provide pretty good performance (as long as you're using a 64-bit OS).
Use merge sort.
It's online and parallelizes well.
http://en.wikipedia.org/wiki/Merge_sort
If you can learn Erlang or Go, they could be very powerful and scale extremely well, as you have 24 threads. Utilize Async I/O. Merge Sort.
And since you have 32GB of Ram, try to load as much as you can into RAM and sort it there then write back to disk.
I would do this in several passes. On the first pass, I would create a list of ticks, then distribute them evenly into many (hundreds?) buckets. If you know ahead of time that the ticks are evenly distributed, you can skip this initial pass. On a second pass, I would split the records into these few hundred separate files of about same size (these much smaller files represent groups of ticks in the order that you want). Then I would sort each file separately in memory. Then concatenate the files.
It is somewhat similar to the hashsort (I think).

Compare files byte by byte or read all bytes?

I came across this code http://support.microsoft.com/kb/320348 which made me wonder what would be the best way to compare 2 files in order to figure out if they differ.
The main idea is to optimize my program which needs to verify if any file is equal or not to create a list of changed files and/or files to delete / create.
Currently I am comparing the size of the files if they match i will go into a md5 checksum of the 2 files, but after looking at that code linked at the begin of this question it made me wonder if it is really worth to use it over creating a checksum of the 2 files (which is basically after you get all the bytes) ?
Also what other verifications should I make to reduce the work in check each file ?
Read both files into a small buffer (4K or 8K) which is optimised for reading and then compare buffers in memory (byte by byte) which is optimised for comparing.
This will give you optimum performance for all cases (where difference is at the start, middle or the end).
Of course first step is to check if file length differs and if that's the case, files are indeed different..
If you haven't already computed hashes of the files, then you might as well do a proper comparison (instead of looking at hashes), because if the files are the same it's the same amount of work, but if they're different you can stop much earlier.
Of course, comparing a byte at a time is probably a bit wasteful - probably a good idea to read whole blocks at a time and compare them.

Categories