Alternative to Dictionary for lookup - c#

I have requirement to build lookup table. I use Dictionary It's contain 45M long and 45M int . long as key and int as value . the size of collection is (45M*12) where long is 8 byte and int is 4 byte
The size about 515 Mbyte . But in fact the size of process is 1.3 Gbyte . The process contains only this lookup table.
Mat be, is there alternative to Dictionary ??
Thx

How much effort are you willing to spend?
You could use a
KeyValuePair<long,int>[] table = new KeyValuePair<long,int> [45 M];
then sort this on the first column (long Key) and use a binary search to find your values.

You could use a SortedList instead of a Dictionary which will be more memory efficient but may be marginally less CPU efficient, ignoring issues about measuring memory and why you need to load so much data in 1 go in the first place :)

Dictionaries have an underlying array that holds onto the data, but the size of the array must be larger than the number of items you have, this is where the lookup speed of a dictionary comes from. In fact, the size of the underlying array should be quite a bit larger than the number of items (25+%). Combine this with the fact that as you're adding items this underlying array is being de-allocated and recreated (to make it larger) you probably have a fair amount of memory ready to be garbage collected (meaning if you actually need more memory the GC will reclaim it, but since you currently have enough it's not bothering to).
Is this Dictionary consuming more memory than you can possibly allow it to, or are you just curious why it's more than you thought it would be? There are other options available to you (other answers and comments have listed some) that will use less memory but also be slower. Are you running into out of memory issues?

if your range is limited to max long values of 10^12, then a problem in regards to space is that you must use longs because you only need a few bits more than an int can hold. If that's the case you could do something like this:
Store your data in an array of 512 Dictionary
var myData = new Dictionary<int,int>[512];
to reference the int associated with a long value (which I'll call "key" for this example), you would do the following:
myData[key & 511].Add((int) (key >> 9), intValue);
int result = myData[(int) (key & 511)][(int) (key >> 9)];
Just how many dictionaries you create and the number of bits used in the bit fiddling might need to be adjusted to fit the true contraints of your data. Using this approach would reduce your memory usage by about a third

Another approach, assuming that the data is static: use two sorted arrays- one of long and one of int. Make sure that item at index N in one is the value for the key at index N in the other. Use Array.BinarySearch to find the key values that you are looking for.

Related

.Net Dictionary<int,int> out of memory exception at around 6,000,000 entries

I am using a Dictionary<Int,Int> to store the frequency of colors in an image, where the key is the the color (as an int), and the value is the number of times the color has been found in the image.
When I process larger / more colorful images, this dictionary grows very large. I get an out of memory exception at just around 6,000,000 entries. Is this the expected capacity when running in 32-bit mode? If so, is there anything I can do about it? And what might be some alternative methods of keeping track of this data that won't run out of memory?
For reference, here is the code that loops through the pixels in a bitmap and saves the frequency in the Dictionary<int,int>:
Bitmap b; // = something...
Dictionary<int, int> count = new Dictionary<int, int>();
System.Drawing.Color color;
for (int i = 0; i < b.Width; i++)
{
for (int j = 0; j < b.Height; j++)
{
color = b.GetPixel(i, j);
int colorString = color.ToArgb();
if (!count.Keys.Contains(color.ToArgb()))
{
count.Add(colorString, 0);
}
count[colorString] = count[colorString] + 1;
}
}
Edit: In case you were wondering what image has that many different colors in it: http://allrgb.com/images/mandelbrot.png
Edit: I also should mention that this is running inside an asp.net web application using .Net 4.0. So there may be additional memory restrictions.
Edit: I just ran the same code inside a console application and had no problems. The problem only happens in ASP.Net.
Update: Given the OP's sample image, it seems that the maximum number of items would be over 16 million, and apparently even that is too much to allocate when instantiating the dictionary. I see three options here:
Resize the image down to a manageable size and work from that.
Try to convert to a color scheme with fewer color possibilities.
Go for an array of fixed size as others have suggested.
Previous answer: the problem is that you don't allocate enough space for your dictionary. At some point, when it is expanding, you just run out of memory for the expansion, but not necessarily for the new dictionary.
Example: this code runs out of memory at nearly 24 million entries (in my machine, running in 32-bit mode):
Dictionary<int, int> count = new Dictionary<int, int>();
for (int i = 0; ; i++)
count.Add(i, i);
because with the last expansion it is currently using space for the entries already there, and tries to allocate new space for another so many million more, and that is too much.
Now, if we initially allocate space for, say, 40 million entries, it runs without problem:
Dictionary<int, int> count = new Dictionary<int, int>(40000000);
So try to indicate how many entries there will be when creating the dictionary.
From MSDN:
The capacity of a Dictionary is the number of elements that can be added to the Dictionary before resizing is necessary. As elements are added to a Dictionary, the capacity is automatically increased as required by reallocating the internal array.
If the size of the collection can be estimated, specifying the initial capacity eliminates the need to perform a number of resizing operations while adding elements to the Dictionary.
Each dictionary entry holds two 4-byte integers: 8 bytes total. 8 bytes * 6 millions entries is only about 48MB, +/- some space for object overhead, alignment, etc. There's plenty of space in memory for this. .Net provides virtual address space of up to 2 GB per process. 48MB or so shouldn't cause a problem.
I expect what's actually happening here is related to how the dictionary auto-expands and how the garbage collector handles (or doesn't handle) compaction.
First, the auto-expanding part. Last time I checked (back around .Net 2.0*), collections in .Net tended to use arrays internally. They would allocated a reasonably-sized array in the collection constructor (say, 10 items), and then use a doubling algorithm to create additional space whenever the array filled up. All the existing items would have to be copied to the new array, but then the old array could be garbage collected. The garbage collector is pretty reliable about this, and so it means you're left using space for at most 2n - 1 items in the collection.
Now the Garbage Collector compaction part. After a certain size, these arrays end up in a section of memory called the Large Object Heap. Garbage Collection still works here (though less often). What doesn't really work here well is compaction (think memory defragmentation). The physical memory used by the old object will be released, returned to the operating system, and available for other processes. However, the virtual address space in your process... the table that maps program memory offsets to physical memory addresses, will still have the (empty) space reserved.
This is important, because remember: we're working with a rapidly growing object. It's possible for such an object to take up address space far larger than the final size of the object itself. An object grows enough, fast enough, and suddenly you get an OutOfMemoryException, even though your app isn't really using all that much RAM.
The first solution here is allocate enough space in the initial collection for all of your data. This allows you to skip all those re-allocations and copying. Your data will live in a single array, and use only the space you actually asked for. Most collections, including the Dictionary, have an overload for the constructor that allows you to give it the number of items you want the first array to use. Be careful here: you don't need to allocate an item for every pixel in your image. There will be a lot of repeated colors. You only need to allocate enough to have space for each color in your image. If it's only large images that give you problems, and you're almost handling them with six millions records, you might find that 8 million is plenty.
My next suggestion is to group your pixel colors. A human can't tell and doesn't care if two colors are just one bit apart in any of the rgb components. You might go as far as to look at the separate RGB values for each pixel and normalize the pixel so that you only care about changes of more than 5 or so for an R,G,or B value. That would get you from 16.5 million potential colors all the way down to only about 132,000, and the data will likely be more useful, too. That might look something like this:
var colorCounts = new Dictionary<Color, int>(132651);
foreach(Color c in GetImagePixels().Select( c=> Color.FromArgb( (c.R/5) * 5, (c.G/5) * 5, (c.B/5) * 5) )
{
colorCounts[c] += 1;
}
* IIRC, somewhere in a recent or upcoming version of .Net both of these issues are being addressed. One by allowing you to force compaction of the LOH, and the other by using a set of arrays for collection backing stores, rather than trying to keep everything in one big array
The maximum size limit provided by CLR is 2GB
When you run a 64-bit managed application on a 64-bit Windows
operating system, you can create an object of no more than 2 gigabytes
(GB).
You may better use an array.
You may also check this BigArray<T>, getting around the 2GB array size limit
In the 32 bit runtime, the maximum number of items you can have in a Dictionary<int, int> is in the neighborhood of 61.7 million. See my old article for more info.
If you're running in 32 bit mode, then your entire application plus whatever bits of ASP.NET and the underlying machinery is required all have to fit within the memory available to your process: normally 2 GB in the 32-bit runtime.
By the way, a really wacky way to solve your problem (but one I wouldn't recommend unless you're really hurting for memory), would be the following (assuming a 24-bit image):
Call LockBits to get a pointer to the raw image data
Compress the per-scan-line padding by moving the data for each scan line to fill the previous row's padding. You end up with an array of 3-byte values followed by a bunch of empty space (to equal the padding).
Sort the image data. That is, sort the 3-byte values. You'd have to write a custom sort, but it wouldn't be too bad.
Go sequentially through the array and count the number of unique values.
Allocate a 2-dimensional array: int[count,2] to hold the values and their occurrence counts.
Go sequentially through the array again to count occurrences of each unique value and populate the counts array.
I wouldn't honestly suggest using this method. Just got a little laugh when I thought of it.
Try using an array instead. I doubt it will run out of memory. 6 million int array elements is not a big deal.

How much capacity should a static hash table have to minimize collisions?

My program retrieves a finite and complete list of elements I want to refer to by a string ID. I'm using a .Net Dictionary<string, MyClass> to store these elements. I personally have no idea how many elements there will be. It could be a few. It could be thousands.
Given the program know exactly how many elements it will be putting in the hash table, what should it specify as the table's capacity. Clearly it should be at least the number of elements it will contain, but using only that number will likely lead to numerous collisions.
Is there a guide to selecting the capacity of a hash table for a known number of elements to balance hash collisions and memory wastage?
EDIT: I'm aware a hash table's size can change. What I'm avoiding first and foremost is leaving it with the default allocation, then immediately adding thousands of elements causing countless resize operations. I won't be adding or removing elements once it's populated. If I know what's going in, I can ensure there's sufficient space upfront. My question relates to the balance of hash collisions versus memory wastage.
Your question seems to imply a false assumption, namely that the dictionary's capacity is fixed. It isn't.
If you know in any given case that a dictionary will hold at least some number of elements, then you can specify that number as the dictionary's initial capacity. The dictionary's capacity is always at least as large as its item count (this is true for .NET 2 through 4, at least; I believe this is an undocumented implementation detail that's subject to change).
Specifying the initial capacity reduces the number of memory allocations by eliminating those that would occurred as the dictionary grows from its default initial capacity to the capacity you have chosen.
If the hash function in use is well chosen, the number of collisions should be relatively small and should have a minimal impact on performance. Specifying an over-large capacity might help in some contrived situations, but I would definitely not give this any thought unless profiling showed that the dictionary's lookups were having a significant impact on performance.
(As an example of a contrived situation, consider a dictionary with int keys with a capacity of 10007, all of whose keys are a multiple of 10007. With the current implementation, all of the items would be stored in a single bucket, because the bucket is chosen by dividing the hash code by the capacity and taking the remainder. In this case, the dictionary would function as a linked list, and forcing it to use a different capacity would fix that.)
This is bit of a subjective question but let me try my best to answer this (from perspective of CLR 2.0. only as I have not yet explored if there have been any changes in dictionary for CLR 4.0).
Your are using a dictionary keyed on string. Since there can be infinite possible strings, it is reasonable to assume that every possible hash code is 'equally likely'. Or in other words each of the 2^32 hash codes (range of int) are equally likely for the string class. Current version of Dictionary in BCL drops off 32nd bit from any 32 bit hash code thus obtained, to essentially get a 31 bit hash code. Hence the range we are dealing with is 2^31 unique equally likely hash codes.
Note that the range of the hash codes is not dependent on the number of elements dictionary contains or can contain.
Dictionary class will use this hash code to allocate a bucket to the 'Myclass' object. So essentially if two different strings return same 31 bits of hash code (assuming BCL designers have chosen the string hash function highly wisely, such instances should be fairly spread out) both will be allocated same bucket. In such a hash collision, nothing can be done.
Now, in current implementation of the Dictionary class, it may happen that even different hash codes (again 31 bit) still end up in the same bucket. The bucket index is identified as follows:
hash = <31 bit hash code>
pr = <least prime number greater than or equal to current dictionary capacity>
bucket_index = hash modulus pr
Hence every hash code of the form (pr*factor + bucket_index) will end up in same bucket irrespective of the factor part.
If you want to be absolutely sure that all different possible 31 bit hash codes end up in different buckets only way is to force the pr to be greater than or equal to the largest possible 31 bit hash code. Or in other words, ensure that every hash code is of the form (pr*0 + hash_code) i.e. pr should be greater than 2^31. This by extension means that the dictionary capacity should be at-least 2^31.
Note that the capacity required to minimize hash collisions is not at all dependent on the number of elements you want to store in the dictionary but on the range of the possible hash codes.
As you can imagine 2^31 is huge huge memory allocation. In fact if you try to specify 2^31 as the capacity, there will be two arrays of 2^31 length. Consider that on a 32 bit machine highest possible address on RAM is 2^32!!!!!
If, for some reason, default behavior of the dictionary is not acceptable to you and it is critical for you to minimize hash collisions (or rather I would say bucket collisions) only hope you have is to provide your own hash code (i.e. you can not use string as key). Such a hash code should keep the formula to obtain bucket index in mind and strive to minimize the range of possible hash codes. Simplest approach is to incrementally assign a number/index to your unique MyClass instances and use this number as your hash code. Then you can specify the total number of MyClass instances as dictionary capacity. Though, in such a case an array can easily be maintained instead of dictionary as you know the 'index' of the object and index is incremental.
In the end, I would like to re-iterate what others have said, 'there will not be countless resizes'. Dictionary doubles its capacity (rounded off to nearest prime number greater than or equal to the new capacity) each time it finds itself short of space. In order to save some processing, you can very well set capacity to number of MyClass instances you have as in any case dictionary will require this much capacity to store the instances but this will not minimize 'hash-collisions' and for normal circumstances will be fast enough.
Datastructure like HashTable are meant for dynamic memory allocation. You can however mention the initial size in some structures. But , when you add new elements , they will expand in size. There is in no way you can restrict the size implicitly.
There are many datastructures available , with their own advantages and disadvantages. You need to select the best one. Limiting the size does not affect the performance. You need to take care of Add, Delete and Search which makes the difference in performance.

Efficiently storing a set of numbers

I am looking for the most efficient way to store a collection of integers. Right now they're being stored in a HashSet<T>, but profiling has shown that these collections weigh heavily on some performance-critical code and I suspect there's a better option.
Some more details:
Random lookups must be O(1) or close to it.
The collections can grow large, so space efficiency is desirable.
The values are uniformly distributed in a 64-bit space.
Mutability is not needed.
There's no clear upper bound on size, but tens of millions of elements is not uncommon.
The most painful performance hit right now is creating them. That seems to be allocation-related - clearing and reusing HashSets helps a lot in benchmarks, but unfortunately that is not a feasible option in the application code.
(added) Implementing a data structure that's tailored to the task is fine. Is a hash table still the way to go? A trie also seems like a possibility at first glance, but I don't have any practical experience with them.
HashSet is usually the best general purpose collection in this case.
If you have any specific information about your collection you may have better options.
If you have a fixed upper bound that is not incredibly large you can use a bit vector of suitable size.
If you have a very dense collection you can instead store the missing values.
If you have very small collections, <= 4 items or so, you can store them in a regular array. A full scan of such small array may be faster than the hashing required to use the hash-set.
If you don't have any more specific characteristics of your data than "large collections of int" HashSet is the way to go.
If the size of the values is bounded you could use a bitset. It stores one bit per integer. In total the memory use would be log n bits with n being the greatest integer.
Another option is a bloom filter. Bloom filters are very compact but you have to be prepared for an occasional false positive in lookups. You can find more about them in wikipedia.
A third option is using a simle sorted array. Lookups are log n with n being the number of integers. It may be fast enough.
I decided to try and implement a special purpose hash-based set class that uses linear probing to handle collisions:
Backing store is a simple array of longs
The array is sized to be larger than the expected number of elements to be stored.
For a value's hash code, use the least-significant 31 bits.
Searching for the position of a value in the backing store is done using a basic linear probe, like so:
int FindIndex(long value)
{
var index = ((int)(value & 0x7FFFFFFF) % _storage.Length;
var slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
for(++index; ; index++)
{
if (index == _storage.Length) index = 0;
slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
}
}
(I was able to determine that the data being stored will never include 0, so that number is safe to use for empty slots.)
The array needs to be larger than the number of elements stored. (Load factor less than 1.) If the set is ever completely filled then FindIndex() will go into an infinite loop if it's used to search for a value that isn't already in the set. In fact, it will want to have quite a lot of empty space, otherwise search and retrieval may suffer as the data starts to form large clumps.
I'm sure there's still room for optimization, and I will may get stuck using some sort of BigArray<T> or sharding for the backing store on large sets. But initial results are promising. It performs over twice as fast as HashSet<T> at a load factor of 0.5, nearly twice as fast with a load factor of 0.8, and even at 0.9 it's still working 40% faster in my tests.
Overhead is 1 / load factor, so if those performance figures hold out in the real world then I believe it will also be more memory-efficient than HashSet<T>. I haven't done a formal analysis, but judging by the internal structure of HashSet<T> I'm pretty sure its overhead is well above 10%.
--
So I'm pretty happy with this solution, but I'm still curious if there are other possibilities. Maybe some sort of trie?
--
Epilogue: Finally got around to doing some competitive benchmarks of this vs. HashSet<T> on live data. (Before I was using synthetic test sets.) It's even beating my optimistic expectations from before. Real-world performance is turning out to be as much as 6x faster than HashSet<T>, depending on collection size.
What I would do is just create an array of integers with a sufficient enough size to handle how ever many integers you need. Is there any reason from staying away from the generic List<T>? http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx
The most painful performance hit right now is creating them...
As you've obviously observed, HashSet<T> does not have a constructor that takes a capacity argument to initialize its capacity.
One trick which I believe would work is the following:
int capacity = ... some appropriate number;
int[] items = new int[capacity];
HashSet<int> hashSet = new HashSet<int>(items);
hashSet.Clear();
...
Looking at the implementation with reflector, this will initialize the capacity to the size of the items array, ignoring the fact that this array contains duplicates. It will, however, only actually add one value (zero), so I'd assume that initializing and clearing should be reasonably efficient.
I haven't tested this so you'd have to benchmark it. And be willing to take the risk of depending on an undocumented internal implementation detail.
It would be interesting to know why Microsoft didn't provide a constructor with a capacity argument like they do for other collection types.

i need to use large size of array

My requirement is to find a duplicate number in an array of integers of length 10 ^ 15.
I need to find a duplicate in one pass. I know the method (logic) to find a duplicate number from an array, but how can I handle such a large size.
An array of 10^15 of integers would require more than a petabyte to store. You said it can be done in a single pass, so there's no need to store all the data. But even reading this amount of data would take a lot of time.
But wait, if the numbers are integers, they fall into a certain range, let's say N = 2^32. So you only need to search at most N+1 numbers to find a duplicate. Now that's feasible.
You can use a BitVector array with length = 2^(32-5) = 0x0800000
This has a bit for each posible int32 number.
Note: easy solution (BitArray) do´nt support adecuate constructor.
BitVector32[] bv = new BitVector32[0x8000000];
int[] ARR = ....; // Your array
foreach (int I in ARR)
{
int Element = I >> 5;
int Bit = I & 0x1f;
if (bv[Element ][Bit])
{
// Option for First Duplicate Found
}
else
{
bv[I][Bit] = true;
}
}
You'll need a different data structure. I suspect the requirement isn't really to use an array - I'd hope not, as arrays can only hold up to Int32.MaxValue elements, i.e. 2,147,483,647... much less than 10^15. Even on a 64-bit machine, I believe the CLR requires that arrays have at most that many elements. (See the docs for Array.CreateInstance for example - even though you can specify the bounds as 64-bit integers, it will throw an exception if they're actually that big.)
Now, if you can explain what the real requirement is, we may well be able to suggest alternative data structures.
If this is a theoretical problem rather than a practical one, it would be helpful if you could tell us those constraints, too.
For example, if you've got enough memory for the array itself, then asking for 2^24 bytes to store which numbers you've seen already (one bit per value) isn't asking for much. This is assuming the values themselves are 32-bit integers, of course. Start with an empty array, and set the relevant bit for each number you find. If you find you're about to set one that's already set, then you've found the first duplicate.
You can declare it in the normal way: new int[1000000000000000]. However this will only work on a 64-bit machine; the largest you can expect to store on a 32-bit machine is a little over 2GB.
Realistically you won't be able to store the whole array in memory. You'll need to come up with a way of generating it in smaller chunks, and checking those chunks individually.
What's the data in the array? Maybe you don't need to generate it all in one go. Or maybe you can store the data in a file.
You cannot declare an array of size greater than Int32.MaxValue (2^31, or approx. 2*10^9), so you will have to either chain arrays together or use a List<int> to hold all of the values.
Your algorithm should really be the same regardless of the array size. The best time complexity you'll get has got to be (ideally) O(n) of course.
Consider the following pseudo-code for the algorithm:
Create a HashSet<int> of capacity equal to the range of numbers in your array.
Loop over each number in the array and check if it already exists in the hashset.
if no, add it to the hashset now.
if yes, you've found a duplicate.
Memory usage here is far from trivial, but if you want speed, it will do the job.
You don't need to do anything. By definition, there will be a duplicate because 2^32 < 10^15 - there aren't enough numbers to fill a 10^15 array uniquely. :)
Now if there is an additional requirement that you know where the duplicates are... thats another story, but it wasn't in the original problem.
question,
1) is the number of items in the array 10^15
2) or can the value of the items be 10^15?
if it is #1:
where are you pulling the nummbers from? if its a file you can step through it.
are there more than 2,147,483,647 unique numbers?
if its #2:
a int64 can handle the number
if its #1 and #2:
are there more than 2,147,483,647 unique numbers?
if there are less then 2,147,483,647 unique numbers you can use a List<bigint>

Performance when checking for duplicates

I've been working on a project where I need to iterate through a collection of data and remove entries where the "primary key" is duplicated. I have tried using a
List<int>
and
Dictionary<int, bool>
With the dictionary I found slightly better performance, even though I never need the Boolean tagged with each entry. My expectation is that this is because a List allows for indexed access and a Dictionary does not. What I was wondering is, is there a better solution to this problem. I do not need to access the entries again, I only need to track what "primary keys" I have seen and make sure I only perform addition work on entries that have a new primary key. I'm using C# and .NET 2.0. And I have no control over fixing the input data to remove the duplicates from the source (unfortunately!). And so you can have a feel for scaling, overall I'm checking for duplicates about 1,000,000 times in the application, but in subsets of no more than about 64,000 that need to be unique.
They have added the HashSet class in .NET 3.5. But I guess it will be on par with the Dictionary. If you have less than say a 100 elements a List will probably perform better.
Edit: Nevermind my comment. I thought you're talking about C++. I have no idea if my post is relevant in the C# world..
A hash-table could be a tad faster. Binary trees (that's what used in the dictionary) tend to be relative slow because of the way the memory gets accessed. This is especially true if your tree becomes very large.
However, before you change your data-structure, have you tried to use a custom pool allocator for your dictionary? I bet the time is not spent traversing the tree itself but in the millions of allocations and deallocations the dictionary will do for you.
You may see a factor 10 speed-boost just plugging a simple pool allocator into the dictionary template. Afaik boost has a component that can be directly used.
Another option: If you know only 64.000 entries in your integers exist you can write those to a file and create a perfect hash function for it. That way you can just use the hash function to map your integers into the 0 to 64.000 range and index a bit-array.
Probably the fastest way, but less flexible. You have to redo your perfect hash function (can be done automatically) each time your set of integers changes.
I don't really get what you are asking.
Firstly is just the opposite of what you say. The dictionary has indexed access (is a hash table) while de List hasn't.
If you already have the data in a dictionary then all keys are unique, there can be no duplicates.
I susspect you have the data stored in another data type and you're storing it into the dictionary. If that's the case the inserting the data will work with two dictionarys.
foreach (int key in keys)
{
if (!MyDataDict.ContainsKey(key))
{
if (!MyDuplicatesDict.ContainsKey(key))
MyDuplicatesDict.Add(key);
}
else
MyDataDict.Add(key);
}
If you are checking for uniqueness of integers, and the range of integers is constrained enough then you could just use an array.
For better packing you could implement a bitmap data structure (basically an array, but each int in the array represents 32 ints in the key space by using 1 bit per key). That way if you maximum number is 1,000,000 you only need ~30.5KB of memory for the data structure.
Performs of a bitmap would be O(1) (per check) which is hard to beat.
There was a question awhile back on removing duplicates from an array. For the purpose of the question performance wasn't much of a consideration, but you might want to take a look at the answers as they might give you some ideas. Also, I might be off base here, but if you are trying to remove duplicates from the array then a LINQ command like Enumerable.Distinct might give you better performance than something that you write yourself. As it turns out there is a way to get LINQ working on .NET 2.0 so this might be a route worth investigating.
If you're going to use a List, use the BinarySearch:
// initailize to a size if you know your set size
List<int> FoundKeys = new List<int>( 64000 );
Dictionary<int,int> FoundDuplicates = new Dictionary<int,int>();
foreach ( int Key in MyKeys )
{
// this is an O(log N) operation
int index = FoundKeys.BinarySearch( Key );
if ( index < 0 )
{
// if the Key is not in our list,
// index is the two's compliment of the next value that is in the list
// i.e. the position it should occupy, and we maintain sorted-ness!
FoundKeys.Insert( ~index, Key );
}
else
{
if ( DuplicateKeys.ContainsKey( Key ) )
{
DuplicateKeys[Key]++;
}
else
{
DuplicateKeys.Add( Key, 1 );
}
}
}
You can also use this for any type for which you can define an IComparer by using an overload: BinarySearch( T item, IComparer< T > );

Categories