Data structure question

Data structure question - c#

I have a database table with a large number of rows and one numeric column, and I want to represent this data in memory. I could just use one big integer array and this would be very fast, but the number of rows could be too large for this.
Most of the rows (more than 99%) have a value of zero. Is there an effective data structure I could use that would only allocate memory for rows with non-zero values and would be nearly as fast as an array?
Update: as an example, one thing I tried was a Hashtable, reading the original table and adding any non-zero values, keyed by the row number in the original table. I got the value with a function that returned 0 if the requested index wasn't found, or else the value in the Hashtable. This works but is slow as dirt compared to a regular array - I might not be doing it right.
Update 2: here is sample code.
private Hashtable _rowStates;
private void SetRowState(int rowIndex, int state)
{
if (_rowStates.ContainsKey(rowIndex))
{
if (state == 0)
{
_rowStates.Remove(rowIndex);
}
else
{
_rowStates[rowIndex] = state;
}
}
else
{
if (state != 0)
{
_rowStates.Add(rowIndex, state);
}
}
}
private int GetRowState(int rowIndex)
{
if (_rowStates.ContainsKey(rowIndex))
{
return (int)_rowStates[rowIndex];
}
else
{
return 0;
}
}

This is an example of a sparse data structure and there are multiple ways to implement such sparse arrays (or matrices) - it all depends on how you intend to use it. Two possible strategies are:
Store only non-zero values. For each element different than zero store a pair (index, value), all other values are known to be zero by default. You would also need to store the total number of elements.
Compress consecutive zero values. Store a number of (count, value) pairs. For example if you have 12 zeros in a row followed by 200 and another 22 zeros, then store (12, 0), (1, 200), (22, 0).

I would expect that the map/dictionary/hashtable of the non-zero values should be a fast and economical solution.
In Java, using the Hashtable class would introduce locking because it is supposed to be thread-safe. Perhaps something similar has slowed down your implementation.
--- update: using Google-fu suggests that C# Hashtable does incur an overhead for thread safety. Try a Dictionary instead.

How exactly you wan't to implement it depends on what your requirements are, it's a tradeoff between memory and speed. A pure integer array is the fastest, with constant complexity lookups.
Using a hash-based collection such as Hashtable or Dictionary (Hashtable seems to be slower but thread-safe - as others have pointed out) will give you a very low memory usage for a sparse data structure as yours but can be somewhat more expensive when performing lookups. You store a key-value pair for each index and non-zero value.
You can use ContainsKey to find out whether the key exists but it is significantly faster to use TryGetValue to make the check and fetch the data in one go. For dense data it can be worth it to catch exceptions for missing elements as this will only incur a cost in the exceptional case and not each lookup.
Edited again as I got myself confused - that'll teach me to post when I ought to be sleeping.

You're paying a boxing penealty by using Hashtable. Try switching to a Dictionary<int, int>. Also, how many rows are we talking - and how fast do you need it?

Create integer array for non-zero values and bit array holding indicators if particular row contains non-zero value.
You can find then necessary element in first array summing up bits in second array starting from 0 up to row index position.

I am not sure about efficiency of this solution but you can try. So it depends at which scenario you will use it but I will write here two of them that I have in mind. First solution is if you have just one field of integers you can simply use generic list of integers:
List<int> myList = new List<int>();
The second one is almost the same, but you can create a list of your own type for example if you have two fields, count and non-zero value you can create a class which will have two properties and then you can create a list of your class and store information in it. But also you can try generic linked lists. So the code for the solution two can be like this:
public class MyDbFields
{
public MyDbFields(int count, int nonzero)
{
Count = count;
NonZero = nonzero;
}
public int Count { get; set; }
public int NonZero { get; set; }
}
Then you can create a list like this:
List<MyDbFields> fields_list = new List<MyDbFields>();
and then fill it with data:
fields_list.Add(new MyDbFields(100, 11));
I am not sure if this will fully help you solve your problem, but just my suggestion.

If I understand correctly, you cannot just select non-zero rows, because for each row index (aka PK value) your Data Structure will have to be able to report not only the value, but also whether or not it is there at all. So assuming 0 if you don't find it in your Data Structure might not be a good idea.
Just to make sure - exactly how many rows are we talking about here? Millions? A million integers would take up only 4MB RAM as an array. Not much really. I guess it must be at least 100'000'000 rows.
Basically I would suggest a sorted array of integer-pairs for storing non-zero values. The first element in each pair would be the PK value, and this is what the array would be sorted by. The second element would be the value. You can make a DB select that returns only these non-zero values, of course. Since the array will be sorted, you'll be able to use binary search to find your values.
If there are no "holes" in the PK values, then the only thing you would need besides this would be the minimum and maximum PK values so that you can determine whether a given index belongs to your data set.
If there are unused PK values between the used ones, then you need some other mechanism to determine which PK values are valid. Perhaps a bitmask or another array of valid (or invalid, whichever are fewer) PK values.
If you choose the bitmask way, there is another idea. Use two bits for every PK value. First bit will show if the PK value is valid or not. Second bit will show if it is zero or not. Store all non-zero values in another array. This however will have the drawback that you won't know which array item corresponds to which bitmask entry. You'd have to count all the way from the start to find out. This can be mitigated with some indexes. Say, for every 1000 entries in the value array you store another integer which tells you where this entry is in the bitmask.

Perhaps you are looking in the wrong area - all you are storing for each value is the row number of the database row, which suggests that perhaps you are just using this to retrieve the row?
Why not try indexing your table on the numeric column - this will provide lightning fast access to the table rows for any given numeric value (which appears to be the ultimate objective here?) If it is still too slow you can move the index itself into memory etc.
My point here is that your database may solve this problem more elegantly than you can.

Related

Alternative to Dictionary for lookup

I have requirement to build lookup table. I use Dictionary It's contain 45M long and 45M int . long as key and int as value . the size of collection is (45M*12) where long is 8 byte and int is 4 byte
The size about 515 Mbyte . But in fact the size of process is 1.3 Gbyte . The process contains only this lookup table.
Mat be, is there alternative to Dictionary ??
Thx

How much effort are you willing to spend?
You could use a
KeyValuePair<long,int>[] table = new KeyValuePair<long,int> [45 M];
then sort this on the first column (long Key) and use a binary search to find your values.

You could use a SortedList instead of a Dictionary which will be more memory efficient but may be marginally less CPU efficient, ignoring issues about measuring memory and why you need to load so much data in 1 go in the first place :)

Dictionaries have an underlying array that holds onto the data, but the size of the array must be larger than the number of items you have, this is where the lookup speed of a dictionary comes from. In fact, the size of the underlying array should be quite a bit larger than the number of items (25+%). Combine this with the fact that as you're adding items this underlying array is being de-allocated and recreated (to make it larger) you probably have a fair amount of memory ready to be garbage collected (meaning if you actually need more memory the GC will reclaim it, but since you currently have enough it's not bothering to).
Is this Dictionary consuming more memory than you can possibly allow it to, or are you just curious why it's more than you thought it would be? There are other options available to you (other answers and comments have listed some) that will use less memory but also be slower. Are you running into out of memory issues?

if your range is limited to max long values of 10^12, then a problem in regards to space is that you must use longs because you only need a few bits more than an int can hold. If that's the case you could do something like this:
Store your data in an array of 512 Dictionary
var myData = new Dictionary<int,int>[512];
to reference the int associated with a long value (which I'll call "key" for this example), you would do the following:
myData[key & 511].Add((int) (key >> 9), intValue);
int result = myData[(int) (key & 511)][(int) (key >> 9)];
Just how many dictionaries you create and the number of bits used in the bit fiddling might need to be adjusted to fit the true contraints of your data. Using this approach would reduce your memory usage by about a third

Another approach, assuming that the data is static: use two sorted arrays- one of long and one of int. Make sure that item at index N in one is the value for the key at index N in the other. Use Array.BinarySearch to find the key values that you are looking for.

The use of LinkedList<BigInteger>

I've been struggling for a while now with an error I can't fix.
I searched the Internet without any success and started wandering if it is possible what I want to accomplish.
I want the create an array with the a huge amount of nodes, so huge that it I need BigInteger.
I founded that LinkedList would fit my solution the best, so I started with this code.
BigInteger[] numberlist = { 0, 1 };
LinkedList<BigInteger> number = new LinkedList<BigInteger>(numberlist);
for(c = 2; c <= b; c++)
{
numberlist [b] = 1; /* flag numbers to 1 */
}
The meaning of this is to set all nodes in the linkedlist to active (1).
The vars c and b are bigintegers too.
The error I get from VS10 is :
Cannot implicitly convert type 'System.Numerics.BigInteger' to 'int'. An explicit conversion exists (are you missing a cast?)
The questions:
Is it possible to accomplish?
How can I flag all nodes in number with the use of BigInteger (not int)?
Is there an other better method for accomplishing the thing?
UPDATE
In the example I use c++ as the counter. This is variable though...
The node list could look like this:
numberlist[2]
numberlist[3]
numberlist[200]
numberlist[20034759044900]
numberlist[23847982344986350]
I'll remove processed nodes. At the maximum I'll use 1,5gb of memory.
Please reply on this update, I want to know whether my ideas are correct or not.
Also I would like too learn from my mistakes!

The generic argument of LinkedList<T> describes the element type and has nothing to do with the number of elements you can put in the collection.
Indexing into a linked list is a bad idea too. It's an O(n) operation.
And I can't imagine how you can have more elements than what fits into an Int64. There is simply not enough memory to back that.
You can have more than 2^31-1 elements in a 64bit process, but most likely you need to create your own collection type for that, since most built in collections have lower limits.
If you need more than 2^31 flags I'd create my own collection type that's backed by multiple arrays and bitpacks the flags. That way you get about 8*2^31 = 16 billion flags into a 2GB array.
If your data is sparse you could consider using a HashSet<Int64> or Dictionary<Int64,Node>.
If your data has long sequences with the same value you could use some tree structure or perhaps some variant of run-length-encoding.
If you don't need the indexes at all, you could just use a Queue<T> and dequeue from the beginning.

From your update it seems you don't want to have huge amount of data, you just want to index them using huge numbers. If that's the case, you can use Dictionary<BigInteger, int> or Dictionary<BigInteger, bool> if you only want true/false values. Alternatively, you could use HashSet<BigInteger>, if you don't need to distinguish between false and “not in collection”.

LinkedList<BigInteger> is a small number of elements, where each element is a BigInteger.
.NET doesn't allow any single array to be larger than 2GB (even on 64-bit), so there's no point in having an index larger than an int.
Try breaking your big array into smaller segments, where each segment can be addressed by an int.

If I may read your mind, it sounds like what you want a sparse array which is indexed by a BigInteger. As others have mentioned, LinkedList<BigInteger> is entirely the wrong data structure for this. I suggest something entirely different, namely a Dictionary<BigInteger, int>. This allows you to do the following:
Dictionary<BigInteger, int> data = new Dictionary<BigInteger, int>();
BigInteger b = GetBigInteger();
data[b] = 1; // the BigInteger is the *index*, and the integer is the *value*

Dictionary of Primes

I was trying to create this helper function in C# that returns the first n prime numbers. I decided to store the numbers in a dictionary in the <int,bool> format. The key is the number in question and the bool represents whether the int is a prime or not. There are a ton of resources out there calculating/generating the prime numbers(SO included), so I thought of joining the masses by crafting another trivial prime number generator.
My logic goes as follows:
public static Dictionary<int,bool> GetAllPrimes(int number)
{
Dictionary<int, bool> numberArray = new Dictionary<int, bool>();
int current = 2;
while (current <= number)
{
//If current has not been marked as prime in previous iterations,mark it as prime
if (!numberArray.ContainsKey(current))
numberArray.Add(current, true);
int i = 2;
while (current * i <= number)
{
if (!numberArray.ContainsKey(current * i))
numberArray.Add(current * i, false);
else if (numberArray[current * i])//current*i cannot be a prime
numberArray[current * i] = false;
i++;
}
current++;
}
return numberArray;
}
It will be great if the wise provide me with suggestions,optimizations, with possible refactorings. I was also wondering if the inclusion of the Dictionary helps with the run-time of this snippet.

Storing integers explicitly needs at least 32 bits per prime number, with some overhead for the container structure.
At around 231, the maximal value a signed 32 bit integer can take, about every 21.5th number is prime. Smaller primes are more dense, about 1 in ln(n) numbers is prime around n.
This means it is more memory efficient to use an array of bits than to store numbers explicitly. It will also be much faster to look up if a number is prime, and reasonably fast to iterate through the primes.
It seems this is called a BitArray in C# (in Java it is BitSet).

The first thing that bothers is that, why are you storing the number itself ?
Can't you just use the index itself which will represent the number?
PS: I'm not a c# developer so maybe it is not possible with a dictionary, but it can be done with the appropriate structure.

First, you only have to loop untill the square root of the number. Make all numbers false by default and have a simple flag that you set true at the beginning of every iteration.
Further, don't store it in a dictionary. Make it a bool array and have the index be the number you're looking for. Only 0 won't make any sense, but that doesn't matter. You don't have to init either; bools are false by default. Just declare an bool[] of number length.
Then, I would init like this:
primes[2] = true;
for(int i = 3; i < sqrtNumber; i += 2) {
}
So you skip all the even numbers automatically.
By the way, never declare a variable (i) in a loop, it makes it slower.
So that's about it. For more info see this page.

I'm pretty sure the Dictionary actually hurts performance, since it doesn't enable you to perform the trial divisions in an optimal order. Traditionally, you would store the known primes so that they could be iterated from smallest to largest, since smaller primes are factors of more composite numbers than larger primes. Additionally, you never need to try division with any prime larger than the square root of the candidate prime.
Many other optimizations are possible (as you yourself point out, this problem has been studied to death) but those are the ones that I can see off the top of my head.

The dictionary really doesn't make sense here -- just store all primes up to a given number in a list. Then follow these steps:
Is given number in the list?
Yes - it's prime. Done.
Not in list
Is given number larger than the list maximum?
No - it's not prime. Done.
Bigger than maximum; need to fill list up to maximum.
Run a sieve up to given number.
Repeat.

1) From the perspective of the client to this function, wouldn't it be better if the return type was bool[] (from 0 to number perhaps)? Internally, you have three states (KnownPrime, KnownComposite, Unknown), which could be represented by an enumeration. Storing an an array of this enumeration internally, prepopulated with Unknown, will be faster than a dictionary.
2) If you stick with the dictionary, the part of the sieve that marks multiples of the current number as composite could be replaced with a numberArray.TryGetValue() pattern rather than multiple checks for ContainsKey and subsequent retrieval of the value by key.

The trouble with returning an object that holds the primes is that unless you're careful to make it immutable, client code is free to mess up the values, in turn meaning you're not able to cache the primes you've already calculated.
How about having a method such as:
bool IsPrime(int primeTest);
in your helper class that can hide the primes it's already calculated, meaning you don't have to re-calculate them every time.

i need to use large size of array

My requirement is to find a duplicate number in an array of integers of length 10 ^ 15.
I need to find a duplicate in one pass. I know the method (logic) to find a duplicate number from an array, but how can I handle such a large size.

An array of 10^15 of integers would require more than a petabyte to store. You said it can be done in a single pass, so there's no need to store all the data. But even reading this amount of data would take a lot of time.
But wait, if the numbers are integers, they fall into a certain range, let's say N = 2^32. So you only need to search at most N+1 numbers to find a duplicate. Now that's feasible.

You can use a BitVector array with length = 2^(32-5) = 0x0800000
This has a bit for each posible int32 number.
Note: easy solution (BitArray) do´nt support adecuate constructor.
BitVector32[] bv = new BitVector32[0x8000000];
int[] ARR = ....; // Your array
foreach (int I in ARR)
{
int Element = I >> 5;
int Bit = I & 0x1f;
if (bv[Element ][Bit])
{
// Option for First Duplicate Found
}
else
{
bv[I][Bit] = true;
}
}

You'll need a different data structure. I suspect the requirement isn't really to use an array - I'd hope not, as arrays can only hold up to Int32.MaxValue elements, i.e. 2,147,483,647... much less than 10^15. Even on a 64-bit machine, I believe the CLR requires that arrays have at most that many elements. (See the docs for Array.CreateInstance for example - even though you can specify the bounds as 64-bit integers, it will throw an exception if they're actually that big.)
Now, if you can explain what the real requirement is, we may well be able to suggest alternative data structures.
If this is a theoretical problem rather than a practical one, it would be helpful if you could tell us those constraints, too.
For example, if you've got enough memory for the array itself, then asking for 2^24 bytes to store which numbers you've seen already (one bit per value) isn't asking for much. This is assuming the values themselves are 32-bit integers, of course. Start with an empty array, and set the relevant bit for each number you find. If you find you're about to set one that's already set, then you've found the first duplicate.

You can declare it in the normal way: new int[1000000000000000]. However this will only work on a 64-bit machine; the largest you can expect to store on a 32-bit machine is a little over 2GB.
Realistically you won't be able to store the whole array in memory. You'll need to come up with a way of generating it in smaller chunks, and checking those chunks individually.
What's the data in the array? Maybe you don't need to generate it all in one go. Or maybe you can store the data in a file.

You cannot declare an array of size greater than Int32.MaxValue (2^31, or approx. 2*10^9), so you will have to either chain arrays together or use a List<int> to hold all of the values.

Your algorithm should really be the same regardless of the array size. The best time complexity you'll get has got to be (ideally) O(n) of course.
Consider the following pseudo-code for the algorithm:
Create a HashSet<int> of capacity equal to the range of numbers in your array.
Loop over each number in the array and check if it already exists in the hashset.
if no, add it to the hashset now.
if yes, you've found a duplicate.
Memory usage here is far from trivial, but if you want speed, it will do the job.

You don't need to do anything. By definition, there will be a duplicate because 2^32 < 10^15 - there aren't enough numbers to fill a 10^15 array uniquely. :)
Now if there is an additional requirement that you know where the duplicates are... thats another story, but it wasn't in the original problem.

question,
1) is the number of items in the array 10^15
2) or can the value of the items be 10^15?
if it is #1:
where are you pulling the nummbers from? if its a file you can step through it.
are there more than 2,147,483,647 unique numbers?
if its #2:
a int64 can handle the number
if its #1 and #2:
are there more than 2,147,483,647 unique numbers?
if there are less then 2,147,483,647 unique numbers you can use a List<bigint>

Performance when checking for duplicates

I've been working on a project where I need to iterate through a collection of data and remove entries where the "primary key" is duplicated. I have tried using a
List<int>
and
Dictionary<int, bool>
With the dictionary I found slightly better performance, even though I never need the Boolean tagged with each entry. My expectation is that this is because a List allows for indexed access and a Dictionary does not. What I was wondering is, is there a better solution to this problem. I do not need to access the entries again, I only need to track what "primary keys" I have seen and make sure I only perform addition work on entries that have a new primary key. I'm using C# and .NET 2.0. And I have no control over fixing the input data to remove the duplicates from the source (unfortunately!). And so you can have a feel for scaling, overall I'm checking for duplicates about 1,000,000 times in the application, but in subsets of no more than about 64,000 that need to be unique.

They have added the HashSet class in .NET 3.5. But I guess it will be on par with the Dictionary. If you have less than say a 100 elements a List will probably perform better.

Edit: Nevermind my comment. I thought you're talking about C++. I have no idea if my post is relevant in the C# world..
A hash-table could be a tad faster. Binary trees (that's what used in the dictionary) tend to be relative slow because of the way the memory gets accessed. This is especially true if your tree becomes very large.
However, before you change your data-structure, have you tried to use a custom pool allocator for your dictionary? I bet the time is not spent traversing the tree itself but in the millions of allocations and deallocations the dictionary will do for you.
You may see a factor 10 speed-boost just plugging a simple pool allocator into the dictionary template. Afaik boost has a component that can be directly used.
Another option: If you know only 64.000 entries in your integers exist you can write those to a file and create a perfect hash function for it. That way you can just use the hash function to map your integers into the 0 to 64.000 range and index a bit-array.
Probably the fastest way, but less flexible. You have to redo your perfect hash function (can be done automatically) each time your set of integers changes.

I don't really get what you are asking.
Firstly is just the opposite of what you say. The dictionary has indexed access (is a hash table) while de List hasn't.
If you already have the data in a dictionary then all keys are unique, there can be no duplicates.
I susspect you have the data stored in another data type and you're storing it into the dictionary. If that's the case the inserting the data will work with two dictionarys.
foreach (int key in keys)
{
if (!MyDataDict.ContainsKey(key))
{
if (!MyDuplicatesDict.ContainsKey(key))
MyDuplicatesDict.Add(key);
}
else
MyDataDict.Add(key);
}

If you are checking for uniqueness of integers, and the range of integers is constrained enough then you could just use an array.
For better packing you could implement a bitmap data structure (basically an array, but each int in the array represents 32 ints in the key space by using 1 bit per key). That way if you maximum number is 1,000,000 you only need ~30.5KB of memory for the data structure.
Performs of a bitmap would be O(1) (per check) which is hard to beat.

There was a question awhile back on removing duplicates from an array. For the purpose of the question performance wasn't much of a consideration, but you might want to take a look at the answers as they might give you some ideas. Also, I might be off base here, but if you are trying to remove duplicates from the array then a LINQ command like Enumerable.Distinct might give you better performance than something that you write yourself. As it turns out there is a way to get LINQ working on .NET 2.0 so this might be a route worth investigating.

If you're going to use a List, use the BinarySearch:
// initailize to a size if you know your set size
List<int> FoundKeys = new List<int>( 64000 );
Dictionary<int,int> FoundDuplicates = new Dictionary<int,int>();
foreach ( int Key in MyKeys )
{
// this is an O(log N) operation
int index = FoundKeys.BinarySearch( Key );
if ( index < 0 )
{
// if the Key is not in our list,
// index is the two's compliment of the next value that is in the list
// i.e. the position it should occupy, and we maintain sorted-ness!
FoundKeys.Insert( ~index, Key );
}
else
{
if ( DuplicateKeys.ContainsKey( Key ) )
{
DuplicateKeys[Key]++;
}
else
{
DuplicateKeys.Add( Key, 1 );
}
}
}
You can also use this for any type for which you can define an IComparer by using an overload: BinarySearch( T item, IComparer< T > );

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.