Can a HashSet contain more than 2^31 -1 elements? - c#

In C# code dealing with any HashSet we need to know if it would be necessary to use the LongCount() method of IEnumerable in order to get the number of elements in the set, or should one always use the Count property, which is of type int32 and has a maximum value of 2^31 - 1.
If there exists such limitation on the maximum number of elements in a HashSet, then what type to use when we need to deal with a set that has numberOfElements >= 2^31 ?
Update:
The exact answer can be found if one runs as part of their code this:
var set = new HashSet<bool>(Int32.MaxValue -1);
Then this exception is thrown -- note what the message says:
But anyway, one shouldn't be forced to experiment just to find a simple fact, whose place is in the language documentation!

Related

How to handle null when overloading operator + for a class value object?

I want to have to have a value object that represents length. I would prefer to use a struct given that it is a value type, but since zero length does not make sense I am forced to use a class. Adding two lengths together seems like a reasonable operation, so I want to overload the + operator. I am curious though, how should I handle adding null?
Adding null to an existing string returns a string with the same content as the existing string. Adding null to a int? that has a value returns null.
I can see a case where adding nullto an existing length simply returns a new length with the same value as the existing length. At the same time, I can see a case where adding null would be considered a bug. I have been trying to find some guidance but have not been able to find any. Is there a common guideline for this or is it different for each application?
I would highly recommend using struct for your length, and treating the default representation as zero length.
since zero length does not make sense I am forced to use a class
It is up to your code to treat the default representation of length struct as a representation of some specific length. In addition to treating it as zero length, you have at least two options:
You can treat default length as an unknown, in which case any operation with it would produce an unknown, or
You can treat it as a "trap representation" of length, in which case any operation with it would produce an exception.
It is probably a design mistake to not treat zeros in a uniform way with all other numbers. Specifically, zero length may become handy when you subtract length values, because subtracting two values of equal length would have nothing to produce.
As far as "unknown" length is concerned, using struct gives you a convenient standard representation of Nullable<length> immediately familiar to users of your length structure.
Simple Answer:
if your allowed to add nulls in your system then you should probably keep the existing value and treat it like a 0 like so:
public static NullNumber operator+ (NullNumber b, NullNumber c) {
return (b ?? 0) + (c ?? 0);
}
Advanced Answer:
You are probably correct about length not making sense at 0 and you are right about adding nulls seems like a bug
I can't see where the field is populated but I suspect either:
you don't have a constructor that requires you to pass in a length if it's required.
Or you have a faulty class that sometimes has a length and sometimes meaning it sounds closer to 2 classes
Strictly speaking, a null length doesn't exist in reality, everything has length. Getting a null return or a NullReferenceException when working with your struct would lead me to think I messed up the constructor or instantiation. In other words, the null reference would be employed in the scope of the application and not exposed to the client.
struct length = new MyStruct(); //no!
struct length = new MyStruct(double feet, double inches) //better...
struct length = 34.5; //ok...

What is the difference between `HashSet<T>.IsSubsetOf()` and `HashSet<T>.IsProperSubsetOf()`

What is the difference between this two method calls?
HashSet<T>.IsSubsetOf()
HashSet<T>.IsProperSubsetOf()
See here
If the current set is a proper subset of other, other must have at least one element that the current set does not have.
vs here
If other contains the same elements as the current set, the current set is still considered a subset of other.
The difference is set.IsSubsetOf(set) == true, whereas set.IsProperSubsetOf(set) == false
This comes from the set theory:
S = {e,s,t}, T = {e,s,t}
T is a subset of S because every element in T is also in S. However it is not a proper subset, because a proper subset, like a normal subset too, contains elements of the superset, but it also has less elements than the initial collection. Example:
S = {e,s,t}, T = {e,t}
T is a proper subset of S.
IsProperSubsetOf cannot contain the whole HashSet. Only a part of it.
IsSubsetOf can contain any subset, including the full HashSet.
From the "Examples" section found here:
"The following example creates two disparate HashSet objects and compares them to each other. In this example, lowNumbers is both a subset and a proper subset of allNumbers until allNumbers is modified, using the IntersectWith method, to contain only values that are present in both sets. Once allNumbers and lowNumbers are identical, lowNumbers is still a subset of allNumbers but is no longer a proper subset."

The use of LinkedList<BigInteger>

I've been struggling for a while now with an error I can't fix.
I searched the Internet without any success and started wandering if it is possible what I want to accomplish.
I want the create an array with the a huge amount of nodes, so huge that it I need BigInteger.
I founded that LinkedList would fit my solution the best, so I started with this code.
BigInteger[] numberlist = { 0, 1 };
LinkedList<BigInteger> number = new LinkedList<BigInteger>(numberlist);
for(c = 2; c <= b; c++)
{
numberlist [b] = 1; /* flag numbers to 1 */
}
The meaning of this is to set all nodes in the linkedlist to active (1).
The vars c and b are bigintegers too.
The error I get from VS10 is :
Cannot implicitly convert type 'System.Numerics.BigInteger' to 'int'. An explicit conversion exists (are you missing a cast?)
The questions:
Is it possible to accomplish?
How can I flag all nodes in number with the use of BigInteger (not int)?
Is there an other better method for accomplishing the thing?
UPDATE
In the example I use c++ as the counter. This is variable though...
The node list could look like this:
numberlist[2]
numberlist[3]
numberlist[200]
numberlist[20034759044900]
numberlist[23847982344986350]
I'll remove processed nodes. At the maximum I'll use 1,5gb of memory.
Please reply on this update, I want to know whether my ideas are correct or not.
Also I would like too learn from my mistakes!
The generic argument of LinkedList<T> describes the element type and has nothing to do with the number of elements you can put in the collection.
Indexing into a linked list is a bad idea too. It's an O(n) operation.
And I can't imagine how you can have more elements than what fits into an Int64. There is simply not enough memory to back that.
You can have more than 2^31-1 elements in a 64bit process, but most likely you need to create your own collection type for that, since most built in collections have lower limits.
If you need more than 2^31 flags I'd create my own collection type that's backed by multiple arrays and bitpacks the flags. That way you get about 8*2^31 = 16 billion flags into a 2GB array.
If your data is sparse you could consider using a HashSet<Int64> or Dictionary<Int64,Node>.
If your data has long sequences with the same value you could use some tree structure or perhaps some variant of run-length-encoding.
If you don't need the indexes at all, you could just use a Queue<T> and dequeue from the beginning.
From your update it seems you don't want to have huge amount of data, you just want to index them using huge numbers. If that's the case, you can use Dictionary<BigInteger, int> or Dictionary<BigInteger, bool> if you only want true/false values. Alternatively, you could use HashSet<BigInteger>, if you don't need to distinguish between false and “not in collection”.
LinkedList<BigInteger> is a small number of elements, where each element is a BigInteger.
.NET doesn't allow any single array to be larger than 2GB (even on 64-bit), so there's no point in having an index larger than an int.
Try breaking your big array into smaller segments, where each segment can be addressed by an int.
If I may read your mind, it sounds like what you want a sparse array which is indexed by a BigInteger. As others have mentioned, LinkedList<BigInteger> is entirely the wrong data structure for this. I suggest something entirely different, namely a Dictionary<BigInteger, int>. This allows you to do the following:
Dictionary<BigInteger, int> data = new Dictionary<BigInteger, int>();
BigInteger b = GetBigInteger();
data[b] = 1; // the BigInteger is the *index*, and the integer is the *value*

i need to use large size of array

My requirement is to find a duplicate number in an array of integers of length 10 ^ 15.
I need to find a duplicate in one pass. I know the method (logic) to find a duplicate number from an array, but how can I handle such a large size.
An array of 10^15 of integers would require more than a petabyte to store. You said it can be done in a single pass, so there's no need to store all the data. But even reading this amount of data would take a lot of time.
But wait, if the numbers are integers, they fall into a certain range, let's say N = 2^32. So you only need to search at most N+1 numbers to find a duplicate. Now that's feasible.
You can use a BitVector array with length = 2^(32-5) = 0x0800000
This has a bit for each posible int32 number.
Note: easy solution (BitArray) do´nt support adecuate constructor.
BitVector32[] bv = new BitVector32[0x8000000];
int[] ARR = ....; // Your array
foreach (int I in ARR)
{
int Element = I >> 5;
int Bit = I & 0x1f;
if (bv[Element ][Bit])
{
// Option for First Duplicate Found
}
else
{
bv[I][Bit] = true;
}
}
You'll need a different data structure. I suspect the requirement isn't really to use an array - I'd hope not, as arrays can only hold up to Int32.MaxValue elements, i.e. 2,147,483,647... much less than 10^15. Even on a 64-bit machine, I believe the CLR requires that arrays have at most that many elements. (See the docs for Array.CreateInstance for example - even though you can specify the bounds as 64-bit integers, it will throw an exception if they're actually that big.)
Now, if you can explain what the real requirement is, we may well be able to suggest alternative data structures.
If this is a theoretical problem rather than a practical one, it would be helpful if you could tell us those constraints, too.
For example, if you've got enough memory for the array itself, then asking for 2^24 bytes to store which numbers you've seen already (one bit per value) isn't asking for much. This is assuming the values themselves are 32-bit integers, of course. Start with an empty array, and set the relevant bit for each number you find. If you find you're about to set one that's already set, then you've found the first duplicate.
You can declare it in the normal way: new int[1000000000000000]. However this will only work on a 64-bit machine; the largest you can expect to store on a 32-bit machine is a little over 2GB.
Realistically you won't be able to store the whole array in memory. You'll need to come up with a way of generating it in smaller chunks, and checking those chunks individually.
What's the data in the array? Maybe you don't need to generate it all in one go. Or maybe you can store the data in a file.
You cannot declare an array of size greater than Int32.MaxValue (2^31, or approx. 2*10^9), so you will have to either chain arrays together or use a List<int> to hold all of the values.
Your algorithm should really be the same regardless of the array size. The best time complexity you'll get has got to be (ideally) O(n) of course.
Consider the following pseudo-code for the algorithm:
Create a HashSet<int> of capacity equal to the range of numbers in your array.
Loop over each number in the array and check if it already exists in the hashset.
if no, add it to the hashset now.
if yes, you've found a duplicate.
Memory usage here is far from trivial, but if you want speed, it will do the job.
You don't need to do anything. By definition, there will be a duplicate because 2^32 < 10^15 - there aren't enough numbers to fill a 10^15 array uniquely. :)
Now if there is an additional requirement that you know where the duplicates are... thats another story, but it wasn't in the original problem.
question,
1) is the number of items in the array 10^15
2) or can the value of the items be 10^15?
if it is #1:
where are you pulling the nummbers from? if its a file you can step through it.
are there more than 2,147,483,647 unique numbers?
if its #2:
a int64 can handle the number
if its #1 and #2:
are there more than 2,147,483,647 unique numbers?
if there are less then 2,147,483,647 unique numbers you can use a List<bigint>

Data structure question

I have a database table with a large number of rows and one numeric column, and I want to represent this data in memory. I could just use one big integer array and this would be very fast, but the number of rows could be too large for this.
Most of the rows (more than 99%) have a value of zero. Is there an effective data structure I could use that would only allocate memory for rows with non-zero values and would be nearly as fast as an array?
Update: as an example, one thing I tried was a Hashtable, reading the original table and adding any non-zero values, keyed by the row number in the original table. I got the value with a function that returned 0 if the requested index wasn't found, or else the value in the Hashtable. This works but is slow as dirt compared to a regular array - I might not be doing it right.
Update 2: here is sample code.
private Hashtable _rowStates;
private void SetRowState(int rowIndex, int state)
{
if (_rowStates.ContainsKey(rowIndex))
{
if (state == 0)
{
_rowStates.Remove(rowIndex);
}
else
{
_rowStates[rowIndex] = state;
}
}
else
{
if (state != 0)
{
_rowStates.Add(rowIndex, state);
}
}
}
private int GetRowState(int rowIndex)
{
if (_rowStates.ContainsKey(rowIndex))
{
return (int)_rowStates[rowIndex];
}
else
{
return 0;
}
}
This is an example of a sparse data structure and there are multiple ways to implement such sparse arrays (or matrices) - it all depends on how you intend to use it. Two possible strategies are:
Store only non-zero values. For each element different than zero store a pair (index, value), all other values are known to be zero by default. You would also need to store the total number of elements.
Compress consecutive zero values. Store a number of (count, value) pairs. For example if you have 12 zeros in a row followed by 200 and another 22 zeros, then store (12, 0), (1, 200), (22, 0).
I would expect that the map/dictionary/hashtable of the non-zero values should be a fast and economical solution.
In Java, using the Hashtable class would introduce locking because it is supposed to be thread-safe. Perhaps something similar has slowed down your implementation.
--- update: using Google-fu suggests that C# Hashtable does incur an overhead for thread safety. Try a Dictionary instead.
How exactly you wan't to implement it depends on what your requirements are, it's a tradeoff between memory and speed. A pure integer array is the fastest, with constant complexity lookups.
Using a hash-based collection such as Hashtable or Dictionary (Hashtable seems to be slower but thread-safe - as others have pointed out) will give you a very low memory usage for a sparse data structure as yours but can be somewhat more expensive when performing lookups. You store a key-value pair for each index and non-zero value.
You can use ContainsKey to find out whether the key exists but it is significantly faster to use TryGetValue to make the check and fetch the data in one go. For dense data it can be worth it to catch exceptions for missing elements as this will only incur a cost in the exceptional case and not each lookup.
Edited again as I got myself confused - that'll teach me to post when I ought to be sleeping.
You're paying a boxing penealty by using Hashtable. Try switching to a Dictionary<int, int>. Also, how many rows are we talking - and how fast do you need it?
Create integer array for non-zero values and bit array holding indicators if particular row contains non-zero value.
You can find then necessary element in first array summing up bits in second array starting from 0 up to row index position.
I am not sure about efficiency of this solution but you can try. So it depends at which scenario you will use it but I will write here two of them that I have in mind. First solution is if you have just one field of integers you can simply use generic list of integers:
List<int> myList = new List<int>();
The second one is almost the same, but you can create a list of your own type for example if you have two fields, count and non-zero value you can create a class which will have two properties and then you can create a list of your class and store information in it. But also you can try generic linked lists. So the code for the solution two can be like this:
public class MyDbFields
{
public MyDbFields(int count, int nonzero)
{
Count = count;
NonZero = nonzero;
}
public int Count { get; set; }
public int NonZero { get; set; }
}
Then you can create a list like this:
List<MyDbFields> fields_list = new List<MyDbFields>();
and then fill it with data:
fields_list.Add(new MyDbFields(100, 11));
I am not sure if this will fully help you solve your problem, but just my suggestion.
If I understand correctly, you cannot just select non-zero rows, because for each row index (aka PK value) your Data Structure will have to be able to report not only the value, but also whether or not it is there at all. So assuming 0 if you don't find it in your Data Structure might not be a good idea.
Just to make sure - exactly how many rows are we talking about here? Millions? A million integers would take up only 4MB RAM as an array. Not much really. I guess it must be at least 100'000'000 rows.
Basically I would suggest a sorted array of integer-pairs for storing non-zero values. The first element in each pair would be the PK value, and this is what the array would be sorted by. The second element would be the value. You can make a DB select that returns only these non-zero values, of course. Since the array will be sorted, you'll be able to use binary search to find your values.
If there are no "holes" in the PK values, then the only thing you would need besides this would be the minimum and maximum PK values so that you can determine whether a given index belongs to your data set.
If there are unused PK values between the used ones, then you need some other mechanism to determine which PK values are valid. Perhaps a bitmask or another array of valid (or invalid, whichever are fewer) PK values.
If you choose the bitmask way, there is another idea. Use two bits for every PK value. First bit will show if the PK value is valid or not. Second bit will show if it is zero or not. Store all non-zero values in another array. This however will have the drawback that you won't know which array item corresponds to which bitmask entry. You'd have to count all the way from the start to find out. This can be mitigated with some indexes. Say, for every 1000 entries in the value array you store another integer which tells you where this entry is in the bitmask.
Perhaps you are looking in the wrong area - all you are storing for each value is the row number of the database row, which suggests that perhaps you are just using this to retrieve the row?
Why not try indexing your table on the numeric column - this will provide lightning fast access to the table rows for any given numeric value (which appears to be the ultimate objective here?) If it is still too slow you can move the index itself into memory etc.
My point here is that your database may solve this problem more elegantly than you can.

Categories