Maximum length for string ActorId - c#

What is the maximum length for a string-based ActorId? And if there is a difference between the maximum possible length and the maximum recommended length, what is the difference and why?

ActorId, per se, does not have a specified limit on length of string-based id. However, while choosing a length for string-based ActorId, you should keep following points in consideration:
1) The ActorStateProvider (implementation of IActorStateProvider) stores named-state(s) and reminder(s) for actor. Depending on the implementation, it may have a specific limit on length of string-based ActorId as internally it will be using a combination ActorId, actor state-name and reminder-name (and possibly some internal metadata tags) to uniquely identify persisted named-state(s) and reminder(s) for a given actor.
2) The default ActorStateProvider for actors is KvsActorStateProvider. It is implemented on top a key-value store. It has a key length limit of 872 characters. I would recommend leaving 50 characters for internal metadata tagging and you can use remaining characters to distribute between your string-based ActorId(s) and actor state-names/reminder-names based on you naming schemes.

Related

How can hashset.contains be O(1) with this implementation?

HashSet.Contains implementation in .Net is:
/// <summary>
/// Checks if this hashset contains the item
/// </summary>
/// <param name="item">item to check for containment</param>
/// <returns>true if item contained; false if not</returns>
public bool Contains(T item) {
if (m_buckets != null) {
int hashCode = InternalGetHashCode(item);
// see note at "HashSet" level describing why "- 1" appears in for loop
for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, item)) {
return true;
}
}
}
// either m_buckets is null or wasn't found
return false;
}
And I read in a lot of places "search complexity in hashset is O(1)". How?
Then why does that for-loop exist?
Edit: .net reference link: https://github.com/microsoft/referencesource/blob/master/System.Core/System/Collections/Generic/HashSet.cs
The classic implementation of a hash table works by assigning elements to one of a number of buckets, based on the hash of the element. If the hashing was perfect, i.e. no two elements had the same hash, then we'd be living in a perfectly perfect world where we wouldn't need to care about anything - any lookup would be O(1) always, because we'd only need to compute the hash, get the bucket and say if something is inside.
We're not living in a perfectly perfect world. First off, consider string hashing. In .NET, there are (2^16)^n possible strings of length n; GetHashCode returns a long, and there are 2^64 possible values of long. That's exactly enough to hash every string of length 4 to a unique long, but if we want strings longer than that, there must exist two different values that give the same hash - this is called a collision. Also, we don't want to maintain 2^64 buckets at all times anyway. The usual way of dealing with that is to take the hashcode and compute its value modulo the number of buckets to determine the bucket's number1. So, the takeaway is - we need to allow for collisions.
The referenced .NET Framework implementation uses the simplest way of dealing with collisions - every bucket holds a linked list of all objects that result in the particular hash. You add object A, it's assigned to a bucket i. You add object B, it has the same hash, so it's added to the list in bucket i right after A. Now if you lookup for any element, you need to traverse the list of all objects and call the actual Equals method to find out if that thing is actually the one you're looking for. That explains the for loop - in the worst case you have to go through the entire list.
Okay, so how "search complexity in hashset is O(1)"? It's not. The worst case complexity is proportional to the number of items. It's O(1) on average.2 If all objects fall to the same bucket, asking for the elements at the end of the list (or for ones that are not in the structure but would fall into the same bucket) will be O(n).
So what do people mean by "it's O(1) on average"? The structure monitors how many objects are there proportional to the number of buckets and if that exceeds some threshold, called the load factor, it resizes. It's easy to see that this makes the average lookup time proportional to the load factor.
That's why it's important for hash functions to be uniform, meaning that the probability that two randomly chosen different objects get the same long assigned is 1/2^643. That keeps the distribution of objects in a hash table uniform, so we avoid pathological cases where one bucket contains a huge number of items.
Note that if you know the hash function and the algorithm used by the hash table, you can force such a pathological case and O(n) lookups. If a server takes inputs from a user and stores them in a hash table, an attacker knowing the hash function and the hash table implementations could use this as a vector for a DDoS attack. There are ways of dealing with that too. Treat this as a demonstration that yes, the worst case can be O(n) and that people are generally aware of that.
There are dozens of other, more complicated ways hash tables can be implemented. If you're interested you need to research on your own. Since lookup structures are so commonplace in computer science, people have come up with all sorts of crazy optimisations that minimise not only the theoretical number of operations, but also things like CPU cache misses.
[1] That's exactly what's happening in the statement int i = m_buckets[hashCode % m_buckets.Length] - 1
[2] At least the ones using naive chaining are not. There exist hash tables with worst-case constant time complexity. But usually they're worse in practice compared to the theoretically (in regards to time complexity) slower implementations, mainly due to CPU cache misses.
[3] I'm assuming the domain of possible hashes is the set of all longs, so there are 2^64 of them, but everything I wrote generalises to any other non-empty, finite set of values.

In SQLite what is the maximum capacity of TEXT?

According to this answer TEXT has a maximum capacity of 65535 characters (or 64Kbytes).
However I just build a test in which I stored a JSON string taken from a json file that is 305KBytes into t TEXT without problems
I am wondering if there is some property in TEXT that allows this
The limit, by default is 1 billion bytes rather than characters so the encoding has to be considered. However it can be changed.
The complete section regarding this is :-
Maximum length of a string or BLOB
The maximum number of bytes in a string or BLOB in SQLite is defined
by the preprocessor macro SQLITE_MAX_LENGTH. The default value of this
macro is 1 billion (1 thousand million or 1,000,000,000). You can
raise or lower this value at compile-time using a command-line option
like this:
DSQLITE_MAX_LENGTH=123456789
The current implementation will only support a string or BLOB length up to 2 to the power of 31-1 or 2147483647. And some
built-in functions such as hex() might fail well before that point. In
security-sensitive applications it is best not to try to increase the
maximum string and blob length. In fact, you might do well to lower
the maximum string and blob length to something more in the range of a
few million if that is possible.
During part of SQLite's INSERT and SELECT processing, the complete
content of each row in the database is encoded as a single BLOB. So
the SQLITE_MAX_LENGTH parameter also determines the maximum number of
bytes in a row.
The maximum string or BLOB length can be lowered at run-time using the
sqlite3_limit(db,SQLITE_LIMIT_LENGTH,size) interface.
Limits In SQLite

What is the max number of indexes lucene.net can handle in a document

Lucene does not document the limitations of the storage engine. Does anyone know the max number of indexes allowed per document?
When referring to term numbers, Lucene's current implementation uses a Java int to hold the term index, which means the maximum number of unique terms in any single index segment is ~2.1 billion times the term index interval (default 128) = ~274 billion. This is technically not a limitation of the index file format, just of Lucene's current implementation.
Similarly, Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet, VInt values which have no limit.
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html#Limitations
As is suggested for all types of indexes (Lucene, RDBMS, or otherwise), the lowest possible number of fields is suggested to be indexed because it keeps your index size small and reduces run-time overhead reading from the index.
That said, the field count limitations are limited by your system resources. Fields are identified by their name (case-sensitive) rather than by an arbitrary numeric ID which typically becomes the limiting factor in these sorts of systems. Theoretical field count limitations are also hard to predict in a system without strict maximum field name lengths like Lucene.
I've personally used more than 200 analyzed fields more than 2 billion documents without issue. At the same time, performance for the same index was not what I have come to expect with smaller indexes on a medium-sized Azure VM.

Guid.NewGuid() VS a random string generator from Random.Next()

My colleague and I are debating which of these methods to use for auto generating user ID's and post ID's for identification in the database:
One option uses a single instance of Random, and takes some useful parameters so it can be reused for all sorts of string-gen cases (i.e. from 4 digit numeric pins to 20 digit alphanumeric ids). Here's the code:
// This is created once for the lifetime of the server instance
class RandomStringGenerator
{
public const string ALPHANUMERIC_CAPS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
public const string ALPHA_CAPS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
public const string NUMERIC = "1234567890";
Random rand = new Random();
public string GetRandomString(int length, params char[] chars)
{
string s = "";
for (int i = 0; i < length; i++)
s += chars[rand.Next() % chars.Length];
return s;
}
}
and the other option is simply to use:
Guid.NewGuid();
see Guid.NewGuid on MSDN
We're both aware that Guid.NewGuid() would work for our needs, but I would rather use the custom method. It does the same thing but with more control.
My colleague thinks that because the custom method has been cooked up ourselves, it's more likely to generate collisions. I'll admit I'm not fully aware of the implementation of Random, but I presume it is just as random as Guid.NewGuid(). A typical usage of the custom method might be:
RandomStringGenerator stringGen = new RandomStringGenerator();
string id = stringGen.GetRandomString(20, RandomStringGenerator.ALPHANUMERIC_CAPS.ToCharArray());
Edit 1:
We are using Azure Tables which doesn't have an auto increment (or similar) feature for generating keys.
Some answers here just tell me to use NewGuid() "because that's what it's made for". I'm looking for a more in depth reason as to why the cooked up method may be more likely to generate collisions given the same degrees of freedom as a Guid.
Edit 2:
We were also using the cooked up method to generate post ID's which, unlike session tokens, need to look pretty for display in the url of our website (like http://mywebsite.com/14983336), so guids are not an option here, however collisions are still to be avoided.
I am looking for a more in depth reason as to why the cooked up method may be more likely to generate collisions given the same degrees of freedom as a Guid.
First, as others have noted, Random is not thread-safe; using it from multiple threads can cause it to corrupt its internal data structures so that it always produces the same sequence.
Second, Random is seeded based on the current time. Two instances of Random created within the same millisecond (recall that a millisecond is several million processor cycles on modern hardware) will have the same seed, and therefore will produce the same sequence.
Third, I lied. Random is not seeded based on the current time; it is seeded based on the amount of time the machine has been active. The seed is a 32 bit number, and since the granularity is in milliseconds, that's only a few weeks until it wraps around. But that's not the problem; the problem is: the time period in which you create that instance of Random is highly likely to be within a few minutes of the machine booting up. Every time you power-cycle a machine, or bring a new machine online in a cluster, there is a small window in which instances of Random are created, and the more that happens, the greater the odds are that you'll get a seed that you had before.
(UPDATE: Newer versions of the .NET framework have mitigated some of these problems; in those versions you no longer have every Random created within the same millisecond have the same seed. However there are still many problems with Random; always remember that it is only pseudo-random, not crypto-strength random. Random is actually very predictable, so if you are relying on unpredictability, it is not suitable.)
As other have said: if you want a primary key for your database then have the database generate you a primary key; let the database do its job. If you want a globally unique identifier then use a guid; that's what they're for.
And finally, if you are interested in learning more about the uses and abuses of guids then you might want to read my "guid guide" series; part one is here:
https://ericlippert.com/2012/04/24/guid-guide-part-one/
As written in other answers, my implementation had a few severe problems:
Thread safety: Random is not thread safe.
Predictability: the method couldn't be used for security critical identifiers like session tokens due to the nature of the Random class.
Collisions: Even though the method created 20 'random' numbers, the probability of a collision is not (number of possible chars)^20 due to the seed value only being 31 bits, and coming from a bad source. Given the same seed, any length of sequence will be the same.
Guid.NewGuid() would be fine, except we don't want to use ugly GUIDs in urls and .NETs NewGuid() algorithm is not known to be cryptographically secure for use in session tokens - it might give predictable results if a little information is known.
Here is the code we're using now, it is secure, flexible and as far as I know it's very unlikely to create collisions if given enough length and character choice:
class RandomStringGenerator
{
RNGCryptoServiceProvider rand = new RNGCryptoServiceProvider();
public string GetRandomString(int length, params char[] chars)
{
string s = "";
for (int i = 0; i < length; i++)
{
byte[] intBytes = new byte[4];
rand.GetBytes(intBytes);
uint randomInt = BitConverter.ToUInt32(intBytes, 0);
s += chars[randomInt % chars.Length];
}
return s;
}
}
"Auto generating user ids and post ids for identification in the database"...why not use a database sequence or identity to generate keys?
To me your question is really, "What is the best way to generate a primary key in my database?" If that is the case, you should use the conventional tool of the database which will either be a sequence or identity. These have benefits over generated strings.
Sequences/identity index better. There are numerous articles and blog posts that explain why GUIDs and so forth make poor indexes.
They are guaranteed to be unique within the table
They can be safely generated by concurrent inserts without collision
They are simple to implement
I guess my next question is, what reasons are you considering GUID's or generated strings? Will you be integrating across distributed databases? If not, you should ask yourself if you are solving a problem that doesn't exist.
Your custom method has two problems:
It uses a global instance of Random, but doesn't use locking. => Multi threaded access can corrupt its state. After which the output will suck even more than it already does.
It uses a predictable 31 bit seed. This has two consequences:
You can't use it for anything security related where unguessability is important
The small seed (31 bits) can reduce the quality of your numbers. For example if you create multiple instances of Random at the same time(since system startup) they'll probably create the same sequence of random numbers.
This means you cannot rely on the output of Random being unique, no matter how long it is.
I recommend using a CSPRNG (RNGCryptoServiceProvider) even if you don't need security. Its performance is still acceptable for most uses, and I'd trust the quality of its random numbers over Random. If you you want uniqueness, I recommend getting numbers with around 128 bits.
To generate random strings using RNGCryptoServiceProvider you can take a look at my answer to How can I generate random 8 character, alphanumeric strings in C#?.
Nowadays GUIDs returned by Guid.NewGuid() are version 4 GUIDs. They are generated from a PRNG, so they have pretty similar properties to generating a random 122 bit number (the remaining 6 bits are fixed). Its entropy source has much higher quality than what Random uses, but it's not guaranteed to be cryptographically secure.
But the generation algorithm can change at any time, so you can't rely on that. For example in the past the Windows GUID generation algorithm changed from v1 (based on MAC + timestamp) to v4 (random).
Use System.Guid as it:
...can be used across all computers and networks wherever a unique identifier is required.
Note that Random is a pseudo-random number generator. It is not truly random, nor unique. It has only 32-bits of value to work with, compared to the 128-bit GUID.
However, even GUIDs can have collisions (although the chances are really slim), so you should use the database's own features to give you a unique identifier (e.g. the autoincrement ID column). Also, you cannot easily turn a GUID into a 4 or 20 (alpha)numeric number.
Contrary to what some people have said in the comment, a GUID generated by Guid.NewGuid() is NOT dependent on any machine-specific identifier (only type 1 GUIDs are, Guid.NewGuid() returns a type 4 GUID, which is mostly random).
As long as you don't need cryptographic security, the Random class should be good enough, but if you want to be extra safe, use System.Security.Cryptography.RandomNumberGenerator. For the Guid approach, note that not all digits in a GUID are random. Quote from wikipedia:
In the canonical representation, xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx, the most significant bits of N indicates the variant (depending on the variant; one, two or three bits are used). The variant covered by the UUID specification is indicated by the two most significant bits of N being 1 0 (i.e. the hexadecimal N will always be 8, 9, A, or B).
In the variant covered by the UUID specification, there are five versions. For this variant, the four bits of M indicates the UUID version (i.e. the hexadecimal M will either be 1, 2, 3, 4, or 5).
Regarding your edit, here is one reason to prefer a GUID over a generated string:
The native storage for a GUID (uniqueidentifier) in SQL Server is 16 bytes. To store a equivalent-length varchar (string), where each "digit" in the id is stored as a character, would require somewhere between 32 and 38 bytes, depending on formatting.
Because of its storage, SQL Server is also able to index a uniqueidentifier column more efficiently than a varchar column as well.

How much capacity should a static hash table have to minimize collisions?

My program retrieves a finite and complete list of elements I want to refer to by a string ID. I'm using a .Net Dictionary<string, MyClass> to store these elements. I personally have no idea how many elements there will be. It could be a few. It could be thousands.
Given the program know exactly how many elements it will be putting in the hash table, what should it specify as the table's capacity. Clearly it should be at least the number of elements it will contain, but using only that number will likely lead to numerous collisions.
Is there a guide to selecting the capacity of a hash table for a known number of elements to balance hash collisions and memory wastage?
EDIT: I'm aware a hash table's size can change. What I'm avoiding first and foremost is leaving it with the default allocation, then immediately adding thousands of elements causing countless resize operations. I won't be adding or removing elements once it's populated. If I know what's going in, I can ensure there's sufficient space upfront. My question relates to the balance of hash collisions versus memory wastage.
Your question seems to imply a false assumption, namely that the dictionary's capacity is fixed. It isn't.
If you know in any given case that a dictionary will hold at least some number of elements, then you can specify that number as the dictionary's initial capacity. The dictionary's capacity is always at least as large as its item count (this is true for .NET 2 through 4, at least; I believe this is an undocumented implementation detail that's subject to change).
Specifying the initial capacity reduces the number of memory allocations by eliminating those that would occurred as the dictionary grows from its default initial capacity to the capacity you have chosen.
If the hash function in use is well chosen, the number of collisions should be relatively small and should have a minimal impact on performance. Specifying an over-large capacity might help in some contrived situations, but I would definitely not give this any thought unless profiling showed that the dictionary's lookups were having a significant impact on performance.
(As an example of a contrived situation, consider a dictionary with int keys with a capacity of 10007, all of whose keys are a multiple of 10007. With the current implementation, all of the items would be stored in a single bucket, because the bucket is chosen by dividing the hash code by the capacity and taking the remainder. In this case, the dictionary would function as a linked list, and forcing it to use a different capacity would fix that.)
This is bit of a subjective question but let me try my best to answer this (from perspective of CLR 2.0. only as I have not yet explored if there have been any changes in dictionary for CLR 4.0).
Your are using a dictionary keyed on string. Since there can be infinite possible strings, it is reasonable to assume that every possible hash code is 'equally likely'. Or in other words each of the 2^32 hash codes (range of int) are equally likely for the string class. Current version of Dictionary in BCL drops off 32nd bit from any 32 bit hash code thus obtained, to essentially get a 31 bit hash code. Hence the range we are dealing with is 2^31 unique equally likely hash codes.
Note that the range of the hash codes is not dependent on the number of elements dictionary contains or can contain.
Dictionary class will use this hash code to allocate a bucket to the 'Myclass' object. So essentially if two different strings return same 31 bits of hash code (assuming BCL designers have chosen the string hash function highly wisely, such instances should be fairly spread out) both will be allocated same bucket. In such a hash collision, nothing can be done.
Now, in current implementation of the Dictionary class, it may happen that even different hash codes (again 31 bit) still end up in the same bucket. The bucket index is identified as follows:
hash = <31 bit hash code>
pr = <least prime number greater than or equal to current dictionary capacity>
bucket_index = hash modulus pr
Hence every hash code of the form (pr*factor + bucket_index) will end up in same bucket irrespective of the factor part.
If you want to be absolutely sure that all different possible 31 bit hash codes end up in different buckets only way is to force the pr to be greater than or equal to the largest possible 31 bit hash code. Or in other words, ensure that every hash code is of the form (pr*0 + hash_code) i.e. pr should be greater than 2^31. This by extension means that the dictionary capacity should be at-least 2^31.
Note that the capacity required to minimize hash collisions is not at all dependent on the number of elements you want to store in the dictionary but on the range of the possible hash codes.
As you can imagine 2^31 is huge huge memory allocation. In fact if you try to specify 2^31 as the capacity, there will be two arrays of 2^31 length. Consider that on a 32 bit machine highest possible address on RAM is 2^32!!!!!
If, for some reason, default behavior of the dictionary is not acceptable to you and it is critical for you to minimize hash collisions (or rather I would say bucket collisions) only hope you have is to provide your own hash code (i.e. you can not use string as key). Such a hash code should keep the formula to obtain bucket index in mind and strive to minimize the range of possible hash codes. Simplest approach is to incrementally assign a number/index to your unique MyClass instances and use this number as your hash code. Then you can specify the total number of MyClass instances as dictionary capacity. Though, in such a case an array can easily be maintained instead of dictionary as you know the 'index' of the object and index is incremental.
In the end, I would like to re-iterate what others have said, 'there will not be countless resizes'. Dictionary doubles its capacity (rounded off to nearest prime number greater than or equal to the new capacity) each time it finds itself short of space. In order to save some processing, you can very well set capacity to number of MyClass instances you have as in any case dictionary will require this much capacity to store the instances but this will not minimize 'hash-collisions' and for normal circumstances will be fast enough.
Datastructure like HashTable are meant for dynamic memory allocation. You can however mention the initial size in some structures. But , when you add new elements , they will expand in size. There is in no way you can restrict the size implicitly.
There are many datastructures available , with their own advantages and disadvantages. You need to select the best one. Limiting the size does not affect the performance. You need to take care of Add, Delete and Search which makes the difference in performance.

Categories