I have like about 100,000 sentences in a List<string>.
I'm trying to split each of these sentences by words and add everything into List<List<string>> where each List contains a sentence and which contains another List of words. I'm doing that because I have to do a different work on each individual words. What would be the size difference of just List<string> of sentences vs List<List<string>> of words in memory?
One of these will be stored in memory eventually so I'm looking for the memory impact of splitting each sentence vs just a string
We'll start with your List<string>. I'm going to assume the 64-bit runtime. Numbers for the 32-bit runtime are slightly smaller.
The List itself requires about 32 bytes (allocation overhead, plus internal variables), plus the backing array of strings. The array overhead is 50 bytes, and you need 8 bytes per string for the references. So if you have 100,000 sentences, you'll need at minimum 800,000 bytes for the array.
The strings themselves require something like 26 bytes each, plus two bytes per character. So if your average sentence is 80 characters, you need 186 bytes per string. Multiplies by 100K strings, that's about 18.5 megabytes. Altogether, your list of sentences will take around 20 MB (round number).
If you split the sentences into words, you now have 100,000 List<string> instances. That's about 5 megabytes just for the List<List<string>>. If we assume 10 words per sentence, then each sentence's list will require about 80 bytes for the backing array, plus 26 bytes per string (total of about 260 bytes), plus the string data itself (8 chars, or 160 bytes total). So each sentence costs you (again, round numbers) 80 + 260 + 160, or 500 bytes. Multiplied by 100,000 sentences, that's 50 MB.
So, very rough numbers, splitting your sentences into a List<List<string>> will occupy 55 or 60 megabytes.
So, first off we'll compare the difference in memory between a single string or two strings which, if concatted together, would result in the first:
string first = "ab";
string second = "a";
string third = "b";
How much memory does first use compared to second and third together? Well, the actual characters that they need to reference is the same, but every single string object has a small overhead (14 bytes on a 32 bit system, 26 bytes on a 64 bit system).
So for each string that you break up into a List<string> representing smaller strings there is a 14 * (wordsPerSentance - 1) byte overhead.
Then there is the overhead for the list itself. The list will consume one word of memory (32 bits on a 32 bit system, 64 on a 64 bit system, etc.) for each item added to the list plus the overhead of a List<string> itself (which is 24 bytes on a 32 bit system).
So for that you need to add (on a 32 bit system) (24 + (8 * averageWordsPerSentance)) * numberOfSentances bytes of memory.
Unfortunately, this isn't a question that can be answered very easily -- it depends on the particular strings, and what lengths you're willing to go to in order to optimize.
For example, take a look at the String.Intern() method. If you intern all the words, it's possible that the collection of words will require less memory than the collection of sentences. It would depend on the contents. There are other implications to interning, though, so that might not be the best idea. Again, it would depend on the particulars of the situation -- check the "Performance Considerations" section of the doc page I linked.
I think the best thing to do is to use GC.GetTotalMemory(true) before and after your operation to get a rough idea of how much memory is actually being used.
Related
When implementing the Run-length encoding (RLE), can I assume that the Runs are going to be shorter than one byte?
So there will not be a situation where there is a run like this
WWWBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...
Where there are 256 B's because you cannot represent that length in one byte whereas you can represent the W's as 3W
If not, should the Run be split into two Runs? How should this situation be handled? I couldn't find any information about this case.
To my understanding, you understand the situation correctly. The word length used for counting the repetition of a character is usually a byte, and the individual characters usually are also encoded as a byte. If in the input there is a repetition of e.g. 300 b, the encoding will be as follows.
255 (number of repetitions of the next character)
98 (ASCII value for b)
45 (nunber of repetitions of the next character)
98 (ASCII value for b)
In total, a run of length larger than 255 will have to be split in two runs. That being said, the actual encoding depends on the specific implementations; it is also possible to use other types than bytes for counting the repetition of characters.
Given a list of 256 numbers in order (0-255), I want to express a subset of 128 numbers from that list. Every number will be unique and not repeated.
What is the most compact way to express this subset?
What I've come up with so far is having a 256 length bit-array and setting the appropriate indexes to 1. This method obviously requires 256 bits to represent the 128 values but is there a different, more space-saving way?
Thanks!
There is 256! / (128! * (256 - 128)!) unique combinations of 128 elements from a set of 256 items, when order does not matter (see wiki about combinations).
If you calculate that number and take base-2 logarithm - you will find that it's 251.6. That means you need at least 252 bits to represent unique selection of 128 items out of 256. Since .NET anyway cannot represent bits (only whole bytes) - there is no reason to actually find out how this could be done.
128 is the worst number in that regard. If you were selecting say 5 elements or 251 out of 256 - that could have been represented with 34 bits and it would have been useful to try and find that kind of effective representation.
Since you don't care about the order of the subset nor do you care about restoring each element to its position in the original array, this is simply a case of producing a random subset of an array, which is similar to drawing cards from a deck.
To take unique elements from an array, you can simply shuffle the source array and then take a number of elements at the first X indices:
int[] srcArray = Enumerable.Range(0, 256).ToArray();
Random r = new Random();
var subset = srcArray.OrderBy(i => r.Next()).Take(128).ToArray();
Note: I use the above randomizing method to keep the example concise. For a more robust shuffling approach, I recommend the Fisher-Yates algorithm as described in this post.
Any ideas or implementations floating about for encoding the current date including the milliseconds into the shortest possible string length?
e.g I want 31/10/2011 10:41:45 in the shortest string possible (ideally 5 characters) - obviously decodable.
If it is impossible to get down to 5 characters, then the year is optional.
edit: it doesn't actually need to be decodable. It just needs to be a unique string.
An time_t is 31 bits. Add 10 bits for up to 1000 milliseconds: That's 41 bits. You want 5 characters: That's 8 bits for the 1st 4 characters + 9 bits for the last one.
Using Chinese ideograms, you should easily be able to find a range of 256 consecutive chars for each of the 1st 4 chars and a range of 512 for the last one.
Needless to say your encoded date will look... chinese! But it should do the trick ;-)
BTW, you don't have to stick to Chinese. You might even want to choose a different Unicode 256 chars range for each character. Of course, you'll want to find sequences of 256/512 printable chars.
Now let's say we skip the year. We're down to 86400 x 366 seconds per year = 31622400 seconds. Including millisecs : 31622400000. That's 35 bits. Great: We're down at 7 bits per character. Easy! :-)
you can use the Ticks:
var ticks = System.DateTime.Now.Ticks;
this is a 64bit number. You get the Time back by calling:
var timeBack = new System.DateTime(ticks);
of course this are 8 bytes but I don't think you can get this more compact (easily).
No can do: The total ms in an year (365 days) is 31,536,000,000 (=365*24*60*60*1000). You need 34.87628063 bits of information to store that value (log2 31,536,000,000). You probably meant "printable characters" BUT you would need 7 bits/character to store 35 bits in 5 characters. As an example base64 is 6 bits/character of information, so 6 characters. Ascii85 would be a little better, but still you would need around 5.5 characters, so 6 characters.
Clearly if you meant 5 BYTES, everything changes. You can store 34.84 years (in ms) in that space.
And if you meant 5 C# PRINTABLE AND UNPRINTABLE CHARACTERS (each C# character is 16 bits), then it's even better. 10 bytes! DateTime in C# is only 8 bytes and it uses ticks (they are a VERY VERY VERY small part of a second)!
BUT if you meant 5 C# PRINTABLE CHARACTERS characters, then use Serge's response. It's very good and show us that the world is a big place (and show us that why good questions are so much important: they let us see the world in new ways).
You can use ASCII characters to represent the numbers and drop the formatting, for example:
31/10/2011 10:41:45
*/*/** *:*:*
*******
That's 7, you can drop 2 if you don't want to include the full year. Obviously the * are actual characters relating to a number, A could be 1 etc, or even use the proper ASCII codes.
How many bits is a .NET string that's 10 characters in length? (.NET strings are UTF-16, right?)
On 32-bit systems:
4 bytes = Type pointer (Every object has one of these)
4 bytes = Lock (One of these too!)
4 bytes = Length (Need the length)
2 * Length bytes = Data (And the chars themselves)
=======================
12 + 2*Length bytes
=======================
96 + 16*Length bits
So 10 chars would = 256 bits = 32 bytes
I am not sure if the Lock grows to 64-bit on 64-bit systems. I kinda hope not, but you never know. The 64-bit structure overhead is therefore anywhere from 16-20 bytes (as opposed to the 12 bytes on 32-bit).
Every char in the string is two bytes in size, so if you are just converting the chars directly and not using any particular encoding, the answer is string.Length * 2 * 8
otherwise the result depends on the encoding, you can write:
int numbits = System.Text.Encoding.UTF8.GetByteCount(str)*8; //returns 80
or
int numbits = System.Text.Encoding.Unicode.GetByteCount(str)*8 //returns 160
If you are talking pure Unicode-16 then:
10 characters = 20 bytes = 160 bits
This really needs a context in order to be answered properly.
It all comes down to how you define character and how to you store the data.
For example, if you define character as a single letter from the users point of view it can be more than 2 bytes, for example this character: Å is two Unicode code points (U+0041 U+030A, Latin Capital A + Combining Ring Above) so it will require two .net chars or 4 bytes int UTF-16.
Now even if you are talking about 10 .net Char elements than if it's in memory you have some object overhead (that was already mentioned) and a bit of alignment overhead (on 32bit system everything has to be aligned to 4 bytes boundary, in 64bit the rules are more complicated) so you may have some empty bytes at the end.
If you are talking about database or files than each database and file system has its own overhead.
Using the standard English letters and underscore only, how many characters can be used at a maximum without causing a potential collision in a hashtable/dictionary.
So strings like:
blur
Blur
b
Blur_The_Shades_Slightly_With_A_Tint_Of_Blue
...
There's no guarantee that you won't get a collision between single letters.
You probably won't, but the algorithm used in string.GetHashCode isn't specified, and could change. (In particular it changed between .NET 1.1 and .NET 2.0, which burned people who assumed it wouldn't change.)
Note that hash code collisions won't stop well-designed hashtables from working - you should still be able to get the right values out, it'll just potentially need to check more than one key using equality if they've got the same hash code.
Any dictionary which relies on hash codes being unique is missing important information about hash codes, IMO :) (Unless it's operating under very specific conditions where it absolutely knows they'll be unique, i.e. it's using a perfect hash function.)
Given a perfect hashing function (which you're not typically going to have, as others have mentioned), you can find the maximum possible number of characters that guarantees no two strings will produce a collision, as follows:
No. of unique hash codes avilable = 2 ^ 32 = 4294967296 (assuming an 32-bit integer is used for hash codes)
Size of character set = 2 * 26 + 1 = 53 (26 lower as upper case letters in the Latin alphabet, plus underscore)
Then you must consider that a string of length l (or less) has a total of 54 ^ l representations. Note that the base is 54 rather than 53 because the string can terminate after any character, adding an extra possibility per char - not that it greatly effects the result.
Taking the no. of unique hash codes as your maximum number of string representations, you get the following simple equation:
54 ^ l = 2 ^ 32
And solving it:
log2 (54 ^ l) = 32
l * log2 54 = 32
l = 32 / log2 54 = 5.56
(Where log2 is the logarithm function of base 2.)
Since string lengths clearly can't be fractional, you take the integral part to give a maximum length of just 5. Very short indeed, but observe that this restriction would prevent even the remotest chance of a collision given a perfect hash function.
This is largely theoretical however, as I've mentioned, and I'm not sure of how much use it might be in the design consideration of anything. Saying that, hopefully it should help you understand the matter from a theoretical viewpoint, on top of which you can add the practical considersations (e.g. non-perfect hash functions, non-uniformity of distribution).
Universal Hashing
To calculate the probability of collisions with S strings of length L with W bits per character to a hash of length H bits assuming an optimal universal hash (1) you could calculate the collision probability based on a hash table of size (number of buckets) 'N`.
First things first we can assume a ideal hashtable implementation (2) that splits the H bits in the hash perfectly into the available buckets N(3). This means H becomes meaningless except as a limit for N.
W and 'L' are simply the basis for an upper bound for S. For simpler maths assume that strings length < L are simply padded to L with a special null character. If we were interested we are interested in the worst case this is 54^L (26*2+'_'+ null), plainly this is a ludicrous number, the actual number of entries is more useful than the character set and the length so we will simply work as if S was a variable in it's own right.
We are left trying to put S items into N buckets.
This then becomes a very well known problem, the birthday paradox
Solving this for various probabilities and number of buckets is instructive but assuming we have 1 billion buckets (so about 4GB of memory in a 32 bit system) then we would need only 37K entries before we hit a 50% chance of their being at least one collision. Given that trying to avoid any collisions in a hashtable becomes plainly absurd.
All this does not mean that we should not care about the behaviour of our hash functions. Clearly these numbers are assuming ideal implementations, they are an upper bound on how good we can get. A poor hash function can give far worse collisions in some areas, waste some of the possible 'space' by never or rarely using it all of which can cause hashes to be less than optimal and even degrade to a performance that looks like a list but with much worse constant factors.
The .NET framework's implementation of the string's hash function is not great (in that it could be better) but is probably acceptable for the vast majority of users and is reasonably efficient to calculate.
An Alternative Approach: Perfect Hashing
If you wish you can generate what are known as perfect hashes this requires full knowledge of the input values in advance however so is not often useful. In a simliar vein to the above maths we can show that even perfect hashing has it's limits:
Recall the limit of of 54 ^ L strings of length L. However we only have H bits (we shall assume 32) which is about 4 billion different numbers. So if you can have truly any string and any number of them then you have to satisfy:
54 ^ L <= 2 ^ 32
And solving it:
log2 (54 ^ L) <= 32
L * log2 54 <= 32
L <= 32 / log2 54 <= 5.56
Since string lengths clearly can't be fractional, you are left with a maximum length of just 5. Very short indeed.
If you know that you will only ever have a set of strings well below 4 Billion in size then perfect hashing would let you handle any value of L, but restricting the set of values can be very hard in practice and you must know them all in advance or degrade to what amounts to a database of string -> hash and add to it as new strings are encountered.
For this exercise the universal hash is optimal as we wish to reduce the probability of any collision i.e. for any input the probability of it having output x from a set of possibilities R is 1/R.
Note that doing an optimal job on the hashing (and the internal bucketing) is quite hard but that you should expect the built in types to be reasonable if not always ideal.
In this example I have avoided the question of closed and open addressing. This does have some bearing on the probabilities involved but not significantly
A hash algorithm isn't supposed to guarantee uniqueness. Given that there are far more potential strings (26^n for n length, even ignoring special chars, spaces, capitalization, non-english chars, etc.) than there are places in your hashtable, there's no way such a guarantee could be fulfilled. It's only supposed to guarantee a good distribution.
If your key is a string (e.g., a Dictionary) then it's GetHashCode() will be used. That's a 32bit integer. Hashtable defaults to a 1 key to value load factor and increases the number of buckets to maintain that load factor. So if you do see collisions they should tend to occur around reallocation boundaries (and decrease shortly after reallocation).