C# Dictionary Memory Management - c#

I have a Dictionary<string,int> that has the potential to contain upwards of 10+ million unique keys. I am trying to reduce the amount of memory that this takes, while still maintaining the functionality of the dictionary.
I had the idea of storing a hash of the string as a long instead, this decreases the apps memory usage to an acceptable amount (~1.5 gig to ~.5 gig), but I don't feel very good about my method for doing this.
long longKey=
BitConverter.ToInt64(cryptoTransformSHA1.ComputeHash(enc.GetBytes(strKey)), 0);
Basically this chops off the end of a SHA1 hash, and puts the first chunk of it into a long, which I then use as a key. While this works, at least for the data I'm testing with, I don't feel like this is a very reliable solution due to the increased possibility for key collisions.
Are there any other ways of reducing the Dictionary's memory footprint, or is the method I have above not as horrible as I think it is?
[edit]
To clarify, I need to maintain the ability to lookup a value contained in the Dictionary using a string. Storing the actual string in the dictionary takes way to much memory. What I would like to do instead is to use a Dictionary<long,int> where the long is the result of a hashing function on the string.

So I have done something similar recently and for a certain set of reasons that are fairly unique to my application did not use a database. In fact I was try to stop using a database. I have found that GetHashCode is significantly improved in 3.5. One important note, NEVER STORE PERSISTENTLY THE RESULTS FROM GetHashCode. NEVER EVER. They are not guaranteed to be consistent between versions of the framework.
So you really need to conduct an analysis of your data since different hash functions might work better or worse on your data. You also need to account for speed. As a general rule cryptographic hash functions should not have many collisions even as the number of hashes moves into the billions. For things that I need to be unique I typically use SHA1 Managed. In general the CryptoAPI has terrible performance, even if the underlying hash functions perform well.
For a 64bit hash I currently use Lookup3 and FNV1, which are both 32 bit hashes, together. For a collision to occur both would need to collide which is mathematically improbable and I have not seen happen over about 100 million hashes. You can find the code to both publicly available on the web.
Still conduct your own analysis. What has worked for me may not work for you. Actually inside of my office different applications with different requirements actually use different hash functions or combinations of hash functions.
I would avoid any unproven hash functions. There are as many hash functions as people who think that they should be writing them. Do your research and test test test.

With 10 million-odd records, have you considered using a database with a non-clustered index? Databases have a lot more tricks up their sleeve for this type of thing.
Hashing, by definition, and under any algorithm, has the potential of collisions - especially with high volumes. Depending on the scenario, I'd be very cautious of this.
Using the strings might take space, but it is reliable... if you are on x64 this needn't be too large (although it definitely counts as "big" ;-p)

By the way, cryptographic hashes / hash functions are exceptionally bad for dictionaries. They’re big and slow. By solving the one problem (size) you’ve only introduced another, more severe problem: the function won’t spread the input evenly any longer, thus destroying the single most important property of a good hash for approaching collision-free addressing (as you seem to have noticed yourself).
/EDIT: As Andrew has noted, GetHashCode is the solution for this problem since that’s its intended use. And like in a true dictionary, you will have to work around collisions. One of the best schemes for that is double hashing. Unfortunately, the only 100% reliable way will be to actually store the original values. Else, you’d have created an infinite compression, which we know can’t exist.

Why don't you just use GetHashCode() to get a hash of the string?

With hashtable implementations I have worked with in the past, the hash brings you to a bucket which is often a link list of other objects that have the same hash. Hashes are not unique, but they are good enough to split your data up into very manageable lists (sometimes only 2 or 3 long) that you can then search though to find your actual item.
The key to a good hash is not its uniqueness, but its speed and distribution capabilities... you want it to distribute as evenly as possible.

Just go get SQLite. You're not likely to beat it, and even if you do, it probably won't be worth the time/effort/complexity.
SQLite.

Related

How to generate identifiers in a distributed system with low probability of duplicates?

I need to generate identifiers in a distributed system.
Duplicates will be detected by the system and will cause the operation that created that identifier to fail. I need to minimize the probability of failing operations by generating identifiers with low collision probability.
I'd also like to be able to describe mathematically how likely it is that a duplicate number is generated. I'm not sure what such a description would look like, preferably I'd like to know the X in something like:
When generating 1000 random numbers per second for 10 years no more than X duplicates should have been generated.
These random numbers can only have 35 significant bits. The system is written in C# and runs on top of Microsoft's .NET platform.
So this is actually two questsions in one (but I guess they depend on each other):
What component/pattern should I use to generate identifiers?
How can I compute the X value?
For (1) I see the following candidates:
System.Random
System.Guid
System.Security.Cryptography.RNGCryptoServiceProvider
The fact that I need numbers to have 35 significant bits is not a problem when it comes to generating values as it is fine to generate a larger number and then just extracting 35 of those bits. However, it do affect the mathematical computation i presume.
UPDATE
I can see now that 35-bits aren't nearly enough for my description above. I don't really need 1 number per millisecond for 10 years. That was an overstatement.
What I really need is a way to distributively generate identifiers that have 35 significant bits with as low probability of a conflict as possible. As time goes by the system will "clean up" identifiers so that it is possible for the same number to be used again without it causing a failure.
I understand that I could of course implement some kind of centralized counter. But I would like to be able to avoid that if possible. I want to minimize the number of network operations needed to maintain the identifiers.
Any suggestions are welcome!
You are wanting to generate 1000 numbers each second for 10 years. So you will generate
1000*60*60*365*10 = 315360000000
You want to use numbers with 35 bits. There are
2**35 = 34359738368
The minimum number of duplicates that you will generate is 315360000000 - 34359738368 which equals 281000261632. That is a lower bound on X. This is self-evident. Suppose by some amazing freak that you manage to sample each and every possible value from the 2**35 available. Then every other sample you make is a duplicate.
I guess we can safely conclude that 35 bits is not enough.
As far as generating good quality pseudo random numbers, it should be fairly obvious that System.Security.Cryptography.RNGCryptoServiceProvider the best choice of the three that you present.
If you really want uniqueness that I suggest that you do the following:
Allocate to each distributed node a unique range of IDs.
Have each node allocate uniquely from that pool of IDs values. For instance, the node starts at the first value and increments the ID by one every time it is asked to generate a new one.
This is really the best strategy if uniqueness matters. But you will likely need to dedicate more bits for your IDs.
Since the probability of collisions steadily increases with a random allocation as you use up more addresses, the system steadily degrades in performance. There is also the looming specter of a non-zero probability of your random selection never terminating because it never chooses a non-conflicting id (PRNGs have cycle lengths for any given seed much smaller than their theoretical full range of output.) Whether this is a problem in practice of course depends on how saturated you expect your address space to be in the long run.
If the IDs don't need to be random, then you almost certainly want to rely on some form of coordination to assign IDs (such as partitioning the address space or using a coordinating manager of some sort to assign IDs) rather than creating random numbers and reconciling collisions after they happen. It will be easier to implement, probably more performant and will allow better saturation of your address space.
In response to comment:
The design for a specific mechanism of coordination depends on a lot of factors, such as how many nodes you expect to have, how flexible you need to be in regards to adding/dropping nodes, how long the IDs need to remain unique (i.e. what is your strategy for managing ID lifetime), etc. It's a complex problem that warrants a careful analysis of your expected use cases, including a look at your future scalability requirements. A simple partioning scheme is sufficient if your number of nodes and/or number of IDs is small, but if you need to scale to larger volumes, it's a much more challenging problem, perhaps requiring more complex allocation strategies.
One possible partitioning design is that you have a centralized manager that allocates IDs in blocks. Each node can then freely increment IDs within that block and only needs to request a new block when it runs out. This can scale well if you expect your ID lifetime to correlate with age, as that generally means that whole blocks will be freed up over time. If ID lifetime is more randomly distributed, though, this could potentially lead to fragmentation and exhaustion of available blocks. So, again, it's a matter of understanding your requirements so that you can design for the scale and usage patterns your application requires.
You can't use random numbers in your case: the Birthday Paradox states that 1st collistion will be at
sqrt(2 * N)
in your case:
sqrt(2 * 2^35) = sqrt(2^36) = 2^18 = 250000 items before the 1st collistion
So GUID-based value is the best choice.
I think for your particular problem all those random numbers providers will work relatively the same - all should generate nearly ideal even distribution of values.
I heard GUID generation includes MAC address as part of generation, so it might influence some part more than other, but I'm not sure. Most likely it is even distribute as well, but you must check that before relying on it.
The main question you should answer is do you really need random numbers, or consequtive is fine? Maybe consequtive addresses will work better and have better performance because of caching? So it might be good to distribute address space among your machines and have full guarantee when collision will be occured and handle it appropriately?

What limitations are there when solving Singular Value Decomposition on a numeric library (computational)?

I am using Math.NET' Singular Decomposition to do PCA Analysis on some database. Depending o the amount of columns and rows, the algorithm keeps running indefinetly (so I am assuming it is not converging).
I think Math.NET's implementation of SVD is based on LAPACK.
So I'm wondering if are there any kind of limitation in this algorithm or the characteristics of my data set that could cause this.
PS.: the data doesn't appear to have much covariance between each attribute.
With most (if not all) algorithms for computing the singular value decomposition, there is no guarantee that the algorithm will terminate, though it is extremely rare that it does not. Good implementations, like LAPACK, will stop after a certain number of iterations and return an error.
In your case, with matrices of size around 100 (I assume when you say more than approx. 70 you mean not very much more), it should take at most a few seconds to compute the SVD. If it takes longer, your matrix is perhaps one of the extreme rare cases where the algorithm that the library uses does not converge. I'd say it's more likely that you found a bug, in which case you should probably contact the maintainers of the library.

List vs. Dictionary (Maximum Size, Number of Elements)

I am attempting to ascertain the maximum sizes (in RAM) of a List and a Dictionary. I am also curious as to the maximum number of elements / entries each can hold, and their memory footprint per entry.
My reasons are simple: I, like most programmers, am somewhat lazy (this is a virtue). When I write a program, I like to write it once, and try to future-proof it as much as possible. I am currently writing a program that uses Lists, but noticed that the iterator wants an integer. Since the capabilities of my program are only limited by available memory / coding style, I'd like to write it so I can use a List with Int64s or possibly BigInts (as the iterators). I've seen IEnumerable as a possibility here, but would like to find out if I can just stuff a Int64 into a Dictionary object as the key, instead of rewriting everything. If I can, I'd like to know what the cost of that might be compared to rewriting it.
My hope is that should my program prove useful, I need only hit recompile in 5 years time to take advantage of the increase in memory.
Is it specified in the documentation for the class? No, then it's unspecified.
In terms of current implementations, there's no maximum size in RAM in the classes themselves, if you create a value type that's 2MB in size, push a few thousand into a list, and receive an out of memory exception, that's nothing to do with List<T>.
Internally, List<T>s workings would prevent it from ever having more than 2billion items. It's harder to come to a quick answer with Dictionary<TKey, TValue>, since the way things are positioned within it is more complicated, but really, if I was looking at dealing with a billion items (if a 32-bit value, for example, then 4GB), I'd be looking to store them in a database and retrieve them using data-access code.
At the very least, once you're dealing with a single data structure that's 4GB in size, rolling your own custom collection class no longer counts as reinventing the wheel.
I am using a concurrentdictionary to rank 3x3 patterns in half a million games of go. Obviously there are a lot of possible patterns. With C# 4.0 the concurrentdictionary goes out of memory at around 120 million objects. It is using 8GB at that time (on a 32GB machine) but wants to grow way too much I think (tablegrowths happen in large chunks with concurrentdictionary). Using a database would slow me down at least a hundredfold I think. And the process is taking 10 hours already.
My solution was to use a multiphase solution, actually doing multiple passes, one for each subset of patterns. Like one pass for odd patterns and one for even patterns. When using more objects no longer fails I can reduce the amount of passes.
C# 4.5 adds support for larger arraysin 64bit by using unsigned 32bit pointers for arrays
(the mentioned limit goes from 2 billion to 4 billion). See also
http://msdn.microsoft.com/en-us/library/hh285054(v=vs.110).aspx. Not sure which objects will benefit from this, List<> might.
I think you have bigger issues to solve before even wondering if a Dictionary with an int64 key will be useful in 5 or 10 years.
Having a List or Dictionary of 2e+10 elements in memory (int32) doesn't seem to be a good idea, never mind 9e+18 elements (int64). Anyhow the framework will never allow you to create a monster that size (not even close) and probably never will. (Keep in mind that a simple int[int.MaxValue] array already far exceeds the framework's limit for memory allocation of any given object).
And the question remains: Why would you ever want your application to hold in memory a list of so many items? You are better of using a specialized data storage backend (database) if you have to manage that amount of information.

Using different numeric variable types

Im still pretty new so bear with me on this one, my question(s) are not meant to be argumentative or petty but during some reading something struck me as odd.
Im under the assumption that when computers were slow and memory was expensive using the correct variable type was much more of a necessity than it is today. Now that memory is a bit easier to come by people seem to have relaxed a bit. For example, you see this sample code everywhere:
for (int i = 0; i < length; i++)
int? (-2,147,483,648 to 2,147,483,648) for length? Isnt byte (0-255) a better choice?
So Im curious of your opinion and what you believe to be best practice, I hate to think this would be used only because the acronym "int" is more intuitive for a beginner...or has memory just become so cheap that we really dont need to concern ourselves with such petty things and therefore we should just use long so we can be sure any other numbers/types(within reason) used can be cast automagically?
...or am Im just being silly by concerning myself with such things?
Luca Bolognese posted this in his blog.
Here's the relevant part:
Use int whenever your values can fit in an int, even for values which can never be
negative
Use long when your values can't fit in an int.
Byte, sbyte, short, ushort, uint, and ulong should only ever be used for interop with C code. Otherwise they're not worth the hassle.
Using a variable that is smaller than the CPU native register size, can actually result in more code being emitted.
As another poster said, don't worry about micro-optimisiations. If you have a performance problem, first profile. 9 times out of 10 your performance problem won't be where you thought it was.
No, I don't think you are being silly, this is a great question!
My opinion is that using strongly typed variables is a best practice. In your example, the variable i is always positive so it could be unsigned int.
When developing programs we need to consider: 1) size 2) speed and 3) the cost of the programmer. These are not mutually exclusive, sometimes we trade off size for speed and of course those best able to do this (great programmers) cost more than beginners.
Also remember what is fastest on computer X may be slower on computer B. Is it a 16 bit, 32 bit, 64 bit etc. operating system? In many cases we want a variable to be aligned on word boundaries for speed so that using variables smaller than a word does not end up saving any space.
So it is not necessary best to use the smallest possible variable, but it is always best practice to make an informed choice as to the best type to use.
Generally memory is cheap these days, since you don't have to worry about it you can concern yourself with more important programming details. Think of it like managing memory, the less of it you have to do the more productive you are overall. Ultimately it has to do with the level of abstraction, if you don't need to control how all the cogs and wheels work then it is best not to tinker with them.
In practice you will generally always use either ints or longs (if you need the extra size). I wouldn't concern myself with anything less these days unless I was optimizing. And remember the golden rule of optimization Don't optimize unless you have to. Write your code first then optimize if needed.
Another similar situation is when designing a database schema. I often see people design a schema an allow only what they need on NVARCHAR columns. To me this is ridiculous because it is a variable length column so you are not wasting space, and by given yourself plenty of room you avoid problems down the road. I once worked for a company that had internal logging on the website; once I upgraded to IE8 the website start crashing. After some investigation I found that the logging schema only allowed 32 characters for the browser id string, but when using IE8 (with Vis Studio and other extensions) the browser id string grew beyond 32 and caused a problem which prevented the website from working at all. Sure there could have been more stringent length checking and better error handling on the part of the developer in charge of that, but by allowing for 256 instead of 32 would not only have prevented the crash, but we wouldn't be truncating the data in the db.
I am not suggesting that you use strings, and Int64 for all your datatypes (no more than I suggest you set all your sql columns to NVARCHAR(4000)) because you lose readability. But choose an appropriate type and give yourself lots of padding.
Local variables like loop indexes are cheap. How many frames are you going to have on the stack at a time? Fifty? A hundred? A thousand? What's the overhead of using a thousand int counters instead of a thousand byte counters? 3K? Is saving 3K worth the rework when it turns out that a couple of those arrays need more than 255 elements?
If you're allocating tens of millions of these things, then squeezing the bit count may make sense. For locals, it's a false economy.
Another factor is what the type communicates to your readers. For better or for worse, people put very little interpretation on an int; but when they see a byte they'll tend to interpret it as something specifically byte-oriented, such as binary data coming off a stream, maybe pixels, or a stream that needs to be run through an encoder to turn it into a string. Using a byte as a loop counter will throw many readers of your code off their stride as they stop and wonder, "Wait, why is this a byte instead of an int?"
Considering this note may help you:
The runtime optimizes the performance
of 32-bit integer types (Int32 and
UInt32), so use those types for
counters and other frequently accessed
integral variables. For floating-point
operations, Double is the most
efficient type because those
operations are optimized by hardware.
source: MCTS Self-Paced Training Kit (Exam 70-536): Microsoft® .NET Framework Application Development Foundation, Second edition
Note: I think this is OK for x86 machines, but for x64 I don't know.

Is there any scenario where the Rope data structure is more efficient than a string builder

Related to this question, based
on a comment of user Eric
Lippert.
Is there any scenario where the Rope data structure is more efficient than a string builder? It is some people's opinion that rope data structures are almost never better in terms of speed than the native string or string builder operations in typical cases, so I am curious to see realistic scenarios where indeed ropes are better.
The documentation for the SGI C++ implementation goes into some detail on the big O behaviours verses the constant factors which is instructive.
Their documentation assumes very long strings being involved, the examples posited for reference talk about 10 MB strings. Very few programs will be written which deal with such things and, for many classes of problems with such requirements reworking them to be stream based rather than requiring the full string to be available where possible will lead to significantly superior results. As such ropes are for non streaming manipulation of multi megabyte character sequences when you are able to appropriately treat the rope as sections (themselves ropes) rather than just a sequence of characters.
Significant Pros:
Concatenation/Insertion become nearly constant time operations
Certain operations may reuse the previous rope sections to allow sharing in memory.
Note that .Net strings, unlike java strings do not share the character buffer on substrings - a choice with pros and cons in terms of memory footprint. Ropes tend to avoid this sort of issue.
Ropes allow deferred loading of substrings until required
Note that this is hard to get right, very easy to render pointless due to excessive eagerness of access and requires consuming code to treat it as a rope, not as a sequence of characters.
Significant Cons:
Random read access becomes O(log n)
The constant factors on sequential read access seem to be between 5 and 10
efficient use of the API requires treating it as a rope, not just dropping in a rope as a backing implementation on the 'normal' string api.
This leads to a few 'obvious' uses (the first mentioned explicitly by SGI).
Edit buffers on large files allowing easy undo/redo
Note that, at some point you may need to write the changes to disk, involving streaming through the entire string, so this is only useful if most edits will primarily reside in memory rather than requiring frequent persistence (say through an autosave function)
Manipulation of DNA segments where significant manipulation occurs, but very little output actually happens
Multi threaded Algorithms which mutate local subsections of string. In theory such cases can be parcelled off to separate threads and cores without needing to take local copies of the subsections and then recombine them, saving considerable memory as well as avoiding a costly serial combining operation at the end.
There are cases where domain specific behaviour in the string can be coupled with relatively simple augmentations to the Rope implementation to allow:
Read only strings with significant numbers of common substrings are amenable to simple interning for significant memory savings.
Strings with sparse structures, or significant local repetition are amenable to run length encoding while still allowing reasonable levels of random access.
Where the sub string boundaries are themselves 'nodes' where information may be stored, though such structures are quite possible better done as a Radix Trie if they are rarely modified but often read.
As you can see from the examples listed, all fall well into the 'niche' category. Further, several may well have superior alternatives if you are willing/able to rewrite the algorithm as a stream processing operation instead.
the short answer to this question is yes, and that requires little explanation. Of course there's situations where the Rope data structure is more efficient than a string builder. they work differently, so they are more suited for different purposes.
(From a C# perspective)
The rope data structure as a binary tree is better in certain situations. When you're looking at extremely large string values (think 100+ MB of xml coming in from SQL), the rope data structure could keep the entire process off the large object heap, where the string object hits it when it passes 85000 bytes.
If you're looking at strings of 5-1000 characters, it probably doesn't improve the performance enough to be worth it. this is another case of a data structure that is designed for the 5% of people that have an extreme situation.
The 10th ICFP Programming Contest relied, basically, on people using the rope data structure for efficient solving. That was the big trick to get a VM that ran in reasonable time.
Rope is excellent if there are lots of prefixing (apparently the word "prepending" is made up by IT folks and isn't a proper word!) and potentially better for insertions; StringBuilders use continuous memory, so only work efficiently for appending.
Therefore, StringBuilder is great for building strings by appending fragments - a very normal use-case. As developers need to do this a lot, StringBuilders are a very mainstream technology.
Ropes are great for edit buffers, e.g. the data-structure behind, say, an enterprise-strength TextArea. So (a relaxation of Ropes, e.g. a linked list of lines rather than a binary tree) is very common in the UI controls world, but that's not often exposed to the developers and users of those controls.
You need really really big amounts of data and churn to make the rope pay-off - processors are very good at stream operations, and if you have the RAM then simply realloc for prefixing does work acceptably for normal use-cases. That competition mentioned at the top was the only time I've seen it needed.
Most advanced text editors represent the text body as a "kind of rope" (though in implementation, leaves aren't usually individual characters, but text runs), mainly to improve the the frequent inserts and deletes on large texts.
Generally, StringBuilder is optimized for appending and tries to minimize the total number of reallocations without overallocating to much. The typical guarantee is (log2 N allocations, and less than 2.5x the memory). Normally the string is built once and may then be used for quite a while without being modified.
Rope is optimized for frequent inserts and removals, and tries to minimize amount of data copied (by a larger number of allocations). In a linear buffer implementation, each insert and delete becomes O(N), and you usually have to represent single character inserts.
Javascript VMs often use ropes for strings.
Maxime Chevalier-Boisvert, developer of the Higgs Javascript VM, says:
In JavaScript, you can use arrays of strings and eventually
Array.prototype.join to make string concatenation reasonably fast,
O(n), but the "natural" way JS programmers tend to build strings is to
just append using the += operator to incrementally build them. JS
strings are immutable, so if this isn't optimized internally,
incremental appending is O(n2 ). I think it's probable that ropes were
implemented in JS engines specifically because of the SunSpider
benchmarks which do string appending. JS engine implementers used
ropes to gain an edge over others by making something that was
previously slow faster. If it wasn't for those benchmarks, I think
that cries from the community about string appending performing poorly
may have been met with "use Array.prototype.join, dummy!".
Also.

Categories