I was attempting to use ObjectIDGenerator in C# to generate a unique ID during serialization, however, this class is not available in the XBox360 or Windows Phone 7 .NET frameworks (they use a compact version of .NET). I implemented a version using a dictionary of Object to Int64 and was able to get a fully working version up, however, the performance is unsatisfactory. I'm serializing on the order of tens of thousands of objects, and currently this is the greatest bottleneck in save/load performance. Using the actual .NET implementation on PC it takes about 0.3 seconds to serialize about 20,000 objects. Using my implementation, it takes about 6 seconds.
In profiling, I found that the heavy hitters were .TryGetValue and .Add on the dictionary (which makes sense since it's both indexing and adding to the hash map). More importantly, the virtual equality operator was being called instead of simply comparing references, so I implemented an IEqualityComparer that only used ReferenceEquals (this resulted in a speed increase).
Does anyone have an insight into a better implementation of ObjectIDGenerator? Thanks for your help!
My Implementation: http://pastebin.com/H1skZwNK
[Edit]
Another note, the results of profiling says that the object comparison / ReferenceEquals is still the bottleneck, with a hit count of 43,000,000. I'm wondering if there's a way to store data along side this object without having to look it up in a hash map...
Is it possible to use an Int32 Id property / handle for each object rather than Object? That may help things. It looks like you're assigning an Id type number to each object anyway, only you're then looking up based on the Object reference instead of the Id. Can you persist the object id (your Int64) within each object and make your dictionary into Dictionary<Int64, Object> instead?
You might also want to see if SortedDictionary<TKey, TValue> or SortedList<TKey, TValue> perform better or worse. But if your main bottleneck is in your IEqualityComparer, these might not help very much.
UPDATE
After looking at the ObjectIDGenerator class API, I can see why you can't do what I advised at first; you're creating the ids!
ObjectIDGenerator seems to be manually implementing its own hash table (it allocates an object[] and a parallel long[] and resizes them as objects are added). It also uses RuntimeHelpers.GetHashCode(Object) to calculate its hash rather than an IEqualityComparer which may be a big boost to your perf as its always calling Object.GetHashCode() and not doing the virtual call on the derived type (or the interface call in your case with IEqualityComparer).
You can see the source for yourself via the Microsoft Shared Source Initiative:
Related
I want to take any object and get a guid that represents that object.
I know that entails a lot of things. I am looking for a good-enough solution for common applications.
My specific use case is for caching, I want to know that the object used to create the thing I am caching has already made one in the past. There would be 2 different types of objects. Each type contains only public properties, and may contain a list/ienumable.
Assuming the object could be serializable my first idea was to serialize it to json (via native jsonserlizer or newtonsoft) and then take the json string and convert that to a uuid version 5 as detailed in a gist here How can I generate a GUID for a string?
My second approach if it's not serializable ( for example contained a dictionary ) would be to use reflection on the public properties to generate a unique string of some sort and then convert that to uuid version 5.
Both approaches use uuid version 5 to take a string to guid. Is there a proven c# class that makes valid uuid 5 guids? The gist looks good but want to be sure.
I was thinking of making the c# namespace and type name be the namespace for the uuid 5. Is that a valid use of namespace ?
My first approach is good enough for my simple use case but I wanted to explore the second approach as it's more flexible.
If creating the guid couldn't guarantee reasonable uniqueness it should throw an error. Surely super complicated objects would fail. How might I know that is the case if using reflection?
I am looking for new approaches or concerns/implementations to the second approach.
Edit: The reason why I bounty/reopened this almost 3 years later is because I need this again (and for caching again); but also because of the introduction of the generic unmanaged constraint in c# 7.3. The blog post at http://devblogs.microsoft.com/premier-developer/dissecting-new-generics-constraints-in-c-7-3/ seems to suggest that if the object can obey the unmanaged spec you can find a suitable key for a key-value store. Am I misunderstanding something?
This is still limited because the object (generic) must obey the unmanaged type constraint which is very limiting (no strings, no arrays, etc), but its one step closer. I don't completely understand why the method of getting the memory stream and getting a sha1 hash cant be done on not unmanaged typed.
I understand that reference types are pointing to places in memory and its not as easy to get the memory that represents all whole object; but it feels doable. After all, objects eventually are made up a bunch of implementations of unmanaged types (string is an array chars, etc)
PS: The requirement of GUID is loose, any integer/string at or under 512 bits would suffice
The problem of equality is a difficult one.
Here some thoughts on how you could solve your problem.
Hashing a serialized object
One method would be to serialize an object and then hash the result as proposed by Georg.
Using the md5 checksum gives you a strong checksum with the right input.
But getting it right is the problem.
You might have trouble using a common serialization framework, because:
They don't care whether a float is 1.0 or 1.000000000000001.
They might have a different understanding about what is equal than you / your employer.
They bloat the serialized text with unneeded symbols. (performance)
Just a little deviation in the serialized text causes a large deviation in the hashed GUID/UUID.
That's why, you should carefully test any serialization you do.
Otherwise you might get false possitives/negatives for objects (mostly false negatives).
Some points to think about:
Floats & Doubles:
Always write them the same way, preferably with the same number of digits to prevent something like 1.000000000000001 vs 1.0 from interfering.
DateTime, TimeStamp, etc.:
Apply a fixed format that wont change and is unambiguous.
Unordered collections:
Sort the data before serializing it. The order must be unambiguous
Strings:
Is the equality case-sensitive? If not make all the strings lower or upper case.
If necessary, make them culture invariant.
More:
For every type, think carefully what is equal and what is not. Think especially about edge cases. (float.NaN, -0 vs 0, null, etc.)
It's up to you whether you use an existing serializer or do it yourself.
Doing it yourself is more work and error prone, but you have full control over all aspects of equality and serialization.
Using an existing serializer is also error prone, because you need to test or prove whether the results are always like you want.
Introducing an unambiguous order and use a tree
If you have control over the source code, you can introduce a custom order function.
The order must take all properties, sub objects, lists, etc. into account.
Then you can create a binary tree, and use the order to insert and lookup objects.
The same problems as mentioned by the first approach still apply, you need to make sure that equal values are detected as such.
The big O performance is also worse than using hashing. But in most real live examples, the actual performance should be comparable or at least fast enough.
The good thing is, you can stop comparing two objects, as soon as you found a property or value that is not equal. Thus no need to always look at the whole object.
A binary tree needs O(log2(n)) comparisons for a lookup, thus that would be quite fast.
The bad thing is, you need access to all actual objects, thus keep them in memory.
A hashtable needs only O(1) comparisons for a lookup, thus would even be faster (theoretically at least).
Put them in a database
If you store all your objects in a database, then the database can do the lookup for you.
Databases are quite good in comparing objects and they have built in mechanisms to handle the equality/near equality problem.
I'm not a database expert, so for this option, someone else might have more insight on how good this solution is.
As others have said in comments, it sounds like GetHashCode might do the trick for you if you're willing to settle for int as your key. If not, there is a Guid constructor that takes byte[] of length 16. You could try something like the following
using System.Linq;
class Foo
{
public int A { get; set; }
public char B { get; set; }
public string C { get; set; }
public Guid GetGuid()
{
byte[] aBytes = BitConverter.GetBytes(A);
byte[] bBytes = BitConverter.GetBytes(B);
byte[] cBytes = BitConverter.GetBytes(C);
byte[] padding = new byte[16];
byte[] allBytes =
aBytes
.Concat(bBytes)
.Concat(cBytes)
.Concat(padding)
.Take(16)
.ToArray();
return new Guid(allBytes);
}
}
As said in the comments, there is no bullet entirely out of silver here, but a few that come quite close. Which of them to use depends on the types you want to use your class with and your context, e.g. when do you consider two objects to be equal. However, be aware that you will always face possible conflicts, a single GUID will not be sufficient to guarantee collision avoidance. All you can do is to decrease the probability of a collision.
In your case,
already made one in the past
sounds like you don't want to refer to reference equality but want to use a notion of value equality. The simplest way to do so is to trust that the classes implement equality using value equality because in that case, you would already be done using GetHashCode but that has a higher probability of collisions because it is only 32bit. Further, you would assume that whoever wrote the class did a good job, which is not always a good assumption to be made, particularly since people tend to blame you rather then themselves.
Otherwise, your best chances are serialization combined with a hashing algorithm of your choice. I would recommend MD5 because it is the fastest and produces the 128bit you need for a GUID. If you say your types consist of public properties only, I would suggest to use an XmlSerializer like so:
private MD5 _md5 = new MD5CryptoServiceProvider();
private Dictionary<Type, XmlSerializer> _serializers = new Dictionary<Type, XmlSerializer>();
public Guid CreateID(object obj)
{
if (obj == null) return Guid.Empty;
var type = obj.GetType();
if (!_serializers.TryGetValue(type, out var serializer))
{
serializer = new XmlSerializer(type);
_serializers.Add(type, serializer);
}
using (var stream = new MemoryStream())
{
serializer.Serialize(stream, obj);
stream.Position = 0;
return new Guid(_md5.ComputeHash(stream));
}
}
Just about all serializers have their drawbacks. XmlSerializer is not capable of serializing cyclic object graphs, DataContractSerializer requires your types to have dedicated attributes and also the old serializers based on the SerializableAttribute require that attribute to be set. You somehow have to make assumptions.
I am looking if there is a pre-existing .Net 'Hash-Set type' implementation suitable to atomizing a general type T. We have a large number of identical objects coming in for serialized sources that need to be atomized to conserve memory.
A Dictionary<T,T> with the value == key works perfectly, however the objects in these collections can run into the millions across the app, and so it seem very wasteful to store 2 references to every object.
HashSet cannot be used as it only has Contains, there ?is no way? to get to the actual member instance.
Obviously I could roll my own but wanted to check if there was anything pre-existing. A scan at C5 didn't see anything jumping out, but then their 250+ page documentation does make me wonder if I've missed something.
EDIT The fundemental idea is I need to be able to GET THE UNIQUE OBJECT BACK ie HashSet has Contains(T obj) but not Get(T obj) /EDIT
The collection at worst only needs to implement:
T GetOrAdd(T candidate)
void Clear()
And take an arbitary IComparer
And GetOrAdd is ~O(1) and would ideally be atomic, i.e. doesn't waste time Hashing twice.
EDIT Failing an existing implementation any recommendations on sources for the basic Hashing / Bucketing mechanics would be appreciated. - The Mono HashSet source has been pointed out for this and thus this section is answered /EDIT
You can take a source code of a HashSet<T> from Reference Source and write your own GetOrAdd method.
I'm implementing a special case of an immutable dictionary, which for convenience implements IEnumerable<KeyValuePair<Foo, Bar>>. Operations that would ordinarily modify the dictionary should instead return a new instance.
So far so good. But when I try to write a fluent-style unit test for the class, I find that neither of the two fluent assertion libraries I've tried (Should and Fluent Assertions) supports the NotBeSameAs() operation on objects that implement IEnumerable -- not unless you first cast them to Object.
When I first ran into this, with Should, I assumed that it was just a hole in the framework, but when I saw that Fluent Assertions had the same hole, it made my think that (since I'm a relative newcomer to C#) I might be missing something conceptual about C# collections -- the author of Should implied as much when I filed an issue.
Obviously there are other ways to test this -- cast to Object and use NotBeSameAs(), just use Object.ReferenceEquals, whatever -- but if there's a good reason not to, I'd like to know what that is.
An IEnumerable<T> is not neccessarily a real object. IEnumerable<T> guarantees that you can enumerate through it's states. In simple cases you have a container class like a List<T> that is already materialized. Then you could compare both Lists' addresses. However, your IEnumerable<T> might also point to a sequence of commands, that will be executed once you enumerate. Basically a state machine:
public IEnumerable<int> GetInts()
{
yield return 10;
yield return 20;
yield return 30;
}
If you save this in a variable, you don't have a comparable object (everything is an object, so you do... but it's not meaningful):
var x = GetInts();
Your comparison only works for materialized ( .ToList() or .ToArray() ) IEnumerables, because those state machines have been evaluated and their results been saved to a collection. So yes, the library actually makes sense, if you know you have materialized IEnumerables, you will need to make this knowledge public by casting them to Object and calling the desired function on this object "manually".
In addition what Jon Skeet suggested take a look at this February 2013 MSDN article from Ted Neward:
.NET Collections, Part 2: Working with C5
Immutable (Guarded) Collections
With the rise of functional concepts
and programming styles, a lot of emphasis has swung to immutable data
and immutable objects, largely because immutable objects offer a lot
of benefits vis-Ă -vis concurrency and parallel programming, but also
because many developers find immutable objects easier to understand
and reason about. Corollary to that concept, then, follows the concept
of immutable collections—the idea that regardless of whether the
objects inside the collection are immutable, the collection itself is
fixed and unable to change (add or remove) the elements in the
collection. (Note: You can see a preview of immutable collections
released on NuGet in the MSDN Base Class Library (BCL) blog at
bit.ly/12AXD78.)
It describes the use of an open source library of collection goodness called C5.
Look at http://itu.dk/research/c5/
For a small set of key/value pairs (default 2, max 5), a Dictionary<TKey, TValue> seems like overkill. Is there a much simpler data structure that could be used in my case ? I'm caching computed values for certain objects (i.e. <MyClass, double>), so retrieval speed is important.
Thanks
A List<KeyValuePair<TKey, TValue>> (created with an appropriate capacity) would probably work just as well in this case... but it wouldn't be terribly idiomatic. (Just to be clear, you'd simply call Equals on each key element, ignoring the hash code completely.) If List<T> feels a bit heavy to you, you could even go down to KeyValuePair<TKey, TValue>[] if you wanted. Ick, but hey... it's your code.
Have you actually tried Dictionary<TKey, TValue> and found it to be too slow? "Seems like overkill" doesn't seem nearly as good an argument as "I've tried it, profiled it, and found an unacceptable amount of my application's time is spent creating dictionaries and looking up entries in them. I need my application to have performance characteristic X and at the moment I only have Y."
If your key type has a particular ordering (and if you were going to perform more lookups on the data structure than you were going to create instances) you could sort the list, meaning you would have a maximum of 3 comparisons for any particular lookup. With only 5 entries you could even hard-code all the potential paths, if you were looking to optimize to the hilt. (You might even have different implementations for 2, 3, 4 and 5 elements. It's getting a little silly at that point though.) This is basically a SortedList<TKey, TValue> implementation, but you may be able to optimize it a little for your scenario of only having a few entries. Again, it's worth trying the built-in types first.
What's vital is that you know how important this part of your code really is to your overall performance - and when it will be "good enough" so you can stop appropriately.
If the set of keys is known at compile time, than you could simply create a class (or struct) with nullable properties that hold the values.
If you use an array like KeyValuePair<TKey, TValue>[], you could sort it and then search using a binary search. But this is only fast if you have to sort once and then retrieve many times.
I often like to use the Hashtable class (http://msdn.microsoft.com/en-us/library/system.collections.hashtable.aspx).
Another thing you can do though, is not worry about managing any cache yourself and just use the caching from ASP.NET. All you have to do is include the system.web assembly and then you can use it even in non-web applications. Here's an article on using the .Net cache code in a Windows Forms app. It's really simple.
http://www.codeproject.com/KB/cs/cacheinwinformapps.aspx
D.
I'm using a HashSet<T> to store a collection of objects. These objects already have a unique ID of System.Guid, so I'd rather the HashSet<> just use that existing ID rather then trying to figure out itself how to hash the object. How do I override the build in hashing and force my program to use the build in ID value as the hash value?
Also say I know the Guid of an object in my HashSet<>, is there a way to get an object from a HashSet<T> based on this Guid alone? Or should I use a dictionary instead.
A HashSet<> is not based a key/value pair, and provides no "by key" access - it is just a set of unique values, using the hash to check containment very quickly.
To use a key/value pair (to fetch out by Guid later) the simplest option would be a Dictionary<Guid,SomeType>. The existing hash-code on Guid should be fine (although if you needed (you don't here) you can provide an IEqualityComparer<T> to use for hashing.
Override the GetHashCode() method for your object.
Of course, there's a slight wrinkle here... GUIDs are larger than int32s, which .NET uses for hashcodes.
Why do you need to override this? seems like perhaps a premature optimization.
Yeah, just use a dictionary. Once you develop your application, go through a performance tuning phase where you measure the performance of all your code. If and only If this hashing function shows as being your largest drain should you consider a more performant data structure (if there is one anyways) :-)
Try looking into System.KeyedCollection. It allows you to embed the knowledge of the key field into your collection implementation.