Fast CRC of a Drawing.Bitmap - c#

On application startup I build a cache of icons (24x24 32bbp pre-multiplied argb bitmaps). this cache contains roughly a thousand items. I don't want to store the same image multiple times in this cache, for both memory and performance reasons. I figured the best way would be to create some sort of crc from each bitmap as it goes into the cache, and to compare new bitmaps against this list of crcs.
What is a good (and fast) way to create a crc from a bitmap which is only loaded in memory?
Or am I totally on the wrong track and is there a better way to build a bitmap-cache?

While I would echo what Hans has said, I believe that you can do this but CRC is a bad algorithm to use.
You can instead create an MD5 hash of the bytes of the generated bitmap. By my calculations your images must be a minimum of 2Kb in size. To generate a hash you can either calculate it across the whole bitmap, or you can be sneaky and do it on every n th byte - which would be faster on the hash side but probably heavier on memory usage as you'll have to extract those bytes into a new array.
If you were going to skip every nth byte, I would use 4 or 2 - using 4 means you read one component from each consecutive pixel, using two means you read two components from each consecutive pixel.
However, MD5 is very fast and you might find (and I would benchmark this in a unit test) that just hashing across the whole bitmap will be faster.
The only thing is, I can't see how you can check in advance whether you should generate a given bitmap without in advance knowing it's hash and the only way you can know it's hash is to generate it. In which case by that point you might as well just save the new image out. An extra element in your image cache array isn't going to break the universe.
What you really need to be able to do here to actually save space and startup time is to know in advance of generating an image whether it's going to be the same as another. Given that these images are generated dynamically is it the case that, when two identical images are generated, they are generated by the same method call with the same parameters?
If so, you could instead look at tagging each generated image with one or more hashcodes (using object.GetHashCode()) for the MethodInfo of the method that generates the image (you can get that inside the method itself by calling MethodBase.GetCurrentMethod(), along with each hashcode for each parameter that was passed in. The hashcode for the method is quite reliable, as it uses the runtime's method handle (which is unique for each method) - the only hash code compression that can occur there is on 64 bit machines where the handle is 64 bit, but the hash code is 32. However, in practise such a collision rarely occurs since you'd have to have a huge amount of code in the application to cause the first 32 bits of two separate method handles to bee the same.
The hash codes of the individual parameters, of course, are far less reliable unless those parameter types have good hash code functions.
This solution would by no means be perfect (at worst you'd still get some duplicates), but I reckon it would speed things up. Like I say, though, it relies on your duplicated images always being generated by the same calls.

A CRC has the same flaw as any hashing function: an equal CRC value does not proof that the images are identical. Your program will randomly, but infrequently, display the wrong image.
You need something else. Like the filename from which you retrieved the bitmap.

Related

Most efficient way to handle large arrays of data in C#?

Currently I am using XNA Game Studio 4.0 with C# Visual Studio 2010. I want to use a versatile method for handling triangles. I am using a preset array of VertexPositionColor items passed through the GraphicsDevice.DrawUserPrimitives() method, which only handles arrays. Because arrays are fixed, but I wanted to have a very large space to arbitrarily add new triangles to the array, my original idea was to make a large array, specifically
VertexPositionColor vertices = new VertexPositionColor[int.MaxValue];
but that ran my application out of memory. So what I'm wondering is how to approach this memory/performance issue best.
Is there an easy way to increase the amount of memory allocated to the stack whenever my program runs?
Would it be beneficial to store the array on the heap instead? And would I have to build my own allocator if I wanted to do that?
Or is my best approach simply to use a LinkedList and deal with the extra processing required to copy it to an array every frame?
I hit this building my voxel engine code.
Consider the problem I had:
Given an unknown volume size that would clearly be bigger than the amount of memory the computer had how do I manage that volume of data?
My solution was to use sparse chunking. for example:
In my case instead of using an array I used a dictionary.
This way I could lookup the values based on a key that was say the hashcode of a voxels position and the value was the voxel itself.
This meant that the voxels were fast to pull out, and self organised by the language / compiler in to an indexed set.
It also means that when pulling data back out I could default to Voxel.Empty for voxels that hadn't yet been assigned.
In your case you might not need a default value but using a dictionary might prove more helpful than an array.
The up shot ... Arrays are a tad faster for some things but when you consider all of your usage scenarios for the data you may find that overall the gains of using a dictionary are worth a slight allocation cost.
In testing I found that if I was prepared to drop from something like 100ms per thousand to say 120ms per thousand on allocations I could then retrieve the data 100% faster for most of the queries I was performing on the set.
Reason for my suggestion here:
It looks like you don't know the size of your data set and using an array only makes sense if you do know the size otherwise you tie up needless "pre allocated chunks of ram" for no reason in order to make your code ready for any eventuality you want to throw at it.
Hope this helps.
You may try List<T> and ToArray() method associate with List. And it's supported by XNA framework too (MSDN).
List is a successor to ArrayList and provide more features and strongly typed (A good comparison).
About performance, List<T>.ToArray is a O(n) operation. And I suggest you to break your lengthy array to sort of portions which you can name with a key [Some sort of unique identifier to a region or so on] . And store relevant information in a List and use Dictionary like Dictionary<Key, List<T>> which could reduce operations involved. Also you can process required models with priority based approach which would give a performance gain over processing complete array at once.

Translating C to C# and HLSL: will this be possible?

I've taken on quite a daunting challenge for myself. In my XNA game, I want to implement Blargg's NTSC filter. This is a C library that transforms a bitmap to make it look like it was output on a CRT TV with the NTSC standard. It's quite accurate, really.
The first thing I tried, a while back, was to just use the C library itself by calling it as a dll. Here I had two problems, 1. I couldn't get some of the data to copy correctly so the image was messed up, but more importantly, 2. it was extremely slow. It required getting the XNA Texture2D bitmap data, passing it through the filter, and then setting the data again to the texture. The framerate was ruined, so I couldn't go down this route.
Now I'm trying to translate the filter into a pixel shader. The problem here (if you're adventurous to look at the code - I'm using the SNES one because it's simplest) is that it handles very large arrays, and relies on interesting pointer operations. I've done a lot of work rewriting the algorithm to work independently per pixel, as a pixel shader will require. But I don't know if this will ever work. I've come to you to see if finishing this is even possible.
There's precalculated array involved containing 1,048,576 integers. Is this alone beyond any limits for the pixel shader? It only needs to be set once, not once per frame.
Even if that's ok, I know that HLSL cannot index arrays by a variable. It has to unroll it into a million if statements to get the correct array element. Will this kill the performance and make it a fruitless endeavor again? There are multiple array accesses per pixel.
Is there any chance that my original plan to use the library as is could work? I just need it to be fast.
I've never written a shader before. Is there anything else I should be aware of?
edit: Addendum to #2. I just read somewhere that not only can hlsl not access arrays by variable, but even to unroll it, the index has to be calculable at compile time. Is this true, or does the "unrolling" solve this? If it's true I think I'm screwed. Any way around that? My algorithm is basically a glorified version of "the input pixel is this color, so look up my output pixel values in this giant array."
From my limited understanding of Shader languages, your problem can easily be solved by using texture instead of array.
Pregenerate it on CPU and then save as texture. 1024x1024 in your case.
Use standard texture access functions as if texture was the array. Posibly using nearest-neighbor to limit blendinding of individual pixels.
I dont think this is possible if you want speed.

Compare files byte by byte or read all bytes?

I came across this code http://support.microsoft.com/kb/320348 which made me wonder what would be the best way to compare 2 files in order to figure out if they differ.
The main idea is to optimize my program which needs to verify if any file is equal or not to create a list of changed files and/or files to delete / create.
Currently I am comparing the size of the files if they match i will go into a md5 checksum of the 2 files, but after looking at that code linked at the begin of this question it made me wonder if it is really worth to use it over creating a checksum of the 2 files (which is basically after you get all the bytes) ?
Also what other verifications should I make to reduce the work in check each file ?
Read both files into a small buffer (4K or 8K) which is optimised for reading and then compare buffers in memory (byte by byte) which is optimised for comparing.
This will give you optimum performance for all cases (where difference is at the start, middle or the end).
Of course first step is to check if file length differs and if that's the case, files are indeed different..
If you haven't already computed hashes of the files, then you might as well do a proper comparison (instead of looking at hashes), because if the files are the same it's the same amount of work, but if they're different you can stop much earlier.
Of course, comparing a byte at a time is probably a bit wasteful - probably a good idea to read whole blocks at a time and compare them.

Simple 2-color differential image compression

Is there an efficient, quick and simple example of doing differential b/w image compression? Or even better, some simple (but lossless - jagged 1bpp images don't look very convincing when compressed using lossy compression) streaming technique which could accept a number of frames as input?
I have a simple b/w image (320x200) stream, displaying something similar to a LED display, which is updated about once a second using AJAX. Images are pretty similar most of the time, so if I subtracted them, result would compress pretty well (even with simple RLE). Is something like this available?
I don't know of any library that already exists that can do what you're asking other than just running it through gzip or some other lossless compression algorithm. However, since you know that the frames are highly correlated, you could XOR the frames like Conspicuous Compiler suggested and then run gzip on that. If there are few changes between frames, the result of the XOR should have a great deal less entropy than the original frame. This will allow gzip or another lossless compression algorithm to achieve a higher compression ratio.
You would also want to send a key(non-differential) frame every once in a while so you can resynchronize in the event of errors.
If you are just interested in learning about compression, you could try implementing the RLE after XORing the frames. Check out the bit-level RLE discussed here about half way down the page. It should be pretty easy to implement as it just stores in each byte a 7 bit length and a one bit value so it could achieve a best-case compression ratio of 128/8=16 if there are no changes between frames.
Another thought is that if there are very few changes, you may want to just encode the bit positions that flipped between frames. You could address the 320x200 image with a 16-bit integer. For instance, if only 100 pixels change, you can just store 100 16 bit integers representing those positions (1600 bits) where the RLE discussed above would take 64000/16=4000 bits at the minimum (it would probably be quite a bit higher). You could actually switch between this method and RLE depending on the frame content.
If you wanted to go beyond simple methods, I would suggest using variable-length codes to represent the possible runs during the run-length encoding. You could then assign shorter codes to the runs with the highest probability. This would be similar to the RLE used in JPEG or MPEG after the lossy part of the compression is performed (DCT and quantization).

C# Dictionary Memory Management

I have a Dictionary<string,int> that has the potential to contain upwards of 10+ million unique keys. I am trying to reduce the amount of memory that this takes, while still maintaining the functionality of the dictionary.
I had the idea of storing a hash of the string as a long instead, this decreases the apps memory usage to an acceptable amount (~1.5 gig to ~.5 gig), but I don't feel very good about my method for doing this.
long longKey=
BitConverter.ToInt64(cryptoTransformSHA1.ComputeHash(enc.GetBytes(strKey)), 0);
Basically this chops off the end of a SHA1 hash, and puts the first chunk of it into a long, which I then use as a key. While this works, at least for the data I'm testing with, I don't feel like this is a very reliable solution due to the increased possibility for key collisions.
Are there any other ways of reducing the Dictionary's memory footprint, or is the method I have above not as horrible as I think it is?
[edit]
To clarify, I need to maintain the ability to lookup a value contained in the Dictionary using a string. Storing the actual string in the dictionary takes way to much memory. What I would like to do instead is to use a Dictionary<long,int> where the long is the result of a hashing function on the string.
So I have done something similar recently and for a certain set of reasons that are fairly unique to my application did not use a database. In fact I was try to stop using a database. I have found that GetHashCode is significantly improved in 3.5. One important note, NEVER STORE PERSISTENTLY THE RESULTS FROM GetHashCode. NEVER EVER. They are not guaranteed to be consistent between versions of the framework.
So you really need to conduct an analysis of your data since different hash functions might work better or worse on your data. You also need to account for speed. As a general rule cryptographic hash functions should not have many collisions even as the number of hashes moves into the billions. For things that I need to be unique I typically use SHA1 Managed. In general the CryptoAPI has terrible performance, even if the underlying hash functions perform well.
For a 64bit hash I currently use Lookup3 and FNV1, which are both 32 bit hashes, together. For a collision to occur both would need to collide which is mathematically improbable and I have not seen happen over about 100 million hashes. You can find the code to both publicly available on the web.
Still conduct your own analysis. What has worked for me may not work for you. Actually inside of my office different applications with different requirements actually use different hash functions or combinations of hash functions.
I would avoid any unproven hash functions. There are as many hash functions as people who think that they should be writing them. Do your research and test test test.
With 10 million-odd records, have you considered using a database with a non-clustered index? Databases have a lot more tricks up their sleeve for this type of thing.
Hashing, by definition, and under any algorithm, has the potential of collisions - especially with high volumes. Depending on the scenario, I'd be very cautious of this.
Using the strings might take space, but it is reliable... if you are on x64 this needn't be too large (although it definitely counts as "big" ;-p)
By the way, cryptographic hashes / hash functions are exceptionally bad for dictionaries. They’re big and slow. By solving the one problem (size) you’ve only introduced another, more severe problem: the function won’t spread the input evenly any longer, thus destroying the single most important property of a good hash for approaching collision-free addressing (as you seem to have noticed yourself).
/EDIT: As Andrew has noted, GetHashCode is the solution for this problem since that’s its intended use. And like in a true dictionary, you will have to work around collisions. One of the best schemes for that is double hashing. Unfortunately, the only 100% reliable way will be to actually store the original values. Else, you’d have created an infinite compression, which we know can’t exist.
Why don't you just use GetHashCode() to get a hash of the string?
With hashtable implementations I have worked with in the past, the hash brings you to a bucket which is often a link list of other objects that have the same hash. Hashes are not unique, but they are good enough to split your data up into very manageable lists (sometimes only 2 or 3 long) that you can then search though to find your actual item.
The key to a good hash is not its uniqueness, but its speed and distribution capabilities... you want it to distribute as evenly as possible.
Just go get SQLite. You're not likely to beat it, and even if you do, it probably won't be worth the time/effort/complexity.
SQLite.

Categories