Sparse matrix compression with fast access time - c#

I'm writing a lexer generator as a spare time project, and I'm wondering about how to go about table compression. The tables in question are 2D arrays of short and very sparse. They are always 256 characters in one dimension. The other dimension is varying in size according to the number of states in the lexer.
The basic requirements of the compression is that
The data should be accessible without decompressing the full data set. And accessible in constant O(1) time.
Reasonably fast to compute the compressed table.
I understand the row displacement method, which is what I currently have implemented. It might be my naive implementation, but what I have is horrendously slow to generate, although quite fast to access. I suppose I could make this go faster using some established algorithm for string searching such as one of the algorithms found here.
I suppose an option would be to use a Dictionary, but that feels like cheating, and I would like the fast access times that I would be able to get if I use straight arrays with some established algorithm. Perhaps I'm worrying needlessly about this.
From what I can gather, flex does not use this algorithm for it's lexing tables. Instead it seems to use something called row/column equivalence which I haven't really been able to find any explanation for.
I would really like to know how this row/column equivalence algorithm that flex uses works, or if there is any other good option that I should consider for this task.
Edit: To clarify more about what this data actually is. It is state information for state transitions in the lexer. The data needs to be stored in a compressed format in memory since the state tables can potentially be huge. It's also from this memory that the actual values will be accessed directly, without decompressing the tables. I have a working solution using row displacement, but it's murderously slow to compute - in partial due to my silly implementation.
Perhaps my implementation of the row displacement method will make it clearer how this data is accessed. It's a bit verbose and I hope it's OK that I've put it on pastebin instead of here.
The data is very sparse. It is usually a big bunch of zeroes followed by a few shorts for each state. It would be trivial to for instance run-length encode it but it would spoil the
linear access time.
Flex apparently has two pairs of tables, base and default for the first pair and next and check for the second pair. These tables seems to index one another in ways I don't understand. The dragon book attempts to explain this, but as is often the case with that tome of arcane knowledge what it says is lost on lesser minds such as mine.

This paper, http://www.syst.cs.kumamoto-u.ac.jp/~masato/cgi-bin/rp/files/p606-tarjan.pdf, describes a method for compressing sparse tables, and might be of interest.
Are you tables known beforehand, and you just need an efficient way to store and access them?
I'm not really familiar with the problem domain, but if your table has a fix size along one axis (256), then would a array of size 256, where each element was a vector of variable length work? Do you want to be able to pick out an element given a (x,y) pair?
Another cool solution that I've always wanted to use for something is a perfect hash table, http://burtleburtle.net/bob/hash/perfect.html, where you generate a hash function from your data, so you will get minimal space requirements, and O(1) lookups (ie no collisions).
None of these solutions employ any type of compression, tho, they just minimize the amount of space wasted..

What's unclear is if your table has "sequence property" in one dimension or another.
Sequence property naturally happens in human speech, since a word is composed of many letters, and the sequence of letters is likely to appear later on. It's also very common in binary program, source code, etc.
On the other hand, sampled data, such as raw audio, seismic values, etc. do not advertise sequence property. Their data can still be compressed, but using another model (such as a simple "delta model" followed by "entropy").
If your data has "sequence property" in any of the 2 dimensions, then you can use common compression algorithm, which will give you both speed and reliability. You just need to provide it with an input which is "sequence friendly" (i.e. select your dimension).
If speed is a concern for you, you can have a look at this C# implementation of a fast compressor which is also a very fast decompressor : https://github.com/stangelandcl/LZ4Sharp

Related

Perceptual image hashing

OK. This is part of an (non-English) OCR project. I have already completed preprocessing steps like deskewing, grayscaling, segmentation of glyphs etc and am now stuck at the most important step: Identifcation of a glyph by comparing it against a database of glyph images, and thus need to devise a robust and efficient perceptual image hashing algorithm.
For many reasons, the function I require won't be as complicated as required by the generic image comparison problem. For one, my images are always grayscale (or even B&W if that makes the task of identification easier). For another, those glyphs are more "stroke-oriented" and have simpler structure than photographs.
I have tried some of my own and some borrowed ideas for defining a good similarity metric. One method was to divide the image into a grid of M x N cells and take average "blackness" of each cell to create a hash for that image, and then take Euclidean distance of the hashes to compare the images. Another was to find "corners" in each glyph and then compare their spatial positions. None of them have proven to be very robust.
I know there are stronger candidates like SIFT and SURF out there, but I have 3 good reasons not to use them. One is that I guess they are proprietary (or somehow patented) and cannot be used in commercial apps. Second is that they are very general purpose and would probably be an overkill for my somewhat simpler domain of images. Third is that there are no implementations available (I'm using C#). I have even tried to convert pHash library to C# but remained unsuccessful.
So I'm finally here. Does anyone know of a code (C# or C++ or Java or VB.NET but shouldn't require any dependencies that cannot be used in .NET world), library, algorithm, method or idea to create a robust and efficient hashing algorithm that could survive minor visual defects like translation, rotation, scaling, blur, spots etc.
It looks like you've already tried something similar to this, but it may still be of some use:
https://www.memonic.com/user/aengus/folder/coding/id/1qVeq

Word game mechanics in XNA

I'm planning on making a casual word game for WP7 using XNA. The game mechanics are fine enough for me to implement but it is just the checking to see if the word they make is actually a word or not.
I thought about having a text file and loading that into memory at the start, but surely this wouldn't be possible to keep in memory for a phone? Also how slow would it be to read from this to see if it is a word. How would they be stored in memory? Would it be best to use a dictionary/hashmap and each key is a word and i just check to see if that key exists? Or would it put them in an array?
Stuck on the best way to implement this, so any input is appreciated. Thanks
Depending on your phones hardware, you could probably just load up a text file into memory. The english language probably has only a couple hundred thousand words. Assuming your average word is around 5 characters or so, thats roughly a meg of data. You will have overhead managing that file in memory, but thats where specifics of hardware matter. BTW, it's not uncommon for current generation of phones to have a gig of RAM.
Please see the following related SO questions which require a text file for a dictionary of words.
Dictionary text file
Putting a text file into memory, even of a whole dictionary, shouldn't be too bad as seth flowers has said. Choosing an appropriate data structure to hold the words will be important.
I would not recommend a dictionary using words as keys... that's kind of silly honestly. If you only have keys and no values, what good is a dictionary? However, you may be on a good track with the Dictionary idea. The first thing I would try would be a Dictionary<char, string[]>, where the key is the first letter, and the value is a list of all words beginning with that letter. Of course, that array will be very long, and search time on the array slow (though lookup on the key should be zippy, as char hashes are unique). The advantage is that, if you use the proper .txt dictionary file and load each word in order, you will know that list is ordered by alphabet. So, you can use efficient search techniques like binary search, or any number of searches formulated for pre-sorted lists. It may not be that slow in the end.
If you want to go further, though, you can use the structure which underlies predictive text. It's called a Patricia Trie, or Radix Trie (Wikipedia). Starting with the first letter, you work your way through all possible branches until you either:
assemble the word the user entered, so it is a valid word
reach the end of the branch; this word does not exist.
'Tries' were made to address this sort of problem. I've never represented one in code, so I'm afraid I can't give you any pointers (ba dum tsh!), but there's likely a wealth of information on how to do it available on the internet. Using a Trie will likely be the most efficient solution, but if you find that an alphabet Dictionary like I mentioned above is sufficiently fast using binary search, you might just want to stick with that for now while you develop the actual gameplay. Getting bogged down with finding the best solution when just starting your game tends to bleed off your passion for getting it done. If you run into performance issues, then you make improvements-- at least that's my philosophy when designing games.
The nice thing is, since Windows Phone supports only essentially 2 different specs, once you test the app and see it runs smoothly on them, you really don't have to worry about optimizing for any worse conditions. So use what works!
P.S.: on Windows Phone, loading text files is tricky. Here is a post on the issue which should help you.

Generate a Hashcode for a string that is platform independent

We have an application that
Generates a hash code on a string
Saves that hash code into a DB along with associated data
Later, it queries the DB using the string hash code for retrieving the data
This is obviously a bug because the value returned from string.GetHashCode() varies from .NET versions and architectures (32/64 bit). To complicate matters, we're too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead. What we'd like to do is come up with a quick and dirty fix for now, and refactor the code later to do it the right way.
The quick and dirty fix seems like creating a static GetInvariantHashCode(string s) helper method that is consistent across architectures.
Can suggest an algorithm for generating a hashcode on a string that is equivalent on 32 bit and 64 bit architecture?
A few more notes:
I'm aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.
I believe the architect's intent was to speed up the searches by querying on a long instead of an NVarChar
I'm aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.
I believe the architect's intent was to speed up the searches by querying on a long instead of an NVarChar
Then just let the database index the strings for you!
Look, I have no idea how large your domain is, but you're going to get collisions very rapidly with very high likelihood if it's of any decent size at all. It's the birthday problem with a lot of people relative to the number of birthdays. You're going to have collisions, and lose any gain in speed you might think you're gaining by not just indexing the strings in the first place.
Anyway, you don't need us if you're stuck a few days away from release and you really need an invariant hash code across platform. There are really dumb, really fast implementations of hash code out there that you can use. Hell, you could come up with one yourself in the blink of an eye:
string s = "Hello, world!";
int hash = 17;
foreach(char c in s) {
unchecked { hash = hash * 23 + c.GetHashCode(); }
}
Or you could use the old Bernstein hash. And on and on. Are they going to give you the performance gain you're looking for? I don't know, they weren't meant to be used for this purpose. They were meant to be used for balancing hash tables. You're not balancing a hash table. You're using the wrong concept.
Edit (the below was written before the question was edited with new salient information):
You can't do this, at all, theoretically, without some kind of restriction on your input space. Your problem is far more severe than String.GetHashCode differening from platform to platform.
There are a lot of instances of string. In fact, way more instances than there are instances of Int32. So, because of the piegonhole principle, you will have collisions. You can't avoid this: your strings are pigeons and your Int32 hash codes are piegonholes and there are too many pigeons to go in the pigeonholes without some pigeonhole getting more than one pigeon. Because of collision problems, you can't use hash codes as unique keys for strings. It doesn't work. Period.
The only way you can make your current proposed design work (using Int32 as an identifier for instances of string) is if you restrict your input space of strings to something that has at size less than or equal to the number of Int32s. Even then, you'll have difficulty coming up with an algorithm that maps your input space of strings to Int32 in a unique way.
Even if you try to increase the number of pigeonholes by using SHA-512 or whatever, you still have the possibility of collisions. I doubt you considered that possibility previously in your design; this design path is DOA. And that's not what SHA-512 is for anyway, it's not to be used for unique identification of messages. It's just to reduce the likelihood of message forgery.
To complicate matters, we're too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead.
Well, then you have a tremendous amount of work ahead of you. I'm sorry you discovered this so late in the game.
I note the documentation for String.GetHashCode:
The behavior of GetHashCode is dependent on its implementation, which might change from one version of the common language runtime to another. A reason why this might happen is to improve the performance of GetHashCode.
And from Object.GetHashCode:
The GetHashCode method is suitable for use in hashing algorithms and data structures such as a hash table.
Hash codes are for balancing hash tables. They are not for identifying objects. You could have caught this sooner if you had used the concept for what it is meant to be used for.
You should just use SHA512.
Note that hashes are not (and cannot be) unique.
If you need it to be unique, just use the identity function as your hash.
You can use one of the managed cryptography classes (such as SHA512Managed) to compute a platform independent hash via ComputeHash. This will require converting the string to a byte array (ie: using Encoding.GetBytes or some other method), and be slow, but be consistent.
That being said, a hash is not guaranteed unique, and is really not a proper mechanism for uniqueness in a database. Using a hash to store data is likely to cause data to get lost, as the first hash collision will overwrite old data (or throw away new data).

Which xml structure allows faster Add/Del/Update

Which XML structure allows me faster adding,deleting,updating of a node?
My assumption is the first one as the xml hierarchy is not that deep.
What do you think ?
<Departments>
<Department Id="a Guid" IsVisible="True" />
</Departments>
OR
<Departments>
<Department>
<Id>a Guid</Id>
<IsVisible>True</IsVisible>
</Department>
</Departments>
It doesn't matter.
You have to read the entire file and parse it into a document structure, do the updates, then write the entire file. Updating the object structure is so little work compared to the file I/O that the structure doesn't matter.
In general
In general terms I tend to agree with the other answers here, but I'd like to add a few remarks. Performance is normally most hindered by its slowest factor, which is the network, the database connection, the file system or even the internal memory when I/O is part of the issue. If we take that as a given, a possible conclusion is that the smaller the size, the bigger the performance improvement is.
Other factors
But there's another factor. Attributes and elements are implemented differently. Attributes are implemented something like key/value pairs with a uniqueness constraint and roughly take the size of chars * 2 + sizeof(int). Elements require a much larger structure in-memory and for the sake of brevity, I like to use one simple factor that's some average between several implementations: 3.5 * chars. I use chars here, because whether you store it as UTF8 or as UTF16 makes a storage difference, but not an in-memory difference.
The former paragraph implies that attributes are faster. But still this isn't a simple fact, because attributes are not implemented as normal nodes and searching for their data is generally slower than searching for data in nodes. This is hard to measure in general terms and requires profiling for every particular situation to find out.
LINQ
Then there's LINQ. If you use LINQ, reading and writing is done with streaming XML which is relatively fast. The in-memory representation is usually much smaller and much faster than with XmlDocument parsing.
Names
The size of the names of the fields, like elements and attributes does not matter. Internally they are keyed and given a unique ID. The contents of the elements and the attributes, however, will add to the overall memory footprint.
If the size of the names is very large compared to their content, minifying the names will make your XML less readable, but also requires less I/O or network bandwidth. As such, in some cases, it may improve performance to use small names.
UTF-8 or UTF-16
Finally, I should add a note on the way you store it. Common sense says, store it as UTF-8. But that requires the parser to read each character and transform it in-memory to UTF-16. This costs time. Sometimes, a larger size of the file (for using UTF-16) can outperform a smaller size (with UTF-8) because the processor overhead is too big. Again, measuring your performance in several scenarios can help. Oh, and if you use a lot of (very) high characters, UTF-16 should be the preferred choice, because UTF-8 may use 3, 4 or even 6 bytes per character.
Summary
To sum it up, if speed is imperative and you cannot resort to a binary format:
Prefer attributes over elements, but only if DOM use is anticipated and searching / keying is not too important;
Prefer UTF-8 over UTF-16 only when the files are very large and you use few (very) high characters, measure to find out;
Prefer streaming over DOM for all your uses (LINQ typically uses streaming);
Don't bother using small names unless your I/O is really a bottleneck and the factor data:overhead is very large;
Define a few typical usage scenarios and measure;
PS: the above is what comes to mind when thinking about XML, there may, of course, be many other factors the improve / degrade performance, the largest perhaps your own skills in writing the best procedures for your CRUD operations.
I doubt very much that you'd see a difference. XML parsing is very fast.
You'd have to test with hundreds of thousands, if not millions of records to measure the difference, which I think would be tiny.
The only way to find out which one is faster is to create some sample queries and run them a bunch of times while profiling and averaging. I doubt you'll find a difference.
I would go with which ever schema is more expressive and meets your requirements. To me that's the first one since I doubt you'd ever want more then one Id or IsVisible type.
It would depend on what you were using to do this addition, updating and deleting. All things being equal, I would expect the first one, but by a truly very, very negliable amount. I would also not be even slightly amazed if I found that there were some libraries that worked faster with the second (due to differences in in-memory model representations, which are completely implementation-defined).
Assuming there will only be one id and one isVisible on each department, I'd go for the first (with the bug of the attribute not being quoted, fixed) as helping to restict the format in itself, and being a clear fit. I wouldn't be upset at having to use the latter though.

In-memory search index for application takes up too much memory - any suggestions?

In our desktop application, we have implemented a simple search engine using an inverted index.
Unfortunately, some of our users' datasets can get very large, e.g. taking up ~1GB of memory before the inverted index has been created. The inverted index itself takes up a lot of memory, almost as much as the data being indexed (another 1GB of RAM).
Obviously this creates problems with out of memory errors, as the 32 bit Windows limit of 2GB memory per application is hit, or users with lesser spec computers struggle to cope with the memory demand.
Our inverted index is stored as a:
Dictionary<string, List<ApplicationObject>>
And this is created during the data load when each object is processed such that the applicationObject's key string and description words are stored in the inverted index.
So, my question is: is it possible to store the search index more efficiently space-wise? Perhaps a different structure or strategy needs to be used? Alternatively is it possible to create a kind of CompressedDictionary? As it is storing lots of strings I would expect it to be highly compressible.
If it's going to be 1GB... put it on disk. Use something like Berkeley DB. It will still be very fast.
Here is a project that provides a .net interface to it:
http://sourceforge.net/projects/libdb-dotnet
I see a few solutions:
If you have the ApplicationObjects in an array, store just the index - might be smaller.
You could use a bit of C++/CLI to store the dictionary, using UTF-8.
Don't bother storing all the different strings, use a Trie
I suspect you may find you've got a lot of very small lists.
I suggest you find out roughly what the frequency is like - how many of your dictionary entries have single element lists, how many have two element lists etc. You could potentially store several separate dictionaries - one for "I've only got one element" (direct mapping) then "I've got two elements" (map to a Pair struct with the two references in) etc until it becomes silly - quite possibly at about 3 entries - at which point you go back to normal lists. Encapsulate the whole lot behind a simple interface (add entry / retrieve entries). That way you'll have a lot less wasted space (mostly empty buffers, counts etc).
If none of this makes much sense, let me know and I'll try to come up with some code.
I agree with bobwienholt, but If you are indexing datasets I assume these came from a database somewhere. Would it make sense to just search that with a search engine like DTSearch or Lucene.net?
You could take the approach Lucene did. First, you create a random access in-memory stream (System.IO.MemoryStream), this stream mirrors a on-disk one, but only a portion of it (if you have the wrong portion, load up another one off the disk). This does cause one headache, you need a file-mappable format for your dictionary. Wikipedia has a description of the paging technique.
On the file-mappable scenario. If you open up Reflector and reflect the Dictionary class you will see that is comprises of buckets. You can probably use each of these buckets as a page and physical file (this way inserts are faster). You can then also loosely delete values by simply inserting a "item x deleted" value to the file and every so often clean the file up.
By the way, buckets hold values with identical hashes. It is very important that your values that you store override the GetHashCode() method (and the compiler will warn you about Equals() so override that as well). You will get a significant speed increase in lookups if you do this.
How about using Memory Mapped File Win32 API to transparently back your memory structure?
http://www.eggheadcafe.com/articles/20050116.asp has the PInvokes necessary to enable it.
Is the index only added to or do you remove keys from it as well?

Categories