How to minify a large finite collection of strings?

How to minify a large finite collection of strings? - c#

I am creating a Trie in memory. Each node contains is a word. It is extremely good performance-wise. But the catch is the memory consumption.
It is 6GB big! I serialized it with protobuf and wrote it to a file that came out to be 150MB.
JSON is 250MB. I was hoping if there is a way to minify the strings? For eg:
As you can see there are duplicates in the first column. Also, it should be reversible.
All the properties/columns are string.
So let's say the table gets converted to :
I think that would save a lot of space. Of course I can do this by inserting each cell in a dictionary first and then assigning it an integer but I do not want to reinvent the wheel unless I have to.

The idea you want to do is creating a dictionary first with all and than change actual values to dictionary key (that will be smaller).
This approach is used in Zip and other compress algorithms.

Related

Geocode lookup in C

I want to do a super fast geocode lookup, returning co-ordinates for an input of Town, City or Country. My knowledge is basic but from what I understand writing it in C is a good start. I was thinking it makes sense to have a tree structure like this:
England
Kent
Orpington
Chatam
Rochester
Dover
Edenbridge
Wiltshire
Swindon
Malmsbury
In my file / database I will have the co-ordinate and the town/city name. If give my program the name "Kent" I want a program that can return me the co-ordinate assoaited with "Kent" in the fastest way possible
Should I store the data in a binary file or a SQL database for performance reasons?
What is the best method of searching this data? Perhaps binary tree searching?
How should the data be stored? perhaps?

Here's a little advice, but not much more than that:
If you want to find places by name, or name prefix, as you indicate that you wish to, then you would be ill-advised to set up a data structure which stores the data in a hierarchy of country, region, town as you suggest you might. If you have an operation that dominates the use of your data structure you are generally best picking the data structure to suit the operation.
In this case an alphabetical list of places would be more suited to your queries. To each place not at the topmost level you would want to add some kind of reference to the name of its 'parent'. If you have an alphabetical list of places you might also want to consider an index , perhaps one which points directly to the first place in the list which starts with each letter of the alphabet.
As you describe your problem it seems to have much more in common with storing words in a dictionary (I mean the sort of thing in which you look up words rather than any particular collection data-type in any specific programming language which goes under the same name) than with most of what goes under the guise of geo-coding.
My guess would be that a gazetteer including the names of all the world's towns, cities, regions and countries (and their coordinates) which have a population over, say, 1000, could be stored in a very simple data structure (basically a list) with an index or two for rapid location of the first A place-name, the first B, and so on. With a little compression you could probably hold this in the memory of most modern desktop PCs.

I think the best advice I can give is to use whatever language you are familiar with to get the results you want. Worry about performance once your code works. Then you can look at translating very specific pieces of functionality into C or C++ one at a time until you have the results you want.

You should not worry about how the information is stored, except not to duplicate data.
You should create one or more indices for the data. The indicies are associative arrays / maps data structures that contain a key (the item you want to search) and a value (such as the record and other information associated with the key). This will enable you with fast lookups without altering your data for each type of search.
On the other hand, your case is an excellent fit for a data base. I suggest you let the database manager your data (such as efficient lookups). After all, that is what they live for.
See also: At what point is it worth using a database?

Representing a giant matrix/table

I need to perform calculations and manipulation on an extremely large table or matrix, which will have roughly 7500 rows and 30000 columns.
The matrix data will look like this:
Document ID| word1 | word 2 | word 3 |... | word 30000 | Document Class
0032 1 0 0 1 P
In other words, the vast majority of the cells will contain boolean values(0's and 1's).
The calculations that needs to be done would be useing word stemming or feature selection(reducing the number of words by using reduction techniques), as well as calculations per-class or per-word etc.
What i have in mind is designing an OOP model for representing the matrix, and then subsequently serializing the objects to disk so i may reuse them later on. For instance i will have an object for each row or each column, or perhaps an object for each intersection that is contained within another class.
I have thought about representing it in XML, but file sizes may prove problematic.
I may be sitting the pot miss with my approach here -
Am i on the right path, or would there be any better performing approaches to manipulating such large data collections.
Key issues here will be performance(reaction time etc.), as well as redundancy and integrity of the data, and obviously i would need to save the data on disk.

You haven't explained the nature of the calculations you're needing to do on the table/matrix, so I'm having to make assumptions there, but if I read your question correctly, this may be a poster-child case for the use of a relational database -- even if you don't have any actual relations in your database. If you can't use a full server, use SQL Server Compact Edition as an embedded database, which would allow you to control the .SDF file programmatically if you chose.
Edit:
After a second consideration, I withdraw my suggestion for a database. This is entirely because of the number of columns in the table, any relational database you use will have hard limits on this, and I don't see a way around it that isn't amazingly complicated.
Based on your edit, I would say that there are three things you are interested in:
A way to analyze the presence of words in documents. This is the bulk of your sample data file, primarily being boolean values indicating the presence or lack of a word in a document.
The words themselves. This is primarily contained in the first row of your sample data file.
A means of identifying documents and their classification. This is the first and last column of your data file.
After thinking about it for a little bit, this is how I would model your data:
With the case of word presence, I feel it's best to avoid a complex object model. You're wanting to do pure calculation in both directions (by column and by row), and the most flexible and potentially performant structure for that in my opinion is a simple two-dimensional array of bool fields, like so:
var wordMatrix = new bool[numDocuments,numWords];
The words themselves should be in an array or list of strings that are index-linked to the second column of the word matrix -- the one defined by numWords in the example above. If you ever needed to quickly search for a particular word, you could use a Dictionary<string, int>, with the key as the word and the value as the index, to quickly find the index of a particular word.
The document identification would similarly be in an array or list of ints index-linked to the first column. I'm assuming the document ids are integer values there. The classification would be a similar array or list, although I'd use a list of enums representing each possible value of the classification. As with the word search, if you needed to search for documents by id, you could have a Dictionary<int, int> act as your search index.
I've made several assumptions with this model, particularly that you want to do pure calculation on the word presence in all directions rather than "per document". If I'm wrong, a simpler approach might be to drop the two-dimensional array and model by document, i.e. a single C# Document class, with a DocumentId, and DocumentClasification field as well as a simple array of booleans that are index-linked to the word list. You could then work with a list of these Document objects along with a separate list of words.
Once you have a data model you like, saving it to disk is the easiest part. Just use C# serialization. You can save it via XML or binary, your choice. Binary would give you the smallest file size, naturally (I figure a little more than 200MB plus the size of a list of 30000 words). If you include the Dictionary lookup indexes, perhaps an additional 120kB.

String pairs (multidimensional array?) C#

I basically want to keep a track of old filenames and each file's new filename.
I could use a Dictionary I suppose, but I want to keep it really lightweight.
Would a multidimensional string array work just as well?

Use the Dictionary<string,string> class:
It manages its own size for you
It uses hashes to speed up searching for elements
You can just go fileNames[oldName] to get the name instead of using a loop or LINQ
Really, the Dictionary is the lightweight solution here.

The problem with using Dictionary<string,string> in your scenario is that all keys need to be unique, so if you are using 'old file name' as a key, then you cannot have 2 scenarios where the 'old file name' was the same.
The best solution (in my opinion) is
List<Tuple<string,string>>
which will also be ordered with the lastest added last etc.
a new entry would look like
List<Tuple<string,string>> list = new List<Tuple<string, string>>();
list.Add(Tuple.Create(oldfilename,newfilename));
and then to find all new files for a particular old file name you could do the following:
var files = list.Where(t => t.Item1 == oldfilename);

You could also consider using StringDictionary, but its kinda old!
It was used in the land before generics, when dinosaurs ruled the earth.

What's your definition of lightweight? Are you worried about memory use or runtime?
With a multidimensional string array, you'd have to search through the array yourself (you could sort the array and do a binary search, but that's still not as nice as a hash table in the general case), so you would lose out on runtime cost on access.
With a multidimensional string array, you also need to know in advance how many entries you're going to need, to allocate memory. If you don't, you lose time and churn memory reallocating ever-larger arrays. A Dictionary doesn't use contiguous regions of memory, so it doesn't reallocate on expansion.
Finally, on the memory front, keep in mind that a string is a reference. The difference in memory use you'll see will only be related to the lookup structure, which is probably small compared to the data you want to keep track of. If the concern is to keep contiguous blocks of memory for cache efficiency, here neither solution is better than the other, as the strings are stored outside of the datastructure (being references), so any string reads lose that contiguity.
So to conclude, there's really no reason to use a multidimensional array over a dictionary.

C#: What is the best collection class to store very similar string items for efficient serialization to a file

I would like to store a list of entityIDs of outlook emails to a file. The entityIDs are strings like:
"000000005F776F08B736B442BCF7B6A7060B509A64002000"
"000000005F776F08B736B442BCF7B6A7060B509A84002000"
"000000005F776F08B736B442BCF7B6A7060B509AA4002000"
as you can notice, the strings are very similar. I would like to save these strings in a collection class that would be stored as efficiently as possible when I serialize it to a file. Do you know of any collection class that could be used for this?
Thank you in advance for any information...
Gregor

No pre-existing collection class from the framework will suit your needs, because these are generic: by definition, they know nothing of the type they are storing (e.g. string) so they cannot do anything with it.
If efficient serialization is your only concern, I suggest that you simply compress the serialized file. Data like this are a feast for compression algorithms. .NET offers gzip and deflate algorithms in System.IO.Compression; better algorithms (if you need them) can easily be found through Google.
If in-memory efficiency is also an issue, you could store your strings in a trie or a radix tree.

You may want to take a look at the Radix Trie data-structure, as this would be able to efficiently store your keys.
As far as serialising to a file, you could, perhaps, walk the trie and write down each node. (In the following example I have used indentation to signify the level in the tree, but you could come up with something a bit more efficient, such as using control characters to signify a descent or ascent.)
00000000
5F776F08B736B442BCF7B6A7060B509A
64002000
84002000
A4002000
6F776F08B736B442BCF7B6A7060B509A
32100000
The example above is the set of:
000000005F776F08B736B442BCF7B6A7060B509A64002000
000000005F776F08B736B442BCF7B6A7060B509A84002000
000000005F776F08B736B442BCF7B6A7060B509AA4002000
000000006F776F08B736B442BCF7B6A7060B509A32100000

Why is efficient an issue? Do you want to use as less HD space as possible (HD space is cheap).
In C# there 2 most used serializers:
Binary
or XML
If you want the user to let the file be adjustable with notepad for example --> use xml. If not use binary

Reading a large file into a Dictionary

I have a 1GB file containing pairs of string and long.
What's the best way of reading it into a Dictionary, and how much memory would you say it requires?
File has 62 million rows.
I've managed to read it using 5.5GB of ram.
Say 22 bytes overhead per Dictionary entry, that's 1.5GB.
long is 8 bytes, that's 500MB.
Average string length is 15 chars, each char 2 bytes, that's 2GB.
Total is about 4GB, where does the extra 1.5 GB go to?
The initial Dictionary allocation takes 256MB.
I've noticed that each 10 million rows I read, consume about 580MB, which fits quite nicely with the above calculation, but somewhere around the 6000th line, memory usage grows from 260MB to 1.7GB, that's my missing 1.5GB, where does it go?
Thanks.

It's important to understand what's happening when you populate a Hashtable. (The Dictionary uses a Hashtable as its underlying data structure.)
When you create a new Hashtable, .NET makes an array containing 11 buckets, which are linked lists of dictionary entries. When you add an entry, its key gets hashed, the hash code gets mapped on to one of the 11 buckets, and the entry (key + value + hash code) gets appended to the linked list.
At a certain point (and this depends on the load factor used when the Hashtable is first constructed), the Hashtable determines, during an Add operation, that it's encountering too many collisions, and that the initial 11 buckets aren't enough. So it creates a new array of buckets that's twice the size of the old one (not exactly; the number of buckets is always prime), and then populates the new table from the old one.
So there are two things that come into play in terms of memory utilization.
The first is that, every so often, the Hashtable needs to use twice as much memory as it's presently using, so that it can copy the table during resizing. So if you've got a Hashtable that's using 1.8GB of memory and it needs to be resized, it's briefly going to need to use 3.6GB, and, well, now you have a problem.
The second is that every hash table entry has about 12 bytes of overhead: pointers to the key, the value, and the next entry in the list, plus the hash code. For most uses, that overhead is insignificant, but if you're building a Hashtable with 100 million entries in it, well, that's about 1.2GB of overhead.
You can overcome the first problem by using the overload of the Dictionary's constructor that lets you provide an initial capacity. If you specify a capacity big enough to hold all of the entries you're going to be added, the Hashtable won't need to be rebuilt while you're populating it. There's pretty much nothing you can do about the second.

Everyone here seems to be in agreement that the best way to handle this is to read only a portion of the file into memory at a time. Speed, of course, is determined by which portion is in memory and what parts must be read from disk when a particular piece of information is needed.
There is a simple method to handle deciding what's the best parts to keep in memory:
Put the data into a database.
A real one, like MSSQL Express, or MySql or Oracle XE (all are free).
Databases cache the most commonly used information, so it's just like reading from memory. And they give you a single access method for in-memory or on-disk data.

Maybe you can convert that 1 GB file into a SQLite database with two columns key and value. Then create an index on key column. After that you can query that database to get the values of the keys you provided.

Thinking about this, I'm wondering why you'd need to do it... (I know, I know... I shouldn't wonder why, but hear me out...)
The main problem is that there is a huge amount of data that needs to be presumably accessed quickly... The question is, will it essentially be random access, or is there some pattern that can be exploited to predict accesses?
In any case, I would implement this as a sliding cache. E.g. I would load as much as feasibly possible into memory to start with (with the selection of what to load based as much on my expected access pattern as possible) and then keep track of accesses to elements by time last accessed.
If I hit something that wasn't in the cache, then it would be loaded and replace the oldest item in the cache.
This would result in the most commonly used stuff being accessible in memory, but would incur additional work for cache misses.
In any case, without knowing a little more about the problem, this is merely a 'general solution'.
It may be that just keeping it in a local instance of a sql db would be sufficient :)

You'll need to specify the file format, but if it's just something like name=value, I'd do:
Dictionary<string,long> dictionary = new Dictionary<string,long>();
using (TextReader reader = File.OpenText(filename))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] bits = line.Split('=');
// Error checking would go here
long value = long.Parse(bits[1]);
dictionary[bits[0]] = value;
}
}
Now, if that doesn't work we'll need to know more about the file - how many lines are there, etc?
Are you using 64 bit Windows? (If not, you won't be able to use more than 3GB per process anyway, IIRC.)
The amount of memory required will depend on the length of the strings, number of entries etc.

I am not familiar with C#, but if you're having memory problems you might need to roll your own memory container for this task.
Since you want to store it in a dict, I assume you need it for fast lookup?
You have not clarified which one should be the key, though.
Let's hope you want to use the long values for keys. Then try this:
Allocate a buffer that's as big as the file. Read the file into that buffer.
Then create a dictionary with the long values (32 bit values, I guess?) as keys, with their values being a 32 bit value as well.
Now browse the data in the buffer like this:
Find the next key-value pair. Calculate the offset of its value in the buffer. Now add this information to the dictionary, with the long as the key and the offset as its value.
That way, you end up with a dictionary which might take maybe 10-20 bytes per record, and one larger buffer which holds all your text data.
At least with C++, this would be a rather memory-efficient way, I think.

Can you convert the 1G file into a more efficient indexed format, but leave it as a file on disk? Then you can access it as needed and do efficient lookups.
Perhaps you can memory map the contents of this (more efficient format) file, then have minimum ram usage and demand-loading, which may be a good trade-off between accessing the file directly on disc all the time and loading the whole thing into a big byte array.

Loading a 1 GB file in memory at once doesn't sound like a good idea to me. I'd virtualize the access to the file by loading it in smaller chunks only when the specific chunk is needed. Of course, it'll be slower than having the whole file in memory, but 1 GB is a real mastodon...

Don't read 1GB of file into the memory even though you got 8 GB of physical RAM, you can still have so many problems. -based on personal experience-
I don't know what you need to do but find a workaround and read partially and process. If it doesn't work you then consider using a database.

If you choose to use a database, you might be better served by a dbm-style tool, like Berkeley DB for .NET. They are specifically designed to represent disk-based hashtables.
Alternatively you may roll your own solution using some database techniques.
Suppose your original data file looks like this (dots indicate that string lengths vary):
[key2][value2...][key1][value1..][key3][value3....]
Split it into index file and values file.
Values file:
[value1..][value2...][value3....]
Index file:
[key1][value1-offset]
[key2][value2-offset]
[key3][value3-offset]
Records in index file are fixed-size key->value-offset pairs and are ordered by key.
Strings in values file are also ordered by key.
To get a value for key(N) you would binary-search for key(N) record in index, then read string from values file starting at value(N)-offset and ending before value(N+1)-offset.
Index file can be read into in-memory array of structs (less overhead and much more predictable memory consumption than Dictionary), or you can do the search directly on disk.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.