I need to perform calculations and manipulation on an extremely large table or matrix, which will have roughly 7500 rows and 30000 columns.
The matrix data will look like this:
Document ID| word1 | word 2 | word 3 |... | word 30000 | Document Class
0032 1 0 0 1 P
In other words, the vast majority of the cells will contain boolean values(0's and 1's).
The calculations that needs to be done would be useing word stemming or feature selection(reducing the number of words by using reduction techniques), as well as calculations per-class or per-word etc.
What i have in mind is designing an OOP model for representing the matrix, and then subsequently serializing the objects to disk so i may reuse them later on. For instance i will have an object for each row or each column, or perhaps an object for each intersection that is contained within another class.
I have thought about representing it in XML, but file sizes may prove problematic.
I may be sitting the pot miss with my approach here -
Am i on the right path, or would there be any better performing approaches to manipulating such large data collections.
Key issues here will be performance(reaction time etc.), as well as redundancy and integrity of the data, and obviously i would need to save the data on disk.
You haven't explained the nature of the calculations you're needing to do on the table/matrix, so I'm having to make assumptions there, but if I read your question correctly, this may be a poster-child case for the use of a relational database -- even if you don't have any actual relations in your database. If you can't use a full server, use SQL Server Compact Edition as an embedded database, which would allow you to control the .SDF file programmatically if you chose.
Edit:
After a second consideration, I withdraw my suggestion for a database. This is entirely because of the number of columns in the table, any relational database you use will have hard limits on this, and I don't see a way around it that isn't amazingly complicated.
Based on your edit, I would say that there are three things you are interested in:
A way to analyze the presence of words in documents. This is the bulk of your sample data file, primarily being boolean values indicating the presence or lack of a word in a document.
The words themselves. This is primarily contained in the first row of your sample data file.
A means of identifying documents and their classification. This is the first and last column of your data file.
After thinking about it for a little bit, this is how I would model your data:
With the case of word presence, I feel it's best to avoid a complex object model. You're wanting to do pure calculation in both directions (by column and by row), and the most flexible and potentially performant structure for that in my opinion is a simple two-dimensional array of bool fields, like so:
var wordMatrix = new bool[numDocuments,numWords];
The words themselves should be in an array or list of strings that are index-linked to the second column of the word matrix -- the one defined by numWords in the example above. If you ever needed to quickly search for a particular word, you could use a Dictionary<string, int>, with the key as the word and the value as the index, to quickly find the index of a particular word.
The document identification would similarly be in an array or list of ints index-linked to the first column. I'm assuming the document ids are integer values there. The classification would be a similar array or list, although I'd use a list of enums representing each possible value of the classification. As with the word search, if you needed to search for documents by id, you could have a Dictionary<int, int> act as your search index.
I've made several assumptions with this model, particularly that you want to do pure calculation on the word presence in all directions rather than "per document". If I'm wrong, a simpler approach might be to drop the two-dimensional array and model by document, i.e. a single C# Document class, with a DocumentId, and DocumentClasification field as well as a simple array of booleans that are index-linked to the word list. You could then work with a list of these Document objects along with a separate list of words.
Once you have a data model you like, saving it to disk is the easiest part. Just use C# serialization. You can save it via XML or binary, your choice. Binary would give you the smallest file size, naturally (I figure a little more than 200MB plus the size of a list of 30000 words). If you include the Dictionary lookup indexes, perhaps an additional 120kB.
Related
I am creating a Trie in memory. Each node contains is a word. It is extremely good performance-wise. But the catch is the memory consumption.
It is 6GB big! I serialized it with protobuf and wrote it to a file that came out to be 150MB.
JSON is 250MB. I was hoping if there is a way to minify the strings? For eg:
As you can see there are duplicates in the first column. Also, it should be reversible.
All the properties/columns are string.
So let's say the table gets converted to :
I think that would save a lot of space. Of course I can do this by inserting each cell in a dictionary first and then assigning it an integer but I do not want to reinvent the wheel unless I have to.
The idea you want to do is creating a dictionary first with all and than change actual values to dictionary key (that will be smaller).
This approach is used in Zip and other compress algorithms.
I've got a list of approx 100k pairs of text strings (sentences) which could mean the same, even if the values are different. A lot of the issues are simply due to using abbreviations and different punctuation on one half of the pair:
Source 1 Source 2
TEMP.IND. TEMPERATURE INDICATOR
My initial thought on how to solve this would be to split the strings by word, and then look each word up in a table containing abbreviations and the full-length words similar to below:
Abbreviation: Meaning:
TEMP. TEMPERATURE
IND. INDICATOR
If a match is found I generate a replacement string using the new word before comparing it with the other source. If they don't match, I repeat the process for each abbreviated word found in the lookup table.
Would this be very complex to do in Oracle, compared to eg. C# (which I'm fluent in)? Keeping it in the DB would be preferred, but not if it takes too much time to implement. Are there any better options? The alternative is to check everything manually.
Apologies if this is the wrong site.
It should be just as easy to do in the DB (Oracle) as to do in C#, and probably much, much faster. Writing the code isn't that much of a problem, once you create your equivalence tables (with the Abbreviation and Meaning columns).
The difficulty is in the specification. Why should Temp. Ind. mean "temperature indicator" and not "temporary index" or "temple in India"? This will work (in Oracle or C# or whatever, that is irrelevant) only if each abbreviation corresponds to a unique meaning.
Best of luck!
I read some document about Lucene; also I read the document in this link
(http://lucene.sourceforge.net/talks/pisa).
I don't really understand how Lucene indexes documents and don't understand which algorithms Lucene uses for indexing?
On the above link, it says Lucene uses this algorithm for indexing:
incremental algorithm:
maintain a stack of segment indices
create index for each incoming document
push new indexes onto the stack
let b=10 be the merge factor; M=8
for (size = 1; size < M; size *= b) {
if (there are b indexes with size docs on top of the stack) {
pop them off the stack;
merge them into a single index;
push the merged index onto the stack;
} else {
break;
}
}
How does this algorithm provide optimized indexing?
Does Lucene use B-tree algorithm or any other algorithm like that for indexing
- or does it have a particular algorithm?
In a nutshell, Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). Note, however, that Lucene does not (necessarily) load all indexed terms to RAM, as described by Michael McCandless, the author of Lucene's indexing system himself. Note that by using Skip-Lists, the index can be traversed from one hit to another, making things like set and, particularly, range queries possible (much like B-Trees). And the Wikipedia entry on indexing Skip-Lists also explains why Lucene's Skip-List implementation is called a multi-level Skip-List - essentially, to make O(log n) look-ups possible (again, much like B-Trees).
So once the inverted (term) index - which is based on a Skip-List data structure - is built from the documents, the index is stored on disk. Lucene then loads (as already said: possibly, only some of) those terms into a Finite State Transducer, in an FST implementation loosely inspired by Morfologick.
Michael McCandless (also) does a pretty good and terse job of explaining how and why Lucene uses a (minimal acyclic) FST to index the terms Lucene stores in memory, essentially as a SortedMap<ByteSequence,SomeOutput>, and gives a basic idea for how FSTs work (i.e., how the FST compacts the byte sequences [i.e., the indexed terms] to make the memory use of this mapping grow sub-linear). And he points to the paper that describes the particular FST algorithm Lucene uses, too.
For those curious why Lucene uses Skip-Lists, while most databases use (B+)- and/or (B)-Trees, take a look at the right SO answer regarding this question (Skip-Lists vs. B-Trees). That answer gives a pretty good, deep explanation - essentially, not so much make concurrent updates of the index "more amenable" (because you can decide to not re-balance a B-Tree immediately, thereby gaining about the same concurrent performance as a Skip-List), but rather, Skip-Lists save you from having to work on the (delayed or not) balancing operation (ultimately) needed by B-Trees (In fact, as the answer shows/references, there is probably very little performance difference between B-Trees and [multi-level] Skip-Lists, if either are "done right.")
There's a fairly good article here: https://web.archive.org/web/20130904073403/http://www.ibm.com/developerworks/library/wa-lucene/
Edit 12/2014: Updated to an archived version due to the original being deleted, probably the best more recent alternative is http://lucene.apache.org/core/3_6_2/fileformats.html
There's an even more recent version at http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene410/package-summary.html#package_description, but it seems to have less information in it than the older one.
In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.
Terms are generated using an analyzer which stems each word to its root form. The most popular stemming algorithm for the english language is the Porter stemming algorithm: http://tartarus.org/~martin/PorterStemmer/
When a query is issued it is processed through the same analyzer that was used to build the index and then used to look up the matching term(s) in the index. That provides a list of documents that match the query.
It seems your question more about index merging than about indexing itself.
Indexing process is quite simple if you ignore low-level details. Lucene form what is called "inverted index" from documents. So if document with text "To be or not to be" and id=1 comes in, inverted index would look like:
[to] → 1
[be] → 1
[or] → 1
[not] → 1
This is basically it – the index from the word to the list of documents containing given word. Each line of this index (word) is called posting list. This index is persisted on long-term storage then.
In reality of course things are more complicated:
Lucene may skip some words based on the particular Analyzer given;
words can be preprocessed using stemming algorithm to reduce flexia of the language;
posting list can contains not only identifiers of the documents, but also offset of the given word inside document (potentially several instances) and some other additional information.
There are many more complications which are not so important for basic understanding.
It's important to understand though, that Lucene index is append only. In some point in time application decide to commit (publish) all the changes in the index. Lucene finish all service operations with index and close it, so it's available for searching. After commit index basically immutable. This index (or index part) is called segment. When Lucene execute search for a query it search in all available segments.
So the question arise – how can we change already indexed document?
New documents or new versions of already indexed documents are indexed in new segments and old versions invalidated in previous segments using so called kill list. Kill list is the only part of committed index which can change. As you might guess, index efficiency drops with time, because old indexes might contain mostly removed documents.
This is where merging comes in. Merging – is the process of combining several indexes to make more efficient index overall. What is basically happens during merge is live documents copied to the new segment and old segments removed entirely.
Using this simple process Lucene is able to maintain index in good shape in terms of a search performance.
Hope it'll helps.
It is inverted index, but that does not specify which structure it uses.
Index format in lucene has complete information.
Start with 'Summary of File Extensions'.
You'll first notice that it talks about various different indexes.
As far as I could notice none of these use strictly speaking a B-tree, but there are similarities - the above structures do resemble trees.
I want to do a super fast geocode lookup, returning co-ordinates for an input of Town, City or Country. My knowledge is basic but from what I understand writing it in C is a good start. I was thinking it makes sense to have a tree structure like this:
England
Kent
Orpington
Chatam
Rochester
Dover
Edenbridge
Wiltshire
Swindon
Malmsbury
In my file / database I will have the co-ordinate and the town/city name. If give my program the name "Kent" I want a program that can return me the co-ordinate assoaited with "Kent" in the fastest way possible
Should I store the data in a binary file or a SQL database for performance reasons?
What is the best method of searching this data? Perhaps binary tree searching?
How should the data be stored? perhaps?
Here's a little advice, but not much more than that:
If you want to find places by name, or name prefix, as you indicate that you wish to, then you would be ill-advised to set up a data structure which stores the data in a hierarchy of country, region, town as you suggest you might. If you have an operation that dominates the use of your data structure you are generally best picking the data structure to suit the operation.
In this case an alphabetical list of places would be more suited to your queries. To each place not at the topmost level you would want to add some kind of reference to the name of its 'parent'. If you have an alphabetical list of places you might also want to consider an index , perhaps one which points directly to the first place in the list which starts with each letter of the alphabet.
As you describe your problem it seems to have much more in common with storing words in a dictionary (I mean the sort of thing in which you look up words rather than any particular collection data-type in any specific programming language which goes under the same name) than with most of what goes under the guise of geo-coding.
My guess would be that a gazetteer including the names of all the world's towns, cities, regions and countries (and their coordinates) which have a population over, say, 1000, could be stored in a very simple data structure (basically a list) with an index or two for rapid location of the first A place-name, the first B, and so on. With a little compression you could probably hold this in the memory of most modern desktop PCs.
I think the best advice I can give is to use whatever language you are familiar with to get the results you want. Worry about performance once your code works. Then you can look at translating very specific pieces of functionality into C or C++ one at a time until you have the results you want.
You should not worry about how the information is stored, except not to duplicate data.
You should create one or more indices for the data. The indicies are associative arrays / maps data structures that contain a key (the item you want to search) and a value (such as the record and other information associated with the key). This will enable you with fast lookups without altering your data for each type of search.
On the other hand, your case is an excellent fit for a data base. I suggest you let the database manager your data (such as efficient lookups). After all, that is what they live for.
See also: At what point is it worth using a database?
What I’m trying to do:
Build a mobile web application where the user can get help finding words to play when playing scrabble
Users get word suggestions by typing in any amount of letters and 0 or more wildcards
How I’m trying to do this:
Using MySQL database with a dictionary containing over 400k words
Using ASP.NET with C# as server-side programming language
Using HTML5, CSS and Javascript
My current plan:
Building a Trie with all the words from the database so I can do a fast and accurate search for words depending on user letter/wildcard input
Having a plan is no good if you can’t execute it, this is what I need help with:
How do I build a Trie from the database? (UPDATE: I want to generate a Trie using the words already in my database, after that's done I'm not going to use the database for word matching any more)
How do I store the Trie for fast and easy access? (UPDATE: So I can trash my database)
How do I use C# to search for words using the Trie depending on letters and wildcards?
Finally:
Any help is very much appreciated, I’m still a beginner with C# and MySQL so please be gentle
Thank you a lot!
First off, let's look at the constraints on the problem. You want to store a word list for a game in a data structure that efficiently supports the "anagram" problem. That is, given a "rack" of n letters, what are all the n-or-fewer-letter words in the word list that can be made from that rack. the word list will be about 400K words, and so is probably about one to ten megs of string data when uncompressed.
A trie is the classic data structure used to solve this problem because it combines both memory efficiency with search efficiency. With a word list of about 400K words of reasonable length you should be able to keep the trie in memory. (As opposed to going with a b-tree sort of solution where you keep most of the tree on disk because it is too big to fit in memory all at once.)
A trie is basically nothing more than a 26-ary tree (assuming you're using the Roman alphabet) where every node has a letter and one additional bit on each node that says whether it is the end of the word.
So let's sketch the data structure:
class TrieNode
{
char Letter;
bool IsEndOfWord;
List<TrieNode> children;
}
This of course is just a sketch; you'd probably want to make these have proper property accessors and constructors and whatnot. Also, maybe a flat list is not the best data structure; maybe some sort of dictionary is better. My advice is to get it working first, and then measure its performance, and if it is unacceptable, then experiment with making changes to improve its performance.
You can start with an empty trie:
TrieNode root = new TrieNode('^', false, new List<TrieNode>());
That is, this is the "root" trie node that represents the beginning of a word.
How do you add the word "AA", the first word in the Scrabble dictionary? Well, first make a node for the first letter:
root.Children.Add('A', false, new List<TrieNode>());
OK, our trie is now
^
|
A
Now add a node for the second letter:
root.Children[0].Children.Add(new trieNode('A', true, new List<TrieNode>()));
Our trie is now
^
|
A
|
A$ -- we notate the end of word flag with $
Great. Now suppose we want to add AB. We already have a node for "A", so add to it the "B$" node:
root.Children[0].Children.Add(new trieNode('B', true, new List<TrieNode>());
and now we have
^
|
A
/ \
A$ B$
Keep on going like that. Of course, rather than writing "root.Children[0]..." you'll write a loop that searches the trie to see if the node you want exists, and if not, create it.
To store your trie on disk -- frankly, I would just store the word list as a plain text file and rebuild the trie when you need to. It shouldn't take more than 30 seconds or so, and then you can re-use the trie in memory. If you do want to store the trie in some format that is more like a trie, it shouldn't be hard to come up with a serialization format.
To search the trie for matching a rack, the idea is to explore every part of the trie, but to prune out the areas where the rack cannot possibly match. If you haven't got any "A"s on the rack, there is no need to go down any "A" node. I sketched out the search algorithm in your previous question.
I've got an implementation of a functional-style persistent trie that I've been meaning to blog about for a while but never got around to it. If I do eventually post that I'll update this question.