Fast autocomplete from in-memory collection (.NET)

Fast autocomplete from in-memory collection (.NET) - c#

I have this text input field on a web page. User types in item names for purchase. I'd like to provide a dropdown with possible names, based on letters typed so far.
Question is how to implement the search on the server (ASP.NET MVC). I'll probably load the whole collection of item names (there are over 100 000) in a static variable on app start. How should I implement efficient search for names starting with given one or more characters?
TIA

You can sort the collection by name, then write a modified binary search that returns a range of items.
However, I would recommend first trying a simple sequential search and seeing how it behaves under load.

I'll probably load the whole
collection of item names (there are
over 100 000) in a static variable on
app start. How should I implement
efficient search for names starting
with given one or more characters?
By NOT (!) loading them into a static variable. Hit the db server on every request with a "top 101" clause. Finished.

Related

Search algorithm for partial words in C#

I'm currently looking for a way to realize a partial word pattern algorithm in C#. The situation I'm in looks like follows:
I got a textfield for the search pattern. Every time the user enters or deletes a char in this field, an event triggers which re-runs the search algorithm. So in case I want to search for the word "face" in strings like
"Facebook", "Facelifting", ""Faceless Face" (whatever that should be) or in generally ANY real life sentences as strings,
the algorithm would first start running when typing "f" in the field. It then show the most relevant String on top of a list the strings are in. The second time it runs when "fa" is typed, and the list is sorted again. This goes on until "face" is completely typed in the textfield and the list is sorted again.
However I don't know what algorithm could be used. I tried the answer from Alain (Getting the closest string match), a simple Levenshtein-Distance algorithm as well as an self-made algorithm, which calculates the priority via
priority = (length_of_typed_pattern) * (amount_of_substr_matches)
In C#, the latter looks like this:
count = Regex.Matches(Regex.Escape(title), pattern).Count;
priority = pattern.Length * count;
The pattern as well as the title are composed of only lowercase letters.
My conclusions so far:
Hamming distance won't make any sense since the strings are not the same length most of the time
The answer from Alain works fine, but only if at least one word completely matches (you only find a most relevant string/sentence when at least one word is equal with the pattern, so if you have "face" typed and there's a string containing the word "facebook", the string containing "facebook" is almost never a top priority
What other ideas could I try? The goal would be to sort the list of strings the best possible way in the earliest moment (with the fewest letters).
You can look at my implementations in the search-* branches of my repository on http://github.com/croemheld/sprung) in Sprung/WindowMatcher.cs and Sprung/Window.cs.
Thanks for your help.

First of all you need to store frequency related to a string(number of times a particular string is searched) in some place to show most relevant one when searched. If you need to show say k most relevant entries so a Min Heap of size 'k' can be implemented.
Case 1- If a letter is pressed for the first time:-
Step (a) Read all the string starting from a Data base or dictionary and store in some data structure(Say DS1) with a FLAG_VALID(set to 1 initially) which shows that it is valid string for the present search characters(for first letter all the strings will be valid).
As you read strings fill the Min Heap according to their Frequency and an element with certain frequency is inserted only when its frequency is greater than minimum one(i.e. the first element of min Heap).
Step (b) (This step is same for all case to show result) To show results you need to show elements in reverse order than Min Heap i.e. first element in Min Heap will have least priority, so basically we need to delete all elements one by one and show it from last to first.
NOTE:- Min Heap will contain reference to a particular string and so the string and its frequency can be accessed at the same time.
Case 2- Inserting next letters in search box:
Step (a) Search through DS1 in which all strings are present and check FLAG_VALID first. If it is a valid string than compare the string from search box and the string from DS1. Set the flag accordingly(if it is a match-1 or not-0) and fill k-Min Heap as it is empty from last search as in Case 1.
Step (b) is as usual.
Case 3- Deleting a letter in search box:
It is similar to above cases but this time we will need to search for those strings also whose FALG_VALID is 0(i.e string which are invalid).
This is a crude searching method and can be improved using certain Data structure and tweaking the algorithm.

Efficiency between searching and foreach loops

I am working with a WPF in C#. I am using the GetNextControl method to store all the child controls in a Control.ControlCollection. I want to loop through the results and fill in only the text boxes. I have thought of two ways to do this, but which would be more efficient:
Search once and store the results in an Control.ControlCollection.
Use a foreach loop to go through the collection and use multiple if/else statements to find the TextBox I am looking for and fill in the box with some text.
Or,
Search and store all the controls in a Control.ControlCollection.
Use the find method of the collection to find a TextBoxwith a certain name and fill in some text in the TextBox.
I think that the first way would be slower because there are more comparisons to make. While the second method uses searching only.

Implement the easiest one. Do not worry about optimization until you have metrics to support the need.
If it is not fast enough/efficient enough, then get some good time measurements. Now it is time to consider alternate implementations.
Implement and time each of the alternates, picking the fastest/most efficient one.

How can I implement a binary search over a list of file names?

NOTE: Assignment help, No code required.
I have a list of file names in a list-box. As part of the assignment I want to search the file names using a binary search implementation.
Can someone help me understand how to implement a binary search without using the built-in List<T>.BinarySearch(...) method?

You have to begin with a sorted list of values. Then you just search like you would if you were playing a number guessing game (and were a computer). Pick the middle element of your list. If the number you're searching for isn't equal to the value of the middle element, do the same thing again, but this time on a sub-list that's half the size (since the list is sorted, you know what side of the list your target is on). Just keep doing that until you found the value you're looking for.

Geocode lookup in C

I want to do a super fast geocode lookup, returning co-ordinates for an input of Town, City or Country. My knowledge is basic but from what I understand writing it in C is a good start. I was thinking it makes sense to have a tree structure like this:
England
Kent
Orpington
Chatam
Rochester
Dover
Edenbridge
Wiltshire
Swindon
Malmsbury
In my file / database I will have the co-ordinate and the town/city name. If give my program the name "Kent" I want a program that can return me the co-ordinate assoaited with "Kent" in the fastest way possible
Should I store the data in a binary file or a SQL database for performance reasons?
What is the best method of searching this data? Perhaps binary tree searching?
How should the data be stored? perhaps?

Here's a little advice, but not much more than that:
If you want to find places by name, or name prefix, as you indicate that you wish to, then you would be ill-advised to set up a data structure which stores the data in a hierarchy of country, region, town as you suggest you might. If you have an operation that dominates the use of your data structure you are generally best picking the data structure to suit the operation.
In this case an alphabetical list of places would be more suited to your queries. To each place not at the topmost level you would want to add some kind of reference to the name of its 'parent'. If you have an alphabetical list of places you might also want to consider an index , perhaps one which points directly to the first place in the list which starts with each letter of the alphabet.
As you describe your problem it seems to have much more in common with storing words in a dictionary (I mean the sort of thing in which you look up words rather than any particular collection data-type in any specific programming language which goes under the same name) than with most of what goes under the guise of geo-coding.
My guess would be that a gazetteer including the names of all the world's towns, cities, regions and countries (and their coordinates) which have a population over, say, 1000, could be stored in a very simple data structure (basically a list) with an index or two for rapid location of the first A place-name, the first B, and so on. With a little compression you could probably hold this in the memory of most modern desktop PCs.

I think the best advice I can give is to use whatever language you are familiar with to get the results you want. Worry about performance once your code works. Then you can look at translating very specific pieces of functionality into C or C++ one at a time until you have the results you want.

You should not worry about how the information is stored, except not to duplicate data.
You should create one or more indices for the data. The indicies are associative arrays / maps data structures that contain a key (the item you want to search) and a value (such as the record and other information associated with the key). This will enable you with fast lookups without altering your data for each type of search.
On the other hand, your case is an excellent fit for a data base. I suggest you let the database manager your data (such as efficient lookups). After all, that is what they live for.
See also: At what point is it worth using a database?

Representing a giant matrix/table

I need to perform calculations and manipulation on an extremely large table or matrix, which will have roughly 7500 rows and 30000 columns.
The matrix data will look like this:
Document ID| word1 | word 2 | word 3 |... | word 30000 | Document Class
0032 1 0 0 1 P
In other words, the vast majority of the cells will contain boolean values(0's and 1's).
The calculations that needs to be done would be useing word stemming or feature selection(reducing the number of words by using reduction techniques), as well as calculations per-class or per-word etc.
What i have in mind is designing an OOP model for representing the matrix, and then subsequently serializing the objects to disk so i may reuse them later on. For instance i will have an object for each row or each column, or perhaps an object for each intersection that is contained within another class.
I have thought about representing it in XML, but file sizes may prove problematic.
I may be sitting the pot miss with my approach here -
Am i on the right path, or would there be any better performing approaches to manipulating such large data collections.
Key issues here will be performance(reaction time etc.), as well as redundancy and integrity of the data, and obviously i would need to save the data on disk.

You haven't explained the nature of the calculations you're needing to do on the table/matrix, so I'm having to make assumptions there, but if I read your question correctly, this may be a poster-child case for the use of a relational database -- even if you don't have any actual relations in your database. If you can't use a full server, use SQL Server Compact Edition as an embedded database, which would allow you to control the .SDF file programmatically if you chose.
Edit:
After a second consideration, I withdraw my suggestion for a database. This is entirely because of the number of columns in the table, any relational database you use will have hard limits on this, and I don't see a way around it that isn't amazingly complicated.
Based on your edit, I would say that there are three things you are interested in:
A way to analyze the presence of words in documents. This is the bulk of your sample data file, primarily being boolean values indicating the presence or lack of a word in a document.
The words themselves. This is primarily contained in the first row of your sample data file.
A means of identifying documents and their classification. This is the first and last column of your data file.
After thinking about it for a little bit, this is how I would model your data:
With the case of word presence, I feel it's best to avoid a complex object model. You're wanting to do pure calculation in both directions (by column and by row), and the most flexible and potentially performant structure for that in my opinion is a simple two-dimensional array of bool fields, like so:
var wordMatrix = new bool[numDocuments,numWords];
The words themselves should be in an array or list of strings that are index-linked to the second column of the word matrix -- the one defined by numWords in the example above. If you ever needed to quickly search for a particular word, you could use a Dictionary<string, int>, with the key as the word and the value as the index, to quickly find the index of a particular word.
The document identification would similarly be in an array or list of ints index-linked to the first column. I'm assuming the document ids are integer values there. The classification would be a similar array or list, although I'd use a list of enums representing each possible value of the classification. As with the word search, if you needed to search for documents by id, you could have a Dictionary<int, int> act as your search index.
I've made several assumptions with this model, particularly that you want to do pure calculation on the word presence in all directions rather than "per document". If I'm wrong, a simpler approach might be to drop the two-dimensional array and model by document, i.e. a single C# Document class, with a DocumentId, and DocumentClasification field as well as a simple array of booleans that are index-linked to the word list. You could then work with a list of these Document objects along with a separate list of words.
Once you have a data model you like, saving it to disk is the easiest part. Just use C# serialization. You can save it via XML or binary, your choice. Binary would give you the smallest file size, naturally (I figure a little more than 200MB plus the size of a list of 30000 words). If you include the Dictionary lookup indexes, perhaps an additional 120kB.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.