Efficient algorithm for finding related submissions

Efficient algorithm for finding related submissions - c#

I recently launched my humble side project and would like to add a "related submissions" section when viewing a submission. Exactly like what SO is doing here - see right column, titled "Related"
Considering that each submission has a title and a set of tags, what is most effective (optimum result), most efficient (fast, memory friendly) way to query the database for related submissions?
I can think of one way to do this (which I'll post as an answer) but I'm very interested to see what others have to say. Or perhaps there's already a standard way of achieving this?

Here's my two cent solution:
To achieve the best output, we need to put “weight” on the query results.
To start with, each submission in the database is assumed to have a weight of zero.
Then, if a submission in the "pool" shares one tag with the current submission, we'd add +3 to the found submission. Hence, if another submission is found that shares two tags with the current submission, we add +6 to the weight.
Next, we split/tokenize the title of the current submission and remove “stop words”.
I’ve seen a list of stop words from google, but for now I’ll define my stop words to be: [“of”, “a”, “the”, “in”]
Example:
Title “The Best Submission of All Times”
Result the array: ["The", “Best”, “Submission”, “of”, “All”, “Times”]
Remove stop words: [“Best”, “Submission”, “All”, “Times”]
Then we query the database for submissions containing any of the mentioned titles, and for each result we add the weight: +2
And finally sort the list descending by weight and take the top N results.
What do you think? (be gentle!)

If I understand well, you need a technique to find whether two posts are "similar" one to each other. You may want to use a probabilistic model for that:
http://en.wikipedia.org/wiki/Mutual_information
The idea would be to say that if two posts share a lot of "uncommon" words, they are probably speaking on the same topic. For detecting uncommon words, depending on your application, you may use a general table of frequencies, or maybe better, build it yourself on the universe of the words of your posts (but you will need to have enough of them to have something relevant).
I would not limit myself on title and tags, but I would overweight them in the research.
This kind of ideas is very common in spam filtering. I unfortunately the time to make a full review, but a quick google search gives:
http://www.aclweb.org/anthology/P/P04/P04-3024.pdf
karlmicha.googlepages.com/acl2004_poster.pdf

Related

Filter elements in Revit and set parameter

Hello everybody,
around two or three month ago I started to learn Dynamo for Revit... finally :)
After learning and testing a lot, I got a few own scripts working. Then I learned Python, because I couldn't create the next script only with Dynamo-Nodes.
Then I thought "Let's see how difficult it is to get something done as a PlugIn".
I watched some Videos and read a lot of stuff.
Finally I got the Revit-AddIn-Wizard installed and made my first small Test-PlugIn.
Great...
Now I have a few problems which I do not understand... so I thought I will try my luck here... because I got so much information and help, reading through this site.
My goal was/is the following: (I tell you what I have now)
A form with a few buttons, comboboxes and a DataGridView.
I can load an Excelfile, click on "Show" to show it in the DataGridView.
The header of each row will be automatically put into 3 comboboxes.
In the first combobox you select the first search-parameter, in the second you CAN select another search-parameter and in the third combobox you select the parameter you want to set.
I have a checkbox to switch from type- to instance-parameter for the search- and the set-operation.
There is also a button which shows another small form with a list of categories (I won't search for ALL, only nearly all modelcategories).
PlugIn
I took me a lot of "watching Videos, reading through the internet, testing, testing and testing".
Thanks to this site here and a few others... I managed to get this whole PlugIn nearly 100% working.
But now I have a few strange issues and I have absolutely no clue on how to fix them or if it is possible. And I really hope that someone can help me.
First... I just tell you my problems and perhaps someone can say "this really IS an issue!" or that it is possible to get it done. Then I would post some code.
So... what do I do?!
1. I have a FilteredElementCollector which filters ALL elements.
2. Depending on my "Type/Instance-Checkboxes" I do .WhereElementIsElementType OR .WhereElementIsNotElementType.
3. Then it passes a MultiCategoryFilter to get the big list down to only the modelcategories.
4. Next, the collection passes one of ten different "methods" depending on all settings. There I filter this collection depending on the searchlists-comboboxes. When the combobox says "Familie" or "Typ" then it filters for ".BuiltInParameter.SymbolFamilyName" or ".Name" otherwise it just uses ".LookupParameter".
After that I have a collection with only the elements of selected categories which contains the values from the Excellist.
5. Depending on what my search- and set-settings are (e.g. search for type and set instance) I have to get the instances from the collected types or the other way around.
6. Then I pass it down to another method where I finally set the parameter.
So... Excelheader goes into comboboxes, depending on what you select in there it creates lists with the values of the selected rows.
I hope you all understand.
Now... where are my problems?
When I search for type-familynames or instance-parameter and set a typeparameter it works for ALL categories without any error.
1. When I try to set an instanceparameter (doesn't matter what my search-setting are) it works for all "normal" families but not for the systemfamilies (e.g. walls, floors, pipes etc.). No error, just nothing happens WHY? It seems that I cannot set an instance-parameter for system-families.
2. Roofs, Stairs, CurtainPanels and GenericModel make problems when I search for a typeparameter Error is something like "The object reference was not set to an object instance". Only with these 4 categories and it doesn't matter what I want to set... but when I search for family-/typeNAME or Instance-Parameter, then I can set type or instance and it works (except instance for sysfam).
3. When I try to search AND set an instance-parameter it works for ALL categories EXCEPT if one wall does not contain a search value... it really is enough that ONE wall does not have a search-param-value that everything will be cancelled.
I have a few other small problems... but I hope someone can help me with these problems... I would be extremely thankfull
greetings and have a nice day or night :)
Philipp

Tl; dr.
The three problems you describe sound like your own. I have no heard anybody else runAsk three separate questions and provide three separate minimal code snippets describing how they arise,. into those. I suggest that you create three separate independent minimal reproducible cases to demonstrate all three issues. Chances are, when you simplify and minimalise your code, the problem will go away. If it does not, it might just possibly be in a small and manageable enough state for other people to help you take a look at it. Given the long-winded description above, nobody in the world can help you.

Thank you for your answer Jeremy,
as I said, as a first start it is ok for me if you don't say "With theses categories, there are indeed some issues!"
I think I've managed to create 3 small examples of my problems.
For each problem I made a zip-file containing the complete visual-studio folder, a small exampleproject and a readme.txt with (I hope) enough information to understand everything in detail.
Problem1
Problem3
You only need to compile them or copy the .addin and .ddl files into the Revit AddIn folder. Then you get the new ribbons.
Short problem summary = I get problems when searching for parametervalues and setting values to another parameter.
Edit: I just solved the 2. problem when searching for familynames and setting system-families-parameter.
I used:
ElementClassFilter ecf = new ElementClassFilter(typeof(FamilyInstance));
FilteredElementColletor colle2 = new FilteredElementCollector(doc);
colle2.WherePasses(ecf);
I simply deleted the ClassFilter and do it now like in the other cases where I need instances.
FilteredElementCollector colle2 = new FilteredElementCollector(doc);
colle2.WhereElementIsNotElementType();
The 1. and 3. problem still exist :/
I would be thankful for any help someone can provide :)

Performant search in listview

I am using Linq to search and show a specific string in a ListView on WinRT (Windows Phone 8.1).
This is my current method:
string value = (string) sender.Text.ToLower();
if (value == "")
{
AirportsList.ItemsSource = _airports.CountryList;
}
else
{
List<Airport> queryList = _airports.AirportList
.Where(airport => airport.IcaoId.ToLower().Contains(value)
|| airport.IcaoName.ToLower().Contains(value)
|| airport.City.ToLower().Contains(value)
|| airport.Country.ToLower().Contains(value)).ToList();
AirportsList.ItemsSource = queryList.ToList();
}
This is quite slow and laggy since it has to create the itemssource each time. Is there a more performant way of doing it?

I imagine creating the queryList isn't necessarily the slow part. That's just a List<Airport>, a class any hardware will capably instantiate with minimal overhead. I imagine the slow part is having to check your value against four different properties on all possible airports, of which there might be many. Further, for each property, you're searching if it Contains the string, rather than it just being at the beginning (via StartsWith). This means that each property has to be linearly searched for your result.
Luckily, searching through items is a pretty common computing issue; consequently, there's many solutions developers cleverer than myself have devised for speeding searches up, or at least making them seem snappy. I'd suggest you look pick a combination of the items listed below (and search for more!). I can't provide implementations them without writing an epically long answer (maybe another SO'er will oblige) but hopefully they'll help your search!
Return results to the UI as they're gathered - This one requires no fundamental change to your approach, nor does it make it faster, but immediate UI feedback is psychologically speaking very important and keeps the user's attention.
Narrow down the query - Are they searching for LHR, TPE, or any three-lettered phrase? It's likely that they are searching for an ICAO id. Perform your search through the ICAO ID's first and echo them to the UI asap. Checking their IP, are they seaching from a computer based in England? If they are, you could prioritize airports in England, etc. For many cases, this will complete most people's searches without having to go through all possible permutatitons. Of course, step to full-permutation searches after these shortcuts incase they're doing non-standard searches.
Subset previous query answers - If they searched for Lon and the search retrieved London Heathrow, London Gatwick, and Londrina, then a follow-up search for Londo is only ever going to return a subset of those responses. This requires that you hold the state of previous requests (and responses), and could be complicated to implement in practice.
Presort and index _airports.AirportList and use a different search algorythm - At the moment, you're linearly searching through the airports, which has O(n) complexity. Provided you're OK with initially searching with StartsWith rather than Contains, using a binary search would nock the algorythm down to O(log n) which, if there's alot of airports, will end up faster. Presort your _airports.AirportList by IcaoId, IcaoName, City, Country into separate indexes / lists (if you have the memory to spare) and binary search against those.
If that doesn't work, take up drinking (and micro-optimize) - You could pre-emptively ToLower your _airports.AirportList so that you're not having to constantly lowercaps them when searching. You could take advantage of cache optimization and flatten _airports.AirportList into parallel List<IcaoId>, List<IcaoName>, etc. You could statistically analyze your clients' search habits and optimize your searches around them (e.g 90 % are using two airports, always have those airports top of the search pile). You could offload the search onto a 3rd party service that does have the computing power to do it almost instantaneously (I'D SERIOUSLY CONSIDER THIS ONE). You could multithread the entire process. You could rewrite it in C++ and get annoyed when that one class you wrote 2 months ago memory leaked all over your phone. You could find evidence that implicates your boss in a sex scandal and blackmail him/her, avoiding this problem all over, etc.

Data structure for searching strings

I am looking for the best data structure for the following case:
In my case I will have thousands of strings, however for this example I am gonna use two for obvious reasons. So let's say I have the strings "Water" and "Walter", what I need is when the letter "W" is entered both strings to be found, and when "Wat" is entered "Water" to be the only result. I did a research however I am still not quite sure which is the correct data structure for this case and I don't want to implement it if I am not sure as this will waste time. So basically what I am thinking right now is either "Trie" or "Suffix Tree". It seems that the "Trie" will do the trick but as I said I need to be sure. Additionally the implementation should not be a problem so I just need to know the correct structure. Also feel free to let me know if there is a better choice. As you can guess normal structures such as Dictionary/MultiDictionary would not work as that will be a memory killer. I am also planning to implement cache to limit the memory consumption. I am sorry there is no code but I hope I will get a answer. Thank you in advance.

You should user Trie. Tries are the foundation for one of the fastest known sorting algorithms (burstsort), it is also used for spell checking, and is used in applications that use text completion. You can see details here.

Practically, if you want to do auto suggest, then storing upto 3-4 chars should suffice.
I mean suggest as and when user types "a" or "ab" or "abc" and the moment he types "abcd" or more characters, you can use map.keys starting with "abcd" using c# language support lamda expressions.
Hence, I suggest, create a map like:
Map<char, <Map<char, Map<char, Set<string>>>>> map;
So, if user enters "a", you look for map[a] and finds all children.

How to configure tolkenizers with indexing and searching with Lucene and Nhibernate

This is a question for using Lucene via the NHibernate.Search namespace, which works in conjunction with Lucene.
I'm indexing a Title in the Index: Grey's Anatomy
Title : "Grey's Anatomy"
By using Luke, I see that that title is getting Tokenized into:
Title: anatomy
Title: grey
Now, I get a result if I search for:
"grey" or "grey's"
However, if I search for "greys" then I get nothing.
I would like "greys" to return a result. And I guess this could be an issue with any word with an apostrophe.
So, here are some questions:
Am I right in thinking I could fix this issue either by changing something on the time of index (so, changing the tolkenizer..??) or changing it a query time (query parser?)
If there is a solution, could someone provide a small code sample?
thanks

If you make a classic Term search using Lucene, then greys it's most likely not to show on the results, except that you make a nice tokenizing work when saving, so from where I see it, you have 2 choices or a 3rd beign a combination of them:
Use a Stemmer for indexed data and query. Stemmers are fast, and you can always find an implementation of Porter's stemmer somewhere in Google. Problem is when you look for different languages.
Use Fuzzy queries. Using a Fuzzy Query you can set the edit distance that you want to get "away" from the word being search. The thing is that because 2 words are "close" using an edition distance (i.e, Lehvenstein) doesn't mean that they're the same, but the problem of Grey and Grey's and Greys should be solved with setting an edit distance of 2.
I think you will be able to find a decent implementation of the Porter Stemmer, which is nice right here.
Hope I can help!

Techniques to make autocomplete on website more responsive

In my website's advanced search screen there are about 15 fields that need an autocomplete field.
Their content is all depending on each other's value (so if one is filled in, the other's content will change depending on the first's value).
Most of the fields have a huge amount of possibilities (1000's of entries at least).
Currently make an ajax call if the user stops typing for half a second. This ajax call makes a quick call to my Lucene index and returns a bunch of JSon objects. The method itself is really fast, but it's the connection and transferring of data that is too slow.
If I look at other sites (say facebook), their autocomplete is instant. I figure they put the possible values in their HTML, so they don't have to do a round trip. But I fear with the amounts of data I'm handling, this is not an option.
Any ideas?

Return only top x results.
Get some trends about what users are picking,
and order based on that, preferably
automatically.
Cache results for every URL & keystroke combination,
so that you don't have to round-trip
if you've already fetched the result
before.
Share this cache with all
autocompletes that use the same URL
& keystroke combination.
Of course,
enable gzip compression for the
JSON, and ensure you're setting your
cache headers to cache for some
time. The time depends on your rate
of change of autocomplete response.
Optimize the JSON to send down the
bare minimum. Don't send down
anything you don't need.

Are you returning ALL results for the possibilities or just the top 10 as json objects.
I notice a lot of people send large numbers of results back to the screen, but then only show the first few. By sending back small numbers of results, you can reduce the data transfer.

Return the top "X" results, rather than the whole list, to cut back on the number of options? You might also want to try and put in some trending to track what users pick from the list so you can try and make the top "X" the most used/most relvant. You could always return your most relevant list first, then return the full list if they are still struggling.

In addition to limiting the set of results to a top X set consider enabling caching on the responses of the AJAX requests (which means using GET and keeping the URL simple).
Its amazing how often users will backspace then end up retyping exactly the same content. Also by allowing public and server-side caching your could speed up the overall round-trup time.

Cache the results in System.Web.Cache
Use a Lucene cache
Use GET not POST as IE caches this
Only grab a subset of results (10 as people suggest)
Try a decent 3rd party autocomplete widget like the YUI one

Returning the top-N entries is a good approach. But if you want/have to return all the data, I would try and limit the data being sent and the JSON object itself.
For instance:
"This Here Company With a Long Name" becomes "This Here Company..." (you put the dots in the name client side--again; transfer a minimum of data).
And as far as the JSON object goes:
{n: "This Here Company", v: "1"}
... Where "n" would be the name and "v" would be the value.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.