How to configure tolkenizers with indexing and searching with Lucene and Nhibernate

How to configure tolkenizers with indexing and searching with Lucene and Nhibernate - c#

This is a question for using Lucene via the NHibernate.Search namespace, which works in conjunction with Lucene.
I'm indexing a Title in the Index: Grey's Anatomy
Title : "Grey's Anatomy"
By using Luke, I see that that title is getting Tokenized into:
Title: anatomy
Title: grey
Now, I get a result if I search for:
"grey" or "grey's"
However, if I search for "greys" then I get nothing.
I would like "greys" to return a result. And I guess this could be an issue with any word with an apostrophe.
So, here are some questions:
Am I right in thinking I could fix this issue either by changing something on the time of index (so, changing the tolkenizer..??) or changing it a query time (query parser?)
If there is a solution, could someone provide a small code sample?
thanks

If you make a classic Term search using Lucene, then greys it's most likely not to show on the results, except that you make a nice tokenizing work when saving, so from where I see it, you have 2 choices or a 3rd beign a combination of them:
Use a Stemmer for indexed data and query. Stemmers are fast, and you can always find an implementation of Porter's stemmer somewhere in Google. Problem is when you look for different languages.
Use Fuzzy queries. Using a Fuzzy Query you can set the edit distance that you want to get "away" from the word being search. The thing is that because 2 words are "close" using an edition distance (i.e, Lehvenstein) doesn't mean that they're the same, but the problem of Grey and Grey's and Greys should be solved with setting an edit distance of 2.
I think you will be able to find a decent implementation of the Porter Stemmer, which is nice right here.
Hope I can help!

Related

Filter elements in Revit and set parameter

Hello everybody,
around two or three month ago I started to learn Dynamo for Revit... finally :)
After learning and testing a lot, I got a few own scripts working. Then I learned Python, because I couldn't create the next script only with Dynamo-Nodes.
Then I thought "Let's see how difficult it is to get something done as a PlugIn".
I watched some Videos and read a lot of stuff.
Finally I got the Revit-AddIn-Wizard installed and made my first small Test-PlugIn.
Great...
Now I have a few problems which I do not understand... so I thought I will try my luck here... because I got so much information and help, reading through this site.
My goal was/is the following: (I tell you what I have now)
A form with a few buttons, comboboxes and a DataGridView.
I can load an Excelfile, click on "Show" to show it in the DataGridView.
The header of each row will be automatically put into 3 comboboxes.
In the first combobox you select the first search-parameter, in the second you CAN select another search-parameter and in the third combobox you select the parameter you want to set.
I have a checkbox to switch from type- to instance-parameter for the search- and the set-operation.
There is also a button which shows another small form with a list of categories (I won't search for ALL, only nearly all modelcategories).
PlugIn
I took me a lot of "watching Videos, reading through the internet, testing, testing and testing".
Thanks to this site here and a few others... I managed to get this whole PlugIn nearly 100% working.
But now I have a few strange issues and I have absolutely no clue on how to fix them or if it is possible. And I really hope that someone can help me.
First... I just tell you my problems and perhaps someone can say "this really IS an issue!" or that it is possible to get it done. Then I would post some code.
So... what do I do?!
1. I have a FilteredElementCollector which filters ALL elements.
2. Depending on my "Type/Instance-Checkboxes" I do .WhereElementIsElementType OR .WhereElementIsNotElementType.
3. Then it passes a MultiCategoryFilter to get the big list down to only the modelcategories.
4. Next, the collection passes one of ten different "methods" depending on all settings. There I filter this collection depending on the searchlists-comboboxes. When the combobox says "Familie" or "Typ" then it filters for ".BuiltInParameter.SymbolFamilyName" or ".Name" otherwise it just uses ".LookupParameter".
After that I have a collection with only the elements of selected categories which contains the values from the Excellist.
5. Depending on what my search- and set-settings are (e.g. search for type and set instance) I have to get the instances from the collected types or the other way around.
6. Then I pass it down to another method where I finally set the parameter.
So... Excelheader goes into comboboxes, depending on what you select in there it creates lists with the values of the selected rows.
I hope you all understand.
Now... where are my problems?
When I search for type-familynames or instance-parameter and set a typeparameter it works for ALL categories without any error.
1. When I try to set an instanceparameter (doesn't matter what my search-setting are) it works for all "normal" families but not for the systemfamilies (e.g. walls, floors, pipes etc.). No error, just nothing happens WHY? It seems that I cannot set an instance-parameter for system-families.
2. Roofs, Stairs, CurtainPanels and GenericModel make problems when I search for a typeparameter Error is something like "The object reference was not set to an object instance". Only with these 4 categories and it doesn't matter what I want to set... but when I search for family-/typeNAME or Instance-Parameter, then I can set type or instance and it works (except instance for sysfam).
3. When I try to search AND set an instance-parameter it works for ALL categories EXCEPT if one wall does not contain a search value... it really is enough that ONE wall does not have a search-param-value that everything will be cancelled.
I have a few other small problems... but I hope someone can help me with these problems... I would be extremely thankfull
greetings and have a nice day or night :)
Philipp

Tl; dr.
The three problems you describe sound like your own. I have no heard anybody else runAsk three separate questions and provide three separate minimal code snippets describing how they arise,. into those. I suggest that you create three separate independent minimal reproducible cases to demonstrate all three issues. Chances are, when you simplify and minimalise your code, the problem will go away. If it does not, it might just possibly be in a small and manageable enough state for other people to help you take a look at it. Given the long-winded description above, nobody in the world can help you.

Thank you for your answer Jeremy,
as I said, as a first start it is ok for me if you don't say "With theses categories, there are indeed some issues!"
I think I've managed to create 3 small examples of my problems.
For each problem I made a zip-file containing the complete visual-studio folder, a small exampleproject and a readme.txt with (I hope) enough information to understand everything in detail.
Problem1
Problem3
You only need to compile them or copy the .addin and .ddl files into the Revit AddIn folder. Then you get the new ribbons.
Short problem summary = I get problems when searching for parametervalues and setting values to another parameter.
Edit: I just solved the 2. problem when searching for familynames and setting system-families-parameter.
I used:
ElementClassFilter ecf = new ElementClassFilter(typeof(FamilyInstance));
FilteredElementColletor colle2 = new FilteredElementCollector(doc);
colle2.WherePasses(ecf);
I simply deleted the ClassFilter and do it now like in the other cases where I need instances.
FilteredElementCollector colle2 = new FilteredElementCollector(doc);
colle2.WhereElementIsNotElementType();
The 1. and 3. problem still exist :/
I would be thankful for any help someone can provide :)

natural language query processing

I have a NLP (natural language processing application) running that gives me a tree of the parsed sentence, the questions is then how should I proceed with that.
What is the time
\-SBAR - Suborginate clause
|-WHNP - Wh-noun phrase
| \-WP - Wh-pronoun
| \-What
\-S - Simple declarative clause
\-VP - Verb phrase
|-VBZ - Verb, 3rd person singular present
| \-is
\-NP - Noun phrase
|-DT - Determiner
| \-the
\-NN - Noun, singular or mass
\-time
the application has a build in javascript interpreter, and was trying to make the phrase in to a simple function such as
function getReply() {
return Resource.Time();
}
in basic terms, what = request = create function, is would be the returned object, and the time would reference the time, now it would be easy just to make a simple parser for that but then we also have what is the time now, or do you know what time it is. I need it to be able to be further developed based on the english language as the project will grow.
the source is C# .Net 4.5
thanks in advance.

As far as I can see, using dependency parse trees will be more helpful. Often, the number of ways a question is asked is limited (I mean statistically significant variations are limited ... there will probably be corner cases that people ordinarily do not use), and are expressed through words like who, what, when, where, why and how.
Dependency parsing will enable you to extract the nominal subject and the direct as well as indirect objects in a query. Typically, these will express the basic intent of the query. Consider the example of tow equivalent queries:
What is the time?
Do you know what the time is?
Their dependency parse structures are as follows:
root(ROOT-0, What-1)
cop(What-1, is-2)
det(time-4, the-3)
nsubj(What-1, time-4)
and
aux(know-3, Do-1)
nsubj(know-3, you-2)
root(ROOT-0, know-3)
dobj(is-7, what-4)
det(time-6, the-5)
nsubj(is-7, time-6)
ccomp(know-3, is-7)
Both are what-queries, and both contain "time" as a nominal subject. The latter also contains "you" as a nominal subject, but I think expressions like "do you know", "can you please tell me", etc. can be removed based on heuristics.
You will find the Stanford Parser helpful for this approach. They also have this online demo, if you want to see some more examples at work.

Slow iteration through elements in WatiN

I'm writing an application with Watin. Its great, but running a performance analysis on my program, over 50% of execution time is spent looping through lists of elements.
For example:
foreach (TextField bT in browser.TextFields)
{
Is very slow.
I seem to remember seeing somewhere there is a faster way of doing this in WatiN, but unfortunately I can't find the page again.
Accessing the number of elements also seems to be slow, eg;
browser.CheckBoxes.Count
Thanks for any tips,
Chris

I think I could answer you better if I had a better idea of what you were trying to do, but I can share some observations on what I've learned with WatiN so far.
The more specific your selectors are, the faster things will go. Avoid using "browser.Elements" as that is really generic. I'm not sure that it saves much, but doing something like browser.Body.Elements throws the header elements out of the scope of things to check and may save a few calculations.
When I say "scope", consider that WatiN always starts with the entire DOM. Can you think of ways to limit the scope of elements perhaps to the text fields within the main div on your page? WatiN returns Elements and ElementCollections, each of which may have its own ElementCollection. That div probably has a specific ID, so you can do something like
var textFields = ie.Div("divId").TextFields;
Look for opportunities to be more specific, and you can use LINQ to describe what you want more clearly. For example, can you write something like:
ie.Body.TextFields.
Where(tf => !string.IsNullOrWhiteSpace(tf.ClassName) && tf.ClassName.Contains("classname")).ToList().
Foreach(tf => tf.Value = "Your Text");
I would refine that further by reducing the number of times I scan the collection by doing something like:
ie.Body.TextFields.ToList().
Foreach(tf => {
if(!string.IsNullOrWhiteSpace(tf.ClassName) && tf.ClassName.Contains("classname")) {
tf => tf.Value = "Your Text"
}
});
The "Find.By*" specifiers also help WatiN operate on the collections you want faster and are a more elegant short-hand for what I wrote above:
ie.Body.TextFields.Filter(Find.ByClass("class")).ToList().ForEach(tf => tf.Value = "Your Text");
And as a last piece of advice, this project lets you find elements using jQuery/CSS style selectors.
So, tl;dr: Narrow down the scope of what you're looking for, and be specific.
Hope that helps. I'm looking for ways to speed up my own tests.

If you really need to iterate through all text fields, there is no other way. As #Xaqron pointed out, it depends on IE. But maybe you just need to iterate through text fields of eg. specified <div/>? Finding it first, and then iterating through it's text fields would be faster.

Thanks Dahv for a really detailed answer. In my case I've sped up my tests by about 10x using a number of tricks, some similar to yours:
Refining scope as you and prostynick (in my case using Form1.TextField etc.)
First checking if browser.html matches my regex before seeing if
fields do
Using the GehSoft.PRCE RegEx wrapper - its native code regex
matching is far faster than .NET's for small haystacks. So to find a TextField I'd do:
Gehtsoft.PCRE.Regex regexString = new Gehtsoft.PCRE.Regex("[Nn]ame");
foreach (TextField bT in browser.TextFields)
{
//Skip if no match
if (!regexString.Execute(bT.Name).Success) continue;
Before I was looping on a list of regexes, then inside that i was looping on TextFields. Making the TextFields loop the top loop improved speed about 3x.

Efficient algorithm for finding related submissions

I recently launched my humble side project and would like to add a "related submissions" section when viewing a submission. Exactly like what SO is doing here - see right column, titled "Related"
Considering that each submission has a title and a set of tags, what is most effective (optimum result), most efficient (fast, memory friendly) way to query the database for related submissions?
I can think of one way to do this (which I'll post as an answer) but I'm very interested to see what others have to say. Or perhaps there's already a standard way of achieving this?

Here's my two cent solution:
To achieve the best output, we need to put “weight” on the query results.
To start with, each submission in the database is assumed to have a weight of zero.
Then, if a submission in the "pool" shares one tag with the current submission, we'd add +3 to the found submission. Hence, if another submission is found that shares two tags with the current submission, we add +6 to the weight.
Next, we split/tokenize the title of the current submission and remove “stop words”.
I’ve seen a list of stop words from google, but for now I’ll define my stop words to be: [“of”, “a”, “the”, “in”]
Example:
Title “The Best Submission of All Times”
Result the array: ["The", “Best”, “Submission”, “of”, “All”, “Times”]
Remove stop words: [“Best”, “Submission”, “All”, “Times”]
Then we query the database for submissions containing any of the mentioned titles, and for each result we add the weight: +2
And finally sort the list descending by weight and take the top N results.
What do you think? (be gentle!)

If I understand well, you need a technique to find whether two posts are "similar" one to each other. You may want to use a probabilistic model for that:
http://en.wikipedia.org/wiki/Mutual_information
The idea would be to say that if two posts share a lot of "uncommon" words, they are probably speaking on the same topic. For detecting uncommon words, depending on your application, you may use a general table of frequencies, or maybe better, build it yourself on the universe of the words of your posts (but you will need to have enough of them to have something relevant).
I would not limit myself on title and tags, but I would overweight them in the research.
This kind of ideas is very common in spam filtering. I unfortunately the time to make a full review, but a quick google search gives:
http://www.aclweb.org/anthology/P/P04/P04-3024.pdf
karlmicha.googlepages.com/acl2004_poster.pdf

Adding words to SQL Server Full Text Stemmer

I've dug around for a few hours now and cannot find an option to do this. What I would like to do is to add words to the stemmer used by Full Text in SQL Server. I work for an agency that would like to search on variations of names. In other words, if an officer enters the name of "Bill" I would also get a hit on "Will" or "William". Anyone know if this is possible?
I did look at implementing a custom IStemmable interface but that seems a bit of an overblow solution to this problem. Does anyone know of an easier way or have an off the shelf solution that will do this?
Thanks...

In SQL Server 2K5 or 2K8 it is called the "Thesaurus". Well doced in MSDN etc
It handles things like these
<expansion>
<sub>Internet Explorer</sub>
<sub>IE</sub>
<sub>IE5</sub>
</expansion>
<replacement>
<pat>NT5</pat>
<pat>W2K</pat>
<sub>Windows 2000</sub>
</replacement>
<expansion>
<sub>run</sub>
<sub>jog</sub>
</expansion>

Sigh....
Thanks. One of those times I was definitely trying to make it more difficult then it needed to be. I think I found stemmer early on and kept using it in my searches.
Thanks again.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.