Sitecore Lucene index search term with space match same word without space

Sitecore Lucene index search term with space match same word without space - c#

This seems so simple that I'm convinced I must be overlooking something. I cannot establish how to do the following in Lucene:
The problem
I'm searching for place names.
I have a field called Name
It is using Lucene.Net.Analysis.Standard.StandardAnalyzer
It is TOKENIZED
The value of Name contains 1 space in the value: halong bay.
The search term may or may not contain an extra space due to culturally different spellings or genuine spelling mistakes. E.g. ha long bay instead of halong bay.
If I use the term halong bay I get a hit.
If I use the term ha long bay I do not get a hit.
The attempted solution
Here's the code I'm using to build my predicate using LINQ to Lucene from Sitecore:
var searchContext = ContentSearchManager.GetIndex("my_index").CreateSearchContext();
var term = "ha long bay";
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Name == term);
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);
I have also tried a fuzzy match using the .Like() extension:
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Like(term));
This also yields no results for ha long bay.
How do I configure Lucene in Sitecore to return a hit for both halong bay and ha long bay search terms, ideally without having to do anything fancy with the input term (e.g. stripping space, adding wildcards, etc)?
Note: I recognise that this would also allow the term h a l o n g b a y to produce a hit, but I don't think I have a problem with this.

A TOKENIZED field means that the field value is split by a token (space in that case) and the resulting terms are added to the index dictionary. If you index "halong bay" in such a field, it will create the "halong" and "bay" terms.
It's normal for the search engine to fail to retrieve this result for the "ha long" search query because it doesn't know any result with the "ha" or "long" terms.
A manual approach would be to define all the other ways to write the place name in another multi-value computed index field named AlternateNames. Then you could issue this kind of query: Name==query OR AlternateNames==query.
An automatic approach would be to also index the place names without spaces in a separate computed index field named CompactName. Then you could issue this kind of query: Name==query OR CompactName==compactedQueryWithoutSpaces
I hope this helps
Jeff

Something like this might do the trick:
var predicate = PredicateBuilder.False<MySearchResultItemClass>();
foreach (var t in term.Split(' '))
{
var tempTerm = t;
predicate = predicate.Or(p => p.Name.Contains(tempTerm));
}
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);
It does split your input string, but I guess that is not 'fancy' ;)

Related

Combining fuzzy search with synonym expansion in Azure search

I'm using the Microsoft.Azure.Search SDK to run an Azure Cognitive Services search that includes synonym expansion. My SynonymMap is as follows:
private async Task UploadSynonyms()
{
var synonymMap = new SynonymMap()
{
Name = "desc-synonymmap",
Synonyms = "\"dog\", \"cat\", \"rabbit\"\n "
};
await m_SearchServiceClient.SynonymMaps.CreateOrUpdateAsync(synonymMap);
}
This is mapped to Animal.Name as follows:
index.Fields.First(f => f.Name == nameof(Animal.Name)).SynonymMaps = new[] { "desc-synonymmap" };
I am trying to use both fuzzy matching and synonym matching, so that, for example:
If I search for 'dog' it returns any Animal with a Name of 'dog', 'cat' or 'rabbit'
If I search for 'dob' it fuzzy matches to 'dog' and returns any Animal with a Name of 'dog', 'cat' or 'rabbit', as they are all synonyms for 'dog'
My search method is as follows:
private async Task RunSearch()
{
var parameters = new SearchParameters
{
SearchFields = new[] { nameof(Animal.Name) },
QueryType = QueryType.Full
};
var results = await m_IndexClientForQueries.Documents.SearchAsync<Animal>("dog OR dog~", parameters);
}
When I search for 'dog' it correctly returns any result with dog/cat/rabbit as it's Name. But when I search for 'dob' it only returns any matches for 'dog', and not any synonyms.
This answer from January 2019 states that "Synonym expansions do not apply to wildcard search terms; prefix, fuzzy, and regex terms aren't expanded." but this answer was posted over a year ago and things may have changed since then.
Is it possible to both fuzzy match and then match on synonyms in Azure Cognitive Search, or is there any workaround to achieve this?

#spaceplane
Synonym expansions do not apply to wildcard search terms; prefix, fuzzy, and regex terms aren't expanded
Unfortunately, this still holds true. Reference : https://learn.microsoft.com/en-us/azure/search/search-synonyms
The reason being the words/graphs that were obtained are directly passed to the index (as per this doc).
Having said that, I m thinking of two possible options that I may meet your requirement :
Option 1
Have a local Fuzzy matcher. Where you can get the possible matching words for a typed word.
Sharing a reference that I found: Link 1. I did come across a lot of packages which did the similar tasks.
Now from your obtained words you can build OR query binding all the matching words and issue it to the Azure cognitive Search.
So for an instance : When dob~ is fired - assuming "dot,dog" would be the words generated by the Fuzzy logic code.
We take these two words and subsequently issue "dog or dot" query to the Azure. Synonyms will be in turn effective because of the search term "dog "and the results will be retrieved accordingly based on the synonymmap.
Option 2
You could consider to handle using a synonym map. For example, mapping "dog" to "dob, dgo, dot" along with other synonyms.

How can I get the Regex Groups for a given Capture?

I'm parsing CSS3 selectors using a regex. For example, the selector a>b,c+d is broken down into:
Selector:
a>b
c+d
SOSS:
a
b
c
d
TypeSelector:
a
b
c
d
Identifier:
a
b
c
d
Combinator:
>
+
The problem is, for example, I don't know which selector the > combinator belongs to. The Selector Group has 2 captures (as shown above), each containing 1 combinator. I want to know what that combinator is for that capture.
Groups have lists of Captures, but Captures don't have lists of Groups found in that Capture. Is there a way around this, or should I just re-parse each selector?
Edit: Each capture does give you the index of where the match occurred though... maybe I could use that information to determine what belongs to what?
So you don't think I'm insane, the syntax is actually quite simple, using my special dict class:
var flex = new FlexDict
{
{"GOS"/*Group of Selectors*/, #"^\s*{Selector}(\s*,\s*{Selector})*\s*$"},
{"Selector", #"{SOSS}(\s*{Combinator}\s*{SOSS})*{PseudoElement}?"},
{"SOSS"/*Sequence of Simple Selectors*/, #"({TypeSelector}|{UniversalSelector}){SimpleSelector}*|{SimpleSelector}+"},
{"SimpleSelector", #"{AttributeSelector}|{ClassSelector}|{IDSelector}|{PseudoSelector}"},
{"TypeSelector", #"{Identifier}"},
{"UniversalSelector", #"\*"},
{"AttributeSelector", #"\[\s*{Identifier}(\s*{ComparisonOperator}\s*{AttributeValue})?\s*\]"},
{"ClassSelector", #"\.{Identifier}"},
{"IDSelector", #"#{Identifier}"},
{"PseudoSelector", #":{Identifier}{PseudoArgs}?"},
{"PseudoElement", #"::{Identifier}"},
{"PseudoArgs", #"\([^)]*\)"},
{"ComparisonOperator", #"[~^$*|]?="},
{"Combinator", #"[ >+~]"},
{"Identifier", #"-?[a-zA-Z\u00A0-\uFFFF_][a-zA-Z\u00A0-\uFFFF_0-9-]*"},
{"AttributeValue", #"{Identifier}|{String}"},
{"String", #""".*?(?<!\\)""|'.*?(?<!\\)'"},
};

You shouldn't write one regex to parse the whole thing. But first get the selectors and then get the combinator for each of them. (At least that's how you would parse your example, real CSS is going to be more complicated.)

Each capture does give you the index of where the match occurred though... maybe I could use that information to determine what belongs to what?
Just thinking aloud here; you could pick out each match in the Selector group, get its starting and ending indices relative to the entire match and see if the index of each combinator falls within the start and end index range. If the combinator's index falls within the range, it occurs in that selector.
I'm not sure how this would fare in terms of performance though. But I think you could make it work.

I wouldn't recommend using regex for parsing anything. Except for very simple cases parsers are almost always a better choice. Take a look at this question.
Is there a CSS parser for C#?

How to enumerate Linq results?

Referring to this topic: Minimize LINQ string token counter
And using the following provided code:
string src = "for each character in the string, take the rest of the " +
"string starting from that character " +
"as a substring; count it if it starts with the target string";
var results = src.Split() // default split by whitespace
.GroupBy(str => str) // group words by the value
.Select(g => new
{
str = g.Key, // the value
count = g.Count() // the count of that value
});
I need to enumerate all values (keywords and number of occurrences) and load them into NameValueCollection. Due to my very limited Linq knowledge, can't figure out how to make it. Please advise. Thanks.

I won't guess why you want to put anything in a NameValueCollection, but is there some reason why
foreach (var result in results)
collection.Add(result.str, result.count.ToString());
is not sufficient?
(EDIT: Changed accessor to Add, which may be better for your use case.)
If the answer is "no, that works" you should probably stop and figure out what the hell the above code is doing before using it in your project.

Looks like your particular problem could just as easily use a Dictionary instead of a NameValueCollection. I forget if this is the correct ToDictionary syntax, but just google the ToDictionary() method:
Dictionary<string, int> useADictionary = results.ToDictionary(x => x.str, x => x.count);

You certainly want a Dictionary instead of NameValueCollection. The whole point is to show unique tokens (strings) with each token's occurrence count (an int), yes?
NameValueCollection is a special-purpose collection that requires string and key and value - Dictionary<string, int> is the mainstream .Net way to associate a unique string key with its corresponding int value.
Take a look at the various System.Collections namespaces to understand what each is intended to achieve. Typically these days, System.Collections.Generic is the most widely-seen, with System.Collections.Concurrent for multithreaded programs.

How to do partial word searches in Lucene.NET?

I have a relatively small index containing around 4,000 locations. Among other things, I'm using it to populate an autocomplete field on a search form.
My index contains documents with a Location field containing values like
Ohio
Dayton, Ohio
Dublin, Ohio
Columbus, Ohio
I want to be able to type in "ohi" and have all of these results appear and right now nothing shows up until I type the full word "ohio".
I'm using Lucene.NET v2.3.2.1 and the relevant portion of my code is as follows for setting up my query....
BooleanQuery keywords = new BooleanQuery();
QueryParser parser = new QueryParser("location", new StandardAnalyzer());
parser.SetAllowLeadingWildcard(true);
keywords.Add(parser.Parse("\"*" + location + "*\""), BooleanClause.Occur.SHOULD);
luceneQuery.Add(keywords, BooleanClause.Occur.MUST);
In short, I'd like to get this working like a LIKE clause similar to
SELECT * from Location where Name LIKE '%ohi%'
Can I do this with Lucene?

Try this query:
parser.Parse(query.Keywords.ToLower() + "*")

Yes, this can be done. But, leading wildcard can result in slow queries. Check the documentation. Also, if you are indexing the entire string (eg. "Dayton, Ohio") as single token, most of the queries will degenerate to leading prefix queries. Using a tokenizer like StandardAnalyzer (which I suppose, you are already doing) will lessen the requirement for leading wildcard.
If you don't want leading prefixes for performance reasons, you can try out indexing ngrams. That way, there will not be any leading wildcard queries. The ngram (assuming only of length 4) tokenizer will create tokens for "Dayton Ohio" as "dayt", "ayto", "yton" and so on.

it's more a matter of populating your index with partial words in the first place. your analyzer needs to put in the partial keywords into the index as it analyzes (and hopefully weight them lower then full keywords as it does).
lucene index lookup trees work from left to right. if you want to search in the middle of a keyword, you have break it up as you analyze. the problem is that partial keywords will explode your index sizes usually.
people usually use really creative analyzers that break up words in root words (that take off prefixes and suffixes).
get down in to deep into understand lucene. it's good stuff. :-)

Is it possible to negate a regular expression search?

I'm building a lexical analysis engine in c#. For the most part it is done and works quite well. One of the features of my lexer is that it allows any user to input their own regular expressions. This allows the engine to lex all sort of fun and interesting things and output a tokenised file.
One of the issues im having is I want the user to have everything contained in this tokenised file. I.E the parts they are looking for and the parts they are not (Partial Highlighting would be a good example of this).
Based on the way my lexer highlights I found the best way to do this would be to negate the regular expressions given by the user.
So if the user wanted to lex a string for every occurrence of "T" the negated version would find everything except "T".
Now the above is easy to do but what if a user supplies 8 different expressions of a complex nature, is there a way to put all these expressions into one and negate the lot?

You could combine several RegEx's into 1 by using (pattern1)|(pattern1)|...
To negate it you just check for !IsMatch
var matches = Regex.Matches("aa bb cc dd", #"(?<token>a{2})|(?<token>d{2})");
would return in fact 2 tokens (note that I've used the same name twice.. that's ok)
Also explore Regex.Split. For instance:
var split = Regex.Split("aa bb cc dd", #"(?<token>aa bb)|(?:\s+)");
returns the words as tokens, except for "aa bb" which is returned as one token because I defined it as so with (?...).
You can also use the Index and Length properties to calculate the middle parts that have not been recognized by the Regex:
var matches = Regex.Matches("aa bb cc dd", #"(?<token>a{2})|(?<token>d{2})");
for (int i = 0; i < matches.Count; i++)
{
var group = matches[i].Groups["token"];
Console.WriteLine("Token={0}, Index={1}, Length={2}", group.Value, group.Index, group.Length);
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Sitecore Lucene index search term with space match same word without space - c#

Related

Combining fuzzy search with synonym expansion in Azure search

How can I get the Regex Groups for a given Capture?

How to enumerate Linq results?

How to do partial word searches in Lucene.NET?

Is it possible to negate a regular expression search?

Categories

Resources