Combining fuzzy search with synonym expansion in Azure search - c#

I'm using the Microsoft.Azure.Search SDK to run an Azure Cognitive Services search that includes synonym expansion. My SynonymMap is as follows:
private async Task UploadSynonyms()
{
var synonymMap = new SynonymMap()
{
Name = "desc-synonymmap",
Synonyms = "\"dog\", \"cat\", \"rabbit\"\n "
};
await m_SearchServiceClient.SynonymMaps.CreateOrUpdateAsync(synonymMap);
}
This is mapped to Animal.Name as follows:
index.Fields.First(f => f.Name == nameof(Animal.Name)).SynonymMaps = new[] { "desc-synonymmap" };
I am trying to use both fuzzy matching and synonym matching, so that, for example:
If I search for 'dog' it returns any Animal with a Name of 'dog', 'cat' or 'rabbit'
If I search for 'dob' it fuzzy matches to 'dog' and returns any Animal with a Name of 'dog', 'cat' or 'rabbit', as they are all synonyms for 'dog'
My search method is as follows:
private async Task RunSearch()
{
var parameters = new SearchParameters
{
SearchFields = new[] { nameof(Animal.Name) },
QueryType = QueryType.Full
};
var results = await m_IndexClientForQueries.Documents.SearchAsync<Animal>("dog OR dog~", parameters);
}
When I search for 'dog' it correctly returns any result with dog/cat/rabbit as it's Name. But when I search for 'dob' it only returns any matches for 'dog', and not any synonyms.
This answer from January 2019 states that "Synonym expansions do not apply to wildcard search terms; prefix, fuzzy, and regex terms aren't expanded." but this answer was posted over a year ago and things may have changed since then.
Is it possible to both fuzzy match and then match on synonyms in Azure Cognitive Search, or is there any workaround to achieve this?

#spaceplane
Synonym expansions do not apply to wildcard search terms; prefix, fuzzy, and regex terms aren't expanded
Unfortunately, this still holds true. Reference : https://learn.microsoft.com/en-us/azure/search/search-synonyms
The reason being the words/graphs that were obtained are directly passed to the index (as per this doc).
Having said that, I m thinking of two possible options that I may meet your requirement :
Option 1
Have a local Fuzzy matcher. Where you can get the possible matching words for a typed word.
Sharing a reference that I found: Link 1. I did come across a lot of packages which did the similar tasks.
Now from your obtained words you can build OR query binding all the matching words and issue it to the Azure cognitive Search.
So for an instance : When dob~ is fired - assuming "dot,dog" would be the words generated by the Fuzzy logic code.
We take these two words and subsequently issue "dog or dot" query to the Azure. Synonyms will be in turn effective because of the search term "dog "and the results will be retrieved accordingly based on the synonymmap.
Option 2
You could consider to handle using a synonym map. For example, mapping "dog" to "dob, dgo, dot" along with other synonyms.

Related

Create a collection of Windows Services matching a regular expression

I want to create a collection of Windows Services that will match a regular expression using a Where clause.
For example, I have 3 Windows Services called:
RCLoad1
RCLoad2
RCLoad3
my Regex would be something like: "^RCLoad*"
I'd like to use something like:
ServiceController[] myServices = ServiceController.GetServices(ServerName)
.Where Regex.IsMatch(....)
But I can't get it to work.
You are not clear on the exact failure.
Does GetServices(ServerName) actually return a list of items?
Subsequently you don't mention what property has the name of the server, Is it Name? Because the code you have now takes the objects ToString() which most likely defaults to the type name and hence the failure. (?)
Find the right name property and use the pattern RCLoad, which will find it anywhere in the string, and then put into a ToList() such as
Regex rgx = new Regex(#"RCLoad"); // RCLoad can be anywhere in the string.
var controllers = ServiceController.GetServices(ServerName)
.Where(sc => rgx.IsMatch( sc.Name ))
.ToList();

Sitecore Lucene index search term with space match same word without space

This seems so simple that I'm convinced I must be overlooking something. I cannot establish how to do the following in Lucene:
The problem
I'm searching for place names.
I have a field called Name
It is using Lucene.Net.Analysis.Standard.StandardAnalyzer
It is TOKENIZED
The value of Name contains 1 space in the value: halong bay.
The search term may or may not contain an extra space due to culturally different spellings or genuine spelling mistakes. E.g. ha long bay instead of halong bay.
If I use the term halong bay I get a hit.
If I use the term ha long bay I do not get a hit.
The attempted solution
Here's the code I'm using to build my predicate using LINQ to Lucene from Sitecore:
var searchContext = ContentSearchManager.GetIndex("my_index").CreateSearchContext();
var term = "ha long bay";
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Name == term);
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);
I have also tried a fuzzy match using the .Like() extension:
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Like(term));
This also yields no results for ha long bay.
How do I configure Lucene in Sitecore to return a hit for both halong bay and ha long bay search terms, ideally without having to do anything fancy with the input term (e.g. stripping space, adding wildcards, etc)?
Note: I recognise that this would also allow the term h a l o n g b a y to produce a hit, but I don't think I have a problem with this.
A TOKENIZED field means that the field value is split by a token (space in that case) and the resulting terms are added to the index dictionary. If you index "halong bay" in such a field, it will create the "halong" and "bay" terms.
It's normal for the search engine to fail to retrieve this result for the "ha long" search query because it doesn't know any result with the "ha" or "long" terms.
A manual approach would be to define all the other ways to write the place name in another multi-value computed index field named AlternateNames. Then you could issue this kind of query: Name==query OR AlternateNames==query.
An automatic approach would be to also index the place names without spaces in a separate computed index field named CompactName. Then you could issue this kind of query: Name==query OR CompactName==compactedQueryWithoutSpaces
I hope this helps
Jeff
Something like this might do the trick:
var predicate = PredicateBuilder.False<MySearchResultItemClass>();
foreach (var t in term.Split(' '))
{
var tempTerm = t;
predicate = predicate.Or(p => p.Name.Contains(tempTerm));
}
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);
It does split your input string, but I guess that is not 'fancy' ;)

How to deal with text input containing a last name with a space or a first name [space] last name combination

I'm dealing with a problem that I can't wrap my head around and could use your help and expertise.
I have a textbox that allows the user to search for another user by a combination of name criterias listed below:
<first name><space><last name> (John Smith)
<last name><comma><space|nospace><first name> (Smith, John) or (Smith,John)
Either starting portion of first name or last name (in this case, I do a search against both the first and last name columns) (Smith), (John), (Sm), or (Jo)
Issue:
There are quite a few users who have a space in their last name, if someone searches for them, they may only enter "de la".
Now in this scenario, since there is a space between the words, the system will assume that the search criteria is first name starts with "de" and last name with "la".
The system will work as expected if the user typed "de la," because now the input contains a comma, and the system will know for sure that this search is for a last name but I have to assume that not everyone will enter a comma at the end.
However the user probably intended only to search for someone with last name starting with "de la".
Current options
I have a few options in mind and could use your help in deciding which one would you recommend. And PLEASE, feel free to add your suggestions.
User training. We can always create help guides/training material to advise the users to enter a comma at the end if they're searching for a last name containing a space. I don't like this approach because the user experience isn't smart/intuitive anymore and most of the users won't read the help guides.
Create 2 different text boxes (for first name and last name). I'm not a fan of this approach either; the UI just won't look and feel the same and will prove inconvenient to the users who just want to copy/paste a name from either Outlook or elsewhere (without having to copy/paste first/last name separately).
Run the search criteria with first, and then in addition, run a search for people with spaced last name and append both results to the return value. This might work, but it'll create a lot of false positives and cause extra load on the server. E.g. search for "de la" will return Lance, Devon (...) and "De La Cruz, John" (...).
I'd appreciate any type of feedback you can shed on this issue; your experiences, best practices, or the best one, some code snippets of something you've worked with related to this scenario.
Application background: Its ASP.NET (4.0) WebAPI service written in C#; its consumed by a client sitting on a different server.
I've used this technique for a number of years and I like it.
Lose the comma, no one will use it. If there is not a space, search for first OR last. If there is a space, search for first AND last. This code works very well for partial name searches, i.e. "J S" finds Jane Smith and John Smith. "John" will find "John Smith" and "Anne Johnson". This should give you a pretty good starting point to get as fancy as you want with your supported queries.
public IEnumerable<People> Search(string query, int maxResults = 20)
{
if (string.IsNullOrWhiteSpace(query))
{
return new List<People>();
}
IEnumerable<People> results;
var split = query.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
if (split.Length > 1)
{
var firstName = split[0];
var lastName = string.Join(" ", split.Skip(1));
results = PeopleRepository.Where(x =>
x.FirstName.StartsWith(firstName, StringComparison.OrdinalIgnoreCase) &&
x.LastName.StartsWith(lastName, StringComparison.OrdinalIgnoreCase));
}
else
{
var search = split[0];
results = PeopleRepository.Where(x =>
x.FirstName.StartsWith(search, StringComparison.OrdinalIgnoreCase) ||
x.LastName.StartsWith(search, StringComparison.OrdinalIgnoreCase));
}
return results.Take(maxResults);
}
Maybe the point is that you should index your user data in order to look for it efficiently.
For example, you should index first and last names without caring about if they're first or last names. You want to search people, why end-user should care about search term order?
The whole index can store user ids on sets specialized by names (either first or last names). If user ids are integers, it would be something like this:
John => 12, 19, 1929, 349, 1, 29
Smith => 12, 349, 11, 4
Matias => 931, 45
Fidemraizer => 931
This way user inputs whatever and you don't care anymore about ordering: if user types "John", you will show all users where their ids are in the John set. If they type both John Smith, you'll need to intersect both John and Smith sets to find out which user ids are in both sets, and so on.
I don't know what database technology you're currently using, but both SQL and NoSQL products can be a good store for this, but NoSQL will work better.

How to implement a simple String search

I want to implement a simple search in my application, based on search query I have.
Let's say I have an array containing 2 paragraphs or articles and I want to search in these articles for related subject or related keywords I enter.
For example:
//this is my search query
string mySearchQuery = "how to play with matches";
//these are my articles
string[] myarticles = new string[] {"article 1: this article will teach newbies how to start fire by playing with the awesome matches..", "article 2: this article doesn't contain anything"};
How can I get the first article based on the search query I provided above? Any idea?
This would return any string in myarticles that contains all of the words in mysearchquery:
var tokens = mySearchQuery.Split(' ');
var matches = myarticles.Where(m => tokens.All(t => m.Contains(t)));
foreach(var match in matches)
{
// do whatever you wish with them here
}
I'm sure you can fine a nice framework for string search, cause it's a wide subject, and got many search rules.
But for this simple sample, try splitting the search query with " ", for each word do a simple string search, if you find it, add 1 point to the paragraph search match, at the end return the paragraph with the most points...

Proximity Search example Lucene.Net

I want to make a Proximity Search with Lucene.Net. I saw this question where it looks like that was the answer for him, but no code was suplied. The Java documentation says to use the ~ character with the number of words in between, but I don't see where this character would go in the code. Anyone can give me an example of a Proximity Search using Lucene.Net?
Edit:
What I have so far:
IndexSearcher searcher = new IndexSearcher(this.Directory, true);
string[] fieldList = new string[] { "Name", "Description" };
List<BooleanClause.Occur> occurs = new List<BooleanClause.Occur>();
foreach (string field in fieldList)
{
occurs.Add(BooleanClause.Occur.SHOULD);
}
Query searchQuery = MultiFieldQueryParser.Parse(this.LuceneVersion, query, fieldList, occurs.ToArray(), this.Analyzer);
If I try to add the "~" with any number on the MultiFieldQueryParser it errors out saying that for a FuzzySearch the values should be between 0.0 and 1.0, but I want a Proximity Search 3 words of separation Ex. "my search"~3
The tilde means either a fuzzy search if you apply it on a single term, or a proximity search if you apply it on a phrase. The error you're receiving sounds like you're applying it on a single term (term~10) instead of using a phrase ("term term"~10).
To do a proximity search use the tilde, "~", symbol at the end of a Phrase.
The only differences between Lucene.NET and classic java lucene of the same version should be internal, not external -- operational goal is to have a very compatible project, especially on the input (queries) and output (index files) side. So it should work however it works for java lucene. If it don't, it is a bug.

Categories