How to deal with text input containing a last name with a space or a first name [space] last name combination - c#

I'm dealing with a problem that I can't wrap my head around and could use your help and expertise.
I have a textbox that allows the user to search for another user by a combination of name criterias listed below:
<first name><space><last name> (John Smith)
<last name><comma><space|nospace><first name> (Smith, John) or (Smith,John)
Either starting portion of first name or last name (in this case, I do a search against both the first and last name columns) (Smith), (John), (Sm), or (Jo)
Issue:
There are quite a few users who have a space in their last name, if someone searches for them, they may only enter "de la".
Now in this scenario, since there is a space between the words, the system will assume that the search criteria is first name starts with "de" and last name with "la".
The system will work as expected if the user typed "de la," because now the input contains a comma, and the system will know for sure that this search is for a last name but I have to assume that not everyone will enter a comma at the end.
However the user probably intended only to search for someone with last name starting with "de la".
Current options
I have a few options in mind and could use your help in deciding which one would you recommend. And PLEASE, feel free to add your suggestions.
User training. We can always create help guides/training material to advise the users to enter a comma at the end if they're searching for a last name containing a space. I don't like this approach because the user experience isn't smart/intuitive anymore and most of the users won't read the help guides.
Create 2 different text boxes (for first name and last name). I'm not a fan of this approach either; the UI just won't look and feel the same and will prove inconvenient to the users who just want to copy/paste a name from either Outlook or elsewhere (without having to copy/paste first/last name separately).
Run the search criteria with first, and then in addition, run a search for people with spaced last name and append both results to the return value. This might work, but it'll create a lot of false positives and cause extra load on the server. E.g. search for "de la" will return Lance, Devon (...) and "De La Cruz, John" (...).
I'd appreciate any type of feedback you can shed on this issue; your experiences, best practices, or the best one, some code snippets of something you've worked with related to this scenario.
Application background: Its ASP.NET (4.0) WebAPI service written in C#; its consumed by a client sitting on a different server.

I've used this technique for a number of years and I like it.
Lose the comma, no one will use it. If there is not a space, search for first OR last. If there is a space, search for first AND last. This code works very well for partial name searches, i.e. "J S" finds Jane Smith and John Smith. "John" will find "John Smith" and "Anne Johnson". This should give you a pretty good starting point to get as fancy as you want with your supported queries.
public IEnumerable<People> Search(string query, int maxResults = 20)
{
if (string.IsNullOrWhiteSpace(query))
{
return new List<People>();
}
IEnumerable<People> results;
var split = query.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
if (split.Length > 1)
{
var firstName = split[0];
var lastName = string.Join(" ", split.Skip(1));
results = PeopleRepository.Where(x =>
x.FirstName.StartsWith(firstName, StringComparison.OrdinalIgnoreCase) &&
x.LastName.StartsWith(lastName, StringComparison.OrdinalIgnoreCase));
}
else
{
var search = split[0];
results = PeopleRepository.Where(x =>
x.FirstName.StartsWith(search, StringComparison.OrdinalIgnoreCase) ||
x.LastName.StartsWith(search, StringComparison.OrdinalIgnoreCase));
}
return results.Take(maxResults);
}

Maybe the point is that you should index your user data in order to look for it efficiently.
For example, you should index first and last names without caring about if they're first or last names. You want to search people, why end-user should care about search term order?
The whole index can store user ids on sets specialized by names (either first or last names). If user ids are integers, it would be something like this:
John => 12, 19, 1929, 349, 1, 29
Smith => 12, 349, 11, 4
Matias => 931, 45
Fidemraizer => 931
This way user inputs whatever and you don't care anymore about ordering: if user types "John", you will show all users where their ids are in the John set. If they type both John Smith, you'll need to intersect both John and Smith sets to find out which user ids are in both sets, and so on.
I don't know what database technology you're currently using, but both SQL and NoSQL products can be a good store for this, but NoSQL will work better.

Related

Combining fuzzy search with synonym expansion in Azure search

I'm using the Microsoft.Azure.Search SDK to run an Azure Cognitive Services search that includes synonym expansion. My SynonymMap is as follows:
private async Task UploadSynonyms()
{
var synonymMap = new SynonymMap()
{
Name = "desc-synonymmap",
Synonyms = "\"dog\", \"cat\", \"rabbit\"\n "
};
await m_SearchServiceClient.SynonymMaps.CreateOrUpdateAsync(synonymMap);
}
This is mapped to Animal.Name as follows:
index.Fields.First(f => f.Name == nameof(Animal.Name)).SynonymMaps = new[] { "desc-synonymmap" };
I am trying to use both fuzzy matching and synonym matching, so that, for example:
If I search for 'dog' it returns any Animal with a Name of 'dog', 'cat' or 'rabbit'
If I search for 'dob' it fuzzy matches to 'dog' and returns any Animal with a Name of 'dog', 'cat' or 'rabbit', as they are all synonyms for 'dog'
My search method is as follows:
private async Task RunSearch()
{
var parameters = new SearchParameters
{
SearchFields = new[] { nameof(Animal.Name) },
QueryType = QueryType.Full
};
var results = await m_IndexClientForQueries.Documents.SearchAsync<Animal>("dog OR dog~", parameters);
}
When I search for 'dog' it correctly returns any result with dog/cat/rabbit as it's Name. But when I search for 'dob' it only returns any matches for 'dog', and not any synonyms.
This answer from January 2019 states that "Synonym expansions do not apply to wildcard search terms; prefix, fuzzy, and regex terms aren't expanded." but this answer was posted over a year ago and things may have changed since then.
Is it possible to both fuzzy match and then match on synonyms in Azure Cognitive Search, or is there any workaround to achieve this?
#spaceplane
Synonym expansions do not apply to wildcard search terms; prefix, fuzzy, and regex terms aren't expanded
Unfortunately, this still holds true. Reference : https://learn.microsoft.com/en-us/azure/search/search-synonyms
The reason being the words/graphs that were obtained are directly passed to the index (as per this doc).
Having said that, I m thinking of two possible options that I may meet your requirement :
Option 1
Have a local Fuzzy matcher. Where you can get the possible matching words for a typed word.
Sharing a reference that I found: Link 1. I did come across a lot of packages which did the similar tasks.
Now from your obtained words you can build OR query binding all the matching words and issue it to the Azure cognitive Search.
So for an instance : When dob~ is fired - assuming "dot,dog" would be the words generated by the Fuzzy logic code.
We take these two words and subsequently issue "dog or dot" query to the Azure. Synonyms will be in turn effective because of the search term "dog "and the results will be retrieved accordingly based on the synonymmap.
Option 2
You could consider to handle using a synonym map. For example, mapping "dog" to "dob, dgo, dot" along with other synonyms.

Sitecore Lucene index search term with space match same word without space

This seems so simple that I'm convinced I must be overlooking something. I cannot establish how to do the following in Lucene:
The problem
I'm searching for place names.
I have a field called Name
It is using Lucene.Net.Analysis.Standard.StandardAnalyzer
It is TOKENIZED
The value of Name contains 1 space in the value: halong bay.
The search term may or may not contain an extra space due to culturally different spellings or genuine spelling mistakes. E.g. ha long bay instead of halong bay.
If I use the term halong bay I get a hit.
If I use the term ha long bay I do not get a hit.
The attempted solution
Here's the code I'm using to build my predicate using LINQ to Lucene from Sitecore:
var searchContext = ContentSearchManager.GetIndex("my_index").CreateSearchContext();
var term = "ha long bay";
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Name == term);
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);
I have also tried a fuzzy match using the .Like() extension:
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Like(term));
This also yields no results for ha long bay.
How do I configure Lucene in Sitecore to return a hit for both halong bay and ha long bay search terms, ideally without having to do anything fancy with the input term (e.g. stripping space, adding wildcards, etc)?
Note: I recognise that this would also allow the term h a l o n g b a y to produce a hit, but I don't think I have a problem with this.
A TOKENIZED field means that the field value is split by a token (space in that case) and the resulting terms are added to the index dictionary. If you index "halong bay" in such a field, it will create the "halong" and "bay" terms.
It's normal for the search engine to fail to retrieve this result for the "ha long" search query because it doesn't know any result with the "ha" or "long" terms.
A manual approach would be to define all the other ways to write the place name in another multi-value computed index field named AlternateNames. Then you could issue this kind of query: Name==query OR AlternateNames==query.
An automatic approach would be to also index the place names without spaces in a separate computed index field named CompactName. Then you could issue this kind of query: Name==query OR CompactName==compactedQueryWithoutSpaces
I hope this helps
Jeff
Something like this might do the trick:
var predicate = PredicateBuilder.False<MySearchResultItemClass>();
foreach (var t in term.Split(' '))
{
var tempTerm = t;
predicate = predicate.Or(p => p.Name.Contains(tempTerm));
}
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);
It does split your input string, but I guess that is not 'fancy' ;)

C# Dynamic where for a collection

I want to search a collection based on text box. The user should be allowed to type in multiple words and in any order. Meaning if the string in the collection is "What a happy day" and the user types in "day What" the string should appear. Now I know how to do with with hard coding the number of words allowed (for example only 3 words allowed) with something like this;
nc = oc.Where(X => X.SearchData.IndexOf(words[0]) > -1 || X.SearchData.IndexOf(words[1]) > -1 || X.SearchData.IndexOf(words[2]) > -1);
note: yes I know I would have to protect to make sure there was actual 3 values in the array words but that is not shown.
The problem with this is that it limits the user and I don't want to do that. If the user wants to search off 10 or 20 things then that is fine with me.
Is there a way to dynamically create the Where statement for collection oc?
thanks
You need more LINQ:
oc.Where(x => words.Any(w => x.SearchData.IndexOf(w) > -1))
IndexOf(w) returns true even if w is a matched substring. For instance in your example if user enters Wha then it gets matched with What. As I understand you it is not the case. So you can simply split SearchData and search over it:
var enteredWords = SearchData.Split();
return oc.Where(p=> enteredWords.Any(q=>p.Contains(q));
I think the answer of #Slaks will match on partial words, as per my comment and the answer given by #Alireza
You could try
oc.Where(phrase => phrase.Split().Intersect(SearchData.Split()).Count() > 0);
There are always various ways with LINQ...

How to do partial word searches in Lucene.NET?

I have a relatively small index containing around 4,000 locations. Among other things, I'm using it to populate an autocomplete field on a search form.
My index contains documents with a Location field containing values like
Ohio
Dayton, Ohio
Dublin, Ohio
Columbus, Ohio
I want to be able to type in "ohi" and have all of these results appear and right now nothing shows up until I type the full word "ohio".
I'm using Lucene.NET v2.3.2.1 and the relevant portion of my code is as follows for setting up my query....
BooleanQuery keywords = new BooleanQuery();
QueryParser parser = new QueryParser("location", new StandardAnalyzer());
parser.SetAllowLeadingWildcard(true);
keywords.Add(parser.Parse("\"*" + location + "*\""), BooleanClause.Occur.SHOULD);
luceneQuery.Add(keywords, BooleanClause.Occur.MUST);
In short, I'd like to get this working like a LIKE clause similar to
SELECT * from Location where Name LIKE '%ohi%'
Can I do this with Lucene?
Try this query:
parser.Parse(query.Keywords.ToLower() + "*")
Yes, this can be done. But, leading wildcard can result in slow queries. Check the documentation. Also, if you are indexing the entire string (eg. "Dayton, Ohio") as single token, most of the queries will degenerate to leading prefix queries. Using a tokenizer like StandardAnalyzer (which I suppose, you are already doing) will lessen the requirement for leading wildcard.
If you don't want leading prefixes for performance reasons, you can try out indexing ngrams. That way, there will not be any leading wildcard queries. The ngram (assuming only of length 4) tokenizer will create tokens for "Dayton Ohio" as "dayt", "ayto", "yton" and so on.
it's more a matter of populating your index with partial words in the first place. your analyzer needs to put in the partial keywords into the index as it analyzes (and hopefully weight them lower then full keywords as it does).
lucene index lookup trees work from left to right. if you want to search in the middle of a keyword, you have break it up as you analyze. the problem is that partial keywords will explode your index sizes usually.
people usually use really creative analyzers that break up words in root words (that take off prefixes and suffixes).
get down in to deep into understand lucene. it's good stuff. :-)

validating user input tags

I know this question might sound a little cheesy but this is the first time I am implementing a "tagging" feature to one of my project sites and I want to make sure I do everything right.
Right now, I am using the very same tagging system as in SO.. space seperated, dash(-) combined multiple words. so when I am validating a user-input tag field I am checking for
Empty string (cannot be empty)
Make sure the string doesnt contain particular letters (suggestions are welcommed here..)
At least one word
if there is a space (there are more than one words) split the string
for each splitted, insert into db
I am missing something here? or is this roughly ok?
Split the string at " ", iterate over the parts, make sure that they comply with your expectations. If they do, put them into the DB.
For example, you can use this regex to check the individual parts:
^[-\w]{2,25}$
This would limit allowed input to consecutive strings of alphanumerics (and "_", which is part of "\w" as well as "-" because you asked for it) 2..25 characters long. This essentially removes any code injection threat you might be facing.
EDIT: In place of the "\w", you are free to take any more closely defined range of characters, I chose it for simplicity only.
I've never implemented a tagging system, but am likely to do so soon for a project I'm working on. I'm primarily a database guy and it occurs to me that for performance reasons it may be best to relate your tagged entities with the tag keywords via a resolution table. So, for instance, with example tables such as:
TechQuestion
TechQuestionID (pk)
SubjectLine
QuestionBody
TechQuestionTag
TechQuestionID (pk)
TagID (pk)
Active (indexed)
Tag
TagID (pk)
TagText (indexed)
... you'd only add new Tag table entries when never-before-used tags were used. You'd re-associate previously provided tags via the TechQuestionTag table entry. And your query to pull TechQuestions related to a given tag would look like:
SELECT
q.TechQuestionID,
q.SubjectLine,
q.QuestionBody
FROM
Tag t INNER JOIN TechQuestionTag qt
ON t.TagID = qt.TagID AND qt.Active = 1
INNER JOIN TechQuestion q
ON qt.TechQuestionID = q.TechQuestionID
WHERE
t.TagText = #tagText
... or what have you. I don't know, perhaps this was obvious to everyone already, but I thought I'd put it out there... because I don't believe the alternative (redundant, indexed, text-tag entries) wouldn't query as efficiently.
Be sure your algorithm can handle leading/trailing/extra spaces with no trouble = )
Also worth thinking about might be a tag blacklist for inappropriate tags (profanity for example).
I hope you're doing the usual protection against injection attacks - maybe that's included under #2.
At the very least, you're going to want to escape quote characters and make embedded HTML harmless - in PHP, functions like addslashes and htmlentities can help you with that. Given that it's for a tagging system, my guess is you'll only want to allow alphanumeric characters. I'm not sure what the best way to accomplish that is, maybe using regular expressions.

Categories