Handle singular and plural search terms in Azure Cognitive Search - c#

We're using Azure Cognitive Search as our search engine for searching for images. The analyzer is Lucene standard and when a user searches for "scottish landscapes" some of our users claim that their image is missing. They will then have to add the keyword "landscapes" in their images so that the search engine can find them.
Changing the analyzer to "en-lucene" or "en-microsoft" only seemed to have way smaller search results, which we didn't like for our users.
Azure Cognitive Search does not seem to distinguish singular and plural words. To resolve the issue, I created a dictionary in the database, used inflection and tried manipulating the search terms:
foreach (var term in terms)
{
if (ps.IsSingular(term))
{
// check with db
var singular = noun.GetSingularWord(term);
if (!string.IsNullOrEmpty(singular))
{
var plural = ps.Pluralize(term);
keywords = keywords + " " + plural;
}
}
else
{
// check with db
var plural = noun.GetPluralWord(term);
if (!string.IsNullOrEmpty(plural))
{
var singular = ps.Singularize(term);
keywords = keywords + " " + singular;
}
}
}
My solution is not 100% ideal but it would be nicer if Azure Cognitive Search can distinguish singular and plural words.
UPDATE:
Custom Analyzers may be the answer to my problem, I just need to find the right token filters.
UPDATE:
Below is my custom analyzer. It removes html constructs, apostrophes, stopwords and converts them to lowercase. The tokenizer is MicrosoftLanguageStemmingTokenizer and it reduces the words to its root words so it's apt for plural to singular scenario (searching for "landscapes" returns "landscapes" and "landscape")
"analyzers": [
{
"name": "p4m_custom_analyzer",
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters": [
"html_strip",
"remove_apostrophe"
],
"tokenizer": "custom_tokenizer",
"tokenFilters": [
"lowercase",
"remove_stopwords"
]
}
],
"charFilters": [
{
"name": "remove_apostrophe",
"#odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
"mappings": ["'=>"]
}
],
"tokenizers": [
{
"name": "custom_tokenizer",
"#odata.type":"#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
"isSearchTokenizer": "false"
}
],
"tokenFilters": [
{
"name": "remove_stopwords",
"#odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter"
}
]
I have yet to figure out the other way around. If the user searches for "apple" it should return "apple" and "apples".

Both en.lucene and en.microsoft should have helped with this, you shouldn't need to manually expand inflections on your side. I'm surprised to hear you see less recall with them. Generally speaking I would expect higher recall with those than the standard analyzer. Do you by any chance have multiple searchable fields with different analyzers? That could interfere. Otherwise, it would be great to see a specific case (a query/document pair along with the index definition) to investigate further.
As a quick test, I used this small index definition:
{
"name": "inflections",
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true
},
{
"name": "en_ms",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"analyzer": "en.microsoft"
}
]
}
These docs:
{
"id": "1",
"en_ms": "example with scottish landscape as part of the sentence"
},
{
"id": "2",
"en_ms": "this doc has one apple word"
},
{
"id": "3",
"en_ms": "this doc has two apples in it"
}
For this search search=landscapes I see these results:
{
"value": [
{
"#search.score": 0.9631388,
"id": "1",
"en_ms": "example with scottish landscape as part of the sentence"
}
]
}
And for search=apple I see:
{
"value": [
{
"#search.score": 0.51188517,
"id": "3",
"en_ms": "this doc has two apples in it"
},
{
"#search.score": 0.46152657,
"id": "2",
"en_ms": "this doc has one apple word"
}
]
}

Related

Bot Framework Compose: Dynamic multiple choice action (from API)

I'm building a bot with bot framework composer (V2)
I want to create a multiple choice action, with choices that I get from a API call.
Api Choices
[
{
"id": 0,
"name": "One",
"active": true
},
{
"id": 1,
"name": "Two",
"active": true
},
{
"id": 2,
"name": "Three",
"active": true
},
{
"id": 3,
"name": "Four",
"active": true
},
{
"id": 4,
"name": "Five",
"active": true
}
]
How do I bind this choices in the multiple choice action?
I assume that you are able to call API and got the data in array format, suppose it got stored in dialog.response.
So what you need to do is,
Add a For each item: Loop and configure it as shown in screenshot.
Next, add Edit an Array Property in the loop and configure it as shown in screenshot
Now, at the end, you need to add Multi-Choice(that you have already added) and give dialog.choices in Array of choices
I have tested this flow till the bot sent card with multiple choice.

How to customize word template on c# for docs generate

I have a word template with .docx format,
my question:
Can I get a list of tag names from the rich text content control
that I have declared in the template like inside table and other else? and how?
(this is the example of template)
Can I call a parameter from the return response database using a
string? and how to retrieve deeper level data like this? (example are below)
{
"id": "293",
"user": "315",
"userNavigation": {
"id": "314",
"name": "insomnia"
},
"department": [
{
"id": "2",
"name": "Tech"
},
{
"id": "1",
"name": "Bio"
},
],
}
I've used two libraries
OpenXml
TemplateEngine.Docx: https://bitbucket.org/unit6ru/templateengine/src/master/
I do not use third party services because they are not paid.

Linq query to Json string

starting from a JObject I can get the array that interests me:
JArray partial = (JArray)rssAlbumMetadata["tracks"]["items"];
First question: "partial" contains a lot of attributes I'm not interested on.
How can I get only what I need?
Second question: once succeeded in the first task I'll get a JArray of duplicated items. How can I get only the unique ones ?
The result should be something like
{
'composer': [
{
'id': '51523',
'name': 'Modest Mussorgsky'
},
{
'id': '228918',
'name': 'Sergey Prokofiev'
},
]
}
Let me start from something like:
[
{
"id": 32837732,
"composer": {
"id": 245,
"name": "George Gershwin"
},
"title": "Of Thee I Sing: Overture (radio version)"
},
{
"id": 32837735,
"composer": {
"id": 245,
"name": "George Gershwin"
},
"title": "Concerto in F : I. Allegro"
},
{
"id": 32837739,
"composer": {
"id": 245,
"name": "George Gershwin"
},
"title": "Concerto in F : II. Adagio"
}
]
First question:
How can I get only what I need?
There is no magic, you need to read the whole JSON string and then query the object to find what you are looking for. It is not possible to read part of the JSON if that is what you need. You have not provided an example of what the data looks like so not possible to specify how to query.
Second question which I guess is: How to de-duplicate contents of an array of object?
Again, I do not have full view of your objects but this example should be able to show you - using Linq as you requested:
var items = new []{new {id=1, name="ali"}, new {id=2, name="ostad"}, new {id=1, name="ali"}};
var dedup = items.GroupBy(x=> x.id).Select(y => y.First()).ToList();
Console.WriteLine(dedup);

Querying JSON with JSON .NET

{
"kind": "folderTree",
"data":
[
{
"id": "IEAAALNZI7777777",
"title": "Root",
"childIds":
[
"IEAAALNZI4ADAKBQ",
"IEAAALNZI4ADAMBQ",
"IEAAALNZI4ADAMBR"
],
"scope": "WsRoot"
},
{
"id": "IEAAANE7I7777777",
"title": "Root",
"childIds":
[
"IEAAANE7I4AC2NTX"
],
"scope": "WsRoot"
},
{
"id": "IEAAALNZI7777776",
"title": "Recycle Bin",
"childIds":
[
"IEAAALNZI4ADALZ2",
"IEAAALNZI4ADAL52",
"IEAAALNZI4ADALR3"
],
"scope": "RbRoot"
}
]
}
Im trying to query the following json structure, searching the child items I want to return the id for a given title.
I am trying something like this:
var folder = json["data"].Children().Where(x => x["Title"] == "Root");
But I'm not sure of the correct syntax
You can use SelectTokens to query LINQ to JSON objects. It supports JSONPath query syntax including wildcards. You can then further narrow down the search with a Where clause:
var folders = json.SelectTokens("data[*]").Where(t => (string)t["title"] == "Root").ToList();
It also supports filtering of array entries based on property values if you don't want the extra Where clause:
var folders = json.SelectTokens("data[?(#.title == 'Root')]").ToList();
Both of the above do the same thing. Incidentally, you've got two folders whose title is "Root" in your JSON, so your query will return multiple results.

Getting distinct values using NEST ElasticSearch client

I'm building a product search engine with Elastic Search in my .NET application, by using the NEST client, and there is one thing i'm having trouble with. Getting a distinct set of values.
I'm search for products, which there are many thousands, but of course i can only return 10 or 20 at a time to the user. And for this paging works fine. But besides this primary result, i want to show my users a list of brands that are found within the complete search, to present these for filtering.
I have read about that i should use Terms Aggregations for this. But, i couldn't get anything better than this. And this still doesn't really give me what i want, because it splits values like "20th Century Fox" into 3 separate values.
var brandResults = client.Search<Product>(s => s
.Query(query)
.Aggregations(a => a.Terms("my_terms_agg", t => t.Field(p => p.BrandName).Size(250))
)
);
var agg = brandResult.Aggs.Terms("my_terms_agg");
Is this even the right approach? Or should is use something totally different? And, how can i get the correct, complete values? (Not split by space .. but i guess that is what you get when you ask for a list of 'Terms'??)
What i'm looking for is what you would get if you would do this in MS SQL
SELECT DISTINCT BrandName FROM [Table To Search] WHERE [Where clause without paging]
You are correct that what you want is a terms aggregation. The problem you're running into is that ES is splitting the field "BrandName" in the results it is returning. This is the expected default behavior of a field in ES.
What I recommend is that you change BrandName into a "Multifield", this will allow you to search on all the various parts, as well as doing a terms aggregation on the "Not Analyzed" (aka full "20th Century Fox") term.
Here is the documentation from ES.
https://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
[UPDATE]
If you are using ES version 1.4 or newer the syntax for multi-fields is a little different now.
https://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html#_multi_fields
Here is a full working sample the illustrate the point in ES 1.4.4. Note the mapping specifies a "not_analyzed" version of the field.
PUT hilden1
PUT hilden1/type1/_mapping
{
"properties": {
"brandName": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
POST hilden1/type1
{
"brandName": "foo"
}
POST hilden1/type1
{
"brandName": "bar"
}
POST hilden1/type1
{
"brandName": "20th Century Fox"
}
POST hilden1/type1
{
"brandName": "20th Century Fox"
}
POST hilden1/type1
{
"brandName": "foo bar"
}
GET hilden1/type1/_search
{
"size": 0,
"aggs": {
"analyzed_field": {
"terms": {
"field": "brandName",
"size": 10
}
},
"non_analyzed_field": {
"terms": {
"field": "brandName.raw",
"size": 10
}
}
}
}
Results of the last query:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"aggregations": {
"non_analyzed_field": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "20th Century Fox",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 1
},
{
"key": "foo",
"doc_count": 1
},
{
"key": "foo bar",
"doc_count": 1
}
]
},
"analyzed_field": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "20th",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 2
},
{
"key": "century",
"doc_count": 2
},
{
"key": "foo",
"doc_count": 2
},
{
"key": "fox",
"doc_count": 2
}
]
}
}
}
Notice that not-analyzed fields keep "20th century fox" and "foo bar" together where as the analyzed field breaks them up.
I had a similar issue. I was displaying search results and wanted to show counts on the category and sub category.
You're right to use aggregations. I also had the issue with the strings being tokenised (i.e. 20th century fox being split) - this happens because the fields are analysed. For me, I added the following mappings (i.e. tell ES not to analyse that field):
"category": {
"type": "nested",
"properties": {
"CategoryNameAndSlug": {
"type": "string",
"index": "not_analyzed"
},
"SubCategoryNameAndSlug": {
"type": "string",
"index": "not_analyzed"
}
}
}
As jhilden suggested, if you use this field for more than one reason (e.g. search and aggregation) you can set it up as a multifield. So on one hand it can get analysed and used for searching and on the other hand for not being analysed for aggregation.

Categories