Getting distinct values using NEST ElasticSearch client - c#

I'm building a product search engine with Elastic Search in my .NET application, by using the NEST client, and there is one thing i'm having trouble with. Getting a distinct set of values.
I'm search for products, which there are many thousands, but of course i can only return 10 or 20 at a time to the user. And for this paging works fine. But besides this primary result, i want to show my users a list of brands that are found within the complete search, to present these for filtering.
I have read about that i should use Terms Aggregations for this. But, i couldn't get anything better than this. And this still doesn't really give me what i want, because it splits values like "20th Century Fox" into 3 separate values.
var brandResults = client.Search<Product>(s => s
.Query(query)
.Aggregations(a => a.Terms("my_terms_agg", t => t.Field(p => p.BrandName).Size(250))
)
);
var agg = brandResult.Aggs.Terms("my_terms_agg");
Is this even the right approach? Or should is use something totally different? And, how can i get the correct, complete values? (Not split by space .. but i guess that is what you get when you ask for a list of 'Terms'??)
What i'm looking for is what you would get if you would do this in MS SQL
SELECT DISTINCT BrandName FROM [Table To Search] WHERE [Where clause without paging]

You are correct that what you want is a terms aggregation. The problem you're running into is that ES is splitting the field "BrandName" in the results it is returning. This is the expected default behavior of a field in ES.
What I recommend is that you change BrandName into a "Multifield", this will allow you to search on all the various parts, as well as doing a terms aggregation on the "Not Analyzed" (aka full "20th Century Fox") term.
Here is the documentation from ES.
https://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
[UPDATE]
If you are using ES version 1.4 or newer the syntax for multi-fields is a little different now.
https://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html#_multi_fields
Here is a full working sample the illustrate the point in ES 1.4.4. Note the mapping specifies a "not_analyzed" version of the field.
PUT hilden1
PUT hilden1/type1/_mapping
{
"properties": {
"brandName": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
POST hilden1/type1
{
"brandName": "foo"
}
POST hilden1/type1
{
"brandName": "bar"
}
POST hilden1/type1
{
"brandName": "20th Century Fox"
}
POST hilden1/type1
{
"brandName": "20th Century Fox"
}
POST hilden1/type1
{
"brandName": "foo bar"
}
GET hilden1/type1/_search
{
"size": 0,
"aggs": {
"analyzed_field": {
"terms": {
"field": "brandName",
"size": 10
}
},
"non_analyzed_field": {
"terms": {
"field": "brandName.raw",
"size": 10
}
}
}
}
Results of the last query:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"aggregations": {
"non_analyzed_field": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "20th Century Fox",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 1
},
{
"key": "foo",
"doc_count": 1
},
{
"key": "foo bar",
"doc_count": 1
}
]
},
"analyzed_field": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "20th",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 2
},
{
"key": "century",
"doc_count": 2
},
{
"key": "foo",
"doc_count": 2
},
{
"key": "fox",
"doc_count": 2
}
]
}
}
}
Notice that not-analyzed fields keep "20th century fox" and "foo bar" together where as the analyzed field breaks them up.

I had a similar issue. I was displaying search results and wanted to show counts on the category and sub category.
You're right to use aggregations. I also had the issue with the strings being tokenised (i.e. 20th century fox being split) - this happens because the fields are analysed. For me, I added the following mappings (i.e. tell ES not to analyse that field):
"category": {
"type": "nested",
"properties": {
"CategoryNameAndSlug": {
"type": "string",
"index": "not_analyzed"
},
"SubCategoryNameAndSlug": {
"type": "string",
"index": "not_analyzed"
}
}
}
As jhilden suggested, if you use this field for more than one reason (e.g. search and aggregation) you can set it up as a multifield. So on one hand it can get analysed and used for searching and on the other hand for not being analysed for aggregation.

Related

Handle singular and plural search terms in Azure Cognitive Search

We're using Azure Cognitive Search as our search engine for searching for images. The analyzer is Lucene standard and when a user searches for "scottish landscapes" some of our users claim that their image is missing. They will then have to add the keyword "landscapes" in their images so that the search engine can find them.
Changing the analyzer to "en-lucene" or "en-microsoft" only seemed to have way smaller search results, which we didn't like for our users.
Azure Cognitive Search does not seem to distinguish singular and plural words. To resolve the issue, I created a dictionary in the database, used inflection and tried manipulating the search terms:
foreach (var term in terms)
{
if (ps.IsSingular(term))
{
// check with db
var singular = noun.GetSingularWord(term);
if (!string.IsNullOrEmpty(singular))
{
var plural = ps.Pluralize(term);
keywords = keywords + " " + plural;
}
}
else
{
// check with db
var plural = noun.GetPluralWord(term);
if (!string.IsNullOrEmpty(plural))
{
var singular = ps.Singularize(term);
keywords = keywords + " " + singular;
}
}
}
My solution is not 100% ideal but it would be nicer if Azure Cognitive Search can distinguish singular and plural words.
UPDATE:
Custom Analyzers may be the answer to my problem, I just need to find the right token filters.
UPDATE:
Below is my custom analyzer. It removes html constructs, apostrophes, stopwords and converts them to lowercase. The tokenizer is MicrosoftLanguageStemmingTokenizer and it reduces the words to its root words so it's apt for plural to singular scenario (searching for "landscapes" returns "landscapes" and "landscape")
"analyzers": [
{
"name": "p4m_custom_analyzer",
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters": [
"html_strip",
"remove_apostrophe"
],
"tokenizer": "custom_tokenizer",
"tokenFilters": [
"lowercase",
"remove_stopwords"
]
}
],
"charFilters": [
{
"name": "remove_apostrophe",
"#odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
"mappings": ["'=>"]
}
],
"tokenizers": [
{
"name": "custom_tokenizer",
"#odata.type":"#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
"isSearchTokenizer": "false"
}
],
"tokenFilters": [
{
"name": "remove_stopwords",
"#odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter"
}
]
I have yet to figure out the other way around. If the user searches for "apple" it should return "apple" and "apples".
Both en.lucene and en.microsoft should have helped with this, you shouldn't need to manually expand inflections on your side. I'm surprised to hear you see less recall with them. Generally speaking I would expect higher recall with those than the standard analyzer. Do you by any chance have multiple searchable fields with different analyzers? That could interfere. Otherwise, it would be great to see a specific case (a query/document pair along with the index definition) to investigate further.
As a quick test, I used this small index definition:
{
"name": "inflections",
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true
},
{
"name": "en_ms",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"analyzer": "en.microsoft"
}
]
}
These docs:
{
"id": "1",
"en_ms": "example with scottish landscape as part of the sentence"
},
{
"id": "2",
"en_ms": "this doc has one apple word"
},
{
"id": "3",
"en_ms": "this doc has two apples in it"
}
For this search search=landscapes I see these results:
{
"value": [
{
"#search.score": 0.9631388,
"id": "1",
"en_ms": "example with scottish landscape as part of the sentence"
}
]
}
And for search=apple I see:
{
"value": [
{
"#search.score": 0.51188517,
"id": "3",
"en_ms": "this doc has two apples in it"
},
{
"#search.score": 0.46152657,
"id": "2",
"en_ms": "this doc has one apple word"
}
]
}

Elasticsearch terms query problem for "text" or "keyword" type fields

"hits" : [
{
"id": 1,
"sampleArrayData": ["x"]
},
{
"id": 2,
"sampleArrayData": ["y"]
},
{
"id": 3,
"sampleArrayData": ["z"]
},
{
"id": 4,
"sampleArrayData": ["x", "y", "z"]
},
{
"id": 5,
"sampleArrayData": ["z", "w"]
}
]
It's a sample data of index.
I want to search this index by sampleArrayData field which includes one or more values I will give dynamically.
For example if want to search this index by ["x","y"] parameters. I must get data which includes "x", or "y" or both "x", "y". (First three records from this index).
We could do this by using terms query on the old elastic search versions like below.
{
"query": {
"bool": {
"must": [
{
"terms": {
"sampleArrayData": [ "x", "y"]
}
}
]
}
}
}
But we can not use terms query for "text" or "keyword" type fields on the elasticsearch current versions.
How can I dynamically search this index with unknown number of parameters on the C# NEST library?

What is the added value, if any, of a composite entity in the Microsoft Bot framework regarding a single marked entity

In tutorial, Microsoft Bot tutorial, a luis service is started that has the ability to deconstruct a sentence about booking a flight.
The entities that are used within the intent utterances have 2 composite entities named To and From which child to a list entity named Airport.
This produces the following json
"entities": {
"From": [
{
"Airport": [
[
"Berlin"
]
],
"$instance": {
"Airport": [
{
"type": "Airport",
"text": "berlin",
"startIndex": 19,
"length": 6,
"modelTypeId": 5,
"modelType": "List Entity Extractor",
"recognitionSources": [
"model"
]
}
]
}
}
],
"To": [
{
"Airport": [
[
"Paris"
]
],
"$instance": {
"Airport": [
{
"type": "Airport",
"text": "paris",
"startIndex": 29,
"length": 5,
"modelTypeId": 5,
"modelType": "List Entity Extractor",
"recognitionSources": [
"model"
]
}
]
}
}
],
Two things about this seem to be not very efficient but since it is a machine training functionality rather than machine learning I want to know if there is a difference.
Why not make Airport the parent and have 2 child entities named ToCity and FromCity? This would allow Airport to be the city with 2 nested objects of ToCity and FromCity if they are extracted in the utterance?
Why is a composite used here at all? Is there some added benefit? With the above abstraction one could simply make 2 simple entities or list entities of ToCity and FromCity I am not seeing why the organization of the composite is befitting here but I may have a miss-understanding.
Here is an example of what I am speaking about regarding question 1. To me this seems like a better more organized methodology of doing this. But not 100% clear such as is it easier to access or are the score's higher one way or the other. I will tell you imperically in the below test this methodology produced a higher score for each of the 2 entities.
"entities": {
"Color": [
{
"CarColor": [
"blue"
],
"$instance": {
"CarColor": [
{
"type": "CarColor",
"text": "blue",
"startIndex": 6,
"length": 4,
"score": 0.9977741,
"modelTypeId": 1,
"modelType": "Entity Extractor",
"recognitionSources": [
"model"
]
}
]
}
},
{
"InteriorColor": [
"red"
],
"$instance": {
"InteriorColor": [
{
"type": "InteriorColor",
"text": "red",
"startIndex": 20,
"length": 3,
"score": 0.883398235,
"modelTypeId": 1,
"modelType": "Entity Extractor",
"recognitionSources": [
"model"
]
}
]
}
}
],
It's a fair question, however the purpose of the tutorial, and any related example code, is to demonstrate how composite entities can be used. It's true that their usefulness isn't fully realized in this tutorial as there are potentially less complicated ways of achieving the same goal. But, developing an overly complex tutorial would be counter-productive to anyone trying to learn.

ElasticSearch search getting bad results

I am fairly new to ElasticSearch and am having issues getting search results that I perceive to be good. My objective is to be able to search an index of medications (6 fields) against a phrase that the user enters. It could be one ore more words. I've tried a few approaches, but I'll outline the best one I've found so far below. Let me know what I'm doing wrong. I'm guessing that I'm missing something fundamental.
Here is a subset of the fields that I'm working with
...
"hits": [
{
"_index": "indexus2",
"_type": "Medication",
"_id": "17471",
"_score": 8.829264,
"_source": {
"SearchContents": " chew chewable oral po tylenol",
"MedShortDesc": "Tylenol PO Chew",
"MedLongDesc": "Tylenol Oral Chewable"
"GenericDesc": "ACETAMINOPHEN ORAL"
...
}
}
...
The fields that I'm searching against used an Edge NGram Analyzer. I'm using the C# Nest library for the indexing
settings.Analysis.Tokenizers.Add("edgeNGram", new EdgeNGramTokenizer()
{
MaxGram = 50,
MinGram = 2,
TokenChars = new List<string>() { "letter", "digit" }
});
settings.Analysis.Analyzers.Add("edgeNGramAnalyzer", new CustomAnalyzer()
{
Filter = new string[] { "lowercase" },
Tokenizer = "edgeNGram"
});
I am using a more_like_this query against the fields in question
GET indexus2/Medication/_search
{
"query": {
"more_like_this" : {
"fields" : ["MedShortDesc",
"MedLongDesc",
"GenericDesc",
"SearchContents"],
"like_text" : "vicodin",
"min_term_freq" : 1,
"max_query_terms" : 25,
"min_word_len": 2
}
}
}
The problem is that for this search for 'vicodin', I'd expect to see matches with the full work first, but I don't. Here is a subset of the results from this query. Vicodin doesn't show up until the 7th result
"hits": [
{
"_index": "indexus2",
"_type": "Medication",
"_id": "31192",
"_score": 4.567309,
"_source": {
"SearchContents": " oral po victrelis",
"MedShortDesc": "Victrelis PO",
"MedLongDesc": "Victrelis Oral",
"RepresentativeRoutedGenericDesc": "BOCEPREVIR ORAL",
...
}
}
<5 more similar results>
{
"_index": "indexus2",
"_type": "Medication",
"_id": "26198",
"_score": 2.2836545,
"_source": {
"SearchContents": " (original 5 500 feeding mg strength) tube via vicodin",
"MedShortDesc": "Vicodin 5 mg-500 mg (Original Strength) via feeding tube",
"MedLongDesc": "Vicodin 5 mg-500 mg (Original Strength) via feeding tube",
"GenericDesc": "HYDROCODONE BITARTRATE/ACETAMINOPHEN ORAL",
...
}
}
Field Mappings
"OrderableMedLongDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"OrderableMedShortDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"RepresentativeRoutedGenericDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"SearchContents": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
Here is what ES shows for my _settings for analyzers
"analyzer": {
"edgeNGramAnalyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "edgeNGram"
}
},
"tokenizer": {
"edgeNGram": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "50"
}
}
As per the above mapping edgeNGramAnalyzer is the search-analyzer for the fields as a result the search query would also get "edge ngrammed". You probably do not want this .
Change the mapping to set only the index_analyzer option as edgeNgramAnalyzer.
The search_analyzer would then default to standard.
Example:
"SearchContents": {
"type": "string",
"index_analyzer": "edgeNGramAnalyzer"
},

Linq query to d3.js chart

I am looking for a good way to feed a d3.js bubble chart with data from my MVC application. For example the standard bubble chart expects nested data in the form:
{
"name": "flare",
"children": [
{
"name": "analytics",
"children": [
{
"name": "cluster",
"children": [
{
"name": "CNN",
"size": 3938
}
]
},
{
"name": "graph",
"children": [
{
"name": "MTV",
"size": 3534
}
]
}
]
}
]
}
What I have on the server side is this linq query to a SQL database:
var results = from a in db.Durations
where a.Category == "watch"
group a by a.Description
into g
select new
{
name = g.Key,
size = g.Select(d => new{d.Begin, d.End}).Sum(d => SqlFunctions.DateDiff("hh", d.Begin, d.End))
};
return Json(results, JsonRequestBehavior.AllowGet);
The query result, parsed as Json, looks like this:
[{"name":"CNN","size":1950},{"name":"MTV","size":1680}]
I've got stuck in the head on what would be a good way to achieve the correct formatting and to create the nested structure from my query results..
server-side, using anonymous types
server-side, adjusting the linq-query
client-side, using d3.js nest
use a simpler bubble model since for my purpose, the nested
structure with children is not really needed
something totally different and much much cooler than 1-4
Thank you for any input.
Replace your return statement with the following one.
return Json(new
{
name = "Sites",
children = results
},
JsonRequestBehavior.AllowGet);
That will give you the following:
{
"name": "Sites",
"children": [
{
"name": "CNN",
"size": 1950
},
{
"name": "MTV",
"size": 1680
}
]
}
To serve as an example, suppose each website had an additional string Type property, with values such as "News" or "Music". Then you could do the following.
return Json(new
{
name = "Sites",
children = results.GroupBy(site => site.Type).Select(group => new
{
name = group.Key,
children = group
}
},
JsonRequestBehavior.AllowGet);
This would give you something like the following.
{
"name": "Sites",
"children": [
{
"name": "News",
"children": [
{
"name": "CNN",
"size": 1950
},
{
"name": "The Verge",
"size": 1600
}
]
},
{
"name": "Music",
"children": [
{
"name": "MTV",
"size": 1680
},
{
"name": "Pandora",
"size": 2000
}
]
}
]
}

Categories