NEST for ElasticSearch - JSONSerializationException when retrieving data - c#

I am using the NEST API (v0.12.0.0) to interface with an ElasticSearch (v1.0.1) index and I just started receiving a JsonSerializationException when retrieving my data. I'm not sure if this is a NEST issue or otherwise, but it just randomly started happening and we haven't made any major changes to our implementation or infrastructure.
I am attempting to retrieve the Ids of my data (stored as a Guid) with a typed Search<>() and I am getting an exception when the data is processed by JSON.NET.
client.Search<ESEventItem>(s =>
s.Index("dev-events004")
.Fields(f => f.Id).Size(100000)
.Type("event").MatchAll()).Documents.ToList()
Running this same query manually in Sense produces no noticeable issues:
POST /dev-events004/event/_search
{
"size": 100000,
"query": {
"match_all": {}
},
"fields": [
"id"
]
}
{
"took": 2088,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 19257,
"max_score": 1,
"hits": [
{
"_index": "dev-events004",
"_type": "event",
"_id": "670a1055-cbe3-480e-b807-a2b500f9dfb3",
"_score": 1,
"fields": {
"id": [
"670a1055-cbe3-480e-b807-a2b500f9dfb3"
]
}
},
/* ... additional results ... */
]
}
}
If I perform a raw, untyped query Fields(new[] { "Id" }) it does not throw an exception. Likewise, if I return the whole ESEventItem object, rather than just the Id fields, it also works without an exception.
To the NEST developer: this question is mirrored as an issue on the github project.

This is due the fact that elasticsearch 1.0 changed how fields are returned. The upcomming NEST 1.0 will support this.

Related

Nest Elasticsearch match_phrase query throws parsing exception

I am using Nest elasticSearch as a client library to interact with the elasticSearch indices.
I am trying to send match_phrase query using the following code:
var searchResponse = elasticClient.Search<ProductType>(s => s
.Index(indices)
.Type(Types.Type(typeof(ProductType)))
.From(0)
.Size(5)
.Query(q =>
q.MatchPhrase(m => m
.Field(Infer.Field<ProductType>(ff => ff.Title))
.Slop(5)
.Query("my query")
)
)
);
It's generating the following query :
GET /product/_search
{
"from": 0,
"size": 5,
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"query": {
"match": {
"title": {
"type": "phrase",
"query": "my query",
"slop": 5
}
}
}
}
When I execute the above query it returns parsing_exception:
[match] query does not support [type]
I was expecting the above code to return query like the following:
GET /product/_search
{
"from": 0,
"size": 5,
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"query": {
"match_phrase": {
"title": {
"query": "my query",
"slop": 5
}
}
}
}
So is there anything wrong with my code and how can I get rid of it?
After investigation and depending on match query does not support type I have found that the server I am hosting my cluster on upgraded ElasticSearch into V6.5.0 and it turns out I should upgrade my Nest NuGet package and now it's generating the match_phrase query as expected.

Index JsonObject with NEST has empty values

I want to index JsonObjects with NEST, after posting the properties are in the index but the values are empty "[]". When I post the same json with Postman the result is correct.
Index:
string indexName = "testindex";
IIndexResponse response = client.Index<JObject>(docItem, i => i.Type("my_type").Index(indexName));
json in docItem:
{
"Source":"test",
"CreatedAt": "2018-05-26 12:23:33",
"SessionId":"1234",
"ResponseParam":{
"ItemA":"bla",
"ItemB": 123
}
}
search query:
http://[IP]:9200/testindex/_search
search result
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "testindex",
"_type": "my_type",
"_id": "u44ucmMB687Uyj7O8xKY",
"_score": 1,
"_source": {
"Source": [],
"CreatedAt": [],
"SessionId": [],
"ResponseParam": {
"ItemA": [],
"ItemB": []
}
}
},
If you're using JObject as the document type, or your document contains JObject, you will need to also reference the NEST.JsonNetSerializer nuget package and hook up the JsonNetSerializer as follows
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var connectionSettings =
new ConnectionSettings(pool, sourceSerializer: JsonNetSerializer.Default);
var client = new ElasticClient(connectionSettings);
This is required because NEST 6.x removed the direct dependency on Json.NET by IL-merging, internalizing and re-namespacing it. One of the changes this brings about is that now, NEST does not know how to specially handle Newtonsoft.Json.Linq.JObject, so the dependency on NEST.JsonNetSerializer which does know how to handle that type specially, is needed.
Source: https://discuss.elastic.co/t/elasticsearch-net-nest-issue-with-api-after-upgrade-to-6-2-3/127690

DocumentDB Stored Procedure Continuation

I'm trying out DocumentDB as a possible data store for a new application. The app has to handle a lot of data so I used the Data Migration tool to put a lot of documents into a collection.
Most of the queries from my app will be aggregating and summing. So I'm using documentdb-lumenize. The code sample for calling that stored procedure from C# has me doing something like this:
var configString = #"{
cubeConfig: {
groupBy: 'year',
field: 'Amount',
f: 'sum'
},
filterQuery: 'SELECT * FROM TestLargeData t'
}";
var config = JsonConvert.DeserializeObject<object>(configString);
var result = await _client.ExecuteStoredProcedureAsync<dynamic>("my/sproc/link", config);
The result I get back looks like this:
{
"cubeConfig": {
"groupBy": "year",
"field": "Amount",
"f": "sum"
},
"filterQuery": "SELECT * FROM TestLargeData t",
"continuation": "-RID:rOtjAPc4TgBxFwAAAAAAAA==#RT:6#TRC:6000",
"stillQueueing": false,
"savedCube": {
"config": {
"groupBy": "year",
"field": "Amount",
"f": "sum"
},
"cellsAsCSVStyleArray": [
[
"year",
"_count",
"Amount_sum"
],
[
2006,
4825,
1391399555.74
],
[
2007,
1175,
693886378
]
],
"summaryMetrics": {}
},
"example": {
"year": 2007,
"SomeOtherField1": "SomeOtherValue1",
"SomeOtherField2": "SomeOtherValue2",
"Amount": 12000,
"id": "0ee80b66-7fa7-40c1-9124-292c01059562",
"_rid": "...",
"_self": "...",
"_etag": "\"...\"",
"_attachments": "attachments/",
"_ts": ...
}
}
The _count values indicate that I got back 6,000 documents worth of aggregated data. There are a million documents in the collection (I wanted to test big!)
I see the "continuation" value in the result. But StoredProcedureResponse doesn't have an ExecuteNextAsync method like the DocumentQuery class does. How would I use the DocumentDB API to request the next part of the data?
I'm the author of documentdb-lumenize. If you just send back in what's returned as the only parameter, then the documentdb-lumenize sproc will know how to deal with the continuation token. You'll have to keep calling it until the continuation token comes back empty.
That said, I'm really surprised it only did 6000 in one round trip. I generally get 20-50K per round trip. Maybe you have a lower spec'd collection? Maybe it's doing an index-less full-scan?
Submit an issue in the GitHub repo if you want more 1:1 help with this.

ElasticSearch search getting bad results

I am fairly new to ElasticSearch and am having issues getting search results that I perceive to be good. My objective is to be able to search an index of medications (6 fields) against a phrase that the user enters. It could be one ore more words. I've tried a few approaches, but I'll outline the best one I've found so far below. Let me know what I'm doing wrong. I'm guessing that I'm missing something fundamental.
Here is a subset of the fields that I'm working with
...
"hits": [
{
"_index": "indexus2",
"_type": "Medication",
"_id": "17471",
"_score": 8.829264,
"_source": {
"SearchContents": " chew chewable oral po tylenol",
"MedShortDesc": "Tylenol PO Chew",
"MedLongDesc": "Tylenol Oral Chewable"
"GenericDesc": "ACETAMINOPHEN ORAL"
...
}
}
...
The fields that I'm searching against used an Edge NGram Analyzer. I'm using the C# Nest library for the indexing
settings.Analysis.Tokenizers.Add("edgeNGram", new EdgeNGramTokenizer()
{
MaxGram = 50,
MinGram = 2,
TokenChars = new List<string>() { "letter", "digit" }
});
settings.Analysis.Analyzers.Add("edgeNGramAnalyzer", new CustomAnalyzer()
{
Filter = new string[] { "lowercase" },
Tokenizer = "edgeNGram"
});
I am using a more_like_this query against the fields in question
GET indexus2/Medication/_search
{
"query": {
"more_like_this" : {
"fields" : ["MedShortDesc",
"MedLongDesc",
"GenericDesc",
"SearchContents"],
"like_text" : "vicodin",
"min_term_freq" : 1,
"max_query_terms" : 25,
"min_word_len": 2
}
}
}
The problem is that for this search for 'vicodin', I'd expect to see matches with the full work first, but I don't. Here is a subset of the results from this query. Vicodin doesn't show up until the 7th result
"hits": [
{
"_index": "indexus2",
"_type": "Medication",
"_id": "31192",
"_score": 4.567309,
"_source": {
"SearchContents": " oral po victrelis",
"MedShortDesc": "Victrelis PO",
"MedLongDesc": "Victrelis Oral",
"RepresentativeRoutedGenericDesc": "BOCEPREVIR ORAL",
...
}
}
<5 more similar results>
{
"_index": "indexus2",
"_type": "Medication",
"_id": "26198",
"_score": 2.2836545,
"_source": {
"SearchContents": " (original 5 500 feeding mg strength) tube via vicodin",
"MedShortDesc": "Vicodin 5 mg-500 mg (Original Strength) via feeding tube",
"MedLongDesc": "Vicodin 5 mg-500 mg (Original Strength) via feeding tube",
"GenericDesc": "HYDROCODONE BITARTRATE/ACETAMINOPHEN ORAL",
...
}
}
Field Mappings
"OrderableMedLongDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"OrderableMedShortDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"RepresentativeRoutedGenericDesc": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
"SearchContents": {
"type": "string",
"analyzer": "edgeNGramAnalyzer"
},
Here is what ES shows for my _settings for analyzers
"analyzer": {
"edgeNGramAnalyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "edgeNGram"
}
},
"tokenizer": {
"edgeNGram": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "50"
}
}
As per the above mapping edgeNGramAnalyzer is the search-analyzer for the fields as a result the search query would also get "edge ngrammed". You probably do not want this .
Change the mapping to set only the index_analyzer option as edgeNgramAnalyzer.
The search_analyzer would then default to standard.
Example:
"SearchContents": {
"type": "string",
"index_analyzer": "edgeNGramAnalyzer"
},

Getting distinct values using NEST ElasticSearch client

I'm building a product search engine with Elastic Search in my .NET application, by using the NEST client, and there is one thing i'm having trouble with. Getting a distinct set of values.
I'm search for products, which there are many thousands, but of course i can only return 10 or 20 at a time to the user. And for this paging works fine. But besides this primary result, i want to show my users a list of brands that are found within the complete search, to present these for filtering.
I have read about that i should use Terms Aggregations for this. But, i couldn't get anything better than this. And this still doesn't really give me what i want, because it splits values like "20th Century Fox" into 3 separate values.
var brandResults = client.Search<Product>(s => s
.Query(query)
.Aggregations(a => a.Terms("my_terms_agg", t => t.Field(p => p.BrandName).Size(250))
)
);
var agg = brandResult.Aggs.Terms("my_terms_agg");
Is this even the right approach? Or should is use something totally different? And, how can i get the correct, complete values? (Not split by space .. but i guess that is what you get when you ask for a list of 'Terms'??)
What i'm looking for is what you would get if you would do this in MS SQL
SELECT DISTINCT BrandName FROM [Table To Search] WHERE [Where clause without paging]
You are correct that what you want is a terms aggregation. The problem you're running into is that ES is splitting the field "BrandName" in the results it is returning. This is the expected default behavior of a field in ES.
What I recommend is that you change BrandName into a "Multifield", this will allow you to search on all the various parts, as well as doing a terms aggregation on the "Not Analyzed" (aka full "20th Century Fox") term.
Here is the documentation from ES.
https://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
[UPDATE]
If you are using ES version 1.4 or newer the syntax for multi-fields is a little different now.
https://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html#_multi_fields
Here is a full working sample the illustrate the point in ES 1.4.4. Note the mapping specifies a "not_analyzed" version of the field.
PUT hilden1
PUT hilden1/type1/_mapping
{
"properties": {
"brandName": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
POST hilden1/type1
{
"brandName": "foo"
}
POST hilden1/type1
{
"brandName": "bar"
}
POST hilden1/type1
{
"brandName": "20th Century Fox"
}
POST hilden1/type1
{
"brandName": "20th Century Fox"
}
POST hilden1/type1
{
"brandName": "foo bar"
}
GET hilden1/type1/_search
{
"size": 0,
"aggs": {
"analyzed_field": {
"terms": {
"field": "brandName",
"size": 10
}
},
"non_analyzed_field": {
"terms": {
"field": "brandName.raw",
"size": 10
}
}
}
}
Results of the last query:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"aggregations": {
"non_analyzed_field": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "20th Century Fox",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 1
},
{
"key": "foo",
"doc_count": 1
},
{
"key": "foo bar",
"doc_count": 1
}
]
},
"analyzed_field": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "20th",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 2
},
{
"key": "century",
"doc_count": 2
},
{
"key": "foo",
"doc_count": 2
},
{
"key": "fox",
"doc_count": 2
}
]
}
}
}
Notice that not-analyzed fields keep "20th century fox" and "foo bar" together where as the analyzed field breaks them up.
I had a similar issue. I was displaying search results and wanted to show counts on the category and sub category.
You're right to use aggregations. I also had the issue with the strings being tokenised (i.e. 20th century fox being split) - this happens because the fields are analysed. For me, I added the following mappings (i.e. tell ES not to analyse that field):
"category": {
"type": "nested",
"properties": {
"CategoryNameAndSlug": {
"type": "string",
"index": "not_analyzed"
},
"SubCategoryNameAndSlug": {
"type": "string",
"index": "not_analyzed"
}
}
}
As jhilden suggested, if you use this field for more than one reason (e.g. search and aggregation) you can set it up as a multifield. So on one hand it can get analysed and used for searching and on the other hand for not being analysed for aggregation.

Categories