ElasticSearch C# client (NEST): access nested aggregation with Spaces

ElasticSearch C# client (NEST): access nested aggregation with Spaces - c#

Assuming my 2 values are "Red Square" and "Green circle",
when i run the aggregation using Elastic search i get 4 values instead of 2, space separated?
They are Red, Square, Green, circle.
Is there a way to get the 2 original values.
The code is below:
var result = this.client.Search<MyClass>(s => s
.Size(int.MaxValue)
.Aggregations(a => a
.Terms("field1", t => t.Field(k => k.MyField))
)
);
var agBucket = (Bucket)result.Aggregations["field1"];
var myAgg = result.Aggs.Terms("field1");
IList<KeyItem> list = myAgg.Items;
foreach (KeyItem i in list)
{
string data = i.Key;
}

In your mapping, you need to set the field1 string as not_analyzed, like this:
{
"your_type": {
"properties": {
"field1": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
You can also make field1 a multi-field and make it both analyzed and not_analyzed to get the best of both worlds (i.e. text matching on the analyzed field + aggregation on the exact value of the not_analyzed raw sub-field).
{
"your_type": {
"properties": {
"field1": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
If you choose this second option, you'll need to run your aggregation on field1.raw instead of field1.

Related

MongoDB, C#. Update multiple array elements with different conditions

There is an entity with custom fields that can change their value. The fields are stored as an array.
public class SomeEntity :...
{
...
public List<CustomField> FieldList { get; set; } = new List<CustomField>();
...
}
I use code like this to update one field:
var filterBuilder = Builders<SomeEntity>.Filter;
var updateFilter =
filterBuilder.Eq(e => e.EntityId, entityToUpdate.EntityId)
& filterBuilder.ElemMatch(e => e.FieldList, f => f.FieldId == updateField.FieldId);
var updateDef = Builders<SomeEntity>.Update.Set(e => e.FieldList[-1].Value, updateField.Value);
var updateRes = await GetEntityCollection().UpdateOneAsync(updateFilter, updateDef);
Sometimes it becomes necessary to update the values of several fields in the database at the same time, but I cannot find a way to do this in one action.
Is it possible in MongoDB to first search for a document using a filter, and then update/delete several elements of its array by their Id's (changes for each element may be different) in one action?
Is it possible to do the same for several documents (for example, remove fields with certain identifiers from all objects matching the filter)?
An example of an object before the change:
{
...
"fieldList": [
{
"id": "3bf2c235-82c3-40e4-91dc-46dc4c1ed177",
"type": 0,
"value": 10
},
{
"id": "5909dabd-fe8f-4edb-a642-c052e23082d8",
"type": 1,
"value": "some value"
},
{
"id": "66805403-d508-4b99-82f3-fa2ed828c19e",
"type": 3,
"value": "2019-08-01T12:00:00"
}
]
}
An example of an object after modification (only one document is updated):
{
...
"fieldList": [
{
"id": "3bf2c235-82c3-40e4-91dc-46dc4c1ed177",
"type": 0,
"value": 500
},
{
"id": "5909dabd-fe8f-4edb-a642-c052e23082d8",
"type": 1,
"value": "new value"
},
{
"id": "66805403-d508-4b99-82f3-fa2ed828c19e",
"type": 3,
"value": "2020-09-10T10:00:00"
}
]
}
An example of an object after a massive change (the document matched the filter, by analogy, all documents that would fit the filter should be updated - delete two fields, add a default value for some field):
{
...
"fieldList": [
{
"id": "3bf2c235-82c3-40e4-91dc-46dc4c1ed177",
"type": 0,
"value": 500,
"defaultValue": 150
}
]
}

Mongo DB aggregation group performance

I am pretty new to mongo DB and experimenting with it for one of our applications. We are trying to implement CQRS and query part we are trying to use node.js and command part we are implementing through c#.
One of my collections might have millions of documents in it. We would have a scenarioId field and each scenario can have around two million records.
Our use case is to compare these two scenarios data and do some mathematical operation on the each field of scenarios.
For example, each scenario can have a property avgMiles and I would like to compute the difference of this property and users should be able to filter on this difference value. As my design is to keep both scenarios data in single collection i am trying to do group by scenario id and further project it.
My sample structure of a document would look like below.
{
"_id" : ObjectId("5ac05dc58ff6cd3054d5654c"),
"origin" : {
"code" : "0000",
},
"destination" : {
"code" : "0001",
},
"currentOutput" : {
"avgMiles" : 0.15093020854848138,
},
"scenarioId" : NumberInt(0),
"serviceType" : "ECON"
}
When I group I would group it based on origin.code and destination.code and serviceType properties.
My aggregate pipeline query looks like this:
db.servicestats.aggregate([{$match:{$or:[{scenarioId:0}, {scenarioId:1}]}},
{$sort:{'origin.code':1,'destination.code':1,serviceType:1}},
{$group:{
_id:{originCode:'$origin.code',destinationCode:'$destination.code',serviceType:'$serviceType'},
baseScenarioId:{$sum:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 1] },
then: '$scenarioId'
}],
default: 0
}
}},
compareScenarioId:{$sum:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 0] },
then: '$scenarioId'
}],
default: 0
}
}},
baseavgMiles:{$max:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 1] },
then: '$currentOutput.avgMiles'
}],
default: null
}
}},
compareavgMiles:{$sum:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 0] },
then: '$currentOutput.avgMiles'
}],
default: null
}
}}
}
},
{$project:{scenarioId:
{ base:'$baseScenarioId',
compare:'$compareScenarioId'
},
avgMiles:{base:'$baseavgMiles', comapre:'$compareavgMiles',diff:{$subtract :['$baseavgMiles','$compareavgMiles']}}
}
},
{$match:{'avgMiles.diff':{$eq:0.5}}},
{$limit:100}
],{allowDiskUse: true} )
My group pipeline stage would have 4 million documents going in it. Can you please suggest how I can improve the performance of this query?
I have an index on the fields used in my group by condition and I have added a sort pipeline stage to help group by to perform better.
Any suggestions are most welcome.
As group by is not workin in my case i have implemented left outer join using $lookup and the query would look like below.
db.servicestats.aggregate([
{$match:{$and :[ {'scenarioId':0}
//,{'origin.code':'0000'},{'destination.code':'0001'}
]}},
//{$limit:1000000},
{$lookup: { from:'servicestats',
let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
pipeline:[
{$match: {
$expr: { $and:
[
{ $eq: [ "$scenarioId", 1 ] },
{ $eq: [ "$origin.code", "$$ocode" ] },
{ $eq: [ "$destination.code", "$$dcode" ] },
{ $eq: [ "$serviceType", "$$stype" ] },
]
}
}
},
{$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
{ $replaceRoot: { newRoot: "$comp" } }
],
as : "compoutputs"
}},
{
$replaceRoot: {
newRoot: {
$mergeObjects:[
{
$arrayElemAt: [
"$$ROOT.compoutputs",
0
]
},
{
origin: "$$ROOT.origin",
destination: "$$ROOT.destination",
serviceType: "$$ROOT.serviceType",
baseavgmiles: "$$ROOT.currentOutput.avgMiles",
output: '$$ROOT'
}
]
}
}
},
{$limit:100}
])
the above query performance is good and returns in 70 ms.
But in my scenario i need a full outer join to be implemented which i understood mongo does not support as of now and implemented using $facet pipeline as below
db.servicestats.aggregate([
{$limit:1000},
{$facet: {output1:[
{$match:{$and :[ {'scenarioId':0}
]}},
{$lookup: { from:'servicestats',
let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
pipeline:[
{$match: {
$expr: { $and:
[
{ $eq: [ "$scenarioId", 1 ] },
{ $eq: [ "$origin.code", "$$ocode" ] },
{ $eq: [ "$destination.code", "$$dcode" ] },
{ $eq: [ "$serviceType", "$$stype" ] },
]
}
}
},
{$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
{ $replaceRoot: { newRoot: "$comp" } }
],
as : "compoutputs"
}},
//{
// $replaceRoot: {
// newRoot: {
// $mergeObjects:[
// {
// $arrayElemAt: [
// "$$ROOT.compoutputs",
// 0
// ]
// },
// {
// origin: "$$ROOT.origin",
// destination: "$$ROOT.destination",
// serviceType: "$$ROOT.serviceType",
// baseavgmiles: "$$ROOT.currentOutput.avgMiles",
// output: '$$ROOT'
// }
// ]
// }
// }
// }
],
output2:[
{$match:{$and :[ {'scenarioId':1}
]}},
{$lookup: { from:'servicestats',
let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
pipeline:[
{$match: {
$expr: { $and:
[
{ $eq: [ "$scenarioId", 0 ] },
{ $eq: [ "$origin.code", "$$ocode" ] },
{ $eq: [ "$destination.code", "$$dcode" ] },
{ $eq: [ "$serviceType", "$$stype" ] },
]
}
}
},
{$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
{ $replaceRoot: { newRoot: "$comp" } }
],
as : "compoutputs"
}},
//{
// $replaceRoot: {
// newRoot: {
// $mergeObjects:[
// {
// $arrayElemAt: [
// "$$ROOT.compoutputs",
// 0
// ]
// },
// {
// origin: "$$ROOT.origin",
// destination: "$$ROOT.destination",
// serviceType: "$$ROOT.serviceType",
// baseavgmiles: "$$ROOT.currentOutput.avgMiles",
// output: '$$ROOT'
// }
// ]
// }
// }
// },
{$match :{'compoutputs':{$eq:[]}}}
]
}
}
///{$limit:100}
])
But facet performance is very bad. Any further ideas to improve this are most welcome.

In general, there are three things that can cause slow queries:
The query is not indexed, cannot use indexes efficiently, or the schema design is not optimal (e.g. highly nested arrays or subdocuments) which means that MongoDB must do some extra work to arrive at the relevant data.
The query is waiting for something slow (e.g. fetching data from disk, writing data to disk).
Underprovisioned hardware.
In terms of your query, there may be some general suggestions regarding query performance:
Using allowDiskUse in an aggregation pipeline means that it is possible that the query will be using disk for some its stages. Disk is frequently the slowest part of a machine, so if it's possible for you to avoid this, it will speed up the query.
Note that an aggregation query is limited to 100MB memory use. This is irrespective of the amount of memory you have.
The $group stage cannot use indexes, because an index is tied to a document's location on disk. Once the aggregation pipeline enters a stage where the document's physical location is irrelevant (e.g. the $group stage), an index cannot be used anymore.
By default, the WiredTiger cache is ~50% of RAM, so a 64GB machine would have a ~32GB WiredTiger cache. If you find that the query is very slow, it is possible that MongoDB needed to go to disk to fetch the relevant documents. Monitoring iostats and checking disk utilization % during the query would provide hints toward whether enough RAM is provisioned.
Some possible solutions are:
Provision more RAM so that MongoDB doesn't have to go to disk very often.
Rework the schema design to avoid heavily nested fields, or multiple arrays in the document.
Tailor the document schema to make it easier for you to query the data in it, instead of tailoring the schema to how you think the data should be stored (e.g. avoid heavy normalization inherent in relational database design model).
If you find that you're hitting the performance limit of a single machine, consider sharding to horizontally scale the query. However, please note that sharding is a solution that would require careful design and consideration.

You are saying above that you'd like to group by scenarioId which, however, you don't. But that is probably what you should be doing to avoid all the switch statements. Something like this might get you going:
db.servicestats.aggregate([{
$match: {
scenarioId: { $in: [ 0, 1 ] }
}
}, {
$sort: { // not sure if that stage even helps - try to run with and without
'origin.code': 1,
'destination.code': 1,
serviceType: 1
}
}, {
$group: { // first group by scenarioId AND the other fields
_id: {
scenarioId: '$scenarioId',
originCode: '$origin.code',
destinationCode: '$destination.code',
serviceType: '$serviceType'
},
avgMiles: { $max: '$currentOutput.avgMiles' } // no switches needed
},
}, {
$group: { // group by the other fields only so without scenarioId
_id: {
originCode: '$_id.originCode',
destinationCode: '$_id.destinationCode',
serviceType: '$_id.serviceType'
},
baseScenarioAvgMiles: {
$max: {
$cond: {
if: { $eq: [ '$_id.scenarioId', 1 ] },
then: '$avgMiles',
else: 0
}
}
},
compareScenarioAvgMiles: {
$max: {
$cond: {
if: { $eq: [ '$_id.scenarioId', 0 ] },
then: '$avgMiles',
else: 0
}
}
}
},
}, {
$addFields: { // compute the difference
diff: {
$subtract :[ '$baseScenarioAvgMiles', '$compareScenarioAvgMiles']
}
}
}, {
$match: {
'avgMiles.diff': { $eq: 0.5 }
}
}, {
$limit:100
}], { allowDiskUse: true })
Beyond that I would suggest you use the power of db.collection.explain().aggregate(...) to find the right indexing and tune your query.

Elasticsearch proper way to escape spaces, ? doesn't work in all scenarios

I'm trying to get searching with spaces to work properly in elasticsearch but having a ton of trouble getting it to behave the same way as it does on another field.
I have two fields, Name and Addresses.First().Line1 that I want to be able to search and preserve spaces in the search. For instance, searching for Bob Smi* would return Bob Smith but not just Bob.
This is working for my Name field by doing a query string search with the space replaced with ?. I'm also doing a wildcard so my final query is *bob?smi*.
However, when I try to also search by line1, I get no results. E.g. *4800* returns a record with line1 like 4800 Street, but when I do the same transformation with 4800 street to get *4800?street*, I get no results.
Below is my query
{
"from": 0,
"size": 50,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "*4800?Street*",
"fields": [
"name",
"addresses.line1"
]
}
}
]
}
}
}
returns no result.
Why would *bob?smi* return result with name Bob Smith but *4800?street* not return result with line item 4800 street?
Below is how both fields are set up in the index:
.Text(smd => smd.Name(c => c.Name).Analyzer(ElasticIndexCreator.SortAnalyzer).Fielddata())
.Nested<Address>(nomd => nomd.Name(p => p.PrimaryAddress).Properties(MapAddressProperties))
//from MapAddressProperties()
.Text(smd2 => smd2.Name(x => x.Line1).Analyzer(ElasticIndexCreator.SortAnalyzer).Fielddata())
Mappings in elastic:
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
"addresses": {
"line1": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
}
Is there some other, better way to escape a space in an elasticsearch querystring? I've also tried \\ and \\\\ (in C# evaluates to \\) instead of the ? to no avail.

Finally found the correct setup after tons of time experimenting. The configuration that worked for me was as follows:
Use Text with Field Data in the columns
Search using QueryString with wildcard placeholders, replacing spaces with ? e.g. bob smith is entered, query elastic with *bob?smith*
Use Nested queries for child objects. Oddly, addresses.line1 will return data when searching for say 4800 but not when trying to do *4800?street*. Using a nested query allows this to function properly .
From what I hear, having to use field data is very memory intensive, and having to use wildcards is very time intensive, so this is probably not an optimal solution but it's the only one I've found. If there is another better way to do this, please enlighten me.
Example queries in C# using Nest:
var query = Query<Student>.QueryString(qs => qs
.Fields(f => f
.Field(c => c.Name)
//.Field(c => c.PrimaryAddress.Line1) //this doesn't work
)
.Query(testCase.Term)
);
query |= Query<Student>.Nested(n => n
.Path(p => p.Addresses)
.Query(q => q
.QueryString(qs => qs
.Fields(f => f.Field(c => c.Addresses.First().Line1))
.Query(testCase.Term)
)
)
);
Example Mapping:
.Map<Student>(s => s.Properties(p => p
.Text(t => t.Name(z => z.Name).Fielddata())
.Nested<StudentAddress>(n => n
.Name(ap => ap.Addresses)
.Properties(ap => ap.Text(t => t.Name(z => z.Line1).Fielddata())
)
))

Try using addresses.line1.keyword (that is, try the keyword multi-field that you defined for addresses.line1) in the fields parameter a term-level wildcard query:
{
"query": {
"wildcard": {
"addresses.line1.keyword": {
"wildcard": "*4800 street*"
}
}
}
}
Per Elasticsearch documentation on full-text wildcard searches, if you search against addresses.line1 (whose type is text so full-text search rules apply), the search will be performed against each term analyzed out of the field, that is, once against 4800 and again against street, none of which would match your *4800?street* wildcard. The addresses.line1.keyword multi-field contains the original 4800 street value, and should match your query pattern using a term-level wildcard query.
By the way, a minor nit: the mapping type definition itself seems incomplete for the addresses field. You said it is:
"addresses": {
"line1": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
}
But IMHO it should instead be:
"addresses": {
"properties": {
"line1": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
}
}

How to mimic URI query

This may be too basic of a question for SO, but I thought I would ask anyway.
I getting my feet wet with ElasticSearch and am trying to return a single document that has an exact match to my field of interest.
I have the field "StoryText" which is mapped as type "string" and indexed as "not_analyzed".
When I search using a the basic URI query:
123.456.0.789:9200/stories/storyphrases/_search?q=StoryText:"The boy sat quietly"
I return an exact matched document as I expected with a single hit.
However, when I use the search functionality:
GET 123.456.0.789:9200/stories/storyphrases/_search
{
"query" : {
"filtered" : {
"filter" : {
"term" : {
"StoryText" : "The boy sat quietly"
}
}
}
}
}
I get multiple documents returned with many hits (i.e. "The boy sat loudly", "The boy stood quietly" etc. etc.)
Could somebody help me to understand how I need to restructure my search request to mimic the result I get using the basic query parameter?
At present I am using NEST in C# to generate my search request which looks like this
var searchresults = client.Search<stories>(p => p
.Query(q => q
.Filtered(r => r
.Filter(s => s.Term("StoryText", inputtext))
)
)
);
Thanks very much for any and all reads and or thoughts!
UPDATE: Mappings are listed below
GET /stories/storyphrases/_mappings
{
"stories": {
"mappings": {
"storyphrases": {
"dynamic": "strict",
"properties": {
"#timestamp": {
"type": "date",
"format": "date_optional_time"
},
"#version": {
"type": "string"
},
"SubjectCode": {
"type": "string"
},
"VerbCode": {
"type": "string"
},
"LocationCode": {
"type": "string"
},
"BookCode": {
"type": "string"
},
"Route": {
"type": "string"
},
"StoryText": {
"type": "string",
"index": "not_analyzed"
},
"SourceType": {
"type": "string"
},
"host": {
"type": "string"
},
"message": {
"type": "string"
},
"path": {
"type": "string"
}
}
}
}
}
Mick

Well, first off you are executing two different queries here. The first is running in a query context whilst the second is essentially a match_all query executing in a filtered context. If your objective is simply to emulate the first query but by passing a JSON body you will need something like
GET 123.456.0.789:9200/stories/storyphrases/_search
{
"query" : {
"query_string" : {
"query" : "StoryText:'The boy sat quietly'"
}
}
}
To write this simple query using Nest you would use
var searchresults = client.Search<stories>(p => p.QueryString("StoryText:" + inputtext));
or in longer form
var searchresults = client.Search<stories>(p => p
.Query(q => q
.QueryString(qs => qs
.Query("StoryText:" + inputtext)
)
)
);
These both produce the same JSON body and send it to the _search endpoint. Assuming that storyphrases is your Elasticsearch type then you may also wish to include this in your C#.
var searchresults = client.Search<stories>(p => p
.Index("stories")
.Type("storyphrases")
.Query(q => q
.QueryString(qs => qs
.Query("StoryText:" + inputtext)
)
)
);
Having said all that and looking at your filtered query it should do what you expect according to my testing. Is your field definitely not analyzed? Can you post your mapping?

MongoDB: Build query in C# driver

I stacked to build this Mongodb query in C# driver:
{
Location: { "$within": { "$center": [ [1, 1], 5 ] } },
Properties: {
$all: [
{ $elemMatch: { Type: 1, Value: "a" } },
{ $elemMatch: { Type: 2, Value: "b" } }
]
}
}
Something next:
var geoQuery = Query.WithinCircle("Location", x, y, radius);
var propertiesQuery = **?**;
var query = Query.And(geoQuery, propertiesQuery);
Addition:
The above query taken from my another question:
MongoDB: Match multiple array elements
You are welcome to take part in its solution.

Here's how if you want to get that exact query:
// create the $elemMatch with Type and Value
// as we're just trying to make an expression here,
// we'll use $elemMatch as the property name
var qType1 = Query.EQ("$elemMatch",
BsonValue.Create(Query.And(Query.EQ("Type", 1),
Query.EQ("Value", "a"))));
// again
var qType2 = Query.EQ("$elemMatch",
BsonValue.Create(Query.And(Query.EQ("Type", 2),
Query.EQ("Value", "b"))));
// then, put it all together, with $all connection the two queries
// for the Properties field
var query = Query.All("Properties",
new List<BsonValue> {
BsonValue.Create(qType1),
BsonValue.Create(qType2)
});
The sneaky part is that while many of the parameters to the various Query methods expect BsonValues rather than queries, you can create a BsonValue instance from a Query instance by doing something like:
// very cool/handy that this works
var bv = BsonValue.Create(Query.EQ("Type", 1));
The actual query sent matches your original request exactly:
query = {
"Properties": {
"$all": [
{ "$elemMatch": { "Type": 1, "Value": "a" }},
{ "$elemMatch": { "Type": 2, "Value": "b" }}
]
}
}
(I'd never seen that style of $all usage either, but apparently, it sounds like it's just not documented yet.)

While I can confirm that the query you posted works on my machine, the documentation of $all seems to indicate that it shouldn't accept expressions or queries, but only values:
Syntax: { field: { $all: [ <value> , <value1> ... ] }
(The documentation uses <expression> if queries are allowed, compare to $and). Accordingly, the C# driver accepts only an array of BsonValue instead of IMongoQuery.
However, the following query should be equivalent:
{
$and: [
{ "Location": { "$within": { "$center": [ [1, 1], 5 ] } } },
{ "Properties" : { $elemMatch: { "Type": 1, "Value": "a" } } },
{ "Properties" : { $elemMatch: { "Type": 2, "Value": "b" } } }
]
}
Which translates to the C# driver as
var query =
Query.And(Query.WithinCircle("Location", centerX, centerY, radius),
Query.ElemMatch("Properties", Query.And(Query.EQ("Type", 1), Query.EQ("Value", "a"))),
Query.ElemMatch("Properties", Query.And(Query.EQ("Type", 2), Query.EQ("Value", "b"))));

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.