Mongo DB aggregation group performance

Mongo DB aggregation group performance - c#

I am pretty new to mongo DB and experimenting with it for one of our applications. We are trying to implement CQRS and query part we are trying to use node.js and command part we are implementing through c#.
One of my collections might have millions of documents in it. We would have a scenarioId field and each scenario can have around two million records.
Our use case is to compare these two scenarios data and do some mathematical operation on the each field of scenarios.
For example, each scenario can have a property avgMiles and I would like to compute the difference of this property and users should be able to filter on this difference value. As my design is to keep both scenarios data in single collection i am trying to do group by scenario id and further project it.
My sample structure of a document would look like below.
{
"_id" : ObjectId("5ac05dc58ff6cd3054d5654c"),
"origin" : {
"code" : "0000",
},
"destination" : {
"code" : "0001",
},
"currentOutput" : {
"avgMiles" : 0.15093020854848138,
},
"scenarioId" : NumberInt(0),
"serviceType" : "ECON"
}
When I group I would group it based on origin.code and destination.code and serviceType properties.
My aggregate pipeline query looks like this:
db.servicestats.aggregate([{$match:{$or:[{scenarioId:0}, {scenarioId:1}]}},
{$sort:{'origin.code':1,'destination.code':1,serviceType:1}},
{$group:{
_id:{originCode:'$origin.code',destinationCode:'$destination.code',serviceType:'$serviceType'},
baseScenarioId:{$sum:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 1] },
then: '$scenarioId'
}],
default: 0
}
}},
compareScenarioId:{$sum:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 0] },
then: '$scenarioId'
}],
default: 0
}
}},
baseavgMiles:{$max:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 1] },
then: '$currentOutput.avgMiles'
}],
default: null
}
}},
compareavgMiles:{$sum:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 0] },
then: '$currentOutput.avgMiles'
}],
default: null
}
}}
}
},
{$project:{scenarioId:
{ base:'$baseScenarioId',
compare:'$compareScenarioId'
},
avgMiles:{base:'$baseavgMiles', comapre:'$compareavgMiles',diff:{$subtract :['$baseavgMiles','$compareavgMiles']}}
}
},
{$match:{'avgMiles.diff':{$eq:0.5}}},
{$limit:100}
],{allowDiskUse: true} )
My group pipeline stage would have 4 million documents going in it. Can you please suggest how I can improve the performance of this query?
I have an index on the fields used in my group by condition and I have added a sort pipeline stage to help group by to perform better.
Any suggestions are most welcome.
As group by is not workin in my case i have implemented left outer join using $lookup and the query would look like below.
db.servicestats.aggregate([
{$match:{$and :[ {'scenarioId':0}
//,{'origin.code':'0000'},{'destination.code':'0001'}
]}},
//{$limit:1000000},
{$lookup: { from:'servicestats',
let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
pipeline:[
{$match: {
$expr: { $and:
[
{ $eq: [ "$scenarioId", 1 ] },
{ $eq: [ "$origin.code", "$$ocode" ] },
{ $eq: [ "$destination.code", "$$dcode" ] },
{ $eq: [ "$serviceType", "$$stype" ] },
]
}
}
},
{$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
{ $replaceRoot: { newRoot: "$comp" } }
],
as : "compoutputs"
}},
{
$replaceRoot: {
newRoot: {
$mergeObjects:[
{
$arrayElemAt: [
"$$ROOT.compoutputs",
0
]
},
{
origin: "$$ROOT.origin",
destination: "$$ROOT.destination",
serviceType: "$$ROOT.serviceType",
baseavgmiles: "$$ROOT.currentOutput.avgMiles",
output: '$$ROOT'
}
]
}
}
},
{$limit:100}
])
the above query performance is good and returns in 70 ms.
But in my scenario i need a full outer join to be implemented which i understood mongo does not support as of now and implemented using $facet pipeline as below
db.servicestats.aggregate([
{$limit:1000},
{$facet: {output1:[
{$match:{$and :[ {'scenarioId':0}
]}},
{$lookup: { from:'servicestats',
let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
pipeline:[
{$match: {
$expr: { $and:
[
{ $eq: [ "$scenarioId", 1 ] },
{ $eq: [ "$origin.code", "$$ocode" ] },
{ $eq: [ "$destination.code", "$$dcode" ] },
{ $eq: [ "$serviceType", "$$stype" ] },
]
}
}
},
{$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
{ $replaceRoot: { newRoot: "$comp" } }
],
as : "compoutputs"
}},
//{
// $replaceRoot: {
// newRoot: {
// $mergeObjects:[
// {
// $arrayElemAt: [
// "$$ROOT.compoutputs",
// 0
// ]
// },
// {
// origin: "$$ROOT.origin",
// destination: "$$ROOT.destination",
// serviceType: "$$ROOT.serviceType",
// baseavgmiles: "$$ROOT.currentOutput.avgMiles",
// output: '$$ROOT'
// }
// ]
// }
// }
// }
],
output2:[
{$match:{$and :[ {'scenarioId':1}
]}},
{$lookup: { from:'servicestats',
let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
pipeline:[
{$match: {
$expr: { $and:
[
{ $eq: [ "$scenarioId", 0 ] },
{ $eq: [ "$origin.code", "$$ocode" ] },
{ $eq: [ "$destination.code", "$$dcode" ] },
{ $eq: [ "$serviceType", "$$stype" ] },
]
}
}
},
{$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
{ $replaceRoot: { newRoot: "$comp" } }
],
as : "compoutputs"
}},
//{
// $replaceRoot: {
// newRoot: {
// $mergeObjects:[
// {
// $arrayElemAt: [
// "$$ROOT.compoutputs",
// 0
// ]
// },
// {
// origin: "$$ROOT.origin",
// destination: "$$ROOT.destination",
// serviceType: "$$ROOT.serviceType",
// baseavgmiles: "$$ROOT.currentOutput.avgMiles",
// output: '$$ROOT'
// }
// ]
// }
// }
// },
{$match :{'compoutputs':{$eq:[]}}}
]
}
}
///{$limit:100}
])
But facet performance is very bad. Any further ideas to improve this are most welcome.

In general, there are three things that can cause slow queries:
The query is not indexed, cannot use indexes efficiently, or the schema design is not optimal (e.g. highly nested arrays or subdocuments) which means that MongoDB must do some extra work to arrive at the relevant data.
The query is waiting for something slow (e.g. fetching data from disk, writing data to disk).
Underprovisioned hardware.
In terms of your query, there may be some general suggestions regarding query performance:
Using allowDiskUse in an aggregation pipeline means that it is possible that the query will be using disk for some its stages. Disk is frequently the slowest part of a machine, so if it's possible for you to avoid this, it will speed up the query.
Note that an aggregation query is limited to 100MB memory use. This is irrespective of the amount of memory you have.
The $group stage cannot use indexes, because an index is tied to a document's location on disk. Once the aggregation pipeline enters a stage where the document's physical location is irrelevant (e.g. the $group stage), an index cannot be used anymore.
By default, the WiredTiger cache is ~50% of RAM, so a 64GB machine would have a ~32GB WiredTiger cache. If you find that the query is very slow, it is possible that MongoDB needed to go to disk to fetch the relevant documents. Monitoring iostats and checking disk utilization % during the query would provide hints toward whether enough RAM is provisioned.
Some possible solutions are:
Provision more RAM so that MongoDB doesn't have to go to disk very often.
Rework the schema design to avoid heavily nested fields, or multiple arrays in the document.
Tailor the document schema to make it easier for you to query the data in it, instead of tailoring the schema to how you think the data should be stored (e.g. avoid heavy normalization inherent in relational database design model).
If you find that you're hitting the performance limit of a single machine, consider sharding to horizontally scale the query. However, please note that sharding is a solution that would require careful design and consideration.

You are saying above that you'd like to group by scenarioId which, however, you don't. But that is probably what you should be doing to avoid all the switch statements. Something like this might get you going:
db.servicestats.aggregate([{
$match: {
scenarioId: { $in: [ 0, 1 ] }
}
}, {
$sort: { // not sure if that stage even helps - try to run with and without
'origin.code': 1,
'destination.code': 1,
serviceType: 1
}
}, {
$group: { // first group by scenarioId AND the other fields
_id: {
scenarioId: '$scenarioId',
originCode: '$origin.code',
destinationCode: '$destination.code',
serviceType: '$serviceType'
},
avgMiles: { $max: '$currentOutput.avgMiles' } // no switches needed
},
}, {
$group: { // group by the other fields only so without scenarioId
_id: {
originCode: '$_id.originCode',
destinationCode: '$_id.destinationCode',
serviceType: '$_id.serviceType'
},
baseScenarioAvgMiles: {
$max: {
$cond: {
if: { $eq: [ '$_id.scenarioId', 1 ] },
then: '$avgMiles',
else: 0
}
}
},
compareScenarioAvgMiles: {
$max: {
$cond: {
if: { $eq: [ '$_id.scenarioId', 0 ] },
then: '$avgMiles',
else: 0
}
}
}
},
}, {
$addFields: { // compute the difference
diff: {
$subtract :[ '$baseScenarioAvgMiles', '$compareScenarioAvgMiles']
}
}
}, {
$match: {
'avgMiles.diff': { $eq: 0.5 }
}
}, {
$limit:100
}], { allowDiskUse: true })
Beyond that I would suggest you use the power of db.collection.explain().aggregate(...) to find the right indexing and tune your query.

Related

C# MongoDB query: filter based on the last item of array

I have a MongoDB collection like this:
{
_id: "abc",
history:
[
{
status: 1,
reason: "confirmed"
},
{
status: 2,
reason: "accepted"
}
],
_id: "xyz",
history:
[
{
status: 2,
reason: "accepted"
},
{
status: 10,
reason: "cancelled"
}
]
}
I want to write a query in C# to return the documents whose last history item is 2 (accepted). So in my result I should not see "xyz" because its state has changed from 2, but I should see "abc" since its last status is 2. The problem is that getting the last item is not easy with MongoDB's C# driver - or I don't know how to.
I tried the linq's lastOrDefault but got System.InvalidOperationException: {document}{History}.LastOrDefault().Status is not supported error.
I know there is a workaround to get the documents first (load to memory) and then filter, but it is client side and slow (consumes lot of network). I want to do the filter on server.

Option 1) Find() -> expected to be faster
db.collection.find({
$expr: {
$eq: [
{
$arrayElemAt: [
"$history.status",
-1
]
},
2
]
}
})
Playground1
Option 2) Aggregation
db.collection.aggregate([
{
"$addFields": {
last: {
$arrayElemAt: [
"$history",
-1
]
}
}
},
{
$match: {
"last.status": 2
}
},
{
$project: {
"history": 1
}
}
])
Playground2

I found a hackaround: to override the history array with the last history document, then apply the filter as if there was no array. This is possible through Aggregate operation $addFields.
PipelineDefinition<Process, BsonDocument> pipeline = new BsonDocument[]
{
new BsonDocument("$addFields",
new BsonDocument("history",
new BsonDocument ( "$slice",
new BsonArray { "$history", -1 }
)
)
),
new BsonDocument("$match",
new BsonDocument
{
{ "history.status", 2 }
}
)
};
var result = collection.Aggregate(pipeline).ToList();
result will be the documents with last history of 2.

MongoDB C# 2.0 upserting sub item in collection [duplicate]

I have documents that looks something like that, with a unique index on bars.name:
{ name: 'foo', bars: [ { name: 'qux', somefield: 1 } ] }
. I want to either update the sub-document where { name: 'foo', 'bars.name': 'qux' } and $set: { 'bars.$.somefield': 2 }, or create a new sub-document with { name: 'qux', somefield: 2 } under { name: 'foo' }.
Is it possible to do this using a single query with upsert, or will I have to issue two separate ones?
Related: 'upsert' in an embedded document (suggests to change the schema to have the sub-document identifier as the key, but this is from two years ago and I'm wondering if there are better solutions now.)

No there isn't really a better solution to this, so perhaps with an explanation.
Suppose you have a document in place that has the structure as you show:
{
"name": "foo",
"bars": [{
"name": "qux",
"somefield": 1
}]
}
If you do an update like this
db.foo.update(
{ "name": "foo", "bars.name": "qux" },
{ "$set": { "bars.$.somefield": 2 } },
{ "upsert": true }
)
Then all is fine because matching document was found. But if you change the value of "bars.name":
db.foo.update(
{ "name": "foo", "bars.name": "xyz" },
{ "$set": { "bars.$.somefield": 2 } },
{ "upsert": true }
)
Then you will get a failure. The only thing that has really changed here is that in MongoDB 2.6 and above the error is a little more succinct:
WriteResult({
"nMatched" : 0,
"nUpserted" : 0,
"nModified" : 0,
"writeError" : {
"code" : 16836,
"errmsg" : "The positional operator did not find the match needed from the query. Unexpanded update: bars.$.somefield"
}
})
That is better in some ways, but you really do not want to "upsert" anyway. What you want to do is add the element to the array where the "name" does not currently exist.
So what you really want is the "result" from the update attempt without the "upsert" flag to see if any documents were affected:
db.foo.update(
{ "name": "foo", "bars.name": "xyz" },
{ "$set": { "bars.$.somefield": 2 } }
)
Yielding in response:
WriteResult({ "nMatched" : 0, "nUpserted" : 0, "nModified" : 0 })
So when the modified documents are 0 then you know you want to issue the following update:
db.foo.update(
{ "name": "foo" },
{ "$push": { "bars": {
"name": "xyz",
"somefield": 2
}}
)
There really is no other way to do exactly what you want. As the additions to the array are not strictly a "set" type of operation, you cannot use $addToSet combined with the "bulk update" functionality there, so that you can "cascade" your update requests.
In this case it seems like you need to check the result, or otherwise accept reading the whole document and checking whether to update or insert a new array element in code.

if you dont mind changing the schema a bit and having a structure like so:
{ "name": "foo", "bars": { "qux": { "somefield": 1 },
"xyz": { "somefield": 2 },
}
}
You can perform your operations in one go.
Reiterating 'upsert' in an embedded document for completeness

I was digging for the same feature, and found that in version 4.2 or above, MongoDB provides a new feature called Update with aggregation pipeline.
This feature, if used with some other techniques, makes possible to achieve an upsert subdocument operation with a single query.
It's a very verbose query, but I believe if you know that you won't have too many records on the subCollection, it's viable. Here's an example on how to achieve this:
const documentQuery = { _id: '123' }
const subDocumentToUpsert = { name: 'xyz', id: '1' }
collection.update(documentQuery, [
{
$set: {
sub_documents: {
$cond: {
if: { $not: ['$sub_documents'] },
then: [subDocumentToUpsert],
else: {
$cond: {
if: { $in: [subDocumentToUpsert.id, '$sub_documents.id'] },
then: {
$map: {
input: '$sub_documents',
as: 'sub_document',
in: {
$cond: {
if: { $eq: ['$$sub_document.id', subDocumentToUpsert.id] },
then: subDocumentToUpsert,
else: '$$sub_document',
},
},
},
},
else: { $concatArrays: ['$sub_documents', [subDocumentToUpsert]] },
},
},
},
},
},
},
])

There's a way to do it in two queries - but it will still work in a bulkWrite.
This is relevant because in my case not being able to batch it is the biggest hangup. With this solution, you don't need to collect the result of the first query, which allows you to do bulk operations if you need to.
Here are the two successive queries to run for your example:
// Update subdocument if existing
collection.updateMany({
name: 'foo', 'bars.name': 'qux'
}, {
$set: {
'bars.$.somefield': 2
}
})
// Insert subdocument otherwise
collection.updateMany({
name: 'foo', $not: {'bars.name': 'qux' }
}, {
$push: {
bars: {
somefield: 2, name: 'qux'
}
}
})
This also has the added benefit of not having corrupted data / race conditions if multiple applications are writing to the database concurrently. You won't risk ending up with two bars: {somefield: 2, name: 'qux'} subdocuments in your document if two applications run the same queries at the same time.

How to merge the results of two queries to different indices in Elasticsearch?

I'm searching an index main-kittens for docs of type Kitty. Now, I want to run an experiment. For some of the users, I want to search experiment-kittens instead. The type is the same — Kitty, and all the fields has the same value as in main index, but while the field Bio is always empty in the main index, in experimental one it stores huge strings.
Now, the problem is that I can't store that Bio for all kittens due to memory/disk limitations. So the experiment-kittens has only most recent kittens (say, last month).
I want the search to be left intact for the most users (i.e., always use the main index). For the picked ones, I want to merge the results. The logic should be:
search userquery + date_created < 1 month ago in experiment-kittens
search userquery + date_created > 1 month ago in main-kittens
The results should be sorted by create_date, and there are too many of them to sort them in my app.
Is there a way to ask elastic to execute two different queries on two indices and merge the results?
(I'm also sure there could be more optimal solutions to the problem, please tell me if you have some).

You can search across multiple indices with a single Elasticsearch request by separating the index names with a comma. Then you can use the missing filter to differentiate between the two indices (one having Bio field and the other not). Then you can use the range filter to filter based on the value of date_created field. Finally you can use the sort API to sort based on the values of date_created field.
Putting all of these together, the Elasticsearch query that you need is as under:
POST main-kittens,experiment-kittens/Kitty/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"missing": {
"field": "Bio"
}
},
{
"range": {
"date_created": {
"to": "now-1M"
}
}
}
]
}
},
{
"bool": {
"must_not": [
{
"missing": {
"field": "Bio"
}
}
],
"must": [
{
"range": {
"date_created": {
"from": "now-1M"
}
}
}
]
}
}
]
}
}
}
},
"sort": [
{
"date_created": {
"order": "desc"
}
}
]
}
You can replace "match_all": {} with any custom query that you may have.

ElasticSearch C# client (NEST): access nested aggregation with Spaces

Assuming my 2 values are "Red Square" and "Green circle",
when i run the aggregation using Elastic search i get 4 values instead of 2, space separated?
They are Red, Square, Green, circle.
Is there a way to get the 2 original values.
The code is below:
var result = this.client.Search<MyClass>(s => s
.Size(int.MaxValue)
.Aggregations(a => a
.Terms("field1", t => t.Field(k => k.MyField))
)
);
var agBucket = (Bucket)result.Aggregations["field1"];
var myAgg = result.Aggs.Terms("field1");
IList<KeyItem> list = myAgg.Items;
foreach (KeyItem i in list)
{
string data = i.Key;
}

In your mapping, you need to set the field1 string as not_analyzed, like this:
{
"your_type": {
"properties": {
"field1": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
You can also make field1 a multi-field and make it both analyzed and not_analyzed to get the best of both worlds (i.e. text matching on the analyzed field + aggregation on the exact value of the not_analyzed raw sub-field).
{
"your_type": {
"properties": {
"field1": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
If you choose this second option, you'll need to run your aggregation on field1.raw instead of field1.

Getting distinct values using NEST ElasticSearch client

I'm building a product search engine with Elastic Search in my .NET application, by using the NEST client, and there is one thing i'm having trouble with. Getting a distinct set of values.
I'm search for products, which there are many thousands, but of course i can only return 10 or 20 at a time to the user. And for this paging works fine. But besides this primary result, i want to show my users a list of brands that are found within the complete search, to present these for filtering.
I have read about that i should use Terms Aggregations for this. But, i couldn't get anything better than this. And this still doesn't really give me what i want, because it splits values like "20th Century Fox" into 3 separate values.
var brandResults = client.Search<Product>(s => s
.Query(query)
.Aggregations(a => a.Terms("my_terms_agg", t => t.Field(p => p.BrandName).Size(250))
)
);
var agg = brandResult.Aggs.Terms("my_terms_agg");
Is this even the right approach? Or should is use something totally different? And, how can i get the correct, complete values? (Not split by space .. but i guess that is what you get when you ask for a list of 'Terms'??)
What i'm looking for is what you would get if you would do this in MS SQL
SELECT DISTINCT BrandName FROM [Table To Search] WHERE [Where clause without paging]

You are correct that what you want is a terms aggregation. The problem you're running into is that ES is splitting the field "BrandName" in the results it is returning. This is the expected default behavior of a field in ES.
What I recommend is that you change BrandName into a "Multifield", this will allow you to search on all the various parts, as well as doing a terms aggregation on the "Not Analyzed" (aka full "20th Century Fox") term.
Here is the documentation from ES.
https://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
[UPDATE]
If you are using ES version 1.4 or newer the syntax for multi-fields is a little different now.
https://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html#_multi_fields
Here is a full working sample the illustrate the point in ES 1.4.4. Note the mapping specifies a "not_analyzed" version of the field.
PUT hilden1
PUT hilden1/type1/_mapping
{
"properties": {
"brandName": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
POST hilden1/type1
{
"brandName": "foo"
}
POST hilden1/type1
{
"brandName": "bar"
}
POST hilden1/type1
{
"brandName": "20th Century Fox"
}
POST hilden1/type1
{
"brandName": "20th Century Fox"
}
POST hilden1/type1
{
"brandName": "foo bar"
}
GET hilden1/type1/_search
{
"size": 0,
"aggs": {
"analyzed_field": {
"terms": {
"field": "brandName",
"size": 10
}
},
"non_analyzed_field": {
"terms": {
"field": "brandName.raw",
"size": 10
}
}
}
}
Results of the last query:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"aggregations": {
"non_analyzed_field": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "20th Century Fox",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 1
},
{
"key": "foo",
"doc_count": 1
},
{
"key": "foo bar",
"doc_count": 1
}
]
},
"analyzed_field": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "20th",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 2
},
{
"key": "century",
"doc_count": 2
},
{
"key": "foo",
"doc_count": 2
},
{
"key": "fox",
"doc_count": 2
}
]
}
}
}
Notice that not-analyzed fields keep "20th century fox" and "foo bar" together where as the analyzed field breaks them up.

I had a similar issue. I was displaying search results and wanted to show counts on the category and sub category.
You're right to use aggregations. I also had the issue with the strings being tokenised (i.e. 20th century fox being split) - this happens because the fields are analysed. For me, I added the following mappings (i.e. tell ES not to analyse that field):
"category": {
"type": "nested",
"properties": {
"CategoryNameAndSlug": {
"type": "string",
"index": "not_analyzed"
},
"SubCategoryNameAndSlug": {
"type": "string",
"index": "not_analyzed"
}
}
}
As jhilden suggested, if you use this field for more than one reason (e.g. search and aggregation) you can set it up as a multifield. So on one hand it can get analysed and used for searching and on the other hand for not being analysed for aggregation.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.