I have documents that looks something like that, with a unique index on bars.name:
{ name: 'foo', bars: [ { name: 'qux', somefield: 1 } ] }
. I want to either update the sub-document where { name: 'foo', 'bars.name': 'qux' } and $set: { 'bars.$.somefield': 2 }, or create a new sub-document with { name: 'qux', somefield: 2 } under { name: 'foo' }.
Is it possible to do this using a single query with upsert, or will I have to issue two separate ones?
Related: 'upsert' in an embedded document (suggests to change the schema to have the sub-document identifier as the key, but this is from two years ago and I'm wondering if there are better solutions now.)
No there isn't really a better solution to this, so perhaps with an explanation.
Suppose you have a document in place that has the structure as you show:
{
"name": "foo",
"bars": [{
"name": "qux",
"somefield": 1
}]
}
If you do an update like this
db.foo.update(
{ "name": "foo", "bars.name": "qux" },
{ "$set": { "bars.$.somefield": 2 } },
{ "upsert": true }
)
Then all is fine because matching document was found. But if you change the value of "bars.name":
db.foo.update(
{ "name": "foo", "bars.name": "xyz" },
{ "$set": { "bars.$.somefield": 2 } },
{ "upsert": true }
)
Then you will get a failure. The only thing that has really changed here is that in MongoDB 2.6 and above the error is a little more succinct:
WriteResult({
"nMatched" : 0,
"nUpserted" : 0,
"nModified" : 0,
"writeError" : {
"code" : 16836,
"errmsg" : "The positional operator did not find the match needed from the query. Unexpanded update: bars.$.somefield"
}
})
That is better in some ways, but you really do not want to "upsert" anyway. What you want to do is add the element to the array where the "name" does not currently exist.
So what you really want is the "result" from the update attempt without the "upsert" flag to see if any documents were affected:
db.foo.update(
{ "name": "foo", "bars.name": "xyz" },
{ "$set": { "bars.$.somefield": 2 } }
)
Yielding in response:
WriteResult({ "nMatched" : 0, "nUpserted" : 0, "nModified" : 0 })
So when the modified documents are 0 then you know you want to issue the following update:
db.foo.update(
{ "name": "foo" },
{ "$push": { "bars": {
"name": "xyz",
"somefield": 2
}}
)
There really is no other way to do exactly what you want. As the additions to the array are not strictly a "set" type of operation, you cannot use $addToSet combined with the "bulk update" functionality there, so that you can "cascade" your update requests.
In this case it seems like you need to check the result, or otherwise accept reading the whole document and checking whether to update or insert a new array element in code.
if you dont mind changing the schema a bit and having a structure like so:
{ "name": "foo", "bars": { "qux": { "somefield": 1 },
"xyz": { "somefield": 2 },
}
}
You can perform your operations in one go.
Reiterating 'upsert' in an embedded document for completeness
I was digging for the same feature, and found that in version 4.2 or above, MongoDB provides a new feature called Update with aggregation pipeline.
This feature, if used with some other techniques, makes possible to achieve an upsert subdocument operation with a single query.
It's a very verbose query, but I believe if you know that you won't have too many records on the subCollection, it's viable. Here's an example on how to achieve this:
const documentQuery = { _id: '123' }
const subDocumentToUpsert = { name: 'xyz', id: '1' }
collection.update(documentQuery, [
{
$set: {
sub_documents: {
$cond: {
if: { $not: ['$sub_documents'] },
then: [subDocumentToUpsert],
else: {
$cond: {
if: { $in: [subDocumentToUpsert.id, '$sub_documents.id'] },
then: {
$map: {
input: '$sub_documents',
as: 'sub_document',
in: {
$cond: {
if: { $eq: ['$$sub_document.id', subDocumentToUpsert.id] },
then: subDocumentToUpsert,
else: '$$sub_document',
},
},
},
},
else: { $concatArrays: ['$sub_documents', [subDocumentToUpsert]] },
},
},
},
},
},
},
])
There's a way to do it in two queries - but it will still work in a bulkWrite.
This is relevant because in my case not being able to batch it is the biggest hangup. With this solution, you don't need to collect the result of the first query, which allows you to do bulk operations if you need to.
Here are the two successive queries to run for your example:
// Update subdocument if existing
collection.updateMany({
name: 'foo', 'bars.name': 'qux'
}, {
$set: {
'bars.$.somefield': 2
}
})
// Insert subdocument otherwise
collection.updateMany({
name: 'foo', $not: {'bars.name': 'qux' }
}, {
$push: {
bars: {
somefield: 2, name: 'qux'
}
}
})
This also has the added benefit of not having corrupted data / race conditions if multiple applications are writing to the database concurrently. You won't risk ending up with two bars: {somefield: 2, name: 'qux'} subdocuments in your document if two applications run the same queries at the same time.
Related
I have a MongoDB collection like this:
{
_id: "abc",
history:
[
{
status: 1,
reason: "confirmed"
},
{
status: 2,
reason: "accepted"
}
],
_id: "xyz",
history:
[
{
status: 2,
reason: "accepted"
},
{
status: 10,
reason: "cancelled"
}
]
}
I want to write a query in C# to return the documents whose last history item is 2 (accepted). So in my result I should not see "xyz" because its state has changed from 2, but I should see "abc" since its last status is 2. The problem is that getting the last item is not easy with MongoDB's C# driver - or I don't know how to.
I tried the linq's lastOrDefault but got System.InvalidOperationException: {document}{History}.LastOrDefault().Status is not supported error.
I know there is a workaround to get the documents first (load to memory) and then filter, but it is client side and slow (consumes lot of network). I want to do the filter on server.
Option 1) Find() -> expected to be faster
db.collection.find({
$expr: {
$eq: [
{
$arrayElemAt: [
"$history.status",
-1
]
},
2
]
}
})
Playground1
Option 2) Aggregation
db.collection.aggregate([
{
"$addFields": {
last: {
$arrayElemAt: [
"$history",
-1
]
}
}
},
{
$match: {
"last.status": 2
}
},
{
$project: {
"history": 1
}
}
])
Playground2
I found a hackaround: to override the history array with the last history document, then apply the filter as if there was no array. This is possible through Aggregate operation $addFields.
PipelineDefinition<Process, BsonDocument> pipeline = new BsonDocument[]
{
new BsonDocument("$addFields",
new BsonDocument("history",
new BsonDocument ( "$slice",
new BsonArray { "$history", -1 }
)
)
),
new BsonDocument("$match",
new BsonDocument
{
{ "history.status", 2 }
}
)
};
var result = collection.Aggregate(pipeline).ToList();
result will be the documents with last history of 2.
I need to update several thousand items every several minutes in Elastic and unfortunately reindexing is not an option for me. From my research the best way to update an item is using _update_by_query - I have had success updating single documents like so -
{
"query": {
"match": {
"itemId": {
"query": "1"
}
}
},
"script": {
"source": "ctx._source.field = params.updateValue",
"lang": "painless",
"params": {
"updateValue": "test",
}
}
}
var response = await Client.UpdateByQueryAsync<dynamic>(q => q
.Index("masterproducts")
.Query(q => x.MatchQuery)
.Script(s => s.Source(x.Script).Lang("painless").Params(x.Params))
.Conflicts(Elasticsearch.Net.Conflicts.Proceed)
);
Although this works it is extremely inefficient as it generates thousands of requests - is there a way in which I can update multiple documents with a matching ID in a single request? I have already tried Multiple search API which it would seem cannot be used for this purpose. Any help would be appreciated!
If possible, try to generalize your query.
Instead of targeting a single itemId, perhaps try using a terms query:
{
"query": {
"terms": {
"itemId": [
"1", "2", ...
]
}
},
"script": {
...
}
}
From the looks of it, your (seemingly simplified) script sets the same value, irregardless of the document ID / itemId. So that's that.
If the script does indeed set different values based on the doc IDs / itemIds, you could make the params multi-value:
"params": {
"updateValue1": "test1",
"updateValue2": "test2",
...
}
and then dynamically access them:
...
def value_to_set = params['updateValue' + ctx._source['itemId']];
...
so the target doc is updated with the corresponding value.
trying to do some analytics on my MongoDB collection using my .NET Core project. (C# driver)
My problem is with aggregating and grouping by multiple fields, and by and nested array element field.
For starters, heres as simplified as I can document example -
{
"_id": ".....",
"CampaignId": 1,
"IsTest":false
"Events": [
{
"EventId": 1,
"IsFake": false
},
{
"EventId": 1,
"IsFake": true
}
{
"EventId": 2,
"IsFake": false
}
]
}
My end goal, is to generate an analytics report that will look like this for example-
[
{
"CampaignId": 1,
"DocumentCountReal":17824,
"DocumentCountTest":321,
"EventCountReal":100,
"EventCountFake":5,
"Events": [
{
"EventId": 1,
"IsFake": false,
"Count": 50
},
{
"EventId": 1,
"IsFake": true,
"Count": 5
},
{
"EventId": 2,
"IsFake": false,
"Count": 50
}
]
},
{
"CampaignId": 2,
"DocumentCountReal":1314,
"DocumentCountTest":57,
"EventCountReal":50,
"EventCountFake":0,
"Events": [
{
"EventId": 1,
"IsFake": false,
"Count": 25
},
{
"EventId": 2,
"IsFake": false,
"Count": 25
}
]
}
]
Just to show you where I currently stand, I found out how to group by one field lol...
Example -
var result = collection.Aggregate().Group(c => c.CampaignId, r => new { CampaignId= r.Key, Count = r.Count()}).ToList();
Couldn't find how to group by a nested array element field (in the example the IsFake property in Event) and in general to build a result as I shared above.
To my surprise I couldn't find a lot of related questions in google.. (specifically using c# driver)
Thanks a lot for reading
So, found a partial solution eventually, thought I'll share -
The biggest problem was to group by a property of a nested array (In my initial question it was the Event array).
In order to achieve that I used Unwind (https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/)
FYI - I needed to use the PreserveNullAndEmptyArrays option in order to not miss documents that had no event.
Also, I made an extra class where the Events property is a single Event instead of a list, that is to deserialize the document after Uwind is being used.
My second question was how to perform several aggregations in one trip to the database, as Valijon commented it looks like $facet is required, couldn't find an example that fits my question.
Example of my solution -
IMongoDatabase db = MongoDBDataLayer.NewConnection();
IMongoCollection<Click> collection = db.GetCollection<Click>("click");
var unwindOption = new AggregateUnwindOptions<ClickUnwinded>() { PreserveNullAndEmptyArrays = true };
var result = collection.Aggregate().Unwind<Click, ClickUnwinded>(c => c.Events, unwindOption)
.Group(
//group by
c => new GroupByClass()
{
CampaignId = c.CampaignId,
IsTest = c.IsTest,
EventId = c.Events.EventId,
IsRobot = c.UserAgent.IsRobot
},
r => new GroupResultClass()
{
CampaignId = r.First().CampaignId,
IsTest = r.First().IsTest,
EventId = r.First().Events.EventId,
IsRobot = r.First().UserAgent.IsRobot,
Count = r.Count()
}
).ToList();
If anyone has an example of how to uttilize facet using c# driver that would be great.
edit -
So, the problem is that doing the approach above will count the same document multiple times if the document have multiple events ... so I have to either find a way to multi aggregate (first aggregate must happen before the unwind) or to make 2 trips to the db.
I am pretty new to mongo DB and experimenting with it for one of our applications. We are trying to implement CQRS and query part we are trying to use node.js and command part we are implementing through c#.
One of my collections might have millions of documents in it. We would have a scenarioId field and each scenario can have around two million records.
Our use case is to compare these two scenarios data and do some mathematical operation on the each field of scenarios.
For example, each scenario can have a property avgMiles and I would like to compute the difference of this property and users should be able to filter on this difference value. As my design is to keep both scenarios data in single collection i am trying to do group by scenario id and further project it.
My sample structure of a document would look like below.
{
"_id" : ObjectId("5ac05dc58ff6cd3054d5654c"),
"origin" : {
"code" : "0000",
},
"destination" : {
"code" : "0001",
},
"currentOutput" : {
"avgMiles" : 0.15093020854848138,
},
"scenarioId" : NumberInt(0),
"serviceType" : "ECON"
}
When I group I would group it based on origin.code and destination.code and serviceType properties.
My aggregate pipeline query looks like this:
db.servicestats.aggregate([{$match:{$or:[{scenarioId:0}, {scenarioId:1}]}},
{$sort:{'origin.code':1,'destination.code':1,serviceType:1}},
{$group:{
_id:{originCode:'$origin.code',destinationCode:'$destination.code',serviceType:'$serviceType'},
baseScenarioId:{$sum:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 1] },
then: '$scenarioId'
}],
default: 0
}
}},
compareScenarioId:{$sum:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 0] },
then: '$scenarioId'
}],
default: 0
}
}},
baseavgMiles:{$max:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 1] },
then: '$currentOutput.avgMiles'
}],
default: null
}
}},
compareavgMiles:{$sum:{$switch: {
branches: [
{
case: { $eq: [ '$scenarioId', 0] },
then: '$currentOutput.avgMiles'
}],
default: null
}
}}
}
},
{$project:{scenarioId:
{ base:'$baseScenarioId',
compare:'$compareScenarioId'
},
avgMiles:{base:'$baseavgMiles', comapre:'$compareavgMiles',diff:{$subtract :['$baseavgMiles','$compareavgMiles']}}
}
},
{$match:{'avgMiles.diff':{$eq:0.5}}},
{$limit:100}
],{allowDiskUse: true} )
My group pipeline stage would have 4 million documents going in it. Can you please suggest how I can improve the performance of this query?
I have an index on the fields used in my group by condition and I have added a sort pipeline stage to help group by to perform better.
Any suggestions are most welcome.
As group by is not workin in my case i have implemented left outer join using $lookup and the query would look like below.
db.servicestats.aggregate([
{$match:{$and :[ {'scenarioId':0}
//,{'origin.code':'0000'},{'destination.code':'0001'}
]}},
//{$limit:1000000},
{$lookup: { from:'servicestats',
let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
pipeline:[
{$match: {
$expr: { $and:
[
{ $eq: [ "$scenarioId", 1 ] },
{ $eq: [ "$origin.code", "$$ocode" ] },
{ $eq: [ "$destination.code", "$$dcode" ] },
{ $eq: [ "$serviceType", "$$stype" ] },
]
}
}
},
{$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
{ $replaceRoot: { newRoot: "$comp" } }
],
as : "compoutputs"
}},
{
$replaceRoot: {
newRoot: {
$mergeObjects:[
{
$arrayElemAt: [
"$$ROOT.compoutputs",
0
]
},
{
origin: "$$ROOT.origin",
destination: "$$ROOT.destination",
serviceType: "$$ROOT.serviceType",
baseavgmiles: "$$ROOT.currentOutput.avgMiles",
output: '$$ROOT'
}
]
}
}
},
{$limit:100}
])
the above query performance is good and returns in 70 ms.
But in my scenario i need a full outer join to be implemented which i understood mongo does not support as of now and implemented using $facet pipeline as below
db.servicestats.aggregate([
{$limit:1000},
{$facet: {output1:[
{$match:{$and :[ {'scenarioId':0}
]}},
{$lookup: { from:'servicestats',
let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
pipeline:[
{$match: {
$expr: { $and:
[
{ $eq: [ "$scenarioId", 1 ] },
{ $eq: [ "$origin.code", "$$ocode" ] },
{ $eq: [ "$destination.code", "$$dcode" ] },
{ $eq: [ "$serviceType", "$$stype" ] },
]
}
}
},
{$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
{ $replaceRoot: { newRoot: "$comp" } }
],
as : "compoutputs"
}},
//{
// $replaceRoot: {
// newRoot: {
// $mergeObjects:[
// {
// $arrayElemAt: [
// "$$ROOT.compoutputs",
// 0
// ]
// },
// {
// origin: "$$ROOT.origin",
// destination: "$$ROOT.destination",
// serviceType: "$$ROOT.serviceType",
// baseavgmiles: "$$ROOT.currentOutput.avgMiles",
// output: '$$ROOT'
// }
// ]
// }
// }
// }
],
output2:[
{$match:{$and :[ {'scenarioId':1}
]}},
{$lookup: { from:'servicestats',
let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
pipeline:[
{$match: {
$expr: { $and:
[
{ $eq: [ "$scenarioId", 0 ] },
{ $eq: [ "$origin.code", "$$ocode" ] },
{ $eq: [ "$destination.code", "$$dcode" ] },
{ $eq: [ "$serviceType", "$$stype" ] },
]
}
}
},
{$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
{ $replaceRoot: { newRoot: "$comp" } }
],
as : "compoutputs"
}},
//{
// $replaceRoot: {
// newRoot: {
// $mergeObjects:[
// {
// $arrayElemAt: [
// "$$ROOT.compoutputs",
// 0
// ]
// },
// {
// origin: "$$ROOT.origin",
// destination: "$$ROOT.destination",
// serviceType: "$$ROOT.serviceType",
// baseavgmiles: "$$ROOT.currentOutput.avgMiles",
// output: '$$ROOT'
// }
// ]
// }
// }
// },
{$match :{'compoutputs':{$eq:[]}}}
]
}
}
///{$limit:100}
])
But facet performance is very bad. Any further ideas to improve this are most welcome.
In general, there are three things that can cause slow queries:
The query is not indexed, cannot use indexes efficiently, or the schema design is not optimal (e.g. highly nested arrays or subdocuments) which means that MongoDB must do some extra work to arrive at the relevant data.
The query is waiting for something slow (e.g. fetching data from disk, writing data to disk).
Underprovisioned hardware.
In terms of your query, there may be some general suggestions regarding query performance:
Using allowDiskUse in an aggregation pipeline means that it is possible that the query will be using disk for some its stages. Disk is frequently the slowest part of a machine, so if it's possible for you to avoid this, it will speed up the query.
Note that an aggregation query is limited to 100MB memory use. This is irrespective of the amount of memory you have.
The $group stage cannot use indexes, because an index is tied to a document's location on disk. Once the aggregation pipeline enters a stage where the document's physical location is irrelevant (e.g. the $group stage), an index cannot be used anymore.
By default, the WiredTiger cache is ~50% of RAM, so a 64GB machine would have a ~32GB WiredTiger cache. If you find that the query is very slow, it is possible that MongoDB needed to go to disk to fetch the relevant documents. Monitoring iostats and checking disk utilization % during the query would provide hints toward whether enough RAM is provisioned.
Some possible solutions are:
Provision more RAM so that MongoDB doesn't have to go to disk very often.
Rework the schema design to avoid heavily nested fields, or multiple arrays in the document.
Tailor the document schema to make it easier for you to query the data in it, instead of tailoring the schema to how you think the data should be stored (e.g. avoid heavy normalization inherent in relational database design model).
If you find that you're hitting the performance limit of a single machine, consider sharding to horizontally scale the query. However, please note that sharding is a solution that would require careful design and consideration.
You are saying above that you'd like to group by scenarioId which, however, you don't. But that is probably what you should be doing to avoid all the switch statements. Something like this might get you going:
db.servicestats.aggregate([{
$match: {
scenarioId: { $in: [ 0, 1 ] }
}
}, {
$sort: { // not sure if that stage even helps - try to run with and without
'origin.code': 1,
'destination.code': 1,
serviceType: 1
}
}, {
$group: { // first group by scenarioId AND the other fields
_id: {
scenarioId: '$scenarioId',
originCode: '$origin.code',
destinationCode: '$destination.code',
serviceType: '$serviceType'
},
avgMiles: { $max: '$currentOutput.avgMiles' } // no switches needed
},
}, {
$group: { // group by the other fields only so without scenarioId
_id: {
originCode: '$_id.originCode',
destinationCode: '$_id.destinationCode',
serviceType: '$_id.serviceType'
},
baseScenarioAvgMiles: {
$max: {
$cond: {
if: { $eq: [ '$_id.scenarioId', 1 ] },
then: '$avgMiles',
else: 0
}
}
},
compareScenarioAvgMiles: {
$max: {
$cond: {
if: { $eq: [ '$_id.scenarioId', 0 ] },
then: '$avgMiles',
else: 0
}
}
}
},
}, {
$addFields: { // compute the difference
diff: {
$subtract :[ '$baseScenarioAvgMiles', '$compareScenarioAvgMiles']
}
}
}, {
$match: {
'avgMiles.diff': { $eq: 0.5 }
}
}, {
$limit:100
}], { allowDiskUse: true })
Beyond that I would suggest you use the power of db.collection.explain().aggregate(...) to find the right indexing and tune your query.
I have a collection which elements can be simplified to this:
{tags : [1, 5, 8]}
where there would be at least one element in array and all of them should be different. I want to substitute one tag for another and I thought that there would not be a problem. So I came up with the following query:
db.colll.update({
tags : 1
},{
$pull: { tags: 1 },
$addToSet: { tags: 2 }
}, {
multi: true
})
Cool, so it will find all elements which has a tag that I do not need (1), remove it and add another (2) if it is not there already. The problem is that I get an error:
"Cannot update 'tags' and 'tags' at the same time"
Which basically means that I can not do pull and addtoset at the same time. Is there any other way I can do this?
Of course I can memorize all the IDs of the elements and then remove tag and add in separate queries, but this does not sound nice.
The error is pretty much what it means as you cannot act on two things of the same "path" in the same update operation. The two operators you are using do not process sequentially as you might think they do.
You can do this with as "sequential" as you can possibly get with the "bulk" operations API or other form of "bulk" update though. Within reason of course, and also in reverse:
var bulk = db.coll.initializeOrderedBulkOp();
bulk.find({ "tags": 1 }).updateOne({ "$addToSet": { "tags": 2 } });
bulk.find({ "tags": 1 }).updateOne({ "$pull": { "tags": 1 } });
bulk.execute();
Not a guarantee that nothing else will try to modify,but it is as close as you will currently get.
Also see the raw "update" command with multiple documents.
If you're removing and adding at the same time, you may be modeling a 'map', instead of a 'set'. If so, an object may be less work than an array.
Instead of data as an array:
{ _id: 'myobjectwithdata',
data: [{ id: 'data1', important: 'stuff'},
{ id: 'data2', important: 'more'}]
}
Use data as an object:
{ _id: 'myobjectwithdata',
data: { data1: { important: 'stuff'},
data2: { important: 'more'} }
}
The one-command update is then:
db.coll.update(
'myobjectwithdata',
{ $set: { 'data.data1': { important: 'treasure' } }
);
Hard brain working for this answer done here and here.
Starting in Mongo 4.4, the $function aggregation operator allows applying a custom javascript function to implement behaviour not supported by the MongoDB Query Language.
And coupled with improvements made to db.collection.update() in Mongo 4.2 that can accept an aggregation pipeline, allowing the update of a field based on its own value,
We can manipulate and update an array in ways the language doesn't easily permit:
// { "tags" : [ 1, 5, 8 ] }
db.collection.updateMany(
{ tags: 1 },
[{ $set:
{ "tags":
{ $function: {
body: function(tags) { tags.push(2); return tags.filter(x => x != 1); },
args: ["$tags"],
lang: "js"
}}
}
}]
)
// { "tags" : [ 5, 8, 2 ] }
$function takes 3 parameters:
body, which is the function to apply, whose parameter is the array to modify. The function here simply consists in pushing 2 to the array and filtering out 1.
args, which contains the fields from the record that the body function takes as parameter. In our case, "$tag".
lang, which is the language in which the body function is written. Only js is currently available.
In case you need replace one value in an array to another check this answer:
Replace array value using arrayFilters