trying to do some analytics on my MongoDB collection using my .NET Core project. (C# driver)
My problem is with aggregating and grouping by multiple fields, and by and nested array element field.
For starters, heres as simplified as I can document example -
{
"_id": ".....",
"CampaignId": 1,
"IsTest":false
"Events": [
{
"EventId": 1,
"IsFake": false
},
{
"EventId": 1,
"IsFake": true
}
{
"EventId": 2,
"IsFake": false
}
]
}
My end goal, is to generate an analytics report that will look like this for example-
[
{
"CampaignId": 1,
"DocumentCountReal":17824,
"DocumentCountTest":321,
"EventCountReal":100,
"EventCountFake":5,
"Events": [
{
"EventId": 1,
"IsFake": false,
"Count": 50
},
{
"EventId": 1,
"IsFake": true,
"Count": 5
},
{
"EventId": 2,
"IsFake": false,
"Count": 50
}
]
},
{
"CampaignId": 2,
"DocumentCountReal":1314,
"DocumentCountTest":57,
"EventCountReal":50,
"EventCountFake":0,
"Events": [
{
"EventId": 1,
"IsFake": false,
"Count": 25
},
{
"EventId": 2,
"IsFake": false,
"Count": 25
}
]
}
]
Just to show you where I currently stand, I found out how to group by one field lol...
Example -
var result = collection.Aggregate().Group(c => c.CampaignId, r => new { CampaignId= r.Key, Count = r.Count()}).ToList();
Couldn't find how to group by a nested array element field (in the example the IsFake property in Event) and in general to build a result as I shared above.
To my surprise I couldn't find a lot of related questions in google.. (specifically using c# driver)
Thanks a lot for reading
So, found a partial solution eventually, thought I'll share -
The biggest problem was to group by a property of a nested array (In my initial question it was the Event array).
In order to achieve that I used Unwind (https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/)
FYI - I needed to use the PreserveNullAndEmptyArrays option in order to not miss documents that had no event.
Also, I made an extra class where the Events property is a single Event instead of a list, that is to deserialize the document after Uwind is being used.
My second question was how to perform several aggregations in one trip to the database, as Valijon commented it looks like $facet is required, couldn't find an example that fits my question.
Example of my solution -
IMongoDatabase db = MongoDBDataLayer.NewConnection();
IMongoCollection<Click> collection = db.GetCollection<Click>("click");
var unwindOption = new AggregateUnwindOptions<ClickUnwinded>() { PreserveNullAndEmptyArrays = true };
var result = collection.Aggregate().Unwind<Click, ClickUnwinded>(c => c.Events, unwindOption)
.Group(
//group by
c => new GroupByClass()
{
CampaignId = c.CampaignId,
IsTest = c.IsTest,
EventId = c.Events.EventId,
IsRobot = c.UserAgent.IsRobot
},
r => new GroupResultClass()
{
CampaignId = r.First().CampaignId,
IsTest = r.First().IsTest,
EventId = r.First().Events.EventId,
IsRobot = r.First().UserAgent.IsRobot,
Count = r.Count()
}
).ToList();
If anyone has an example of how to uttilize facet using c# driver that would be great.
edit -
So, the problem is that doing the approach above will count the same document multiple times if the document have multiple events ... so I have to either find a way to multi aggregate (first aggregate must happen before the unwind) or to make 2 trips to the db.
Related
I have a MongoDB collection like this:
{
_id: "abc",
history:
[
{
status: 1,
reason: "confirmed"
},
{
status: 2,
reason: "accepted"
}
],
_id: "xyz",
history:
[
{
status: 2,
reason: "accepted"
},
{
status: 10,
reason: "cancelled"
}
]
}
I want to write a query in C# to return the documents whose last history item is 2 (accepted). So in my result I should not see "xyz" because its state has changed from 2, but I should see "abc" since its last status is 2. The problem is that getting the last item is not easy with MongoDB's C# driver - or I don't know how to.
I tried the linq's lastOrDefault but got System.InvalidOperationException: {document}{History}.LastOrDefault().Status is not supported error.
I know there is a workaround to get the documents first (load to memory) and then filter, but it is client side and slow (consumes lot of network). I want to do the filter on server.
Option 1) Find() -> expected to be faster
db.collection.find({
$expr: {
$eq: [
{
$arrayElemAt: [
"$history.status",
-1
]
},
2
]
}
})
Playground1
Option 2) Aggregation
db.collection.aggregate([
{
"$addFields": {
last: {
$arrayElemAt: [
"$history",
-1
]
}
}
},
{
$match: {
"last.status": 2
}
},
{
$project: {
"history": 1
}
}
])
Playground2
I found a hackaround: to override the history array with the last history document, then apply the filter as if there was no array. This is possible through Aggregate operation $addFields.
PipelineDefinition<Process, BsonDocument> pipeline = new BsonDocument[]
{
new BsonDocument("$addFields",
new BsonDocument("history",
new BsonDocument ( "$slice",
new BsonArray { "$history", -1 }
)
)
),
new BsonDocument("$match",
new BsonDocument
{
{ "history.status", 2 }
}
)
};
var result = collection.Aggregate(pipeline).ToList();
result will be the documents with last history of 2.
I am fairly new to c# and am trying to parse an api response. My goal is to check each sku present to see if it contains all three of the following tags: Dot, Low, and Default. The only thing is the api is set up a bit odd, so even if the "Rskuname" is the same, its listed under a different skuID. I need to make sure each Rskuname contains all of the 3 types, here is an example of the api below (parts of it have been omitted since its a huge amount of data, just showing the pieces important to this question)
"Skus": [
{
"SkuId": "DH786HY",
"Name": "Stand_D3_v19 Dot",
"Attributes": {
"RSkuName": "Stand_D3_v19",
"Category": "General",
"DiskSize": "0 Gb",
"Low": "False",
},
"Tags": [
"Dot"
{
"SkuId": "DU70PLL1",
"Name": "Stand_D3_v19",
"Attributes": {
"Attributes": {
"RSkuName": "Stand_D3_v19",
"Category": "General",
"DiskSize": "0 Gb",
"Low": "False",
},
"Tags": [
"Default"
]
{
"SkuId": "DPOK65R4",
"Name": "Stand_D3_v19 Low",
"Attributes": {
"Attributes": {
"RSkuName": "Stand_D3_v19",
"Category": "General",
"DiskSize": "0 Gb",
"Low": "True",
},
"Tags": [
"Low"
],
{
"SkuId": "DPOK65R4",
"Name": "Stand_D6_v22 Low",
"Attributes": {
"Attributes": {
"RSkuName": "Stand_D6_v22",
"Category": "General",
"DiskSize": "0 Gb",
"Low": "True",
},
"Tags": [
"Low"
],
Originally I tried to iterate through each sku, however since the skuids are different even though the name is the same that doesnt work. I was thinking of possibly using a string, hashset dictionary so it would go skuName:Tags but I'm not sure that will work either. Any ideas would be much appreciated. Apologies if this question isn't phrased well, once again I'm a beginner. I have included what I tried originally below:
foreach (Sku sku in skus)
{
string SkuName = sku.Attributes[RSkuName];
var count = 0;
if (sku.Tags.Equals(Default))
{
count++;
}
if (sku.Tags.Equals(Low))
{
count++;
}
if (sku.Tags.Equals(Dot))
{
count++;
}
}
if (count < 3)
{
traceSource.TraceInformation($"There are not 3 tags present for" {SkuName} );
}
Seems simple:
group by RSkuName
group by Tag element value
make sure there is a group for each of the required tag values.
Yes you could use a hashset in this scenario to formulate the groups. If we had to do it from first principals that that's not a bad idea.
However we can use the LINQ fluent GroupBy function (which uses a hashset internally) to iterate over the groups.
There is only one complicating factor to this that even your first attempt does not take into account, Tags is an array of strings, so to properly group the values across multiple arrays so we can use GroupBy we can use the SelectMany function to merge the arrays from multiple SKUs into a single set, then GroupBy becomes viable again.
Finally, if the only possible values for Tag elements are Dot, Low, and Default. Then we only need to count the groups and make sure there are 3 to make the SKU valid.
bool notValid = skus.GroupBy(x => x.Attributes[RSkuName])
.Any(sku => sku.SelectMany(x => x.Tags)
.GroupBy(x => x)
.Count() < 3)
I call this a fail fast approach, instead of making sure ALL items satisfy the criteria we only try to detect the first time that the criteria is not met and stop processing the list
If other tags might be provided then we can still use similar syntax by filtering the tags first:
string[] requiredTags = new string [] { "Dot", "Low", "Default" };
bool notValid = skus.GroupBy(x => x.Attributes[RSkuName])
.Any(sku => sku.SelectMany(x => x.Tags)
.Where(x => requiredTags.Contains(x))
.GroupBy(x => x)
.Count() < 3)
If you need to list out all the skus that have failed, and perhaps why they were not valid, then we can do that with similar syntax. Instead of using LINQ though, lets look at how you might do this with your current iterator approach...
Start by creating a class to hold the tags that we have seen for each sku:
public class SkuTracker
{
public string Sku { get; set; }
public List<string> Tags { get;set; } = new List<string>();
public override string ToString() => $"{Sku} - ({Tags.Count()}) {String.Join(",", Tags)}";
}
Then we maintain a dictionary of these SkuTracker objects and record the tags as we see them:
var trackedSkus = new Dictionary<string, SkuTracker>();
...
foreach (Sku sku in skus)
{
string skuName = sku.Attributes[RSkuName];
if (!trackedSkus.ContainsKey(skuName))
trackedSkus.Add(skuName, new SkuTracker { Sku = skuName };
trackedSkus.Tags.AddRange(sku.Tags);
}
...
var missingSkus = trackedSkus.Values.Where(x => x.Tags.Count() < 3)
.ToList();
foreach(var sku in missingSkus)
{
traceSource.TraceInformation($"There are not 3 tags present for {sku.Sku}" );
}
Looking at the JSON fragment you have provided, I suspect that you could only verify the validation after the entire list was processed, so we cannot trace out the failure messages in the first iteration, this current code will still produce the same output as if we had though.
NOTE: In the SkuTracker we have defined the ToString() override, this is so that you could easily view the content of the tracked SKUs in the debugger using the native viewers, especially if you try to inspect the content of missingSkus.
I need to update several thousand items every several minutes in Elastic and unfortunately reindexing is not an option for me. From my research the best way to update an item is using _update_by_query - I have had success updating single documents like so -
{
"query": {
"match": {
"itemId": {
"query": "1"
}
}
},
"script": {
"source": "ctx._source.field = params.updateValue",
"lang": "painless",
"params": {
"updateValue": "test",
}
}
}
var response = await Client.UpdateByQueryAsync<dynamic>(q => q
.Index("masterproducts")
.Query(q => x.MatchQuery)
.Script(s => s.Source(x.Script).Lang("painless").Params(x.Params))
.Conflicts(Elasticsearch.Net.Conflicts.Proceed)
);
Although this works it is extremely inefficient as it generates thousands of requests - is there a way in which I can update multiple documents with a matching ID in a single request? I have already tried Multiple search API which it would seem cannot be used for this purpose. Any help would be appreciated!
If possible, try to generalize your query.
Instead of targeting a single itemId, perhaps try using a terms query:
{
"query": {
"terms": {
"itemId": [
"1", "2", ...
]
}
},
"script": {
...
}
}
From the looks of it, your (seemingly simplified) script sets the same value, irregardless of the document ID / itemId. So that's that.
If the script does indeed set different values based on the doc IDs / itemIds, you could make the params multi-value:
"params": {
"updateValue1": "test1",
"updateValue2": "test2",
...
}
and then dynamically access them:
...
def value_to_set = params['updateValue' + ctx._source['itemId']];
...
so the target doc is updated with the corresponding value.
I have documents that looks something like that, with a unique index on bars.name:
{ name: 'foo', bars: [ { name: 'qux', somefield: 1 } ] }
. I want to either update the sub-document where { name: 'foo', 'bars.name': 'qux' } and $set: { 'bars.$.somefield': 2 }, or create a new sub-document with { name: 'qux', somefield: 2 } under { name: 'foo' }.
Is it possible to do this using a single query with upsert, or will I have to issue two separate ones?
Related: 'upsert' in an embedded document (suggests to change the schema to have the sub-document identifier as the key, but this is from two years ago and I'm wondering if there are better solutions now.)
No there isn't really a better solution to this, so perhaps with an explanation.
Suppose you have a document in place that has the structure as you show:
{
"name": "foo",
"bars": [{
"name": "qux",
"somefield": 1
}]
}
If you do an update like this
db.foo.update(
{ "name": "foo", "bars.name": "qux" },
{ "$set": { "bars.$.somefield": 2 } },
{ "upsert": true }
)
Then all is fine because matching document was found. But if you change the value of "bars.name":
db.foo.update(
{ "name": "foo", "bars.name": "xyz" },
{ "$set": { "bars.$.somefield": 2 } },
{ "upsert": true }
)
Then you will get a failure. The only thing that has really changed here is that in MongoDB 2.6 and above the error is a little more succinct:
WriteResult({
"nMatched" : 0,
"nUpserted" : 0,
"nModified" : 0,
"writeError" : {
"code" : 16836,
"errmsg" : "The positional operator did not find the match needed from the query. Unexpanded update: bars.$.somefield"
}
})
That is better in some ways, but you really do not want to "upsert" anyway. What you want to do is add the element to the array where the "name" does not currently exist.
So what you really want is the "result" from the update attempt without the "upsert" flag to see if any documents were affected:
db.foo.update(
{ "name": "foo", "bars.name": "xyz" },
{ "$set": { "bars.$.somefield": 2 } }
)
Yielding in response:
WriteResult({ "nMatched" : 0, "nUpserted" : 0, "nModified" : 0 })
So when the modified documents are 0 then you know you want to issue the following update:
db.foo.update(
{ "name": "foo" },
{ "$push": { "bars": {
"name": "xyz",
"somefield": 2
}}
)
There really is no other way to do exactly what you want. As the additions to the array are not strictly a "set" type of operation, you cannot use $addToSet combined with the "bulk update" functionality there, so that you can "cascade" your update requests.
In this case it seems like you need to check the result, or otherwise accept reading the whole document and checking whether to update or insert a new array element in code.
if you dont mind changing the schema a bit and having a structure like so:
{ "name": "foo", "bars": { "qux": { "somefield": 1 },
"xyz": { "somefield": 2 },
}
}
You can perform your operations in one go.
Reiterating 'upsert' in an embedded document for completeness
I was digging for the same feature, and found that in version 4.2 or above, MongoDB provides a new feature called Update with aggregation pipeline.
This feature, if used with some other techniques, makes possible to achieve an upsert subdocument operation with a single query.
It's a very verbose query, but I believe if you know that you won't have too many records on the subCollection, it's viable. Here's an example on how to achieve this:
const documentQuery = { _id: '123' }
const subDocumentToUpsert = { name: 'xyz', id: '1' }
collection.update(documentQuery, [
{
$set: {
sub_documents: {
$cond: {
if: { $not: ['$sub_documents'] },
then: [subDocumentToUpsert],
else: {
$cond: {
if: { $in: [subDocumentToUpsert.id, '$sub_documents.id'] },
then: {
$map: {
input: '$sub_documents',
as: 'sub_document',
in: {
$cond: {
if: { $eq: ['$$sub_document.id', subDocumentToUpsert.id] },
then: subDocumentToUpsert,
else: '$$sub_document',
},
},
},
},
else: { $concatArrays: ['$sub_documents', [subDocumentToUpsert]] },
},
},
},
},
},
},
])
There's a way to do it in two queries - but it will still work in a bulkWrite.
This is relevant because in my case not being able to batch it is the biggest hangup. With this solution, you don't need to collect the result of the first query, which allows you to do bulk operations if you need to.
Here are the two successive queries to run for your example:
// Update subdocument if existing
collection.updateMany({
name: 'foo', 'bars.name': 'qux'
}, {
$set: {
'bars.$.somefield': 2
}
})
// Insert subdocument otherwise
collection.updateMany({
name: 'foo', $not: {'bars.name': 'qux' }
}, {
$push: {
bars: {
somefield: 2, name: 'qux'
}
}
})
This also has the added benefit of not having corrupted data / race conditions if multiple applications are writing to the database concurrently. You won't risk ending up with two bars: {somefield: 2, name: 'qux'} subdocuments in your document if two applications run the same queries at the same time.
starting from a JObject I can get the array that interests me:
JArray partial = (JArray)rssAlbumMetadata["tracks"]["items"];
First question: "partial" contains a lot of attributes I'm not interested on.
How can I get only what I need?
Second question: once succeeded in the first task I'll get a JArray of duplicated items. How can I get only the unique ones ?
The result should be something like
{
'composer': [
{
'id': '51523',
'name': 'Modest Mussorgsky'
},
{
'id': '228918',
'name': 'Sergey Prokofiev'
},
]
}
Let me start from something like:
[
{
"id": 32837732,
"composer": {
"id": 245,
"name": "George Gershwin"
},
"title": "Of Thee I Sing: Overture (radio version)"
},
{
"id": 32837735,
"composer": {
"id": 245,
"name": "George Gershwin"
},
"title": "Concerto in F : I. Allegro"
},
{
"id": 32837739,
"composer": {
"id": 245,
"name": "George Gershwin"
},
"title": "Concerto in F : II. Adagio"
}
]
First question:
How can I get only what I need?
There is no magic, you need to read the whole JSON string and then query the object to find what you are looking for. It is not possible to read part of the JSON if that is what you need. You have not provided an example of what the data looks like so not possible to specify how to query.
Second question which I guess is: How to de-duplicate contents of an array of object?
Again, I do not have full view of your objects but this example should be able to show you - using Linq as you requested:
var items = new []{new {id=1, name="ali"}, new {id=2, name="ostad"}, new {id=1, name="ali"}};
var dedup = items.GroupBy(x=> x.id).Select(y => y.First()).ToList();
Console.WriteLine(dedup);