I need to update several thousand items every several minutes in Elastic and unfortunately reindexing is not an option for me. From my research the best way to update an item is using _update_by_query - I have had success updating single documents like so -
{
"query": {
"match": {
"itemId": {
"query": "1"
}
}
},
"script": {
"source": "ctx._source.field = params.updateValue",
"lang": "painless",
"params": {
"updateValue": "test",
}
}
}
var response = await Client.UpdateByQueryAsync<dynamic>(q => q
.Index("masterproducts")
.Query(q => x.MatchQuery)
.Script(s => s.Source(x.Script).Lang("painless").Params(x.Params))
.Conflicts(Elasticsearch.Net.Conflicts.Proceed)
);
Although this works it is extremely inefficient as it generates thousands of requests - is there a way in which I can update multiple documents with a matching ID in a single request? I have already tried Multiple search API which it would seem cannot be used for this purpose. Any help would be appreciated!
If possible, try to generalize your query.
Instead of targeting a single itemId, perhaps try using a terms query:
{
"query": {
"terms": {
"itemId": [
"1", "2", ...
]
}
},
"script": {
...
}
}
From the looks of it, your (seemingly simplified) script sets the same value, irregardless of the document ID / itemId. So that's that.
If the script does indeed set different values based on the doc IDs / itemIds, you could make the params multi-value:
"params": {
"updateValue1": "test1",
"updateValue2": "test2",
...
}
and then dynamically access them:
...
def value_to_set = params['updateValue' + ctx._source['itemId']];
...
so the target doc is updated with the corresponding value.
Related
I am fairly new to c# and am trying to parse an api response. My goal is to check each sku present to see if it contains all three of the following tags: Dot, Low, and Default. The only thing is the api is set up a bit odd, so even if the "Rskuname" is the same, its listed under a different skuID. I need to make sure each Rskuname contains all of the 3 types, here is an example of the api below (parts of it have been omitted since its a huge amount of data, just showing the pieces important to this question)
"Skus": [
{
"SkuId": "DH786HY",
"Name": "Stand_D3_v19 Dot",
"Attributes": {
"RSkuName": "Stand_D3_v19",
"Category": "General",
"DiskSize": "0 Gb",
"Low": "False",
},
"Tags": [
"Dot"
{
"SkuId": "DU70PLL1",
"Name": "Stand_D3_v19",
"Attributes": {
"Attributes": {
"RSkuName": "Stand_D3_v19",
"Category": "General",
"DiskSize": "0 Gb",
"Low": "False",
},
"Tags": [
"Default"
]
{
"SkuId": "DPOK65R4",
"Name": "Stand_D3_v19 Low",
"Attributes": {
"Attributes": {
"RSkuName": "Stand_D3_v19",
"Category": "General",
"DiskSize": "0 Gb",
"Low": "True",
},
"Tags": [
"Low"
],
{
"SkuId": "DPOK65R4",
"Name": "Stand_D6_v22 Low",
"Attributes": {
"Attributes": {
"RSkuName": "Stand_D6_v22",
"Category": "General",
"DiskSize": "0 Gb",
"Low": "True",
},
"Tags": [
"Low"
],
Originally I tried to iterate through each sku, however since the skuids are different even though the name is the same that doesnt work. I was thinking of possibly using a string, hashset dictionary so it would go skuName:Tags but I'm not sure that will work either. Any ideas would be much appreciated. Apologies if this question isn't phrased well, once again I'm a beginner. I have included what I tried originally below:
foreach (Sku sku in skus)
{
string SkuName = sku.Attributes[RSkuName];
var count = 0;
if (sku.Tags.Equals(Default))
{
count++;
}
if (sku.Tags.Equals(Low))
{
count++;
}
if (sku.Tags.Equals(Dot))
{
count++;
}
}
if (count < 3)
{
traceSource.TraceInformation($"There are not 3 tags present for" {SkuName} );
}
Seems simple:
group by RSkuName
group by Tag element value
make sure there is a group for each of the required tag values.
Yes you could use a hashset in this scenario to formulate the groups. If we had to do it from first principals that that's not a bad idea.
However we can use the LINQ fluent GroupBy function (which uses a hashset internally) to iterate over the groups.
There is only one complicating factor to this that even your first attempt does not take into account, Tags is an array of strings, so to properly group the values across multiple arrays so we can use GroupBy we can use the SelectMany function to merge the arrays from multiple SKUs into a single set, then GroupBy becomes viable again.
Finally, if the only possible values for Tag elements are Dot, Low, and Default. Then we only need to count the groups and make sure there are 3 to make the SKU valid.
bool notValid = skus.GroupBy(x => x.Attributes[RSkuName])
.Any(sku => sku.SelectMany(x => x.Tags)
.GroupBy(x => x)
.Count() < 3)
I call this a fail fast approach, instead of making sure ALL items satisfy the criteria we only try to detect the first time that the criteria is not met and stop processing the list
If other tags might be provided then we can still use similar syntax by filtering the tags first:
string[] requiredTags = new string [] { "Dot", "Low", "Default" };
bool notValid = skus.GroupBy(x => x.Attributes[RSkuName])
.Any(sku => sku.SelectMany(x => x.Tags)
.Where(x => requiredTags.Contains(x))
.GroupBy(x => x)
.Count() < 3)
If you need to list out all the skus that have failed, and perhaps why they were not valid, then we can do that with similar syntax. Instead of using LINQ though, lets look at how you might do this with your current iterator approach...
Start by creating a class to hold the tags that we have seen for each sku:
public class SkuTracker
{
public string Sku { get; set; }
public List<string> Tags { get;set; } = new List<string>();
public override string ToString() => $"{Sku} - ({Tags.Count()}) {String.Join(",", Tags)}";
}
Then we maintain a dictionary of these SkuTracker objects and record the tags as we see them:
var trackedSkus = new Dictionary<string, SkuTracker>();
...
foreach (Sku sku in skus)
{
string skuName = sku.Attributes[RSkuName];
if (!trackedSkus.ContainsKey(skuName))
trackedSkus.Add(skuName, new SkuTracker { Sku = skuName };
trackedSkus.Tags.AddRange(sku.Tags);
}
...
var missingSkus = trackedSkus.Values.Where(x => x.Tags.Count() < 3)
.ToList();
foreach(var sku in missingSkus)
{
traceSource.TraceInformation($"There are not 3 tags present for {sku.Sku}" );
}
Looking at the JSON fragment you have provided, I suspect that you could only verify the validation after the entire list was processed, so we cannot trace out the failure messages in the first iteration, this current code will still produce the same output as if we had though.
NOTE: In the SkuTracker we have defined the ToString() override, this is so that you could easily view the content of the tracked SKUs in the debugger using the native viewers, especially if you try to inspect the content of missingSkus.
I have just started with DynamoDB. I have background in MongoDB and relational databases and I am structuring my JSON in more like a graph structure than a flat structure. For example,
[
{
"id": "1",
"title": "Castle on the hill",
"lyrics": "when i was six years old I broke my leg",
"artists": [
{
"name": "Ed Sheeran",
"sex": "male"
}
]
}
]
For example, If I like to search the item by 'Ed Sheeran'. The closest I have got is this and this is not even matching any value.
var request = new ScanRequest
{
TableName = "Table",
ProjectionExpression = "Id, Title, Artists",
ExpressionAttributeValues = new Dictionary<string,AttributeValue>
{
{ ":artist", new AttributeValue { M = new Dictionary<string, AttributeValue>
{
{ "Name", new AttributeValue { S = "Ed Sheeran" }}
}
}
}
},
ExpressionAttributeNames = new Dictionary<string, string>
{
{ "#artist", "Artists" },
},
FilterExpression = "#artist = :artist",
};
var result = await client.ScanAsync(request);
Most of the example and tuturials I have watched so far, they have treated dynamodb as a table in a normal relational database with very flat design. Am I doing it wrong to structure the JSON as above? Should Artists be in a separate table?
And If it can be done, how do i search by some value in a complex type like in the above example?
First of all, you should not be using the scan operation in dynamodb. I would strongly recommend to use query. Have a look at this stack overflow question first.
If you want to search on any attribute, you can either mark them as the primary key (either hash_key or hash_key + sort_key) or create an index on the field you want to query on.
Depending on the use case of id attribute in your schema, if you are never querying on id attribute, I would recommend the structure something like this :
[
{
"artist_name" : "Ed Sheeran" // Hash Key
"id": "1", // Sort Key (Assuming id is unique and combination of HK+SK is unique)
"title": "Castle on the hill",
"lyrics": "when i was six years old I broke my leg",
"artists": [
{
"name": "Ed Sheeran",
"sex": "male"
}
]
}
]
Alternatively, if you also need to query on id and it has to be the hash key, you can an index on the artist_name attribute and then query it.
[
{
"artist_name" : "Ed Sheeran" // GSI Hash key
"id": "1", // Table Hash key
"title": "Castle on the hill",
"lyrics": "when i was six years old I broke my leg",
"artists": [
{
"name": "Ed Sheeran",
"sex": "male"
}
]
}
]
In either case, it is not possible to query inside a nested object without using scan operation and then iterating it in code, something which you have already tried.
I have documents that looks something like that, with a unique index on bars.name:
{ name: 'foo', bars: [ { name: 'qux', somefield: 1 } ] }
. I want to either update the sub-document where { name: 'foo', 'bars.name': 'qux' } and $set: { 'bars.$.somefield': 2 }, or create a new sub-document with { name: 'qux', somefield: 2 } under { name: 'foo' }.
Is it possible to do this using a single query with upsert, or will I have to issue two separate ones?
Related: 'upsert' in an embedded document (suggests to change the schema to have the sub-document identifier as the key, but this is from two years ago and I'm wondering if there are better solutions now.)
No there isn't really a better solution to this, so perhaps with an explanation.
Suppose you have a document in place that has the structure as you show:
{
"name": "foo",
"bars": [{
"name": "qux",
"somefield": 1
}]
}
If you do an update like this
db.foo.update(
{ "name": "foo", "bars.name": "qux" },
{ "$set": { "bars.$.somefield": 2 } },
{ "upsert": true }
)
Then all is fine because matching document was found. But if you change the value of "bars.name":
db.foo.update(
{ "name": "foo", "bars.name": "xyz" },
{ "$set": { "bars.$.somefield": 2 } },
{ "upsert": true }
)
Then you will get a failure. The only thing that has really changed here is that in MongoDB 2.6 and above the error is a little more succinct:
WriteResult({
"nMatched" : 0,
"nUpserted" : 0,
"nModified" : 0,
"writeError" : {
"code" : 16836,
"errmsg" : "The positional operator did not find the match needed from the query. Unexpanded update: bars.$.somefield"
}
})
That is better in some ways, but you really do not want to "upsert" anyway. What you want to do is add the element to the array where the "name" does not currently exist.
So what you really want is the "result" from the update attempt without the "upsert" flag to see if any documents were affected:
db.foo.update(
{ "name": "foo", "bars.name": "xyz" },
{ "$set": { "bars.$.somefield": 2 } }
)
Yielding in response:
WriteResult({ "nMatched" : 0, "nUpserted" : 0, "nModified" : 0 })
So when the modified documents are 0 then you know you want to issue the following update:
db.foo.update(
{ "name": "foo" },
{ "$push": { "bars": {
"name": "xyz",
"somefield": 2
}}
)
There really is no other way to do exactly what you want. As the additions to the array are not strictly a "set" type of operation, you cannot use $addToSet combined with the "bulk update" functionality there, so that you can "cascade" your update requests.
In this case it seems like you need to check the result, or otherwise accept reading the whole document and checking whether to update or insert a new array element in code.
if you dont mind changing the schema a bit and having a structure like so:
{ "name": "foo", "bars": { "qux": { "somefield": 1 },
"xyz": { "somefield": 2 },
}
}
You can perform your operations in one go.
Reiterating 'upsert' in an embedded document for completeness
I was digging for the same feature, and found that in version 4.2 or above, MongoDB provides a new feature called Update with aggregation pipeline.
This feature, if used with some other techniques, makes possible to achieve an upsert subdocument operation with a single query.
It's a very verbose query, but I believe if you know that you won't have too many records on the subCollection, it's viable. Here's an example on how to achieve this:
const documentQuery = { _id: '123' }
const subDocumentToUpsert = { name: 'xyz', id: '1' }
collection.update(documentQuery, [
{
$set: {
sub_documents: {
$cond: {
if: { $not: ['$sub_documents'] },
then: [subDocumentToUpsert],
else: {
$cond: {
if: { $in: [subDocumentToUpsert.id, '$sub_documents.id'] },
then: {
$map: {
input: '$sub_documents',
as: 'sub_document',
in: {
$cond: {
if: { $eq: ['$$sub_document.id', subDocumentToUpsert.id] },
then: subDocumentToUpsert,
else: '$$sub_document',
},
},
},
},
else: { $concatArrays: ['$sub_documents', [subDocumentToUpsert]] },
},
},
},
},
},
},
])
There's a way to do it in two queries - but it will still work in a bulkWrite.
This is relevant because in my case not being able to batch it is the biggest hangup. With this solution, you don't need to collect the result of the first query, which allows you to do bulk operations if you need to.
Here are the two successive queries to run for your example:
// Update subdocument if existing
collection.updateMany({
name: 'foo', 'bars.name': 'qux'
}, {
$set: {
'bars.$.somefield': 2
}
})
// Insert subdocument otherwise
collection.updateMany({
name: 'foo', $not: {'bars.name': 'qux' }
}, {
$push: {
bars: {
somefield: 2, name: 'qux'
}
}
})
This also has the added benefit of not having corrupted data / race conditions if multiple applications are writing to the database concurrently. You won't risk ending up with two bars: {somefield: 2, name: 'qux'} subdocuments in your document if two applications run the same queries at the same time.
I'm searching an index main-kittens for docs of type Kitty. Now, I want to run an experiment. For some of the users, I want to search experiment-kittens instead. The type is the same — Kitty, and all the fields has the same value as in main index, but while the field Bio is always empty in the main index, in experimental one it stores huge strings.
Now, the problem is that I can't store that Bio for all kittens due to memory/disk limitations. So the experiment-kittens has only most recent kittens (say, last month).
I want the search to be left intact for the most users (i.e., always use the main index). For the picked ones, I want to merge the results. The logic should be:
search userquery + date_created < 1 month ago in experiment-kittens
search userquery + date_created > 1 month ago in main-kittens
The results should be sorted by create_date, and there are too many of them to sort them in my app.
Is there a way to ask elastic to execute two different queries on two indices and merge the results?
(I'm also sure there could be more optimal solutions to the problem, please tell me if you have some).
You can search across multiple indices with a single Elasticsearch request by separating the index names with a comma. Then you can use the missing filter to differentiate between the two indices (one having Bio field and the other not). Then you can use the range filter to filter based on the value of date_created field. Finally you can use the sort API to sort based on the values of date_created field.
Putting all of these together, the Elasticsearch query that you need is as under:
POST main-kittens,experiment-kittens/Kitty/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"missing": {
"field": "Bio"
}
},
{
"range": {
"date_created": {
"to": "now-1M"
}
}
}
]
}
},
{
"bool": {
"must_not": [
{
"missing": {
"field": "Bio"
}
}
],
"must": [
{
"range": {
"date_created": {
"from": "now-1M"
}
}
}
]
}
}
]
}
}
}
},
"sort": [
{
"date_created": {
"order": "desc"
}
}
]
}
You can replace "match_all": {} with any custom query that you may have.
I have a NEST Query;
var desciptor = new SearchDescriptor<SomePoco>()
.TrackScores()
.From(request.Page == 1 ? 0 : (request.Page - 1) * request.PageSize)
.Size(request.PageSize)
.MatchAll()
.FacetFilter("some_name", a => new FilterContainer(new AndFilter { Filters = CreatePocoSearchFilter(request) }))
.SortDescending("_score");
var results = _client.Search<SomePoco>(x => descriptor);
The FacetFilter is returning the total number of HITS from my query. I want to split these hits out using a property on the search request. So, in the search request I have a list of ints. I want to know how many hits were returned for each int in that list.
I hope this makes sense.
I've tried adding a FacetTerm, this gives me the total number of hits for every value of the int query value instead of just the ones that pertain to the search. I understand the query, filter stage, and have tried to change the descriptor accordingly with no luck.
Thanks.
There are several ways to do this. My suggestion would be to use a filtered query, and then use a Terms aggregation or facet (facets are deprecated so I recommend moving away from those) on the results.
With an Aggregation:
POST /_search
{
"query": {
"filtered": {
"query": { "match_all": {}},
"filter": {
"terms": {
"<FIELD_NAME>": [1, 2, 3, 42]
}
}
}
},
"aggs": {
"countOfInts": {
"terms": {
"field": "<FIELD_NAME>",
"size": 10
}
}
}
}
With a Facet:
POST /_search
{
"query": {
"filtered": {
"query": { "match_all": {}},
"filter": {
"terms": {
"<FIELD_NAME>": [1, 2, 3, 42]
}
}
}
},
"facets": {
"countOfInts": {
"terms": {
"field": "<FIELD_NAME>",
"size": 10
}
}
}
}
You could also do the same thing by doing a plain query with match_all and then do the filter inside the facet or aggregation. The way I listed it above will perform a little bit better because it will reduce the working set before building the agg/facet.
I did not include the code for NEST because depending on the version of the dlls you are using the format can be somewhat different.