I'm trying out DocumentDB as a possible data store for a new application. The app has to handle a lot of data so I used the Data Migration tool to put a lot of documents into a collection.
Most of the queries from my app will be aggregating and summing. So I'm using documentdb-lumenize. The code sample for calling that stored procedure from C# has me doing something like this:
var configString = #"{
cubeConfig: {
groupBy: 'year',
field: 'Amount',
f: 'sum'
},
filterQuery: 'SELECT * FROM TestLargeData t'
}";
var config = JsonConvert.DeserializeObject<object>(configString);
var result = await _client.ExecuteStoredProcedureAsync<dynamic>("my/sproc/link", config);
The result I get back looks like this:
{
"cubeConfig": {
"groupBy": "year",
"field": "Amount",
"f": "sum"
},
"filterQuery": "SELECT * FROM TestLargeData t",
"continuation": "-RID:rOtjAPc4TgBxFwAAAAAAAA==#RT:6#TRC:6000",
"stillQueueing": false,
"savedCube": {
"config": {
"groupBy": "year",
"field": "Amount",
"f": "sum"
},
"cellsAsCSVStyleArray": [
[
"year",
"_count",
"Amount_sum"
],
[
2006,
4825,
1391399555.74
],
[
2007,
1175,
693886378
]
],
"summaryMetrics": {}
},
"example": {
"year": 2007,
"SomeOtherField1": "SomeOtherValue1",
"SomeOtherField2": "SomeOtherValue2",
"Amount": 12000,
"id": "0ee80b66-7fa7-40c1-9124-292c01059562",
"_rid": "...",
"_self": "...",
"_etag": "\"...\"",
"_attachments": "attachments/",
"_ts": ...
}
}
The _count values indicate that I got back 6,000 documents worth of aggregated data. There are a million documents in the collection (I wanted to test big!)
I see the "continuation" value in the result. But StoredProcedureResponse doesn't have an ExecuteNextAsync method like the DocumentQuery class does. How would I use the DocumentDB API to request the next part of the data?
I'm the author of documentdb-lumenize. If you just send back in what's returned as the only parameter, then the documentdb-lumenize sproc will know how to deal with the continuation token. You'll have to keep calling it until the continuation token comes back empty.
That said, I'm really surprised it only did 6000 in one round trip. I generally get 20-50K per round trip. Maybe you have a lower spec'd collection? Maybe it's doing an index-less full-scan?
Submit an issue in the GitHub repo if you want more 1:1 help with this.
Related
Problem Statement:
I need to get Customer and all their corresponding Orders stored in the following format in MongoDB.
Current data resides in 2 different tables - Customers and Orders in MSSQL database.
Expected Collection in MongoDb:
{
"_data": [
{
"customer": {
"customerID": "1240014902244000006",
"firstName": "Jerry",
"surname": "Galoski",
"orders": [
{
"_data": {
"orderId": "1240014912244000016",
"suburb": "Delhi",
"postcode": "110051",
"country": "India"
},
"_links": [
{
"rel": "self",
"href": "/customer/1240014902244000006/orders",
"action": "GET"
},
{
"rel": "customer",
"href": "/customer/1240014902244000006",
"action": "GET"
}
]
},
{
"_data": {
"orderId": "1240014912244000024",
"suburb": "Delhi",
"postcode": "110051",
"country": "India"
},
"_links": [
{
"rel": "customer",
"href": "/customer/1240014902244000006/orders",
"action": "GET"
}
]
}
]
},
"_links": [
{
"rel": "self",
"href": "/customer/1240014902244000006",
"action": "GET"
},
{
"rel": "orders",
"href": "/customer/1240014902244000006/orders",
"action": "GET"
}
]
}
]
}
Steps taken to achieve this:
First step is to collate all orders data, which I'm doing using the below SQL query:
SELECT CustomerId, AllOrderIds =
STUFF(
(
SELECT DISTINCT ', ' + CAST(e2.OrderId AS VARCHAR(MAX))
FROM CustomerOrderRelation e2
WHERE e1.CustomerId = e2.CustomerId
FOR XML PATH('')
), 1,1,''
) FROM CustomerOrderRelation cor;
Then I use the above subquery in the customers sql query to get the complete output as shown.
Then in the C# code, I will process the Customer records one by one, to generate the hyperlinks for each.
I need to generate hyperlinks for Order records also.
Once I have the hyperlinks generated in C#, I will call InsertMany command (MongoDB) to insert data. As a result, I will have 2 collections created in MongoDb - one each for Customer and Order.
Finally, I'll be using MongoDb's MapReduce functionality to merge 2 collections to get the data.
db.customers.mapReduce(mapCustomers, reduce, {"out": {"reduce": "customerId"}});
db.orders.mapReduce(mapOrders, reduce, {"out": {"reduce": "customerId"}});
db.customerAndOrders.find().pretty();
I need to do the merge because, I am not finding a clever way to do following operations:
Get customers and their consolidated orders in one single entity
Generate Hyperlinks for customers and all their orders
Push this data into MongoDb with customer details and their order details
Suggestion required:
Is there a better way to accomplish the above?
I have documents like this in my CosmosDB database:
{
"id": "12345",
"filename": "foo.txt",
"versions": {
"1": {
"storageAccount": "blob123",
"size": 33
},
"2": {
"storageAccount": "blob123",
"size": 42
}
}
}
(this is a simplified sample)
I need to query on the "storageAccount" property, to check if there are files stored on a given storage account. But I can't find a way to express "for each version".
I tried this, but of course it doesn't work
select top 1 *
from c
join v in c.versions
where v.storageAccount = 'blob123'
Apparently JOIN only works on arrays, not dictionaries. Is there a way to query items in a dictionary?
As a workaround, I can use an UDF, but the performance and cost are terrible (1200 RUs for just 2000 documents when there is not matching document...)
EDIT: updated to more closely reflect actual use case
Unfortunately, this isn't possible today. You cannot iterate over object keys in Cosmos's SQL.
I'd recommend changing the schema to something like:
{
"id": "12345",
"filename": "foo.txt",
"versions": [
{
"id": "1"
"storageAccount": "blob123",
"size": 33
},
{
"id": "2"
"storageAccount": "blob123",
"size": 42
}
]
}
Additionally, you could evaluate a User Defined Function which would return the keys of an object for you, but that will increase your RU costs, though possibly less than sprocs.
I'm building a product search engine with Elastic Search in my .NET application, by using the NEST client, and there is one thing i'm having trouble with. Getting a distinct set of values.
I'm search for products, which there are many thousands, but of course i can only return 10 or 20 at a time to the user. And for this paging works fine. But besides this primary result, i want to show my users a list of brands that are found within the complete search, to present these for filtering.
I have read about that i should use Terms Aggregations for this. But, i couldn't get anything better than this. And this still doesn't really give me what i want, because it splits values like "20th Century Fox" into 3 separate values.
var brandResults = client.Search<Product>(s => s
.Query(query)
.Aggregations(a => a.Terms("my_terms_agg", t => t.Field(p => p.BrandName).Size(250))
)
);
var agg = brandResult.Aggs.Terms("my_terms_agg");
Is this even the right approach? Or should is use something totally different? And, how can i get the correct, complete values? (Not split by space .. but i guess that is what you get when you ask for a list of 'Terms'??)
What i'm looking for is what you would get if you would do this in MS SQL
SELECT DISTINCT BrandName FROM [Table To Search] WHERE [Where clause without paging]
You are correct that what you want is a terms aggregation. The problem you're running into is that ES is splitting the field "BrandName" in the results it is returning. This is the expected default behavior of a field in ES.
What I recommend is that you change BrandName into a "Multifield", this will allow you to search on all the various parts, as well as doing a terms aggregation on the "Not Analyzed" (aka full "20th Century Fox") term.
Here is the documentation from ES.
https://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
[UPDATE]
If you are using ES version 1.4 or newer the syntax for multi-fields is a little different now.
https://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html#_multi_fields
Here is a full working sample the illustrate the point in ES 1.4.4. Note the mapping specifies a "not_analyzed" version of the field.
PUT hilden1
PUT hilden1/type1/_mapping
{
"properties": {
"brandName": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
POST hilden1/type1
{
"brandName": "foo"
}
POST hilden1/type1
{
"brandName": "bar"
}
POST hilden1/type1
{
"brandName": "20th Century Fox"
}
POST hilden1/type1
{
"brandName": "20th Century Fox"
}
POST hilden1/type1
{
"brandName": "foo bar"
}
GET hilden1/type1/_search
{
"size": 0,
"aggs": {
"analyzed_field": {
"terms": {
"field": "brandName",
"size": 10
}
},
"non_analyzed_field": {
"terms": {
"field": "brandName.raw",
"size": 10
}
}
}
}
Results of the last query:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"aggregations": {
"non_analyzed_field": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "20th Century Fox",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 1
},
{
"key": "foo",
"doc_count": 1
},
{
"key": "foo bar",
"doc_count": 1
}
]
},
"analyzed_field": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "20th",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 2
},
{
"key": "century",
"doc_count": 2
},
{
"key": "foo",
"doc_count": 2
},
{
"key": "fox",
"doc_count": 2
}
]
}
}
}
Notice that not-analyzed fields keep "20th century fox" and "foo bar" together where as the analyzed field breaks them up.
I had a similar issue. I was displaying search results and wanted to show counts on the category and sub category.
You're right to use aggregations. I also had the issue with the strings being tokenised (i.e. 20th century fox being split) - this happens because the fields are analysed. For me, I added the following mappings (i.e. tell ES not to analyse that field):
"category": {
"type": "nested",
"properties": {
"CategoryNameAndSlug": {
"type": "string",
"index": "not_analyzed"
},
"SubCategoryNameAndSlug": {
"type": "string",
"index": "not_analyzed"
}
}
}
As jhilden suggested, if you use this field for more than one reason (e.g. search and aggregation) you can set it up as a multifield. So on one hand it can get analysed and used for searching and on the other hand for not being analysed for aggregation.
I am using the NEST API (v0.12.0.0) to interface with an ElasticSearch (v1.0.1) index and I just started receiving a JsonSerializationException when retrieving my data. I'm not sure if this is a NEST issue or otherwise, but it just randomly started happening and we haven't made any major changes to our implementation or infrastructure.
I am attempting to retrieve the Ids of my data (stored as a Guid) with a typed Search<>() and I am getting an exception when the data is processed by JSON.NET.
client.Search<ESEventItem>(s =>
s.Index("dev-events004")
.Fields(f => f.Id).Size(100000)
.Type("event").MatchAll()).Documents.ToList()
Running this same query manually in Sense produces no noticeable issues:
POST /dev-events004/event/_search
{
"size": 100000,
"query": {
"match_all": {}
},
"fields": [
"id"
]
}
{
"took": 2088,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 19257,
"max_score": 1,
"hits": [
{
"_index": "dev-events004",
"_type": "event",
"_id": "670a1055-cbe3-480e-b807-a2b500f9dfb3",
"_score": 1,
"fields": {
"id": [
"670a1055-cbe3-480e-b807-a2b500f9dfb3"
]
}
},
/* ... additional results ... */
]
}
}
If I perform a raw, untyped query Fields(new[] { "Id" }) it does not throw an exception. Likewise, if I return the whole ESEventItem object, rather than just the Id fields, it also works without an exception.
To the NEST developer: this question is mirrored as an issue on the github project.
This is due the fact that elasticsearch 1.0 changed how fields are returned. The upcomming NEST 1.0 will support this.
I'm trying to count the number of 'comments' related to a product in a couchbase bucket. That part is easy for a "full" set of data. It's just a simple map / reduce. Things get tricky when i want limit it to only products that have had changes within a date range. I can do this as two different Views in CB. One that gets the Product Id's where the dateCreated falls within my range, and then One that I pass these Id's to and it calculates my stats. The performance on this approach is horrible though. The key's for the second query aren't necessarily contiguous so i can't do a start/end on them; I'm using the .net 2.2 client for version 4.x couchbase.
I'm open to any options; i.e. Super-awesome-do-it-all-in-one-call View, or follow the 2 view approach if the client has some capacity for bulk get's against non-contiguous keys in a View (i can't find anything on this topic).
Here's my simplified example schema:
{
"comment": {
"key": "key1",
"title": "yay",
"productId": "product1",
"dateCreated": "2016,11,30"
},
"comment": {
"key": "key2",
"title": "booo",
"productId": "product1",
"dateCreated": "2016,12,30"
}
}
Not sure if this is what you want (also not sure about how to translate this to C#), but say you have two documents with ids comment::1 and comment::2 and a Couchbase document for each in this format.
{
"key": "key2",
"title": "booo",
"productId": "product1",
"dateCreated": "2016,12,30"
}
You can define a view (let's call it comments_by_time)
Map
function (doc, meta) {
if (doc.dateCreated) {
var dateParts = doc.dateCreated.split(",");
dateParts = dateParts.map(Number);
emit(dateParts, doc.productId);
}
}
Reduce
_count
Then, you can use the View Query API to do a startKey and endKey range on your documents.
End point
http://<couchbase>:8092/<bucket>/_design/<view>/_view/comments_by_time
Get count of all comments
?reduce=true
{"rows":[ {"key":null,"value":2} ] }
Get documents before a date
?reduce=false&endkey=[2016,12,1]
{"total_rows":2,"rows":[
{"id":"comment::1","key":[2016,11,30],"value":"product1"}
]
}
Between dates
?reduce=false&startkey=[2016,12,1]&endkey=[2017,1,1]
{"total_rows":2,"rows":[
{"id":"comment::2","key":[2016,12,30],"value":"product1"}
]
}