Bulk Indexing in Elasticssearch using the ElasticLowLevelClient client

Bulk Indexing in Elasticssearch using the ElasticLowLevelClient client - c#

I'm using the ElasticLowLevelClient client to index elasticsearch data as it needs to be indexed as a raw string as I don't have access to the POCO objects. I can successfully index an individual object by calling:
client.Index<object>(indexName, message.MessageType, message.Id,
new Elasticsearch.Net.PostData<object>(message.MessageJson));
How can I do a bulk insert into the index using the ElasticLowLevelClient client? The bulk inset APIs all require a POCO of the indexing document which I don't have e.g.:
ElasticsearchResponse<T> Bulk<T>(string index, PostData<object> body,
Func<BulkRequestParameters, BulkRequestParameters> requestParameters = null)
I could make the API calls in parallel for each object but that seems inefficient.

The low level client generic type parameter is the type for the response expected.
If you're using the low level client exposed on the high level client, through the .LowLevel property, you can send a bulk request where your documents are JSON strings as follows in 5.x
var client = new ElasticClient(settings);
var messages = new []
{
new Message
{
Id = "1",
MessageType = "foo",
MessageJson = "{\"name\":\"message 1\",\"content\":\"foo\"}"
},
new Message
{
Id = "2",
MessageType = "bar",
MessageJson = "{\"name\":\"message 2\",\"content\":\"bar\"}"
}
};
var indexName = "my-index";
var bulkRequest = messages.SelectMany(m =>
new[]
{
client.Serializer.SerializeToString(new
{
index = new
{
_index = indexName,
_type = m.MessageType,
_id = m.Id
}
}, SerializationFormatting.None),
m.MessageJson
});
var bulkResponse = client.LowLevel.Bulk<BulkResponse>(string.Join("\n", bulkRequest) + "\n");
which sends the following bulk request
POST http://localhost:9200/_bulk
{"index":{"_index":"my-index","_type":"foo","_id":"1"}}
{"name":"message 1","content":"foo"}
{"index":{"_index":"my-index","_type":"bar","_id":"2"}}
{"name":"message 2","content":"bar"}
A few important points
We need to build the bulk request ourselves to use the low level bulk API call. Since our documents are already strings, it makes sense to build a string request.
We serialize an anonymous type with no indenting for the action and metadata for each bulk item.
The MessageJson cannot contain any newline characters in it as this will break the bulk API; newline characters are the delimiters for json objects within the body.
Because we're using the low level client exposed on the high level client, we can still take advantage of the high level requests, responses and serializer. The bulk request returns a BulkResponse, which you can work with as you normally do when sending a bulk request with the high level client.

Related

C# serializing JSON data from Database for API

My goal is to get the data from the database, serializing them into JSON format and send it to the API. The problem is that I don't know how to get right JSON format for the API.
C# Worker service collecting data from database.
from database i got:
1|John|Wick|Action|101
my API needs this JSON:
{
"Name":"John",
"Surname":"Wick",
"Type":"Action",
"Length":"101"
}
when i use in C# serializing to JSON:
var jsonString = Newtonsoft.Json.JsonConvert.SerializeObject(values);
i got:
[John,Wick,Action,101]
is there any way how to add name of values to JSON ?

First split the database result based on the delimiter
string dbResult = ...; //1|John|Wick|Action|101
string[] dbResults = dbResult.Split("|");
Second create an anonymous object (if you don't want to introduce a data model class/struct/record)
var result = new
{
Name = dbResults[0],
Surname = dbResults[1],
Type = dbResults[2],
Length = dbResults[3],
};
Third serialize the anonymous object
var jsonString = Newtonsoft.Json.JsonConvert.SerializeObject(result);

Why does my Nest method in controller create an index?

i am using net 6.0.1 with asp.net mvc.
var node = new Uri("http://localhost:9200");
var setting = new ConnectionSettings(node);
var client = new ElasticClient(setting);
var news = new News
{
NewsTitle = "TestTitle"
};
client.Index(news, idx => idx.Index("NewsTitle"));
var response = client.Get<News>(1, idx => idx.Index("NewsTitle"));
ElasticSearch is installed and running but when I try to run these lines of code then it does nothing. No index is created.

There's a few things to consider
client.Index(news, idx => idx.Index("NewsTitle")); doesn't check the response of the index document request.
It's a good idea to check the response to see if it succeeded - NEST
has a convenient .IsValid property on all responses that has
semantics for considering if the response is valid
var response = client.Get<News>(1, idx => idx.Index("NewsTitle")); attempts to get a document from the "NewsTitle" index with an id of 1, but the document just indexed does not have an id, so Elasticsearch will generate a random one when it indexes the document.
NEST has a convention where if the POCO to index has an Id property, it will use this as the id of the document. So, if News class is modified to contain an int Id property, and the instance indexed is assigned an Id property value of 1, the get request will return the indexed document.

Elastic Search ingest attachment plugin blocks

I am using NEST (C#) and the ingest attachment plugin to ingest 10s of thousands of documents into an Elastic search instance. Unfortunately, after a while everything just stands still - i.e. no more documents are ingested. The log shows:
[2019-02-20T17:35:07,528][INFO ][o.e.m.j.JvmGcMonitorService] [BwAAiDl] [gc][7412] overhead, spent [326ms] collecting in the last [1s]
Not sure if this tells anyone anything? Btw, are there more efficient ways to ingest many documents (rather than using thousands of REST requests)?
I am using this kind of code:
client.Index(new Document
{
Id = Guid.NewGuid(),
Path = somePath,
Content = Convert.ToBase64String(File.ReadAllBytes(somePath))
}, i => i.Pipeline("attachments"));
Define the pipeline:
client.PutPipeline("attachments", p => p
.Description("Document attachment pipeline")
.Processors(pr => pr
.Attachment<Document>(a => a
.Field(f => f.Content)
.TargetField(f => f.Attachment)
)
.Remove<Document>(r => r
.Field(f => f.Content)
)
)
);

The log indicates that a considerable amount of time is being spent performing Garbage Collection on the Elasticsearch server side; this is very likely to be the cause of large stop events that you are seeing. If you have monitoring enabled on the cluster (ideally exporting such data to a separate cluster), I would look at analysing those to see if it sheds some light on why large GC is happening.
are there more efficient ways to ingest many documents (rather than using thousands of REST requests)?
Yes, you are indexing each attachment in a separate index request. Depending on the size of each attachment, base64 encoded, you may want to send several in one bulk request
// Your collection of documents
var documents = new[]
{
new Document
{
Id = Guid.NewGuid(),
Path = "path",
Content = "content"
},
new Document
{
Id = Guid.NewGuid(),
Path = "path",
Content = "content" // base64 encoded bytes
}
};
var client = new ElasticClient();
var bulkResponse = client.Bulk(b => b
.Pipeline("attachments")
.IndexMany(documents)
);
If you're reading documents from the filesystem, you probably want to lazily enumerate them and send bulk requests. Here, you can make use of the BulkAll helper method too.
First have some lazily enumerated collection of documents
public static IEnumerable<Document> GetDocuments()
{
var count = 0;
while (count++ < 20)
{
yield return new Document
{
Id = Guid.NewGuid(),
Path = "path",
Content = "content" // base64 encoded bytes
};
}
}
Then configure the BulkAll call
var client = new ElasticClient();
// set up the observable configuration
var bulkAllObservable = client.BulkAll(GetDocuments(), ba => ba
.Pipeline("attachments")
.Size(10)
);
var waitHandle = new ManualResetEvent(false);
Exception exception = null;
// set up what to do in response to next bulk call, exception and completion
var bulkAllObserver = new BulkAllObserver(
onNext: response =>
{
// perform some action e.g. incrementing counter
// to indicate how many have been indexed
},
onError: e =>
{
exception = e;
waitHandle.Set();
},
onCompleted: () =>
{
waitHandle.Set();
});
// start the observable process
bulkAllObservable.Subscribe(bulkAllObserver);
// wait for indexing to finish, either forever,
// or set a max timeout as here.
waitHandle.WaitOne(TimeSpan.FromHours(1));
if (exception != null)
throw exception;
Size dictates how many documents to send in each request. There are no hard and fast rules for how big this can be for your cluster, because it can depend on a number of factors including ingest pipeline, the mapping of documents, the byte size of documents, the cluster hardware etc. You can configure the observable to retry documents that fail to be indexed, and if you see es_rejected_execution_exception, you are at the limits of what your cluster can concurrently handle.
Another recommendation is that of document ids. I see you're using new Guids for the ids of documents, which implies to me that you don't care what the value is for each document. If that is the case, I would recommend not sending an Id value, and instead allow Elasticsearch to generate an id for each document. This is very likely to result in an improvement in performance (I believe the implementation had changed slightly further in Elasticsearch and Lucene since this post, but the point still stands).

Bulk collection inside the right Index path in ElasticSearch using NEST in a .NET Core application

I am trying to bulk a collection of elements inside an index of ElasticSearch using NEST inside a .NET Core application.
Currently what I have is working, and the elements are saved, but Is not saved where I try to do
My client creation:
protected ElasticClient GetClient()
{
var node = new Uri("http://localhost:9200/");
var settings = new ConnectionSettings(node)
.DefaultIndex("TestIndex")
.PrettyJson(true);
return new ElasticClient(settings);
}
Here is how I create the descriptor for bulk all the data
protected BulkDescriptor GenerateBulkDescriptor<T>(IEnumerable<T> elements, string indexName) where T: class, IIndexable
{
var bulkIndexer = new BulkDescriptor();
foreach (var element in elements)
bulkIndexer.Index<T>(i => i
.Document(element)
.Id(element.Id)
.Index(indexName));
return bulkIndexer;
}
Finally, once I have this, here is how I index the data
var descriptor = GenerateBulkDescriptor(indexedElements, "indexed_elements");
var response = GetClient().Bulk(descriptor);
But, If I see how It's stored in the Elastic index using this, that is what I have:
How can I know if is created under TestIndex index? Because as far as I can see, there is just one index created
Thank you a lot in advance

When defining the index operations on the BulkDescriptor, you are explicitly setting the index to use for each operation
foreach (var element in elements)
bulkIndexer.Index<T>(i => i
.Document(element)
.Id(element.Id)
.Index(indexName));
where indexName is "indexed_elements". This is why all documents are indexed into this index and you do not see any in "TestIndex".
The Bulk API allows multiple operations to be defined, which may include indexing documents into different indices. When the index is specified directly on an operation, that will be the index used. If all index operations on a Bulk API call are to take place against the same index, you can omit the index on each operation and instead, specify the index to use on the Bulk API call directly
var defaultIndex = "default_index";
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var settings = new ConnectionSettings(pool)
.DefaultIndex(defaultIndex);
var client = new ElasticClient(settings);
var people = new []
{
new Person { Id = 1, Name = "Paul" },
new Person { Id = 2, Name = "John" },
new Person { Id = 3, Name = "George" },
new Person { Id = 4, Name = "Ringo" },
};
var bulkResponse = client.Bulk(b => b
.Index("people")
.IndexMany(people)
);
which sends the following request
POST http://localhost:9200/people/_bulk
{"index":{"_id":"1","_type":"person"}}
{"id":1,"name":"Paul"}
{"index":{"_id":"2","_type":"person"}}
{"id":2,"name":"John"}
{"index":{"_id":"3","_type":"person"}}
{"id":3,"name":"George"}
{"index":{"_id":"4","_type":"person"}}
{"id":4,"name":"Ringo"}
Note that the URI is /people/bulk and that each JSON object representing an operation does not contain an "_index".
If you omit the .Index() on Bulk API call, it will use the DefaultIndex configured on ConnectionSettings:
var bulkResponse = client.Bulk(b => b
.IndexMany(people)
);
which yields
POST http://localhost:9200/_bulk
{"index":{"_id":"1","_index":"default_index","_type":"person"}}
{"id":1,"name":"Paul"}
{"index":{"_id":"2","_index":"default_index","_type":"person"}}
{"id":2,"name":"John"}
{"index":{"_id":"3","_index":"default_index","_type":"person"}}
{"id":3,"name":"George"}
{"index":{"_id":"4","_index":"default_index","_type":"person"}}
{"id":4,"name":"Ringo"}
You can also specify a default index to use for a given POCO type on ConnectionSettings with DefaultMappingFor<T>(), where T is your POCO type.

After som tests and attemps, I have found a solution.
First of all, it was a problem with the index configured, once I set it in lower case, the index was working fine indexing data inside.
Then, I had the problem of index data in a specific "path" inside the same index, finalyy I found the Type solution from NEST, taking also advantage of the DefaultMappingFor suggested by Russ in the previous answer.
Client definition:
var node = new Uri(_elasticSearchConfiguration.Node);
var settings = new ConnectionSettings(node)
.DefaultMappingFor<IndexedElement>(m => m
.IndexName(_elasticSearchConfiguration.Index)
.TypeName(nameof(IndexedElement).ToLower()))
.PrettyJson(true)
.DisableDirectStreaming();
var client = new ElasticClient(settings);
Then, the BulkDescriptior creation:
var bulkIndexer = new BulkDescriptor();
foreach (var element in elements)
bulkIndexer.Index<IndexedElement>(i => i
.Document(element)
.Type(nameof(IndexedElement).ToLower()))
.Id(element.Id)
);
And finally, data bulk:
client.Bulk(bulkIndexer);
Now, If I perform a call to the index, I can see this
{
"testindex": {
"aliases": {},
"mappings": {
"indexedelement": {
[...]
}
Thank you Russ for your help and for who have had a look to the post.
UPDATE
Finally, it seems that the unique problem was regarding the default index, that it must be lowercase, so, specify the type with the name of the POCO itself is not neccesary, like #RussCam has truly detected in comments above. After changing thedefault index to lowercase, all the different possibilities worked fine.
Thank you all again

RetrieveEntityRequest for multiple entities at once?

I'm using the RetrieveEntityRequest to get an entity's attributes' metadata:
RetrieveEntityRequest entityRequest = new RetrieveEntityRequest
{
EntityFilters = EntityFilters.Attributes,
LogicalName = joinedEntityName.Value,
};
RetrieveEntityResponse joinedEntityMetadata = (RetrieveEntityResponse)_service.Execute(entityRequest);
Now, consider I need to execute this request for multiple entities. Is it possible to do this in one execution (maybe not with RetrieveEntityRequest), instead of one request for each entity?

You can't do it with RetrieveEntityRequest. However, you can do a RetrieveMetadataChangesRequest to get what you want. It's misleadingly named for your purposes, but if you don't provide a ClientVersionStamp property, it will simply retrieve everything you've specified in the Query property.
Here's a simple example where you'd retrieve the metadata for account and contact, and only retrieve the LogicalName and DisplayName properties:
var customFilterExpression = new[]
{
new MetadataConditionExpression("LogicalName", MetadataConditionOperator.Equals, "account"),
new MetadataConditionExpression("LogicalName", MetadataConditionOperator.Equals, "contact")
};
var customFilter = new MetadataFilterExpression(LogicalOperator.Or);
customFilter.Conditions.AddRange(customFilterExpression);
var entityProperties = new MetadataPropertiesExpression
{
AllProperties = false
};
entityProperties.PropertyNames.AddRange("LogicalName", "DisplayName");
var request = new RetrieveMetadataChangesRequest
{
Query = new EntityQueryExpression
{
Properties = entityProperties,
Criteria = customFilter,
}
};
This method also has the benefit of only retrieving what specific properties you need, which makes the request faster and the payload smaller. It's specifically designed for mobile where you want to only retrieve the Metadata you need, and what has changed since the last time you retrieved it, but it works nicely in a lot of scenarios.

You have to use RetrieveAllEntitiesRequest. Sample below:
RetrieveAllEntitiesRequest retrieveAllEntityRequest = new RetrieveAllEntitiesRequest
{
RetrieveAsIfPublished = true,
EntityFilters = EntityFilters.Attributes
};
RetrieveAllEntitiesResponse retrieveAllEntityResponse = (RetrieveAllEntitiesResponse)serviceProxy.Execute(retrieveAllEntityRequest);
CRM SDK has all or one-by-one approach only.
You have to keep your list of entities ready & issue the RetrieveEntityRequest for each item.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Bulk Indexing in Elasticssearch using the ElasticLowLevelClient client - c#

Related

C# serializing JSON data from Database for API

Why does my Nest method in controller create an index?

Elastic Search ingest attachment plugin blocks

Bulk collection inside the right Index path in ElasticSearch using NEST in a .NET Core application

RetrieveEntityRequest for multiple entities at once?

Categories

Resources