Elastic Search ingest attachment plugin blocks - c#

I am using NEST (C#) and the ingest attachment plugin to ingest 10s of thousands of documents into an Elastic search instance. Unfortunately, after a while everything just stands still - i.e. no more documents are ingested. The log shows:
[2019-02-20T17:35:07,528][INFO ][o.e.m.j.JvmGcMonitorService] [BwAAiDl] [gc][7412] overhead, spent [326ms] collecting in the last [1s]
Not sure if this tells anyone anything? Btw, are there more efficient ways to ingest many documents (rather than using thousands of REST requests)?
I am using this kind of code:
client.Index(new Document
{
Id = Guid.NewGuid(),
Path = somePath,
Content = Convert.ToBase64String(File.ReadAllBytes(somePath))
}, i => i.Pipeline("attachments"));
Define the pipeline:
client.PutPipeline("attachments", p => p
.Description("Document attachment pipeline")
.Processors(pr => pr
.Attachment<Document>(a => a
.Field(f => f.Content)
.TargetField(f => f.Attachment)
)
.Remove<Document>(r => r
.Field(f => f.Content)
)
)
);

The log indicates that a considerable amount of time is being spent performing Garbage Collection on the Elasticsearch server side; this is very likely to be the cause of large stop events that you are seeing. If you have monitoring enabled on the cluster (ideally exporting such data to a separate cluster), I would look at analysing those to see if it sheds some light on why large GC is happening.
are there more efficient ways to ingest many documents (rather than using thousands of REST requests)?
Yes, you are indexing each attachment in a separate index request. Depending on the size of each attachment, base64 encoded, you may want to send several in one bulk request
// Your collection of documents
var documents = new[]
{
new Document
{
Id = Guid.NewGuid(),
Path = "path",
Content = "content"
},
new Document
{
Id = Guid.NewGuid(),
Path = "path",
Content = "content" // base64 encoded bytes
}
};
var client = new ElasticClient();
var bulkResponse = client.Bulk(b => b
.Pipeline("attachments")
.IndexMany(documents)
);
If you're reading documents from the filesystem, you probably want to lazily enumerate them and send bulk requests. Here, you can make use of the BulkAll helper method too.
First have some lazily enumerated collection of documents
public static IEnumerable<Document> GetDocuments()
{
var count = 0;
while (count++ < 20)
{
yield return new Document
{
Id = Guid.NewGuid(),
Path = "path",
Content = "content" // base64 encoded bytes
};
}
}
Then configure the BulkAll call
var client = new ElasticClient();
// set up the observable configuration
var bulkAllObservable = client.BulkAll(GetDocuments(), ba => ba
.Pipeline("attachments")
.Size(10)
);
var waitHandle = new ManualResetEvent(false);
Exception exception = null;
// set up what to do in response to next bulk call, exception and completion
var bulkAllObserver = new BulkAllObserver(
onNext: response =>
{
// perform some action e.g. incrementing counter
// to indicate how many have been indexed
},
onError: e =>
{
exception = e;
waitHandle.Set();
},
onCompleted: () =>
{
waitHandle.Set();
});
// start the observable process
bulkAllObservable.Subscribe(bulkAllObserver);
// wait for indexing to finish, either forever,
// or set a max timeout as here.
waitHandle.WaitOne(TimeSpan.FromHours(1));
if (exception != null)
throw exception;
Size dictates how many documents to send in each request. There are no hard and fast rules for how big this can be for your cluster, because it can depend on a number of factors including ingest pipeline, the mapping of documents, the byte size of documents, the cluster hardware etc. You can configure the observable to retry documents that fail to be indexed, and if you see es_rejected_execution_exception, you are at the limits of what your cluster can concurrently handle.
Another recommendation is that of document ids. I see you're using new Guids for the ids of documents, which implies to me that you don't care what the value is for each document. If that is the case, I would recommend not sending an Id value, and instead allow Elasticsearch to generate an id for each document. This is very likely to result in an improvement in performance (I believe the implementation had changed slightly further in Elasticsearch and Lucene since this post, but the point still stands).

Related

How to design a system that deals with incrementing registration number in MongoDB

In MongoDB how to create Registration number ?
I need auto incrementing value but MongoDB doesn't have auto increment by default (risk in case concurrency failure),
so how to do it?
Eg. current registration no : 1
now when I insert a new record this must be 1 + 1 = 2
I do it like that
FindOneAndUpdate is atomic
static public async Task<long> NextInt64Async(IMongoDatabase db, string seqCollection, long q = 1, CancellationToken cancel = default)
{
long result = 1;
BsonDocument r = await db.GetCollection<BsonDocument>("seq").FindOneAndUpdateAsync<BsonDocument>(
filter: Builders<BsonDocument>.Filter.Eq("_id", seqCollection),
update: Builders<BsonDocument>.Update.Inc("seq", q),
options: new FindOneAndUpdateOptions<BsonDocument, BsonDocument>() { ReturnDocument = ReturnDocument.After, IsUpsert = true },
cancellationToken: cancel
);
if (r != null)
result = r["seq"].AsInt64;
return result;
}
....
await collection.InsertOneAsync(new Person() { Id = await NextInt64Async(db, "person"), Name = "Person" + i });
Find full example here
https://github.com/iso8859/learn-mongodb-by-example/blob/main/dotnet/02%20-%20Intermediate/InsertLongId.cs
If you need to avoid gaps, you can use the following approach that involves only updates to a single document that are atomic:
First, you pre-fill the invoice collection with a reasonable amount of documents (if you expect 1000 invoices per day, you could create the documents for a year in advance) that has a unique, increasing and gap-less number, e.g.
[
{ _id: "abcde1", InvoiceId: 1, HasInvoice: false, Invoice: null },
{ _id: "abcde2", InvoiceId: 2, HasInvoice: false, Invoice: null },
{ _id: "abcde3", InvoiceId: 3, HasInvoice: false, Invoice: null },
...
]
There should be a unique index on InvoiceId and for efficient querying/sorting during the updates another one on HasInvoice and InvoiceId. You'd need to insert new documents if you are about to run out of prepared documents.
When creating an invoice, you perform a FindOneAndModify operation that gets the document with the lowest InvoiceId that does not have an Invoice yet. Using the updates, you assign the Invoice, e.g.:
var filter = Builders<Invoice>.Filter.Eq(x => x.HasInvoice, false);
var update = Builders<Invoice>.Update
.Set(x => x.HasInvoice, true)
.Set(x => x.Invoice, invoiceDetails);
var options = new FindOneAndUpdateOptions<Invoice>()
{
Sort = Builders<Invoice>.Sort.Ascending(x => x.InvoiceId),
ReturnDocument = ReturnDocument.After,
};
var updatedInvoice = await invoices.FindOneAndUpdateAsync(filter, update, options);
FindOneAndModify returns the updated document so that you can access the assigned invoice id afterwards.
Due to the atomic execution of FindAndModify there is no need for transactions; gaps are not possible as always the lowest InvoiceId is found.
There is no built in way to achieve this, there are a few solutions for certain situations.
For example if you're using Mongo Realm you can define a DB trigger, I recommend following this guide
If you're using mongoose in your app there are certain plugins like mongoose-auto-increment that do it for you.
The way they work is by creating an additional collection that contains a counter to be used for every insert, however this is not perfect as it your db is still vulnerable to manual updates and human error. This is still the only viable solution that doesn't require preprocessing of some sort, I recommend to also create a unique index on that field to at least guarantee uniqueness.

MongoDb & C# : Using a Cursor with a sort on a large Index

So I would like to de-dup my dataset which has 2 billion records in. I have an index on url, and I want to iterate through each record and see if it's a duplicate.
The index is 110GB
MongoDB.Driver.MongoCommandException: 'Command find failed: Executor
error during find command :: caused by :: Sort operation used more
than the maximum 33554432 bytes of RAM. Add an index, or specify a
smaller limit..'
My current method won't run because of the Index being huge.
var filter = Builders<Page>.Filter.Empty;
var sort = Builders<Page>.Sort.Ascending("url");
await collection.Find(filter).Sort(sort)
.ForEachAsync(async document =>
{
Console.WriteLine(document.Url);
//_ = await collection.DeleteOneAsync(a => a.Id == document.Id);
}
);
if the goal is to delete duplicate pages with the same url, why not use an aggregation like the following:
db.Page.aggregate(
[
{
$sort: {
url: 1
}
},
{
$group: {
_id: "$url",
doc: { $first: "$$ROOT" }
}
},
{
$replaceWith: "$doc"
},
{
$out: "UniquePages"
}
],
{
allowDiskUse: 1
})
it will create a new collection called UniquePages. after inspecting that collection to see if the data is correct, you can simply drop the old Page collection and rename the new one to Page.
Make sure that sort operation uses an index. This will mostly perform better and memory sort has a restriction on a maximum of 32MB. Refer
to doc. and check doc uses indexes to sort query results.
Use query plan to identify the selected / winningPlan execution plan
Analyse query performance details with executionStats

Bulk collection inside the right Index path in ElasticSearch using NEST in a .NET Core application

I am trying to bulk a collection of elements inside an index of ElasticSearch using NEST inside a .NET Core application.
Currently what I have is working, and the elements are saved, but Is not saved where I try to do
My client creation:
protected ElasticClient GetClient()
{
var node = new Uri("http://localhost:9200/");
var settings = new ConnectionSettings(node)
.DefaultIndex("TestIndex")
.PrettyJson(true);
return new ElasticClient(settings);
}
Here is how I create the descriptor for bulk all the data
protected BulkDescriptor GenerateBulkDescriptor<T>(IEnumerable<T> elements, string indexName) where T: class, IIndexable
{
var bulkIndexer = new BulkDescriptor();
foreach (var element in elements)
bulkIndexer.Index<T>(i => i
.Document(element)
.Id(element.Id)
.Index(indexName));
return bulkIndexer;
}
Finally, once I have this, here is how I index the data
var descriptor = GenerateBulkDescriptor(indexedElements, "indexed_elements");
var response = GetClient().Bulk(descriptor);
But, If I see how It's stored in the Elastic index using this, that is what I have:
How can I know if is created under TestIndex index? Because as far as I can see, there is just one index created
Thank you a lot in advance
When defining the index operations on the BulkDescriptor, you are explicitly setting the index to use for each operation
foreach (var element in elements)
bulkIndexer.Index<T>(i => i
.Document(element)
.Id(element.Id)
.Index(indexName));
where indexName is "indexed_elements". This is why all documents are indexed into this index and you do not see any in "TestIndex".
The Bulk API allows multiple operations to be defined, which may include indexing documents into different indices. When the index is specified directly on an operation, that will be the index used. If all index operations on a Bulk API call are to take place against the same index, you can omit the index on each operation and instead, specify the index to use on the Bulk API call directly
var defaultIndex = "default_index";
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var settings = new ConnectionSettings(pool)
.DefaultIndex(defaultIndex);
var client = new ElasticClient(settings);
var people = new []
{
new Person { Id = 1, Name = "Paul" },
new Person { Id = 2, Name = "John" },
new Person { Id = 3, Name = "George" },
new Person { Id = 4, Name = "Ringo" },
};
var bulkResponse = client.Bulk(b => b
.Index("people")
.IndexMany(people)
);
which sends the following request
POST http://localhost:9200/people/_bulk
{"index":{"_id":"1","_type":"person"}}
{"id":1,"name":"Paul"}
{"index":{"_id":"2","_type":"person"}}
{"id":2,"name":"John"}
{"index":{"_id":"3","_type":"person"}}
{"id":3,"name":"George"}
{"index":{"_id":"4","_type":"person"}}
{"id":4,"name":"Ringo"}
Note that the URI is /people/bulk and that each JSON object representing an operation does not contain an "_index".
If you omit the .Index() on Bulk API call, it will use the DefaultIndex configured on ConnectionSettings:
var bulkResponse = client.Bulk(b => b
.IndexMany(people)
);
which yields
POST http://localhost:9200/_bulk
{"index":{"_id":"1","_index":"default_index","_type":"person"}}
{"id":1,"name":"Paul"}
{"index":{"_id":"2","_index":"default_index","_type":"person"}}
{"id":2,"name":"John"}
{"index":{"_id":"3","_index":"default_index","_type":"person"}}
{"id":3,"name":"George"}
{"index":{"_id":"4","_index":"default_index","_type":"person"}}
{"id":4,"name":"Ringo"}
You can also specify a default index to use for a given POCO type on ConnectionSettings with DefaultMappingFor<T>(), where T is your POCO type.
After som tests and attemps, I have found a solution.
First of all, it was a problem with the index configured, once I set it in lower case, the index was working fine indexing data inside.
Then, I had the problem of index data in a specific "path" inside the same index, finalyy I found the Type solution from NEST, taking also advantage of the DefaultMappingFor suggested by Russ in the previous answer.
Client definition:
var node = new Uri(_elasticSearchConfiguration.Node);
var settings = new ConnectionSettings(node)
.DefaultMappingFor<IndexedElement>(m => m
.IndexName(_elasticSearchConfiguration.Index)
.TypeName(nameof(IndexedElement).ToLower()))
.PrettyJson(true)
.DisableDirectStreaming();
var client = new ElasticClient(settings);
Then, the BulkDescriptior creation:
var bulkIndexer = new BulkDescriptor();
foreach (var element in elements)
bulkIndexer.Index<IndexedElement>(i => i
.Document(element)
.Type(nameof(IndexedElement).ToLower()))
.Id(element.Id)
);
And finally, data bulk:
client.Bulk(bulkIndexer);
Now, If I perform a call to the index, I can see this
{
"testindex": {
"aliases": {},
"mappings": {
"indexedelement": {
[...]
}
Thank you Russ for your help and for who have had a look to the post.
UPDATE
Finally, it seems that the unique problem was regarding the default index, that it must be lowercase, so, specify the type with the name of the POCO itself is not neccesary, like #RussCam has truly detected in comments above. After changing thedefault index to lowercase, all the different possibilities worked fine.
Thank you all again

Bulk Indexing in Elasticssearch using the ElasticLowLevelClient client

I'm using the ElasticLowLevelClient client to index elasticsearch data as it needs to be indexed as a raw string as I don't have access to the POCO objects. I can successfully index an individual object by calling:
client.Index<object>(indexName, message.MessageType, message.Id,
new Elasticsearch.Net.PostData<object>(message.MessageJson));
How can I do a bulk insert into the index using the ElasticLowLevelClient client? The bulk inset APIs all require a POCO of the indexing document which I don't have e.g.:
ElasticsearchResponse<T> Bulk<T>(string index, PostData<object> body,
Func<BulkRequestParameters, BulkRequestParameters> requestParameters = null)
I could make the API calls in parallel for each object but that seems inefficient.
The low level client generic type parameter is the type for the response expected.
If you're using the low level client exposed on the high level client, through the .LowLevel property, you can send a bulk request where your documents are JSON strings as follows in 5.x
var client = new ElasticClient(settings);
var messages = new []
{
new Message
{
Id = "1",
MessageType = "foo",
MessageJson = "{\"name\":\"message 1\",\"content\":\"foo\"}"
},
new Message
{
Id = "2",
MessageType = "bar",
MessageJson = "{\"name\":\"message 2\",\"content\":\"bar\"}"
}
};
var indexName = "my-index";
var bulkRequest = messages.SelectMany(m =>
new[]
{
client.Serializer.SerializeToString(new
{
index = new
{
_index = indexName,
_type = m.MessageType,
_id = m.Id
}
}, SerializationFormatting.None),
m.MessageJson
});
var bulkResponse = client.LowLevel.Bulk<BulkResponse>(string.Join("\n", bulkRequest) + "\n");
which sends the following bulk request
POST http://localhost:9200/_bulk
{"index":{"_index":"my-index","_type":"foo","_id":"1"}}
{"name":"message 1","content":"foo"}
{"index":{"_index":"my-index","_type":"bar","_id":"2"}}
{"name":"message 2","content":"bar"}
A few important points
We need to build the bulk request ourselves to use the low level bulk API call. Since our documents are already strings, it makes sense to build a string request.
We serialize an anonymous type with no indenting for the action and metadata for each bulk item.
The MessageJson cannot contain any newline characters in it as this will break the bulk API; newline characters are the delimiters for json objects within the body.
Because we're using the low level client exposed on the high level client, we can still take advantage of the high level requests, responses and serializer. The bulk request returns a BulkResponse, which you can work with as you normally do when sending a bulk request with the high level client.

Read Azure DocumentDB document that might not exist

I can query a single document from the Azure DocumentDB like this:
var response = await client.ReadDocumentAsync( documentUri );
If the document does not exist, this will throw a DocumentClientException. In my program I have a situation where the document may or may not exist. Is there any way to query for the document without using try-catch and without doing two round trips to the server, first to query for the document and second to retrieve the document should it exist?
Sadly there is no other way, either you handle the exception or you make 2 calls, if you pick the second path, here is one performance-driven way of checking for document existence:
public bool ExistsDocument(string id)
{
var client = new DocumentClient(DatabaseUri, DatabaseKey);
var collectionUri = UriFactory.CreateDocumentCollectionUri("dbName", "collectioName");
var query = client.CreateDocumentQuery<Microsoft.Azure.Documents.Document>(collectionUri, new FeedOptions() { MaxItemCount = 1 });
return query.Where(x => x.Id == id).Select(x=>x.Id).AsEnumerable().Any(); //using Linq
}
The client should be shared among all your DB-accesing methods, but I created it there to have a auto-suficient example.
The new FeedOptions () {MaxItemCount = 1} will make sure the query will be optimized for 1 result (we don't really need more).
The Select(x=>x.Id) will make sure no other data is returned, if you don't specify it and the document exists, it will query and return all it's info.
You're specifically querying for a given document, and ReadDocumentAsync will throw that DocumentClientException when it can't find the specific document (returning a 404 in the status code). This is documented here. By catching the exception (and seeing that it's a 404), you wouldn't need two round trips.
To get around dealing with this exception, you'd need to make a query instead of a discrete read, by using CreateDocumentQuery(). Then, you'll simply get a result set you can enumerate through (even if that result set is empty). For example:
var collLink = UriFactory.CreateDocumentCollectionUri(databaseId, collectionId);
var querySpec = new SqlQuerySpec { <querytext> };
var itr = client.CreateDocumentQuery(collLink, querySpec).AsDocumentQuery();
var response = await itr.ExecuteNextAsync<Document>();
foreach (var doc in response.AsEnumerable())
{
// ...
}
With this approach, you'll just get no responses. In your specific case, where you'll be adding a WHERE clause to query a specific document by its id, you'll either get zero results or one result.
With CosmosDB SDK ver. 3 it's possible. You can check if an item exists in a container and get it by using Container.ReadItemStreamAsync<T>(string id, PartitionKey key) and checking response.StatusCode:
using var response = await container.ReadItemStreamAsync(id, new PartitionKey(key));
if (response.StatusCode == HttpStatusCode.NotFound)
{
return null;
}
if (!response.IsSuccessStatusCode)
{
throw new Exception(response.ErrorMessage);
}
using var streamReader = new StreamReader(response.Content);
var content = await streamReader.ReadToEndAsync();
var item = JsonConvert.DeserializeObject(content, stateType);
This approach has a drawback, however. You need to deserialize the item by hand.

Categories