Index, IndexMany, IndexAsnyc, IndexManyAsync with NEST - c#

I try to understand indexing options using nest for ElasticSearch and I executed each of them and here are my results:
var node = new Uri("http://localhost:9200");
var settings = new ConnectionSettings(node, defaultIndex: "mydatabase");
settings.SetTimeout(1800000);
var client = new ElasticClient(settings);
var createIndexResult = client.CreateIndex("mydatabase");
var mapResult = client.Map<Product>(c => c.MapFromAttributes().SourceField(s=>s.Enabled(true));
1) Index: When I use Index option by iterating each object, It works smooth although it is slow.
foreach (var item in Items)
{
elasticClient.Index(item);
}
2) IndexAsync: This worked without any exception but It was not faster than snyc iteration and less documents were indexed.
foreach (var item in Items)
{
elasticClient.IndexAsync(item);
}
3) IndexMany: I tried, elasticClient.IndexMany(items); without foreach of course, It runs faster than doing foreach -index option, but somehow when I have a lot of data (in my case was 500.000objects) it threw and exception, saying
"System.Net.WebException: The underlying connection was closed: A
connection that its continuation was expected, has been closed by the
server ..
    at System.Net.HttpWebRequest.GetResponse ()"
when I check the log file, I can see only
"2016-01-14
10:21:49,567][WARN ][http.netty ] [Microchip] Caught
exception while handling client http traffic, closing connection [id:
0x68398975, /0:0:0:0:0:0:0:1:57860 => /0:0:0:0:0:0:0:1:9200]"
4)IndexManyAsync: elasticClient.IndexManyAsync(Items); trying indexasnyc throws similar exception as snyc but I can see more information in the log file.
[2016-01-14 11:00:16,086][WARN ][http.netty ]
[Microchip] Caught exception while handling client http traffic,
closing connection [id: 0x43bca172, /0:0:0:0:0:0:0:1:59314 =>
/0:0:0:0:0:0:0:1:9200]
org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException:
HTTP content length exceeded 104857600 bytes.
My questions are what are the exact differences? in which cases we might need async? why both indexmany and indexmanyasnyc options throw such exception?
it looks like index option is the safest one. Is it just ok to use it like that?

Using sync or async will not have any impact on Elasticsearch indexing performance. You would want to use async if you do not want to block your client code on completion of indexing, that's all.
Coming to Index vs IndexMany, it is always recommended to use the latter to take advantage of batching and avoiding too many request/response cycles between your client and Elasticsearch. That said, you cannot simply index such a huge number of documents in a single request. The exception message is pretty clear in saying that your batch index request has exceeded the HTTP content length limit of 100MB. What you need to do is reduce the number of documents you want to index using IndexMany so that you do not hit this limit and then invoke IndexMany multiple times till you complete indexing all of 500,000 documents.

The problem with the indexMany and indexManyAsync is that you are indexing too much data in one request.
This can be solved by doing multiple indexMany calls on subsets of your list, but there is now an easier way to deal with this called bulkAllObservable
var bulkAllObservable = client.BulkAll(items, b => b
.Index("myindex")
// how long to wait between retries
.BackOffTime("30s")
// how many retries are attempted if a failure occurs
.BackOffRetries(2)
// refresh the index once the bulk operation completes
.RefreshOnCompleted()
// how many concurrent bulk requests to make
.MaxDegreeOfParallelism(Environment.ProcessorCount)
// number of items per bulk request
.Size(1000)
)
// Perform the indexing, waiting up to 15 minutes.
// Whilst the BulkAll calls are asynchronous this is a blocking operation
.Wait(TimeSpan.FromMinutes(15), next =>
{
// do something on each response e.g. write number of batches indexed to console
});
This will index your whole list in chunks of 1000 items at a time.

Related

How to crawl XML(s) very fast — considering the below networking limitations?

I have a .Net crawler that's running when the user makes a request (so, it needs to be fast). It crawls 400+ links in real time. (This is the business ask.)
The problem: I need to detect if a link is xml (think of rss or atom feeds) or html. If the link is xml then I continue with processing, but if the link is html I can skip it. Usually, I have 2 xml(s) and 398+ html(s). Currently, I have multiple threads going but the processing is still slow, usually 75 seconds running with 10 threads for 400+ links, or 280 seconds running with 1 thread. (I want to add more threads but see below..)
The challenge that I am facing is that I read the streams as follows:
var request = WebRequest.Create(requestUriString: uri.AbsoluteUri);
// ....
var response = await request.GetResponseAsync();
//....
using (var reader = new StreamReader(stream: response.GetResponseStream(), encoding: encoding)) {
char[] buffer = new char[1024];
await reader.ReadAsync(buffer: buffer, index: 0, count: 1024);
responseText = new string(value: buffer);
}
// parse first byts of reasponseText to check if xml
The problem is that my optimization to get only 1024 is quite useless because the GetResponseAsync is downloading the entire stream anyway, as I see.
(The other option that I have is to look for the header ContentType, but that's quite similar AFAIK because I get the content anyway - in case that you don't recommend to use OPTIONS, that I did not use so far - and in addition xml might be content-type incorrectly marked (?) and I am going to miss some content.)
If there is any optimization that I am missing please help, as I am running out of ideas.
(I do consider to optimize this design by spreading the load on multiple servers, so that I balance the network with the parallelism, but that's a bit of change from the current architecture, that I cannot afford to do at this point in time.)
Using HEAD requests could speed up the requests significantly, IF you can rely on the Content-Type.
e.g
HttpClient client = new HttpClient();
HttpResponseMessage response = await client.SendAsync(new HttpRequestMessage() { Method = HttpMethod.Head});
Just showing basic usage. Obviously you need to add uri and anything else required to the request.
Also just to note that even with 10 threads, 400 request will likely always take quite a while. 400/10 means 40 requests sequentially. Unless the requests are to servers close by then 200ms would be a good response time meaning a minimum of 8 seconds. Ovserseas serves that may be slow could easily push this out to 30-40 seconds of unavoidable delay, unless you increase the amount of threads to parallel more of the requests.
Dataflow (Task Parallel Library) Can be very helpful for writing parallel pipes with a convenient MaxDegreeOfParallelism property for easily adjusting how many parallel instances can be run.

Last batch never uploads to Solr when uploading batches of data from json file stream

This might be a long shot but I might as well try here. There is a block of c# code that is rebuilding a solr core. The steps are as follows:
Delete all the existing documents
Get the core entities
Split the entities into batches of 1000
Spin of threads to preform the next set of processes:
Serialize each batch to json and writing the json to a file on the server
hosting the core
Send a command to the core to upload that file using System.Net.WebClient solrurl/corename/update/json?stream.file=myfile.json&stream.contentType=application/json;charset=utf-8
Delete the file. I've also tried deleting the files after all the batches are done, as well as not deleting the files at all
After all batches are done it commits. I've also tried committing
after each batch is done.
My problem is the last batch will not upload if it's much less than the batch size. It flows through like the command was called but nothing happens. It throws no exceptions and I see no errors in the solr logs. My questions are Why? and How can I ensure the last batch always gets uploaded? We think it's a timing issue, but we've added Thread.Sleep(30000) in many parts of the code to test that theory and it still happens.
The only time it doesn't happen is:
if the batch is full or almost full
we don't run multiple threads it
we put a break point at the File.Delete line on the last batch and wait for 30 seconds or so, then continue
Here is the code for writing the file and calling the update command. This is called for each batch.
private const string
FileUpdateCommand = "{1}/update/json?stream.file={0}&stream.contentType=application/json;charset=utf-8",
SolrFilesDir = #"\\MYSERVER\SolrFiles",
SolrFileNameFormat = SolrFilesDir + #"\{0}-{1}.json",
_solrUrl = "http://MYSERVER:8983/solr/",
CoreName = "MyCore";
public void UpdateCoreByFile(List<CoreModel> items)
{
if (items.Count == 0)
return;
var settings = new JsonSerializerSettings { DateTimeZoneHandling = DateTimeZoneHandling.Utc };
var dir = new DirectoryInfo(SolrFilesDir);
if (!dir.Exists)
dir.Create();
var filename = string.Format(SolrFileNameFormat, Guid.NewGuid(), CoreName);
using (var sw = new StreamWriter(filename))
{
sw.Write(JsonConvert.SerializeObject(items, settings));
}
var file = HttpUtility.UrlEncode(filename);
var command = string.Format(FileUpdateCommand, file, CoreName);
using (var client = _clientFactory.GetClient())//System.Net.WebClient
{
client.DownloadData(new Uri(_solrUrl + command));
}
//Thread.Sleep(30000);//doesn't work if I add this
File.Delete(filename);//works here if add breakpoint and wait 30 sec or so
}
I'm just trying to figure out why this is happening and how to address it. I hope this makes sense, and I have provided enough information and code. Thanks for any help.
Since changing the size of the data set and adding a breakpoint "fixes" it, this is most certainly a race condition. Since you haven't added the code that actually indexes the content, it's impossible to say what the issue really is, but my guess is that the last commit happens before all the threads have finished, and only works when all threads are done (if you sleep the threads, the issue will still be there, since all threads sleep for the same time).
The easy fix - use commitWithin instead, and never issue explicit commits. The commitWithin parmaeter makes sure that the documents become available in the index within the given time frame (given as milliseconds). To make sure that the documents you submit becomes available within ten seconds, append &commitWithin=10000 to your URL.
If there's already documents pending a commit, the documents added will be committed before the ten seconds has ellapsed, but even if there's just one last document being submitted as the last batch, it'll never be more than ten seconds before it becomes visible (.. and there will be no documents left forever in a non-committed limbo).
That way you won't have to keep your threads synchronized or issue a final commit, as long as you wait until all threads have finished before exiting your application (if it's an application that actually terminates).

C# batch processing of async web responses hangs just before finishing

Here is the scenario.
I want to call 2 versions of an API (hosted on different servers), then cast their responses (they come as a JSON) to C# objects and compare them.
An important note here is that i need to query the APIs a lot of times ~3000. The reason for this is that I query an endpoint that has an id and that returns a specific object from the DB. So my queries are like http://myapi/v1/endpoint/id. And I basically use a loop to go through all of the ids.
Here is the issue
I start querying the API and for the first 90% of all requests it is blazing fast (I get the response and i process it) and all that happens under 5 seconds.
Then however, I start to come to a stop. The next 50-100 requests can take between 1 - 5 seconds to process and after that I come to a stop. No CPU-usage, network activity is low (and I am pretty sure that activity is from other apps). And my app just hangs.
UPDATE: Around 50% of the times I tested this, it does finally resume after quite a bit of time. But the other 50% it still just hangs
Here is what I am doing in-code
I have a list of Ids that I iterate to query the endpoint.
This is the main piece of code that queries the APIs and processes the responses.
var endPointIds = await GetIds(); // this queries a different endpoint to get all ids, however there are no issues with it
var tasks = endPointIds.Select(async id =>
{
var response1 = await _data.GetData($"{Consts.ApiEndpoint1}/{id}");
var response2 = await _data.GetData($"{Consts.ApiEndpoint2}/{id}");
return ProcessResponces(response1, response2);
});
var res = await Task.WhenAll(tasks);
var result = res.Where(r => r != null).ToList();
return result; // I never get to return the result, the app hangs before this is reached
This is the GetData() method
private async Task<string> GetAsync(string serviceUri)
{
try
{
var request = WebRequest.CreateHttp(serviceUri);
request.ContentType = "application/json";
request.Method = WebRequestMethods.Http.Get;
using (var response = await request.GetResponseAsync())
using (var responseStream = response.GetResponseStream())
using (var streamReader = new StreamReader(responseStream, Encoding.UTF8))
{
return await streamReader.ReadToEndAsync();
}
}
catch
{
return string.Empty;
}
}
I would link the ProcessResponces method as well, however I tried mocking it to return a string like so:
private string ProcessResponces(string responseJson1, string responseJson1)
{
//usually i would have 2 lines that deserialize responseJson1 and responseJson1 here using Newtonsoft.Json's DeserializeObject<>
return "Fake success";
}
And even with this implementation my issue did not go away (only difference it made is that I managed the have fast requests for like 97% of my requests, but my code still ended up stopping at the last few request), so I am guessing the main issue is not related to that method. But what it more or less does is deserialize both responses to c# objects, compares them and returns information about their equality.
Here are my observations after 4 hours of debugging
If I manually reduce the number of queries to my API (I used .Take() method on the list of ids) the issue still persists. For example on 1000 total requests I start hanging around 900th, for 1500 on the 1400th an so on. I believe the issue goes away at around 100-200 requests, but I am not sure since it might just be too fast for me to notice.
Since this is currently a console app I tried adding WriteLines() in some of my methods, and the issue seemed to go away (I am guessing the delay in speed that writing on the console creates, gives some time between requests and that helps)
Lastly i did a concurrency profiling of my app and it reported that there were a lot of contentions happening at the point where my app hangs. Opening the contention tab showed that they are mainly happening with System.IO.StreamReader.ReadToEndAsync()
Thoughts and Questions
Obviously, what can I do to resolve the issue?
Is my GetAsync() method wrong, should I be using something else instead of responseStream and streamReader?
I am not super proficient in asynchronous operations, maybe my flow of async/await operations is wrong.
Lastly, could it be something with the API controllers themselves? They are standard ASP.NET MVC 5 WebAPI controllers (version 5.2.3.0)
After long hours of tracking my requests with Fiddler and finally mocking my DataProvider (_data) to retrieve locally, from disk - it turns out that I had responses that were taking 30s+ to come (or even not coming at all).
Since my .Select() is async it always dispalyed info for the quick responses first, and then came to a halt as it was waiting for the slow ones. This gave an illusion that I was somehow loading the first X amount of requests quickly and then stopping. When, in reality, I was simply shown the fastest X amount of requests and then coming to a halt as I was waiting for the slow ones.
And to kind of answer my questions...
What can I do to resolve the issue - set a timeout that allows a maximum number of milliseconds/seconds for a request to finish.
The GetAsync() method is alright.
Async/await operations are also correct, just need to have in mind that doign an async select will return results ordered by the time it took for them to finish.
The ASP.NET Framework controllers are perfectly fine and do not contribute to the issue.

Why is this eating memory?

I wrote an application whose purpose is to read logs from a large table (90 million) and process them into easily understandable stats, how many, how long etc.
The first run took 7.5 hours and only had to process 27 of the 90 million. I would like to speed this up. So I am trying to run the queries in parallel. But when I run the below code, within a couple minutes I crash with an Out of Memory exception.
Environments:
Sync
Test : 26 Applications, 15 million logs, 5 million retrieved, < 20mb, takes 20 seconds
Production: 56 Applications, 90 million logs, 27 million retrieved, < 30mb, takes 7.5 hours
Async
Test : 26 Applications, 15 million logs, 5 million retrieved, < 20mb, takes 3 seconds
Production: 56 Applications, 90 million logs, 27 million retrieved, Memory Exception
public void Run()
{
List<Application> apps;
//Query for apps
using (var ctx = new MyContext())
{
apps = ctx.Applications.Where(x => x.Type == "TypeIWant").ToList();
}
var tasks = new Task[apps.Count];
for (int i = 0; i < apps.Count; i++)
{
var app = apps[i];
tasks[i] = Task.Run(() => Process(app));
}
//try catch
Task.WaitAll(tasks);
}
public void Process(Application app)
{
//Query for logs for time period
using (var ctx = new MyContext())
{
var logs = ctx.Logs.Where(l => l.Id == app.Id).AsNoTracking();
foreach (var log in logs)
{
Interlocked.Increment(ref _totalLogsRead);
var l = log;
Task.Run(() => ProcessLog(l, app.Id));
}
}
}
Is it ill advised to create 56 contexts?
Do I need to dispose and re-create contexts after a certain number of logs retrieved?
Perhaps I'm misunderstanding how the IQueryable is working? <-- My Guess
My understanding is that it will retrieve logs as needed, I guess that means for the loop is it like a yield? or is my issue that 56 'threads' call to the database and I am storing 27 million logs in memory?
Side question
The results don't really scale together. Based on the Test environment results i would expect Production would only take a few minutes. I assume the increase is directly related to the number of records in the table.
With 27 Million rows the problem is one of stream processing, not parallel execution. You need to approach the problem as you would with SQL Server's SSIS or any other ETL tools: each processing step is a transofrmation that processes its input and sends its output to the next step.
Parallel processing is achieved by using a separate thread to run each step. Some steps could also use multiple threads to process multiple inputs up to a limit. Setting limits to each step's thread count and input buffer ensures you can achieve maximum throughput without flooding your machine with waiting tasks.
.NET's TPL Dataflow addresses exactly this scenario. It provides blocks to transfrom inputs to outputs (TransformBlock), split collections to individual messages (TransformManyBlock), execute actions without transformations (ActionBlock), combine data in batches (BatchBlock) etc.
You can also specify the Maximum degree of parallelism for each step so that, eg. you have only 1 log queries executing at each time, but use 10 tasks for log processing.
In your case, you could:
Start with a TransformManyBlock that receives an application type and returns a list of app IDs
A TranformBlock reads the logs for a specific ID and sends them downstream
An ActionBlock processes the batch.
Step #3 could be broken to many other steps. Eg if you don't need to process all app log entries together, you can use a step to process individual entries. Or you could first group them by date.
Another option is to create a custom block to read data from the database using a DbDataReader and post each entry to the next step immediatelly, instead of waiting for all rows to return. This would allow you to process each entry as it arrives, instead of waiting to receive all entries.
If each app log contains many entries, this could be a huge memory and time saver

SubscriptionClient.RecieveBatch not retrieving all the brokered messages

I have a console application to read all the brokered messages present in the subscription on the Azure Service Bus. I have around 3500 messages in there. This is my code to read the messages:
SubscriptionClient client = messagingFactory.CreateSubscriptionClient(topic, subscription);
long count = namespaceManager.GetSubscription(topic, subscription).MessageCountDetails.ActiveMessageCount;
Console.WriteLine("Total messages to process : {0}", count.ToString()); //Here the number is showing correctly
IEnumerable<BrokeredMessage> dlIE = null;
dlIE = client.ReceiveBatch(Convert.ToInt32(count));
When I execute the code, in the dlIE, I can see only 256 messages. I have also tried giving the prefetch count like this client.PrefetchCountbut then also it returns 256 messages only.
I think there is some limit to the number of messages that can be retrieved at a time.However there is no such thing mentioned on the msdn page for the RecieveBatch method. What can I do to retrieve all messages at a time?
Note:
I only want to read the message and then let it exist on the service bus. Therefore I do not use message.complete method.
I cannot remove and re-create the topic/subscription from the Service Bus.
Edit:
I used PeekBatch instead of ReceiveBatch like this:
IEnumerable<BrokeredMessage> dlIE = null;
List<BrokeredMessage> bmList = new List<BrokeredMessage>();
long i = 0;
dlIE = subsciptionClient.PeekBatch(Convert.ToInt32(count)); // count is the total number of messages in the subscription.
bmList.AddRange(dlIE);
i = dlIE.Count();
if(i < count)
{
while(i < count)
{
IEnumerable<BrokeredMessage> dlTemp = null;
dlTemp = subsciptionClient.PeekBatch(i, Convert.ToInt32(count));
bmList.AddRange(dlTemp);
i = i + dlTemp.Count();
}
}
I have 3255 messages in the subscription. When the first time peekBatch is called it gets 250 messages. so it goes into the while loop with PeekBatch(250,3225). Every time 250 messages are only received. The final total messages I am having in the output list is 3500 with duplicates. I am not able to understand how this is happening.
I have figured it out. The subscription client remembers the last batch it retrieved and when called again, retrieves the next batch.
So the code would be :
IEnumerable<BrokeredMessage> dlIE = null;
List<BrokeredMessage> bmList = new List<BrokeredMessage>();
long i = 0;
while (i < count)
{
dlIE = subsciptionClient.PeekBatch(Convert.ToInt32(count));
bmList.AddRange(dlIE);
i = i + dlIE.Count();
}
Thanks to MikeWo for guidance
Note: There seems to be some kind of a size limit on the number of messages you can peek at a time. I tried with different subscriptions and the number of messages fetched were different for each.
Is the topic you are writing to partitioned by chance? When you receive messages from a partitioned entity it will only fetch from one of the partitions at a time. From MSDN:
"When a client wants to receive a message from a partitioned queue, or from a subscription of a partitioned topic, Service Bus queries all fragments for messages, then returns the first message that is returned from any of the messaging stores to the receiver. Service Bus caches the other messages and returns them when it receives additional receive requests. A receiving client is not aware of the partitioning; the client-facing behavior of a partitioned queue or topic (for example, read, complete, defer, deadletter, prefetching) is identical to the behavior of a regular entity."
It's probably not a good idea to assume that even with a non partitioned entity that you'd get all messages in one go with really either the Receive or Peek methods. It would be much more efficient to loop through the messages in much smaller batches, especially if your message have any decent size to them or are indeterminate in size.
Since you don't actually want to remove the message from the queue I'd suggest using PeekBatch instead of ReceiveBatch. This lets you get a copy of the message and doesn't lock it. I'd highly suggest a loop using the same SubscriptionClient in conjunction with PeekBatch. By using the same SubscriptionClient with PeekBatch under the hood the last pulled sequence number is kept as as you loop through it should keep track and go through the whole queue. This would essentially let you read through the entire queue.
I came across a similar issue where client.ReceiveBatchAsync(....) would not retrieve any data from the subscription in the azure service bus.
After some digging around I found out that there is a bit for each subscriber to enable batch operations. This can only be enabled through powershell. Below is the command I used:
$subObject = Get-AzureRmServiceBusSubscription -ResourceGroup '#resourceName' -NamespaceName '#namespaceName' -Topic '#topicName' -SubscriptionName '#subscriptionName'
$subObject.EnableBatchedOperations = $True
Set-AzureRmServiceBusSubscription -ResourceGroup '#resourceName' -NamespaceName '#namespaceName' -Topic '#topicName'-SubscriptionObj $subObject
More details can be found here. While it still didn't load all the messages at least it started to clear the queue. As far as I'm aware, the batch size parameter is only there as a suggestion to the service bus but not a rule.
Hope it helps!

Categories