How to to avoid Throughput exceed error in dynamodb [duplicate] - c#

This question already has answers here:
Parallel.ForEach and async-await [duplicate]
(4 answers)
How to limit the amount of concurrent async I/O operations?
(11 answers)
Closed 11 months ago.
I am trying to do multiple delete operation on dynamodb table. However due to dynamodb limitation of 25 item per batch, I cannot delete more than 25 item per batch. I have list of deleteWriteOperation (batch of 25 each) and I am trying to run the batches parallelly. Any suggestion how can I avoid this or how do I add delay functionality so dynamodb autoscale while task wait.
Here is my code:
// batches is list of list holding DeleteWriteOperation (batch of 25) each list
var opts = new ParallelOptions { MaxDegreeOfParallelism = Convert.ToInt32(Math.Ceiling((Environment.ProcessorCount * 0.75) * 1.0)) }; // limiting number of concurrent threads
try
{
Parallel.ForEach(
batches,
opts,
async batch =>
{
await processDelete(batch, clientId);
});
}
catch (Exception ex)
{
_logger.LogDebug(e)
}
Here is the error that I received using the above code:
Amazon.DynamoDBv2.AmazonDynamoDBException: 'Throughput exceeds the current capacity for one or more global secondary indexes. DynamoDB is automatically scaling your index so please try again shortly.'

Related

Limit number of Threads without waiting for them [duplicate]

This question already has answers here:
How to limit the Maximum number of parallel tasks in c#
(11 answers)
Closed 4 days ago.
I have a foreach loop that sends data to a GRPC API. I want my loop to send multiple requests at the same time but limit the number of requests to e.g. 10. My current code is the following:
foreach (var element in elements)
{
var x = new Thread(() =>
SendOverGrpc(element));
x.Start();
}
But with that code, the software "immediately" sends all requests. How can I limit the number of requests to e.g. 10? As soon as one of my 10 requests is finished, I want to send the next one.
Easiest way is Parallel.Foreach, eg
Parallel.ForEach(elements, new ParallelOptions() { MaxDegreeOfParallelism = 10 }, element =>
{
SendOverGrpc(element);
});

C# Multithreading and pooling

Hello fellow developers,
I have a question about implementing multi-threading on my .NET (Framework 4.0) Windows Service.
Basically, what the service should be doing is the following:
Scans the filesystem (a specific directory) to see if there are files to process
If there are files that need to be processed, it should be using a thread pooling mechanism to issue threads up to a predetermined amount.
Each thread will perform an upload operation of a single file
As soon as one thread completes, the filesystem is scanned again to see if there are other files to process (I want to avoid having two threads perform the operation on the same file)
I am struggling to find a way that will allow me to do just that last step.
Right now, I have a function that retrieves the number of maximum number of concurrent threads that runs in the main thread:
int maximumNumberOfConcurrentThreads = getMaxThreads(databaseConnection);
Then, still in the main thread, I have a function that scans the directory and returns a list with the files to process
List<FileToUploadInfo> filesToUpload = getFilesToUploadFromFS(directory);
After this, I call the following function:
generateThreads(maximumNumberOfConcurrentThreads, filesToUpload);
Each thread should be calling the below function (returns void):
uploadFile(fileToUpload, databaseConnection, currentThread);
Right now, the way the program is structured, if maximum number of threads is set, say, to 5, I am grabbing 5 elements from the list and uploading them.
As soon as all 5 are done, I grab 5 more and do the same until I don't have any left, as per code below.
for (int index = 0; index < filesToUpload.Count; index = index + maximumNumberOfConcurrentThreads) {
try {
Parallel.For(0, maximumNumberOfConcurrentThreads, iteration => { if (index + iteration < filesToUpload .Count) { uploadFile(filesToUpload [index + iteration], databaseConnection, iteration); } });
}
catch (System.ArgumentOutOfRangeException outOfRange) {
debug("Exception in Parallel.For [" + outOfRange.Message + "]");
}
However, if 4 files are small and the upload of each one takes 5 seconds, while the remaining one is big and takes 30 minutes, I will have, after the 4 files have been uploaded, only one file uploading, and I need to wait for it to finish before starting to upload others in the list.
After finishing uploading all the files in the list, my service goes to sleep, and then, when it wakes up again, it scans the file system again.
What is the strategy that best fits my needs? Is it advisable to go this route or will it create concurrency nightmares? I need to avoid uploading any file twice.

Why is this eating memory?

I wrote an application whose purpose is to read logs from a large table (90 million) and process them into easily understandable stats, how many, how long etc.
The first run took 7.5 hours and only had to process 27 of the 90 million. I would like to speed this up. So I am trying to run the queries in parallel. But when I run the below code, within a couple minutes I crash with an Out of Memory exception.
Environments:
Sync
Test : 26 Applications, 15 million logs, 5 million retrieved, < 20mb, takes 20 seconds
Production: 56 Applications, 90 million logs, 27 million retrieved, < 30mb, takes 7.5 hours
Async
Test : 26 Applications, 15 million logs, 5 million retrieved, < 20mb, takes 3 seconds
Production: 56 Applications, 90 million logs, 27 million retrieved, Memory Exception
public void Run()
{
List<Application> apps;
//Query for apps
using (var ctx = new MyContext())
{
apps = ctx.Applications.Where(x => x.Type == "TypeIWant").ToList();
}
var tasks = new Task[apps.Count];
for (int i = 0; i < apps.Count; i++)
{
var app = apps[i];
tasks[i] = Task.Run(() => Process(app));
}
//try catch
Task.WaitAll(tasks);
}
public void Process(Application app)
{
//Query for logs for time period
using (var ctx = new MyContext())
{
var logs = ctx.Logs.Where(l => l.Id == app.Id).AsNoTracking();
foreach (var log in logs)
{
Interlocked.Increment(ref _totalLogsRead);
var l = log;
Task.Run(() => ProcessLog(l, app.Id));
}
}
}
Is it ill advised to create 56 contexts?
Do I need to dispose and re-create contexts after a certain number of logs retrieved?
Perhaps I'm misunderstanding how the IQueryable is working? <-- My Guess
My understanding is that it will retrieve logs as needed, I guess that means for the loop is it like a yield? or is my issue that 56 'threads' call to the database and I am storing 27 million logs in memory?
Side question
The results don't really scale together. Based on the Test environment results i would expect Production would only take a few minutes. I assume the increase is directly related to the number of records in the table.
With 27 Million rows the problem is one of stream processing, not parallel execution. You need to approach the problem as you would with SQL Server's SSIS or any other ETL tools: each processing step is a transofrmation that processes its input and sends its output to the next step.
Parallel processing is achieved by using a separate thread to run each step. Some steps could also use multiple threads to process multiple inputs up to a limit. Setting limits to each step's thread count and input buffer ensures you can achieve maximum throughput without flooding your machine with waiting tasks.
.NET's TPL Dataflow addresses exactly this scenario. It provides blocks to transfrom inputs to outputs (TransformBlock), split collections to individual messages (TransformManyBlock), execute actions without transformations (ActionBlock), combine data in batches (BatchBlock) etc.
You can also specify the Maximum degree of parallelism for each step so that, eg. you have only 1 log queries executing at each time, but use 10 tasks for log processing.
In your case, you could:
Start with a TransformManyBlock that receives an application type and returns a list of app IDs
A TranformBlock reads the logs for a specific ID and sends them downstream
An ActionBlock processes the batch.
Step #3 could be broken to many other steps. Eg if you don't need to process all app log entries together, you can use a step to process individual entries. Or you could first group them by date.
Another option is to create a custom block to read data from the database using a DbDataReader and post each entry to the next step immediatelly, instead of waiting for all rows to return. This would allow you to process each entry as it arrives, instead of waiting to receive all entries.
If each app log contains many entries, this could be a huge memory and time saver

Index, IndexMany, IndexAsnyc, IndexManyAsync with NEST

I try to understand indexing options using nest for ElasticSearch and I executed each of them and here are my results:
var node = new Uri("http://localhost:9200");
var settings = new ConnectionSettings(node, defaultIndex: "mydatabase");
settings.SetTimeout(1800000);
var client = new ElasticClient(settings);
var createIndexResult = client.CreateIndex("mydatabase");
var mapResult = client.Map<Product>(c => c.MapFromAttributes().SourceField(s=>s.Enabled(true));
1) Index: When I use Index option by iterating each object, It works smooth although it is slow.
foreach (var item in Items)
{
elasticClient.Index(item);
}
2) IndexAsync: This worked without any exception but It was not faster than snyc iteration and less documents were indexed.
foreach (var item in Items)
{
elasticClient.IndexAsync(item);
}
3) IndexMany: I tried, elasticClient.IndexMany(items); without foreach of course, It runs faster than doing foreach -index option, but somehow when I have a lot of data (in my case was 500.000objects) it threw and exception, saying
"System.Net.WebException: The underlying connection was closed: A
connection that its continuation was expected, has been closed by the
server ..
    at System.Net.HttpWebRequest.GetResponse ()"
when I check the log file, I can see only
"2016-01-14
10:21:49,567][WARN ][http.netty ] [Microchip] Caught
exception while handling client http traffic, closing connection [id:
0x68398975, /0:0:0:0:0:0:0:1:57860 => /0:0:0:0:0:0:0:1:9200]"
4)IndexManyAsync: elasticClient.IndexManyAsync(Items); trying indexasnyc throws similar exception as snyc but I can see more information in the log file.
[2016-01-14 11:00:16,086][WARN ][http.netty ]
[Microchip] Caught exception while handling client http traffic,
closing connection [id: 0x43bca172, /0:0:0:0:0:0:0:1:59314 =>
/0:0:0:0:0:0:0:1:9200]
org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException:
HTTP content length exceeded 104857600 bytes.
My questions are what are the exact differences? in which cases we might need async? why both indexmany and indexmanyasnyc options throw such exception?
it looks like index option is the safest one. Is it just ok to use it like that?
Using sync or async will not have any impact on Elasticsearch indexing performance. You would want to use async if you do not want to block your client code on completion of indexing, that's all.
Coming to Index vs IndexMany, it is always recommended to use the latter to take advantage of batching and avoiding too many request/response cycles between your client and Elasticsearch. That said, you cannot simply index such a huge number of documents in a single request. The exception message is pretty clear in saying that your batch index request has exceeded the HTTP content length limit of 100MB. What you need to do is reduce the number of documents you want to index using IndexMany so that you do not hit this limit and then invoke IndexMany multiple times till you complete indexing all of 500,000 documents.
The problem with the indexMany and indexManyAsync is that you are indexing too much data in one request.
This can be solved by doing multiple indexMany calls on subsets of your list, but there is now an easier way to deal with this called bulkAllObservable
var bulkAllObservable = client.BulkAll(items, b => b
.Index("myindex")
// how long to wait between retries
.BackOffTime("30s")
// how many retries are attempted if a failure occurs
.BackOffRetries(2)
// refresh the index once the bulk operation completes
.RefreshOnCompleted()
// how many concurrent bulk requests to make
.MaxDegreeOfParallelism(Environment.ProcessorCount)
// number of items per bulk request
.Size(1000)
)
// Perform the indexing, waiting up to 15 minutes.
// Whilst the BulkAll calls are asynchronous this is a blocking operation
.Wait(TimeSpan.FromMinutes(15), next =>
{
// do something on each response e.g. write number of batches indexed to console
});
This will index your whole list in chunks of 1000 items at a time.

Azure Table Storage QueryAll(), ImproveThroughput

I have some data (approximatly 5 Mio of items in 1500 tables, 10GB) in azure tables. The entities can be large and contain some serialized binary data in the protobuf format.
I have to process all of them and transform it to another structure. This processing is not thread safe. I also process some data from a mongodb replica set using the same code (the mongodb is hosted in another datacenter).
For debugging purposes I log the throughput and realized that it is very low. With mongodb I have a throughput of 5000 items / sec, with azure table storage only 30 items per second.
To improve the performance I try to use TPL dataflow, but it doesnt help:
public async Task QueryAllAsync(Action<StoredConnectionSetModel> handler)
{
List<CloudTable> tables = await QueryAllTablesAsync(companies, minDate);
ActionBlock<StoredConnectionSetModel> handlerBlock = new ActionBlock<StoredConnectionSetModel>(handler, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 });
ActionBlock<CloudTable> downloaderBlock = new ActionBlock<CloudTable>(x => QueryTableAsync(x, s => handlerBlock.Post(s), completed), new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 });
foreach (CloudTable table in tables)
{
downloaderBlock.Post(table);
}
}
private static async Task QueryTableAsync(CloudTable table, Action<StoredConnectionSetModel> handler)
{
TableQuery<AzureTableEntity<StoredConnectionSetModel>> query = new TableQuery<AzureTableEntity<StoredConnectionSetModel>>();
TableContinuationToken token = null;
do
{
TableQuerySegment<AzureTableEntity<StoredConnectionSetModel>> segment = await table.ExecuteQuerySegmentedAsync<AzureTableEntity<StoredConnectionSetModel>>(query, token);
foreach (var entity in segment.Results)
{
handler(entity.Entity);
}
token = segment.ContinuationToken;
}
while (token != null)
}
I run the batch process on my local machine (with 100mbit connection) and in azure (as worker role) and it is very strange, that the throughput on my machine is higher (100 items / sec) than on azure. I reach my max capacity of the internet connection locally but the worker role should not have this 100mbit limitation I hope.
How can I increase the throughput? I have no ideas what is going wrong here.
EDIT: I realized that I was wrong with the 30items per second. It is often higher (100/sec), depending on the size of the items I guess. According to the documentation (http://azure.microsoft.com/en-us/documentation/articles/storage-performance-checklist/#subheading10) there is a limit:
The scalability limit for accessing tables is up to 20,000 entities (1KB each) per second for an account. This are only 19MB / sec, not so impressive, if you keep in mind, that there are also normal requests from the production system). Probably I test it to use multiple accounts.
EDIT #2: I made two single tests, starting with a list of 500 keys [1...500] (Pseudo Code)
Test#1 Old approach (TABLE 1)
foreach (key1 in keys)
foreach (key2 in keys)
insert new Entity { paritionkey = key1, rowKey = key2 }
Test#2 New approach (TABLE 2)
numpartitions = 100
foreach (key1 in keys)
foreach (key2 in keys)
insert new Entity { paritionkey = (key1 + key2).GetHashCode() % numParitions, rowKey = key1 + key2 }
Each entity gets another property with 10KB of random text data.
Then I made the query tests, in the first case I just query all entities from Table 1 in one thread (sequential)
In the next test I create on task for each partitionkey and query all entities from Table 2 (parallel). I know that the test is no that good, because in my production environment I have a lot more partitions than only 500 per table, but it doesnt matter. At least the second attempt should perform well.
It makes no difference. My max throughput is 600 entities/sec, varying from 200 to 400 the most of the time. The documentation says that I can query 20.000 entities / sec (with 1 KB each), so I should get at least 1500 or so in average, I think. I tested it on a machine with 500MBit internet connection and I only reached about 30mbit, so this should not be the problem.
You should also check out the Table Storage Design Guide. Hope this helps.

Categories