As the title suggests, I need to insert 100,000+ records into a DocumentDb collection programatically. The data will be used for creating reports later on. I am using the Azure Documents SDK and a stored procedure for bulk inserting documents (See question Azure documentdb bulk insert using stored procedure).
The following console application shows how I'm inserting documents.
InsertDocuments generates 500 test documents to pass to the stored procedure. The main function calls InsertDocuments 10 times, inserting 5,000 documents overall. Running this application results in 500 documents getting inserted every few seconds. If I increase the number of documents per call I start to get errors and lost documents.
Can anyone recommend a faster way to insert documents?
static void Main(string[] args)
{
Console.WriteLine("Starting...");
MainAsync().Wait();
}
static async Task MainAsync()
{
int campaignId = 1001,
count = 500;
for (int i = 0; i < 10; i++)
{
await InsertDocuments(campaignId, (count * i) + 1, (count * i) + count);
}
}
static async Task InsertDocuments(int campaignId, int startId, int endId)
{
using (DocumentClient client = new DocumentClient(new Uri(documentDbUrl), documentDbKey))
{
List<dynamic> items = new List<dynamic>();
// Create x number of documents to insert
for (int i = startId; i <= endId; i++)
{
var item = new
{
id = Guid.NewGuid(),
campaignId = campaignId,
userId = i,
status = "Pending"
};
items.Add(item);
}
var task = client.ExecuteStoredProcedureAsync<dynamic>("/dbs/default/colls/campaignusers/sprocs/bulkImport", new RequestOptions()
{
PartitionKey = new PartitionKey(campaignId)
},
new
{
items = items
});
try
{
await task;
int insertCount = (int)task.Result.Response;
Console.WriteLine("{0} documents inserted...", insertCount);
}
catch (Exception e)
{
Console.WriteLine("Error: {0}", e.Message);
}
}
}
The fastest way to insert documents into Azure DocumentDB. is available as a sample on Github: https://github.com/Azure/azure-documentdb-dotnet/tree/master/samples/documentdb-benchmark
The following tips will help you achieve the best througphput using the .NET SDK:
Initialize a singleton DocumentClient
Use Direct connectivity and TCP protocol (ConnectionMode.Direct and ConnectionProtocol.Tcp)
Use 100s of Tasks in parallel (depends on your hardware)
Increase the MaxConnectionLimit in the DocumentClient constructor to a high value, say 1000 connections
Turn gcServer on
Make sure your collection has the appropriate provisioned throughput (and a good partition key)
Running in the same Azure region will also help
With 10,000 RU/s, you can insert 100,000 documents in about 50 seconds (approximately 5 request units per write).
With 100,000 RU/s, you can insert in about 5 seconds. You can make this as fast as you want to, by configuring throughput (and for very high # of inserts, spread inserts across multiple VMs/workers)
EDIT: You can now use the bulk executor library at https://learn.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview, 7/12/19
The Cosmos Db team have just released a bulk import and update SDK, unfortunately only available in Framework 4.5.1 but this apparently does a lot of the heavy lifting for you and maximize use of throughput. see
https://learn.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-bulk-executor-dot-net
Cosmos DB SDK has been updated to allow bulk insert: https://learn.microsoft.com/en-us/azure/cosmos-db/tutorial-sql-api-dotnet-bulk-import via the AllowBulkExecution option.
Other aproach is stored procedure as mentioned by other people . Stored procedure requires partitioning key. Also stored procedure should end within 4sec as per documentation otherwise all records will rollback. See code below using python azure documentdb sdk and javascript based stored procedure. I have modified the script and resolved lot of error below code is working fine:-
function bulkimport2(docObject) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
// The count of imported docs, also used as current doc index.
var count = 0;
getContext().getResponse().setBody(docObject.items);
//return
// Validate input.
//if (!docObject.items || !docObject.items.length) getContext().getResponse().setBody(docObject);
docObject.items=JSON.stringify(docObject.items)
docObject.items = docObject.items.replace("\\\\r", "");
docObject.items = docObject.items.replace("\\\\n", "");
var docs = JSON.parse(docObject.items);
var docsLength = docObject.items.length;
if (docsLength == 0) {
getContext().getResponse().setBody(0);
return;
}
// Call the CRUD API to create a document.
tryCreate(docs[count], callback, collectionLink,count);
// Note that there are 2 exit conditions:
// 1) The createDocument request was not accepted.
// In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
// In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreate(doc, callback, collectionLink,count ) {
doc=JSON.stringify(doc);
if (typeof doc == "undefined") {
getContext().getResponse().setBody(count);
return ;
} else {
doc = doc.replace("\\r", "");
doc = doc.replace("\\n", "");
doc=JSON.parse(doc);
}
getContext().getResponse().setBody(doc);
var isAccepted = collection.upsertDocument(collectionLink, doc, callback);
// If the request was accepted, callback will be called.
// Otherwise report current count back to the client,
// which will call the script again with remaining set of docs.
// This condition will happen when this stored procedure has been running too long
// and is about to get cancelled by the server. This will allow the calling client
// to resume this batch from the point we got to before isAccepted was set to false
if (!isAccepted) {
getContext().getResponse().setBody(count);
}
}
// This is called when collection.createDocument is done and the document has been persisted.
function callback(err, doc, options) {
if (err) throw getContext().getResponse().setBody(err + doc);
// One more document has been inserted, increment the count.
count++;
if (count >= docsLength) {
// If we have created all documents, we are done. Just set the response.
getContext().getResponse().setBody(count);
return ;
} else {
// Create next document.
tryCreate(docs[count], callback, collectionLink,count);
}
}
}
EDIT:- getContext().getResponse().setBody(count);
return ; //when all records are processed completely.
python script to load stored procedure and do batch import
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(config['ENDPOINT'], {'masterKey': config['MASTERKEY'] ,'DisableSSLVerification' : 'true' })
# Create a database
#db = client.CreateDatabase({ 'id': config['DOCUMENTDB_DATABASE'] })
db=client.ReadDatabases({ 'id': 'db2' })
print(db)
# Create collection options
options = {
'offerEnableRUPerMinuteThroughput': True,
'offerVersion': "V2",
'offerThroughput': 400
}
# Create a collection
#collection = client.CreateCollection('dbs/db2' , { 'id': 'coll2'}, options)
#collection = client.CreateCollection({ 'id':'db2'},{ 'id': 'coll2'}, options)
database_link = 'dbs/db2'
collection_link = database_link + '/colls/coll2'
"""
#List collections
collection = client.ReadCollection(collection_link)
print(collection)
print('Databases:')
databases = list(client.ReadDatabases())
if not databases:
print('No Databases:')
for database in databases:
print(database['id'])
"""
# Create some documents
"""
document1 = client.CreateDocument(collection['_self'],
{
'Web Site': 0,
'Cloud Service': 0,
'Virtual Machine': 0,
'name': 'some'
})
document2 = client.CreateDocument(collection['_self'],
{
'Web Site': 1,
'Cloud Service': 0,
'Virtual Machine': 0,
'name': 'some'
})
"""
# Query them in SQL
"""
query = { 'query': 'SELECT * FROM server s' }
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 20
#result_iterable = client.QueryDocuments(collection['_self'], query, options)
result_iterable = client.QueryDocuments(collection_link, query, options)
results = list(result_iterable);
print(results)
"""
##How to store procedure and use it
"""
sproc3 = {
'id': 'storedProcedure2',
'body': (
'function (input) {' +
' getContext().getResponse().setBody(' +
' \'a\' + input.temp);' +
'}')
}
retrieved_sproc3 = client.CreateStoredProcedure(collection_link,sproc3)
result = client.ExecuteStoredProcedure('dbs/db2/colls/coll2/sprocs/storedProcedure3',{'temp': 'so'})
"""
## delete all records in collection
"""
result = client.ExecuteStoredProcedure('dbs/db2/colls/coll2/sprocs/bulkDeleteSproc',"SELECT * FROM c ORDER BY c._ts DESC ")
print(result)
"""
multiplerecords="""[{
"Virtual Machine": 0,
"name": "some",
"Web Site": 0,
"Cloud Service": 0
},
{
"Virtual Machine": 0,
"name": "some",
"Web Site": 1,
"Cloud Service": 0
}]"""
multiplerecords=json.loads(multiplerecords)
print(multiplerecords)
print(str(json.dumps(json.dumps(multiplerecords).encode('utf8'))))
#bulkloadresult = client.ExecuteStoredProcedure('dbs/db2/colls/coll2/sprocs/bulkImport',json.dumps(multiplerecords).encode('utf8'))
#bulkloadresult = client.ExecuteStoredProcedure('dbs/db2/colls/coll2/sprocs/bulkImport',json.dumps(json.loads(r'{"items": [{"name":"John","age":30,"city":"New York"},{"name":"John","age":30,"city":"New York"}]}')).encode('utf8'))
str1='{name":"John","age":30,"city":"New York","PartitionKey" : "Morisplane"}'
str2='{name":"John","age":30,"city":"New York","partitionKey" : "Morisplane"}'
key1=base64.b64encode(str1.encode("utf-8"))
key2=base64.b64encode(str2.encode("utf-8"))
data= {"items":[{"id": key1 ,"name":"John","age":30,"city":"Morisplane","PartitionKey" : "Morisplane" },{"id": key2,"name":"John","age":30,"city":"Morisplane","partitionKey" : "Morisplane"}] , "city": "Morisplane", "partitionKey" : "Morisplane"}
print(repr(data))
#retrieved_sproc3 =client.DeleteStoredProcedure('dbs/db2/colls/coll2/sprocs/bulkimport2')
sproc3 = {
'id': 'bulkimport2',
'body': (
"""function bulkimport2(docObject) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
// The count of imported docs, also used as current doc index.
var count = 0;
getContext().getResponse().setBody(docObject.items);
//return
// Validate input.
//if (!docObject.items || !docObject.items.length) getContext().getResponse().setBody(docObject);
docObject.items=JSON.stringify(docObject.items)
docObject.items = docObject.items.replace("\\\\r", "");
docObject.items = docObject.items.replace("\\\\n", "");
var docs = JSON.parse(docObject.items);
var docsLength = docObject.items.length;
if (docsLength == 0) {
getContext().getResponse().setBody(0);
return;
}
// Call the CRUD API to create a document.
tryCreate(docs[count], callback, collectionLink,count);
// Note that there are 2 exit conditions:
// 1) The createDocument request was not accepted.
// In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
// In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreate(doc, callback, collectionLink,count ) {
doc=JSON.stringify(doc);
if (typeof doc == "undefined") {
getContext().getResponse().setBody(count);
return ;
} else {
doc = doc.replace("\\r", "");
doc = doc.replace("\\n", "");
doc=JSON.parse(doc);
}
getContext().getResponse().setBody(doc);
return
var isAccepted = collection.upsertDocument(collectionLink, doc, callback);
// If the request was accepted, callback will be called.
// Otherwise report current count back to the client,
// which will call the script again with remaining set of docs.
// This condition will happen when this stored procedure has been running too long
// and is about to get cancelled by the server. This will allow the calling client
// to resume this batch from the point we got to before isAccepted was set to false
if (!isAccepted) {
getContext().getResponse().setBody(count);
}
}
// This is called when collection.createDocument is done and the document has been persisted.
function callback(err, doc, options) {
if (err) throw getContext().getResponse().setBody(err + doc);
// One more document has been inserted, increment the count.
count++;
if (count >= docsLength) {
// If we have created all documents, we are done. Just set the response.
getContext().getResponse().setBody(count);
return ;
} else {
// Create next document.
tryCreate(docs[count], callback, collectionLink,count);
}
}
}"""
)
}
#retrieved_sproc3 = client.CreateStoredProcedure(collection_link,sproc3)
bulkloadresult = client.ExecuteStoredProcedure('dbs/db2/colls/coll2/sprocs/bulkimport2', data , {"partitionKey" : "Morisplane"} )
print(repr(bulkloadresult))
private async Task<T> ExecuteDataUpload<T>(IEnumerable<object> data,PartitionKey partitionKey)
{
using (var client = new DocumentClient(m_endPointUrl, m_authKey, connPol))
{
while (true)
{
try
{
var result = await client.ExecuteStoredProcedureAsync<T>(m_spSelfLink, new RequestOptions { PartitionKey = partitionKey }, data);
return result;
}
catch (DocumentClientException ex)
{
if (429 == (int)ex.StatusCode)
{
Thread.Sleep(ex.RetryAfter);
continue;
}
if (HttpStatusCode.RequestTimeout == ex.StatusCode)
{
Thread.Sleep(ex.RetryAfter);
continue;
}
throw ex;
}
catch (Exception)
{
Thread.Sleep(TimeSpan.FromSeconds(1));
continue;
}
}
}
}
public async Task uploadData(IEnumerable<object> data, string partitionKey)
{
int groupSize = 600;
int dataSize = data.Count();
int chunkSize = dataSize > groupSize ? groupSize : dataSize;
List<Task> uploadTasks = new List<Task>();
while (dataSize > 0)
{
IEnumerable<object> chunkData = data.Take(chunkSize);
object[] taskData = new object[3];
taskData[0] = chunkData;
taskData[1] = chunkSize;
taskData[2] = partitionKey;
uploadTasks.Add(Task.Factory.StartNew(async (arg) =>
{
object[] reqdData = (object[])arg;
int chunkSizes = (int)reqdData[1];
IEnumerable<object> chunkDatas = (IEnumerable<object>)reqdData[0];
var partKey = new PartitionKey((string)reqdData[2]);
int chunkDatasCount = chunkDatas.Count();
while (chunkDatasCount > 0)
{
int insertedCount = await ExecuteDataUpload<int>(chunkDatas, partKey);
chunkDatas = chunkDatas.Skip(insertedCount);
chunkDatasCount = chunkDatasCount - insertedCount;
}
}, taskData));
data = data.Skip(chunkSize);
dataSize = dataSize - chunkSize;
chunkSize = dataSize > groupSize ? groupSize : dataSize;
}
await Task.WhenAll(uploadTasks);
}
Now call the uploadData in parallel with list of objects you want to upload. Just keep one thing in mind send data of like Partitionkey only.
Related
There is an API that accepts an entity with a previously unknown ID. I need to configure the rate limiter so that entities with the same ID get into the queue. I figured out how to create a window and a queue. How to make a separate queue for each ID?
The entity is a JSON file. The ID is inside the file.
The following is written, but this forms one queue:
services.AddRateLimiter(options => options
.AddFixedWindowLimiter(policyName: "UserPolicy", options =>
{
options.PermitLimit = 1;
options.Window = TimeSpan.FromSeconds(10);
options.QueueProcessingOrder = QueueProcessingOrder.OldestFirst;
options.QueueLimit = 3;
}));
You can try using PartitionedRateLimiter. Something along these lines (not tested):
builder.Services.AddRateLimiter(options =>
{
options.AddPolicy("myRateLimiter1", context =>
{
var request = context.Request;
var partitionKey = "";
if (request.Method == HttpMethods.Post && request.ContentLength > 0)
{
request.EnableBuffering();
var buffer = new byte[Convert.ToInt32(request.ContentLength)];
request.Body.Read(buffer, 0, buffer.Length);
//get body string here...
var requestContent = Encoding.UTF8.GetString(buffer);
// get partition key here... partitionKey = ...
request.Body.Position = 0; //rewinding the stream to 0
}
return RateLimitPartition.GetFixedWindowLimiter(
partitionKey: partitionKey,
factory: partition => new FixedWindowRateLimiterOptions
{
PermitLimit = 1,
Window = TimeSpan.FromSeconds(10),
QueueProcessingOrder = QueueProcessingOrder.OldestFirst,
QueueLimit = 3
});
});
});
Though I would suggest to consider passing Id in some other way (headers) or resolve the limiter on the handler/BL level.
I created a C# Lambda function that is being triggered with a DynamoDB stream. It gets excecuted just fine. However, the StreamRecord of the NewImage value returns no values. The count is 0. What am I doing wrong? I have checked all the AWS documention but this seems more and more like a bug. My below lambda function should work and it should return at least 1 StreamRecord in my example. attributeMap.Count always returns 0 but it should return 1.
public void FunctionHandler(DynamoDBEvent dynamoDbEvent, ILambdaContext context)
{
Console.WriteLine($"Beginning to process {dynamoDbEvent.Records.Count} records...");
foreach (var record in dynamoDbEvent.Records)
{
Console.WriteLine($"Event ID: {record.EventID}");
Console.WriteLine($"Event Name: {record.EventName}");
var attributeMap = record.Dynamodb.NewImage;
if (attributeMap.Count > 0) // If item does not exist, attributeMap.Count will be 0
{
Console.WriteLine(attributeMap["AccountId"].S);
}
}
Console.WriteLine("Stream processing complete.");
}
UPDATE: October, 4th, 2018. I no longer use an admin app. I now use CloudFormation exclusively to create and maintain everyting including full CI/CD pipelines with CodePipelines. This includes all Serverless lamba functions in .NET Core 2.1.
I figured it out. The sucky thing is that AWS Docs do not say anything about this. This was a pain in the ass to find out. For anyone else who might need this information here it is: You have to set the stream view type when you create the DynamoDB stream for the table. Here is a picture for the AWS Console:
However, since I setup all tables via an admin console (in C# Core 2.0), here is how I setup the setup the table including the stream specification and the event source mapping request to the lambda function:
var request = new CreateTableRequest
{
TableName = TABLE_CREATE_ACCOUNT,
AttributeDefinitions = new List<AttributeDefinition>()
{
new AttributeDefinition
{
AttributeName = "CommandId",
AttributeType = ScalarAttributeType.S
}
},
KeySchema = new List<KeySchemaElement>()
{
new KeySchemaElement
{
AttributeName = "CommandId",
KeyType = KeyType.HASH
}
},
ProvisionedThroughput = new ProvisionedThroughput
{
ReadCapacityUnits = 1,
WriteCapacityUnits = 1
},
StreamSpecification = new StreamSpecification
{
StreamEnabled = true,
StreamViewType = StreamViewType.NEW_IMAGE
}
};
try
{
var response = _db.CreateTableAsync(request);
var tableDescription = response.Result.TableDescription;
Console.WriteLine("{1}: {0} ReadCapacityUnits: {2} WriteCapacityUnits: {3}",
tableDescription.TableStatus,
tableDescription.TableName,
tableDescription.ProvisionedThroughput.ReadCapacityUnits,
tableDescription.ProvisionedThroughput.WriteCapacityUnits);
string status = tableDescription.TableStatus;
Console.WriteLine(TABLE_CREATE_ACCOUNT + " - " + status);
WaitUntilTableReady(TABLE_CREATE_ACCOUNT);
// This connects the DynamoDB stream to a lambda function
Console.WriteLine("Creating event source mapping between table stream '"+ TABLE_CREATE_ACCOUNT + "' and lambda 'ProcessCreateAccount'");
var req = new CreateEventSourceMappingRequest
{
BatchSize = 100,
Enabled = true,
EventSourceArn = tableDescription.LatestStreamArn,
FunctionName = "ProcessCreateAccount",
StartingPosition = EventSourcePosition.LATEST
};
var reqResponse =_lambda.CreateEventSourceMappingAsync(req);
Console.WriteLine("Event source mapping state: " + reqResponse.Result.State);
}
catch (AmazonDynamoDBException e)
{
Console.WriteLine("Error creating table '" + TABLE_CREATE_ACCOUNT + "'");
Console.WriteLine("Amazon error code: {0}", string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
Console.WriteLine("Exception message: {0}", e.Message);
}
catch (Exception e)
{
Console.WriteLine("Error creating table '" + TABLE_CREATE_ACCOUNT + "'");
Console.WriteLine("Exception message: {0}", e.Message);
}
The key is
StreamViewType = StreamViewType.NEW_IMAGE
That was it.
I have an SSIS package that's launching another SSIS package in a Foreach container; because the container reports completion as soon as it launched all the packages it had to launch, I need a way to make it wait until all "child" packages have completed.
So I implemented a little sleep-wait loop that basically pulls the Execution objects off the SSISDB for the ID's I'm interested in.
The problem I'm facing, is that a grand total of 0 Dts.Events.FireProgress events get fired, and if I uncomment the Dts.Events.FireInformation call in the do loop, then every second I get a message reported saying 23 packages are still running... except if I check in SSISDB's Active Operations window I see that most have completed already and 3 or 4 are actually running.
What am I doing wrong, why wouldn't runningCount contain the number of actually running executions?
using ssis = Microsoft.SqlServer.Management.IntegrationServices;
public void Main()
{
const string serverName = "REDACTED";
const string catalogName = "SSISDB";
var ssisConnectionString = $"Data Source={serverName};Initial Catalog=msdb;Integrated Security=SSPI;";
var ids = GetExecutionIDs(serverName);
var idCount = ids.Count();
var previousCount = -1;
var iterations = 0;
try
{
var fireAgain = true;
const int secondsToSleep = 1;
var sleepTime = TimeSpan.FromSeconds(secondsToSleep);
var maxIterations = TimeSpan.FromHours(1).TotalSeconds / sleepTime.TotalSeconds;
IDictionary<long, ssis.Operation.ServerOperationStatus> catalogExecutions;
using (var connection = new SqlConnection(ssisConnectionString))
{
var server = new ssis.IntegrationServices(connection);
var catalog = server.Catalogs[catalogName];
do
{
catalogExecutions = catalog.Executions
.Where(execution => ids.Contains(execution.Id))
.ToDictionary(execution => execution.Id, execution => execution.Status);
var runningCount = catalogExecutions.Count(kvp => kvp.Value == ssis.Operation.ServerOperationStatus.Running);
System.Threading.Thread.Sleep(sleepTime);
//Dts.Events.FireInformation(0, "ScriptMain", $"{runningCount} packages still running.", string.Empty, 0, ref fireAgain);
if (runningCount != previousCount)
{
previousCount = runningCount;
decimal completed = idCount - runningCount;
decimal percentCompleted = completed / idCount;
Dts.Events.FireProgress($"Waiting... {completed}/{idCount} completed", Convert.ToInt32(100 * percentCompleted), 0, 0, "", ref fireAgain);
}
iterations++;
if (iterations >= maxIterations)
{
Dts.Events.FireWarning(0, "ScriptMain", $"Timeout expired, requesting cancellation.", string.Empty, 0);
Dts.Events.FireQueryCancel();
Dts.TaskResult = (int)Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Canceled;
return;
}
}
while (catalogExecutions.Any(kvp => kvp.Value == ssis.Operation.ServerOperationStatus.Running));
}
}
catch (Exception exception)
{
if (exception.InnerException != null)
{
Dts.Events.FireError(0, "ScriptMain", exception.InnerException.ToString(), string.Empty, 0);
}
Dts.Events.FireError(0, "ScriptMain", exception.ToString(), string.Empty, 0);
Dts.Log(exception.ToString(), 0, new byte[0]);
Dts.TaskResult = (int)ScriptResults.Failure;
return;
}
Dts.TaskResult = (int)ScriptResults.Success;
}
The GetExecutionIDs function simply returns all execution ID's for the child packages, from my metadata database.
The problem is that you're re-using the same connection at every iteration. Turn this:
using (var connection = new SqlConnection(ssisConnectionString))
{
var server = new ssis.IntegrationServices(connection);
var catalog = server.Catalogs[catalogName];
do
{
catalogExecutions = catalog.Executions
.Where(execution => ids.Contains(execution.Id))
.ToDictionary(execution => execution.Id, execution => execution.Status);
Into this:
do
{
using (var connection = new SqlConnection(ssisConnectionString))
{
var server = new ssis.IntegrationServices(connection);
var catalog = server.Catalogs[catalogName];
catalogExecutions = catalog.Executions
.Where(execution => ids.Contains(execution.Id))
.ToDictionary(execution => execution.Id, execution => execution.Status);
}
And you'll get correct execution status every time. Not sure why the connection can't be reused, but keeping connections as short-lived as possible is always a good idea - and that's another proof.
I am trying to insert 10000 records into Azure table storage. I am using ExecuteAsync() to achieve it, but somehow approximately around 7500 records are inserted and rest of the records are lost. I am purposely not using await keyword because I don't want to wait for the result, just want to store them in the table. Below is my code snippet.
private static async void ConfigureAzureStorageTable()
{
CloudStorageAccount storageAccount =
CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
TableResult result = new TableResult();
CloudTable table = tableClient.GetTableReference("test");
table.CreateIfNotExists();
for (int i = 0; i < 10000; i++)
{
var verifyVariableEntityObject = new VerifyVariableEntity()
{
ConsumerId = String.Format("{0}", i),
Score = String.Format("{0}", i * 2 + 2),
PartitionKey = String.Format("{0}", i),
RowKey = String.Format("{0}", i * 2 + 2)
};
TableOperation insertOperation = TableOperation.Insert(verifyVariableEntityObject);
try
{
table.ExecuteAsync(insertOperation);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
}
Is anything incorrect with the usage of the method?
You still want to await table.ExecuteAsync(). That will mean that ConfigureAzureStorageTable() returns control to the caller at that point, which can continue executing.
The way you have it in the question, ConfigureAzureStorageTable() is going to continue past the call to table.ExecuteAsync() and exit, and things like table will go out of scope, while the table.ExecuteAsync() task is still not complete.
There are plenty of caveats about using async void on SO and elsewhere that you will also need to consider. You could just as easily have your method as async Task but not await it in the caller yet, but keep the returned Task around for clean termination, etc.
Edit: one addition - you almost certainly want to use ConfigureAwait(false) on your await there, as you don't appear to need to preserve any context. This blog post has some guidelines on that and async in general.
According to your requirement, I have tested your scenario on my side by using CloudTable.ExecuteAsync and CloudTable.ExecuteBatchAsync successfully. Here is my code snippet about using CloudTable.ExecuteBatchAsync to insert records to Azure Table Storage, you could refer to it.
Program.cs Main
class Program
{
static void Main(string[] args)
{
CloudStorageAccount storageAccount =
CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
TableResult result = new TableResult();
CloudTable table = tableClient.GetTableReference("test");
table.CreateIfNotExists();
//Generate records to be inserted into Azure Table Storage
var entities = Enumerable.Range(1, 10000).Select(i => new VerifyVariableEntity()
{
ConsumerId = String.Format("{0}", i),
Score = String.Format("{0}", i * 2 + 2),
PartitionKey = String.Format("{0}", i),
RowKey = String.Format("{0}", i * 2 + 2)
});
//Group records by PartitionKey and prepare for executing batch operations
var batches = TableBatchHelper<VerifyVariableEntity>.GetBatches(entities);
//Execute batch operations in parallel
Parallel.ForEach(batches, new ParallelOptions()
{
MaxDegreeOfParallelism = 5
}, (batchOperation) =>
{
try
{
table.ExecuteBatch(batchOperation);
Console.WriteLine("Writing {0} records", batchOperation.Count);
}
catch (Exception ex)
{
Console.WriteLine("ExecuteBatch throw a exception:" + ex.Message);
}
});
Console.WriteLine("Done!");
Console.WriteLine("Press any key to exit...");
Console.ReadKey();
}
}
TableBatchHelper.cs
public class TableBatchHelper<T> where T : ITableEntity
{
const int batchMaxSize = 100;
public static IEnumerable<TableBatchOperation> GetBatches(IEnumerable<T> items)
{
var list = new List<TableBatchOperation>();
var partitionGroups = items.GroupBy(arg => arg.PartitionKey).ToArray();
foreach (var group in partitionGroups)
{
T[] groupList = group.ToArray();
int offSet = batchMaxSize;
T[] entities = groupList.Take(offSet).ToArray();
while (entities.Any())
{
var tableBatchOperation = new TableBatchOperation();
foreach (var entity in entities)
{
tableBatchOperation.Add(TableOperation.InsertOrReplace(entity));
}
list.Add(tableBatchOperation);
entities = groupList.Skip(offSet).Take(batchMaxSize).ToArray();
offSet += batchMaxSize;
}
}
return list;
}
}
Note: As mentioned in the official document about inserting a batch of entities:
A single batch operation can include up to 100 entities.
All entities in a single batch operation must have the same partition key.
In summary, please try to check whether it could work on your side. Also, you could capture the detailed exception within your console application and capture the HTTP request via Fiddler to catch the HTTP error requests when you inserting records to Azure Table Storage.
How about using a TableBatchOperation to run batches of N inserts at once?
private const int BatchSize = 100;
private static async void ConfigureAzureStorageTable()
{
CloudStorageAccount storageAccount =
CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
TableResult result = new TableResult();
CloudTable table = tableClient.GetTableReference("test");
table.CreateIfNotExists();
var batchOperation = new TableBatchOperation();
for (int i = 0; i < 10000; i++)
{
var verifyVariableEntityObject = new VerifyVariableEntity()
{
ConsumerId = String.Format("{0}", i),
Score = String.Format("{0}", i * 2 + 2),
PartitionKey = String.Format("{0}", i),
RowKey = String.Format("{0}", i * 2 + 2)
};
TableOperation insertOperation = TableOperation.Insert(verifyVariableEntityObject);
batchOperation.Add(insertOperation);
if (batchOperation.Count >= BatchSize)
{
try
{
await table.ExecuteBatchAsync(batchOperation);
batchOperation = new TableBatchOperation();
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
}
if(batchOperation.Count > 0)
{
try
{
await table.ExecuteBatchAsync(batchOperation);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
}
You can adjust BatchSize to what you need. Small disclaimer: I didn't try to run this, though it should work.
But I can't help but wonder why is your function async void? That should be reserved for event handlers and similar ones where you cannot decide the interface. In most cases you want to return a task. Because now the caller cannot catch exceptions that occur in this function.
async void is not a good practice unless it is an eventhandler.
https://msdn.microsoft.com/en-us/magazine/jj991977.aspx
If you plan to insert many records into azure table storage, batch insert is your best bet.
https://msdn.microsoft.com/en-us/library/azure/microsoft.windowsazure.storage.table.tablebatchoperation.aspx
Keep in mind that it has limit of 100 table operations per batch.
I had the same issue and fixed it by
forcing ExecuteAsync to wait for the results before it exist..
table.ExecuteAsync(insertOperation).GetAwaiter().GetResult()
I've read conflicting information as to whether or not the WADLogsTable table used by the DiagnosticMonitor in Windows Azure will automatically prune old log entries.
I'm guessing it doesn't, and will instead grow forever - costing me money. :)
If that's the case, does anybody have a good code sample as to how to clear out old log entries from this table manually? Perhaps based on timestamp? I'd run this code from a worker role periodically.
The data in tables created by Windows Azure Diagnostics isn't deleted automatically.
However, Windows Azure PowerShell Cmdlets contain cmdlets specifically for this case.
PS D:\> help Clear-WindowsAzureLog
NAME
Clear-WindowsAzureLog
SYNOPSIS
Removes Windows Azure trace log data from a storage account.
SYNTAX
Clear-WindowsAzureLog [-DeploymentId ] [-From ] [-To ] [-StorageAccountName ] [-StorageAccountKey ] [-UseD
evelopmentStorage] [-StorageAccountCredentials ] []
Clear-WindowsAzureLog [-DeploymentId <String>] [-FromUtc <DateTime>] [-ToUt
c <DateTime>] [-StorageAccountName <String>] [-StorageAccountKey <String>]
[-UseDevelopmentStorage] [-StorageAccountCredentials <StorageCredentialsAcc
ountAndKey>] [<CommonParameters>]
You need to specify -ToUtc parameter, and all logs before that date will be deleted.
If cleanup task needs to be performed on Azure within the worker role, C# cmdlets code can be reused. PowerShell Cmdlets are published under permissive MS Public License.
Basically, there are only 3 files needed without other external dependencies: DiagnosticsOperationException.cs, WadTableExtensions.cs, WadTableServiceEntity.cs.
Updated function of Chriseyre2000. This provides much more performance for those cases where you need to delete many thousands records: search by PartitionKey and chunked step-by-step process. And remember that the best choice it is to run it near storage (in cloud service).
public static void TruncateDiagnostics(CloudStorageAccount storageAccount,
DateTime startDateTime, DateTime finishDateTime, Func<DateTime,DateTime> stepFunction)
{
var cloudTable = storageAccount.CreateCloudTableClient().GetTableReference("WADLogsTable");
var query = new TableQuery();
var dt = startDateTime;
while (true)
{
dt = stepFunction(dt);
if (dt>finishDateTime)
break;
var l = dt.Ticks;
string partitionKey = "0" + l;
query.FilterString = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.LessThan, partitionKey);
query.Select(new string[] {});
var items = cloudTable.ExecuteQuery(query).ToList();
const int chunkSize = 200;
var chunkedList = new List<List<DynamicTableEntity>>();
int index = 0;
while (index < items.Count)
{
var count = items.Count - index > chunkSize ? chunkSize : items.Count - index;
chunkedList.Add(items.GetRange(index, count));
index += chunkSize;
}
foreach (var chunk in chunkedList)
{
var batches = new Dictionary<string, TableBatchOperation>();
foreach (var entity in chunk)
{
var tableOperation = TableOperation.Delete(entity);
if (batches.ContainsKey(entity.PartitionKey))
batches[entity.PartitionKey].Add(tableOperation);
else
batches.Add(entity.PartitionKey, new TableBatchOperation {tableOperation});
}
foreach (var batch in batches.Values)
cloudTable.ExecuteBatch(batch);
}
}
}
You could just do it based on the timestamp but that would be very inefficient since the whole table would need to be scanned. Here is a code sample that might help where the partition key is generated to prevent a "full" table scan. http://blogs.msdn.com/b/avkashchauhan/archive/2011/06/24/linq-code-to-query-windows-azure-wadlogstable-to-get-rows-which-are-stored-after-a-specific-datetime.aspx
Here is a solution that trunctates based upon a timestamp. (Tested against SDK 2.0)
It does use a table scan to get the data but if run say once per day would not be too painful:
/// <summary>
/// TruncateDiagnostics(storageAccount, DateTime.Now.AddHours(-1));
/// </summary>
/// <param name="storageAccount"></param>
/// <param name="keepThreshold"></param>
public void TruncateDiagnostics(CloudStorageAccount storageAccount, DateTime keepThreshold)
{
try
{
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable cloudTable = tableClient.GetTableReference("WADLogsTable");
TableQuery query = new TableQuery();
query.FilterString = string.Format("Timestamp lt datetime'{0:yyyy-MM-ddTHH:mm:ss}'", keepThreshold);
var items = cloudTable.ExecuteQuery(query).ToList();
Dictionary<string, TableBatchOperation> batches = new Dictionary<string, TableBatchOperation>();
foreach (var entity in items)
{
TableOperation tableOperation = TableOperation.Delete(entity);
if (!batches.ContainsKey(entity.PartitionKey))
{
batches.Add(entity.PartitionKey, new TableBatchOperation());
}
batches[entity.PartitionKey].Add(tableOperation);
}
foreach (var batch in batches.Values)
{
cloudTable.ExecuteBatch(batch);
}
}
catch (Exception ex)
{
Trace.TraceError(string.Format("Truncate WADLogsTable exception {0}", ex), "Error");
}
}
Here's my slightly different version of #Chriseyre2000's solution, using asynchronous operations and PartitionKey querying. It's designed to run continuously within a Worker Role in my case. This one may be a bit easier on memory if you have a lot of entries to clean up.
static class LogHelper
{
/// <summary>
/// Periodically run a cleanup task for log data, asynchronously
/// </summary>
public static async void TruncateDiagnosticsAsync()
{
while ( true )
{
try
{
// Retrieve storage account from connection-string
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(
CloudConfigurationManager.GetSetting( "CloudStorageConnectionString" ) );
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable cloudTable = tableClient.GetTableReference( "WADLogsTable" );
// keep a weeks worth of logs
DateTime keepThreshold = DateTime.UtcNow.AddDays( -7 );
// do this until we run out of items
while ( true )
{
TableQuery query = new TableQuery();
query.FilterString = string.Format( "PartitionKey lt '0{0}'", keepThreshold.Ticks );
var items = cloudTable.ExecuteQuery( query ).Take( 1000 );
if ( items.Count() == 0 )
break;
Dictionary<string, TableBatchOperation> batches = new Dictionary<string, TableBatchOperation>();
foreach ( var entity in items )
{
TableOperation tableOperation = TableOperation.Delete( entity );
// need a new batch?
if ( !batches.ContainsKey( entity.PartitionKey ) )
batches.Add( entity.PartitionKey, new TableBatchOperation() );
// can have only 100 per batch
if ( batches[entity.PartitionKey].Count < 100)
batches[entity.PartitionKey].Add( tableOperation );
}
// execute!
foreach ( var batch in batches.Values )
await cloudTable.ExecuteBatchAsync( batch );
Trace.TraceInformation( "WADLogsTable truncated: " + query.FilterString );
}
}
catch ( Exception ex )
{
Trace.TraceError( "Truncate WADLogsTable exception {0}", ex.Message );
}
// run this once per day
await Task.Delay( TimeSpan.FromDays( 1 ) );
}
}
}
To start the process, just call this from the OnStart method in your worker role.
// start the periodic cleanup
LogHelper.TruncateDiagnosticsAsync();
If you don't care about any of the contents, just delete the table. Azure Diagnostics will just recreate it.
Slightly updated Chriseyre2000's code:
using ExecuteQuerySegmented instead of ExecuteQuery
observing TableBatchOperation limit of 100 operations
purging all Azure tables
public static void TruncateAllAzureTables(CloudStorageAccount storageAccount, DateTime keepThreshold)
{
TruncateAzureTable(storageAccount, "WADLogsTable", keepThreshold);
TruncateAzureTable(storageAccount, "WADCrashDump", keepThreshold);
TruncateAzureTable(storageAccount, "WADDiagnosticInfrastructureLogsTable", keepThreshold);
TruncateAzureTable(storageAccount, "WADPerformanceCountersTable", keepThreshold);
TruncateAzureTable(storageAccount, "WADWindowsEventLogsTable", keepThreshold);
}
public static void TruncateAzureTable(CloudStorageAccount storageAccount, string aTableName, DateTime keepThreshold)
{
const int maxOperationsInBatch = 100;
var tableClient = storageAccount.CreateCloudTableClient();
var cloudTable = tableClient.GetTableReference(aTableName);
var query = new TableQuery { FilterString = $"Timestamp lt datetime'{keepThreshold:yyyy-MM-ddTHH:mm:ss}'" };
TableContinuationToken continuationToken = null;
do
{
var queryResult = cloudTable.ExecuteQuerySegmented(query, continuationToken);
continuationToken = queryResult.ContinuationToken;
var items = queryResult.ToList();
var batches = new Dictionary<string, List<TableBatchOperation>>();
foreach (var entity in items)
{
var tableOperation = TableOperation.Delete(entity);
if (!batches.TryGetValue(entity.PartitionKey, out var batchOperationList))
{
batchOperationList = new List<TableBatchOperation>();
batches.Add(entity.PartitionKey, batchOperationList);
}
var batchOperation = batchOperationList.FirstOrDefault(bo => bo.Count < maxOperationsInBatch);
if (batchOperation == null)
{
batchOperation = new TableBatchOperation();
batchOperationList.Add(batchOperation);
}
batchOperation.Add(tableOperation);
}
foreach (var batch in batches.Values.SelectMany(l => l))
{
cloudTable.ExecuteBatch(batch);
}
} while (continuationToken != null);
}