Windows Azure - Cleaning Up The WADLogsTable - c#

I've read conflicting information as to whether or not the WADLogsTable table used by the DiagnosticMonitor in Windows Azure will automatically prune old log entries.
I'm guessing it doesn't, and will instead grow forever - costing me money. :)
If that's the case, does anybody have a good code sample as to how to clear out old log entries from this table manually? Perhaps based on timestamp? I'd run this code from a worker role periodically.

The data in tables created by Windows Azure Diagnostics isn't deleted automatically.
However, Windows Azure PowerShell Cmdlets contain cmdlets specifically for this case.
PS D:\> help Clear-WindowsAzureLog
NAME
Clear-WindowsAzureLog
SYNOPSIS
Removes Windows Azure trace log data from a storage account.
SYNTAX
Clear-WindowsAzureLog [-DeploymentId ] [-From ] [-To ] [-StorageAccountName ] [-StorageAccountKey ] [-UseD
evelopmentStorage] [-StorageAccountCredentials ] []
Clear-WindowsAzureLog [-DeploymentId <String>] [-FromUtc <DateTime>] [-ToUt
c <DateTime>] [-StorageAccountName <String>] [-StorageAccountKey <String>]
[-UseDevelopmentStorage] [-StorageAccountCredentials <StorageCredentialsAcc
ountAndKey>] [<CommonParameters>]
You need to specify -ToUtc parameter, and all logs before that date will be deleted.
If cleanup task needs to be performed on Azure within the worker role, C# cmdlets code can be reused. PowerShell Cmdlets are published under permissive MS Public License.
Basically, there are only 3 files needed without other external dependencies: DiagnosticsOperationException.cs, WadTableExtensions.cs, WadTableServiceEntity.cs.

Updated function of Chriseyre2000. This provides much more performance for those cases where you need to delete many thousands records: search by PartitionKey and chunked step-by-step process. And remember that the best choice it is to run it near storage (in cloud service).
public static void TruncateDiagnostics(CloudStorageAccount storageAccount,
DateTime startDateTime, DateTime finishDateTime, Func<DateTime,DateTime> stepFunction)
{
var cloudTable = storageAccount.CreateCloudTableClient().GetTableReference("WADLogsTable");
var query = new TableQuery();
var dt = startDateTime;
while (true)
{
dt = stepFunction(dt);
if (dt>finishDateTime)
break;
var l = dt.Ticks;
string partitionKey = "0" + l;
query.FilterString = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.LessThan, partitionKey);
query.Select(new string[] {});
var items = cloudTable.ExecuteQuery(query).ToList();
const int chunkSize = 200;
var chunkedList = new List<List<DynamicTableEntity>>();
int index = 0;
while (index < items.Count)
{
var count = items.Count - index > chunkSize ? chunkSize : items.Count - index;
chunkedList.Add(items.GetRange(index, count));
index += chunkSize;
}
foreach (var chunk in chunkedList)
{
var batches = new Dictionary<string, TableBatchOperation>();
foreach (var entity in chunk)
{
var tableOperation = TableOperation.Delete(entity);
if (batches.ContainsKey(entity.PartitionKey))
batches[entity.PartitionKey].Add(tableOperation);
else
batches.Add(entity.PartitionKey, new TableBatchOperation {tableOperation});
}
foreach (var batch in batches.Values)
cloudTable.ExecuteBatch(batch);
}
}
}

You could just do it based on the timestamp but that would be very inefficient since the whole table would need to be scanned. Here is a code sample that might help where the partition key is generated to prevent a "full" table scan. http://blogs.msdn.com/b/avkashchauhan/archive/2011/06/24/linq-code-to-query-windows-azure-wadlogstable-to-get-rows-which-are-stored-after-a-specific-datetime.aspx

Here is a solution that trunctates based upon a timestamp. (Tested against SDK 2.0)
It does use a table scan to get the data but if run say once per day would not be too painful:
/// <summary>
/// TruncateDiagnostics(storageAccount, DateTime.Now.AddHours(-1));
/// </summary>
/// <param name="storageAccount"></param>
/// <param name="keepThreshold"></param>
public void TruncateDiagnostics(CloudStorageAccount storageAccount, DateTime keepThreshold)
{
try
{
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable cloudTable = tableClient.GetTableReference("WADLogsTable");
TableQuery query = new TableQuery();
query.FilterString = string.Format("Timestamp lt datetime'{0:yyyy-MM-ddTHH:mm:ss}'", keepThreshold);
var items = cloudTable.ExecuteQuery(query).ToList();
Dictionary<string, TableBatchOperation> batches = new Dictionary<string, TableBatchOperation>();
foreach (var entity in items)
{
TableOperation tableOperation = TableOperation.Delete(entity);
if (!batches.ContainsKey(entity.PartitionKey))
{
batches.Add(entity.PartitionKey, new TableBatchOperation());
}
batches[entity.PartitionKey].Add(tableOperation);
}
foreach (var batch in batches.Values)
{
cloudTable.ExecuteBatch(batch);
}
}
catch (Exception ex)
{
Trace.TraceError(string.Format("Truncate WADLogsTable exception {0}", ex), "Error");
}
}

Here's my slightly different version of #Chriseyre2000's solution, using asynchronous operations and PartitionKey querying. It's designed to run continuously within a Worker Role in my case. This one may be a bit easier on memory if you have a lot of entries to clean up.
static class LogHelper
{
/// <summary>
/// Periodically run a cleanup task for log data, asynchronously
/// </summary>
public static async void TruncateDiagnosticsAsync()
{
while ( true )
{
try
{
// Retrieve storage account from connection-string
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(
CloudConfigurationManager.GetSetting( "CloudStorageConnectionString" ) );
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable cloudTable = tableClient.GetTableReference( "WADLogsTable" );
// keep a weeks worth of logs
DateTime keepThreshold = DateTime.UtcNow.AddDays( -7 );
// do this until we run out of items
while ( true )
{
TableQuery query = new TableQuery();
query.FilterString = string.Format( "PartitionKey lt '0{0}'", keepThreshold.Ticks );
var items = cloudTable.ExecuteQuery( query ).Take( 1000 );
if ( items.Count() == 0 )
break;
Dictionary<string, TableBatchOperation> batches = new Dictionary<string, TableBatchOperation>();
foreach ( var entity in items )
{
TableOperation tableOperation = TableOperation.Delete( entity );
// need a new batch?
if ( !batches.ContainsKey( entity.PartitionKey ) )
batches.Add( entity.PartitionKey, new TableBatchOperation() );
// can have only 100 per batch
if ( batches[entity.PartitionKey].Count < 100)
batches[entity.PartitionKey].Add( tableOperation );
}
// execute!
foreach ( var batch in batches.Values )
await cloudTable.ExecuteBatchAsync( batch );
Trace.TraceInformation( "WADLogsTable truncated: " + query.FilterString );
}
}
catch ( Exception ex )
{
Trace.TraceError( "Truncate WADLogsTable exception {0}", ex.Message );
}
// run this once per day
await Task.Delay( TimeSpan.FromDays( 1 ) );
}
}
}
To start the process, just call this from the OnStart method in your worker role.
// start the periodic cleanup
LogHelper.TruncateDiagnosticsAsync();

If you don't care about any of the contents, just delete the table. Azure Diagnostics will just recreate it.

Slightly updated Chriseyre2000's code:
using ExecuteQuerySegmented instead of ExecuteQuery
observing TableBatchOperation limit of 100 operations
purging all Azure tables
public static void TruncateAllAzureTables(CloudStorageAccount storageAccount, DateTime keepThreshold)
{
TruncateAzureTable(storageAccount, "WADLogsTable", keepThreshold);
TruncateAzureTable(storageAccount, "WADCrashDump", keepThreshold);
TruncateAzureTable(storageAccount, "WADDiagnosticInfrastructureLogsTable", keepThreshold);
TruncateAzureTable(storageAccount, "WADPerformanceCountersTable", keepThreshold);
TruncateAzureTable(storageAccount, "WADWindowsEventLogsTable", keepThreshold);
}
public static void TruncateAzureTable(CloudStorageAccount storageAccount, string aTableName, DateTime keepThreshold)
{
const int maxOperationsInBatch = 100;
var tableClient = storageAccount.CreateCloudTableClient();
var cloudTable = tableClient.GetTableReference(aTableName);
var query = new TableQuery { FilterString = $"Timestamp lt datetime'{keepThreshold:yyyy-MM-ddTHH:mm:ss}'" };
TableContinuationToken continuationToken = null;
do
{
var queryResult = cloudTable.ExecuteQuerySegmented(query, continuationToken);
continuationToken = queryResult.ContinuationToken;
var items = queryResult.ToList();
var batches = new Dictionary<string, List<TableBatchOperation>>();
foreach (var entity in items)
{
var tableOperation = TableOperation.Delete(entity);
if (!batches.TryGetValue(entity.PartitionKey, out var batchOperationList))
{
batchOperationList = new List<TableBatchOperation>();
batches.Add(entity.PartitionKey, batchOperationList);
}
var batchOperation = batchOperationList.FirstOrDefault(bo => bo.Count < maxOperationsInBatch);
if (batchOperation == null)
{
batchOperation = new TableBatchOperation();
batchOperationList.Add(batchOperation);
}
batchOperation.Add(tableOperation);
}
foreach (var batch in batches.Values.SelectMany(l => l))
{
cloudTable.ExecuteBatch(batch);
}
} while (continuationToken != null);
}

Related

ExecuteStoredProcedureAsync() returning different results

Long story short, a stored procedure in cosmosDB is returning 2 when executed inside the portal, and is returning 0 when called from ExecuteStoredProcedureAsync() in my c# console app.
The correct response is 2
Here's the stored procedure:
JS:
function countItems() {
var context = getContext();
var collection = context.getCollection();
var collectionLink = collection.getSelfLink();
var response = context.getResponse();
var query = "SELECT * FROM c";
var isAccepted = collection.queryDocuments(
collectionLink,
query,
function(err, documents, responseOptions) {
if (err) {
throw err;
}
response.setBody(documents.length);
}
);
}
When I run this from the azure portal, it returns the correct result: 2.
/////////////////////
Here's the C# call:
C#
private static async Task ExecuteStoredProc(string spId, CosmosContext cosmosContext)
{
using (var client = new CosmosClient(cosmosContext.Endpoint, cosmosContext.MasterKey))
{
var container = client.GetContainer(cosmosContext.DbId, cosmosContext.ContainerId);
var scripts = container.Scripts;
var pk = new PartitionKey(cosmosContext.DbId);
var result = await scripts.ExecuteStoredProcedureAsync<string>(spId, pk, null);
var message = result.Resource;
Console.WriteLine(message);
}
}
When I run this from the C# console app, it returns 0
What's the deal?
Based on my test, you may not set the PartitionKey correctly.
If you have set partition key, you need to pass the correct partition key.
static void Main(string[] args)
{
using (var client = new CosmosClient(Endpoint, Key))
{
// With Partition Key
var container = client.GetContainer("TestDB", "Demo");
var scripts = container.Scripts;
// With Partition Key
var pk = new PartitionKey("B");
var result =scripts.ExecuteStoredProcedureAsync<string>("length", pk, null).GetAwaiter().GetResult();
var message = result.Resource;
Console.WriteLine(message);
}
Console.ReadLine();
}
If there is no partition key, then you need to pass PartitionKey.None
static void Main(string[] args)
{
using (var client = new CosmosClient(Endpoint, Key))
{
// Without Partition Key
var container = client.GetContainer("ToDoList", "Items");
var scripts = container.Scripts;
//Without Partition Key
var result = scripts.ExecuteStoredProcedureAsync<string>("length", PartitionKey.None, null).GetAwaiter().GetResult();
var message = result.Resource;
Console.WriteLine(message);
}
Console.ReadLine();
}

Fastest way to insert 100,000+ records into DocumentDB

As the title suggests, I need to insert 100,000+ records into a DocumentDb collection programatically. The data will be used for creating reports later on. I am using the Azure Documents SDK and a stored procedure for bulk inserting documents (See question Azure documentdb bulk insert using stored procedure).
The following console application shows how I'm inserting documents.
InsertDocuments generates 500 test documents to pass to the stored procedure. The main function calls InsertDocuments 10 times, inserting 5,000 documents overall. Running this application results in 500 documents getting inserted every few seconds. If I increase the number of documents per call I start to get errors and lost documents.
Can anyone recommend a faster way to insert documents?
static void Main(string[] args)
{
Console.WriteLine("Starting...");
MainAsync().Wait();
}
static async Task MainAsync()
{
int campaignId = 1001,
count = 500;
for (int i = 0; i < 10; i++)
{
await InsertDocuments(campaignId, (count * i) + 1, (count * i) + count);
}
}
static async Task InsertDocuments(int campaignId, int startId, int endId)
{
using (DocumentClient client = new DocumentClient(new Uri(documentDbUrl), documentDbKey))
{
List<dynamic> items = new List<dynamic>();
// Create x number of documents to insert
for (int i = startId; i <= endId; i++)
{
var item = new
{
id = Guid.NewGuid(),
campaignId = campaignId,
userId = i,
status = "Pending"
};
items.Add(item);
}
var task = client.ExecuteStoredProcedureAsync<dynamic>("/dbs/default/colls/campaignusers/sprocs/bulkImport", new RequestOptions()
{
PartitionKey = new PartitionKey(campaignId)
},
new
{
items = items
});
try
{
await task;
int insertCount = (int)task.Result.Response;
Console.WriteLine("{0} documents inserted...", insertCount);
}
catch (Exception e)
{
Console.WriteLine("Error: {0}", e.Message);
}
}
}
The fastest way to insert documents into Azure DocumentDB. is available as a sample on Github: https://github.com/Azure/azure-documentdb-dotnet/tree/master/samples/documentdb-benchmark
The following tips will help you achieve the best througphput using the .NET SDK:
Initialize a singleton DocumentClient
Use Direct connectivity and TCP protocol (ConnectionMode.Direct and ConnectionProtocol.Tcp)
Use 100s of Tasks in parallel (depends on your hardware)
Increase the MaxConnectionLimit in the DocumentClient constructor to a high value, say 1000 connections
Turn gcServer on
Make sure your collection has the appropriate provisioned throughput (and a good partition key)
Running in the same Azure region will also help
With 10,000 RU/s, you can insert 100,000 documents in about 50 seconds (approximately 5 request units per write).
With 100,000 RU/s, you can insert in about 5 seconds. You can make this as fast as you want to, by configuring throughput (and for very high # of inserts, spread inserts across multiple VMs/workers)
EDIT: You can now use the bulk executor library at https://learn.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview, 7/12/19
The Cosmos Db team have just released a bulk import and update SDK, unfortunately only available in Framework 4.5.1 but this apparently does a lot of the heavy lifting for you and maximize use of throughput. see
https://learn.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-bulk-executor-dot-net
Cosmos DB SDK has been updated to allow bulk insert: https://learn.microsoft.com/en-us/azure/cosmos-db/tutorial-sql-api-dotnet-bulk-import via the AllowBulkExecution option.
Other aproach is stored procedure as mentioned by other people . Stored procedure requires partitioning key. Also stored procedure should end within 4sec as per documentation otherwise all records will rollback. See code below using python azure documentdb sdk and javascript based stored procedure. I have modified the script and resolved lot of error below code is working fine:-
function bulkimport2(docObject) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
// The count of imported docs, also used as current doc index.
var count = 0;
getContext().getResponse().setBody(docObject.items);
//return
// Validate input.
//if (!docObject.items || !docObject.items.length) getContext().getResponse().setBody(docObject);
docObject.items=JSON.stringify(docObject.items)
docObject.items = docObject.items.replace("\\\\r", "");
docObject.items = docObject.items.replace("\\\\n", "");
var docs = JSON.parse(docObject.items);
var docsLength = docObject.items.length;
if (docsLength == 0) {
getContext().getResponse().setBody(0);
return;
}
// Call the CRUD API to create a document.
tryCreate(docs[count], callback, collectionLink,count);
// Note that there are 2 exit conditions:
// 1) The createDocument request was not accepted.
// In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
// In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreate(doc, callback, collectionLink,count ) {
doc=JSON.stringify(doc);
if (typeof doc == "undefined") {
getContext().getResponse().setBody(count);
return ;
} else {
doc = doc.replace("\\r", "");
doc = doc.replace("\\n", "");
doc=JSON.parse(doc);
}
getContext().getResponse().setBody(doc);
var isAccepted = collection.upsertDocument(collectionLink, doc, callback);
// If the request was accepted, callback will be called.
// Otherwise report current count back to the client,
// which will call the script again with remaining set of docs.
// This condition will happen when this stored procedure has been running too long
// and is about to get cancelled by the server. This will allow the calling client
// to resume this batch from the point we got to before isAccepted was set to false
if (!isAccepted) {
getContext().getResponse().setBody(count);
}
}
// This is called when collection.createDocument is done and the document has been persisted.
function callback(err, doc, options) {
if (err) throw getContext().getResponse().setBody(err + doc);
// One more document has been inserted, increment the count.
count++;
if (count >= docsLength) {
// If we have created all documents, we are done. Just set the response.
getContext().getResponse().setBody(count);
return ;
} else {
// Create next document.
tryCreate(docs[count], callback, collectionLink,count);
}
}
}
EDIT:- getContext().getResponse().setBody(count);
return ; //when all records are processed completely.
python script to load stored procedure and do batch import
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(config['ENDPOINT'], {'masterKey': config['MASTERKEY'] ,'DisableSSLVerification' : 'true' })
# Create a database
#db = client.CreateDatabase({ 'id': config['DOCUMENTDB_DATABASE'] })
db=client.ReadDatabases({ 'id': 'db2' })
print(db)
# Create collection options
options = {
'offerEnableRUPerMinuteThroughput': True,
'offerVersion': "V2",
'offerThroughput': 400
}
# Create a collection
#collection = client.CreateCollection('dbs/db2' , { 'id': 'coll2'}, options)
#collection = client.CreateCollection({ 'id':'db2'},{ 'id': 'coll2'}, options)
database_link = 'dbs/db2'
collection_link = database_link + '/colls/coll2'
"""
#List collections
collection = client.ReadCollection(collection_link)
print(collection)
print('Databases:')
databases = list(client.ReadDatabases())
if not databases:
print('No Databases:')
for database in databases:
print(database['id'])
"""
# Create some documents
"""
document1 = client.CreateDocument(collection['_self'],
{
'Web Site': 0,
'Cloud Service': 0,
'Virtual Machine': 0,
'name': 'some'
})
document2 = client.CreateDocument(collection['_self'],
{
'Web Site': 1,
'Cloud Service': 0,
'Virtual Machine': 0,
'name': 'some'
})
"""
# Query them in SQL
"""
query = { 'query': 'SELECT * FROM server s' }
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 20
#result_iterable = client.QueryDocuments(collection['_self'], query, options)
result_iterable = client.QueryDocuments(collection_link, query, options)
results = list(result_iterable);
print(results)
"""
##How to store procedure and use it
"""
sproc3 = {
'id': 'storedProcedure2',
'body': (
'function (input) {' +
' getContext().getResponse().setBody(' +
' \'a\' + input.temp);' +
'}')
}
retrieved_sproc3 = client.CreateStoredProcedure(collection_link,sproc3)
result = client.ExecuteStoredProcedure('dbs/db2/colls/coll2/sprocs/storedProcedure3',{'temp': 'so'})
"""
## delete all records in collection
"""
result = client.ExecuteStoredProcedure('dbs/db2/colls/coll2/sprocs/bulkDeleteSproc',"SELECT * FROM c ORDER BY c._ts DESC ")
print(result)
"""
multiplerecords="""[{
"Virtual Machine": 0,
"name": "some",
"Web Site": 0,
"Cloud Service": 0
},
{
"Virtual Machine": 0,
"name": "some",
"Web Site": 1,
"Cloud Service": 0
}]"""
multiplerecords=json.loads(multiplerecords)
print(multiplerecords)
print(str(json.dumps(json.dumps(multiplerecords).encode('utf8'))))
#bulkloadresult = client.ExecuteStoredProcedure('dbs/db2/colls/coll2/sprocs/bulkImport',json.dumps(multiplerecords).encode('utf8'))
#bulkloadresult = client.ExecuteStoredProcedure('dbs/db2/colls/coll2/sprocs/bulkImport',json.dumps(json.loads(r'{"items": [{"name":"John","age":30,"city":"New York"},{"name":"John","age":30,"city":"New York"}]}')).encode('utf8'))
str1='{name":"John","age":30,"city":"New York","PartitionKey" : "Morisplane"}'
str2='{name":"John","age":30,"city":"New York","partitionKey" : "Morisplane"}'
key1=base64.b64encode(str1.encode("utf-8"))
key2=base64.b64encode(str2.encode("utf-8"))
data= {"items":[{"id": key1 ,"name":"John","age":30,"city":"Morisplane","PartitionKey" : "Morisplane" },{"id": key2,"name":"John","age":30,"city":"Morisplane","partitionKey" : "Morisplane"}] , "city": "Morisplane", "partitionKey" : "Morisplane"}
print(repr(data))
#retrieved_sproc3 =client.DeleteStoredProcedure('dbs/db2/colls/coll2/sprocs/bulkimport2')
sproc3 = {
'id': 'bulkimport2',
'body': (
"""function bulkimport2(docObject) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
// The count of imported docs, also used as current doc index.
var count = 0;
getContext().getResponse().setBody(docObject.items);
//return
// Validate input.
//if (!docObject.items || !docObject.items.length) getContext().getResponse().setBody(docObject);
docObject.items=JSON.stringify(docObject.items)
docObject.items = docObject.items.replace("\\\\r", "");
docObject.items = docObject.items.replace("\\\\n", "");
var docs = JSON.parse(docObject.items);
var docsLength = docObject.items.length;
if (docsLength == 0) {
getContext().getResponse().setBody(0);
return;
}
// Call the CRUD API to create a document.
tryCreate(docs[count], callback, collectionLink,count);
// Note that there are 2 exit conditions:
// 1) The createDocument request was not accepted.
// In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
// In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreate(doc, callback, collectionLink,count ) {
doc=JSON.stringify(doc);
if (typeof doc == "undefined") {
getContext().getResponse().setBody(count);
return ;
} else {
doc = doc.replace("\\r", "");
doc = doc.replace("\\n", "");
doc=JSON.parse(doc);
}
getContext().getResponse().setBody(doc);
return
var isAccepted = collection.upsertDocument(collectionLink, doc, callback);
// If the request was accepted, callback will be called.
// Otherwise report current count back to the client,
// which will call the script again with remaining set of docs.
// This condition will happen when this stored procedure has been running too long
// and is about to get cancelled by the server. This will allow the calling client
// to resume this batch from the point we got to before isAccepted was set to false
if (!isAccepted) {
getContext().getResponse().setBody(count);
}
}
// This is called when collection.createDocument is done and the document has been persisted.
function callback(err, doc, options) {
if (err) throw getContext().getResponse().setBody(err + doc);
// One more document has been inserted, increment the count.
count++;
if (count >= docsLength) {
// If we have created all documents, we are done. Just set the response.
getContext().getResponse().setBody(count);
return ;
} else {
// Create next document.
tryCreate(docs[count], callback, collectionLink,count);
}
}
}"""
)
}
#retrieved_sproc3 = client.CreateStoredProcedure(collection_link,sproc3)
bulkloadresult = client.ExecuteStoredProcedure('dbs/db2/colls/coll2/sprocs/bulkimport2', data , {"partitionKey" : "Morisplane"} )
print(repr(bulkloadresult))
private async Task<T> ExecuteDataUpload<T>(IEnumerable<object> data,PartitionKey partitionKey)
{
using (var client = new DocumentClient(m_endPointUrl, m_authKey, connPol))
{
while (true)
{
try
{
var result = await client.ExecuteStoredProcedureAsync<T>(m_spSelfLink, new RequestOptions { PartitionKey = partitionKey }, data);
return result;
}
catch (DocumentClientException ex)
{
if (429 == (int)ex.StatusCode)
{
Thread.Sleep(ex.RetryAfter);
continue;
}
if (HttpStatusCode.RequestTimeout == ex.StatusCode)
{
Thread.Sleep(ex.RetryAfter);
continue;
}
throw ex;
}
catch (Exception)
{
Thread.Sleep(TimeSpan.FromSeconds(1));
continue;
}
}
}
}
public async Task uploadData(IEnumerable<object> data, string partitionKey)
{
int groupSize = 600;
int dataSize = data.Count();
int chunkSize = dataSize > groupSize ? groupSize : dataSize;
List<Task> uploadTasks = new List<Task>();
while (dataSize > 0)
{
IEnumerable<object> chunkData = data.Take(chunkSize);
object[] taskData = new object[3];
taskData[0] = chunkData;
taskData[1] = chunkSize;
taskData[2] = partitionKey;
uploadTasks.Add(Task.Factory.StartNew(async (arg) =>
{
object[] reqdData = (object[])arg;
int chunkSizes = (int)reqdData[1];
IEnumerable<object> chunkDatas = (IEnumerable<object>)reqdData[0];
var partKey = new PartitionKey((string)reqdData[2]);
int chunkDatasCount = chunkDatas.Count();
while (chunkDatasCount > 0)
{
int insertedCount = await ExecuteDataUpload<int>(chunkDatas, partKey);
chunkDatas = chunkDatas.Skip(insertedCount);
chunkDatasCount = chunkDatasCount - insertedCount;
}
}, taskData));
data = data.Skip(chunkSize);
dataSize = dataSize - chunkSize;
chunkSize = dataSize > groupSize ? groupSize : dataSize;
}
await Task.WhenAll(uploadTasks);
}
Now call the uploadData in parallel with list of objects you want to upload. Just keep one thing in mind send data of like Partitionkey only.

ExecuteAsync() of Azure Table Storage failing to insert all the records

I am trying to insert 10000 records into Azure table storage. I am using ExecuteAsync() to achieve it, but somehow approximately around 7500 records are inserted and rest of the records are lost. I am purposely not using await keyword because I don't want to wait for the result, just want to store them in the table. Below is my code snippet.
private static async void ConfigureAzureStorageTable()
{
CloudStorageAccount storageAccount =
CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
TableResult result = new TableResult();
CloudTable table = tableClient.GetTableReference("test");
table.CreateIfNotExists();
for (int i = 0; i < 10000; i++)
{
var verifyVariableEntityObject = new VerifyVariableEntity()
{
ConsumerId = String.Format("{0}", i),
Score = String.Format("{0}", i * 2 + 2),
PartitionKey = String.Format("{0}", i),
RowKey = String.Format("{0}", i * 2 + 2)
};
TableOperation insertOperation = TableOperation.Insert(verifyVariableEntityObject);
try
{
table.ExecuteAsync(insertOperation);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
}
Is anything incorrect with the usage of the method?
You still want to await table.ExecuteAsync(). That will mean that ConfigureAzureStorageTable() returns control to the caller at that point, which can continue executing.
The way you have it in the question, ConfigureAzureStorageTable() is going to continue past the call to table.ExecuteAsync() and exit, and things like table will go out of scope, while the table.ExecuteAsync() task is still not complete.
There are plenty of caveats about using async void on SO and elsewhere that you will also need to consider. You could just as easily have your method as async Task but not await it in the caller yet, but keep the returned Task around for clean termination, etc.
Edit: one addition - you almost certainly want to use ConfigureAwait(false) on your await there, as you don't appear to need to preserve any context. This blog post has some guidelines on that and async in general.
According to your requirement, I have tested your scenario on my side by using CloudTable.ExecuteAsync and CloudTable.ExecuteBatchAsync successfully. Here is my code snippet about using CloudTable.ExecuteBatchAsync to insert records to Azure Table Storage, you could refer to it.
Program.cs Main
class Program
{
static void Main(string[] args)
{
CloudStorageAccount storageAccount =
CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
TableResult result = new TableResult();
CloudTable table = tableClient.GetTableReference("test");
table.CreateIfNotExists();
//Generate records to be inserted into Azure Table Storage
var entities = Enumerable.Range(1, 10000).Select(i => new VerifyVariableEntity()
{
ConsumerId = String.Format("{0}", i),
Score = String.Format("{0}", i * 2 + 2),
PartitionKey = String.Format("{0}", i),
RowKey = String.Format("{0}", i * 2 + 2)
});
//Group records by PartitionKey and prepare for executing batch operations
var batches = TableBatchHelper<VerifyVariableEntity>.GetBatches(entities);
//Execute batch operations in parallel
Parallel.ForEach(batches, new ParallelOptions()
{
MaxDegreeOfParallelism = 5
}, (batchOperation) =>
{
try
{
table.ExecuteBatch(batchOperation);
Console.WriteLine("Writing {0} records", batchOperation.Count);
}
catch (Exception ex)
{
Console.WriteLine("ExecuteBatch throw a exception:" + ex.Message);
}
});
Console.WriteLine("Done!");
Console.WriteLine("Press any key to exit...");
Console.ReadKey();
}
}
TableBatchHelper.cs
public class TableBatchHelper<T> where T : ITableEntity
{
const int batchMaxSize = 100;
public static IEnumerable<TableBatchOperation> GetBatches(IEnumerable<T> items)
{
var list = new List<TableBatchOperation>();
var partitionGroups = items.GroupBy(arg => arg.PartitionKey).ToArray();
foreach (var group in partitionGroups)
{
T[] groupList = group.ToArray();
int offSet = batchMaxSize;
T[] entities = groupList.Take(offSet).ToArray();
while (entities.Any())
{
var tableBatchOperation = new TableBatchOperation();
foreach (var entity in entities)
{
tableBatchOperation.Add(TableOperation.InsertOrReplace(entity));
}
list.Add(tableBatchOperation);
entities = groupList.Skip(offSet).Take(batchMaxSize).ToArray();
offSet += batchMaxSize;
}
}
return list;
}
}
Note: As mentioned in the official document about inserting a batch of entities:
A single batch operation can include up to 100 entities.
All entities in a single batch operation must have the same partition key.
In summary, please try to check whether it could work on your side. Also, you could capture the detailed exception within your console application and capture the HTTP request via Fiddler to catch the HTTP error requests when you inserting records to Azure Table Storage.
How about using a TableBatchOperation to run batches of N inserts at once?
private const int BatchSize = 100;
private static async void ConfigureAzureStorageTable()
{
CloudStorageAccount storageAccount =
CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
TableResult result = new TableResult();
CloudTable table = tableClient.GetTableReference("test");
table.CreateIfNotExists();
var batchOperation = new TableBatchOperation();
for (int i = 0; i < 10000; i++)
{
var verifyVariableEntityObject = new VerifyVariableEntity()
{
ConsumerId = String.Format("{0}", i),
Score = String.Format("{0}", i * 2 + 2),
PartitionKey = String.Format("{0}", i),
RowKey = String.Format("{0}", i * 2 + 2)
};
TableOperation insertOperation = TableOperation.Insert(verifyVariableEntityObject);
batchOperation.Add(insertOperation);
if (batchOperation.Count >= BatchSize)
{
try
{
await table.ExecuteBatchAsync(batchOperation);
batchOperation = new TableBatchOperation();
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
}
if(batchOperation.Count > 0)
{
try
{
await table.ExecuteBatchAsync(batchOperation);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
}
You can adjust BatchSize to what you need. Small disclaimer: I didn't try to run this, though it should work.
But I can't help but wonder why is your function async void? That should be reserved for event handlers and similar ones where you cannot decide the interface. In most cases you want to return a task. Because now the caller cannot catch exceptions that occur in this function.
async void is not a good practice unless it is an eventhandler.
https://msdn.microsoft.com/en-us/magazine/jj991977.aspx
If you plan to insert many records into azure table storage, batch insert is your best bet.
https://msdn.microsoft.com/en-us/library/azure/microsoft.windowsazure.storage.table.tablebatchoperation.aspx
Keep in mind that it has limit of 100 table operations per batch.
I had the same issue and fixed it by
forcing ExecuteAsync to wait for the results before it exist..
table.ExecuteAsync(insertOperation).GetAwaiter().GetResult()

SqlDependency triggers on uncommited data

We have a import / bulk copy batch that runs in the middle of the night, something like
using (var tx = new TransactionScope(TransactionScopeOption.Required, TimeSpan.FromMinutes(1)))
{
//Delete existing
...
//Bulk copy to temp table
...
//Insert
...
tx.Complete();
}
We also have a cache that uses SqlDependency. But it triggers allready on the Delete statement making the Cache empty for a few seconds between the delete nad the reinsert. Can I configure SqlDependency to only listen to commited data?
SqlDep. code
private IEnumerable<TEntity> RegisterAndFetch(IBusinessContext context)
{
var dependency = new SqlDependency();
dependency.OnChange += OnDependencyChanged;
try
{
CallContext.SetData("MS.SqlDependencyCookie", dependency.Id);
var refreshed = OnCacheRefresh(context, data.AsEnumerable());
var result = refreshed.ToArray();
System.Diagnostics.Debug.WriteLine("{0} - CacheChanged<{1}>: {2}", DateTime.Now, typeof(TEntity).Name, result.Length);
return result;
}
finally
{
CallContext.SetData("MS.SqlDependencyCookie", null);
}
}
OnDependencyChanged basily calls above RegisterAndFetch method
Query:
protected override IEnumerable<BankHoliday> OnCacheRefresh(IBusinessContext context, IEnumerable<BankHoliday> currentData)
{
var firstOfMonth = DateTime.Now.FirstDayOfMonth(); // <-- Must be a constant in the query
return context.QueryOn<BankHoliday>().Where(bh => bh.Holiday >= firstOfMonth);
}

Azure table Storage continuation

So, Microsoft decided to send diagnostic data to Azure table storage. I'm trying to query this storage and send it to another location for analytics via C# SDK. I can query just fine pull the hundreds of thousand of record, but it appears that the last continuation token they send will always receive a null response. Even if more data gets sent into table storage, my continuation token doesn't work, still gets a null continuation token and null data back.
Has anyone done anything like this? How can I continue "syncing" azure table data if the continuation tokens they send are broken?
public static List<PerfMonEntity> GetEventData(ref TableContinuationToken contToken)
{
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(ConfigurationManager.AppSettings["StorageConnectionString"]);
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable eventLogsTable = tableClient.GetTableReference("WADPerformanceCountersTable");
TableQuery<PerfMonEntity> query = new TableQuery<PerfMonEntity>();
var l = new List<PerfMonEntity>();
var segment = eventLogsTable.ExecuteQuerySegmented(query, contToken ?? new TableContinuationToken());
foreach (PerfMonEntity wadCounter in segment)
{
l.Add(wadCounter);
}
contToken = segment.ContinuationToken;
if (contToken == null)
{
Console.WriteLine("contToken is NULL!");
return null;
}
Console.WriteLine("partkey: {0}", contToken.NextPartitionKey ?? "");
Console.WriteLine("rowkey: {0}", contToken.NextRowKey ?? "");
return l;
}
-=-=-=-=-=-
while (num < loop)
{
List<PerfMonEntity> eleList = AzurePerfTable.GetEventData(ref contToken);
if (eleList != null)
returnedList.AddRange(eleList);
else
num = loop;
num += 1;
if (contToken != null)
AZContinuationToken.SetContToken(contToken);
Console.WriteLine("returnedlistsize: {0}", returnedList.Count<PerfMonEntity>());
}
The continuation token is null when there is no more data to return. When it's non-null, it means that there are additional entities to return in the next page. You can check for null to determine when you've retrieved the last page and then exit the loop.
Try writing your logic along these lines:
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable eventLogsTable = tableClient.GetTableReference("WADPerformanceCountersTable");
TableQuery query = new TableQuery();
Console.WriteLine("List perf counter results in pages:");
TableContinuationToken token = null;
do
{
var segment = eventLogsTable.ExecuteQuerySegmented(query, token, null, null);
foreach (var wadCounter in segment)
{
Console.WriteLine(wadCounter.PartitionKey);
Console.WriteLine(wadCounter.RowKey);
Console.WriteLine(wadCounter.Timestamp);
}
token = segment.ContinuationToken;
}
while (token != null);

Categories