Count rows within partition in Azure table storage - c#

I've seen various questions around SO about how to get the total row count of an Azure storage table, but I want to know how to get the number of rows within a single partition.
How can I do this while loading a minimal amount of entity data into memory?

As you may already know that there's no Count like functionality available in Azure Tables. In order to get the total number of entities (rows) in a Partition (or a Table), you have to fetch all entities.
You can reduce the response payload by using a technique called Query Projection. A query projection allows you to specify the list of entity attributes (columns) that you want table service to return. Since you're only interested in total count of entities, I would recommend that you only fetch PartitionKey back. You may find this blog post helpful for understanding about Query Projection: https://blogs.msdn.microsoft.com/windowsazurestorage/2011/09/15/windows-azure-tables-introducing-upsert-and-query-projection/.

https://azure.microsoft.com/en-gb/features/storage-explorer/ allows you to define a Query and you can use the Table Statistics toolbar item to get the total rows for the whole table or your query

Tested the speed using Stopwatch to fetch and count 100,000 entities in a Partition that have three fields in addition to the standard TableEntity.
I select just the PartitionKey and use a resolver to end up with just a list of strings, which once the entire Partition has been retrieved I count.
Fastest I have got it is around 6000ms - 6500ms. Here is the function:
public static async Task<int> GetCountOfEntitiesInPartition(string tableName, string partitionKey)
{
CloudTable table = tableClient.GetTableReference(tableName);
TableQuery<DynamicTableEntity> tableQuery = new TableQuery<DynamicTableEntity>().Where(TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, partitionKey)).Select(new string[] { "PartitionKey" });
EntityResolver<string> resolver = (pk, rk, ts, props, etag) => props.ContainsKey("PartitionKey") ? props["PartitionKey"].StringValue : null;
List<string> entities = new List<string>();
TableContinuationToken continuationToken = null;
do
{
TableQuerySegment<string> tableQueryResult =
await table.ExecuteQuerySegmentedAsync(tableQuery, resolver, continuationToken);
continuationToken = tableQueryResult.ContinuationToken;
entities.AddRange(tableQueryResult.Results);
} while (continuationToken != null);
return entities.Count;
}
This is a generic function, all you need is the tableName and partitionKey.

You could achieve this by leveraging atomic batch operation of azure table storage service pretty efficiently. For every partition have an additional entity with the same partition key and a specific row key like "PartitionCount" etc. That entity will have a single int (or long ) property Count.
Every time you insert a new entity do an atomic batch operation to also increment the Count property of your partition counter entity. Your partition counter entity will have the same partition key with your data entity so that allows you to do an atomic batch operation with guaranteed consistency.
Every time you delete an entity, go and decrement the Count property of the partition counter entity. Again in a batch execute operation so these 2 operations are consistent.
If you want to just read the value of partition count then all you need to do is to make a single point query to the partition counter entity and its Count property will tell you the current count for that partition.

This can be done a bit shorter than #NickBrooks answer.
public static async Task<int> GetCountOfEntitiesInPartition<T>(
string tableName,string partitionKey)
where T : ITableEntity, new()
{
var tableClient = tableServiceClient.GetTableClient(tableName);
var results = _tableClient.QueryAsync<T>(t => t.PartitionKey == partitionKey,
select: new[] { "PartitionKey" });
return await results.CountAsync();
}
The results.CountAsync() comes from System.Linq.Async, a NuGet package which is officially supported by dotnet.

I think you can directly use the .Count in C#. You can use either this technique:
var tableStorageData = await table.ExecuteQuerySegmentedAsync(azQuery, null);
int count = tableStorageData.Count();
or
TableQuery<UserDetails> tableQuery = new TableQuery<UserDetails>();
var tableStorageData = table.ExecuteQuery(tableQuery,null);
count = tableStorageData .Count();
The count variable will have the number total number of rows depending on the query.

Related

Query Azure table storage for faster retrieval of data in C#

I have a ATS table whose Partition key and Row Key looks like:
PartitionKey RowKey
US_W|000000001 0000200325|0184921077191606273
US_W|000000004 0000200328|0184921077191606277
US_W|000000005 XXXXXXXXXX|XX(somenumbers)XXXX
To be clear, I only have the PartitionKey with me to query this table and RowKey is unknown.
I am retrieving the result from table using the following method:
public async Task<IList<T>> FetchSelectedDataByPartitionKey<T>(string partitionKey, List<string> columns, QueryComparisonEnums partitionKeyQueryCompareEnums = QueryComparisonEnums.Equal) where T : class, ITableEntity, new()
{
var tableClient = await GetTableClient<T>();
string query = $"PartitionKey {partitionKeyQueryCompareEnums.GetAttribute<EnmDecriptionAttribute>()?.Value} '{partitionKey}'";
AsyncPageable<T> queryResultsFilter = tableClient.QueryAsync<T>(filter: query, select: columns);
List<T> result = new List<T>();
await foreach (Page<T> page in queryResultsFilter.AsPages())
{
foreach (var qEntity in page.Values)
{
result.Add(qEntity);
}
}
return result;
}
This function works fine but it takes around 60 seconds to scan huge set of data from this table and filter and fetch 75000 entities from it. To get faster result set I am already using select property to only fetch selected fields of an entity instead of fetching entire entity.
I read few blogs such as distributed scan of Azure Table Storage but I believe this holds good only if PartitionKey is more scattered.
How can I retrieve the data in a faster way? Any help is appreciated :)
In Azure Table Storage, Point Query which is combination of PartitionKey and RowKey works as clustered index and is the most efficient way for lookup. By keeping both together storage will immediately know which partition to query and perform lookup on Rowkey in that partition.
But as you have mentioned, Rowkey is unknown to you hence currently you are doing Partition Scan which uses partitionkey value and some other filters.
As per my understanding, you can make use of pagination and continuation token by setting the value MaxPerPage in QueryAsync method. Then passing the continuation token value to AsPages() method and getting data per page with token for the next page.
Below is sample code which is similar to the code you used. Please look at the parameters maxPerPage and continuationToken passed to QueryAsync() and AsPages() method respectively: -
var customers = _tableClient.QueryAsync<CustomerModel>(filter: "", maxPerPage: 5);
await foreach (var page in customers.AsPages(continuationToken))
{
return Tuple.Create<string, IEnumerable<CustomerModel>>(page.ContinuationToken, page.Values);
}
References: -
queryasync() with pagination
azure-table-storage-query-performance
efficiently-retrieving-large-numbers-of-entities-from-azure-table-storage

Fetch row number MongoDB c#

I am fetching a single candidate exam result details after the
examination. which is stored in mongodb using c# driver. The
collection has TotalMarks field which is stored with marks obtained in
that exam.
Unfortunately the collection does not have the Rank Field because mark
calculation is not done in order
What I want to do is order the collection by totalmark and get the position(rank) of the candidate I am selecting.
public ExamCandidateResult ExaminationGetCandidateResultStatus( Guid examinationId, Guid candidateId)
{
var con = new MongoClient(DBConnection.ExamConnectionString);
var db = con.GetDatabase(ExamDB);
var collection = db.GetCollection<ExamCandidateResult>("Examination");
var filter = Builders<ExamCandidateResult>.Filter.Eq("ExaminationID", examinationId.ToString())
& Builders<ExamCandidateResult>.Filter.Eq("CandidateID", candidateId.ToString());
var data = collection.Find(filter).FirstOrDefault();
return data;
}
With this code I am fetching only the canidate details how can I fetch
the rank(row) with it ?
I don't think you can get the row number directly but You can use two queries, one to get the candidate and one to get the count of candidates who have more totalMarks than the desired candidate, and finally plus one count to get the rank of the candidate.

Is a MongoDB bulk upsert possible? C# Driver

I'd like to do a bulk upsert in Mongo. Basically I'm getting a list of objects from a vendor, but I don't know which ones I've gotten before (and need to be updated) vs which ones are new. One by one I could do an upsert, but UpdateMany doesn't work with upsert options.
So I've resorted to selecting the documents, updating in C#, and doing a bulk insert.
public async Task BulkUpsertData(List<MyObject> newUpsertDatas)
{
var usernames = newUpsertDatas.Select(p => p.Username);
var filter = Builders<MyObject>.Filter.In(p => p.Username, usernames);
//Find all records that are in the list of newUpsertDatas (these need to be updated)
var collection = Db.GetCollection<MyObject>("MyCollection");
var existingDatas = await collection.Find(filter).ToListAsync();
//loop through all of the new data,
foreach (var newUpsertData in newUpsertDatas)
{
//and find the matching existing data
var existingData = existingDatas.FirstOrDefault(p => p.Id == newUpsertData.Id);
//If there is existing data, preserve the date created (there are other fields I preserve)
if (existingData == null)
{
newUpsertData.DateCreated = DateTime.Now;
}
else
{
newUpsertData.Id = existingData.Id;
newUpsertData.DateCreated = existingData.DateCreated;
}
}
await collection.DeleteManyAsync(filter);
await collection.InsertManyAsync(newUpsertDatas);
}
Is there a more efficient way to do this?
EDIT:
I did some speed tests.
In preparation I inserted 100,000 records of a pretty simple object. Then I upserted 200,000 records into the collection.
Method 1 is as outlined in the question. SelectMany, update in code, DeleteMany, InsertMany. This took approximately 5 seconds.
Method 2 was making a list of UpdateOneModel with Upsert = true and then doing one BulkWriteAsync. This was super slow. I could see the count in the mongo collection increasing so I know it was working. But after about 5 minutes it had only climbed to 107,000 so I canceled it.
I'm still interested if anyone else has a potential solution
Given that you've said you could do a one-by-one upsert, you can achieve what you want with BulkWriteAsync. This allows you to create one or more instances of the abstract WriteModel, which in your case would be instances of UpdateOneModel.
In order to achieve this, you could do something like the following:
var listOfUpdateModels = new List<UpdateOneModel<T>>();
// ...
var updateOneModel = new UpdateOneModel<T>(
Builders<T>.Filter. /* etc. */,
Builders<T>.Update. /* etc. */)
{
IsUpsert = true;
};
listOfUpdateModels.Add(updateOneModel);
// ...
await mongoCollection.BulkWriteAsync(listOfUpdateModels);
The key to all of this is the IsUpsert property on UpdateOneModel.

Inserting many rows with Entity Framework is extremely slow

I'm using Entity Framework to build a database. There's two models; Workers and Skills. Each Worker has zero or more Skills. I initially read this data into memory from a CSV file somewhere, and store it in a dictionary called allWorkers. Next, I write the data to the database as such:
// Populate database
using (var db = new SolverDbContext())
{
// Add all distinct skills to database
db.Skills.AddRange(allSkills
.Distinct(StringComparer.InvariantCultureIgnoreCase)
.Select(s => new Skill
{
Reference = s
}));
db.SaveChanges(); // Very quick
var dbSkills = db.Skills.ToDictionary(k => k.Reference, v => v);
// Add all workers to database
var workforce = allWorkers.Values
.Select(i => new Worker
{
Reference = i.EMPLOYEE_REF,
Skills = i.GetSkills().Select(s => dbSkills[s]).ToArray(),
DefaultRegion = "wa",
DefaultEfficiency = i.TECH_EFFICIENCY
});
db.Workers.AddRange(workforce);
db.SaveChanges(); // This call takes 00:05:00.0482197
}
The last db.SaveChanges(); takes over five minutes to execute, which I feel is far too long. I ran SQL Server Profiler as the call is executing, and basically what I found was thousands of calls to:
INSERT [dbo].[SkillWorkers]([Skill_SkillId], [Worker_WorkerId])
VALUES (#0, #1)
There are 16,027 rows being added to SkillWorkers, which is a fair amount of data but not huge by any means. Is there any way to optimize this code so it doesn't take 5min to run?
Update: I've looked at other possible duplicates, such as this one, but I don't think they apply. First, I'm not bulk adding anything in a loop. I'm doing a single call to db.SaveChanges(); after every row has been added to db.Workers. This should be the fastest way to bulk insert. Second, I've set db.Configuration.AutoDetectChangesEnabled to false. The SaveChanges() call now takes 00:05:11.2273888 (In other words, about the same). I don't think this really matters since every row is new, thus there are no changes to detect.
I think what I'm looking for is a way to issue a single UPDATE statement containing all 16,000 skills.
One easy method is by using the EntityFramework.BulkInsert extension.
You can then do:
// Add all workers to database
var workforce = allWorkers.Values
.Select(i => new Worker
{
Reference = i.EMPLOYEE_REF,
Skills = i.GetSkills().Select(s => dbSkills[s]).ToArray(),
DefaultRegion = "wa",
DefaultEfficiency = i.TECH_EFFICIENCY
});
db.BulkInsert(workforce);

How to retrieve records more than 4000 from Raven DB in SIngle Session [duplicate]

I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}
It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.
The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default
The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging
Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();
You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.

Categories