Retrieve 1+ million records from Azure Table Storage

Retrieve 1+ million records from Azure Table Storage - c#

My table storage has approximately 1-2 million records and I have a daily job that needs needs to retrieve all the records that does not have a property A and do some further processing.
It is expected that there are about 1 - 1.5 million records without property A. I understand there are two approaches.
Query all records then filter results after
Do a table scan
Currently, it is using the approach where we query all records and filter in c#. However, the task is running in an Azure Function App. The query to retrieve all the results is sometimes taking over 10 minutes which is the limit for Azure Functions.
I'm trying to understand why retrieve 1 million records is taking so long and how to optimise the query. The existing design of the table is that the partition and row key are identical and is a guid - this leads me to believe that there is one entity per partition.
Looking at Microsoft docs, here are some key Table Storage limits (https://learn.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#azure-table-storage-scale-targets):
Maximum request rate per storage account: 20,000 transactions per second, which assumes a 1-KiB entity size
Target throughput for a single table partition (1 KiB-entities): Up to 2,000 entities per second.
My initial guess is that I should use another partition key to group 2,000 entities per partition to achieve the target throughput of 2,000 per second per partition. Would this mean that 2,000,000 records could in theory be returned in 1 second?
Any thoughts or advice appreciated.

I found this question after blogging on the very topic. I have a project where I am using the Azure Functions Consumption plan and have a big Azure Storage Table (3.5 million records).
Here's my blog post:
https://www.joelverhagen.com/blog/2020/12/distributed-scan-of-azure-tables
I have mentioned a couple of options in this blog post but I think the fastest is distributing the "table scan" work into smaller work items that can be easily completed in the 10-minute limit. I have an implementation linked in the blog post if you want to try it out. It will likely take some adapting to your Azure Function but most of the clever part (finding the partition key ranges) is implemented and tested.
This looks to be essentially what user3603467 is suggesting in his answer.

I see two approaches to retrieve 1+ records in a batch process, where the result must be saved to a single media - like a file.
First) You identity/select all primary id/key of related data. Then you spawn parallel jobs with chunks of these primary id/keys where you read the actual data and process it. each job then report to the single media with the result.
Second) You identity/select (for update) top n of related data, and mark this data with a state of being processed. Use concurrency locking here, that should prevent others from picking that data up if this is done in parallel.
I will go for the first solution if possible, since it is the simplest and cleanest solution. The second solution is best if you use "select for update", i dont know if that is supported on Azure Table Storage.

You'll need to paralise the task. As you don't know the partition keys, run 24 separate queries PK that start and end for each letter of the alaphabet. Write a query where PK > A && PK < B, and > B < C etc. Then join the 24 results in memory. Super easy to do in a single function. In JS just use Promise.all([]).

Related

BigQuery quota limits from query table append apply or not?

In my C# app I use a BigQueryClient.CreateQueryJobAsync to regularly append to a partitioned table table1. Currently this is happening only up to a 50-100 per day.
await BqClient.CreateQueryJobAsync(
...,
new QueryOptions
{
DestinationTable = "table1",
WriteDisposition = WriteDisposition.WriteAppend
})
I understand that Big Query has many limits and these are usually well documented. But for this particular scenario I am not sure if the limits apply or not. The quotas page says there are "1,000 updates per table per day" but the documentation also explicitly lists which operations are affected by quota. Assuming there is an explicit list there must also be a list of "everything else" where the quota does not apply. For instance, "classic UI" is under the quota which should imply that the "new UI" is not. Similarly, the page states that jobs.query API is affected by quota but since I am using the official C# driver, it leaves me wondering as to whether this applies to my scenario or not.
Apparently, I could write a script to try to do the append operation 1001 times in 24 hours and see whether I hit the quota but I wish I could simply read this from documentation and understand without any ambiguity.
Does anyone know from first-hand experience how this actually works?

BigQuery C# library internally uses jobs.query API, so the limit/quota applies in your case.
One more thing to be aware of is, since you're writing to a partitioned table, below quota also applies:
Maximum number of partition modifications per day per table — 5,000
You are limited to a total of 5,000 partition modifications per day
for a partitioned table. A partition can be modified by using an
operation that appends to or overwrites data in the partition.
Operations that modify partitions include: a load job, a query that
writes results to a partition, or a DML statement (INSERT, DELETE,
UPDATE, or MERGE) that modifies data in a partition.
More than one partition may be affected by a single job. For example,
a DML statement can update data in multiple partitions (for both
ingestion-time and partitioned tables). Query jobs and load jobs can
also write to multiple partitions but only for partitioned tables.
BigQuery uses the number of partitions affected by a job when
determining how much of the quota the job consumes. Streaming inserts
do not affect this quota.

How to Minimize Data Transfer Out From Azure Query in C# .NET

I have a small table(23 rows, 2 int columns), just a basic user-activity monitor. The first column represents user id. The second column holds a value that should be unique to every user, but I must alert the users if two values are the same. I'm using an Azure Sql database to hold this table, and Linq to Sql in C# to run the query.
The problem: Microsoft will bill me based on data transferred out of their data-centers. I would like have all of my users to be aware of the current state of this table at all times, second by second, and keep data-transfer under 5 GB per month. I'm thinking along the lines of a Linq-To-Sql expression such as
UserActivity.Where(x => x.Val == myVal).Count() > 1;
But this would download the table to the client, which cannot happen. Should I be implementing a Linq solution? Or would SqlDataReader download less metadata from the server? Am I taking the right approach by using a database at all? Gimme thoughts!

If it is data transfer you are worried about you need to do your processing on the server and return only the results. A SQLDataReader solution can return a smaller, already processed set of data to minimise the traffic.

A couple thoughts here:
First, I strongly encourage you to profile the SQL generated by your LINQ-to-SQL queries. There are several tools available for this, here's one at random (I have no particular preference or affiliation):
LINQ Profiler from Devart
Your prior experience with LINQ query inefficiency notwithstanding, the LINQ sample you quote in your question isn't particularly complex so I would expect you could make it or similar work efficiently, given a good feedback mechanism like the tool above or similar.
Second, you don't explicitly mention whether your query client is running in Azure or outside, but I gather from your concern about data egress costs that its running outside Azure. So the data egress costs are going to be query results using the TDS protocol (low-level protocol for SQL Server), which is pretty efficient. Some quick back-of-the-napkin math shows that you should be fine to stay below your monthly 5 GB limit:
23 users
10 hours/day
30 days/month (less if only weekdays)
3600 requests/hour/user
32 bits of raw data per response
= about 95 MB of raw response data per month
Even if you assume 10x overhead of TDS for header metadata, etc. (and if my math is right :-) ) then you've still got plenty of room underneath 5 GB. The point isn't that you should stop thinking about it and assume it's fine... but don't assume it isn't fine, either. In fact, don't assume anything. Test, and measure, and make an informed choice. I suspect you'll find a way to stay well under 5 GB without much trouble, even with LINQ.
One other thought... perhaps you could consider running your query inside Azure, and weigh the cost of that vs. the cost of data egress under the "query running outside Azure" scenario? This could (for example) take the form of a small Azure Web Job that runs the query every second and notifies the 23 users if the count goes above 1.
Azure Web Jobs
In essence, you wouldn't notify them if the condition is false, only when it's true. As for the notification mechanism, there are various cloud-friendly options:
Azure mobile push notifications
SMS messaging
SignalR notifications
The key here is to determine whether its more cost-effective and in line with any bigger-picture technology or business goals to have each user issue the query continuously, or to use some separate process in Azure to notify users asynchronously if the "trigger condition" is met.
Best of luck!

How to get a table entities as a batches in Microsoft Azure Table Storage

I would like to fetch a table entities from a cloud storage using Microsoft Azure Table Storage. And, it takes too long to fetch a large amount of data such as a 100000 entities. Is there any way to get entities as a batches with the count of 1000?
Thanks in Advance,
Paul

TableQuery can pull records for queries in batches of 1000.
https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-tables/#retrieve-all-entities-in-a-partition
However Table Storage was designed to pull records by either partition key or row key. If you have any other filter it will be slow as there are no indexes to help, so it has to pull each row and see if it meets your filter. So I would loop through your partition keys and pull data that way.
After each batch of 1000, you get a continuation token to go get the next batch for that that query. I have had some luck with setting up a blocking collection and data flow so I can kick off the next query and let that I/O happen while I am still processing the first query.

Azure Table Storage key latency very variable

We're seeing some very variable latencies when querying our Azure Table Storage data. We have a number of items each fetching time series data which is broken up by day as follows:
Partition key: {DATA_TYPE}_{YYYMMdd} - 4 different datatypes with about 2 years of data in total
Row Key: {DataObjectId} - About 3-4,000 records per day.
A record itself is a JSON encoded array of dateTime objects spread out every 15 minutes.
So I want to retrieve timeseries data for a specific object for the last few days so I constructed the following query:
string.Format("(PartitionKey ge '{0}') and (PartitionKey le '{1}') and (RowKey eq '{2}')", lowDate, highDate, DataObjectId);
As above we have records going over 2-3 years now.
On the whole the query time is fairly speedy 600-800 ms However once or twice we get a couple of values where it seems to take a very long time to retrieve data from these partitions. i.e. one or two queries have taken 50 seconds plus to return data.
We are not aware that the system is under dramatic load. In fact frustratingly all the graphs in the portal we've found suggest no real problems.
Some suggestions that come to mind:
1.) add year component first making the partition keys immediately more selective.
However the most frustrating thing is the variation in time taken to do the queries.
The Azure storage latency in the Azure portal is averaging at about 117.2ms and the maximum reported is 294ms. I have interpreted this as Network latency.
Of course any suggestions gratefully received. The most vexing thing is that the execution time is so variable. In a very small number of cases we see our application resorting to the use of continuation tokens as the query has taken over 5 seconds to complete.
https://msdn.microsoft.com/en-us/library/azure/dd179421.aspx

Have been looking at this for a while.
I've not found an answer to why querying accross partitions suffered such variable latency. I had assumed that it would work well with the indexes.
However the solution seems to be to simply request data from the 6 different partitions. Therefore all querying takes advantage of both the Partitionkey and rowkey indexing. Once this was implemented our queries began returning much faster.
Would still like to understand why querying accross partitions seemed so slow, but I can only assume the query resulted in a table scan which has variable latency.

Azure Drive vs Block Blob vs Table

I couldn't decide the best approach to handle the following scenario via Azure storage.
~1500+ CSV files between ~1MB to ~500MB overall ~20GB data
Each file uses exactly same model and each model.toString() is ~50 characters ~400byte
Every business day, during 6 hours period, ~8000+ new rows comes per minute
Based on property value, each row goes to the correct file
Multiple instance writing is not necessary as long as multiple reading is supported even there is few seconds delay for snapshot period is OK.
I would like to use Block Blob but downloading ~400MB single file into the computer, just to add a single line and upload it back doesn't make sense and I couldn't find other way around.
There is a Drive option which uses Page Blob unfortunately it is not supported by SDKv2 and makes me nervous about possible discontinuation of the support
And final one is Table which looks OK other than reading few hundred thousands rows continuesly may become an issue
Basically, I prefer to write files when I retrieve the data immediately. But, if it does worth to give up, I can live with the single update at the end of the day which means ~300-1000 lines per file
What would be best approach to handle this scenario?

Based on your above requirement, Azure Tables are the optimal option. With single Azure Storage account you get the following:
Storage Transactions – Up to 20,000 entities/messages/blobs per second
Single Table Partition – a table partition are all of the entities in a table with the same partition key value, and most tables have many partitions. The throughput target for a single partition is:
Up to 20,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to a few thousand requests per second (up to the storage account target 20,000).
Tables – use a more finely grained PartitionKey for the table in order to allow us to automatically spread the table partitions across more servers.
About reading "few hundred thousands rows" continuously, your main obstacle is storage level 20,000 transactions/sec however if you design your partition so granular to segment them on hundreds of servers, you could be able to read "hundred of thousands" in minutes.
Source:
Windows Azure Storage Abstractions and their Scalability Targets
Windows Azure’s Flat Network Storage and 2012 Scalability Targets

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.