Azure table storage querying partitionkey - c#

I am using Azure table storage to retrieve data though timestamp filter. I see the execution is very slow as timestamp is not a partition key or row key. I researched on stackoverflow and found that time stamp should be converted to ticks and stored in to Partition key. I did the same and while inserting data I took the below string and inserted tick string to partition key.
string currentDateTimeTick = ConvertDateTimeToTicks(DateTime.Now.ToUniversalTime()).ToString();
public static long ConvertDateTimeToTicks(DateTime dtInput)
{
long ticks = 0;
ticks = dtInput.Ticks;
return ticks;
}
This is fine till here. But When I am trying to retrieve last 5 days data, I am unable to query the tick against partition key. I am trying to get last 5 days data. What was my mistake in the below code?
int days = 5;
TableQuery<MyEntity> query = new TableQuery<MyEntity>()
.Where(TableQuery.GenerateFilterConditionForDate("PartitionKey", QueryComparisons.GreaterThanOrEqual, "0"+DateTimeOffset.Now.AddDays(days).Date.Ticks));

Are you sure you want to use ticks as a partition key? This means that every measureable 100 ns instant becomes it's own partition. With time based data you can use the partition key to specify an interval like every hour, minute or even second and then a row key with the actual timestamp.
That problem aside let me show you how to do the query. First let me comment on how you generate the partition key. I suggest you do it like this:
var partitionKey = DateTime.UtcNow.Ticks.ToString("D18");
Don't use DateTime.Now.ToUniversalTime() to get the current UTC time. It will internally use DateTime.UtcNow, then convert it to the local time zone and ToUniversalTime() will convert back to UTC which is just wasteful (and more time consuming than you may think).
And your ConvertDateTimeToTicks() method serves no other purpose than to get the Ticks property so it is just making your code more complex without adding any value.
Here is how to perform the query:
var days = 5;
var partitionKey = DateTime.UtcNow.AddDays(-days).Ticks.ToString("D18")
var query = new TableQuery<MyEntity>().Where(
TableQuery.GenerateFilterCondition(
"PartitionKey",
QueryComparisons.GreaterThanOrEqual,
partitionKey
)
);
The partition key is formatted as an 18 characters string allowing you to use a straightforward comparison.
I suggest that you move the code to generate the partition key (and row key) into a function to make sure that the keys are generated the same way throughout your code.
The reason 18 characters are used is because the Ticks value of a DateTime today as well as many thousands of years in the future uses 18 decimal digits. If you decide to base your partition key on hours, minutes or seconds instead of 100 ns ticks then you can shorten the length of the partition key accordingly.

As Martin suggests, using a timestamp as your partition key is almost certainly not what you want to do.
Partitions are the unit of scale in Azure Table Storage and more or less represent physical segmentation of your data. They're a scalability optimization that allows you to "throw hardware" at the problem of storing more and more data, while maintaining acceptable response times (something which is traditionally hard in data storage). You define the partitions in your data by assigning partition keys to each row. Its almost never desirable that each row lives in its own partition.
In ATS, the row key becomes your unique key within a given partition. So the combination of partition key + row key is the true unique key across the entire ATS table.
There's lots of advice out there for choosing a valid partition key and row key... none of which is generalized. It depends on the nature of your data, your anticipated query patterns, etc.
Choose a partition key that will aggregate your data into a reasonably well-distributed set of "buckets". All things being equal, if you anticipate having 1 million rows in your table, it's often useful to have, say, 10 buckets with 100,000 rows each... or maybe 100 buckets with 10,000 rows each. At query time you'll need to pick the partition(s) you're querying, so the number of buckets may matter to you. "Buckets" often correspond to a natural segmentation concept in your domain... a bucket to represent each US state, or a bucket to represent each department in your company, etc. Note that its not necessary (or often possible) to have perfectly distributed buckets... get as close as you can, with reasonable effort.
One example of where you might intentionally have an uneven distribution is if you intend to vary query patterns by bucket... bucket A will receive lots of cheap, fast queries, bucket B fewer, more expensive queries, etc. Or perhaps bucket A data will remain static while bucket B data changes frequently. This can be accomplished with multiple tables, too... so there's no "one size fits all" answer.
Given the limited knowledge we have of your problem, I like Martin's advice of using a time span as your partition key. Small spans will result in many partitions, and (among other things) make queries that utilize multiple time spans relatively expensive. Larger spans will result in fewer aggregation costs across spans, but will result in bigger partitions and thus more expensive queries within a partition (it will also make identifying a suitable row key potentially more challenging).
Ultimately you'll likely need to experiment with a few options to find the most suitable one for your data and intended queries.
One other piece of advice... don't be afraid to consider duplicating data in multiple data stores to suit widely varying query types. Not every query will work effectively against a single schema or storage configuration. The effort needed to synchronize data across stores may be less than that needed bend query technology X to your will.
more on Partition and Row key choices
also here
Best of luck!

One thing that was not mentioned in the answers above is that Azure will detect if you are using sequential, always-increasing or always-decreasing values for your partition key and create "range partitions". Range partitions group entities that have sequential unique PartitionKey values to improve the performance of range queries. Without range partitions, as mentioned above, a range query will need to cross partition boundaries or server boundaries, which can decrease the query performance. Range partitions happen under-the-hood and are decided by Azure, not you.
Now, if you want to do bulk inserts, let's say once a minute, you will still need to flatten out your timestamp partition keys to, say, ticks rounded up to the nearest minute. You can only do bulk inserts with the same partition key.

Related

Recommended pattern/strategy to compare two sets of data (new vs existing)... remaining with new data that's not in existing

I have an ETL process/job that fetches database data from a source to a destination in a scheduled way.
[Source data] is updated regularly with new data from some external
source. [Destination data] is a subset of [Source data] that is used
downstream by business.
The constraint requirement in [Destination data] is that it should
not have duplicates (may occur, for example, in the event of job
failure, then a new extraction is run after some data is possible imported)
The job imports 1000 records at a time
The Scheduler/Job has other responsibilities and other data it works on
One of my "feasible" options involve:
fetching ALL the projected composite/key columns from the destination,
doing a comparison with the new 1000 loaded records (still alot of
records).
Then saving the new [Source data] that is not in the
[Destination Data].
I would imagine that the data structure containing existing [Destination data] would be a Hashset of the following structure, for example, HashSet<int,string,string>. Where the 3 data items uniquely identify a record.
I would then get the 1000 records, loop through them, comparing with the HashSet.
I fear working with too much data in-memory.
Any advice on a better approach, or would this be the most efficient way to do it?
Just to share, I found similar question with a comprehensive answer. It's in Java but easily translates to C#.
Still open to any alternatives. Otherwise will mark this one as answer and indicate as being duplicate.
...we could sort all elements by their ID (a one-time O(n log n) cost) in ascending order, and iterate over them using an O(n) algorithm that skips elements as long as they are larger than the current element from the other sequence. This is better, but still not optimal.
The optimal solution is to create a hash set of IDs of the bs set. This does not require sorting of both sets, and allows linear-time membership test. There is a one-time O(n) cost to assemble the set of IDs.
HashSet<Integer> bIds = new HashSet<>(bs.size());
for (B b : bs)
bIDs.add(b.getId());
for (A a : as)
if (bIds.contains(a.getId()))
cs.add(a);
The total complexity of this solution is O(|as| + |bs|).
https://softwareengineering.stackexchange.com/a/258325/132218

check existance of an implementation, for a "string search method in database"

I am developing an application which receives packets from network and stores them into database. in one part, I save dns records to db, in this format:
IP Address(unsigned 32bit integer)
DNS record(unlimited string)
The rate of DNS records, is about 10-100 records per second. As it's realtime, I have not enough time to check for duplicates by string search in database. I was thinking of a good method to get a unique short integer (you say,64 bit) per given unique string. So my search, from string search, becomes number search and lets me check for duplicates faster. Any idea about implementations of what I told, or better approaches is appreciated. samples in C# are preferred. but any good idea is welcomed.
I would read through this, talking about hashing strings into integers, and since addresses are pretty long (letter wise), I would use some modulo function to keep it in integer limits.
The results would be checked with a hash table for duplicates.
This could be done for the first 20 letters, and then the next 20 for a nested hash table if required and so on.
Make sure you set up your table-indexes and Primary Keys in the table correctly.
Load the table contents asynchronuosly every couple of seconds and populate a generic dictionary<long,string> with it.
Perform the search on the dictionary as it is optimized for searches. If you need it even faster, use a hashtable.
Flush the newly added entries in a Transaction asynchronuosly into the DB.
P.S. Your Scenario is to vague to create a decent code example.

How to store a sparse boolean vector in a database?

Let's say I have a book with ~2^40 pages. Each day, I read a random chunk of contiguous pages (sometimes including some pages I've already read). What's the smartest way to store and update the information of "which pages I've read" in a (SQLite) database ?
My current idea is to store [firstChunkPage, lastChunkPage] entries in a table, but I'm not sure about how to update this efficiently.
Should I first check for every possible overlaps and then update ?
Should I just insert my new range and then merge overlapping entries (perhaps multiple times because multiple overlaps can occur ?) ? I'm not sure about how to build such a SQL query.
This looks like a pretty common problem, so I'm wondering if anyone knows a 'recognized' solution for this.
Any help or idea is welcome !
EDIT : The reading isn't actually random, the number of chunks is expected to be pretty much constant and very small compared to the number of pages.
Your idea to store ranges of (firstChunkPage, lastChunkPage) pairs should work if data is relatively sparse.
Unfortunately, queries like you mentioned:
SELECT count(*) FROM table
WHERE firstChunkPage <= page AND page <= lastChunkPage
cannot work effectively, unless you use spatial indexes.
For SQLite, you should use R-Tree module, which implements support for this kind of index. Quote:
An R-Tree is a special index that is designed for doing range queries. R-Trees are most commonly used in geospatial systems where each entry is a rectangle with minimum and maximum X and Y coordinates. ... For example, suppose a database records the starting and ending times for a large number of events. A R-Tree is able to quickly find all events, for example, that were active at any time during a given time interval, or all events that started during a particular time interval, or all events that both started and ended within a given time interval.
With R-Tree, you can very quickly identify all overlaps before inserting new range and replace them with new combined entry.
To create your RTree index, use something like this:
CREATE VIRTUAL TABLE demo_index USING rtree(
id, firstChunkPage, lastChunkPage
);
For more information, read documentation.

how to check if value is present in a very big data record or a big list efficiently

hi guys i have this doubt ...
if i have a record of username and password details for logging in to a website I'll most probably get the user name and password from the form and will be using to check if the given username is present in the database by using a contains() Boolean operation and if contains then check the password is same as saved in the database..
but for websites like g-mail and Facebook there are million of records and the authentication is very quick ...
how to they do it ..what method do they follow for this
how they check if a value is present in a large record that quickly ?
does the process involve just adding more server for processing speed ?
ty for the answers ...
**
sorry i have posted this question without knowing about indexers ..
(just came to know that by creating indexes to one or multiple column
the full table scan is minimized and index path is used instead which
is less costlier and more efficient operation ..)
**
You just need one SQL query:
select 1 from user u
where u.login = :theEnteredLogin
and u.hashed_password = :theHashedEnteredPassword
(where :xxx are parameters of the query).
If you have an index on the login column or even better, on [login - hashed_password], the query should not take more than a few milliseconds to execute.
Well, they have lots of servers and high-performance databases. At a low level, the table for the hash is probably indexed by the hash for fast lookup - binary search style.
For medium to large data sets indexing, combined with proper sizing of disk, memory and cpus, is the most adopted approach.
For very large data sets, the database can be distributed and data partitioned.
For very, very large data sets, aside from the above scenarios, used technologies usually involve using map reduce model.

How to generate a transaction number?

I was thinking of formatting it like this
TYYYYMMDDNNNNNNNNNNX
(1 character + 19 digits)
Where
T is type
YYYY is year
MM is month
DD is day
N is sequencial number
X is check digit
The problem is, how do I generate the sequencial number? since my primary key is not an auto increment integer value, if it was i would use that, but its not.
EDIT can I have the sequencial number resets itself after 1 day (24hours).
P201012080000000001X <-- first
transaction of 2010/12/08
P2010120810000000002X <--- second
transaction of 2010/12/08
P201012090000000001X <--- First
transaction of 2010/12/09
(X is the check digit)
The question is meaningless without a context. Others have commented on your question. Please answer the comments. What is the "transaction number" for; where is it used; what is the "transaction" that you need an external identifier for.
Identity or auto-increment columns may have some use internally, but they are quite useless outside the database.
If we had the full schema, knowing which components are PKs that will not change, etc, we could provide a more meaningful answer.
At first glance, without the info requested, I see no point in recording date in the "transaction" (the date is already stored in the transaction row)
You seem to have the formula for your transaction number, the only question you really have is how to generate a sequence number that resets each day.
You can consider the following options:
Use a database sequence and a scheduled job that resets it.
Use a sequence from outside the database (for instance, a file or memory structure).
With the proper isolation level, you should be able to include the (SELECT (MAX(Seq) + 1) FROM Table WHERE DateCol = CURRENT_DATE) as a value expression in your INSERT statement.
Also note that there's probably no real reason to actually store the transaction number in the database as it's easy to derive it from the information it encodes. All you need to store is the sequential number.
You can track the auto-incs separately.
Or, as you get ready to add a new transaction. First poll the DB for the newest transaction and break that apart to find the number, and increase that.
Or add an auto-inc field, but don't use it as a key.
You can use a uuid generator so that you don't have to mind about a sequence and you are sure not to have collision between transactions.
eg :
in java :
java.util.UUID.randomUUID()
05f4c168-083a-4107-84ef-10346fad6f58
5fb202f1-5d2a-4d59-bbeb-5bcabd513520
31836df6-d4ee-457b-a47a-d491d5960530
3aaaa3c2-c1a0-4978-9ca8-be1c7a0798cf
in php :
echo uniqid()
4d00fe31232b6
4d00fe4eeefc2
4d00fe575c262
there is a UUID generator in barely all languages.
A primary key that big is a very, very bad idea. You will waste huge amounts of table space unnecessarily and make your table very slow to query and manage. Make you primary key a small simple incrementing int and store the transaction date in a separate field. When necessary in a query you can select a transaction number for that day with:
SELECT ROW_NUMBER OVER (PARTITION BY TxnDate ORDER BY TxnID), TxnDate, ...
Please read this regarding good primary key selection criteria. http://www.sqlskills.com/BLOGS/KIMBERLY/category/Indexes.aspx

Categories