How to generate a transaction number? - c#

I was thinking of formatting it like this
TYYYYMMDDNNNNNNNNNNX
(1 character + 19 digits)
Where
T is type
YYYY is year
MM is month
DD is day
N is sequencial number
X is check digit
The problem is, how do I generate the sequencial number? since my primary key is not an auto increment integer value, if it was i would use that, but its not.
EDIT can I have the sequencial number resets itself after 1 day (24hours).
P201012080000000001X <-- first
transaction of 2010/12/08
P2010120810000000002X <--- second
transaction of 2010/12/08
P201012090000000001X <--- First
transaction of 2010/12/09
(X is the check digit)

The question is meaningless without a context. Others have commented on your question. Please answer the comments. What is the "transaction number" for; where is it used; what is the "transaction" that you need an external identifier for.
Identity or auto-increment columns may have some use internally, but they are quite useless outside the database.
If we had the full schema, knowing which components are PKs that will not change, etc, we could provide a more meaningful answer.
At first glance, without the info requested, I see no point in recording date in the "transaction" (the date is already stored in the transaction row)

You seem to have the formula for your transaction number, the only question you really have is how to generate a sequence number that resets each day.
You can consider the following options:
Use a database sequence and a scheduled job that resets it.
Use a sequence from outside the database (for instance, a file or memory structure).
With the proper isolation level, you should be able to include the (SELECT (MAX(Seq) + 1) FROM Table WHERE DateCol = CURRENT_DATE) as a value expression in your INSERT statement.
Also note that there's probably no real reason to actually store the transaction number in the database as it's easy to derive it from the information it encodes. All you need to store is the sequential number.

You can track the auto-incs separately.
Or, as you get ready to add a new transaction. First poll the DB for the newest transaction and break that apart to find the number, and increase that.
Or add an auto-inc field, but don't use it as a key.

You can use a uuid generator so that you don't have to mind about a sequence and you are sure not to have collision between transactions.
eg :
in java :
java.util.UUID.randomUUID()
05f4c168-083a-4107-84ef-10346fad6f58
5fb202f1-5d2a-4d59-bbeb-5bcabd513520
31836df6-d4ee-457b-a47a-d491d5960530
3aaaa3c2-c1a0-4978-9ca8-be1c7a0798cf
in php :
echo uniqid()
4d00fe31232b6
4d00fe4eeefc2
4d00fe575c262
there is a UUID generator in barely all languages.

A primary key that big is a very, very bad idea. You will waste huge amounts of table space unnecessarily and make your table very slow to query and manage. Make you primary key a small simple incrementing int and store the transaction date in a separate field. When necessary in a query you can select a transaction number for that day with:
SELECT ROW_NUMBER OVER (PARTITION BY TxnDate ORDER BY TxnID), TxnDate, ...
Please read this regarding good primary key selection criteria. http://www.sqlskills.com/BLOGS/KIMBERLY/category/Indexes.aspx

Related

Manage live stream viewer huge log

I've developped an .NET CORE application for Live Stream, witch has a lot of funcionalities. One of those, is to show to our clients how many people was watching in every 5 minute interval.
By now, im saving on a SQL Server database, a log for each viewer with ViewerID and TimeStamp in a 5 minutes interval. It seem's to be a bad approach, since in first couple days, i've reached 100k rows in that table. I need that data, because we have a "Time Peek Chart", that shows how many people and who was watching in a 5 minutes interval.
Anyways, do anyone have a suggestion of how can i handle this? I was thinking about a .txt file with the same data, but it also seems that I/O of the server can be a problem...
Also o though about a NoSQL database, maybe use a existing MongoDB AaS, like scalegrid.io or mlab.com.
Can someone help me with this, please? Thanks in advance!
I presume this is related to one of your previous questions Filter SQL GROUP by a filter that is not in GROUP and an expansion of the question in comments 'how to make this better'.
This answer below is definitely not the only way to do this - but I think it's a good start.
As you're using SQL Server for the initial data storage (minute-by-minute) I would suggest continuing to use SQL Server for the next stage of data storage. I think you'd need a compelling argument to use something else for the next stage, as you then need to maintain both of them (e.g., keeping software up-to-date, backups, etc), as well as having all the fun of transferring data properly between the two pieces of software.
My suggested approach is to keep the most detailed/granular data that you need, but no more.
In the previous question, you were keeping data by the minute, then calculating up to the 5-minute bracket. In this answer I'd summarise (and store) the data for the 5-minute brackets then discard your minute-by-minute data once it has been summarised.
For example, you could have a table called 'StreamViewerHistory' that has the Viewer's ID and a timestamp (much like the original table).
This only has 1 row per viewer per 5 minute interval. You could make the timestamp field a smalldatetime (as you don't care about seconds) or even have it as an ID value pointing to another table that references each timeframe. I think smalldatetime is easier to start with.
Depending exactly on how it's used, I would suggest having the Primary Key (or at least the Clustered index) being the timestamp before the ViewerID - this means new rows get added to the end. It also assumes that most queries of data are filtered by timeframes first (e.g., last week's worth of data).
I would consider having an index on ViewerId then the timestamp, for when people want to view an individual's history.
e.g.,
CREATE TABLE [dbo].[StreamViewerHistory](
[TrackDate] smalldatetime NOT NULL,
[StreamViewerID] int NOT NULL,
CONSTRAINT [PK_StreamViewerHistory] PRIMARY KEY CLUSTERED
(
[TrackDate] ASC,
[StreamViewerID] ASC
)
GO
CREATE NONCLUSTERED INDEX [IX_StreamViewerHistory_StreamViewerID] ON [dbo].[StreamViewerHistory]
(
[StreamViewerID] ASC,
[TrackDate] ASC
)
GO
Now, on some sort of interval (either as part of your ping process, or a separate process run regularly) interrogate the data in your source table LiveStreamViewerTracks, crunch the data as per the previous question, and save the results in this new table. Then delete the rows from LiveStreamViewerTracks to keep it smaller and usable. Ensure you delete the relevant rows only though (e.g., the ones that have been processed).
The advantage of the above process is that the data in this new table is very usable by SQL Server. Whenever you need a graph (e.g., of the last 14 days) it doesn't need to read the whole table - instead it just starts at the relevant day and only read the relevant rows. Note to make sure your queries are SARGable though e.g.,
-- This is SARGable and can use the index
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE TrackDate >= '20201001'
-- These are non-SARGable and will read the whole table
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE CAST(TrackDate as date) >= '20201001'
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE DATEDIFF(day, TrackDate, '20201001') <= 0
Typically, if you want counts of users for every 5 minutes within a given timeframe, you'd have something like
SELECT TrackDate, COUNT(*) AS NumViewers
FROM StreamViewerHistory
WHERE TrackDate >= '20201001 00:00:00' AND TrackDate < '20201002 00:00:00'
GROUP BY TrackDate
This should be good enough for quite a while. If your views/etc do slow down a lot, you could consider other things to help e.g., you could also do further calculations/other reporting tables e.g., also have a table with TrackDate and NumViewers - where there's one row per TrackDate. This should be very fast when reporting overall number of users, but will not allow you to drill down to a specific user.

Why is first insert winning over second one in Cassandra?

There are two data centers with 3 nodes each. I'm doing two simple inserts (very fast back to back) to the same table with a consistency level of local quorum. The table has one partitioning key and no clustering columns.
Sometimes the first insert wins over the second one. The data produced by the first insert statement is what gets saved in the database even though I do an insert right after that.
C# Code
var statement = "Insert Into customer (id,name) Values (1, "foo")";
statement.SetConsistencyLevel(ConsistencyLevel.LocalQuorum);
session.Execute(statement);
Set the timestamp on client. In most new drivers this is done automatically to better ensure order preserved. However older drivers or pre Cassandra 2.1 its not supported and needs to be in query. I dont know what driver or version you are using, but you can also put it in the CQL. Its supported on protocol level though so driver should have better mechanism.
Something like: var statement = "INSERT INTO customer (id,name) VALUES (1, 'foo') USING TIMESTAMP {microsecond timestamp}";
Best approach is to use a monatomic timestamp so that each call is always higher then last (ie use current milliseconds and add a counter). I don't know C# to tell you how to best approach that. Look at https://docs.datastax.com/en/developer/csharp-driver/3.3/features/query-timestamps/#using-a-timestamp-generator
If you don't have a timestamp set it on the mutation, the coordinator will assign it after it parses the query. Since networks and netty queues can do funny things order is not a sure thing, especially as they end up on different nodes that may have some clock drift.

Azure table storage querying partitionkey

I am using Azure table storage to retrieve data though timestamp filter. I see the execution is very slow as timestamp is not a partition key or row key. I researched on stackoverflow and found that time stamp should be converted to ticks and stored in to Partition key. I did the same and while inserting data I took the below string and inserted tick string to partition key.
string currentDateTimeTick = ConvertDateTimeToTicks(DateTime.Now.ToUniversalTime()).ToString();
public static long ConvertDateTimeToTicks(DateTime dtInput)
{
long ticks = 0;
ticks = dtInput.Ticks;
return ticks;
}
This is fine till here. But When I am trying to retrieve last 5 days data, I am unable to query the tick against partition key. I am trying to get last 5 days data. What was my mistake in the below code?
int days = 5;
TableQuery<MyEntity> query = new TableQuery<MyEntity>()
.Where(TableQuery.GenerateFilterConditionForDate("PartitionKey", QueryComparisons.GreaterThanOrEqual, "0"+DateTimeOffset.Now.AddDays(days).Date.Ticks));
Are you sure you want to use ticks as a partition key? This means that every measureable 100 ns instant becomes it's own partition. With time based data you can use the partition key to specify an interval like every hour, minute or even second and then a row key with the actual timestamp.
That problem aside let me show you how to do the query. First let me comment on how you generate the partition key. I suggest you do it like this:
var partitionKey = DateTime.UtcNow.Ticks.ToString("D18");
Don't use DateTime.Now.ToUniversalTime() to get the current UTC time. It will internally use DateTime.UtcNow, then convert it to the local time zone and ToUniversalTime() will convert back to UTC which is just wasteful (and more time consuming than you may think).
And your ConvertDateTimeToTicks() method serves no other purpose than to get the Ticks property so it is just making your code more complex without adding any value.
Here is how to perform the query:
var days = 5;
var partitionKey = DateTime.UtcNow.AddDays(-days).Ticks.ToString("D18")
var query = new TableQuery<MyEntity>().Where(
TableQuery.GenerateFilterCondition(
"PartitionKey",
QueryComparisons.GreaterThanOrEqual,
partitionKey
)
);
The partition key is formatted as an 18 characters string allowing you to use a straightforward comparison.
I suggest that you move the code to generate the partition key (and row key) into a function to make sure that the keys are generated the same way throughout your code.
The reason 18 characters are used is because the Ticks value of a DateTime today as well as many thousands of years in the future uses 18 decimal digits. If you decide to base your partition key on hours, minutes or seconds instead of 100 ns ticks then you can shorten the length of the partition key accordingly.
As Martin suggests, using a timestamp as your partition key is almost certainly not what you want to do.
Partitions are the unit of scale in Azure Table Storage and more or less represent physical segmentation of your data. They're a scalability optimization that allows you to "throw hardware" at the problem of storing more and more data, while maintaining acceptable response times (something which is traditionally hard in data storage). You define the partitions in your data by assigning partition keys to each row. Its almost never desirable that each row lives in its own partition.
In ATS, the row key becomes your unique key within a given partition. So the combination of partition key + row key is the true unique key across the entire ATS table.
There's lots of advice out there for choosing a valid partition key and row key... none of which is generalized. It depends on the nature of your data, your anticipated query patterns, etc.
Choose a partition key that will aggregate your data into a reasonably well-distributed set of "buckets". All things being equal, if you anticipate having 1 million rows in your table, it's often useful to have, say, 10 buckets with 100,000 rows each... or maybe 100 buckets with 10,000 rows each. At query time you'll need to pick the partition(s) you're querying, so the number of buckets may matter to you. "Buckets" often correspond to a natural segmentation concept in your domain... a bucket to represent each US state, or a bucket to represent each department in your company, etc. Note that its not necessary (or often possible) to have perfectly distributed buckets... get as close as you can, with reasonable effort.
One example of where you might intentionally have an uneven distribution is if you intend to vary query patterns by bucket... bucket A will receive lots of cheap, fast queries, bucket B fewer, more expensive queries, etc. Or perhaps bucket A data will remain static while bucket B data changes frequently. This can be accomplished with multiple tables, too... so there's no "one size fits all" answer.
Given the limited knowledge we have of your problem, I like Martin's advice of using a time span as your partition key. Small spans will result in many partitions, and (among other things) make queries that utilize multiple time spans relatively expensive. Larger spans will result in fewer aggregation costs across spans, but will result in bigger partitions and thus more expensive queries within a partition (it will also make identifying a suitable row key potentially more challenging).
Ultimately you'll likely need to experiment with a few options to find the most suitable one for your data and intended queries.
One other piece of advice... don't be afraid to consider duplicating data in multiple data stores to suit widely varying query types. Not every query will work effectively against a single schema or storage configuration. The effort needed to synchronize data across stores may be less than that needed bend query technology X to your will.
more on Partition and Row key choices
also here
Best of luck!
One thing that was not mentioned in the answers above is that Azure will detect if you are using sequential, always-increasing or always-decreasing values for your partition key and create "range partitions". Range partitions group entities that have sequential unique PartitionKey values to improve the performance of range queries. Without range partitions, as mentioned above, a range query will need to cross partition boundaries or server boundaries, which can decrease the query performance. Range partitions happen under-the-hood and are decided by Azure, not you.
Now, if you want to do bulk inserts, let's say once a minute, you will still need to flatten out your timestamp partition keys to, say, ticks rounded up to the nearest minute. You can only do bulk inserts with the same partition key.

How to retrieve newest row in an Azure Table?

I am trying to retrieve the newest row created in the Primary Minute Metrics tables that is automatically created by Azure. Is there any way to do this without scanning through the whole table? The partition key is basically the timestamp in a different format. For example:
20150811T1250
However, there is no way for me to tell what the latest partitionkey is, so I can't just query by partition. Also, the row key is useless since all the rows have the same rowkey. I am completely stumped on how I would do this even though it seems like a really basic thing to do. Any ideas?
An example of a few partition keys of rows in the table:
20150813T0623
20150813T0629
20150813T0632
20150813T0637
20150813T0641
20150813T0646
20150813T0650
20150813T0654
EDIT: As a followup question. Is there a way to scan the table backwards? That would allow me to just get the first row scanned since that would be the latest row.
When it comes to querying data, Azure Tables offer very limited choices. Given that you know how the PartitionKey gets assigned (YYYYMMDDTHHmm format), one possible solution would be to query from current date/time (in UTC) minus some offset to current date/time and go from there.
For example, assuming start time is 03-Dec-2015 00:00:00. What you could do is try to fetch data from 02-Dec-2015 23:00:00 to 03-Dec-2015 00:00:00 and see if any records are returned. If the records are returned, you can simply take the last entry in the resultset and that would be your latest entry. If no records are found, then you move back by 1 hour (i.e. from 02-Dec-2015 22:00:00 to 02-Dec-2015 23:00:00) and fetch records again and repeat this till the time you find matching result.
Yet another idea (though a bit convoluted one) is to create another table and periodically copy the data from the main table to this new table. When you copy the data, what you would need to do is take the PartitionKey value, create a Date/Time object out of it, subtract that from DateTime.MaxValue. Calculate the ticks for this new value and use that as the PartitionKey for your new entity (you would need to convert that ticks into string and do some string prepadding so that all values are of same length). Now the latest entries will always be on the top.

Concurrency issue

We are creating a client server application using WPF/C# with SQL. Here we are generating a unique number b checking DB(To get the last maximum number) and with that max value, we are increment '1' and storing the value in DB. At this time another user also working on the same screen and creating unique numbers, in some case the the unique numbers gets duplicated and throws exception.
We found this is a concurrency issue.
Indeed, fetching a number out, adding one, and hoping it still isn't in use is a thread-race and a race between multiple clients - and should be avoided.
Options:
use an IDENTITY column in the database, and let the database generate the value itself during INSERT; the database server knows how to do this safely and reliably
if that isn't possible, you might want to delay this code until you are ready to INSERT so it is all part of a single database operation - and even then, if it isn't in a "serializable transaction" (with key-range read locks, etc), then you would have to loop on "get the max, increment, try to insert but note that we might have lost a race, so only insert if the value doesn't exist - which it might; repeat from start if unsuccessful"
alternatively, you could create the new record when you first need the number (even though the rest of the data isn't available), noting that you might still need the "loop until successful" approach
Frankly, the IDENTITY column approach is the simplest.
Finally, We have follwed Singleton pattern with lock to resolver this issue.
Thanks.

Categories