I am trying to retrieve the newest row created in the Primary Minute Metrics tables that is automatically created by Azure. Is there any way to do this without scanning through the whole table? The partition key is basically the timestamp in a different format. For example:
20150811T1250
However, there is no way for me to tell what the latest partitionkey is, so I can't just query by partition. Also, the row key is useless since all the rows have the same rowkey. I am completely stumped on how I would do this even though it seems like a really basic thing to do. Any ideas?
An example of a few partition keys of rows in the table:
20150813T0623
20150813T0629
20150813T0632
20150813T0637
20150813T0641
20150813T0646
20150813T0650
20150813T0654
EDIT: As a followup question. Is there a way to scan the table backwards? That would allow me to just get the first row scanned since that would be the latest row.
When it comes to querying data, Azure Tables offer very limited choices. Given that you know how the PartitionKey gets assigned (YYYYMMDDTHHmm format), one possible solution would be to query from current date/time (in UTC) minus some offset to current date/time and go from there.
For example, assuming start time is 03-Dec-2015 00:00:00. What you could do is try to fetch data from 02-Dec-2015 23:00:00 to 03-Dec-2015 00:00:00 and see if any records are returned. If the records are returned, you can simply take the last entry in the resultset and that would be your latest entry. If no records are found, then you move back by 1 hour (i.e. from 02-Dec-2015 22:00:00 to 02-Dec-2015 23:00:00) and fetch records again and repeat this till the time you find matching result.
Yet another idea (though a bit convoluted one) is to create another table and periodically copy the data from the main table to this new table. When you copy the data, what you would need to do is take the PartitionKey value, create a Date/Time object out of it, subtract that from DateTime.MaxValue. Calculate the ticks for this new value and use that as the PartitionKey for your new entity (you would need to convert that ticks into string and do some string prepadding so that all values are of same length). Now the latest entries will always be on the top.
Related
I've developped an .NET CORE application for Live Stream, witch has a lot of funcionalities. One of those, is to show to our clients how many people was watching in every 5 minute interval.
By now, im saving on a SQL Server database, a log for each viewer with ViewerID and TimeStamp in a 5 minutes interval. It seem's to be a bad approach, since in first couple days, i've reached 100k rows in that table. I need that data, because we have a "Time Peek Chart", that shows how many people and who was watching in a 5 minutes interval.
Anyways, do anyone have a suggestion of how can i handle this? I was thinking about a .txt file with the same data, but it also seems that I/O of the server can be a problem...
Also o though about a NoSQL database, maybe use a existing MongoDB AaS, like scalegrid.io or mlab.com.
Can someone help me with this, please? Thanks in advance!
I presume this is related to one of your previous questions Filter SQL GROUP by a filter that is not in GROUP and an expansion of the question in comments 'how to make this better'.
This answer below is definitely not the only way to do this - but I think it's a good start.
As you're using SQL Server for the initial data storage (minute-by-minute) I would suggest continuing to use SQL Server for the next stage of data storage. I think you'd need a compelling argument to use something else for the next stage, as you then need to maintain both of them (e.g., keeping software up-to-date, backups, etc), as well as having all the fun of transferring data properly between the two pieces of software.
My suggested approach is to keep the most detailed/granular data that you need, but no more.
In the previous question, you were keeping data by the minute, then calculating up to the 5-minute bracket. In this answer I'd summarise (and store) the data for the 5-minute brackets then discard your minute-by-minute data once it has been summarised.
For example, you could have a table called 'StreamViewerHistory' that has the Viewer's ID and a timestamp (much like the original table).
This only has 1 row per viewer per 5 minute interval. You could make the timestamp field a smalldatetime (as you don't care about seconds) or even have it as an ID value pointing to another table that references each timeframe. I think smalldatetime is easier to start with.
Depending exactly on how it's used, I would suggest having the Primary Key (or at least the Clustered index) being the timestamp before the ViewerID - this means new rows get added to the end. It also assumes that most queries of data are filtered by timeframes first (e.g., last week's worth of data).
I would consider having an index on ViewerId then the timestamp, for when people want to view an individual's history.
e.g.,
CREATE TABLE [dbo].[StreamViewerHistory](
[TrackDate] smalldatetime NOT NULL,
[StreamViewerID] int NOT NULL,
CONSTRAINT [PK_StreamViewerHistory] PRIMARY KEY CLUSTERED
(
[TrackDate] ASC,
[StreamViewerID] ASC
)
GO
CREATE NONCLUSTERED INDEX [IX_StreamViewerHistory_StreamViewerID] ON [dbo].[StreamViewerHistory]
(
[StreamViewerID] ASC,
[TrackDate] ASC
)
GO
Now, on some sort of interval (either as part of your ping process, or a separate process run regularly) interrogate the data in your source table LiveStreamViewerTracks, crunch the data as per the previous question, and save the results in this new table. Then delete the rows from LiveStreamViewerTracks to keep it smaller and usable. Ensure you delete the relevant rows only though (e.g., the ones that have been processed).
The advantage of the above process is that the data in this new table is very usable by SQL Server. Whenever you need a graph (e.g., of the last 14 days) it doesn't need to read the whole table - instead it just starts at the relevant day and only read the relevant rows. Note to make sure your queries are SARGable though e.g.,
-- This is SARGable and can use the index
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE TrackDate >= '20201001'
-- These are non-SARGable and will read the whole table
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE CAST(TrackDate as date) >= '20201001'
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE DATEDIFF(day, TrackDate, '20201001') <= 0
Typically, if you want counts of users for every 5 minutes within a given timeframe, you'd have something like
SELECT TrackDate, COUNT(*) AS NumViewers
FROM StreamViewerHistory
WHERE TrackDate >= '20201001 00:00:00' AND TrackDate < '20201002 00:00:00'
GROUP BY TrackDate
This should be good enough for quite a while. If your views/etc do slow down a lot, you could consider other things to help e.g., you could also do further calculations/other reporting tables e.g., also have a table with TrackDate and NumViewers - where there's one row per TrackDate. This should be very fast when reporting overall number of users, but will not allow you to drill down to a specific user.
I want to emphasize that I'm looking for ideas, not necessarily a concrete answer since it's difficult to show what my queries look like, but I don't believe that's needed.
The process looks like this:
Table A keeps filling up, like a bucket - an SQL job keeps calling SP_Proc1 every minute or less and it inserts multiple records into table A.
At the same time a C# process keeps calling another procedure SP_Proc2 every minute or less that does an ordered TOP 5 select from table A and returns the results to the C# method. After C# code finishes processing the results it deletes the selected 5 records from table A.
I bolded the problematic part above. It is necessary that the records from table A be processed 5 at a time in the order specified, but a few times a month SP_Proc2 selects the ordered TOP 5 records in a wrong order even though all the records are present in table A and have correct column values that are used for ordering.
Something to note:
I'm ordering by integers, not varchar.
The C# part is using 1 thread.
Both SP_Proc1 and SP_Proc2 use a transaction and use READ COMMITTED OR READ COMMITTED SNAPSHOT transaction isolation level.
One column that is used for ordering is a computed value, but a very simple one. It just checks if another column in table A is not null and sets the computed column to either 1 or 0.
There's a unique nonclustered index on primary key Id and a clustered index composed of the same columns used for ordering in SP_Proc2.
I'm using SQL Server 2012 (v11.0.3000)
I'm beginning to think that this might be an SQL bug or maybe the records or index in table A get corrupted and then deleted by the C# process and that's why I can't catch it.
Edit:
To clarify, SP_Proc1 commits a big batch of N records to table A at once and SP_Proc2 pulls the records from table A in batches of 5, it orders the records in the table and selects TOP 5 and sometimes a wrong batch is selected, the batch itself is ordered correctly, but a different batch was supposed to be selected according to ORDER BY. I believe Rob Farley might have the right idea.
My guess is that your “out of order TOP 5” is ordered, but that a later five overlaps. Like, one time you get 1231, 1232, 1233, 1234, and 1236, and the next batch is 1235, 1237, and so on.
This can be an issue with locking and blocking. You’ve indicated your processes use transactions, so it wouldn’t surprise me if your 1235 hasn’t been committed yet, but can just be ignored by your snapshot isolation, and your 1236 can get picked up.
It doesn’t sound like there’s a bug here. What I’m describing above is a definite feature of snapshot isolation. If you must have 1235 picked up in an earlier batch than 1236, then don’t use snapshot isolation, and force your table to be locked until each block of inserts is finished.
An alternative suggestion would be to use a table lock (tablock) for the reading and writing procedures.
Though this is expensive, if you desire absolute consistency then this may be the way to go.
There are two data centers with 3 nodes each. I'm doing two simple inserts (very fast back to back) to the same table with a consistency level of local quorum. The table has one partitioning key and no clustering columns.
Sometimes the first insert wins over the second one. The data produced by the first insert statement is what gets saved in the database even though I do an insert right after that.
C# Code
var statement = "Insert Into customer (id,name) Values (1, "foo")";
statement.SetConsistencyLevel(ConsistencyLevel.LocalQuorum);
session.Execute(statement);
Set the timestamp on client. In most new drivers this is done automatically to better ensure order preserved. However older drivers or pre Cassandra 2.1 its not supported and needs to be in query. I dont know what driver or version you are using, but you can also put it in the CQL. Its supported on protocol level though so driver should have better mechanism.
Something like: var statement = "INSERT INTO customer (id,name) VALUES (1, 'foo') USING TIMESTAMP {microsecond timestamp}";
Best approach is to use a monatomic timestamp so that each call is always higher then last (ie use current milliseconds and add a counter). I don't know C# to tell you how to best approach that. Look at https://docs.datastax.com/en/developer/csharp-driver/3.3/features/query-timestamps/#using-a-timestamp-generator
If you don't have a timestamp set it on the mutation, the coordinator will assign it after it parses the query. Since networks and netty queues can do funny things order is not a sure thing, especially as they end up on different nodes that may have some clock drift.
I have large tables, contains DateTime columns to store exact time of some events and actions in my application. In some cases, users can enter the date of an event.
I want to validate event sequences, find events by the time of happening, and such things.
If I order events by DateTime, it's time consuming in large data. If I order by Id, there's no guaranty that users data entry is ordered, also users are not responsible to determine the sequence (they just enter date time). I prefer to order by a numeric field instead of DateTime.
What do you suggest?
Ordering by DateTime columns should not be slow, even on large data, provided your database has that column indexed.
I would, personally, do the ordering directly on the DateTime (with an index on the db), but make sure that your LINQ queries limit the results to the appropriate date window.
Keep in mind that a datetime will be stored as an integer number of ticks since a fixed point in time (generally Jan. 1st 1970). To compare datetimes it is just an integer comparison of this single value, it doesn't need to compare year, month, day, etc. That is unless you're storing the date as a string, and not as a datetime.
My guess is that your database is internally storing the data sorted by ID, and that the ID is also indexed, which is why that's so quick. Your problem isn't sorting on a datatime, it's simply sorting on a non-ID column. As Reed suggested, you probably just need to index the column. It's also possible that you're doing something, somewhere, in a way that you shouldn't. It's hard to say what that might be without seeing the code, the DB configuration, etc.
I was thinking of formatting it like this
TYYYYMMDDNNNNNNNNNNX
(1 character + 19 digits)
Where
T is type
YYYY is year
MM is month
DD is day
N is sequencial number
X is check digit
The problem is, how do I generate the sequencial number? since my primary key is not an auto increment integer value, if it was i would use that, but its not.
EDIT can I have the sequencial number resets itself after 1 day (24hours).
P201012080000000001X <-- first
transaction of 2010/12/08
P2010120810000000002X <--- second
transaction of 2010/12/08
P201012090000000001X <--- First
transaction of 2010/12/09
(X is the check digit)
The question is meaningless without a context. Others have commented on your question. Please answer the comments. What is the "transaction number" for; where is it used; what is the "transaction" that you need an external identifier for.
Identity or auto-increment columns may have some use internally, but they are quite useless outside the database.
If we had the full schema, knowing which components are PKs that will not change, etc, we could provide a more meaningful answer.
At first glance, without the info requested, I see no point in recording date in the "transaction" (the date is already stored in the transaction row)
You seem to have the formula for your transaction number, the only question you really have is how to generate a sequence number that resets each day.
You can consider the following options:
Use a database sequence and a scheduled job that resets it.
Use a sequence from outside the database (for instance, a file or memory structure).
With the proper isolation level, you should be able to include the (SELECT (MAX(Seq) + 1) FROM Table WHERE DateCol = CURRENT_DATE) as a value expression in your INSERT statement.
Also note that there's probably no real reason to actually store the transaction number in the database as it's easy to derive it from the information it encodes. All you need to store is the sequential number.
You can track the auto-incs separately.
Or, as you get ready to add a new transaction. First poll the DB for the newest transaction and break that apart to find the number, and increase that.
Or add an auto-inc field, but don't use it as a key.
You can use a uuid generator so that you don't have to mind about a sequence and you are sure not to have collision between transactions.
eg :
in java :
java.util.UUID.randomUUID()
05f4c168-083a-4107-84ef-10346fad6f58
5fb202f1-5d2a-4d59-bbeb-5bcabd513520
31836df6-d4ee-457b-a47a-d491d5960530
3aaaa3c2-c1a0-4978-9ca8-be1c7a0798cf
in php :
echo uniqid()
4d00fe31232b6
4d00fe4eeefc2
4d00fe575c262
there is a UUID generator in barely all languages.
A primary key that big is a very, very bad idea. You will waste huge amounts of table space unnecessarily and make your table very slow to query and manage. Make you primary key a small simple incrementing int and store the transaction date in a separate field. When necessary in a query you can select a transaction number for that day with:
SELECT ROW_NUMBER OVER (PARTITION BY TxnDate ORDER BY TxnID), TxnDate, ...
Please read this regarding good primary key selection criteria. http://www.sqlskills.com/BLOGS/KIMBERLY/category/Indexes.aspx