We're in the process of evaluating Cassandra for use with financial time series data and are trying to understand the best way to store and retrieve the data we need in the most performant way. We are running Cassandra on a Virtual machine to which 8 cores and 8Gb RAM have been allocated. The remaining resources of the host machine (another 8 cores and 12Gb RAM) are used for development of the testing client application. Our data is currently stored in flat files and is of the order of 100-150Gb each day (uncompressed). In terms of retrieving the data from cassandra we need to be able to stream either:
All of the data - i.e. stream data for all securities for an entire day ordered by timestamp
All of the data for a particular time period which is a subset of the entire day ordered by timestamp
Data for a subset of the securities and a particular time period which is a subset of the entire day ordered by timestamp.
We have so far experimented with partitioning the data based on security and day with a table that has the following schema:
create table MarketData (
Security text
,Date date
,Timestamp timestamp
...
other columns
...
primary key((Security,Date),timestamp));
However when we perform a simple paged query from within a C# client application as below it takes roughly 8 secs to retrieve 50K records, which is very poor. We've experimented with different page sizes and a page size of approx. 450 seems to give the least bad results.
var ps = client.Session.Prepare("select security, date, timestamp, toUnixTimestamp(timestamp), from marketdata where security = ? and date = ?");
int pageSize = 450;
var statement = ps.Bind("AAPL_O",new LocalDate(2016,01,12)).SetPageSize(pageSize);
stopwatch.Start();
var rowSet = client.Session.Execute(statement);
foreach (Row row in rowSet)
{
}
stopwatch.Stop();
Furthermore, this kind of a schema would also be problematic in terms of selecting SORTED data across partitions (i.e. for multiple securities) since it involves sorting across partitions which Cassandra doesn't seem to be well suited to.
We have also cosidered partinioning based on minute with the following schema:
create table MarketData (
Year int,
Month int,
Day int,
Hour int,
Minute int,
Security text
,Timestamp timestamp
...
other columns
...
primary key((Year,Month,Day,Hour,Minute),timestamp));
However, our concern is that our perlimiary test of paging through the results of a straightforward 'select' statement is so poor.
Are we approaching things in the wrong way? Could our configuration be incorrect? Or is Cassandra maybe not the appropriate bigdata solution for what we are trying to achieve?
Thanks
".... poor performance...."
"We are running Cassandra on a Virtual machine "
I think those 2 highlighted words are related :). Out of curiosity, what is the nature of your hard drive ? Shared storage ? SAN ? Spinning disk ? SSD ? Mutualised hard drive ?
Furthermore, this kind of a schema would also be problematic in terms of selecting SORTED data across partitions (i.e. for multiple securities)
Exact, Cassandra does not sort by partition key. You'll probably need to create another table (or a materialized view, new Cassandra 3.0 feature) with PRIMARY KEY ((time_period),security, timestamp) so that you can order by Security
Are we approaching things in the wrong way?
Yes, why do you want to do "performance benchmark" on a virtual machine ? Those 2 ideas are pretty antinomic. The general recommendation with Cassandra is to use dedicated hard drives (spinning disk at least, preferably SSD). Cassandra read performance is strongly bound to your disk I/O.
With virtual machines and virtualized storage, you deactivate all Cassandra optimization for disk throughput. Writing a sequential block of data on a virtualized disk do not guarantee you that the data are effectively written sequentially because the hypervisor/virtual disk controller can re-order the them to split across several blocks on the actual physical disks
Cassandra deployment on virtual machines are only suited for P.O.C to validate a data model & queries. You'll need to have dedicated physical hard drives to benchmark the actual performance of your data model with Cassandra.
Related
My table storage has approximately 1-2 million records and I have a daily job that needs needs to retrieve all the records that does not have a property A and do some further processing.
It is expected that there are about 1 - 1.5 million records without property A. I understand there are two approaches.
Query all records then filter results after
Do a table scan
Currently, it is using the approach where we query all records and filter in c#. However, the task is running in an Azure Function App. The query to retrieve all the results is sometimes taking over 10 minutes which is the limit for Azure Functions.
I'm trying to understand why retrieve 1 million records is taking so long and how to optimise the query. The existing design of the table is that the partition and row key are identical and is a guid - this leads me to believe that there is one entity per partition.
Looking at Microsoft docs, here are some key Table Storage limits (https://learn.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#azure-table-storage-scale-targets):
Maximum request rate per storage account: 20,000 transactions per second, which assumes a 1-KiB entity size
Target throughput for a single table partition (1 KiB-entities): Up to 2,000 entities per second.
My initial guess is that I should use another partition key to group 2,000 entities per partition to achieve the target throughput of 2,000 per second per partition. Would this mean that 2,000,000 records could in theory be returned in 1 second?
Any thoughts or advice appreciated.
I found this question after blogging on the very topic. I have a project where I am using the Azure Functions Consumption plan and have a big Azure Storage Table (3.5 million records).
Here's my blog post:
https://www.joelverhagen.com/blog/2020/12/distributed-scan-of-azure-tables
I have mentioned a couple of options in this blog post but I think the fastest is distributing the "table scan" work into smaller work items that can be easily completed in the 10-minute limit. I have an implementation linked in the blog post if you want to try it out. It will likely take some adapting to your Azure Function but most of the clever part (finding the partition key ranges) is implemented and tested.
This looks to be essentially what user3603467 is suggesting in his answer.
I see two approaches to retrieve 1+ records in a batch process, where the result must be saved to a single media - like a file.
First) You identity/select all primary id/key of related data. Then you spawn parallel jobs with chunks of these primary id/keys where you read the actual data and process it. each job then report to the single media with the result.
Second) You identity/select (for update) top n of related data, and mark this data with a state of being processed. Use concurrency locking here, that should prevent others from picking that data up if this is done in parallel.
I will go for the first solution if possible, since it is the simplest and cleanest solution. The second solution is best if you use "select for update", i dont know if that is supported on Azure Table Storage.
You'll need to paralise the task. As you don't know the partition keys, run 24 separate queries PK that start and end for each letter of the alaphabet. Write a query where PK > A && PK < B, and > B < C etc. Then join the 24 results in memory. Super easy to do in a single function. In JS just use Promise.all([]).
We're seeing some very variable latencies when querying our Azure Table Storage data. We have a number of items each fetching time series data which is broken up by day as follows:
Partition key: {DATA_TYPE}_{YYYMMdd} - 4 different datatypes with about 2 years of data in total
Row Key: {DataObjectId} - About 3-4,000 records per day.
A record itself is a JSON encoded array of dateTime objects spread out every 15 minutes.
So I want to retrieve timeseries data for a specific object for the last few days so I constructed the following query:
string.Format("(PartitionKey ge '{0}') and (PartitionKey le '{1}') and (RowKey eq '{2}')", lowDate, highDate, DataObjectId);
As above we have records going over 2-3 years now.
On the whole the query time is fairly speedy 600-800 ms However once or twice we get a couple of values where it seems to take a very long time to retrieve data from these partitions. i.e. one or two queries have taken 50 seconds plus to return data.
We are not aware that the system is under dramatic load. In fact frustratingly all the graphs in the portal we've found suggest no real problems.
Some suggestions that come to mind:
1.) add year component first making the partition keys immediately more selective.
However the most frustrating thing is the variation in time taken to do the queries.
The Azure storage latency in the Azure portal is averaging at about 117.2ms and the maximum reported is 294ms. I have interpreted this as Network latency.
Of course any suggestions gratefully received. The most vexing thing is that the execution time is so variable. In a very small number of cases we see our application resorting to the use of continuation tokens as the query has taken over 5 seconds to complete.
https://msdn.microsoft.com/en-us/library/azure/dd179421.aspx
Have been looking at this for a while.
I've not found an answer to why querying accross partitions suffered such variable latency. I had assumed that it would work well with the indexes.
However the solution seems to be to simply request data from the 6 different partitions. Therefore all querying takes advantage of both the Partitionkey and rowkey indexing. Once this was implemented our queries began returning much faster.
Would still like to understand why querying accross partitions seemed so slow, but I can only assume the query resulted in a table scan which has variable latency.
I couldn't decide the best approach to handle the following scenario via Azure storage.
~1500+ CSV files between ~1MB to ~500MB overall ~20GB data
Each file uses exactly same model and each model.toString() is ~50 characters ~400byte
Every business day, during 6 hours period, ~8000+ new rows comes per minute
Based on property value, each row goes to the correct file
Multiple instance writing is not necessary as long as multiple reading is supported even there is few seconds delay for snapshot period is OK.
I would like to use Block Blob but downloading ~400MB single file into the computer, just to add a single line and upload it back doesn't make sense and I couldn't find other way around.
There is a Drive option which uses Page Blob unfortunately it is not supported by SDKv2 and makes me nervous about possible discontinuation of the support
And final one is Table which looks OK other than reading few hundred thousands rows continuesly may become an issue
Basically, I prefer to write files when I retrieve the data immediately. But, if it does worth to give up, I can live with the single update at the end of the day which means ~300-1000 lines per file
What would be best approach to handle this scenario?
Based on your above requirement, Azure Tables are the optimal option. With single Azure Storage account you get the following:
Storage Transactions – Up to 20,000 entities/messages/blobs per second
Single Table Partition – a table partition are all of the entities in a table with the same partition key value, and most tables have many partitions. The throughput target for a single partition is:
Up to 20,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to a few thousand requests per second (up to the storage account target 20,000).
Tables – use a more finely grained PartitionKey for the table in order to allow us to automatically spread the table partitions across more servers.
About reading "few hundred thousands rows" continuously, your main obstacle is storage level 20,000 transactions/sec however if you design your partition so granular to segment them on hundreds of servers, you could be able to read "hundred of thousands" in minutes.
Source:
Windows Azure Storage Abstractions and their Scalability Targets
Windows Azure’s Flat Network Storage and 2012 Scalability Targets
We have a application (written in c#) to store live stock market price in the database (SQL Server 2005). It insert about 1 Million record in a single day. Now we are adding some more segment of market into it and the no of records would be double (2 Millions/day).
Currently the average record insertion per second is about 50, maximum is 450 and minimum is 0.
To check certain conditions i have used service broker (asynchronous trigger) on my price table. It is running fine at this time(about 35% CPU utilization).
Now i am planning to create a in memory dataset of current stock price. we would like to do some simple calculations.
Currently i am using xml batch insertion method. (OPENXML in Storred Proc)
I want to know different views of members on this.
Please provide your way of dealing with such situation.
Your question is reading, but title implies writing?
When reading, consider (bit don't blindly use) temporary tables to cache data if you're going to do some processing. However, by simple calculations I assume aggregates live AVG, MAX etc?
It would generally be inane to drag data around, cache it in the client and aggregate it there.
If batch uploads:
SQLBulkCopy or similar to a staging table
Single write from staging to final table with
If single upload, just insert it
A million rows a day is a rounding error for what SQL Server ('Orable, MySQL, DB2 etc) is capable of
Example: 35k transaction (not rows) per second
Window app i am constructing is for very low end machines (Celeron with max 128 RAM). From the following two approaches which one is the best (I don't want that application becomes memory hog for low end machines):-
Approach One:-
Query the database Select GUID from Table1 where DateTime <= #givendate which is returning me more than 300 thousands records (but only one field i.e. GUID - 300 thousands GUIDs). Now running a loop to achieve next process of this software based on GUID.
Second Approach:-
Query the database Select Top 1 GUID from Table1 where DateTime <= #givendate with top 1 again and again until all 300 thousands records done. It will return me only one GUID at a time, and I can do my next step of operation.
What do you suggest which approach will use the less Memory Resources?? (Speed / performance is not the issue here).
PS: Database is also on local machine (MSDE or 2005 express version)
I would go with a hybrid approach. I would select maybe 50 records at a time instead of just one. This way, you aren't loading the entire number of records, but you are also drastically reducing the number of calls to the database.
Go with approach 1 and use SQLDataReader to iterate through the data without eating up memory.
If you only have 128 MB of ram I think number 2 would be your best approach......that said can't you do this SET based with a stored procedure perhaps, this way all the processing would happen on the server
If memory use is a concern, I would consider caching the data to disk locally. You can then read the data from the files using a FileStream object.
Your number 2 solution will be really slow, and put a lot of burden on the db server.
I would have a paged enabled Stored Procedure.
I would do it in chunks of 1k rows and test from there up until I get the best performance.
usp_GetGUIDS #from = 1, #to = 1000
This may be a totally inaproprite approach for you, but if you're that worried about performance and your machine is low spec, I'd try the following:
Move your SQL server to another machine, as this eats up a lot of resources.
Alternativly, if you don't have that many records, store as XML or SQLite, and get rid of the SQL server altogether?