MongoDb fx rates data aggregation, c# - c#

I have a data series in MongoDb like this:
{
"_id":{
"$binary":{
"base64":"MwTY5Cd5MUyZ1q+WAOLUFg==",
"subType":"03"
}
},
"Readings":{
"BTCUSD":"23157",
"ETHUSD":"1674",
"SOLUSD":"96",
"BNBUSD":"398"
},
"ExternalAPI":"Chain Oracle",
"TimeStamp":{
"$date":{
"$numberLong":"1674740700000"
}
}
}
I have to write different queries:
Min, Max, Avg by Time and by Currency Pair
I will use the queries from c# code.
The data storing format is changeable, so if it is not the format to store FX rates, I have the possibility to use different DB type or formats.
I tried to use relational db to store data, but I think a time series-like approach may be a better fit for the problem.
My aim is to store data of fx rates which are come from different sources 5s periods. I have to use these data for queries min, max, avg by time and by currency pair. I would like to find the best approach.
What is the best approach to reach my aim?
Thanks

Related

Data structure to represent statistical table lookup data

I'm performing some statistical calculations where I need to lookup values from various tables dynamically.
I've tried representing the following data as JSON, and querying the relevant value:
using JsonDocument doc = JsonDocument.Parse(json);
JsonElement w_test_table = doc.RootElement;
w_test_table.GetProperty(factor).GetProperty(sampleCount.ToString()).GetDouble();
However this feels like I'm going down a bad path.
I have multiple tables I need to lookup for each calculation, so I'm starting to think there's a better way to do this that I'm unaware of.
My concern with storing this in the DB is that I'd have multiple round trips querying the DB to resolve the value.
The calculations I'm performing are done on a collection of sample sets - multiple sample sets I'm running the stats for, hence multiple times I need to resolve the values from various lookup tables, the input values will differ each time.
Any ideas on how I can represent these kinds of lookup tables in C# would be appreciated.
You could use a Data Table or a Multi-Dimensional array.
However since you may have a specific use case, you may want to have a custom class that holds the data inside using one of the above structures and have methods that abstract your specific logic.

Asp.net storing DateTime/long to smallest comparable value

I am working on a rather large SQL Database project that requires some history tracking. I am aware that SQL has things like Data Capture but I need to have more control than just storing backup copies of the data along with other requirements.
Here is what I am trying to do, I found some similar questions on here like
get-the-smallest-datetime-value-for-each-day-in-sql-database
and
how-can-i-truncate-a-datetime-in-sql-server
What I am trying to avoid is exactly the types of things that are mentioned in these answers, by that I mean that they require some sort of conversion during the Query i.e. truncating the DateTime value, what I would like to do is take the DateTime.Ticks or some other DateTime property, and do a one way conversion to a smaller type to create a "Version" of each record that can be quickly Queried without doing any kind of Conversion after the database update/insert.
The concern that I have about simply storing the long from the DateTime.Ticks as a Version or using something like a Base36 string is the size and the time that is required during queries to compare the fields.
Can anyone point me in the right direction on how to SAFELY convert a DateTime.Ticks or a long into something that can be directly compared during queries? By this I mean I would like to be able to locate the history record using something like:
int versionToFind = GetVersion(DateTime.Now);
var result = from rec in db.Records
where rec.version <= versionToFind
select rec;
or
int versionToFind = record.version;
var result = from rec in db.Records
where rec.version >= versionToFind
select rec;
One thing to mention here is that I am not opposed to using some other method of quickly tracking the History of the data. I just need to end up with the quickest and smallest solution to be able to generate and compare Versions for each record.

Azure table storage querying partitionkey

I am using Azure table storage to retrieve data though timestamp filter. I see the execution is very slow as timestamp is not a partition key or row key. I researched on stackoverflow and found that time stamp should be converted to ticks and stored in to Partition key. I did the same and while inserting data I took the below string and inserted tick string to partition key.
string currentDateTimeTick = ConvertDateTimeToTicks(DateTime.Now.ToUniversalTime()).ToString();
public static long ConvertDateTimeToTicks(DateTime dtInput)
{
long ticks = 0;
ticks = dtInput.Ticks;
return ticks;
}
This is fine till here. But When I am trying to retrieve last 5 days data, I am unable to query the tick against partition key. I am trying to get last 5 days data. What was my mistake in the below code?
int days = 5;
TableQuery<MyEntity> query = new TableQuery<MyEntity>()
.Where(TableQuery.GenerateFilterConditionForDate("PartitionKey", QueryComparisons.GreaterThanOrEqual, "0"+DateTimeOffset.Now.AddDays(days).Date.Ticks));
Are you sure you want to use ticks as a partition key? This means that every measureable 100 ns instant becomes it's own partition. With time based data you can use the partition key to specify an interval like every hour, minute or even second and then a row key with the actual timestamp.
That problem aside let me show you how to do the query. First let me comment on how you generate the partition key. I suggest you do it like this:
var partitionKey = DateTime.UtcNow.Ticks.ToString("D18");
Don't use DateTime.Now.ToUniversalTime() to get the current UTC time. It will internally use DateTime.UtcNow, then convert it to the local time zone and ToUniversalTime() will convert back to UTC which is just wasteful (and more time consuming than you may think).
And your ConvertDateTimeToTicks() method serves no other purpose than to get the Ticks property so it is just making your code more complex without adding any value.
Here is how to perform the query:
var days = 5;
var partitionKey = DateTime.UtcNow.AddDays(-days).Ticks.ToString("D18")
var query = new TableQuery<MyEntity>().Where(
TableQuery.GenerateFilterCondition(
"PartitionKey",
QueryComparisons.GreaterThanOrEqual,
partitionKey
)
);
The partition key is formatted as an 18 characters string allowing you to use a straightforward comparison.
I suggest that you move the code to generate the partition key (and row key) into a function to make sure that the keys are generated the same way throughout your code.
The reason 18 characters are used is because the Ticks value of a DateTime today as well as many thousands of years in the future uses 18 decimal digits. If you decide to base your partition key on hours, minutes or seconds instead of 100 ns ticks then you can shorten the length of the partition key accordingly.
As Martin suggests, using a timestamp as your partition key is almost certainly not what you want to do.
Partitions are the unit of scale in Azure Table Storage and more or less represent physical segmentation of your data. They're a scalability optimization that allows you to "throw hardware" at the problem of storing more and more data, while maintaining acceptable response times (something which is traditionally hard in data storage). You define the partitions in your data by assigning partition keys to each row. Its almost never desirable that each row lives in its own partition.
In ATS, the row key becomes your unique key within a given partition. So the combination of partition key + row key is the true unique key across the entire ATS table.
There's lots of advice out there for choosing a valid partition key and row key... none of which is generalized. It depends on the nature of your data, your anticipated query patterns, etc.
Choose a partition key that will aggregate your data into a reasonably well-distributed set of "buckets". All things being equal, if you anticipate having 1 million rows in your table, it's often useful to have, say, 10 buckets with 100,000 rows each... or maybe 100 buckets with 10,000 rows each. At query time you'll need to pick the partition(s) you're querying, so the number of buckets may matter to you. "Buckets" often correspond to a natural segmentation concept in your domain... a bucket to represent each US state, or a bucket to represent each department in your company, etc. Note that its not necessary (or often possible) to have perfectly distributed buckets... get as close as you can, with reasonable effort.
One example of where you might intentionally have an uneven distribution is if you intend to vary query patterns by bucket... bucket A will receive lots of cheap, fast queries, bucket B fewer, more expensive queries, etc. Or perhaps bucket A data will remain static while bucket B data changes frequently. This can be accomplished with multiple tables, too... so there's no "one size fits all" answer.
Given the limited knowledge we have of your problem, I like Martin's advice of using a time span as your partition key. Small spans will result in many partitions, and (among other things) make queries that utilize multiple time spans relatively expensive. Larger spans will result in fewer aggregation costs across spans, but will result in bigger partitions and thus more expensive queries within a partition (it will also make identifying a suitable row key potentially more challenging).
Ultimately you'll likely need to experiment with a few options to find the most suitable one for your data and intended queries.
One other piece of advice... don't be afraid to consider duplicating data in multiple data stores to suit widely varying query types. Not every query will work effectively against a single schema or storage configuration. The effort needed to synchronize data across stores may be less than that needed bend query technology X to your will.
more on Partition and Row key choices
also here
Best of luck!
One thing that was not mentioned in the answers above is that Azure will detect if you are using sequential, always-increasing or always-decreasing values for your partition key and create "range partitions". Range partitions group entities that have sequential unique PartitionKey values to improve the performance of range queries. Without range partitions, as mentioned above, a range query will need to cross partition boundaries or server boundaries, which can decrease the query performance. Range partitions happen under-the-hood and are decided by Azure, not you.
Now, if you want to do bulk inserts, let's say once a minute, you will still need to flatten out your timestamp partition keys to, say, ticks rounded up to the nearest minute. You can only do bulk inserts with the same partition key.

How to store a sparse boolean vector in a database?

Let's say I have a book with ~2^40 pages. Each day, I read a random chunk of contiguous pages (sometimes including some pages I've already read). What's the smartest way to store and update the information of "which pages I've read" in a (SQLite) database ?
My current idea is to store [firstChunkPage, lastChunkPage] entries in a table, but I'm not sure about how to update this efficiently.
Should I first check for every possible overlaps and then update ?
Should I just insert my new range and then merge overlapping entries (perhaps multiple times because multiple overlaps can occur ?) ? I'm not sure about how to build such a SQL query.
This looks like a pretty common problem, so I'm wondering if anyone knows a 'recognized' solution for this.
Any help or idea is welcome !
EDIT : The reading isn't actually random, the number of chunks is expected to be pretty much constant and very small compared to the number of pages.
Your idea to store ranges of (firstChunkPage, lastChunkPage) pairs should work if data is relatively sparse.
Unfortunately, queries like you mentioned:
SELECT count(*) FROM table
WHERE firstChunkPage <= page AND page <= lastChunkPage
cannot work effectively, unless you use spatial indexes.
For SQLite, you should use R-Tree module, which implements support for this kind of index. Quote:
An R-Tree is a special index that is designed for doing range queries. R-Trees are most commonly used in geospatial systems where each entry is a rectangle with minimum and maximum X and Y coordinates. ... For example, suppose a database records the starting and ending times for a large number of events. A R-Tree is able to quickly find all events, for example, that were active at any time during a given time interval, or all events that started during a particular time interval, or all events that both started and ended within a given time interval.
With R-Tree, you can very quickly identify all overlaps before inserting new range and replace them with new combined entry.
To create your RTree index, use something like this:
CREATE VIRTUAL TABLE demo_index USING rtree(
id, firstChunkPage, lastChunkPage
);
For more information, read documentation.

MongoDB key-value DB alternative with compression

I am currently using MongoDB to store lots of real-time signals of some sensors. The stored information includes a timestamp, a numeric value and a flag that indicates the quality of the signal.
Queries are very fast but the amount of disk used is exorbitant and would like to try other non-relational database most suitable to my purposes.
I've been looking at http://nosql-database.org/ but I don't know which database is the best for my needs.
Thank you very much :)
http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
MongoDB stores field names inside every document, which is great because it allows all documents to have different fields, but creates a storage overhead when fields are always the same.
To reduce the disk consumption, try shortening the field names, so instead of:
{
_id: "47cc67093475061e3d95369d",
timestamp: "2011-06-09T17:46:21",
value: 314159,
quality: 3
}
Try this:
{
_id: "47cc67093475061e3d95369d",
t: "2011-06-09T17:46:21",
v: 314159,
q: 3
}
Then you can map these field names to something more meaningful inside your application.
Also, if you're storing separate _id and timestamp fields then you might be doubling up.
The ObjectId type has a timestamp embedded in it, so depending on how you query and use your data, it might mean you can do without a separate timestamp field all together.
Disk space is cheap, don't care about this, development with new database will cost much more... If you on windows you can try RavenDB.
Also mb take a look into this answer about reducing mongo database size:
You can do this "compression" by running mongod --repair or by
connecting directly and running db.repairDatabase().

Categories