Let's say I have a book with ~2^40 pages. Each day, I read a random chunk of contiguous pages (sometimes including some pages I've already read). What's the smartest way to store and update the information of "which pages I've read" in a (SQLite) database ?
My current idea is to store [firstChunkPage, lastChunkPage] entries in a table, but I'm not sure about how to update this efficiently.
Should I first check for every possible overlaps and then update ?
Should I just insert my new range and then merge overlapping entries (perhaps multiple times because multiple overlaps can occur ?) ? I'm not sure about how to build such a SQL query.
This looks like a pretty common problem, so I'm wondering if anyone knows a 'recognized' solution for this.
Any help or idea is welcome !
EDIT : The reading isn't actually random, the number of chunks is expected to be pretty much constant and very small compared to the number of pages.
Your idea to store ranges of (firstChunkPage, lastChunkPage) pairs should work if data is relatively sparse.
Unfortunately, queries like you mentioned:
SELECT count(*) FROM table
WHERE firstChunkPage <= page AND page <= lastChunkPage
cannot work effectively, unless you use spatial indexes.
For SQLite, you should use R-Tree module, which implements support for this kind of index. Quote:
An R-Tree is a special index that is designed for doing range queries. R-Trees are most commonly used in geospatial systems where each entry is a rectangle with minimum and maximum X and Y coordinates. ... For example, suppose a database records the starting and ending times for a large number of events. A R-Tree is able to quickly find all events, for example, that were active at any time during a given time interval, or all events that started during a particular time interval, or all events that both started and ended within a given time interval.
With R-Tree, you can very quickly identify all overlaps before inserting new range and replace them with new combined entry.
To create your RTree index, use something like this:
CREATE VIRTUAL TABLE demo_index USING rtree(
id, firstChunkPage, lastChunkPage
);
For more information, read documentation.
Related
I am using Azure table storage to retrieve data though timestamp filter. I see the execution is very slow as timestamp is not a partition key or row key. I researched on stackoverflow and found that time stamp should be converted to ticks and stored in to Partition key. I did the same and while inserting data I took the below string and inserted tick string to partition key.
string currentDateTimeTick = ConvertDateTimeToTicks(DateTime.Now.ToUniversalTime()).ToString();
public static long ConvertDateTimeToTicks(DateTime dtInput)
{
long ticks = 0;
ticks = dtInput.Ticks;
return ticks;
}
This is fine till here. But When I am trying to retrieve last 5 days data, I am unable to query the tick against partition key. I am trying to get last 5 days data. What was my mistake in the below code?
int days = 5;
TableQuery<MyEntity> query = new TableQuery<MyEntity>()
.Where(TableQuery.GenerateFilterConditionForDate("PartitionKey", QueryComparisons.GreaterThanOrEqual, "0"+DateTimeOffset.Now.AddDays(days).Date.Ticks));
Are you sure you want to use ticks as a partition key? This means that every measureable 100 ns instant becomes it's own partition. With time based data you can use the partition key to specify an interval like every hour, minute or even second and then a row key with the actual timestamp.
That problem aside let me show you how to do the query. First let me comment on how you generate the partition key. I suggest you do it like this:
var partitionKey = DateTime.UtcNow.Ticks.ToString("D18");
Don't use DateTime.Now.ToUniversalTime() to get the current UTC time. It will internally use DateTime.UtcNow, then convert it to the local time zone and ToUniversalTime() will convert back to UTC which is just wasteful (and more time consuming than you may think).
And your ConvertDateTimeToTicks() method serves no other purpose than to get the Ticks property so it is just making your code more complex without adding any value.
Here is how to perform the query:
var days = 5;
var partitionKey = DateTime.UtcNow.AddDays(-days).Ticks.ToString("D18")
var query = new TableQuery<MyEntity>().Where(
TableQuery.GenerateFilterCondition(
"PartitionKey",
QueryComparisons.GreaterThanOrEqual,
partitionKey
)
);
The partition key is formatted as an 18 characters string allowing you to use a straightforward comparison.
I suggest that you move the code to generate the partition key (and row key) into a function to make sure that the keys are generated the same way throughout your code.
The reason 18 characters are used is because the Ticks value of a DateTime today as well as many thousands of years in the future uses 18 decimal digits. If you decide to base your partition key on hours, minutes or seconds instead of 100 ns ticks then you can shorten the length of the partition key accordingly.
As Martin suggests, using a timestamp as your partition key is almost certainly not what you want to do.
Partitions are the unit of scale in Azure Table Storage and more or less represent physical segmentation of your data. They're a scalability optimization that allows you to "throw hardware" at the problem of storing more and more data, while maintaining acceptable response times (something which is traditionally hard in data storage). You define the partitions in your data by assigning partition keys to each row. Its almost never desirable that each row lives in its own partition.
In ATS, the row key becomes your unique key within a given partition. So the combination of partition key + row key is the true unique key across the entire ATS table.
There's lots of advice out there for choosing a valid partition key and row key... none of which is generalized. It depends on the nature of your data, your anticipated query patterns, etc.
Choose a partition key that will aggregate your data into a reasonably well-distributed set of "buckets". All things being equal, if you anticipate having 1 million rows in your table, it's often useful to have, say, 10 buckets with 100,000 rows each... or maybe 100 buckets with 10,000 rows each. At query time you'll need to pick the partition(s) you're querying, so the number of buckets may matter to you. "Buckets" often correspond to a natural segmentation concept in your domain... a bucket to represent each US state, or a bucket to represent each department in your company, etc. Note that its not necessary (or often possible) to have perfectly distributed buckets... get as close as you can, with reasonable effort.
One example of where you might intentionally have an uneven distribution is if you intend to vary query patterns by bucket... bucket A will receive lots of cheap, fast queries, bucket B fewer, more expensive queries, etc. Or perhaps bucket A data will remain static while bucket B data changes frequently. This can be accomplished with multiple tables, too... so there's no "one size fits all" answer.
Given the limited knowledge we have of your problem, I like Martin's advice of using a time span as your partition key. Small spans will result in many partitions, and (among other things) make queries that utilize multiple time spans relatively expensive. Larger spans will result in fewer aggregation costs across spans, but will result in bigger partitions and thus more expensive queries within a partition (it will also make identifying a suitable row key potentially more challenging).
Ultimately you'll likely need to experiment with a few options to find the most suitable one for your data and intended queries.
One other piece of advice... don't be afraid to consider duplicating data in multiple data stores to suit widely varying query types. Not every query will work effectively against a single schema or storage configuration. The effort needed to synchronize data across stores may be less than that needed bend query technology X to your will.
more on Partition and Row key choices
also here
Best of luck!
One thing that was not mentioned in the answers above is that Azure will detect if you are using sequential, always-increasing or always-decreasing values for your partition key and create "range partitions". Range partitions group entities that have sequential unique PartitionKey values to improve the performance of range queries. Without range partitions, as mentioned above, a range query will need to cross partition boundaries or server boundaries, which can decrease the query performance. Range partitions happen under-the-hood and are decided by Azure, not you.
Now, if you want to do bulk inserts, let's say once a minute, you will still need to flatten out your timestamp partition keys to, say, ticks rounded up to the nearest minute. You can only do bulk inserts with the same partition key.
I am developing an application which receives packets from network and stores them into database. in one part, I save dns records to db, in this format:
IP Address(unsigned 32bit integer)
DNS record(unlimited string)
The rate of DNS records, is about 10-100 records per second. As it's realtime, I have not enough time to check for duplicates by string search in database. I was thinking of a good method to get a unique short integer (you say,64 bit) per given unique string. So my search, from string search, becomes number search and lets me check for duplicates faster. Any idea about implementations of what I told, or better approaches is appreciated. samples in C# are preferred. but any good idea is welcomed.
I would read through this, talking about hashing strings into integers, and since addresses are pretty long (letter wise), I would use some modulo function to keep it in integer limits.
The results would be checked with a hash table for duplicates.
This could be done for the first 20 letters, and then the next 20 for a nested hash table if required and so on.
Make sure you set up your table-indexes and Primary Keys in the table correctly.
Load the table contents asynchronuosly every couple of seconds and populate a generic dictionary<long,string> with it.
Perform the search on the dictionary as it is optimized for searches. If you need it even faster, use a hashtable.
Flush the newly added entries in a Transaction asynchronuosly into the DB.
P.S. Your Scenario is to vague to create a decent code example.
hi guys i have this doubt ...
if i have a record of username and password details for logging in to a website I'll most probably get the user name and password from the form and will be using to check if the given username is present in the database by using a contains() Boolean operation and if contains then check the password is same as saved in the database..
but for websites like g-mail and Facebook there are million of records and the authentication is very quick ...
how to they do it ..what method do they follow for this
how they check if a value is present in a large record that quickly ?
does the process involve just adding more server for processing speed ?
ty for the answers ...
**
sorry i have posted this question without knowing about indexers ..
(just came to know that by creating indexes to one or multiple column
the full table scan is minimized and index path is used instead which
is less costlier and more efficient operation ..)
**
You just need one SQL query:
select 1 from user u
where u.login = :theEnteredLogin
and u.hashed_password = :theHashedEnteredPassword
(where :xxx are parameters of the query).
If you have an index on the login column or even better, on [login - hashed_password], the query should not take more than a few milliseconds to execute.
Well, they have lots of servers and high-performance databases. At a low level, the table for the hash is probably indexed by the hash for fast lookup - binary search style.
For medium to large data sets indexing, combined with proper sizing of disk, memory and cpus, is the most adopted approach.
For very large data sets, the database can be distributed and data partitioned.
For very, very large data sets, aside from the above scenarios, used technologies usually involve using map reduce model.
I have a windows application written in C# that needs to load load 250,000 rows from database and provide a "search as you type" feature which means as soon as user types something in a text box, the application needs to search all 250,000 records (which are btw, single column with 1000 characters each row) using like search and display the found records.
The approach I followed was:
1- The application loads all the records into a typed List<EmployeeData>
while (objSQLReader.Read())
{
lstEmployees.Add(new EmployeesData(
Convert.ToInt32(objSQLReader.GetString(0)),
objSQLReader.GetString(1),
objSQLReader.GetString(2)));
}
2- In TextChanged event, Using LINQ, I search (with combination of Regular Expression) and attach the IEnumerable<EmployeesData> to a ListView which is in Virtual Mode.
String strPattern = "(?=.*wood*)(?=.*james*)";
IEnumerable<EmployeesData> lstFoundItems = from objEmployee in lstEmployees
where Regex.IsMatch(Employee.SearchStr, strPattern, RegexOptions.IgnoreCase)
select objEmployee;
lstFoundEmployees = lstFoundItems;
3- RetrieveVirtualItem event is handled to display items in ListView to display the item.
e.Item = new ListViewItem(new String[] {
lstFoundEmployees.ElementAt(e.ItemIndex).DateProjectTaskClient,
e.ItemIndex.ToString() });
Though the lstEmployees is loaded relatively fast (1.5 seconds) for loading the list from SQL Server, to search on TextChanged, it takes more than 7 minutes to search using LINQ. Searching thru SQL Server directly by performing a LIKE search takes less than 7 seconds.
What am I doing wrong here? How can I make this search faster (not more 2 seconds)? This is a requirement from my client. So, any help is highly appreciated. Please Help...
Does the database column that stores the text data have an index on it? If so, something similar to the trie structure that Nicholas described is already in use. Indexes in SQL Server are implemented using B+ trees, which have a an average search time on the order of log base 2 of n, where n is the height of the tree. This means that if you have 250,000 records in the table the number of operations required to search are log base 2 ( 250,000 ) or approximately 18 operations.
When you load all of the information into a data reader and then use a LINQ expression it's a linear operation, (O) n, where n is the length of the list. So worst case, it's going to be 250,000 operations. If you use a DataView there will be indexes that can be used to help with searching, which will drastically improve performance.
At the end of the day if there will not be too many requests submitted against the database server leverage the query optimizer to do this. As long as the LIKE operation isn't performed with a wildcard at the front of the string (i.e. LIKE %some_string) (negates the use of an index) and there is an index on the table you will have really fast performance. If there are just too many requests that will be submitted to the database server, either put all of the information into a DataView so an index can be used, or use a dictionary as Tim suggested above, which has a search time of O(1) (on the order of one), assuming the dictionary is implemented using a hash table.
You'd be wanting to preload things and build yourself a data structure called a trie
It's memory-intensive, but it's what the doctor ordered in this case.
See my answer to this question. If you need instant response (i.e. as fast as a user types), loading the data into memory can be a very attractive option. It may use a bit of memory, but it is very fast.
Even though there are many characters (250K records * 1000), how many unique values are there? An in-memory structure based off of keys with pointers to records matching those keys really doesn't have to be that big, even accounting for permutations of those keys.
If the data it truly won't fit into memory or changes frequently, keep it in the database and use SQL Server Full Text Indexing, which will handle searches such as this much better than a LIKE. This assumes a fast connection from the application to the database.
Full Text Indexing offers a powerful set of operators/expressions which can be used to make searches more intelligent. It's available with the free SQL Expression Edition, which will handle up to 10GB of data.
If the records can be sorted, you may want to go with a binary search, which is much, much faster for large data sets. There are several implementations in .NET collections, like List<T> and Array.
I want a data structure that will allow querying how many items in last X minutes. An item may just be a simple identifier or a more complex data structure, preferably the timestamp of the item will be in the item, rather than stored outside (as a hash or similar, wouldn't want to have problems with multiple items having same timestamp).
So far it seems that with LINQ I could easily filter items with timestamp greater than a given time and aggregate a count. Though I'm hesitant to try to work .NET 3.5 specific stuff into my production environment yet. Are there any other suggestions for a similar data structure?
The other part that I'm interested in is aging old data out, If I'm only going to be asking for counts of items less than 6 hours ago I would like anything older than that to be removed from my data structure because this may be a long-running program.
A simple linked list can be used for this.
Basically you add new items to the end, and remove too old items from the start, it is a cheap data structure.
example-code:
list.push_end(new_data)
while list.head.age >= age_limit:
list.pop_head()
If the list will be busy enough to warrant chopping off larger pieces than one at a time, then I agree with dmo, use a tree structure or something similar that allows pruning on a higher level.
I think that an important consideration will be the frequency of querying vs. adding/removing. If you will do frequent querying (especially if you'll have a large collection) a B-tree may be the way to go:
http://en.wikipedia.org/wiki/B-tree
You could have some thread go through and clean up this tree periodically or make it part of the search (again, depending on the usage). Basically, you'll do a tree search to find the spot "x minutes ago", then count the number of children on the nodes with newer times. If you keep the number of children under the nodes up to date, this sum can be done quickly.
a cache with sliding expiration will do the job ....
stuff your items in and the cache handles the aging ....
http://www.sharedcache.com/cms/