I need a binary tree or another structure in which I can store objects with a time stamp and then QUICKLY look them up not just by the timestamp I know is there but also by a range
(timestamp > min && timestamp < max).
I found SortedDictionary and SortedSet that both implement a binary tree. What I am missing is the ability to look up by range > && < without forcing it (SortedDictionary or SortedSet) internally to iterate over more elements than they need to.
What I mean is when I call
SortedDictionary.TryGetValue(DateTime.Now, ...
it should take logarithmic time.
I want to be able to get all items between Min and Max in logarithmic time as well. Missing
SortedDictionary.TryGetValueBetween(DateTime.Now-SomeInterval, DateTime.Now+SomeInterval,...
If I was implementing the binary tree myself it would not be a problem. But I do not see a mechanism for doing it with SortedDictionary or SortedSet. And I don't want to resort to linear time.
Am I just not finding the right methods or do I really need to implement the binary tree myself to get the benefits I am looking for?
Other options are also welcome. Is there a different structure that would give me insert, delete and "range lookup" in log time or better.
Found 2 solutions:
On closer inspection SortedSet does have the method I need: GetViewBetween
In a free third party library called Wintellect.PowerCollections there is OrderedMultiDictionary class (see here) that does what I need plus allows duplicates into the collection (unlike SortedSet). The method for getting a range between 2 values is called Range().
Both as far as I can tell do inserts, deletes and lookups in log(n) time.
Related
Given a binary tree, each of it's nodes contains an item with range, for instance, one, particular node may contain a range of ( 1 to 1.23456 ]
If the query element is less than or greater than the described range, it inspects the respective child. For example, it is 1.3
As follows, we will be looking over the right branch, performing 2 "if" checks to see if it fits in the range of the element.
Even though balanced Binary Search Tree (BST) is an elegant way of traversing quickly through a dataset, the amount of "if" checks grows significantly if there are more and more children. It becomes even more of a problem, when it has to be done several million times per second.
Is there an elegant way of storing objects such that given an element with a value (1.3 for example), its value can be simply fed into something as Dictionary? This would quickly retrieve the proper element to whose range this value fits or null if it fits none.
However, dictionary doesn't check against ranges, instead, it expects a single value. Therefore, is there a data structure which can provide an item if supplied key fits within the item's range?
Here a person has similar problem, however he finds out that the memory is wasted. He is being advised to BST approach, but is it the only solution?
Sorry if there is an evident answer, I may missed it.
Are you asking about interval trees? Interval trees allow you get all the elements on the interval x..y within O(logn) time. For C# implementation I have used the libary called IntervalTreeLib and it worked nicely.
In computer science, an interval tree is an ordered tree data
structure to hold intervals. Specifically, it allows one to
efficiently find all intervals that overlap with any given interval or
point. It is often used for windowing queries, for instance, to find
all roads on a computerized map inside a rectangular viewport, or to
find all visible elements inside a three-dimensional scene. A similar
data structure is the segment tree.
I have a collection of about 8,000 test scores in an XML file.
Using Linq and C#, what is one of the most efficient ways to calculate the percentile of a particular test score.
My emphasis is on efficiency. So what is the recommended approach? I am also looking for the appropriate builtin Linq or C# functions recommended for this calculation. Is there something called Percentile() or TopPercent?
It sounds like you're worrying about efficiency before you've verified that you need to worry about it.
I would take the following approach:
Load the XML file into memory with LINQ to XML (as the simplest XML API in .NET)
Convert the scores into a list of integers (or whatever the score type is)
You can now find out the total count easily
Use Count with a predicate to find out how many scores are less than your "target" score
If you need to check multiple scores, you obviously only need to repeat the final step.
My first attempt at optimizing this (for multiple checks) would be to sort the list, so you can then just do a binary search to find the rank of each score. I'd only go that far after benchmarking though.
How to construct/obtain a datastructure with the following capabilities:
Stores (key,value) nodes, keys implement IComparable.
Fast (log N) insertion and retrieval.
Fast (log N) method to retrieve the next higher/next lower node from any node. [EXAMPLE: if
the key values inserted are (7,cat), (4,dog),(12,ostrich), (13,goldfish) then if keyVal referred to (7,cat), keyVal.Next() should return a reference to (12,ostrich) ].
A solution with an enumerator from an arbitrary key would of course also suffice. Note that standard SortedDictionary functionality will not suffice, since only an enumerator over the entire set can be returned, which makes finding keyVal.next require N operations at worst.
Could a self-implemented balanced binary search tree (red-black tree) be fitted with node.next() functionality? Any good references for doing this? Any less coding-time consuming solutions?
I once had similar requirements and was unable to find something suitable. So I implemented an AVL tree. Here come some advices to do it with performance in mind:
Do not use recursion for walking the tree (insert, update, delete, next). Better use a stack array to store the way up to the root which is needed for balancing operations.
Do not store parent nodes. All operations will start from the root node and walk further down. Parents are not needed, if implemented carefully.
In order to find the Next() node of an existing one, usually Find() is first called. The stack produced by that, should be reused for Next() than.
By following these rules, I was able to implement the AVL tree. It is working very efficiently even for very large data sets. I would be willing to share, but it would need some modifications, since it does not store values (very easy) and does not rely on IComparable but on fixed key types of int.
The OrderedDictionary in PowerCollections provides a "get iterator starting at or before key" function that takes O(log N) time to return the first value. That makes it very fast to, say, scan the 1,000 items that are near the middle of a 50 million item set (which with SortedDictionary would require guessing to start at the start or the end, both of which are equally bad choices and would require iterator around 25 million items). OrderedDictionary can to that with just 1,000 items iterated.
There is a problem in OrderedDictionary though in that it uses yield which causes O(n^2) performance and out of memory conditions when iterating a 50 million item set in a 32 bit process. There is a quite simple fix for that while I will document later.
I have a large dataset with possibly over a million entries. All items have an assigned time stamp and items are added to the set at runtime (usually, but not always, with a newer time stamp).
I need to show a sub set of this data given a certain time range. This time range is usually quite small compared to the total data set, i.e. of the 1.000.000+ items not more than about 1000 are in that given time range. This time range moves at a constant pace, e.g. every second the time range is moved by one second.
Additionally, the user may adjust the time range at any time ("move" through the data set) or set additional filters (e.g. filter by some text).
So far I wasn't worried about performance, trying to get the other things right, and only worked with smaller test sets. I am not quite sure how to tackle this problem efficiently and would be glad for every input. Thanks.
Edit: Used language is C# 4.
Update: I am now using a interval tree, implementation can be found here:
https://github.com/mbuchetics/RangeTree
It also comes with an asynchronous version which rebuilds the tree using the Task Parallel Library (TPL).
We had similar problem in our development - had to collect several million items sorted by some key and then export one page on demand from it. I see that your problem is somehow similar.
For the purpose, we adapted the red-black tree structure, in the following ways:
we added the iterator to it, so we could get 'next' item in o(1)
we added finding the iterator from the 'index', and managed to do that in O(log n)
RB Tree has O(log n) insertion complexity, so I guess that your insertions will fit in there nicely.
next() on the iterator was implemented by adding and maintaining the linked list of all leaf nodes - our original adopted RB Tree implementation didn't include this.
RB Tree is also cool because it allows you to fine-tune the node size according to your needs. By experimenting you'll be able to figure right numbers that just fit your problem.
Use SortedList sorted by timestamp.
All you have to is to have a implement a binary search on the sorted keys inside the sorted list to find the boundary of your selection which is pretty easy.
Insert new items into a sorted list. This would let you select a range pretty easily. You could potentially use linq as well if you're familiar with it.
A SO post about generating all the permutations got me thinking about a few alternative approaches. I was thinking about using space/run-time trade offs and was wondering if people could critique this approach and possible hiccups while trying to implement it in C#.
The steps goes as follows:
Given a data-structure of homogeneous elements, count the number of elements in the structure.
Assuming the permutation consists of all the elements of the structure, calculate the factorial of the value from Step 1.
Instantiate a newer structure(Dictionary) of type <key(Somehashofcollection),Collection<data-structure of homogeneous elements>> and initialize a counter.
Hash(???) the seed structure from step 1, and insert the key/value pair of hash and collection into the Dictionary. Increment the counter by 1.
Randomly shuffle(???) the order of the seed structure, hash it and then try to insert it into the Dictionary from step 3.
If there is a conflict in hashes,repeat step 5 again to get a new order and hash and check for conflict. Upon successful insertion increment the counter by 1.
Repeat steps 5 & 6 until the counter equals the factorial calculated in step 2.
It seems like doing it this way using some sort of randomizer(which is a black box to me at the moment) might help with getting all the permutations within a decent timeframe for datasets of obscene sizes.
It will be great to get some feedback from the great minds of SO to further analyze this approach whose objective is to deviate from the traditional brute-force approach prevalent in algorithms of such nature and also the repercussions of implementing such an algorithm using C#.
Thanks
This method of generating all permutations does not fare well as compared to the standard known methods.
Say you had n items and M=n! permutations.
This method of generation is expected to generate M*lnM permutations before discovering all M.
(See this answer for a possible explanation: Programing Pearls - Random Select algorithm)
Also, what would the hash function be? For a reasonable hash function, we might have to start dealing with very large integer issues pretty soon (any n > 50 for sure, don't remember that exact cut-off point).
This method uses up a lot of memory too (the hashtable of all permutations).
Even assuming the hash is perfect, this method would take expected Omega(nMlogM) operations and guaranteed Omega(nM) space, while standard well-known methods can do it in O(M) time and O(n) space.
As a starting point I suggest one can read: Systematic Generation of All Permutations which is believe is O(nM) time and O(n) space and still much better than this method.
Note that if one has to generate all permutations, any algorithm will necessarily take Omega(M) steps and so the the method I refer to above is optimal!
It seems like a complicated way to randomise the order of the generated permutations. In terms of time efficiency, you can't do much better than the 'brute force' approach.