TA-Lib : Technical Analysis Library, Lookback and unstablePeriod - c#

TA-Lib is a financial/market/OHLC technical analysis library for a Java, C++, .Net, etc. In it are ~158 Technical Functions (EMA, MAMA, MACD, SMA, etc), each has an associate Lookback Function
public static int EmaLookback(int optInTimePeriod)
The Lookback for each function seems to return the minimum length of processing required to compute each function accurately. With the startIdx to the endIdx equal to the Lookback.
Core.RetCode retcode = Core.Ema(startIdx, endIdx, double inReal, optInTimePeriod, ref outBegIdx, ref outNBElement, double outReal)
Some of these functions use an array called
Globals.unstablePeriod[0x17]
If this is incorrect in any way please correct me. Now the questions ...
The array unstablePeriod[] initializes to 0 for all entries. Is this what is supposed to occur, if not where in TA-Lib do I find the code or data that it is initialized with?
The code we are writing only requires the single most recent element in the array outReal[0] (or any other "outArray[]"). Is there a way to return a single element(a), or does the the spread between the startIdx and the endIdx have to equal the Lookback(b)?
a)
int startIdx = this.ohlcArray.IdxCurrent;
int endIdx = startIdx;
// call to TA Routine goes here
b)
int lookBack = Core.EmaLookback(optInTimePeriod) - 1;
int startIdx = this.ohlcArray.IdxCurrent;
int endIdx = startIdx + lookBack;
// call to TA Routine goes here
retcode = Core.Ema(startIdx, endIdx, inReal, optInTimePeriod, ref outBegIdx, ref outNBElement, outReal);
Why is do these routines return 0, for the first outArray[0] element, when startIdx is equal to 0?
Since I am getting such odd results. Should the startIdx be at the oldest date or the newest date? Meaning should you process from the past (startIdx) towards now (endIdx), or from now (startIdx) towards the oldest date(endIdx) in time? I am guessing I am computing backwards (b)
a) 2000 (startIdx) - 2003 (endIdx),
or
b) 2003 (startIdx) - 2000 (endIdx)

I already forget C# so might be wrong, but:
The Lookback for each function seems to return the minimum length of processing required to compute each function accurately. With the startIdx to the endIdx equal to the Lookback.
No, it returns number of input elements that required to calculate first output element. Which is usually equal or more than timePeriod value. That's all. So if you input 1000 elements (StartIdx == 0 and endIdx == 9999) while Lookback function gives you 25 you'll get 1000-25 = 9975 == outNBElement resulting elements back. And outBegIdx will be 24.
Note: noone guarantees accuracy of function. Lookback just let you calculate size of resulting array beforehand which is critical for C/C++ where fixed size arrays might be allocated.
The array unstablePeriod[] initializes to 0 for all entries. Is this what is supposed to occur, if not where in TA-Lib do I find the code or data that it is initialized with?
Seems like that. It happens in Core::GlobalsType constructor in TA-Lib-Core.h
Globals.unstablePeriod is an array that keeps unstability settings for some of TA funcs. The values addressed via enum class FuncUnstId which is declared in ta_defs.h. The 0x17 value would correspond to T3 technical indicator.
In case of T3 indicator this unstability period just adds a value to lookback result. So T3's lookback is 6 * (timePeriod-1) + TA_GLOBALS_UNSTABLE_PERIOD[TA_FUNC_UNST_T3]. That's why it's 0 by default. And that's clear that function accuracy isn't that simple.
Consider EMA. Its lookback is timePeriod-1 + TA_GLOBALS_UNSTABLE_PERIOD[TA_FUNC_UNST_EMA]. Assume unstability value is 0. So EMA is only a timePeriod-1. (I would recommend do not touch unstability without a reason). According to the code I see - its first EMA result is calculated as simple average of a first "lookback count" of elements by default. There is a global compatibility setting that might be {CLASSIC, METASTOCK, TRADESTATION} and affects first element calculation, but this doesn't change a lot. Your first element is an average and others are calculated as EMA_today = (value_today - old_EMA)*coefficient + old_EMA.
That's the reason you can't just pass "lookback count" of elements and get "accurate function result". It won't be accurate - it'll be the first one, not the right one. In case of EMA it'll always be a simple average as simple average is used as a seed for this function. And following results are calculated not only over first N input elements but include previous EMA value. And this previous EMA includes its previous EMA etc. So you can't just pass lookback count of elements and expect the accurate result. You would need to pass previous function value too.
Moreover, most rolling indicators behave like that. Their first N values are heavily depend on point from which you'd started to calculate them. This might be addressed with Unstability period but you'd better to not limit the input data.
Why is do these routines return 0, for the first outArray[0] element
My guess it's bcs of -1 in your lookback calculation. Also 0 is returned for outBegIdx'th element, not the 0 element.
Is there a way to return a single element(a)
With regular TA-Lib - no, or you need to process big enough piece of data every time to make sure your rolling results do "converge". But I've made a TA-Lib fork for myself here which is designed for this purpose. The main idea is described in readme. It should be almost as fast as original and you can just pass single value and state object to get single result back without recalculation of all data. So calculation might be paused and continued when new data arrives without loss of previous computational results.
The problem is that TA-Lib is written in C and code of all its C#/Java etc wrappers are actually generated by its internal tool (ta-gen). No one ever tried to work with my fork via those wrapper interfaces. So they may be broken. Also i don't provide precompiled binaries and as TA-Lib is very old and its infrastructure is quite fancy it might require some skill and efforts to build it on target platform

Related

Some clarification about arrays for a project

Or someone to explain how to do this or direct me to some resources. Here's the assignment info:
dataCaptured: integer array - This field will store a history of a limited set of recently captured measurements. Once the array is full, the class should start overwriting the oldest elements while continuing to record the newest captures. (You may need some helper fields/variables to go with this one).
mostRecentMeasure: integer - This field will store the most recent measurement captured for convenience of display.
GetRawData. Return the contents of the dataCapturedarray. Which element is the oldest one in your history? Do you need to manipulate these values? How will they be presented to the user?
Add a button to display the measurement history (GetRawData). Where/how will you display this list? What did you make GetRawData() return? Does the list start with the oldest value collected? What happens when your history array fills up and older values are overwritten? What happens when your history has not been filled up yet? Does "0" count as an actual history entry? Does your history display the units in which the data was collected?
So, I want to pass the entire array to a Textbox? How do I do that? Atm, I do have the device up and running and I have a textbox ready for the array, I think. Can someone direct me to some articles on where to pass this information? Thanks for any help.
currently i have:
private double[] dataCaptured;
public double[] GetRawData()
{
return dataCaptured;
}
very little :(
I'm not going to do your homework for you, so this isn't really an answer; it's more of a tutorial.
First, the problem you have been assigned is a common one. Consider some kind of chemical process. You build a system that periodically measures some attribute of the system (say temperature). Depending on the nature of the process, you may measure it every 200 milliseconds, every 2 seconds, every 2 minutes, whatever. But, you are going to do that forever. You want to keep the data for some time, but not forever - memory is cheap these days, but not that cheap.
So you decide to keep the last N measurements. You can keep that in an N-element array. When you get the first measurement, you put it in the 0-element of the array, the second in the 1-element, and so on. When you get to the Nth measurement, you stick it in the last element. But, with the (N+1)th element, you have a problem. The solution is to put it in the 0-element of the array. The next one goes into the 1-element of the array, and so on.
At that point, you need some bookkeeping. You need to track where you put the most recent element. You also need to keep track of where the oldest element is. You also need a way to get all the measurements in some soft of time order.
This is a circular buffer. The reason it's circular is because you count 0, 1, 2...N-1, N, 0, 1, 2 and so on, around and around forever.
When you first start filling the buffer with measurements, the oldest is in element-0, and the most recent is wherever you put the last measurement. It continues like that until you have filled the buffer. Once you do that, then the oldest is no longer at 0, it's at the element logically after the most recent one (which is either the next element, or, if you are at the end of buffer, at the 0 position).
You could track the oldest and the newest indexes in the array/buffer. An easier way is to just track whether you are in the initial haven't-filled-buffer-yet phase (a boolean) and the latest index. The oldest is either index 0 (if you haven't filled buffer yet) or the one logically after the newest one. The bookkeeping is a little easier that way.
To do this, you need a way to find the next logical index in the buffer (either index+1 or 0 if you've hit the end of the buffer). You also need a way to get all the data (start at the index logically after the current entry, and then get the next N logical entries).
By the way, I'd use float rather than double to track my measurement. In general, measurements have enough error in themselves that the extra precision offered by double is of no use. By halving the data size, you can make your buffer twice as long (this, by the way, is a consideration in a measurement and control system).
For your measurement source, I'd initially use a number generator that starts at 1.0f to start and just increments by 1.0f every cycle. It will make debugging much simpler.
private float tempValue = 1.0f;
private float GetNextValue()
{
tempValue += 1.0f;
return tempValue;
}
Once you do this for real, you may want random numbers, something like:
private readonly Random _random = new Random();
private const float MinValue = 4.0f;
private const float MaxValue = 20.0f;
private float GetNextValue()
{
var nextRandom = _random.NextDouble();
var nextValue = (float) ((MaxValue - MinValue) * nextRandom + MinValue);
return nextValue;
}
By the way, the choice of 4.0 and 20.0 is intentional - google for "4 20 measurement".
If you want to get fancy, generate a random number stream and run it through a smoothing filter (but that gets more complicated).
You asked about how to get all the values into a single text box. This should do the trick:
// concatenate a collection of float values (rendered with one decimal point)
// into a string (with some spaces separating the values)
private string FormatValuesAsString(IEnumerable<float> values)
{
var buffer = new StringBuilder();
foreach (var val in values)
{
buffer.Append($"{val:F1} ");
}
return buffer.ToString();
}
An IEnumerable<float> represents a collection of float values. An array of floats implements that interface, as does just about every other collection of floats (which means you can put an array in for that parameter (FormatValuesAsString(myArray)). You can also create a function that dynamically generates values (look up yield return). The FormatValuesAsString function will return a string you can shove into a text box. If you do that on each calculation cycle, you'll see your values shift through the text box.
You will get to know the debugger very well doing this exercise. It is very hard to get this right in the first one, or two, or three, or... tries (you'll learn why programmers curse off-by-one errors). Don't give up. It's a good assignment.
Finally, if you get confused, you can post questions as comments to this answer (within reason). Tag me (to make sure I see them).

Named numbers as variables [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I've seen this a couple of times recently in high profile code, where constant values are defined as variables, named after the value, then used only once. I wondered why it gets done?
E.g. Linux Source (resize.c)
unsigned five = 5;
unsigned seven = 7;
E.g. C#.NET Source (Quaternion.cs)
double zero = 0;
double one = 1;
Naming numbers is terrible practice, one day something will need to change, and you'll end up with unsigned five = 7.
If it has some meaning, give it a meaningful name. The 'magic number' five is no improvement over the magic number 5, it's worse because it might not actually equal 5.
This kind of thing generally arises from some cargo-cult style programming style guidelines where someone heard that "magic numbers are bad" and forbade their use without fully understanding why.
Well named variables
Giving proper names to variables can dramatically clarify code, such as
constant int MAXIMUM_PRESSURE_VALUE=2;
This gives two key advantages:
The value MAXIMUM_PRESSURE_VALUE may be used in many different places, if for whatever reason that value changes you need to change it in only one place.
Where used it immediately shows what the function is doing, for example the following code obviously checks if the pressure is dangerously high:
if (pressure>MAXIMUM_PRESSURE_VALUE){
//without me telling you you can guess there'll be some safety protection in here
}
Poorly named variables
However, everything has a counter argument and what you have shown looks very like a good idea taken so far that it makes no sense. Defining TWO as 2 doesn't add any value
constant int TWO=2;
The value TWO may be used in many different places, perhaps to double things, perhaps to access an index. If in the future you need to change the index you cannot just change to int TWO=3; because that would affect all the other (completely unrelated) ways you've used TWO, now you'd be tripling instead of doubling etc
Where used it gives you no more information than if you just used "2". Compare the following two pieces of code:
if (pressure>2){
//2 might be good, I have no idea what happens here
}
or
if (pressure>TWO){
//TWO means 2, 2 might be good, I still have no idea what happens here
}
Worse still (as seems to be the case here) TWO may not equal 2, if so this is a form of obfuscation where the intention is to make the code less clear: obviously it achieves that.
The usual reason for this is a coding standard which forbids magic numbers but doesn't count TWO as a magic number; which of course it is! 99% of the time you want to use a meaningful variable name but in that 1% of the time using TWO instead of 2 gains you nothing (Sorry, I mean ZERO).
this code is inspired by Java but is intended to be language agnostic
Short version:
A constant five that just holds the number five is pretty useless. Don't go around making these for no reason (sometimes you have to because of syntax or typing rules, though).
The named variables in Quaternion.cs aren't strictly necessary, but you can make the case for the code being significantly more readable with them than without.
The named variables in ext4/resize.c aren't constants at all. They're tersely-named counters. Their names obscure their function a bit, but this code actually does correctly follow the project's specialized coding standards.
What's going on with Quaternion.cs?
This one's pretty easy.
Right after this:
double zero = 0;
double one = 1;
The code does this:
return zero.GetHashCode() ^ one.GetHashCode();
Without the local variables, what does the alternative look like?
return 0.0.GetHashCode() ^ 1.0.GetHashCode(); // doubles, not ints!
What a mess! Readability is definitely on the side of creating the locals here. Moreover, I think explicitly naming the variables indicates "We've thought about this carefully" much more clearly than just writing a single confusing return statement would.
What's going on with resize.c?
In the case of ext4/resize.c, these numbers aren't actually constants at all. If you follow the code, you'll see that they're counters and their values actually change over multiple iterations of a while loop.
Note how they're initialized:
unsigned three = 1;
unsigned five = 5;
unsigned seven = 7;
Three equals one, huh? What's that about?
See, what actually happens is that update_backups passes these variables by reference to the function ext4_list_backups:
/*
* Iterate through the groups which hold BACKUP superblock/GDT copies in an
* ext4 filesystem. The counters should be initialized to 1, 5, and 7 before
* calling this for the first time. In a sparse filesystem it will be the
* sequence of powers of 3, 5, and 7: 1, 3, 5, 7, 9, 25, 27, 49, 81, ...
* For a non-sparse filesystem it will be every group: 1, 2, 3, 4, ...
*/
static unsigned ext4_list_backups(struct super_block *sb, unsigned *three,
unsigned *five, unsigned *seven)
They're counters that are preserved over the course of multiple calls. If you look at the function body, you'll see that it's juggling the counters to find the next power of 3, 5, or 7, creating the sequence you see in the comment: 1, 3, 5, 7, 9, 25, 27, &c.
Now, for the weirdest part: the variable three is initialized to 1 because 30 = 1. The power 0 is a special case, though, because it's the only time 3x = 5x = 7x. Try your hand at rewriting ext4_list_backups to work with all three counters initialized to 1 (30, 50, 70) and you'll see how much more cumbersome the code becomes. Sometimes it's easier to just tell the caller to do something funky (initialize the list to 1, 5, 7) in the comments.
So, is five = 5 good coding style?
Is "five" a good name for the thing that the variable five represents in resize.c? In my opinion, it's not a style you should emulate in just any random project you take on. The simple name five doesn't communicate much about the purpose of the variable. If you're working on a web application or rapidly prototyping a video chat client or something and decide to name a variable five, you're probably going to create headaches and annoyance for anyone else who needs to maintain and modify your code.
However, this is one example where generalities about programming don't paint the full picture. Take a look at the kernel's coding style document, particularly the chapter on naming.
GLOBAL variables (to be used only if you really need them) need to
have descriptive names, as do global functions. If you have a function
that counts the number of active users, you should call that
"count_active_users()" or similar, you should not call it "cntusr()".
...
LOCAL variable names should be short, and to the point. If you have
some random integer loop counter, it should probably be called "i".
Calling it "loop_counter" is non-productive, if there is no chance of it
being mis-understood. Similarly, "tmp" can be just about any type of
variable that is used to hold a temporary value.
If you are afraid to mix up your local variable names, you have another
problem, which is called the function-growth-hormone-imbalance syndrome.
See chapter 6 (Functions).
Part of this is C-style coding tradition. Part of it is purposeful social engineering. A lot of kernel code is sensitive stuff, and it's been revised and tested many times. Since Linux is a big open-source project, it's not really hurting for contributions — in most ways, the bigger challenge is checking those contributions for quality.
Calling that variable five instead of something like nextPowerOfFive is a way to discourage contributors from meddling in code they don't understand. It's an attempt to force you to really read the code you're modifying in detail, line by line, before you try to make any changes.
Did the kernel maintainers make the right decision? I can't say. But it's clearly a purposeful move.
My organisation have certain programming guidelines, one of which is the use of magic numbers...
eg:
if (input == 3) //3 what? Elephants?....3 really is the magic number here...
This would be changed to:
#define INPUT_1_VOLTAGE_THRESHOLD 3u
if (input == INPUT_1_VOLTAGE_THRESHOLD) //Not elephants :(
We also have a source file with -200,000 -> 200,000 #defined in the format:
#define MINUS_TWO_ZERO_ZERO_ZERO_ZERO_ZERO -200000
which can be used in place of magic numbers, for example when referencing a specific index of an array.
I imagine this has been done for "Readability".
The numbers 0, 1, ... are integers. Here, the 'named variables' give the integer a different type. It might be more reasonable to specify these constant (const unsigned five = 5;)
I've used something akin to that a couple times to write values to files:
const int32_t zero = 0 ;
fwrite( &zero, sizeof(zero), 1, myfile );
fwrite accepts a const pointer, but if some function needs a non const pointer, you'll end up using a non const variable.
P.S.: That always keeps me wondering what may be the sizeof zero .
How do you come to a conslusion that it is used only once? It is public, it could be used any number of times from any assembly.
public static readonly Quaternion Zero = new Quaternion();
public static readonly Quaternion One = new Quaternion(1.0f, 1.0f, 1.0f, 1.0f);
Same thing applies to .Net framework decimal class. which also exposes public constants like this.
public const decimal One = 1m;
public const decimal Zero = 0m;
Numbers are often given a name when these numbers have special meaning.
For example in the Quaternion case the identity quaternion and unit length quaternion have special meaning and are frequently used in a special context. Namely Quaternion with (0,0,0,1) is an identity quaternion so it's a common practice to define them instead of using magic numbers.
For example
// define as static
static Quaternion Identity = new Quaternion(0,0,0,1);
Quaternion Q1 = Quaternion.Identity;
//or
if ( Q1.Length == Unit ) // not considering floating point error
One of my first programming jobs was on a PDP 11 using Basic. The Basic interpreter allocated memory to every number required, so every time the program mentioned 0, a byte or two would be used to store the number 0. Of course back in those days memory was a lot more limited than today and so it was important to conserve.
Every program in that work place started with:
10 U0%=0
20 U1%=1
That is, for those who have forgotten their Basic:
Line number 10: create an integer variable called U0 and assign it the number 0
Line number 20: create an integer variable called U1 and assign it the number 1
These variables, by local convention, never held any other value, so they were effectively constants. They allowed 0 and 1 to be used throughout the program without wasting any memory.
Aaaaah, the good old days!
some times it's more readable to write:
double pi=3.14; //Constant or even not constant
...
CircleArea=pi*r*r;
instead of:
CircleArea=3.14*r*r;
and may be you would use pi more again (you are not sure but you think it's possible later or in other classes if they are public)
and then if you want to change pi=3.14 into pi=3.141596 it's easier.
and some other like e=2.71, Avogadro and etc.

Deduce a downward trend from a list of values

In of the functions in my program that is called every second I get a float value that represents some X strength.
These values keep on coming at intervals and I am looking to store a history of the last 30 values and check if there's a downward/decreasing trend in the values (there might be a 2 or 3 false positives as well, so those have to neglected). If there's a downward trend and (If the most recent value minus the first value in the history) passes a threshold of 50 (say), I want to call another function. How can such a thing be implemented in C# which has such a structure to store history of 30 values and then analyse/deduce the downward trend?
You have several choices. If you only need to call this once per second, you can use a Queue<float>, like this:
Queue<float> theQueue = new Queue<float>(30);
// once per second:
// if the queue is full, remove an item
if (theQueue.Count >= 30)
{
theQueue.Dequeue();
}
// add the new item to the queue
theQueue.Enqueue(newValue);
// now analyze the items in the queue to detect a downward trend
foreach (float f in theQueue)
{
// do your analysis
}
That's easy to implement and will be plenty fast enough to run once per second.
How you analyze the downward trend really depends on your definition.
It occurs to me that the Queue<float> enumerator might not be guaranteed to return things in the order that they were inserted. If it doesn't, then you'll have to implement your own circular buffer.
I don't know C# but I'd probably store the values as an List of some sort. Here's some pseudo code for the trend checking:
if last value - first value < threshold
return
counter = 0
for int i = 1; i < 30; i++
if val[i] > val[i-1]
counter++
if counter < false_positive_threshold
//do stuff
A Circular List is the best data structure to store last X values. There doesn't seem to be one in the standard library but there are several questions on SO how to build one.
You need to define "downward trend". It seems to me that according to your current definition ((If the most recent value minus the first value in the history) the sequence "100, 150, 155, 175, 180, 182" is a downward trend. With that definition you only need to the latest and first value in history from the circular list, simplifying it somewhat.
But you probably need a more elaborate algorithm to identify a downward trend.

make sure array is sequential in C#

I've got an array of integers we're getting from a third party provider. These are meant to be sequential but for some reason they miss a number (something throws an exception, its eaten and the loop continues missing that index). This causes our system some grief and I'm trying to ensure that the array we're getting is indeed sequential.
The numbers start from varying offsets (sometimes 1000, sometimes 5820, others 0) but whatever the start, its meant to go from there.
What's the fastest method to verify the array is sequential? Even though its a required step it seems now, I also have to make sure it doesn't take too long to verify. I am currently starting at the first index, picking up the number and adding one and making sure the next index contains that etc.
EDIT:
The reason why the system fails is because of the way people use the system it may not always be returning the tokens the way it was picked initially - long story. The data can't be corrected until it gets to our layer unfortunately.
If you're sure that the array is sorted and has no duplicates, you can just check:
array[array.Length - 1] == array[0] + array.Length - 1
I think it's worth addressing the bigger issue here: what are you going to do if the data doesn't meet your requriements (sequential, no gaps)?
If you're still going to process the data, then you should probably invest your time in making your system more resilient to gaps or missing entries in the data.
**If you need to process the data and it must be clean, you should work with the vendor to make sure they send you well-formed data.
If you're going to skip processing and report an error, then asserting the precondition of no gaps may be the way to go. In C# there's a number of different things you could do:
If the data is sorted and has no dups, just check if LastValue == FirstValue + ArraySize - 1.
If the data is not sorted but dup free, just sort it and do the above.
If the data is not sorted, has dups and you actually want to detect the gaps, I would use LINQ.
List<int> gaps = Enumerable.Range(array.Min(), array.Length).Except(array).ToList();
or better yet (since the high-end value may be out of range):
int minVal = array.Min();
int maxVal = array.Max();
List<int> gaps = Enumerable.Range(minVal, maxVal-minVal+1).Except(array).ToList();
By the way, the whole concept of being passed a dense, gapless, array of integers is a bit odd for an interface between two parties, unless there's some additional data that associated with them. If there's no other data, why not just send a range {min,max} instead?
for (int i = a.Length - 2; 0 <= i; --i)
{
if (a[i] >= a[i+1]) return false; // not in sequence
}
return true; // in sequence
Gabe's way is definitely the fastest if the array is sorted. If the array is not sorted, then it would probably be best to sort the array (with merge/shell sort (or something of similar speed)) and then use Gabe's way.

Fast Algorithm for computing percentiles to remove outliers

I have a program that needs to repeatedly compute the approximate percentile (order statistic) of a dataset in order to remove outliers before further processing. I'm currently doing so by sorting the array of values and picking the appropriate element; this is doable, but it's a noticable blip on the profiles despite being a fairly minor part of the program.
More info:
The data set contains on the order of up to 100000 floating point numbers, and assumed to be "reasonably" distributed - there are unlikely to be duplicates nor huge spikes in density near particular values; and if for some odd reason the distribution is odd, it's OK for an approximation to be less accurate since the data is probably messed up anyhow and further processing dubious. However, the data isn't necessarily uniformly or normally distributed; it's just very unlikely to be degenerate.
An approximate solution would be fine, but I do need to understand how the approximation introduces error to ensure it's valid.
Since the aim is to remove outliers, I'm computing two percentiles over the same data at all times: e.g. one at 95% and one at 5%.
The app is in C# with bits of heavy lifting in C++; pseudocode or a preexisting library in either would be fine.
An entirely different way of removing outliers would be fine too, as long as it's reasonable.
Update: It seems I'm looking for an approximate selection algorithm.
Although this is all done in a loop, the data is (slightly) different every time, so it's not easy to reuse a datastructure as was done for this question.
Implemented Solution
Using the wikipedia selection algorithm as suggested by Gronim reduced this part of the run-time by about a factor 20.
Since I couldn't find a C# implementation, here's what I came up with. It's faster even for small inputs than Array.Sort; and at 1000 elements it's 25 times faster.
public static double QuickSelect(double[] list, int k) {
return QuickSelect(list, k, 0, list.Length);
}
public static double QuickSelect(double[] list, int k, int startI, int endI) {
while (true) {
// Assume startI <= k < endI
int pivotI = (startI + endI) / 2; //arbitrary, but good if sorted
int splitI = partition(list, startI, endI, pivotI);
if (k < splitI)
endI = splitI;
else if (k > splitI)
startI = splitI + 1;
else //if (k == splitI)
return list[k];
}
//when this returns, all elements of list[i] <= list[k] iif i <= k
}
static int partition(double[] list, int startI, int endI, int pivotI) {
double pivotValue = list[pivotI];
list[pivotI] = list[startI];
list[startI] = pivotValue;
int storeI = startI + 1;//no need to store # pivot item, it's good already.
//Invariant: startI < storeI <= endI
while (storeI < endI && list[storeI] <= pivotValue) ++storeI; //fast if sorted
//now storeI == endI || list[storeI] > pivotValue
//so elem #storeI is either irrelevant or too large.
for (int i = storeI + 1; i < endI; ++i)
if (list[i] <= pivotValue) {
list.swap_elems(i, storeI);
++storeI;
}
int newPivotI = storeI - 1;
list[startI] = list[newPivotI];
list[newPivotI] = pivotValue;
//now [startI, newPivotI] are <= to pivotValue && list[newPivotI] == pivotValue.
return newPivotI;
}
static void swap_elems(this double[] list, int i, int j) {
double tmp = list[i];
list[i] = list[j];
list[j] = tmp;
}
Thanks, Gronim, for pointing me in the right direction!
The histogram solution from Henrik will work. You can also use a selection algorithm to efficiently find the k largest or smallest elements in an array of n elements in O(n). To use this for the 95th percentile set k=0.05n and find the k largest elements.
Reference:
http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements
According to its creator a SoftHeap can be used to:
compute exact or approximate medians
and percentiles optimally. It is also
useful for approximate sorting...
I used to identify outliers by calculating the standard deviation. Everything with a distance more as 2 (or 3) times the standard deviation from the avarage is an outlier. 2 times = about 95%.
Since your are calculating the avarage, its also very easy to calculate the standard deviation is very fast.
You could also use only a subset of your data to calculate the numbers.
You could estimate your percentiles from just a part of your dataset, like the first few thousand points.
The Glivenko–Cantelli theorem ensures that this would be a fairly good estimate, if you can assume your data points to be independent.
Divide the interval between minimum and maximum of your data into (say) 1000 bins and calculate a histogram. Then build partial sums and see where they first exceed 5000 or 95000.
There are a couple basic approaches I can think of. First is to compute the range (by finding the highest and lowest values), project each element to a percentile ((x - min) / range) and throw out any that evaluate to lower than .05 or higher than .95.
The second is to compute the mean and standard deviation. A span of 2 standard deviations from the mean (in both directions) will enclose 95% of a normally-distributed sample space, meaning your outliers would be in the <2.5 and >97.5 percentiles. Calculating the mean of a series is linear, as is the standard dev (square root of the sum of the difference of each element and the mean). Then, subtract 2 sigmas from the mean, and add 2 sigmas to the mean, and you've got your outlier limits.
Both of these will compute in roughly linear time; the first one requires two passes, the second one takes three (once you have your limits you still have to discard the outliers). Since this is a list-based operation, I do not think you will find anything with logarithmic or constant complexity; any further performance gains would require either optimizing the iteration and calculation, or introducing error by performing the calculations on a sub-sample (such as every third element).
A good general answer to your problem seems to be RANSAC.
Given a model, and some noisy data, the algorithm efficiently recovers the parameters of the model.
You will have to chose a simple model that can map your data. Anything smooth should be fine. Let say a mixture of few gaussians. RANSAC will set the parameters of your model and estimate a set of inliners at the same time. Then throw away whatever doesn't fit the model properly.
You could filter out 2 or 3 standard deviation even if the data is not normally distributed; at least, it will be done in a consistent manner, that should be important.
As you remove the outliers, the std dev will change, you could do this in a loop until the change in std dev is minimal. Whether or not you want to do this depends upon why are you manipulating the data this way. There are major reservations by some statisticians to removing outliers. But some remove the outliers to prove that the data is fairly normally distributed.
Not an expert, but my memory suggests:
to determine percentile points exactly you need to sort and count
taking a sample from the data and calculating the percentile values sounds like a good plan for decent approximation if you can get a good sample
if not, as suggested by Henrik, you can avoid the full sort if you do the buckets and count them
One set of data of 100k elements takes almost no time to sort, so I assume you have to do this repeatedly. If the data set is the same set just updated slightly, you're best off building a tree (O(N log N)) and then removing and adding new points as they come in (O(K log N) where K is the number of points changed). Otherwise, the kth largest element solution already mentioned gives you O(N) for each dataset.

Categories