Some clarification about arrays for a project - c#

Or someone to explain how to do this or direct me to some resources. Here's the assignment info:
dataCaptured: integer array - This field will store a history of a limited set of recently captured measurements. Once the array is full, the class should start overwriting the oldest elements while continuing to record the newest captures. (You may need some helper fields/variables to go with this one).
mostRecentMeasure: integer - This field will store the most recent measurement captured for convenience of display.
GetRawData. Return the contents of the dataCapturedarray. Which element is the oldest one in your history? Do you need to manipulate these values? How will they be presented to the user?
Add a button to display the measurement history (GetRawData). Where/how will you display this list? What did you make GetRawData() return? Does the list start with the oldest value collected? What happens when your history array fills up and older values are overwritten? What happens when your history has not been filled up yet? Does "0" count as an actual history entry? Does your history display the units in which the data was collected?
So, I want to pass the entire array to a Textbox? How do I do that? Atm, I do have the device up and running and I have a textbox ready for the array, I think. Can someone direct me to some articles on where to pass this information? Thanks for any help.
currently i have:
private double[] dataCaptured;
public double[] GetRawData()
{
return dataCaptured;
}
very little :(

I'm not going to do your homework for you, so this isn't really an answer; it's more of a tutorial.
First, the problem you have been assigned is a common one. Consider some kind of chemical process. You build a system that periodically measures some attribute of the system (say temperature). Depending on the nature of the process, you may measure it every 200 milliseconds, every 2 seconds, every 2 minutes, whatever. But, you are going to do that forever. You want to keep the data for some time, but not forever - memory is cheap these days, but not that cheap.
So you decide to keep the last N measurements. You can keep that in an N-element array. When you get the first measurement, you put it in the 0-element of the array, the second in the 1-element, and so on. When you get to the Nth measurement, you stick it in the last element. But, with the (N+1)th element, you have a problem. The solution is to put it in the 0-element of the array. The next one goes into the 1-element of the array, and so on.
At that point, you need some bookkeeping. You need to track where you put the most recent element. You also need to keep track of where the oldest element is. You also need a way to get all the measurements in some soft of time order.
This is a circular buffer. The reason it's circular is because you count 0, 1, 2...N-1, N, 0, 1, 2 and so on, around and around forever.
When you first start filling the buffer with measurements, the oldest is in element-0, and the most recent is wherever you put the last measurement. It continues like that until you have filled the buffer. Once you do that, then the oldest is no longer at 0, it's at the element logically after the most recent one (which is either the next element, or, if you are at the end of buffer, at the 0 position).
You could track the oldest and the newest indexes in the array/buffer. An easier way is to just track whether you are in the initial haven't-filled-buffer-yet phase (a boolean) and the latest index. The oldest is either index 0 (if you haven't filled buffer yet) or the one logically after the newest one. The bookkeeping is a little easier that way.
To do this, you need a way to find the next logical index in the buffer (either index+1 or 0 if you've hit the end of the buffer). You also need a way to get all the data (start at the index logically after the current entry, and then get the next N logical entries).
By the way, I'd use float rather than double to track my measurement. In general, measurements have enough error in themselves that the extra precision offered by double is of no use. By halving the data size, you can make your buffer twice as long (this, by the way, is a consideration in a measurement and control system).
For your measurement source, I'd initially use a number generator that starts at 1.0f to start and just increments by 1.0f every cycle. It will make debugging much simpler.
private float tempValue = 1.0f;
private float GetNextValue()
{
tempValue += 1.0f;
return tempValue;
}
Once you do this for real, you may want random numbers, something like:
private readonly Random _random = new Random();
private const float MinValue = 4.0f;
private const float MaxValue = 20.0f;
private float GetNextValue()
{
var nextRandom = _random.NextDouble();
var nextValue = (float) ((MaxValue - MinValue) * nextRandom + MinValue);
return nextValue;
}
By the way, the choice of 4.0 and 20.0 is intentional - google for "4 20 measurement".
If you want to get fancy, generate a random number stream and run it through a smoothing filter (but that gets more complicated).
You asked about how to get all the values into a single text box. This should do the trick:
// concatenate a collection of float values (rendered with one decimal point)
// into a string (with some spaces separating the values)
private string FormatValuesAsString(IEnumerable<float> values)
{
var buffer = new StringBuilder();
foreach (var val in values)
{
buffer.Append($"{val:F1} ");
}
return buffer.ToString();
}
An IEnumerable<float> represents a collection of float values. An array of floats implements that interface, as does just about every other collection of floats (which means you can put an array in for that parameter (FormatValuesAsString(myArray)). You can also create a function that dynamically generates values (look up yield return). The FormatValuesAsString function will return a string you can shove into a text box. If you do that on each calculation cycle, you'll see your values shift through the text box.
You will get to know the debugger very well doing this exercise. It is very hard to get this right in the first one, or two, or three, or... tries (you'll learn why programmers curse off-by-one errors). Don't give up. It's a good assignment.
Finally, if you get confused, you can post questions as comments to this answer (within reason). Tag me (to make sure I see them).

Related

TA-Lib : Technical Analysis Library, Lookback and unstablePeriod

TA-Lib is a financial/market/OHLC technical analysis library for a Java, C++, .Net, etc. In it are ~158 Technical Functions (EMA, MAMA, MACD, SMA, etc), each has an associate Lookback Function
public static int EmaLookback(int optInTimePeriod)
The Lookback for each function seems to return the minimum length of processing required to compute each function accurately. With the startIdx to the endIdx equal to the Lookback.
Core.RetCode retcode = Core.Ema(startIdx, endIdx, double inReal, optInTimePeriod, ref outBegIdx, ref outNBElement, double outReal)
Some of these functions use an array called
Globals.unstablePeriod[0x17]
If this is incorrect in any way please correct me. Now the questions ...
The array unstablePeriod[] initializes to 0 for all entries. Is this what is supposed to occur, if not where in TA-Lib do I find the code or data that it is initialized with?
The code we are writing only requires the single most recent element in the array outReal[0] (or any other "outArray[]"). Is there a way to return a single element(a), or does the the spread between the startIdx and the endIdx have to equal the Lookback(b)?
a)
int startIdx = this.ohlcArray.IdxCurrent;
int endIdx = startIdx;
// call to TA Routine goes here
b)
int lookBack = Core.EmaLookback(optInTimePeriod) - 1;
int startIdx = this.ohlcArray.IdxCurrent;
int endIdx = startIdx + lookBack;
// call to TA Routine goes here
retcode = Core.Ema(startIdx, endIdx, inReal, optInTimePeriod, ref outBegIdx, ref outNBElement, outReal);
Why is do these routines return 0, for the first outArray[0] element, when startIdx is equal to 0?
Since I am getting such odd results. Should the startIdx be at the oldest date or the newest date? Meaning should you process from the past (startIdx) towards now (endIdx), or from now (startIdx) towards the oldest date(endIdx) in time? I am guessing I am computing backwards (b)
a) 2000 (startIdx) - 2003 (endIdx),
or
b) 2003 (startIdx) - 2000 (endIdx)
I already forget C# so might be wrong, but:
The Lookback for each function seems to return the minimum length of processing required to compute each function accurately. With the startIdx to the endIdx equal to the Lookback.
No, it returns number of input elements that required to calculate first output element. Which is usually equal or more than timePeriod value. That's all. So if you input 1000 elements (StartIdx == 0 and endIdx == 9999) while Lookback function gives you 25 you'll get 1000-25 = 9975 == outNBElement resulting elements back. And outBegIdx will be 24.
Note: noone guarantees accuracy of function. Lookback just let you calculate size of resulting array beforehand which is critical for C/C++ where fixed size arrays might be allocated.
The array unstablePeriod[] initializes to 0 for all entries. Is this what is supposed to occur, if not where in TA-Lib do I find the code or data that it is initialized with?
Seems like that. It happens in Core::GlobalsType constructor in TA-Lib-Core.h
Globals.unstablePeriod is an array that keeps unstability settings for some of TA funcs. The values addressed via enum class FuncUnstId which is declared in ta_defs.h. The 0x17 value would correspond to T3 technical indicator.
In case of T3 indicator this unstability period just adds a value to lookback result. So T3's lookback is 6 * (timePeriod-1) + TA_GLOBALS_UNSTABLE_PERIOD[TA_FUNC_UNST_T3]. That's why it's 0 by default. And that's clear that function accuracy isn't that simple.
Consider EMA. Its lookback is timePeriod-1 + TA_GLOBALS_UNSTABLE_PERIOD[TA_FUNC_UNST_EMA]. Assume unstability value is 0. So EMA is only a timePeriod-1. (I would recommend do not touch unstability without a reason). According to the code I see - its first EMA result is calculated as simple average of a first "lookback count" of elements by default. There is a global compatibility setting that might be {CLASSIC, METASTOCK, TRADESTATION} and affects first element calculation, but this doesn't change a lot. Your first element is an average and others are calculated as EMA_today = (value_today - old_EMA)*coefficient + old_EMA.
That's the reason you can't just pass "lookback count" of elements and get "accurate function result". It won't be accurate - it'll be the first one, not the right one. In case of EMA it'll always be a simple average as simple average is used as a seed for this function. And following results are calculated not only over first N input elements but include previous EMA value. And this previous EMA includes its previous EMA etc. So you can't just pass lookback count of elements and expect the accurate result. You would need to pass previous function value too.
Moreover, most rolling indicators behave like that. Their first N values are heavily depend on point from which you'd started to calculate them. This might be addressed with Unstability period but you'd better to not limit the input data.
Why is do these routines return 0, for the first outArray[0] element
My guess it's bcs of -1 in your lookback calculation. Also 0 is returned for outBegIdx'th element, not the 0 element.
Is there a way to return a single element(a)
With regular TA-Lib - no, or you need to process big enough piece of data every time to make sure your rolling results do "converge". But I've made a TA-Lib fork for myself here which is designed for this purpose. The main idea is described in readme. It should be almost as fast as original and you can just pass single value and state object to get single result back without recalculation of all data. So calculation might be paused and continued when new data arrives without loss of previous computational results.
The problem is that TA-Lib is written in C and code of all its C#/Java etc wrappers are actually generated by its internal tool (ta-gen). No one ever tried to work with my fork via those wrapper interfaces. So they may be broken. Also i don't provide precompiled binaries and as TA-Lib is very old and its infrastructure is quite fancy it might require some skill and efforts to build it on target platform

Finding insertion points in a sorted array faster than O(n)?

This is for game programming. Lets say I have a Unit that can track 10 enemies within it's range. Each enemy has a priority between 0-100. So the array currently looks like this (numbers represent priority):
Enemy - 96
Enemy - 78
Enemy - 77
Enemy - 73
Enemy - 61
Enemy - 49
Enemy - 42
Enemy - 36
Enemy - 22
Enemy - 17
Say a new enemy wanders within range and has a priority of 69, this will be inserted between 73 and 61, and 17 will be removed from the array (Well, the 17 would be removed before the insertion, I believe).
Is there any way to figure out that it needs to be inserted between 73 and 61 without an O(n) operation?
I feel you're asking the wrong question here. You have to both first find the spot to insert into and then insert the element. These are two operation that are both tied together and I feel you shouldn't be asking about how to find where to do one faster without the other. It'll make sense why towards the end of the question. But I'm addressing the question of actually inserting faster.
Short Answer: No
Answer you'll get from someone that's too smart for themselves:
The only way to accomplish this is to not use an array. In an array unless you are inserting into the first or last permissions the insert will be O(n). This is because the array consists of its elements occupying contiguous space in memory. That is how you are able to reference a particular element in O(1) time, you know exactly where that element is. The cost is to insert in the middle you need to move half the elements in the array. So while you can look up with a binary search in log(n) time you cannot insert in that time.
So if you're going to do anything, you'll need a different data structure. A simple binary tree may be the solution it will do the insertion in log(n) time. On the other hand if you're feeding it a sorted array you have to worry about tree balancing, so not you might need a red and black tree. Or if you are always popping the element that is the closest or the furthest then you can use heap sort. A heap sort is the best algorithm for a priority queue. It has an additional advantage of fitting a tree structure in an array so it has far better spatial locality (more on this later).
The truth:
You'll most likely have a dozen maybe a few dozen enemies in the vicinity at most. At that level the asymptotic performance does not matter because it is designed especially for large values of 'n'. What you're looking at is a religious adherence to your CS 201 professor's calls about Big Oh. Linear search and insertion will be the fastest method, and the answer to will it scale is, who the hell cares. If you try to implement a complicated algorithm to scale it, you will almost always be slower since what is determining your speed is not the software, it is the hardware, and you're better off sticking to doing things that the hardware knows how to deal with well: "linearly going down memory". In fact after the prefetchers do their thing it would be faster to linearly go through each element even if there were a couple of thousand elements than to implement a red and black tree. Because a data structure like a tree would allocate memory all over the place without any regard to spatial locality. And the calls to allocate more memory for a node are in themselves more expensive than the time it takes to read through a thousand elements. Which is why graphics cards use insert sort all over the place.
Heap Sort
Heap sort might actually be faster depending on the input data since it is using a linear array although it may confuse the prefetchers so it's hard to say. The only limitation is that you can only pop the highest priority element. Obviously you can define highest priority to be either the lowest or the largest element. Heap sort is too fancy for me to try and describe it over here, just Google it. It does separate insertion and removal into two O(log(n)) operations. The biggest downside of heap sort is it will seriously decrease the debugability of the code. A heap is not a sorted array, it has an order to it, but other than heap sort being a complicated unintuitive algorithm, it is not apparently visible to a human being if a heap is setup correctly. So you would introduce more bugs for in the best case little benefit. Hell, the last time I had to do a heap sort I copied the code for it and that had bugs in it.
Insertion Sort With Binary Search
So this is what it seems like you're trying to do. The truth is this is a very bad idea. On average insertion sort takes O(n). And we know this is a hard limit for inserting a random element into a sorted array. Yes we can find the element we want to insert into faster by using a binary search. But then the average insertion still takes O(n). Alternatively, in the best case, if you are inserting and the element goes into the last position insertion sort takes O(1) time because when you inserted, it is already in the correct place. However, if you do a binary search to find the insertion location, then finding out you're supposed to insert in the last position takes O(log(n)) time. And the insertion itself takes O(1) time. So in trying to optimize it, you've severely degraded the best case performance. Looking at your use case, this queue holds the enemies with their priorities. The priority of an enemy is likely a function of their strength and their distance. Which means when an enemy enters into the priority queue, it will likely have a very low priority. This plays very well into the best case of insertion of O(1) performance. If you decrease the best case performance you will do more harm than good because it is also your most general case.
Preoptimization is the root of all evil -- Donald Knuth
Since you are maintaining a sorted search pool at all times, you can use binary search. First check the middle element, then check the element halfway between the middle element and whichever end of the array is closer, and so on until you find the location. This will give you O(log2n) time.
Sure, assuming you are using an Array type to house the list this really easy.
I will assume Enemy is your class name, and that is has a property called Priority to perform the sort. We will need an IComparer<Enemy> that looks like the following:
public class EnemyComparer : IComparer<Enemy>
{
int IComparer<Enemy>.Compare(Enemy x, Enemy y)
{
return y.Priority.CompareTo(x.Priority); // reverse operand to invert ordering
}
}
Then we can write a simple InsertEnemy routine as follows:
public static bool InsertEnemy(Enemy[] enemies, Enemy newEnemy)
{
// binary search in O(logN)
var ix = Array.BinarySearch(enemies, newEnemy, new EnemyComparer());
// If not found, the bit-wise compliment is the insertion index
if (ix < 0)
ix = ~ix;
// If the insertion index is after the list we bail out...
if (ix >= enemies.Length)
return false;// Insert is after last item...
//Move enemies down the list to make room for the insertion...
if (ix + 1 < enemies.Length)
Array.ConstrainedCopy(enemies, ix, enemies, ix + 1, enemies.Length - (ix + 1));
//Now insert the newEnemy into the position
enemies[ix] = newEnemy;
return true;
}
There are other data structures that would make this a bit faster, but this should prove efficient enough. A B-Tree or binary tree would be ok if the list will get large, but for 10 items it's doubtful it would be faster.
The method above was tested with the addition of the following:
public class Enemy
{
public int Priority;
}
public static void Main()
{
var rand = new Random();
// Start with a sorted list of 10
var enemies = Enumerable.Range(0, 10).Select(i => new Enemy() {Priority = rand.Next(0, 100)}).OrderBy(e => e.Priority).ToArray();
// Insert random entries
for (int i = 0; i < 100; i++)
InsertEnemy(enemies, new Enemy() {Priority = rand.Next(100)});
}

Deduce a downward trend from a list of values

In of the functions in my program that is called every second I get a float value that represents some X strength.
These values keep on coming at intervals and I am looking to store a history of the last 30 values and check if there's a downward/decreasing trend in the values (there might be a 2 or 3 false positives as well, so those have to neglected). If there's a downward trend and (If the most recent value minus the first value in the history) passes a threshold of 50 (say), I want to call another function. How can such a thing be implemented in C# which has such a structure to store history of 30 values and then analyse/deduce the downward trend?
You have several choices. If you only need to call this once per second, you can use a Queue<float>, like this:
Queue<float> theQueue = new Queue<float>(30);
// once per second:
// if the queue is full, remove an item
if (theQueue.Count >= 30)
{
theQueue.Dequeue();
}
// add the new item to the queue
theQueue.Enqueue(newValue);
// now analyze the items in the queue to detect a downward trend
foreach (float f in theQueue)
{
// do your analysis
}
That's easy to implement and will be plenty fast enough to run once per second.
How you analyze the downward trend really depends on your definition.
It occurs to me that the Queue<float> enumerator might not be guaranteed to return things in the order that they were inserted. If it doesn't, then you'll have to implement your own circular buffer.
I don't know C# but I'd probably store the values as an List of some sort. Here's some pseudo code for the trend checking:
if last value - first value < threshold
return
counter = 0
for int i = 1; i < 30; i++
if val[i] > val[i-1]
counter++
if counter < false_positive_threshold
//do stuff
A Circular List is the best data structure to store last X values. There doesn't seem to be one in the standard library but there are several questions on SO how to build one.
You need to define "downward trend". It seems to me that according to your current definition ((If the most recent value minus the first value in the history) the sequence "100, 150, 155, 175, 180, 182" is a downward trend. With that definition you only need to the latest and first value in history from the circular list, simplifying it somewhat.
But you probably need a more elaborate algorithm to identify a downward trend.

make sure array is sequential in C#

I've got an array of integers we're getting from a third party provider. These are meant to be sequential but for some reason they miss a number (something throws an exception, its eaten and the loop continues missing that index). This causes our system some grief and I'm trying to ensure that the array we're getting is indeed sequential.
The numbers start from varying offsets (sometimes 1000, sometimes 5820, others 0) but whatever the start, its meant to go from there.
What's the fastest method to verify the array is sequential? Even though its a required step it seems now, I also have to make sure it doesn't take too long to verify. I am currently starting at the first index, picking up the number and adding one and making sure the next index contains that etc.
EDIT:
The reason why the system fails is because of the way people use the system it may not always be returning the tokens the way it was picked initially - long story. The data can't be corrected until it gets to our layer unfortunately.
If you're sure that the array is sorted and has no duplicates, you can just check:
array[array.Length - 1] == array[0] + array.Length - 1
I think it's worth addressing the bigger issue here: what are you going to do if the data doesn't meet your requriements (sequential, no gaps)?
If you're still going to process the data, then you should probably invest your time in making your system more resilient to gaps or missing entries in the data.
**If you need to process the data and it must be clean, you should work with the vendor to make sure they send you well-formed data.
If you're going to skip processing and report an error, then asserting the precondition of no gaps may be the way to go. In C# there's a number of different things you could do:
If the data is sorted and has no dups, just check if LastValue == FirstValue + ArraySize - 1.
If the data is not sorted but dup free, just sort it and do the above.
If the data is not sorted, has dups and you actually want to detect the gaps, I would use LINQ.
List<int> gaps = Enumerable.Range(array.Min(), array.Length).Except(array).ToList();
or better yet (since the high-end value may be out of range):
int minVal = array.Min();
int maxVal = array.Max();
List<int> gaps = Enumerable.Range(minVal, maxVal-minVal+1).Except(array).ToList();
By the way, the whole concept of being passed a dense, gapless, array of integers is a bit odd for an interface between two parties, unless there's some additional data that associated with them. If there's no other data, why not just send a range {min,max} instead?
for (int i = a.Length - 2; 0 <= i; --i)
{
if (a[i] >= a[i+1]) return false; // not in sequence
}
return true; // in sequence
Gabe's way is definitely the fastest if the array is sorted. If the array is not sorted, then it would probably be best to sort the array (with merge/shell sort (or something of similar speed)) and then use Gabe's way.

Fast Algorithm for computing percentiles to remove outliers

I have a program that needs to repeatedly compute the approximate percentile (order statistic) of a dataset in order to remove outliers before further processing. I'm currently doing so by sorting the array of values and picking the appropriate element; this is doable, but it's a noticable blip on the profiles despite being a fairly minor part of the program.
More info:
The data set contains on the order of up to 100000 floating point numbers, and assumed to be "reasonably" distributed - there are unlikely to be duplicates nor huge spikes in density near particular values; and if for some odd reason the distribution is odd, it's OK for an approximation to be less accurate since the data is probably messed up anyhow and further processing dubious. However, the data isn't necessarily uniformly or normally distributed; it's just very unlikely to be degenerate.
An approximate solution would be fine, but I do need to understand how the approximation introduces error to ensure it's valid.
Since the aim is to remove outliers, I'm computing two percentiles over the same data at all times: e.g. one at 95% and one at 5%.
The app is in C# with bits of heavy lifting in C++; pseudocode or a preexisting library in either would be fine.
An entirely different way of removing outliers would be fine too, as long as it's reasonable.
Update: It seems I'm looking for an approximate selection algorithm.
Although this is all done in a loop, the data is (slightly) different every time, so it's not easy to reuse a datastructure as was done for this question.
Implemented Solution
Using the wikipedia selection algorithm as suggested by Gronim reduced this part of the run-time by about a factor 20.
Since I couldn't find a C# implementation, here's what I came up with. It's faster even for small inputs than Array.Sort; and at 1000 elements it's 25 times faster.
public static double QuickSelect(double[] list, int k) {
return QuickSelect(list, k, 0, list.Length);
}
public static double QuickSelect(double[] list, int k, int startI, int endI) {
while (true) {
// Assume startI <= k < endI
int pivotI = (startI + endI) / 2; //arbitrary, but good if sorted
int splitI = partition(list, startI, endI, pivotI);
if (k < splitI)
endI = splitI;
else if (k > splitI)
startI = splitI + 1;
else //if (k == splitI)
return list[k];
}
//when this returns, all elements of list[i] <= list[k] iif i <= k
}
static int partition(double[] list, int startI, int endI, int pivotI) {
double pivotValue = list[pivotI];
list[pivotI] = list[startI];
list[startI] = pivotValue;
int storeI = startI + 1;//no need to store # pivot item, it's good already.
//Invariant: startI < storeI <= endI
while (storeI < endI && list[storeI] <= pivotValue) ++storeI; //fast if sorted
//now storeI == endI || list[storeI] > pivotValue
//so elem #storeI is either irrelevant or too large.
for (int i = storeI + 1; i < endI; ++i)
if (list[i] <= pivotValue) {
list.swap_elems(i, storeI);
++storeI;
}
int newPivotI = storeI - 1;
list[startI] = list[newPivotI];
list[newPivotI] = pivotValue;
//now [startI, newPivotI] are <= to pivotValue && list[newPivotI] == pivotValue.
return newPivotI;
}
static void swap_elems(this double[] list, int i, int j) {
double tmp = list[i];
list[i] = list[j];
list[j] = tmp;
}
Thanks, Gronim, for pointing me in the right direction!
The histogram solution from Henrik will work. You can also use a selection algorithm to efficiently find the k largest or smallest elements in an array of n elements in O(n). To use this for the 95th percentile set k=0.05n and find the k largest elements.
Reference:
http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements
According to its creator a SoftHeap can be used to:
compute exact or approximate medians
and percentiles optimally. It is also
useful for approximate sorting...
I used to identify outliers by calculating the standard deviation. Everything with a distance more as 2 (or 3) times the standard deviation from the avarage is an outlier. 2 times = about 95%.
Since your are calculating the avarage, its also very easy to calculate the standard deviation is very fast.
You could also use only a subset of your data to calculate the numbers.
You could estimate your percentiles from just a part of your dataset, like the first few thousand points.
The Glivenko–Cantelli theorem ensures that this would be a fairly good estimate, if you can assume your data points to be independent.
Divide the interval between minimum and maximum of your data into (say) 1000 bins and calculate a histogram. Then build partial sums and see where they first exceed 5000 or 95000.
There are a couple basic approaches I can think of. First is to compute the range (by finding the highest and lowest values), project each element to a percentile ((x - min) / range) and throw out any that evaluate to lower than .05 or higher than .95.
The second is to compute the mean and standard deviation. A span of 2 standard deviations from the mean (in both directions) will enclose 95% of a normally-distributed sample space, meaning your outliers would be in the <2.5 and >97.5 percentiles. Calculating the mean of a series is linear, as is the standard dev (square root of the sum of the difference of each element and the mean). Then, subtract 2 sigmas from the mean, and add 2 sigmas to the mean, and you've got your outlier limits.
Both of these will compute in roughly linear time; the first one requires two passes, the second one takes three (once you have your limits you still have to discard the outliers). Since this is a list-based operation, I do not think you will find anything with logarithmic or constant complexity; any further performance gains would require either optimizing the iteration and calculation, or introducing error by performing the calculations on a sub-sample (such as every third element).
A good general answer to your problem seems to be RANSAC.
Given a model, and some noisy data, the algorithm efficiently recovers the parameters of the model.
You will have to chose a simple model that can map your data. Anything smooth should be fine. Let say a mixture of few gaussians. RANSAC will set the parameters of your model and estimate a set of inliners at the same time. Then throw away whatever doesn't fit the model properly.
You could filter out 2 or 3 standard deviation even if the data is not normally distributed; at least, it will be done in a consistent manner, that should be important.
As you remove the outliers, the std dev will change, you could do this in a loop until the change in std dev is minimal. Whether or not you want to do this depends upon why are you manipulating the data this way. There are major reservations by some statisticians to removing outliers. But some remove the outliers to prove that the data is fairly normally distributed.
Not an expert, but my memory suggests:
to determine percentile points exactly you need to sort and count
taking a sample from the data and calculating the percentile values sounds like a good plan for decent approximation if you can get a good sample
if not, as suggested by Henrik, you can avoid the full sort if you do the buckets and count them
One set of data of 100k elements takes almost no time to sort, so I assume you have to do this repeatedly. If the data set is the same set just updated slightly, you're best off building a tree (O(N log N)) and then removing and adding new points as they come in (O(K log N) where K is the number of points changed). Otherwise, the kth largest element solution already mentioned gives you O(N) for each dataset.

Categories