Deduce a downward trend from a list of values - c#

In of the functions in my program that is called every second I get a float value that represents some X strength.
These values keep on coming at intervals and I am looking to store a history of the last 30 values and check if there's a downward/decreasing trend in the values (there might be a 2 or 3 false positives as well, so those have to neglected). If there's a downward trend and (If the most recent value minus the first value in the history) passes a threshold of 50 (say), I want to call another function. How can such a thing be implemented in C# which has such a structure to store history of 30 values and then analyse/deduce the downward trend?

You have several choices. If you only need to call this once per second, you can use a Queue<float>, like this:
Queue<float> theQueue = new Queue<float>(30);
// once per second:
// if the queue is full, remove an item
if (theQueue.Count >= 30)
{
theQueue.Dequeue();
}
// add the new item to the queue
theQueue.Enqueue(newValue);
// now analyze the items in the queue to detect a downward trend
foreach (float f in theQueue)
{
// do your analysis
}
That's easy to implement and will be plenty fast enough to run once per second.
How you analyze the downward trend really depends on your definition.
It occurs to me that the Queue<float> enumerator might not be guaranteed to return things in the order that they were inserted. If it doesn't, then you'll have to implement your own circular buffer.

I don't know C# but I'd probably store the values as an List of some sort. Here's some pseudo code for the trend checking:
if last value - first value < threshold
return
counter = 0
for int i = 1; i < 30; i++
if val[i] > val[i-1]
counter++
if counter < false_positive_threshold
//do stuff

A Circular List is the best data structure to store last X values. There doesn't seem to be one in the standard library but there are several questions on SO how to build one.
You need to define "downward trend". It seems to me that according to your current definition ((If the most recent value minus the first value in the history) the sequence "100, 150, 155, 175, 180, 182" is a downward trend. With that definition you only need to the latest and first value in history from the circular list, simplifying it somewhat.
But you probably need a more elaborate algorithm to identify a downward trend.

Related

Some clarification about arrays for a project

Or someone to explain how to do this or direct me to some resources. Here's the assignment info:
dataCaptured: integer array - This field will store a history of a limited set of recently captured measurements. Once the array is full, the class should start overwriting the oldest elements while continuing to record the newest captures. (You may need some helper fields/variables to go with this one).
mostRecentMeasure: integer - This field will store the most recent measurement captured for convenience of display.
GetRawData. Return the contents of the dataCapturedarray. Which element is the oldest one in your history? Do you need to manipulate these values? How will they be presented to the user?
Add a button to display the measurement history (GetRawData). Where/how will you display this list? What did you make GetRawData() return? Does the list start with the oldest value collected? What happens when your history array fills up and older values are overwritten? What happens when your history has not been filled up yet? Does "0" count as an actual history entry? Does your history display the units in which the data was collected?
So, I want to pass the entire array to a Textbox? How do I do that? Atm, I do have the device up and running and I have a textbox ready for the array, I think. Can someone direct me to some articles on where to pass this information? Thanks for any help.
currently i have:
private double[] dataCaptured;
public double[] GetRawData()
{
return dataCaptured;
}
very little :(
I'm not going to do your homework for you, so this isn't really an answer; it's more of a tutorial.
First, the problem you have been assigned is a common one. Consider some kind of chemical process. You build a system that periodically measures some attribute of the system (say temperature). Depending on the nature of the process, you may measure it every 200 milliseconds, every 2 seconds, every 2 minutes, whatever. But, you are going to do that forever. You want to keep the data for some time, but not forever - memory is cheap these days, but not that cheap.
So you decide to keep the last N measurements. You can keep that in an N-element array. When you get the first measurement, you put it in the 0-element of the array, the second in the 1-element, and so on. When you get to the Nth measurement, you stick it in the last element. But, with the (N+1)th element, you have a problem. The solution is to put it in the 0-element of the array. The next one goes into the 1-element of the array, and so on.
At that point, you need some bookkeeping. You need to track where you put the most recent element. You also need to keep track of where the oldest element is. You also need a way to get all the measurements in some soft of time order.
This is a circular buffer. The reason it's circular is because you count 0, 1, 2...N-1, N, 0, 1, 2 and so on, around and around forever.
When you first start filling the buffer with measurements, the oldest is in element-0, and the most recent is wherever you put the last measurement. It continues like that until you have filled the buffer. Once you do that, then the oldest is no longer at 0, it's at the element logically after the most recent one (which is either the next element, or, if you are at the end of buffer, at the 0 position).
You could track the oldest and the newest indexes in the array/buffer. An easier way is to just track whether you are in the initial haven't-filled-buffer-yet phase (a boolean) and the latest index. The oldest is either index 0 (if you haven't filled buffer yet) or the one logically after the newest one. The bookkeeping is a little easier that way.
To do this, you need a way to find the next logical index in the buffer (either index+1 or 0 if you've hit the end of the buffer). You also need a way to get all the data (start at the index logically after the current entry, and then get the next N logical entries).
By the way, I'd use float rather than double to track my measurement. In general, measurements have enough error in themselves that the extra precision offered by double is of no use. By halving the data size, you can make your buffer twice as long (this, by the way, is a consideration in a measurement and control system).
For your measurement source, I'd initially use a number generator that starts at 1.0f to start and just increments by 1.0f every cycle. It will make debugging much simpler.
private float tempValue = 1.0f;
private float GetNextValue()
{
tempValue += 1.0f;
return tempValue;
}
Once you do this for real, you may want random numbers, something like:
private readonly Random _random = new Random();
private const float MinValue = 4.0f;
private const float MaxValue = 20.0f;
private float GetNextValue()
{
var nextRandom = _random.NextDouble();
var nextValue = (float) ((MaxValue - MinValue) * nextRandom + MinValue);
return nextValue;
}
By the way, the choice of 4.0 and 20.0 is intentional - google for "4 20 measurement".
If you want to get fancy, generate a random number stream and run it through a smoothing filter (but that gets more complicated).
You asked about how to get all the values into a single text box. This should do the trick:
// concatenate a collection of float values (rendered with one decimal point)
// into a string (with some spaces separating the values)
private string FormatValuesAsString(IEnumerable<float> values)
{
var buffer = new StringBuilder();
foreach (var val in values)
{
buffer.Append($"{val:F1} ");
}
return buffer.ToString();
}
An IEnumerable<float> represents a collection of float values. An array of floats implements that interface, as does just about every other collection of floats (which means you can put an array in for that parameter (FormatValuesAsString(myArray)). You can also create a function that dynamically generates values (look up yield return). The FormatValuesAsString function will return a string you can shove into a text box. If you do that on each calculation cycle, you'll see your values shift through the text box.
You will get to know the debugger very well doing this exercise. It is very hard to get this right in the first one, or two, or three, or... tries (you'll learn why programmers curse off-by-one errors). Don't give up. It's a good assignment.
Finally, if you get confused, you can post questions as comments to this answer (within reason). Tag me (to make sure I see them).

TA-Lib : Technical Analysis Library, Lookback and unstablePeriod

TA-Lib is a financial/market/OHLC technical analysis library for a Java, C++, .Net, etc. In it are ~158 Technical Functions (EMA, MAMA, MACD, SMA, etc), each has an associate Lookback Function
public static int EmaLookback(int optInTimePeriod)
The Lookback for each function seems to return the minimum length of processing required to compute each function accurately. With the startIdx to the endIdx equal to the Lookback.
Core.RetCode retcode = Core.Ema(startIdx, endIdx, double inReal, optInTimePeriod, ref outBegIdx, ref outNBElement, double outReal)
Some of these functions use an array called
Globals.unstablePeriod[0x17]
If this is incorrect in any way please correct me. Now the questions ...
The array unstablePeriod[] initializes to 0 for all entries. Is this what is supposed to occur, if not where in TA-Lib do I find the code or data that it is initialized with?
The code we are writing only requires the single most recent element in the array outReal[0] (or any other "outArray[]"). Is there a way to return a single element(a), or does the the spread between the startIdx and the endIdx have to equal the Lookback(b)?
a)
int startIdx = this.ohlcArray.IdxCurrent;
int endIdx = startIdx;
// call to TA Routine goes here
b)
int lookBack = Core.EmaLookback(optInTimePeriod) - 1;
int startIdx = this.ohlcArray.IdxCurrent;
int endIdx = startIdx + lookBack;
// call to TA Routine goes here
retcode = Core.Ema(startIdx, endIdx, inReal, optInTimePeriod, ref outBegIdx, ref outNBElement, outReal);
Why is do these routines return 0, for the first outArray[0] element, when startIdx is equal to 0?
Since I am getting such odd results. Should the startIdx be at the oldest date or the newest date? Meaning should you process from the past (startIdx) towards now (endIdx), or from now (startIdx) towards the oldest date(endIdx) in time? I am guessing I am computing backwards (b)
a) 2000 (startIdx) - 2003 (endIdx),
or
b) 2003 (startIdx) - 2000 (endIdx)
I already forget C# so might be wrong, but:
The Lookback for each function seems to return the minimum length of processing required to compute each function accurately. With the startIdx to the endIdx equal to the Lookback.
No, it returns number of input elements that required to calculate first output element. Which is usually equal or more than timePeriod value. That's all. So if you input 1000 elements (StartIdx == 0 and endIdx == 9999) while Lookback function gives you 25 you'll get 1000-25 = 9975 == outNBElement resulting elements back. And outBegIdx will be 24.
Note: noone guarantees accuracy of function. Lookback just let you calculate size of resulting array beforehand which is critical for C/C++ where fixed size arrays might be allocated.
The array unstablePeriod[] initializes to 0 for all entries. Is this what is supposed to occur, if not where in TA-Lib do I find the code or data that it is initialized with?
Seems like that. It happens in Core::GlobalsType constructor in TA-Lib-Core.h
Globals.unstablePeriod is an array that keeps unstability settings for some of TA funcs. The values addressed via enum class FuncUnstId which is declared in ta_defs.h. The 0x17 value would correspond to T3 technical indicator.
In case of T3 indicator this unstability period just adds a value to lookback result. So T3's lookback is 6 * (timePeriod-1) + TA_GLOBALS_UNSTABLE_PERIOD[TA_FUNC_UNST_T3]. That's why it's 0 by default. And that's clear that function accuracy isn't that simple.
Consider EMA. Its lookback is timePeriod-1 + TA_GLOBALS_UNSTABLE_PERIOD[TA_FUNC_UNST_EMA]. Assume unstability value is 0. So EMA is only a timePeriod-1. (I would recommend do not touch unstability without a reason). According to the code I see - its first EMA result is calculated as simple average of a first "lookback count" of elements by default. There is a global compatibility setting that might be {CLASSIC, METASTOCK, TRADESTATION} and affects first element calculation, but this doesn't change a lot. Your first element is an average and others are calculated as EMA_today = (value_today - old_EMA)*coefficient + old_EMA.
That's the reason you can't just pass "lookback count" of elements and get "accurate function result". It won't be accurate - it'll be the first one, not the right one. In case of EMA it'll always be a simple average as simple average is used as a seed for this function. And following results are calculated not only over first N input elements but include previous EMA value. And this previous EMA includes its previous EMA etc. So you can't just pass lookback count of elements and expect the accurate result. You would need to pass previous function value too.
Moreover, most rolling indicators behave like that. Their first N values are heavily depend on point from which you'd started to calculate them. This might be addressed with Unstability period but you'd better to not limit the input data.
Why is do these routines return 0, for the first outArray[0] element
My guess it's bcs of -1 in your lookback calculation. Also 0 is returned for outBegIdx'th element, not the 0 element.
Is there a way to return a single element(a)
With regular TA-Lib - no, or you need to process big enough piece of data every time to make sure your rolling results do "converge". But I've made a TA-Lib fork for myself here which is designed for this purpose. The main idea is described in readme. It should be almost as fast as original and you can just pass single value and state object to get single result back without recalculation of all data. So calculation might be paused and continued when new data arrives without loss of previous computational results.
The problem is that TA-Lib is written in C and code of all its C#/Java etc wrappers are actually generated by its internal tool (ta-gen). No one ever tried to work with my fork via those wrapper interfaces. So they may be broken. Also i don't provide precompiled binaries and as TA-Lib is very old and its infrastructure is quite fancy it might require some skill and efforts to build it on target platform

Finding insertion points in a sorted array faster than O(n)?

This is for game programming. Lets say I have a Unit that can track 10 enemies within it's range. Each enemy has a priority between 0-100. So the array currently looks like this (numbers represent priority):
Enemy - 96
Enemy - 78
Enemy - 77
Enemy - 73
Enemy - 61
Enemy - 49
Enemy - 42
Enemy - 36
Enemy - 22
Enemy - 17
Say a new enemy wanders within range and has a priority of 69, this will be inserted between 73 and 61, and 17 will be removed from the array (Well, the 17 would be removed before the insertion, I believe).
Is there any way to figure out that it needs to be inserted between 73 and 61 without an O(n) operation?
I feel you're asking the wrong question here. You have to both first find the spot to insert into and then insert the element. These are two operation that are both tied together and I feel you shouldn't be asking about how to find where to do one faster without the other. It'll make sense why towards the end of the question. But I'm addressing the question of actually inserting faster.
Short Answer: No
Answer you'll get from someone that's too smart for themselves:
The only way to accomplish this is to not use an array. In an array unless you are inserting into the first or last permissions the insert will be O(n). This is because the array consists of its elements occupying contiguous space in memory. That is how you are able to reference a particular element in O(1) time, you know exactly where that element is. The cost is to insert in the middle you need to move half the elements in the array. So while you can look up with a binary search in log(n) time you cannot insert in that time.
So if you're going to do anything, you'll need a different data structure. A simple binary tree may be the solution it will do the insertion in log(n) time. On the other hand if you're feeding it a sorted array you have to worry about tree balancing, so not you might need a red and black tree. Or if you are always popping the element that is the closest or the furthest then you can use heap sort. A heap sort is the best algorithm for a priority queue. It has an additional advantage of fitting a tree structure in an array so it has far better spatial locality (more on this later).
The truth:
You'll most likely have a dozen maybe a few dozen enemies in the vicinity at most. At that level the asymptotic performance does not matter because it is designed especially for large values of 'n'. What you're looking at is a religious adherence to your CS 201 professor's calls about Big Oh. Linear search and insertion will be the fastest method, and the answer to will it scale is, who the hell cares. If you try to implement a complicated algorithm to scale it, you will almost always be slower since what is determining your speed is not the software, it is the hardware, and you're better off sticking to doing things that the hardware knows how to deal with well: "linearly going down memory". In fact after the prefetchers do their thing it would be faster to linearly go through each element even if there were a couple of thousand elements than to implement a red and black tree. Because a data structure like a tree would allocate memory all over the place without any regard to spatial locality. And the calls to allocate more memory for a node are in themselves more expensive than the time it takes to read through a thousand elements. Which is why graphics cards use insert sort all over the place.
Heap Sort
Heap sort might actually be faster depending on the input data since it is using a linear array although it may confuse the prefetchers so it's hard to say. The only limitation is that you can only pop the highest priority element. Obviously you can define highest priority to be either the lowest or the largest element. Heap sort is too fancy for me to try and describe it over here, just Google it. It does separate insertion and removal into two O(log(n)) operations. The biggest downside of heap sort is it will seriously decrease the debugability of the code. A heap is not a sorted array, it has an order to it, but other than heap sort being a complicated unintuitive algorithm, it is not apparently visible to a human being if a heap is setup correctly. So you would introduce more bugs for in the best case little benefit. Hell, the last time I had to do a heap sort I copied the code for it and that had bugs in it.
Insertion Sort With Binary Search
So this is what it seems like you're trying to do. The truth is this is a very bad idea. On average insertion sort takes O(n). And we know this is a hard limit for inserting a random element into a sorted array. Yes we can find the element we want to insert into faster by using a binary search. But then the average insertion still takes O(n). Alternatively, in the best case, if you are inserting and the element goes into the last position insertion sort takes O(1) time because when you inserted, it is already in the correct place. However, if you do a binary search to find the insertion location, then finding out you're supposed to insert in the last position takes O(log(n)) time. And the insertion itself takes O(1) time. So in trying to optimize it, you've severely degraded the best case performance. Looking at your use case, this queue holds the enemies with their priorities. The priority of an enemy is likely a function of their strength and their distance. Which means when an enemy enters into the priority queue, it will likely have a very low priority. This plays very well into the best case of insertion of O(1) performance. If you decrease the best case performance you will do more harm than good because it is also your most general case.
Preoptimization is the root of all evil -- Donald Knuth
Since you are maintaining a sorted search pool at all times, you can use binary search. First check the middle element, then check the element halfway between the middle element and whichever end of the array is closer, and so on until you find the location. This will give you O(log2n) time.
Sure, assuming you are using an Array type to house the list this really easy.
I will assume Enemy is your class name, and that is has a property called Priority to perform the sort. We will need an IComparer<Enemy> that looks like the following:
public class EnemyComparer : IComparer<Enemy>
{
int IComparer<Enemy>.Compare(Enemy x, Enemy y)
{
return y.Priority.CompareTo(x.Priority); // reverse operand to invert ordering
}
}
Then we can write a simple InsertEnemy routine as follows:
public static bool InsertEnemy(Enemy[] enemies, Enemy newEnemy)
{
// binary search in O(logN)
var ix = Array.BinarySearch(enemies, newEnemy, new EnemyComparer());
// If not found, the bit-wise compliment is the insertion index
if (ix < 0)
ix = ~ix;
// If the insertion index is after the list we bail out...
if (ix >= enemies.Length)
return false;// Insert is after last item...
//Move enemies down the list to make room for the insertion...
if (ix + 1 < enemies.Length)
Array.ConstrainedCopy(enemies, ix, enemies, ix + 1, enemies.Length - (ix + 1));
//Now insert the newEnemy into the position
enemies[ix] = newEnemy;
return true;
}
There are other data structures that would make this a bit faster, but this should prove efficient enough. A B-Tree or binary tree would be ok if the list will get large, but for 10 items it's doubtful it would be faster.
The method above was tested with the addition of the following:
public class Enemy
{
public int Priority;
}
public static void Main()
{
var rand = new Random();
// Start with a sorted list of 10
var enemies = Enumerable.Range(0, 10).Select(i => new Enemy() {Priority = rand.Next(0, 100)}).OrderBy(e => e.Priority).ToArray();
// Insert random entries
for (int i = 0; i < 100; i++)
InsertEnemy(enemies, new Enemy() {Priority = rand.Next(100)});
}

Fast Algorithm for computing percentiles to remove outliers

I have a program that needs to repeatedly compute the approximate percentile (order statistic) of a dataset in order to remove outliers before further processing. I'm currently doing so by sorting the array of values and picking the appropriate element; this is doable, but it's a noticable blip on the profiles despite being a fairly minor part of the program.
More info:
The data set contains on the order of up to 100000 floating point numbers, and assumed to be "reasonably" distributed - there are unlikely to be duplicates nor huge spikes in density near particular values; and if for some odd reason the distribution is odd, it's OK for an approximation to be less accurate since the data is probably messed up anyhow and further processing dubious. However, the data isn't necessarily uniformly or normally distributed; it's just very unlikely to be degenerate.
An approximate solution would be fine, but I do need to understand how the approximation introduces error to ensure it's valid.
Since the aim is to remove outliers, I'm computing two percentiles over the same data at all times: e.g. one at 95% and one at 5%.
The app is in C# with bits of heavy lifting in C++; pseudocode or a preexisting library in either would be fine.
An entirely different way of removing outliers would be fine too, as long as it's reasonable.
Update: It seems I'm looking for an approximate selection algorithm.
Although this is all done in a loop, the data is (slightly) different every time, so it's not easy to reuse a datastructure as was done for this question.
Implemented Solution
Using the wikipedia selection algorithm as suggested by Gronim reduced this part of the run-time by about a factor 20.
Since I couldn't find a C# implementation, here's what I came up with. It's faster even for small inputs than Array.Sort; and at 1000 elements it's 25 times faster.
public static double QuickSelect(double[] list, int k) {
return QuickSelect(list, k, 0, list.Length);
}
public static double QuickSelect(double[] list, int k, int startI, int endI) {
while (true) {
// Assume startI <= k < endI
int pivotI = (startI + endI) / 2; //arbitrary, but good if sorted
int splitI = partition(list, startI, endI, pivotI);
if (k < splitI)
endI = splitI;
else if (k > splitI)
startI = splitI + 1;
else //if (k == splitI)
return list[k];
}
//when this returns, all elements of list[i] <= list[k] iif i <= k
}
static int partition(double[] list, int startI, int endI, int pivotI) {
double pivotValue = list[pivotI];
list[pivotI] = list[startI];
list[startI] = pivotValue;
int storeI = startI + 1;//no need to store # pivot item, it's good already.
//Invariant: startI < storeI <= endI
while (storeI < endI && list[storeI] <= pivotValue) ++storeI; //fast if sorted
//now storeI == endI || list[storeI] > pivotValue
//so elem #storeI is either irrelevant or too large.
for (int i = storeI + 1; i < endI; ++i)
if (list[i] <= pivotValue) {
list.swap_elems(i, storeI);
++storeI;
}
int newPivotI = storeI - 1;
list[startI] = list[newPivotI];
list[newPivotI] = pivotValue;
//now [startI, newPivotI] are <= to pivotValue && list[newPivotI] == pivotValue.
return newPivotI;
}
static void swap_elems(this double[] list, int i, int j) {
double tmp = list[i];
list[i] = list[j];
list[j] = tmp;
}
Thanks, Gronim, for pointing me in the right direction!
The histogram solution from Henrik will work. You can also use a selection algorithm to efficiently find the k largest or smallest elements in an array of n elements in O(n). To use this for the 95th percentile set k=0.05n and find the k largest elements.
Reference:
http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements
According to its creator a SoftHeap can be used to:
compute exact or approximate medians
and percentiles optimally. It is also
useful for approximate sorting...
I used to identify outliers by calculating the standard deviation. Everything with a distance more as 2 (or 3) times the standard deviation from the avarage is an outlier. 2 times = about 95%.
Since your are calculating the avarage, its also very easy to calculate the standard deviation is very fast.
You could also use only a subset of your data to calculate the numbers.
You could estimate your percentiles from just a part of your dataset, like the first few thousand points.
The Glivenko–Cantelli theorem ensures that this would be a fairly good estimate, if you can assume your data points to be independent.
Divide the interval between minimum and maximum of your data into (say) 1000 bins and calculate a histogram. Then build partial sums and see where they first exceed 5000 or 95000.
There are a couple basic approaches I can think of. First is to compute the range (by finding the highest and lowest values), project each element to a percentile ((x - min) / range) and throw out any that evaluate to lower than .05 or higher than .95.
The second is to compute the mean and standard deviation. A span of 2 standard deviations from the mean (in both directions) will enclose 95% of a normally-distributed sample space, meaning your outliers would be in the <2.5 and >97.5 percentiles. Calculating the mean of a series is linear, as is the standard dev (square root of the sum of the difference of each element and the mean). Then, subtract 2 sigmas from the mean, and add 2 sigmas to the mean, and you've got your outlier limits.
Both of these will compute in roughly linear time; the first one requires two passes, the second one takes three (once you have your limits you still have to discard the outliers). Since this is a list-based operation, I do not think you will find anything with logarithmic or constant complexity; any further performance gains would require either optimizing the iteration and calculation, or introducing error by performing the calculations on a sub-sample (such as every third element).
A good general answer to your problem seems to be RANSAC.
Given a model, and some noisy data, the algorithm efficiently recovers the parameters of the model.
You will have to chose a simple model that can map your data. Anything smooth should be fine. Let say a mixture of few gaussians. RANSAC will set the parameters of your model and estimate a set of inliners at the same time. Then throw away whatever doesn't fit the model properly.
You could filter out 2 or 3 standard deviation even if the data is not normally distributed; at least, it will be done in a consistent manner, that should be important.
As you remove the outliers, the std dev will change, you could do this in a loop until the change in std dev is minimal. Whether or not you want to do this depends upon why are you manipulating the data this way. There are major reservations by some statisticians to removing outliers. But some remove the outliers to prove that the data is fairly normally distributed.
Not an expert, but my memory suggests:
to determine percentile points exactly you need to sort and count
taking a sample from the data and calculating the percentile values sounds like a good plan for decent approximation if you can get a good sample
if not, as suggested by Henrik, you can avoid the full sort if you do the buckets and count them
One set of data of 100k elements takes almost no time to sort, so I assume you have to do this repeatedly. If the data set is the same set just updated slightly, you're best off building a tree (O(N log N)) and then removing and adding new points as they come in (O(K log N) where K is the number of points changed). Otherwise, the kth largest element solution already mentioned gives you O(N) for each dataset.

smart way to generate unique random number

i want to generate a sequence of unique random numbers in the range of 00000001 to 99999999.
So the first one might be 00001010, the second 40002928 etc.
The easy way is to generate a random number and store it in the database, and every next time do it again and check in the database if the number already exists and if so, generate a new one, check it again, etc.
But that doesn't look right, i could be regenerating a number maybe 100 times if the number of generated items gets large.
Is there a smarter way?
EDIT
as allways i forgot to say WHY i wanted this, and it will probably make things clearer and maybe get an alternative, and it is:
we want to generate an ordernumber for a booking, so we could just use 000001, 000002 etc. But we don't want to give the competitors a clue of how much orders are created (because it's not a high volume market, and we don't want them to know if we are on order 30 after 2 months or at order 100. So we want to have an order number which is random (yet unique)
You can use either an Linear Congruential Generator (LCG) or Linear Feedback Shift Register (LFSR). Google or wikipedia for more info.
Both can, with the right parameters, operate on a 'full-cycle' (or 'full period') basis so that they will generate a 'psuedo-random number' only once in a single period, and generate all numbers within the range. Both are 'weak' generators, so no good for cyptography, but perhaps 'good enough' for apparent randomness. You may have to constrain the period to work within your 'decimal' maximum as having 'binary' periods is necessary.
Update: I should add that it is not necessary to pre-calculate or pre-store previous values in any way, you only need to keep the previous seed-value (single int) and calculate 'on-demand' the next number in the sequence. Of course you can save a chain of pre-calculated numbers to your DB if desired, but it isn't necessary.
How about creating a set all of possible numbers and simply randomising the order? You could then just pick the next number from the tail.
Each number appears only once in the set, and when you want a new one it has already been generated, so the overhead is tiny at the point at which you want one. You could do this in memory or the database of your choice. You'll just need a sensible locking strategy for pulling the next available number.
You could build a table with all the possible numbers in it, give the record a 'used' field.
Select all records that have not been 'used'
Pick a random number (r) between 1 and record count
Take record number r
Get your 'random value' from the record
Set the 'used' flag and update the db.
That should be more efficient than picking random numbers, querying the database and repeat until not found as that's just begging for an eternity for the last few values.
Use Pseudo-random Number Generators.
For example - Linear Congruential Random Number Generator
(if increment and n are coprime, then code will generate all numbers from 0 to n-1):
int seed = 1, increment = 3;
int n = 10;
int x = seed;
for(int i = 0; i < n; i++)
{
x = (x + increment) % n;
Console.WriteLine(x);
}
Output:
4
7
0
3
6
9
2
5
8
1
Basic Random Number Generators
Mersenne Twister
Using this algorithm might be suitable, though it's memory consuming:
http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
Put the numbers in the array from 1 to 99999999 and do the shuffle.
For the extremely limited size of your numbers no you cannot expect uniqueness for any type of random generation.
You are generating a 32bit integer, whereas to reach uniqueness you need a much larger number in terms around 128bit which is the size GUIDs use which are guaranteed to always be globally unique.
In case you happen to have access to a library and you want to dig into and understand the issue well, take a look at
The Art of Computer Programming, Volume 2: Seminumerical Algorithms
by Donald E. Knuth. Chapter 3 is all about random numbers.
You could just place your numbers in a set. If the size of the set after generation of your N numbers is too small, generate some more.
Do some trial runs. How many numbers do you have to generate on average? Try to find out an optimal solution to the tradeoff "generate too many numbers" / "check too often for duplicates". This optimal is a number M, so that after generating M numbers, your set will likely hold N unique numbers.
Oh, and M can also be calculated: If you need an extra number (your set contains N-1), then the chance of a random number already being in the set is (N-1)/R, with R being the range. I'm going crosseyed here, so you'll have to figure this out yourself (but this kinda stuff is what makes programming fun, no?).
You could put a unique constraint on the column that contains the random number, then handle any constraint voilations by regenerating the number. I think this normally indexes the column as well so this would be faster.
You've tagged the question with C#, so I'm guessing you're using C# to generate the random number. Maybe think about getting the database to generate the random number in a stored proc, and return it.
You could try giving writing usernames by using a starting number and an incremental number. You start at a number (say, 12000), then, for each account created, the number goes up by the incremental value.
id = startValue + (totalNumberOfAccounts * inctrementalNumber)
If incrementalNumber is a prime value, you should be able to loop around the max account value and not hit another value. This creates the illusion of a random id, but should also have very little conflicts. In the case of a conflicts, you could add a number to increase when there's a conflict, so the above code becomes. We want to handle this case, since, if we encounter one account value that is identical, when we increment, we will bump into another conflict when we increment again.
id = startValue + (totalNumberOfAccounts * inctrementalNumber) + totalConflicts
By fallowing line we can get e.g. 6 non repetitive random numbers for range e.g. 1 to 100.
var randomNumbers = Enumerable.Range(1, 100)
.OrderBy(n => Guid.NewGuid())
.Take(6)
.OrderBy(n => n);
I've had to do something like this before (create a "random looking" number for part of a URL). What I did was create a list of keys randomly generated. Each time it needed a new number it simply randomly selected a number from keys.Count and XOR the key and the given sequence number, then outputted XORed value (in base 62) prefixed with the keys index (in base 62).
I also check the output to ensure it does not contain any naught words. If it does simply take the next key and have a second go.
Decrypting the number is equally simple (the first digit is the index to the key to use, a simple XOR and you are done).
I like andora's answer if you are generating new numbers and might have used it had I known. However if I was to do this again I would have simply used UUIDs. Most (if not every) platform has a method for generating them and the length is just not an issue for URLs.
You could try shuffling the set of possible values then using them sequentially.
I like Lazarus's solution, but if you want to avoid effectively pre-allocating the space for every possible number, just store the used numbers in the table, but build an "unused numbers" list in memory by adding all possible numbers to a collection then deleting every one that's present in the database. Then select one of the remaining numbers and use that, adding it to the list in the database, obviously.
But, like I say, I like Lazaru's solution - I think that's your best bet for most scenarios.
function getShuffledNumbers(count) {
var shuffledNumbers = new Array();
var choices = new Array();
for (var i = 0; i<count; i++) {
// choose a number between 1 and amount of numbers remaining
choices[i] = selectedNumber = Math.ceil(Math.random()*(99999999 - i));
// Now to figure out the number based on this selection, work backwards until
// you figure out which choice this number WOULD have been on the first step
for (var j = 0; j < i; j++) {
if (choices[i - 1 - j] >= selectedNumber) {
// This basically says "it was choice number (selectedNumber) on the last step,
// but if it's greater than or equal to this, it must have been choice number
// (selectedNumber + 1) on THIS step."
selectedNumber++;
}
}
shuffledNumbers[i] = selectedNumber;
}
return shuffledNumbers;
}
This is as fast a way I could think of and only uses memory as it needs, however if you run it all the way through it will use double as much memory because it has two arrays, choices and shuffledNumbers.
Running a linear congruential generator once to generate each number is apt to produce rather feeble results. Running it through a number of iterations which is relatively prime to your base (100,000,000 in this case) will improve it considerably. If before reporting each output from the generator, you run it through one or more additional permutation functions, the final output will still be a duplicate-free permutation of as many numbers as you want (up to 100,000,000) but if the proper functions are chosen the result can be cryptographically strong.
create and store ind db two shuffled versions(SHUFFLE_1 and SHUFFLE_2) of the interval [0..N), where N=10'000;
whenever a new order is created, you assign its id like this:
ORDER_FAKE_INDEX = N*SHUFFLE_1[ORDER_REAL_INDEX / N] + SHUFFLE_2[ORDER_REAL_INDEX % N]
I also came with same kind of problem but in C#. I finally solved it. Hope it works for you also.
Suppose I need random number between 0 and some MaxValue and having a Random type object say random.
int n=0;
while(n<MaxValue)
{
int i=0;
i=random.Next(n,MaxValue);
n++;
Write.Console(i.ToString());
}
the stupid way: build a table to record, store all the numble first, and them ,every time the numble used, and flag it as "used"
System.Random rnd = new System.Random();
IEnumerable<int> numbers = Enumerable.Range(0, 99999999).OrderBy(r => rnd.Next());
This gives a randomly shuffled collection of ints in your range. You can then iterate through the collection in order.
The nice part about this is that you're not actually creating the entire collection in memory.
See comments below - this will generate the entire collection in memory when you iterate to the first element.
You can genearate number like below if you are ok with consumption of memory.
import java.util.ArrayList;
import java.util.Collections;
public class UniqueRandomNumbers {
public static void main(String[] args) {
ArrayList<Integer> list = new ArrayList<Integer>();
for (int i=1; i<11; i++) {
list.add(i);
}
Collections.shuffle(list);
for (int i=0; i<11; i++) {
System.out.println(list.get(i));
}
}
}

Categories