Can we improve this o(mn) algorithm of bin-counting numbers?

Can we improve this o(mn) algorithm of bin-counting numbers? - c#

I have an Array in C# which contains numbers (e.g. int, float or double); I have another array of ranges (each defined as a lower and upper bound). My current implementation is something like this.
foreach (var v in data)
{
foreach (var row in ranges)
{
if (v >= row.lower && v <= row.high)
{
statistics[row]++;
break;
}
}
}
So the algorithm is O(mn) where m is the number of ranges and n is the size of the numbers.
Can this be improved? because in practical, the n is big and I want this to be as fast as possible.

Sort data array, then for each interval - find the first index that is in this range in data, and the last one (both using binary search). The number of elements that in this interval is than easily computed by reducing lastIdx-firstIdx (or add +1, depending if lastIdx is inclusive or not).
This is done in O(mlogm + nlogm), where m is the number of data, and n number of intervals.
Bonus: If data is changing constantly, you can use an order statistics tree, with the same approach (since this tree allows you to find easily the index of each element, and is supporting modifying the data).
Bonus2: Optimality proof
Using comparisons based algorithms, this cannot be done better, since if we could, we could also solve element distinctness problem better.
Element Distinctness Problem:
Given an array a1,a2,...,an - find out if there are i,j such that
i!=j, ai=aj.
This problem is known to have Omega(nlogn) time bound using comparisons based algorithms.
Reduction:
Given an instance of element distinctness problem a1,...,an - create data=a1,...,an, and intervals: [a1,a1], [a2,a2],..., [an,an] - and run the algorithm.
If there are more than n matches - there is duplicates, otherwise there is none.
The complexity of the above algorithm is O(n+f(n)), where n is the number of elements, and f(n) is the complexity of this algorithm. this has to be Omega(nlogn), so does f(n), and we can conclude there is no more efficient algorithm.

Assuming the ranges are ordered, you always take the first range that fits, right?
This means that you could easily build a binary tree of the lower bounds. You find the highest lower bound that's lower than your number, and check if it fits the higher bound. If the tree is properly balanced, this can get you quite close to O(nlog m). Of course, if you don't need to change the ranges frequently, a simple ordered array will do - just use the usual binary search methods.
Using a hashtable instead could get you pretty close to O(n), depending on how the ranges are structured. If data is also ordered, you could get even better results.

An alternate solution that doesn't involve sorting the data:
var dictionary = new Dictionary<int, int>();
foreach (var v in data) {
if (dictionary.ContainsKey(v)){
dictionary[v]++;
} else {
dictionary[v] = 1;
}
}
foreach (var row in ranges) {
for (var i = row.lower; i <= row.higher; i++) {
statistics[row] += dictionary[i];
}
}
Get a count of the number of occurrences of each value in data, and then sum the counts between the bounds of your range.

Related

Sorting without comparing elements

given size of array 5, with five numbers in it, sort them from smallest to largest without comparing.(hint, access time O(n)
I tried to search a lot but didnt knew, how it can be done. O(n), means which algo/data structure.i am unaware.

I suppose you need Counting sort, it has linear time, but takes some memory and depends on min/max value of your initial array

The Counting Sort will do this for you, although if I were in an interview and on the spot I'd probably do something like the below which is vaguely similar as I can never remember these "classic" algorithms off the top of my head!
The key idea here is to use the each actual unsorted integer value as an index into a target array that contains N elements where N is the max. of the values to be sorted.
I am using a simple class to record both the value and the number of times it occurred so you can reconstruct an actual array from it if you need to keep discrete values that occurred multiple times in the original array.
So all you need to do is walk the unsorted array once, putting each value into the corresponding index in the target array, and (ignoring empty elements) your values are already sorted from smallest to largest without having ever compared them to one another.
(I personally am not a fan of interview questions like this where the answer is "oh, use Counting Sort" or whatever - I would hope that the interviewer asking this question would be genuinely interested to see what approach you took to solving a new problem, regardless of if you got a strictly correct answer or not)
The performance of the below is O(n) meaning it runs in linear time (1 element takes X amount of time, 10 elements takes 10X,etc) but it can use a lot of memory if the max element is large,cannot do in place sorting, will only work with primitives and it's not something I'd hope to ever see in production code :)
void Main()
{
//create unsorted list of random numbers
var unsorted = new List<int>();
Random rand = new Random();
for(int x=0;x<10;x++)
{
unsorted.Add(rand.Next(1,10));
}
//create array big enough to hold unsorted.Max() elements
//note this is indirectly performing a comparison of the elements of the array
//but not for the sorting, so I guess that is allowable :)
var sorted = new NumberCount[unsorted.Max()+1];
//loop the unsorted array
for (int index=0;index<unsorted.Count;index++)
{
//get the value at the current index and use as an index to the target array
var value = unsorted[index];
//if the sorted array contains the value at the current index, just increment the count
if (sorted[value]!=null && sorted[value].Value!=0)
{
sorted[value].Count++;
}
else
{
//insert the current value in it's index position
sorted[value]=new NumberCount{Value=value,Count=1};
}
}
//ignore all elements in sorted that are null because they were not part of the original list of numbers.
foreach (var r in sorted.Where(r=>r!=null))
{
Console.WriteLine("{0}, occurs {1} times",r.Value,r.Count);
}
}
//just a poco to hold the numbers and the number of times they occurred.
public class NumberCount
{
public int Value{get;set;}
public int Count{get;set;}
}

A data structure for determining how many elements lie within a specific range?

Lets say I have a set of numbers. I need to calculate how many numbers are in given range.
For example: for given set: {3, 4, 7, 10, 15, 30}:
numbers in range (0, 6) = 2
numbers in range (8, 40) = 3
numbers in range (0, 50) = 6
What kind of structure would be best for that purpose? By best I mean structure with fastest execution of said operation. Also fast insertion and removal would also be appreciated...

If the set of numbers you are given never changes, one simple option would be to sort the numbers into ascending order, then use binary search on the endpoints of the range to determine where the first element in the sorted sequence is that is contained within the range lies and where the first element not in the range lies. You can then subtract the difference of these two positions to count up how many elements are in the range, or just iterate over that range to determine all the numbers that are in that range. Using a fast sorting algorithm like quicksort or heapsort, the sorting can be done in O(n log n) time, and each query takes only time O(log n) to do two different binary searches.
You could potentially speed this up in a variety of ways. For example, if you know that the numbers are more or less evenly distributed, you could use interpolation search instead of binary search to do the lookups. This takes expected time O(log log n) to do each query, which is exponentially faster than before. If you know that the numbers are all in the range [0, N), you could use a more advanced data structure like a van Emde Boas tree to speed up all operations to O(log log N) in the worst case.
If, on the other hand, the set of numbers can grow and shrink, then you might want to consider using a balanced binary search tree to store your numbers. You can then do efficient searches on the tree (in time O(log n)) to determine the first number in the range and first number not in the range.
Hope this helps!

This is a well studied problem in computational geometry, it is called range searching. Although you have the 1-D version. The question is how common is each operations, if insertions and deletions are seldom, in that case you can just tabulate them. That will give you O(n^2) storage and constant time querying.

templatetypedef's answer is fine if your dataset won't change over time, but you mention a need for fast insertion and removal. [EDIT: David Eisenstat has explained how two O(log n) searches of a balanced binary tree augmented with per-node counts can efficiently count elements in a given range.]
In any case, if fast updates are required, the ideal data structure for your problem is the Fenwick tree or BIT tree. This data structure provides O(log n) guarantees for both the following operations:
Query: Count the number of elements between 0 and any given number.
Update: Insert or remove any given number of copies of some given number to/from the multiset.
Two query calls allow you to count the number of elements in any given range [i, j) using count(j) - count(i).
Both queries and updates on Fenwick trees involve only simple bitwise operations and lookups on a single array, so using this data structure will yield a very competitive constant on the O(log n) -- I expect it will be much faster than maintaining a balanced binary tree under updates, which requires pointer manipulations and tree rebalancing.

what is wrong with this?
static int Count(IList<int> set, int min, int max)
{
int count = 0;
foreach (int i in set)
if (i < max && i > min)
count++;
return count;
}

How to find point between two keys in sorted dictionary

I have a sorted dictionary that contains measured data points as key/value pairs. To determine the value for a non-measured data point I want to extrapolate the value between two known keys using a linear interpolation of their corresponding values. I understand how to calculate the non-measured data point once I have the two key/value pairs it lies between. What I don't know is how to find out which keys it lies between. Is there a more elegant way than a "for" loop (I'm thinking function/LINQ query) to figure out which two keys my data point lies between?

Something like this would work:
dic.Keys.Zip(dic.Keys.Skip(1),
(a, b) => new { a, b })
.Where(x => x.a <= datapoint && x.b >= datapoint)
.FirstOrDefault();
This traverses they keys using the fact that they are ordered and compares all two keys following each other in order - since LINQ is lazy once you find the first match the traversal will stop.

The standard C# answers are all O(N) complexity.
Sometimes you just need a small subset in a rather large sorted collection. (so you're not iterating all the keys)
The standard C# collections won't help you here. And a solution is as followed:
http://www.itu.dk/research/c5/ Use the IntervalHeap in the C5 collections library. This class supports a GetRange() method and will lookup the startkey with O(log N) complexity and iterate the range with O(N) complexity. Which will be definately useful for big datasets if performance is critical. e.g. Spatial Partitioning in gaming

Possible you're asking about following:
myDictionary.Keys.Where(w => w > start && w < end)

regular loop should be ok here:
IEnumerable<double> keys = ...; //ordered sequence of keys
double interpolatedKey = ...;
// I'm considering here that keys collection doesn't contain interpolatedKey
double? lowerFoundKey = null;
double? upperFoundKey = null;
foreach (double key in keys)
{
if (key > interpolatedKey)
{
upperFoundKey = key;
break;
}
else
lowerFoundKey = key;
}
You can do it in C# with LINQ with shorter but less effective code:
double lowerFoundKey = key.LastOrDefault(k => k < interpolatedKey);
double upperFoundKey = key.FirstOrDefault(k => k > interpolatedKey);
In order to it efficiently with LINQ it should have a method which is called windowed in F# with parameter 2. It will return an IEnumerable of adjacent pairs in keys collection. While this function is missing in LINQ regular foreach loop should be ok.

I don't think there is a function on SortedDictionary that lets you find elements around the one you need faster than iterating elements. (+1 to BrokenGlass solution)
To be able to find items faster you need to switch to some other structure. I.e. SortedList provides similar functionality but allows to index its Key collection and hence you can use binary serach to find the range.

Fast Algorithm for computing percentiles to remove outliers

I have a program that needs to repeatedly compute the approximate percentile (order statistic) of a dataset in order to remove outliers before further processing. I'm currently doing so by sorting the array of values and picking the appropriate element; this is doable, but it's a noticable blip on the profiles despite being a fairly minor part of the program.
More info:
The data set contains on the order of up to 100000 floating point numbers, and assumed to be "reasonably" distributed - there are unlikely to be duplicates nor huge spikes in density near particular values; and if for some odd reason the distribution is odd, it's OK for an approximation to be less accurate since the data is probably messed up anyhow and further processing dubious. However, the data isn't necessarily uniformly or normally distributed; it's just very unlikely to be degenerate.
An approximate solution would be fine, but I do need to understand how the approximation introduces error to ensure it's valid.
Since the aim is to remove outliers, I'm computing two percentiles over the same data at all times: e.g. one at 95% and one at 5%.
The app is in C# with bits of heavy lifting in C++; pseudocode or a preexisting library in either would be fine.
An entirely different way of removing outliers would be fine too, as long as it's reasonable.
Update: It seems I'm looking for an approximate selection algorithm.
Although this is all done in a loop, the data is (slightly) different every time, so it's not easy to reuse a datastructure as was done for this question.
Implemented Solution
Using the wikipedia selection algorithm as suggested by Gronim reduced this part of the run-time by about a factor 20.
Since I couldn't find a C# implementation, here's what I came up with. It's faster even for small inputs than Array.Sort; and at 1000 elements it's 25 times faster.
public static double QuickSelect(double[] list, int k) {
return QuickSelect(list, k, 0, list.Length);
}
public static double QuickSelect(double[] list, int k, int startI, int endI) {
while (true) {
// Assume startI <= k < endI
int pivotI = (startI + endI) / 2; //arbitrary, but good if sorted
int splitI = partition(list, startI, endI, pivotI);
if (k < splitI)
endI = splitI;
else if (k > splitI)
startI = splitI + 1;
else //if (k == splitI)
return list[k];
}
//when this returns, all elements of list[i] <= list[k] iif i <= k
}
static int partition(double[] list, int startI, int endI, int pivotI) {
double pivotValue = list[pivotI];
list[pivotI] = list[startI];
list[startI] = pivotValue;
int storeI = startI + 1;//no need to store # pivot item, it's good already.
//Invariant: startI < storeI <= endI
while (storeI < endI && list[storeI] <= pivotValue) ++storeI; //fast if sorted
//now storeI == endI || list[storeI] > pivotValue
//so elem #storeI is either irrelevant or too large.
for (int i = storeI + 1; i < endI; ++i)
if (list[i] <= pivotValue) {
list.swap_elems(i, storeI);
++storeI;
}
int newPivotI = storeI - 1;
list[startI] = list[newPivotI];
list[newPivotI] = pivotValue;
//now [startI, newPivotI] are <= to pivotValue && list[newPivotI] == pivotValue.
return newPivotI;
}
static void swap_elems(this double[] list, int i, int j) {
double tmp = list[i];
list[i] = list[j];
list[j] = tmp;
}
Thanks, Gronim, for pointing me in the right direction!

The histogram solution from Henrik will work. You can also use a selection algorithm to efficiently find the k largest or smallest elements in an array of n elements in O(n). To use this for the 95th percentile set k=0.05n and find the k largest elements.
Reference:
http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements

According to its creator a SoftHeap can be used to:
compute exact or approximate medians
and percentiles optimally. It is also
useful for approximate sorting...

I used to identify outliers by calculating the standard deviation. Everything with a distance more as 2 (or 3) times the standard deviation from the avarage is an outlier. 2 times = about 95%.
Since your are calculating the avarage, its also very easy to calculate the standard deviation is very fast.
You could also use only a subset of your data to calculate the numbers.

You could estimate your percentiles from just a part of your dataset, like the first few thousand points.
The Glivenko–Cantelli theorem ensures that this would be a fairly good estimate, if you can assume your data points to be independent.

Divide the interval between minimum and maximum of your data into (say) 1000 bins and calculate a histogram. Then build partial sums and see where they first exceed 5000 or 95000.

There are a couple basic approaches I can think of. First is to compute the range (by finding the highest and lowest values), project each element to a percentile ((x - min) / range) and throw out any that evaluate to lower than .05 or higher than .95.
The second is to compute the mean and standard deviation. A span of 2 standard deviations from the mean (in both directions) will enclose 95% of a normally-distributed sample space, meaning your outliers would be in the <2.5 and >97.5 percentiles. Calculating the mean of a series is linear, as is the standard dev (square root of the sum of the difference of each element and the mean). Then, subtract 2 sigmas from the mean, and add 2 sigmas to the mean, and you've got your outlier limits.
Both of these will compute in roughly linear time; the first one requires two passes, the second one takes three (once you have your limits you still have to discard the outliers). Since this is a list-based operation, I do not think you will find anything with logarithmic or constant complexity; any further performance gains would require either optimizing the iteration and calculation, or introducing error by performing the calculations on a sub-sample (such as every third element).

A good general answer to your problem seems to be RANSAC.
Given a model, and some noisy data, the algorithm efficiently recovers the parameters of the model.
You will have to chose a simple model that can map your data. Anything smooth should be fine. Let say a mixture of few gaussians. RANSAC will set the parameters of your model and estimate a set of inliners at the same time. Then throw away whatever doesn't fit the model properly.

You could filter out 2 or 3 standard deviation even if the data is not normally distributed; at least, it will be done in a consistent manner, that should be important.
As you remove the outliers, the std dev will change, you could do this in a loop until the change in std dev is minimal. Whether or not you want to do this depends upon why are you manipulating the data this way. There are major reservations by some statisticians to removing outliers. But some remove the outliers to prove that the data is fairly normally distributed.

Not an expert, but my memory suggests:
to determine percentile points exactly you need to sort and count
taking a sample from the data and calculating the percentile values sounds like a good plan for decent approximation if you can get a good sample
if not, as suggested by Henrik, you can avoid the full sort if you do the buckets and count them

One set of data of 100k elements takes almost no time to sort, so I assume you have to do this repeatedly. If the data set is the same set just updated slightly, you're best off building a tree (O(N log N)) and then removing and adding new points as they come in (O(K log N) where K is the number of points changed). Otherwise, the kth largest element solution already mentioned gives you O(N) for each dataset.

Dictionary of Primes

I was trying to create this helper function in C# that returns the first n prime numbers. I decided to store the numbers in a dictionary in the <int,bool> format. The key is the number in question and the bool represents whether the int is a prime or not. There are a ton of resources out there calculating/generating the prime numbers(SO included), so I thought of joining the masses by crafting another trivial prime number generator.
My logic goes as follows:
public static Dictionary<int,bool> GetAllPrimes(int number)
{
Dictionary<int, bool> numberArray = new Dictionary<int, bool>();
int current = 2;
while (current <= number)
{
//If current has not been marked as prime in previous iterations,mark it as prime
if (!numberArray.ContainsKey(current))
numberArray.Add(current, true);
int i = 2;
while (current * i <= number)
{
if (!numberArray.ContainsKey(current * i))
numberArray.Add(current * i, false);
else if (numberArray[current * i])//current*i cannot be a prime
numberArray[current * i] = false;
i++;
}
current++;
}
return numberArray;
}
It will be great if the wise provide me with suggestions,optimizations, with possible refactorings. I was also wondering if the inclusion of the Dictionary helps with the run-time of this snippet.

Storing integers explicitly needs at least 32 bits per prime number, with some overhead for the container structure.
At around 231, the maximal value a signed 32 bit integer can take, about every 21.5th number is prime. Smaller primes are more dense, about 1 in ln(n) numbers is prime around n.
This means it is more memory efficient to use an array of bits than to store numbers explicitly. It will also be much faster to look up if a number is prime, and reasonably fast to iterate through the primes.
It seems this is called a BitArray in C# (in Java it is BitSet).

The first thing that bothers is that, why are you storing the number itself ?
Can't you just use the index itself which will represent the number?
PS: I'm not a c# developer so maybe it is not possible with a dictionary, but it can be done with the appropriate structure.

First, you only have to loop untill the square root of the number. Make all numbers false by default and have a simple flag that you set true at the beginning of every iteration.
Further, don't store it in a dictionary. Make it a bool array and have the index be the number you're looking for. Only 0 won't make any sense, but that doesn't matter. You don't have to init either; bools are false by default. Just declare an bool[] of number length.
Then, I would init like this:
primes[2] = true;
for(int i = 3; i < sqrtNumber; i += 2) {
}
So you skip all the even numbers automatically.
By the way, never declare a variable (i) in a loop, it makes it slower.
So that's about it. For more info see this page.

I'm pretty sure the Dictionary actually hurts performance, since it doesn't enable you to perform the trial divisions in an optimal order. Traditionally, you would store the known primes so that they could be iterated from smallest to largest, since smaller primes are factors of more composite numbers than larger primes. Additionally, you never need to try division with any prime larger than the square root of the candidate prime.
Many other optimizations are possible (as you yourself point out, this problem has been studied to death) but those are the ones that I can see off the top of my head.

The dictionary really doesn't make sense here -- just store all primes up to a given number in a list. Then follow these steps:
Is given number in the list?
Yes - it's prime. Done.
Not in list
Is given number larger than the list maximum?
No - it's not prime. Done.
Bigger than maximum; need to fill list up to maximum.
Run a sieve up to given number.
Repeat.

1) From the perspective of the client to this function, wouldn't it be better if the return type was bool[] (from 0 to number perhaps)? Internally, you have three states (KnownPrime, KnownComposite, Unknown), which could be represented by an enumeration. Storing an an array of this enumeration internally, prepopulated with Unknown, will be faster than a dictionary.
2) If you stick with the dictionary, the part of the sieve that marks multiples of the current number as composite could be replaced with a numberArray.TryGetValue() pattern rather than multiple checks for ContainsKey and subsequent retrieval of the value by key.

The trouble with returning an object that holds the primes is that unless you're careful to make it immutable, client code is free to mess up the values, in turn meaning you're not able to cache the primes you've already calculated.
How about having a method such as:
bool IsPrime(int primeTest);
in your helper class that can hide the primes it's already calculated, meaning you don't have to re-calculate them every time.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.