Bell Curve algorithm to adjust set of scores - c#

I am faced with a challenge whereby the business user would like a "Bell curve" applied to their scoring.
This system scores people on a 1-5 point scale. The requirement is that most people score too generously, and they would like for the scores within a group of people to be adjusted down (or up) based on a bell curve.
I would assume then that they are trying to make the majority of people sit at the median level i.e. 3 in this case. I am not sure that the client is correct in their terminology wrt Bell Curve but the requirement is that the scores are leveled out to the 3 level.
What would be the best algorithm to achieve this?
For example, in one group they might have a 3,4,4,3,5 group of scores. in this case the scoring is on average higher than 3.What would be a fair way to adjust all these scores so that the "bell curve" is applied?

The bell curve is the Probability Distribution Function (PDF) of the normal distribution, so that's your goal.
The key to this transformation is the Cumulative Distribution Function (CDF). In words, "y% of the values are less or equal to x". You can easily table the CDF that you have in your input. The CDF of the normal distribution is also known (integral of the bell curve).
Together, this gives you: "y% of the scores are less than x, but according to the normal distribution, y% of the scores should be less than x', therefore the correction is x -> x' "
Mathematically, this is done via the probit function.

You usually assume that your data fit a distribution instead of transforming your data into a given distribution.
If your input data fit a normal distribution ("bell curve"), then you can adjust by simply add/remove the same value from all the sample.
The distribution will be preserved, only the mean will change.
If you want to center your distribution on a given mean, just add the difference between your target mean and the actual one.

Related

Understanding Gabor filter

In Accord.net framework, two classes are used to construct a Gabor filter:
Accord.Math.Gabor
Accord.Imaging.Filters.GaborFilter
There are various implementations of Gabor filter elsewhere:
How to convolve an image with different Gabor filters adjusted according to the local orientation and density using FFT?
Gabor Filter – Image processing for scientists and engineers, Part 6
https://github.com/clumsy/gabor-filter/blob/master/src/main/java/GaborFilter.java
https://github.com/dominiklessel/opencv-gabor-filter/blob/master/gaborFilter.cpp
https://github.com/adriaant/Gabor-API
but, the source codes in Accord.net look very strange to me. They discuss 3 types of kernels:
Real
Imaginary
Magnitude
SquaredMagnitude
Can anyone either explain the latter 3 (Real is self-explanatory) types or refer me to some materials where I can study them?
The Gabor kernel g(t) is complex-valued. It is a quadrature filter, meaning that, in the frequency domain (G(f)), it has no negative frequencies. Thus, the even and odd parts of this frequency response are related by even(G(f)) = odd(G(f)) * sign(f). That is, the even and odd parts have the same values for positive frequencies, but inverse values for negative frequencies. Adding up the even and odd part leads thus to the negative frequencies canceling out, and the positive frequencies reinforcing each other.
The even part of the (real-valued) frequency response corresponds to an even and real-valued kernel. The odd part corresponds to an odd and imaginary-valued kernel. The even kernel is a windowed cosine, the odd kernel is a windowed sine.
The Gabor filer is applied by convolving the image with these two components, then taking the magnitude of the result.
The magnitude of the filter itself is just a Gaussian smoothing kernel (it's the window over the sine and cosine). Note that cos^2+sin^2=1, so the magnitude doesn't show the wave component of the kernel. The code you linked that computes the magnitude of the Gabor kernel does a whole lot of pointless computations... :)

Detecting significant changes in data

I have a graph input where the X axis is time (going forwards). The Y axis is generally stable but has large drops and raises at different points (marked as the red arrows below)
Visually it's obvious but how do I efficiently detect this from within code? I'm not sure which algorithms I should be using but I would like to keep it as simple as possible.
A simple way is to calculate the difference between every two neighbouring samples, eg diff= abs(y[x point 1] - y[x point 0]) and calculate the standard deviation for all the differences. This will rank the differences in order for you and also help eliminate random noise which you get if you just sample largest diff values.
If your up/down values are over several x periods ( eg temp plotted every minute ), then calculate the diff over N samples, taking the max and min from the N samples. If you want 5 samples to be the detection period, then get samples 0,1,2,3,4 and extract min/max, use those for diff. Repeat for samples 1,2,3,4,5 and so on. You may need to play with this as too many samples starts affecting stddev.
An alternative method is to calculate the slope of up/down parts of the chart by subsampling and selecting slopes and lengths that are interesting. While this can be more accurate for automated detection it is much harder to describe the algorithm in depth.
I've worked on similar issues and built a chart categoriser, but would really love references to research in this area.
When you get this going, you may also want to look at 'control charts' from operations research, they identify several patterns that might also be worth detecting, depending on what your charts are of.

Desiring jagged results from simplex noise or another algorithm just as fast

I'm wanting to do some placement of objects like trees and the like based on noise for the terrain of a game/tech demo.
I've used value noise previously and I believe I understand perlin noise well enough. Simplex noise, however, escapes me quite well (just a tad over my head at present).
I have an implementation in C# of simplex noise, however, it's almost completely stolen from here. It works beautifully, but I just don't understand it well enough to modify it for my own purposes.
It is quite fast, but it also gives rather smooth results. I'm actually wanting something that is a little more jagged, like simple linear interpolation would give when I was doing value noise. My issue here is that due to the amount of calls I'd be doing for these object placements and using fractal Brownian motion, the speed of the algorithm becomes quite important.
Any suggestions on how to get more 'jagged' results like linear interpolation gives with value noise using a faster algorithm than value noise is?
if you are using a complex noise function to do a simple task like the placement of trees, your using completely the wrong type of maths function. It is a very specific function which is great for making textures and 3d shapes and irregular curves. Placing treas on 2d certainly doesn't need irregular curves! Unless you want to place trees along in lines that are irregular and curved!
unless you mean you want to place trees in areas of the noise which are a certain level, for example where the noise is larger than 0.98, which will give you nicely randomised zones that you can use as a central point saying some trees will be there.
it will be a lot faster and a lot easier to vary, if you just use any normal noise function, just program your placement code around the noise function. I mean a predictable pseudo-random noise function which is the same every time you use it.
use integers 0 to 10 and 20 to 30, multiplied by your level number, to select 10 X and 10 Y points on the same pseudo-random noise curve. this will give you 10 random spots on your map from where to do stuff using almost no calculations.
Once you have the central point where trees will be, use another 10 random points from the function to say how many trees will be there, another 10 to say how far apart they will be, for the distribution around the tree seed quite exceptional.
The other option, if you want to change the curve http://webstaff.itn.liu.se/~stegu/simplexnoise/simplexnoise.pdf is to read this paper and look at the polynomial function /whatever gradient function could be used in your code, looking the comments for the gradient function, commented out and do X equals Y, which should give you a straight interpolation curve.
if you vote this answer up, I should have enough points in order to comment on this forum:]
I realise this is a very old question, but I felt that the previous answer was entirely wrong, so I wanted to clarify how you should use a noise function to determine the placement of things like trees / rocks / bushes.
Basically, if you want to globally place items across a terrain, you're going to need some function which tells you where those are likely to occur. For instance, you might say "trees need to be on slopes of 45 degrees or less, and below 2000 meters". This gives you a map of possible places for trees. But now you need to choose random, but clustered locations for them.
The best way of doing this is to multiply your map of zeroes and ones by a fractal function (i.e. a Simplex noise function or one generated through subdivision and displacement - see https://fractal-landscapes.co.uk/maths).
This then gives you a probability density function, where the value at a point represents the relative probability of placing a tree at that location. Now you store the partial sum of that function for every location on the map. To place a new tree:
Choose a random number between 0 and the maximum of the summed function.
Do a binary search to find the location on the map in this range.
Place the tree there.
Rinse and repeat.
This allows you to place objects where they belong, according to their natural ranges and so on.

Uniform distribution from a fractal Perlin noise function in C#

My Perlin noise function (which adds up 6 octaves of 3D simplex at 0.75 persistence) generates a 2D array array of doubles.
These numbers each come out normalized to [-1, 1], with mean at 0. I clamp them to avoid exceptions, which I think are due to floating-point accuracy issues, but I am fairly sure my scaling factor is good enough for restricting the noise output to exactly this neighborhood in the ideal case.
Anyway, that's all details. The point is, here is a 256-by-256 array of noise:
The histogram with a normal fit looks like this:
Matlab's lillietest is a function which applies the Lilliefors test to determine if a set of numbers comes from a normal distribution. My result was, repeatedly, 1, which means that these numbers are not normally distributed.
I would like a function f(x) such that, when applied to the list of values from my noise function, the results appear uniformly distributed.
I would like this function to be implementable in C# and not take minutes to run.
Once again, it shouldn't matter where the numbers come from (the question is about transforming one distribution into another, specifically a normal-like one to uniform). Nevertheless, my noise function implementation is based on this and this. You can find the above array of values here.
Oddly enough I just wrote an article on your very question:
http://ericlippert.com/2012/02/21/generating-random-non-uniform-data/
There I discuss how to turn a uniform distribution into some other distribution, but of course you can use similar techniques to transform other distributions.
You will probably be interested in one of the following (related) techniques:
Probability integral transform
Histogram equalization

How to match SURF interest points to a database of images

I am using the SURF algorithm in C# (OpenSurf) to get a list of interest points from an image. Each of these interest points contains a vector of descriptors , an x coordinate (int), an y coordinate (int), the scale (float) and the orientation (float).
Now, i want to compare the interest points from one image to a list of images in a database which also have a list of interest points, to find the most similar image. That is: [Image(I.P.)] COMPARETO [List of Images(I.P.)]. => Best match. Comparing the images on an individual basis yields unsatisfactory results.
When searching stackoverflow or other sites, the best solution i have found is to build an FLANN index while at the same time keeping track of where the interest points comes from. But before implementation, I have some questions which puzzle me:
1) When matching images based on their SURF interest points an algorithm I have found does the matching by comparing their distance (x1,y1->x2,y2) with each other and finding the image with the lowest total distance. Are the descriptors or orientation never used when comparing interest points?
2) If the descriptors are used, than how do i compare them? I can't figure out how to compare X vectors of 64 points (1 image) with Y vectors of 64 points (several images) using a indexed tree.
I would really appreciate some help. All the places I have searched or API I found, only support matching one picture to another, but not to match one picture effectively to a list of pictures.
There are multiple things here.
In order to know two images are (almost) equal, you have to find the homographic projection of the two such that the projection results in a minimal error between the projected feature locations. Brute-forcing that is possible but not efficient, so a trick is to assume that similar images tend to have the feature locations in the same spot as well (give or take a bit). For example, when stitching images, the image to stitch are usually taken only from a slightly different angle and/or location; even if not, the distances will likely grow ("proportionally") to the difference in orientation.
This means that you can - as a broad phase - select candidate images by finding k pairs of points with minimum spatial distance (the k nearest neighbors) between all pairs of images and perform homography only on these points. Only then you compare the projected point-pairwise spatial distance and sort the images by said distance; the lowest distance implies the best possible match (given the circumstances).
If I'm not mistaken, the descriptors are oriented by the strongest angle in the angle histogram. Theat means you may also decide to take the euclidean (L2) distance of the 64- or 128-dimensional feature descriptors directly to obtain the actual feature-space similarity of two given features and perform homography on the best k candidates. (You will not compare the scale in which the descriptors were found though, because that would defeat the purpose of scale invariance.)
Both options are time consuming and direcly depend on the number of images and features; in other word's: stupid idea.
Approximate Nearest Neighbors
A neat trick is to not use actual distances at all, but approximate distances instead. In other words, you want an approximate nearest neighbor algorithm, and FLANN (although not for .NET) would be one of them.
One key point here is the projection search algorithm. It works like this:
Assuming you want to compare the descriptors in 64-dimensional feature space. You generate a random 64-dimensional vector and normalize it, resulting in an arbitrary unit vector in feature space; let's call it A. Now (during indexing) you form the dot product of each descriptor against this vector. This projects each 64-d vector onto A, resulting in a single, real number a_n. (This value a_n represents the distance of the descriptor along A in relation to A's origin.)
This image I borrowed from this answer on CrossValidated regarding PCA demonstrates it visually; think about the rotation as the result of different random choices of A, where the red dots correspond to the projections (and thus, scalars a_n). The red lines show the error you make by using that approach, this is what makes the search approximate.
You will need A again for search, so you store it. You also keep track of each projected value a_n and the descriptor it came from; furthermore you align each a_n (with a link to its descriptor) in a list, sorted by a_n.
To clarify using another image from here, we're interested in the location of the projected points along the axis A:
The values a_0 .. a_3 of the 4 projected points in the image are approximately sqrt(0.5²+2²)=1.58, sqrt(0.4²+1.1²)=1.17, -0.84 and -0.95, corresponding to their distance to A's origin.
If you now want to find similar images, you do the same: Project each descriptor onto A, resulting in a scalar q (query). Now you go to the position of q in the list and take the k surrounding entries. These are your approximate nearest neighbors. Now take the feature-space distance of these k values and sort by lowest distance - the top ones are your best candidates.
Coming back to the last picture, assume the topmost point is our query. It's projection is 1.58 and it's approximate nearest neighbor (of the four projected points) is the one at 1.17. They're not really close in feature space, but given that we just compared two 64-dimensional vectors using only two values, it's not that bad either.
You see the limits there and, similar projections do not at all require the original values to be close, this will of course result in rather creative matches. To accomodate for this, you simply generate more base vectors B, C, etc. - say n of them - and keep track of a separate list for each. Take the k best matches on all of them, sort that list of k*n 64-dimensional vectors according to their euclidean distance to the query vector, perform homography on the best ones and select the one with the lowest projection error.
The neat part about this is that if you have n (random, normalized) projection axes and want to search in 64-dimensional space, you are simply multiplying each descriptor with a n x 64 matrix, resulting in n scalars.
I am pretty sure that the distance is calculated between the descriptors and not their coordinates (x,y). You can compare directly only one descriptor against another. I propose the following possible solution (surely not the optimal)
You can find for each descriptor in the query image the top-k nearest neighbors in your dataset, and later take all top-k lists and finds the most common image there.

Categories