I'm trying to tackle the classic handwritten digit recognition problem with a feed forward neural network and backpropagation, using the MNIST dataset. I'm using Michael Nielsen's book to learn the essentials and 3Blue1Brown's youtube video for the backpropagation algorithm.
I finished writing it some time ago and been debugging since, because the results are quite bad. At its best the network can recognize ~4000/10000 samples after 1 epoch and that number only drops on the following epochs, which lead me to believe there's some issue with the backpropagation algorithm. I've been drowning in index hell trying to debug this for the last few days and can't figure out where the issue is, I'd appreciate any help in pointing it out.
A bit of background: 1) I'm not using any matrix multiplication and no external frameworks, but doing everything with for loops because that's how I learned it from the video. 2) Unlike the book, I'm storing both weights and biases in the same array. The biases for every layer are a column at the end of the weight matrix for that layer.
And finally for the code, this is the Backpropagate method of the NeuralNetwork class, which is called in UpdateMiniBatch, which itself is called in SGD:
/// <summary>
/// Returns the partial derivative of the cost function on one sample with respect to every weight in the network.
/// </summary>
public List<double[,]> Backpropagate(ITrainingSample sample)
{
// Forwards pass
var (weightedInputs, activations) = GetWeightedInputsAndActivations(sample.Input);
// The derivative with respect to the activation of the last layer is simple to compute: activation - expectedActivation
var errors = activations.Last().Select((a, i) => a - sample.Output[i]).ToArray();
// Backwards pass
List<double[,]> delCostOverDelWeights = Weights.Select(x => new double[x.GetLength(0), x.GetLength(1)]).ToList();
List<double[]> delCostOverDelActivations = Weights.Select(x => new double[x.GetLength(0)]).ToList();
delCostOverDelActivations[delCostOverDelActivations.Count - 1] = errors;
// Comment notation:
// Cost function: C
// Weight connecting the i-th neuron on the (l + 1)-th layer to the j-th neuron on the l-th layer: w[l][i, j]
// Bias of the i-th neuron on the (l + 1)-th layer: b[l][i]
// Activation of the i-th neuon on the l-th layer: a[l][i]
// Weighted input of the i-th neuron on the l-th layer: z[l][i] // which doesn't make sense on layer 0, but is left for index convenience
// Notice that weights, biases, delCostOverDelWeights and delCostOverDelActivation all start at layer 1 (the 0-th layer is irrelevant to their meanings) while activations and weightedInputs strat at the 0-th layer
for (int l = Weights.Count - 1; l >= 0; l--)
{
//Calculate ∂C/∂w for the current layer:
for (int i = 0; i < Weights[l].GetLength(0); i++)
for (int j = 0; j < Weights[l].GetLength(1); j++)
delCostOverDelWeights[l][i, j] = // ∂C/∂w[l][i, j]
delCostOverDelActivations[l][i] * // ∂C/∂a[l + 1][i]
SigmoidPrime(weightedInputs[l + 1][i]) * // ∂a[l + 1][i]/∂z[l + 1][i] = ∂(σ(z[l + 1][i]))/∂z[l + 1][i] = σ′(z[l + 1][i])
(j < Weights[l].GetLength(1) - 1 ? activations[l][j] : 1); // ∂z[l + 1][i]/∂w[l][i, j] = a[l][j] ||OR|| ∂z[l + 1][i]/∂b[l][i] = 1
// Calculate ∂C/∂a for the previous layer(a[l]):
if (l != 0)
for (int i = 0; i < Weights[l - 1].GetLength(0); i++)
for (int j = 0; j < Weights[l].GetLength(0); j++)
delCostOverDelActivations[l - 1][i] += // ∂C/∂a[l][i] = sum over j:
delCostOverDelActivations[l][j] * // ∂C/∂a[l + 1][j]
SigmoidPrime(weightedInputs[l + 1][j]) * // ∂a[l + 1][j]/∂z[l + 1][j] = ∂(σ(z[l + 1][j]))/∂z[l + 1][j] = σ′(z[l + 1][j])
Weights[l][j, i]; // ∂z[l + 1][j]/∂a[l][i] = w[l][j, i]
}
return delCostOverDelWeights;
}
GetWeightedInputsAndActivations:
public (List<double[]>, List<double[]>) GetWeightedInputsAndActivations(double[] input)
{
List<double[]> activations = new List<double[]>() { input }.Concat(Weights.Select(x => new double[x.GetLength(0)])).ToList();
List<double[]> weightedInputs = activations.Select(x => new double[x.Length]).ToList();
for (int l = 0; l < Weights.Count; l++)
for (int i = 0; i < Weights[l].GetLength(0); i++)
{
double value = 0;
for (int j = 0; j < Weights[l].GetLength(1) - 1; j++)
value += Weights[l][i, j] * activations[l][j];// weights
weightedInputs[l + 1][i] = value + Weights[l][i, Weights[l].GetLength(1) - 1];// bias
activations[l + 1][i] = Sigmoid(weightedInputs[l + 1][i]);
}
return (weightedInputs, activations);
}
The entire NeuralNetwork as well as everything else can be found here.
EDIT: after many significant changes to the repo the above link might no longer be functional, but should hopefully be irrelevant considering the answer. For completeness' sake this is a functional link to the changed repository.
Fixed. The issue was: I didn't divide the pixel inputs by 255. Everything else seems to work correctly, and I'm now getting +9000/10000 on the first epoch.
Related
I am trying to build a custom machine learning library in C# , I have researched fairly about the topic. My first example (XOR estimator) was a success, I was able to lower the average loss to almost cero. Then I tried to build a model to classify handwritten digits (using MNIST text database).The problem is ,no matter how I configure the model, I always get stuck on a certain average loss over the data set.Second problem, because the MNIST dataset is very large, the model takes a lot of time to compute, maybe I can use some advice on how to carry on the slowest parts of the algorithm (I am using stochastic gradient descent).I am going to show the main method that performs most of the work.
I have tried using MSE and CrossEntropy loss functions, also tanh , sigmoid , reLu and softPlus activation functions. The model I am trying to build is a 4 layer one. First layer, 784 input neurons ; Second , 16 neurons,sigmoid ; Third , 16 neurons , sigmoid and Output layer, 10 neurons (one hot encoded digits) with sigmoid. I am aware that the code below may not be a minimal reproducible example, but it represents the algorithm I am trying to figure. I have also uploaded the solution to GitHub, maybe somebody can give me a hand figuring the problem. This is the link https://github.com/juan-carvajal/MachineLearningFramework
Running the Main method of the app will first, execute the XOR classifier, that runs good. Then the MNIST classifier.
The model is best represented here:
DataSet dataSet = new DataSet("mnist2.txt", ' ', 10, false);
//This creates a model with batching=128 , learningRate=0.5 and
//CrossEntropy loss function
var p = new Perceptron(128, 0.5, ErrorFunction.CrossEntropy())
.Layer(784, ActivationFunction.Sigmoid())
.Layer(16, ActivationFunction.Sigmoid())
.Layer(16, ActivationFunction.Sigmoid())
.Layer(10, ActivationFunction.Sigmoid());
//1000 is the number of epochs
p.Train2(dataSet, 1000);
Actual Algorithm (stochastic gradiente descent):
Console.WriteLine("Initial Loss:"+ CalculateMeanErrorOverDataSet(dataSet));
for (int i = 0; i < epochs; i++)
{
//Shuffle the data in every step
dataSet.Shuffle();
List<DataRow> batch = dataSet.NextBatch(this.Batching);
//Gets random batch from the dataSet
int count = 0;
foreach (DataRow example in batch)
{
count++;
double[] result = this.FeedForward(example.GetFeatures());
double[] labels = example.GetLabels();
if (result.Length != labels.Length)
{
throw new Exception("Inconsistent array size, Incorrect implementation.");
}
else
{
//What follows is the calculation of the gradient for this example, every example affects the current gradient, then all those changes are averaged an every parameter is updated.
double error = CalculateExampleLost(example);
for (int l = this.Layers.Count - 1; l > 0; l--)
{
if (l == this.Layers.Count - 1)
{
for (int j = 0; j < this.Layers[l].CostDerivatives.Length; j++)
{
this.Layers[l].CostDerivatives[j] = ErrorFunction.GetDerivativeValue(labels[j], this.Layers[l].Activations[j]);
}
}
else
{
for (int j = 0; j < this.Layers[l].CostDerivatives.Length; j++)
{
double acum = 0;
for (int j2 = 0; j2 < Layers[l + 1].Size; j2++)
{
acum += Layers[l + 1].WeightMatrix[j2, j] * this.Layers[l+1].ActivationFunction.GetDerivativeValue(Layers[l + 1].WeightedSum[j2]) * Layers[l + 1].CostDerivatives[j2];
}
this.Layers[l].CostDerivatives[j] = acum;
}
}
for (int j = 0; j < this.Layers[l].Activations.Length; j++)
{
this.Layers[l].BiasVectorChangeRecord[j] += this.Layers[l].ActivationFunction.GetDerivativeValue(Layers[l].WeightedSum[j]) * Layers[l].CostDerivatives[j];
for (int k = 0; k < Layers[l].WeightMatrix.GetLength(1); k++)
{
this.Layers[l].WeightMatrixChangeRecord[j, k] += Layers[l - 1].Activations[k]
* this.Layers[l].ActivationFunction.GetDerivativeValue(Layers[l].WeightedSum[j])
* Layers[l].CostDerivatives[j];
}
}
}
}
}
TakeGradientDescentStep(batch.Count);
if ((i + 1) % (epochs / 10) == 0)
{
Console.WriteLine("Epoch " + (i + 1) + ", Avg.Loss:" + CalculateMeanErrorOverDataSet(dataSet));
}
}
This is an example of what the local minima looks like in the current model.
In my research I found out that similar models may archieve accuracy up to 90%. My model barely got 10%.
I am trying to make a back propagation neural network.
Based upon the the tutorials i found here : MSDN article by James McCaffrey. He gives many examples but all his networks are based upon the same problem to solve. So his networks look like 4:7:3 >> 4input - 7hidden - 3output.
His output is always binary 0 or 1, one output gets a 1, to classify an Irish flower, into one of the three categories.
I would like to solve another problem with a neural network and that would require me 2 neural networks where one needs an output inbetween 0..255 and another inbetween 0 and 2times Pi. (a full turn, circle). Well essentially i think i need an output that range from 0.0 to 1.0 or from -1 to 1 and anything in between, so that i can multiply it to becomme 0..255 or 0..2Pi
I think his network does behave, like it does because of his computeOutputs
Which I show below here :
private double[] ComputeOutputs(double[] xValues)
{
if (xValues.Length != numInput)
throw new Exception("Bad xValues array length");
double[] hSums = new double[numHidden]; // hidden nodes sums scratch array
double[] oSums = new double[numOutput]; // output nodes sums
for (int i = 0; i < xValues.Length; ++i) // copy x-values to inputs
this.inputs[i] = xValues[i];
for (int j = 0; j < numHidden; ++j) // compute i-h sum of weights * inputs
for (int i = 0; i < numInput; ++i)
hSums[j] += this.inputs[i] * this.ihWeights[i][j]; // note +=
for (int i = 0; i < numHidden; ++i) // add biases to input-to-hidden sums
hSums[i] += this.hBiases[i];
for (int i = 0; i < numHidden; ++i) // apply activation
this.hOutputs[i] = HyperTanFunction(hSums[i]); // hard-coded
for (int j = 0; j < numOutput; ++j) // compute h-o sum of weights * hOutputs
for (int i = 0; i < numHidden; ++i)
oSums[j] += hOutputs[i] * hoWeights[i][j];
for (int i = 0; i < numOutput; ++i) // add biases to input-to-hidden sums
oSums[i] += oBiases[i];
double[] softOut = Softmax(oSums); // softmax activation does all outputs at once for efficiency
Array.Copy(softOut, outputs, softOut.Length);
double[] retResult = new double[numOutput]; // could define a GetOutputs method instead
Array.Copy(this.outputs, retResult, retResult.Length);
return retResult;
The network uses the folowing hyperTan function
private static double HyperTanFunction(double x)
{
if (x < -20.0) return -1.0; // approximation is correct to 30 decimals
else if (x > 20.0) return 1.0;
else return Math.Tanh(x);
}
In above a function makes for the output layer use of Softmax() and it is i think critical to problem here. In that I think it makes his output all binary, and it looks like this :
private static double[] Softmax(double[] oSums)
{
// determine max output sum
// does all output nodes at once so scale doesn't have to be re-computed each time
double max = oSums[0];
for (int i = 0; i < oSums.Length; ++i)
if (oSums[i] > max) max = oSums[i];
// determine scaling factor -- sum of exp(each val - max)
double scale = 0.0;
for (int i = 0; i < oSums.Length; ++i)
scale += Math.Exp(oSums[i] - max);
double[] result = new double[oSums.Length];
for (int i = 0; i < oSums.Length; ++i)
result[i] = Math.Exp(oSums[i] - max) / scale;
return result; // now scaled so that xi sum to 1.0
}
How to rewrite softmax ?
So the network will be able to give non binary answers ?
Notice the full code of the network is here. if you would like to try it out.
Also as to test the network the following accuracy function is used, maybe the binary behaviour emerges from it
public double Accuracy(double[][] testData)
{
// percentage correct using winner-takes all
int numCorrect = 0;
int numWrong = 0;
double[] xValues = new double[numInput]; // inputs
double[] tValues = new double[numOutput]; // targets
double[] yValues; // computed Y
for (int i = 0; i < testData.Length; ++i)
{
Array.Copy(testData[i], xValues, numInput); // parse test data into x-values and t-values
Array.Copy(testData[i], numInput, tValues, 0, numOutput);
yValues = this.ComputeOutputs(xValues);
int maxIndex = MaxIndex(yValues); // which cell in yValues has largest value?
int tMaxIndex = MaxIndex(tValues);
if (maxIndex == tMaxIndex)
++numCorrect;
else
++numWrong;
}
return (numCorrect * 1.0) / (double)testData.Length;
}
Just in case that someone gets into the same situation.
If you need some example code of a neural network regression
(a NNR) That's how they are called.
Here is link to sample code in C#, and here is a good article about it. Notice the guy writes more articles there, you wont find everything but there's a lot there. Despite I was following this man for a while I missed this specific article as I didn't know how they where called, when I asked the question here on stack overflow.
I'm a bit rusty at Neural Netowrks but I think, if you want to have a range of values from your output then you need to make sure your activation functions on your output layer are linear (or something that has a similar effect).
Try adding this method:
private static double[] Linear(double[] oSums)
{
double sum = oSums.Sum(d => Math.Abs(d));
double[] result = new double[oSums.Length];
for (int i = 0; i < oSums.Length; ++i)
result[i] = Math.Abs(oSums[i]) / sum;
// scaled so that xi sum to 1.0
return result;
}
And then in the ComputeOutputs method you need to use this new activation function for the output (rather than Softmax):
...
//double[] softOut = Softmax(oSums); // all outputs at once for efficiency
double[] softOut = Linear(oSums); // all outputs at once for efficiency
Array.Copy(softOut, outputs, softOut.Length);
...
This should now output linear values.
I am attempting to use Lomont FFT in order to return complex numbers to build a spectrogram / spectral density chart using c#.
I am having trouble understanding how to return values from the class.
Here is the code I have put together thus far which appears to be working.
int read = 0;
Double[] data;
byte[] buffer = new byte[1024];
FileStream wave = new FileStream(args[0], FileMode.Open, FileAccess.Read);
read = wave.Read(buffer, 0, 44);
read = wave.Read(buffer, 0, 1024);
data = new Double[read];
for (int i = 0; i < read; i+=2)
{
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
Console.WriteLine(data[i]);
}
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(data, true);
What I am not clear on is, how to return/access the values from Lomont FFT implementation back into my application (console)?
Being pretty new to c# development, I'm thinking I am perhaps missing a fundamental aspect of understanding regarding how to retrieve processed values from the instance of the Lomont Class, or perhaps even calling it incorrectly.
Console.WriteLine(LFFT.A); // Returns 0
Console.WriteLine(LFFT.B); // Returns 1
I have been searching for a code snippet or explanation of how to do this, but so far have come up with nothing that I understand or explains this particular aspect of the issue I am facing. Any guidance would be greatly appreciated.
A subset of the results held in data array noted in the code above can be found below and based on my current understanding, appear to be valid:
0.00531005859375
0.0238037109375
0.041473388671875
0.0576171875
0.07183837890625
0.083465576171875
0.092193603515625
0.097625732421875
0.099639892578125
0.098114013671875
0.0931396484375
0.0848388671875
0.07354736328125
0.05963134765625
0.043609619140625
0.026031494140625
0.007476806640625
-0.011260986328125
-0.0296630859375
-0.047027587890625
-0.062713623046875
-0.076141357421875
-0.086883544921875
-0.09454345703125
-0.098785400390625
-0.0994873046875
-0.0966796875
-0.090362548828125
-0.080810546875
-0.06842041015625
-0.05352783203125
-0.036712646484375
-0.0185546875
What am I actually attempting to do? (perspective)
I am looking to load a wave file into a console application and return a spectrogram/spectral density chart/image as a jpg/png for further processing.
The wave files I am reading are mono in format
UPDATE 1
I Receive slightly different results depending on which FFT is used.
Using RealFFT
for (int i = 0; i < read; i+=2)
{
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
//Console.WriteLine(data[i]);
}
LomontFFT LFFT = new LomontFFT();
LFFT.RealFFT(data, true);
for (int i = 0; i < buffer.Length / 2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(data[2 * i] * data[2 * i] + data[2 * i + 1] * data[2 * i + 1]));
}
Partial Result of RealFFT
0.314566983321381
0.625242818210924
0.30314888696868
0.118468857708093
0.0587697011760449
0.0369034115568654
0.0265842582236275
0.0207195964060356
0.0169601273233317
0.0143745438577886
0.012528799609089
0.0111831275153128
0.0102313284519146
0.00960198279358434
0.00920236001619566
Using FFT
for (int i = 0; i < read; i+=2)
{
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
//Console.WriteLine(data[i]);
}
double[] bufferB = new double[2 * data.Length];
for (int i = 0; i < data.Length; i++)
{
bufferB[2 * i] = data[i];
bufferB[2 * i + 1] = 0;
}
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(bufferB, true);
for (int i = 0; i < bufferB.Length / 2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(bufferB[2 * i] * bufferB[2 * i] + bufferB[2 * i + 1] * bufferB[2 * i + 1]));
}
Partial Result of FFT:
0.31456698332138
0.625242818210923
0.303148886968679
0.118468857708092
0.0587697011760447
0.0369034115568653
0.0265842582236274
0.0207195964060355
0.0169601273233317
0.0143745438577886
0.012528799609089
0.0111831275153127
0.0102313284519146
0.00960198279358439
0.00920236001619564
Looking at the LomontFFT.FFT documentation:
Compute the forward or inverse Fourier Transform of data, with
data containing complex valued data as alternating real and
imaginary parts. The length must be a power of 2. The data is
modified in place.
This tells us a few things. First the function is expecting complex-valued data whereas your data is real. A quick fix for this is to create another buffer of twice the size and setting all the imaginary parts to 0:
double[] buffer = new double[2*data.Length];
for (int i=0; i<data.Length; i++)
{
buffer[2*i] = data[i];
buffer[2*i+1] = 0;
}
The documentation also tells us that the computation is done in place. That means that after the call to FFT returns, the input array is replaced with the computed result. You could thus print the spectrum with:
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(buffer, true);
for (int i = 0; i < buffer.Length/2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(buffer[2*i]*buffer[2*i]+buffer[2*i+1]*buffer[2*i+1]));
}
Note since your input data is real valued you could also use LomontFFT.RealFFT. In that case, given a slightly different packing rule, you would obtain the FFT results using:
LomontFFT LFFT = new LomontFFT();
LFFT.RealFFT(data, true);
System.Console.WriteLine("{0}", Math.Abs(data[0]);
for (int i = 1; i < data.Length/2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(data[2*i]*data[2*i]+data[2*i+1]*data[2*i+1]));
}
System.Console.WriteLine("{0}", Math.Abs(data[1]);
This would give you the non-redundant lower half of the spectrum (Unlike LomontFFT.FFT which provides the entire spectrum). Also, numerical differences on the order of double precision (around 1e-16 times the spectrum peak value) with respect to LomontFFT.FFT can be expected.
Given a point collection defined by x and y coordinates.
In this collection I get the start point, the end point and all the other n-2 points.
I have to find the shortest way between the start point and end point by going through all the other points. The shortest way is defined by its value and if possible the crossing point order.
At a first look this seems to be a graph problem, but i am not so sure about that right now, any way i am trying to find this shortest way by using only geometric relations since currently all the information that i have is only the x and y coordinates of the points, and which point is the start point and which is the end point.
My question is, can this way be found by using only geometric relations?
I am trying to implement this in C#, so if some helpful packages are available please let me know.
The simplest heuristic with reasonable performance is 2-opt. Put the points in an array, with the start point first and the end point last, and repeatedly attempt to improve the solution as follows. Choose a starting index i and an ending index j and reverse the subarray from i to j. If the total cost is less, then keep this change, otherwise undo it. Note that the total cost will be less if and only if d(p[i - 1], p[i]) + d(p[j], p[j + 1]) > d(p[i - 1], p[j]) + d(p[i], p[j + 1]), so you can avoid performing the swap unless it's an improvement.
There are a possible number of improvements to this method. 3-opt and k-opt consider more possible moves, resulting in better solution quality. Data structures for geometric search, kd-trees for example, decrease the time to find improving moves. As far as I know, the state of the art in local search algorithms for TSP is Keld Helsgaun's LKH.
Another family of algorithms is branch and bound. These return optimal solutions. Concorde (as far as I know) is the state of the art here.
Here's a Java implementation of the O(n^2 2^n) DP that Niklas described. There are many possible improvements, e.g., cache the distances between points, switch to floats (maybe), reorganize the iteration so that subsets are enumerating in increasing order of size (to allow only the most recent layer of minTable to be retained, resulting in a significant space saving).
class Point {
private final double x, y;
Point(double x, double y) {
this.x = x;
this.y = y;
}
double distanceTo(Point that) {
return Math.hypot(x - that.x, y - that.y);
}
public String toString() {
return x + " " + y;
}
}
public class TSP {
public static int[] minLengthPath(Point[] points) {
if (points.length < 2) {
throw new IllegalArgumentException();
}
int n = points.length - 2;
if ((1 << n) <= 0) {
throw new IllegalArgumentException();
}
byte[][] argMinTable = new byte[1 << n][n];
double[][] minTable = new double[1 << n][n];
for (int s = 0; s < (1 << n); s++) {
for (int i = 0; i < n; i++) {
int sMinusI = s & ~(1 << i);
if (sMinusI == s) {
continue;
}
int argMin = -1;
double min = points[0].distanceTo(points[1 + i]);
for (int j = 0; j < n; j++) {
if ((sMinusI & (1 << j)) == 0) {
continue;
}
double cost =
minTable[sMinusI][j] +
points[1 + j].distanceTo(points[1 + i]);
if (argMin < 0 || cost < min) {
argMin = j;
min = cost;
}
}
argMinTable[s][i] = (byte)argMin;
minTable[s][i] = min;
}
}
int s = (1 << n) - 1;
int argMin = -1;
double min = points[0].distanceTo(points[1 + n]);
for (int i = 0; i < n; i++) {
double cost =
minTable[s][i] +
points[1 + i].distanceTo(points[1 + n]);
if (argMin < 0 || cost < min) {
argMin = i;
min = cost;
}
}
int[] path = new int[1 + n + 1];
path[1 + n] = 1 + n;
int k = n;
while (argMin >= 0) {
path[k] = 1 + argMin;
k--;
int temp = s;
s &= ~(1 << argMin);
argMin = argMinTable[temp][argMin];
}
path[0] = 0;
return path;
}
public static void main(String[] args) {
Point[] points = new Point[20];
for (int i = 0; i < points.length; i++) {
points[i] = new Point(Math.random(), Math.random());
}
int[] path = minLengthPath(points);
for (int i = 0; i < points.length; i++) {
System.out.println(points[path[i]]);
System.err.println(points[i]);
}
}
}
The Euclidean travelling salesman problem can be reduced to this and it's NP-hard. So unless your point set is small or you have a very particular structure, you should probably look out for an approximation. Note that the Wikipedia article mentions the existence of a PTAS for the problem, which could turn out to be quite effective in practice.
UPDATE: Since your instances seem to have only few nodes, you can use a simple exponential-time dynamic programming approach. Let f(S, p) be the minimum cost to connect all the points in the set S, ending at the points p. We have f({start}, start) = 0 and we are looking for f(P, end), where P is the set of all points. To compute f(S, p), we can check all potential predecessors of p in the tour, so we have
f(S, p) = MIN(q in S \ {p}, f(S \ {p}, q) + distance(p, q))
You can represent S as a bitvector to save space (just use an single-word integer for maximum simplicity). Also use memoization to avoid recomputing subproblem results.
The runtime will be O(2^n * n^2) and the algorithm can be implemented with a rather low constant factor, so I predict it to be able to solve instance with n = 25 within seconds a reasonable amount of time.
This can be solved using an evolutionary algorithm.
Look at this: http://johnnewcombe.net/blog/post/23
You might want to look at TPL (Task Parallel Library) to speed up the application.
EDIT
I found this Link which has a Traveling Salesman algorithm:
http://msdn.microsoft.com/en-us/magazine/gg983491.aspx
The Source Code is said to be at:
http://archive.msdn.microsoft.com/mag201104BeeColony
I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:
The real word: hospital
Mistaken word: haspita
Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.
How can I achieve this with performance being a key consideration?
I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:
If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and ints compare much faster than chars.
It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
Make sure your arrays slice accross the shorter word width.
Code (it works the exact same if you replace int[] with String in the parameter declarations:
/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
int length1 = source.Length;
int length2 = target.Length;
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
// Ensure arrays [i] / length1 use shorter length
if (length1 > length2) {
Swap(ref target, ref source);
Swap(ref length1, ref length2);
}
int maxi = length1;
int maxj = length2;
int[] dCurrent = new int[maxi + 1];
int[] dMinus1 = new int[maxi + 1];
int[] dMinus2 = new int[maxi + 1];
int[] dSwap;
for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
int jm1 = 0, im1 = 0, im2 = -1;
for (int j = 1; j <= maxj; j++) {
// Rotate
dSwap = dMinus2;
dMinus2 = dMinus1;
dMinus1 = dCurrent;
dCurrent = dSwap;
// Initialize
int minDistance = int.MaxValue;
dCurrent[0] = j;
im1 = 0;
im2 = -1;
for (int i = 1; i <= maxi; i++) {
int cost = source[im1] == target[jm1] ? 0 : 1;
int del = dCurrent[im1] + 1;
int ins = dMinus1[i] + 1;
int sub = dMinus1[im1] + cost;
//Fastest execution for min value of 3 integers
int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
min = Math.Min(min, dMinus2[im2] + cost);
dCurrent[i] = min;
if (min < minDistance) { minDistance = min; }
im1++;
im2++;
}
jm1++;
if (minDistance > threshold) { return int.MaxValue; }
}
int result = dCurrent[maxi];
return (result > threshold) ? int.MaxValue : result;
}
Where Swap is:
static void Swap<T>(ref T arg1,ref T arg2) {
T temp = arg1;
arg1 = arg2;
arg2 = temp;
}
What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.
Here's an implementation from the first site linked below:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}
There is a big number of string similarity distance algorithms that can be used. Some listed here (but not exhaustively listed are):
Levenstein
Needleman Wunch
Smith Waterman
Smith Waterman Gotoh
Jaro, Jaro Winkler
Jaccard Similarity
Euclidean Distance
Dice Similarity
Cosine Similarity
Monge Elkan
A library that contains implementation to all of these is called SimMetrics
which has both java and c# implementations.
I have found that Levenshtein and Jaro Winkler are great for small differences betwen strings such as:
Spelling mistakes; or
ö instead of o in a persons name.
However when comparing something like article titles where significant chunks of the text would be the same but with "noise" around the edges, Smith-Waterman-Gotoh has been fantastic:
compare these 2 titles (that are the same but worded differently from different sources):
An endonuclease from Escherichia coli that introduces single polynucleotide chain scissions in ultraviolet-irradiated DNA
Endonuclease III: An Endonuclease from Escherichia coli That Introduces Single Polynucleotide Chain Scissions in Ultraviolet-Irradiated DNA
This site that provides algorithm comparison of the strings shows:
Levenshtein: 81
Smith-Waterman Gotoh 94
Jaro Winkler 78
Jaro Winkler and Levenshtein are not as competent as Smith Waterman Gotoh in detecting the similarity. If we compare two titles that are not the same article, but have some matching text:
Fat metabolism in higher plants. The function of acyl thioesterases in the metabolism of acyl-coenzymes A and acyl-acyl carrier proteins
Fat metabolism in higher plants. The determination of acyl-acyl carrier protein and acyl coenzyme A in a complex lipid mixture
Jaro Winkler gives a false positive, but Smith Waterman Gotoh does not:
Levenshtein: 54
Smith-Waterman Gotoh 49
Jaro Winkler 89
As Anastasiosyal pointed out, SimMetrics has the java code for these algorithms. I had success using the SmithWatermanGotoh java code from SimMetrics.
Here is my implementation of Damerau Levenshtein Distance, which returns not only similarity coefficient, but also returns error locations in corrected word (this feature can be used in text editors). Also my implementation supports different weights of errors (substitution, deletion, insertion, transposition).
public static List<Mistake> OptimalStringAlignmentDistance(
string word, string correctedWord,
bool transposition = true,
int substitutionCost = 1,
int insertionCost = 1,
int deletionCost = 1,
int transpositionCost = 1)
{
int w_length = word.Length;
int cw_length = correctedWord.Length;
var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
var result = new List<Mistake>(Math.Max(w_length, cw_length));
if (w_length == 0)
{
for (int i = 0; i < cw_length; i++)
result.Add(new Mistake(i, CharMistakeType.Insertion));
return result;
}
for (int i = 0; i <= w_length; i++)
d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
for (int j = 0; j <= cw_length; j++)
d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
for (int i = 1; i <= w_length; i++)
{
for (int j = 1; j <= cw_length; j++)
{
bool equal = correctedWord[j - 1] == word[i - 1];
int delCost = d[i - 1, j].Key + deletionCost;
int insCost = d[i, j - 1].Key + insertionCost;
int subCost = d[i - 1, j - 1].Key;
if (!equal)
subCost += substitutionCost;
int transCost = int.MaxValue;
if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
{
transCost = d[i - 2, j - 2].Key;
if (!equal)
transCost += transpositionCost;
}
int min = delCost;
CharMistakeType mistakeType = CharMistakeType.Deletion;
if (insCost < min)
{
min = insCost;
mistakeType = CharMistakeType.Insertion;
}
if (subCost < min)
{
min = subCost;
mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
}
if (transCost < min)
{
min = transCost;
mistakeType = CharMistakeType.Transposition;
}
d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
}
}
int w_ind = w_length;
int cw_ind = cw_length;
while (w_ind >= 0 && cw_ind >= 0)
{
switch (d[w_ind, cw_ind].Value)
{
case CharMistakeType.None:
w_ind--;
cw_ind--;
break;
case CharMistakeType.Substitution:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
w_ind--;
cw_ind--;
break;
case CharMistakeType.Deletion:
result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
w_ind--;
break;
case CharMistakeType.Insertion:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
cw_ind--;
break;
case CharMistakeType.Transposition:
result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
w_ind -= 2;
cw_ind -= 2;
break;
}
}
if (d[w_length, cw_length].Key > result.Count)
{
int delMistakesCount = d[w_length, cw_length].Key - result.Count;
for (int i = 0; i < delMistakesCount; i++)
result.Add(new Mistake(0, CharMistakeType.Deletion));
}
result.Reverse();
return result;
}
public struct Mistake
{
public int Position;
public CharMistakeType Type;
public Mistake(int position, CharMistakeType type)
{
Position = position;
Type = type;
}
public override string ToString()
{
return Position + ", " + Type;
}
}
public enum CharMistakeType
{
None,
Substitution,
Insertion,
Deletion,
Transposition
}
This code is a part of my project: Yandex-Linguistics.NET.
I wrote some tests and it's seems to me that method is working.
But comments and remarks are welcome.
Here is an alternative approach:
A typical method for finding similarity is Levenshtein distance, and there is no doubt a library with code available.
Unfortunately, this requires comparing to every string. You might be able to write a specialized version of the code to short-circuit the calculation if the distance is greater than some threshold, you would still have to do all the comparisons.
Another idea is to use some variant of trigrams or n-grams. These are sequences of n characters (or n words or n genomic sequences or n whatever). Keep a mapping of trigrams to strings and choose the ones that have the biggest overlap. A typical choice of n is "3", hence the name.
For instance, English would have these trigrams:
Eng
ngl
gli
lis
ish
And England would have:
Eng
ngl
gla
lan
and
Well, 2 out of 7 (or 4 out of 10) match. If this works for you, and you can index the trigram/string table and get a faster search.
You can also combine this with Levenshtein to reduce the set of comparison to those that have some minimum number of n-grams in common.
Here's a VB.net implementation:
Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
Dim cost(v1.Length, v2.Length) As Integer
If v1.Length = 0 Then
Return v2.Length 'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
ElseIf v2.Length = 0 Then
Return v1.Length 'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
Else
'setup the base costs for inserting the correct characters
For v1Count As Integer = 0 To v1.Length
cost(v1Count, 0) = v1Count
Next v1Count
For v2Count As Integer = 0 To v2.Length
cost(0, v2Count) = v2Count
Next v2Count
'now work out the cheapest route to having the correct characters
For v1Count As Integer = 1 To v1.Length
For v2Count As Integer = 1 To v2.Length
'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1),
'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and
cost(v1Count, v2Count) = Math.Min(
cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
Math.Min(
cost(v1Count - 1, v2Count) + 1,
cost(v1Count, v2Count - 1) + 1
)
)
Next v2Count
Next v1Count
'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
Return cost(v1.Length, v2.Length)
End If
End Function