Linear regression with constraints with Math.NET - c#

I'm performing simple linear regression with Math.NET.
I provided a common code sample below. Alternative to this example one can use the Fit class for simple linear regression.
What I additionally want is to specify additional constraints like a fixed y-intercept or force the fit to run throug a fixed point, e.g. (2, 2). How to achieve this in Math.NET?
var xdata = new double[] { 10, 20, 30 };
var ydata = new double[] { 15, 20, 25 };
var X = DenseMatrix.CreateFromColumns(new[] {new DenseVector(xdata.Length, 1), new DenseVector(xdata)});
var y = new DenseVector(ydata);
var p = X.QR().Solve(y);
var a = p[0];
var b = p[1];

You can modify your data set to reflect the constraint , and then use the standard math.Net linear regression
if (x0,y0) is the point through which the regression line must pass,
fit the model y−y0=β(x−x0)+ε, i.e., a linear regression with "no
intercept" on a translated data set.
see here : https://stats.stackexchange.com/questions/12484/constrained-linear-regression-through-a-specified-point
and here : http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#Constrained_linear_least_squares

First of all, it you want to force the regression through the origin, you can use LineThroughOrigin or alternativelly LineThroughOriginFunc if what you want is the function itself.
To force the regression to have a desired intercept, I would perform a normal linear regression and get the intercept and slope (knowing these you know everything about your linear function).
With this information, you can compensate the intercept, for example:
If you made your regression in which
intercept = 2
slope = 1
Then you know that your equation would be y = x + 2.
If you want the same function to cross the y axis in 3 (y = x + 3), you would just need to add 1 to the intercept so that
intercept = 3
slope = 1

Related

ML.NET plotting K-means clustering results?

I'm experimenting with ML.NET in an unsupervised clustering scenario. My start data are less than 30 records with 5 features in a TSV file, e.g. (of course the label will be ignored):
Label S1 S2 S3 S4 S5
alpha 0.274167987321712 0.483359746434231 0.0855784469096672 0.297939778129952 0.0332805071315372
beta 0.378208470054279 0.405409549510871 0.162317151706584 0.292342604802355 0.0551994848048085
...
My start point was the iris tutorial, a sample of K-means clustering. In my case I want 3 clusters. As I'm just learning, once created the model I'd like to use it to add the clustering data to each record in a copy of the original file, so I can examine them and plot scatter graphs.
I started with this training code (say MyModel is the POCO class representing its model, with properties for S1-S5):
// load data
MLContext mlContext = new MLContext(seed: 0);
IDataView dataView = mlContext.Data.LoadFromTextFile<MyModel>
(dataPath, hasHeader: true, separatorChar: '\t');
// train model
const string featuresColumnName = "Features";
EstimatorChain<ClusteringPredictionTransformer<KMeansModelParameters>>
pipeline = mlContext.Transforms
.Concatenate(featuresColumnName, "S1", "S2", "S3", "S4", "S5")
.Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName,
numberOfClusters: 3));
TransformerChain<ClusteringPredictionTransformer<KMeansModelParameters>>
model = pipeline.Fit(dataView);
// save model
using (FileStream fileStream = new FileStream(modelPath,
FileMode.Create, FileAccess.Write, FileShare.Write))
{
mlContext.Model.Save(model, dataView.Schema, fileStream);
}
Then, I load the saved model, read every record from the original data, and get its cluster ID. This sounds a bit convoluted, but my learning intent here is inspecting the results, before playing with them. The results should be saved in a new file, together with the centroids coordinates and the points coordinates.
Yet, it does not seem that this API is transparent enough to easily access the centroids; I found only a post, which is rather old, and its code no more compiles. I rather used it as a hint to recover the data via reflection, but this is a hack.
Also, I'm not sure about the details of the data provided by the framework. I can see that every centroid has 3 vectors (named cx cy cz in the sample code), each with 5 elements (the 5 features, in their concatenated input order, I presume, i.e. from S1 to S5); also, each prediction provides a 3-fold distance (dx dy dz). If these assumptions are OK, I could assign a cluster ID to each record like this:
// for each record in the original data
foreach (MyModel record in csvReader.GetRecords<MyModel>())
{
// get its cluster ID
MyPrediction prediction = predictor.Predict(record);
// get the centroids just once, as of course they are the same
// for all the records referring their distances to them
if (cx == null)
{
// get centroids (via reflection...):
// https://github.com/dotnet/machinelearning/blob/master/docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Clustering/KMeansWithOptions.cs#L49
// https://social.msdn.microsoft.com/Forums/azure/en-US/c09171c0-d9c8-4426-83a9-36ed72a32fe7/kmeans-output-centroids-and-cluster-size?forum=MachineLearning
VBuffer<float>[] centroids = default;
var last = ((TransformerChain<ITransformer>)model)
.LastTransformer;
KMeansModelParameters kparams = (KMeansModelParameters)
last.GetType().GetProperty("Model").GetValue(last);
kparams.GetClusterCentroids(ref centroids, out int k);
cx = centroids[0].GetValues().ToArray();
cy = centroids[1].GetValues().ToArray();
cz = centroids[2].GetValues().ToArray();
}
float dx = prediction.Distances[0];
float dy = prediction.Distances[1];
float dz = prediction.Distances[2];
// ... calculate and save full details for the record ...
}
Given this scenario, I suppose I can get all the details about each record position in a pretrained model in the following way:
dx, dy, dz are the distances.
cx[0] cy[0] cy[0] + the distances (dx, dy, and dz respectively) should be the position of the S1 point; cx[1] cy[1] cz[1] + the distances the position of S2; and so forth up to S5 (cx[4] etc).
In this case, I could plot these data in a 3D scatter graph. Yet, I'm totally new to ML.NET, and thus I'm not sure about these assumptions, and it's well possible I'm on the wrong path. Could anyone point me in the right direction?
I just figured this out myself - took a bit of digging so for those interested heres some good info:
The centroids can now be retrieved right off the fit model via
VBuffer<float>[] centroids = default;
var modelParams = trainedModel.Model;
modelParams.GetClusterCentroids(ref centroids, out var k);
However the documentation here is annoyingly misleading because the centroids they claim are "coordinates" are not coordinates but rather the mean values of the features columns for the cluster.
Based on your pipeline this probably makes them pretty useless if like me you have 700 feature columns and half a dozen transformation steps. As far as I can tell (please correct me if I'm wrong anyone!!!) there is no way to transform the centroids into Cartesian coordinates for charting.
But we can still use them.
What I ended up doing was after training my model on my data I ran all my data through the model's prediction function. This gives me the predicted cluster id and euclidean distances to all other cluster centroids.
Using the predicted cluster id and the centroid means for the cluster you can map your datapoint's features over the means to get a "weighted value" of your data row based on the predicted cluster. Basically a centroid will contain info that it contains a certain column 0.6533, and another column 0.211, and another column 0. By running your datapoint features, lets say ( 5, 3, 1 ), through the centroid you'll get ( 3.2665, 0.633, 0 ). Which is a representation of the data row as included in the predicted cluster.
This is still just a row of data however - to make them into Cartesian coordinates for a point graph I simply use a sum of the first half as X and a sum of the second half as Y. For the example data the coord would be ( 3.8995, 0 )
Doing this we can finally get pretty charts
And here's a mostly complete code example:
VBuffer<float>[] centroids = default;
var modelParams = trainedModel.Model;
modelParams.GetClusterCentroids(ref centroids, out var k);
// extract from the VBuffer for ease
var cleanCentroids = Enumerable.Range(1, 5).ToDictionary(x => (uint)x, x =>
{
var values = centroids[x - 1].GetValues().ToArray();
return values;
});
var points = new Dictionary<uint, List<(double X, double Y)>>();
foreach (var dp in featuresDataset)
{
var prediction = predictor.Predict(dp);
var weightedCentroid = cleanCentroids[prediction.PredictedClusterId].Zip(dp.Features, (x, y) => x * y);
var point = (X: weightedCentroid.Take(weightedCentroid.Count() / 2).Sum(), Y: weightedCentroid.Skip(weightedCentroid.Count() / 2).Sum());
if (!points.ContainsKey(prediction.PredictedClusterId))
points[prediction.PredictedClusterId] = new List<(double X, double Y)>();
points[prediction.PredictedClusterId].Add(point);
}
Where featuresDataset is an array of objects that contain the feature columns being fed to the kmeans trainer. See the microsoft docs link above for an example - featuresDataset would be testData in their sample.

What parameters should I be using for the LogNormal and Normal Distribution in MATH.NET

I've tried several combinations of Mathdotnet's LogNormal and Normal classes: https://numerics.mathdotnet.com/api/MathNet.Numerics.Distributions/LogNormal.
I seem to get a lot closer to the result I'm looking for using the mean and standard deviation as parameters. However, I notice that when I use larger numbers, like numberOfMinutes my results do not deviate past the mean like they do with smaller numbers like numberOfDays do. I know I'm not thinking about this right and could use some help.
Also, I'd like to use the geometric mean vs the mean but I didn't know what parameter to use for the variance given I couldn't pinpoint how to even use it for the mean.
Finally, I hope the answer to this also answers the same issue I'm having with the Normal distribution.
List<double> numberOfDays = new List<double> { 10, 12, 18, 30 };
double mean = numberOfDays.Mean(); // 17.5
double geometricMean = numberOfDays.GeometricMean(); // 15.954
double variance = numberOfDays.Variance(); // 81
double standardDeviation = numberOfDays.StandardDeviation(); // 9
// Do I need a Geometric Standard Deviation or Variance
double numberOfDaysSampleMV = LogNormal.WithMeanVariance(mean, variance).Sample(); // One example sample yielded 40.23
double numberOfDaysSampleMSD = LogNormal.WithMeanVariance(mean, standardDeviation).Sample(); // One example sample yielded 17.33
I believe you are confused about the parameters required. Using conventional notation, you have set X which you believe is LogNormal:
X = { 10, 12, 18, 30 }
mean: m = 17.5
standard deviation: sd = 9
from this you derive set Y which is Normal:
Y = {2.30,2.48,2.89,3.4}
mean: mu = 2.77
standard deviation: sigma = 0.487
Note that mu and sigma are computed from Y, not X. To create sample of the LogNormal data, you use mu and sigma, not m and sd.
double[] sample = new double[100];
LogNormal.Samples(sample, mu, sigma);
This is consistent with the Wikipedia article on the LogNormal distribution. The Numerics documentation is not clear.
Here is my test program which might be useful:
List<double> X = new List<double> { 10, 12, 18, 30 }; // assume to be LogNormal
double m = X.Mean(); // mean of log normal values = 17.5
double sd = X.StandardDeviation(); // standard deviation of log normal values = 9
List<double> Y = new List<double> { };
for (int i = 0; i < 4; i++)
{
Y.Add(Math.Log(X[i]));
}
// Y = {2.30,2.48,2.89,3.4}
double mu = Y.Mean(); // mean of normal values = 2.77
double sigma = Y.StandardDeviation(); // standard deviation of normal values = 0.487
double[] sample = new double[100];
LogNormal.Samples(sample, mu, sigma); // get sample
double sample_m = sample.Mean(); // 17.93, approximates m
double sample_sd = sample.StandardDeviation(); // 8.98, approximates sd
sample = new double[100];
Normal.Samples(sample, mu, sigma); // get sample
double sample_mu = sample.Mean(); //2.77, approximates mu
double sample_sigma = sample.StandardDeviation(); //0.517 approximates sigma
Using your test program above my samples came out like this.
Using LogNormal(mu, sigma)
I'm ultimately concerned about the values greater than 30 and less than 10.
However, by trail and error [accidentally], when I use the following method to get the samples using the original m and sd variable in your test program I get the results I'm looking for. I do not want to go forward with something I accidentally did.
sample = new double[100];
for (int i = 0; i < 100; i++)
{
sample[i] = LogNormal.WithMeanVariance(m, sd).Sample();
}
Using LogNormal.WithMeanVariance(m, sd)
My values are consistently between the Min and Max and concentrated around the Mean.
My example shows pretty clearly how to get a LogNormal sample that has the mean and standard deviation of the original data.
The min/max of 10/30 is unrealistic if you are going create your samples based on the mean and standard deviation of the sample. Suppose you took of random sample of the weights of 4 people out of a population of 1000 people. Would you expect your sample to include both the lightest and heaviest of the population?
LogNormal.WithMeanVariance(m, sd) is wrong. The units are wrong. It's expecting a variance would have the units of ln(days)^2 while sd has units of days.
I suggest you a) use LogNormal(mu,sigma) and discard any values that are outside your min/max range or b) use LogNormal(mu,c*sigma) for some value of c less than one to reduce the variance enough that all the values are in your min/max range. The choice depends on the nature of your project.
The Wikipedia entry on the LogNormal distribution has formulas for computing mu and sigma from m and sd which might be better than calculating from the Y data.

Looking for a sample how to do weighted linear regression

I'm trying to use MathNet to calculate weighted linear regression of my data.
The documentation is here.
I'm trying to find a x + b = y such that it would best fit a list of (x,y,w), where w is weight of each point.
var r = WeightedRegression.Weighted(
weightedPoints.Select(p=>new Tuple<double[],double>(new [] { p.LogAvgAmount}, p.Frequency),
weightedPoints.Select(p=>Convert.ToDouble(p.Weight)).ToArray(), false);
As result, in r I'm getting a single point. What I'm expecting is values of a and b.
What am I doing wrong?
WeightedRegression.Weighted expects a predictor matrix as the first parameter, and only the LogAvgAmount is being passed. Try adding a 1 to the list or invoking WeightedRegression.Weighted with intercept: true
var x = weightedPoints.Select(p => new[] {p.LogAvgAmount}).ToArray();
var y = weightedPoints.Select(p => p.Frequency).ToArray();
var w = weightedPoints.Select(p => Convert.ToDouble(p.Weight)).ToArray();
// r1 == r2
var r1 = WeightedRegression.Weighted(weightedPoints.Select(p =>
new[] {1, p.LogAvgAmount}).ToArray(), y, w);
var r2 = WeightedRegression.Weighted(x, y, w, intercept: true);
Using Math.Net Numerics might be a good idea.
Weighted Regression
Sometimes the regression error can be reduced by dampening specific data points. We can achieve this by introducing a weight matrix W into the normal equations XTy=XTXp. Such weight matrices are often diagonal, with a separate weight for each data point on the diagonal.
var p = WeightedRegression.Weighted(X,y,W);
Weighter regression becomes interesting if we can adapt them to the point of interest and e.g. dampen all data points far away. Unfortunately this way the model parameters are dependent on the point of interest t.
1: // warning: preliminary api
2: var p = WeightedRegression.Local(X,y,t,radius,kernel);
You can find more info at:
https://numerics.mathdotnet.com/regression.html

Fitting an Akima Spline curve

I'm trying to fit an Akima Spline curve in C# using the same method as this tool: https://www.mycurvefit.com/share/4ab90a5f-af5e-435e-9ce4-652c95c3d9a7
This curve gives me the exact shape I'm after (the curve line peaking at X = 30M, the highest point from the sample data)
But when I use MathNet's Akima function, and plot 52 points from the same data set:
var x = new List<double> { 0, 15000000, 30000000, 40000000, 60000000 };
var y = new List<double> { 0, 93279805, 108560423, 105689254, 90130257 };
var curveY = new List<double>();
var interpolation = MathNet.Numerics.Interpolation.CubicSpline.InterpolateAkima(x.ToArray(), y.ToArray());
for (int i=1; i<=52; i++)
{
var cY = interpolation.Interpolate((60000000/52)*i);
curveY.Add(cY);
}
I don't get the same curve at all, I get a curve which peaks around X = 26M, and looks much more like a Natural Spline: https://www.mycurvefit.com/share/faec5545-abf1-4768-b180-3e615dc60e3a
What is the reason the Akimas look so different? (especially in terms of peak)
Interpolate method waiting double parameter but this is integer (60000000 / 52) * i
change (60000000 / 52) * i to (60000000d / 52d) * i
I gave up with the MathNet functions and used the CubicSpline.FitParametric() function on this implementation instead: https://www.codeproject.com/Articles/560163/Csharp-Cubic-Spline-Interpolation
This successfully gave me the desired fit I was after (which fully respects the sample data peak).

Exponential based Curve-Fit using Math.Net

I'm very new to the Math.Net Library and I'm having problems trying to do curve-fitting based on an exponential function. More specifically I intend to use this function:
f(x) = a*exp(b*x) + c*exp(d*x)
Using MATLAB I get pretty good results, as shown in the following image:
MATLAB calculates the following parameters:
f(x) = a*exp(b*x) + c*exp(d*x)
Coefficients (with 95% confidence bounds):
a = 29.6 ( 29.49 , 29.71)
b = 0.000408 ( 0.0003838, 0.0004322)
c = -6.634 ( -6.747 , -6.521)
d = -0.03818 ( -0.03968 , -0.03667)
Is it possible to achieve these results using Math.Net?
Looking at Math.net, it seems that Math.net does various types of regression, whereas your function require some type of iterative method. For instance Gauss-Newton's method where you would use linear regression in each iteration to solve a (overdetermined) system of linear equations, but this would still require some "manual" work with writing the method.
No it appears there is not exponential support at this time. However there's a discussion on Math.NET forums where a maintainer proposes a workaround:
https://discuss.mathdotnet.com/t/exponential-fit/131
Contents duplicated in case link gets broken:
You can, by transforming it, similar to Linearizing non-linear models
by transformation. Something along the lines of the following should
work:
double[] Exponential(double[] x, double[] y,
DirectRegressionMethod method = DirectRegressionMethod.QR)
{
double[] y_hat = Generate.Map(y, Math.Log);
double[] p_hat = Fit.LinearCombination(x, y_hat, method, t => 1.0, t => t);
return new[] {Math.Exp(p_hat[0]), p_hat[1]};
}
Example usage:
double[] x = new[] { 1.0, 2.0, 3.0 };
double[] y = new[] { 2.0, 4.1, 7.9 };
double[] p = Exponential(x,y); // a=1.017, r=0.687
double[] yh = Generate.Map(x,k => p[0]*Math.Exp(p[1]*k)) // 2.02, 4.02, 7.98
Answer is: not yet, I believe. Basically, there is contribution of whole csmpfit package, but it yet to be integrated into Math.Net. You could use it as separate library and then after full integration move to Math.Net. Link http://csmpfit.codeplex.com

Categories