ML.NET plotting K-means clustering results?

ML.NET plotting K-means clustering results? - c#

I'm experimenting with ML.NET in an unsupervised clustering scenario. My start data are less than 30 records with 5 features in a TSV file, e.g. (of course the label will be ignored):
Label S1 S2 S3 S4 S5
alpha 0.274167987321712 0.483359746434231 0.0855784469096672 0.297939778129952 0.0332805071315372
beta 0.378208470054279 0.405409549510871 0.162317151706584 0.292342604802355 0.0551994848048085
...
My start point was the iris tutorial, a sample of K-means clustering. In my case I want 3 clusters. As I'm just learning, once created the model I'd like to use it to add the clustering data to each record in a copy of the original file, so I can examine them and plot scatter graphs.
I started with this training code (say MyModel is the POCO class representing its model, with properties for S1-S5):
// load data
MLContext mlContext = new MLContext(seed: 0);
IDataView dataView = mlContext.Data.LoadFromTextFile<MyModel>
(dataPath, hasHeader: true, separatorChar: '\t');
// train model
const string featuresColumnName = "Features";
EstimatorChain<ClusteringPredictionTransformer<KMeansModelParameters>>
pipeline = mlContext.Transforms
.Concatenate(featuresColumnName, "S1", "S2", "S3", "S4", "S5")
.Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName,
numberOfClusters: 3));
TransformerChain<ClusteringPredictionTransformer<KMeansModelParameters>>
model = pipeline.Fit(dataView);
// save model
using (FileStream fileStream = new FileStream(modelPath,
FileMode.Create, FileAccess.Write, FileShare.Write))
{
mlContext.Model.Save(model, dataView.Schema, fileStream);
}
Then, I load the saved model, read every record from the original data, and get its cluster ID. This sounds a bit convoluted, but my learning intent here is inspecting the results, before playing with them. The results should be saved in a new file, together with the centroids coordinates and the points coordinates.
Yet, it does not seem that this API is transparent enough to easily access the centroids; I found only a post, which is rather old, and its code no more compiles. I rather used it as a hint to recover the data via reflection, but this is a hack.
Also, I'm not sure about the details of the data provided by the framework. I can see that every centroid has 3 vectors (named cx cy cz in the sample code), each with 5 elements (the 5 features, in their concatenated input order, I presume, i.e. from S1 to S5); also, each prediction provides a 3-fold distance (dx dy dz). If these assumptions are OK, I could assign a cluster ID to each record like this:
// for each record in the original data
foreach (MyModel record in csvReader.GetRecords<MyModel>())
{
// get its cluster ID
MyPrediction prediction = predictor.Predict(record);
// get the centroids just once, as of course they are the same
// for all the records referring their distances to them
if (cx == null)
{
// get centroids (via reflection...):
// https://github.com/dotnet/machinelearning/blob/master/docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Clustering/KMeansWithOptions.cs#L49
// https://social.msdn.microsoft.com/Forums/azure/en-US/c09171c0-d9c8-4426-83a9-36ed72a32fe7/kmeans-output-centroids-and-cluster-size?forum=MachineLearning
VBuffer<float>[] centroids = default;
var last = ((TransformerChain<ITransformer>)model)
.LastTransformer;
KMeansModelParameters kparams = (KMeansModelParameters)
last.GetType().GetProperty("Model").GetValue(last);
kparams.GetClusterCentroids(ref centroids, out int k);
cx = centroids[0].GetValues().ToArray();
cy = centroids[1].GetValues().ToArray();
cz = centroids[2].GetValues().ToArray();
}
float dx = prediction.Distances[0];
float dy = prediction.Distances[1];
float dz = prediction.Distances[2];
// ... calculate and save full details for the record ...
}
Given this scenario, I suppose I can get all the details about each record position in a pretrained model in the following way:
dx, dy, dz are the distances.
cx[0] cy[0] cy[0] + the distances (dx, dy, and dz respectively) should be the position of the S1 point; cx[1] cy[1] cz[1] + the distances the position of S2; and so forth up to S5 (cx[4] etc).
In this case, I could plot these data in a 3D scatter graph. Yet, I'm totally new to ML.NET, and thus I'm not sure about these assumptions, and it's well possible I'm on the wrong path. Could anyone point me in the right direction?

I just figured this out myself - took a bit of digging so for those interested heres some good info:
The centroids can now be retrieved right off the fit model via
VBuffer<float>[] centroids = default;
var modelParams = trainedModel.Model;
modelParams.GetClusterCentroids(ref centroids, out var k);
However the documentation here is annoyingly misleading because the centroids they claim are "coordinates" are not coordinates but rather the mean values of the features columns for the cluster.
Based on your pipeline this probably makes them pretty useless if like me you have 700 feature columns and half a dozen transformation steps. As far as I can tell (please correct me if I'm wrong anyone!!!) there is no way to transform the centroids into Cartesian coordinates for charting.
But we can still use them.
What I ended up doing was after training my model on my data I ran all my data through the model's prediction function. This gives me the predicted cluster id and euclidean distances to all other cluster centroids.
Using the predicted cluster id and the centroid means for the cluster you can map your datapoint's features over the means to get a "weighted value" of your data row based on the predicted cluster. Basically a centroid will contain info that it contains a certain column 0.6533, and another column 0.211, and another column 0. By running your datapoint features, lets say ( 5, 3, 1 ), through the centroid you'll get ( 3.2665, 0.633, 0 ). Which is a representation of the data row as included in the predicted cluster.
This is still just a row of data however - to make them into Cartesian coordinates for a point graph I simply use a sum of the first half as X and a sum of the second half as Y. For the example data the coord would be ( 3.8995, 0 )
Doing this we can finally get pretty charts
And here's a mostly complete code example:
VBuffer<float>[] centroids = default;
var modelParams = trainedModel.Model;
modelParams.GetClusterCentroids(ref centroids, out var k);
// extract from the VBuffer for ease
var cleanCentroids = Enumerable.Range(1, 5).ToDictionary(x => (uint)x, x =>
{
var values = centroids[x - 1].GetValues().ToArray();
return values;
});
var points = new Dictionary<uint, List<(double X, double Y)>>();
foreach (var dp in featuresDataset)
{
var prediction = predictor.Predict(dp);
var weightedCentroid = cleanCentroids[prediction.PredictedClusterId].Zip(dp.Features, (x, y) => x * y);
var point = (X: weightedCentroid.Take(weightedCentroid.Count() / 2).Sum(), Y: weightedCentroid.Skip(weightedCentroid.Count() / 2).Sum());
if (!points.ContainsKey(prediction.PredictedClusterId))
points[prediction.PredictedClusterId] = new List<(double X, double Y)>();
points[prediction.PredictedClusterId].Add(point);
}
Where featuresDataset is an array of objects that contain the feature columns being fed to the kmeans trainer. See the microsoft docs link above for an example - featuresDataset would be testData in their sample.

Related

How to properly plot 3D surface with ZXYPositions in ilNumerics?

What I want to achieve?
I'm working on an evolutionary algorithm finding min/max of non-linear functions. I have fully functional WPF application, but there's one feature missing: 3D plots.
What is the problem?
To accomplish this I've started with free trial of ilNumerics which provide 3D data visualisation. It works completely fine with examples from documentation, but there's something what prevents me from plotting properly my own 3D graphs.
Visualising problem:
So, here is how it behaves at the moment
Those are graphs of non-linear function: x1^4+x2^4-0.62*x1^2-0.62*x2^2
Left side: Contour achieved with OxyPlot
Right side: 3D graph achieved with ilNumerics
As you can see, OxyPlot contour is completely fine and 3D graph which I'm trying to plot with exactly same data is not proper at all.
How actual (not working) solution is done?
I'm trying to visualise 3D surface using points in space. ILNumerics has class called Surface which object I have to create in order to plot my graph. It has following constructor:
public Surface(InArray<float> ZXYPositions, InArray<float> C = null, Tuple<float, float> colorsDataRange = null, Colormap colormap = null, object tag = null);
where as you can see ZXYPositions is what I actually have problem with. Before instantiating Surface object I'm creating an Array like this:
int m = 0;
for (int i = 0; i < p; ++i)
{
for (int j = 0; j < p; ++j)
{
sigma[m, 0] = (float)data[i, j];
sigma[m, 1] = (float)xy[0][i];
sigma[m, 2] = (float)xy[1][j];
m++;
}
}
where sigma[m, 0] = Z; sigma[m, 1] = X; sigma[m, 2] = Y;
And here's the problem. I cannot find any logical error in this approach.
Here is code responsible for creating object which I'm passing to ilNumerics plot panel:
var scene = new PlotCube(twoDMode: false) {
// add a surface
new Surface(sigma) {
// make thin transparent wireframes
Wireframe = { Color = Color.FromArgb(50, Color.LightGray) },
// choose a different colormap
Colormap = Colormaps.Jet,
}
};
Additionaly I want to say that sigma array is constructed properly, because I've printed out its values and they're definitely correct.
Plot only data points.
At the end I need to add, that when I'm not creating surface object and plot only data points it looks much more reasonable:
But sadly it's not what I'm looking for. I want to create a surface with this data.

Good News!
I found the answer. Oddly almost evereything was fine.. I missunderstood just one thing. When I'm passing ZXYPositions argument to surface it can actually expect only Z data from me to plot graph correctly.
What did I changed to make it work
Two first for loops now looks like that:
sigma = data;
As you can see they're no longer loops, because sigma now contains only "solution" coordinates (which are Z coords), so I need to just assign data array to sigma.
Second part, where I'm creating Surface now looks like this:
var B = ILMath.tosingle(sigma);
var scene = new PlotCube(twoDMode: false) {
// add a surface
new Surface(B) {
// make thin transparent wireframes
Wireframe = { Color = Color.FromArgb(50, Color.LightGray) },
// choose a different colormap
Colormap = Colormaps.Jet,
}
};
scene.Axes.XAxis.Max = (float)arguments[0].Maximum;
scene.Axes.XAxis.Min = (float)arguments[0].Minimum;
scene.Axes.YAxis.Max = (float)arguments[1].Maximum;
scene.Axes.YAxis.Min = (float)arguments[1].Minimum;
scene.First<PlotCube>().Rotation = Matrix4.Rotation(new Vector3(1f, 0.23f, 1), 0.7f);
Basically one thing which changed is scaling XY axes to proper values.
Final results
Here you have final results:

Looking for a sample how to do weighted linear regression

I'm trying to use MathNet to calculate weighted linear regression of my data.
The documentation is here.
I'm trying to find a x + b = y such that it would best fit a list of (x,y,w), where w is weight of each point.
var r = WeightedRegression.Weighted(
weightedPoints.Select(p=>new Tuple<double[],double>(new [] { p.LogAvgAmount}, p.Frequency),
weightedPoints.Select(p=>Convert.ToDouble(p.Weight)).ToArray(), false);
As result, in r I'm getting a single point. What I'm expecting is values of a and b.
What am I doing wrong?

WeightedRegression.Weighted expects a predictor matrix as the first parameter, and only the LogAvgAmount is being passed. Try adding a 1 to the list or invoking WeightedRegression.Weighted with intercept: true
var x = weightedPoints.Select(p => new[] {p.LogAvgAmount}).ToArray();
var y = weightedPoints.Select(p => p.Frequency).ToArray();
var w = weightedPoints.Select(p => Convert.ToDouble(p.Weight)).ToArray();
// r1 == r2
var r1 = WeightedRegression.Weighted(weightedPoints.Select(p =>
new[] {1, p.LogAvgAmount}).ToArray(), y, w);
var r2 = WeightedRegression.Weighted(x, y, w, intercept: true);

Using Math.Net Numerics might be a good idea.
Weighted Regression
Sometimes the regression error can be reduced by dampening specific data points. We can achieve this by introducing a weight matrix W into the normal equations XTy=XTXp. Such weight matrices are often diagonal, with a separate weight for each data point on the diagonal.
var p = WeightedRegression.Weighted(X,y,W);
Weighter regression becomes interesting if we can adapt them to the point of interest and e.g. dampen all data points far away. Unfortunately this way the model parameters are dependent on the point of interest t.
1: // warning: preliminary api
2: var p = WeightedRegression.Local(X,y,t,radius,kernel);
You can find more info at:
https://numerics.mathdotnet.com/regression.html

Conversion of points in one Projected Coordinate System to Another

How can I convert points from one projected coordinate system to another using ArcObjects in C#?
//Coordinates in feet
double feetLong = 2007816.711;
double feetLat = 393153.895;
//Coordinates in decimal degrees (Should be the resulting coordinates)
//long: -97.474575;
//lat: 32.747352;
double[] feetPair = new double[] { feetLong, feetLat };
//Our projection used in GIS
string epsg32038 = "PROJCS[\"NAD27 / Texas North Central\",GEOGCS[\"GCS_North_American_1927\",DATUM[\"D_North_American_1927\",SPHEROID[\"Clarke_1866\",6378206.4,294.9786982138982]],PRIMEM[\"Greenwich\",0],UNIT[\"Degree\",0.017453292519943295]],PROJECTION[\"Lambert_Conformal_Conic\"],PARAMETER[\"standard_parallel_1\",32.13333333333333],PARAMETER[\"standard_parallel_2\",33.96666666666667],PARAMETER[\"latitude_of_origin\",31.66666666666667],PARAMETER[\"central_meridian\",-97.5],PARAMETER[\"false_easting\",2000000],PARAMETER[\"false_northing\",0],UNIT[\"Foot_US\",0.30480060960121924]]";
//Google Maps projection
string epsg3785 = "PROJCS[\"Popular Visualisation CRS / Mercator\",GEOGCS[\"Popular Visualisation CRS\",DATUM[\"D_Popular_Visualisation_Datum\",SPHEROID[\"Popular_Visualisation_Sphere\",6378137,0]],PRIMEM[\"Greenwich\",0],UNIT[\"Degree\",0.017453292519943295]],PROJECTION[\"Mercator\"],PARAMETER[\"central_meridian\",0],PARAMETER[\"scale_factor\",1],PARAMETER[\"false_easting\",0],PARAMETER[\"false_northing\",0],UNIT[\"Meter\",1]]";
This is the beginning of my code. I've tried using the CoordinateSystemFactory but never got anything to work. I intend to use ProjNet to solve this although I am open to any other way. I am really new to using ArcObjects to create custom tools and have been stuck on this for a while.

Linear regression with constraints with Math.NET

I'm performing simple linear regression with Math.NET.
I provided a common code sample below. Alternative to this example one can use the Fit class for simple linear regression.
What I additionally want is to specify additional constraints like a fixed y-intercept or force the fit to run throug a fixed point, e.g. (2, 2). How to achieve this in Math.NET?
var xdata = new double[] { 10, 20, 30 };
var ydata = new double[] { 15, 20, 25 };
var X = DenseMatrix.CreateFromColumns(new[] {new DenseVector(xdata.Length, 1), new DenseVector(xdata)});
var y = new DenseVector(ydata);
var p = X.QR().Solve(y);
var a = p[0];
var b = p[1];

You can modify your data set to reflect the constraint , and then use the standard math.Net linear regression
if (x0,y0) is the point through which the regression line must pass,
fit the model y−y0=β(x−x0)+ε, i.e., a linear regression with "no
intercept" on a translated data set.
see here : https://stats.stackexchange.com/questions/12484/constrained-linear-regression-through-a-specified-point
and here : http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#Constrained_linear_least_squares

First of all, it you want to force the regression through the origin, you can use LineThroughOrigin or alternativelly LineThroughOriginFunc if what you want is the function itself.
To force the regression to have a desired intercept, I would perform a normal linear regression and get the intercept and slope (knowing these you know everything about your linear function).
With this information, you can compensate the intercept, for example:
If you made your regression in which
intercept = 2
slope = 1
Then you know that your equation would be y = x + 2.
If you want the same function to cross the y axis in 3 (y = x + 3), you would just need to add 1 to the intercept so that
intercept = 3
slope = 1

How to get levels for Fry Graph readability formula?

I'm working in an application (C#) that applies some readability formulas to a text, like Gunning-Fog, Precise SMOG, Flesh-Kincaid.
Now, I need to implement the Fry-based Grade formula in my program, I understand the formula's logic, pretty much you take 3 100-words samples and calculate the average on sentences per 100-words and syllables per 100-words, and then, you use a graph to plot the values.
Here is a more detailed explanation on how this formula works.
I already have the averages, but I have no idea on how can I tell my program to "go check the graph and plot the values and give me a level." I don't have to show the graph to the user, I only have to show him the level.
I was thinking that maybe I can have all the values in memory, divided into levels, for example:
Level 1: values whose sentence average are between 10.0 and 25+, and whose syllables average are between 108 and 132.
Level 2: values whose sentence average are between 7.7 and 10.0, and .... so on
But the problem is that so far, the only place in which I have found the values that define a level, are in the graph itself, and they aren't too much accurate, so if I apply the approach commented above, trying to take the values from the graph, my level estimations would be too much imprecise, thus, the Fry-based Grade will not be accurate.
So, maybe any of you knows about some place where I can find exact values for the different levels of the Fry-based Grade, or maybe any of you can help me think in a way to workaround this.
Thanks

Well, I'm not sure about this being the most efficient solution, neither the best one, but at least it does the job.
I gave up to the idea of having like a math formula to get the levels, maybe there is such a formula, but I couldn't find it.
So I took the Fry's graph, with all the levels, and I painted each level of a different color, them I loaded the image on my program using:
Bitmap image = new Bitmap(#"C:\FryGraph.png");
image.GetPixel(int x, int y);
As you can see, after loading the image I use the GetPixel method to get the color at the specified coordinates. I had to do some conversion, to get the equivalent pixels for a given value on the graph, since the scale of the graph is not the equivalent to the pixels of the image.
In the end, I compare the color returned by GetPixel to see which was the Fry readability level of the text.
I hope this may be of any help for someone who faces the same problem.
Cheers.

You simply need to determine the formula for the graph. That is, a formula that accepts the number of sentences and number of syllables, and returns the level.
If you can't find the formula, you can determine it yourself. Estimate the linear equation for each of the lines on the graph. Also estimate the 'out-of-bounds' areas in the 'long words' and 'long sentences' areas.
Now for each point, just determine the region in which it resides; which lines it is above and which lines it is below. This is fairly simple algebra, unfortunately this is the best link I can find to describe how to do that.

I have made a first pass at solving this that I thought I would share in case someone else is looking sometime in the future. I built on the answer above and created a generic list of linear equations that one can use to determine an approximate grade level. First had to correct the values to make it more linear. This does not take into account the invalid areas, but I may revisit that.
The equation class:
public class GradeLineEquation
{
// using form y = mx+b
// or y=Slope(x)=yIntercept
public int GradeLevel { get; set; }
public float Slope { get; set; }
public float yIntercept { get; set; }
public float GetYGivenX(float x)
{
float result = 0;
result = (Slope * x) + yIntercept;
return result;
}
public GradeLineEquation(int gradelevel,float slope,float yintercept)
{
this.GradeLevel = gradelevel;
this.Slope = slope;
this.yIntercept = yintercept;
}
}
Here is the FryCalculator:
public class FryCalculator
{
//this class normalizes the plot on the Fry readability graph the same way a person would, by choosing points on the graph based on values even though
//the y-axis is non-linear and neither axis starts at 0. Just picking a relative point on each axis to plot the intercept of the zero and infinite scope lines
private List<GradeLineEquation> linedefs = new List<GradeLineEquation>();
public FryCalculator()
{
LoadLevelEquations();
}
private void LoadLevelEquations()
{
// load the estimated linear equations for each line with the
// grade level, Slope, and y-intercept
linedefs.Add(new NLPTest.GradeLineEquation(1, (float)0.5, (float)22.5));
linedefs.Add(new NLPTest.GradeLineEquation(2, (float)0.5, (float)20.5));
linedefs.Add(new NLPTest.GradeLineEquation(3, (float)0.6, (float)17.4));
linedefs.Add(new NLPTest.GradeLineEquation(4, (float)0.6, (float)15.4));
linedefs.Add(new NLPTest.GradeLineEquation(5, (float)0.625, (float)13.125));
linedefs.Add(new NLPTest.GradeLineEquation(6, (float)0.833, (float)7.333));
linedefs.Add(new NLPTest.GradeLineEquation(7, (float)1.05, (float)-1.15));
linedefs.Add(new NLPTest.GradeLineEquation(8, (float)1.25, (float)-8.75));
linedefs.Add(new NLPTest.GradeLineEquation(9, (float)1.75, (float)-24.25));
linedefs.Add(new NLPTest.GradeLineEquation(10, (float)2, (float)-35));
linedefs.Add(new NLPTest.GradeLineEquation(11, (float)2, (float)-40));
linedefs.Add(new NLPTest.GradeLineEquation(12, (float)2.5, (float)-58.5));
linedefs.Add(new NLPTest.GradeLineEquation(13, (float)3.5, (float)-93));
linedefs.Add(new NLPTest.GradeLineEquation(14, (float)5.5, (float)-163));
}
public int GetGradeLevel(float avgSylls,float avgSentences)
{
// first normalize the values given to cartesion positions on the graph
float x = NormalizeX(avgSylls);
float y = NormalizeY(avgSentences);
// given x find the first grade level equation that produces a lower y at that x
return linedefs.Find(a => a.GetYGivenX(x) < y).GradeLevel;
}
private float NormalizeY(float avgSentenceCount)
{
float result = 0;
int lower = -1;
int upper = -1;
// load the list of y axis line intervalse
List<double> intervals = new List<double> {2.0, 2.5, 3.0, 3.3, 3.5, 3.6, 3.7, 3.8, 4.0, 4.2, 4.3, 4.5, 4.8, 5.0, 5.2, 5.6, 5.9, 6.3, 6.7, 7.1, 7.7, 8.3, 9.1, 10.0, 11.1, 12.5, 14.3, 16.7, 20.0, 25.0 };
// find the first line lower or equal to the number we have
lower = intervals.FindLastIndex(a => ((double)avgSentenceCount) >= a);
// if we are not over the top or on the line grab the next higher line value
if(lower > -1 && lower < intervals.Count-1 && ((float) intervals[lower] != avgSentenceCount))
upper = lower + 1;
// set the integer portion of the respons
result = (float)lower;
// if we have an upper limit calculate the percentage above the lower line (to two decimal places) and add it to the result
if(upper != -1)
result += (float)Math.Round((((avgSentenceCount - intervals[lower])/(intervals[upper] - intervals[lower]))),2);
return result;
}
private float NormalizeX(float avgSyllableCount)
{
// the x axis is MUCH simpler. Subtract 108 and divide by 2 to get the x position relative to a 0 origin.
float result = (avgSyllableCount - 108) / 2;
return result;
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.