ML.NET: Schema mismatch for feature column 'Features' - c#

I'm trying to learn ML.NET/Get into Machine Learning, but I'm stuck at an issue.
My goal is to create a Trained Model that can be used to predict a city based on input.
This code:
var dataPath = "cities.csv";
var mlContext = new MLContext();
var loader = mlContext.Data.CreateTextLoader<CityData>(hasHeader: false, separatorChar: ',');
var data = loader.Load(dataPath);
string featuresColumnName = "Features";
var pipeline = mlContext.Transforms.Concatenate(featuresColumnName, "PostalCode", "CityName")
.Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, clustersCount: 3));
var model = pipeline.Fit(data);
Which should take an CSV as input (Which contains a list of Cities (Column 0 = Postal Code, Column 1 = CityName), and then add these features to the pipeline, gives the following error:
Unhandled Exception: System.ArgumentOutOfRangeException: Schema mismatch for feature column 'Features': expected Vector<R4>, got Vector<Text>
On the "Fit"- function.
I've done a bit of digging on the GitHub Repo, but I can't seem to find a solution. I'm working from the Iris- example (https://learn.microsoft.com/en-us/dotnet/machine-learning/tutorials/iris-clustering) (Of course with my modifications)
Any ideas?

Using FeaturizeText to transform strings features into a float array ones
var pipeline = mlContext.Transforms
.Text.FeaturizeText("PostalCodeF", "PostalCode")
.Append(mlContext.Transforms.Text.FeaturizeText("CityNameF", "CityName"))
.Append(mlContext.Transforms.Concatenate(featuresColumnName, "PostalCodeF", "CityNameF"))
.Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, clustersCount: 3));
var model = pipeline.Fit(data);

Related

Is there a way to use VarVector to represent raw data in Ml.net K-means clustering

I would like to use ML.Net K-means clustering on some 'raw' vectors which I've generated in-memory by processing another dataset. I would like to be able to select the length of the vectors at runtime. All vectors within a given model will be the same length but that length may vary from model to model as I try out different clustering approaches.
I use the following code:
public class MyVector
{
[VectorType]
public float[] Values;
}
void Train()
{
var vectorSize = GetVectorSizeFromUser();
var vectors = .... process dataset to create an array of MyVectors, each with 'vectorSize' values
var mlContext = new MLContext();
string featuresColumnName = "Features";
var pipeline = mlContext
.Transforms
.Concatenate(featuresColumnName, nameof(MyVector.Values))
.Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3));
var trainingData = mlContext.Data.LoadFromEnumerable(vectors);
Console.WriteLine("Training...");
var model = pipeline.Fit(trainingData);
}
The problem is that when I try to to run the training, I get this exception...
Schema mismatch for feature column 'Features': expected
Vector, got VarVector (Parameter 'inputSchema')
I can avoid this for any given value of vectorSize (say 20) by using [VectorType(20)], but the key thing here is I would like not to rely on a specific compile-time value. Is there a recipe to allow for dynamically sized data to be used for this kind of training?
I can imagine various nasty workarounds involving dynamically constructing dataviews with dummy columns but was hoping there would be a simpler approach.
Thanks to Jon for finding the link to this issue which contains the required information. The trick is to override the SchemaDefinition at run-time....
public class MyVector
{
//it's not required to specify the type here since we will override in our custom schema
public float[] Values;
}
void Train()
{
var vectorSize = GetVectorSizeFromUser();
var vectors = .... process dataset to create an array of MyVectors, each with 'vectorSize' values
var mlContext = new MLContext();
string featuresColumnName = "Features";
var pipeline = mlContext
.Transforms
.Concatenate(featuresColumnName, nameof(MyVector.Values))
.Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3));
//create a custom schema-definition that overrides the type for the Values field...
var schemaDef = SchemaDefinition.Create(typeof(MyVector));
schemaDef[nameof(MyVector.Values)].ColumnType
= new VectorDataViewType(NumberDataViewType.Single, vectorSize);
//use that schema definition when creating the training dataview
var trainingData = mlContext.Data.LoadFromEnumerable(vectors,schemaDef);
Console.WriteLine("Training...");
var model = pipeline.Fit(trainingData);
//Note that the schema-definition must also be supplied when creating the prediction engine...
var predictor = mlContext
.Model
.CreatePredictionEngine<MyVector,ClusterPrediction>(model,
inputSchemaDefinition: schemaDef);
//now we can use the engine to predict which cluster a vector belongs to...
var prediction = predictor.Predict(..some MyVector...);
}

How to insert the column to predict (label)

I am building a multiclass classification program and i want to dynamicaly insert train data from a CSV.
I have tried:
var loader = context.Data.CreateTextLoader(
new[]
{
new TextLoader.Column("sentiment", DataKind.String,0),
new TextLoader.Column("content", DataKind.String, 1),
},
// First line of the file is a header, not a data row.
hasHeader: true);
var trainData = loader.Load(_filePath);
var experiment = context.Auto().CreateMulticlassClassificationExperiment(240);
//find best model
var result = experiment.Execute(trainData);
Console.WriteLine(Environment.NewLine);
Console.WriteLine("Best run:");
Console.WriteLine($"Trainer name - {result.BestRun.TrainerName}");
When I run the programm I get this error
System.ArgumentException: 'Provided label column 'Label' not found in training data.'
I know there is a way to create a class on runtime and pass it as a schema in LoadFromText but I haven't been able to make it work yet.
I think I see what you need. In the Execute method, there's an overload that it can take in a ColumnInformation.
Just create an instance of that and a property on it allows you to specify the label column name.
var labelColumnInfo = new ColumnInformation()
{
LabelColumnName = "sentiment"
};
Then, you can pass that into the Execute method.
var result = experiment.Execute(trainData, labelColumnInfo);

How execute/run code from string variable at runtime

I am trying to execute code that's in a string variable to get an item from a dictionary
I have tried using CSharpCodeProvider like this:
var text = "IconDescription";
text = "\"" + text + "\"";
var field = "Outcome[" + text + "].Value";
field = "\"" + field + "\"";
CSharpCodeProvider codeProvider = new CSharpCodeProvider();
ICodeCompiler icc = codeProvider.CreateCompiler()
parameters.GenerateExecutable = true;
CompilerResults results = icc.CompileAssemblyFromSource(parameters, field)
var dataTest = JsonConvert.DeserializeObject<DictionaryOutcome.Rootobject>(jsonText);
var t = new List<Outcome>();
var defaultOutcome = dataTest.Response.Outcome.KeyValueOfstringOutcomepQnxSKQu.Select(item => new Outcome
{
DataType = item.Value.DataType,
Value = item.Value.Value1,
Field = item.Key
}).ToList();
defaultOutcome.ToList().ForEach(item =>
{
t.Add(item);
});
the field variable's value is Outcome["IconDescription"].Value, and I want to execute this code to get the value from the Outcome dictionary, using the "IconDescription" Key, and get the value.
Is this possible?
I have tried the following:
var scriptOptions = ScriptOptions.Default
.WithReferences(typeof(Dictionary<string, Outcome>).FullName)
.WithImports("Outcome");
var scripts = CSharpScript.Create<object>(field, scriptOptions);
var resultsx = scripts.RunAsync(null, CancellationToken.None).Result;
And I am getting this error:
error CS0246: The type or namespace name 'Outcome' could not be found (are you missing a using directive or an assembly reference?)
I am struggling to even guess what you are trying to do, but as a starter, consider what you are actually doing by trying to compile that string you are constructing.
Outcome["SomeValue"].Value is not even close to being valid C# code:
it has no scope
it has no entry point
it imports no namespaces
it isn't terminated with a ;
the symbol Outcome is not defined
You're compiling this into an executable, so I don't see how it could have any knowledge of the list of results brought back from deserializing the JSON content, where you haven't specified where you're getting this from.
You haven't specified any evidence that explains why you need such an elaborate solution to merely extract some values from JSON, so a straightforward solution might be to use the built-in features of Newtonsoft.Json:
dataTest[0] selects the first element in the array, when the json object is an array;
dataTest[0]["Outcome"] selects the property Outcome of the first object, which may itself be an object
dataTest[0]["Outcome"]["Value"] selects the property Value of Outcome
All of the string indexes here can be known only at runtime and held in variables. I don't see why you need to do any scripting at all to do this.

How to create and use a HMM Dynamic Bayesian Network in Bayes Server?

I'm trying to build a prediction module implementing a Hidden Markov Model type DBN in Bayes Server 7 C#. I managed to create the network structure but I'm not sure if its correct because their documentation and examples are not very comprehensive and I also don't fully understand how the prediction is meant to be done in the code after training is complete.
Here is a how my Network creation and training code looks:
var Feature1 = new Variable("Feature1", VariableValueType.Continuous);
var Feature2 = new Variable("Feature2", VariableValueType.Continuous);
var Feature3 = new Variable("Feature3", VariableValueType.Continuous);
var nodeFeatures = new Node("Features", new Variable[] { Feature1, Feature2, Feature3 });
nodeFeatures.TemporalType = TemporalType.Temporal;
var nodeHypothesis = new Node(new Variable("Hypothesis", new string[] { "state1", "state2", "state3" }));
nodeHypothesis.TemporalType = TemporalType.Temporal;
// create network and add nodes
var network = new Network();
network.Nodes.Add(nodeHypothesis);
network.Nodes.Add(nodeFeatures);
// link the Hypothesis node to the Features node within each time slice
network.Links.Add(new Link(nodeHypothesis, nodeFeatures));
// add a temporal link of order 5. This links the Hypothesis node to itself in the next time slice
for (int order = 1; order <= 5; order++)
{
network.Links.Add(new Link(nodeHypothesis, nodeHypothesis, order));
}
var temporalDataReaderCommand = new DataTableDataReaderCommand(evidenceDataTable);
var temporalReaderOptions = new TemporalReaderOptions("CaseId", "Index", TimeValueType.Value);
// here we map variables to database columns
// in this case the variables and database columns have the same name
var temporalVariableReferences = new VariableReference[]
{
new VariableReference(Feature1, ColumnValueType.Value, Feature1.Name),
new VariableReference(Feature2, ColumnValueType.Value, Feature2.Name),
new VariableReference(Feature3, ColumnValueType.Value, Feature3.Name)
};
var evidenceReaderCommand = new EvidenceReaderCommand(
temporalDataReaderCommand,
temporalVariableReferences,
temporalReaderOptions);
// We will use the RelevanceTree algorithm here, as it is optimized for parameter learning
var learning = new ParameterLearning(network, new RelevanceTreeInferenceFactory());
var learningOptions = new ParameterLearningOptions();
// Run the learning algorithm
var result = learning.Learn(evidenceReaderCommand, learningOptions);
And this is my attempt at prediction:
// we will now perform some queries on the network
var inference = new RelevanceTreeInference(network);
var queryOptions = new RelevanceTreeQueryOptions();
var queryOutput = new RelevanceTreeQueryOutput();
int time = 0;
// query a probability variable
var queryHypothesis = new Table(nodeHypothesis, time);
inference.QueryDistributions.Add(queryHypothesis);
double[] inputRow = GetInput();
// set some temporal evidence
inference.Evidence.Set(Feature1, inputRow[0], time);
inference.Evidence.Set(Feature2, inputRow[1], time);
inference.Evidence.Set(Feature3, inputRow[2], time);
inference.Query(queryOptions, queryOutput);
int hypothesizedClassId;
var probability = queryHypothesis.GetMaxValue(out hypothesizedClassId);
Console.WriteLine("hypothesizedClassId = {0}, score = {1}", hypothesizedClassId, probability);
Here I'm not even sure how to "Unroll" the network properly to get the prediction and what value to assign to the variable "time". If someone can shed some light on how this toolkit works, I would greatly appreciate it. Thanks.
The code looks fine except for the network structure, which should look something like this for an HMM (the only change to your code is the links):
var Feature1 = new Variable("Feature1", VariableValueType.Continuous);
var Feature2 = new Variable("Feature2", VariableValueType.Continuous);
var Feature3 = new Variable("Feature3", VariableValueType.Continuous);
var nodeFeatures = new Node("Features", new Variable[] { Feature1, Feature2, Feature3 });
nodeFeatures.TemporalType = TemporalType.Temporal;
var nodeHypothesis = new Node(new Variable("Hypothesis", new string[] { "state1", "state2", "state3" }));
nodeHypothesis.TemporalType = TemporalType.Temporal;
// create network and add nodes
var network = new Network();
network.Nodes.Add(nodeHypothesis);
network.Nodes.Add(nodeFeatures);
// link the Hypothesis node to the Features node within each time slice
network.Links.Add(new Link(nodeHypothesis, nodeFeatures));
// An HMM also has an order 1 link on the latent node
network.Links.Add(new Link(nodeHypothesis, nodeHypothesis, 1));
It is also worth noting the following:
You can add multiple distributions to 'inference.QueryDistributions' and query them all at once
While it is perfectly valid to set evidence manually and then query, see EvidenceReader, DataReader and either DatabaseDataReader or DataTableDataReader, if you want to execute the query over multiple records.
Check out the TimeSeriesMode on ParameterLearningOptions
If you want the 'Most probable explanation' set queryOptions.Propagation = PropagationMethod.Max; // an extension of the Viterbi algorithm for HMMs
Check out the following link:
https://www.bayesserver.com/docs/modeling/time-series-model-types
An Hidden Markov model (as a Bayesian network) has a discrete latent variable and a number of child nodes. In Bayes Server you can combine multiple variables in a child node, much like a standard HMM. In Bayes Server you can also mix and match discrete/continuous nodes, handle missing data, and add additional structure (e.g. mixture of HMM, and many other exotic models).
Regarding prediction, once you have built the structure from the link above, there is a DBN prediction example at https://www.bayesserver.com/code/
(Note that you can predict an individual variable in the future (even if you have missing data), you can predict multiple variables (joint probability) in the future, you can predict how anomalous the time series is (log-likelihood) and for discrete (sequence) predictions you can predict the most probable sequence.)
It it is not clear, ping Bayes Server Support and they will add an example for you.

Using Accord.Net's Codification Object to Codify second data set

I am trying to figure out how to use the Accord.Net Framework to make a bayesian prediction using the machine learning NaiveBayes class. I have followed the example code listed in the documentation and have been able to create the model from the example.
What I can't figure out is how to make a prediction based on that model.
The way the Accord.Net framework works is that it translates a table of strings into numeric symolic representation of those strings using a class called Codification. Here is how I create inputs and outputs DataTable to train the model (90% of this code is straight from the example):
var dt = new DataTable("Categorizer");
dt.Columns.Add("Word");
dt.Columns.Add("Category");
foreach (string category in categories)
{
rep.LoadTrainingDataForCategory(category,dt);
}
var codebook = new Codification(dt);
DataTable symbols = codebook.Apply(dt);
double[][] inputs = symbols.ToArray("Word");
int[] outputs = symbols.ToIntArray("Category").GetColumn(0);
IUnivariateDistribution[] priors = {new GeneralDiscreteDistribution(codebook["Word"].Symbols)};
int inputCount = 1;
int classCount = codebook["Category"].Symbols;
var target = new NaiveBayes<IUnivariateDistribution>(classCount, inputCount, priors);
target.Estimate(inputs, outputs);
And this all works successfully. Now, I have new input that I want to test against the trained data model I just built. So I try to do this:
var testDt = new DataTable("Test Data");
testDt.Columns.Add("Word");
foreach (string token in tokens)
{
testDt.Rows.Add(token);
}
DataTable testDataSymbols = codebook.Apply(testDt);
double[] testData = testDataSymbols.ToArray("Word").GetColumn(0);
double logLikelihood = 0;
double[] responses;
int cat = target.Compute(testData, out logLikelihood, out responses);
Notice that I am using the same codebook object that I was using previously when I built the model. I want the data to be codified using the same codebook as the original model, otherwise the same word might be encoded with two completely different values (the word "bob" in the original model might correspond to the number 23 and in the new model, the number 43... no way that would work.)
However, I am getting a NullReferenceException error on this line:
DataTable testDataSymbols = codebook.Apply(testDt);
Here is the error:
System.NullReferenceException: Object reference not set to an instance of an object.
at Accord.Statistics.Filters.Codification.ProcessFilter(DataTable data)
at Accord.Statistics.Filters.BaseFilter`1.Apply(DataTable data)
at Agent.Business.BayesianClassifier.Categorize(String[] categories, String testText)
The objects I am passing in are all not null, so this must be something happening deeper in the code. But I am not sure what.
Thanks for any help. And if anyone knows of an example where a prediction is actually made from the bayesian example for Accord.Net, I would be much obliged if you shared it.
Sorry about the lack of documentation on the final part. In order to obtain the same integer codification for a new word, you could use the Translate method of the codebook:
// Compute the result for a sunny, cool, humid and windy day:
double[] input = codebook.Translate("Sunny", "Cool", "High", "Strong").ToDouble();
int answer = target.Compute(input);
string result = codebook.Translate("PlayTennis", answer); // result should be "no"
but it should also have been possible to call codebook.Apply to apply the same transformation to a new dataset. If you feel this is a bug, would you like to fill a bug report in the issue tracker?

Categories