Using nonlinear square fit in C#

Using nonlinear square fit in C# - c#

I'm trying to find a fit function that has the form:
f(x) = P / (1 + e^((x + m) / s)
Where P is a known constant. I'm fitting this function to a list of measured doubles (between 20-100 elements) and all these values has a corresponding x-value. I'm relatively new to C# and not very in to the maths either so I find it kind of hard to read the documentation available.
I have tried using AlgLib, but don't know where to start or what function to use.
Edit: So to precise what I#m looking for: I'd like to find a C# method where i can pass the functions form, aswell as some coordinates (x- and y-values) and have the method returning the two unknown variables (s and m above).

I use AlgLib daily for exactly this purpose. If you go to the link http://www.alglib.net/docs.php and scroll all the way down, you'll find the documentation with code examples in a number of languages (including C#) that I think will help you immensely: http://www.alglib.net/translator/man/manual.csharp.html
For your problem, you should consider all the constraints you need, but a simple example of obtaining a nonlinear least-squares fit given your input function and data would look something like this:
public SomeReturnObject Optimize(SortedDictionary<double, double> dataToFitTo, double p, double initialGuessM, double initialGuessS)
{
var x = new double[dataToFitTo.Count,1];
for(int i=0; i < dataToFitTo.Count; i++)
{
x[i, 0] = dataToFitTo.Keys.ElementAt(i);
}
var y = dataToFitTo.Values.ToArray();
var c = new[] {initialGuessM, initialGuessS};
int info;
alglib.lsfitstate state;
alglib.lsfitreport rep;
alglib.lsfitcreatef(x, y, c, 0.0001, out state);
alglib.lsfitsetcond(state, epsf, 0, 0);
alglib.lsfitfit(state, MyFunc, null, p);
alglib.lsfitresults(state, out info, out c, out rep);
/* When you get here, the c[] array should have the optimized values
for m and s, so you'll want to handle accordingly depending on your
needs. I'm not sure if you want out parameters for m and s or an
object that has m and s as properties. */
}
private void MyFunc(double[] c, double[] x, ref double func, object obj)
{
var xPt = x[0];
var m = c[0];
var s = c[1];
var P = (double)obj;
func = P / (1 + Math.Exp((xPt + m) / s));
}
Mind you, this is just a quick and dirty example. There is a lot of built-in functionality in Alglib so you'll need to adjust the problem code here to suit your needs with boundary constraints, weighting, step size, variable scaling....etc. It should be clear how to do all that from the examples and documentation in the second link.
Also note that Alglib is very particular about the method signature of MyFunc, so I would avoid moving around those inputs or adding any more.
Alternatively, you can write your own Levenberg-Marquardt algorithm if Alglib doesn't satisfy all your needs.

Related

ILNumerics using subarrays and matfiles with structs

First question: Can ILNumerics read matfiles with struct? I couldnt make it work.
I then split the file in matlab and I would like to use it for calculations. but i have problems with the subarray function. I would like to do this:
using (ILMatFile matRead = new ILMatFile(#"C:\Temp\Dates.mat"))
{
ILArray<double> Dates = matRead.GetArray<double>("Dates");
double x = 736055-1;
double y = 736237+1;
ILArray<ILLogical> LogDates = (Dates > x && Dates < y);
}
using (ILMatFile matRead = new ILMatFile(#"C:\Temp\Power.mat"))
{
ILArray<double> power = matRead.GetArray<double>("Power");
ILArray<double> tpower = power[LogDates, full];
double avgpower = tpower.Average();
Console.WriteLine(avgpower.ToString());
Console.ReadKey();
}
This doesnt work for a number of reasons. The logical doesnt take my syntax and I dont really get why. But also the subarry in the second block doesnt work, it doesnt know the full keyword (even though the documentation says it shouldand also it doesnt like the logical. obviously I want to average tpower over all columns and only those rows where the logical condition is one.
thanks.
nik

ILLogical is an array itself. You use it like that:
ILLogical LogDates = ILMath.and(Dates > x, Dates < y);
If you still experiencing problems with the subarray, try:
ILArray<double> tpower = power[ILMath.find(LogDates), ILMath.full];
Only, if your class is derived from ILMath, you can ommit the ILMath. identifier! Otherwise, string subarray definitions are sometimes shorter:
ILArray<double> tpower = power[ILMath.find(LogDates), ":"]
In order to take the average over selected rows, reducing to one:
double avgpower = tpower.Average(); // Linq version
double avgpower = (double)ILMath.sumall(tpower) / tpower.S.NumberOfElements; // prob. faster on large data

Extract x,y values from deldir object using RDotNet

Background
I am using RDotNet to run an R script that performs a voronoi tessellation using the deldir package. After
R:tiles = tile.list(voro) I wish to extract R:tiles[[i]][c("x","y")] for the each tile i into a C#:List<Tuple<double,double>>.
Issue 1
I can extract the R:tiles object into C#-world using var tiles = engine.Evaluate("tiles").AsVector().ToList(); but I am struggling to understand how to use RDotNet to extract the x, y values for each tile from this point:
I don't know how to iterate over this object to extract the x, y values that I desire.
Issue 2
Alternatively, I attempted to create a new simpler object in R, i.e. values and attempt to extract a string and parse values from that. So far I have only created this object for one of the points:
R: e.g.
values <- tiles[[1]][c("x","y")]
C#: e.g.
var xvalues = engine.Evaluate("values[\"x\"]").AsCharacter();
var yvalues = engine.Evaluate("values[\"y\"]").AsCharacter();
// Some boring code that parses the strings, casts to double and populates the Tuple
However I can only extract one string at a time and have to split the string to obtain the values I'm after. This does not seem like the way I should be doing things.
Question
How can extract the x,y coordinates for every tile from R:tiles[[i]][c("x","y")] into a C#:List<Tuple<double,double>>?

I think you are after something like the following if I got what you seek correctly. The full code I tested is committed to a git repo I've just set up for SO queries. I've tested against the NuGet package for 1.5.5; note to later readers that subsequent versions of R.NET may let you use other idioms.
var res = new List<List<Tuple<double, double>>>();
// w is the result of tile.list as per the sample in ?tile.list
var n = engine.Evaluate("length(w)").AsInteger()[0];
for (int i = 1; i <= n; i++)
{
var x = engine.Evaluate("w[[" + i + "]]$x").AsNumeric().ToArray();
var y = engine.Evaluate("w[[" + i + "]]$y").AsNumeric().ToArray();
var t = x.Zip(y, (first, second) => Tuple.Create(first, second)).ToList();
res.Add(t);
}

Converting string to function

I want to build in my application the possibility of drawing mathematical functions. In the plotting library that I'm using (OxyPlot) there is a great support for that. See this example:
y = ax³ + bx² + cx + d = 0
is being plotted this way:
new FunctionSeries( x => a*x*x*x + b*x*x + c*x + d, /* other stuff, spacing, number of points, etc */ )
Trigonometrical functions are done the same way:
y = sin(3x) + 5cos(x)
is
new FunctionSeries(x => Math.Sin(3*x) + 5*Math.Cos(x) , ....);
I would be very happy if someone could guide me in the conversion between a string (written in a textbox for example) and a call of a method that has inside the syntax shown.
EDIT: the first parameter in the FunctionSeries(a, ....) a is Func<double, double>
EDIT2: Is there a way to say to the compiler, hey, believe me "x => 5*x*x" is a Func, take it literally
something like :
Func<double, double> f = (Func<double, double>)myString;

Here I have a partial solution:
var expresionData = new List<DataPoint>();
Regex pattern = new Regex("[x]");
for (int i = 0; i < 100; i++)
{
string a = pattern.Replace(ExpresionString, i.ToString());
NCalc.Expression exp = new NCalc.Expression(a);
expresionData.Add(new DataPoint(i,Double.Parse(exp.Evaluate().ToString())));
}
I'm doing a little trick here: I transform each 'x' in the typed string to i, then I evaluate the expression and add the point. It's pretty slow. I'm still very interested in the original question:
How to transform a string to Func<double, double> (or just make the compiler take it literally).

Looking for a way to optimize this algorithm for parsing a very large string

The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).
I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.
This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method
Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.
Here is the method that parses the text 'source':
private List<Tuple<char, char, char, char>> _Parse(string src)
{
var _map = new List<Tuple<char, char, char, char>>();
for (int i = 0; i < src.Length - 3; i++)
{
int j = i + 1;
int k = i + 2;
int l = i + 3;
_map.Add
(new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l]));
}
return _map;
}
And here is the _Probability method:
private double _Probability(char x0, char x1, char x2, char x3)
{
var subset_x0 = map.Where(x => x.Item1 == x0);
var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);
int count_of_x0 = subset_x0.Count();
int count_of_x1_following = subset_x0_x1_following.Count();
int count_of_x2_following = subset_x0_x2_following.Count();
int count_of_x3_following = subset_x0_x3_following.Count();
decimal p1;
decimal p2;
decimal p3;
if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
{
p1 = e;
p2 = e;
p3 = e;
}
else
{
p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;
p1 = (p1 * 100) + e;
p2 = (p2 * 100) + e;
p3 = (p3 * 100) + e;
}
//more calculations omitted
return _final;
}
}
EDIT - I'm providing more details to clear things up,
1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf
2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:
[t][h]
[t][h][a]
[t][h][a][t]
[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.
The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.
I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.

In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.
As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.
EDIT
Here's an example of how I think I would do it in C#
class TextParser{
private Node Parse(string src){
var top = new Node(null);
for (int i = 0; i < src.Length - 3; i++){
var first = src[i];
var second = src[i+1];
var third = src[i+2];
var fourth = src[i+3];
var firstLevelNode = top.AddChild(first);
var secondLevelNode = firstLevelNode.AddChild(second);
var thirdLevelNode = secondLevelNode.AddChild(third);
thirdLevelNode.AddChild(fourth);
}
return top;
}
}
public class Node{
private readonly Node _parent;
private readonly Dictionary<char,Node> _children
= new Dictionary<char, Node>();
private int _count;
public Node(Node parent){
_parent = parent;
}
public Node AddChild(char value){
if (!_children.ContainsKey(value))
{
_children.Add(value, new Node(this));
}
var levelNode = _children[value];
levelNode._count++;
return levelNode;
}
public decimal Probability(string substring){
var node = this;
foreach (var c in substring){
if(!node.Contains(c))
return 0m;
node = node[c];
}
return ((decimal) node._count)/node._parent._children.Count;
}
public Node this[char value]{
get { return _children[value]; }
}
private bool Contains(char c){
return _children.ContainsKey(c);
}
}
the usage would then be:
var top = Parse(src);
top.Probability("test");

I would suggest changing the data structure to make that faster...
I think a Dictionary<char,Dictionary<char,Dictionary<char,Dictionary<char,double>>>> would be much more efficient since you would be accessing each "level" (Item1...Item4) when calculating... and you would cache the result in the innermost Dictionary so next time you don't have to calculate at all..

Ok, I don't have time to work out details, but this really calls for
neural classifier nets (Just take any off the shelf, even the Controllable Regex Mutilator would do the job with way more scalability) -- heuristics over brute force
you could use tries (Patricia Tries a.k.a. Radix Trees to make a space optimized version of your datastructure that can be sparse (the Dictionary of Dictionaries of Dictionaries of Dictionaries... looks like an approximation of this to me)

There's not much you can do with the parse function as it stands. However, the tuples appear to be four consecutive characters from a large body of text. Why not just replace the tuple with an int and then use the int to index the large body of text when you need the character values. Your tuple based method is effectively consuming four times the memory the original text would use, and since memory is usually the bottleneck to performance, it's best to use as little as possible.
You then try to find the number of matches in the body of text against a set of characters. I wonder how a straightforward linear search over the original body of text would compare with the linq statements you're using? The .Where will be doing memory allocation (which is a slow operation) and the linq statement will have parsing overhead (but the compiler might do something clever here). Having a good understanding of the search space will make it easier to find an optimal algorithm.
But then, as has been mentioned in the comments, using a 264 matrix would be the most efficent. Parse the input text once and create the matrix as you parse. You'd probably want a set of dictionaries:
SortedDictionary <int,int> count_of_single_letters; // key = single character
SortedDictionary <int,int> count_of_double_letters; // key = char1 + char2 * 32
SortedDictionary <int,int> count_of_triple_letters; // key = char1 + char2 * 32 + char3 * 32 * 32
SortedDictionary <int,int> count_of_quad_letters; // key = char1 + char2 * 32 + char3 * 32 * 32 + char4 * 32 * 32 * 32
Finally, a note on data types. You're using the decimal type. This is not an efficient type as there is no direct mapping to CPU native type and there is overhead in processing the data. Use a double instead, I think the precision will be sufficient. The most precise way will be to store the probability as two integers, the numerator and denominator and then do the division as late as possible.

The best approach here is to using sparse storage and pruning after each each 10000 character for example. Best storage strucutre in this case is prefix tree, it will allow fast calculation of probability, updating and sparse storage. You can find out more theory in this javadoc http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramProcessLM.html

How to prevent creating intermediate objects in cascading operators?

I use a custom Matrix class in my application, and I frequently add multiple matrices:
Matrix result = a + b + c + d; // a, b, c and d are also Matrices
However, this creates an intermediate matrix for each addition operation. Since this is simple addition, it is possible to avoid the intermediate objects and create the result by adding the elements of all 4 matrices at once. How can I accomplish this?
NOTE: I know I can define multiple functions like Add3Matrices(a, b, c), Add4Matrices(a, b, c, d), etc. but I want to keep the elegancy of result = a + b + c + d.

You could limit yourself to a single small intermediate by using lazy evaluation. Something like
public class LazyMatrix
{
public static implicit operator Matrix(LazyMatrix l)
{
Matrix m = new Matrix();
foreach (Matrix x in l.Pending)
{
for (int i = 0; i < 2; ++i)
for (int j = 0; j < 2; ++j)
m.Contents[i, j] += x.Contents[i, j];
}
return m;
}
public List<Matrix> Pending = new List<Matrix>();
}
public class Matrix
{
public int[,] Contents = { { 0, 0 }, { 0, 0 } };
public static LazyMatrix operator+(Matrix a, Matrix b)
{
LazyMatrix l = new LazyMatrix();
l.Pending.Add(a);
l.Pending.Add(b);
return l;
}
public static LazyMatrix operator+(Matrix a, LazyMatrix b)
{
b.Pending.Add(a);
return b;
}
}
class Program
{
static void Main(string[] args)
{
Matrix a = new Matrix();
Matrix b = new Matrix();
Matrix c = new Matrix();
Matrix d = new Matrix();
a.Contents[0, 0] = 1;
b.Contents[1, 0] = 4;
c.Contents[0, 1] = 9;
d.Contents[1, 1] = 16;
Matrix m = a + b + c + d;
for (int i = 0; i < 2; ++i)
{
for (int j = 0; j < 2; ++j)
{
System.Console.Write(m.Contents[i, j]);
System.Console.Write(" ");
}
System.Console.WriteLine();
}
System.Console.ReadLine();
}
}

Something that would at least avoid the pain of
Matrix Add3Matrices(a,b,c) //and so on
would be
Matrix AddMatrices(Matrix[] matrices)

In C++ it is possible to use Template Metaprograms and also here, using templates to do exactly this. However, the template programing is non-trivial. I don't know if a similar technique is available in C#, quite possibly not.
This technique, in c++ does exactly what you want. The disadvantage is that if something is not quite right then the compiler error messages tend to run to several pages and are almost impossible to decipher.
Without such techniques I suspect you are limited to functions such as Add3Matrices.
But for C# this link might be exactly what you need: Efficient Matrix Programming in C# although it seems to work slightly differently to C++ template expressions.

You can't avoid creating intermediate objects.
However, you can use expression templates as described here to minimise them and do fancy lazy evaluation of the templates.
At the simplest level, the expression template could be an object that stores references to several matrices and calls an appropriate function like Add3Matrices() upon assignment. At the most advanced level, the expression templates will do things like calculate the minimum amount of information in a lazy fashion upon request.

This is not the cleanest solution, but if you know the evaluation order, you could do something like this:
result = MatrixAdditionCollector() << a + b + c + d
(or the same thing with different names). The MatrixCollector then implements + as +=, that is, starts with a 0-matrix of undefined size, takes a size once the first + is evaluated and adds everything together (or, copies the first matrix). This reduces the amount of intermediate objects to 1 (or even 0, if you implement assignment in a good way, because the MatrixCollector might be/contain the result immediately.)
I am not entirely sure if this is ugly as hell or one of the nicer hacks one might do. A certain advantage is that it is kind of obvious what's happening.

Might I suggest a MatrixAdder that behaves much like a StringBuilder. You add matrixes to the MatrixAdder and then call a ToMatrix() method that would do the additions for you in a lazy implementation. This would get you the result you want, could be expandable to any sort of LazyEvaluation, but also wouldn't introduce any clever implementations that could confuse other maintainers of the code.

I thought that you could just make the desired add-in-place behavior explicit:
Matrix result = a;
result += b;
result += c;
result += d;
But as pointed out by Doug in the Comments on this post, this code is treated by the compiler as if I had written:
Matrix result = a;
result = result + b;
result = result + c;
result = result + d;
so temporaries are still created.
I'd just delete this answer, but it seems others might have the same misconception, so consider this a counter example.

Bjarne Stroustrup has a short paper called Abstraction, libraries, and efficiency in C++ where he mentions techniques used to achieve what you're looking for. Specifically, he mentions the library Blitz++, a library for scientific calculations that also has efficient operations for matrices, along with some other interesting libraries. Also, I recommend reading a conversation with Bjarne Stroustrup on artima.com on that subject.

It is not possible, using operators.

My first solution would be something along this lines (to add in the Matrix class if possible) :
static Matrix AddMatrices(Matrix[] lMatrices) // or List<Matrix> lMatrices
{
// Check consistency of matrices
Matrix m = new Matrix(n, p);
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
foreach (Maxtrix mat in lMatrices)
m[i, j] += mat[i, j];
return m;
}
I'd had it in the Matrix class because you can rely on the private methods and properties that could be usefull for your function in case the implementation of the matrix change (linked list of non empty nodes instead of a big double array, for example).
Of course, you would loose the elegance of result = a + b + c + d. But you would have something along the lines of result = Matrix.AddMatrices(new Matrix[] { a, b, c, d });.

There are several ways to implement lazy evaluation to achieve that. But its important to remember that not always your compiler will be able to get the best code of all of them.
I already made implementations that worked great in GCC and even superceeded the performance of the traditional several For unreadable code because they lead the compiler to observe that there were no aliases between the data segments (somethign hard to grasp with arrays coming of nowhere). But some of those were a complete fail at MSVC and vice versa on other implementations. Unfortunately those are too long to post here (don't think several thousands lines of code fit here).
A very complex library with great embedded knowledge int he area is Blitz++ library for scientific computation.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.