Segmenting GPS path data - c#

My Problem
I have a data stream coming from a program that connects to a GPS device and an inclinometer (they are actually both stand alone devices, not a cellphone) and logs the data while the user drives around in a car. The essential data that I receive are:
Latitude/Longitude - from GPS, with a resolution of about +-5 feet,
Vehicle land-speed - from GPS, in knots, which I convert to MPH
Sequential record index - from the database, it's an auto-incrementing integer and nothing ever gets deleted,
some other stuff that isn't pertinent to my current problem.
This data gets stored in a database and read back from the database into an array. From start to finish, the recording order is properly maintained, so even though the timestamp that is recorded from the GPS device is only to 1 second precision and we sample at 5hz, the absolute value of the time is of no interest and the insertion order suffices.
In order to aid in analyzing the data, a user performs a very basic data input task of selecting the "start" and "end" of curves on the road from the collected path data. I get a map image from Google and I draw the curve data on top of it. The user zooms into a curve of interest, based on their own knowledge of the area, and clicks two points on the map. Google is actually very nice and reports where the user clicked in Latitude/Longitude rather than me having to try to backtrack it from pixel values, so the issue of where the user clicked in relation to the data is covered.
The zooming in on the curve clips the data: I only retrieve data that falls in the Lat/Lng window defined by the zoom level. Most of the time, I'm dealing with fewer than 300 data points, when a single driving session could result in over 100k data points.
I need to find the subsegment of the curve data that falls between those to click points.
What I've Tried
Originally, I took the two points that are closest to each click point and the curve was anything that fell between them. That worked until we started letting the drivers make multiple passes over the road. Typically, a driver will make 2 back-and-forth runs over an interesting piece of road, giving us 4 total passes. If you take the two closest points to the two click points, then you might end up with the first point corresponding to a datum on one pass, and the second point corresponding to a datum on a completely different pass. The points in the sequence between these two points would then extend far beyond the curve. And, even if you got lucky and all the data points found were both on the same pass, that would only give you one of the passes, and we need to collect all passes.
For a while, I had a solution that worked much better. I calculated two new sequences representing the distance from each data point to each of the click points, then the approximate second derivative of that distance, looking for the inflection points of the distance from the click point over the data points. I reasoned that the inflection point meant that the points previous to the inflection were getting closer to the click point and the points after the inflection were getting further away from the click point. Doing this iteratively over the data points, I could group the curves as I came to them.
Perhaps some code is in order (this is C#, but don't worry about replying in kind, I'm capable of reading most languages):
static List<List<LatLngPoint>> GroupCurveSegments(List<LatLngPoint> dataPoints, LatLngPoint start, LatLngPoint end)
{
var withDistances = dataPoints.Select(p => new
{
ToStart = p.Distance(start),
ToEnd = p.Distance(end),
DataPoint = p
}).ToArray();
var set = new List<List<LatLngPoint>>();
var currentSegment = new List<LatLngPoint>();
for (int i = 0; i < withDistances.Length - 2; ++i)
{
var a = withDistances[i];
var b = withDistances[i + 1];
var c = withDistances[i + 2];
// the edge of the map can clip the data, so the continuity of
// the data is not exactly mapped to the continuity of the array.
var ab = b.DataPoint.RecordID - a.DataPoint.RecordID;
var bc = c.DataPoint.RecordID - b.DataPoint.RecordID;
var inflectStart = Math.Sign(a.ToStart - b.ToStart) * Math.Sign(b.ToStart - c.ToStart);
var inflectEnd = Math.Sign(a.ToEnd - b.ToEnd) * Math.Sign(b.ToEnd - c.ToEnd);
// if we haven't started a segment yet and we aren't obviously between segments
if ((currentSegment.Count == 0 && (inflectStart == -1 || inflectEnd == -1)
// if we have started a segment but we haven't changed directions away from it
|| currentSegment.Count > 0 && (inflectStart == 1 && inflectEnd == 1))
// and we're continuous on the data collection path
&& ab == 1
&& bc == 1)
{
// extend the segment
currentSegment.Add(b.DataPoint);
}
else if (
// if we have a segment collected
currentSegment.Count > 0
// and we changed directions away from one of the points
&& (inflectStart == -1
|| inflectEnd == -1
// or we lost data continuity
|| ab > 1
|| bc > 1))
{
// clip the segment and start a new one
set.Add(currentSegment);
currentSegment = new List<LatLngPoint>();
}
}
return set;
}
This worked great until we started advising the drivers to drive around 15MPH through turns (supposedly, it helps reduce sensor error. I'm personally not entirely convinced what we're seeing at higher speed is error, but I'm probably not going to win that argument). A car traveling at 15MPH is traveling at 22fps. Sampling this data at 5hz means that each data point is about four and a half feet apart. However, our GPS unit's precision is only about 5 feet. So, just the jitter of the GPS data itself could cause an inflection point in the data at such low speeds and high sample rates (technically, at this sample rate, you'd have to go at least 35MPH to avoid this problem, but it seems to work okay at 25MPH in practice).
Also, we're probably bumping up sampling rate to 10 - 15 Hz pretty soon. You'd need to drive at about 45MPH to avoid my inflection problem, which isn't safe on most of the curves of interest. My current procedure ends up splitting the data into dozens of subsegments, over road sections that I know had only 4 passes. One section that only had 300 data points came out to 35 subsegments. The rendering of the indication of the start and end of each pass (a small icon) indicated quite clearly that each real pass was getting chopped up into several pieces.
Where I'm Thinking of Going
Find the minimum distance of all points to both the start and end click points
Find all points that are within +10 feet of that distance.
Group each set of points by data continuity, i.e. each group should be continuous in the database, because more than one point on a particular pass could fall within the distance radius.
Take the data mid-point of each of those groups for each click point as the representative start and end for each pass.
Pair up points in the two sets per click point by those that would minimize the record index distance between each "start" and "end".
Halp?!
But I had tried this once before and it didn't work very well. Step #2 can return an unreasonably large number of points if the user doesn't click particularly close to where they intend. It can return too few points if the user clicks very, particularly close to where they intend. I'm not sure just how computationally intensive step #3 will be. And step #5 will fail if the driver were to drive over a particularly long curve and immediately turn around just after the start and end to perform the subsequent passes. We might be able to train the drivers to not do this, but I don't like taking chances on such things. So I could use some help figuring out how to clip and group this path that doubles back over itself into subsegments for passes over the curve.

Okay, so here is what I ended up doing, and it seems to work well for now. I like that it is a little simpler to follow than before. I decided that Step #4 from my question was not necessary. The exact point used as the start and end isn't critical, so I just take the first point that is within the desired radius of the first click point and the last point within the desired radius of the second point and take everything in the middle.
protected static List<List<T>> GroupCurveSegments<T>(List<T> dbpoints, LatLngPoint start, LatLngPoint end) where T : BBIDataPoint
{
var withDistances = dbpoints.Select(p => new
{
ToStart = p.Distance(start),
ToEnd = p.Distance(end),
DataPoint = p
}).ToArray();
var minToStart = withDistances.Min(p => p.ToStart) + 10;
var minToEnd = withDistances.Min(p => p.ToEnd) + 10;
bool startFound = false,
endFound = false,
oldStartFound = false,
oldEndFound = false;
var set = new List<List<T>>();
var cur = new List<T>();
foreach(var a in withDistances)
{
// save the previous values, because they
// impact the future values.
oldStartFound = startFound;
oldEndFound = endFound;
startFound =
!oldStartFound && a.ToStart <= minToStart
|| oldStartFound && !oldEndFound
|| oldStartFound && oldEndFound
&& (a.ToStart <= minToStart || a.ToEnd <= minToEnd);
endFound =
!oldEndFound && a.ToEnd <= minToEnd
|| !oldStartFound && oldEndFound
|| oldStartFound && oldEndFound
&& (a.ToStart <= minToStart || a.ToEnd <= minToEnd);
if (startFound || endFound)
{
cur.Add(a.DataPoint);
}
else if (cur.Count > 0)
{
set.Add(cur);
cur = new List<T>();
}
}
// if a data stream ended near the end of the curve,
// then the loop will not have saved it the pass.
if (cur.Count > 0)
{
cur = new List<T>();
}
return set;
}

Related

Working with micro changes in floats/doubles

The last couple of days have been full with making calculations and formulas and I'm beginning to lose my mind (a little bit). So now I'm turning to you guys for some insight/help.
Here's the problem; I'm working with bluetooth beacons whom are placed all over an entire floor in a building to make an indoor GPS showcase. You can use your phone to connect with these beacons, which results in receiving your longitude and latitude location from them. These numbers are large float/double variables, looking like this:
lat: 52.501288451787076
lng: 6.079107635606511
The actual changes happen at the 4th and 5th position after the point. I'm converting these numbers to the Cartesian coordinate system using;
x = R * cos(lat) * cos(lon)
z = R *sin(lat)
Now the coordinates from this conversion are kind of solid. They are numbers with which I can work with. I use them in a 3d engine (Unity3d) to make a real-time map where you can see where someone is walking.
Now for the actual problem! These beacons are not entirely accurate. These numbers 'jump' up and down even when you lay your phone down. Ranging from, let's assume the same latitude as mentioned above, 52.501280 to 52.501296. If we convert this and use it as coordinates in a 3d engine, the 'avatar' for a user jumps from one position to another (more small jumps than large jumps).
What is a good way to cope with these jumping numbers? I've tried to check for big jumps and ignore those, but the jumps are still too big. A broader check will result in almost no movement, even when a phone is moving. Or is there a better way to convert the lat and long variables for use in a 3d engine?
If there is someone who has had the same problem as me, some mathematical wonder who can give a good conversion/formula to start with or someone who knows what I'm possibly doing wrong then please, help a fellow programmer out.
Moving Average
You could use this: (Taken here: https://stackoverflow.com/a/1305/5089204)
Attention: Please read the comments to this class as this implementation has some flaws... It's just for quick test and show...
public class LimitedQueue<T> : Queue<T> {
private int limit = -1;
public int Limit {
get { return limit; }
set { limit = value; }
}
public LimitedQueue(int limit)
: base(limit) {
this.Limit = limit;
}
public new void Enqueue(T item) {
if (this.Count >= this.Limit) {
this.Dequeue();
}
base.Enqueue(item);
}
}
Just test it like this:
var queue = new LimitedQueue<float>(4);
queue.Enqueue(52.501280f);
var avg1 = queue.Average(); //52.50128
queue.Enqueue(52.501350f);
var avg2 = queue.Average(); //52.5013161
queue.Enqueue(52.501140f);
var avg3 = queue.Average(); //52.50126
queue.Enqueue(52.501022f);
var avg4 = queue.Average(); //52.5011978
queue.Enqueue(52.501635f);
var avg5 = queue.Average(); //52.50129
queue.Enqueue(52.501500f);
var avg6 = queue.Average(); //52.5013237
queue.Enqueue(52.501505f);
var avg7 = queue.Average(); //52.5014153
queue.Enqueue(52.501230f);
var avg8 = queue.Average(); //52.50147
The limited queue will not grow... You just define the count of elements you want to use (in this case I specified 4). The 5th element pushes the first out and so on...
The average will always be a smooth sliding :-)

Video rate image construction from binary data performance

First things first:
I have a git repo over here that holds the code of my current efforts and an example data set
Background
The example data set holds a bunch of records in Int32 format. Each record is composed of several bit fields that basically hold info on events where an event is either:
The detection of a photon
The arrival of a synchronizing signal
Each Int32 record can be treated like following C-style struct:
struct {
unsigned TimeTag :16;
unsigned Channel :12;
unsigned Route :2;
unsigned Valid :1;
unsigned Reserved :1; } TTTRrecord;
Whether we are dealing with a photon record or a sync event, time
tag will always hold the time of the event relative to the start of
the experiment (macro-time).
If a record is a photon, valid == 1.
If a record is a sync signal or something else, valid == 0.
If a record is a sync signal, sync type = channel & 7 will give either a value indicating start of frame or end of scan line in a frame.
The last relevant bit of info is that Timetag is 16 bit and thus obviously limited. If the Timetag counter rolls over, the rollover counter is incremented. This rollover (overflow) count can easily be obtained from channel overflow = Channel & 2048.
My Goal
These records come in from a high speed scanning microscope and I would like to use these records to reconstruct images from the recorded photon data, preferably at 60 FPS.
To do so, I obviously have all the info:
I can look over all available data, find all overflows, which allows me to reconstruct the sequential macro time for each record (photon or sync).
I also know when the frame started and when each line composing the frame ended (and thus also how many lines there are).
Therefore, to reconstruct a bitmap of size noOfLines * noOfLines I can process the bulk array of records line by line where each time I basically make a "histogram" of the photon events with edges at the time boundary of each pixel in the line.
Put another way, if I know Tstart and Tend of a line, and I know the number of pixels I want to spread my photons over, I can walk through all records of the line and check if the macro time of my photons falls within the time boundary of the current pixel. If so, I add one to the value of that pixel.
This approach works, current code in the repo gives me the image I expect but it is too slow (several tens of ms to calculate a frame).
What I tried already:
The magic happens in the function int[] Renderline (see repo).
public static int[] RenderlineV(int[] someRecords, int pixelduration, int pixelCount)
{
// Will hold the pixels obviously
int[] linePixels = new int[pixelCount];
// Calculate everything (sync, overflow, ...) from the raw records
int[] timeTag = someRecords.Select(x => Convert.ToInt32(x & 65535)).ToArray();
int[] channel = someRecords.Select(x => Convert.ToInt32((x >> 16) & 4095)).ToArray();
int[] valid = someRecords.Select(x => Convert.ToInt32((x >> 30) & 1)).ToArray();
int[] overflow = channel.Select(x => (x & 2048) >> 11).ToArray();
int[] absTime = new int[overflow.Length];
absTime[0] = 0;
Buffer.BlockCopy(overflow, 0, absTime, 4, (overflow.Length - 1) * 4);
absTime = absTime.Cumsum(0, (prev, next) => prev * 65536 + next).Zip(timeTag, (o, tt) => o + tt).ToArray();
long lineStartTime = absTime[0];
int tempIdx = 0;
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
for (int i = tempIdx; i < someRecords.Length; i++)
{
if (valid[i] == 1 && lineStartTime + (j + 1) * pixelduration >= absTime[i])
{
count++;
}
}
// Avoid checking records in the raw data that were already binned to a pixel.
linePixels[j] = count;
tempIdx += count;
}
return linePixels;
}
Treating photon records in my data set as an array of structs and addressing members of my struct in an iteration was a bad idea. I could increase speed significantly (2X) by dumping all bitfields into an array and addressing these. This version of the render function is already in the repo.
I also realised I could improve the loop speed by making sure I refer to the .Length property of the array I am running through as this supposedly eliminates bounds checking.
The major speed loss is in the inner loop of this nested set of loops:
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
lineStartTime += pixelduration;
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
}
// Doing it here is faster.
linePixels[j] = count;
tempIdx += count;
}
Rendering 400 lines like this in a for loop takes roughly 150 ms in a VM (I do not have a dedicated Windows machine right now and I run a Mac myself, I know I know...).
I just installed Win10CTP on a 6 core machine and replacing the normal loops by Parallel.For() increases the speed by almost exactly 6X.
Oddly enough, the non-parallel for loop runs almost at the same speed in the VM or the physical 6 core machine...
Regardless, I cannot imagine that this function cannot be made quicker. I would first like to eke out every bit of efficiency from the line render before I start thinking about other things.
I would like to optimise the function that generates the line to the maximum.
Outlook
Until now, my programming dealt with rather trivial things so I lack some experience but things I think I might consider:
Matlab is/seems very efficient with vectored operations. Could I achieve similar things in C#, i.e. by using Microsoft.Bcl.Simd? Is my case suited for something like this? Would I see gains even in my VM or should I definitely move to real HW?
Could I gain from pointer arithmetic/unsafe code to run through my arrays?
...
Any help would be greatly, greatly appreciated.
I apologize beforehand for the quality of the code in the repo, I am still in the quick and dirty testing stage... Nonetheless, criticism is welcomed if it is constructive :)
Update
As some mentioned, absTime is ordered already. Therefore, once a record is hit that is no longer in the current pixel or bin, there is no need to continue the inner loop.
5X speed gain by adding a break...
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
else
{
break;
}
}

How to best implement K-nearest neighbours in C# for large number of dimensions?

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.
Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

Problems with Coordinates measure with C# linq

I'm currently developing server side to Application,
in the App I have a lot of interest point in big area (over 1000 points)
and I want to find the nearest points to user device.
I've try to used:
.GetDistanceTo(GeoCoordinate);
from the libary:
System.Device.Location;
example query:
from point in db.Points
where ((new GeoCoordinate(point.lat,point.lng)).GetDistanceTo(new GeoCoordinate(coordinates[0],coordinates[1]))<1000))
select point
but in the Linq Query it's not supported and if I try to use it on a List<> or an Array it takes too long...
How can I do it better and faster?
Thanks
I assume that you are performing this computation often, otherwise iterating over a thousand points should not take a large amount of time - certainly under a second.
Consider caching the points in memory as GeoCoordinates, since I am guessing the bulk of the time may be spent allocating memory and instantiating the objects, rather than computing the distance. From the existing list of GeoCoordinates, you could then do a computation against an existing Geocoordinate that is already instantiated.
Ex:
On the application load, store all points into memory, possibly on a background thread.
List<GeoCoordinate> points = from point in db.Points select new GeoCoordinate(point.lat, point.lng);
Then, take your point you are trying to search and loop over points
var gcSearch = new GeoCoordinate(coordinates[0], coordinates[1]);
var searchDistance = 1000;
var results = from pSearch in points
where pSearch.GetDistanceTo(gcSearch) < searchDistance
select pSearch;
Iif that still isn't fast enough, consider caching the last searched point and returing a known list if the new search is within the same bounds.
// in class definition
static GeoCoordinate lastSearchedPoint = null;
static List<GeoCoordinate> lastSearchedResults = null
const searchFudgeDistance = 100;
//in search method
var gcSearch = new GeoCoordinate(coordinates[0], coordinates[1]);
if (lastSearchedPoint != null && gcSearch.GetDistanceTo(lastSearchedPoint) < searchFudgeDistance)
return lastSearchedResults;
lastSearchedPoint = gcSearch;
var searchDistance = 1000;
var results = from pSearch in points
where pSearch.GetDistanceTo(gcSearch) < searchDistance
select pSearch;
//store the results for future searches
lastSearchedResults = results;

calculate average function of several functions

I have several ordered List of X/Y Pairs and I want to calculate a ordered List of X/Y Pairs representing the average of these Lists.
All these Lists (including the "average list") will then be drawn onto a chart (see example picture below).
I have several problems:
The different lists don't have the same amount of values
The X and Y values can increase and decrease and increase (and so on) (see example picture below)
I need to implement this in C#, altought I guess that's not really important for the algorithm itself.
Sorry, that I can't explain my problem in a more formal or mathematical way.
EDIT: I replaced the term "function" with "List of X/Y Pairs" which is less confusing.
I would use the method Justin proposes, with one adjustment. He suggests using a mappingtable with fractional indices, though I would suggest integer indices. This might sound a little mathematical, but it's no shame to have to read the following twice(I'd have to too). Suppose the point at index i in a list of pairs A has searched for the closest points in another list B, and that closest point is at index j. To find the closest point in B to A[i+1] you should only consider points in B with an index equal to or larger than j. It will probably by j + 1, but could be j or j + 2, j + 3 etc, but never below j. Even if the point closest to A[i+1] has an index smaller than j, you still shouldn't use that point to interpolate with, since that would result in an unexpected average and graph. I'll take a moment now to create some sample code for you. I hope you see that this optimalization makes sense.
EDIT: While implementing this, I realised that j is not only bounded from below(by the method described above), but also bounded from above. When you try the distance from A[i+1] to B[j], B[j+1], B[j+2] etc, you can stop comparing when the distance A[i+1] to B[j+...] stops decreasing. There's no point in searching further in B. The same reasoning applies as when j was bounded from below: even if some point elsewhere in B would be closer, that's probably not the point you want to interpolate with. Doing so would result in an unexpected graph, probably less smooth than you'd expect. And an extra bonus of this second bound is the improved performance. I've created the following code:
IEnumerable<Tuple<double, double>> Average(List<Tuple<double, double>> A, List<Tuple<double, double>> B)
{
if (A == null || B == null || A.Any(p => p == null) || B.Any(p => p == null)) throw new ArgumentException();
Func<double, double> square = d => d * d;//squares its argument
Func<int, int, double> euclidianDistance = (a, b) => Math.Sqrt(square(A[a].Item1 - B[b].Item1) + square(A[a].Item2 - B[b].Item2));//computes the distance from A[first argument] to B[second argument]
int previousIndexInB = 0;
for (int i = 0; i < A.Count; i++)
{
double distance = euclidianDistance(i, previousIndexInB);//distance between A[i] and B[j - 1], initially
for (int j = previousIndexInB + 1; j < B.Count; j++)
{
var distance2 = euclidianDistance(i, j);//distance between A[i] and B[j]
if (distance2 < distance)//if it's closer than the previously checked point, keep searching. Otherwise stop the search and return an interpolated point.
{
distance = distance2;
previousIndexInB = j;
}
else
{
break;//don't place the yield return statement here, because that could go wrong at the end of B.
}
}
yield return LinearInterpolation(A[i], B[previousIndexInB]);
}
}
Tuple<double, double> LinearInterpolation(Tuple<double, double> a, Tuple<double, double> b)
{
return new Tuple<double, double>((a.Item1 + b.Item1) / 2, (a.Item2 + b.Item2) / 2);
}
For your information, the function Average returns the same amount of interpolated points the list A contains, which is probably fine, but you should think about this for your specific application. I've added some comments in it to clarify some details, and I've described all aspects of this code in the text above. I hope it's clear, and otherwise feel free to ask questions.
SECOND EDIT: I misread and thought you had only two lists of points. I have created a generalised function of that above accepting multiple lists. It still uses only those principles explained above.
IEnumerable<Tuple<double, double>> Average(List<List<Tuple<double, double>>> data)
{
if (data == null || data.Count < 2 || data.Any(list => list == null || list.Any(p => p == null))) throw new ArgumentException();
Func<double, double> square = d => d * d;
Func<Tuple<double, double>, Tuple<double, double>, double> euclidianDistance = (a, b) => Math.Sqrt(square(a.Item1 - b.Item1) + square(a.Item2 - b.Item2));
var firstList = data[0];
for (int i = 0; i < firstList.Count; i++)
{
int[] previousIndices = new int[data.Count];//the indices of points which are closest to the previous point firstList[i - 1].
//(or zero if i == 0). This is kept track of per list, except the first list.
var closests = new Tuple<double, double>[data.Count];//an array of points used for caching, of which the average will be yielded.
closests[0] = firstList[i];
for (int listIndex = 1; listIndex < data.Count; listIndex++)
{
var list = data[listIndex];
double distance = euclidianDistance(firstList[i], list[previousIndices[listIndex]]);
for (int j = previousIndices[listIndex] + 1; j < list.Count; j++)
{
var distance2 = euclidianDistance(firstList[i], list[j]);
if (distance2 < distance)//if it's closer than the previously checked point, keep searching. Otherwise stop the search and return an interpolated point.
{
distance = distance2;
previousIndices[listIndex] = j;
}
else
{
break;
}
}
closests[listIndex] = list[previousIndices[listIndex]];
}
yield return new Tuple<double, double>(closests.Select(p => p.Item1).Average(), closests.Select(p => p.Item2).Average());
}
}
Actually that I did the specific case for 2 lists separately might have been a good thing: it is easily explained and offers a step before understanding the generalised version. Furthermore, the square root could be taken out, since it doesn't change the order of the distances when sorted, just the lengths.
THIRD EDIT: In the comments it became clear there might be a bug. I think there are none, aside from the mentioned small bug, which shouldn't make any difference except for at the end of the graphs. As a proof that it actually works, this is the result of it(the dotted line is the average):
I'll use a metaphor of your functions being cars racing down a curvy racetrack, where you want to extract the center-line of the track given the cars' positions. Each car's position can be described as a function of time:
p1(t) = (x1(t), y1(t))
p2(t) = (x2(t), y2(t))
p3(t) = (x3(t), y3(t))
The crucial problem is that the cars are racing at different speeds, which means that p1(10) could be twice as far down the race track as p2(10). If you took a naive average of these two points, and there was a sharp curve in the track between the cars, the average may be far from the track.
If you could just transform your functions to no longer be a function of time, but a function of the distance along the track, then you would be able to do what you want.
One way you could do this would be to choose the slowest car (i.e., the one with the greatest number of samples). Then, for each sample of the slowest car's position, look at all of the other cars' paths, find the two closest points, and choose the point on the interpolated line which is closest to the slowest car's position. Then average these points together. Once you do this for all of the slow car's samples, you have an average path.
I'm assuming that all of the cars start and end in roughly the same places; if any of the cars just race a small portion of the track, you will need to add some more logic to detect that.
A possible improvement (for both performance and accuracy), is to keep track of the most recent sample you are using for each car and the speed of each car (the relative sampling rate). For your slowest car, it would be a simple map: 1 => 1, 2 => 2, 3 => 3, ... For the other cars, though, it could be more like: 1 => 0.3, 2 => 0.7, 3 => 1.6 (fractional values are due to interpolation). The speed would be the inverse of the change in sample number (e.g., the slow car would have speed 1, and the other car would have speed 1/(1.6-0.7)=1.11). You could then ensure that you don't accidentally backtrack on any of the cars. You could also improve the calculation speed because you don't have to search through the whole set of all points on each path; instead, you can assume that the next sample will be somewhere close to the current sample plus 1/speed.
As these are not y=f(x) functions, are they perhaps something like (x,y)=f(t)?
If so, you could interpolate along t, and calculate avg(x) and avg(y) for each t.
EDIT This of course assumes that t can be made available to your code - so that you have an ordered list of T/X/Y triples.
There are several ways this can be done. One is to combine all of your data into one single set of points, and do a best-fit curve through the combined set.
you have e.g. 2 "functions" with
fc1 = { {1,0.3} {2, 0.5} {3, 0.1} }
fc1 = { {1,0.1} {2, 0.8} {3, 0.4} }
You want the arithmetic mean (slang: "average") of the two functions. To do this you just calculate the pointwise arithmetic mean:
fc3 = { {1, (0.3+0.1)/2} ... }
Optimization:
If you have large numbers of points you should first convert your "ordered List of X/Y Pairs" into a Matrix OR at least store the points column-wise like so:
{0.3, 0.1}, {0.5, 0.8}, {0.1, 0.4}

Categories