Optimal algorithm for merging connected parallel 3D line segments - c#

I have a sketch with 3D line segments. Each segment has start and end 3D point. My task is to merge segments if they are parallel and connected. I've implemented it on C#. This algorithm is recursive. Can this code be optimized? Can it be not recursive?
/// <summary>
/// Merges segments if they compose a straight line.
/// </summary>
/// <param name="segments">List of segments.</param>
/// <returns>Merged list of segments.</returns>
internal static List<Segment3d> MergeSegments(List<Segment3d> segments)
{
var result = new List<Segment3d>(segments);
for (var i = 0; i < result.Count - 1; i++)
{
var firstLine = result[i];
for (int j = i + 1; j < result.Count; j++)
{
var secondLine = result[j];
var startToStartConnected = firstLine.P1.Equals(secondLine.P1);
var startToEndConnected = firstLine.P1.Equals(secondLine.P2);
var endToStartConnected = firstLine.P2.Equals(secondLine.P1);
var endToEndConnected = firstLine.P2.Equals(secondLine.P2);
if (firstLine.IsParallel(secondLine) && (startToStartConnected || startToEndConnected || endToStartConnected || endToEndConnected))
{
Segment3d mergedLine = null;
if (startToStartConnected)
{
mergedLine = new Segment3d(firstLine.P2, secondLine.P2);
}
if (startToEndConnected)
{
mergedLine = new Segment3d(firstLine.P2, secondLine.P1);
}
if (endToStartConnected)
{
mergedLine = new Segment3d(firstLine.P1, secondLine.P2);
}
if (endToEndConnected)
{
mergedLine = new Segment3d(firstLine.P1, secondLine.P1);
}
// Remove duplicate.
if (firstLine == secondLine)
{
mergedLine = new Segment3d(firstLine.P1, firstLine.P2);
}
result.Remove(firstLine);
result.Remove(secondLine);
result.Add(mergedLine);
result = MergeSegments(result);
return result;
}
}
}
return result;
}
The classes Segment3D and Point3D are pretty simple:
Class Segment3D:
internal class Segment3d{
public Segment3d(Point3D p1, Point3D p2)
{
this.P1 = p1;
this.P2 = p2;
}
public bool IsParallel(Segment3d segment)
{
// check if segments are parallel
return true;
}
public Point3D P1 { get; }
public Point3D P2 { get; }
}
internal class Point3D
{
public double X { get; set; }
public double Y { get; set; }
public double Z { get; set; }
public override bool Equals(object obj)
{
// Implement equality logic,
return true;
}
}

The optimizations
You are asking about the way to remove the recursion. However that isn't the only nor the largest problem of your current solution. So I will try to give you an outline of possible directions for the optimization. Unfortunately, since your code still isn't self-contained minimal reproducible example it is rather tricky to debug. If I have time in the future, I might re-visit.
First step: Limiting the number of comparisons.
Currently, you are performing unnecessary number of comparisons since you compare every two possible line segments and every possible alignment that they can have. This is unnecessary.
First step to lower the number of the comparisons is to separate the line segments by their direction. Currently, when you are trying to compare two vectors, you go and check if these align. If that is the case, you proceed with the rest of the comparisons.
If we sort the segments by the direction, we will naturally group them into buckets of sort. Sorting the segments might sound weird, there are three axes to sort by. That is fine though since the only thing we actually care about is the fact, that if two normalized vectors (x,y,z) and (a,b,c) differ in at least one coordinate, they are not parallel.
This sorting can be done by implementing the IComparable interface and then calling sort method as you would normally. This has already been described for example here. The sorting has the complexity O(N * log(N)) which is quite a lot better than your current O(N^2) approach. (Actually, your algorithm is even worse than that since during each recursion step you do another N^2 steps. So you might be dangerously close to the O(N^4) territory. There be the dragons.)
Note: If you are feeling adventurous and O(Nlog(N) is not enough, it could be possible to get to the O(N) territory by implementing the System.HashCode function for your direction vector and find the groups of parallel lines by using Sets or Dictionaries. However it might be rather tricky since floating point numbers and comparison for equality aren't exactly a friendly combination. Caution is required. Sorting should be enough for most of the use cases.
So now we have the line segments separated by the direction vectors. Even if you proceed now with your current approach the results should be much better since we lowered the size of the groups to be compared. However we can go further.
Second step: Smart comparison of the segment ends.
Since you want to merge the segments if either of their endpoints align(as you covered start-start, start-end, end-start, end-end combinations in your question) we can most likely start merging any two identical points we find. The easiest way to do that is once more either sorting or hashing. Since we won't be normalizing which can result in the marginall differences, hashing should be viable. However sorting is still the more-straightforward and "safer" way to go.
One of the many ways this could be done:
Put all of the endpoints into single linked list as a tuple (endpoint, parent_vector). Arrays are not suitable since the deletion would be costly.
Sort the list by the endpoints. (Once again, you will need to implement the IComparable interface for your points)
Go through the list. Whenever the two neighboring points are equal, remove them and merge their siblings into a new line segment (Or delete one of them and the closer sibling if the neighbors are both the starting or ending point of the segment).
When you pass whole list, all the segments are either merged or have no more neighbors.
Pick up your survivors and you are done.
The complexity of this step is once again O(N * log(N)) as in the previous case. No recursion necessary and the speedup should be nice.

Related

How to speed up my pathfinding algorithm?

How can I speed up my code? And what am I doing wrong? I have a two-dimensional array of cells that store data about what is in them. And I have a map of only 100x100 and with 10 colonists and this already causes freezes. Although my game is quite raw.
And when should I build a route for a colonist? Every step he takes? Because if an unexpectedly built wall appears on his way. Then he will have to immediately change the route.
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
public class PathFinding : MonoBehaviour
{
public MainWorld MainWorld;
public MainWorld.Cell CurrentWorldCell;
public MainWorld.Cell NeighborWorldCell;
public List<MainWorld.Cell> UnvisitedWorldCells;
public List<MainWorld.Cell> VisitedWorldCells;
public Dictionary<MainWorld.Cell, MainWorld.Cell> PathTraversed;
public List<MainWorld.Cell> PathToObject;
public List<MainWorld.Cell> FindPath(MainWorld.Cell StartWorldCell, string ObjectID)
{
CurrentWorldCell = new MainWorld.Cell();
NeighborWorldCell = new MainWorld.Cell();
UnvisitedWorldCells = new List<MainWorld.Cell>();
VisitedWorldCells = new List<MainWorld.Cell>();
PathTraversed = new Dictionary<MainWorld.Cell, MainWorld.Cell>();
UnvisitedWorldCells.Add(StartWorldCell);
while (UnvisitedWorldCells.Count > 0)
{
CurrentWorldCell = UnvisitedWorldCells[0];
NeighborWorldCell = MainWorld.Data[CurrentWorldCell.Position.x, CurrentWorldCell.Position.y + 1];
CheckWorldCell(CurrentWorldCell, NeighborWorldCell, ObjectID);
NeighborWorldCell = MainWorld.Data[CurrentWorldCell.Position.x + 1, CurrentWorldCell.Position.y];
CheckWorldCell(CurrentWorldCell, NeighborWorldCell, ObjectID);
NeighborWorldCell = MainWorld.Data[CurrentWorldCell.Position.x, CurrentWorldCell.Position.y - 1];
CheckWorldCell(CurrentWorldCell, NeighborWorldCell, ObjectID);
NeighborWorldCell = MainWorld.Data[CurrentWorldCell.Position.x - 1, CurrentWorldCell.Position.y];
CheckWorldCell(CurrentWorldCell, NeighborWorldCell, ObjectID);
UnvisitedWorldCells.Remove(CurrentWorldCell);
if (CurrentWorldCell.ObjectID == ObjectID)
{
return CreatePath(StartWorldCell, CurrentWorldCell);
}
}
return null;
}
public void CheckWorldCell(MainWorld.Cell CurrentWorldCell, MainWorld.Cell NeighborWorldCell, string ObjectID)
{
if (VisitedWorldCells.Contains(NeighborWorldCell) == false)
{
if (NeighborWorldCell.IsPassable == true ||
NeighborWorldCell.ObjectID == ObjectID)
{
UnvisitedWorldCells.Add(NeighborWorldCell);
VisitedWorldCells.Add(NeighborWorldCell);
PathTraversed.Add(NeighborWorldCell, CurrentWorldCell);
}
else
{
VisitedWorldCells.Add(NeighborWorldCell);
}
}
}
public List<MainWorld.Cell> CreatePath(MainWorld.Cell StartWorldCell, MainWorld.Cell EndWorldCell)
{
PathToObject = new List<MainWorld.Cell>();
PathToObject.Add(EndWorldCell);
while (PathToObject[PathToObject.Count - 1] != StartWorldCell)
{
PathToObject.Add(PathTraversed[PathToObject[PathToObject.Count - 1]]);
}
PathToObject.Reverse();
return PathToObject;
}
}
On each iteration of your pathfinding loop, there is a check for each of the four neighboring cells. Within each check, you are searching the VisitedWorldCells List. If there are many cells in that list, it could be your first issue. Next, you either add an element to one or three lists. Manipulating lists is slow. To create a path, you are making a list and adding a bunch of elements. Doing all of this 10 times per frame is unreasonable.
Use preallocated 2d boolean arrays. You know how large the map is.
bool[,] array = new bool[mapSizeX, mapSizeY];
Using this, you can check the cells directly with something like
if(!VistedWorldCells[14,16])
Instead of checking the entire VisitedWorldCells list (which also could be a slow comparison depending on what the MainWorld.Cell type is).
Preallocated arrays and directly checking/setting the values will speed this up by orders of magnitude.
As far as I can tell your algorithm is a simple breadth first search. The first step should be to replace the UnvisitedWorldCells list (often called the 'open set') with a queue, and the VisitedWorldCells (often called closed set) with a hashSet. Or just use a 2D array for a constant time lookup. You might also consider using a more compact representation of your node, like a simple pair of x/y coordinates.
The next improvement step would be Djikstras shortest path algorithm. And there are plenty of examples how this works if you search around a bit. This works in a similar way, but uses a priority queue (usually a min heap) for the unvisited cells, this allows for different 'cost' to be specified.
The next step should probably be A*, this is an 'optimal' algorithm for pathfinding, but if you only have 100x100 nodes I would not expect a huge difference. This uses a heuristic to guess what nodes to visit first, so the priority queue does not just use the cost to traverse to the node, but also the estimated remaining cost.
As always when it comes to performance you should measure and profile your code to find bottlenecks.
When to pathfind would be a more complicated issue. A fairly simple and probably effective solution would be to create a path once, and just follow it. If a node has been blocked you can redo the path finding then. You could also redo pathfinding every n seconds in case a better path has appeared. If you have multiple units you might introduce additional requirements like "pushing" blocking units out of the way, trying to keep groups of units in a compact group as they navigate tight passages etc. You can probably spend years taking every possible feature in consideration.

How to best implement K-nearest neighbours in C# for large number of dimensions?

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.
Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

Compare Two Ordered Lists in C#

The problem is that I have two lists of strings. One list is an approximation of the other list, and I need some way of measuring the accuracy of the approximation.
As a makeshift way of scoring the approximation, I have bucketed each list(the approximation and the answer) into 3 partitions (high, medium low) after sorting based on a numeric value that corresponds to the string. I then compare all of the elements in the approximation to see if the string exists in the same partition of the correct list.
I sum the number of correctly classified strings and divide it by the total number of strings. I understand that this is a very crude way of measuring the accuracy of the estimate, and was hoping that better alternatives were available. This is a very small component of a larger piece of work, and was hoping to not have to reinvent the wheel.
EDIT:
I think I wasn't clear enough. I don't need the two lists to be exactly equal, I need some sort of measure that shows the lists are similar. For example, The High-Medium-Low (H-M-L) approach we have taken shows that the estimated list is sufficiently similar. The downside of this approach is that if the estimated list has an item at the bottom of the "High" bracket, and in the actual list, the item is at the top of the medium set, the score algorithm fails to deliver.
It could potentially be that in addition to the H-M-L approach, the bottom 20% of each partition is compared to the top 20% of the next partition or something along those lines.
Thanks all for your help!!
So, we're taking a sequence of items and grouping it into partitions with three categories, high, medium, and low. Let's first make an object to represent those three partitions:
public class Partitions<T>
{
public IEnumerable<T> High { get; set; }
public IEnumerable<T> Medium { get; set; }
public IEnumerable<T> Low { get; set; }
}
Next to make an estimate we want to take two of these objects, one for the actual and one for the estimate. For each priority level we want to see how many of the items are in both collections; this is an "Intersection"; we want to sum up the counts of the intersection of each set.
Then just divide that count by the total:
public static double EstimateAccuracy<T>(Partitions<T> actual
, Partitions<T> estimate)
{
int correctlyCategorized =
actual.High.Intersect(estimate.High).Count() +
actual.Medium.Intersect(estimate.Medium).Count() +
actual.Low.Intersect(estimate.Low).Count();
double total = actual.High.Count()+
actual.Medium.Count()+
actual.Low.Count();
return correctlyCategorized / total;
}
Of course, if we generalize this to not 3 priorities, but rather a sequence of sequences, in which each sequence corresponds to some bucket (i.e. there are N buckets, not just 3) the code actually gets easier:
public static double EstimateAccuracy<T>(
IEnumerable<IEnumerable<T>> actual
, IEnumerable<IEnumerable<T>> estimate)
{
var query = actual.Zip(estimate, (a, b) => new
{
valid = a.Intersect(b).Count(),
total = a.Count()
}).ToList();
return query.Sum(pair => pair.valid) /
(double)query.Sum(pair => pair.total);
}
Nice question. Well, I think you could use the following method to compare your lists:
public double DetermineAccuracyPercentage(int numberOfEqualElements, int yourListsLength)
{
return ((double)numberOfEqualElements / (double)yourListsLength) * 100.0;
}
The number returned should determine how much equality exists between your two lists.
If numberOfEqualElements = yourLists.Length (Count) so they are absolutely equal.
The accuracy of the approximation = (numberOfEqualElements / yourLists.Length)
1 = completely equal , 0 = absolutely different, and the values between 0 and 1 determine the level of equality. In my sample a percentage.
If you compare these 2 lists, you will retrieve a 75% of equality, the same that 3 of 4 equal elements (3/4).
IList<string> list1 = new List<string>();
IList<string> list2 = new List<string>();
list1.Add("Dog");
list1.Add("Cat");
list1.Add("Fish");
list1.Add("Bird");
list2.Add("Dog");
list2.Add("Cat");
list2.Add("Fish");
list2.Add("Frog");
int resultOfComparing = list1.Intersect(list2).Count();
double accuracyPercentage = DetermineAccuracyPercentage(resultOfComparing, list1.Count);
I hope it helps.
I would take both List<String>s and combine each element into a IEnumerable<Boolean>:
public IEnumerable<Boolean> Combine<Ta, Tb>(List<Ta> seqA, List<Tb> seqB)
{
if (seqA.Count != seqB.Count)
throw new ArgumentException("Lists must be the same size...");
for (int i = 0; i < seqA.Count; i++)
yield return seqA[i].Equals(seqB[i]));
}
And then use Aggregate() to verify which strings match and keep a running total:
var result = Combine(a, b).Aggregate(0, (acc, t)=> t ? acc + 1 : acc) / a.Count;

How can I get a random x number of decimals from a list of unique decimals that total up to y?

Say I have a sorted list of 1000 or so unique decimals, arranged by value.
List<decimal> decList
How can I get a random x number of decimals from a list of unique decimals that total up to y?
private List<decimal> getWinningValues(int xNumberToGet, decimal yTotalValue)
{
}
Is there any way to avoid a long processing time on this? My idea so far is to take xNumberToGet random numbers from the pool. Something like (cool way to get random selection from a list)
foreach (decimal d in decList.OrderBy(x => randomInstance.Next())Take(xNumberToGet))
{
}
Then I might check the total of those, and if total is less, i might shift the numbers up (to the next available number) slowly. If the total is more, I might shift the numbers down. I'm still now sure how to implement or if there is a better design readily available. Any help would be much appreciated.
Ok, start with a little extension I got from this answer,
public static IEnumerable<IEnumerable<T>> Combinations<T>(
this IEnumerable<T> source,
int k)
{
if (k == 0)
{
return new[] { Enumerable.Empty<T>() };
}
return source.SelectMany((e, i) =>
source.Skip(i + 1).Combinations(k - 1)
.Select(c => (new[] { e }).Concat(c)));
}
this gives you a pretty efficient method to yield all the combinations with k members, without repetition, from a given IEnumerable. You could make good use of this in your implementation.
Bear in mind, if the IEnumerable and k are sufficiently large this could take some time, i.e. much longer than you have. So, I've modified your function to take a CancellationToken.
private static IEnumerable<decimal> GetWinningValues(
IEnumerable<decimal> allValues,
int numberToGet,
decimal targetValue,
CancellationToken canceller)
{
IList<decimal> currentBest = null;
var currentBestGap = decimal.MaxValue;
var locker = new object();
allValues.Combinations(numberToGet)
.AsParallel()
.WithCancellation(canceller)
.TakeWhile(c => currentBestGap != decimal.Zero)
.ForAll(c =>
{
var gap = Math.Abs(c.Sum() - targetValue);
if (gap < currentBestGap)
{
lock (locker)
{
currentBestGap = gap;
currentBest = c.ToList();
}
}
}
return currentBest;
}
I've an idea that you could sort the initial list and quit iterating the combinations at a certain point, when the sum must exceed the target. After some consideration, its not trivial to identify that point and, the cost of checking may exceed the benefit. This benefit would have to be balanced agaist some function of the target value and mean of the set.
I still think further optimization is possible but I also think that this work has already been done and I'd just need to look it up in the right place.
There are k such subsets of decList (k might be 0).
Assuming that you want to select each one with uniform probability 1/k, I think you basically need to do the following:
iterate over all the matching subsets
select one
Step 1 is potentially a big task, you can look into the various ways of solving the "subset sum problem" for a fixed subset size, and adapt them to generate each solution in turn.
Step 2 can be done either by making a list of all the solutions and choosing one or (if that might take too much memory) by using the clever streaming random selection algorithm.
If your data is likely to have lots of such subsets, then generating them all might be incredibly slow. In that case you might try to identify groups of them at a time. You'd have to know the size of the group without visiting its members one by one, then you can choose which group to use weighted by its size, then you've reduced the problem to selecting one of that group at random.
If you don't need to select with uniform probability then the problem might become easier. At the best case, if you don't care about the distribution at all then you can return the first subset-sum solution you find -- whether you'd call that "at random" is another matter...

C# fastest intersection of 2 sets of sorted numbers

I'm calculating intersection of 2 sets of sorted numbers in a time-critical part of my application. This calculation is the biggest bottleneck of the whole application so I need to speed it up.
I've tried a bunch of simple options and am currently using this:
foreach (var index in firstSet)
{
if (secondSet.BinarySearch(index) < 0)
continue;
//do stuff
}
Both firstSet and secondSet are of type List.
I've also tried using LINQ:
var intersection = firstSet.Where(t => secondSet.BinarySearch(t) >= 0).ToList();
and then looping through intersection.
But as both of these sets are sorted I feel there's a better way to do it. Note that I can't remove items from sets to make them smaller. Both sets usually consist of about 50 items each.
Please help me guys as I don't have a lot of time to get this thing done. Thanks.
NOTE: I'm doing this about 5.3 million times. So every microsecond counts.
If you have two sets which are both sorted, you can implement a faster intersection than anything provided out of the box with LINQ.
Basically, keep two IEnumerator<T> cursors open, one for each set. At any point, advance whichever has the smaller value. If they match at any point, advance them both, and so on until you reach the end of either iterator.
The nice thing about this is that you only need to iterate over each set once, and you can do it in O(1) memory.
Here's a sample implementation - untested, but it does compile :) It assumes that both of the incoming sequences are duplicate-free and sorted, both according to the comparer provided (pass in Comparer<T>.Default):
(There's more text at the end of the answer!)
static IEnumerable<T> IntersectSorted<T>(this IEnumerable<T> sequence1,
IEnumerable<T> sequence2,
IComparer<T> comparer)
{
using (var cursor1 = sequence1.GetEnumerator())
using (var cursor2 = sequence2.GetEnumerator())
{
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
var value1 = cursor1.Current;
var value2 = cursor2.Current;
while (true)
{
int comparison = comparer.Compare(value1, value2);
if (comparison < 0)
{
if (!cursor1.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
}
else if (comparison > 0)
{
if (!cursor2.MoveNext())
{
yield break;
}
value2 = cursor2.Current;
}
else
{
yield return value1;
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
value2 = cursor2.Current;
}
}
}
}
EDIT: As noted in comments, in some cases you may have one input which is much larger than the other, in which case you could potentially save a lot of time using a binary search for each element from the smaller set within the larger set. This requires random access to the larger set, however (it's just a prerequisite of binary search). You can even make it slightly better than a naive binary search by using the match from the previous result to give a lower bound to the binary search. So suppose you were looking for values 1000, 2000 and 3000 in a set with every integer from 0 to 19,999. In the first iteration, you'd need to look across the whole set - your starting lower/upper indexes would be 0 and 19,999 respectively. After you'd found a match at index 1000, however, the next step (where you're looking for 2000) can start with a lower index of 2000. As you progress, the range in which you need to search gradually narrows. Whether or not this is worth the extra implementation cost or not is a different matter, however.
Since both lists are sorted, you can arrive at the solution by iterating over them at most once (you may also get to skip part of one list, depending on the actual values they contain).
This solution keeps a "pointer" to the part of list we have not yet examined, and compares the first not-examined number of each list between them. If one is smaller than the other, the pointer to the list it belongs to is incremented to point to the next number. If they are equal, the number is added to the intersection result and both pointers are incremented.
var firstCount = firstSet.Count;
var secondCount = secondSet.Count;
int firstIndex = 0, secondIndex = 0;
var intersection = new List<int>();
while (firstIndex < firstCount && secondIndex < secondCount)
{
var comp = firstSet[firstIndex].CompareTo(secondSet[secondIndex]);
if (comp < 0) {
++firstIndex;
}
else if (comp > 0) {
++secondIndex;
}
else {
intersection.Add(firstSet[firstIndex]);
++firstIndex;
++secondIndex;
}
}
The above is a textbook C-style approach of solving this particular problem, and given the simplicity of the code I would be surprised to see a faster solution.
You're using a rather inefficient Linq method for this sort of task, you should opt for Intersect as a starting point.
var intersection = firstSet.Intersect(secondSet);
Try this. If you measure it for performance and still find it unwieldy, cry for further help (or perhaps follow Jon Skeet's approach).
I was using Jon's approach but needed to execute this intersect hundreds of thousands of times for a bulk operation on very large sets and needed more performance. The case I was running in to was heavily imbalanced sizes of the lists (eg 5 and 80,000) and wanted to avoid iterating the entire large list.
I found that detecting the imbalance and changing to an alternate algorithm gave me huge benifits over specific data sets:
public static IEnumerable<T> IntersectSorted<T>(this List<T> sequence1,
List<T> sequence2,
IComparer<T> comparer)
{
List<T> smallList = null;
List<T> largeList = null;
if (sequence1.Count() < Math.Log(sequence2.Count(), 2))
{
smallList = sequence1;
largeList = sequence2;
}
else if (sequence2.Count() < Math.Log(sequence1.Count(), 2))
{
smallList = sequence2;
largeList = sequence1;
}
if (smallList != null)
{
foreach (var item in smallList)
{
if (largeList.BinarySearch(item, comparer) >= 0)
{
yield return item;
}
}
}
else
{
//Use Jon's method
}
}
I am still unsure about the point at which you break even, need to do some more testing
try
firstSet.InterSect (secondSet).ToList ()
or
firstSet.Join(secondSet, o => o, id => id, (o, id) => o)

Categories