efficient way of matching coordinates

efficient way of matching coordinates - c#

I have 2 lists of coordinates in C# one as of coordinates of Drivers and the other as of coordinates of cafes. I am looking for an efficient way of populating a static Dictionary with its key as of a Driver from the first list and its associated values of all Cafes in 500 meters radius.
public void ManageList() {
GlobalList.Clear();
foreach (var driver in driverList)
{
var driverCoords = new GeoCoordinate(driver.Latitude, driver.Longitude);
List<Cafe> matchedCafes = new List<Cafe>();
foreach (var cafe in cafeList)
{
var cafeCoords = new GeoCoordinate(cafe.Latitude, cafe.Longitude);
if (cafeCoords.GetDistanceTo(driverCoords) <= 500) {
matchedCafes.Add(cafeCoords);
}
}
GlobalList.Add(driverCoords, matchedCafes);
}
}
the above works fine as long as drivers are not movable objects. If I want to send the driver's coordinates every 5 seconds and update the GlobalList per driver the above algorithm fails as I am basically clearing the whole list and populate it again.

More of a pointer than an answer. It's unclear how many items you are talking about.
But really what you describe is a spatial hashing problem.
This is a basic of game engine, physics, programming.
It is a big topic, but you could google to get started,
https://gamedevelopment.tutsplus.com/tutorials/redesign-your-display-list-with-spatial-hashes--cms-27586
http://zufallsgenerator.github.io/2014/01/26/visually-comparing-algorithms/
https://gamedev.stackexchange.com/a/69794/86883
As a matter of fact, you could probably ask your question on gamedev, since, it is exactly that type of question.
I'll try to make an extremely simple explanation:
Say your system performs perfectly fine (no performance problems) when you have, example, 20 cafes.
But, in fact you have 2000 cafes.
So break down the map into about 100 "boxes".
When you do a taxi, only do the cafes in that box.
You've immediately eliminated 1980 of the cafes which are so far away they are not even in the box in question. (Naturally what I have stated is a simplification, there are a huge number of details to address in the basic approach.)
Actually this article -
https://dzone.com/articles/algorithm-week-spatial
very nicely explains both quadtrees and geohashing for you.

Related

How to use GloVe word embedding model in ML.net

I'm new to Machine Learning and working on my master thesis using ML.net. I'm trying use glove model to vectorise a CV text, but finding it hard to wrap my head over the process. I have the Pipeline setup as below:
var pipeline = context.Transforms.Text.NormalizeText("Text", null,
keepDiacritics: false, keepNumbers: false, keepPunctuations: false)
.Append(context.Transforms.Text.TokenizeIntoWords("Tokens", "Text"))
.Append(context.Transforms.Text.RemoveDefaultStopWords("WordsWithoutStopWords", "Tokens", Microsoft.ML.Transforms.Text.StopWordsRemovingEstimator.Language.English))
.Append(context.Transforms.Text.ApplyWordEmbedding("Features", "WordsWithoutStopWords",
Microsoft.ML.Transforms.Text.WordEmbeddingEstimator.PretrainedModelKind.GloVe300D));
var embeddingTransformer = pipeline.Fit(emptyData);
var predictionEngine = context.Model.CreatePredictionEngine<Input,Output>(embeddingTransformer);
var data = new Input { Text = TextExtractor.Extract("/attachments/CV6.docx")};
var prediction = predictionEngine.Predict(data);
Console.WriteLine($"Number of features: {prediction.Features.Length}");
Console.WriteLine("Features: ");
foreach(var feature in prediction.Features)
{
Console.Write($"{feature} ");
}
Console.WriteLine(Environment.NewLine);
From what I've studied about vectorization, each word in the document should be converted into vector, but when I'm printing the features, I can see 900 features getting printed. Can someone explain how this works? There are very less examples and tutorials available about ML.net on internet.

The vector of 900 features coming the WordEmbeddingEstimator is the min/max/average of the individual word embeddings in your phrase. Each of the min/max/average are 300 dimensional for the GloVe 300D model, giving 900 total.
The min/max gives the bounding hyper-rectangle for the words in your phrase. The average gives the standard phrase embedding.
See: https://github.com/dotnet/machinelearning/blob/d1bf42551f0f47b220102f02de6b6c702e90b2e1/src/Microsoft.ML.Transforms/Text/WordEmbeddingsExtractor.cs#L748-L752

GloVe is short for Global Vectorization.
GloVe is an unsupervised (no human labeling of the of the training set) learning method. The vectors associated with each word are generally derived from each word's proximity with others in sentences.
Once you have trained your network (presumably on a much larger data set than a single CV/resume) then you can make interesting comparisons between words based on their absolute and relative "positions" in the vector space. A much less computationally expensive way of developing a network to analyze e.g. documents is to download a pre-trained dataset. I'm sure you've found this page (https://nlp.stanford.edu/projects/glove/) which, among other things, will allow you to access pre-trained word embeddings/vectorizations.
Final thoughts: I'm sorry if all of this is redundant information for you, especially if this really turns out to be a ML.net framework syntax question. I don't know exactly what your goal is but
900 dimensions seems like an awful lot for looking at CV's. Maybe this is an ML.net default? I suspect that 300-500 will be more than adequate. See what the pre-trained data sets provide.
If you only intend to train your network from zero on a single CV, then this method is going to be wholly inadequate.
Your best approach is likely to be a sort of transfer learning approach where you obtain a liberally licensed network that has been pre-trained on a massive data set in your language of interest (usually easy for academic work). Then perform additional training using a smaller, targeted group of training-only CV's to add any specialized words to the 'vocabulary' of your network. Then perform your experimentation and analysis on a set of test CV's, which have never been used to train the network.

Incorrect decimal placement in financial application

I'm fairly new to programming and started learning C# and this is my first project. I'm struggling to figure out why strange and seemingly random issue is occurring. This is a fairly simple trading application. Basically it connects to a websocket stream and receives live price data from the exchange and then evaluates the price in real time and performs some actions. The price is updated hundreds of times per second and operates without issues and then all of a sudden, I will get a price value that is thousands of dollars off the actual price that was sent from the exchange. I finally caught this occurring in real time. The app had been running for 11 hours or so without issue, then the bad value came through.
Here is the code in question:
public static decimal CurrentPrice;
// ...
if (BitmexTickerStreamIsConnected)
{
bitmexApiSocketService.Subscribe(BitmetSocketSubscriptions.CreateInstrumentSubsription(
message =>
{
foreach (var instrumentDto in message.Data)
{
if (instrumentDto.Symbol == "XBTUSD")
{
BitmexTickerStreamLastMessageReceived = DateTime.Now;
decimal LastPrice = instrumentDto.LastPrice.HasValue ? Convert.ToDecimal(instrumentDto.LastPrice) : CurrentPrice;
CurrentPrice = LastPrice;
}
}
}));
}
These are the values from the debug after a breakpoint was hit further down:
instrumentDto.LastPrice = 7769.5
LastPrice = 7769.5
CurrentPrice = 776.9
The issue is that CurrentPrice seems to be for some reason shifting the decimal to the left by one place. The values coming in from the websocket are fine, its just when CurrentPrice is set to LastPrice that the issue happens.
I have no idea why this is happening and seems to be totally random.
Anyone have any idea why this might be happening or how?
Thank you for your help!

There's two common causes:
Market data, due to how fast it's updated, will occasionally give
you straight up bad data depending on the provider. If you're
consuming directly from an exchange you need to code for this. Some
providers (like OPRA) will filter or mark bad ticks for you.
If you see this issue consistently it's due to things like tick size
or scale. Some exchanges do this differently, but effectively you
need to multiply certain price fields by a certain scale. Consult
with the data provider documentation for details.
If this is seen very rarely, you likely just got a bad price. Yes, this absolutely will happen on occassion and you need to be prepared for it, unless you want to become the next Knight Capital.
In all handlers I've written (or contributed to) there's a "sanity check" to see if the data is good. Depending on what you're trying to accomplish, just dropping the bad tick is fine.
Another solution that I've commonly used is alternate streams of data (usually called "A" and "B" streams or similar). If you get a bad tick on one stream, use the other.
That said this is not directly related to the programming language, but at the core it's handling quirks with the API/data.
Edit
Also beware of threading issues here. Be sure CurrentPrice isn't updated by multiple threads at once. decimal is 128-bit base 10 floating point, and that's larger than word size currently (32 or 64 bits).
You may need to synchronize the reads and writes to it which you can do in a variety of ways. The above information still applies, though.

Seperate list / mesh into sub-lists / sub-meshes

EDIT: To give an idea of what type of mesh I have:
Imagine a LEGO brick with four knobs on the top. I read a STL file containing the surface of the whole brick. After identifying all nodes with unique coordinates (and saving their next neighbours in a list) I cut away most of the brick, so that only the four knobs remain. Unluckily for me, these four knobs are still in one big list (one for the nodes, one for the next neighbours). I want the fastest way to get all nodes of one knob if I specify one node which I know belongs to that knob.
I have a (relatively) big List<cNode> nodes where
class cNode
{
int nodeNumber;
cVector vector;
}
and an even bigger (ca. 14e6 entries) List<cNodeCoincidence> coincidences where
class cNodeCoincidence
{
cNode node1;
cNode node2;
}
My nodes represent points in 3D and my coincidences resembles what was formerly a mesh consisting of triangles, condensed from a STL file. I know for a fact (and the user made his input accordingly), that my node-mesh is actually 4 seperate meshes in one node/coincidence list. My goal is to extract the nodes of each sub-mesh to its own nodelist. To achieve this, I start with one node for each sub-mesh, which I know to be part of said sub-mesh. Cue a recursive function:
private void AssembleSubMesh(ReadOnlyCollection<cNode> in_nodesToRead, List<cNode> in_nodesAlreadyRead)
{
List<cNode> newNodesToRead = new List<cNode>();
List<cNodeCoincidence> foundCoincidences = coincidences.Where(x => (in_nodesToRead.Any(y => y == x.node1)) || in_nodesToRead.Any(z => z == x.node2)).ToList();
in_nodesAlreadyRead.AddRange(in_nodesToRead);
List<cNode> allRemainingNodes = new List<cNode>();
foreach (cNodeCoincidence nc in foundCoincidences)
{
allRemainingNodes.Add(nc.node1);
allRemainingNodes.Add(nc.node2);
}
allRemainingNodes = allRemainingNodes.Distinct().ToList();
allRemainingNodes.RemoveAll(x => in_nodesAlreadyRead.Contains(x));
if (allRemainingNodes.Count != 0)
AssembleSubMesh(new ReadOnlyCollection<cNode>(allRemainingNodes), in_nodesAlreadyRead);
}
which is called by: AssembleSubMesh(new ReadOnlyCollection<cNode>(firstNodeIKnow), globalResultListForSubmesh); thus writing the results of the recursion to a more global list.
This procedure works (tested with small mesh), but is painfully slow (over 15 hours before I aborted the process).
Is there any way to seperate the mesh in a faster and perhaps more elegant way?
I found this SO post and had a look at this lecture and it seems, that there might be a chance that this (especially WQUPC) is what I need, but I don't understand how exactly it could help, because they only have a node list and I additionally have the coincidence list which would be a shame not to use, really (?).
Could a database help (because of indexing)?

You need to be able to identify edge. which is a a single connection between 2 vertices (no more other connections found). I assume that there are all vertices for all triangles, so they are duplicated. It depends on the dimension of your mesh sure, but it shouldn't take so much time.
You need to define dictionaries, which will pump your app's memory, but will also dramatically increase speed with guaranteed O(1) access.
In short:
1) load data
2) scan it and construct appropriate data structures
If you observe any CAD modelling software it takes much more time then it should during loading of meshes, for the same reason: they need to scan data loaded and construct appropriate data structures to be able to process that data after as fast as it possible.
3) use those data structures to get information you need as fast as it possible.
So choose data structures and keys wisely, to meet requirements of your application.

Most efficient implementation for a complete undirected graph

Problem background
I am currently developing a framework of Ant Colony System algorithms. I thought I'd start out by trying them on the first problem they were applied to: Travelling Salesman Problem (TSP). I will be using C# for the task.
All TSP instances will consist of a complete undirected graph with 2 different weights associated with each edge.
Question
Until now I've only used adjacency-list representations but I've read that they are recommended only for sparse graphs. As I am not the most knowledgeable of persons when it comes to data structures I was wondering what would be the most efficient way to implement an undirected complete graph?
I can provide additional details if required.
Thank you for your time.
UPDATE
Weight clarification. Each edge will have the two values associated with them:
distance between two cities ( d(i,j) = d(j,i) same distance in both directions)
amount of pheromone deposited by ants on that particular edge
Operations. Small summary of the operations I will be doing on the graph:
for each node, the ant on that particular node will have to iterate through the values associated with all incident edges
Problem clarification
Ant Colony Optimization algorithms can "solve" TSP as this is where they were first applied to . I say "solve" because they are part of a family of algorithms called metaheuristics optimizations, thus they never guarantee to return the optimal solution.
Regarding the problem at hand:
ants will know how to complete a tour because each ant will have a memory.
each time an ant visits a city it will store that city in its memory.
each time an ant considers visiting a new city it will search in its memory and pick an outgoing edge only if that edge will not lead it to an already visited city.
when there are no more edges the ant can choose it has complete a tour; at this point we can retrace the tour created by the ant by backtracking through its memory.
Research article details: Ant Colony System article
Efficiency considerations
I am more worried about run time (speed) than memory.

First, there's a general guide to adjacency lists vs matrices here. That's a pretty low-level, non-specific discussion, though, so it might not tell you anything you don't already know.
The takeaway, I think, is this: If you often find yourself needing to answer the question, "I need to know the distance or the pheromone level of the edge between exactly node i and node j," then you probably want the matrix form, as that question can be answered in O(1) time.
You do mention needing to iterate over the edges adjacent to a node-- here is where some cleverness and subtlety may come in. If you don't care about the order of the iteration, then you don't care about the data structure. If you care deeply about the order, and you know the order up front, and it never changes, you can probably code this directly into an adjacency list. If you find yourself always wanting, e.g., the largest or smallest concentration of pheromones, you may want to try something even more structured, like a priority queue. It really depends on what kind of operations you're doing.
Finally, I know you mention that you're more interested in speed than memory, but it's not clear to me how many graph representations you'll be using. If only one, then you truly don't care. But, if each ant is building up its own representation of the graph as it goes along, you might care more than you think, and the adjacency list will let you carry around incomplete graph representations; the flip side of that is that it will take time to build the representations up when the ant is exploring on its frontier.
Finally finally, I know you say you're dealing with complete graphs and TSP, but it is worth thinking about whether you will ever need to adapt these routines to some other problem on possibly graphs, and if so, what then.
I lean toward adjacency lists and/or even more structure, but I don't think you will find a clean, crisp answer.

Since you have a complete graph I would think that the best representation would be a 2D array.
public class Edge
{
//change types as appropriate
public int Distance {get;set;}
public int Pheromone {get;set;}
}
int numNodes;
Edge[,] graph = new Edge[numNodes,numNodes];
for(int i = 0; i < numNodes; i++)
{
for(int j = 0; j < numNodes; j++)
{
graph[i][j] = new Edge();
//initialize Edge
}
}
If you have a LOT of nodes, and don't "remember" nodes by index in this graph, then it may be beneficial to have a Dictionary that maps a Node to the index in the graph. It may also be helpful to have the reverse lookup (a List would be the appropriate data structure here. This would give you the ability to get a Node object (if you have a lot of information to store about each node) based on the index of that node in the graph.

Speed up PathGeometry.Combine (c#.net)?

I've got two geometrical data sets to match, both containing tens of thousands PathGeometries. To be exact I need to find areas which overlap from one set to the other, so I got a loop like
foreach (var p1 in firstGeometries)
{
foreach (var p2 in secondGeometries)
{
PathGeometry sharedArea = PathGeometry.Combine(p1, p2, GeometryCombineMode.Intersect, null);
if (sharedArea.GetArea() > 0) // only true 0.01% of the time
{
[...]
}
}
}
Now, due to the nature of my data, 99,99% of the times the combinations do not intersect at all. Profiling told me this is the most 'expensive' part of this calculation.
Is there any way to speed up or get a faster collision detection between two PathGeometries?

Adding a new answer since I'm more familiar with the Geometry class now. First, I'd test for intersection using their bounding boxes. Though honestly, PathGeometry.Combine probably already does this. So what's the real problem? That testing the boundary of each object against the boundary of every other object is quadratic time. If you instead found intersections (or collisions in some areas of CS) using a quadtree, you could have significant performance gains. Hard to say without testing and fine tuning though. http://gamedev.tutsplus.com/tutorials/implementation/quick-tip-use-quadtrees-to-detect-likely-collisions-in-2d-space/

Maybe you can use the Parallel.ForEach method, if you have more than one cpu core avaiable.

Though I am not sure about the exact nature of each of the path geometry, but assuming that they are polygons:
You can sort each object based on their bounds. This way, you are assured that, once the condition if (sharedArea.GetArea() > 0) fails, the remaining elements in the inner loop will not produce an area greater than 0, so you can break out of the loop.
It will significantly improve the running time, since the condition is likely to fail most of the time.

I haven't tested it, but it may be helpful to use GetFlattenedPathGeometry and combine the results of that instead. Depending on the type of geometry your combining, it's likely getting converted to a polygonal approximation each time. Using GetFlattenedPathGeometry ahead of time will hopefully eliminate the redundant computation.

You definitely need a "broad and narrow phase" to do this.
Bounding-Box checks are a must for something like this.
A much simpler alternative to a quad tree would be to use "spatial hashing" (sometimes also called "spatial indexing"). This technique should reduce the needed time a thousandfold. For a reference use: http://www.playchilla.com/as3-spatial-hash It's in AS3 but it's trivial to convert it to C#

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.