Get all variations of a string using Levenshtein distance - c#

I found a lot of implementation doing the calculation of Levenshtein between 2 strings, but is there any implementation that can generate all variations using Levenshtein distance (max 2) for one given string.
The reason is, I'm using ElasticSearch to execute some fuzzy search, but with the load of queries that I have I have some performance issue because ELK will calculate those possibilities each time, I want to store those values once.

The most commonly cited reference implementation for generating an edit distance is in Python, you can see it in this answer.
The original author linked subsequent implementations in other languages at the bottom of his blog under the heading Other Computer Languages. There are 4 implementations in C#, this one in particular is functional (I'm unsure under what license those implementations are published, so I won't transcribe them into this thread).
But using wildcard searches with ElasticSearch is the correct approach. The engine will implement approximate string matching as efficiently as possible - there are a number of different algorithms this can be based on and the optimal choice depends on your data structure, etc.
You can simplify the use by generating the edit distance yourself, but in most cases if you're using a database or engine their implementation will have better performance. (This is a computationally expensive task, there's no way around that.)

Related

How can I use SharpNLP to detect the possibility that a line of text is a sentence?

I've written a small C# program that compiles a bunch of words into a line of text and I want to use NLP only to give me a percentage possibility that the bunch of words is a sentence. I don't need tokens, or tagging, all that can be in the background if it needs to be done. I have OpenNLP and SharpEntropy referenced in my project, but I'm coming up with an error "Array dimensions exceeded supported range." when using these, so I've also attempted using IKVM created OpenNLP without sharp entropy, but without documentation, I can't seem to wrap my head around the proper steps to get only the percentage probability.
Any help or direction would be appreciated.
I'll recommend 2 relatively simple measures that might help you classify a word sequence as sentence/non-sentence. Unfortunately, I don't know how well SharpNLP will handle either. More complete toolkits exist in Java, Python, and C++ (LingPipe, Stanford CoreNLP, GATE, NLTK, OpenGRM, ...)
Language-model probability: Train a language model on sentences with start and stop tokens at the beginning/end of the sentence. Compute the probability of your target sequence per that language model. Grammatical and/or semantically sensible word sequences will score much higher than random word sequences. This approach should work with a standard n-gram model, a discriminative conditional probability model, or pretty much any other language modeling approach. But definitely start with a basic n-gram model.
Parse tree probability: Similarly, you can measure the inside probability of recovered constituency structure (e.g. via a probabilistic context free grammar parse). More grammatical sequences (i.e., more likely to be a complete sentence) will be reflected in higher inside probabilities. You will probably get better results if you normalize by the sequence length (the same may apply to a language-modeling approach as well).
I've seen preliminary (but unpublished) results on tweets, that seem to indicate a bimodal distribution of normalized probabilities - tweets that were judged more grammatical by human annotators often fell within a higher peak, and those judged less grammatical clustered into a lower one. But I don't know how well those results would hold up in a larger or more formal study.

Fuzzy Text Matching C#

I'm writing a desktop UI (.Net WinForms) to assist a photographer clean up his image meta data. There is a list of 66k+ phrases. Can anyone suggest a good open source/free .NET component I can use that employs some sort of algorithm to identify potential candiates for consolidation? For example there may be two or more entries which are actually the same word or phrase that only differ by whitespace or punctuation or even slight mis-spelling. The application will ultimately rely on the user to action the consolidation of phrases but having an effective way to automatically find potential candidates will prove invaluable.
Let me introduce you to the Levenshtein distance formula. It is awesome:
http://en.wikipedia.org/wiki/Levenshtein_distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.
Personally I used this in a healthcare setting, where Provider names were checked for duplicates. Using the Levenshtein process, we gave them a confidence rating and allowed them to determine if it was a true duplicate or something unique.
I know this is an old question, but I feel like this answer can help people who are dealing with the same issue in current time.
Please have a look at https://github.com/JakeBayer/FuzzySharp
It is a c# NuGet package that has multiple methods that implement a certain way of fuzzy search. Not sure, but perhaps Fosco's anwer is also used in one of them.
Edit:
I just noticed a comment about this package, but I think it deserves a better place inside this question

Queryable Language Dictionary and Word Searching Functionality

In a future project I will need to implement functionality meant for searching words (either by length or given a set of characters and their position in the word) which will return all words that meet a certain criteria.
In order to do so, I will need language dictionaries that can be easily queryable in LINQ. The first thing I'd like to ask is if anyone knows about good dictionaries to use in this kind of application and environment used.
And I'd also like to ask about good ways to search the said dictionary for a word. Would a hash table help speeding up queries? The thing is that a language dictionary can be quite huge, and knowing I will have plenty of search criteria, what would be a good way to implement such functionality in order to avoid hindering the search speed?
Without knowing the exact set of stuff you are likely to need to optimize for, it's hard to say. The standard data structures for efficiently organizing a large corpus of words for fast retrieval is the "trie" data structure, or, if space efficiency is important (because say you're writing a program for a phone, or other memory-constrained environment) then a DAWG -- a Directed Acyclic Word Graph. (A DAWG is essentially a trie that merges common paths to leaves.)
Other interesting questions that I'd want to know an answer to before designing a data structure are things like: will the dictionary ever change? If it does change, are there performance constraints on how fast the new data needs to be integrated into the structure? Will the structure be used only as a fast lookup device, or would you like to store summary information about the words in it as well? (If the latter then a DAWG is unsuitable, since two words may share the same prefix and suffix nodes.) And so on.
I would search the literature for information on tries, DAWGs and ways to optimize Scrabble programs; clearly Scrabble requires all kinds of clever searching of a corpus of strings, and as a result there have been some very fast variants on DAWG data structures built by Scrabble enthusiasts.
I have recently written an immutable trie data structure in C# which I'm planning on blogging about at some point. I'll update this answer in the coming months if I do end up doing that.

C# generic graph search framework

I have now coded up various graph search (A*, DFS, BFS, etc..) algorithms many times over. Every time, the only real difference is the actual search states I am searching over, and how new states are generated from existing ones.
I am now faced with yet another search-heavy project, and would like to avoid having to code and debug a general search algorithm again. It would be really nice if I could define a search state class, including information for generating successive states, heuristic cost, etc, and just plug it in to some kind of existing search framework that can do all of the heavy lifting for me. I know the algorithms aren't particularly difficult to code, but there are always enough tricks involved to make it annoying.
Does anything like this exist? I couldn't find anything.
Perhaps QuickGraph will be of interest.
QuickGraph provides generic
directed/undirected graph
datastructures and algorithms for .Net
2.0 and up. QuickGraph comes with algorithms such as depth first seach,
breath first search, A* search,
shortest path, k-shortest path,
maximum flow, minimum spanning tree,
least common ancestors, etc
This sounds like a perfect use case for either a Delegate or a Lambda Expression.
Using Lambda Expressions for Tree Traversal – C#
http://blog.aggregatedintelligence.com/2010/05/using-lambda-expressions-for-tree.html

How do text differencing applications work?

How do applications like DiffMerge detect differences in text files, and how do they determine when a line is new, and not just on a different line than the file being checked against?
Is this something that is fairly easy to implement? Are there already libraries to do this?
Here's the paper that served as the basis for the UNIX command-line tool diff.
That's a complex question. Performing a diff means finding the minimum edit distance between the two files. That is, the minimum number of changes you must make to transform one file into the other. This is equivalent to finding the longest common subsequence of lines between the two files, and this is the basis for the various diff programs. The longest common subsequence problem is well known, and you should be able to find the dynamic programming solution on google.
The trouble with the dynamic programming approach is that it's O(n^2). It's thus very slow on large files and unusable for large, binary strings. The hard part in writing a diff program is optimizing the algorithm for your problem domain, so that you get reasonable performance (and reasonable results). The paper "An Algorithm for Differential File Comparison" by Hunt and McIlroy gives a good description of an early version of the Unix diff utility.
There are libraries. Here's one: http://code.google.com/p/google-diff-match-patch/
StackOverflow uses Beyond Compare for its diff. I believe it works by calling Beyond Compare from the command line.
It actually is pretty simple; DIFF programes - most of the time - are based on the Longest Common Sequence, which can be solved using a graph algorithm.
This web page gives example implementations in C#.

Categories