I'm writing a desktop UI (.Net WinForms) to assist a photographer clean up his image meta data. There is a list of 66k+ phrases. Can anyone suggest a good open source/free .NET component I can use that employs some sort of algorithm to identify potential candiates for consolidation? For example there may be two or more entries which are actually the same word or phrase that only differ by whitespace or punctuation or even slight mis-spelling. The application will ultimately rely on the user to action the consolidation of phrases but having an effective way to automatically find potential candidates will prove invaluable.
Let me introduce you to the Levenshtein distance formula. It is awesome:
http://en.wikipedia.org/wiki/Levenshtein_distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.
Personally I used this in a healthcare setting, where Provider names were checked for duplicates. Using the Levenshtein process, we gave them a confidence rating and allowed them to determine if it was a true duplicate or something unique.
I know this is an old question, but I feel like this answer can help people who are dealing with the same issue in current time.
Please have a look at https://github.com/JakeBayer/FuzzySharp
It is a c# NuGet package that has multiple methods that implement a certain way of fuzzy search. Not sure, but perhaps Fosco's anwer is also used in one of them.
Edit:
I just noticed a comment about this package, but I think it deserves a better place inside this question
Related
I found a lot of implementation doing the calculation of Levenshtein between 2 strings, but is there any implementation that can generate all variations using Levenshtein distance (max 2) for one given string.
The reason is, I'm using ElasticSearch to execute some fuzzy search, but with the load of queries that I have I have some performance issue because ELK will calculate those possibilities each time, I want to store those values once.
The most commonly cited reference implementation for generating an edit distance is in Python, you can see it in this answer.
The original author linked subsequent implementations in other languages at the bottom of his blog under the heading Other Computer Languages. There are 4 implementations in C#, this one in particular is functional (I'm unsure under what license those implementations are published, so I won't transcribe them into this thread).
But using wildcard searches with ElasticSearch is the correct approach. The engine will implement approximate string matching as efficiently as possible - there are a number of different algorithms this can be based on and the optimal choice depends on your data structure, etc.
You can simplify the use by generating the edit distance yourself, but in most cases if you're using a database or engine their implementation will have better performance. (This is a computationally expensive task, there's no way around that.)
A little personal project of mine is to blindly produce a search engine from scratch without using any outside sources. This is mostly for a learning experience and I haven't had much trouble up until now, where I have both a dilemma and a tough problem.
Observe this case:
Suzy wants to search for "fuzzy bears". This is fine, functions as well as it can. However, Suzy screws up and types "fuzzybears". Right now, my search algorithm breaks down since this is interpreted as a single token, and not multiple tokens. Any case or combination of words that has even one occurrence of such a run on term, or glued tokens, causes a poor search result.
For scope, this is something I am writing using a combination of C# and T-SQL.
I've tried multiple solutions, but nothing has really come from them. Firstly, I used a List to take the terms and create variations, but this was much too slow to my liking and required a lot more memory than I feel it should need.
I wanted to save search queries to a database for statistics and maybe to learn more about organically growing the algorithm, so maybe a way to handle these glued tokens in SQL could be a solution, but I have no clue how to start with something like that unless I used a cursor or some other slow solution.
I could take searches, save them to my database, create different combinations where some tokens are glued, and then have those glued tokens as terms to hit on? The issue with this solution is it takes up quite a bit of space and I won't always need these strings since spelling errors like this aren't all too common.
Mainly, what I need is speed. It doesn't really have to be pretty, but if it's fast and accurate then I'm happy even if it takes up a lot of disk space.
Not asking for solutions here, but if anyone can point me in a direction I can go or it would be greatly appreciated.
Consider this approach: since spaces, punctuation, and anything similar would screw up a search like this, remove all of those, convert to a common case (I prefer lowercase, but pick what you prefer), and then tokenize based on syllables, using roughly the same set of division rules as for hyphenating English words.
So, to search for answers that contain "Consider this approach:", you reduce the phrase to "considerthisapproach" and then tokenize as "con","sid","er","this","ap","proach". If con and sid and er appear next to each other, and in that order, you've found the word "consider".
This approach can be adapted for statistical matching too, so e.g. if at least 85% of syllables are found in the correct order, you consider it a close match, and maybe order the results by match % so more meaningful matches are at the top.
I have my website where I have given users the opportunity to share their status. How can I detect that if any abusive or slang words are used so as to block such comments?
Is there any library or trick to detect such kind of comments in .NET?
It is not a trick; use a dictionary of bad words, and add some logic to detect "bad words" in good places. Add the ability for users to post complains about mis-correction of your logic (so you can fine tune it) and that's it.
Implementation is pretty easy, and a dictionary of "bad words" - either look it up, or write one your own.
(I used to collect bad words from customer complains on a chat service - after a year it was almost bulletproof.)
This is actually quite difficult to automate and do accurately without unintended side effects. You can maintain a dictionary of bad words, and use regular expressions to replace occurrences of those bad words. Please see my answer to the following question for example code, plus some of the issues:
Replace Bad words using Regex
Automated approaches have a number of shortcommings: false positives, missing bad words that are not in the dictionary, and minor variations of bad words that are not detected. Involvement from users can be used to bolster or as an alternative approach e.g SO has the abiliy to flag comments and moderators can delete or censor them.
There are some bad word lists around which you can download and use.
eg. http://urbanoalvarez.es/blog/2008/04/04/bad-words-list/
The best thing to do is to start with a small list and add to it based on the real comments made on your site. You can put a report link on the comments so other visitors can notify you if there are bad comments made.
What I need is: plots creation, stuff for interpolation, stuff for counting such things as
and
where L(x) is an interpolation built from some data (points) generated from original known function f(x). meaning we know original function. we have a range (-a, a) - known. We need library to help us calculate data points in range. we need to calculate L(x) a polinom using that data in that range.
I need this library to be free and opensource
Perhaps Math.NET can help you.
Check this other answer https://stackoverflow.com/questions/1387430/recommended-math-library-for-c-net, in particular several people think that MathDotNet is nice.
For plot creation, you may want excel interop (why not ?), or ILNumerics.NET.
But I don't understand the other requirements. You want to measure interpolation errors (in the max and L1 norm) from a function you don't know ? This is not a programming question, it is a math question.
I suggest you look at interpolation libraries (Math.NET contains one for instance, but many others also do) and see if they provide such things as "error estimation".
Otherwise, what you need is a math book which will explain you the assumptions on f that you need to estimate the interpolation error. It depends on what you know about the regularity of f and the interpolation method.
Edit, regarding additional information provided: There are closed form formulas for interpolation errors (here as a starting point). But any numerical integration routine (which Math.NET does not provide) will get what you want. Have a look at libraries other people pointed out, this link will get you started.
Since you seem to have regular functions (since you do polynomial interpolation), I'd go with simple Romberg integration, which is quite simple to implement in case you don't find a library that suits your need (I doubt it). Have a look at Numerical Recipes, 3rd edition for sample code.
What about using Mathematica?
Math.NET and ILNumerics.Net are both open source and will both solve your equations.
I am creating an application in .NET.
I got a running application name http://www.spinnerchief.com/. It did what I needed it to do but but I did not get any help from Google.
I need functional results for my application, where users can give one sentence and then the user can get the same sentence, but have it worded differently.
Here is an example of want I want.
Suppose I put a sentence that is "Pankaj is a good man." The output should be similar to the following one:
Pankaj is a great person.
Pankaj is a superb man.
Pankaj is a acceptable guy.
Pankaj is a wonderful dude.
Pankaj is a superb male.
Pankaj is a good human.
Pankaj is a splendid gentleman
To do this correctly for any arbitrary sentence you would need to perform natural language analysis of the source sentence. You may want to look into the SharpNLP library - it's a free library of natural language processing tools for C#/.NET.
If you're looking for a simpler approach, you have to be willing to sacrifice correctness to some degree. For instance, you could create a dictionary of trigger words, which - when they appear in a sentence - are replaced with synonyms from a thesaurus. The problem with this approach is making sure that you replace a word with an equivalent part of speech. In English, it's possible for certain words to be different parts of speech (verb, adjective, adverb, etc) based on their contextual usage in a sentence.
An additional consideration you'll need to address (if you're not using an NLP library) is stemming. In most languages, certain parts of speech are conjugated/modified (verbs in English) based on the subject they apply to (or the object, speaker, or tense of the sentence).
If all you want to do is replace adjectives (as in your example) the approach of using trigger words may work - but it won't be readily extensible. Before you do anything, I would suggest that you clearly defined the requirements and rules for your problem domain ... and use that to decide which route to take.
For this, the best thing for you to use is WordNet and it's hyponym/hypernym relations. There is a WordNet .Net library. For each word you want to alternate, you can either get it's hypernym (i.e. for person, a hypernym means "person is a kind of...") or hyponym ("X is a kind of person"). Then just replace the word you are alternating.
You will want to make sure you have the correct part-of-speech (i.e. noun, adjective, verb...) and there is also the issue of senses, which may introduce some undesired alternations (sense #1 is the most common).
I don't know anything about .Net, but you should look into using a dictionary function (I'm sure there is one, or at least a library that streamlines the process if there isn't).
Then, you'd have to go through the string, and ommit words like "is" or "a". Only taking words you want to have synonyms for.
After this, its pretty simple to have a loop spit out your sentences.
Good luck.