Levenshtein algorithm with custom character mapping

Levenshtein algorithm with custom character mapping - c#

I want to use Levenshtein algorithm to search in a list of strings. I want to implement a custom character mapping in order to type latin characters and searching in items in greek.
mapping example:
a = α, ά
b = β
i = ι,ί,ΐ,ϊ
... (etc)
u = ου, ού
So searching using abu in a list with
αbu
abού
αού (all greek characters)
will result with all items in the list. (item order is not a problem)
How do I apply a mapping in the algorithm? (this is where I start)

I think the best way would be to preprocess your symbols to one definite form (e.g. all in latin) and then use Levenshtein as you would do normaly.
In pseudocode:
int func(String latinStr, String greekStr) {
String mappedStr = convertToLatin(greekStr); // e.g. now αβ would be ab
return Levenstein(latinStr, mappedStr);
}
And in convertToLatin you may symbol-by-symbol ask Dictionary with mappings for a replace and construct new string

Related

c# compare string irrespective of language

I have a routine that tries to find a specific term in a list of strings.
int FindString(string term, List<string> stringList)
{
for (int i = 0; i < stringList.Count; i++)
{
if (stringList[i].Contains(term))
{
return i;
}
}
return -1;
}
The term is always a Unicode string in English -for example "B4"- while the list of strings contains strings that may be written in other languages. A string might contain "B4" for example but since it was written in Greek, the Contains method returns false when comparing the English and Greek version of basically the same characters.
Is there a way to transform the non-English string so the Contains method will properly return true?
Example term and string (filename in reality):
term: B4
string: 19-299-12-Β4.txt

Basically you need to "normalize" string based on your custom rules and than perform search.
Since there is no generally accepted mapping that include at least "Latin B" equals "Greek B" you have to build your own - basic dictionary Dictionary<char,char> may be enough.
As part of that "normalization" you may also consider digit mapping - for that there is actually official Unicode information available - GetDigitValue.
So overall code to normalize would look like:
var source = "А9"; // Cyrilic A9 - "\u0410\u0039"
var map = new Dictionary<char,char> { { 'А', 'A' } }; // Cyrillic to Latin
var chars = source.Select( c =>
CharUnicodeInfo.GetUnicodeCategory(c)==UnicodeCategory.DecimalDigitNumber?
CharUnicodeInfo.GetDigitValue(c).ToString()[0] :
map.ContainsKey(c) ? map[c] :
c);
var result = String.Join("", chars);
var term = "\u0041\u0039"; // Latin A9
Console.WriteLine(source.Contains(term));
Console.WriteLine(result.Contains(term));

Convert string into three letter Abbreviation

I've recently been given a new project by work to convert Any given string into 1-3 letter abbreviations.
An example of something similar to what I must produce is below however the strings given could be anything:
switch (string.Name)
{
case "Emotional, Social & Personal": return "ESP";
case "Speech & Language": return "SL";
case "Physical Development": return "PD";
case "Understanding the World": return "UW";
case "English": return "E";
case "Expressive Art & Design": return "EAD";
case "Science": return "S";
case "Understanding The World And It's People"; return "UTW";
}
I figured that I could use string.Split & count the number of words in the array. Then add conditions for handling particular length strings as generally these sentences wont be longer than 4 words however problems I will encounter are.
If a string is longer than I expected it wouldn't be handled
Symbols must be excluded from the abbreviation
Any suggestions as to the logic I could apply would be very appreciated.
Thanks

Something like the following should work with the examples you have given.
string abbreviation = new string(
input.Split()
.Where(s => s.Length > 0 && char.IsLetter(s[0]) && char.IsUpper(s[0]))
.Take(3)
.Select(s => s[0])
.ToArray());
You may need to adjust the filter based on your expected input. Possibly adding a list of words to ignore.

It seems that if it doesn't matter, you could just go for the simplest thing. If the string is shorter than 4 words, take the first letter of each string.
If the string is longer than 4, eliminate all "ands", and "ors", then do the same.
To be better, you could have a lookup dictionary of words that you wouldn't care about - like "the" or "so".
You could also keep an 3D char array, in alphabetical order for quick lookup. That way, you wouldn't have any repeating abbreviations.
However, there are only a finite number of abbreviations. Therefore, it might be better to keep the 'useless' words stored in another string. That way, if the abbreviation your program does by default is already taken, you can use the useless words to make a new one.
If all of the above fail, you could start to linearly move through string to get a different 3 letter word abbreviation - sort of like codons on DNA.

Perfect place to use a dictionary
Dictionary<string, string> dict = new Dictionary<string, string>() {
{"Emotional, Social & Personal", "ESP"},
{"Speech & Language","SL"},
{"Physical Development", "PD"},
{"Understanding the World","UW"},
{"English","E"},
{"Expressive Art & Design","EAD"},
{"Science","S"},
{"Understanding The World And It's People","UTW"}
};
string results = dict["English"];

Following snippet may help you:
string input = "Emotional, Social & Personal"; // an example from the question
string plainText = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(Regex.Replace(input, #"[^0-9A-Za-z ,]", "").ToLower()); // will produce a text without special charactors
string abbreviation = String.Join("",plainText.Split(" ".ToCharArray(),StringSplitOptions.RemoveEmptyEntries).Select(y =>y[0]).ToArray());// get first character from each word

How to do a wildcard search in C# on ASP.NET?

I am using MVC3, C#, .net4.0
I have objects that contain a search string with which I can use to search for the relevant objects ie for 4 objects:
[car:vw:engine:1800]
[car:vw:engine:Diesel 1800]
[car:vw:engine:1600]
[car:ford:engine:1800]
I would like to search for objects that have a make of "vw" and "1800" engine.
I could try Contains():
SearchString.Contains("vw:engine:1800")
Which will return just one object.
I need something like:
SearchString.Contains("vw:engine:*1800")
Where * is a wildcard and would pick up :
[car:vw:engine:1800]
[car:vw:engine:Diesel 1800]
The only way around this, at present, would be:
SearchString.Contains("vw:engine:1800") or
SearchString.Contains("vw:engine:Diesel 1800")
Is there a simple way to do this using a mainstream .net function like Contains(), if not Contains() itself.
There is a good reason for me using a search string like this, but this is not part of the question.

You can use regular expressions to check if SearchString is a match. .* means zero or more of any characters and is used in place of your wildcard.
string pattern = #"^\[car:vw:engine:.*1800]$";
bool matches = Regex.IsMatch(SearchString, pattern);

Generally I'd prefer the regular expressions.
In your particular case you could use something like this:
string car1 = "[car:vw:engine:Diesel 1800]";
string car2 = "[car:vw:engine:1800]";
var tokens1 = car1.Substring(1, car1.Length - 2).Split(':');
var tokens2 = car2.Substring(1, car2.Length - 2).Split(':');
bool IsMatch1 = tokens1[3].EndsWith("1800");
bool IsMatch2 = tokens2[3].EndsWith("1800");

How to compare 'μ' and 'µ' in C# [duplicate]

I fall into a surprising issue.
I loaded a text file in my application and I have some logic which compares the value having µ.
And I realized that even if the texts are same the compare value is false.
Console.WriteLine("μ".Equals("µ")); // returns false
Console.WriteLine("µ".Equals("µ")); // return true
In later line the character µ is copy pasted.
However, these might not be the only characters that are like this.
Is there any way in C# to compare the characters which look the same but are actually different?

Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC) and the second is the micro sign and has 181 (0xB5).
References:
Unicode Character 'GREEK SMALL LETTER MU' (U+03BC)
Unicode Character 'MICRO SIGN' (U+00B5)
So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:
public void Main()
{
var s1 = "μ";
var s2 = "µ";
Console.WriteLine(s1.Equals(s2)); // false
Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true
}
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormKC);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
And the Demo

In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.
For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:
Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)
This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.
So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:
using System;
using System.Text;
class Program
{
static void Main(string[] args)
{
char first = 'μ';
char second = 'µ';
// Technically you only need to normalize U+00B5 to obtain U+03BC, but
// if you're unsure which character is which, you can safely normalize both
string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);
Console.WriteLine(first.Equals(second)); // False
Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
}
}
For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.

They both have different character codes: Refer this for more details
Console.WriteLine((int)'μ'); //956
Console.WriteLine((int)'µ'); //181
Where, 1st one is:
Display Friendly Code Decimal Code Hex Code Description
====================================================================
μ μ μ μ Lowercase Mu
µ µ µ µ micro sign Mu

For the specific example of μ (mu) and µ (micro sign), the latter has a compatibility decomposition to the former, so you can normalize the string to FormKC or FormKD to convert the micro signs to mus.
However, there are lots of sets of characters that look alike but aren't equivalent under any Unicode normalization form. For example, A (Latin), Α (Greek), and А (Cyrillic). The Unicode website has a confusables.txt file with a list of these, intended to help developers guard against homograph attacks. If necessary, you could parse this file and build a table for “visual normalization” of strings.

Search both characters in a Unicode database and see the difference.
One is the Greek small Letter µ and the other is the Micro Sign µ.
Name : MICRO SIGN
Block : Latin-1 Supplement
Category : Letter, Lowercase [Ll]
Combine : 0
BIDI : Left-to-Right [L]
Decomposition : <compat> GREEK SMALL LETTER MU (U+03BC)
Mirror : N
Index entries : MICRO SIGN
Upper case : U+039C
Title case : U+039C
Version : Unicode 1.1.0 (June, 1993)
Name : GREEK SMALL LETTER MU
Block : Greek and Coptic
Category : Letter, Lowercase [Ll]
Combine : 0
BIDI : Left-to-Right [L]
Mirror : N
Upper case : U+039C
Title case : U+039C
See Also : micro sign U+00B5
Version : Unicode 1.1.0 (June, 1993)

EDIT After the merge of this question with How to compare 'μ' and 'µ' in C#
Original answer posted:
"μ".ToUpper().Equals("µ".ToUpper()); //This always return true.
EDIT
After reading the comments, yes it is not good to use the above method because it may provide wrong results for some other type of inputs, for this we should use normalize using full compatibility decomposition as mentioned in wiki. (Thanks to the answer posted by BoltClock)
static string GREEK_SMALL_LETTER_MU = new String(new char[] { '\u03BC' });
static string MICRO_SIGN = new String(new char[] { '\u00B5' });
public static void Main()
{
string Mus = "µμ";
string NormalizedString = null;
int i = 0;
do
{
string OriginalUnicodeString = Mus[i].ToString();
if (OriginalUnicodeString.Equals(GREEK_SMALL_LETTER_MU))
Console.WriteLine(" INFORMATIO ABOUT GREEK_SMALL_LETTER_MU");
else if (OriginalUnicodeString.Equals(MICRO_SIGN))
Console.WriteLine(" INFORMATIO ABOUT MICRO_SIGN");
Console.WriteLine();
ShowHexaDecimal(OriginalUnicodeString);
Console.WriteLine("Unicode character category " + CharUnicodeInfo.GetUnicodeCategory(Mus[i]));
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormC);
Console.Write("Form C Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormD);
Console.Write("Form D Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKC);
Console.Write("Form KC Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKD);
Console.Write("Form KD Normalized: ");
ShowHexaDecimal(NormalizedString);
Console.WriteLine("_______________________________________________________________");
i++;
} while (i < 2);
Console.ReadLine();
}
private static void ShowHexaDecimal(string UnicodeString)
{
Console.Write("Hexa-Decimal Characters of " + UnicodeString + " are ");
foreach (short x in UnicodeString.ToCharArray())
{
Console.Write("{0:X4} ", x);
}
Console.WriteLine();
}
Output
INFORMATIO ABOUT MICRO_SIGN
Hexa-Decimal Characters of µ are 00B5
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 00B5
Form D Normalized: Hexa-Decimal Characters of µ are 00B5
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
INFORMATIO ABOUT GREEK_SMALL_LETTER_MU
Hexa-Decimal Characters of µ are 03BC
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 03BC
Form D Normalized: Hexa-Decimal Characters of µ are 03BC
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
While reading information in Unicode_equivalence I found
The choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ﬃ), ..... so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03.
So to compare equivalence we should normally use FormKC i.e. NFKC normalization or FormKD i.e NFKD normalization.
I was little curious to know more about all the Unicode characters so I made sample which would iterate over all the Unicode character in UTF-16 and I got some results I want to discuss
Information about characters whose FormC and FormD normalized values were not equivalent
Total: 12,118
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-253, ..... 44032-55203
Information about characters whose FormKC and FormKD normalized values were not equivalent
Total: 12,245
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-228, ..... 44032-55203, 64420-64421, 64432-64433, 64490-64507, 64512-64516, 64612-64617, 64663-64667, 64735-64736, 65153-65164, 65269-65274
All the character whose FormC and FormD normalized value were not equivalent, there FormKC and FormKD normalized values were also not equivalent except these characters
Characters: 901 '΅', 8129 '῁', 8141 '῍', 8142 '῎', 8143 '῏', 8157 '῝', 8158 '῞'
, 8159 '῟', 8173 '῭', 8174 '΅'
Extra character whose FormKC and FormKD normalized value were not equivalent, but there FormC and FormD normalized values were equivalent
Total: 119
Characters: 452 'Ǆ' 453 'ǅ' 454 'ǆ' 12814 '㈎' 12815 '㈏' 12816 '㈐' 12817 '㈑' 12818 '㈒'
12819 '㈓' 12820 '㈔' 12821 '㈕', 12822 '㈖' 12823 '㈗' 12824 '㈘' 12825 '㈙' 12826 '㈚'
12827 '㈛' 12828 '㈜' 12829 '㈝' 12830 '㈞' 12910 '㉮' 12911 '㉯' 12912 '㉰' 12913 '㉱'
12914 '㉲' 12915 '㉳' 12916 '㉴' 12917 '㉵' 12918 '㉶' 12919 '㉷' 12920 '㉸' 12921 '㉹' 12922 '㉺' 12923 '㉻' 12924 '㉼' 12925 '㉽' 12926 '㉾' 13056 '㌀' 13058 '㌂' 13060 '㌄' 13063 '㌇' 13070 '㌎' 13071 '㌏' 13072 '㌐' 13073 '㌑' 13075 '㌓' 13077 '㌕' 13080 '㌘' 13081 '㌙' 13082 '㌚' 13086 '㌞' 13089 '㌡' 13092 '㌤' 13093 '㌥' 13094 '㌦' 13099 '㌫' 13100 '㌬' 13101 '㌭' 13102 '㌮' 13103 '㌯' 13104 '㌰' 13105 '㌱' 13106 '㌲' 13108 '㌴' 13111 '㌷' 13112 '㌸' 13114 '㌺' 13115 '㌻' 13116 '㌼' 13117 '㌽' 13118 '㌾' 13120 '㍀' 13130 '㍊' 13131 '㍋' 13132 '㍌' 13134 '㍎' 13139 '㍓' 13140 '㍔' 13142 '㍖' .......... ﺋ' 65164 'ﺌ' 65269 'ﻵ' 65270 'ﻶ' 65271 'ﻷ' 65272 'ﻸ' 65273 'ﻹ' 65274'
There are some characters which can not be normalized, they throw ArgumentException if tried
Total:2081
Characters(int value): 55296-57343, 64976-65007, 65534
This links can be really helpful to understand what rules govern for Unicode equivalence
Unicode_equivalence
Unicode_compatibility_characters

Most likely, there are two different character codes that make (visibly) the same character. While technically not equal, they look equal. Have a look at the character table and see whether there are multiple instances of that character. Or print out the character code of the two chars in your code.

You ask "how to compare them" but you don't tell us what you want to do.
There are at least two main ways to compare them:
Either you compare them directly as you are and they are different
Or you use Unicode Compatibility Normalization if your need is for a comparison that finds them to match.
There could be a problem though because Unicode compatibility normalization will make many other characters compare equal. If you want only these two characters to be treated as alike you should roll your own normalization or comparison functions.
For a more specific solution we need to know your specific problem. What is the context under which you came across this problem?

If I would like to be pedantic, I would say that your question doesn't make sense, but since we are approaching christmas and the birds are singing, I'll proceed with this.
First off, the 2 entities that you are trying to compare are glyphs, a glyph is part of a set of glyphs provided by what is usually know as a "font", the thing that usually comes in a ttf, otf or whatever file format you are using.
The glyphs are a representation of a given symbol, and since they are a representation that depends on a specific set, you can't just expect to have 2 similar or even "better" identical symbols, it's a phrase that doesn't make sense if you consider the context, you should at least specify what font or set of glyphs you are considering when you formulate a question like this.
What is usually used to solve a problem similar to the one that you are encountering, it's an OCR, essentially a software that recognize and compares glyphs, If C# provides an OCR by default I don't know that, but it's generally a really bad idea if you don't really need an OCR and you know what to do with it.
You can possibly end up interpreting a physics book as an ancient greek book without mentioning the fact that OCR are generally expensive in terms of resources.
There is a reason why those characters are localized the way they are localized, just don't do that.

It's possible to draw both of chars with the same font style and size with DrawString method. After two bitmaps with symbols has been generated, it's possible to compare them pixel by pixel.
Advantage of this method is that you can compare not only absolute equal charcters, but similar too (with definite tolerance).

string handling

I would like to know that if I have an english dictionary in a text file what is the best way to check whether a given string is a proper and correct english word. My dictionary contains about 100000 english words and I have to check on an average of 60000 words in one go. I am just looking for the most efficient way. Also should I store all the strings first or I just process them as they are generated.
Thanx

100k is not too great a number, so you can just pop everything in a Hashset<string>.
Hashset lookup is key-based, so it will be lightning fast.
example how this might look in code is:
string[] lines = File.ReadAllLines(#"C:\MyDictionary.txt");
HashSet<string> myDictionary = new HashSet<string>();
foreach (string line in lines)
{
myDictionary.Add(line);
}
string word = "aadvark";
if (myDictionary.Contains(word))
{
Console.WriteLine("There is an aadvark");
}
else
{
Console.WriteLine("The aadvark is a lie");
}

You should probably use HashSet<string> if you're using .NET 3.5 or higher.
Just load the dictionary of valid words into a HashSet<string> and then either use Contains on each candidate string, or use some of the set operators to find all words which aren't valid.
For example:
// There are loads of ways of loading words from a file, of course
var valid = new HashSet<string>(File.ReadAllLines("dictionary.txt"));
var candidates = new HashSet<string>(File.ReadAllLines("candidate.txt"));
var validCandidates = candidates.Intersect(valid);
var invalidCandidates = candidates.Except(valid);
You may also wish to use case-insensitive comparisons or something similar - use the StringComparer static properties to get at appropriate instances of StringComparer which you can pass to the HashSet constructor.
If you're using .NET 2, you can use a Dictionary<string, whatever> as a poor-man's set - basically use whatever you like as the value, and just check for keys.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Levenshtein algorithm with custom character mapping - c#

Related

c# compare string irrespective of language

Convert string into three letter Abbreviation

How to do a wildcard search in C# on ASP.NET?

How to compare 'μ' and 'µ' in C# [duplicate]

string handling

Categories

Resources