I fall into a surprising issue.
I loaded a text file in my application and I have some logic which compares the value having µ.
And I realized that even if the texts are same the compare value is false.
Console.WriteLine("μ".Equals("µ")); // returns false
Console.WriteLine("µ".Equals("µ")); // return true
In later line the character µ is copy pasted.
However, these might not be the only characters that are like this.
Is there any way in C# to compare the characters which look the same but are actually different?
Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC) and the second is the micro sign and has 181 (0xB5).
References:
Unicode Character 'GREEK SMALL LETTER MU' (U+03BC)
Unicode Character 'MICRO SIGN' (U+00B5)
So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:
public void Main()
{
var s1 = "μ";
var s2 = "µ";
Console.WriteLine(s1.Equals(s2)); // false
Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true
}
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormKC);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
And the Demo
In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.
For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:
Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)
This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.
So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:
using System;
using System.Text;
class Program
{
static void Main(string[] args)
{
char first = 'μ';
char second = 'µ';
// Technically you only need to normalize U+00B5 to obtain U+03BC, but
// if you're unsure which character is which, you can safely normalize both
string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);
Console.WriteLine(first.Equals(second)); // False
Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
}
}
For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.
They both have different character codes: Refer this for more details
Console.WriteLine((int)'μ'); //956
Console.WriteLine((int)'µ'); //181
Where, 1st one is:
Display Friendly Code Decimal Code Hex Code Description
====================================================================
μ μ μ μ Lowercase Mu
µ µ µ µ micro sign Mu
For the specific example of μ (mu) and µ (micro sign), the latter has a compatibility decomposition to the former, so you can normalize the string to FormKC or FormKD to convert the micro signs to mus.
However, there are lots of sets of characters that look alike but aren't equivalent under any Unicode normalization form. For example, A (Latin), Α (Greek), and А (Cyrillic). The Unicode website has a confusables.txt file with a list of these, intended to help developers guard against homograph attacks. If necessary, you could parse this file and build a table for “visual normalization” of strings.
Search both characters in a Unicode database and see the difference.
One is the Greek small Letter µ and the other is the Micro Sign µ.
Name : MICRO SIGN
Block : Latin-1 Supplement
Category : Letter, Lowercase [Ll]
Combine : 0
BIDI : Left-to-Right [L]
Decomposition : <compat> GREEK SMALL LETTER MU (U+03BC)
Mirror : N
Index entries : MICRO SIGN
Upper case : U+039C
Title case : U+039C
Version : Unicode 1.1.0 (June, 1993)
Name : GREEK SMALL LETTER MU
Block : Greek and Coptic
Category : Letter, Lowercase [Ll]
Combine : 0
BIDI : Left-to-Right [L]
Mirror : N
Upper case : U+039C
Title case : U+039C
See Also : micro sign U+00B5
Version : Unicode 1.1.0 (June, 1993)
EDIT After the merge of this question with How to compare 'μ' and 'µ' in C#
Original answer posted:
"μ".ToUpper().Equals("µ".ToUpper()); //This always return true.
EDIT
After reading the comments, yes it is not good to use the above method because it may provide wrong results for some other type of inputs, for this we should use normalize using full compatibility decomposition as mentioned in wiki. (Thanks to the answer posted by BoltClock)
static string GREEK_SMALL_LETTER_MU = new String(new char[] { '\u03BC' });
static string MICRO_SIGN = new String(new char[] { '\u00B5' });
public static void Main()
{
string Mus = "µμ";
string NormalizedString = null;
int i = 0;
do
{
string OriginalUnicodeString = Mus[i].ToString();
if (OriginalUnicodeString.Equals(GREEK_SMALL_LETTER_MU))
Console.WriteLine(" INFORMATIO ABOUT GREEK_SMALL_LETTER_MU");
else if (OriginalUnicodeString.Equals(MICRO_SIGN))
Console.WriteLine(" INFORMATIO ABOUT MICRO_SIGN");
Console.WriteLine();
ShowHexaDecimal(OriginalUnicodeString);
Console.WriteLine("Unicode character category " + CharUnicodeInfo.GetUnicodeCategory(Mus[i]));
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormC);
Console.Write("Form C Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormD);
Console.Write("Form D Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKC);
Console.Write("Form KC Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKD);
Console.Write("Form KD Normalized: ");
ShowHexaDecimal(NormalizedString);
Console.WriteLine("_______________________________________________________________");
i++;
} while (i < 2);
Console.ReadLine();
}
private static void ShowHexaDecimal(string UnicodeString)
{
Console.Write("Hexa-Decimal Characters of " + UnicodeString + " are ");
foreach (short x in UnicodeString.ToCharArray())
{
Console.Write("{0:X4} ", x);
}
Console.WriteLine();
}
Output
INFORMATIO ABOUT MICRO_SIGN
Hexa-Decimal Characters of µ are 00B5
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 00B5
Form D Normalized: Hexa-Decimal Characters of µ are 00B5
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
INFORMATIO ABOUT GREEK_SMALL_LETTER_MU
Hexa-Decimal Characters of µ are 03BC
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 03BC
Form D Normalized: Hexa-Decimal Characters of µ are 03BC
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
While reading information in Unicode_equivalence I found
The choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ffi), ..... so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03.
So to compare equivalence we should normally use FormKC i.e. NFKC normalization or FormKD i.e NFKD normalization.
I was little curious to know more about all the Unicode characters so I made sample which would iterate over all the Unicode character in UTF-16 and I got some results I want to discuss
Information about characters whose FormC and FormD normalized values were not equivalent
Total: 12,118
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-253, ..... 44032-55203
Information about characters whose FormKC and FormKD normalized values were not equivalent
Total: 12,245
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-228, ..... 44032-55203, 64420-64421, 64432-64433, 64490-64507, 64512-64516, 64612-64617, 64663-64667, 64735-64736, 65153-65164, 65269-65274
All the character whose FormC and FormD normalized value were not equivalent, there FormKC and FormKD normalized values were also not equivalent except these characters
Characters: 901 '΅', 8129 '῁', 8141 '῍', 8142 '῎', 8143 '῏', 8157 '῝', 8158 '῞'
, 8159 '῟', 8173 '῭', 8174 '΅'
Extra character whose FormKC and FormKD normalized value were not equivalent, but there FormC and FormD normalized values were equivalent
Total: 119
Characters: 452 'DŽ' 453 'Dž' 454 'dž' 12814 '㈎' 12815 '㈏' 12816 '㈐' 12817 '㈑' 12818 '㈒'
12819 '㈓' 12820 '㈔' 12821 '㈕', 12822 '㈖' 12823 '㈗' 12824 '㈘' 12825 '㈙' 12826 '㈚'
12827 '㈛' 12828 '㈜' 12829 '㈝' 12830 '㈞' 12910 '㉮' 12911 '㉯' 12912 '㉰' 12913 '㉱'
12914 '㉲' 12915 '㉳' 12916 '㉴' 12917 '㉵' 12918 '㉶' 12919 '㉷' 12920 '㉸' 12921 '㉹' 12922 '㉺' 12923 '㉻' 12924 '㉼' 12925 '㉽' 12926 '㉾' 13056 '㌀' 13058 '㌂' 13060 '㌄' 13063 '㌇' 13070 '㌎' 13071 '㌏' 13072 '㌐' 13073 '㌑' 13075 '㌓' 13077 '㌕' 13080 '㌘' 13081 '㌙' 13082 '㌚' 13086 '㌞' 13089 '㌡' 13092 '㌤' 13093 '㌥' 13094 '㌦' 13099 '㌫' 13100 '㌬' 13101 '㌭' 13102 '㌮' 13103 '㌯' 13104 '㌰' 13105 '㌱' 13106 '㌲' 13108 '㌴' 13111 '㌷' 13112 '㌸' 13114 '㌺' 13115 '㌻' 13116 '㌼' 13117 '㌽' 13118 '㌾' 13120 '㍀' 13130 '㍊' 13131 '㍋' 13132 '㍌' 13134 '㍎' 13139 '㍓' 13140 '㍔' 13142 '㍖' .......... ﺋ' 65164 'ﺌ' 65269 'ﻵ' 65270 'ﻶ' 65271 'ﻷ' 65272 'ﻸ' 65273 'ﻹ' 65274'
There are some characters which can not be normalized, they throw ArgumentException if tried
Total:2081
Characters(int value): 55296-57343, 64976-65007, 65534
This links can be really helpful to understand what rules govern for Unicode equivalence
Unicode_equivalence
Unicode_compatibility_characters
Most likely, there are two different character codes that make (visibly) the same character. While technically not equal, they look equal. Have a look at the character table and see whether there are multiple instances of that character. Or print out the character code of the two chars in your code.
You ask "how to compare them" but you don't tell us what you want to do.
There are at least two main ways to compare them:
Either you compare them directly as you are and they are different
Or you use Unicode Compatibility Normalization if your need is for a comparison that finds them to match.
There could be a problem though because Unicode compatibility normalization will make many other characters compare equal. If you want only these two characters to be treated as alike you should roll your own normalization or comparison functions.
For a more specific solution we need to know your specific problem. What is the context under which you came across this problem?
If I would like to be pedantic, I would say that your question doesn't make sense, but since we are approaching christmas and the birds are singing, I'll proceed with this.
First off, the 2 entities that you are trying to compare are glyphs, a glyph is part of a set of glyphs provided by what is usually know as a "font", the thing that usually comes in a ttf, otf or whatever file format you are using.
The glyphs are a representation of a given symbol, and since they are a representation that depends on a specific set, you can't just expect to have 2 similar or even "better" identical symbols, it's a phrase that doesn't make sense if you consider the context, you should at least specify what font or set of glyphs you are considering when you formulate a question like this.
What is usually used to solve a problem similar to the one that you are encountering, it's an OCR, essentially a software that recognize and compares glyphs, If C# provides an OCR by default I don't know that, but it's generally a really bad idea if you don't really need an OCR and you know what to do with it.
You can possibly end up interpreting a physics book as an ancient greek book without mentioning the fact that OCR are generally expensive in terms of resources.
There is a reason why those characters are localized the way they are localized, just don't do that.
It's possible to draw both of chars with the same font style and size with DrawString method. After two bitmaps with symbols has been generated, it's possible to compare them pixel by pixel.
Advantage of this method is that you can compare not only absolute equal charcters, but similar too (with definite tolerance).
Related
I am trying to parse out and identify some values from strings that I have in a list.
I am using string.Contains to identify the value im looking for, but I am getting hits even if the value is surrounded by other text. How can I make sure I only get a hit if the value is isolated?
Example parse:
Looking for value = "302"
string sale =
"199708. (30), italiano, delim fabricata modella, serialNumber302. tnr F18529302E.";
var result = sale.ToLower().Contains(”302”));
In this example I will get a hit for "serialNumber302" and "F18529302E", which in the context is incorrect since I only want a hit if it finds “302” isolated, like “dontfind302 shouldfind 302”.
Any ideas on how to do this?
If you try Regex, you can define a word boundary using \b:
string sale =
"199708. (30), italiano, delim fabricata modella, serialNumber302. tnr F18529302E.";
bool result = Regex.IsMatch(sale, #"\b302\b"); // false
sale = "A string with 302 isolated";
result = Regex.IsMatch(sale, #"\b302\b"); // true
So 302 will only be found if it is at the start of the string, at the end of the string, or if it is surrounded by non-word characters i.e. not a-z A-Z 0-9 or _
EDIT: From the comments I realiſed that it waſn't clear whether or not "serialNum302" ſhould get a hit. I aſſumed ſo in this anſwer.
I ſee a few eaſy ways you could do this:
1) If the input is always a number as in the example, one option would be to only ſearch for ſubſtrings not ſurrounded by more numbers, by examining all the reſults of an initial ſearch and comparing their neighboring characters againſt the ſtring "0123456789". I really don't think this is the beſt option though, becauſe ſooner or later it's goïng to break when it miſinterprets one of the other bits of data.
2) If the ſtring sale always has the ſeriäl number in the format "serialNumber[Num]", inſtead of juſt looking for Num, look for "serialNumber" + Num, as this is leſs likely to be meſſed up with the other data.
3) From your ſtring, it looks like you have a ſtandardized format that's beïng introduced to the ſyſtem. In this caſe, parſe it in a ſtandardized way, e.g. by ſplitting it into ſubſtrings at the commas, then parſing each ſubſtring differently as it requires.
I've recently been given a new project by work to convert Any given string into 1-3 letter abbreviations.
An example of something similar to what I must produce is below however the strings given could be anything:
switch (string.Name)
{
case "Emotional, Social & Personal": return "ESP";
case "Speech & Language": return "SL";
case "Physical Development": return "PD";
case "Understanding the World": return "UW";
case "English": return "E";
case "Expressive Art & Design": return "EAD";
case "Science": return "S";
case "Understanding The World And It's People"; return "UTW";
}
I figured that I could use string.Split & count the number of words in the array. Then add conditions for handling particular length strings as generally these sentences wont be longer than 4 words however problems I will encounter are.
If a string is longer than I expected it wouldn't be handled
Symbols must be excluded from the abbreviation
Any suggestions as to the logic I could apply would be very appreciated.
Thanks
Something like the following should work with the examples you have given.
string abbreviation = new string(
input.Split()
.Where(s => s.Length > 0 && char.IsLetter(s[0]) && char.IsUpper(s[0]))
.Take(3)
.Select(s => s[0])
.ToArray());
You may need to adjust the filter based on your expected input. Possibly adding a list of words to ignore.
It seems that if it doesn't matter, you could just go for the simplest thing. If the string is shorter than 4 words, take the first letter of each string.
If the string is longer than 4, eliminate all "ands", and "ors", then do the same.
To be better, you could have a lookup dictionary of words that you wouldn't care about - like "the" or "so".
You could also keep an 3D char array, in alphabetical order for quick lookup. That way, you wouldn't have any repeating abbreviations.
However, there are only a finite number of abbreviations. Therefore, it might be better to keep the 'useless' words stored in another string. That way, if the abbreviation your program does by default is already taken, you can use the useless words to make a new one.
If all of the above fail, you could start to linearly move through string to get a different 3 letter word abbreviation - sort of like codons on DNA.
Perfect place to use a dictionary
Dictionary<string, string> dict = new Dictionary<string, string>() {
{"Emotional, Social & Personal", "ESP"},
{"Speech & Language","SL"},
{"Physical Development", "PD"},
{"Understanding the World","UW"},
{"English","E"},
{"Expressive Art & Design","EAD"},
{"Science","S"},
{"Understanding The World And It's People","UTW"}
};
string results = dict["English"];
Following snippet may help you:
string input = "Emotional, Social & Personal"; // an example from the question
string plainText = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(Regex.Replace(input, #"[^0-9A-Za-z ,]", "").ToLower()); // will produce a text without special charactors
string abbreviation = String.Join("",plainText.Split(" ".ToCharArray(),StringSplitOptions.RemoveEmptyEntries).Select(y =>y[0]).ToArray());// get first character from each word
I'm binding a textbox to a member of a class, and I need to tweak the appearance of the phone number so that it's easier to read (the user doesn't want to see values such as "1234567890" or "+01234567890"). So, I've got this code:
var bindingPhone = new Binding("Text", platypusInfo, "Phone1", true);
bindingPhone.Format += phoneBinding_Format;
textBoxPhoneNum1.DataBindings.Add(bindingPhone);
...
private void phoneBinding_Format(object sender, ConvertEventArgs e) {
e.Value = ??How can I deal with this??
}
But the phone values, although usually either "NNNNNNNNNN" (such as "1234567890") or "+NNNNNNNNNNN" (such as "+01234567890") can also appear in a number of other permutations, such as:
(NN) NNNN NNNN
++NNNNNNNNNNNNN
+NNNNNNNNNNNNN
+NN NNNNNNNNNNN
NNNNNNNNNNNN
Is there anything I can do in phoneBinding_Format() that will make these phone numbers easier to read without breaking them into nonsensical parts, such as "43-4859-4365" instead of "434-859-4365"?
UPDATE
Due to these factors:
1) I'm working on several projects simultaneously and need to get back to another one
2) Our two most common formats comprise the lion's share of our phone numbers
3) This is just a "nice feature" not a "must-have" feature
...I've settled on the following for now, based on a Jon Skeet answer:
private void phoneBinding_Format(object sender, ConvertEventArgs e)
{
const int UK_PHONE_LEN = 9; // +NNNNNNNN
const int US_PHONE_FORMAT_LEN = 10; // NNNNNNNNNN
const int COMMON_INTERNATIONAL_FORMAT_LEN = 12; //+NNNNNNNNNNN
string phone;
string area;
string major;
string minor;
string intl_firstsegment;
string intl_secondsegment;
string intl_thirdsegment;
string intl_fourthsegment;
string intl_fifthsegment;
if (e.Value.ToString().Length == US_PHONE_FORMAT_LEN)
{
phone = e.Value.ToString();
area = phone.Substring(0, 3);
major = phone.Substring(3, 3);
minor = phone.Substring(6);
e.Value = string.Format("{0}-{1}-{2}", area, major, minor);
}
else if ((e.Value.ToString().Length == UK_PHONE_LEN) && (e.Value.ToString()[0] == '+')) {
phone = e.Value.ToString();
intl_firstsegment = phone.Substring(0, 2);
intl_secondsegment = phone.Substring(2, 3);
intl_thirdsegment = phone.Substring(5);
e.Value = string.Format("+{0}-{1}-{2}", intl_firstsegment, intl_secondsegment, intl_thirdsegment);
}
else if ((e.Value.ToString().Length == COMMON_INTERNATIONAL_PHONE_LEN) && (e.Value.ToString()[0] == '+'))
{
phone = e.Value.ToString();
intl_firstsegment = phone.Substring(0, 2);
intl_secondsegment = phone.Substring(2, 2);
intl_thirdsegment = phone.Substring(4, 3);
intl_fourthsegment = phone.Substring(7, 2);
intl_fifthsegment = phone.Substring(9);
e.Value = string.Format("+{0}-{1}-{2}-{3}-{4}", intl_firstsegment, intl_secondsegment, intl_thirdsegment, intl_fourthsegment, intl_fifthsegment);
}
}
BTW, an interesting thing happened on the way to breakpoint nirvana: I originally had these tests (1st character is a plus sign and length is the expected) reversed, and got: System.IndexOutOfRangeException was unhandled by user code
Message=Index was outside the bounds of the array.
Reversing the condition so that length was checked first (which naturally doesn't fail when length is 0/string is empty) fixed it (since then no attempt is made to access char 0).
Google's LibPhoneNumber could be just what you need if you want to support international phone numbers in addition to US (and Canadian, which are 100% compatible with US) numbers.
Google's common Java, C++ and Javascript library for parsing, formatting, storing and validating international phone numbers. The Java version is optimized for running on smartphones, and is used by the Android framework since 4.0 (Ice Cream Sandwich).
Using it from C#:
http://blog.thekieners.com/2011/06/06/using-googles-libphonenumber-in-microsoft-net-with-c/
C# port
https://bitbucket.org/pmezard/libphonenumber-csharp/wiki/Home
The easiest approach would be to just strip out all non-numeric characters and whitespace from the string before applying your formatting.
http://msdn.microsoft.com/en-us/library/844skk0h.aspx
Interesting code snippet here. Can be used to pretty-up phone numbers before displaying them.
Input: xxxxxxxxxx or xxx-xxx-xxxx or (xxx) xxx-xxxx, Output: (xxx) xxx-xxxx
Code:
private string formatPhoneNumber(string number) {
System.Text.RegularExpressions.Regex pattern = new System.Text.RegularExpressions.Regex("^\\(?([1-9]\\d{2})\\)?\\D*?([1-9]\\d{2})\\D*?(\\d{4})$");
Match re = Regex.Match(number, pattern.ToString());
return "(" + Convert.ToString(re.Groups[1]) + ") " + Convert.ToString(re.Groups[2]) + "-" + Convert.ToString(re.Groups[3]);
}
I would personally specify that only numeric phone numbers are allowed (meaning the user may not enter phone numbers like 1-800-FLOWERS) and then strip all non numeric characters, before formatting.
What I'm getting is that you have numbers stored in ten-digit character format ("1234567890"), without formatting, but now you need to add formatting to make the number more readable without making the number nonsensical for the country the number is used in. As different countries/regions have different default formats for numbers, the NANPA system of (ACD) COX-SUBS for area code, central office and subscriber doesn't always apply.
My suggestion would be to maintain a table or dictionary of phone number masks, then use a MaskedTextBox and bind not only the number, but the mask, to data in the contact object.
For instance, phone mask ID 1 might be for NANPA numbers: "000-000-0000". Phone mask ID 2 might be for London-metro numbers and would be "\000 0000 0000" (the leading digit is always a zero, and should be omitted when calling from outside the country). ID 3 might be for French phone numbers: "00 00 00 00 00". You can specify a "get-only" property on the object that will provide the actual mask string to the TextBox, and bind a different control (a drop-down maybe) that the user can choose to display the number in the correct format (which you can save for later use with that contact). Most of the time you'll be able to guess based on country, but this isn't always the case.
Understand that you will need a LOT of masks, and they aren't always ten digits. While the NANPA system is relatively consistent, UK phone numbers are a mess; geographic area codes are variable-length, and the total number can be ten or eleven digits, so there are six different masks just for UK numbers based on geographic area code. In Mexico, area codes can be two or three digits, and the total number is ten digits. French phone numbers are ten digits in groups of two. In addition, the actual combination of digits to be dialed depends on where you're calling from; if the call is always from the US, many European number systems drop their leading zero used for in-country calls and you instead dial the country code.
I want to use Levenshtein algorithm to search in a list of strings. I want to implement a custom character mapping in order to type latin characters and searching in items in greek.
mapping example:
a = α, ά
b = β
i = ι,ί,ΐ,ϊ
... (etc)
u = ου, ού
So searching using abu in a list with
αbu
abού
αού (all greek characters)
will result with all items in the list. (item order is not a problem)
How do I apply a mapping in the algorithm? (this is where I start)
I think the best way would be to preprocess your symbols to one definite form (e.g. all in latin) and then use Levenshtein as you would do normaly.
In pseudocode:
int func(String latinStr, String greekStr) {
String mappedStr = convertToLatin(greekStr); // e.g. now αβ would be ab
return Levenstein(latinStr, mappedStr);
}
And in convertToLatin you may symbol-by-symbol ask Dictionary with mappings for a replace and construct new string
I need to be able to convert user input to [a-z] roman characters ONLY (not case sensitive). So, there are only 26 characters that I am interested in.
However, the user can type in any "form" of those characters that they wish. The Spanish "n", the French "e", and the German "u" can all have accents from the user input (which are removed by the program).
I've gotten pretty close with these two extension methods:
public static string LettersOnly(this string Instring)
{
char[] aChar = Instring.ToCharArray();
int intCount = 0;
string strTemp = "";
for (intCount = 0; intCount <= Instring.Length - 1; intCount++)
{
if (char.IsLetter(aChar[intCount]) )
{
strTemp += aChar[intCount];
}
}
return strTemp;
}
public static string RemoveAccentMarks(this string s)
{
string normalizedString = s.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
char c;
for (int i = 0; i <= normalizedString.Length - 1; i++)
{
c = normalizedString[i];
if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(c) != System.Globalization.UnicodeCategory.NonSpacingMark)
{
sb.Append(c);
}
}
return sb.ToString();
}
Here is an example test:
string input = "Àlièñ451";
input = input.LettersOnly().RemoveAccentMarks().ToLower();
console.WriteLine(input);
Result: "alien" (as expected)
This works for 99.9% of the cases. However, a few characters seem to pass all of the checks.
For instance, "ß" (a German double-s, I think). This is considered by .Net to be a letter. This is not considered by the function above to have any accent marks... but it STILL isn't in the range of a-z, like I need it to be. Ideally, I could convert this to a "B" or an "ss" (whichever is appropriate), but I need to convert it to SOMETHING in the range of a-z.
Another example, the dipthong ("æ"). Again, .Net considers this a "letter". The function above doesn't see any accent, but again, it isn't in the roman 26 character alphabet. In this case, I need to convert to the two letters "ae" (I think).
Is there an easy way to convert ANY worldwide input to the closest roman alphabet equivalent? It is expected that this probably won't be a perfectly clean translation, but I need to trust that the inputs at FlipScript.com are ONLY getting the characters a-z... and nothing else.
Any and all help appreciated.
If I were you, I'd create a Dictionary which would contain the mappings from foreign letters to Roman letters. I'd use this for two reasons:
It will make understanding what you want to do easier to someone who is reading your code.
There are a small, finite, number of these special letters so you don't need to worry about maintenance of the data structure.
I'd put the mappings into an xml file then load them into the data structure at run-time. That way, you do not need to modify any code which uses the characters, you only need to specify the mappings themselves.