c# compare string irrespective of language - c#

I have a routine that tries to find a specific term in a list of strings.
int FindString(string term, List<string> stringList)
{
for (int i = 0; i < stringList.Count; i++)
{
if (stringList[i].Contains(term))
{
return i;
}
}
return -1;
}
The term is always a Unicode string in English -for example "B4"- while the list of strings contains strings that may be written in other languages. A string might contain "B4" for example but since it was written in Greek, the Contains method returns false when comparing the English and Greek version of basically the same characters.
Is there a way to transform the non-English string so the Contains method will properly return true?
Example term and string (filename in reality):
term: B4
string: 19-299-12-Β4.txt

Basically you need to "normalize" string based on your custom rules and than perform search.
Since there is no generally accepted mapping that include at least "Latin B" equals "Greek B" you have to build your own - basic dictionary Dictionary<char,char> may be enough.
As part of that "normalization" you may also consider digit mapping - for that there is actually official Unicode information available - GetDigitValue.
So overall code to normalize would look like:
var source = "А9"; // Cyrilic A9 - "\u0410\u0039"
var map = new Dictionary<char,char> { { 'А', 'A' } }; // Cyrillic to Latin
var chars = source.Select( c =>
CharUnicodeInfo.GetUnicodeCategory(c)==UnicodeCategory.DecimalDigitNumber?
CharUnicodeInfo.GetDigitValue(c).ToString()[0] :
map.ContainsKey(c) ? map[c] :
c);
var result = String.Join("", chars);
var term = "\u0041\u0039"; // Latin A9
Console.WriteLine(source.Contains(term));
Console.WriteLine(result.Contains(term));

Related

I need help working with <string, char> format dictionaries in C#

This post is about a difficulty that I am having in C# (Windows Forms App in Visual Studio 2017, .NET Framework) regarding the use of a Dictionary in a "string, char" format.
Here is what I have:
First - a dictionary in a format
Dictionary<String, char> bintoascii = new Dictionary<String, char>()
{
{ "01000001" , 'A'},
{ "01000010" , 'B'},
//..................
{ "01111010" , 'z'},
{ "00100000" , ' ' },
};
And the actual conversion code:
AsciiOutput.Text = "";
String binninput = Input.Text;
for (int i = 0; i < binninput.Length; i++)
{
if (i > 0)
AsciiOutput.Text = AsciiOutput.Text + " ";
else if (i == 0)
{
AsciiOutput.Text = AsciiOutput.Text + " ";
}
string b = binninput[i].ToString();
if (bintoascii.ContainsKey(b))
AsciiOutput.Text = AsciiOutput.Text + (bintoascii[b]);
}
The function of this code is to convert from binary to ASCII via input and output textboxes (which have already been successfully set up on my GUI).
Essentially, it first declares a dictionary of Binary values (represented as strings) and their corresponding ASCII values (represented as chars).
The textbox that binary is inputted to is Input.Text and the textbox that ASCII is outputted from is AsciiOutput.Text (Note: the string binninput represents Input.Text)
There is a loop based on the length of the Input (binninput/Input.Text) that places spaces between each letter of binary. So it would be 01000001 01000010 instead of 0100000101000010, for example.
The latter part of the loop inserts the 8-digit binary representation of each letter individually (hence why it is repeated based on the length of the input).
Visual Studio displays no errors, but the output textbox (AsciiOutput.Text) is blank on my GUI. I'm not for sure on this, but I think that the issue lies within the
string b = binninput[i].ToString();
line of code. Removing the .ToString() function causes conversion errors. I've tried for hours messing around with substituting chars, ints, and strings around thinking it's a basic mistake but to no resolve, hence why I came here.
(Using a char, string format dictionary I got ASCII to binary conversions working great and the code look very similar; if someone wants I can post that here too)
It sounds like what you're trying to do is look up a Key (the binary string) from a given Value (a character from some input string).
One way to do this is to use the System.Linq extension method FirstOrDefault (which returns the first match or the default for the type if no match is found), use Value == character as the match critera, and get the Key from the result:
// This would return "01000001" in your example above
var result = bintoascii.FirstOrDefault(x => x.Value == 'A').Key;
Here's a brief code sample that populates a dictionary with the string representation of the binary values for the characters representing all the numbers, uppercase and lowercase letters, and the space character, then parses an input string ("Hello # World") and returns the Key for each character found in a Value in the dictionary (if the value is missing, then it displays the character in square brackets - I added the '#' character to the test string to show what it would look like):
static void Main(string[] args)
{
// Populate dictionary with string binary key and associated char value
Dictionary<String, char> binToAscii =
Enumerable.Range(32, 1) // Space character
.Concat(Enumerable.Range(48, 10)) // Numbers
.Concat(Enumerable.Range(65, 26)) // Uppercase Letters
.Concat(Enumerable.Range(97, 26)) // Lowercase Letters
.Select(intVal => (char) intVal) // Convert int to char
// And finally, set the key to the binary string, padded to 8 characters
.ToDictionary(dataChar => Convert.ToString(dataChar, 2).PadLeft(8, '0'));
var testString = "Hello # World";
// Display the binary representation of each character or [--char--] if missing
var resultString = string.Join(" ", testString.Select(chr =>
binToAscii.FirstOrDefault(x => x.Value == chr).Key ?? $"[--'{chr}'--]"));
Console.WriteLine($"{testString} ==\n{resultString}");
GetKeyFromUser("\nDone! Press any key to exit...");
}
Output

how to match rules using regex in C#

I am new to regex stuff in C#. I am not sure how to use the regex to validate client reference number. This client reference number has 3 different types : id, mobile number, and serial number.
C#:
string client = "ABC 1234567891233";
//do code stuff here:
if Regex matches 3-4 digits to client, return value = client id
else if Regex matches 8 digts to client, return value = ref no
else if Regex matches 13 digits to client, return value = phone no
I dont know how to count digits using Regex for different types. Like Regex("{![\d.....}").
I don't understand why you're bent on using regular expressions here. A simple one-liner would do, eg. even such an extension method:
static int NumbersCount(this string str)
{
return str.ToCharArray().Where(c => Char.IsNumber(c)).Count();
}
It's clearer and more maintainable in my opinion.
You could probably give it a go with group matching and something along the lines of
"(?<client>[0-9]{5,9}?)|(?<serial>[0-9]{10}?)|(?<mobile>[0-9]{13,}?)"
Then you'd check whether you have a match for "client", "serial", "mobile" and interpret the string input on that basis. But is it easier to understand?
Does it express your intentions more clearly for those reading your code later on?
If the requirement is such that these numbers must be consecutive (as #Corak points out)... I'd still write that iteratively, like so:
/// <summary>
/// returns lengths of all the numeric sequences encountered in the string
/// </summary>
static IEnumerable<int> Lengths(string str)
{
var count = 0;
for (var i = 0; i < str.Length; i++)
{
if (Char.IsNumber(str[i]))
{
count++;
}
if ((!Char.IsNumber(str[i]) || i == str.Length - 1) && count > 0)
{
yield return count;
count = 0;
}
}
}
And then you could simply:
bool IsClientID(string str)
{
var lenghts = Lengths(str);
return lenghts.Count() == 1 && lenghts.Single() == 5;
}
Is it more verbose? Yes, but chances are that people will still like you more than if you make them fiddling with regex every time the validation rules happen to change, or some debugging is required : ) This includes your future self.
I'm not sure if I understood your question. But if you want to get the number of Numerical Characters from a string you can use the following code:
Regex regex = new Regex(#"^[0-9]+$");
string ValidateString = regex.Replace(ValidateString, "");
if(ValidateString.Length > 4 && ValidateString.Length < 10)
//this is a customer id
....

Is there a way to check if text is in cyrillics or latin using C#?

Is there a way to check if text is in cyrillics or latin using C#?
Use a Regex and check for \p{IsCyrillic}, for example:
if (Regex.IsMatch(stringToCheck, #"\p{IsCyrillic}"))
{
// there is at least one cyrillic character in the string
}
This would be true for the string "abcабв" because it contains at least one cyrillic character. If you want it to be false if there are non cyrillic characters in the string, use:
if (!Regex.IsMatch(stringToCheck, #"\P{IsCyrillic}"))
{
// there are only cyrillic characters in the string
}
This would be false for the string "abcабв", but true for "абв".
To check what the IsCyrillic named block or other named blocks contain, have a look at this http://msdn.microsoft.com/en-us/library/20bw873z.aspx#SupportedNamedBlocks
How about this ?
string pattern = #"\p{IsCyrillic}";
if ( Regex.Matches(textInput, pattern).Count > 0)
{
// contains cyrillics' characters.
}
If you want to check that contains cyrillics characters more than x characters Change the right hand numeric value.
Our system recieved spam email that contains cyrillics' characters roughly 30% of
full of text;so, couldn't decide whether 100% or 0%
Here is another solution for this problem
public bool isCyrillic(string textInput)
{
bool rezultat=true;
string pattern = #"[абвгдѓежзѕијклљмнњопрстќуфхцчџш]";
char[] textArray = textInput.ToCharArray();
for (int i = 0; i < textArray.Length; i++)
{
if (!Regex.IsMatch(textArray[i].ToString(),pattern))
{
rezultat = false;
break;
}
}
return rezultat;
}

Is there an easy way to trim the last three characters off a string

I have strings like this:
var a = "abcdefg";
var b = "xxxxxxxx";
The strings are always longer than five characters.
Now I need to trim off the last 3 characters. Is there some simple way that I can do this with C#?
In the trivial case you can just use
result = s.Substring(0, s.Length-3);
to remove the last three characters from the string.
Or as Jason suggested Remove is an alternative:
result = s.Remove(s.Length-3)
Unfortunately for unicode strings there can be a few problems:
A unicode codepoint can consist of multiple chars since the encoding of string is UTF-16 (See Surrogate pairs). This happens only for characters outside the basic plane, i.e. which have a code-point >2^16. This is relevant if you want to support Chinese.
A glyph (graphical symbol) can consist of multiple codepoints. For example ä can be written as a followed by a combining ¨.
Behavior with right-to-left writing might not be what you want either
You want String.Remove(Int32)
Deletes all the characters from this string beginning at a specified
position and continuing through the last position.
If you want to perform validation, along the lines of druttka's answer, I would suggest creating an extension method
public static class MyStringExtensions
{
public static string SafeRemove(this string s, int numCharactersToRemove)
{
if (numCharactersToRemove > s.Length)
{
throw new ArgumentException("numCharactersToRemove");
}
// other validation here
return s.Remove(s.Length - numCharactersToRemove);
}
}
var s = "123456";
var r = s.SafeRemove(3); //r = "123"
var t = s.SafeRemove(7); //throws ArgumentException
string a = "abcdefg";
a = a.Remove(a.Length - 3);
string newString = oldString.Substring(0, oldString.Length - 4);
If you really only need to trim off the last 3 characters, you can do this
string a = "abcdefg";
if (a.Length > 3)
{
a = a.Substring(0, a.Length-3);
}
else
{
a = String.Empty;
}

Converting "Bizarre" Chars in String to Roman Chars

I need to be able to convert user input to [a-z] roman characters ONLY (not case sensitive). So, there are only 26 characters that I am interested in.
However, the user can type in any "form" of those characters that they wish. The Spanish "n", the French "e", and the German "u" can all have accents from the user input (which are removed by the program).
I've gotten pretty close with these two extension methods:
public static string LettersOnly(this string Instring)
{
char[] aChar = Instring.ToCharArray();
int intCount = 0;
string strTemp = "";
for (intCount = 0; intCount <= Instring.Length - 1; intCount++)
{
if (char.IsLetter(aChar[intCount]) )
{
strTemp += aChar[intCount];
}
}
return strTemp;
}
public static string RemoveAccentMarks(this string s)
{
string normalizedString = s.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
char c;
for (int i = 0; i <= normalizedString.Length - 1; i++)
{
c = normalizedString[i];
if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(c) != System.Globalization.UnicodeCategory.NonSpacingMark)
{
sb.Append(c);
}
}
return sb.ToString();
}
Here is an example test:
string input = "Àlièñ451";
input = input.LettersOnly().RemoveAccentMarks().ToLower();
console.WriteLine(input);
Result: "alien" (as expected)
This works for 99.9% of the cases. However, a few characters seem to pass all of the checks.
For instance, "ß" (a German double-s, I think). This is considered by .Net to be a letter. This is not considered by the function above to have any accent marks... but it STILL isn't in the range of a-z, like I need it to be. Ideally, I could convert this to a "B" or an "ss" (whichever is appropriate), but I need to convert it to SOMETHING in the range of a-z.
Another example, the dipthong ("æ"). Again, .Net considers this a "letter". The function above doesn't see any accent, but again, it isn't in the roman 26 character alphabet. In this case, I need to convert to the two letters "ae" (I think).
Is there an easy way to convert ANY worldwide input to the closest roman alphabet equivalent? It is expected that this probably won't be a perfectly clean translation, but I need to trust that the inputs at FlipScript.com are ONLY getting the characters a-z... and nothing else.
Any and all help appreciated.
If I were you, I'd create a Dictionary which would contain the mappings from foreign letters to Roman letters. I'd use this for two reasons:
It will make understanding what you want to do easier to someone who is reading your code.
There are a small, finite, number of these special letters so you don't need to worry about maintenance of the data structure.
I'd put the mappings into an xml file then load them into the data structure at run-time. That way, you do not need to modify any code which uses the characters, you only need to specify the mappings themselves.

Categories