Converting "Bizarre" Chars in String to Roman Chars

Converting "Bizarre" Chars in String to Roman Chars - c#

I need to be able to convert user input to [a-z] roman characters ONLY (not case sensitive). So, there are only 26 characters that I am interested in.
However, the user can type in any "form" of those characters that they wish. The Spanish "n", the French "e", and the German "u" can all have accents from the user input (which are removed by the program).
I've gotten pretty close with these two extension methods:
public static string LettersOnly(this string Instring)
{
char[] aChar = Instring.ToCharArray();
int intCount = 0;
string strTemp = "";
for (intCount = 0; intCount <= Instring.Length - 1; intCount++)
{
if (char.IsLetter(aChar[intCount]) )
{
strTemp += aChar[intCount];
}
}
return strTemp;
}
public static string RemoveAccentMarks(this string s)
{
string normalizedString = s.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
char c;
for (int i = 0; i <= normalizedString.Length - 1; i++)
{
c = normalizedString[i];
if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(c) != System.Globalization.UnicodeCategory.NonSpacingMark)
{
sb.Append(c);
}
}
return sb.ToString();
}
Here is an example test:
string input = "Àlièñ451";
input = input.LettersOnly().RemoveAccentMarks().ToLower();
console.WriteLine(input);
Result: "alien" (as expected)
This works for 99.9% of the cases. However, a few characters seem to pass all of the checks.
For instance, "ß" (a German double-s, I think). This is considered by .Net to be a letter. This is not considered by the function above to have any accent marks... but it STILL isn't in the range of a-z, like I need it to be. Ideally, I could convert this to a "B" or an "ss" (whichever is appropriate), but I need to convert it to SOMETHING in the range of a-z.
Another example, the dipthong ("æ"). Again, .Net considers this a "letter". The function above doesn't see any accent, but again, it isn't in the roman 26 character alphabet. In this case, I need to convert to the two letters "ae" (I think).
Is there an easy way to convert ANY worldwide input to the closest roman alphabet equivalent? It is expected that this probably won't be a perfectly clean translation, but I need to trust that the inputs at FlipScript.com are ONLY getting the characters a-z... and nothing else.
Any and all help appreciated.

If I were you, I'd create a Dictionary which would contain the mappings from foreign letters to Roman letters. I'd use this for two reasons:
It will make understanding what you want to do easier to someone who is reading your code.
There are a small, finite, number of these special letters so you don't need to worry about maintenance of the data structure.
I'd put the mappings into an xml file then load them into the data structure at run-time. That way, you do not need to modify any code which uses the characters, you only need to specify the mappings themselves.

Related

In c# How to convert back unicoded characters to UTF-8?

I Have this text Grou00dfbeerenstrau00dfe and I need to convert it to Großbeerenstraße
also Eichstu00e4tt to Eichstätt
But I don't completely understand and solve this because of these reasons:
ONLY some characters (special characters) are converted, not the whole text
Unicoded texts usually have Escape characters like \u00df instead of u00df
Could you please help me to convert correctly back to its original states?
Basically, how can I convert when there is no escape character?
NOTE: If you must know, I'm sending some special charactered strings into some system. I cannot touch this system but when I request back the same string from that system, it converts Großbeerenstraße to Grou00dfbeerenstrau00dfe and so on.

Based on David's idea of looking for u and checking if the following 4 characters are valid hex numbers, it would look something like this:
public string FixGermanUnicode(string input) {
var output = new StringBuilder();
for (var i = 0; i < input.Length; i++) {
if (i < input.Length - 4 && input[i] == 'u' && input[i + 1] == '0'
&& int.TryParse(input.Substring(i + 1, 4), NumberStyles.HexNumber, null, out var code)) {
try {
output.Append(char.ConvertFromUtf32(code));
i += 4;
} catch (ArgumentOutOfRangeException) {
//not a valid unicode character
output.Append(input[i]);
}
} else {
output.Append(input[i]);
}
}
return output.ToString();
}
Console.WriteLine(FixGermanUnicode("Grou00dfbeerenstrau00dfe"));
Really, it checks for u0 to prevent cases where the next 4 characters are valid unicode, but should not have been replaced. That will work for German at least, since all the special characters in German have unicode codes starting with 0.
This will also catch scenarios where the follow 4 digits are valid hex numbers, but the resulting hex number is not a valid unicode character.

While I completely agree with #Gabriel Luci's answer, I would like to point out a more concise implementation of the same idea (it needs the ' System.Text.RegularExpression' namespace):
readonly static string unicodePattern = #"u0[0-9a-fA-F]{3}";
public static string FixGermanUnicode(string input)
{
return Regex.Replace(input, unicodePattern, match =>
{
var digits = match.Value.Substring(1);
try
{
return char.ConvertFromUtf32(int.Parse(digits, System.Globalization.NumberStyles.AllowHexSpecifier)).ToString();
}
catch (ArgumentOutOfRangeException)
{
//not a valid unicode character
return match.Value;
}
});
}

c# compare string irrespective of language

I have a routine that tries to find a specific term in a list of strings.
int FindString(string term, List<string> stringList)
{
for (int i = 0; i < stringList.Count; i++)
{
if (stringList[i].Contains(term))
{
return i;
}
}
return -1;
}
The term is always a Unicode string in English -for example "B4"- while the list of strings contains strings that may be written in other languages. A string might contain "B4" for example but since it was written in Greek, the Contains method returns false when comparing the English and Greek version of basically the same characters.
Is there a way to transform the non-English string so the Contains method will properly return true?
Example term and string (filename in reality):
term: B4
string: 19-299-12-Β4.txt

Basically you need to "normalize" string based on your custom rules and than perform search.
Since there is no generally accepted mapping that include at least "Latin B" equals "Greek B" you have to build your own - basic dictionary Dictionary<char,char> may be enough.
As part of that "normalization" you may also consider digit mapping - for that there is actually official Unicode information available - GetDigitValue.
So overall code to normalize would look like:
var source = "А9"; // Cyrilic A9 - "\u0410\u0039"
var map = new Dictionary<char,char> { { 'А', 'A' } }; // Cyrillic to Latin
var chars = source.Select( c =>
CharUnicodeInfo.GetUnicodeCategory(c)==UnicodeCategory.DecimalDigitNumber?
CharUnicodeInfo.GetDigitValue(c).ToString()[0] :
map.ContainsKey(c) ? map[c] :
c);
var result = String.Join("", chars);
var term = "\u0041\u0039"; // Latin A9
Console.WriteLine(source.Contains(term));
Console.WriteLine(result.Contains(term));

Convert any string to ASCII, Remove Backslash

This question may reveal my ignorance regarding character encoding, so if it does, I would greatly appreciate information to correct that.
I am relaying strings from new applications to an old application. The old application only accepts ASCII characters (http://www.asciitable.com/). The old application also does not support certain characters such as backslashes. The new applications support more or less anything.
Let's say I have the string:
"Whatever - 1_夜_💦💦💦"
I need to convert that to something with only ASCII characters. For example, maybe something like:
"Whatever - 1_\u001cY_=???=???=???"
Then I want to replace the remaining illegal characters with substitution strings.
Ideally, any character that is encoded to ASCII should be able to be de-coded. That is, any unique input string will have a unique output string (no arbitrary inputs "abc" and "xyz" which are different produce the same result). An algorithm could convert the output string back to the input string.
This is what I've tried:
static string ConvertToAscii(string str)
{
var return_string = "";
foreach (var c in str)
{
if ((int)c < 128)
{
return_string += c;
}
else
{
var charBytes = BitConverter.GetBytes(c);
var ascii = Encoding.ASCII.GetString(charBytes);
return_string += ascii;
}
}
return return_string;
}
When I use this with the string I mentioned above, I get:
"Whatever - 1_\u001cY_=???=???=???"
That seems great - however, the "\u001cY" is apparently a single character, rather than a collection of ASCII characters. So my target database rejects it, and I am not able to figure out how to remove the "\" while leaving the remaining characters.
How can I convert any string into a collection of ASCII characters?

The easiest approach is Base64 all bytes since you don't seem to care how strings are represented:
Convert.ToBase64String( Encoding.Unicode.GetBytes("Whatever - 1_夜_💦💦💦"))
will produce result that is guaranteed to be ASCII (even printable ASCII) - for your string result would be "VwBoAGEAdABlAHYAZQByACAALQAgADEAXwAcWV8APdim3D3Yptw92Kbc".

Here is similar code to what I ended up using to convert everything to Ascii:
internal static string ConvertToAscii(string str)
{
var returnStringBuilder = new StringBuilder();
foreach (var c in str)
{
if (char.IsControl(c))
{
// Control character
continue;
}
if (c < 127)
{
// ASCII Character
returnStringBuilder.Append(c);
}
else
{
returnStringBuilder.Append("U+" + ((int) c).ToString("X4"));
}
}
return returnStringBuilder.ToString();
}

Substring Issue: Substring converting to char

I am making a typing game and I need to make a list of each character in a string so I can define what input the code should be expecting.
I tried:
static List<char> chars = "This is my string".ToCharArray().ToList();
But because char does not contain capitalization information it throws this error:
ArgumentException: InputKey named: T is unknown.
I knew char was not going to work, I needed each letter to be a string, not a char. So, I created a method using Substring:
static List<string> ToStringArray(string input)
{
List<string> strings = new List<string>();
for (int i = 0; i < input.Length; i++)
{
strings.Add(input.Substring(i, 1));
}
return strings;
}
static List<string> strings = ToStringArray("This is my string");
But apparently Substring is converting to a char because my code is still throwing the same error, and if I change the length of the substring to 2 my code works again. How can I force Substring to not convert to char? Or should I be approaching this problem in a completely different way?

I think you may be approaching this from a more complex angle than it needs to be.
If you have a:
string testString = "This is my string";
Then you can already access each individual character by index, such as testString[1] (which would be 'h')
If you're worried about case, then you can reference the string with
testString.ToLower();

Parse without string split

This is a spin-off from the discussion in some other question.
Suppose I've got to parse a huge number of very long strings. Each string contains a sequence of doubles (in text representation, of course) separated by whitespace. I need to parse the doubles into a List<double>.
The standard parsing technique (using string.Split + double.TryParse) seems to be quite slow: for each of the numbers we need to allocate a string.
I tried to make it old C-like way: compute the indices of the beginning and the end of substrings containing the numbers, and parse it "in place", without creating additional string. (See http://ideone.com/Op6h0, below shown the relevant part.)
int startIdx, endIdx = 0;
while(true)
{
startIdx = endIdx;
// no find_first_not_of in C#
while (startIdx < s.Length && s[startIdx] == ' ') startIdx++;
if (startIdx == s.Length) break;
endIdx = s.IndexOf(' ', startIdx);
if (endIdx == -1) endIdx = s.Length;
// how to extract a double here?
}
There is an overload of string.IndexOf, searching only within a given substring, but I failed to find a method for parsing a double from substring, without actually extracting that substring first.
Does anyone have an idea?

There is no managed API to parse a double from a substring. My guess is that allocating the string will be insignificant compared to all the floating point operations in double.Parse.
Anyway, you can save the allocation by creating a "buffer" string once of length 100 consisting of whitespace only. Then, for every string you want to parse, you copy the chars into this buffer string using unsafe code. You fill the buffer string with whitespace. And for parsing you can use NumberStyles.AllowTrailingWhite which will cause trailing whitespace to be ignored.
Getting a pointer to string is actually a fully supported operation:
string l_pos = new string(' ', 100); //don't write to a shared string!
unsafe
{
fixed (char* l_pSrc = l_pos)
{
// do some work
}
}
C# has special syntax to bind a string to a char*.

if you want to do it really fast, i would use a state machine
this could look like:
enum State
{
Separator, Sign, Mantisse etc.
}
State CurrentState = State.Separator;
int Prefix, Exponent, Mantisse;
foreach(var ch in InputString)
{
switch(CurrentState)
{ // set new currentstate in dependence of ch and CurrentState
case Separator:
GotNewDouble(Prefix, Exponent, Mantisse);
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Converting "Bizarre" Chars in String to Roman Chars - c#

Related

In c# How to convert back unicoded characters to UTF-8?

c# compare string irrespective of language

Convert any string to ASCII, Remove Backslash

Substring Issue: Substring converting to char

Parse without string split

Categories

Resources