In c# How to convert back unicoded characters to UTF-8?

In c# How to convert back unicoded characters to UTF-8? - c#

I Have this text Grou00dfbeerenstrau00dfe and I need to convert it to Großbeerenstraße
also Eichstu00e4tt to Eichstätt
But I don't completely understand and solve this because of these reasons:
ONLY some characters (special characters) are converted, not the whole text
Unicoded texts usually have Escape characters like \u00df instead of u00df
Could you please help me to convert correctly back to its original states?
Basically, how can I convert when there is no escape character?
NOTE: If you must know, I'm sending some special charactered strings into some system. I cannot touch this system but when I request back the same string from that system, it converts Großbeerenstraße to Grou00dfbeerenstrau00dfe and so on.

Based on David's idea of looking for u and checking if the following 4 characters are valid hex numbers, it would look something like this:
public string FixGermanUnicode(string input) {
var output = new StringBuilder();
for (var i = 0; i < input.Length; i++) {
if (i < input.Length - 4 && input[i] == 'u' && input[i + 1] == '0'
&& int.TryParse(input.Substring(i + 1, 4), NumberStyles.HexNumber, null, out var code)) {
try {
output.Append(char.ConvertFromUtf32(code));
i += 4;
} catch (ArgumentOutOfRangeException) {
//not a valid unicode character
output.Append(input[i]);
}
} else {
output.Append(input[i]);
}
}
return output.ToString();
}
Console.WriteLine(FixGermanUnicode("Grou00dfbeerenstrau00dfe"));
Really, it checks for u0 to prevent cases where the next 4 characters are valid unicode, but should not have been replaced. That will work for German at least, since all the special characters in German have unicode codes starting with 0.
This will also catch scenarios where the follow 4 digits are valid hex numbers, but the resulting hex number is not a valid unicode character.

While I completely agree with #Gabriel Luci's answer, I would like to point out a more concise implementation of the same idea (it needs the ' System.Text.RegularExpression' namespace):
readonly static string unicodePattern = #"u0[0-9a-fA-F]{3}";
public static string FixGermanUnicode(string input)
{
return Regex.Replace(input, unicodePattern, match =>
{
var digits = match.Value.Substring(1);
try
{
return char.ConvertFromUtf32(int.Parse(digits, System.Globalization.NumberStyles.AllowHexSpecifier)).ToString();
}
catch (ArgumentOutOfRangeException)
{
//not a valid unicode character
return match.Value;
}
});
}

Related

Convert any string to ASCII, Remove Backslash

This question may reveal my ignorance regarding character encoding, so if it does, I would greatly appreciate information to correct that.
I am relaying strings from new applications to an old application. The old application only accepts ASCII characters (http://www.asciitable.com/). The old application also does not support certain characters such as backslashes. The new applications support more or less anything.
Let's say I have the string:
"Whatever - 1_夜_💦💦💦"
I need to convert that to something with only ASCII characters. For example, maybe something like:
"Whatever - 1_\u001cY_=???=???=???"
Then I want to replace the remaining illegal characters with substitution strings.
Ideally, any character that is encoded to ASCII should be able to be de-coded. That is, any unique input string will have a unique output string (no arbitrary inputs "abc" and "xyz" which are different produce the same result). An algorithm could convert the output string back to the input string.
This is what I've tried:
static string ConvertToAscii(string str)
{
var return_string = "";
foreach (var c in str)
{
if ((int)c < 128)
{
return_string += c;
}
else
{
var charBytes = BitConverter.GetBytes(c);
var ascii = Encoding.ASCII.GetString(charBytes);
return_string += ascii;
}
}
return return_string;
}
When I use this with the string I mentioned above, I get:
"Whatever - 1_\u001cY_=???=???=???"
That seems great - however, the "\u001cY" is apparently a single character, rather than a collection of ASCII characters. So my target database rejects it, and I am not able to figure out how to remove the "\" while leaving the remaining characters.
How can I convert any string into a collection of ASCII characters?

The easiest approach is Base64 all bytes since you don't seem to care how strings are represented:
Convert.ToBase64String( Encoding.Unicode.GetBytes("Whatever - 1_夜_💦💦💦"))
will produce result that is guaranteed to be ASCII (even printable ASCII) - for your string result would be "VwBoAGEAdABlAHYAZQByACAALQAgADEAXwAcWV8APdim3D3Yptw92Kbc".

Here is similar code to what I ended up using to convert everything to Ascii:
internal static string ConvertToAscii(string str)
{
var returnStringBuilder = new StringBuilder();
foreach (var c in str)
{
if (char.IsControl(c))
{
// Control character
continue;
}
if (c < 127)
{
// ASCII Character
returnStringBuilder.Append(c);
}
else
{
returnStringBuilder.Append("U+" + ((int) c).ToString("X4"));
}
}
return returnStringBuilder.ToString();
}

Splitting a multi-lingual string

I have an AS/400 reply text which comes multi-lingual string as shown below and is of 28872 characters length.
2012021920120219000000000300000D000000010146208D22ﻑﻳﺭﺎﺻﻣ
I have to split the text 240 characters per block but as I have arabic characters in between my logic is failing to extract the exact 240 character length.
My question is how to split a multi-lingual text with out loosing the original format?

You should write your code that way that it, depending on the text encoding, extracts exactly 240 characters. A character can take several bytes depending on the encoding. A common encoding is UTF-8. Take a look at Wikipedia on how UTF-8 works. This will allow you to write correct working code. UTF-8 Description
So, you should find out how many bytes the current character takes.
Of course, before you start off, make sure you know the encoding of the input text.
Know that Java uses UTF-16 to store characters. So, this means that an arabic character can be made out of more than one char, because the code point exceeds 2^16. To work with this correctly, I would convert the whole string to a byte buffer:
String longStringToSplit = ...;
byte[] stringUTF8 = longStringToSplit.getBytes("UTF-8");
// now, split it manually and correct, using the utf-8 specifications you
// can find in the link I gave you to wiki.

Here is a simple code can do this:
List<string> SplitString(String input, int length)
{
var splitedList = new List<string>();
string block = "";
var arabicBlock = "";
foreach (char c in input)
{
if (block.Length + arabicBlock.Length > length - 1)
{
splitedList.Add(block);
block = "";
}
var b = (int) c;
// check here if charachter is arabic
// this is a sample, or you can use 'IsArabicChar'
//if (b > 6000)
if(IsArabicChar(c))
{
arabicBlock += c.ToString();
}
else
{
block += arabicBlock + c;
arabicBlock = "";
}
}
return splitedList;
}
IsArabicChar method can be useful:
internal static bool IsArabicChar(Char character)
{
if (character >= 0x600 && character <= 0x6ff)
return true;
if (character >= 0x750 && character <= 0x77f)
return true;
if (character >= 0xfb50 && character <= 0xfc3f)
return true;
if (character >= 0xfe70 && character <= 0xfefc)
return true;
return false;
}

Is there a way to check if text is in cyrillics or latin using C#?

Is there a way to check if text is in cyrillics or latin using C#?

Use a Regex and check for \p{IsCyrillic}, for example:
if (Regex.IsMatch(stringToCheck, #"\p{IsCyrillic}"))
{
// there is at least one cyrillic character in the string
}
This would be true for the string "abcабв" because it contains at least one cyrillic character. If you want it to be false if there are non cyrillic characters in the string, use:
if (!Regex.IsMatch(stringToCheck, #"\P{IsCyrillic}"))
{
// there are only cyrillic characters in the string
}
This would be false for the string "abcабв", but true for "абв".
To check what the IsCyrillic named block or other named blocks contain, have a look at this http://msdn.microsoft.com/en-us/library/20bw873z.aspx#SupportedNamedBlocks

How about this ?
string pattern = #"\p{IsCyrillic}";
if ( Regex.Matches(textInput, pattern).Count > 0)
{
// contains cyrillics' characters.
}
If you want to check that contains cyrillics characters more than x characters Change the right hand numeric value.
Our system recieved spam email that contains cyrillics' characters roughly 30% of
full of text;so, couldn't decide whether 100% or 0%

Here is another solution for this problem
public bool isCyrillic(string textInput)
{
bool rezultat=true;
string pattern = #"[абвгдѓежзѕијклљмнњопрстќуфхцчџш]";
char[] textArray = textInput.ToCharArray();
for (int i = 0; i < textArray.Length; i++)
{
if (!Regex.IsMatch(textArray[i].ToString(),pattern))
{
rezultat = false;
break;
}
}
return rezultat;
}

Is there an easy way to trim the last three characters off a string

I have strings like this:
var a = "abcdefg";
var b = "xxxxxxxx";
The strings are always longer than five characters.
Now I need to trim off the last 3 characters. Is there some simple way that I can do this with C#?

In the trivial case you can just use
result = s.Substring(0, s.Length-3);
to remove the last three characters from the string.
Or as Jason suggested Remove is an alternative:
result = s.Remove(s.Length-3)
Unfortunately for unicode strings there can be a few problems:
A unicode codepoint can consist of multiple chars since the encoding of string is UTF-16 (See Surrogate pairs). This happens only for characters outside the basic plane, i.e. which have a code-point >2^16. This is relevant if you want to support Chinese.
A glyph (graphical symbol) can consist of multiple codepoints. For example ä can be written as a followed by a combining ¨.
Behavior with right-to-left writing might not be what you want either

You want String.Remove(Int32)
Deletes all the characters from this string beginning at a specified
position and continuing through the last position.
If you want to perform validation, along the lines of druttka's answer, I would suggest creating an extension method
public static class MyStringExtensions
{
public static string SafeRemove(this string s, int numCharactersToRemove)
{
if (numCharactersToRemove > s.Length)
{
throw new ArgumentException("numCharactersToRemove");
}
// other validation here
return s.Remove(s.Length - numCharactersToRemove);
}
}
var s = "123456";
var r = s.SafeRemove(3); //r = "123"
var t = s.SafeRemove(7); //throws ArgumentException

string a = "abcdefg";
a = a.Remove(a.Length - 3);

string newString = oldString.Substring(0, oldString.Length - 4);

If you really only need to trim off the last 3 characters, you can do this
string a = "abcdefg";
if (a.Length > 3)
{
a = a.Substring(0, a.Length-3);
}
else
{
a = String.Empty;
}

Converting "Bizarre" Chars in String to Roman Chars

I need to be able to convert user input to [a-z] roman characters ONLY (not case sensitive). So, there are only 26 characters that I am interested in.
However, the user can type in any "form" of those characters that they wish. The Spanish "n", the French "e", and the German "u" can all have accents from the user input (which are removed by the program).
I've gotten pretty close with these two extension methods:
public static string LettersOnly(this string Instring)
{
char[] aChar = Instring.ToCharArray();
int intCount = 0;
string strTemp = "";
for (intCount = 0; intCount <= Instring.Length - 1; intCount++)
{
if (char.IsLetter(aChar[intCount]) )
{
strTemp += aChar[intCount];
}
}
return strTemp;
}
public static string RemoveAccentMarks(this string s)
{
string normalizedString = s.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
char c;
for (int i = 0; i <= normalizedString.Length - 1; i++)
{
c = normalizedString[i];
if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(c) != System.Globalization.UnicodeCategory.NonSpacingMark)
{
sb.Append(c);
}
}
return sb.ToString();
}
Here is an example test:
string input = "Àlièñ451";
input = input.LettersOnly().RemoveAccentMarks().ToLower();
console.WriteLine(input);
Result: "alien" (as expected)
This works for 99.9% of the cases. However, a few characters seem to pass all of the checks.
For instance, "ß" (a German double-s, I think). This is considered by .Net to be a letter. This is not considered by the function above to have any accent marks... but it STILL isn't in the range of a-z, like I need it to be. Ideally, I could convert this to a "B" or an "ss" (whichever is appropriate), but I need to convert it to SOMETHING in the range of a-z.
Another example, the dipthong ("æ"). Again, .Net considers this a "letter". The function above doesn't see any accent, but again, it isn't in the roman 26 character alphabet. In this case, I need to convert to the two letters "ae" (I think).
Is there an easy way to convert ANY worldwide input to the closest roman alphabet equivalent? It is expected that this probably won't be a perfectly clean translation, but I need to trust that the inputs at FlipScript.com are ONLY getting the characters a-z... and nothing else.
Any and all help appreciated.

If I were you, I'd create a Dictionary which would contain the mappings from foreign letters to Roman letters. I'd use this for two reasons:
It will make understanding what you want to do easier to someone who is reading your code.
There are a small, finite, number of these special letters so you don't need to worry about maintenance of the data structure.
I'd put the mappings into an xml file then load them into the data structure at run-time. That way, you do not need to modify any code which uses the characters, you only need to specify the mappings themselves.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

In c# How to convert back unicoded characters to UTF-8? - c#

Related

Convert any string to ASCII, Remove Backslash

Splitting a multi-lingual string

Is there a way to check if text is in cyrillics or latin using C#?

Is there an easy way to trim the last three characters off a string

Converting "Bizarre" Chars in String to Roman Chars

Categories

Resources