Unicode to ASCII with character translations for umlats

Unicode to ASCII with character translations for umlats - c#

I have a client that sends unicode input files and demands only ASCII encoded files in return - why is unimportant.
Does anyone know of a routine to translate unicode string to a closest approximation of an ASCII string? I'm looking to replace common unicode characters like 'ä' to a best ASCII representation.
For example: 'ä' -> 'a'
Data resides in SQL Server however I can also work in C# as a downstream mechanism or as a CLR procedure.

Just loop through the string. For each character do a switch:
switch(inputCharacter)
{
case 'ä':
outputString = "ae";
break;
case 'ö':
outputString = "oe";
break;
...
(These translations are common in german language with ASCII only)
Then combine all outputStrings with a StringBuilder.

I think you really mean extended ASCII to ASCII
Just a simple dictionary
Dictionary<char, char> trans = new Dictionary<char, char>() {...}
StringBuilder sb = new StringBuilder();
foreach (char c in string.ToCharArray)
{
if((Int)c <= 127)
sb.Append(c);
else
sbAppend(trans[c]);
}
string ascii = sb.ToString();

Related

UTF-16 error, how to solve the unrecognized escape sequence?

This program is a translator program that takes some symbols and converts them to normal letters.
The problem is, when I try to put some symbols like: allAlphabets.Add("[]/[]"); or: allAlphabets.Add("//"); , i get an error about the UTF-16
static void Main(string[] args)
{
string input = ""; // string input
List<string> allAlphabets = new List<string>(); // storing to a list
input = Console.ReadLine();
char[] word = input.ToCharArray();
for (int i = 0; i < word.Length; i++)
{
switch (word[i]) // switch casce
{
normal letters
case 'm':
allAlphabets.Add("[]\/[]"); // represents text as a sequence of utf-16 code units
break;
case 'n':
allAlphabets.Add("[]\[]"); // represents text as a sequence of utf-16 code units
case 'v':
allAlphabets.Add("\/"); // represents text as a sequence of utf-16 code units
break;
case 'w':
allAlphabets.Add("\/\/"); // represents text as a sequence of utf-16 code units
}
}
}
}
Does someone know a way of encoding the unrecognized escape sequence?
Thank you!

You need to use the verbatim identifier (#)
To indicate that a string literal is to be interpreted verbatim. The #
character in this instance defines a verbatim string literal. Simple
escape sequences (such as "\\" for a backslash), hexadecimal escape
sequences (such as "\x0041" for an uppercase A), and Unicode escape
sequences (such as "\u0041" for an uppercase A) are interpreted
literally. Only a quote escape sequence ("") is not interpreted
literally; it produces a single quotation mark. Additionally, in case
of a verbatim interpolated string brace escape sequences ({{ and }})
are not interpreted literally; they produce single brace characters.
allAlphabets.Add(#"[]\/[]");
or escape the backslash
allAlphabets.Add("[]\\/[]")
Additional Resources
Strings (C# Programming Guide)
Regular and Verbatim String Literals
String Escape Sequences

HTML Encode ISO-8859-2 (Latin-2) characters in C#

Anyone knows how to encode ISO-8859-2 charset in C#? The following example does not work:
String name = "Filipović";
String encoded = WebUtility.HtmlEncode(name);
The resulting string should be
"Filipović"
Thanks

After reading your comments (you should support also Chinese names using ASCII chars only) I think you shouldn't stick to ISO-8859-2 encoding.
Solution 1
Use UTF-7 encoding for such names. UTF-7 is designed to use only ASCII characters for any Unicode string.
string value = "Filipović with Unicode symbol: 🏯";
var encoded = Encoding.ASCII.GetString(Encoding.UTF7.GetBytes(value));
Console.WriteLine(encoded); // Filipovi+AQc- with Unicode symbol: +2Dzf7w-
var decoded = Encoding.UTF7.GetString(Encoding.ASCII.GetBytes(encoded));
Solution 2
Alternatively, you can use base64 encoding, too. But in this case the pure ASCII strings will not be human-readable anymore.
string value = "Filipović with Unicode symbol: 🏯";
encoded = Convert.ToBase64String(Encoding.UTF8.GetBytes(value));
Console.WriteLine(encoded); // RmlsaXBvdmnEhyB3aXRoIFVuaWNvZGUgc3ltYm9sOiDwn4+v
var decoded = Encoding.UTF8.GetString(Convert.FromBase64String(encoded));
Solution 3
If you really stick to HTML Entity encoding you can achieve it like this:
string value = "Filipović with Unicode symbol: 🏯";
var result = new StringBuilder();
for (int i = 0; i < value.Length; i++)
{
if (Char.IsHighSurrogate(value[i]))
{
result.Append($"&#{Char.ConvertToUtf32(value[i], value[i + 1])};");
i++;
}
else if (value[i] > 127)
result.Append($"&#{(int)value[i]};");
else
result.Append(value[i]);
}
Console.WriteLine(result); // Filipović with Unicode symbol: 🏯

If you don't have strict requirement on Html encoding I'd recommend using Url (%) encoding which encodes all non-ASCII characters:
String name = "Filipović";
String encoded = WebUtility.UrlEncode(name); // Filipovi%C4%87
If you must have string with all non-ASCII characters to be HTML encoded consistently your best bet is use &xNNNN; or &#NNNN; format to encode all characters above 127. Unfortunately there is no way to convience HtmlEncode to encode all characters, so you need to do it yourself i.e. similarly how it is done in Convert a Unicode string to an escaped ASCII string. You can continue using HtmlDecode to read the values back at it handles &#xNNNN just fine.
Non optimal sample:
var name = "Filipović";
var result = String.Join("",
name.Select(x => x < 127 ? x.ToString() : String.Format("&#x{0:X4}", (int)x))
);

Encode - C# convert ISO-8859-1 entities number to characters

I found a question about how to convert ISO-8859-1 characters to entity number
C# convert ISO-8859-1 characters to entity number
code:
string input = "Steel Décor";
StringBuilder output = new StringBuilder();
foreach (char ch in input)
{
if (ch > 0x7F)
output.AppendFormat("&#{0};", (int) ch);
else
output.Append(ch);
}
// output.ToString() == "Steel Décor"
but i didn't figure out how to do the opposite converting from entity number to character like from
//"Steel Décor" to "Steel Décor"
ps: all accent character in my string are entity code

How do I get a list of all the printable characters in C#?

I'd like to be able to get a char array of all the printable characters in C#, does anybody know how to do this?
edit:
By printable I mean the visible European characters, so yes, umlauts, tildes, accents etc.

This will give you a list with all characters that are not considered control characters:
List<Char> printableChars = new List<char>();
for (int i = char.MinValue; i <= char.MaxValue; i++)
{
char c = Convert.ToChar(i);
if (!char.IsControl(c))
{
printableChars.Add(c);
}
}
You may want to investigate the other Char.IsXxxx methods to find a combination that suits your requirements.

Here's a LINQ version of Fredrik's solution. Note that Enumerable.Range yields an IEnumerable<int> so you have to convert to chars first. Cast<char> would have worked in 3.5SP0 I believe, but as of 3.5SP1 you have to do a "proper" conversion:
var chars = Enumerable.Range(0, char.MaxValue+1)
.Select(i => (char) i)
.Where(c => !char.IsControl(c))
.ToArray();
I've created the result as an array as that's what the question asked for - it's not necessarily the best idea though. It depends on the use case.
Note that this also doesn't consider full Unicode characters, only those in the basic multilingual plane. I don't know what it returns for high/low surrogates, but it's worth at least knowing that a single char doesn't really let you represent everything :(

A LINQ solution (based on Fredrik Mörk's):
Enumerable.Range(char.MinValue, char.MaxValue).Select(c => (char)c).Where(
c => !char.IsControl(c)).ToArray();

TLDR Answer
Use this Regex...
var regex = new Regex(#"[^\p{Cc}^\p{Cn}^\p{Cs}]");
TLDR Explanation
^\p{Cc} : Do not match control characters.
^\p{Cn} : Do not match unassigned characters.
^\p{Cs} : Do not match UTF-8-invalid characters.
Working Demo
I test two strings in this demo: "Hello, World!" and "Hello, World!" + (char)4. char(4) is the character for END TRANSMISSION.
using System;
using System.Text.RegularExpressions;
public class Test {
public static MatchCollection getPrintableChars(string haystack) {
var regex = new Regex(#"[^\p{Cc}^\p{Cn}^\p{Cs}]");
var matches = regex.Matches(haystack);
return matches;
}
public static void Main() {
var teststring1 = "Hello, World!";
var teststring2 = "Hello, World!" + (char)4;
var teststring1unprintablechars = getPrintableChars(teststring1);
var teststring2unprintablechars = getPrintableChars(teststring2);
Console.WriteLine("Testing a Printable String: " + teststring1unprintablechars.Count + " Printable Chars Detected");
Console.WriteLine("Testing a String With 1-Unprintable Char: " + teststring2unprintablechars.Count + " Printable Chars Detected");
foreach (Match unprintablechar in teststring1unprintablechars) {
Console.WriteLine("String 1 Printable Char:" + unprintablechar);
}
foreach (Match unprintablechar in teststring2unprintablechars) {
Console.WriteLine("String 2 Printable Char:" + unprintablechar);
}
}
}
Full Working Demo at IDEOne.com
Alternatives
\P{C} : Match only visible characters. Do not match any invisible characters.
\P{Cc} : Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!

I know ASCII wasn't specifically requested but this is a quick way to get a list of all the printable ASCII characters.
for (Int32 i = 0x20; i <= 0x7e; i++)
{
printableChars.Add(Convert.ToChar(i));
}
See this ASCII table.
Edit:
As stated by Péter Szilvási, the 0x20 and 0x7e in the loop are hexidecimal representations of the base 10 numbers 32 and 126, which are the printable ASCII characters.

public bool IsPrintableASCII(char c)
{
return c >= '\x20' && c <= '\x7e';
}

How do I convert C# characters to their hexadecimal code representation

What I need to do is convert a C# character to an escaped unicode string:
So, 'A' - > "\x0041".
Is there a better way to do this than:
char ch = 'A';
string strOut = String.Format("\\x{0}", Convert.ToUInt16(ch).ToString("x4"));

Cast and use composite formatting:
char ch = 'A';
string strOut = String.Format(#"\x{0:x4}", (ushort)ch);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Unicode to ASCII with character translations for umlats - c#

Just loop through the string. For each character do a switch: switch(inputCharacter) { case 'ä': outputString = "ae"; break; case 'ö': outputString = "oe"; break; ... (These translations are common in german language with ASCII only) Then combine all outputStrings with a StringBuilder.

Related

UTF-16 error, how to solve the unrecognized escape sequence?

HTML Encode ISO-8859-2 (Latin-2) characters in C#

Encode - C# convert ISO-8859-1 entities number to characters

How do I get a list of all the printable characters in C#?

How do I convert C# characters to their hexadecimal code representation

Categories

Resources