Encode - C# convert ISO-8859-1 entities number to characters - c#

I found a question about how to convert ISO-8859-1 characters to entity number
C# convert ISO-8859-1 characters to entity number
code:
string input = "Steel Décor";
StringBuilder output = new StringBuilder();
foreach (char ch in input)
{
if (ch > 0x7F)
output.AppendFormat("&#{0};", (int) ch);
else
output.Append(ch);
}
// output.ToString() == "Steel Décor"
but i didn't figure out how to do the opposite converting from entity number to character like from
//"Steel Décor" to "Steel Décor"
ps: all accent character in my string are entity code

Related

HTML Encode ISO-8859-2 (Latin-2) characters in C#

Anyone knows how to encode ISO-8859-2 charset in C#? The following example does not work:
String name = "Filipović";
String encoded = WebUtility.HtmlEncode(name);
The resulting string should be
"Filipović"
Thanks
After reading your comments (you should support also Chinese names using ASCII chars only) I think you shouldn't stick to ISO-8859-2 encoding.
Solution 1
Use UTF-7 encoding for such names. UTF-7 is designed to use only ASCII characters for any Unicode string.
string value = "Filipović with Unicode symbol: 🏯";
var encoded = Encoding.ASCII.GetString(Encoding.UTF7.GetBytes(value));
Console.WriteLine(encoded); // Filipovi+AQc- with Unicode symbol: +2Dzf7w-
var decoded = Encoding.UTF7.GetString(Encoding.ASCII.GetBytes(encoded));
Solution 2
Alternatively, you can use base64 encoding, too. But in this case the pure ASCII strings will not be human-readable anymore.
string value = "Filipović with Unicode symbol: 🏯";
encoded = Convert.ToBase64String(Encoding.UTF8.GetBytes(value));
Console.WriteLine(encoded); // RmlsaXBvdmnEhyB3aXRoIFVuaWNvZGUgc3ltYm9sOiDwn4+v
var decoded = Encoding.UTF8.GetString(Convert.FromBase64String(encoded));
Solution 3
If you really stick to HTML Entity encoding you can achieve it like this:
string value = "Filipović with Unicode symbol: 🏯";
var result = new StringBuilder();
for (int i = 0; i < value.Length; i++)
{
if (Char.IsHighSurrogate(value[i]))
{
result.Append($"&#{Char.ConvertToUtf32(value[i], value[i + 1])};");
i++;
}
else if (value[i] > 127)
result.Append($"&#{(int)value[i]};");
else
result.Append(value[i]);
}
Console.WriteLine(result); // Filipović with Unicode symbol: 🏯
If you don't have strict requirement on Html encoding I'd recommend using Url (%) encoding which encodes all non-ASCII characters:
String name = "Filipović";
String encoded = WebUtility.UrlEncode(name); // Filipovi%C4%87
If you must have string with all non-ASCII characters to be HTML encoded consistently your best bet is use &xNNNN; or &#NNNN; format to encode all characters above 127. Unfortunately there is no way to convience HtmlEncode to encode all characters, so you need to do it yourself i.e. similarly how it is done in Convert a Unicode string to an escaped ASCII string. You can continue using HtmlDecode to read the values back at it handles &#xNNNN just fine.
Non optimal sample:
var name = "Filipović";
var result = String.Join("",
name.Select(x => x < 127 ? x.ToString() : String.Format("&#x{0:X4}", (int)x))
);

String Conversion - remove some characters and replace non-digits with ASCII code

I need to take the value CS5999-1 and convert it to 678359991. Basically replace any alpha character with the equivalent ASCII value and strip the dash. I need to get rid of non-numeric characters and make the value unique (some of the data coming in is all numeric and I determined this will make the records unique).
I have played around with regular expressions and can replace the characters with an empty string, but can't figure out how to replace the character with an ASCII value.
Code is still stuck in .NET 2.0 (Corporate America) in case that matters for any ideas.
I have tried several different methods to do this and no I don't expect SO members to write the code for me. I am looking for ideas.
to replace the alpha characters with an empty string I have used:
strResults = Regex.Replace(strResults , #"[A-Za-z\s]",string.Empty);
This loop will replace the character with itself. Basically if I could replace find a way to substitute the replace value with an the ACSII value I would have it, but have tried converting the char value to int and several other different methods I found and all come up with an error.
foreach (char c in strMapResults)
{
strMapResults = strMapResults.Replace(c,c);
}
Check if each character is in the a-z range. If so, add the ASCII value to the list, and if it is in the 0-9 range, just add the number.
public static string AlphaToAscii(string str)
{
var result = string.Empty;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
result += (int)c;
else if (c >= '0' && c <= '9')
result += c;
}
return result;
}
All characters outside of the alpha-numeric range (such as -) will be ignored.
If you are running this function on particularly large strings or want better performance you may want to use a StringBuilder instead of +=.
For all characters in the ASCII range, the encoded value is the same as the Unicode code point. This is also true of ISO/IEC 8859-1, and UCS-2, but not of other legacy encodings.
And since UCS-2 is the same as UTF-16 for the values in UCS-2 (which includes all ASCII characters, as per the above), and since .NET char is a UTF-16 unit, all you need to do is just cast to int.
var builder = new StringBuilder(str.Length * 3); // Pre-allocate to worse-case scenario
foreach(char c in str)
{
if (c >= '0' && c <= '9')
builder.Append(c);
else if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
builder.Append((int)c);
}
string result = builder.ToString();
If you want to know how you might do this with a regular expression (you mentioned regex in your question), here's one way to do it.
The code below filters all non-digit characters, converting letters to their ASCII representation, and dumping anything else, including all non-ASCII alphabetical characters. Note that treating (int)char as the equivalent of a character's ASCII value is only valid where the character is genuinely available in the ASCII character set, which is clearly the case for A-Za-z.
MatchEvaluator filter = match =>
{
var alpha = match.Groups["asciialpha"].Value;
return alpha != "" ? ((int) alpha[0]).ToString() : "";
};
var filtered = Regex.Replace("CS5999-1", #"(?<asciialpha>[A-Za-z])|\D", filter);
Try this
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string input = "CS5999-1";
MatchEvaluator evaluator = new MatchEvaluator(Replace);
string results = Regex.Replace(input, "[A-Za-z\\-]", evaluator);
}
static string Replace(Match match)
{
if (match.Value == "-")
{
return "";
}
else
{
byte[] ascii = Encoding.UTF8.GetBytes(match.Value);
return ascii[0].ToString();
}
}
}
}
​

Unicode to ASCII with character translations for umlats

I have a client that sends unicode input files and demands only ASCII encoded files in return - why is unimportant.
Does anyone know of a routine to translate unicode string to a closest approximation of an ASCII string? I'm looking to replace common unicode characters like 'ä' to a best ASCII representation.
For example: 'ä' -> 'a'
Data resides in SQL Server however I can also work in C# as a downstream mechanism or as a CLR procedure.
Just loop through the string. For each character do a switch:
switch(inputCharacter)
{
case 'ä':
outputString = "ae";
break;
case 'ö':
outputString = "oe";
break;
...
(These translations are common in german language with ASCII only)
Then combine all outputStrings with a StringBuilder.
I think you really mean extended ASCII to ASCII
Just a simple dictionary
Dictionary<char, char> trans = new Dictionary<char, char>() {...}
StringBuilder sb = new StringBuilder();
foreach (char c in string.ToCharArray)
{
if((Int)c <= 127)
sb.Append(c);
else
sbAppend(trans[c]);
}
string ascii = sb.ToString();

Why rtf string when applied to Arabic Text giving "????" instead of applying formatting to it?

I am trying to apply heading styles as there are in MS Word by extracting the rtf strings of their heading styles. Actually, rtf string works well for the English text and applies formatting to it but when its applied to Urdu Text, it gives formatted "????".
Let me explain you guys from example:
I select the word written in Urdu as "اللغة العربية" and i have already an rtf string containing the rtf of heading style of MS Word as:
{\rtf1\ansi\ansicpg1252... "اللغة العربية"...} in which i am adding this string so to get a formatted string.
But instead of giving me the formatted اللغة العربية, it gives formatted question marks "????" which i think is an encoding or font problem. So kindly tell me as how to apply rtf string to Urdu to get formatted text?
You need to use a function to convert unicode characters in the string to their corresponding rtf codes:
static string GetRtfUnicodeEscapedString(string s)
{
var sb = new StringBuilder();
foreach (var c in s)
{
if(c == '\\' || c == '{' || c == '}')
sb.Append(#"\" + c);
else if (c <= 0x7f)
sb.Append(c);
else
sb.Append("\\u" + Convert.ToUInt32(c) + "?");
}
return sb.ToString();
}
found here: https://stackoverflow.com/a/9988686/1543816
Characters whose integer value is more than 127 (7f hex) will be converted to \uxxxx? where xxxx is the unicode of the character.

How do I convert C# characters to their hexadecimal code representation

What I need to do is convert a C# character to an escaped unicode string:
So, 'A' - > "\x0041".
Is there a better way to do this than:
char ch = 'A';
string strOut = String.Format("\\x{0}", Convert.ToUInt16(ch).ToString("x4"));
Cast and use composite formatting:
char ch = 'A';
string strOut = String.Format(#"\x{0:x4}", (ushort)ch);

Categories