HTML Encode ISO-8859-2 (Latin-2) characters in C# - c#

Anyone knows how to encode ISO-8859-2 charset in C#? The following example does not work:
String name = "Filipović";
String encoded = WebUtility.HtmlEncode(name);
The resulting string should be
"Filipović"
Thanks

After reading your comments (you should support also Chinese names using ASCII chars only) I think you shouldn't stick to ISO-8859-2 encoding.
Solution 1
Use UTF-7 encoding for such names. UTF-7 is designed to use only ASCII characters for any Unicode string.
string value = "Filipović with Unicode symbol: 🏯";
var encoded = Encoding.ASCII.GetString(Encoding.UTF7.GetBytes(value));
Console.WriteLine(encoded); // Filipovi+AQc- with Unicode symbol: +2Dzf7w-
var decoded = Encoding.UTF7.GetString(Encoding.ASCII.GetBytes(encoded));
Solution 2
Alternatively, you can use base64 encoding, too. But in this case the pure ASCII strings will not be human-readable anymore.
string value = "Filipović with Unicode symbol: 🏯";
encoded = Convert.ToBase64String(Encoding.UTF8.GetBytes(value));
Console.WriteLine(encoded); // RmlsaXBvdmnEhyB3aXRoIFVuaWNvZGUgc3ltYm9sOiDwn4+v
var decoded = Encoding.UTF8.GetString(Convert.FromBase64String(encoded));
Solution 3
If you really stick to HTML Entity encoding you can achieve it like this:
string value = "Filipović with Unicode symbol: 🏯";
var result = new StringBuilder();
for (int i = 0; i < value.Length; i++)
{
if (Char.IsHighSurrogate(value[i]))
{
result.Append($"&#{Char.ConvertToUtf32(value[i], value[i + 1])};");
i++;
}
else if (value[i] > 127)
result.Append($"&#{(int)value[i]};");
else
result.Append(value[i]);
}
Console.WriteLine(result); // Filipović with Unicode symbol: 🏯

If you don't have strict requirement on Html encoding I'd recommend using Url (%) encoding which encodes all non-ASCII characters:
String name = "Filipović";
String encoded = WebUtility.UrlEncode(name); // Filipovi%C4%87
If you must have string with all non-ASCII characters to be HTML encoded consistently your best bet is use &xNNNN; or &#NNNN; format to encode all characters above 127. Unfortunately there is no way to convience HtmlEncode to encode all characters, so you need to do it yourself i.e. similarly how it is done in Convert a Unicode string to an escaped ASCII string. You can continue using HtmlDecode to read the values back at it handles &#xNNNN just fine.
Non optimal sample:
var name = "Filipović";
var result = String.Join("",
name.Select(x => x < 127 ? x.ToString() : String.Format("&#x{0:X4}", (int)x))
);

Related

Splitting a string on an "undefined" variable

I have a piece of text that is in multiple formats, and I want to try and create a method that encompasses all of them. I know where I can split these lines, however, I am uncertain of how to define this.
An example of the text:
.0 index .0.label unicode "Area" .0.value unicode "6WAY DB" .1 index .1.label unicode "SubStation" .1.value unicode "E782DB257" .2 (etc...)
I want to split these lines on the ".0", ".1", etc, so that my list will look like:
.0 index
.0.label unicode "Area"
.0.value unicode "6WAY DB"
.1 index
.1.label unicode "SubStation"
This will make the data easier to manipulate. However, since the value changes depending on the line, I can't simply sate the value as a regular string. Instead, I was thinking of stating is more like
string Split = "." + n.IsInt();
Or something similar. However, I can't find anything that has worked yet.
If i understand you, you can do the following with regex replace
var input = ".0 index .0.label unicode \"Area\" .0.value unicode \"6WAY DB\" .1 index .1.label unicode \"SubStation\" .1.value unicode \"E782DB257\" .2 (etc...)";
var result = Regex.Replace(input, #"\.\d", $"{Environment.NewLine}$&");
Console.WriteLine(result);
or to actually split
var lines = result.Split(new[]{Environment.NewLine},StringSplitOptions.None);
foreach (var line in lines)
Console.WriteLine(line);
Output
.0 index
.0.label unicode "Area"
.0.value unicode "6WAY DB"
.1 index
.1.label unicode "SubStation"
.1.value unicode "E782DB257"
.2 (etc...)
Explanation
. matches any character (except for line terminators)
\d matches a digit (equal to [0-9])
$& replaces with the original match
If your string follow fix format and you want to extract value from the string then you can implement a custom function for that something like this.
function splitCustom(str){
var retVal=[];
str = str.split('.0 index')[1].trim();
var totalRecord=str[str.lastIndexOf(' index')-1];
for(var i=0;i<=totalRecord;i++){
var obj={};
var substr=str.split("." + (i+1) + ' index');
var curRecord="";
if(substr.length>1){
curRecord=substr[0].trim();
str = substr[1].trim();
}
else{
curRecord=str;
}
obj.index=i;
var labelString=curRecord.split("." + i + ".")[1].trim();
obj.label=labelString.substr(labelString.indexOf('"')+1, labelString.lastIndexOf('"')-labelString.indexOf('"')-1);
var valueString=curRecord.split("." + i + ".")[2].trim();
obj.value=valueString.substr(valueString.indexOf('"')+1, valueString.lastIndexOf('"')-valueString.indexOf('"')-1);
retVal.push(obj);
}
return retVal;
}
var str='.0 index .0.label unicode "Area" .0.value unicode "6WAY DB" .1 index .1.label unicode "SubStation" .1.value unicode "E782DB257"';
var response = splitCustom(str);
Output
[
{"index":0,"label":"Area","value":"6WAY DB"},
{"index":1,"label":"SubStation","value":"E782DB257"}
]

Unicode string to binary string and binary string to unicode c#

I have a unicode text with some unicode characters say,"Hello, world! this paragraph has some unicode characters."
I want to convert this paragraph to binary string i.e in binary digits with datatype string. and after converting, I also want to convert that binary string back to unicode string.
If you're simply looking for a way to decode and encode a string into byte[] and not actual binary then i would use System.Text
The actual example from msdn:
string unicodeString = "This string contains the unicode character Pi (\u03a0)";
// Create two different encodings.
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte array.
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
// Display the strings created before and after the conversion.
Console.WriteLine("Original string: {0}", unicodeString);
Console.WriteLine("Ascii converted string: {0}", asciiString);
Don't forget
using System;
using System.Text;
Since there are several encodings for the Unicode character set, you have to pick: UTF-8, UTF-16, UTF-32, etc. Say you picked UTF-8. You have to use the same encoding going both ways.
To convert to a binary string:
String.Join(
String.Empty, // running them all together makes it tricky.
Encoding.UTF8
.GetBytes("Hello, world! this paragraph has some unicode characters.")
.Select(byt => Convert.ToString(byt, 2).PadLeft(8, '0'))) // must ensure 8 digits.
And back again:
Encoding.UTF8.GetString(
Regex.Split(
"010010000110010101101100011011000110111100101100001000000111011101101111011100100110110001100100001000010010000001110100011010000110100101110011001000000111000001100001011100100110000101100111011100100110000101110000011010000010000001101000011000010111001100100000011100110110111101101101011001010010000001110101011011100110100101100011011011110110010001100101001000000110001101101000011000010111001001100001011000110111010001100101011100100111001100101110"
,"(.{8})") // this is the consequence of running them all together.
.Where(binary => !String.IsNullOrEmpty(binary)) // keeps the matches; drops empty parts
.Select(binary => Convert.ToByte(binary, 2))
.ToArray())

Unicode to ASCII with character translations for umlats

I have a client that sends unicode input files and demands only ASCII encoded files in return - why is unimportant.
Does anyone know of a routine to translate unicode string to a closest approximation of an ASCII string? I'm looking to replace common unicode characters like 'ä' to a best ASCII representation.
For example: 'ä' -> 'a'
Data resides in SQL Server however I can also work in C# as a downstream mechanism or as a CLR procedure.
Just loop through the string. For each character do a switch:
switch(inputCharacter)
{
case 'ä':
outputString = "ae";
break;
case 'ö':
outputString = "oe";
break;
...
(These translations are common in german language with ASCII only)
Then combine all outputStrings with a StringBuilder.
I think you really mean extended ASCII to ASCII
Just a simple dictionary
Dictionary<char, char> trans = new Dictionary<char, char>() {...}
StringBuilder sb = new StringBuilder();
foreach (char c in string.ToCharArray)
{
if((Int)c <= 127)
sb.Append(c);
else
sbAppend(trans[c]);
}
string ascii = sb.ToString();

Decode a Javascript hex literal in C#

I have the following string:
string s = #"a=q\x26T=1";
I want to unescape this to:
"a=q&T=1"
How do I do this is C# other than just replacing the characters? There are various other escaped characters, so I'm not sure what encoding to use.
This works:
var decodedString = Regex.Unescape(#"source=s_q\x26hl=en");
but this works even better:
var regex = new Regex(#"\\x([a-fA-F0-9]{2})");
json = regex.Replace(json, match => char.ConvertFromUtf32(Int32.Parse(match.Groups[1].Value, System.Globalization.NumberStyles.HexNumber)));

C# UTF7Encoding for first bracket ' { '

While reading bytes from a file containing UTF7 encoded characters the first bracket '{' is supposed to be encoded to 123 or 007B but it is not happening.All other characters are encoded right but not '{'.The code I am using is given below.
StreamReader _HistoryLocation = new StreamReader("abc.txt");
String _ftpInformation = _HistoryLocation.ReadLine();
UTF7Encoding utf7 = new UTF7Encoding();
Byte[] encodedBytes = utf7.GetBytes(_ftpInformation);
What might be the problem ?
As per RFC2152 that you reference '{' and similar characters may only optionally be encoded as directly - they may instead be encoded.
Notice that UTF7Encoding has an overloaded constructor with an allowOptionals flag that will directly encode the RFC2152 optional characters.

Categories