Can't decode UTF-8 umlaut in C# [duplicate]

Can't decode UTF-8 umlaut in C# [duplicate] - c#

How do I decode this string 'Sch\u00f6nen' (#"Sch\u00f6nen") in C#, I've tried HttpUtility but it doesn't give me the results I need, which is "Schönen".

Regex.Unescape did the trick:
System.Text.RegularExpressions.Regex.Unescape(#"Sch\u00f6nen");
Note that you need to be careful when testing your variants or writing unit tests: "Sch\u00f6nen" is already "Schönen". You need # in front of string to treat \u00f6 as part of the string.

If you landed on this question because you see "Sch\u00f6nen" (or similar \uXXXX values in string constant) - it is not encoding. It is a way to represent Unicode characters as escape sequence similar how string represents New Line by \n and Return by \r.
I don't think you have to decode.
string unicodestring = "Sch\u00f6nen";
Console.WriteLine(unicodestring);
Schönen was outputted.

Wrote a code that covnerts unicode strings to actual chars. (But the best answer in this topic works fine and less complex).
string stringWithUnicodeSymbols = #"{""id"": 10440119, ""photo"": 10945418, ""first_name"": ""\u0415\u0432\u0433\u0435\u043d\u0438\u0439""}";
var splitted = Regex.Split(stringWithUnicodeSymbols, #"\\u([a-fA-F\d]{4})");
string outString = "";
foreach (var s in splitted)
{
try
{
if (s.Length == 4)
{
var decoded = ((char) Convert.ToUInt16(s, 16)).ToString();
outString += decoded;
}
else
{
outString += s;
}
}
catch (Exception e)
{
outString += s;
}
}

Related

Convert any string to ASCII, Remove Backslash

This question may reveal my ignorance regarding character encoding, so if it does, I would greatly appreciate information to correct that.
I am relaying strings from new applications to an old application. The old application only accepts ASCII characters (http://www.asciitable.com/). The old application also does not support certain characters such as backslashes. The new applications support more or less anything.
Let's say I have the string:
"Whatever - 1_夜_💦💦💦"
I need to convert that to something with only ASCII characters. For example, maybe something like:
"Whatever - 1_\u001cY_=???=???=???"
Then I want to replace the remaining illegal characters with substitution strings.
Ideally, any character that is encoded to ASCII should be able to be de-coded. That is, any unique input string will have a unique output string (no arbitrary inputs "abc" and "xyz" which are different produce the same result). An algorithm could convert the output string back to the input string.
This is what I've tried:
static string ConvertToAscii(string str)
{
var return_string = "";
foreach (var c in str)
{
if ((int)c < 128)
{
return_string += c;
}
else
{
var charBytes = BitConverter.GetBytes(c);
var ascii = Encoding.ASCII.GetString(charBytes);
return_string += ascii;
}
}
return return_string;
}
When I use this with the string I mentioned above, I get:
"Whatever - 1_\u001cY_=???=???=???"
That seems great - however, the "\u001cY" is apparently a single character, rather than a collection of ASCII characters. So my target database rejects it, and I am not able to figure out how to remove the "\" while leaving the remaining characters.
How can I convert any string into a collection of ASCII characters?

The easiest approach is Base64 all bytes since you don't seem to care how strings are represented:
Convert.ToBase64String( Encoding.Unicode.GetBytes("Whatever - 1_夜_💦💦💦"))
will produce result that is guaranteed to be ASCII (even printable ASCII) - for your string result would be "VwBoAGEAdABlAHYAZQByACAALQAgADEAXwAcWV8APdim3D3Yptw92Kbc".

Here is similar code to what I ended up using to convert everything to Ascii:
internal static string ConvertToAscii(string str)
{
var returnStringBuilder = new StringBuilder();
foreach (var c in str)
{
if (char.IsControl(c))
{
// Control character
continue;
}
if (c < 127)
{
// ASCII Character
returnStringBuilder.Append(c);
}
else
{
returnStringBuilder.Append("U+" + ((int) c).ToString("X4"));
}
}
return returnStringBuilder.ToString();
}

Replace Unicode escape sequences in a string [duplicate]

This question already has answers here:
Unicode characters string
(5 answers)
Closed 6 years ago.
We have one text file which has the following text
"\u5b89\u5fbd\u5b5f\u5143"
When we read the file content in C# .NET it shows like:
"\\u5b89\\u5fbd\\u5b5f\\u5143"
Our decoder method is
public string Decoder(string value)
{
Encoding enc = new UTF8Encoding();
byte[] bytes = enc.GetBytes(value);
return enc.GetString(bytes);
}
When I pass a hard coded value,
string Output=Decoder("\u5b89\u5fbd\u5b5f\u5143");
it works well, but when we use a variable value it is not working.
When we use the string this is what we get from the text file:
value=(text file content)
string Output=Decoder(value);
It returns the wrong output.
How can I fix this?

Use the below code. This unescapes any escaped characters from the input string
Regex.Unescape(value);

You could use a regular expression to parse the file:
private static Regex _regex = new Regex(#"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string Decoder(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}
And then:
string data = Decoder(File.ReadAllText("test.txt"));

So your file contains the verbatim string
\u5b89\u5fbd\u5b5f\u5143
in ASCII and not the string represented by those four Unicode codepoints in some given encoding?
As it happens, I just wrote some code in C# that can parse strings in this format for a JSON parser project -- here's a variant that only handles \uXXXX escapes:
private static string ReadSlashedString(TextReader reader) {
var sb = new StringBuilder(32);
bool q = false;
while (true) {
int chrR = reader.Read();
if (chrR == -1) break;
var chr = (char) chrR;
if (!q) {
if (chr == '\\') {
q = true;
continue;
}
sb.Append(chr);
}
else {
switch (chr) {
case 'u':
case 'U':
var hexb = new char[4];
reader.Read(hexb, 0, 4);
chr = (char) Convert.ToInt32(new string(hexb), 16);
sb.Append(chr);
break;
default:
throw new Exception("Invalid backslash escape (\\ + charcode " + (int) chr + ")");
}
q = false;
}
}
return sb.ToString();
}
And you could use it like:
var str = ReadSlashedString(new StringReader("\\u5b89\\u5fbd\\u5b5f\\u5143"));
(or using a StreamReader to read from a file).
Darin Dimitrov's regexp-utilizing answer is probably faster, but I happened to have this code at hand. :)

UTFEncoding (or any other encoding) won't translate escape sequences like \u5b89 into the corresponding character.
The reason why it works when you pass a string constant is that the C# compiler is interpreting the escape sequences and translating them in the corresponding character before calling the decoder (actually even before the program is executed...).
You have to write code that recognizes the escape sequences and convert them into the corresponding characters.

When you are reading "\u5b89\u5fbd\u5b5f\u5143" you get exactly what you read. The debugger escapes your strings before displaying them. The double backslashes in the string are actually single backslashes that have been escaped.
When you pass you hardcoded value, you are not actually passing in what you see on the screen. You are passing in four Unicode characters, since the C# string is unescaped by the compiler.
Darin already posted a way to unescape Unicode characters from the file, so I won't repeat it.

I think this will give you some idea.
string str = "ivandro\u0020";
str = str.Trim();
If you try to print the string, you will notice that the space, which is \u0020, is removed.

Is there an easy way to trim the last three characters off a string

I have strings like this:
var a = "abcdefg";
var b = "xxxxxxxx";
The strings are always longer than five characters.
Now I need to trim off the last 3 characters. Is there some simple way that I can do this with C#?

In the trivial case you can just use
result = s.Substring(0, s.Length-3);
to remove the last three characters from the string.
Or as Jason suggested Remove is an alternative:
result = s.Remove(s.Length-3)
Unfortunately for unicode strings there can be a few problems:
A unicode codepoint can consist of multiple chars since the encoding of string is UTF-16 (See Surrogate pairs). This happens only for characters outside the basic plane, i.e. which have a code-point >2^16. This is relevant if you want to support Chinese.
A glyph (graphical symbol) can consist of multiple codepoints. For example ä can be written as a followed by a combining ¨.
Behavior with right-to-left writing might not be what you want either

You want String.Remove(Int32)
Deletes all the characters from this string beginning at a specified
position and continuing through the last position.
If you want to perform validation, along the lines of druttka's answer, I would suggest creating an extension method
public static class MyStringExtensions
{
public static string SafeRemove(this string s, int numCharactersToRemove)
{
if (numCharactersToRemove > s.Length)
{
throw new ArgumentException("numCharactersToRemove");
}
// other validation here
return s.Remove(s.Length - numCharactersToRemove);
}
}
var s = "123456";
var r = s.SafeRemove(3); //r = "123"
var t = s.SafeRemove(7); //throws ArgumentException

string a = "abcdefg";
a = a.Remove(a.Length - 3);

string newString = oldString.Substring(0, oldString.Length - 4);

If you really only need to trim off the last 3 characters, you can do this
string a = "abcdefg";
if (a.Length > 3)
{
a = a.Substring(0, a.Length-3);
}
else
{
a = String.Empty;
}

How to encode Japanese characters

I have to develop a program. This is encoding system.
I have this Japanese characters that are:
つれづれなるまゝに、日暮らし、硯にむかひて、心にうつりゆくよしなし事を、そこはかとなく書きつくれば、あやしうこそものぐるほしけれ
I want to convert this string to encoding like this:
%26%2312388%3B%26%2312428%3B%26%2312389%3B%26%2312428%3B%26%2312394%3B%26%2312427%3B%26%2312414%3B%26%2312445%3B%26%2312395%3B%26%2312289%3B%26%2326085%3B%26%2326286%3B%26%2312425%3B%26%2312375%3B%26%2312289%3B%26%2330831%3B%26%2312395%3B%26%2312416%3B%26%2312363%3B%26%2312402%3B%26%2312390%3B%26%2312289%3B%26%2324515%3B%26%2312395%3B%26%2312358%3B%26%2312388%3B%26%2312426%3B%26%2312422%3B%26%2312367%3B%26%2312424%3B%26%2312375%3B%26%2312394%3B%26%2312375%3B%26%2320107%3B%26%2312434%3B%26%2312289%3B%26%2312381%3B%26%2312371%3B%26%2312399%3B%26%2312363%3B%26%2312392%3B%26%2312394%3B%26%2312367%3B%26%2326360%3B%26%2312365%3B%26%2312388%3B%26%2312367%3B%26%2312428%3B%26%2312400%3B%26%2312289%3B%26%2312354%3B%26%2312420%3B%26%2312375%3B%26%2312358%3B%26%2312371%3B%26%2312381%3B%26%2312418%3B%26%2312398%3B%26%2312368%3B%26%2312427%3B%26%2312411%3B%26%2312375%3B%26%2312369%3B%26%2312428%3B%26%2312290%3B.
How can I do that?

I believe you are looking for HttpUtility.UrlEncode, can't figure out the encoding to get exactly the same output that you show.
var testString = "つれづれなるまゝに、日暮らし、硯にむかひて、心にうつりゆくよしなし事を、そこはかとなく書きつくれば、あやしうこそものぐるほしけれ。";
var encodedUrl = HttpUtility.UrlEncode(testString, Encoding.UTF8);
You might want to change your question, as you don't really need to convert Unicode to ASCII, which is impossible. You rather need to Persent encode or URL encode Percent-encoding.
[EDIT]
I figured it out:
var testString = "つれづれなるまゝに、日暮らし、硯にむかひて、心にうつりゆくよしなし事を、そこはかとなく書きつくれば、あやしうこそものぐるほしけれ。";
var htmlEncoded = string.Concat(testString.Select(arg => string.Format("&#{0};", (int)arg)));
var result = HttpUtility.UrlEncode(htmlEncoded);
The result will exactly match to the encoding you that you provided.
Step by step:
var inputChar = 'つ';
var charValue = (int)inputChar; // 12388
var htmlEncoded = "&#" + charValue + ";"; // つ
var ulrEncoded = HttpUtility.UrlEncode(htmlEncoded); // %26%2312388%3b

This is impossible. Unicode is so much larger than ASCII and you can't look up every character from Unicode in ASCII. while ASCII is 256 characters only (with control chars), Unicode is tens of thousands (I guess).

Here is a function that seems to work:
public static string UrlDoubleEncode(string text)
{
if (text == null)
return null;
StringBuilder sb = new StringBuilder();
foreach (int i in text)
{
sb.Append('&');
sb.Append('#');
sb.Append(i);
sb.Append(';');
}
return HttpUtility.UrlEncode(sb.ToString());
}

C#, function to replace all html special characters with normal text characters

I have characters incoming from an xml template for example:
& >
Does a generic function exist in the framework to replace these with their normal equivalents?

You want to use HttpUtility.HtmlDecode.:
Converts a string that has been HTML-encoded for HTTP transmission into a decoded string.

Sometimes the text has parts that has been doubly encoded.
For example: "Lorem Ipsum&#x0D;
- Blah"
This may help with that:
public static string RecursiveHtmlDecode(string str) {
if (string.IsNullOrWhiteSpace(str)) return str;
var tmp = HttpUtility.HtmlDecode(str);
while (tmp != str)
{
str = tmp;
tmp = HttpUtility.HtmlDecode(str);
}
return str; //completely decoded string
}

Maybe this helps: WebUtility.HtmlDecode("");

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Can't decode UTF-8 umlaut in C# [duplicate] - c#

How do I decode this string 'Sch\u00f6nen' (#"Sch\u00f6nen") in C#, I've tried HttpUtility but it doesn't give me the results I need, which is "Schönen".

Regex.Unescape did the trick: System.Text.RegularExpressions.Regex.Unescape(#"Sch\u00f6nen"); Note that you need to be careful when testing your variants or writing unit tests: "Sch\u00f6nen" is already "Schönen". You need # in front of string to treat \u00f6 as part of the string.

Related

Convert any string to ASCII, Remove Backslash

Replace Unicode escape sequences in a string [duplicate]

Is there an easy way to trim the last three characters off a string

How to encode Japanese characters

C#, function to replace all html special characters with normal text characters

Categories

Resources