I have the following string:
string s = #"a=q\x26T=1";
I want to unescape this to:
"a=q&T=1"
How do I do this is C# other than just replacing the characters? There are various other escaped characters, so I'm not sure what encoding to use.
This works:
var decodedString = Regex.Unescape(#"source=s_q\x26hl=en");
but this works even better:
var regex = new Regex(#"\\x([a-fA-F0-9]{2})");
json = regex.Replace(json, match => char.ConvertFromUtf32(Int32.Parse(match.Groups[1].Value, System.Globalization.NumberStyles.HexNumber)));
Related
I have a Unicode string from a text file such that. And I want to display the real character.
For example:
\u8ba1\u7b97\u673a\u2022\u7f51\u7edc\u2022\u6280\u672f\u7c7b
When read this string from text file, using StreamReader.ReadToLine(), it escape the \ to '\\' such as "\\u8ba1", which is not wanted.
It will display the Unicode string same as from text. Which I want is to display the real character.
How can change the "\\u8ba1" to "\u8ba1" in the result string.
Or should use another Reader to read the string?
If you have a string like
var input1 = "\u8ba1\u7b97\u673a\u2022\u7f51\u7edc\u2022\u6280\u672f\u7c7b";
// input1 == "计算机•网络•技术类"
you don't need to unescape anything. It's just the string literal that contains the escape sequences, not the string itself.
If you have a string like
var input2 = #"\u8ba1\u7b97\u673a\u2022\u7f51\u7edc\u2022\u6280\u672f\u7c7b";
you can unescape it using the following regex:
var result = Regex.Replace(
input2,
#"\\[Uu]([0-9A-Fa-f]{4})",
m => char.ToString(
(char)ushort.Parse(m.Groups[1].Value, NumberStyles.AllowHexSpecifier)));
// result == "计算机•网络•技术类"
This question came out in the first result when googling, but I thought there should be a simpler way... this is what I ended up using:
using System.Text.RegularExpressions;
//...
var str = "Ingl\\u00e9s";
var converted = Regex.Unescape(str);
Console.WriteLine($"{converted} {str != converted}"); // Inglés True
I have this string:
string specialCharacterString = #"\n";
where "\n" is the new line special character.
Is it possible convert/assign that string (of two characters) into a (single) char. How do I do something like:
char specialCharacter = Parse(specialCharacterString);
Where specialCharacter value would be equal to \n
Is there anything in dotnet that would parse the string for me or must I use if or switch the string (the string can contain any special character) to accomplish what I want. Note that char.Parse(string) cannot handle special characters and thinks the string above is actually two characters.
Maybe I am oversimplifying but can't you just do the following:
txtString.Replace("\n", "$");
It is technically a string to string replacement but would be string to char...
You can always cast it to a char since you know what char you are replacing the string with.
Not sure, what business need it is, but if you need parsing C# in C# you can use some tools like Antlr, which supports C# grammar (https://github.com/antlr/grammars-v4/)
I don't think there is any ready tool designed just for strings
Try use Regex.Unescape(specialCharacterString);
It will return the new string with escape characters.
For example:
var literalStringWithEscapeCharacters = #"Hello\tWorld";
var stringWithEscapeCharacters = Regex.Unescape(literalStringWithEscapeCharacters);
Console.WriteLine(stringWithEscapeCharacters);
Will print: Hello World
Instead of: Hello\tWorld
Then you can find escape characters in stringWithEscapeCharacters like this:
var escapeChars= new [] { '\n' };
var characters = stringWithEscapeCharacters.Where(c => escapeChars.Contains(c)).ToList();
All escape characters described here:
https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/strings/#string-escape-sequences
I am converting hex to a UTF8 string using below line.
var obInstruction = Encoding.UTF8.GetString(ob.Bits);
In the result I got � between every character as shown in the picture below.
what is � ?
So I added replace to the line and changed it to
var obInstruction = Encoding.UTF8.GetString(ob.Bits).Replace("�", ""); but � won't go away.
When I tried to replace other characters Replace work fine but not for �.
What is � and how can I remove it?
In Power query, Text.Clean will remove such strange characters but I am not sure how to do in C#.
*Edit: added a picture for the result with UTF32
Empty boxes with UTF32:
This should do the job:
var input = new byte[5];
var encoding = Encoding.GetEncoding(Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(""),
new DecoderReplacementFallback(""));
var converted = Encoding.Convert(Encoding.UTF8, encoding, input);
var output = encoding.GetString(converted);
This would remove all non-ascii chars with an empty string
Anyone knows how to encode ISO-8859-2 charset in C#? The following example does not work:
String name = "Filipović";
String encoded = WebUtility.HtmlEncode(name);
The resulting string should be
"Filipović"
Thanks
After reading your comments (you should support also Chinese names using ASCII chars only) I think you shouldn't stick to ISO-8859-2 encoding.
Solution 1
Use UTF-7 encoding for such names. UTF-7 is designed to use only ASCII characters for any Unicode string.
string value = "Filipović with Unicode symbol: 🏯";
var encoded = Encoding.ASCII.GetString(Encoding.UTF7.GetBytes(value));
Console.WriteLine(encoded); // Filipovi+AQc- with Unicode symbol: +2Dzf7w-
var decoded = Encoding.UTF7.GetString(Encoding.ASCII.GetBytes(encoded));
Solution 2
Alternatively, you can use base64 encoding, too. But in this case the pure ASCII strings will not be human-readable anymore.
string value = "Filipović with Unicode symbol: 🏯";
encoded = Convert.ToBase64String(Encoding.UTF8.GetBytes(value));
Console.WriteLine(encoded); // RmlsaXBvdmnEhyB3aXRoIFVuaWNvZGUgc3ltYm9sOiDwn4+v
var decoded = Encoding.UTF8.GetString(Convert.FromBase64String(encoded));
Solution 3
If you really stick to HTML Entity encoding you can achieve it like this:
string value = "Filipović with Unicode symbol: 🏯";
var result = new StringBuilder();
for (int i = 0; i < value.Length; i++)
{
if (Char.IsHighSurrogate(value[i]))
{
result.Append($"&#{Char.ConvertToUtf32(value[i], value[i + 1])};");
i++;
}
else if (value[i] > 127)
result.Append($"&#{(int)value[i]};");
else
result.Append(value[i]);
}
Console.WriteLine(result); // Filipović with Unicode symbol: 🏯
If you don't have strict requirement on Html encoding I'd recommend using Url (%) encoding which encodes all non-ASCII characters:
String name = "Filipović";
String encoded = WebUtility.UrlEncode(name); // Filipovi%C4%87
If you must have string with all non-ASCII characters to be HTML encoded consistently your best bet is use &xNNNN; or &#NNNN; format to encode all characters above 127. Unfortunately there is no way to convience HtmlEncode to encode all characters, so you need to do it yourself i.e. similarly how it is done in Convert a Unicode string to an escaped ASCII string. You can continue using HtmlDecode to read the values back at it handles &#xNNNN just fine.
Non optimal sample:
var name = "Filipović";
var result = String.Join("",
name.Select(x => x < 127 ? x.ToString() : String.Format("&#x{0:X4}", (int)x))
);
I wanto to convert the string like "123" to string like "\u0031\u0032\u0033".
How can i do this in .NET?
For example: reverse convert:
Encoding enc = Encoding.GetEncoding("us-ascii",
new EncoderExceptionFallback(),
new DecoderExceptionFallback());
byte[] by = enc.GetBytes(s);
string ans = enc.GetString(by);
return ans;
Strings in .NET already are Unicode, so there's no need to convert them from Unicode to Unicode.
If you want to output a unicode escaped string, then try this:
string ans = string.Concat(s.Select(c => string.Format("\\u{0:x4}", (int)c)).ToArray());
Result:
\u0031\u0032\u0033
See it working online: ideone
In .NET 4.0 you can omit the call to ToArray.
string ans = Regex.Replace(s, ".", m => String.Format(#"\u{0:x4}", (int)m.Value[0]));