Convert UTF-8 literals to readable string, C#?

Convert UTF-8 literals to readable string, C#? - c#

I have a string as follows
const string nameString = #"\xda\xa9\xd8\xa7\xd8\xb1\xd8\xa8\xd8\xb1";
I tried:
var name = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(nameString));
Not work.
You can find real characters here:
https://utf8-chartable.de/unicode-utf8-table.pl?start=1536&number=128&names=-&utf8=string-literal
e.g:
U+0631 ر \xd8\xb1
How can we convert it to readable string in C#?

Well, we have to parse: each "\xa9" should be converted into byte 0xa9
const string nameString = #"\xda\xa9\xd8\xa7\xd8\xb1\xd8\xa8\xd8\xb1";
We can do it with a help of regular expressions:
byte[] data = Regex
.Matches(nameString, #"\\x(?<value>[0-9a-fA-F]{1,2})")
.Cast<Match>()
.Select(match => (Convert.ToByte(match.Groups["value"].Value, 16)))
.ToArray();
Let's have a look at the data:
// da a9 d8 a7 d8 b1 d8 a8 d8 b1
Console.WriteLine(string.Join(" ", data.Select(b => b.ToString("x2"))));
Finally, we want to encode data to string; assuming that we should use UTF8:
string name = Encoding.UTF8.GetString(data);
Console.WriteLine(name);
Outcome:
کاربر

Using # sign causes escape sequences to be interpreted literally. Remove # sign to achieve desired result.
For more information see # (C# Reference).
The # character in this instance defines a verbatim string literal. Simple escape sequences (such as "\" for a backslash), hexadecimal escape sequences (such as "\x0041" for an uppercase A), and Unicode escape sequences (such as "\u0041" for an uppercase A) are interpreted literally.

Related

C# - regex finding all ascii enclosed in ' ' and convert them to hex ascii output

I write a converter for user input data, which converts number value strings and ascii characters enclosed in ' ' to hex representation. Number entering works fine with:
string TestText = "lorem, 'C', 127, 0x06, '#' ipsum";
TestText = Regex.Replace(
TestText,
" +\\d{1,3}",
(MatchEvaluator)(match => Convert.ToByte(match.Value).ToString("X2")));
Out.Text = TestText;
But how can I detect ascii chars enclosed in ' ' and convert them to a hex string like: 'C' will be 43 and '+' becomes 2B.

Basically, you want to match the regular expression '[^']'. This looks for all characters that are not ' but which are enclosed in '.
Then, in your match evaluator, you get the character in the middle, and convert it to a hexadecimal string. To do that, first cast the char to an int and then you can use ToString("x2"):
TestText = Regex.Replace(TestText, "'[^']'",
(MatchEvaluator)(match => ((int)match.Value[1]).ToString("x2")));

First, you need a RegEx to capture the character inside the 's: "'(.)'"
Then you need to convert that character to its hex equivalent, like so: Encoding.ASCII.GetBytes(match.Groups[1].Value).First().ToString("X2")
so your final code would look like this:
string TestText = "lorem, 'C', 127, 0x06, '#' ipsum '+'";
TestText = Regex.Replace(TestText, #" +\d{1,3}", match => Convert.ToByte(match.Value).ToString("X2"));
TestText = Regex.Replace(TestText, "'(.)'", match => Encoding.ASCII.GetBytes(match.Groups[1].Value).First().ToString("X2"));
Out.Text = TestText;
Note that, as pointed out in the comments, your RegEx is currently matching the 0 at the beginning of 0x06, which may not be what you want.

Replacing doubleslash to single slash

In my c# application i want to convert a string characters to special characters.
My input string is "G\u00f6teborg" and i want the output as Göteborg.
I am using below code,
string name = "G\\u00f6teborg";
StringBuilder sb = new StringBuilder(name);
sb = sb.Replace(#"\\",#"\");
string name1 = System.Web.HttpUtility.HtmlDecode(sb.ToString());
Console.WriteLine(name1);
In the above code the double slash remains the same , it is not replacing to single slash, so after decoding i am getting the output as G\u00f6teborg .
Please help to find a solution for this.
Thanks in advance.

string name = "G\\u00f6teborg";
Just remove one of the backslashes:
string name = "G\u00f6teborg";
If you got the input from a user then you need to do more: it’s not enough to replace a backslash because that’s not how the characters are stored internally, the \uXXXX is an escape sequence representing a Unicode code point.
If you want to replace a user input escape sequence by a Unicode code point you need to parse the user input properly. You can use a regular expression for that:
MatchEvaluator replacer = m => ((char) int.Parse(m.Groups[1].Value, NumberStyles.AllowHexSpecifier)).ToString();
string result = Regex.Replace(name, #"\\u([a-fA-F0-9]{4})", replacer);
This matches each escape group (\u followed by four hex digits), extracts the hex digits, parses them and translates them to a character.

How do I get the STX character of hex 02

I have a device to which I'm trying to connect via a socket, and according to the manual, I need the "STX character of hex 02".
How can I do this using C#?

Just a comment to GeoffM's answer (I don't have enough points to comment the proper way).
You should never embed STX (or other characters) that way using only two digits.
If the next character (after "\x02") was a valid hex digit, that would also be parsed and it would be a mess.
string s1 = "\x02End";
string s2 = "\x02" + "End";
string s3 = "\x0002End";
Here, s1 equals ".nd", since 2E is the dot character, while s2 and s3 equal STX + "End".

You can use a Unicode character escape: \u0002

Cast the Integer value of 2 to a char:
char cChar = (char)2;

\x02 is STX Code you can check the ASCII Table
checkFinal = checkFinal.Replace("\x02", "End").ToString().Trim();

Within a string, clearly the Unicode format is best, but for use as a byte, this approach works:
byte chrSTX = 0x02; // Start of Text
byte chrETX = 0x03; // End of Text
// etc...

You can embed the STX within a string like so:
byte[] myBytes = System.Text.Encoding.ASCII.GetBytes("\x02Hello, world!");
socket.Send(myBytes);

What does it mean when I enclose a C# string in #" "? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
What does # mean at the start of a string in C#?
Sorry but I can't find this on Google. I guess it maybe is not accepting my search string when I do a search.
Can someone tell me what this means in C#
var a = #"abc";
what's the meaning of the #?

It is a string literal. Which basically means it will take any character except ", including new lines. To write out a ", use "".

The advantage of #-quoting is that escape sequences are not processed,
which makes it easy to write, for example, a fully qualified file
name:
#"c:\Docs\Source\a.txt" // rather than "c:\\Docs\\Source\\a.txt"

It means it's a literal string.
Without it, any string containing a \ will consider the next character a special character, such as \n for new line. With a # in front, it will treat the \ literally.
In the example you've given, there is no difference in the output.

This says that the characters inside the double quotation marks should be interpreted exactly as they are.
You can see that the backslash is treated as a character and not an
escape sequence when the # is used. The C# compiler also allows you to
use real newlines in verbatim literals. You must encode quotation
marks with double quotes.
string fileLocation = "C:\\CSharpProjects";
string fileLocation = #"C:\CSharpProjects";
Look at here for examples.

C# supports two forms of string literals: regular string literals and verbatim string literals.
A regular string literal consists of zero or more characters enclosed
in double quotes, as in "hello", and may include both simple escape
sequences (such as \t for the tab character) and hexadecimal and
Unicode escape sequences.
A verbatim string literal consists of an # character followed by a
double-quote character, zero or more characters, and a closing
double-quote character. A simple example is "hello". In a verbatim
string literal, the characters between the delimiters are interpreted
verbatim, the only exception being a quote-escape-sequence. In
particular, simple escape sequences and hexadecimal and Unicode
escape sequences are not processed in verbatim string literals. A
verbatim string literal may span multiple lines.
Code Example
string a = "hello, world"; // hello, world
string b = #"hello, world"; // hello, world
string c = "hello \t world"; // hello world
string d = #"hello \t world"; // hello \t world
string e = "Joe said \"Hello\" to me"; // Joe said "Hello" to me
string f = #"Joe said ""Hello"" to me"; // Joe said "Hello" to me
string g = "\\\\server\\share\\file.txt"; // \\server\share\file.txt
string h = #"\\server\share\file.txt"; // \\server\share\file.txt
string i = "one\r\ntwo\r\nthree";
string j = #"one
two
three";
Reference link: MSDN

How do I get a list of all the printable characters in C#?

I'd like to be able to get a char array of all the printable characters in C#, does anybody know how to do this?
edit:
By printable I mean the visible European characters, so yes, umlauts, tildes, accents etc.

This will give you a list with all characters that are not considered control characters:
List<Char> printableChars = new List<char>();
for (int i = char.MinValue; i <= char.MaxValue; i++)
{
char c = Convert.ToChar(i);
if (!char.IsControl(c))
{
printableChars.Add(c);
}
}
You may want to investigate the other Char.IsXxxx methods to find a combination that suits your requirements.

Here's a LINQ version of Fredrik's solution. Note that Enumerable.Range yields an IEnumerable<int> so you have to convert to chars first. Cast<char> would have worked in 3.5SP0 I believe, but as of 3.5SP1 you have to do a "proper" conversion:
var chars = Enumerable.Range(0, char.MaxValue+1)
.Select(i => (char) i)
.Where(c => !char.IsControl(c))
.ToArray();
I've created the result as an array as that's what the question asked for - it's not necessarily the best idea though. It depends on the use case.
Note that this also doesn't consider full Unicode characters, only those in the basic multilingual plane. I don't know what it returns for high/low surrogates, but it's worth at least knowing that a single char doesn't really let you represent everything :(

A LINQ solution (based on Fredrik Mörk's):
Enumerable.Range(char.MinValue, char.MaxValue).Select(c => (char)c).Where(
c => !char.IsControl(c)).ToArray();

TLDR Answer
Use this Regex...
var regex = new Regex(#"[^\p{Cc}^\p{Cn}^\p{Cs}]");
TLDR Explanation
^\p{Cc} : Do not match control characters.
^\p{Cn} : Do not match unassigned characters.
^\p{Cs} : Do not match UTF-8-invalid characters.
Working Demo
I test two strings in this demo: "Hello, World!" and "Hello, World!" + (char)4. char(4) is the character for END TRANSMISSION.
using System;
using System.Text.RegularExpressions;
public class Test {
public static MatchCollection getPrintableChars(string haystack) {
var regex = new Regex(#"[^\p{Cc}^\p{Cn}^\p{Cs}]");
var matches = regex.Matches(haystack);
return matches;
}
public static void Main() {
var teststring1 = "Hello, World!";
var teststring2 = "Hello, World!" + (char)4;
var teststring1unprintablechars = getPrintableChars(teststring1);
var teststring2unprintablechars = getPrintableChars(teststring2);
Console.WriteLine("Testing a Printable String: " + teststring1unprintablechars.Count + " Printable Chars Detected");
Console.WriteLine("Testing a String With 1-Unprintable Char: " + teststring2unprintablechars.Count + " Printable Chars Detected");
foreach (Match unprintablechar in teststring1unprintablechars) {
Console.WriteLine("String 1 Printable Char:" + unprintablechar);
}
foreach (Match unprintablechar in teststring2unprintablechars) {
Console.WriteLine("String 2 Printable Char:" + unprintablechar);
}
}
}
Full Working Demo at IDEOne.com
Alternatives
\P{C} : Match only visible characters. Do not match any invisible characters.
\P{Cc} : Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!

I know ASCII wasn't specifically requested but this is a quick way to get a list of all the printable ASCII characters.
for (Int32 i = 0x20; i <= 0x7e; i++)
{
printableChars.Add(Convert.ToChar(i));
}
See this ASCII table.
Edit:
As stated by Péter Szilvási, the 0x20 and 0x7e in the loop are hexidecimal representations of the base 10 numbers 32 and 126, which are the printable ASCII characters.

public bool IsPrintableASCII(char c)
{
return c >= '\x20' && c <= '\x7e';
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Convert UTF-8 literals to readable string, C#? - c#

Related

C# - regex finding all ascii enclosed in ' ' and convert them to hex ascii output

Replacing doubleslash to single slash

How do I get the STX character of hex 02

What does it mean when I enclose a C# string in #" "? [duplicate]

How do I get a list of all the printable characters in C#?

Categories

Resources