The \U Escape Sequence in C# - c#

I am experimenting with the Escape sequences and can not really use the \U sequence (UTF-32)
It does not compile as it can not recognize the sequence for some reason.
It recognizes it as UTF-16.
Could you please help me?
Console.WriteLine("\U00HHHHHH");

Your problem is that you copied \U00HHHHHH from the documentation page Strings (C# Programming Guide): String Escape Sequences:
But \U00HHHHHH is not itself a valid UTF-32 escape sequence -- it's a mask where each H indicates where a Hex character must be typed. The reason it's not valid is that hexadecimal numbers consist of the digits 0-9 and the letters A–F or a–f -- and H is not one of these characters. And the literal mentioned in comments, "\U001effff", does not work because it falls outside the range the range of valid UTF-32 characters values specified immediately thereafter in the docs:
(range: 000000 - 10FFFF; example: \U0001F47D = "👽")*
The c# compiler actually checks to see if the specified UTF-32 character is valid according to these rules:
// These compile because they're valid Hex numbers in the range 000000 - 10FFFF padded to 8 digits with leading zeros:
Console.WriteLine("\U0001F47D");
Console.WriteLine("\U00000000");
Console.WriteLine("\U0010FFFF");
// But these don't.
// H is not a valid Hex character:
// Compilation error (line 16, col 22): Unrecognized escape sequence
Console.WriteLine("\U00HHHHHH");
// This is outside the range of 000000 - 10FFFF:
// Compilation error (line 19, col 22): Unrecognized escape sequence
Console.WriteLine("\U001effff");
See https://dotnetfiddle.net/KezdTG.
As an aside, to properly display Unicode characters in the Windows console, see How to write Unicode characters to the console?.

Related

Char representation of UNICODE character doesn't always return a character

I have a routine that given an int, returns the equivalent UNICODE character. For some values though, it doesn't retun a character, but its (I presume) Hex value.
For example:
17664 ---> '䔀' // CORRECT!
BUT
56384 ---> '\udc40' // WRONG!!!
Why is that?
The character with code 0xdc40 is a Low Surrogate.
That means that it is one half of a 32-bit character (which is represented as a 16-bit low surrogate plus a 16-bit high surrogate UTF16 character), and thus does not correspond to an actual character.
That's why the output is showing '\udc40' rather than a single character.

Naming variables with utf characters

What determines what utf characters can be used in code?
var süßigkeit = new Candy(); // works
var süßigkeit∆ = süßigkeit + 1; // doesn't work
Taken from Microsoft docs:
Identifiers must start with a letter, or _.
Identifiers may contain Unicode letter characters, decimal digit characters, Unicode
connecting characters, Unicode combining characters, or Unicode
formatting characters.
https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/inside-a-program/identifier-names
Char.GetUnicodeCategory('∆') // MathSymbol category

regex issue c# numbers are underscores now

My Regex is removing all numeric (0-9) in my string.
I don't get why all numbers are replaced by _
EDIT: I understand that my "_" regex pattern changes the characters into underscores. But not why numbers!
Can anyone help me out? I only need to remove like all special characters.
See regex here:
string symbolPattern = "[!##$%^&*()-=+`~{}'|]";
Regex.Replace("input here 12341234" , symbolPattern, "_");
Output: "input here ________"
The problem is your pattern uses a dash in the middle, which acts as a range of the ascii characters from ) to =. Here's a breakdown:
): 41
1: 49
=: 61
As you can see, numbers start at 49, and falls between the range of 41-61, so they're matched and replaced.
You need to place the - at either the beginning or end of the character class for it to be matched literally rather than act as a range:
"[-!##$%^&*()=+`~{}'|]"
you must escape - because sequence [)-=] contains digits
string symbolPattern = "[!##$%^&*()\-=+`~{}'|]";
Move the - to the end of the list so it is seen as a literal:
"[!##$%^&*()=+`~{}'|-]"
Or, to the front:
"[-!##$%^&*()=+`~{}'|]"
As it stands, it will match all characters in the range )-=, which includes all numerals.
You need to escape your special characters in your regex. For instance, * is a wildcard match. Look at what some of those special characters mean for your match.
I've not used C#, but typically the "*" character is also a control character that would need escaping.
The following matches a whole line of any characters, although the "^" and "$" are some what redundant:
^.*$
This matches any number of "A" characters that appear in a string:
A*
The "Owl" book from oreilly is what you really need to research this:
http://shop.oreilly.com/product/9780596528126.do?green=B5B9A1A7-B828-5E41-9D38-70AF661901B8&intcmp=af-mybuy-9780596528126.IP

Is the format of GUID always the same?

GUID you get something like aaaef973-d8ce-4c92-95b4-3635bb2d42d5
Is it always the same? Is it always going to have the following format
8 char "-", 4 char "-", 4 char "-", 4 char "-", 12 char
I'm asking because i need to convert a GUID without "-" to GUID with "-" and vice visa.
No; there are other formats, such as the format you listed except with braces. There's also more complex formats. Here are some of the formats MSDN lists:
UUID formats
32 digits: 00000000000000000000000000000000 (N)
32 digits separated by hyphens: 00000000-0000-0000-0000-000000000000 (D)
32 digits separated by hyphens, enclosed in braces: {00000000-0000-0000-0000-000000000000} (B)
32 digits separated by hyphens, enclosed in parentheses: (00000000-0000-0000-0000-000000000000) (P)
Four hexadecimal values enclosed in braces, where the fourth value is a subset of eight hexadecimal values that is also enclosed in braces: {0x00000000,0x0000,0x0000,{0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00}} (X)
—MSDN
You should simply rely upon it being 32 hexadecimal characters, there can be a variety of ways to present it. Check the Wikipedia article for more information, including a description of how they are commonly written.
For your conversion you should really rely on the static Guid.Parse() methods. Using a mix of your example and the ones in icktoofay's answer, this works nicely:
var z = Guid.Parse("aaaef973-d8ce-4c92-95b4-3635bb2d42d5");
z = Guid.Parse("{aaaef973-d8ce-4c92-95b4-3635bb2d42d5}");
z = Guid.Parse("{0x00000000,0x0000,0x0000,{0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00}}");
then for outputting them with or without hyphens etc you can use the Guid.ToString() method with one of the established format codes.
Most of the time, GUIDS are 32-character hexadecimal strings such as {21EC2020-3AEA-1069-A2DD-08002B30309D} (unless they're encoded in Base-64), and are usually stored as a 128-bit integers. They won't always have hyphens, though.

Is there a situation where we should prefer using hex escape sequence over unicode escape sequence or vice versa?

1) Escape sequences are mostly used for characters constants that either have a special meaning (such as “ or \ ) or for characters that can't be represented graphically. Any character literal could be represented using hex ('\xhhhh') or unicode ('\0hhhh') escape sequences. Is there a situation where we should prefer using hex escape sequence over unicode escape sequence or vice versa?
2) When should we specify integer literals in hexadecimal form?
thank you
They are not interchangeable. You can only use a Unicode escape in an identifier name:
var on\u0065 = 1;
var tw\x006f = 2; // bad
But in a string or char literal it doesn't make a heck of a lot of difference. I prefer \u myself because the escape code has a fixed number of digits, \x is variable. But easy enough to avoid mistakes. Also note /U to pick codepoints from the upper planes.

Categories