What determines what utf characters can be used in code?
var süßigkeit = new Candy(); // works
var süßigkeit∆ = süßigkeit + 1; // doesn't work
Taken from Microsoft docs:
Identifiers must start with a letter, or _.
Identifiers may contain Unicode letter characters, decimal digit characters, Unicode
connecting characters, Unicode combining characters, or Unicode
formatting characters.
https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/inside-a-program/identifier-names
Char.GetUnicodeCategory('∆') // MathSymbol category
Related
I am experimenting with the Escape sequences and can not really use the \U sequence (UTF-32)
It does not compile as it can not recognize the sequence for some reason.
It recognizes it as UTF-16.
Could you please help me?
Console.WriteLine("\U00HHHHHH");
Your problem is that you copied \U00HHHHHH from the documentation page Strings (C# Programming Guide): String Escape Sequences:
But \U00HHHHHH is not itself a valid UTF-32 escape sequence -- it's a mask where each H indicates where a Hex character must be typed. The reason it's not valid is that hexadecimal numbers consist of the digits 0-9 and the letters A–F or a–f -- and H is not one of these characters. And the literal mentioned in comments, "\U001effff", does not work because it falls outside the range the range of valid UTF-32 characters values specified immediately thereafter in the docs:
(range: 000000 - 10FFFF; example: \U0001F47D = "👽")*
The c# compiler actually checks to see if the specified UTF-32 character is valid according to these rules:
// These compile because they're valid Hex numbers in the range 000000 - 10FFFF padded to 8 digits with leading zeros:
Console.WriteLine("\U0001F47D");
Console.WriteLine("\U00000000");
Console.WriteLine("\U0010FFFF");
// But these don't.
// H is not a valid Hex character:
// Compilation error (line 16, col 22): Unrecognized escape sequence
Console.WriteLine("\U00HHHHHH");
// This is outside the range of 000000 - 10FFFF:
// Compilation error (line 19, col 22): Unrecognized escape sequence
Console.WriteLine("\U001effff");
See https://dotnetfiddle.net/KezdTG.
As an aside, to properly display Unicode characters in the Windows console, see How to write Unicode characters to the console?.
Microsoft uses this rule as one of its complexity rules:
Any Unicode character that is categorized as an alphabetic character but is not uppercase or lowercase. This includes Unicode characters from Asian languages.
Testing for usual rules, like uppercase can be as simple as password.Any(char.IsUpper).
What test could I use in C# to test for alphabetic Unicode characters that are not uppercase or lowercase?
How about the literal translation of the rule:
password.Any(c => Char.IsLetter(c) &&
!Char.IsUpper(c) &&
!Char.IsLower(c))
When you convert ascii a and A to unicode, you'll get a and A so obviously, they are not the same.
Update:
Here's an example of what I think you're asking:
var c = 'א';
c.Dump();
char.IsUpper(c).Dump("is upper"); // False
char.IsLower(c).Dump("is lower"); // False
char.IsLetterOrDigit(c).Dump("is letter or digit"); // True
char.IsNumber(c).Dump("is Number"); // False
This question already has answers here:
URL Encoding using C#
(14 answers)
Closed 9 years ago.
is there some algorithm in C# to encode url with symbols that can correct display in web-browser?
something like Base64.
The Standard (RFC 3986 aka STD 66) lays it out for you. In particular, §2 and 2.1:
2. Characters
The URI syntax provides a method of encoding data, presumably for the
sake of identifying a resource, as a sequence of characters. The URI
characters are, in turn, frequently encoded as octets for transport
or presentation. This specification does not mandate any particular
character encoding for mapping between URI characters and the octets
used to store or transmit those characters. When a URI appears in a
protocol element, the character encoding is defined by that protocol;
without such a definition, a URI is assumed to be in the same
character encoding as the surrounding text.
The ABNF notation defines its terminal values to be non-negative
integers (codepoints) based on the US-ASCII coded character set
[ASCII]. Because a URI is a sequence of characters, we must invert
that relation in order to understand the URI syntax. Therefore, the
integer values used by the ABNF must be mapped back to their
corresponding characters via US-ASCII in order to complete the syntax
rules.
A URI is composed from a limited set of characters consisting of
digits, letters, and a few graphic symbols. A reserved subset of
those characters may be used to delimit syntax components within a
URI while the remaining characters, including both the unreserved set
and those reserved characters not acting as delimiters, define each
component's identifying data.
2.1. Percent-Encoding
A percent-encoding mechanism is used to represent a data octet in a
component when that octet's corresponding character is outside the
allowed set or is being used as a delimiter of, or within, the
component. A percent-encoded octet is encoded as a character
triplet, consisting of the percent character "%" followed by the two
hexadecimal digits representing that octet's numeric value. For
example, "%20" is the percent-encoding for the binary octet
"00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
character (SP). Section 2.4 describes when percent-encoding and
decoding is applied.
pct-encoded = "%" HEXDIG HEXDIG
The uppercase hexadecimal digits 'A' through 'F' are equivalent to
the lowercase digits 'a' through 'f', respectively. If two URIs
differ only in the case of hexadecimal digits used in percent-encoded
octets, they are equivalent. For consistency, URI producers and
normalizers should use uppercase hexadecimal digits for all percent-
encodings.
In general, the only characters that may freely be represented in a URL without being percent-encoded are
The unreserved characters. These are the US-ASCII (7-bit) characters
A-Z
a-z
0-9
-._~
The reserved characters ... when in use as within their role in the grammar of a URL and its scheme. These reserved characters are:
:/?#[]#!$&'()*+,;=
Any other characters, per the standard must be properly percent-encoded.
Further note that a URL may only contains characters drawn from the US-ASCII character set (0x00-0x7F): If your URL contains characters outside that range of codepoints, those characters will need to be suitably encoded for representation in US-ASCII (e.g., via HTML/XML entity references). Further, you application is responsible for interpreting such.
I have a string and I want to know if it has unicode characters inside or not.
(if its fully contains ASCII or not)
How can I achieve that?
Thanks!
If my assumptions are correct you wish to know if your string contains any "non-ANSI" characters. You can derive this as follows.
public void test()
{
const string WithUnicodeCharacter = "a hebrew character:\uFB2F";
const string WithoutUnicodeCharacter = "an ANSI character:Æ";
bool hasUnicode;
//true
hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter);
Console.WriteLine(hasUnicode);
//false
hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter);
Console.WriteLine(hasUnicode);
}
public bool ContainsUnicodeCharacter(string input)
{
const int MaxAnsiCode = 255;
return input.Any(c => c > MaxAnsiCode);
}
Update
This will detect for extended ASCII. If you only detect for the true ASCII character range (up to 127), then you could potentially get false positives for extended ASCII characters which does not denote Unicode. I have alluded to this in my sample.
If a string contains only ASCII characters, a serialization + deserialization step using ASCII encoding should get back the same string
so a one liner check in c# could look like..
String s1="testभारत";
bool isUnicode= System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes(s1)) != s1;
ASCII defines only character codes in the range 0-127. Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters.
Note, that ASCII includes only the English alphabet. Thus, if you (for whatever reason) need to apply that same approach to strings that might contain accented characters (Spanish text for example), ASCII is not sufficient and you need to look for another differentiator.
ANSI character set [*] does extends the ASCII characters with the aforementioned accented Latin characters in the range 128-255. However, Unicode does not overlap with ANSI in that range, so technically an Unicode string might contain characters that are not part of ANSI, but have the same character code (specifically in the range 128-159, as you can see from the table I linked to).
As for the actual code to do this, #chibacity answer should work, although you should modify it to cover strict ASCII, because it won't work for ANSI.
[*] Also known as Latin 1 Windows (Win-1252)
As long as it contains characters, it contains Unicode characters.
From System.String:
Represents text as a series of Unicode
characters.
public static bool ContainsUnicodeChars(string text)
{
return !string.IsNullOrEmpty(text);
}
You normally have to worry about different Unicode encodings when you have to:
Encode a string into a stream of bytes with a particular encoding.
Decode a string from a stream of bytes with a particular encoding.
Once you're into string land though, the encoding that the string was originally represented with, if any, is irrelevant.
Each character in a string is defined
by a Unicode scalar value, also called
a Unicode code point or the ordinal
(numeric) value of the Unicode
character. Each code point is encoded
by using UTF-16 encoding, and the
numeric value of each element of the
encoding is represented by a Char
object.
Perhaps you might also find these questions relevant:
How can you strip non-ASCII characters from a string? (in C#)
C# Ensure string contains only ASCII
And this article by Jon Skeet: Unicode and .NET
This is another solution without using lambda expresions. It is in VB.NET but you can convert it easily to C#:
Public Function ContainsUnicode(ByVal inputstr As String) As Boolean
Dim inputCharArray() As Char = inputstr.ToCharArray
For i As Integer = 0 To inputCharArray.Length - 1
If CInt(AscW(inputCharArray(i))) > 255 Then Return True
Next
Return False
End Function
1) Escape sequences are mostly used for characters constants that either have a special meaning (such as “ or \ ) or for characters that can't be represented graphically. Any character literal could be represented using hex ('\xhhhh') or unicode ('\0hhhh') escape sequences. Is there a situation where we should prefer using hex escape sequence over unicode escape sequence or vice versa?
2) When should we specify integer literals in hexadecimal form?
thank you
They are not interchangeable. You can only use a Unicode escape in an identifier name:
var on\u0065 = 1;
var tw\x006f = 2; // bad
But in a string or char literal it doesn't make a heck of a lot of difference. I prefer \u myself because the escape code has a fixed number of digits, \x is variable. But easy enough to avoid mistakes. Also note /U to pick codepoints from the upper planes.