Encoding char in C# [duplicate] - c#

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
To which character encoding (Unicode version) set does a char object correspond?
I'm a little afraid to ask this, as I'm sure its been asked before, but I can't find it. Its probably something obvious, but I've never studied encoding before.
int Convert(char c)
{
return (int)c;
}
What encoding is produced by that method? I thought it might be ASCII (at least for <128), but doing the code below produced... smiley faces as the first characters? What? Definitely not ASCII...
for (int i = 0; i < 128; i++)
Console.WriteLine(i + ": " + (char)i);

C# char uses the UTF-16 encoding. The language specification, 1.3 Types and variables, says:
Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.
UTF-16 overlaps with ASCII in that the character codes in the ASCII range 0-127 mean the same thing in UTF-16 as in ASCII. The smiley faces in your program's output are presumably how your console interprets the non-printable characters in the range 0-31.

Each char is a UTF-16 code point. However, you should use the proper Encoding class to ensure that the unicode is normalized. See
C# and UTF-16 characters

Related

Parse chars in escape sequence form [duplicate]

This question already has answers here:
Can I expand a string that contains C# literal expressions at runtime
(5 answers)
Closed 3 years ago.
I'm implementing a compiler, and I need to convert from an escape character literal written in source code file to an actual value.
For example I might have a line of source code char = '\\'. I then parse that and get given the string "'\\'", which I need to turn into the actual char '\\'.
char.Parse and char.TryParse both fail when parsing a char in escape sequence form. For example:
char.Parse(#"\\");
Will throw "String must be exactly one character long."
Is there any way to parse everything on this list as a char (ignoring those that are to big to fit in a UTF16 char).
the # makes it a verbatim string literal. drop that and \ will be treated as the escape char for the next \.
#"\\".Length // 2
"\\".Length // 1

How to get string's byte length? [duplicate]

This question already has answers here:
How to know the size of the string in bytes?
(4 answers)
Closed 5 years ago.
I have a string like this:
string a1 = "{`name`:`санкт_петербург`,`shortName`:`питер`,`hideByDefault`:false}";
a1. length shows that string length is 68, which is not true: Cyrillic symbols are twice as big (because of UTF-16 encoding, I presume), therefore the real length of this string is 87.
I need to either get the number of Cyrillic symbols in the string or get real string length in any other way.
From the MSDN:
The .NET Framework uses the UTF-16 encoding (represented by the UnicodeEncoding class) to represent characters and string
So a1.Length is in UTF-16 code units (What's the difference between a character, a code point, a glyph and a grapheme?). Cyrillic characters, being in the base BMP (Base Multilingual Plane), all use a single code unit (so a single char). Many emoji for example use TWO code units (two char, 4 bytes!)... They aren't in the BMP. See for example https://ideone.com/ASDORp.
If you want the size IN BYTES, a1.Length * 2 clearly is the length :-) If you want to know in UTF8 (a very common encoding, NOT USED INTERNALLY BY .NET, but very used by the web, xml, ...) how many bytes it would be Encoding.UTF8.GetByteCount(a1)

Showing Octal characters in c# [duplicate]

This question already has answers here:
Octal equivalent in C#
(5 answers)
Closed 6 years ago.
According to MSDN:
\ooo ASCII character in octal notation
The following code is showing Octal character($) in c#:
char character36 = '\o44';
Console.Write(character36);
but it doesn't work.
In Escape Sequences, which I suspect you were reading, the "ooo" is italicised to indicate that it should not be included verbatim. It should be replaced by the appropriate octal digits. "o44" aren't octal digits, you included a literal "o". The appropriate octal digits would be "044", or just "44".
But as pointed out by Andrew Savinykh, this is a documentation page about C, not about C#, and C# does not use the same syntax. It doesn't have octal escape sequences at all. The escape sequences for C# are documented on Strings (C# Programming Guide), and do not include any octal escape sequences unless you want to include the special exception of \0. You can use hexadecimal escape sequences instead, either \u0024 or \x24.

Print the next letter by increasing their ascii code in c# [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Can i have algorithm how to solve this program?
Write a program in which user gives the string as input and increment every alphabets of string using their ASCII values and print the output in console application C#.
(Like if user enters Abcd it will print Bcde.)
Strings in .NET are sequences of UTF-16 code units. (This also true for Java, JavaScript, HTML, XML, XSL, Windows API, …) The .NET datatype for a string is String and the .NET datatype for a UTF-16 code unit is Char. In C#, you can use them or their keyword aliases string and char.
UTF-16 is one of several encodings of the Unicode character set. It encodes each Unicode codepoint in one or two code units. When two code units are used they are called surrogate pairs and in the order high surrogate then low surrogate.
ASCII is a subset of Unicode. The Unicode characters in common with ASCII are exactly U+0000 to U+007F. UTF-16 encodes them each as one code unit: '\u0000' to '\u007F'. The single encoding for the ASCII character set encodes each in one byte: 0x00 to 0x7f.
The problem statement refers to "string", "alphabet", "ASCII", "C#" and "console". Each of these has to be understood.
Some people, unfortunately, use "ASCII" to mean "character code". In the context of "C#" "string", character code means either Unicode codepoint or UTF-16 code unit. You could make a simplifying assumption that the input characters do not require UTF-16 surrogate pairs and even that the range is U+0000 to U+007F.
"Alphabet" is both a mathematical and a linguistical term. In mathematics (and computer science), an "alphabet" is a set of token things, often given an ordering. In linguistics, an "alphabet" is a sequence of basic letters used in the writing system for a particular language. It is often defined by a "language academy" and may or may not include all forms of letters used in the language. For example, the accepted English alphabet does not include letters with any diacritics or accents or ligatures even though they are used in English words. Also, in a writing system, some or all letters can have uppercase, lowercase, title case or other forms. For this problem, with ASCII being mentioned and ASCII containing the Basic Latin letters in uppercase and lowercase, you could define two technical alphabets A-Z and a-z, ordered by their UTF-16 code unit values.
Since you want to increment character code for each character in your alphabet, you have to decide what happens if the result is no longer in your alphabet. Is the result really a character anyway because surely there is a last possible character (or many for the distinct UTF-16 code unit ranges)? You might consider wrapping around to the first character in your alphabet. Z->A, z->a.
The final thing is "console". A console has a setting for character encoding. (Go chcp to find out what yours is.) Your program will read from and write to the console. When you write to it, it uses a font to generate an image of the characters received. If everything lines up, great. If not, you can ask questions about it. Bottom line: When the program reads and writes, the input/output functions do the encoding conversion. So, within your program, for String and Char, the encoding is UTF-16.
Now, a String is a sequence of Char so you can use several decomposition techniques including foreach:
foreach (Char c in inputString)
{
if (Char.IsSurrogate(c)) throw new ArgumentOutOfRangeException();
if (c > '\u007f') throw new ArgumentOutOfRangeException();
// add your logic to increment c and save or output it
}
One easy and comprehensible way to do it:
byte[] bytes = System.Text.Encoding.ASCII.GetBytes("abcd".ToCharArray());
for (int i = 0; i <= bytes.GetUpperBound(0); i++)
{
bytes[i]++;
}
Console.WriteLine(System.Text.Encoding.ASCII.GetString(bytes));
A possible solution would be to transform the string into a character array and then iterate over that array.
For each element in the array, increment the value by one and cast back to character type. (Basically this would work because char and int are the same apart for the char having value limitations according to the ASCII table, and that the computer can relate an image to the char)
I hope this answered your question.

How to convert a char to its full Unicode name? [duplicate]

This question already has answers here:
Finding out Unicode character name in .Net
(7 answers)
Closed 9 years ago.
I need functions to convert between a character (e.g. 'α') and its full Unicode name (e.g. "GREEK SMALL LETTER ALPHA") in both directions.
The solution I came up with is to perform a lookup in the official Unicode Standard available online: http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt, or, rather, in its cached local copy, possibly converted to a suitable collection beforehand to improve the lookup performance).
Is there a simpler way to do these conversions?
I would prefer a solution in C#, but solutions in other languages that can be adapted to C# / .NET are also welcome. Thanks!
if you do not want to keep unicode name table in memory just prepare text file where offset of unicode value multiplied by max unicode length name will point to unicode name. for max 4 bytes length it wont be mroe than few megabytes. If you wish to have more compact implementation then group offset address in file to unicode names at start of file indexed by unicode value then enjoy more compact name table. but you have to prepare such file though it is not difficult.

Categories