This question already has answers here:
How to know the size of the string in bytes?
(4 answers)
Closed 5 years ago.
I have a string like this:
string a1 = "{`name`:`санкт_петербург`,`shortName`:`питер`,`hideByDefault`:false}";
a1. length shows that string length is 68, which is not true: Cyrillic symbols are twice as big (because of UTF-16 encoding, I presume), therefore the real length of this string is 87.
I need to either get the number of Cyrillic symbols in the string or get real string length in any other way.
From the MSDN:
The .NET Framework uses the UTF-16 encoding (represented by the UnicodeEncoding class) to represent characters and string
So a1.Length is in UTF-16 code units (What's the difference between a character, a code point, a glyph and a grapheme?). Cyrillic characters, being in the base BMP (Base Multilingual Plane), all use a single code unit (so a single char). Many emoji for example use TWO code units (two char, 4 bytes!)... They aren't in the BMP. See for example https://ideone.com/ASDORp.
If you want the size IN BYTES, a1.Length * 2 clearly is the length :-) If you want to know in UTF8 (a very common encoding, NOT USED INTERNALLY BY .NET, but very used by the web, xml, ...) how many bytes it would be Encoding.UTF8.GetByteCount(a1)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Can i have algorithm how to solve this program?
Write a program in which user gives the string as input and increment every alphabets of string using their ASCII values and print the output in console application C#.
(Like if user enters Abcd it will print Bcde.)
Strings in .NET are sequences of UTF-16 code units. (This also true for Java, JavaScript, HTML, XML, XSL, Windows API, …) The .NET datatype for a string is String and the .NET datatype for a UTF-16 code unit is Char. In C#, you can use them or their keyword aliases string and char.
UTF-16 is one of several encodings of the Unicode character set. It encodes each Unicode codepoint in one or two code units. When two code units are used they are called surrogate pairs and in the order high surrogate then low surrogate.
ASCII is a subset of Unicode. The Unicode characters in common with ASCII are exactly U+0000 to U+007F. UTF-16 encodes them each as one code unit: '\u0000' to '\u007F'. The single encoding for the ASCII character set encodes each in one byte: 0x00 to 0x7f.
The problem statement refers to "string", "alphabet", "ASCII", "C#" and "console". Each of these has to be understood.
Some people, unfortunately, use "ASCII" to mean "character code". In the context of "C#" "string", character code means either Unicode codepoint or UTF-16 code unit. You could make a simplifying assumption that the input characters do not require UTF-16 surrogate pairs and even that the range is U+0000 to U+007F.
"Alphabet" is both a mathematical and a linguistical term. In mathematics (and computer science), an "alphabet" is a set of token things, often given an ordering. In linguistics, an "alphabet" is a sequence of basic letters used in the writing system for a particular language. It is often defined by a "language academy" and may or may not include all forms of letters used in the language. For example, the accepted English alphabet does not include letters with any diacritics or accents or ligatures even though they are used in English words. Also, in a writing system, some or all letters can have uppercase, lowercase, title case or other forms. For this problem, with ASCII being mentioned and ASCII containing the Basic Latin letters in uppercase and lowercase, you could define two technical alphabets A-Z and a-z, ordered by their UTF-16 code unit values.
Since you want to increment character code for each character in your alphabet, you have to decide what happens if the result is no longer in your alphabet. Is the result really a character anyway because surely there is a last possible character (or many for the distinct UTF-16 code unit ranges)? You might consider wrapping around to the first character in your alphabet. Z->A, z->a.
The final thing is "console". A console has a setting for character encoding. (Go chcp to find out what yours is.) Your program will read from and write to the console. When you write to it, it uses a font to generate an image of the characters received. If everything lines up, great. If not, you can ask questions about it. Bottom line: When the program reads and writes, the input/output functions do the encoding conversion. So, within your program, for String and Char, the encoding is UTF-16.
Now, a String is a sequence of Char so you can use several decomposition techniques including foreach:
foreach (Char c in inputString)
{
if (Char.IsSurrogate(c)) throw new ArgumentOutOfRangeException();
if (c > '\u007f') throw new ArgumentOutOfRangeException();
// add your logic to increment c and save or output it
}
One easy and comprehensible way to do it:
byte[] bytes = System.Text.Encoding.ASCII.GetBytes("abcd".ToCharArray());
for (int i = 0; i <= bytes.GetUpperBound(0); i++)
{
bytes[i]++;
}
Console.WriteLine(System.Text.Encoding.ASCII.GetString(bytes));
A possible solution would be to transform the string into a character array and then iterate over that array.
For each element in the array, increment the value by one and cast back to character type. (Basically this would work because char and int are the same apart for the char having value limitations according to the ASCII table, and that the computer can relate an image to the char)
I hope this answered your question.
This question already has answers here:
Determine a string's encoding in C#
(10 answers)
Closed 9 years ago.
I have a string read as a UTF8 (not from a file, can't check BOM).
The problem is that sometimes the original text was formed with another encoding, but was converted to UTF8 - so the string is not readable, sort of gibberish.
is it possible to detect that this string is not actual UTF8?
Thanks!
No. They're just bytes. You could try to guess, if you wanted, by trying different conversions and seeing whether there are valid dictionary words, etc., but in a theoretical sense it's impossible without knowing something about the data itself, i.e. knowing that it never uses certain characters, or always uses certain characters, or that it contains mostly words found in a given dictionary, etc. It might look like gibberish to a person, but the computer has no way of quantifying "gibberish".
This question already has answers here:
Finding out Unicode character name in .Net
(7 answers)
Closed 9 years ago.
I need functions to convert between a character (e.g. 'α') and its full Unicode name (e.g. "GREEK SMALL LETTER ALPHA") in both directions.
The solution I came up with is to perform a lookup in the official Unicode Standard available online: http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt, or, rather, in its cached local copy, possibly converted to a suitable collection beforehand to improve the lookup performance).
Is there a simpler way to do these conversions?
I would prefer a solution in C#, but solutions in other languages that can be adapted to C# / .NET are also welcome. Thanks!
if you do not want to keep unicode name table in memory just prepare text file where offset of unicode value multiplied by max unicode length name will point to unicode name. for max 4 bytes length it wont be mroe than few megabytes. If you wish to have more compact implementation then group offset address in file to unicode names at start of file indexed by unicode value then enjoy more compact name table. but you have to prepare such file though it is not difficult.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
To which character encoding (Unicode version) set does a char object correspond?
I'm a little afraid to ask this, as I'm sure its been asked before, but I can't find it. Its probably something obvious, but I've never studied encoding before.
int Convert(char c)
{
return (int)c;
}
What encoding is produced by that method? I thought it might be ASCII (at least for <128), but doing the code below produced... smiley faces as the first characters? What? Definitely not ASCII...
for (int i = 0; i < 128; i++)
Console.WriteLine(i + ": " + (char)i);
C# char uses the UTF-16 encoding. The language specification, 1.3 Types and variables, says:
Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.
UTF-16 overlaps with ASCII in that the character codes in the ASCII range 0-127 mean the same thing in UTF-16 as in ASCII. The smiley faces in your program's output are presumably how your console interprets the non-printable characters in the range 0-31.
Each char is a UTF-16 code point. However, you should use the proper Encoding class to ensure that the unicode is normalized. See
C# and UTF-16 characters
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
.NET String to byte Array C#
How do I convert String to byte[] array and vice versa? I need strings to be stored in some binary storage. Please show example in both directions. And one more thing: each string maybe bigger than 90Kb.
If you want to use UTF-8 encoding:
// string to byte[]
byte[] bytes = Encoding.UTF8.GetBytes(someString);
// byte[] to string
string anotherString = Encoding.UTF8.GetString(bytes);
Before you march off and use one of the examples someone's already given you should be aware that there is, in general, no unique mapping between a string and a sequence of bytes. How the string is mapped to binary (and vice versa) is determined by the encoding that you use. Joel Spolsky wrote an awesome article on this subject.
When decoding binary to get a string, you need to use the same encoding as was used to produce the binary in the first place, otherwise you'll run into problems.
Use the Encoding class.
How do I get a consistent byte representation of strings in C# without manually specifying an encoding?