This question already has answers here:
Finding out Unicode character name in .Net
(7 answers)
Closed 9 years ago.
I need functions to convert between a character (e.g. 'α') and its full Unicode name (e.g. "GREEK SMALL LETTER ALPHA") in both directions.
The solution I came up with is to perform a lookup in the official Unicode Standard available online: http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt, or, rather, in its cached local copy, possibly converted to a suitable collection beforehand to improve the lookup performance).
Is there a simpler way to do these conversions?
I would prefer a solution in C#, but solutions in other languages that can be adapted to C# / .NET are also welcome. Thanks!
if you do not want to keep unicode name table in memory just prepare text file where offset of unicode value multiplied by max unicode length name will point to unicode name. for max 4 bytes length it wont be mroe than few megabytes. If you wish to have more compact implementation then group offset address in file to unicode names at start of file indexed by unicode value then enjoy more compact name table. but you have to prepare such file though it is not difficult.
Related
This question already has answers here:
How to know the size of the string in bytes?
(4 answers)
Closed 5 years ago.
I have a string like this:
string a1 = "{`name`:`санкт_петербург`,`shortName`:`питер`,`hideByDefault`:false}";
a1. length shows that string length is 68, which is not true: Cyrillic symbols are twice as big (because of UTF-16 encoding, I presume), therefore the real length of this string is 87.
I need to either get the number of Cyrillic symbols in the string or get real string length in any other way.
From the MSDN:
The .NET Framework uses the UTF-16 encoding (represented by the UnicodeEncoding class) to represent characters and string
So a1.Length is in UTF-16 code units (What's the difference between a character, a code point, a glyph and a grapheme?). Cyrillic characters, being in the base BMP (Base Multilingual Plane), all use a single code unit (so a single char). Many emoji for example use TWO code units (two char, 4 bytes!)... They aren't in the BMP. See for example https://ideone.com/ASDORp.
If you want the size IN BYTES, a1.Length * 2 clearly is the length :-) If you want to know in UTF8 (a very common encoding, NOT USED INTERNALLY BY .NET, but very used by the web, xml, ...) how many bytes it would be Encoding.UTF8.GetByteCount(a1)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Can i have algorithm how to solve this program?
Write a program in which user gives the string as input and increment every alphabets of string using their ASCII values and print the output in console application C#.
(Like if user enters Abcd it will print Bcde.)
Strings in .NET are sequences of UTF-16 code units. (This also true for Java, JavaScript, HTML, XML, XSL, Windows API, …) The .NET datatype for a string is String and the .NET datatype for a UTF-16 code unit is Char. In C#, you can use them or their keyword aliases string and char.
UTF-16 is one of several encodings of the Unicode character set. It encodes each Unicode codepoint in one or two code units. When two code units are used they are called surrogate pairs and in the order high surrogate then low surrogate.
ASCII is a subset of Unicode. The Unicode characters in common with ASCII are exactly U+0000 to U+007F. UTF-16 encodes them each as one code unit: '\u0000' to '\u007F'. The single encoding for the ASCII character set encodes each in one byte: 0x00 to 0x7f.
The problem statement refers to "string", "alphabet", "ASCII", "C#" and "console". Each of these has to be understood.
Some people, unfortunately, use "ASCII" to mean "character code". In the context of "C#" "string", character code means either Unicode codepoint or UTF-16 code unit. You could make a simplifying assumption that the input characters do not require UTF-16 surrogate pairs and even that the range is U+0000 to U+007F.
"Alphabet" is both a mathematical and a linguistical term. In mathematics (and computer science), an "alphabet" is a set of token things, often given an ordering. In linguistics, an "alphabet" is a sequence of basic letters used in the writing system for a particular language. It is often defined by a "language academy" and may or may not include all forms of letters used in the language. For example, the accepted English alphabet does not include letters with any diacritics or accents or ligatures even though they are used in English words. Also, in a writing system, some or all letters can have uppercase, lowercase, title case or other forms. For this problem, with ASCII being mentioned and ASCII containing the Basic Latin letters in uppercase and lowercase, you could define two technical alphabets A-Z and a-z, ordered by their UTF-16 code unit values.
Since you want to increment character code for each character in your alphabet, you have to decide what happens if the result is no longer in your alphabet. Is the result really a character anyway because surely there is a last possible character (or many for the distinct UTF-16 code unit ranges)? You might consider wrapping around to the first character in your alphabet. Z->A, z->a.
The final thing is "console". A console has a setting for character encoding. (Go chcp to find out what yours is.) Your program will read from and write to the console. When you write to it, it uses a font to generate an image of the characters received. If everything lines up, great. If not, you can ask questions about it. Bottom line: When the program reads and writes, the input/output functions do the encoding conversion. So, within your program, for String and Char, the encoding is UTF-16.
Now, a String is a sequence of Char so you can use several decomposition techniques including foreach:
foreach (Char c in inputString)
{
if (Char.IsSurrogate(c)) throw new ArgumentOutOfRangeException();
if (c > '\u007f') throw new ArgumentOutOfRangeException();
// add your logic to increment c and save or output it
}
One easy and comprehensible way to do it:
byte[] bytes = System.Text.Encoding.ASCII.GetBytes("abcd".ToCharArray());
for (int i = 0; i <= bytes.GetUpperBound(0); i++)
{
bytes[i]++;
}
Console.WriteLine(System.Text.Encoding.ASCII.GetString(bytes));
A possible solution would be to transform the string into a character array and then iterate over that array.
For each element in the array, increment the value by one and cast back to character type. (Basically this would work because char and int are the same apart for the char having value limitations according to the ASCII table, and that the computer can relate an image to the char)
I hope this answered your question.
This question already has answers here:
Mysql server does not support 4-byte encoded utf8 characters
(8 answers)
Closed 7 years ago.
I have a integration with facebook and have notice that thay sends for example u+1f600 that is called a grinning face. When I try to store this in the MySQL Text field I get Server does not support 4-byte encoded so a fast solution is to remove all these special chars from the string.
The question is how? I know about the u+1f600 but I suspect that there could be more of them.
Consider switching to MySQL utf8mb4 encoding... https://mathiasbynens.be/notes/mysql-utf8mb4
This question already has answers here:
Determine a string's encoding in C#
(10 answers)
Closed 9 years ago.
I have a string read as a UTF8 (not from a file, can't check BOM).
The problem is that sometimes the original text was formed with another encoding, but was converted to UTF8 - so the string is not readable, sort of gibberish.
is it possible to detect that this string is not actual UTF8?
Thanks!
No. They're just bytes. You could try to guess, if you wanted, by trying different conversions and seeing whether there are valid dictionary words, etc., but in a theoretical sense it's impossible without knowing something about the data itself, i.e. knowing that it never uses certain characters, or always uses certain characters, or that it contains mostly words found in a given dictionary, etc. It might look like gibberish to a person, but the computer has no way of quantifying "gibberish".
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Can somebody please provide me some important aspects I should be aware of while handling Unicode strings in C#?
Keep in mind that C# strings are sequnces of Char, UTF-16 code units. They are not Unicode code-points. Some unicode code points require two Char's, and you should not split strings between these Chars.
In addition, unicode code points may combine to form a single language 'character' -- for instance, a 'u' Char followed by umlat Char. So you can't split strings between arbitrary code points either.
Basically, it's mess of issues, where any given issue may only in practice affect languages you don't know.
C# (and .Net in general) handle unicode strings transparently, and you won't have to do anything special unless your application needs to read/write files with specific encodings. In those cases, you can convert managed strings to byte arrays of the encoding of your choice by using the classes in the System.Text.Encodings namespace.
System.String already handled unicode internally so you are covered there. Best practice would be to use System.Text.Encoding.UTF8Encoding when reading and writing files. It's more than just reading/writing files however, anything that streams data out including network connections is going to depend upon the encoding. If you're using WCF, it's going to default to UTF8 for most of the bindings (in fact most don't allow ASCII at all).
UTF8 is a good choice because while it still supports the entire Unicode character set, for the majority of the ASCII character set it has a byte similarity. Thus naive applications that don't support Unicode have some chance of reading/writing your applications data. Those applications will only begin to fail when you start using extended characters.
System.Text.Encoding.Unicode will write UTF-16 which is a minimum of two bytes per character, making it both larger and fully incompatible with ASCII. And System.Text.Encoding.UTF32 as you can guess is larger still. I'm not sure of the real-world use case of UTF-16 and 32, but perhaps they perform better when you have large numbers of extended characters. That's just a theory, but if it is true, then Japanese/Chinese developers making a product that will be used primarily in those languages might find UTF-16/32 a better choice.
Only think about encoding when reading and writing streams. Use TextReader and TextWriters to read and write text in different encodings. Always use utf-8 if you have a choice.
Don't get confused by languages and cultures - that's a completely separate issue from unicode.
.Net has relatively good i18n support. You don't really need to think about unicode that much as all .Net strings and built-in string functions do the right thing with unicode. The only thing to bear in mind is that most of the string functions, for example DateTime.ToString(), use by default the thread's culture which by default is the Windows culture. You can specify a different culture for formatting either on the current thread or on each method call.
The only time unicode is an issue is when encoding/decoding strings to and from bytes.
As mentioned, .NET strings handle Unicode transparently. Besides file I/O, the other consideration would be at the database layer. SQL Server for instance distinguishes between VARCHAR (non-unicode) and NVARCHAR (which handles unicode). Also need to pay attention to stored procedure parameters.
More details can be found on this thread:
http://discuss.joelonsoftware.com/default.asp?dotnet.12.189999.12