String.length() vs national letters - c#

So i have a program that works on text and i need to get the length of string.
BUT if in my word i have a national letter the output of length method is not correct. It gets additional +1 for each national letter, so it returns 6 from "qwerty", but 7 if i use "e with a little tail" instead of regular 'e'.
Any ideas how could i fix that?
Also, sorry for descriptions of letters, but i think stackoverflow takes my national symbols as grammar errors and doesn't allow me to post a question :/

It tells you on the page for string.Length what to do (emphasis mine):
The Length property returns the number of Char objects in this
instance, not the number of Unicode characters. The reason is that a
Unicode character might be represented by more than one Char. Use the
System.Globalization.StringInfo class to work with each Unicode
character instead of each Char.

Related

How to Determine Unicode Characters from a UTF-16 String?

I have string that contains an odd Unicode space character, but I'm not sure what character that is. I understand that in C# a string in memory is encoded using the UTF-16 format. What is a good way to determine which Unicode characters make up the string?
This question was marked as a possible duplicate to
Determine a string's encoding in C#
It's not a duplicate of this question because I'm not asking about what the encoding is. I already know that a string in C# is encoded as UTF-16. I'm just asking for an easy way to determine what the Unicode values are in the string.
The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like present here http://www.fileformat.info/info/unicode/char/10330/index.htm won't be correctly handled by code that assumes it'll fit into two bytes.
Unicode seems to identify characters as numeric code points. Not all code points actually refer to characters, however, because Unicode has the concept of combining characters (which I don’t know much about). However, each Unicode string, even some invalid ones (e.g., illegal sequence of combining characters), can be thought of as a list of code points (numbers).
In the UTF-16 encoding, each code point is encoded as a 2 or 4 byte sequence. In .net, Char might roughly correspond to either a 2 byte UTF-16 sequence or half of a 4 byte UTF-16 sequence. When Char contains half of a 4 byte sequence, it is considered a “surrogate” because it only has meaning when combined with another Char which it must be kept with. To get started with inspecting your .net string, you can get .net to tell you the code points contained in the string, automatically combining surrogate pairs together if necessary. .net provides Char.ConvertToUtf32 which is described the following way:
Converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point.
The documentation for Char.ConvertToUtf32(String s, Int32 index) states that an ArgumentException is thrown for the following case:
The specified index position contains a surrogate pair, and either the first character in the pair is not a valid high surrogate or the second character in the pair is not a valid low surrogate.
Thus, you can go character by character in a string and find all of the Unicode code points with the help of Char.IsHighSurrogate() and Char.ConvertToUtf32(). When you don’t encounter a high surrogate, the current character fits in one Char and you only need to advance one Char in your string. If you do encounter a high surrogate, the character requires two Char and you need to advance by two:
static IEnumerable<int> GetCodePoints(string s)
{
for (var i = 0; i < s.Length; i += char.IsHighSurrogate(s[i]) ? 2 : 1)
{
yield return char.ConvertToUtf32(s, i);
}
}
When you say “from a UTF-16 String”, that might imply that you have read in a series of bytes formatted as UTF-16. If that is the case, you would need to convert that to a .net string before passing to the above method:
GetCodePoints(Encoding.UTF16.GetString(myUtf16Blob));
Another note: depending on how you build your String instance, it is possible that it contains an illegal sequence of Char with regards to surrogate pairs. For such strings, Char.ConvertToUtf32() will throw an exception when encountered. However, I think that Encoding.GetString() will always either return a valid string or throw an exception. So, generally, as long as your String instances are from “good” sources, you needn’t worry about Char.ConvertToUtf32() throwing (unless you pass in random values for the index offset because your offset might be in the middle of a surrogate pair).

How to identify zero-width character?

Visual Studio 2015 found an unexpected character in my code (error CS1056)
How can I identify what the character is? It's a zero-width character so I can't see it. I'd like to know exactly what it is so I can work out where it comes from and how to fix it with a find-and-replace (I have many similar errors).
Here's an example. There's a zero-width character between x and y in the quote below:
x​y
It would be helpful just to tell me the name of the character in my example, but I'd also like to know generally how to identify characters myself.
I have a little bit of Javascript embedded within my explanation of Unicode which allows you to see the Unicode characters you copy/paste into a textbox. Your example looks like this:
Here you can see that the character is U+200B. Just searching for that will normally lead you to http://www.fileformat.info, in this case this page which can give you details of the character.
If you have the characters yourself within an application, Char.GetUnicodeCategory is your friend. (Oddly enough, there's no Char.GetUnicodeCategory(int) for non-BMP characters as far as I can see...)
According to similar question: Remove zero-width space characters from a JavaScript string
I'd hit ctrl+f (or ctrl+h) and turn on Regexp option, then search (or search-replace) for:
[\u200B-\u200D\uFEFF]
I've just tried your example and successfully replaced that zero-width space with "X" mark.
Just please note that this range covers only a few specific characters as explained in that post, not all invisible characters.
edit - thanks to this page I've found a better expression that seems nicely supported in the "find/replace" when Regexp option is turned on:
\p{Cf}
which seems to matches invisible characters, it successfully hit that one in your example, though I'm not exactly sure if it covers all you'd need. It may be worth playing with whole {C}-class or searching for whitespace|nonprintable plus negative match for {Z}-class (or {Zs}) negation.
Aha, use this website http://www.fileformat.info/info/unicode/char/search.htm?q=%E2%80%8B&preview=entity
Are you looking for Unicode character U+200B: ZERO WIDTH SPACE?
http://www.fileformat.info/info/unicode/char/200b/index.htm
You can ask the built-in Unicode table:
var category = char.GetUnicodeCategory(s[1]);
The specific character in your example is in the Format category and here is what MSDN has to say about it:
Format character that affects the layout of text or the operation of text processes, but is not normally rendered. Signified by the Unicode designation "Cf" (other, format). The value is 15.
To get the character code, simply extract it:
char c = s[1];
int codepoint = (int)c; // gives you 0x200B
The unicode codepoint 0x200b is known as "zero width space".

Unicode SMP "character" in C# char [duplicate]

This question already has answers here:
C# and UTF-16 characters
(3 answers)
Closed 9 years ago.
I am trying to determine the implications of character encoding for a software system I am planning, and I found something odd while doing a test.
To my knowledge C# internally uses UTF-16 which (to my knowledge) encompasses every Unicode code point using two 16-bit fields. So I wanted to make some character literals and intentionally chose 𝛃 and 얤, because the former is from the SMP plane and the latter is from the BMP plane. The results are:
char ch1 = '얤'; // No problem
char ch2 = '𝛃'; // Compilation error "Too many characters in character literal"
What's going on?
A corollary of this question is, if I have the string "얤𝛃얤" it is displayed correctly in a MessageBox, however when I convert it to a char[] using ToCharArray I get an array with four elements rather than three. Also the String.Length is reported as four rather than three.
Am I missing something here?
MSDN says that the char type can represent Unicode 16-bit character (thus only character form BMP).
If you use a character outside BMP (in UTF-16: supplementary pair - 2x16 bit) compiler treats that as two characters.
Your source file may not be saved in UTF-8 (which is recommended when using special characters in the source), so the compiler may actually see a sequence of bytes that confuses it. You can verify that by opening your source file in a hex editor - the byte(s) you'll see in place of your character will likely be different.
If it's not already on, you can turn on that setting in Tools->Options->Documents in Visual Studio (I use 2008) - the option is Save documents as Unicode when data cannot be saved in codepage.
Typically, it's better to specify special characters using a character sequence.
This MSDN article describes how to use \uxxxx sequences to specify the Unicode character code you want. This blog entry has all the various C# escape sequences listed - the reason I'm including it is because it mentions using \xnnn - avoid using this format: it's a variable length version of \u and it can cause issues in some situations (not in yours, though).
The MSDN article points out why the character assignment is no good: the code point for the character in question is > FFFF which is outside the range for the char type.
As for the string part of the question, the answer is that the SMP character is represented as two char values. This SO question includes some code showing how to get the code points out of a string, it involves the use of StringInfo.GetTextElementEnumerator

Counting special UTF-8 character

I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)
Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes.  This is the old C way of looking at things, usually.
Length in Unicode code points.  This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units.  This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

What is the regular expression for the following strings and would the expression change if the number rolled over?

What would be the following regular expressions for the following strings?
56AAA71064D6
56AAA7105A25
Would the regular expression change if the numbers rolled over? What I mean by this is that the above numbers happen to contain hexadecimal values and I don't know how the value changes one it reaches F. Using the first one as an example: 56AAA71064D6, if this went up to
56AAA71064F6 and then the following one would become 56AAA7106406, this would create a different regular expression because where a letter was allowed, now their is a digit, so does this make the regular expression even more difficult. Suggestions?
A manufacturer is going to enter a range of serial numbers. The problems are that different manufacturers have different formats for serial numbers (some are just numbers, some are alpha numeric, some contain extra characters like dashes, some contain hexadacimal values which makes it more difficult because I don't know how the roll over to the next serial number). The roll over issue is the biggest problem because the serial numbers are entered as a range like 5A1B - 6F12 and without knowing how the roll over, it seems to me that storing them in the database is not as easy. I was going to have the option of giving the user the option to input the pattern (expression) and storing that in the databse, but if a character or characters changes from a digit to a letter or vice versa, then the regular expression is no longer valid for certain serial numbers.
Also, the above example I gave is with just one case. There are multitude of serial numbers that would contain different expressions.
There's no single regular expression which is "the" expression to match both of those strings. Instead, there are infinitely many which will do so. Here are two options at opposite ends of the spectrum:
(56AAA71064D6)|(56AAA7105A25)
.*
The first will only match those two strings. The second will match anything. Both satisfy all the criteria you've given.
Now, if you specify more criteria, then we'd be able to give a more reasonable idea of the regular expression to provide - and that will drive the answers to the other questions. (At the moment, the only answer that makes sense is "It depends on what regex you use.")
I think you could do it this way for 12 characters. This will search for a 12 character phrase where each of the characters must be a capital (A or B or C or D or E or F or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9 or 0)
[A-F0-9]{12}
If you're wanting to include the possibility of dashes then do this.
[A-F0-9\-]{12}
Or you're wanting to include the possibility of dashes plus the 12 characters then do this. But that would pick up any 12-15 character item that fit the criteria though.
[A-F0-9\-]{12,15}
Or if it's surrounded by spaces (AAAAHHHh...SO is stripping out my spaces!!!)
[A-F0-9\-]{12}
Or if it's surrounded by tabs
\t[A-F0-9\-]{12}\t
This match a string that contains 12 hexa
[0-9A-F]{12}
Assuming these are all 12-digit hexadecimal numbers, which it looks like they are, the following regex should work:
[0-9A-Fa-f]{12}
Here I'm using a character class to say that I want any digit, OR A-F, OR a-f. As a bonus I'm allowing lowercase letters; if you don't want those just get them out of the regex.
As Jon Skeet and others have said, you really didn't provide enough information, so if you don't like this answer please understand that I was doing the best I can with what information you provided.
So, how about this:
[0-9A-F]{12}
Well it sounds like you're describing a 12 digit hexadecimal number:
^[A-F0-9]{12}$

Categories