I noticed that C# 'Char.IsControl' method doesn't recognize some characters as control. For example, the following code outputs false for both values:
char pilcrow = '\u00B6';
char softHyphen = '\u00AD';
Console.Write("{0},{1}",char.IsControl(pilcrow), char.IsControl(softHyphen)); // -> 'false,false'
Is this an expected behavior? I need to escape such characters in my code.
Those aren't control characters. One is the pilcrow sign ¶ which belongs to the Punctuation, Other [Po] category , the other is the soft hyphen, a non-visible formatting character that affects how texts gets hyphenated.
There's nothing special about them, in fact you probably use the soft hyphen when writing a paragraph in Word and want to control hyphenation of some words. Word uses ¶ as the paragraph mark - a visualization of a paragraph's end. It doesn't affect formatting, it's just the common way to denote the end of paragraph. In that respect it's no different than ², ³, §, ¶, ¤, ¦, °, ±, ½, ¬ (just holding Right Alt and hitting keys)
.NET strings use Unicode so there's no need to escape these characters. You can just type them directly.
There's no problem with printing either - those characters are used in document processing after all. The soft hyphen controls how the UI or the print engine lays out text during rendering to the screen or paper.
If someone doesn't want those characters to be printed, a simple string.Remove will do the job. Removing the hyphen can affect how text is printed though, with long words moving to the next line. I added that hyphen to Removing in the previous sentence to force hyphenation. Without it, Removing would have moved to the next line
Related
I have a Urdu word "لاعلم" and more similar words. How can I split the word that I get "لا" and "علم" separately in an array? I have tried converting the words to unicode characters, but I can,t detect the break between "لا" and "علم".
English words can be easily separated based on spaces, but I am stuck on separating Urdu words, where there are no spaces.
There is no space because its a single word meaning "ignorant." As a matter of fact, "لا" and "علم" separated wouldn't mean anything.
Space is inserted in Urdu (and Arabic script) for a practical need to demarcate words when the font would automatically ligature it with adjoining characters. The only way one can undo the ligature is by inserting a superfluous space between characters. Technically, the ZERO WIDTH NON-JOINER (U+200C) is precisely for this purpose but human beings are slow to learn and space is easy to insert.
There are some characters that don't join with following letters, for example, "ا" wouldn't join with any following character but can with a preceding character like "ل" to form the ligature "لا." You can use this list of characters (same rules for Arabic) and write a custom toneizer that ends a word after "Right Joining" characters, ZWNJ or a space.
Visual Studio 2015 found an unexpected character in my code (error CS1056)
How can I identify what the character is? It's a zero-width character so I can't see it. I'd like to know exactly what it is so I can work out where it comes from and how to fix it with a find-and-replace (I have many similar errors).
Here's an example. There's a zero-width character between x and y in the quote below:
xy
It would be helpful just to tell me the name of the character in my example, but I'd also like to know generally how to identify characters myself.
I have a little bit of Javascript embedded within my explanation of Unicode which allows you to see the Unicode characters you copy/paste into a textbox. Your example looks like this:
Here you can see that the character is U+200B. Just searching for that will normally lead you to http://www.fileformat.info, in this case this page which can give you details of the character.
If you have the characters yourself within an application, Char.GetUnicodeCategory is your friend. (Oddly enough, there's no Char.GetUnicodeCategory(int) for non-BMP characters as far as I can see...)
According to similar question: Remove zero-width space characters from a JavaScript string
I'd hit ctrl+f (or ctrl+h) and turn on Regexp option, then search (or search-replace) for:
[\u200B-\u200D\uFEFF]
I've just tried your example and successfully replaced that zero-width space with "X" mark.
Just please note that this range covers only a few specific characters as explained in that post, not all invisible characters.
edit - thanks to this page I've found a better expression that seems nicely supported in the "find/replace" when Regexp option is turned on:
\p{Cf}
which seems to matches invisible characters, it successfully hit that one in your example, though I'm not exactly sure if it covers all you'd need. It may be worth playing with whole {C}-class or searching for whitespace|nonprintable plus negative match for {Z}-class (or {Zs}) negation.
Aha, use this website http://www.fileformat.info/info/unicode/char/search.htm?q=%E2%80%8B&preview=entity
Are you looking for Unicode character U+200B: ZERO WIDTH SPACE?
http://www.fileformat.info/info/unicode/char/200b/index.htm
You can ask the built-in Unicode table:
var category = char.GetUnicodeCategory(s[1]);
The specific character in your example is in the Format category and here is what MSDN has to say about it:
Format character that affects the layout of text or the operation of text processes, but is not normally rendered. Signified by the Unicode designation "Cf" (other, format). The value is 15.
To get the character code, simply extract it:
char c = s[1];
int codepoint = (int)c; // gives you 0x200B
The unicode codepoint 0x200b is known as "zero width space".
I'm having a problem with a generated word document (coming from crystal reports engine via a C#.net application). Initially hyphens are visible but if the text is copied and pasted with "keep text only" option or "remove formatting option" the hyphen character gets changed to a blank space" ".
I'm quite sure this is an issue with the encoding of the character, probably it is encoded as soft hyphen. Is there any way to resolve this via the crystal report engine.
I have already checked and confirmed that the source text is an actual hyphen character in the database.
It seems that the common Ascii hyphen, U+002D HYPHEN-MINUS in Unicode, gets converted to a code that is treated as Nonbreaking Hyphen in Word. A comment says that in “Show All” mode in Word, it looks like a hyphen, but longer. This means that it looks like an en dash “–”. Internally, it is byte 1E hexadecimal, so we could say that it is the control character U+001E. But it is unaffected by the use of AltX. And if you copy text containing it and paste it as plain text, it gets replaced by U+0020 SPACE, so it’s really treated as a special code and not as a character.
It is not at all the same as NON-BREAKING HYPHEN U+2011 in Unicode; instead, it is Microsoft’s own idea of handling a situation where you want a hyphen to appear but don’t want Word to break a string into two lines after a hyphen (e.g., in the string “F-1”, where such a break would look ridiculous).
So you could try to find how this happens in the report engine and to prevent it. But it may be something more complicated than just replacing “-” by the bye 1E.
If you need to deal with the issue in Word, you can use Find and Replace, where the special characters menu contains “Nonbreaking Hyphen”. You could replace it by the common hyphen “-”, losing the non-breakability.
You could alternatively replace it by NON-BREAKING HYPHEN U+2011, which would preserve that property. But this might cause problems if transferred to other programs, or due to font problems (not all fonts contain that character). The font problem can be tricky: Word automatically switches to another font when needed and does not inform about this, so your text might contain characters in different fonts, which may cause problems of many kinds (such as uneven line spacing). Moreover, when U+2011 is present, it may look different from the common Ascii hyphen; it more or less should. Typographically, if you use U+2011, your normal (breaking) hyphens should be U+2010 HYPHEN.
I have a text file it includes various control characters including backspace \b and carriage return \r. For example
100\r101\r102\r103
£\b$\b%
I'd like to 'collapse' these control characters to leave me with the text one would see in a console that understood these control characters:
103
%
I don't know what this is called. If it has a name, please share it, so I can search for it.
i guess you can simply make a loop on it then replace the characters(backspace \b and carriage return \r) with ones that console could understand
In written Arabic, characters look differently depending on where they stand in a word. For example, the letter ta might look like this: ـثـ inside a word but look like this: ﺙ if it stands by itself. I have some Arabic text, for example:
string word = والتفويض ;
When I render word as a whole word it renders correctly. Now, I want to parse the string and print out each letter in the word one at a time. However, if I do this:
foreach(char c in word.ToCharArray())
{
Debug.Print(c.ToString());
}
The char c doesn't print out the original representation of the letter as it was rendered in the context of a word, instead it prints out the same Arabic letter as if it were rendered by itself. How can I parse my string of Arabic text so that the letters returned look the same as when they were displayed as a whole word?
I'm trying to do this in c#.
There are characters in the UCS that represent particular forms of Arabic characters. However, these do not work well when moving from one context to another.
In general if you want to indicate that a letter is joined to another, when there is no such letter to join it to, you should use U+200D ZERO WIDTH JOINER at the appropriate place (before the character to place the joiner to the right, after the character to place it to the left, or having one on either side.
Conversely, placing U+200C ZERO WIDTH NON-JOINER between characters will break their joining.
Just how well that works in practice will depend on the rendering engine processing the characters.