Crystal Reports Improperly Encoding Hyphens On Export to Word - c#

I'm having a problem with a generated word document (coming from crystal reports engine via a C#.net application). Initially hyphens are visible but if the text is copied and pasted with "keep text only" option or "remove formatting option" the hyphen character gets changed to a blank space" ".
I'm quite sure this is an issue with the encoding of the character, probably it is encoded as soft hyphen. Is there any way to resolve this via the crystal report engine.
I have already checked and confirmed that the source text is an actual hyphen character in the database.

It seems that the common Ascii hyphen, U+002D HYPHEN-MINUS in Unicode, gets converted to a code that is treated as Nonbreaking Hyphen in Word. A comment says that in “Show All” mode in Word, it looks like a hyphen, but longer. This means that it looks like an en dash “–”. Internally, it is byte 1E hexadecimal, so we could say that it is the control character U+001E. But it is unaffected by the use of AltX. And if you copy text containing it and paste it as plain text, it gets replaced by U+0020 SPACE, so it’s really treated as a special code and not as a character.
It is not at all the same as NON-BREAKING HYPHEN U+2011 in Unicode; instead, it is Microsoft’s own idea of handling a situation where you want a hyphen to appear but don’t want Word to break a string into two lines after a hyphen (e.g., in the string “F-1”, where such a break would look ridiculous).
So you could try to find how this happens in the report engine and to prevent it. But it may be something more complicated than just replacing “-” by the bye 1E.
If you need to deal with the issue in Word, you can use Find and Replace, where the special characters menu contains “Nonbreaking Hyphen”. You could replace it by the common hyphen “-”, losing the non-breakability.
You could alternatively replace it by NON-BREAKING HYPHEN U+2011, which would preserve that property. But this might cause problems if transferred to other programs, or due to font problems (not all fonts contain that character). The font problem can be tricky: Word automatically switches to another font when needed and does not inform about this, so your text might contain characters in different fonts, which may cause problems of many kinds (such as uneven line spacing). Moreover, when U+2011 is present, it may look different from the common Ascii hyphen; it more or less should. Typographically, if you use U+2011, your normal (breaking) hyphens should be U+2010 HYPHEN.

Related

Char.IsControl method does not recognize some characters as control

I noticed that C# 'Char.IsControl' method doesn't recognize some characters as control. For example, the following code outputs false for both values:
char pilcrow = '\u00B6';
char softHyphen = '\u00AD';
Console.Write("{0},{1}",char.IsControl(pilcrow), char.IsControl(softHyphen)); // -> 'false,false'
Is this an expected behavior? I need to escape such characters in my code.
Those aren't control characters. One is the pilcrow sign ¶ which belongs to the Punctuation, Other [Po] category , the other is the soft hyphen, a non-visible formatting character that affects how texts gets hyphenated.
There's nothing special about them, in fact you probably use the soft hyphen when writing a paragraph in Word and want to control hyphenation of some words. Word uses ¶ as the paragraph mark - a visualization of a paragraph's end. It doesn't affect formatting, it's just the common way to denote the end of paragraph. In that respect it's no different than ², ³, §, ¶, ¤, ¦, °, ±, ½, ¬ (just holding Right Alt and hitting keys)
.NET strings use Unicode so there's no need to escape these characters. You can just type them directly.
There's no problem with printing either - those characters are used in document processing after all. The soft hyphen controls how the UI or the print engine lays out text during rendering to the screen or paper.
If someone doesn't want those characters to be printed, a simple string.Remove will do the job. Re­moving the hyphen can affect how text is printed though, with long words moving to the next line. I added that hyphen to Removing in the previous sentence to force hyphenation. Without it, Removing would have moved to the next line

Split Urdu words based on nonexistent space

I have a Urdu word "لاعلم" and more similar words. How can I split the word that I get "لا" and "علم" separately in an array? I have tried converting the words to unicode characters, but I can,t detect the break between "لا" and "علم".
English words can be easily separated based on spaces, but I am stuck on separating Urdu words, where there are no spaces.
There is no space because its a single word meaning "ignorant." As a matter of fact, "لا" and "علم" separated wouldn't mean anything.
Space is inserted in Urdu (and Arabic script) for a practical need to demarcate words when the font would automatically ligature it with adjoining characters. The only way one can undo the ligature is by inserting a superfluous space between characters. Technically, the ZERO WIDTH NON-JOINER (U+200C) is precisely for this purpose but human beings are slow to learn and space is easy to insert.
There are some characters that don't join with following letters, for example, "ا" wouldn't join with any following character but can with a preceding character like "ل" to form the ligature "لا." You can use this list of characters (same rules for Arabic) and write a custom toneizer that ends a word after "Right Joining" characters, ZWNJ or a space.

How to identify zero-width character?

Visual Studio 2015 found an unexpected character in my code (error CS1056)
How can I identify what the character is? It's a zero-width character so I can't see it. I'd like to know exactly what it is so I can work out where it comes from and how to fix it with a find-and-replace (I have many similar errors).
Here's an example. There's a zero-width character between x and y in the quote below:
x​y
It would be helpful just to tell me the name of the character in my example, but I'd also like to know generally how to identify characters myself.
I have a little bit of Javascript embedded within my explanation of Unicode which allows you to see the Unicode characters you copy/paste into a textbox. Your example looks like this:
Here you can see that the character is U+200B. Just searching for that will normally lead you to http://www.fileformat.info, in this case this page which can give you details of the character.
If you have the characters yourself within an application, Char.GetUnicodeCategory is your friend. (Oddly enough, there's no Char.GetUnicodeCategory(int) for non-BMP characters as far as I can see...)
According to similar question: Remove zero-width space characters from a JavaScript string
I'd hit ctrl+f (or ctrl+h) and turn on Regexp option, then search (or search-replace) for:
[\u200B-\u200D\uFEFF]
I've just tried your example and successfully replaced that zero-width space with "X" mark.
Just please note that this range covers only a few specific characters as explained in that post, not all invisible characters.
edit - thanks to this page I've found a better expression that seems nicely supported in the "find/replace" when Regexp option is turned on:
\p{Cf}
which seems to matches invisible characters, it successfully hit that one in your example, though I'm not exactly sure if it covers all you'd need. It may be worth playing with whole {C}-class or searching for whitespace|nonprintable plus negative match for {Z}-class (or {Zs}) negation.
Aha, use this website http://www.fileformat.info/info/unicode/char/search.htm?q=%E2%80%8B&preview=entity
Are you looking for Unicode character U+200B: ZERO WIDTH SPACE?
http://www.fileformat.info/info/unicode/char/200b/index.htm
You can ask the built-in Unicode table:
var category = char.GetUnicodeCategory(s[1]);
The specific character in your example is in the Format category and here is what MSDN has to say about it:
Format character that affects the layout of text or the operation of text processes, but is not normally rendered. Signified by the Unicode designation "Cf" (other, format). The value is 15.
To get the character code, simply extract it:
char c = s[1];
int codepoint = (int)c; // gives you 0x200B
The unicode codepoint 0x200b is known as "zero width space".

clean up word pasted apostrophe and other special characters

I've looked around and have not found a solid solution to this. Is there a way to replace word pasted apostrophes and other special characters coming from a text editor (radeditor)? When I attempt to send them in an email, these characters are being replaced with a ? question mark. I would prefer not to manually replace every possible special character, unless there is a know list somewhere.

Invisible Character

Does anyone know if there is any invisible character in Unicode strings other than the space? Like in windows 98 there were some tricks using ALT+some integer (in fact bugs http://forums.techarena.in/customize-desktop/1121437.htm).
Is it possible to programmatically add some such characters that are not shown by any editor?
They are normally called Control Characters:
The control characters U+0000–U+001F and U+007F come from ASCII. Additionally, U+0080–U+009F were used in conjunction with ISO 8859 character sets (among others). They are specified in ISO 6429 and often referred to as C0 and C1 control codes respectively.
Most of these characters play no explicit role in Unicode text handling. The characters U+0000 , U+0009 (HT), U+000A (LF), U+000D (CR), and U+0085 (CR+LF) are commonly used in text processing as formatting characters.
Unicode provides a set of different space characters that might be used for steganography
http://en.wikipedia.org/wiki/Whitespace_character#Unicode
http://en.wikipedia.org/wiki/Space_%28punctuation%29#Spaces_in_Unicode
"somestring" + new string(' ',20);

Categories