Invisible Character - c#

Does anyone know if there is any invisible character in Unicode strings other than the space? Like in windows 98 there were some tricks using ALT+some integer (in fact bugs http://forums.techarena.in/customize-desktop/1121437.htm).
Is it possible to programmatically add some such characters that are not shown by any editor?

They are normally called Control Characters:
The control characters U+0000–U+001F and U+007F come from ASCII. Additionally, U+0080–U+009F were used in conjunction with ISO 8859 character sets (among others). They are specified in ISO 6429 and often referred to as C0 and C1 control codes respectively.
Most of these characters play no explicit role in Unicode text handling. The characters U+0000 , U+0009 (HT), U+000A (LF), U+000D (CR), and U+0085 (CR+LF) are commonly used in text processing as formatting characters.

Unicode provides a set of different space characters that might be used for steganography
http://en.wikipedia.org/wiki/Whitespace_character#Unicode
http://en.wikipedia.org/wiki/Space_%28punctuation%29#Spaces_in_Unicode

"somestring" + new string(' ',20);

Related

why some vendors map Unicode characters to another character set(code page)?

I'm reading a book which talks about text encoding in .NET:
There are two categories of text encoding in .NET:
• Those that map Unicode characters to another character set
• Those that use standard Unicode encoding schemes
The first category contains legacy encodings such as IBM’s EBCDIC and 8-bit char‐acter sets with extended characters in the upper-128 region that were popular prior to Unicode (identified by a code page). In the second category are UTF-8, UTF-16, and UTF-32
I'm confused about the first one, code page part, I have read some questions on stackoverflow, none of them the same as the question I'm going to ask, my question is:
Why some vendors need to map Unicode characters to another character set? from my understanding on Unicode characters, Unicode can cover all characters of almost all language over the world, why reinvent the wheel to map Unicode characters to another character set? for example, line feed in unicode is U+000A, why would you want to map it to other character? just stick to the unicode standard, then you can use binary code to represent all kinds of character.

WriteAllText, Character Encoding, £ and?

Take the following example:
string testfile1 = Path.Combine(HttpRuntime.AppDomainAppPath, "folder\\" + "test1.txt");
if (!System.IO.File.Exists(testfile1))
{
System.IO.File.WriteAllText(testfile1, "£100", System.Text.Encoding.ASCII);
}
string testfile2 = Path.Combine(HttpRuntime.AppDomainAppPath, "folder\\" + "test2.txt");
if (!System.IO.File.Exists(testfile2))
{
System.IO.File.WriteAllText(testfile2, "£100", System.Text.Encoding.UTF8);
}
Note the encoding. The first outputs ?100. The second outputs £100.
I know the encoding is different, but can somebody explain why ASCII encoding can't write the £ sign?
ASCII doesn't include the "£" character. That is - there is no byte value (nor a multiple byte value - they don't exist in ASCII) that denotes that symbol. So it shows you a "?" to tell you that. UTF8, on the other hand, does include it.
See here a list of all of the printable characters in ASCII.
If you must use ASCII, consider using "GBP" as mentioned here for Pound sterling. (Also might be relevant: Extended ASCII.)
To deal with ASCII and certain characters it depended largely on what code page you're using. £ isn't a character that is required or used universally within the latin alphabet so didn't appear in the standard ASCII set.
Look at this article or this one on code pages to see how the character limitation was resolved and for an idea as to why it won't show up everywhere.
As Hans pointed out, ASCII is designed to Americans using only code points 0-127, the negligible rest of the English speaking world can live with that unless they try to use obscure symbols like £ with code points outside the range 0-127. I presume you live in the UK and aim only at customers from the UK, or Western Europe. Don't use Encoding.ASCII but Encoding.Default which would be code page 1252 in the UK, not in Turkey of course. You get real ASCII for every character in the ASCII range 0-127 but can also use characters in the range 128-255 where the pound symbol lives. But note, if someone tries to read the file assuming it is encoded in UTF8, the £ sign will obscure the content since it includes a byte that is non-existing in UTF8. This is indicated by some weird glyph like �.

How to identify zero-width character?

Visual Studio 2015 found an unexpected character in my code (error CS1056)
How can I identify what the character is? It's a zero-width character so I can't see it. I'd like to know exactly what it is so I can work out where it comes from and how to fix it with a find-and-replace (I have many similar errors).
Here's an example. There's a zero-width character between x and y in the quote below:
x​y
It would be helpful just to tell me the name of the character in my example, but I'd also like to know generally how to identify characters myself.
I have a little bit of Javascript embedded within my explanation of Unicode which allows you to see the Unicode characters you copy/paste into a textbox. Your example looks like this:
Here you can see that the character is U+200B. Just searching for that will normally lead you to http://www.fileformat.info, in this case this page which can give you details of the character.
If you have the characters yourself within an application, Char.GetUnicodeCategory is your friend. (Oddly enough, there's no Char.GetUnicodeCategory(int) for non-BMP characters as far as I can see...)
According to similar question: Remove zero-width space characters from a JavaScript string
I'd hit ctrl+f (or ctrl+h) and turn on Regexp option, then search (or search-replace) for:
[\u200B-\u200D\uFEFF]
I've just tried your example and successfully replaced that zero-width space with "X" mark.
Just please note that this range covers only a few specific characters as explained in that post, not all invisible characters.
edit - thanks to this page I've found a better expression that seems nicely supported in the "find/replace" when Regexp option is turned on:
\p{Cf}
which seems to matches invisible characters, it successfully hit that one in your example, though I'm not exactly sure if it covers all you'd need. It may be worth playing with whole {C}-class or searching for whitespace|nonprintable plus negative match for {Z}-class (or {Zs}) negation.
Aha, use this website http://www.fileformat.info/info/unicode/char/search.htm?q=%E2%80%8B&preview=entity
Are you looking for Unicode character U+200B: ZERO WIDTH SPACE?
http://www.fileformat.info/info/unicode/char/200b/index.htm
You can ask the built-in Unicode table:
var category = char.GetUnicodeCategory(s[1]);
The specific character in your example is in the Format category and here is what MSDN has to say about it:
Format character that affects the layout of text or the operation of text processes, but is not normally rendered. Signified by the Unicode designation "Cf" (other, format). The value is 15.
To get the character code, simply extract it:
char c = s[1];
int codepoint = (int)c; // gives you 0x200B
The unicode codepoint 0x200b is known as "zero width space".

Crystal Reports Improperly Encoding Hyphens On Export to Word

I'm having a problem with a generated word document (coming from crystal reports engine via a C#.net application). Initially hyphens are visible but if the text is copied and pasted with "keep text only" option or "remove formatting option" the hyphen character gets changed to a blank space" ".
I'm quite sure this is an issue with the encoding of the character, probably it is encoded as soft hyphen. Is there any way to resolve this via the crystal report engine.
I have already checked and confirmed that the source text is an actual hyphen character in the database.
It seems that the common Ascii hyphen, U+002D HYPHEN-MINUS in Unicode, gets converted to a code that is treated as Nonbreaking Hyphen in Word. A comment says that in “Show All” mode in Word, it looks like a hyphen, but longer. This means that it looks like an en dash “–”. Internally, it is byte 1E hexadecimal, so we could say that it is the control character U+001E. But it is unaffected by the use of AltX. And if you copy text containing it and paste it as plain text, it gets replaced by U+0020 SPACE, so it’s really treated as a special code and not as a character.
It is not at all the same as NON-BREAKING HYPHEN U+2011 in Unicode; instead, it is Microsoft’s own idea of handling a situation where you want a hyphen to appear but don’t want Word to break a string into two lines after a hyphen (e.g., in the string “F-1”, where such a break would look ridiculous).
So you could try to find how this happens in the report engine and to prevent it. But it may be something more complicated than just replacing “-” by the bye 1E.
If you need to deal with the issue in Word, you can use Find and Replace, where the special characters menu contains “Nonbreaking Hyphen”. You could replace it by the common hyphen “-”, losing the non-breakability.
You could alternatively replace it by NON-BREAKING HYPHEN U+2011, which would preserve that property. But this might cause problems if transferred to other programs, or due to font problems (not all fonts contain that character). The font problem can be tricky: Word automatically switches to another font when needed and does not inform about this, so your text might contain characters in different fonts, which may cause problems of many kinds (such as uneven line spacing). Moreover, when U+2011 is present, it may look different from the common Ascii hyphen; it more or less should. Typographically, if you use U+2011, your normal (breaking) hyphens should be U+2010 HYPHEN.

Extended ASCII question

I read wikipedia but I do not understand whether extended ASCII is still just ASCII and is available on any computer that would run my console application?
Also if I understand it correctly, I can write an ASCII char only by using its unicode code in VB or C#.
Thank you
ASCII only covers the characters with value 0-127, and those are the same on all computers. (Well, almost, although this is mostly a matter of glyphs rather than semantics.)
Extended ASCII is a term for various single-byte code pages that are assign various characters to the range 128-255. There is no single "extended ASCII" set of characters.
In C# and VB.NET, all strings are Unicode, so by default, there's no need to worry about this - whether or not a character can be displated in a console app is a matter of the fonts being used, not the limitation of any specific single-byte codepage.
As others have said, true ASCII is always the lower 7 bits of each byte. Before the advent (and ubiquity) of Unicode standards, various extensions to the ASCII character set that utilized the eighth bit were released. The most common in the Windows world is Windows code page 1252.
If you're looking to use this encoding in .NET, you can get it like this:
Encoding windows1252 = Encoding.GetEncoding("windows-1252");
As Wikipedia says, ASCII is only 0-127. "Extended ASCII" is a misnomer, should be avoided, and used to loosely mean "some other character set based on ASCII which only uses single bytes" (meaning not multibyte like UTF-8). Sometimes the term means the 128-255 codepoints of that specific character set⁠—⁠but again, it's vague and you shouldn't count on it meaning anything specific.
The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.
Source: http://en.wikipedia.org/wiki/Extended_ASCII

Categories