Extended ASCII question

Extended ASCII question - c#

I read wikipedia but I do not understand whether extended ASCII is still just ASCII and is available on any computer that would run my console application?
Also if I understand it correctly, I can write an ASCII char only by using its unicode code in VB or C#.
Thank you

ASCII only covers the characters with value 0-127, and those are the same on all computers. (Well, almost, although this is mostly a matter of glyphs rather than semantics.)
Extended ASCII is a term for various single-byte code pages that are assign various characters to the range 128-255. There is no single "extended ASCII" set of characters.
In C# and VB.NET, all strings are Unicode, so by default, there's no need to worry about this - whether or not a character can be displated in a console app is a matter of the fonts being used, not the limitation of any specific single-byte codepage.

As others have said, true ASCII is always the lower 7 bits of each byte. Before the advent (and ubiquity) of Unicode standards, various extensions to the ASCII character set that utilized the eighth bit were released. The most common in the Windows world is Windows code page 1252.
If you're looking to use this encoding in .NET, you can get it like this:
Encoding windows1252 = Encoding.GetEncoding("windows-1252");

As Wikipedia says, ASCII is only 0-127. "Extended ASCII" is a misnomer, should be avoided, and used to loosely mean "some other character set based on ASCII which only uses single bytes" (meaning not multibyte like UTF-8). Sometimes the term means the 128-255 codepoints of that specific character set⁠—⁠but again, it's vague and you shouldn't count on it meaning anything specific.
The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.
Source: http://en.wikipedia.org/wiki/Extended_ASCII

Related

why some vendors map Unicode characters to another character set(code page)?

I'm reading a book which talks about text encoding in .NET:
There are two categories of text encoding in .NET:
• Those that map Unicode characters to another character set
• Those that use standard Unicode encoding schemes
The first category contains legacy encodings such as IBM’s EBCDIC and 8-bit char‐acter sets with extended characters in the upper-128 region that were popular prior to Unicode (identified by a code page). In the second category are UTF-8, UTF-16, and UTF-32
I'm confused about the first one, code page part, I have read some questions on stackoverflow, none of them the same as the question I'm going to ask, my question is:
Why some vendors need to map Unicode characters to another character set? from my understanding on Unicode characters, Unicode can cover all characters of almost all language over the world, why reinvent the wheel to map Unicode characters to another character set? for example, line feed in unicode is U+000A, why would you want to map it to other character? just stick to the unicode standard, then you can use binary code to represent all kinds of character.

WriteAllText, Character Encoding, £ and?

Take the following example:
string testfile1 = Path.Combine(HttpRuntime.AppDomainAppPath, "folder\\" + "test1.txt");
if (!System.IO.File.Exists(testfile1))
{
System.IO.File.WriteAllText(testfile1, "£100", System.Text.Encoding.ASCII);
}
string testfile2 = Path.Combine(HttpRuntime.AppDomainAppPath, "folder\\" + "test2.txt");
if (!System.IO.File.Exists(testfile2))
{
System.IO.File.WriteAllText(testfile2, "£100", System.Text.Encoding.UTF8);
}
Note the encoding. The first outputs ?100. The second outputs £100.
I know the encoding is different, but can somebody explain why ASCII encoding can't write the £ sign?

ASCII doesn't include the "£" character. That is - there is no byte value (nor a multiple byte value - they don't exist in ASCII) that denotes that symbol. So it shows you a "?" to tell you that. UTF8, on the other hand, does include it.
See here a list of all of the printable characters in ASCII.
If you must use ASCII, consider using "GBP" as mentioned here for Pound sterling. (Also might be relevant: Extended ASCII.)

To deal with ASCII and certain characters it depended largely on what code page you're using. £ isn't a character that is required or used universally within the latin alphabet so didn't appear in the standard ASCII set.
Look at this article or this one on code pages to see how the character limitation was resolved and for an idea as to why it won't show up everywhere.

As Hans pointed out, ASCII is designed to Americans using only code points 0-127, the negligible rest of the English speaking world can live with that unless they try to use obscure symbols like £ with code points outside the range 0-127. I presume you live in the UK and aim only at customers from the UK, or Western Europe. Don't use Encoding.ASCII but Encoding.Default which would be code page 1252 in the UK, not in Turkey of course. You get real ASCII for every character in the ASCII range 0-127 but can also use characters in the range 128-255 where the pound symbol lives. But note, if someone tries to read the file assuming it is encoded in UTF8, the £ sign will obscure the content since it includes a byte that is non-existing in UTF8. This is indicated by some weird glyph like �.

Unicode SMP "character" in C# char [duplicate]

This question already has answers here:
C# and UTF-16 characters
(3 answers)
Closed 9 years ago.
I am trying to determine the implications of character encoding for a software system I am planning, and I found something odd while doing a test.
To my knowledge C# internally uses UTF-16 which (to my knowledge) encompasses every Unicode code point using two 16-bit fields. So I wanted to make some character literals and intentionally chose 𝛃 and 얤, because the former is from the SMP plane and the latter is from the BMP plane. The results are:
char ch1 = '얤'; // No problem
char ch2 = '𝛃'; // Compilation error "Too many characters in character literal"
What's going on?
A corollary of this question is, if I have the string "얤𝛃얤" it is displayed correctly in a MessageBox, however when I convert it to a char[] using ToCharArray I get an array with four elements rather than three. Also the String.Length is reported as four rather than three.
Am I missing something here?

MSDN says that the char type can represent Unicode 16-bit character (thus only character form BMP).
If you use a character outside BMP (in UTF-16: supplementary pair - 2x16 bit) compiler treats that as two characters.

Your source file may not be saved in UTF-8 (which is recommended when using special characters in the source), so the compiler may actually see a sequence of bytes that confuses it. You can verify that by opening your source file in a hex editor - the byte(s) you'll see in place of your character will likely be different.
If it's not already on, you can turn on that setting in Tools->Options->Documents in Visual Studio (I use 2008) - the option is Save documents as Unicode when data cannot be saved in codepage.
Typically, it's better to specify special characters using a character sequence.
This MSDN article describes how to use \uxxxx sequences to specify the Unicode character code you want. This blog entry has all the various C# escape sequences listed - the reason I'm including it is because it mentions using \xnnn - avoid using this format: it's a variable length version of \u and it can cause issues in some situations (not in yours, though).
The MSDN article points out why the character assignment is no good: the code point for the character in question is > FFFF which is outside the range for the char type.
As for the string part of the question, the answer is that the SMP character is represented as two char values. This SO question includes some code showing how to get the code points out of a string, it involves the use of StringInfo.GetTextElementEnumerator

Counting special UTF-8 character

I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)

Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes. This is the old C way of looking at things, usually.
Length in Unicode code points. This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units. This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

ASCII Encoding and Umlauts and Accents

I have a requirement to produce text files with ASCII encoding. I have a database full of Greek, French, and German characters with Umlauts and Accents. Is this even possible?
string reportString = report.makeReport();
Dictionary<string, string> replaceCharacters = new Dictionary<string, string>();
byte[] encodedReport = Encoding.ASCII.GetBytes(reportString);
Response.BufferOutput = false;
Response.ContentType = "text/plain";
Response.AddHeader("Content-Disposition", "attachment;filename=" + reportName + ".txt");
Response.OutputStream.Write(encodedReport, 0, encodedReport.Length);
Response.End();
When I get the reportString back the characters are represented faithfully. When I save the text file I have ? in place of the special characters.
As I understand it the ASCII standard is for American English only and something UTF 8 would be for the international audience. Is this a correct?
I'm going to make the statement that if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly.
Or, am I way off and doing/saying something stupid?

You cannot represent accents and umlauts in an ASCII encoded file simply because these characters are not defined in the standard ASCII charset.

Before Unicode this was handled by "code pages", you can think of a code page as a mapping between Unicode characters and the 256 values that can fit into a single byte (obviously, in every code page most of the Unicode characters are missing).
The original ASCII code page includes only English letters - but it's unlikely someone really wants the original 7-bit code page, they probably call any 8-bit character set ASCII.
The English code page known as Latin-1 is ISO-8859-1 or Windows-1252 (the first is the ISO standard, the second is the closest code page supported by Windows).
To support characters not in Latin-1 you need to encode using different code pages, for example:
874 — Thai
932 — Japanese
936 — Chinese (simplified) (PRC, Singapore)
949 — Korean
950 — Chinese (traditional) (Taiwan, Hong Kong)
1250 — Latin (Central European languages)
1251 — Cyrillic
1252 — Latin (Western European languages)
1253 — Greek
1254 — Turkish
1255 — Hebrew
1256 — Arabic
1257 — Latin (Baltic languages)
1258 — Vietnamese
UTF-8 is something completely different, it encodes the entire Unicode character set by using variable number of bytes per characters, numbers and English letters are encoded the same as ASCII (and Windows-1252) most other languages are encoded at 2 to 4 bytes per character.
UTF-8 is mostly compatible with ASCII systems because English is encoded the same as ASCII and there are no embedded nulls in the strings.
Converting between .net strings (UTF-16LE) and other encoding is done by the System.Text.Encoding class.
IMPORTANT NOTE: the most important thing is that the system on the receiving end will use the same code page and teh system on the sending end - otherwise you will get gibberish.

The ASCII characer set only contains A-Z in upper and lowe case, digits, and some punctuation. No greek characters, no umlauts, no accents.
You can use a character set from the group that is sometimes referred to as "extended ASCII", which uses 256 characters instead of 128.
The problem with using a different character set than ASCII is that you have to use the correct one, i.e. the one that the receiving part is expecting, or it will fail to interpret any of the extended characters correctly.
You can use Encoding.GetEncoding(...) to create an extended encoding. See the reference for the Encoding class for a list of possible encodings.

You are correct.
Pure US ASCII is a 7-bit encoding, featuring English characters only.
You need a different encoding to capture characters from other alphabets. UTF-8 is a good choice.

UTF-8 is backward compatible with ASCII, so if you encode your files as UTF-8, then ASCII clients can read whatever is in their character set, and Unicode clients can read all the extended characters.
There's no way to get all the accents you want in ASCII; some accented characters (like ü) are however available in the "extended ASCII" (8-bit) character set.

Various of the encodings mentioned by other answers can be loosely described as extended ASCII.
When your users are asking for ASCII encoding, they are probably asking for one of these.
A statement like "if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly" risks sounding pedantic to a non-technical user. An alternative is to get a sample of what they want (probably either the ANSI or OEM code page of their PC), determine the appropriate code page, and specify that.

The above is only partially correct. While it's true that you can't encode those characters in ASCII, you can represent them. They exist because some typewriters and early computers couldn't handle those characters.
Ä=Ae
ä=ae
ö=oe
Ö=Oe
ü=ue
Ü=Ue
ß=sz
Edit:
Andyraddaz has already written code that replaces lots of Unicode Characters with ASCII Representations. They might not be correct for some Languages/Cultures, but at least you wont have encoding errors.
https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.