Where does (char)int get its symbols from? - c#

Being a computer programming rookie, I was given homework involving the use of the playing card suit symbols. In the course of my research I came across an easy way to retrieve the symbols:
Console.Write((char)6);
gives you ♠
Console.Write((char)3);
gives you ♥
and so on...
However, I still don't understand what logic C# uses to retrieve those symbols. I mean, the ♠ symbol in the Unicode table is U+2660, yet I didn't use it. The ASCII table doesn't even contain these symbols.
So my question is, what is the logic behind (char)int?

For these low numbers (below 32), this is an aspect of the console rather than C#, and it comes from Code page 437 - though it won't include the ones that have other meanings that the console actually uses, such as tab, carriage return, and bell. This isn't really portable to any context where you're not running directly in a console window, and you should use e.g. 0x2660 instead, or just '\u2660'.

The logic behind (char)int is that char is a UTF-16 code unit, one or two of which encode a Unicode codepoint. Codepoints are naturally ordinal numbers, being an identifier for a member of a character set. They are often written in hexadecimal, and specifically for Unicode, preceded by U+, for example U+2660.
UTF-16 is a mapping between codepoint and code units. Code units being 16 bits can be operated on as integers. Since a char holds one code unit, you can convert an short to a char. Since the different integer types can interoperate, you can convert an int to a char.
So, your short (or int) has meaning as text only when it represents a UTF-16 code unit for a codepoint that only has one code unit. (You could also convert an int holding a whole codepoint to a string.)
Of course, you could let the compiler figure it out for you and make it easier for your readers, too, with:
Console.Write('♥');
Also, forget ASCII. It's never the right encoding (except when it is). In case it's not clear, a string is a counted sequence of UTF-16 code units.

Related

C#'s StringInfo and TextElementEnumerator can't recognize graphemes properly

In C# StringInfo and TextElementEnumerator classes provide methods and properties for text elements.
And here, we can find the definition of the Text Element.
The .NET Framework defines a text element as a unit of text that is
displayed as a single character, that is, a grapheme. A text element
can be any of the following:
Yes, it says a text element is a grapheme in .NET. I also tested with some unicode characters myself, and it really seemed true until I tested one Korean letter '가'.
As we all know some Unicode characters consist of multiple code points. Also we may face code point sequences and that's the reason I'm using StringInfo and TextElementEnumerator instead of simple String.
StringInfo and TextElementEnumerator could tell if Chars were surrogate pairs correctly. And "\u0061\u0308", a Unicode character which consists of multiple code points, was recognized as one text element just as expected. But as for "\u1100\u1161", it failed to say that it was also one text element.
"\u1100" is a leading letter "ㄱ", and "\u1161" is a vowel letter "ㅏ". They can be individual characters and shown to the users just as I write here and you can see them now. But if they are used together, they are rendered as one character "가" instead of "ㄱㅏ".
There are two ways in order to represent a Korean character "가":
Using a single code point U+AC00 from Hangul Syllable.
Using two code points U+1100 and U+1161 from Jamo.
Most of the time the former is used. The latter is rarely used, to be honest, I can't imagine when it's used at all..
Anyway, the first one is just one precomposed letter and the second is a sequence of Lead and Vowel which is treated as one character. When rendered they look the exactly same and both are actually canonically equivalent.
Also the following line returns true in C# :
"\u1100\u1161".Normalize() == "\uAC00"
I wonder why Normalize() here works just fine when C# doesn't think they are one complete text element..
I thought it had something to do with my .NET's version, but it turns out it's not the case. This thing happens even in Mono too.
I tested this with ICU as well, and it could treat "\u1100\u1161" as one grapheme correctly!
I initially thought StringInfo and TextElementEnumerator could eliminate need for ICU4C in some simple cases, so I'm very disappointed now..
Here's my question :
Am I doing something wrong here?
or
A Text Element in .NET isn't a user-perceived character unlike in ICU?
The basic issue here is that per the Korean standard KS X 1026, the two jamos ㄱ and ㅏ are distinct from their combined form 가. In fact, this exact example is used in the official standard (see section 6.2).
Long story short, Microsoft attempted to follow the standard but other operating systems and applications don't necessarily do so. Hence you can get "malformed" content from other software / platforms that appears to be parsed incorrectly on Windows / in .NET, even though it is parsed "correctly" on those platforms.
You will either need to ensure your data is correctly formed in the first place (unlikely, given that the de-facto standard is to completely ignore the official standard) or you will need to use ICU (or a similar library) to deal with these cases.

Huffman Coding. Decode from binary file

Huffman Coding task.
what I doing.
Read string from file, prepare Huffman structure, encode string to bits and save that bits to binary file.
What I need:
Decode string from binary file but encoding and decoding must be independent. After closing app for e.q.
I saving to binary file like that:
A:000;l:001;a:10; :110;m:010;k:011;o:1110;t:1111;
00000110110010101100111110111110;
And need to read it and decode. So I think I need to build Huffman structure again from that but how?
I see this options
Encoder and decoder always use the same tree, it never changes. So the decoder already knows, that 000 means A.
Tree is appended before the message in binary format. Encoder and decoder have to know the exact format for storing the tree, there are many possibilities how to do this. In simplest case there would be number of encoded characters and for every character its ascii code, length of Huffman code and the code itself.
Tree is built on the fly using adaptive Huffman coding, but it does not seem to be Your case.
Since you know A:000;l:001;a:10; :110;m:010;k:011;o:1110;t:1111; You can try to traverse the string 00000110110010101100111110111110 a character at a time. also have a switch statement for each of the characters, and their code. When ever you come across a case, for eg000, you can output A. This is one way I can see you being able to go back to the string. I am sure there is a better way out there.
hope this helps.
Assuming "Adaptive Huffman", it's not usual to decide yourself what code to use for each character.
The usual sequence is
Analyze the text to be encoded. That means counting the occurrences of each character. In the English language 'e' would be more frequent than 'x', 'y' or 'z' for example.
Sort the arrays of char/occurrence in ascending order.
Build a BTree - that means combining the two lowest, adding their counts and making a new tree node. Ignore those two and look for the next pair of lowest occurrences (which might include the node you just made). This continues until you end up with a BTree with one root. (There are lots of helpful images of this). I can explain this in more detailed steps if necessary.
From the root of the tree you "walk" to each leaf. For each "left" add a '0' and for each right a '1'. When you reach the leaf, you have the code for that letter. If your text has many e's it will have the shortest code and no other code will start with the same sequence of bits. This is the idea, the most frequent characters have the shortest code, thus bigger memory savings.
Now, by walking the tree you have the code (varying lengths) for each character.
Encode your text to a string of bits.
To decode you use the same tree. You say it must work "after closing app" so you will have to store the tree in some form with the encoded data.
In your comment you mention the problem with having varying length codes. There is no ambiguity. In an extreme case, if you had more e's than all other characters combined, the tree would be very lopsided. 'e' would be encoded as '1' and all other letters would have codes of varying lengths, beginning with 0.

Why do some character literals cause Syntax Errors in Java?

In the latest edition of JavaSpecialists newsletter, the author mentions a piece of code that is un-compilable in Java
public class A1 {
Character aChar = '\u000d';
}
Try compile it, and you will get an error, such as:
A1.java:2: illegal line end in character literal
Character aChar = '\u000d';
^
Why an equivalent piece of c# code does not show such a problem?
public class CharacterFixture
{
char aChar = '\u000d';
}
Am I missing anything?
EDIT: My original intention of question was how c# compiler got unicode file parsing correct (if so) and why java should still stick with the incorrect(if so) parsing?
EDIT: Also i want myoriginal question title to be restored? Why such a heavy editing and i strongly suspect that it heavily modified my intentions.
Java's compiler translates \uxxxx escape sequences as one of the very first steps, even before the tokenizer gets a crack at the code. By the time it actually starts tokenizing, there are no \uxxxx sequences anymore; they're already turned into the chars they represent, so to the compiler your Java example looks the same as if you'd actually typed a carriage return in there somehow. It does this in order to provide a way to use Unicode within the source, regardless of the source file's encoding. Even ASCII text can still fully represent Unicode chars if necessary (at the cost of readability), and since it's done so early, you can have them almost anywhere in the code. (You could say \u0063\u006c\u0061\u0073\u0073\u0020\u0053\u0074\u0075\u0066\u0066\u0020\u007b\u007d, and the compiler would read it as class Stuff {}, if you wanted to be annoying or torture yourself.)
C# doesn't do that. \uxxxx is translated later, with the rest of the program, and is only valid in certain types of tokens (namely, identifiers and string/char literals). This means it can't be used in certain places where it can be used in Java. cl\u0061ss is not a keyword, for example.

C# string equality operator returns false, but I'm pretty sure it should be true... What?

I'm trying to write a unit test for a piece of code that generates a large amount of text. I've run into an issue where the "expected" and "actual" strings appear to be equal, but Assert.AreEqual throws, and both the equality operator and Equals() return false. The result of GetHashCode() is different for both values as well.
However, putting both strings into text files and comparing with DiffMerge tells me they're the same.
Additionally, using Encoding.ASCII.GetBytes() on both values and then using SequenceEquals to compare the resulting byte arrays returns true.
The values are 34KB each, so I'll hold off putting them here for now. Any ideas? I'm completely stumped.
Loop through char by char and find which it thinks is different? The fact that writing it to disk and comparing the ASCII / text tells me that it is probably either carriage-return / line-feed related (which is somehow normalized during save), or relates to some non-ASCII character (maybe a high-unicode whitespace), which will be stripped when saving as ASCII.
What are the encoding types of the files you are feeding into DiffMerge? If you have characters that don't match the encoding type, then there is a chance they won't show up in DiffMerge.
The string that is being generated and the expected result probably have different character encodings. When you are doing ASCII.GetBytes, you are converting everything into ASCII. So, your strings are being converted to ASCII and are equal in terms of the ASCII character set. However, they can still be unequal in other character sets (and still "look" the same to you).
Also, try doing a string.Compare(str1, str2, StringComparison.XXXX) and let us know what happens.

How do I find out if my string contains the "micro" Unicode character?

I have a Excel Spreadsheet with lab data which looks like this:
µg/L (ppb)
I want to test for the presence of the Greek letter "µ" and if found I need to do something special.
Normally, I would write something like this:
if ( cell.StartsWith(matchSequence) ) {
//.. <-- universal symbol for "magic" :)
}
I know there is an Encoding API in the Framework, but should I use it for just this one edge-case or just copy the Greek micro symbol from the character map?
How would I test for the presence of a this unicode character? The character map seems like a "cheap" fix that will bite me later (I work for a company which is multinational).
I want to do something that is maintainable and not just some crazy math-voodoo conversion that only works for this edge case.
I guess I'm asking for best practice advice here.
Thanks!
You need to work out the unicode character you're interested in, then you can represent it with in code with an escape sequence.
For example, µ is U+00B5, so you just need:
if (text.Contains("\u00b5"))
You can find out the Unicode value from charmap or from the Unicode code charts.
The Unicode code point for micro µ is U+00B5 and is different from the "Greek letter mu" µ, which is at U+03BC. So you can use "\u00b5" to find it, and possibly also look for "\u03bc" as well - they look the same, so whoever created the spreadsheet could have used the wrong one!
You can create a Char from the numeric equivelent shown to you in the Character Map (displays as U+0050 for 'P'). To do this simply check the contains:
string value;
if (value.Contains(Char.ConvertFromUtf32(0x0050)))
;
C# code files are usually encoded in utf8, since the language is using this encoding. All strings and strign literals in c# (and other .NET languages) are encoded in utf16. So you can safely copy the micro character from the character map.
You can also use its integer value as unicode literal like 0x1234.

Categories