How to compare text: ANSI vs Unicode

How to compare text: ANSI vs Unicode - c#

I'm trying to find a certificate by serial number.
Apparently both sides are equal but debugger says they are not.
What am I missing here? I noticed IDE warned me when I copied & pasted
serial number into text editor, I didn't care about the message;
unicode blah blah...
UPDATE
I copied the serial number from certificate window.
Open Notepad2 new file. It's default ANSI. Pasted the text. Copied again. Pasted into VS. Values are not equal.
Set Notepad2 file as Unicode. Pasted serial number. Copied from Notepad2 and pasted into VS. Now they are equal.
It seems unicode vs ansi problem.

I can't see anything wrong with how you create the string or how you compare it. That should work just fine. Strings in .NET are always unicode, so there is no possible encoding problem either.
Retype the serial string manually. It seems that you got some unusual characters in there when pasting it, for example non-breaking spaces instead of regular spaces.

When I copied I've selected "extra space". Weird but it doesn't appear in the IDE. If I select the serial number carefully, everything will be ok.
In Notepad it appears as below:
?00 c4 aa b9 b1 08 90 5d
It's hex 3F char (question mark) and it doesn't appear in notepad if it's unicode. In ANSI mode it becomes visible as ?

As far as I know, comparison in UNICODE works flawlessly when English alphanumerics are used. For other languages, there are caveats.
Will you please provide "copy-able" code, so that I ca check it on my machine? (PS: I am feeling too lazy to type it..:o)
Also, note that == works only with value types. You should use EqualsTo() for comparing ref types (this way, you will be comparing values, not references!)

Instead of
operator==
you should use
string.Equals(object Obj)
method. The "==" tests for equality of the reference not contents.
See MSDN for further reference.

Related

Weird behavior C#

Somehow I'm getting a weird result from a GetString(). So, in my project I got this code:
byte[] arrayBytes = System.Convert.FromBase64String(n["spo_fdat"].InnerText);
string str = System.Text.Encoding.UTF8.GetString(arrayBytes);
The InnerText Value and the code is in: https://dotnetfiddle.net/mMUlti
So, my problem is that somehow I'm getting this result on my Visual Studio:
While in the online compiler that I post above the output is as expected.
This output is an output for a printer and this \0 are destroying the format.
Anyone have a clue of what is going on and what should I do/try?

It looks like for some reason every other byte in your input is null. If you strip those out you get something that looks much more plausible as printer commands (though I am no expert). Hopefully you can verify things...
To do this all I did was added this line in:
arrayBytes = arrayBytes.Where((x,i)=>i%2==0).ToArray();
The where command takes the value (x), and index (i) and if the index mode 2 is 0 (ie its even) then the where clause allows it - if its odd it throws it away.
The output I get from this starts:
CT~~CD,~CC^~CT~
^XA~TA000~JSN^LT0^MNW^MTT^PON^PMN^LH0,0^JMA^PR2,2~SD15^JUS^LRN^CI0^XZ
^XA
^MMT
^PW607
^LL0406
There are some non-printing character in there too that look like possible printing commands (eg 16 is the first character that is "data link escape" character.
Edited afterthought:
The problem you have here is obviously a problem with the specification. It seems to be that your input is wrong. You need to talk to whoever generated it find out the specification they are using to generate it, make sure their ode matches that spec and then right your code to accept that spec. With a solid specification you should both be writing compatible code.

Try inspecting the bytes instead. You'll see that what you have encoded in the base-64 string is much closer to what Visual Studio shows to you in comparison to the output from dotnetfiddle. Consoles usually don't escape non-printables (such as \0 - the null character) whereas Visual Studio string inspector does so in attempt to provide as much value to its user as possible.
Looking at your base-64 encoded data, it looks way more like UTF-16 than UTF-8. If you decode it like so, you'll perhaps get rid of the null characters in Visual Studio inspector as well.
Regardless of that, the base-64 data don't make much sense. More semantical context is required to figure out what the issue is.
According to inspection by Chris, it looks like the data is UTF-8 encoded in UTF-16.
You should be able to get proper results with the following:
var xml = //your base-64 input...
var arrayBytes = Convert.FromBase64String(xml);
var utf16 = Encoding.Unicode.GetString(arrayBytes);
var utf8Bytes = utf16.Select(c => (byte)c).ToArray();
var utf8 = Encoding.UTF8.GetString(utf8Bytes);
Console.WriteLine(utf8);
The opposite is probably how your input was created. However, you could also go for Chris' solution of ignoring every odd byte as it is basically the same with less weird encoding things going on (although this may be more explicit to what really goes on: UTF-8 inside UTF-16).

Where does (char)int get its symbols from?

Being a computer programming rookie, I was given homework involving the use of the playing card suit symbols. In the course of my research I came across an easy way to retrieve the symbols:
Console.Write((char)6);
gives you ♠
Console.Write((char)3);
gives you ♥
and so on...
However, I still don't understand what logic C# uses to retrieve those symbols. I mean, the ♠ symbol in the Unicode table is U+2660, yet I didn't use it. The ASCII table doesn't even contain these symbols.
So my question is, what is the logic behind (char)int?

For these low numbers (below 32), this is an aspect of the console rather than C#, and it comes from Code page 437 - though it won't include the ones that have other meanings that the console actually uses, such as tab, carriage return, and bell. This isn't really portable to any context where you're not running directly in a console window, and you should use e.g. 0x2660 instead, or just '\u2660'.

The logic behind (char)int is that char is a UTF-16 code unit, one or two of which encode a Unicode codepoint. Codepoints are naturally ordinal numbers, being an identifier for a member of a character set. They are often written in hexadecimal, and specifically for Unicode, preceded by U+, for example U+2660.
UTF-16 is a mapping between codepoint and code units. Code units being 16 bits can be operated on as integers. Since a char holds one code unit, you can convert an short to a char. Since the different integer types can interoperate, you can convert an int to a char.
So, your short (or int) has meaning as text only when it represents a UTF-16 code unit for a codepoint that only has one code unit. (You could also convert an int holding a whole codepoint to a string.)
Of course, you could let the compiler figure it out for you and make it easier for your readers, too, with:
Console.Write('♥');
Also, forget ASCII. It's never the right encoding (except when it is). In case it's not clear, a string is a counted sequence of UTF-16 code units.

How to fix encoding for Dictionary<char, char>?

I created a simple Dictionary<char, char> that contains character combinations to replace local characters to ascii characters (ē -> e), but it does not work - when I see this dictionary in debug mode - I see, that local characters are wrong (instead of my local characters (latvian) I see some different characters)
I suspect it's something to do with encoding, although I don't know why is this happening and how to fix it...
if I make a simple string text = "with some local characters ā ē ū"; - if I check this in debug mode, encoding seems to be correct...
here is the instantiation of my dictionary:
and here is what values appear in this dictionary after instantiation:

Check that your source is encoding per the C# language specification. It must use one of the allowed Unicode encodings. UTF-8 is always allowed. (I'd say preferred.) Your editor should be able to tell you which your using and/or allow you to re-save it with a specific encoding.
In Visual Studio, you can re-save the file with a specific encoding using File » Advanced Save Options…, then File » Save.

Why do I get an CS1056 Unexpected character '' on this code

I'm getting this unexpected character '' error and I don't understand why.
var list = new List<MyModel>();
list.Add(new MyModel() {
variable1 = 942,
variable2 = 2001,
variable3 = "my text",
variable4 = 123
}); // CS1056 Unexpected character '' on this line

From what the error says and the actual error code I got from an Online compiler after copy/pasting, Your code on this line contains a character that is not visible but that the compiler is trying to interpret. Simply try erase every character starting at your closing bracket towards your number 3 and press Enter again It should be working (it did work for me)

I just deleted the file Version=v4.0.AssemblyAttributes.cs(1,1,1,1) located in my temp folder C:\Users\MyUser\AppData\Local\Temp and then it works perfectly.
For .NET Core you have to delete .NETCoreApp,Version=v2.1.AssemblyAttributes.cs

As mentioned by Daneau in the accepted answer, the problem is by a character that is not visible in the IDE.
Here are several solutions to find the invisible character with Notepad++.
Solutions 1: Show Symbol
Copy the code to Notepad++,
Select View -> Show Symbol -> Show All Characters
This can show invisible control characters.
Solutions 2: Convert to ANSI
Copy the code to Notepad++,
Select Encoding- > Convert to ANSI
This will convert the invisible character to ? if it is a none ANSI character.
Solutions 3: Remove none ASCII characters
Copy the code to Notepad++,
Open the Find window (Ctrl+F)
Select the Replace tab
in "Find what" write: [^\x00-\x7F]
Leave "Replace with" empty
In "Search Mode" select "Regular expression"
Find and remove the none ASCII characters
This will remove none ASCII characters.
Note: This can remove valid non ASCII characters (in strings and comments) so try to skip those if you have any.
Tip: Use HEX-Editor plugin
Use Notepad++ HEX-Editor plugin to see the binary code of text. Any character out of the range of 0x00 - 0x7F (0 - 127) is a non ASCII character and a suspect of being the problem.

Just reporting my direct experience.
As Daneau wrote, I had a character (ASCII DLE, I copied while messing up a zebra printer) hiding in the text. I could not afford to rewrite everything, so I used notepad++ "View->Show Symbol->Show All Characters" feature.
I apologize for not commenting Daneau entry, but I don't have enough reputation.

Write the code again without copying it. That worked for me

go to C:\Users\UserName\AppData\Local\Temp\ and clear the data or remove the file specified in the error, that will solve the issue.
VS will add the required file on auto, no worries.

I got this error when I moved my application from one folder to another, I resolved this by deleting the Debug folder inside the obj folder.

It indeed has to do with copy pasting code and characters that you cannot see. The easiest way to fix it is by passing your copy pasted code into a note application or simple text program which will automatically remove these invisible characters. After that simply copy the code from the text editor and paste it into your IDE.

For some reason this happened to me on every project in my solution. My fix was to delete all bin and obj folders in my solution.

Correct Hebrew character sequence in C# and searchable PDFs

I'm testing an SDK that extracts text from a searchable PDF. One of the SDK's dependencies was recently updated, and it's causing an existing test on Hebrew text to fail. I don't know Hebrew nor enough about how the involved technologies represent right-to-left languages.
The NUnit test asserts that the extracted text matches the C# string "מנבוצץז ".
string hebrewText = reader.ReadToEnd();
Assert.AreEqual("מנבוצץז ", hebrewText);
The rasterized PDF has what I believe are the same characters, but in the opposite order.
The unit test fails with this message:
Expected: "מנבוצץז "
But was: " זץצובנמ"
Although the actual result more closely matches what I see in the rasterized PDF, I'm not completely sure the original test is wrong.
Are Hebrew characters in a C# string supposed to be read right-to-left like printed Hebrew text?
Does any part of the .NET stack tamper with the direction of Hebrew strings?
What about NUnit?
Are Hebrew characters embedded in a searchable PDF normally supposed to go in the same direction as the rasterized text?
Anything else I should know before deciding whether to "fix" this unit test?

There are various ways to encode RTL languages. The most common way (and Window's default) is to use logical ordering, which means the first letter is encoded as the first character in a string (or file). So whether visually the first letter appears on the left or right side of the screen doesn't affect the order in which they are stored.
Now as for the text appearing in Visual Studio, it depends on the version. As far as I remember, prior to Visual Studio 2010 the code editor displayed Hebrew backwards, and it was apparent as when you tried to select Hebrew text, it reversed in an odd way (which was visually confusing). It appears this issue no longer exists is Visual Studio 2010 (at least with SP1 which I just tested).
Let's take a Hebrew word for which the direction is more clear to non-Hebrew speakers than the string specified in your text:
יון
The word happens to be the Hebrew word for an ion, and on your screen, it should appear as three letters where the tallest letter is on the left and the shortest is on the right. In a .NET string, the expression "יון".Substring(0, 1) will produce the short letter, since it's the first letter in the string. The string can also be written as "\u05D9\u05D5\u05DF" where the leftmost Unicode character \u05D9 represents the short letter displayed on the right, which clearly demonstrates the order in which the letters are stored.
Since the string in your test case is nonsensical, I can't tell you whether it was a wrong test all along or if it a correct test that should pass. If the image you uploaded has been rendered correctly then it appears the actual result of your test is correct and the expected value is incorrect, and so you should fix the test.

I believe that all strings in C# will be stored internally as LTR; RTL strings will have a non-printable character (or something) denoting that they are indeed RTL.
More than likely. RTL GUIs and rendered text for example need certain properties (specifically RightToLeft and RightToLeftLayout) to be set in order to display correctly.
NUnit shouldn't. Nor should it care. IMHO a reversed string != the original string.
I couldn't comment. I'd assume that they should be whatever the test is expecting though, assuming it was passing at first.
Don't do half measures with RTL, it really doesn't like it. Either have full RTL support, or nothing. It can be pretty nasty, I wish you the best of luck!

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.