Can anyone tell me that why am I getting different values for ‰ symbol?
My Delphi7 code is-
k := ord('‰'); //k is of type longint; receiving k as 137
My C# code is-
k = (long)('‰');//k is of type long; receiving k as 8240
Please give me a solution, that how can I get the same value in C#? Because Delphi code solves my purpose.
C# encodes text as UTF-16. Delphi 7 encodes text as ANSI. Therein lies the difference. You are not comparing like with like.
You ask how to get the ANSI ordinal value for that character. Use new Encoding(codepage) to get an Encoding instance that matches your Delphi ANSI encoding. You'll need to know which code page you are using in order to do that. Then call GetBytes on that encoding instance.
For instance if your code page is 1252 then you'd write:
enc = new Encoding(1252);
byte[] bytes = enc.GetBytes("‰");
Or if you want to use the default system ANSI encoding you would use
byte[] bytes = Encoding.Default.GetBytes("‰");
One wonders why you are asking this question. Have answered many questions here, I wonder if you are performing some sort of encryption, hashing or encoding, and are using as your reference some Delphi code that uses AnsiString variables as byte arrays. In which case, whilst Encoding.GetBytes is what you asked for, it's not what you need. What you need is to stop using strings to hold binary data. Use byte arrays. In both Delphi and C#.
Related
I have some code in Java that use
String.getBytes()
(without encoding parameters) on some generated string to obtain byte[] which I later use as a key for AES encryption.
Then I take the encoded message and under C# (WP7/WP8 environment) I need to decode it.
I can easily generate the string that I've used in Java application, however I need to convert this to byte[] in such way that it will generate exactly the same byte array as in Java.
Question 1:
Can I do this without altering Java code?
Question 2:
If not, how should I implement both version so they will always return the same byte[] no matter what?
Basically you should specify an encoding in the Java code. Currently, your code will produce different outputs on different systems as it uses the platform-default encoding (e.g. Windows-1252 or UTF-8).
I would encourage you to use UTF-8 in both cases:
// Java 7 onwards
byte[] bytes = text.getBytes(StandardCharsets.UTF_8);
// Java pre-7
byte[] bytes = text.getBytes("UTF-8");
// .NET
byte[] bytes = Encoding.UTF8.GetBytes(text);
Using UTF-8 allows for all valid Unicode strings to be encoded into bytes. You could consider using UTF-16, but then you need to make sure you specify the same endianness in each case. That does have the benefit of having exactly two bytes per char regardless of content though (as a char is a UTF-16 code unit in both Java and .NET).
I have a problem.
Unicode 2019 is this character:
’
It is a right single quote.
It gets encoded as UTF8.
But I fear it gets double-encoded.
>>> u'\u2019'.encode('utf-8')
'\xe2\x80\x99'
>>> u'\xe2\x80\x99'.encode('utf-8')
'\xc3\xa2\xc2\x80\xc2\x99'
>>> u'\xc3\xa2\xc2\x80\xc2\x99'.encode('utf-8')
'\xc3\x83\xc2\xa2\xc3\x82\xc2\x80\xc3\x82\xc2\x99'
>>> print(u'\u2019')
’
>>> print('\xe2\x80\x99')
’
>>> print('\xc3\xa2\xc2\x80\xc2\x99')
’
>>> '\xc3\xa2\xc2\x80\xc2\x99'.decode('utf-8')
u'\xe2\x80\x99'
>>> '\xe2\x80\x99'.decode('utf-8')
u'\u2019'
This is the principle used above.
How can I do the bolded parts, in C#?
How can I take a UTF8-Encoded string, conver to byte array, convert THAT to a string in, and then do decode again?
I tried this method, but the output is not suitable in ISO-8859-1, it seems...
string firstLevel = "’";
byte[] decodedBytes = Encoding.UTF8.GetBytes(firstLevel);
Console.WriteLine(Encoding.UTF8.GetChars(decodedBytes));
// ’
Console.WriteLine(decodeUTF8String(firstLevel));
//â�,��"�
//I was hoping for this:
//’
Understanding Update:
Jon's helped me with my most basic question: going from "’" to "’ and thence to "’" But I want to honor the recommendations at the heart of his answer:
understand what is happening
fix the original sin
I made an effort at number 1.
Encoding/Decoding
I get so confused with terms like these.
I confuse them with terms like Encrypting/Decrypting, simply because of "En..." and "De..."
I forget what they translate from, and what they translate to.
I confuse these start points and end points; could it be related to other vague terms like hex, character entities, code points, and character maps.
I wanted to settle the definition at a basic level.
Encoding and Decoding in the context of this question is:
Decode
Corresponds to C# {Encoding}.'''GetString'''(bytesArray)
Corresponds to Python stringObject.'''decode'''({Encoding})
Takes bytes as input, and converts to string representation as output, according to some conversion scheme called an "encoding", represented by {Encoding} above.
Bytes -> String
Encode
Corresponds to C# {Encoding}.'''GetBytes'''(stringObject)
Corresponds to Python stringObject.'''encode'''({Encoding})
The reverse of Decode.
String -> Bytes (except for Python)
Bytes vs Strings in Python
So Encode and Decode take us back and forth between bytes and strings.
While Python helped me understand what was going wrong, it could also confuse my understanding of the "fundamentals" of Encoding/Decoding.
Jon said:
It's a shame that Python hides [the difference between binary data and text data] to a large extent
I think this is what PEP means when it says:
Python's current string objects are overloaded. They serve to hold both sequences of characters and sequences of bytes. This overloading of purpose leads to confusion and bugs.
Python 3.* does not overload strings in this way.:
Python 2.7
>>> #Encoding example. As a generalization, "Encoding" produce bytes.
>>> #In Python 2.7, strings are overloaded to serve as bytes
>>> type(u'\u2019'.encode('utf-8'))
<type 'str'>
Python 3.*
>>> #In Python 3.*, bytes and strings are distinct
>>> type('\u2019'.encode('utf-8'))
<class 'bytes'>
Another important (related) difference between Python 2 and 3, is their default encoding:
>>>import sys
>>>sys.getdefaultencoding()
Python 2
'ascii'
Python 3
'utf-8'
And while Python 2 says 'ascii', I think it means a specific type of ASCII;
It does '''not''' mean ISO-8859-1, which supports range(256), which is what Jon uses to decode (discussed below)
It means ASCII, the plainest variety, which are only range(128)
And while Python 3 no longer overloads string as both bytes, and strings, the interpreter still makes it easy to ignore what's happening and move between types. i.e.
just put a 'u' before a string in Python 2.* and it's a Unicode literal
just put a 'b' before a string in Python 3.* and it's a Bytes literal
Encoding and C
Jon points out that C# uses UTF-16, to correct my "UTF-8 Encoded String" comment, above;
Every string is effectively UTF-16.
My understanding of is: if C# has a string object "s", the computer memory actually has bytes corresponding to that character in the UTF-16 map. That is, (including byte-order-mark??) feff0073.
He also uses ISO-8859-1 in the hack method I requested.
I'm not sure why.
My head is hurting at the moment, so I'll return when I have some perspective.
I'll return to this post. I hope I'm explaining properly. I'll make it a Wiki?
You need to understand that fundamentally this is due to someone misunderstanding the difference between binary data and text data. It's a shame that Python hides that difference to a large extent - it's quite hard to accidentally perform this particular form of double-encoding in C#. Still, this code should work for you:
using System;
using System.Text;
class Test
{
static void Main()
{
// Avoid encoding issues in the source file itself...
string firstLevel = "\u00c3\u00a2\u00c2\u0080\u00c2\u0099";
string secondLevel = HackDecode(firstLevel);
string thirdLevel = HackDecode(secondLevel);
Console.WriteLine("{0:x}", (int) thirdLevel[0]); // 2019
}
// Converts a string to a byte array using ISO-8859-1, then *decodes*
// it using UTF-8. Any use of this method indicates broken data to start
// with. Ideally, the source of the error should be fixed.
static string HackDecode(string input)
{
byte[] bytes = Encoding.GetEncoding(28591)
.GetBytes(input);
return Encoding.UTF8.GetString(bytes);
}
}
image is the string of an image file .
I have code as follows in C#:
Convert.ToBase64String(image);
and code as follows in Java:
org.apache.commons.codec.binary.Base64.encodeBase64(image.getBytes())
The result is different.
Somebody says its because
Java byte : -128 to 127
C# byte : 0 to 255
But how can I fix this? How can I implement C#'s Convert.ToBase64String() in Java?
I need the same result as in C# by using Java.
First you need to realise that a byte stores 256 values whether its signed or unsigned. If you want to get unsigned values from a signed byte (which is what Java supports) you can use & 0xFF
e.g.
byte[] bytes = { 0, 127, -128, -1};
for(byte b: bytes) {
int unsigned = b & 0xFF;
System.out.println(unsigned);
}
prints
0
127
128
255
The simple answer is you don't need a byte[] which has the same values. ;)
You're base64 encoding a string? What do you want that to do? You first need to convert the string to a sequence of bytes, choosing an encoding such as UTF-8 or UTF-16.
My guess is that you managed to use different encodings on both sides. Java's String.GetBytes() uses the default charset (Probably something like Latin1 on western windows versions). For C# you didn't post the relevant code.
To fix this, choose an encoding and use it explicitly on both sides. I recommend using UTF-8.
On the Java side you should use the correct method for encoding, so you don't end up with "modified UTF-8", but since I'm not a java programmer, I don't know which methods output modified UTF-8. I think it only happens if you abuse some internal serialization method.
signed vs. unsigned bytes should not be relevant here. The intermediate byte buffer will be different, but the original string, and the base64 string should be identical on both sides.
I also encountered the same problem. There is a saying on the Internet:
Java byte : -128 to 127 | C# byte : 0 to 255
I looked up the algorithmic principle of java base64 encoding and decoding. Use C# to implement the base64 algorithm and run the program: the result is the same as
Convert.ToBase64String(byteArray).
Finally found that the best way to solve this problem is:
Uri.EscapeDataString(Convert.ToBase64String(byteArray)).
It should be noted that this is the reason for the special characters in the URL.
The project I'm currently working on needs to interface with a client system that we don't make, so we have no control over how data is sent either way. The problem is that were working in C#, which doesn't seem to have any support for UCS-2 and very little support for big-endian. (as far as i can tell)
What I would like to know, is if there's anything i looked over in .net, or something that someone else has made and released that we can use. If not I will take a crack at encoding/decoding it in a custom method, if that's even possible.
But thanks for your time either way.
EDIT:
BigEndianUnicode does work to correctly decode the string, the problem was in receiving other data as big endian, so far using IPAddress.HostToNetworkOrder() as suggested elsewhere has allowed me to decode half of the string (Merli? is what comes up and it should be Merlin33069)
Im combing the short code to see if theres another length variable i missed
RESOLUTION:
after working out that the bigendian variables was the main problem, i went back through and reviewed the details and it seems that the length of the strings was sent in character counts, not byte counts (in utf it would seem a char is two bytes) all i needed to do was double it, and it worked out. thank you all for your help.
string x = "abc";
byte[] data = Encoding.BigEndianUnicode.GetBytes(x);
In other direction:
string decodedX = Encoding.BigEndianUnicode.GetString(data);
It is not exactly UCS-2 but it is enough for most cases.
UPD: Unicode FAQ
Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
Sometimes in the past an implementation has been labeled "UCS-2" to
indicate that it does not support supplementary characters and doesn't
interpret pairs of surrogate code points as characters. Such an
implementation would not handle processing of character properties,
code point boundaries, collation, etc. for supplementary characters.
EDIT: Now we know that the problem isn't in the encoding of the text data but in the encoding of the length. There are a few options:
Reverse the bytes and then use the built-in BitConverter code (which I assume is what you're using now; that or BinaryReader)
Perform the conversion yourself using repeated "add and shift" operations
Use my EndianBitConverter or EndianBinaryReader classes from MiscUtil, which are like BitConverter and BinaryReader, but let you specify the endianness.
You may be looking for Encoding.BigEndianUnicode. That's the big-endian UTF-16 encoding, which isn't strictly speaking the same as UCS-2 (as pointed out by Marc) but should be fine unless you give it strings including characters outside the BMP (i.e. above U+FFFF), which can't be represented in UCS-2 but are represented in UTF-16.
From the Wikipedia page:
The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.2 It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 96.9% of all the code points in the range 0-0xFFFF, including all characters that had been assigned a value at that time.
I find it highly unlikely that the client system is sending you characters where there's a difference (which is basically the surrogate pairs, which are permanently reserved for that use anyway).
UCS-2 is so close to UTF-16 that Encoding.BigEndianUnicode will almost always suffice.
The issue (comments) around reading the length prefix (as big-endian) is more correctly resolved via shift operations, which will do the right thing on all systems. For example:
Read4BytesIntoBuffer(buffer);
int len =(buffer[0] << 24) | (buffer[1] << 16) | (buffer[2] << 8) | (buffer[3]);
This will then work the same (at parsing a big-endian 4 byte int) on any system, regardless of local endianness.
We recently came across some sample code from a vendor for hashing a secret key for a web service call, their sample was in VB.NET which we converted to C#. This caused the hashing to produce different input. It turns out the way they were generating the key for the encryption was by converting a char array to a string and back to a byte array. This led me to the discovery that VB.NET and C#'s default encoder work differently with some characters.
C#:
Console.Write(Encoding.Default.GetBytes(new char[] { (char)149 })[0]);
VB:
Dim b As Char() = {Chr(149)}
Console.WriteLine(Encoding.Default.GetBytes(b)(0))
The C# output is 63, while VB is the correct byte value of 149.
if you use any other value, like 145, etc, the output matches.
Walking through the debugging, both VB and C# default encoder is SBCSCodePageEncoding.
Does anyone know why this is?
I have corrected the sample code by directly initializing a byte array, which it should have been in the first place, but I still want to know why the encoder, which should not be language specific, appears to be just that.
If you use ChrW(149) you will get a different result- 63, the same as the C#.
Dim b As Char() = {ChrW(149)}
Console.WriteLine(Encoding.Default.GetBytes(b)(0))
Read the documentation to see the difference- that will explain the answer
The VB Chr function takes an argument in the range 0 to 255, and converts it to a character using the current default code page. It will throw an exception if you pass an argument outside this range.
ChrW will take a 16-bit value and return the corresponding System.Char value without using an encoding - hence will give the same result as the C# code you posted.
The approximate equivalent of your VB code in C# without using the VB Strings class (that's the class that contains Chr and ChrW) would be:
char[] chars = Encoding.Default.GetChars(new byte[] { 149 });
Console.Write(Encoding.Default.GetBytes(chars)[0]);
The default encoding is machine dependent as well as thread dependent because it uses the current codepage. You generally should use something like Encoding.UTF8 so that you don't have to worry about what happens when one machine is using unicode and another is using 1252-ANSI.
Different operating systems might use
different encodings as the default.
Therefore, data streamed from one
operating system to another might be
translated incorrectly. To ensure that
the encoded bytes are decoded
properly, your application should use
a Unicode encoding, that is,
UTF8Encoding, UnicodeEncoding, or
UTF32Encoding, with a preamble.
Another option is to use a
higher-level protocol to ensure that
the same format is used for encoding
and decoding.
from http://msdn.microsoft.com/en-us/library/system.text.encoding.default.aspx
can you check what each language produces when you explicitly encode using utf8?
I believe the equivalent in VB is ChrW(149).
So, this VB code...
Dim c As Char() = New Char() { Chr(149) }
'Dim c As Char() = New Char() { ChrW(149) }
Dim b As Byte() = System.Text.Encoding.Default.GetBytes(c)
Console.WriteLine("{0}", Convert.ToInt32(c(0)))
Console.WriteLine("{0}", CInt(b(0)))
produces the same output as this C# code...
var c = new char[] { (char)149 };
var b = System.Text.Encoding.Default.GetBytes(c);
Console.WriteLine("{0}", (int)c[0]);
Console.WriteLine("{0}", (int) b[0]);