I am converting my string to byte array using ASCII encoding using below code.
String data = "<?xml version="1.0" encoding="utf-8"?><ns0:ReceivedPayment Amount="1.01"/>"
byte[] buffer = Encoding.ASCII.GetBytes(data);
The problem i am facing is it's adding "?" in my string.
Now if i again convert back my byte array to string
var str = System.Text.Encoding.Default.GetString(buffer);
my string becomes
string str = "?<?xml version="1.0" encoding="utf-8"?><ns0:ReceivedPayment Amount="1.01"/>"
Does any one know why it's adding "?" in my string and how to remove it.
It seems that you showed only simplified code. Am I right that you read data from a file? If yes, check for a BOM (byte order mark) field at the begining of the file. It is used for encoding: UTF-8, UTF-16 and UTF-32.
There a several things wrong here. One is not showing the relevant code.
Nonetheless, if you use valid methods to read text from a UTF-8, UTF-32, etc file, you won't have a BOM in your string because the string will hold the text and the BOM is not part of the text.
One the other hand, if you are reading an XML file, it is not a "text" file. You should use an XML reader. That would take care to use the encoding that is (most likely) indicated in the file.
And, when you write an XML file (which I presume you'll be doing with the byte array), you should use an XML writer. That would take care to use the encoding you specify and write it into the file.
Keep in mind, though, that conversion from Unicode (for which UTF-8 is one encoding) to some other character set can silently corrupt your data with a replacement character (typically '?') for those that are not in the target character set.
Here is my extension method:
public static byte[] ToByteArray(this string str)
{
var bytes = new byte[str.Length * sizeof(char)];
Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
Related
string (System.String) is UTF-16, but if I convert a string to UTF-8, the Encoding.UTF8.GetString() method returns AGAIN string (UTF-16), and it's imposible, because string isn't UTF-8.
var foo = Encoding.UTF8.GetString(Encoding.Unicode.GetBytes("hello"));
Console.WriteLine(foo.GetType()); // Prints "System.String"
Yes, String is always UTF-16. If you convert a String to a String you'll either get the same string or data loss.
You can convert a String to a byte array using any available or custom encoding. In most cases, especially writing a file, you can just tell the writer or stream which encoding you want it to use.
In case there is any confusion about UTF-16 and UTF-8, they are both encodings for the same character set: Unicode. There is no data loss between them; you'd just use the most appropriate one, typically UTF-16 in memory and UTF-8 for files and streams.
Good day!
I convert binary file into char array:
var bytes = File.ReadAllBytes(#"file.wav");
char[] outArr = new char[(int)(Math.Ceiling((double)bytes.Length / 3) * 4)];
var result = Convert.ToBase64CharArray(bytes, 0, bytes.Length, outArr, 0, Base64FormattingOptions.None);
string resStr = new string(outArr);
So, is it little endian?
And does it convert to UTF-8?
Thank you!
You don't have any UTF-8 here - and UTF-8 doesn't have an endianness anyway, as its code unit size is just a single byte.
Your code would be simpler as:
var bytes = File.ReadAllBytes(#"file.wav");
string base64 = Convert.ToBase64String(bytes);
If you then write the string to a file, that would have an encoding, which could easily be UTF-8 (and will be by default), but again there's no endianness to worry about.
Note that as base64 text is always in ASCII, each character within a base64 string will take up a single byte in UTF-8 anyway. Even if UTF-8 did have different representations for multi-byte values, it wouldn't be an issue here.
C# char represents a UTF-16 character element. So there is no UTF-8 here.
Since .net is little endian, and since char is two bytes wide, then the char array, and the string, are both stored in little endian byte order.
If you want to convert your byte array to base64 and then encode as UTF-8 do it like this:
byte[] base64utf8 = Encoding.UTF8.GetBytes(Convert.ToBase64String(bytes));
If you wish to save the base64 text to a file, encoded as UTF-8, you could do that like so:
File.WriteAllText(filename, Convert.ToBase64String(bytes), Encoding.UTF8);
Since UTF-8 is a byte oriented encoding, endianness is not an issue.
I have a text file and i am converting it to unicode, and then want to save the content to a file. I want to save in the file in 2 formats:
In unicode
In English like characters (as file.doc)
UnicodeEncoding u = new UnicodeEncoding();
byte[] filebytes = u.GetBytes("C:/file.doc");
File.WriteAllBytes(#"C:/uni.doc", filebytes); // unicode
File.WriteAllBytes(#"C:/ori.doc", filebytes); // As the Original file
Bytes are bytes: just 8-bit binary numbers.
Encodings apply only to text, which you've not got if you've done a binary read.
If you want to read a text file in one encoding and write it in another, you can do so something like so:
Encoding sourceEncoding = Encoding.UTF8 ; // or whatever encoding the source file is encoded with
Encoding targetEncoding = Encoding.UTF32 ; // or whatever destination encoding you desire
string data = File.ReadAllText( #"C:\original.txt" , sourceEncoding ) ;
File.WriteAllText( #"C:\different-encoding.txt" , data , targetEncoding ) ;
You should bear in mind that strings are internally represented in the CLR infrastructure as a UTF-16 encoding of Unicode text.
GetBytes convert string to bytes, does not Take file path as input. You have to use StreamReader to read the file text. And to get encoding bytes, you just pass the Read bytes to System.Text.Encoding.UTF16.GetBytes(stringIJustReadFromFile);
For ASCII, Use System.Text.Encoding.ASCII.GetBytes(stringIJustReadFromFile), the you can use StreamWriter to write them to other files.
my question is very simple but at the moment i don't know how to do this. I have a string in ISO-8859-1 format and i need to convert this string to UTF-8. I need to do it in c# on windows phone 7 sdk. How can i do it? Thanks
The MSDN page for the Encoding class lists the recognized encodings.
28591 iso-8859-1 Western European (ISO)
For your question the correct choice is iso-8859-1 which you can pass to Encoding.GetEncoding.
var inputEncoding = Encoding.GetEncoding("iso-8859-1");
var text = inputEncoding.GetString(input);
var output = Encoding.Utf8.GetBytes(text);
Two clarifications on the previous answers:
There is no Encoding.GetText method (unless it was introduced specifically for the WP7 framework). The method should presumably be Encoding.GetString.
The Encoding.GetString method takes a byte[] parameter, not a string. All strings in .NET are internally represented as UTF-16; there is no way of having a “string in ISO-8859-1 format”. Thus, you must be careful how you read your source (file, network), rather than how you process your string.
For example, to read from a text file encoded in ISO-8859-1, you could use:
string text = File.ReadAllText(path, Encoding.GetEncoding("iso-8859-1"));
To save to a text file encoded in UTF-8, you could use:
File.WriteAllText(path, text, Encoding.UTF8);
Reply to comment:
Yes. You can use Encoding.GetString to decode your byte array (assuming it contains character values for text under a particular encoding) into a string, and Encoding.GetBytes to convert your string back into a byte array (possibly of a different encoding), as demonstrated in the other answers.
The concept of “encoding” relates to how byte sequences (be they a byte[] array in memory or the content of a file on disk) are to be interpreted. The string class is oblivious to the encoding that the text was read from, or should be saved to.
You can use Convert which works pretty well, especially when you have byte array:
var latinString = "Řr"; // år
Encoding latinEncoding = Encoding.GetEncoding("iso-8859-1");
Encoding utf8Encoding = Encoding.UTF8;
byte[] latinBytes = latinEncoding.GetBytes(latinString);
byte[] utf8Bytes = Encoding.Convert(latinEncoding, utf8Encoding, latinBytes);
var utf8String = Encoding.UTF8.GetString(utf8Bytes);
I have to create some sort of text file in which there are numbers and Hebrew letters decoded to ASCII.
This is file creation method which triggers on ButtonClick
protected void ToFile(object sender, EventArgs e)
{
filename = Transactions.generateDateYMDHMS();
string path = string.Format("{0}{1}.001", Server.MapPath("~/transactions/"), filename);
StreamWriter sw = new StreamWriter(path, false, Encoding.ASCII);
sw.WriteLine("hello");
sw.WriteLine(Transactions.convertUTF8ASCII("שלום"));
sw.WriteLine("bye");
sw.Close();
}
as you can see, i use Transactions.convertUTF8ASCII() static method to convert from probably Unicode string from .NET to ASCII representation of it. I use it on term Hebrew 'shalom' and get back '????' instead of result i need.
Here is the method.
public static string convertUTF8ASCII(string initialString)
{
byte[] unicodeBytes = Encoding.Unicode.GetBytes(initialString);
byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
return Encoding.ASCII.GetString(asciiBytes);
}
Instead of having initial word decoded to ASCII i get '????' in the file i create even if i run debbuger i get same result.
What i'm doing wrong ?
You can't simply translate arbitrary unicode characters to ASCII. The best it can do is discard the unsupportable characters, hence ????. Obviously the basic 7-bit characters will work, but not much else. I'm curious as to what the expected result is?
If you need this for transfer (rather than representation) you might consider base-64 encoding of the underlying UTF8 bytes.
Do you perhaps mean ANSI, not ASCII?
ASCII doesn't define any Hebrew characters. There are however some ANSI code pages which do such as "windows-1255"
In which case, you may want to consider looking at:
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx
In short, where you have:
Encoding.ASCII
You would replace it with:
Encoding.GetEncoding(1255)
Are you perhaps asking about transliteration (as in "Romanization") instead of encoding conversion, if you really are talking about ASCII?
I just faced the same issue when original xml file was in ASCII Encoding.
As Userx suggested
Encoding.GetEncoding(1255)
XDocument.Parse(System.IO.File.ReadAllText(xmlPath, Encoding.GetEncoding(1255)));
So now my XDocument file can read hebrew even if the xml file was saved as ASCII