I am in the process of porting data from a legacy system, but I am unsure what encoding it uses internally to store the data. I have noticed that the data corresponds to ASCII code values (i.e. character ë, or small letter e with diaeresis, is stored as byte value 137 as per this chart).
I need to encode the data using ISO-8859-1 for the destination system, but obviously using the data as-is yields the incorrect results (in ISO-8859-1 the per mille sign is represented by decimal 137, as per this chart).
I need some advice on what encoding I can use when reading the data - i.e. an encoding that corresponds to the decimal ASCII code values.
I found my answer in this SO post. It turns out that code page 437 corresponds to the extended ASCII character codes. I was thus able to re-encode the data as follows:
var output = Encoding.Convert(Encoding.GetEncoding(437), Encoding.GetEncoding("ISO-8859-1"), input);
Related
I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Farsi (Persian-Arabic):
مدل-رنگ-موی-جدید-5-436x500
whereas it should look like this:
مدل-رنگ-موی-جدید-5-436x500
This link convert this correctly:
http://www.ltg.ed.ac.uk/~richard/utf-8.html
How I can do it in c#?
It is very hard to tell exactly what is going on from the description of your question. We would all be much better off if you provided us with an example of what is happening using a single character instead of a whole string, and if you chose an example character which does not belong to some exotic character set, for example the bullet character (u2022) or something like that.
Anyhow, what is probably happening is this:
The letter "ر" is represented in UTF-8 as a byte sequence of D8 B1, but what you see is "ر", and that's because in UTF-16 Ø is u00D8 and ± is u00B1. So, the incoming text was originally in UTF-8, but in the process of importing it to a dotNet Unicode String in your application it was incorrectly interpreted as being in some 8-bit character set such as ANSI or Latin-1. That's why you now have a Unicode String which appears to contain garbage.
However, the process of converting 8-bit characters to Unicode is for the most part not destructive, so all of the information is still there, that's why the UTF-8 tool that you linked to can still kind of make sense out of it.
What you need to do is convert the string back to an array of ANSI (or Latin-1, whatever) bytes, and then re-construct the string the right way, which is a conversion of UTF-8 to Unicode.
I cannot easily reproduce your situation, so here are some things to try:
byte[] bytes = System.Text.Encoding.Ansi.GetBytes( garbledUnicodeString );
followed by
string properUnicodeString = System.Text.Encoding.UTF8.GetString( bytes );
I want to convert the ascii encoded text input by my users into UTF-8 encoding, so that I can display it using any unicode font types. For example, I want to display english alphabet 'l' in ASCII as 'ക' in Unicode. I think I would require a mapping system too, so that I can Map l to 'ക'. Please help me to solve this issue.
Your text is in ISCII (Indian Script Code for Information Interchange). You need to convert ISCII with the proper code page to unicode. The following methods should do the job. Convert will convert a given text from one encoding to another. GetEncoding will provide you with the Encoding objects to be used by the Convert method.
Example code can be found here: http://www.dotnetframework.org/default.aspx/Net/Net/3#5#50727#3053/DEVDIV/depot/DevDiv/releases/whidbey/netfxsp/ndp/clr/src/BCL/System/Text/ISCIIEncoding#cs/1/ISCIIEncoding#cs
Code page identifiers can be found here:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
public static byte[] Convert(System.Text.Encoding srcEncoding, System.Text.Encoding dstEncoding, byte[] bytes)
Member of System.Text.Encoding
Summary:
Converts an entire byte array from one encoding to another.
Parameters:
srcEncoding: The encoding format of bytes.
dstEncoding: The target encoding format.
bytes:
Returns:
An array of type System.Byte containing the results of converting bytes from srcEncoding to dstEncoding.
and this
public static System.Text.Encoding GetEncoding(int codepage)
Member of System.Text.Encoding
Summary:
Returns the encoding associated with the specified code page identifier.
Parameters:
codepage: The code page identifier of the preferred encoding. -or- 0, to use the default encoding.
Returns:
The System.Text.Encoding associated with the specified code page.
As per Wikipedia Article, the code page for Malayalam is 57009
Encoding.UTF8.GetString(Encoding.ASCII.GetBytes(input))
Your question makes no sense. Changing the encoding from ASCII to UTF-8 does not magically turn an l into a ക, it only changes the byte representation of the l (actually, since ASCII is a subset of UTF-8, it does not even do that here. It does nothing.)
What you probably want is some kind of transliteration between the Latin and Malayalam alphabet, but that is something completely different.
i read some data from a device. Then i send this data to a web server via xml. The data should be represented in xml so this makes me convert characters between 0-31 because these chars can not be displayed on xml.
The question is how can i convert the chars between 0-31 decimal in a string like [00]abcde[01]fgh[02]...
Are there any built-in function in .net framework or any accepted pattern?
Thanks
You should use standard XML encoding:
Your XML API will do that for you, so you don't need to worry about anything.
You can simply encode the number as an XML entity you write &# followed by the number and a semicolon
so 1 becomes and 13 becomes
and so on and so forth
However as noted by dan04 you can't represent 0 as a numeric character reference, so in the case where your data might include 0 you will have to use a different encoding. You could encode the entire binary data as base64
Most XML toolboxes will do the encoding to NCRs for you though so you really shouldn't have to worry about that
I'm not entirely sure if the question even makes sense. I'm converting a byte array taken from an ID3 tag and converting it to a string. Most text frames in an ID3 tag use ISO 8859-1 encoding but it depends on the frame. In any case, if you look up what 0x00 is in the ISO 8859-1 codes it is invalid.
To further complicate, either due programmer error or just poor formatting, some of the strings end in 0x00 and some do not.
When converting a series of bytes into a string using ISO 8859-1 encoding do you have manually check the end of the string to see if it is a null? Or will the encoding object through whatever method it uses to convert in the first place deal with the null properly? Furthermore, is there some sort of function that could normalize or "fix" the null terminated string?
When you try to display these strings they do not display properly.
I am using C# for this particular project.
Some extra info here about ID3 Tags: ID3 Specs
Or am I completely misunderstanding the whole thing? Is a null terminator simply a way a particular language handles strings and it has nothing to do with encoding?
Edit: I used System.Text.Encoding.GetEncoding("iso-8859-1") followed by a GetString call
If you use Encoding.GetEncoding(28591), it just converts a byte 0 to the Unicode U+0000. Encodings generally assume that they have to convert all the bytes - they don't look for terminators.
This treatment of 0 as Unicode 0 is inline with the Wikipedia description:
In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the unassigned code values thus provides for 256 characters via every possible 8-bit value.
The C0 and C1 control characters page includes:
0: Originally used to allow gaps to be left on paper tape for edits. Later used for padding after a code that might take a terminal some time to process (e.g. a carriage return or line feed on a printing terminal). Now often used as a string terminator, especially in the C programming language.
Sample code:
using System;
using System.Text;
class Program
{
static void Main(string[] args)
{
byte[] data = { 0, 0 };
Encoding latin1 = Encoding.GetEncoding(28591);
string text = latin1.GetString(data);
Console.WriteLine(text.Length); // 2
Console.WriteLine((int) text[0]); // 0
Console.WriteLine((int) text[1]); // 0
}
}
Happily, ASCII, ISO-8859-1 and Unicode all agree on codepoints in the range 0..127. Thus your character '\0' will be encoded identically in ASCII, ISO-8859-1 and UTF-8.
If your program assigns special semantics to the zero byte, you have to take care of that appropriately.
Additional information: Unable to
translate Unicode character \uDFFF at
index 195 to specified code page.
I made an algorithm, who's result are binary values (different lengths). I transformed it into uint, and then into chars and saved into stringbuilder, as you can see below:
uint n = Convert.ToUInt16(tmp_chars, 2);
_koded_text.Append(Convert.ToChar(n));
My problem is, that when i try to save those values into .txt i get the previously mentioned error.
StreamWriter file = new StreamWriter(filename);
file.WriteLine(_koded_text);
file.Close();
What i am saving is this: "忿췾᷿]볯褟ﶞ痢ﳻ��伞ﳴ㿯ﹽ翼蛿㐻ﰻ筹��﷿₩マ랿鳿⏟麞펿"... which are some weird signs.
What i need is to convert those binary values into some kind of string of chars and save it to txt. I saw somewhere that converting to UTF8 should help, but i don't know how to. Would changing files encoding help too?
You cannot transform binary data to a string directly. The Unicode characters in a string are encoded using utf16 in .NET. That encoding uses two bytes per character, providing 65536 distinct values. Unicode however has over one million codepoints. To make that work, the Unicode codepoints above \uffff (above the BMP, Basic Multilingual Plane) are encoded with a surrogate pair. The first one has a value between 0xd800 and 0xdbff, the second between 0xdc00 and 0xdfff. That provides 2 ^ (10 + 10) = 1 million additional codes.
You can perhaps see where this leads, in your case the code detects a high surrogate value (0xdfff) that isn't paired with a low surrogate. That's illegal. Lots more possible mishaps, several codepoints are unassigned, several are diacritics that get mangled when the string is normalized.
You just can't make this work. Base64 encoding is the standard way to carry binary data across a text stream. It uses 6 bits per character, 3 bytes require 4 characters. The character set is ASCII so the odds of the receiving program decoding the character back to binary incorrectly are minimal. Only a decades old IBM mainframe that uses EBCDIC could get you into trouble. Or just plain avoid encoding to text and keep it binary.
Since you're trying to encode binary data to a text stream this SO question already contains an answer to the question: "How do I encode something as base64?" From there plain ASCII/ANSI text is fine for the output encoding.