How can I reverse this text encoding?

How can I reverse this text encoding? - c#

We have the encoding method below that accepts an input string from a legacy system file (in the Unicode format of a VB6 string) and the name of an encoding. It applies the encoding and returns a string that displays correctly in our newer web applications. As our newer applications have a reporting backend that still depends on the old formats I have a need to reverse the encoding to allow newly translated strings to be stored in the legacy files. Here are two examples of conversions done by the Encode method.
Encode("µn¤J¦WºÙ", "BIG5") returns 登入名稱
Encode("Çàðåãèñòðèðîâàííîå èìÿ", "windows-1251") returns Зарегистрированное имя
To reverse these encodings I have been trying various encoding steps based on questions found here and elsewhere but have thus far succeeded only in producing output that appears identical to the input, is composed entirely of question marks or is composed of a mixture of ASCII characters and question marks different to the original input.
The encoding method was written by a departed colleague and I must admit that I don't fully understand why it has the coded loops, while all other examples I find simply get the characters from the string, then the bytes from those using the input encoding and finally get the string from the bytes using the output encoding. If I try removing the coded loops and just doing those three steps the method no longer returns the expected result.
Here is the encoding method, and my question is, how can I create a corresponding Decode method that reverses what it does?
private static string Encode(string src, string encoding)
{
if (String.IsNullOrWhiteSpace(encoding)) return src;
Encoding unicode = Encoding.Unicode;
Encoding sourceEncoding = Encoding.GetEncoding(encoding);
char[] srcChars = src.ToCharArray();
byte[] srcBytes = sourceEncoding.GetBytes(srcChars);
if (srcChars.Length == srcBytes.Length)
{
for (int i = 0; i < srcChars.Length; i++)
if ((int)srcChars[i] < 256)
srcBytes[i] = (byte)srcChars[i];
}
else
{
srcBytes = new byte[srcChars.Length];
for (int i = 0; i < srcChars.Length; i++)
srcBytes[i] = (byte)srcChars[i];
}
byte[] unicodeBytes = Encoding.Convert(sourceEncoding, unicode, srcBytes);
return unicode.GetString(unicodeBytes);
}

Related

Converting unknown characters to Greek characters

I have a file which contains the following characters:
ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ
I am trying to convert that to Greek words and the result should be:
ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ
The file that the above value is stored in in Unicode format.
I am applying all possible encodings but no luck in the conversion.
private void Convert()
{
string textFilePhysicalPath = (#"C:\Users\Nec\Desktop\a.txt");
string contents = File.ReadAllText(textFilePhysicalPath);
List<string> sLines = new List<string>();
// For every encoding, get the property values.
EncodingInfo ei;
foreach (var ei in Encoding.GetEncodings())
{
Encoding e = ei.GetEncoding();
Encoding iso = Encoding.GetEncoding(ei.Name);
Encoding utfx = Encoding.Unicode;
byte[] utfBytes = utfx.GetBytes(contents);
byte[] isoBytes = Encoding.Convert(utfx, iso, utfBytes);
string msg = iso.GetString(isoBytes);
string xx = (ei.Name + " " + msg);
sLines.Add(xx);
}
using (StreamWriter file = new StreamWriter(#"C:\Users\Nec\Desktop\result.txt"))
{
foreach (var line in sLines)
file.WriteLine(line);
}
}
A website that converts it correctly is http://www.online-decoder.com/el
but even when I use the ISO-8859-1 to ISO-8859-7 it still doesn't work in .NET.

This code converts the string from the C# which is UTF-16 to an 8-bit representation using the common ISO-8859-1 codepage. Then it converts it back to UTF-16 using the greek codepage windows-1253. The result is ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ as you want.
string errorneousString = "ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ";
byte[] asIso88591Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(errorneousString);
string asGreekString = Encoding.GetEncoding("windows-1253").GetString(asIso88591Bytes);
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(asGreekString);
Edit: Since your file is encoded in an 8-bit format, you need to specify the codepage when reading it. Use this:
string fileContents = File.ReadAllText("189.dat", Encoding.GetEncoding("windows-1253"));
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(fileContents);
That reads the content as
'CS','C.S.F. EXAMINATION','ΕΞΕΤΑΣΗ Ε.Ν.Υ.' 'EH','Hb
ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ' 'EP','PROTEIN
ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΠΡΩΤΕΙΝΩΝ' 'FB','HAEMATOLOGY -
FBC','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΑΙΜΑΤΟΣ - FBC' 'FR','FREE TEXT', 'GT','GLUCOSE
TOLERANCE TEST','ΔΟΚΙΜΑΣΙΑ ΑΝΟΧΗΣ ΓΛΥΚΟΖΗΣ'
'MI','MICROBIOLOGY','ΜΙΚΡΟΒΙΟΛΟΓΙΑ' 'NO','NORMAL FORM','ΚΑΝΟΝΙΚΟ
ΔΕΛΤΙΟ' 'RE','RENAL CALCULUS','ΧΗΜΙΚΗ ΑΝΑΛΥΣΗ ΟΥΡΟΛΙΘΟΥ' 'SE','SEMEN
ANALYSIS','ΣΠΕΡΜΟΔΙΑΓΡΑΜΜΑ' 'SP','SPECIAL PATHOLOGY','SPECIAL
PATHOLOGY' 'ST','STOOL EXAMINATION
','ΕΞΕΤΑΣΗ ΚΟΠΡΑΝΩΝ' 'SW','SEMEN WASH','SEMEN WASH'
'TH','THROMBOPHILIA PANEL','THROMBOPHILIA PANEL' 'UR','URINE
ANALYSIS','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΟΥΡΩΝ' 'WA','WATER CULTURE REPORT','ΑΝΑΛΥΣΗ
ΝΕΡΟΥ' 'WI','WIDAL ','ΑΝΟΣΟΒΙΟΛΟΓΙΑ'

This is an ASCII file stored using the Greek (1253) codepage which was read using a different codepage.
File.ReadAllText tries to detect whether the file is UTF16 or UTF8 by checking the BOM bytes and falls back to UTF8 by default. UTF8 is essentially the 7-bit ANSI codepage for single-byte text, which means that trying to read a nonUnicode, nonANSI file like this will result in garbled text.
To load a file using a specific encoding/codepage, just pass the encoding as the Encoding parametter, eg :
var enc = Encoding.GetEncoding(1253);
var text=File.ReadAllText(#"189.dat",enc);
Strings in .NET are Unicode, specifically UTF16. This means that text doesn't need any conversions. Its contents will be :
'CS','C.S.F. EXAMINATION','ΕΞΕΤΑΣΗ Ε.Ν.Υ.'
'EH','Hb ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ'
'EP','PROTEIN ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΠΡΩΤΕΙΝΩΝ'
'FB','HAEMATOLOGY - FBC','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΑΙΜΑΤΟΣ - FBC'
'FR','FREE TEXT',
'GT','GLUCOSE TOLERANCE TEST','ΔΟΚΙΜΑΣΙΑ ΑΝΟΧΗΣ ΓΛΥΚΟΖΗΣ'
'MI','MICROBIOLOGY','ΜΙΚΡΟΒΙΟΛΟΓΙΑ'
'NO','NORMAL FORM','ΚΑΝΟΝΙΚΟ ΔΕΛΤΙΟ'
'RE','RENAL CALCULUS','ΧΗΜΙΚΗ ΑΝΑΛΥΣΗ ΟΥΡΟΛΙΘΟΥ'
'SE','SEMEN ANALYSIS','ΣΠΕΡΜΟΔΙΑΓΡΑΜΜΑ'
'SP','SPECIAL PATHOLOGY','SPECIAL PATHOLOGY'
'ST','STOOL EXAMINATION ','ΕΞΕΤΑΣΗ ΚΟΠΡΑΝΩΝ'
'SW','SEMEN WASH','SEMEN WASH'
'TH','THROMBOPHILIA PANEL','THROMBOPHILIA PANEL'
'UR','URINE ANALYSIS','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΟΥΡΩΝ'
'WA','WATER CULTURE REPORT','ΑΝΑΛΥΣΗ ΝΕΡΟΥ'
'WI','WIDAL ','ΑΝΟΣΟΒΙΟΛΟΓΙΑ'
UTF16 uses two bytes for every character. If a UTF16 file was opened in a hex browser, every other character would be a NUL (0x00). It's not UTF8 either - outside the 7-bit ANSI range each character uses two or more bytes that always have the high bit set. Instead of one garbled character there would be two at least.
File and stream methods that could be affected by encoding or culture in .NET always have an overload that accepts an Encoding or CultureInfo parameter.
Console
Writing the output to the Console may display in garbled text. The text isn't really converted, just displayed the wrong way.
While the console can display Unicode text it assumes that the system's codepage is used by default. In the past it couldn't even support UTF8 as a codepage - there was no such option in the settings. After all, the label for the system locale settings is Language used for non-Unicode programs.
The latest Windows 10 Insider releases offer UTF8 as the system codepage as a beta option.
To ensure Unicode text appears properly in the console one would have to set its encoding to UTF8, eg :
var text=File.ReadAllText(#"189.dat",enc);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine(text);

I don't know what codepage this is, but it seems to be simply offset by some values. You can convert the source string to the target string by adding 11 to the first byte and 16 to the second byte:
var input = Encoding.Default.GetBytes("ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ");
for (var i = 0; i < input.Length; i++)
{
if (input[i] == 32) continue;
input[i++] += 11;
input[i] += 16;
}
var output = Encoding.UTF8.GetString(input);
Result: ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ
Not sure if this is a solution, but it may give you a hint.

Just it c#
HtmlWeb web = new HtmlWeb();
web.OverrideEncoding = Encoding.GetEncoding(65001);

Encoding not converting

An ASP.NET page (ashx) receives a GET request with a UTF8 string. It reads a SqlServer database with Windows-1255 data.
I can't seem to get them to work together. I've used information gathered on SO (mainly Convert a string's character encoding from windows-1252 to utf-8) as well as msdn on the subject.
When I run anything through the functions below - it always ends up the same as it started - not converted at all.
Is something done wrong?
EDIT
What I'm specifically trying to do (getData returns a Dictionary<int, string>):
getData().Where(a => a.Value.Contains(context.Request.QueryString["q"]))
Result is empty, unless I send a "neutral" character such as "'" or ",".
CODE
string windows1255FromUTF8(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(p);
byte[] winBytes = Encoding.Convert(utf8, win, utfBytes);
return win.GetString(winBytes);
}
string UTF8FromWindows1255(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] winBytes = win.GetBytes(p);
byte[] utfBytes = Encoding.Convert(win, utf8, winBytes);
return utf8.GetString(utfBytes);
}

There is nothing wrong with the functions, they are simply useless.
What the functions do is to encode the strings into bytes, convert the data from one encoding to another, then decode the bytes back to a string. Unless the string contains a character that is not possible to encode using the windows-1255 encoding, the returned value should be identical to the input.
Strings in .NET doesn't have an encoding. If you get a string from a source where the text was encoded using for example UTF-8, once it's decoded into a string it doesn't have that encoding any more. You don't have to do anyting to a string to use it when the destination has a specific encoding, whatever library you are using that takes the string will take care of the encoding.

For some reason this worked:
byte[] fromBytes = (fromEncoding.UTF8).GetBytes(myString);
string finalString = (Encoding.GetEncoding(1255)).GetString(fromBytes);
Switching encoding without the conversion...

Strip the byte order mark from string in C#

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.
So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using
if (xml.StartsWith(ByteOrderMarkUtf8))
{
xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}
but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?

I recently had issues with the .NET 4 upgrade, but until then the simple answer is
String.Trim()
removes the BOM up until .NET 3.5.
However, in .NET 4 you need to change it slightly:
String.Trim(new char[]{'\uFEFF'});
That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):
String.Trim(new char[]{'\uFEFF','\u200B'});
This you could also use to remove other unwanted characters.
Some further information is from
String.Trim Method:
The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).

I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:
private readonly string _byteOrderMarkUtf8 =
Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
public string GetXmlResponse(Uri resource)
{
string xml;
using (var client = new WebClient())
{
client.Encoding = Encoding.UTF8;
xml = client.DownloadString(resource);
}
if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal))
{
xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
}
return xml;
}
Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.

This works as well
int index = xmlResponse.IndexOf('<');
if (index > 0)
{
xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}

A quick and simple method to remove it directly from a string:
private static string RemoveBom(string p)
{
string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
if (p.StartsWith(BOMMarkUtf8))
p = p.Remove(0, BOMMarkUtf8.Length);
return p.Replace("\0", "");
}
How to use it:
string yourCleanString=RemoveBom(yourBOMString);

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.
Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:
var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);
It's that simple.
If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):
var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);

I wrote the following post after coming across this issue.
Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.

It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.
Usage:
string feed = ""; // input
bool hadBOM = FixBOMIfNeeded(ref feed);
var xElem = XElement.Parse(feed); // now does not fail
/// <summary>
/// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0];
/// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char.
/// </summary>
public const char BOMChar = (char)65279;
public static bool FixBOMIfNeeded(ref string str)
{
if (string.IsNullOrEmpty(str))
return false;
bool hasBom = str[0] == BOMChar;
if (hasBom)
str = str.Substring(1);
return hasBom;
}

Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.
Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.

I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):
public static string GetUTF8String(byte[] data)
{
byte[] utf8Preamble = Encoding.UTF8.GetPreamble();
if (data.StartsWith(utf8Preamble))
{
return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length);
}
else
{
return Encoding.UTF8.GetString(data);
}
}
Where StartsWith(byte[]) is the logical extension:
public static bool StartsWith(this byte[] thisArray, byte[] otherArray)
{
// Handle invalid/unexpected input
// (nulls, thisArray.Length < otherArray.Length, etc.)
for (int i = 0; i < otherArray.Length; ++i)
{
if (thisArray[i] != otherArray[i])
{
return false;
}
}
return true;
}

StreamReader sr = new StreamReader(strFile, true);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);

Yet another generic variation to get rid of the UTF-8 BOM preamble:
var preamble = Encoding.UTF8.GetPreamble();
if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble))
preamble = Array.Empty<Byte>();
return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length);

Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:
certficateThumbprint = Regex.Replace(certficateThumbprint, #"[^a-zA-Z0-9\-\s*]", "");
And there you go. Voila!! It worked for me.

I solved the issue with the following code
using System.Xml.Linq;
void method()
{
byte[] bytes = GetXmlBytes();
XDocument doc;
using (var stream = new MemoryStream(docBytes))
{
doc = XDocument.Load(stream);
}
}

How to deal with ISO-2022-JP ( and other character sets ) in a Twitter update?

Part of my application accepts arbitrary text and posts it as an Update to Twitter. Everything works fine, until it comes to posting foreign ( non ASCII/UTF7/8 ) character sets, then things no longer work.
For example, if someone posts:
に投稿できる
It ( within my code in Visual Studio debugger ) becomes:
=?ISO-2022-JP?B?GyRCJEtFajlGJEckLSRrGyhC?=
Googling has told me that this represents ( minus ? as delimiters )
=?ISO-2022-JP is the text encoding
?B means it is base64 encoded
?GyRCJEtFajlGJEckLSRrGyhC? Is the encoded string
For the life of me, I can't figure out how to get this string posted as an update to Twitter in it's original Japanese characters. As it stands now, sending '=?ISO-2022-JP?B?GyRCJEtFajlGJEckLSRrGyhC?=' to Twitter will result in exactly that getting posted. Ive also tried breaking the string up into pieces as above, using System.Text.Encoding to convert to UTF8 from ISO-2022-JP and vice versa, base64 decoded and not. Additionally, ive played around with the URL Encoding of the status update like this:
string[] bits = tweetText.Split(new char[] { '?' });
if (bits.Length >= 4)
{
textEncoding = System.Text.Encoding.GetEncoding(bits[1]);
xml = oAuth.oAuthWebRequest(TwitterLibrary.oAuthTwitter.Method.POST, url, "status=" + System.Web.HttpUtility.UrlEncode(decodedText, textEncoding));
}
No matter what I do, the results never end up back to normal.
EDIT:
Got it in the end. For those following at home, it was pretty close to the answer listed below in the end. It was just Visual Studios debugger was steering me the wrong way and a bug in the Twitter Library I was using. End result was this:
decodedText = textEncoding.GetString(System.Convert.FromBase64String(bits[3]));
byte[] originalBytes = textEncoding.GetBytes(decodedText);
byte[] utfBytes = System.Text.Encoding.Convert(textEncoding, System.Text.Encoding.UTF8, originalBytes);
// now, back to string form
decodedText = System.Text.Encoding.UTF8.GetString(utfBytes);
Thanks all.

This produced the output you are looking for:
using System;
using System.Text;
class Program {
static void Main(string[] args) {
string input = "に投稿できる";
Console.WriteLine(EncodeTwit(input));
Console.ReadLine();
}
public static string EncodeTwit(string txt) {
var enc = Encoding.GetEncoding("iso-2022-jp");
byte[] bytes = enc.GetBytes(txt);
char[] chars = new char[(bytes.Length * 3 + 1) / 2];
int len = Convert.ToBase64CharArray(bytes, 0, bytes.Length, chars, 0);
return "=?ISO-2022-JP?B?" + new string(chars, 0, len) + "?=";
}
}
Standards are great, there are so many to choose from. ISO never disappoints, there are no less than 3 ISO-2022-JP encodings. If you have trouble then also try encodings 50221 and 50222.

Your understanding of how the text is encoded seems correct. In python
'GyRCJEtFajlGJEckLSRrGyhC'.decode('base64').decode('ISO-2022-JP')
returns the correct unicode string. Note that you need to decode base64 first in order to get the ISO-2022-JP-encoded text.

Encoding Conversion problem

I've got a little problem changing the ecoding of a string. Actually I read from a DB strings that are encoded using the codepage 850 and I have to prepare them in order to be suitable for an interoperable WCF service.
From the DB I read characters \x10 and \x11 (triangular shapes) and i want to convert them to the Unicode format in order to prevent serialization/deserialization problem during WCF call. (Chars
and are not valid according of the XML specs even if WCF serialize them).
Now, I use following code in order to covert string encoding, but nothing happens. Result string is in fact identical to the original one.
I'm probably missing something...
Please help me!!!
Emanuele
static class UnicodeEncodingExtension
{
public static string Convert(this Encoding sourceEncoding, Encoding targetEncoding, string value)
{
string reEncodedString = null;
byte[] sourceBytes = sourceEncoding.GetBytes(value);
byte[] targetBytes = Encoding.Convert(sourceEncoding, targetEncoding, sourceBytes);
reEncodedString = sourceEncoding.GetString(targetBytes);
return reEncodedString;
}
}
class Program
{
private static Encoding Cp850Encoding = Encoding.GetEncoding(850);
private static Encoding UnicodeEncoding = Encoding.UTF8;
static void Main(string[] args)
{
string value;
string resultValue;
value = "\x10";
resultValue = Cp850Encoding.Convert(UnicodeEncoding, value);
value = "\x11";
resultValue = Cp850Encoding.Convert(UnicodeEncoding, value);
value = "\u25b6";
resultValue = UnicodeEncoding.Convert(Cp850Encoding, value);
value = "\u25c0";
resultValue = UnicodeEncoding.Convert(Cp850Encoding, value);
}
}

It seems you think there is a problem based on an incorrect understanding. But jmservera is correct - all strings in .NET are encoded internally as unicode.
You didn't say exactly what you want to accomplish. Are you experiencing a problem at the other end of the wire?
Just FYI, you can set the text encoding on a WCF binding with the textMessageEncoding element in the config file.

I suspect this line may be your culprit
reEncodedString = sourceEncoding.GetString(targetBytes);
which seems to take your target encoded string of bytes and asks your sourceEncoding to make a string out of them. I've not had a chance to verify it but I suspect the following might be better
reEncodedString = targetEncoding.GetString(targetBytes);

All the strings stored in string are in fact Unicode.Unicode. Read: Strings in .Net and C# and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Edit: I suppose that you want the Convert function to automatically change \x11 to \u25c0, but the problem here is that \x11 is valid in almost any encoding, the differences usually start in character \x80, so the Convert function will maintain it even if you do that:
string reEncodedString = null;
byte[] unicodeBytes = UnicodeEncoding.Unicode.GetBytes(value);
byte[] sourceBytes = Encoding.Convert(Encoding.Unicode,
sourceEncoding, unicodeBytes);
You can see in unicode.org the mappings from CP850 to Unicode. So, for this conversion to happen you will have to change these characters manually.

byte[] sourceBytes =Encoding.Default.GetBytes(value)
Encoding.UTF8.GetString(sourceBytes)
this sequence usefull for download unicode file from service(for example xml file that contain persian character)

You should try this:
byte[] sourceBytes = sourceEncoding.GetBytes(value);
var convertedString = Encoding.UTF8.GetString(sourceBytes);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I reverse this text encoding? - c#

Related

Converting unknown characters to Greek characters

Encoding not converting

Strip the byte order mark from string in C#

How to deal with ISO-2022-JP ( and other character sets ) in a Twitter update?

Encoding Conversion problem

Categories

Resources