Encoding Conversion problem - c#

I've got a little problem changing the ecoding of a string. Actually I read from a DB strings that are encoded using the codepage 850 and I have to prepare them in order to be suitable for an interoperable WCF service.
From the DB I read characters \x10 and \x11 (triangular shapes) and i want to convert them to the Unicode format in order to prevent serialization/deserialization problem during WCF call. (Chars
and are not valid according of the XML specs even if WCF serialize them).
Now, I use following code in order to covert string encoding, but nothing happens. Result string is in fact identical to the original one.
I'm probably missing something...
Please help me!!!
Emanuele
static class UnicodeEncodingExtension
{
public static string Convert(this Encoding sourceEncoding, Encoding targetEncoding, string value)
{
string reEncodedString = null;
byte[] sourceBytes = sourceEncoding.GetBytes(value);
byte[] targetBytes = Encoding.Convert(sourceEncoding, targetEncoding, sourceBytes);
reEncodedString = sourceEncoding.GetString(targetBytes);
return reEncodedString;
}
}
class Program
{
private static Encoding Cp850Encoding = Encoding.GetEncoding(850);
private static Encoding UnicodeEncoding = Encoding.UTF8;
static void Main(string[] args)
{
string value;
string resultValue;
value = "\x10";
resultValue = Cp850Encoding.Convert(UnicodeEncoding, value);
value = "\x11";
resultValue = Cp850Encoding.Convert(UnicodeEncoding, value);
value = "\u25b6";
resultValue = UnicodeEncoding.Convert(Cp850Encoding, value);
value = "\u25c0";
resultValue = UnicodeEncoding.Convert(Cp850Encoding, value);
}
}

It seems you think there is a problem based on an incorrect understanding. But jmservera is correct - all strings in .NET are encoded internally as unicode.
You didn't say exactly what you want to accomplish. Are you experiencing a problem at the other end of the wire?
Just FYI, you can set the text encoding on a WCF binding with the textMessageEncoding element in the config file.

I suspect this line may be your culprit
reEncodedString = sourceEncoding.GetString(targetBytes);
which seems to take your target encoded string of bytes and asks your sourceEncoding to make a string out of them. I've not had a chance to verify it but I suspect the following might be better
reEncodedString = targetEncoding.GetString(targetBytes);

All the strings stored in string are in fact Unicode.Unicode. Read: Strings in .Net and C# and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Edit: I suppose that you want the Convert function to automatically change \x11 to \u25c0, but the problem here is that \x11 is valid in almost any encoding, the differences usually start in character \x80, so the Convert function will maintain it even if you do that:
string reEncodedString = null;
byte[] unicodeBytes = UnicodeEncoding.Unicode.GetBytes(value);
byte[] sourceBytes = Encoding.Convert(Encoding.Unicode,
sourceEncoding, unicodeBytes);
You can see in unicode.org the mappings from CP850 to Unicode. So, for this conversion to happen you will have to change these characters manually.

byte[] sourceBytes =Encoding.Default.GetBytes(value)
Encoding.UTF8.GetString(sourceBytes)
this sequence usefull for download unicode file from service(for example xml file that contain persian character)

You should try this:
byte[] sourceBytes = sourceEncoding.GetBytes(value);
var convertedString = Encoding.UTF8.GetString(sourceBytes);

Related

Convert special characters to normal

I need a way to convert special characters like this:
Helloæ
To normal characters. So this word would end up being Helloae. So far I have tried HttpUtility.Decode, or a method that would convert UTF8 to win1252, but nothing worked. Is there something simple and generic that would do this job?
Thank you.
EDIT
I have tried implementing those two methods using posts here on OC. Here's the methods:
public static string ConvertUTF8ToWin1252(string _source)
{
Encoding utf8 = new UTF8Encoding();
Encoding win1252 = Encoding.GetEncoding(1252);
byte[] input = _source.ToUTF8ByteArray();
byte[] output = Encoding.Convert(utf8, win1252, input);
return win1252.GetString(output);
}
// It should be noted that this method is expecting UTF-8 input only,
// so you probably should give it a more fitting name.
private static byte[] ToUTF8ByteArray(this string _str)
{
Encoding encoding = new UTF8Encoding();
return encoding.GetBytes(_str);
}
But it did not worked. The string remains the same way.
See: Does .NET transliteration library exists?
UnidecodeSharpFork
Usage:
var result = "Helloæ".Unidecode();
Console.WriteLine(result) // Prints Helloae
There is no direct mapping between æ and ae they are completely different unicode code points. If you need to do this you'll most likely need to write a function that maps the offending code points to the strings that you desire.
Per the comments you may need to take a two stage approach to this:
Remove the diacritics and combining characters per the link to the possible duplicate
Map any characters left that are not combining to alternate strings
switch(badChar){
case 'æ':
return "ae";
case 'ø':
return "oe";
// and so on
}

Encoding not converting

An ASP.NET page (ashx) receives a GET request with a UTF8 string. It reads a SqlServer database with Windows-1255 data.
I can't seem to get them to work together. I've used information gathered on SO (mainly Convert a string's character encoding from windows-1252 to utf-8) as well as msdn on the subject.
When I run anything through the functions below - it always ends up the same as it started - not converted at all.
Is something done wrong?
EDIT
What I'm specifically trying to do (getData returns a Dictionary<int, string>):
getData().Where(a => a.Value.Contains(context.Request.QueryString["q"]))
Result is empty, unless I send a "neutral" character such as "'" or ",".
CODE
string windows1255FromUTF8(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(p);
byte[] winBytes = Encoding.Convert(utf8, win, utfBytes);
return win.GetString(winBytes);
}
string UTF8FromWindows1255(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] winBytes = win.GetBytes(p);
byte[] utfBytes = Encoding.Convert(win, utf8, winBytes);
return utf8.GetString(utfBytes);
}
There is nothing wrong with the functions, they are simply useless.
What the functions do is to encode the strings into bytes, convert the data from one encoding to another, then decode the bytes back to a string. Unless the string contains a character that is not possible to encode using the windows-1255 encoding, the returned value should be identical to the input.
Strings in .NET doesn't have an encoding. If you get a string from a source where the text was encoded using for example UTF-8, once it's decoded into a string it doesn't have that encoding any more. You don't have to do anyting to a string to use it when the destination has a specific encoding, whatever library you are using that takes the string will take care of the encoding.
For some reason this worked:
byte[] fromBytes = (fromEncoding.UTF8).GetBytes(myString);
string finalString = (Encoding.GetEncoding(1255)).GetString(fromBytes);
Switching encoding without the conversion...

Error in converting from Encrypted value

I have a encryption method GetDecryptedSSN(). I tested it’s correctness by the following test. It works fine
//////////TEST 2//////////
byte[] encryptedByteWithIBMEncoding2 = DecryptionServiceHelper.GetEncryptedSSN("123456789");
string clearTextSSN2 = DecryptionServiceHelper.GetDecryptedSSN(encryptedByteWithIBMEncoding2);
But when I do a conversion to ASCII String and then back, it is not working correctly. What is the problem in the conversion logic?
//////////TEST 1//////////
//String -- > ASCII Byte --> IBM Byte -- > encryptedByteWithIBMEncoding
byte[] encryptedByteWithIBMEncoding = DecryptionServiceHelper.GetEncryptedSSN("123456789");
//encryptedByteWithIBMEncoding --> Encrypted Byte ASCII
string EncodingFormat = "IBM037";
byte[] encryptedByteWithASCIIEncoding = Encoding.Convert(Encoding.GetEncoding(EncodingFormat), Encoding.ASCII,
encryptedByteWithIBMEncoding);
//Encrypted Byte ASCII - ASCII Encoded string
string encodedEncryptedStringInASCII = System.Text.ASCIIEncoding.ASCII.GetString(encryptedByteWithASCIIEncoding);
//UpdateSSN(encodedEncryptedStringInASCII);
byte[] dataInBytesASCII = System.Text.ASCIIEncoding.ASCII.GetBytes(encodedEncryptedStringInASCII);
byte[] bytesInIBM = Encoding.Convert(Encoding.ASCII, Encoding.GetEncoding(EncodingFormat),
dataInBytesASCII);
string clearTextSSN = DecryptionServiceHelper.GetDecryptedSSN(bytesInIBM);
Helper Class
public static class DecryptionServiceHelper
{
public const string EncodingFormat = "IBM037";
public const string SSNPrefix = "0000000";
public const string Encryption = "E";
public const string Decryption = "D";
public static byte[] GetEncryptedSSN(string clearTextSSN)
{
return GetEncryptedID(SSNPrefix + clearTextSSN);
}
public static string GetDecryptedSSN(byte[] encryptedSSN)
{
return GetDecryptedID(encryptedSSN);
}
private static byte[] GetEncryptedID(string id)
{
ServiceProgram input = new ServiceProgram();
input.RequestText = Encodeto64(id);
input.RequestType = Encryption;
ProgramInterface inputRequest = new ProgramInterface();
inputRequest.Test__Request = input;
using (MY_Service operation = new MY_Service())
{
return ((operation.MY_Operation(inputRequest)).Test__Response.ResponseText);
}
}
private static string GetDecryptedID(byte[] id)
{
ServiceProgram input = new ServiceProgram();
input.RequestText = id;
input.RequestType = Decryption;
ProgramInterface request = new ProgramInterface();
request.Test__Request = input;
using (MY_Service operationD = new MY_Service())
{
ProgramInterface1 response = operationD.MY_Operation(request);
byte[] encodedBytes = Encoding.Convert(Encoding.GetEncoding(EncodingFormat), Encoding.ASCII,
response.Test__Response.ResponseText);
return System.Text.ASCIIEncoding.ASCII.GetString(encodedBytes);
}
}
private static byte[] Encodeto64(string toEncode)
{
byte[] dataInBytes = System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
Encoding encoding = Encoding.GetEncoding(EncodingFormat);
return Encoding.Convert(Encoding.ASCII, encoding, dataInBytes);
}
}
REFERENCE:
Getting incorrect decryption value using AesCryptoServiceProvider
This is the problem, I suspect:
string encodedEncryptedStringInASCII =
System.Text.ASCIIEncoding.ASCII.GetString(encryptedByteWithASCIIEncoding);
(It's not entirely clear because of all the messing around with encodings beforehand, which seems pointless to me, but...)
The result of encryption is not "text encoded in ASCII" - so you shouldn't try to treat it that way. (You haven't said what kind of encryption you're using, but it would be very odd for it to produce ASCII text.)
It's just an arbitrary byte array. In order to represent that in text only using the ASCII character set, the most common approach is to use base64. So the above code would become:
string encryptedText = Convert.ToBase64(encryptedByteWithIBMEncoding);
Then later, you'd convert it back to a byte array ready for decryption as:
encryptedByteWithIBMEncoding = Convert.FromBase64String(encryptedText);
I would strongly advise you to avoid messing around with the encodings like this if you can help it though. It's not clear why ASCII needs to get involved at all. If you really want to encode your original text as IBM037 before encryption, you should just use:
Encoding encoding = Encoding.GetEncoding("IBM037");
string unencryptedBinary = encoding.GetBytes(textInput);
Personally I'd usually use UTF-8, as an encoding which can handle any character data rather than just a limited subset, but that's up to you. I think you're making the whole thing much more complicated than it needs to be though.
A typical "encrypt a string, getting a string result" workflow is:
Convert input text to bytes using UTF-8. The result is a byte array.
Encrypt result of step 1. The result is a byte array.
Convert result of step 2 into base64. The result is a string.
To decrypt:
Convert the string from base64. The result is a byte array.
Decrypt the result of step 1. The result is a byte array.
Convert the result of step 2 back to a string using the same encoding as step 1 of the encryption process.
In DecryptionServiceHelper.GetEncryptedSSN you are encoding the text in IBM037 format BEFORE encrypting.
So the following piece of code is not correct as you are converting the encrypted bytes to ASCII assuming that its in the IBM037 format. That's wrong as the encrypted bytes is not in IBM037 format (the text was encoded before encryption)
//encryptedByteWithIBMEncoding --> Encrypted Byte ASCII
string EncodingFormat = "IBM037";
byte[] encryptedByteWithASCIIEncoding = Encoding.Convert(Encoding.GetEncoding(EncodingFormat), Encoding.ASCII,
encryptedByteWithIBMEncoding);
One possible solution is to encode the encrypted text using IBM037 format, that should fix the issue I guess.

Strip the byte order mark from string in C#

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.
So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using
if (xml.StartsWith(ByteOrderMarkUtf8))
{
xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}
but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?
I recently had issues with the .NET 4 upgrade, but until then the simple answer is
String.Trim()
removes the BOM up until .NET 3.5.
However, in .NET 4 you need to change it slightly:
String.Trim(new char[]{'\uFEFF'});
That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):
String.Trim(new char[]{'\uFEFF','\u200B'});
This you could also use to remove other unwanted characters.
Some further information is from
String.Trim Method:
The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).
I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:
private readonly string _byteOrderMarkUtf8 =
Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
public string GetXmlResponse(Uri resource)
{
string xml;
using (var client = new WebClient())
{
client.Encoding = Encoding.UTF8;
xml = client.DownloadString(resource);
}
if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal))
{
xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
}
return xml;
}
Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.
This works as well
int index = xmlResponse.IndexOf('<');
if (index > 0)
{
xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}
A quick and simple method to remove it directly from a string:
private static string RemoveBom(string p)
{
string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
if (p.StartsWith(BOMMarkUtf8))
p = p.Remove(0, BOMMarkUtf8.Length);
return p.Replace("\0", "");
}
How to use it:
string yourCleanString=RemoveBom(yourBOMString);
If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.
Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).
I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:
var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);
It's that simple.
If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):
var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);
I wrote the following post after coming across this issue.
Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.
It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.
Usage:
string feed = ""; // input
bool hadBOM = FixBOMIfNeeded(ref feed);
var xElem = XElement.Parse(feed); // now does not fail
/// <summary>
/// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0];
/// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char.
/// </summary>
public const char BOMChar = (char)65279;
public static bool FixBOMIfNeeded(ref string str)
{
if (string.IsNullOrEmpty(str))
return false;
bool hasBom = str[0] == BOMChar;
if (hasBom)
str = str.Substring(1);
return hasBom;
}
Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.
Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.
I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):
public static string GetUTF8String(byte[] data)
{
byte[] utf8Preamble = Encoding.UTF8.GetPreamble();
if (data.StartsWith(utf8Preamble))
{
return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length);
}
else
{
return Encoding.UTF8.GetString(data);
}
}
Where StartsWith(byte[]) is the logical extension:
public static bool StartsWith(this byte[] thisArray, byte[] otherArray)
{
// Handle invalid/unexpected input
// (nulls, thisArray.Length < otherArray.Length, etc.)
for (int i = 0; i < otherArray.Length; ++i)
{
if (thisArray[i] != otherArray[i])
{
return false;
}
}
return true;
}
StreamReader sr = new StreamReader(strFile, true);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);
Yet another generic variation to get rid of the UTF-8 BOM preamble:
var preamble = Encoding.UTF8.GetPreamble();
if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble))
preamble = Array.Empty<Byte>();
return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length);
Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:
certficateThumbprint = Regex.Replace(certficateThumbprint, #"[^a-zA-Z0-9\-\s*]", "");
And there you go. Voila!! It worked for me.
I solved the issue with the following code
using System.Xml.Linq;
void method()
{
byte[] bytes = GetXmlBytes();
XDocument doc;
using (var stream = new MemoryStream(docBytes))
{
doc = XDocument.Load(stream);
}
}

How to convert (transliterate) a string from utf8 to ASCII (single byte) in c#?

I have a string object
"with multiple characters and even special characters"
I am trying to use
UTF8Encoding utf8 = new UTF8Encoding();
ASCIIEncoding ascii = new ASCIIEncoding();
objects in order to convert that string to ascii. May I ask someone to bring some light to this simple task, that is hunting my afternoon.
EDIT 1:
What we are trying to accomplish is getting rid of special characters like some of the special windows apostrophes. The code that I posted below as an answer will not take care of that. Basically
O'Brian will become O?Brian. where ' is one of the special apostrophes
This was in response to your other question, that looks like it's been deleted....the point still stands.
Looks like a classic Unicode to ASCII issue. The trick would be to find where it's happening.
.NET works fine with Unicode, assuming it's told it's Unicode to begin with (or left at the default).
My guess is that your receiving app can't handle it. So, I'd probably use the ASCIIEncoder with an EncoderReplacementFallback with String.Empty:
using System.Text;
string inputString = GetInput();
var encoder = ASCIIEncoding.GetEncoder();
encoder.Fallback = new EncoderReplacementFallback(string.Empty);
byte[] bAsciiString = encoder.GetBytes(inputString);
// Do something with bytes...
// can write to a file as is
File.WriteAllBytes(FILE_NAME, bAsciiString);
// or turn back into a "clean" string
string cleanString = ASCIIEncoding.GetString(bAsciiString);
// since the offending bytes have been removed, can use default encoding as well
Assert.AreEqual(cleanString, Default.GetString(bAsciiString));
Of course, in the old days, we'd just loop though and remove any chars greater than 127...well, those of us in the US at least. ;)
I was able to figure it out. In case someone wants to know below the code that worked for me:
ASCIIEncoding ascii = new ASCIIEncoding();
byte[] byteArray = Encoding.UTF8.GetBytes(sOriginal);
byte[] asciiArray = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, byteArray);
string finalString = ascii.GetString(asciiArray);
Let me know if there is a simpler way o doing it.
For anyone who likes Extension methods, this one does the trick for us.
using System.Text;
namespace System
{
public static class StringExtension
{
private static readonly ASCIIEncoding asciiEncoding = new ASCIIEncoding();
public static string ToAscii(this string dirty)
{
byte[] bytes = asciiEncoding.GetBytes(dirty);
string clean = asciiEncoding.GetString(bytes);
return clean;
}
}
}
(System namespace so it's available pretty much automatically for all of our strings.)
Based on Mark's answer above (and Geo's comment), I created a two liner version to remove all ASCII exception cases from a string. Provided for people searching for this answer (as I did).
using System.Text;
// Create encoder with a replacing encoder fallback
var encoder = ASCIIEncoding.GetEncoding("us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback());
string cleanString = encoder.GetString(encoder.GetBytes(dirtyString));
If you want 8 bit representation of characters that used in many encoding, this may help you.
You must change variable targetEncoding to whatever encoding you want.
Encoding targetEncoding = Encoding.GetEncoding(874); // Your target encoding
Encoding utf8 = Encoding.UTF8;
var stringBytes = utf8.GetBytes(Name);
var stringTargetBytes = Encoding.Convert(utf8, targetEncoding, stringBytes);
var ascii8BitRepresentAsCsString = Encoding.GetEncoding("Latin1").GetString(stringTargetBytes);

Categories