Convert a string's character encoding from windows-1252 to utf-8 - c#

I had converted a Word Document(docx) to html, the converted html has windows-1252 as its character encoding. In .Net for this 1252 character encoding all the special characters are being displayed as '�'. This html is being displayed in a Rad Editor which displays correctly if the html is in Utf-8 format.
I had tried the following code but no vein
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
char[] utf8Chars = new char[utf8.GetCharCount(utf8Bytes, 0, utf8Bytes.Length)];
utf8.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars, 0);
string utf8String = new string(utf8Chars);
Any suggestions on how to convert the html into UTF-8?

This should do it:
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);

Actually the problem lies here
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);
We should not get the bytes from the html String. I tried the below code and it worked.
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = ReadFile(Server.MapPath(HtmlFile));
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);
public static byte[] ReadFile(string filePath)
{
byte[] buffer;
FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);
try
{
int length = (int)fileStream.Length; // get file length
buffer = new byte[length]; // create buffer
int count; // actual number of bytes read
int sum = 0; // total number of bytes read
// read until Read method returns 0 (end of the stream has been reached)
while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
sum += count; // sum is a buffer offset for next reading
}
finally
{
fileStream.Close();
}
return buffer;
}

How you are planning to use resulting html? The most appropriate way in my opinion to solve your problem would be add meta with encoding specification. Something like:
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />

Use Encoding.Convert method. Details are in the Encoding.Convert method MSDN article.

Related

C# equivalent to parse cryptojs

I'm trying to create C# that does this in CryptoJS
var hash = CryptoJS.HmacSHA512(msg, key);
var crypt = CryptoJS.enc.Utf8.parse(hash.toString());
var base64 = CryptoJS.enc.Base64.stringify(crypt);
My question is in the second statement where hash variable is put into a string then parsed.
Is there an equivalent in C#? Once parsed how do you encode the result into Utf8.
Thanks
I'm not 100% if I understand exactly which piece you are looking for here. But there is no such thing as a UTF8 System.String in C#. However when you write a string to a stream you can choose the encoding of the bytes in the stream to be UTF8
For example by passing that encoding as an option to a StreamWriter.
using (StreamWriter writer = new StreamWriter(stream, Encoding.UTF8)) {
writer.Write(text);
}
My boss find the answer to this. The difference is that before you return the base64 string using C# you have to change the bytes into hexadecimal.
var encoder = new UTF8Encoding();
byte[] keyBytes = encoder.GetBytes(key);
var newlinemsg = action + "\n" + msg;
byte[] messageBytes = encoder.GetBytes(newlinemsg);
byte[] hashBytes = new HMACSHA512(keyBytes).ComputeHash(messageBytes);
var hexString = ToHexString(hashBytes);
var base64 = Convert.ToBase64String(encoder.GetBytes(hexString));

C# MD5 hash function return weird result?

I just tried to create a MD5 hash program in C#. My friend give me a sample code about this but when I try to run a test with "123456", instead of returning the correct hash result
e10adc3949ba59abbe56e057f20f883e
it returns the result
ce0bfd15059b68d67688884d7a3d3e8c
I tried to read the main code but still cannot get anything!
string value = textBox1.Text;
byte[] valueBytes = new byte[value.Length * 2];
Encoder encoder = Encoding.Unicode.GetEncoder();
encoder.GetBytes(value.ToCharArray(), 0, value.Length, valueBytes, 0, true);
MD5 md5 = new MD5CryptoServiceProvider();
byte[] hashBytes = md5.ComputeHash(valueBytes);
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < hashBytes.Length; i++)
{
stringBuilder.Append(hashBytes[i].ToString("x2"));
}
textBox2.Text = stringBuilder.ToString();
Looks like your friend used Encoding.Default instead of Encoding.Unicode
Strings in .NET are UTF16. Hashing works on bytes though, not strings. The string has to be converted to bytes. To do that, a specific encoding has to be used .
If the .NET native encoding is use, ie UTF16, the original byte buffer will be 12 bytes long and the hex representation of the hash will be ce0bfd15059b68d67688884d7a3d3e8c :
var valueBytes=Encoding.Unicode.GetBytes("123456");
Debug.Assert(valueBytes.Length==12);
var md5=System.Security.Cryptography.MD5.Create();
byte[] hashBytes = md5.ComputeHash(valueBytes);
var hexText=String.Join("",hashBytes.Select(c=>c.ToString("x2")));
If the 7-bit US-ASCII encoding is used though, the array will be 6 bytes long and the hex representation will be e10adc3949ba59abbe56e057f20f883e :
var valueBytes=Encoding.ASCII.GetBytes("123456");
Debug.Assert(valueBytes.Length==6);
var md5=System.Security.Cryptography.MD5.Create();
byte[] hashBytes = md5.ComputeHash(valueBytes);
var hexText=String.Join("",hashBytes.Select(c=>c.ToString("x2")));
The fist 127 bytes of most codepages match the 7-bit US-ASCII characters, so most encodings, including UTF8, would return e10adc3949ba59abbe56e057f20f883e. The following encodings would return the same hash string : Encoding.GetEncoding(1251) (Cyrillic), Encoding.GetEncoding(20000) (Chinese Traditiona) would result in the same hash.
The Encoding.Default value returns the encoding that corresponds to the computer's system locale. It's the encoding used by non-Unicode applications like C++ applications compiled with ANSI string types.
Encoding.GetEncoding(20273) though would return a different value - that's an IBM EBCDIC that used different bytes even for the english alphabet and digits. This will return : 73e00d17ee63efb9ae91d274baae2459
You're expecting to have UTF8 string, so why do you use Unicode encoding? Use UTF8 and you'll get the result that you expect:
string value = "123456";
byte[] valueBytes = new byte[value.Length]; // <-- don't multiply by 2!
Encoder encoder = Encoding.UTF8.GetEncoder(); // <-- UTF8 here
encoder.GetBytes(value.ToCharArray(), 0, value.Length, valueBytes, 0, true);
MD5 md5 = new MD5CryptoServiceProvider();
byte[] hashBytes = md5.ComputeHash(valueBytes);
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < hashBytes.Length; i++)
{
stringBuilder.Append(hashBytes[i].ToString("x2"));
}
Console.WriteLine(stringBuilder.ToString()); // "e10adc3949ba59abbe56e057f20f883e"

c# converting a .csv file from Windows UTF-8 to w1252

I need to convert a .csv file from UTF-8 to W1252 (West European).
I have tried the example from the MSDN page and the following code without succes
Encoding utf8 = Encoding.UTF8;
//Encoding utf8 = new UTF8Encoding();
Encoding win1252 = Encoding.GetEncoding(1252);
string src = today.ToString("dd-MM-yyyy") + "-ups.csv";
string source = File.ReadAllText(src);
byte[] input = source.ToUTF8ByteArray();
byte[] output = Encoding.Convert(utf8, win1252, input);
File.WriteAllText(src + "w1252", win1252.GetString(output));
with the extension method
public static class StringHelper
{
public static byte[] ToUTF8ByteArray(this string str)
{
Encoding encoding = new UTF8Encoding();
return encoding.GetBytes(str);
}
}
After this, the file still reads with broken characters when opened as W1252 and works perfectly if opening with UTF-8, confirming that it is not good.
Thanks!
Why not read in the initial encoding (Encoding.UTF8), and write in target one (Encoding.GetEncoding(1252)):
string fileName = #"C:\MyFile.csv";
File.WriteAllText(fileName, File
.ReadAllText(fileName, Encoding.UTF8), Encoding.GetEncoding(1252));

convert a string from ISO-8859-5 to UTF8

I'm writing an application for windows mobile. I use a scan, i get a string encoding ISO-8859-5.How do I convert a string in UTF8?
Here is my code
var str_source = "³¿±2";
Console.WriteLine(str_source);
Encoding iso = Encoding.GetEncoding("iso-8859-5");
Encoding utf8 = Encoding.UTF32;
byte[] utfBytes = utf8.GetBytes(str_source);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
var str_result = iso.GetString(isoBytes, 0, isoBytes.Length);
Console.WriteLine(str_result);
You should never start off your testing code with using string literals when dealing with encoding issues. Always use bytes to start with.
Encoding iso = Encoding.GetEncoding("iso-8859-5");
Encoding utf = Encoding.UTF8;
var isoBytes = new byte[] { 228, 232 }; // фш
// iso to utf8
var utfBytes = Encoding.Convert(iso, utf, isoBytes);
// utf8 to iso
var isoBytes2 = Encoding.Convert(utf, iso, utfBytes);
// get all strings (with the correct encoding)
// all 3 strings will contain фш
string s1 = iso.GetString(isoBytes);
string s2 = utf.GetString(utfBytes);
string s3 = iso.GetString(isoBytes2);
Edit: If you do want to use string literals to get you started, then you can use the code below to change their encoding (Encoding.Unicode) to the expected 'incoming text' encoding:
string stringLiteral = "фш";
Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding("iso-8859-5"),
Encoding.Unicode.GetBytes(stringLiteral)); // { 228, 232 }

Convert UTF-16 text to another encoding (Windows-1250)

I have a text in a variable, text, encoded in the default (UTF-16) encoding. I would like to change it to Windows-1250. I have:
public static string EncodeToWin1250(string text)
{
Encoding unicode = Encoding.Unicode;
Encoding win1250 = Encoding.GetEncoding(1250);
byte[] unicodeBytes = unicode.GetBytes(text);
byte[] win1250Bytes = Encoding.Convert(unicode, win1250, unicodeBytes);
char[] win1250Chars = new char[win1250.GetCharCount(win1250Bytes, 0, win1250Bytes.Length)];
win1250.GetChars(win1250Bytes, 0, win1250Bytes.Length, win1250Chars, 0);
text = new string(win1250Chars);
return text;
}
but so far it doesn't work.
How do I fix this problem?
I am returning the string as a file:
[...]
result = BLL.DataExchange.MoneyS3.MoneyS3Export.EncodeToWin1250(result);
Context.Response.Clear();
Context.Response.AddHeader("Content-Disposition", "attachment; filename=invoicesIssued.xml");
Context.Response.ContentType = "application/octet-stream";
Context.Response.BufferOutput = false;
Context.Response.Write(result);
Context.Response.Flush();
Context.Response.Close();
All strings are stored internally as Unicode in .NET.
You can convert a string to a byte stream using a codepage, as your code does. But your can't change the internal representation of the string: It's Unicode (encoded as UTF16), period.
You may dump your encoded byte stream to a file or wherever you want. But you can't change the internal encoding of .NET string objects.
Your function should return a byte[] instead of a string (win1250Chars actually)

Categories