Convert UTF-16 text to another encoding (Windows-1250)

Convert UTF-16 text to another encoding (Windows-1250) - c#

I have a text in a variable, text, encoded in the default (UTF-16) encoding. I would like to change it to Windows-1250. I have:
public static string EncodeToWin1250(string text)
{
Encoding unicode = Encoding.Unicode;
Encoding win1250 = Encoding.GetEncoding(1250);
byte[] unicodeBytes = unicode.GetBytes(text);
byte[] win1250Bytes = Encoding.Convert(unicode, win1250, unicodeBytes);
char[] win1250Chars = new char[win1250.GetCharCount(win1250Bytes, 0, win1250Bytes.Length)];
win1250.GetChars(win1250Bytes, 0, win1250Bytes.Length, win1250Chars, 0);
text = new string(win1250Chars);
return text;
}
but so far it doesn't work.
How do I fix this problem?
I am returning the string as a file:
[...]
result = BLL.DataExchange.MoneyS3.MoneyS3Export.EncodeToWin1250(result);
Context.Response.Clear();
Context.Response.AddHeader("Content-Disposition", "attachment; filename=invoicesIssued.xml");
Context.Response.ContentType = "application/octet-stream";
Context.Response.BufferOutput = false;
Context.Response.Write(result);
Context.Response.Flush();
Context.Response.Close();

All strings are stored internally as Unicode in .NET.
You can convert a string to a byte stream using a codepage, as your code does. But your can't change the internal representation of the string: It's Unicode (encoded as UTF16), period.
You may dump your encoded byte stream to a file or wherever you want. But you can't change the internal encoding of .NET string objects.
Your function should return a byte[] instead of a string (win1250Chars actually)

Related

ASP.NET SOAP Webservice ,Encode Problem in Exception

Here is my problem, Im trying to Encode the response of my webservice with the following Code.
public static string ConvertToUTF8(string Cadena)
{
string mensajeex = Cadena;
Encoding utf8 = Encoding.UTF8;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte array.
byte[] unicodeBytes = unicode.GetBytes(mensajeex);
// Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(unicode, utf8, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
char[] asciiChars = new char[utf8.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
utf8.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string Utf8string = new string(asciiChars);
// Display the strings created before and after the conversion.
Console.WriteLine("Original string: {0}", mensajeex);
Console.WriteLine("Ascii converted string: {0}", Utf8string);
return Utf8string;
}
And actually it works! But when I try to Encode a string and then pass through an exception as a Message property like this
throw new Exception(XMLHelper.ConvertToUTF8(Message));
It give me the response wrong like:
El valor 'R' no es válido seg&#250
Any ideas? Thanks

c# converting a .csv file from Windows UTF-8 to w1252

I need to convert a .csv file from UTF-8 to W1252 (West European).
I have tried the example from the MSDN page and the following code without succes
Encoding utf8 = Encoding.UTF8;
//Encoding utf8 = new UTF8Encoding();
Encoding win1252 = Encoding.GetEncoding(1252);
string src = today.ToString("dd-MM-yyyy") + "-ups.csv";
string source = File.ReadAllText(src);
byte[] input = source.ToUTF8ByteArray();
byte[] output = Encoding.Convert(utf8, win1252, input);
File.WriteAllText(src + "w1252", win1252.GetString(output));
with the extension method
public static class StringHelper
{
public static byte[] ToUTF8ByteArray(this string str)
{
Encoding encoding = new UTF8Encoding();
return encoding.GetBytes(str);
}
}
After this, the file still reads with broken characters when opened as W1252 and works perfectly if opening with UTF-8, confirming that it is not good.
Thanks!

Why not read in the initial encoding (Encoding.UTF8), and write in target one (Encoding.GetEncoding(1252)):
string fileName = #"C:\MyFile.csv";
File.WriteAllText(fileName, File
.ReadAllText(fileName, Encoding.UTF8), Encoding.GetEncoding(1252));

convert a string from ISO-8859-5 to UTF8

I'm writing an application for windows mobile. I use a scan, i get a string encoding ISO-8859-5.How do I convert a string in UTF8?
Here is my code
var str_source = "³¿±2";
Console.WriteLine(str_source);
Encoding iso = Encoding.GetEncoding("iso-8859-5");
Encoding utf8 = Encoding.UTF32;
byte[] utfBytes = utf8.GetBytes(str_source);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
var str_result = iso.GetString(isoBytes, 0, isoBytes.Length);
Console.WriteLine(str_result);

You should never start off your testing code with using string literals when dealing with encoding issues. Always use bytes to start with.
Encoding iso = Encoding.GetEncoding("iso-8859-5");
Encoding utf = Encoding.UTF8;
var isoBytes = new byte[] { 228, 232 }; // фш
// iso to utf8
var utfBytes = Encoding.Convert(iso, utf, isoBytes);
// utf8 to iso
var isoBytes2 = Encoding.Convert(utf, iso, utfBytes);
// get all strings (with the correct encoding)
// all 3 strings will contain фш
string s1 = iso.GetString(isoBytes);
string s2 = utf.GetString(utfBytes);
string s3 = iso.GetString(isoBytes2);
Edit: If you do want to use string literals to get you started, then you can use the code below to change their encoding (Encoding.Unicode) to the expected 'incoming text' encoding:
string stringLiteral = "фш";
Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding("iso-8859-5"),
Encoding.Unicode.GetBytes(stringLiteral)); // { 228, 232 }

Convert a string's character encoding from windows-1252 to utf-8

I had converted a Word Document(docx) to html, the converted html has windows-1252 as its character encoding. In .Net for this 1252 character encoding all the special characters are being displayed as '�'. This html is being displayed in a Rad Editor which displays correctly if the html is in Utf-8 format.
I had tried the following code but no vein
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
char[] utf8Chars = new char[utf8.GetCharCount(utf8Bytes, 0, utf8Bytes.Length)];
utf8.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars, 0);
string utf8String = new string(utf8Chars);
Any suggestions on how to convert the html into UTF-8?

This should do it:
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);

Actually the problem lies here
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);
We should not get the bytes from the html String. I tried the below code and it worked.
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = ReadFile(Server.MapPath(HtmlFile));
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);
public static byte[] ReadFile(string filePath)
{
byte[] buffer;
FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);
try
{
int length = (int)fileStream.Length; // get file length
buffer = new byte[length]; // create buffer
int count; // actual number of bytes read
int sum = 0; // total number of bytes read
// read until Read method returns 0 (end of the stream has been reached)
while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
sum += count; // sum is a buffer offset for next reading
}
finally
{
fileStream.Close();
}
return buffer;
}

How you are planning to use resulting html? The most appropriate way in my opinion to solve your problem would be add meta with encoding specification. Something like:
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />

Use Encoding.Convert method. Details are in the Encoding.Convert method MSDN article.

ï»¿ characters appended to the beginning of each file

I've downloaded an HttpHandler class that concatenates JS files into one file and it keeps appending the ï»¿ characters at the start of each file it concatenates.
Any ideas on what is causing this? Could it be that onces the files processed they are written to the cache and that's how the cache is storing/rendering it?
Any inputs would be greatly appreciated.
using System;
using System.Net;
using System.IO;
using System.IO.Compression;
using System.Text;
using System.Configuration;
using System.Web;
public class HttpCombiner : IHttpHandler {
private const bool DO_GZIP = false;
private readonly static TimeSpan CACHE_DURATION = TimeSpan.FromDays(30);
public void ProcessRequest (HttpContext context) {
HttpRequest request = context.Request;
// Read setName, contentType and version. All are required. They are
// used as cache key
string setName = request["s"] ?? string.Empty;
string contentType = request["t"] ?? string.Empty;
string version = request["v"] ?? string.Empty;
// Decide if browser supports compressed response
bool isCompressed = DO_GZIP && this.CanGZip(context.Request);
// Response is written as UTF8 encoding. If you are using languages
// like Arabic, you should change this to proper encoding
UTF8Encoding encoding = new UTF8Encoding(false);
// If the set has already been cached, write the response directly
// from cache. Otherwise generate the response and cache it
if (!this.WriteFromCache(context, setName, version, isCompressed,
contentType))
{
using (MemoryStream memoryStream = new MemoryStream(5000))
{
// Decide regular stream or GZipStream based on whether the
// response can be cached or not
using (Stream writer = isCompressed
? (Stream)(new GZipStream(memoryStream,
CompressionMode.Compress))
: memoryStream)
{
// Load the files defined in <appSettings> and process
// each file
string setDefinition = System.Configuration
.ConfigurationManager.AppSettings[setName] ?? "";
string[] fileNames = setDefinition.Split(
new char[] { ',' },
StringSplitOptions.RemoveEmptyEntries);
foreach (string fileName in fileNames)
{
byte[] fileBytes = this.GetFileBytes(
context, fileName.Trim(), encoding);
writer.Write(fileBytes, 0, fileBytes.Length);
}
writer.Close();
}
// Cache the combined response so that it can be directly
// written in subsequent calls
byte[] responseBytes = memoryStream.ToArray();
context.Cache.Insert(
GetCacheKey(setName, version, isCompressed),
responseBytes, null,
System.Web.Caching.Cache.NoAbsoluteExpiration,
CACHE_DURATION);
// Generate the response
this.WriteBytes(responseBytes, context, isCompressed,
contentType);
}
}
}
private byte[] GetFileBytes(HttpContext context, string virtualPath,
Encoding encoding)
{
if (virtualPath.StartsWith("http://",
StringComparison.InvariantCultureIgnoreCase))
{
using (WebClient client = new WebClient())
{
return client.DownloadData(virtualPath);
}
}
else
{
string physicalPath = context.Server.MapPath(virtualPath);
byte[] bytes = File.ReadAllBytes(physicalPath);
// TODO: Convert unicode files to specified encoding.
// For now, assuming files are either ASCII or UTF8
return bytes;
}
}
private bool WriteFromCache(HttpContext context, string setName,
string version, bool isCompressed, string contentType)
{
byte[] responseBytes = context.Cache[GetCacheKey(setName, version,
isCompressed)] as byte[];
if (null == responseBytes || 0 == responseBytes.Length) return false;
this.WriteBytes(responseBytes, context, isCompressed, contentType);
return true;
}
private void WriteBytes(byte[] bytes, HttpContext context,
bool isCompressed, string contentType)
{
HttpResponse response = context.Response;
response.AppendHeader("Content-Length", bytes.Length.ToString());
response.ContentType = contentType;
if (isCompressed)
response.AppendHeader("Content-Encoding", "gzip");
context.Response.Cache.SetCacheability(HttpCacheability.Public);
context.Response.Cache.SetExpires(DateTime.Now.Add(CACHE_DURATION));
context.Response.Cache.SetMaxAge(CACHE_DURATION);
context.Response.Cache.AppendCacheExtension(
"must-revalidate, proxy-revalidate");
response.OutputStream.Write(bytes, 0, bytes.Length);
response.Flush();
}
private bool CanGZip(HttpRequest request)
{
string acceptEncoding = request.Headers["Accept-Encoding"];
if (!string.IsNullOrEmpty(acceptEncoding) &&
(acceptEncoding.Contains("gzip")
|| acceptEncoding.Contains("deflate")))
return true;
return false;
}
private string GetCacheKey(string setName, string version,
bool isCompressed)
{
return "HttpCombiner." + setName + "." + version + "." + isCompressed;
}
public bool IsReusable
{
get { return true; }
}
}

The ï»¿ characters are the UTF BOM markers.

Its the UTF Byte Order Mark (BOM).
It will be at the start of each file, but your editor will ignore them there. When concatenated they end up in the middle, so you see them.

OK, I've debugged your code.
BOM marks appear in the source stream when the files are being read from the disk:
byte[] bytes = File.ReadAllBytes(physicalPath);
// TODO: Convert unicode files to specified encoding. For now, assuming
// files are either ASCII or UTF8
If you read the files properly, you can get rid of the marks.

I think this is the Byte Order Mark (BOM) for files with UTF-8 encoding. This mark allows to determine in what encoding the file is stored.

You didn't post what the actual solution was. Here's my soulution. On the line where it reads the file into memory, I found a kind of strange way to remove the BOM:
byte[] bytes = File.ReadAllBytes(physicalPath);
String ss = new StreamReader(new MemoryStream(bytes), true).ReadToEnd();
byte[] b = StrToByteArray(ss);
return b;
And you also need this function:
public static byte[] StrToByteArray(string str)
{
System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
return encoding.GetBytes(str);
}
Nitech

If you have the file's contents in a string, .Trim() will lop off the "BOM" quite handily.
You may not be able to do that, or you may want the whitespace at the ends of the file, but it's certainly an option.
For .js whitespace isn't significant, so this could work.

Check how your js files are encoded and provide the same encoding in the code which does the reading and concatenation. These two characters usually point to unicode.

Those characters are UTF-8 BOM. It doesn't seem like they're coming from the gzipped stream. It's more likely they are inserted to the response stream, so I would suggest clearing the response before working with it:
context.Response.Clear();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Convert UTF-16 text to another encoding (Windows-1250) - c#

Related

ASP.NET SOAP Webservice ,Encode Problem in Exception

c# converting a .csv file from Windows UTF-8 to w1252

convert a string from ISO-8859-5 to UTF8

Convert a string's character encoding from windows-1252 to utf-8

ï»¿ characters appended to the beginning of each file

Categories

Resources