Base64 to UTF-8 String decoding- Arabic Text - c#

I'm trying to decode an Base64 data which contains a mixture of English and Arabic characters. I'm using the following code to decode.
var bytes = Convert.FromBase64String(data); //data contains base64 data
string text = Encoding.UTF8.GetString(bytes);
After decoding I'm displaying it on the ASP page. My problem here is, English text is displayed properly whereas in place of arabic text i'm getting empty boxes and question marks like this. ����� ���
Please suggest where i'm going wrong.

After searching for few days. I came up with this and is working..
byte[] plain = Convert.FromBase64String(data);
Encoding iso = Encoding.GetEncoding("ISO-8859-6");
newData = iso.GetString(plain);
return newData;

You should run this under debugger and see whether you get the correct Arabic text in string text:
If text is incorrect, then The bytes (after Base64 decode) are not encoded as UTF-8, but some other encoding - UTF-16, Windows-1256, etc.
If text is correct, then it gets corrupted when displayed on the ASP.NET page. In that case, you should set the page's encoding to one that supports Arabic - best is UTF-8, as Shekhar suggests.

try this
byte[] dec1_byte = Base64.decodeBase64(data.getBytes());
String dec1 = new String(dec1_byte);
byte[] newBytes = Base64.encodeBase64(dec1_byte);
String newStr = new String(newBytes);
hope this will work

Try using encoding in your page on which you are displaying the Arabic characters
<%# Page RequestEncoding="utf-8" ResponseEncoding="utf-8" %>

Related

Text File Wrong Encoding issue

I have a text file that contains a strange encoded characters, the original characters of the file was Arabic characters.
As a sample: the file contains this string ÝíæáÇ ãÍÝæÑ which equivalent to فيولا محفور
other some examples here:
ÈÇÑíÜÜÜÜÜÒ = باريـــــز
ÏíäÇ ÔÇÌ = دينا شاج
ßíÑãÇäì ãÍÝæÑ = كيرمانى محفور
ÇäÌì ÈÇáÝæã ãßãáÇÊ = انجى بالفوم مكملات
ÓÈÔíÇá ÑæíÇá 35 ãáã = سبشيال رويال 35 ملم
Is there is any way to revert back the file content to its original Arabic characters?
Note: I am using C# programming language.
I'm not too familiar with Arabic encodings, but I assume that your text file is encoded using a Windows-1256 code page.
So you need to specify this codepage when reading the file:
var text = File.ReadAllText(pathToFile, Encoding.GetEncoding(1256));

Encoding issue when handling a string that contains "question mark" (�)

I am parsing some web content in a response from a HttpWebRequest.
This web content is using charset ISO-8859-1 and when parsing it and finally getting the word needed from the response, I am receiving a string with a question mark like this � and I want to know which is the right way to transform it back into a readable string.
So, what I've tried is to convert the current word encoding into UTF-8 like this:
(I am wondering if UTF-8 could solve my problem)
string word = "ESPA�OL";
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf = Encoding.GetEncoding("UTF-8");
byte[] isoBytes = iso.GetBytes(word);
byte[] utfBytes = Encoding.Convert(iso, utf, isoBytes);
string utfWord = utf.GetString(utfBytes);
Console.WriteLine(utfWord);
However, utfWord variable outputs ESPA?OL which is still wrong. The correct output is supposed to be ESPAÑOL.
Can someone please give me the right directions to solve this, if possible?
The word in question is "ESPAÑOL". This can be encoded correctly in ISO-8859-1 since all characters in the word are represented in ISO-8859-1.
You can see this for yourself using the following simple program:
using System;
using System.Diagnostics;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
string original = "ESPAÑOL";
byte[] iso_8859_1 = enc.GetBytes(original);
string roundTripped = enc.GetString(iso_8859_1);
Debug.Assert(original == roundTripped);
Console.WriteLine(roundTripped);
}
}
}
What this tells you is that you need to properly diagnose where the erroneous character comes from. By the time that you have a � character, it is too late. The information has been lost. The presence of the � character indicates that, at some point, a conversion was performed into a character set that did not contain the character Ñ.
A conversion from ISO-8859-1 to a Unicode encoding will correctly handle "ESPAÑOL" because that word can be encoded in ISO-8859-1.
The most likely explanation is that somewhere along the way, the text "ESPAÑOL" is being converted to a character set that does not contain the letter Ñ.

How to read Swedish characters properly from a txt file

I am reading a file (line by line) full of Swedish characters like äåö but how can I read and save the strings with Swedish characters. Here is my code and I am using UTF8 encoding:
TextReader tr = new StreamReader(#"c:\testfile.txt", System.Text.Encoding.UTF8, true);
tr.ReadLine() //returns a string but Swedish characters are not appearing correctly...
You need to change the System.Text.Encoding.UTF8 to System.Text.Encoding.GetEncoding(1252). See below
System.IO.TextReader tr = new System.IO.StreamReader(#"c:\testfile.txt", System.Text.Encoding.GetEncoding(1252), true);
tr.ReadLine(); //returns a string but Swedish characters are not appearing correctly
I figured it out myself i.e System.Text.Encoding.Default will support Swedish characters.
TextReader tr = new StreamReader(#"c:\testfile.txt", System.Text.Encoding.Default, true);
System.Text.Encoding.UTF8 should be enough and it is supported both on .NET Framework and .NET Core https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding?redirectedfrom=MSDN&view=netframework-4.8
If you still have issues with ��� characters (instead of having ÅÖÄ) then check the source file - what kind of encoding does it have? Maybe it's ANSI, then you have to convert to UTF8.
You can do it in Notepad++. You can open text file and go to Encoding - Convert to UTF-8.
Alternatively in the source code (C#):
var myString = Encoding.UTF8.GetString(File.ReadAllBytes(pathToTheTextFile));

C# - Korean Encoding

This might be different with other Korean encoding questions.
There is this site I have to scrape and it's Korean.
An example sentence in their site is this
"개인정보보호를 위해 뒤로가기 버튼 대신 검색결과 화면 상단과 하단의 이전 버튼을 사용하시기 바랍니다."
I am using HttpWebRequest and HttpWebResponse to scrape the site.
this is how I retreive the html
-- partial code --
using (Stream data = resp.GetResponseStream())
{
response.Append(new StreamReader(data, Encoding.GetEncoding(code), true).ReadToEnd());
}
now my problem is, am not getting the correct Korean characters. In my "code" variable, I'm basing the code page here in MSDN http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx (let me narrow it down).
here are the Korean code pages:
51949, 50225, 20949, 20833, 10003, 949
but am still not getting the correct Korean characters? What you think is the problem?
It is very likely that the page is not in a specific Korean encoding, but one of the Unicode encodings.
Try Encoding.UTF8, Encoding.Default (UTF-16) instead of the specific code pages. There are also Encoding.UTF7 and Encoding.UTF32, but they are not as common.
To be certain, examine the meta tags and headers for the content-type returned by the server.
Update (gleaned from commments):
Since the content-type header is EUC-KR, the corresponding codepage is 51949 and this is what you need to use to retrieve the page.
It was not clear that you are writing this out to a file - you need to use the same encoding when writing the file out, or convert the byte[] from the original to the output file encoding (using Encoding.Convert).
While having exact same issue I've finished it with code below:
Encoding.UTF8.GetString(DownloadData(URL));
This directly transform output for the WebClient GET request to UTF8 encoding.

Encoding issue in .NET

I have a requirement to encode and decode Japanese characters. I tried in JAVA and it worked fine with "Cp939" encoding but am unable to find that encoding in .NET. The 932 encoding doesn't encode all the characters and so i need to find out a way of implementing 939 encoding in .NET.
Java Code :
convStr = new String(str8859_1.getBytes("Cp037"), "Cp939");
.NET :
bytesConverted = Encoding.Convert(Encoding.GetEncoding(37),
Encoding.GetEncoding(932), bytesConverted);
// This result is a junk of characters and is totally different
// from the expected output 'ニツポンバ'
convStr = Encoding.GetEncoding(1252).GetString(bytesConverted);
The encoded bytes are in the encoding 932, so why are you using the encoding 1252 when you convert the encoded bytes to a string?
The following should work:
bytesConverted = Encoding.Convert(Encoding.GetEncoding(37),
Encoding.GetEncoding(932), bytesConverted);
// This result is a junk of characters and is totally different
// from the expected output 'ニツポンバ'
convStr = Encoding.GetEncoding(932).GetString(bytesConverted);
is this an error or just how you typed it ?
bytesConverted = Encoding.Convert(Encoding.GetEncoding(37),
Encoding.GetEncoding(932), bytesConverted);
should be:
bytesConverted = Encoding.Convert(Encoding.GetEncoding(37),
Encoding.GetEncoding(939), bytesConverted);
Surely ?

Categories