I am parsing some web content in a response from a HttpWebRequest.
This web content is using charset ISO-8859-1 and when parsing it and finally getting the word needed from the response, I am receiving a string with a question mark like this � and I want to know which is the right way to transform it back into a readable string.
So, what I've tried is to convert the current word encoding into UTF-8 like this:
(I am wondering if UTF-8 could solve my problem)
string word = "ESPA�OL";
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf = Encoding.GetEncoding("UTF-8");
byte[] isoBytes = iso.GetBytes(word);
byte[] utfBytes = Encoding.Convert(iso, utf, isoBytes);
string utfWord = utf.GetString(utfBytes);
Console.WriteLine(utfWord);
However, utfWord variable outputs ESPA?OL which is still wrong. The correct output is supposed to be ESPAÑOL.
Can someone please give me the right directions to solve this, if possible?
The word in question is "ESPAÑOL". This can be encoded correctly in ISO-8859-1 since all characters in the word are represented in ISO-8859-1.
You can see this for yourself using the following simple program:
using System;
using System.Diagnostics;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
string original = "ESPAÑOL";
byte[] iso_8859_1 = enc.GetBytes(original);
string roundTripped = enc.GetString(iso_8859_1);
Debug.Assert(original == roundTripped);
Console.WriteLine(roundTripped);
}
}
}
What this tells you is that you need to properly diagnose where the erroneous character comes from. By the time that you have a � character, it is too late. The information has been lost. The presence of the � character indicates that, at some point, a conversion was performed into a character set that did not contain the character Ñ.
A conversion from ISO-8859-1 to a Unicode encoding will correctly handle "ESPAÑOL" because that word can be encoded in ISO-8859-1.
The most likely explanation is that somewhere along the way, the text "ESPAÑOL" is being converted to a character set that does not contain the letter Ñ.
Related
I'm trying to decode an Base64 data which contains a mixture of English and Arabic characters. I'm using the following code to decode.
var bytes = Convert.FromBase64String(data); //data contains base64 data
string text = Encoding.UTF8.GetString(bytes);
After decoding I'm displaying it on the ASP page. My problem here is, English text is displayed properly whereas in place of arabic text i'm getting empty boxes and question marks like this. ����� ���
Please suggest where i'm going wrong.
After searching for few days. I came up with this and is working..
byte[] plain = Convert.FromBase64String(data);
Encoding iso = Encoding.GetEncoding("ISO-8859-6");
newData = iso.GetString(plain);
return newData;
You should run this under debugger and see whether you get the correct Arabic text in string text:
If text is incorrect, then The bytes (after Base64 decode) are not encoded as UTF-8, but some other encoding - UTF-16, Windows-1256, etc.
If text is correct, then it gets corrupted when displayed on the ASP.NET page. In that case, you should set the page's encoding to one that supports Arabic - best is UTF-8, as Shekhar suggests.
try this
byte[] dec1_byte = Base64.decodeBase64(data.getBytes());
String dec1 = new String(dec1_byte);
byte[] newBytes = Base64.encodeBase64(dec1_byte);
String newStr = new String(newBytes);
hope this will work
Try using encoding in your page on which you are displaying the Arabic characters
<%# Page RequestEncoding="utf-8" ResponseEncoding="utf-8" %>
I'm working with ICQ protocol and I found problem with special letters (fxp diacritics). I read that ICQ using another encoding (CP-1251 if I remember).
How can I decode string with text to correct encoding?
I've tried using UTF8Encoding class, but without success.
Using ICQ-sharp library.
private void ParseMessage (string uin, byte[] data)
{
ushort capabilities_length = LittleEndianBitConverter.Big.ToUInt16 (data, 2);
ushort msg_tlv_length = LittleEndianBitConverter.Big.ToUInt16 (data, 6 + capabilities_length);
string message = Encoding.UTF8.GetString (data, 12 + capabilities_length, msg_tlv_length - 4);
Debug.WriteLine(message);
}
If contact using the same client it's OK, but if not incoming and outcoming messages with diacritics are just unreadable.
I've determinated (using this -> https://stackoverflow.com/a/12853721/846232) that it's in BigEndianUnicode encoding. But if string not contains diacritics its unreadable (chinese letters). But if I use UTF8 encoding on text without diacritics its ok. But I don't know how to do that it will be encoded right allways.
If UTF-8 kinda works (i.e. it works for "english", or any US-ASCII characters), then you don't have UTF-16. Latin1 (or Windows-1252, Microsoft's variant), or e.g. Windows-1251 or Windows-1250 are perfectly possible though, since these the first part containing latin letters without diacritics are the same.
Decode like this:
var encoding = Encoding.GetEncoding("Windows-1250");
string message = encoding.GetString(data, 12 + capabilities_length, msg_tlv_length - 4);
Environment: Visual Studio 2008 SP1
I have the following line in my text file:
using (var reader = File.OpenText(#"c:\temp\DATA.txt"))
{
...
string textLine = "ist where [name]='Curaçao')"
}
Please notice the non-English character.
Whenever the reader.ReadLine gets to this point it turns it into a question mark in my console application.
Any ideas how to preserve that?
You should use the charset in the reader. The console, however, doesn't support non-ASCII characters!
This is most likely an encoding issue - the reader is using a different encoding to the one the file is in.
Make sure both are using the same encoding.
File.OpenText will use the UTF8Encoding - if your file is in a different encoding, this may very well be the issue.
To specify an encoding, construct StreamReader with a constructor that takes an Encoding parameter:
using (var reader = new StreamReader(#"c:\temp\DATA.txt",
Encoding.GetEncoding(860)))
{
...
string textLine = "ist where [name]='Curaçao')"
}
In the above example, I am using the Portuguese encoding.
I have a requirement to encode and decode Japanese characters. I tried in JAVA and it worked fine with "Cp939" encoding but am unable to find that encoding in .NET. The 932 encoding doesn't encode all the characters and so i need to find out a way of implementing 939 encoding in .NET.
Java Code :
convStr = new String(str8859_1.getBytes("Cp037"), "Cp939");
.NET :
bytesConverted = Encoding.Convert(Encoding.GetEncoding(37),
Encoding.GetEncoding(932), bytesConverted);
// This result is a junk of characters and is totally different
// from the expected output 'ニツポンバ'
convStr = Encoding.GetEncoding(1252).GetString(bytesConverted);
The encoded bytes are in the encoding 932, so why are you using the encoding 1252 when you convert the encoded bytes to a string?
The following should work:
bytesConverted = Encoding.Convert(Encoding.GetEncoding(37),
Encoding.GetEncoding(932), bytesConverted);
// This result is a junk of characters and is totally different
// from the expected output 'ニツポンバ'
convStr = Encoding.GetEncoding(932).GetString(bytesConverted);
is this an error or just how you typed it ?
bytesConverted = Encoding.Convert(Encoding.GetEncoding(37),
Encoding.GetEncoding(932), bytesConverted);
should be:
bytesConverted = Encoding.Convert(Encoding.GetEncoding(37),
Encoding.GetEncoding(939), bytesConverted);
Surely ?
I wrote a small program for iterating through a lot of files and applying some changes where a certain string match is found, the problem I have is that different files have different encodings. So what I would like to do is check the encoding, then overwrite the file in its original encoding.
What would be the prettiest way of doing that in C# .net 2.0?
My code looks very simple as of now;
String f1 = File.ReadAllText(fileList[i]).ToLower();
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, Encoding.Unicode);
}
I took a look at Auto encoding detect in C# which made me realize how I could detect encoding, but I am not sure how I could use that information to write in the same encoding.
Would greatly appreciate any help here.
Unfortunately encoding is one of those subjects where there is not always a definitive answer. In many cases it's much closer to guessing the encoding as opposed to detecting it. Raymond Chen did an excellent blog post on this subject that is worth the read
http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx
The gist of the article is
If the BOM (byte order marker) exists then you're golden
Else it's guess work and heuristics
However I still think the best approach is to Darin mentioned in the question you linked. Let StreamReader guess for you vs. re-inventing the wheel. It only requires a very slight modification to your sample.
String f1;
Encoding encoding;
using (var reader = new StreamReader(fileList[i])) {
f1 = reader.ReadToEnd().ToLower();
encoding = reader.CurrentEncoding;
}
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, encoding);
}
By default, .Net use UTF8. It is hard to detect character encoding becus most of the time .Net will read as UTF8. i alway have problem with ANSI.
my trick is i will read the file as Stream as force it to read as UTF8 and detect usual character that should be in text. If found, then UTF8 else ANSI ... and tell user u can use just 2 encoding either ANSI or UTF8. auto dectect not quite work in my language :p
I am afraid, you will have to know the encoding. For UTF based encodings though you can use StreamReader built in functionality though.
Taken form here.
With regard to encodings - you will
need to have identified the encoding
in order to use the StreamReader.
However, the StreamReader itself can
help if you create it with one of the
constructor overloads that allows you
to supply the flag
detectEncodingFromByteOrderMarks as
true (or you can use
Encoding.GetPreamble and look at the
byte preamble yourself).
Both these methods will only help
auto-detect UTF based encodings though
- so any ANSI encodings with a specified codepage will probably not
be parsed correctly.
Prob a bit late but I encountered the same problem myself, using the previous answers I found a solution that works for me, It reads in the text using StreamReaders default encoding, extracts the encoding used on that file and uses StreamWriter to write it back with the changes using the found Encoding. Also removes\reAdds the ReadOnly flag
string file = "File to open";
string text;
Encoding encoding;
string oldValue = "string to be replaced";
string replacementValue = "New string";
var attributes = File.GetAttributes(file);
File.SetAttributes(file, attributes & ~FileAttributes.ReadOnly);
using (StreamReader reader = new StreamReader(file, Encoding.Default))
{
text = reader.ReadToEnd();
encoding = reader.CurrentEncoding;
reader.Close();
}
bool changedValue = false;
if (text.Contains(oldValue))
{
text = text.Replace(oldValue, replacementValue);
changedValue = true;
}
if (changedValue)
{
using (StreamWriter write = new StreamWriter(file, false, encoding))
{
write.Write(text.ToString());
write.Close();
}
File.SetAttributes(file, attributes | FileAttributes.ReadOnly);
}
The solution for all Germans => ÄÖÜäöüß
This function opens the file an determines the Encoding by the BOM.
If the BOM is missing the file will be interpreted as ANSI, but if there are UTF8 encoded German Umlaute in it, it will be detected as UTF8.
https://stackoverflow.com/a/69312696/9134997