C# decoding non Ascii characters? [closed] - c#

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am reading a meta description from couple of sites using HtmlAgilityPack.
I noticed if it is not English letters it does not decode the special characters. (such as Japaneses letters).
I am using Encoding UTF8 - should I be using something else.
byte[] bytes = Encoding.Default.GetBytes(item.Attributes["content"].Value);
return Encoding.UTF8.GetString(bytes);

WebClient.DownloadString is a limited, high-level method which makes it awkward and error-prone to do fundamentally simple things.
Fetching a web page via HTTP is simple. You give a URL and some request headers; The server responds with some response headers and a byte stream of the body. The response headers typically state the character encoding of a text body. If not, the text body might say itself. HTMLAgilityPack understands this and provides the HtmlWeb class to do create an HtmlAgilityPack.HtmlDocument from that interaction.
var document = new HtmlWeb().Load("http://www3.nhk.or.jp/news/");
var keywords = document.DocumentNode
.SelectSingleNode("//meta[#name='keywords']")
.Attributes["content"]?.Value;
Console.WriteLine(keywords);
Console.WriteLine($#"
StreamEncoding: {document.StreamEncoding?.EncodingName}
DeclaredEncoding: {document.DeclaredEncoding?.EncodingName}
Encoding: {document.Encoding?.EncodingName}");
NHK,ニュース,NHK NEWS WEB
StreamEncoding: Unicode (UTF-8)
DeclaredEncoding:
Encoding: Unicode (UTF-8)

As per your comment, seems like your website is using SHIFT-JIS encoding, not UTF-8. I've added two samples for UTF-8 and SHIFT-JIS.
using (var client = new WebClient())
{
// UTF-8
var content = client.DownloadString("http://www3.nhk.or.jp/news/");
var doc = new HtmlDocument();
doc.LoadHtml(content);
var metaDescNode = doc.DocumentNode.SelectSingleNode("//meta[#name=\"description\"]");
var bytes = Encoding.Default.GetBytes(metaDescNode.Attributes["content"].Value);
var decodedMetaDesc = Encoding.UTF8.GetString(bytes); // This string has decoded characters
// Shift_JIS
var japaneseEncoding = Encoding.GetEncoding(932);
var content2 = client.DownloadString("http://www.toronto-electricians.com/");
var doc2 = new HtmlDocument();
doc2.LoadHtml(content2);
var metaDescNode2 = doc2.DocumentNode.SelectSingleNode("//meta[#name=\"description\"]");
var bytes2 = Encoding.Default.GetBytes(metaDescNode2.Attributes["content"].Value);
var decodedMetaDesc2 = japaneseEncoding.GetString(bytes2); // This string has decoded characters
}
Screenshot #1 from debugger.
Screenshot #2 from debugger.

Related

encode and decode strange .shm file data to and from base64 c#

first a depressing fact: https://www.base64decode.org/ can do what i want to do.
i´m trying to encode and decode (to and from base64) a model file (.shm) generated by the image processing tool MVTec Halcon because i want to store it in a xml file.
If i open it, it has this strange form:
HSTF ÿÿÿÿ¿€ Q¿ÙG®záH?Üä4©±w?­Eè}‰#?ð ................
I´m using this methods to encode and decode it:
public static string Base64Encode(string text)
{
Byte[] textBytes = Encoding.Default.GetBytes(text);
return Convert.ToBase64String(textBytes);
}
public static string Base64Decode(string base64EncodedData)
{
Byte[] base64EncodedBytes = Convert.FromBase64String(base64EncodedData);
return Encoding.Default.GetString(base64EncodedBytes);
}
and calling the methods from a gui like this:
var model = File.ReadAllText(#"C:\Users\\Desktop\model_region_nut.txt");
var base64 = ImageConverter.Base64Encode(model);
File.WriteAllText(#"C:\Users\\Desktop\base64.txt", base64);
var modelneu = ImageConverter.Base64Decode(File.ReadAllText(#"C:\Users\\Desktop\base64.txt"));
File.WriteAllText(#"C:\Users\\Desktop\modelneu.txt", modelneu);
my result for modelneu is:
HSTF ?????? Q??G?z?H???4??w??E?}??#??
so you can see that there are lots of missing characters.. I guess the problem is caused by using .Default.
Thanks for your help,
Michel
If you're working with binary data, there is no reason at all to go through text decoding and encoding. Doing so only risks corrupting the data in various ways, even if you're using a consistent character encoding.
Just use File.ReadAllBytes() instead of File.ReadAllText() and skip the unnecessary Encoding step.
The problem is with reading file with unspecified encoding, check this question.
As mentioned there you can go with overload for ReadAllText to specify encoding and also for writing you must specofy encoding for WriteAllText I suggest using UTF-8 encoding so:
var model = File.ReadAllText(#"C:\Users\pichlerm\Desktop\model_region_nut.txt",Encoding.UTF8);
var base64 = ImageConverter.Base64Encode(model);
File.WriteAllText(#"C:\Users\\Desktop\base64.txt", base64,Encoding.UTF8);
var modelneu = ImageConverter.Base64Decode(File.ReadAllText(#"C:\Users\\Desktop\base64.txt"));
File.WriteAllText(#"C:\Users\pichlerm\Desktop\modelneu.txt", modelneu);

HttpClient: Correct order to detect encoding

I'm using HttpClient to fetch some files. I put the content into a byte array (bytes). Now I need to detect the encoding. The contenttype will be either html, css, JavaScript or XML contenttype.
Currently I check the charset from headers, then check for a BOM (byte order mark) before I finally check the first part of the file for a charset meta tag.
Normally this works fine, because there are no conflicts.
But: Is that order correct (in case of conflict)?
The code I corrently use:
Encoding encoding;
try
{
encoding = Encoding.GetEncoding(responseMessage.Content.Headers.ContentType.CharSet);
}
catch
{
using (MemoryStream ms = new MemoryStream(bytes))
{
using (StreamReader sr = new StreamReader(ms, Encoding.Default, true))
{
char[] chars = new char[1024];
sr.Read(chars, 0, 1024);
string textDefault = new string(chars);
if (sr.CurrentEncoding == Encoding.Default)
{
encoding = Global.EncodingFraContentType(textDefault);
}
else
{
encoding = sr.CurrentEncoding;
}
}
}
}
responseInfo.Text = encoding.GetString(bytes);
Global.EncodingFraContentType is a regular expression that finds the charset defined either in XML declaration, or in a meta tag.
What order is the correct to detect charset/encoding?
The correct answer depends not on order, but on which actually gives the correct result, and there's no perfect answer here.
If there is a conflict, then the server has given you something incorrect. Since it's incorrect there can't be a "correct" order because there isn't a correct way of being incorrect. And, maybe the header and the embedded metadata are both wrong!
No even slightly common-used encoding can have something that looks like a BOM would look like in UTF-8 or UTF-16 at the beginning and still be a valid example of the content types you mention, so if there's a BOM then that wins.
(The one exception to that is if the document is so badly edited as to switch encoding part-way through, which is no unheard of, but then the buggy content is so very buggy as to have no real meaning).
If the content contains no octet that is greater than 0x7F then it doesn't matter and the header and metadata both claim it as different examples of US-ASCII, UTF-8, any of the ISO-8859 family of encodings, or any of the other encodings for which those octets all map to the same code point, then it doesn't really matter which you consider it to be, as the nett result is the same. Consider it to be whatever the metadata says, as then you don't need to rewrite it to match correctly.
If it's in UTF-16 without a BOM it is likely going to be clearly as such very soon as all of those formats have a lot of characters with special meaning in the range U+0000 to U+00FF (indeed, generally U+0020 to U+007F) and so you'll have lots of ranges with a zero byte every other character.
If it has octets above 0x7F and is valid UTF-8, then it's almost certainly UTF-8. (By the same token if it's not UTF-8 and has octets above 0x7F then it almost certainly can't be mistaken for UTF-8).
The trickiest reasonably common case is if you have conflicting claims about it being in two different encodings which are both single-octet-per-character encodings and an octet in the range 0x80-0xFF is present. This is the case that you can't be sure about. If one encoding is a subset of the other (especially when C1 controls are excluded) then you could go for the superset, but that requires storing knowledge about those encodings, and considerable amount of work. Most of the time I'd be inclined to just throw an exception, and when it's found in the logs see if I can get the source to fix their bug, or special-case that source, but that doesn't work if you are dealing with a very large number of disparate sources that you may not have a relationship with. Alas there is no perfect answer here.
Its worth noting also that sometimes both header and embedded metadata will agree with each other incorrectly. A common case is content in CP-1252 but claimed as being in ISO-8859-1.
According to W3C Faq
If you have a UTF-8 byte-order mark (BOM) at the start of your file then recent browser versions other than Internet Explorer 10 or 11 will use that to determine that the encoding of your page is UTF-8. It has a higher precedence than any other declaration, including the HTTP header.
When it comes to the http-header vs meta BOM takes precedence, as long as the meta is within the first 1024 it can take precedence, though there is no strict rule on that.
Conclusion - in order of importance:
Byte Order Mark (BOM): If present, this is AUTHORATIVE, since it was
added by the editor that actually saved the file (this can only be
present on unicode encodings).
Content-Type charset (in header set by the server): For dynamically created/processed files, it should be present (since the
server knows), but might not be for static files (the server just
sends those).
Inline charset: For xml, html and css the encoding can be be specified inside the document, in either xml prolog, html meta tag
or #charset in css. To read that you need to decode the first
part of the document using for instance 'Windows-1252' encoding.
Assume utf-8. This is the standard of the web and is today by far the most used.
If the found encoding equals 'ISO-8859-1', use 'Windows-1252' instead (required in html5 - read more at Wikipedia
Now try to decode the document using the found encoding. If error handling is turned on, that might fail! In that case:
Use 'Windows-1252'. This was the standard in old windows files and works fine as last try (there's still a lot of old files out there).
This will never throw errors. However it might of course be wrong.
I have made a method that implements this. The regex I use is able to find encodings specified as:
Xml: <?xml version="1.0" encoding="utf-8"?> OR <?xml encoding="utf-8"?>
html: <meta charset="utf-8" /> OR <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
css: #charset "utf-8";
(It works with both single and double qoutes).
You will need:
using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
Here's the method that returns the decoded string (parameters are the HttpClient and the Uri):
public static async Task<string> GetString(HttpClient httpClient, Uri url)
{
byte[] bytes;
Encoding encoding = null;
Regex charsetRegex = new Regex(#"(?<=(<meta.*?charset=|^\<\?xml.*?encoding=|^#charset[ ]?)[""']?)[\w-]+?(?=[""';\r\n])",
RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture);
using (HttpResponseMessage responseMessage = await httpClient.GetAsync(url).ConfigureAwait(false))
{
responseMessage.EnsureSuccessStatusCode();
bytes = await responseMessage.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
string headerCharset = responseMessage?.Content?.Headers?.ContentType?.CharSet;
byte[] buffer = new byte[0x1000];
Array.Copy(bytes, buffer, Math.Min(bytes.Length, buffer.Length));
using (MemoryStream ms = new MemoryStream(buffer))
{
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), true, buffer.Length, true))
{
string testString = await sr.ReadToEndAsync().ConfigureAwait(false);
if (!sr.CurrentEncoding.Equals(Encoding.GetEncoding("Windows-1252")))
{
encoding = sr.CurrentEncoding;
}
else if (headerCharset != null)
{
encoding = Encoding.GetEncoding(headerCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
else
{
string inlineCharset = charsetRegex.Match(testString).Value;
if (!string.IsNullOrEmpty(inlineCharset))
{
encoding = Encoding.GetEncoding(inlineCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
else
{
encoding = new UTF8Encoding(false, true);
}
}
if (encoding.Equals(Encoding.GetEncoding("iso-8859-1")))
{
encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
}
}
using (MemoryStream ms = new MemoryStream(bytes))
{
try
{
using (StreamReader sr = new StreamReader(ms, encoding, false, 0x8000, true))
{
return await sr.ReadToEndAsync().ConfigureAwait(false);
}
}
catch (DecoderFallbackException)
{
ms.Position = 0;
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), false, 0x8000, true))
{
return await sr.ReadToEndAsync().ConfigureAwait(false);
}
}
}
}
}
You should wrap the method call in a try/catch, since HttpClient can throw errors, if the request fails.
Update:
In .Net Core, you don't have the 'Windows-1252' encoding (big mistake IMHO), so here you must settle with 'ISO-8859-1'.

C# Encoding: Getting special characters from their codes

I am using a C# WinForms app to scrape some data from a webpage that uses charset ISO-8859-1. It works well for many special characters, but not all.
(* Below I use colons instead of semi-colons so that you will see the code that I see, and not the value of it)
I looked at the Page Source and I noticed that for the ones that won't display correctly, the actual code (e.g. &#363:) is in the Page Source, instead of the value. For example, in the Page Source I see Ry&#363: Murakami, but I expect to see Ryū Murakami. Also, there are many other codes that appear as codes, such as &#350: &#333: &#353: &#269: &#259: &#537: and many more.
I have tried using WebClient.DownloadString and WebClient.DownloadData.
Try #1 Code:
using (WebClient wc = new WebClient())
{
wc.Encoding = Encoding.GetEncoding("ISO-8859-1");
string WebPageText = wc.DownloadString("http://www.[removed].htm");
// Scrape WebPageText here
}
Try #2 Code:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
using (WebClient wc = new WebClient())
{
wc.Encoding = iso;
byte[] AllData = wc.DownloadData("http://www.[removed].htm");
byte[] utfBytes = Encoding.Convert(iso, utf8, AllData);
string WebPageText = utf8.GetString(utfBytes);
// Scrape WebPageText here
}
I want to keep the special characters, so please don't suggest any RemoveDiacritics examples. Am I missing something?
Consider Decoding your HTML input.

How to read string from HttpRequest form data in correct encoding

Today I have done a service to receive emails from SendGrid and finally have sent an email with a text "At long last", first time in non-English language during testing. Unfortunately, the encoding has become a problem that I cannot fix.
In a ServiceStack service I have a string property (in an input object that is posted to the service from SendGrid) in an encoding that is different from UTF8 or Unicode (KOI8-R in my case).
public class SengGridEmail : IReturn<SengGridEmailResponse>
{
public string Text { get; set; }
}
When I try to convert this string to UTF8 I get ????s, probably because when I access the Text property it is already converted into Unicode (.NET's internal string representation). This question and answer illustrate the issue.
My question is how to get original KOI8-R bytes within ServiceStack service or ASP.NEt MVC controller, so that I could convert it to UTF8 text?
Update:
Accessing base.Request.FormData["text"] doesn't help
var originalEncoding = Encoding.GetEncoding("KOI8-R");
var originalBytes = originalEncoding.GetBytes(base.Request.FormData["text"]);
But if I take base64 string from the original sent mail and convert it to byte[], and then convert those bytes to UTF8 string - it works. Either base.Request.FormData["text"] is already in Unicode .NET string format, or (less likely) it is something on SendGrid side.
Update 2:
Here is a unit test that shows what is happening:
[Test]
public void EncodingTest()
{
const string originalString = "наконец-то\r\n";
const string base64Koi = "zsHLz87Fwy3Uzw0K";
const string charset = "KOI8-R";
var originalBytes = base64Koi.FromBase64String(); // KOI bytes
var originalEncoding = Encoding.GetEncoding(charset); // KOI Encoding
var originalText = originalEncoding.GetString(originalBytes); // this is initial string correctly converted to .NET representation
Assert.AreEqual(originalString, originalText);
var unicodeEncoding = Encoding.UTF8;
var originalWrongString = unicodeEncoding.GetString(originalBytes); // this is how the KOI string is represented in .NET, equals to base.Request.FormData["text"]
var originalWrongBytes = originalEncoding.GetBytes(originalWrongString);
var unicodeBytes = Encoding.Convert(originalEncoding, unicodeEncoding, originalBytes);
var result = unicodeEncoding.GetString(unicodeBytes);
var unicodeWrongBytes = Encoding.Convert(originalEncoding, unicodeEncoding, originalWrongBytes);
var wrongResult = unicodeEncoding.GetString(unicodeWrongBytes); // this is what I see in DB
Assert.AreEqual(originalString, result);
Assert.AreEqual(originalString, wrongResult); // I want this to pass!
}
Discovered two underlying problems for my problem.
The first is from SendGrid - they post multi-part data without specifying content-type for non-unicode elements.
The second is from ServiceStack - currently it doesn't support encoding other than utf-8 for multi-part data.
Update:
SendGrid helpdesk promised to look into the issue, ServiceStack now fully support custom charsets in multi-part data.
As for initial question itself, one could access buffered stream in ServiceStack as described here: Can ServiceStack Runner Get Request Body?.

Converting a string encoded in utf8 to unicode in C#

I've got this string returned via HTTP Post from a URL in a C# application, that contains some chinese character eg:
Gelatos® Colors Gift Set中文
Problem is I want to convert it to
Gelatos® Colors Gift Set中文
Both string are actually identical but encoded differently. I understand in C# everything is UTF16. I've tried reading alof of postings here regarding converting from one encoding to the other but no luck.
Hope someone could help.
Here's the C# code:
WebClient wc = new WebClient();
json = wc.DownloadString("http://mysite.com/ext/export.asp");
textBox2.Text = "Receiving orders....";
//convert the string to UTF16
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
Encoding utf8 = Encoding.UTF8;
byte[] asciiBytes = ascii.GetBytes(json);
byte[] utf8Bytes = utf8.GetBytes(json);
byte[] unicodeBytes = Encoding.Convert(utf8, unicode, utf8Bytes);
string sOut = unicode.GetString(unicodeBytes);
System.Windows.Forms.MessageBox.Show(sOut); //doesn't work...
Here's the code from the server:
<%#CodePage = 65001%>
<%option explicit%>
<%
Session.CodePage = 65001
Response.charset ="utf-8"
Session.LCID = 1033 'en-US
.....
response.write (strJSON)
%>
The output from the web is correct. But I was just wondering if some changes is done on the http stream to the C# application.
thanks.
Download the web pages as bytes in the first place. Then, convert the bytes to the correct encoding.
By first converting it using a wrong encoding you are probably losing data. Especially using ASCII.
If the server is really returning UTF-8 text, you can configure your WebClient by setting its Encoding property. This would eliminate any need for subsequent conversions.
using (WebClient wc = new WebClient())
{
wc.Encoding = Encoding.UTF8;
json = wc.DownloadString("http://mysite.com/ext/export.asp");
}

Categories