Reading email body with encoding ISO-8859-1 - c#

I'm using Mailkit for reading some email's body content by using IMAP.
Some of these emails come with content-type text/plain and charset ISO-8859-1 which causes that my code replaces some Latin characters á é í ó ú and apparently also CR and LF by weird chars such as =E1 =FA =F3 =...
var body = message.BodyParts.OfType<BodyPart>().FirstOrDefault(x => x.ContentType.IsMimeType("text", "plain"));
var bodyText = (TextPart)folder.GetBodyPart(message.UniqueId, body);
var bodyContent = bodyText.Text;
There is no problem when opening these emails with email clients such as Thunderbird or Outlook. They are showing these chars as they are. I want to be able to retrieve these Latin chars.
I've tried with some encoding options with no success.
var bodyContent = bodyText.GetText(System.Text.Encoding.ASCII);
var bodyContent = bodyText.GetText(System.Text.Encoding.UTF-8);

Normally you don't need to decode quoted-printable encoded content yourself, but my guess is that the client that sent this message encoded the content using the quoted-printable encoding but did not set the Content-Transfer-Encoding header properly.
I would probably change your code to something more like this:
// figure out which body part we need
var body = message.BodyParts.OfType<BodyPartText>().FirstOrDefault(x => x.ContentType.IsMimeType("text", "plain"));
// download the body part we need
var bodyText = (TextPart)folder.GetBodyPart(message.UniqueId, body);
// If it's encoded using quoted-printable we'll need to decode it first.
// To do so, we'll need the charset.
//
// The reason I would get it from the `bodyText.ContentType` is because
// this will work even if you used MessageSummaryItems.Body instead of
// MessageSummaryItems.BodyStructure.
var charset = bodyText.ContentType.Charset;
// Decodes the content by using QuotedPrintableDecoder from MimeKit library.
var bodyContent = DecodeQuotedPrintable(bodyText.Content, charset);
// The main changes I'm making to this function compared to what you have is
// using the stream/filter interfaces rather than using the low-level decoder
// directly. You can do it either way, but if you continue using your
// method - I would recommend using Encoding.UTF8.GetBytes() rather than
// Encoding.ASCII.GetBytes() because UTF-8 can handle all strings while
// ASCII cannot.
static string DecodeQuotedPrintable (IMimeContent content, string charset)
{
using (var output = new MemoryStream ()) {
using (filtered = new FilteredStream (output)) {
// add a quoted-printable decoder
filtered.Add (DecoderFilter.Create (ContentEncoding.QuotedPrintable));
// pump the content through the decoder
content.DecodeTo (filtered);
// flush the filtered stream
filtered.Flush ();
}
var encoding = Encoding.GetEncoding (charset);
return encoding.GetString (output.GetBuffer (), 0, (int) output.Length);
}
}

The message body is encoded using quoted printable.
You have to decode it first.
In MailKit it should be the DecodeTo method

I could finally get it working by using QuotedPrintableDecoder from MimeKit library.
var body = message.BodyParts.OfType<BodyPart>().FirstOrDefault(x => x.ContentType.IsMimeType("text", "plain"));
// If it's encoded using quoted-printable we'll need to decode it first. To do so, we'll need the charset.
var charset = body.ContentType.Charset;
var bodyText = (TextPart)folder.GetBodyPart(message.UniqueId, body);
// Decodes the content by using QuotedPrintableDecoder from MimeKit library.
var bodyContent = DecodeQuotedPrintable(bodyText.Text, charset);
static string DecodeQuotedPrintable (string input, string charset)
{
var decoder = new QuotedPrintableDecoder ();
var buffer = Encoding.ASCII.GetBytes (input);
var output = new byte[decoder.EstimateOutputLength (buffer.Length)];
int used = decoder.Decode (buffer, 0, buffer.Length, output);
var encoding = Encoding.GetEncoding (charset);
return encoding.GetString (output, 0, used);
}

Related

How to remove BOM from an encoded base64 UTF string?

I have a file encoded in base64 using openssl base64 -in en -out en1 in a command line in MacOS and I am reading this file using the following code:
string fileContent = File.ReadAllText(Path.Combine(AppContext.BaseDirectory, MConst.BASE_DIR, "en1"));
var b1 = Convert.FromBase64String(fileContent);
var str1 = System.Text.Encoding.UTF8.GetString(b1);
The string I am getting has a ? before the actual file content. I am not sure what's causing this, any help will be appreciated.
Example Input:
import pandas
import json
Encoded file example:
77u/DQppbXBvcnQgY29ubmVjdG9yX2FwaQ0KaW1wb3J0IGpzb24NCg0K
Output based on the C# code:
?import pandas
import json
Normally, when you read UTF (with BOM) from a text file, the decoding is handled for you behind the scene. For example, both of the following lines will read UTF text correctly regardless of whether or not the text file has a BOM:
File.ReadAllText(path, Encoding.UTF8);
File.ReadAllText(path); // UTF8 is the default.
The problem is that you're dealing with UTF text that has been encoded to a Base64 string. So, ReadAllText() can no longer handle the BOM for you. You can either do it yourself by (checking and) removing the first 3 bytes from the byte array or delegate that job to a StreamReader, which is exactly what ReadAllText() does:
var bytes = Convert.FromBase64String(fileContent);
string finalString = null;
using (var ms = new MemoryStream(bytes))
using (var reader = new StreamReader(ms)) // Or:
// using (var reader = new StreamReader(ms, Encoding.UTF8))
{
finalString = reader.ReadToEnd();
}
// Proceed to using finalString.

Flurl AddFile fileName Encoding

I try to use flurl to send a file like this:
public ImportResponse Import(ImportRequest request, string fileName, Stream stream)
{
request).PostAsync(content).Result<ImportTariffResponse>();
return FlurlClient(Routes.Import, request).PostMultipartAsync(mp => mp.AddJson("json", request).AddFile("file", stream, ConvertToAcsii(fileName))).Result<ImportResponse>();
}
fileName = "Файл импорта тарифов (1).xlsx"
But in post method I get this:
Request.Files.FirstOrDefault().FileName =
"=?utf-8?B?0KTQsNC50Lsg0LjQvNC/0L7RgNGC0LAg0YLQsNGA0LjRhNC+0LIgKDEpLnhsc3g=?="
Any suggestions?
The filename appears to be encoded using MIME encoded-word syntax. (Flurl doesn't do this directly, it presumably happens deeper down in the HttpClient libraries when non-ASCII characters are detected.) .NET doesn't directly support decoding this format, but you can do it yourself fairly easily. If you strip the =?utf-8?B? from the beginning and ?= from the end, what you're left with is your filename base64 encoded.
Here's one way you could do it:
var base64 = Request.Files.FirstOrDefault().FileName.Split('?')[3];
var bytes = Convert.FromBase64String(base64);
var filename = Encoding.UTF8.GetString(bytes);

Dynamic loaded XML and XmlTextReader character encoding issue

I load a XML like this:
var url = Application.dataPath + #"/config.xml";
var www = new WWW(url);
while (!www.isDone)
{
yield return new WaitForSeconds(0.2f);
}
After that I create a XmlTextReader in order to parse that XML:
GameSettings.ParseXML(new XmlTextReader(new StringReader(www.text)));
But I'm having problem with character encoding (é,ç,ã,ê, etc). What can I do make it works?
If you use WWW.text, the function expects the web page contents encoded in UTF-8 or ASCII but your customer uses Windows-1252.
Like Bart already suggested, the best way would be to request that the customer just uses UTF-8. If that is not possible und you are sure that the customer always uses Windows-1252 you can convert the encoding inside your application.
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
Encoding utf8 = Encoding.UTF8;
byte[] windowsBytes = www.bytes;
byte[] utf8Bytes = Encoding.Convert(windows1252, utf8, windowsBytes);
string converted_xml = utf8.GetString(utf8Bytes);

Bytes read as UTF8 string and converted to Base64

Forgive the lengthy setup here but I thought it may help to have the context...
I am implementing a custom digital signature validation method in as part of a WCF service. We're using a custom method because various differing interpretations of some industry standards but the details there aren't all that relevant.
In this particular scenario, I am receiving an MTOM/XOP encoded request where the root MIME part contains a digital signature and the signature DigestValue and SignatureValue pieces are split up into separate MIME parts.
The MIME parts that contain the signature DigestValue and SignatureValue data is binary encoded so it is literally a bunch of raw bytes in the web request like this:
Content-Id: <c18605af-18ec-4fcb-bec7-e3767ef6fe53#example.jaxws.sun.com>
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
[non-printable-binary-data-goes-here]
--uuid:eda4d7f2-4647-4632-8ecb-5ba44f1a076d
I am reading the contents of the message in as a string (using the default UTF8 encoding) like this (see the requestAsString parameter below):
MessageBuffer buffer = request.CreateBufferedCopy(int.MaxValue);
try
{
using (MemoryStream mstream = new MemoryStream())
{
buffer.WriteMessage(mstream);
mstream.Position = 0;
using (StreamReader sr = new StreamReader(mstream))
{
requestAsString = sr.ReadToEnd();
}
request = buffer.CreateMessage();
}
}
After I read the MTOM/XOP message in, I am attempting to re-organize the multiple MIME parts into one SOAP message where the signature DigestValue and SignatureValue elements are restored to the original SOAP envelope (and not as attachments). So basically I am taking decoding the MTOM/XOP request.
Unfortunately, I am having trouble reading the DigestValue and SignatureValue pieces correctly. I need to read the bytes out of the message and get the base64 string representation of that data.
Despite all the context above, it seems the core problem is reading the binary data in as a string (UTF8 encoded) and then converting it to a proper base64 representation.
Here is what I am seeing in my test code:
This is my example base64 string:
string base64String = "mowXMw68eLSv9J1W7f43MvNgCrc=";
I can then get the byte representation of that string. This yields an array of 20 bytes:
byte[] base64Bytes = Convert.FromBase64String(base64String);
I then get the UTF8 encoded version of those bytes:
string decodedString = UTF8Encoding.UTF8.GetString(base64Bytes);
Now the strange part... if I convert the string back to bytes as follows, I get an array of bytes that is 39 bytes long:
byte[] base64BytesBack = UTF8Encoding.UTF8.GetBytes(decodedString);
So obviously at this point, when I convert back into a base64 string, it doesn't match the original value:
string base64StringBack = Convert.ToBase64String(base64BytesBack);
base64StringBack is set to "77+977+9FzMO77+9eO+/ve+/ve+/vVbvv73vv703Mu+/vWAK77+9"
What am I doing wrong here? If I switch to using UTF8Encoding.Unicode.GetString() and UTF8Encoding.Unicode.GetBytes(), it works as expected:
string base64String = "mowXMw68eLSv9J1W7f43MvNgCrc=";
// First get an array of bytes from the base64 string
byte[] base64Bytes = Convert.FromBase64String(base64String);
// Get the Unicode representation of the base64 bytes.
string decodedString = UTF8Encoding.Unicode.GetString(base64Bytes);
byte[] base64BytesBack = UTF8Encoding.Unicode.GetBytes(decodedString);
string base64StringBack = Convert.ToBase64String(base64BytesBack);
Now base64StringBack is set to "mowXMw68eLSv9J1W7f43MvNgCrc=" so it seems I am mis-using the UTF8 encoding somehow or it is behaving differently than I would expect.
Arbitrary binary data cannot be decoded into an UTF8 encoded string and then encoded back to the same binary data. The paragraph "Invalid byte sequences" in http://en.wikipedia.org/wiki/UTF-8 points that out.
I am a bit confused as to why you want the data encoded/decoded as UTF8.
Ok, I took a different approach to reading the MTOM/XOP message:
Instead of relying on my own code to parse the MIME parts by hand, I just used XmlDictionaryReader.CreateMtomReader() to get an XmlDictionaryReader and read the message into an XmlDocument (being careful to preserve whitespace on the XmlDocument so digital signatures aren't broken):
MessageBuffer buffer = request.CreateBufferedCopy(int.MaxValue);
messageContentType = WebOperationContext.Current.IncomingRequest.ContentType;
try
{
using (MemoryStream mstream = new MemoryStream())
{
buffer.WriteMessage(mstream);
mstream.Position = 0;
if (messageContentType.Contains("multipart/related;"))
{
Encoding[] encodings = new Encoding[1];
encodings[0] = Encoding.UTF8;
// MTOM
using (XmlDictionaryReader reader = XmlDictionaryReader.CreateMtomReader(mstream, encodings, messageContentType, XmlDictionaryReaderQuotas.Max))
{
XmlDocument msgDoc = new XmlDocument();
msgDoc.PreserveWhitespace = true;
msgDoc.Load(reader);
requestAsString = msgDoc.OuterXml;
reader.Close();
}
}
else
{
// Text
using (StreamReader sr = new StreamReader(mstream))
{
requestAsString = sr.ReadToEnd();
}
}
request = buffer.CreateMessage();
}
}
finally
{
buffer.Close();
}

How to disable base64-encoded filenames in HttpClient/MultipartFormDataContent

I'm using HttpClient to POST MultipartFormDataContent to a Java web application. I'm uploading several StringContents and one file which I add as a StreamContent using MultipartFormDataContent.Add(HttpContent content, String name, String fileName) using the method HttpClient.PostAsync(String, HttpContent).
This works fine, except when I provide a fileName that contains german umlauts (I haven't tested other non-ASCII characters yet). In this case, fileName is being base64-encoded. The result for a file named 99 2 LD 353 Temp Äüöß-1.txt
looks like this:
__utf-8_B_VGVtcCDvv73vv73vv73vv71cOTkgMiBMRCAzNTMgVGVtcCDvv73vv73vv73vv70tMS50eHQ___
The Java server shows this encoded file name in its UI, which confuses the users. I cannot make any server-side changes.
How do I disable this behavior? Any help would be highly appreciated.
Thanks in advance!
I just found the same limitation as StrezzOr, as the server that I was consuming didn't respect the filename* standard.
I converted the filename to a byte array of the UTF-8 representation, and the re-armed the bytes as chars of "simple" string (non UTF-8).
This code creates a content stream and add it to a multipart content:
FileStream fs = File.OpenRead(_fullPath);
StreamContent streamContent = new StreamContent(fs);
streamContent.Headers.Add("Content-Type", "application/octet-stream");
String headerValue = "form-data; name=\"Filedata\"; filename=\"" + _Filename + "\"";
byte[] bytes = Encoding.UTF8.GetBytes(headerValue);
headerValue="";
foreach (byte b in bytes)
{
headerValue += (Char)b;
}
streamContent.Headers.Add("Content-Disposition", headerValue);
multipart.Add(streamContent, "Filedata", _Filename);
This is working with spanish accents.
Hope this helps.
I recently found this issue and I use a workaround here:
At server side:
private static readonly Regex _regexEncodedFileName = new Regex(#"^=\?utf-8\?B\?([a-zA-Z0-9/+]+={0,2})\?=$");
private static string TryToGetOriginalFileName(string fileNameInput) {
Match match = _regexEncodedFileName.Match(fileNameInput);
if (match.Success && match.Groups.Count > 1) {
string base64 = match.Groups[1].Value;
try {
byte[] data = Convert.FromBase64String(base64);
return Encoding.UTF8.GetString(data);
}
catch (Exception) {
//ignored
return fileNameInput;
}
}
return fileNameInput;
}
And then use this function like this:
string correctedFileName = TryToGetOriginalFileName(fileRequest.FileName);
It works.
In order to pass non-ascii characters in the Content-Disposition header filename attribute it is necessary to use the filename* attribute instead of the regular filename. See spec here.
To do this with HttpClient you can do the following,
var streamcontent = new StreamContent(stream);
streamcontent.Headers.ContentDisposition = new ContentDispositionHeaderValue("attachment") {
FileNameStar = "99 2 LD 353 Temp Äüöß-1.txt"
};
multipartContent.Add(streamcontent);
The header will then end up looking like this,
Content-Disposition: attachment; filename*=utf-8''99%202%20LD%20353%20Temp%20%C3%84%C3%BC%C3%B6%C3%9F-1.txt
I finally gave up and solved the task using HttpWebRequest instead of HttpClient. I had to build headers and content manually, but this allowed me to ignore the standards for sending non-ASCII filenames. I ended up cramming unencoded UTF-8 filenames into the filename header, which was the only way the server would accept my request.

Categories