Avoid "3D" near = on Exchange 2010 - c#

I have an question releated to encoding on Microsoft Exchange servers. I have built an app that is processing messages on Exchange and one of options is to force the encoding always to "US-ASCII".
As long as the mails goes directly through Exchange protocols, there is no problem. I have noticed the issue releated to messages sent by third-party mail clients (e.g. Thunderbird) over SMTP protocol.
Although the charset is visible in source code as US-ASCII I can find "3D" near = character, therefore the source code is corrupted and some parts of message are not displaying correctly (e.g. images).
To resolve this problem I have tried to force 7-bit content transfer encoding, but the is issue still persisting.
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
<html><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body bgcolor=3D"#FFFFFF" text=3D"#000000">
dsadsadsadsdsdsadasdsadasdsad<b>dsa</b>
<p style=3D"FONT-FAMILY: Arial" id=3D"c1-id-6">Some signature with image.=
</p><p style=3D"FONT-FAMILY: Arial" id=3D"c1-id-7"><img alt=3D"" src=3D"cid=
:img1.jpg" id=3D"c1-id-8"></p><p style=3D"FONT-FAMILY: Arial" id=3D"c1-id-9=
"> </p></body>
</html>
As long as the message is processed by my app, the "3D" does not appear, even after changing the charset.

Your choice of content transfer encoding is causing this: Content-Transfer-Encoding: quoted-printable
Quoted printable uses the equals sign as an escape character, so the mail sever has dutifully escaped all the 'raw' equals signs for you.
Quoted-Printable, or QP encoding, is an encoding using printable ASCII
characters (alphanumeric and the equals sign "=") to transmit 8-bit
data over a 7-bit data path or, generally, over a medium which is not
8-bit clean.[1] It is defined as a MIME content transfer encoding for
use in e-mail.
QP works by using the equals sign "=" as an escape character.
If you wanted to properly process this, look for all '=' characters in your content (not headers), read the next two characters, and then replace the '=XX' triple with the ascii value of the hex you read. "=3D" replaces to "=" with the above scheme.
For more information on Content-Transfer-Encoding refer to section 5 of RFC 1341, and RFC 1521 at least; consider reading the RFCs that obsolete the above RFCs.

Related

SendGrid inbound parse nordic chars

Completely stuck on a problem related to the inbound parse webhook functionality offered by SendGrid: https://sendgrid.com/docs/for-developers/parsing-email/setting-up-the-inbound-parse-webhook/
First off everything is working just fine with retrieving the mail sent to my application endpoint. Using Request.Form I'm able to retrieve the data and work with it.
The problem is that we started noticing question mark symbols instead of letters when recieving some mails (written in swedish using Å Ä and Ö). This occured both when sending plaintext mails, and mails with an HTML-body.
However, this only happens every now and then. After a lot of searching I found out that if the mail is sent from e.g. Postbox or Outlook (or the like), and the application has the charset set to iso-8859-1 that's when Å Ä Ö is replaced by question marks.
To replicate the error and be able to debug it I set up a HTML page with a form using the iso-8859-1 encoding, sending a similar payload as the one seen in the link above (the default one). And after that been through testing a multitude of things trying to get it to work.
As of now I'm trying to recode the input, without success. Code I'm testing:
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = wind1252.GetBytes(Request.Form.["html"]);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8,wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);
This only results in the utf8String producing the same result with "???" where Å Ä Ö should be. My guess here is that perhaps it's due to the Request.Form["html"] returning a UTF-16 string, of the content that is encoded already in the wrong encoding iso-8859-1.
The method for fetching the POST is as follows
public async Task<InboundParseModel> FetchMail(IFormCollection form)
{
InboundParseModel _em = new InboundParseModel
{
To = form["to"].SingleOrDefault(),
From = form["from"].SingleOrDefault(),
Subject = form["subject"].SingleOrDefault(),
Html = form["html"].SingleOrDefault(),
Text = System.Net.WebUtility.HtmlEncode(form["text"].SingleOrDefault()),
Envelope = form["envelope"].SingleOrDefault()
};
}
Called from another method that the POST is done to by FetchMail(Request.Form);
Project info: ASP.NET Core 2.2, C#
So as stated earlier, I am completely stuck and don't really have any ideas on how to solve this. Any help would be much appreciated!

How to set encoding in SLUG Header using WebClient

I'm working in a project that needs to upload a file into a web application using WebClient.
I tried with the following code but the server doesn't recognize the special characters defined in the slug header and replaces them with other not printable characters.
WebClient.Headers.Clear();
WebClient.Headers.Add("Content-Type",GetMimeType(Path.GetExtension("aáñÑ.pdf")));
WebClient.Headers.Add("Accept", "*/*");
WebClient.Headers.Add("Referer", myRefererURL);
WebClient.Headers.Add("x-csrf-token", "securityTokenFromModel");
WebClient.Headers.Add("slug", "aáñÑ.pdf");
Also, after reading rfc2047 (http://www.ietf.org/rfc/rfc2047.txt) I replaced the last line with the following code, but server doesn't recognize the request and returns an error.
WebClient.Headers.Add("slug", "(=?ISO-8859-1?q?" + "aáñÑ.pdf" + "?=)");
Is there another way to set the enconding charset to allow using special characters (accents, spanish characters) in the slug header?
Edit:
After reading #Julian answer, I tried to change the slug header to look like this:
WebClient.Headers.Add("slug", "The Beach at S%C3%A8te");
But the web application sets the filename exactly: "The Beach at S%C3%A8te".
In another test, this is how Fiddler shows the request using filename "Documentación Ññ.docx":
Request made by Internet Explorer 11: OK
Request made by .NET WebClient and Google Chrome: ERROR
The answer is in the specification:
"The field value is the percent-encoded value of the UTF-8 encoding of the character sequence to be included (see Section 2.1 of [RFC3986] for the definition of percent encoding, and [RFC3629] for the definition of the UTF-8 encoding).
Implementation note: to produce the field value from a character sequence, first encode it using the UTF-8 encoding, then encode all octets outside the ranges %20-24 and %26-7E using percent encoding (%25 is the ASCII encoding of "%", thus it needs to be escaped). To consume the field value, first reverse the percent encoding, then run the resulting octet sequence through a UTF-8 decoding process."
https://greenbytes.de/tech/webdav/rfc5023.html#rfc.section.9.7.1

C# - Korean Encoding

This might be different with other Korean encoding questions.
There is this site I have to scrape and it's Korean.
An example sentence in their site is this
"개인정보보호를 위해 뒤로가기 버튼 대신 검색결과 화면 상단과 하단의 이전 버튼을 사용하시기 바랍니다."
I am using HttpWebRequest and HttpWebResponse to scrape the site.
this is how I retreive the html
-- partial code --
using (Stream data = resp.GetResponseStream())
{
response.Append(new StreamReader(data, Encoding.GetEncoding(code), true).ReadToEnd());
}
now my problem is, am not getting the correct Korean characters. In my "code" variable, I'm basing the code page here in MSDN http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx (let me narrow it down).
here are the Korean code pages:
51949, 50225, 20949, 20833, 10003, 949
but am still not getting the correct Korean characters? What you think is the problem?
It is very likely that the page is not in a specific Korean encoding, but one of the Unicode encodings.
Try Encoding.UTF8, Encoding.Default (UTF-16) instead of the specific code pages. There are also Encoding.UTF7 and Encoding.UTF32, but they are not as common.
To be certain, examine the meta tags and headers for the content-type returned by the server.
Update (gleaned from commments):
Since the content-type header is EUC-KR, the corresponding codepage is 51949 and this is what you need to use to retrieve the page.
It was not clear that you are writing this out to a file - you need to use the same encoding when writing the file out, or convert the byte[] from the original to the output file encoding (using Encoding.Convert).
While having exact same issue I've finished it with code below:
Encoding.UTF8.GetString(DownloadData(URL));
This directly transform output for the WebClient GET request to UTF8 encoding.

encoding when get page from net

I need to download a webpage, I have the following code to determe the encoding
System.IO.StreamReader sr=null;
mFrm.InfoShotcut("Henter webside....");
if(response.ContentEncoding!=null && response.ContentEncoding!="")
{
sr=new System.IO.StreamReader(srm,System.Text.Encoding.GetEncoding(response.ContentEncoding));
}
else
{
//System.Windows.Forms.MessageBox.Show();
sr=new System.IO.StreamReader(srm,System.Text.Encoding.GetEncoding(response.CharacterSet));
}
if(sr!=null)
{
result=sr.ReadToEnd();
if(response.CharacterSet!=GetCharatset(result))
{
System.Text.Encoding CorrectEncoding=System.Text.Encoding.GetEncoding(GetCharatset(result));
HttpWebRequest client2=(HttpWebRequest)HttpWebRequest.Create(Helper.value1);
HttpWebResponse response2=(HttpWebResponse)client2.GetResponse();
System.IO.Stream srm2=response2.GetResponseStream();
sr=new System.IO.StreamReader(srm2,CorrectEncoding);
result=sr.ReadToEnd();
}
}
mFrm.InfoShotcut("Henter webside......");
}
catch (Exception ex)
{
// handle error
MessageBox.Show( ex.Message );
}
And it had worked great, but now i have tried it with a site, where it states it uses
<pre>
<META http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</pre>
But realy is in UTF-8, how do I find out that sow i can save the file with the right encoding.
First off, the Content-Encoding header does not describe the character set being used. As the RFC says:
Content codings are primarily used to allow a document to be compressed or otherwise usefully transformed without losing the identity of its underlying media type and without loss of information.
The character set used is described in the Content-Type header. For example:
Content-Type: text/html; charset=UTF-8
Your code above that uses the Content-Encoding header will not correctly identify the character set. You have to look at the Content-Type header, find the semicolon if it's there, and then parse the charset parameter.
And, as you've discovered, it can also be described in an HTML META tag.
Or, there might not be a character set definition at all, in which case you have to default to something. My experience has been that defaulting to UTF-8 is a good choice. It's not 100% reliable, but it seems that sites that don't include the charset parameter with the Content-Type field usually default to UTF-8. I've also found that META tags, when they exist, are wrong almost half the time.
As L.B mentioned in his comment, it's possible to download the bytes and examine them to determine the encoding. That can be done with a surprising degree of accuracy, but it requires a lot of code.

Weird characters in email

I have written a mail-processing program, which basically slaps a template on incoming mail and forwards it on. Incoming mail goes to a Gmail account, which I download using POP, then I read the mail (both html and plain text multipart-MIME), make whatever changes I need to the template, then create a new mail with the appropriate plain+html text and send it on to another address.
Trouble is, when the mail gets to the other side, some of the mails have been mangled, with weird characters like à and  magically getting inserted. They weren't in the original mails, they're not in my template, and I can't find any sort of predictable pattern as to when these characters appear. I'm sure it's got something to do with the encoding properties of the mails, but I am making sure to set both the charset and the transfer encoding of the outgoing mail to be the same as the incoming mail. So what else do I need to do?
EDIT: Here's a snipped sample of an incoming mail:
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
=0A=0ASafari Special:=0A=0A=A0=0A=0ASafari in Thornybush Priv=
ate Game Reserve 9-12=0AJanuary 2012 (3nights)
After processing, this comes out as:
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
=0D=0A=0D=0ASafari Special:=0D=0A=0D=0A=C2=A0=0D=0A=0D=0A=
Safari in Thornybush Private Game Reserve 9-12=0D=0AJanuary=
2012 (3nights)
Notice the insertion of the =0D and =C2 characters (aside from a few =0A's that weren't in the original).
So what does you think is happening here?
ANOTHER CLUE: Here's my code that creates the alternate view:
var htmlView = AlternateView.CreateAlternateViewFromString(htmlBody, null, "text/html");
htmlView.ContentType.CharSet = charSet;
htmlView.TransferEncoding = transferEncoding;
m.AlternateViews.Add(htmlView);
Along the lines of what #mjwills suggested, perhaps the CreateAlternativeViewFromString() method already assumes UTF-8, and changing it later to iso-8859-1 doesn't make a difference?
So every =0A is becoming =0D=0A.
And every =A0 is becoming =C2=A0.
The former looks like it might be related to Carriage Return / Line Feeds.
The latter looks like it might be related to What is "=C2=A0" in MIME encoded, quoted-printable text?.
My guess is that even though you have specified the charset, something alone the line is treating it as UTF8.
You may want to try using this form of CreateAlternateViewFromString, where the ContentType.CharSet is set appropriately.

Categories