I'm working in a project that needs to upload a file into a web application using WebClient.
I tried with the following code but the server doesn't recognize the special characters defined in the slug header and replaces them with other not printable characters.
WebClient.Headers.Clear();
WebClient.Headers.Add("Content-Type",GetMimeType(Path.GetExtension("aáñÑ.pdf")));
WebClient.Headers.Add("Accept", "*/*");
WebClient.Headers.Add("Referer", myRefererURL);
WebClient.Headers.Add("x-csrf-token", "securityTokenFromModel");
WebClient.Headers.Add("slug", "aáñÑ.pdf");
Also, after reading rfc2047 (http://www.ietf.org/rfc/rfc2047.txt) I replaced the last line with the following code, but server doesn't recognize the request and returns an error.
WebClient.Headers.Add("slug", "(=?ISO-8859-1?q?" + "aáñÑ.pdf" + "?=)");
Is there another way to set the enconding charset to allow using special characters (accents, spanish characters) in the slug header?
Edit:
After reading #Julian answer, I tried to change the slug header to look like this:
WebClient.Headers.Add("slug", "The Beach at S%C3%A8te");
But the web application sets the filename exactly: "The Beach at S%C3%A8te".
In another test, this is how Fiddler shows the request using filename "Documentación Ññ.docx":
Request made by Internet Explorer 11: OK
Request made by .NET WebClient and Google Chrome: ERROR
The answer is in the specification:
"The field value is the percent-encoded value of the UTF-8 encoding of the character sequence to be included (see Section 2.1 of [RFC3986] for the definition of percent encoding, and [RFC3629] for the definition of the UTF-8 encoding).
Implementation note: to produce the field value from a character sequence, first encode it using the UTF-8 encoding, then encode all octets outside the ranges %20-24 and %26-7E using percent encoding (%25 is the ASCII encoding of "%", thus it needs to be escaped). To consume the field value, first reverse the percent encoding, then run the resulting octet sequence through a UTF-8 decoding process."
https://greenbytes.de/tech/webdav/rfc5023.html#rfc.section.9.7.1
Related
Completely stuck on a problem related to the inbound parse webhook functionality offered by SendGrid: https://sendgrid.com/docs/for-developers/parsing-email/setting-up-the-inbound-parse-webhook/
First off everything is working just fine with retrieving the mail sent to my application endpoint. Using Request.Form I'm able to retrieve the data and work with it.
The problem is that we started noticing question mark symbols instead of letters when recieving some mails (written in swedish using Å Ä and Ö). This occured both when sending plaintext mails, and mails with an HTML-body.
However, this only happens every now and then. After a lot of searching I found out that if the mail is sent from e.g. Postbox or Outlook (or the like), and the application has the charset set to iso-8859-1 that's when Å Ä Ö is replaced by question marks.
To replicate the error and be able to debug it I set up a HTML page with a form using the iso-8859-1 encoding, sending a similar payload as the one seen in the link above (the default one). And after that been through testing a multitude of things trying to get it to work.
As of now I'm trying to recode the input, without success. Code I'm testing:
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = wind1252.GetBytes(Request.Form.["html"]);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8,wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);
This only results in the utf8String producing the same result with "???" where Å Ä Ö should be. My guess here is that perhaps it's due to the Request.Form["html"] returning a UTF-16 string, of the content that is encoded already in the wrong encoding iso-8859-1.
The method for fetching the POST is as follows
public async Task<InboundParseModel> FetchMail(IFormCollection form)
{
InboundParseModel _em = new InboundParseModel
{
To = form["to"].SingleOrDefault(),
From = form["from"].SingleOrDefault(),
Subject = form["subject"].SingleOrDefault(),
Html = form["html"].SingleOrDefault(),
Text = System.Net.WebUtility.HtmlEncode(form["text"].SingleOrDefault()),
Envelope = form["envelope"].SingleOrDefault()
};
}
Called from another method that the POST is done to by FetchMail(Request.Form);
Project info: ASP.NET Core 2.2, C#
So as stated earlier, I am completely stuck and don't really have any ideas on how to solve this. Any help would be much appreciated!
EDIT: The characters come correctly, but in the middle of the page there's this line <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd">, afterwards the special characters become é as é (that are represented fine in browser), but are represented as eacute; (without the &) if downloaded via WebClient. END EDIT
I am extracting an excerpt from a web using WebClient + RegEx.
But setting the encoding correctly still makes é as eacute;, ças ccedil;, í as iacute; etc.
I followed DownloadString and Special Characters example to correctly set the charset (ISO-8859-1):
System.Net.WebClient wc = new System.Net.WebClient();
wc.DownloadString("https://myurl"); //
var contentType = wc.ResponseHeaders["Content-Type"];
var charset = Regex.Match(contentType, "charset=([^;]+)").Groups[1].Value;
wc.Encoding = Encoding.GetEncoding(charset);
It does set charset like the document's (ISO-8859-1), but when i do the follow-up DownloadString (i know i could set the encoding before and just do one wc.DownloadString, but i wanted to follolw the accepted answer's example):
string result = wc.DownloadString("https://myurl");
The special characters still come wrong.
NOTE: I am using a non-English Windows 10 (if it's relevant)
NOTE 2: The page's special characters appear correctly in any browser
My question is, why the WebClient don't download correctly even with the correct charset set?
using System.Text;
wc.Encoding = Encoding.UTF8;
I have an question releated to encoding on Microsoft Exchange servers. I have built an app that is processing messages on Exchange and one of options is to force the encoding always to "US-ASCII".
As long as the mails goes directly through Exchange protocols, there is no problem. I have noticed the issue releated to messages sent by third-party mail clients (e.g. Thunderbird) over SMTP protocol.
Although the charset is visible in source code as US-ASCII I can find "3D" near = character, therefore the source code is corrupted and some parts of message are not displaying correctly (e.g. images).
To resolve this problem I have tried to force 7-bit content transfer encoding, but the is issue still persisting.
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
<html><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body bgcolor=3D"#FFFFFF" text=3D"#000000">
dsadsadsadsdsdsadasdsadasdsad<b>dsa</b>
<p style=3D"FONT-FAMILY: Arial" id=3D"c1-id-6">Some signature with image.=
</p><p style=3D"FONT-FAMILY: Arial" id=3D"c1-id-7"><img alt=3D"" src=3D"cid=
:img1.jpg" id=3D"c1-id-8"></p><p style=3D"FONT-FAMILY: Arial" id=3D"c1-id-9=
"> </p></body>
</html>
As long as the message is processed by my app, the "3D" does not appear, even after changing the charset.
Your choice of content transfer encoding is causing this: Content-Transfer-Encoding: quoted-printable
Quoted printable uses the equals sign as an escape character, so the mail sever has dutifully escaped all the 'raw' equals signs for you.
Quoted-Printable, or QP encoding, is an encoding using printable ASCII
characters (alphanumeric and the equals sign "=") to transmit 8-bit
data over a 7-bit data path or, generally, over a medium which is not
8-bit clean.[1] It is defined as a MIME content transfer encoding for
use in e-mail.
QP works by using the equals sign "=" as an escape character.
If you wanted to properly process this, look for all '=' characters in your content (not headers), read the next two characters, and then replace the '=XX' triple with the ascii value of the hex you read. "=3D" replaces to "=" with the above scheme.
For more information on Content-Transfer-Encoding refer to section 5 of RFC 1341, and RFC 1521 at least; consider reading the RFCs that obsolete the above RFCs.
This might be different with other Korean encoding questions.
There is this site I have to scrape and it's Korean.
An example sentence in their site is this
"개인정보보호를 위해 뒤로가기 버튼 대신 검색결과 화면 상단과 하단의 이전 버튼을 사용하시기 바랍니다."
I am using HttpWebRequest and HttpWebResponse to scrape the site.
this is how I retreive the html
-- partial code --
using (Stream data = resp.GetResponseStream())
{
response.Append(new StreamReader(data, Encoding.GetEncoding(code), true).ReadToEnd());
}
now my problem is, am not getting the correct Korean characters. In my "code" variable, I'm basing the code page here in MSDN http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx (let me narrow it down).
here are the Korean code pages:
51949, 50225, 20949, 20833, 10003, 949
but am still not getting the correct Korean characters? What you think is the problem?
It is very likely that the page is not in a specific Korean encoding, but one of the Unicode encodings.
Try Encoding.UTF8, Encoding.Default (UTF-16) instead of the specific code pages. There are also Encoding.UTF7 and Encoding.UTF32, but they are not as common.
To be certain, examine the meta tags and headers for the content-type returned by the server.
Update (gleaned from commments):
Since the content-type header is EUC-KR, the corresponding codepage is 51949 and this is what you need to use to retrieve the page.
It was not clear that you are writing this out to a file - you need to use the same encoding when writing the file out, or convert the byte[] from the original to the output file encoding (using Encoding.Convert).
While having exact same issue I've finished it with code below:
Encoding.UTF8.GetString(DownloadData(URL));
This directly transform output for the WebClient GET request to UTF8 encoding.
I'm using automatic conversion from wsdl to c#, everything works apart from encoding, whenever
I have native characters (like 'ł' or 'ó') I get '??' insted of them in string fields ('G????wny' instead of 'Główny'). How to deal with it? Server sends document with correct encoding, with header .
EDIT: I noticed in Wireshark, that packets send FROM me have BOM, but packets sends TO me, don't have it - maybe it's a root of problem?
So maybe the following will help:
What I am sure I did is:
In the webservice PHP file, after connecting to the Mysql Database I call:
mysql_query("SET CHARSET utf8");
mysql_query("SET NAMES utf8 COLLATE utf8_polish_ci");
The second I did:
In the same PHP file,
I added utf8_encode to the service on the $POST_DATA variable:
$server->service(utf8_encode($POST_DATA));
in the class.nusoap_base.php I changed:
`//var $soap_defencoding = 'ISO-8859-1';
var $soap_defencoding = 'UTF-8';`
and olso in the nusoap.php the same as above:
//var $soap_defencoding = 'ISO-8859-1';
var $soap_defencoding = 'UTF-8';
and in the nusoap.php file again:
var $decode_utf8 = true;
Now I can send and receive properly encoded data.
Hope this helps.
Regards,
The problem was on the server side with sent Content-Type parameter in header (it was set to "text/xml"). It occurs that for utf-8 it HAVE TO be "text/xml; charset=utf-8", other methods such as placing BOM aren't correct (according to RFC 3023). More info here: http://annevankesteren.nl/2005/03/text-xml