Download a webpage in UTF-8 - c#

I'm using the code below to download this XML file:
private async static Task<string> DownloadPageAsync(string url)
{
try
{
HttpClientHandler handler = new HttpClientHandler();
handler.UseDefaultCredentials = true;
handler.AllowAutoRedirect = true;
handler.UseCookies = true;
HttpClient client = new HttpClient(handler);
client.MaxResponseContentBufferSize = 10000000;
HttpResponseMessage response = await client.GetAsync(url);
response.EnsureSuccessStatusCode();
string responseBody = response.Content.ReadAsString();
return responseBody;
}
catch (Exception ex)
{
return "error" + ex.Message;
}
}
but the document I'm getting seems to have encoding problems. Although the document is not well formatted, I'm guessing my downloaded webpage is not in UTF-8 either. How can I return a UTF-8 string? Thanks.

your link encoding is iso-8859-1.
use
XmlDocument.Load(uriString)
or
XDocument.Load(uriString)

I suggest using the HTML Agility Pack to download and parse the document for you - it will automatically detect the encoding (where possible), so this shouldn't be a problem for you.
If this is not an option, you need to know what encoding the document is using then transform it to UTF8 using the Encoding classes to convert from the original encoding to UTF8.

Related

C# HttpClient save response with MIME "text/plain" as an UTF-8 string

I'm sending a request with HttpClient to a remote endpoint. I want to download the content and save it to a file as an UTF-8 string.
If the server would respond with the proper Content-Type text/plain; charset=utf-8, then the following code processes it just fine:
HttpClient client = new();
HttpResponseMessage res = await client.GetAsync(url);
string text = await res.Content.ReadAsStringAsync();
File.WriteAllText("file.txt", text);
However, the server always returns the basic Content-Type text/plain and I'm unable to get that as an UTF-8 string.
HttpClient cl = new();
HttpResponseMessage res = await cl.GetAsync(url);
string attempt1 = await res.Content.ReadAsStringAsync();
string attempt2 = Encoding.UTF8.GetString(await res.Content.ReadAsByteArrayAsync());
Stream stream = await res.Content.ReadAsStreamAsync();
byte[] bytes = ((MemoryStream)stream).ToArray();
string attempt3 = Encoding.UTF8.GetString(bytes);
I tried all three of these approaches, all resulted in scrambled characters due to the encoding mismatch. I don't have control over the server, so I can't change the headers.
Is there any way to force HttpClient to parse it as UTF-8? Why are the manual approaches not working?
I've built a Cloudflare worker to demonstrate this behavior and allow you to easily debug:
https://headers.briganreiz.workers.dev/charset-in-header
https://headers.briganreiz.workers.dev/no-charset
Edit: Turns out it was the GZip compression on the main server which I didn't notice. This question solved it for me: Decompressing GZip Stream from HTTPClient Response
I find it works well with these different classes WebRequest and HttpWebResponse. I have not added plumbing for resp.StatusCode etc but obviously presuming all went well is a tad naive.
Give it a go i am sure You'll find the WebRequest and HttpWebResponse more capable for dynamic requests (?)
var req = WebRequest.CreateHttp(url)
var getResponse = req.GetResponseAsync();
getResponse.Wait(ResponseTimeoutMilliseconds);
var resp = (HttpWebResponse)getResponse.Result;
using (Stream responseStream = resp.GetResponseStream())
{
var reader = new StreamReader(responseStream, Encoding.UTF8);
string content = reader.ReadToEnd();
}
Obviously once you have things working, you should absolutely use the ..Async versions but for debugging, since we already waited for response it is more convenient to simply step through i find, feel free to not take that middle step :)

UTF-8 URL Encode

I am having issues in encoding my query params using HttpUtility.UrlEncode() it is not getting converted to UTF-8.
query["agent"] = HttpUtility.UrlEncode("{\"mbox\":\"mailto: UserName#company.com\"}");
I tried using the overload method and passed utf encoding but still it is not working.
expected result:
?agent=%7B%22mbox%22%3A%22mailto%3AUserName%40company.com%22%7D
Actual Result:
?agent=%257b%2522mbox%2522%253a%2522mailto%253aUserName%2540company.com%2522%257d
public StatementService(HttpClient client, IConfiguration conf)
{
configuration = conf;
var BaseAddress = "https://someurl.com/statements?";
client.BaseAddress = new Uri(BaseAddress);
client.DefaultRequestHeaders.Add("Custom-Header",
"customheadervalue");
Client = client;
}
public async Task<Object> GetStatements(){
var query = HttpUtility.ParseQueryString(Client.BaseAddress.Query);
query["agent"] = HttpUtility.UrlEncode( "{\"mbox\":\"mailto:UserName#company.com\"}");
var longuri = new Uri(Client.BaseAddress + query.ToString());
var response = await Client.GetAsync(longuri);
response.EnsureSuccessStatusCode();
using var responseStream = await response.Content.ReadAsStreamAsync();
dynamic statement = JsonSerializer.DeserializeAsync<object>(responseStream);
//Convert stream reader to string
StreamReader JsonStream = new StreamReader(statement);
string JsonString = JsonStream.ReadToEnd();
//convert Json String to Object.
JObject JsonLinq = JObject.Parse(JsonString);
// Linq to Json
dynamic res = JsonLinq["statements"][0].Select(res => res).FirstOrDefault();
return await res;
}
The method HttpUtility.ParseQueryString internally returns a HttpValueCollection. HttpValueCollection.ToString() already performs url encoding, so you don't need to do that yourself. If you do it yourself, it is performed twice and you get the wrong result that you see.
I don't see the relation to UTF-8. The value you use ({"mbox":"mailto: UserName#company.com"}) doesn't contain any characters that would look different in UTF-8.
References:
HttpValueCollection and NameValueCollection
ParseQueryString source
HttpValueCollection source
I strongly suggest you this other approach, using Uri.EscapeDataString method. This method is inside System.Net instead of System.Web that is a heavy dll. In addition HttpUtility.UrlEncode encode characters are in uppercase this would be an issue in certain cases while implementing HTTP protocols.
Uri.EscapeDataString("{\"mbox\":\"mailto: UserName#company.com\"}")
"%7B%22mbox%22%3A%22mailto%3A%20UserName%40company.com%22%7D"

How should I do the encoding to not get an error with Chinese characters using WebClient?

I have the following function to download files from our server. Some customers name their file with Chinese characters and then I get the following error in Wc_DownloadFileCompleted: "The remote server returned an error: (404) Not Found.". I have tried HttpUtility.UrlEncode to encode the URL but that gives me an error on the Uri constructor or if I just encode the last part I get the same 404 error.
This is the URL giving me the problems:
http://example.com/Uploads/-463941/480630/1802+201830030210+孟万青.CNC.cloudfile
I have double-checked that the file is at that location and with the same filename.
private void DownloadCloudFile(string url)
{
WebClient wc = new WebClient();
wc.DownloadFileCompleted += Wc_DownloadFileCompleted;
string tmpfile = Path.GetTempFileName();
wc.DownloadFileAsync(new Uri(url), tmpfile, tmpfile);
}
HttpClient can download the file
private static async Task DownloadCloudFile(string url)
{
string tmpfile = Path.GetTempFileName();
using (HttpClient client = new HttpClient())
{
using (HttpResponseMessage response = await client.GetAsync(url))
using (Stream streamToReadFrom = await response.Content.ReadAsStreamAsync())
using (var fileStream = File.Create(tmpfile))
{
streamToReadFrom.Seek(0, SeekOrigin.Begin);
streamToReadFrom.CopyTo(fileStream);
fileStream.Close();
}
}
}
It look like a bug in WebClient.
The proper way to encode Unicode characters in URL is to convert them to UTF-8 and then percent-encode them, as described here.
Thanks for the comments. I could still not get it to work and decided on another route. I simply force the user to type in ASCII characters when naming the files.

HttpClient throws System.ArgumentException: 'windows-1251' is not a supported encoding name

I am writing WinPhone 8.1 app.
Code is very simple and works in most cases:
string htmlContent;
using (var client = new HttpClient())
{
htmlContent = await client.GetStringAsync(GenerateUri());
}
_htmlDocument.LoadHtml(htmlContent);
But sometimes exception is thrown at
htmlContent = await client.GetStringAsync(GenerateUri());
InnerException {System.ArgumentException: 'windows-1251' is not a
supported encoding name. Parameter name: name at
System.Globalization.EncodingTable.internalGetCodePageFromName(String
name) at
System.Globalization.EncodingTable.GetCodePageFromName(String name)
at
System.Net.Http.HttpContent.<>c__DisplayClass1.b__0(Task
task)} System.Exception {System.ArgumentException}
Does HttpClient support 1251 encoding? And if it doesn't, how can I avoid this problem? Or is it target page problem? Or am I wrong in something?
Get response as IBuffer and then convert using .NET encoding classes:
HttpClient client = new HttpClient();
HttpResponseMessage response = await client.GetAsync(uri);
IBuffer buffer = await response.Content.ReadAsBufferAsync();
byte[] bytes = buffer.ToArray();
Encoding encoding = Encoding.GetEncoding("windows-1251");
string responseString = encoding.GetString(bytes, 0, bytes.Length);

Windows 8: Download string with encoding (WinRT)

I use this code to download string from the Internet
public static async Task<string> DownloadPageAsync(string url)
{
HttpClientHandler handler = new HttpClientHandler {UseDefaultCredentials = true, AllowAutoRedirect = true};
HttpClient client = new HttpClient(handler);
client.MaxResponseContentBufferSize = 196608;
HttpResponseMessage response = await client.GetAsync(url);
response.EnsureSuccessStatusCode();
string responseBody = await response.Content.ReadAsStringAsync();
return responseBody;
}
but it only works for UTF8 documents. Where do I set the Encoding?
Change ReadAsStringAsync to ReadAsBufferAsync and parse result with required encoding
var buffer = await response.Content.ReadAsBufferAsync();
byte [] rawBytes = new byte[buffer.Length];
using (var reader = DataReader.FromBuffer(buffer))
{
reader.ReadBytes(rawBytes);
}
var res = Encoding.UTF8.GetString(rawBytes, 0, rawBytes.Length);
In WinRT the HttpContent reads Enconding from the Headers property. If the HTTP response from server doesn't set the Content-Type header with encoding, it tries to find BOM mark in the stream and if there's no BOM it will default to the UTF-8 encoding.
If the server is not sending the right Content-Type header you use the HttpContent.ReadAsStreamAsync() method and use your own instance of the Encoding class to correctly decode data.
Set the "ContentEncoding" property of your HttpResponse object:
http://msdn.microsoft.com/en-us/library/system.web.httpresponse.contentencoding%28v=vs.71%29.aspx
Values include:
http://msdn.microsoft.com/en-us/library/system.text.encoding%28v=vs.71%29.aspx
System.Text.ASCIIEncoding
System.Text.UnicodeEncoding
System.Text.UTF7Encoding
System.Text.UTF8Encoding
PS:
This really isn't "Metro" per se - just C#/.Net (albeit .Net 4.x)

Categories