C# WebClient Strange Characters - c#

I am trying to download this webpage using C# WebClient..
Now it works perfectly with python urllib2 but with c# web client it gives these strange characters in the output file..
I have tried using Encoding with webclient class as well but it doesn't work at all..
public static string GetWebURL()
{
string url = "http://bet.hkjc.com";
WebClient webClient = new WebClient();
webClient.Encoding = Encoding.UTF8;
string html = webClient.DownloadString(url);
File.WriteAllText("page.html", html);
}
this is the output with those strange characters
‹âå²Qtñw‰pUðñõQuòñtVPÒÕ×7vÖ×w qÂH˜è*„%æg–dæç%æèë»ú)ÙñrÂ(N.Ê,(Q(©,HµU*I­(ÑÃJ,K„ˆ*Ùq)((â€U*TÆ’e‰E ©y‰I9©ŽÉÉ©ÅÅÎùy%Eù9 ¶i‰9Å©Ö %â„¢i Xâ€h"(É-P°U(ÃÃŒKÉ/×ËÉON¹H/£(5M¯¸4©¸¤HÃ\SlHu°kPËœkP¼Ÿ£¯+PP/L‘ÂËœ4&µÂ?MCI_IS®+%?713Ã/17¨ ɘfd!¸ zJšÚ†P«Sò“KsSóJô &MA V¨ŸKòô’RK‚s2ÜŠ€ªô2‹}òÓóó445¡ÊÃ=­Wâ€Z“˜œ t|zj^jQbN<Ø1z䁚9‰y鶩yJ_ÂP-ˆÔšœchˆe¦‚ µ\H&[×rÙèC’€0ÂJ%à „ ÷‚üüP9Ud¦MÃÃÔÌØÈÖM×ÃÈ25² ÷ô³V·†(ÃŽM-JOM
What should I do to see the html that is being send?

You're looking at a compressed byte stream. You can tell by inspecting the headers of the http response, for example with curl:
curl -X HEAD -i http://bet.hkjc.com/
but the Developer Console of your browser will reveal the same:
HTTP/1.1 200 OK
Cache-Control: public, max-age=120, must-revalidate
Content-Length: 3615
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip
Expires: Wed, 29 Jun 2016 08:01:06 GMT
Vary: Accept-Encoding
Server: Microsoft-IIS/7.0
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
Date: Wed, 29 Jun 2016 08:00:14 GMT
Via: 1.1 stjbwbwa52
Accept-Ranges: bytes
Notice the Content-Encoding: to say gzip. This means the result you just got is compressed with the gzip algorithm. The standard WebClient can't handle that but with an simple subclass the WebClient can do new tricks:
public class DecompressWebClient:WebClient
{
// moved common logic here
public DecompressWebClient()
{
this.Encoding = Encoding.UTF8;
}
// This is the factory to create the webrequest
protected override WebRequest GetWebRequest(Uri address)
{
// get the default one
var request = base.GetWebRequest(address);
// see if it is a HttpWebRequest
var httpReq = request as HttpWebRequest;
if (httpReq != null)
{
// add extra capabilities, like decompression
httpReq.AutomaticDecompression = DecompressionMethods.GZip;
}
return request;
}
}
On the HttpWebRequest there exists a property AutomaticDecompression that, when set to true, will take care of the decompression for us.
When you put the Subclassed WebClient to use your code will look like:
string url = "http://bet.hkjc.com";
using(WebClient webClient = new DecompressWebClient())
{
string html = webClient.DownloadString(url);
File.WriteAllText("page.html", html);
}
The encoding UTF8 is correct, as you can also see in the header for the Content-Type setting.
The top of the html file will look like this:
<html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7; IE=EmulateIE10"/>
<meta name="application-name" content="香港賽馬會"/>
<title>香港賽馬會</title>

Related

C# HttpListener. Sending EMPTY response (without any artefacts)

I am using C# HttpListener class to realize some server. And here is the problem. I just want to send to client an empty response to the request like
HTTP/1.1 200 OK
or
HTTP/1.1 400 Bad Request
without any additional text. So I set status code and status description and don't write any bytes to response OutputStream - I just don't need them. Then close the response to initiate sending bytes to the client with response.Close() method. And what I get on the client side shown by Fiddler is
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Server: Microsoft-HTTPAPI/2.0
Date: Sun, 25 Oct 2015 10:42:12 GMT
0
There is a workaround for Server and Date fields -
HttpListener Server Header c#.
But how to remove these "Transfer-Encoding: chunked" artefact and "0" body from this response?!
Thanks to all in advance!
The code:
private void ProcessContext(HttpListenerContext aContext)
{
HttpListenerResponse response = aContext.Response;
response.StatusCode = (int)HttpStatusCode.OK;
response.StatusDescription = "OK";
response.Close();
}
This will get rid of everything but the status and the Content-Length header:
HttpListener listener = new HttpListener();
listener.Prefixes.Add("http://*:5555/");
listener.Start();
listener.BeginGetContext(ar =>
{
HttpListener l = (HttpListener)ar.AsyncState;
HttpListenerContext context = l.EndGetContext(ar);
context.Response.Headers.Clear();
context.Response.SendChunked = false;
context.Response.StatusCode = 200;
context.Response.Headers.Add("Server", String.Empty);
context.Response.Headers.Add("Date", String.Empty);
context.Response.Close();
}, listener);
and in fiddler you'll see this:
Just set ContentLength64 to zero before closing response stream in order to transmit data the regular way:
response.ContentLength64 = 0;
response.OutputStream.Close();
If you flush or close response stream without setting content length to any value, data will be transmitted in chunks. And 0/r/n in your response body is actually a closing chunk.

WebClient download string is different than WebBrowser View source

I am create a C# 4.0 application to download the webpage content using Web client.
WebClient function
public static string GetDocText(string url)
{
string html = string.Empty;
try
{
using (ConfigurableWebClient client = new ConfigurableWebClient())
{
/* Set timeout for webclient */
client.Timeout = 600000;
/* Build url */
Uri innUri = null;
if (!url.StartsWith("http://"))
url = "http://" + url;
Uri.TryCreate(url, UriKind.RelativeOrAbsolute, out innUri);
try
{
client.Headers.Add("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR " + "3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; InfoPath.2; AskTbFXTV5/5.15.4.23821; BRI/2)");
client.Headers.Add("Vary", "Accept-Encoding");
client.Encoding = Encoding.UTF8;
html = client.DownloadString(innUri);
if (html.Contains("Pagina non disponibile"))
{
string str = "site blocked";
str = "";
}
if (string.IsNullOrEmpty(html))
{
return string.Empty;
}
else
{
return html;
}
}
catch (Exception ex)
{
return "";
}
finally
{
client.Dispose();
}
}
}
catch (Exception ex)
{
return "";
}
}
public class ConfigurableWebClient : WebClient
{
public int? Timeout { get; set; }
public int? ConnectionLimit { get; set; }
protected override WebRequest GetWebRequest(Uri address)
{
var baseRequest = base.GetWebRequest(address);
var webRequest = baseRequest as HttpWebRequest;
if (webRequest == null)
return baseRequest;
if (Timeout.HasValue)
webRequest.Timeout = Timeout.Value;
if (ConnectionLimit.HasValue)
webRequest.ServicePoint.ConnectionLimit = ConnectionLimit.Value;
return webRequest;
}
}
I examine the download content in C# Web client it's slightly different than the browser
content. I give the same URL in browser ( Mozilla Firefox ) and my web client function.
the webpage shows the content correctly but my Web client DownloadString is returns another
HTML. Please see my the Web Client response below.
Webclient downloaded html
<!DOCTYPE html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?Ref=/pgol/4-abbigliamento/3-Roma%20%28RM%29/p-7&distil_RID=A8D2F8B6-B314-11E3-A5E9-E04C5DBA1712" />
<script type="text/javascript" src="/ga.280243267228712.js?PID=6D4E4D1D-7094-375D-A439-0568A6A70836" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#glance7ca96c1b,#hiredf795fe70,#target01a7c05a,#hiredf795fe70{display:none!important}</style></head>
<body>
<div id="distil_ident_block"> </div>
<div id="d__fFH"><OBJECT id="d_dlg" CLASSID="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></OBJECT><span id="d__fF"></span></div></body>
</html>
My problem is my Webclient function is not returned the actual webpage content.
Some Web Program respond different by HTTP Request Header.
so, if you want to same HTML as web browser's,
then you will send same HTTP Request which of your Web Browser!
how?
Using Firefox Developer tool or Chrome Developer Tool, and Copy The HTTP Request!
In my case WebClient's DownloadData/DownloadFile/DownloadString methods showed different results than when downloading the file from a browser, like Chrome. First I thought it was an encoding problem and looped through all the encodings from Encoding.GetEncodings(), but the output data showed nonsense characters. Then after much searching I ended up here.
I looked at the Response headers in the Chrome browser Network tab as #han058 suggested and it read:
Cache-Control: public, max-age=900
content-disposition: attachment;filename=FILENAME.csv
Content-Encoding: gzip
Content-Length: 29310
Content-Type: text/plain; charset=utf-8
Date: Sat, 04 Jan 2020 20:20:13 GMT
Expires: Sat, 04 Jan 2020 20:35:14 GMT
Last-Modified: Sat, 04 Jan 2020 20:20:14 GMT
Server: Microsoft-IIS/10.0
Vary: *
X-Powered-By: ASP.NET
X-Powered-By: ARR/3.0
X-Powered-By: ASP.NET
So the response was encoded Content-Encoding: gzip. In other words, I had to unzip the file, before I could read it.
using System;
using System.IO;
using System.IO.Compression;
using System.Net;
public class Program
{
static void Main(string[] args)
{
var url = new Uri("http://www.url.com/FILENAME.csv");
var path = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
var fileName = "File.csv";
using (WebClient wc = new WebClient())
using (Stream s = File.Create(Path.Combine(path, fileName)))
using (GZipStream gs = new GZipStream(wc.OpenRead(url), CompressionMode.Decompress))
{
//Saves to C:\Users\[YourUser]\Desktop\File.csv
gs.CopyTo(s);
}
}
}

Reading HTTP file attachment as string

I am using a WebRequest to make a GET and the response includes an attachment.
The attachment is a html file that I want to strip the content out between the tags. I have managed to get the call working with the following code:
string URI = "http://www.sample.com/ReportServer/Pages/ReportViewer.aspx?%2fReports&rs:Command=Render&rs:Format=MHTML&OrganisationID=" + organisationID;
CredentialCache cc = new CredentialCache();
cc.Add(new Uri(URI), "NTLM", new NetworkCredential(userName, userPassword, userDomain));
WebRequest req = WebRequest.Create(URI);
req.Credentials = cc;
WebResponse resp = req.GetResponse();
StreamReader reader = new StreamReader(resp.GetResponseStream());
string response = reader.ReadToEnd().Trim();
The response, when i look in Fiddler is :
HTTP/1.1 200 OK
Cache-Control: private
Content-Type: multipart/related
Expires: Wed, 02 Apr 2014 14:35:15 GMT
Set-Cookie: RSExecutionSession%3a%2fPuborts%2fSecreal%2fClub+Meip+Ret=0yu4f1455xnmznu55; path=/
Server: Microsoft-HTTPAPI/2.0
X-AspNet-Version: 2.0.50727
FileExtension: mhtml
Content-Disposition: attachment; filename="Blah Report.mhtml"
Date: Wed, 02 Apr 2014 14:36:15 GMT
Content-Length: 84215
MIME-Version: 1.0
Content-Type: multipart/related;
boundary="----=_NextPart_01C35DB7.4B204430"
X-MSSQLRS-ProducerVersion: V10.50.4000.0
This is a multi-part message in MIME format.
------=_NextPart_01C35DB7.4B204430
Content-Disposition: inline; filename="Blah Membership Report"
Content-Type: text/html;
name="Club Membership Report";
charset="utf-8"
Content-Transfer-Encoding: base64
PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMDEgVHJhbnNpdGlvbmFs
------=_NextPart_01C35DB7.4B204430--
How can I get hold of just the attachment and read the contents into a string please?
The "attachment" part is just to enforce a save as screen in the browser.
You can use a normal WebClient
var client = new WebClient();
using (client)
{
client.Credentials = blablabla
var result = client.DownloadString("http://blablabla.com");
}

Malformed string exception while sending "&" in Json to AppEngine

I am trying to send Facebook graph link to the AppEngine server. I receive "Malformed string exception". Here is my method sending json to server:
public async Task<string> SendJSONData(string urlToCall, string JSONData)
{
// server to POST to
string url = urlToCall;
// HTTP web request
var httpWebRequest = (HttpWebRequest)WebRequest.Create(url);
httpWebRequest.ContentType = "application/x-www-form-urlencoded";
httpWebRequest.Method = "POST";
// Write the request Asynchronously
using (var stream = await Task.Factory.FromAsync<Stream>(httpWebRequest.BeginGetRequestStream,
httpWebRequest.EndGetRequestStream, null))
{
//create some json string
string json = "action=" + JSONData;
// convert json to byte array
byte[] jsonAsBytes = Encoding.UTF8.GetBytes(json);
// Write the bytes to the stream
await stream.WriteAsync(jsonAsBytes, 0, jsonAsBytes.Length);
}
WebResponse response = await httpWebRequest.GetResponseAsync();
StreamReader requestReader = new StreamReader(response.GetResponseStream());
String webResponse = requestReader.ReadToEnd();
return webResponse; }
Here is what I sniff using Fiddler:
POST http://x.appspot.com/register HTTP/1.1
Accept: */*
Content-Length: 376
Accept-Encoding: identity
Content-Type: application/x-www-form-urlencoded
User-Agent: NativeHost
Host: x.appspot.com
Connection: Keep-Alive
Pragma: no-cache
action={
"mailFb": "mail#gmail.com",
"userName": "Michael",
"userSurname": "w00t",
"nickname": "Michael w00t",
"userSex": "male",
"userAvatar": "https://graph.facebook.com/myperfectid/picture?type=large&access_token=BlahblahblahblahToken"
}
So everything looks fine, but the problem is that i receive the following error in AppEngine log:
2013-03-02 17:52:10.431 /register 500 56ms 0kb NativeHost
W 2013-03-02 17:52:10.427 /register com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: Unterminated string at line 7 column 79 at com.google.g
C 2013-03-02 17:52:10.429 Uncaught exception from servlet com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: Unterminated string at line 7 colu
I managed to narrow the problem to its source, which is "&" character. So my question is, how to fix the code, so that it works with AppEngine.
Oh, here is how i read the received data on the server:
gson.fromJson(reader, User.class);
The problem is highlighted by the fact you're claiming you are sending "Content-Type: application/x-www-form-urlencoded"
But it isn't. Hence the error.
The correct encoding for & is &.

HttpWebRequest and Set-Cookie header in response not parsed (WP7)

I am trying to get the header "Set-Cookie" or access the cookie container, but the Set-Cookie header is not available.
The cookie is in the response header, but it's not there in the client request object.
I am registering the ClientHttp stack using
bool httpResult = WebRequest.RegisterPrefix("http://", WebRequestCreator.ClientHttp);
Here's the response:
HTTP/1.1 200 OK
Content-Type: application/xml; charset=utf-8
Connection: keep-alive
Status: 200
X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 3.0.0.pre4
ETag: "39030a9c5a45a24e485e4d2fb06c6389"
Client-Version: 312, 105, 0, 0
X-Runtime: 44
Content-Length: 1232
Set-Cookie: _CWFServer_session=[This is the session data]; path=/; HttpOnly
Cache-Control: private, max-age=0, must-revalidate
Server: nginx/0.7.67 + Phusion Passenger 3.0.0.pre4 (mod_rails/mod_rack)
<?xml version="1.0" encoding="UTF-8"?>
<user>
...
</user>
My callback code contains something like:
var webRequest = (HttpWebRequest)result.AsyncState;
raw = webRequest.EndGetResponse(result) as HttpWebResponse;
foreach (Cookie c in webRequest.CookieContainer.GetCookies(webRequest.RequestUri))
{
Console.WriteLine("Cookie['" + c.Name + "']: " + c.Value);
}
I've also tried looking at the headers but Set-Cookie header isn't present in the response either.
Any suggestions on what may be the problem?
Try explicitly passing a new CookieContainer:
CookieContainer container = new CookieContainer();
container.Add(new Uri("http://yoursite"), new Cookie("name", "value"));
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://yoursite");
request.CookieContainer = container;
request.BeginGetResponse(new AsyncCallback(GetData), request);
You are receiving HttpOnly cookies:
Set-Cookie: _CWFServer_session=[This is the session data]; path=/; HttpOnly
For security reasons, those cookies can't be accessed from code, but you still can use them in your next calls to HttpWebRequest. More on this here : Reading HttpOnly Cookies from Headers of HttpWebResponse in Windows Phone
With WP7.1, I also had problems reading non HttpOnly cookies. I found out that they are not available if the response of the HttpWebRequest comes from the cache. Making the query unique with a random number solved the cache problem :
// The Request
Random random = new Random();
// UniqueQuery is used to defeat the cache system that destroys the cookie.
_uniqueQuery = "http://my-site.somewhere?someparameters=XXX"
+ ";test="+ random.Next();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(_uniqueQuery);
request.BeginGetResponse(Response_Completed, request);
Once you get the response, you can fetch the cookie from the response headers:
void Response_Completed(IAsyncResult result)
{
HttpWebRequest request = (HttpWebRequest)result.AsyncState;
HttpWebResponse response = (HttpWebResponse)request.EndGetResponse(result);
String header = response.Headers["Set-Cookie"];
I never managed to get the CookieContainer.GetCookies() method to work.
Is the cookie httponly? If so, you won't be able to see it, but if you use the same CookieContainer for your second request, the request will contain the cookie, even though your program won't be able to see it.
You must edit the headers collection directly. Something like this:
request.Headers["Set-Cookie"] = "name=value";
request.BeginGetResponse(myCallback, request);

Categories