I have been trying to use the htmlagilitypack through a proxy and I see that I get some unpredictable behavior.
How do you add credentials to htmlagilitypack so that it will be able to scrape web pages through a proxy?
Here is what I usually do:
HttpWebRequest request = (HttpWebRequest) HttpWebRequest.Create(url);
...
HttpWebResponse response = (HttpWebResponse) request.GetResponse();
using (var reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8)) {
var doc = new HtmlDocument();
doc.Load(reader.BaseStream);
//Use (or return) the HtmlDocument 'doc' here.
}
You could encapsulate this code on a method that given a url, returns a HtmlDocument object.
There is a similar question already answered hereenter link description here
You use it as below in your code.
HtmlWeb web = new HtmlWeb();
web.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6";
var doc = web.Load(string url, string proxyHost, int proxyPort, string yourUserId, string yourPassword);
Related
Pretty standard implementation of HttpWebRequest, whenever I pass a certain URL to get the html it comes back with nothing but special characters. An example of what comes back is below.
Now this site is SSL so I'm wondering if that has something to do with it but I've never had this problem before and I've used this with other SSL sites.
�
ServicePointManager.ServerCertificateValidationCallback = new System.Net.Security.RemoteCertificateValidationCallback(AcceptAllCertifications);
var request = (HttpWebRequest)WebRequest.Create(url);
using (var response = (HttpWebResponse)request.GetResponse())
{
Stream data = response.GetResponseStream();
HtmlDocument hDoc = new HtmlDocument();
using (StreamReader readURLContent = new StreamReader(data))
{
html = readURLContent.ReadToEnd();
hDoc.LoadHtml(html);
}
}
I can't really find anything for this specific issue so I'm kind of lost if anybody could point me in the right direction that would be awesome.
Edit: here's an image of what it looks like since I can't copy paste it
My guess is that the response is compressed. If you use a WebDebugger like Charles or Fiddler. You can see how the requests and structured and what data they contain - it makes it a lot easier to replicate the http requests later on when programming them. Try the following code.
try
{
string webAddr = url;
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "text/html; charset=utf-8";
httpWebRequest.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0";
httpWebRequest.AllowAutoRedirect = true;
httpWebRequest.Method = "GET";
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream(), Encoding.UTF8))
{
var responseText = streamReader.ReadToEnd();
doc.LoadHtml(responseText);
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
The code sets the encoding on the requsts. You an also set the encoding at the streamreader when reading the response. And automatic decompression is enabled.
I have tried several methods in C# using webclient and webresponse and they all return
<html><head><meta http-equiv=\"REFRESH\" content=\"0; URL=http://www.windowsphone.com/en-US/games?list=xbox\"><script type=\"text/javascript\">function OnBack(){}</script></head></html>"
instead of the actual rendered page when you use a browser to go to http://www.windowsphone.com/en-US/games?list=xbox
How would you go about grabbing the HTML from that location?
http://www.windowsphone.com/en-US/games?list=xbox
Thanks!
/edit: examples added:
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
Uri inputUri = new Uri(inputUrl);
WebRequest request = WebRequest.CreateDefault(inputUri);
request.Method = "GET";
WebResponse response;
try
{
response = request.GetResponse();
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
resultHTML = reader.ReadToEnd();
}
}
catch { }
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
WebClient webClient = new WebClient();
try
{
resultHTML = webClient.DownloadString(inputUrl);
}
catch { }
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
WebResponse objResponse;
WebRequest objRequest = HttpWebRequest.Create(inputUrl);
try
{
objResponse = objRequest.GetResponse();
using (StreamReader sr = new StreamReader(objResponse.GetResponseStream()))
{
resultHTML = sr.ReadToEnd();
sr.Close();
}
}
catch { }
I checked for this URL, and you need to parse the cookies.
When you try to access the page for the first time, you are redirected to an https URL on login.live.com and then redirected back to the original URL. The https page sets a cookie called MSPRequ for the domain login.live.com. If you do not have this cookie, you cannot access the site.
I tried disabling cookies in my browser and it ends up looping infinitely back to the URL https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&checkda=1&ct=1328303901&rver=6.1.6195.0&wp=MBI&wreply=http:%2F%2Fwww.windowsphone.com%2Fen-US%2Fgames%3Flist%3Dxbox&lc=1033&id=268289. It's been going on for several minutes now and doesn't appear it will ever stop.
So you will have to grab the cookie from the https page when it is set, and persist that cookie for your subsequent requests.
This might be because the server you are requesting HTML from returns different HTML depending on the User Agent string. You might try something like this
webClient.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
That particular header may not work, but you could try others that would mimic standard browsers.
I am making a desktop yellowpage application. I can access all countries yellowpage site but not australian site. I dont know why?
Here is the code
class Program
{
static void Main(string[] args)
{
WebClient wb = new WebClient();
wb.Headers.Add("user-agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US)");
string html = wb.DownloadString("http://www.yellowpages.com.au");
Console.WriteLine(html);
}
}
For all other site I get html of the website for australian site I get null. i even tried httpwebrequest also.
Here is the yellowpage australian site: http://www.yellowpages.com.au
Thanks in advance
It looks like that website will only send over gzip'ed data. Try switching to HttpWebRequest and using auto decompression:
var request = (HttpWebRequest)WebRequest.Create("http://www.yellowpages.com.au");
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705;)";
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
In addition to #bkaid's correct (and upvoted) answer, you can use your own class inherited from WebClient to uncompress/handle gzip compressed html:
public class GZipWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = (HttpWebRequest)base.GetWebRequest(address);
request.AutomaticDecompression = DecompressionMethods.GZip |
DecompressionMethods.Deflate;
return request;
}
}
Having done this, the following works just fine:
WebClient wb = new GZipWebClient();
string html = wb.DownloadString("http://www.yellowpages.com.au");
When I view the transfer from that website in Wireshark, it says it's a malformed HTTP packet. It says it uses chunked transfer, then says the following chunk has 0 bytes and then sends the code of the website. That's why WebClient returns an empty string (not null). And I think it's correct behavior.
It seems browsers ignore this error and so they can display the page properly.
EDIT:
As bkaid pointed out, the server seems to handle send correct gziped response. The following code works for me:
WebClient wb = new WebClient();
wb.Headers.Add("Accept-Encoding", "gzip");
string html;
using (var webStream = wb.OpenRead("http://www.yellowpages.com.au"))
using (var gzipStream = new GZipStream(webStream, CompressionMode.Decompress))
using (var streamReader = new StreamReader(gzipStream))
html = streamReader.ReadToEnd();
var uri = new Uri("http://store.scrapbook.com/cos-pad825.html?t12-13=cosmo%20cricket&date=20110309");
var request = (HttpWebRequest)WebRequest.Create(url);
var cookieContainer = new CookieContainer();
request.CookieContainer = cookieContainer;
request.UserAgent = #"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";
request.Method = "GET";
request.AllowAutoRedirect = true;
request.Timeout = 15000;
var response = (HttpWebResponse)request.GetResponse();
var page = new HtmlDocument();
var stream = response.GetResponseStream();
page.Load(stream);
Causes an error. on the Load(stream) call. Any ideas?
The error I get when I run your code is:
System.ArgumentException: 'ISO-8559-1' is not a supported encoding name.
It's thrown by the standard .NET Framework encoding classes. It means the page declares an encoding not supported by .NET. I fixed it like this:
var page = new HtmlDocument();
page.OptionReadEncoding = false;
PS: I'm using the Html Agility Pack version 1.3
Maybe not a the answer you need, but the stack trace indicates the exception occurred after you past the response stream type to the page's load method.
It might be worth adding a TextReader in before the htmldocument assignment, and pass off the stream object to that. Then pass the textreader var to htmldoc's Load method.
Before you debug the source code for latest htmlagility, I think you should first edit your question to include the states of all types/props of interest, for clarity.
I have a strange problem:
I am getting the html source from url using this:
string html;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(Url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
{
html = reader.ReadToEnd();
reader.Close();
}
response.Close();
}
The page that I am requesting has css inline like this:
<span class="VL" style="display:inline-block;height:20px;width:0px;"></span>
But the html var value has only:
<span class="VL" style="display:inline-block;"></span>
Anyone knows why? I have tested with many enconders and using WebRequest and WebClient too, but doesn't work too.
You might need to send a User Agent so that the site doesn't think that you are a bot. Some sites don't bother with CSS when requested from bots. Also the reading of the remote HTML could be simplified using a WebClient:
using (var client = new WebClient())
{
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4";
string html = client.DownloadString(url);
}
Are you viewing the source through a browser development tool, by clicking inspect element? Is it possible you are viewing the source from a browser which is adding the height and width attributes on the client side through JavaScript and showing you the modified style.