C# HtmlAgilityPack timeout before download page - c#

I want parse site https://russiarunning.com/events?d=run on C# with htmlagilitypack
I'm try this make
string url = "https://russiarunning.com/events?d=run";
var web = new HtmlWeb();
var doc = web.Load(url);
But I got a problem - content on site loading with timeout ~1000ms
therefore, when using the web.Load (url) I download the page without content.
How make timeout before download page with htmlagilitypack ?

Try this...
Create one class as below :
public class WebClientHelper : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
and use as below:
var data = new Helpers.WebClientHelper().DownloadString(Url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(data);

You can simply do this:
string url = "https://russiarunning.com/events?d=run";
var web = new HtmlWeb();
web.PreRequest = delegate(HttpWebRequest webReq)
{
webReq.Timeout = 4000; // number of milliseconds
return true;
};
var doc = web.Load(url);
More on Timeout property: https://learn.microsoft.com/en-us/dotnet/api/system.net.httpwebrequest.timeout?view=netframework-4.7.2

Related

Character encoding wrong in HtmlAgilityPack [duplicate]

i have some problems with Gzip in HTMLAgillityPack
Error - 'gzip' is not a supported encoding name
Code:
var url = "http://poe.trade/search/arokazugetohar";
var web = new HtmlWeb();
var htmldoc = web.Load(url);
You can add gzip encoding using below method.
var url = "http://poe.trade/search/arokazugetohar";
HtmlWeb webClient = new HtmlWeb();
HtmlAgilityPack.HtmlWeb.PreRequestHandler handler = delegate (HttpWebRequest request)
{
request.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
webClient.PreRequest += handler;
HtmlDocument doc = webClient.Load(url);

Gzip, HTMLAgilitypack

i have some problems with Gzip in HTMLAgillityPack
Error - 'gzip' is not a supported encoding name
Code:
var url = "http://poe.trade/search/arokazugetohar";
var web = new HtmlWeb();
var htmldoc = web.Load(url);
You can add gzip encoding using below method.
var url = "http://poe.trade/search/arokazugetohar";
HtmlWeb webClient = new HtmlWeb();
HtmlAgilityPack.HtmlWeb.PreRequestHandler handler = delegate (HttpWebRequest request)
{
request.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
webClient.PreRequest += handler;
HtmlDocument doc = webClient.Load(url);

WebRequest not returning HTML

I want to load this http://www.yellowpages.ae/categories-by-alphabet/h.html url, but it returns null
In some question I have heard about adding Cookie container but it is already there in my code.
var MainUrl = "http://www.yellowpages.ae/categories-by-alphabet/h.html";
HtmlWeb web = new HtmlWeb();
web.PreRequest += request =>
{
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
web.CacheOnly = false;
var doc = web.Load(MainUrl);
the website opens perfectly fine in browser.
You need CookieCollection to get cookies and set UseCookie to true in HtmlWeb.
CookieCollection cookieCollection = null;
var web = new HtmlWeb
{
//AutoDetectEncoding = true,
UseCookies = true,
CacheOnly = false,
PreRequest = request =>
{
if (cookieCollection != null && cookieCollection.Count > 0)
request.CookieContainer.Add(cookieCollection);
return true;
},
PostResponse = (request, response) => { cookieCollection = response.Cookies; }
};
var doc = web.Load("https://www.google.com");
I doubt it is a cookie issue. Looks like a gzip encryption since I got nothing but gibberish when I tried to fetch the page. If it was a cookie issue the response should return an error saying so. Anyhow. Here is my solution to your problem.
public static void Main(string[] args)
{
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.yellowpages.ae/categories-by-alphabet/h.html");
request.Method = "GET";
request.ContentType = "text/html;charset=utf-8";
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
doc.Load(stream, Encoding.GetEncoding("utf-8"));
}
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
}
All it does is that it decrypts/extracts the gzip message that we receive.
How did I know it was GZIP you ask? The response stream from the debugger said that the ContentEncoding was gzip.
Basically just add:
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
To your code and you're good.

HtmlAgilityPack don't get xpath in c#

before, I use this code, it can get xpath of website. But, today I debug code, I see, it don't get data html from website: webtruyen.com. I try to check website.com/robots.txt. but it don't suspect. And I try to add proxy to get data, but return data null. I don't know how to get xpath from website webtruyen.com. Who help me? I want to know how to read data from website http://webtruyen.com.
My code:
string url = "http://webtruyen.com";
var web = new HtmlWeb();
var doc = web.Load(url);
String temps = "";
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
temps = node.InnerHtml;
}
I debug, return:
InnerHtml 'doc.DocumentNode.InnerHtml' threw an exception of type 'System.NullReferenceException' string {System.NullReferenceException}
My code use proxy:
string url = "http://webtruyen.com";
var web = new HtmlWeb();
webGet.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/)";
var doc = web.Load(url);
String temps = "";
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
temps = node.InnerHtml;
}
I have the same error using HtmlWeb.Load(), but I can easily solve your issue using HttpWebRequest (TLDR: See #3 for the working code).
Step 1) Using the following code:
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
You see that you actually get a 403 Forbidden error (WebException).
Step 2)
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
HtmlDocument doc = new HtmlDocument();
try
{
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
}
catch (WebException wx)
{
doc.LoadHtml(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd());
}
on doc.DocumentNode.OuterHtml, you see the HTML of the forbidden error with the JavaScript that sets the cookie on your browser and refreshes it.
3) So in order to load the page outside of a manual browser, you have to manually set that cookie and re-access it. Meaning, with:
string cookie = string.Empty;
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
try
{
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
}
catch (WebException wx)
{
cookie = Regex.Match(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd(), "document.cookie = '(.*?)';").Groups[1].Value;
}
hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
hwr.Headers.Add("Cookie", cookie);
HtmlDocument doc = new HtmlDocument();
using (Stream s = hwr.GetResponse().GetResponseStream())
using (StreamReader sr = new StreamReader(s))
{
doc.LoadHtml(sr.ReadToEnd());
}
You get the page :)
Moral of the story, if your browser can do it, so can you.

WinRT web page parse / DocumentNode.InnerHtml = "URI" rather than page html

I'm trying to create a metro application with schedule of subjects for my university. I use HAP+Fizzler for parse page and get data.
Schedule link give me #Too many automatic redirections# error.
I found out that CookieContainer can help me, but don't know how implement it.
CookieContainer cc = new CookieContainer();
request.CookieContainer = cc;
My code:
public static HttpWebRequest request;
public string Url = "http://cist.kture.kharkov.ua/ias/app/tt/f?p=778:201:9421608126858:::201:P201_FIRST_DATE,P201_LAST_DATE,P201_GROUP,P201_POTOK:01.09.2012,31.01.2013,2423447,0:";
public SampleDataSource()
{
HtmlDocument html = new HtmlDocument();
request = (HttpWebRequest)WebRequest.Create(Url);
request.Proxy = null;
request.UseDefaultCredentials = true;
CookieContainer cc = new CookieContainer();
request.CookieContainer = cc;
html.LoadHtml(request.RequestUri.ToString());
var page = html.DocumentNode;
String ITEM_CONTENT = null;
foreach (var item in page.QuerySelectorAll(".MainTT"))
{
ITEM_CONTENT = item.InnerHtml;
}
}
With CookieContainer i don't get error, but DocumentNode.InnerHtml for some reason get value of my URI, not page html.
You just need to change one line.
Replace
html.LoadHtml(request.RequestUri.ToString());
with
html.LoadHtml(new StreamReader(request.GetResponse().GetResponseStream()).ReadToEnd());
EDIT
First mark your method as async
request.CookieContainer = cc;
var resp = await request.GetResponseAsync();
html.LoadHtml(new StreamReader(resp.GetResponseStream()).ReadToEnd());
If You want to download web page code try use this method(by using HttpClient):
public async Task<string> DownloadHtmlCode(string url)
{
HttpClientHandler handler = new HttpClientHandler { UseDefaultCredentials = true, AllowAutoRedirect = true };
HttpClient client = new HttpClient(handler);
HttpResponseMessage response = await client.GetAsync(url);
response.EnsureSuccessStatusCode();
string responseBody = await response.Content.ReadAsStringAsync();
return responseBody;
}
If you want to parse your downloaded htmlcode you can use Regex or LINQ. I have some example by using LINQ to parse html code but before you should load your code into HtmlDocument by using HtmlAgilityPack library. Then you can load by that way: html.LoadHtml(temphtml);
When you'll do that, you can parse your HtmlDocument:
//This is for img links parse-example:
IEnumerable<HtmlNode> imghrefNodes = html.DocumentNode.Descendants().Where(n => n.Name == "img");
foreach (HtmlNode img in imghrefNodes)
{
HtmlAttribute att = img.Attributes["src"];
//in att.Value you can find your img url
//Here you can do everything what you want with all img links by editing att.Value
}

Categories