Character encoding wrong in HtmlAgilityPack [duplicate] - c#

i have some problems with Gzip in HTMLAgillityPack
Error - 'gzip' is not a supported encoding name
Code:
var url = "http://poe.trade/search/arokazugetohar";
var web = new HtmlWeb();
var htmldoc = web.Load(url);

You can add gzip encoding using below method.
var url = "http://poe.trade/search/arokazugetohar";
HtmlWeb webClient = new HtmlWeb();
HtmlAgilityPack.HtmlWeb.PreRequestHandler handler = delegate (HttpWebRequest request)
{
request.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
webClient.PreRequest += handler;
HtmlDocument doc = webClient.Load(url);

Related

C# HtmlAgilityPack timeout before download page

I want parse site https://russiarunning.com/events?d=run on C# with htmlagilitypack
I'm try this make
string url = "https://russiarunning.com/events?d=run";
var web = new HtmlWeb();
var doc = web.Load(url);
But I got a problem - content on site loading with timeout ~1000ms
therefore, when using the web.Load (url) I download the page without content.
How make timeout before download page with htmlagilitypack ?
Try this...
Create one class as below :
public class WebClientHelper : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
and use as below:
var data = new Helpers.WebClientHelper().DownloadString(Url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(data);
You can simply do this:
string url = "https://russiarunning.com/events?d=run";
var web = new HtmlWeb();
web.PreRequest = delegate(HttpWebRequest webReq)
{
webReq.Timeout = 4000; // number of milliseconds
return true;
};
var doc = web.Load(url);
More on Timeout property: https://learn.microsoft.com/en-us/dotnet/api/system.net.httpwebrequest.timeout?view=netframework-4.7.2

Gzip, HTMLAgilitypack

i have some problems with Gzip in HTMLAgillityPack
Error - 'gzip' is not a supported encoding name
Code:
var url = "http://poe.trade/search/arokazugetohar";
var web = new HtmlWeb();
var htmldoc = web.Load(url);
You can add gzip encoding using below method.
var url = "http://poe.trade/search/arokazugetohar";
HtmlWeb webClient = new HtmlWeb();
HtmlAgilityPack.HtmlWeb.PreRequestHandler handler = delegate (HttpWebRequest request)
{
request.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
webClient.PreRequest += handler;
HtmlDocument doc = webClient.Load(url);

WebRequest not returning HTML

I want to load this http://www.yellowpages.ae/categories-by-alphabet/h.html url, but it returns null
In some question I have heard about adding Cookie container but it is already there in my code.
var MainUrl = "http://www.yellowpages.ae/categories-by-alphabet/h.html";
HtmlWeb web = new HtmlWeb();
web.PreRequest += request =>
{
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
web.CacheOnly = false;
var doc = web.Load(MainUrl);
the website opens perfectly fine in browser.
You need CookieCollection to get cookies and set UseCookie to true in HtmlWeb.
CookieCollection cookieCollection = null;
var web = new HtmlWeb
{
//AutoDetectEncoding = true,
UseCookies = true,
CacheOnly = false,
PreRequest = request =>
{
if (cookieCollection != null && cookieCollection.Count > 0)
request.CookieContainer.Add(cookieCollection);
return true;
},
PostResponse = (request, response) => { cookieCollection = response.Cookies; }
};
var doc = web.Load("https://www.google.com");
I doubt it is a cookie issue. Looks like a gzip encryption since I got nothing but gibberish when I tried to fetch the page. If it was a cookie issue the response should return an error saying so. Anyhow. Here is my solution to your problem.
public static void Main(string[] args)
{
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.yellowpages.ae/categories-by-alphabet/h.html");
request.Method = "GET";
request.ContentType = "text/html;charset=utf-8";
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
doc.Load(stream, Encoding.GetEncoding("utf-8"));
}
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
}
All it does is that it decrypts/extracts the gzip message that we receive.
How did I know it was GZIP you ask? The response stream from the debugger said that the ContentEncoding was gzip.
Basically just add:
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
To your code and you're good.

WinRT web page parse / DocumentNode.InnerHtml = "URI" rather than page html

I'm trying to create a metro application with schedule of subjects for my university. I use HAP+Fizzler for parse page and get data.
Schedule link give me #Too many automatic redirections# error.
I found out that CookieContainer can help me, but don't know how implement it.
CookieContainer cc = new CookieContainer();
request.CookieContainer = cc;
My code:
public static HttpWebRequest request;
public string Url = "http://cist.kture.kharkov.ua/ias/app/tt/f?p=778:201:9421608126858:::201:P201_FIRST_DATE,P201_LAST_DATE,P201_GROUP,P201_POTOK:01.09.2012,31.01.2013,2423447,0:";
public SampleDataSource()
{
HtmlDocument html = new HtmlDocument();
request = (HttpWebRequest)WebRequest.Create(Url);
request.Proxy = null;
request.UseDefaultCredentials = true;
CookieContainer cc = new CookieContainer();
request.CookieContainer = cc;
html.LoadHtml(request.RequestUri.ToString());
var page = html.DocumentNode;
String ITEM_CONTENT = null;
foreach (var item in page.QuerySelectorAll(".MainTT"))
{
ITEM_CONTENT = item.InnerHtml;
}
}
With CookieContainer i don't get error, but DocumentNode.InnerHtml for some reason get value of my URI, not page html.
You just need to change one line.
Replace
html.LoadHtml(request.RequestUri.ToString());
with
html.LoadHtml(new StreamReader(request.GetResponse().GetResponseStream()).ReadToEnd());
EDIT
First mark your method as async
request.CookieContainer = cc;
var resp = await request.GetResponseAsync();
html.LoadHtml(new StreamReader(resp.GetResponseStream()).ReadToEnd());
If You want to download web page code try use this method(by using HttpClient):
public async Task<string> DownloadHtmlCode(string url)
{
HttpClientHandler handler = new HttpClientHandler { UseDefaultCredentials = true, AllowAutoRedirect = true };
HttpClient client = new HttpClient(handler);
HttpResponseMessage response = await client.GetAsync(url);
response.EnsureSuccessStatusCode();
string responseBody = await response.Content.ReadAsStringAsync();
return responseBody;
}
If you want to parse your downloaded htmlcode you can use Regex or LINQ. I have some example by using LINQ to parse html code but before you should load your code into HtmlDocument by using HtmlAgilityPack library. Then you can load by that way: html.LoadHtml(temphtml);
When you'll do that, you can parse your HtmlDocument:
//This is for img links parse-example:
IEnumerable<HtmlNode> imghrefNodes = html.DocumentNode.Descendants().Where(n => n.Name == "img");
foreach (HtmlNode img in imghrefNodes)
{
HtmlAttribute att = img.Attributes["src"];
//in att.Value you can find your img url
//Here you can do everything what you want with all img links by editing att.Value
}

C# Downloading website into string using C# WebClient or HttpWebRequest

I am trying to download the contents of a website. However for a certain webpage the string returned contains jumbled data, containing many � characters.
Here is the code I was originally using.
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string source;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
source = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
I also tried alternate implementations with WebClient, but still the same result:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
using (WebClient client = new WebClient())
using (var read = client.OpenRead(url))
{
doc.Load(read, true);
}
From searching I guess this might be an issue with Encoding, so I tried both the solutions posted below but still cannot get this to work.
http://blogs.msdn.com/b/feroze_daud/archive/2004/03/30/104440.aspx
http://bytes.com/topic/c-sharp/answers/653250-webclient-encoding
The offending site that I cannot seem to download is the United_States article on the english version of WikiPedia (en . wikipedia . org / wiki / United_States).
Although I have tried a number of other wikipedia articles and have not seen this issue.
Using the built-in loader in HtmlAgilityPack worked for me:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/United_States");
string html = doc.DocumentNode.OuterHtml; // I don't see no jumbled data here
Edit:
Using a standard WebClient with your user-agent will result in a HTTP 403 - forbidden - using this instead worked for me:
using (WebClient wc = new WebClient())
{
wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
}
Also see this SO thread: WebClient forbids opening wikipedia page?
The response is gzip encoded.
Try the following to decode the stream:
UPDATE
Based on the comment by BrokenGlass setting the following properties should solve your problem (worked for me):
req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
Old/Manual solution:
string source;
var response = req.GetResponse();
var stream = response.GetResponseStream();
try
{
if (response.Headers.AllKeys.Contains("Content-Encoding")
&& response.Headers["Content-Encoding"].Contains("gzip"))
{
stream = new System.IO.Compression.GZipStream(stream, System.IO.Compression.CompressionMode.Decompress);
}
using (StreamReader reader = new StreamReader(stream))
{
source = reader.ReadToEnd();
}
}
finally
{
if (stream != null)
stream.Dispose();
}
This is how I usually grab a page into a string (its VB, but should translate easily):
req = Net.WebRequest.Create("http://www.cnn.com")
Dim resp As Net.HttpWebResponse = req.GetResponse()
sr = New IO.StreamReader(resp.GetResponseStream())
lcResults = sr.ReadToEnd.ToString
and haven't had the problems you are.

Categories