WinRT web page parse / DocumentNode.InnerHtml = "URI" rather than page html - c#

I'm trying to create a metro application with schedule of subjects for my university. I use HAP+Fizzler for parse page and get data.
Schedule link give me #Too many automatic redirections# error.
I found out that CookieContainer can help me, but don't know how implement it.
CookieContainer cc = new CookieContainer();
request.CookieContainer = cc;
My code:
public static HttpWebRequest request;
public string Url = "http://cist.kture.kharkov.ua/ias/app/tt/f?p=778:201:9421608126858:::201:P201_FIRST_DATE,P201_LAST_DATE,P201_GROUP,P201_POTOK:01.09.2012,31.01.2013,2423447,0:";
public SampleDataSource()
{
HtmlDocument html = new HtmlDocument();
request = (HttpWebRequest)WebRequest.Create(Url);
request.Proxy = null;
request.UseDefaultCredentials = true;
CookieContainer cc = new CookieContainer();
request.CookieContainer = cc;
html.LoadHtml(request.RequestUri.ToString());
var page = html.DocumentNode;
String ITEM_CONTENT = null;
foreach (var item in page.QuerySelectorAll(".MainTT"))
{
ITEM_CONTENT = item.InnerHtml;
}
}
With CookieContainer i don't get error, but DocumentNode.InnerHtml for some reason get value of my URI, not page html.

You just need to change one line.
Replace
html.LoadHtml(request.RequestUri.ToString());
with
html.LoadHtml(new StreamReader(request.GetResponse().GetResponseStream()).ReadToEnd());
EDIT
First mark your method as async
request.CookieContainer = cc;
var resp = await request.GetResponseAsync();
html.LoadHtml(new StreamReader(resp.GetResponseStream()).ReadToEnd());

If You want to download web page code try use this method(by using HttpClient):
public async Task<string> DownloadHtmlCode(string url)
{
HttpClientHandler handler = new HttpClientHandler { UseDefaultCredentials = true, AllowAutoRedirect = true };
HttpClient client = new HttpClient(handler);
HttpResponseMessage response = await client.GetAsync(url);
response.EnsureSuccessStatusCode();
string responseBody = await response.Content.ReadAsStringAsync();
return responseBody;
}

If you want to parse your downloaded htmlcode you can use Regex or LINQ. I have some example by using LINQ to parse html code but before you should load your code into HtmlDocument by using HtmlAgilityPack library. Then you can load by that way: html.LoadHtml(temphtml);
When you'll do that, you can parse your HtmlDocument:
//This is for img links parse-example:
IEnumerable<HtmlNode> imghrefNodes = html.DocumentNode.Descendants().Where(n => n.Name == "img");
foreach (HtmlNode img in imghrefNodes)
{
HtmlAttribute att = img.Attributes["src"];
//in att.Value you can find your img url
//Here you can do everything what you want with all img links by editing att.Value
}

Related

C# HtmlAgilityPack timeout before download page

I want parse site https://russiarunning.com/events?d=run on C# with htmlagilitypack
I'm try this make
string url = "https://russiarunning.com/events?d=run";
var web = new HtmlWeb();
var doc = web.Load(url);
But I got a problem - content on site loading with timeout ~1000ms
therefore, when using the web.Load (url) I download the page without content.
How make timeout before download page with htmlagilitypack ?
Try this...
Create one class as below :
public class WebClientHelper : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
and use as below:
var data = new Helpers.WebClientHelper().DownloadString(Url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(data);
You can simply do this:
string url = "https://russiarunning.com/events?d=run";
var web = new HtmlWeb();
web.PreRequest = delegate(HttpWebRequest webReq)
{
webReq.Timeout = 4000; // number of milliseconds
return true;
};
var doc = web.Load(url);
More on Timeout property: https://learn.microsoft.com/en-us/dotnet/api/system.net.httpwebrequest.timeout?view=netframework-4.7.2

WebRequest not returning HTML

I want to load this http://www.yellowpages.ae/categories-by-alphabet/h.html url, but it returns null
In some question I have heard about adding Cookie container but it is already there in my code.
var MainUrl = "http://www.yellowpages.ae/categories-by-alphabet/h.html";
HtmlWeb web = new HtmlWeb();
web.PreRequest += request =>
{
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
web.CacheOnly = false;
var doc = web.Load(MainUrl);
the website opens perfectly fine in browser.
You need CookieCollection to get cookies and set UseCookie to true in HtmlWeb.
CookieCollection cookieCollection = null;
var web = new HtmlWeb
{
//AutoDetectEncoding = true,
UseCookies = true,
CacheOnly = false,
PreRequest = request =>
{
if (cookieCollection != null && cookieCollection.Count > 0)
request.CookieContainer.Add(cookieCollection);
return true;
},
PostResponse = (request, response) => { cookieCollection = response.Cookies; }
};
var doc = web.Load("https://www.google.com");
I doubt it is a cookie issue. Looks like a gzip encryption since I got nothing but gibberish when I tried to fetch the page. If it was a cookie issue the response should return an error saying so. Anyhow. Here is my solution to your problem.
public static void Main(string[] args)
{
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.yellowpages.ae/categories-by-alphabet/h.html");
request.Method = "GET";
request.ContentType = "text/html;charset=utf-8";
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
doc.Load(stream, Encoding.GetEncoding("utf-8"));
}
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
}
All it does is that it decrypts/extracts the gzip message that we receive.
How did I know it was GZIP you ask? The response stream from the debugger said that the ContentEncoding was gzip.
Basically just add:
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
To your code and you're good.

Htmlagilitypack after login to Https Website with HttpWebRequest?

I want to parse some html site like pluralsight,Forexample (https://app.pluralsight.com/id?), So How can I first login site programmaticaly (without using webbrowser control) then call another url (for example : Pluralsight) and get response and parse with Htmlagility pack.
But I've written a login code, but I do not know the next step.
public class Login
{
private CookieContainer Cookies = new CookieContainer();
public void SiteLogin(string username, string password)
{
Uri site = new Uri("https://app.pluralsight.com/id?");
HttpWebRequest wr = (HttpWebRequest)WebRequest.Create(site);
wr.Method = "Post";
wr.ContentType = "application/x-www-form-urlencoded";
wr.Referer = "https://app.pluralsight.com/id?";
wr.CookieContainer = Cookies;
var parameters = new Dictionary<string, string>{
{"realm", "vzw"},
{"goto",""},
{"gotoOnFail",""},
{"gx_charset", "UTF-8"},
{"rememberUserNameCheckBoxExists","Y"},
{"IDToken1", username},
{"IDToken2", password}
};
string input = string.Empty;
using (var requestStream = wr.GetRequestStream())
using (var writer = new StreamWriter(requestStream, Encoding.UTF8))
writer.Write(ParamsToFormEncoded(parameters));
using (var response = (HttpWebResponse)wr.GetResponse())
{
if (response.StatusCode == HttpStatusCode.OK)
{
//but I do not know the next step.
}
}
}
private string ParamsToFormEncoded(Dictionary<string, string> parameters)
{
return string.Join("&", parameters.Select(kvp => Uri.EscapeDataString(kvp.Key).Replace("%20", "+")
+ "=" + Uri.EscapeDataString(kvp.Value).Replace("20%", "+")
).ToArray());
}
}
You have to get the stream for the response via HttpWebResponse.GetResponseStream and then load the document via the Load method of the HtmlDocument.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(response.GetResponseStream());
//further processing...

How to add cookies to WebRequest?

I am trying to unit test some code, and I need to to replace this:
HttpWebRequest httpWebRequest = (HttpWebRequest)WebRequest.Create( uri );
httpWebRequest.CookieContainer = new CookieContainer();
with
WebRequest webRequest = WebRequest.Create( uri );
webRequest.CookieContainer = new CookieContainer();
Basically, how do I get cookies into the request without using a HttpWebRequest?
Based on your comments, you might consider writing an extension method:
public static bool TryAddCookie(this WebRequest webRequest, Cookie cookie)
{
HttpWebRequest httpRequest = webRequest as HttpWebRequest;
if (httpRequest == null)
{
return false;
}
if (httpRequest.CookieContainer == null)
{
httpRequest.CookieContainer = new CookieContainer();
}
httpRequest.CookieContainer.Add(cookie);
return true;
}
Then you can have code like:
WebRequest webRequest = WebRequest.Create( uri );
webRequest.TryAddCookie(new Cookie("someName","someValue"));
Try with something like this:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.contoso.com/default.html");
request.CookieContainer = new CookieContainer();
request.CookieContainer.Add(new Cookie("ConstoCookie", "Chocolate Flavour"));
WebRequest is an abstract class that does not have a CookieContainer property. In addition you can't use the Headers collection (not implemented exception) so any attempt like webRequest.Headers.Add("Cookie", "...") will fail.
Sorry, but you have no chance to use cookies with WebRequest.
Stick on HttpWebRequest and add/edit as many cookies you like using its Headers collection!
dlev's answer ended up working, but I had problems implementing the solution ("The parameter '{0}' cannot be an empty string."), so I decided to write the full code in case anybody else has similar problems.
My goal was to get the html as a string, but I needed to add the cookies to the web request. This is the function that downloads the string using the cookies:
public static string DownloadString(string url, Encoding encoding, IDictionary<string, string> cookieNameValues)
{
using (var webClient = new WebClient())
{
var uri = new Uri(url);
var webRequest = WebRequest.Create(uri);
foreach(var nameValue in cookieNameValues)
{
webRequest.TryAddCookie(new Cookie(nameValue.Key, nameValue.Value, "/", uri.Host));
}
var response = webRequest.GetResponse();
var receiveStream = response.GetResponseStream();
var readStream = new StreamReader(receiveStream, encoding);
var htmlCode = readStream.ReadToEnd();
return htmlCode;
}
}
We are using the code from dlev's answer:
public static bool TryAddCookie(this WebRequest webRequest, Cookie cookie)
{
HttpWebRequest httpRequest = webRequest as HttpWebRequest;
if (httpRequest == null)
{
return false;
}
if (httpRequest.CookieContainer == null)
{
httpRequest.CookieContainer = new CookieContainer();
}
httpRequest.CookieContainer.Add(cookie);
return true;
}
This is how you use the full code:
var cookieNameValues = new Dictionary<string, string>();
cookieNameValues.Add("varName", "varValue");
var htmlResult = DownloadString(url, Encoding.UTF8, cookieNameValues);

Getting the Redirected URL from the Original URL

I have a table in my database which contains the URLs of some websites. I have to open those URLs and verify some links on those pages. The problem is that some URLs get redirected to other URLs. My logic is failing for such URLs.
Is there some way through which I can pass my original URL string and get the redirected URL back?
Example: I am trying with this URL:
http://individual.troweprice.com/public/Retail/xStaticFiles/FormsAndLiterature/CollegeSavings/trp529Disclosure.pdf
It gets redirected to this one:
http://individual.troweprice.com/staticFiles/Retail/Shared/PDFs/trp529Disclosure.pdf
I tried to use following code:
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(Uris);
req.Proxy = proxy;
req.Method = "HEAD";
req.AllowAutoRedirect = false;
HttpWebResponse myResp = (HttpWebResponse)req.GetResponse();
if (myResp.StatusCode == HttpStatusCode.Redirect)
{
MessageBox.Show("redirected to:" + myResp.GetResponseHeader("Location"));
}
When I execute the code above it gives me HttpStatusCodeOk. I am surprised why it is not considering it a redirection. If I open the link in Internet Explorer then it will redirect to another URL and open the PDF file.
Can someone help me understand why it is not working properly for the example URL?
By the way, I checked with Hotmail's URL (http://www.hotmail.com) and it correctly returns the redirected URL.
This function will return the final destination of a link — even if there are multiple redirects. It doesn't account for JavaScript-based redirects or META redirects. Notice that the previous solution didn't deal with Absolute & Relative URLs, since the LOCATION header could return something like "/newhome" you need to combine with the URL that served that response to identify the full URL destination.
public static string GetFinalRedirect(string url)
{
if(string.IsNullOrWhiteSpace(url))
return url;
int maxRedirCount = 8; // prevent infinite loops
string newUrl = url;
do
{
HttpWebRequest req = null;
HttpWebResponse resp = null;
try
{
req = (HttpWebRequest) HttpWebRequest.Create(url);
req.Method = "HEAD";
req.AllowAutoRedirect = false;
resp = (HttpWebResponse)req.GetResponse();
switch (resp.StatusCode)
{
case HttpStatusCode.OK:
return newUrl;
case HttpStatusCode.Redirect:
case HttpStatusCode.MovedPermanently:
case HttpStatusCode.RedirectKeepVerb:
case HttpStatusCode.RedirectMethod:
newUrl = resp.Headers["Location"];
if (newUrl == null)
return url;
if (newUrl.IndexOf("://", System.StringComparison.Ordinal) == -1)
{
// Doesn't have a URL Schema, meaning it's a relative or absolute URL
Uri u = new Uri(new Uri(url), newUrl);
newUrl = u.ToString();
}
break;
default:
return newUrl;
}
url = newUrl;
}
catch (WebException)
{
// Return the last known good URL
return newUrl;
}
catch (Exception ex)
{
return null;
}
finally
{
if (resp != null)
resp.Close();
}
} while (maxRedirCount-- > 0);
return newUrl;
}
The URL you mentioned uses a JavaScript redirect, which will only redirect a browser. So there's no easy way to detect the redirect.
For proper (HTTP Status Code and Location:) redirects, you might want to remove
req.AllowAutoRedirect = false;
and get the final URL using
myResp.ResponseUri
as there can be more than one redirect.
UPDATE: More clarification regarding redirects:
There's more than one way to redirect a browser to another URL.
The first way is to use a 3xx HTTP status code, and the Location: header. This is the way the gods intended HTTP redirects to work, and is also known as "the one true way." This method will work on all browsers and crawlers.
And then there are the devil's ways. These include meta refresh, the Refresh: header, and JavaScript. Although these methods work in most browsers, they are definitely not guaranteed to work, and occasionally result in strange behavior (aka. breaking the back button).
Most web crawlers, including the Googlebot, ignore these redirection methods, and so should you. If you absolutely have to detect all redirects, then you would have to parse the HTML for META tags, look for Refresh: headers in the response, and evaluate Javascript. Good luck with the last one.
Use this code to get redirecting URL
public void GrtUrl(string url)
{
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.AllowAutoRedirect = false; // IMPORTANT
webRequest.Timeout = 10000; // timeout 10s
webRequest.Method = "HEAD";
// Get the response ...
HttpWebResponse webResponse;
using (webResponse = (HttpWebResponse)webRequest.GetResponse())
{
// Now look to see if it's a redirect
if ((int)webResponse.StatusCode >= 300 &&
(int)webResponse.StatusCode <= 399)
{
string uriString = webResponse.Headers["Location"];
Console.WriteLine("Redirect to " + uriString ?? "NULL");
webResponse.Close(); // don't forget to close it - or bad things happen
}
}
}
Here's two Async HttpClient versions:
Works in .Net Framework and .Net Core
public static async Task<Uri> GetRedirectedUrlAsync(Uri uri, CancellationToken cancellationToken = default)
{
using var client = new HttpClient(new HttpClientHandler
{
AllowAutoRedirect = false,
}, true);
using var response = await client.GetAsync(uri, cancellationToken);
return new Uri(response.Headers.GetValues("Location").First();
}
Works in .Net Core
public static async Task<Uri> GetRedirectedUrlAsync(Uri uri, CancellationToken cancellationToken = default)
{
using var client = new HttpClient();
using var response = await client.GetAsync(uri, cancellationToken);
return response.RequestMessage.RequestUri;
}
P.S. handler.MaxAutomaticRedirections = 1 can be used if you need to limit the number of attempts.
After reviewing everyone's suggestions I kind of figured this out for at least my case which basically did 3 loops once to https and second one to actual ending location.
This is a recursive function call here:
public static string GrtUrl(string url, int counter)
{
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12
| SecurityProtocolType.Ssl3;
string ReturnURL = url;
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.AllowAutoRedirect = false; // IMPORTANT
webRequest.Timeout = 10000; // timeout 10s
webRequest.Method = "HEAD";
// Get the response ...
HttpWebResponse webResponse;
using (webResponse = (HttpWebResponse)webRequest.GetResponse())
{
// Now look to see if it's a redirect
if ((int)webResponse.StatusCode >= 300 && (int)webResponse.StatusCode <= 399)
{
string uriString = webResponse.Headers["Location"];
ReturnURL = uriString;
if (ReturnURL == url)
{
webResponse.Close(); // don't forget to close it - or bad things happen!
return ReturnURL;
}
else
{
webResponse.Close(); // don't forget to close it - or bad things happen!
if (counter > 50)
return ReturnURL;
else
return GrtUrl(ReturnURL, counter++);
}
}
}
return ReturnURL;
}
You could check the Request.UrlReferrer.AbsoluteUri to see where i came from. If that doesn't work can you pass the old url as a query string parameter?
This code works for me
var request = (HttpWebRequest)HttpWebRequest.Create(url);
request.Method = "POST";
request.AllowAutoRedirect = true;
request.ContentType = "application/x-www-form-urlencoded";
var response = request.GetResponse();
//After sending the request and the request is expected to redirect to some page of your website, The response.ResponseUri.AbsoluteUri contains that url including the query strings
//(www.yourwebsite.com/returnulr?r=""... and so on)
Redirect(response.ResponseUri.AbsoluteUri); //then just do your own redirect.
Hope this helps
I had the same problem and after tryin a lot I couldn't get what i wanted with HttpWebRequest so i used web browser class to navigate to first url and then i could get the redirected url !
WebBrowser browser = new WebBrowser();
browser.Navigating += new System.Windows.Forms.WebBrowserNavigatingEventHandler(this.browser_Navigating);
string urlToNavigate = "your url";
browser.Navigate(new Uri(urlToNavigate));
then on navigating you can get your redirected url. Be careful that the first time browser_Navigating event handler occurs, e.url is the same url you used to start browsing so you can get redirected url on the second call
private void browser_Navigating(object sender, WebBrowserNavigatingEventArgs e)
{
Uri uri = e.Url;
}
This code worked for me with Unicode support:
public static string GetFinalRedirect(string url)
{
try
{
var request = (HttpWebRequest)HttpWebRequest.Create(url);
request.Method = "POST";
request.AllowAutoRedirect = true;
request.ContentType = "application/x-www-form-urlencoded";
var response = request.GetResponse();
return response.ResponseUri.AbsoluteUri.ToString();
}
catch(Exception ax)
{
return "";
}
}
string url = ".......";
var request = (HttpWebRequest)WebRequest.Create(url);
var response = (HttpWebResponse)request.GetResponse();
string redirectUrl = response.ResponseUri.ToString();
A way to deal with javascript redirect is to view the source code of the initial domain's page that would load and then extract a new domain aka the final domain directly from the source code. Since it is a javascript redirect then the new domain aka final domain should be there. Cheers
Code to extract the URL address from page source:
string href = "";
string pageSrc = "get page source using web client download string method and place output here";
Match m = Regex.Match(pageSrc, #"href=\""(.*?)\""", RegexOptions.Singleline);
if (m2.Success){
href = m.Groups[1].Value; /* will result in http://finalurl.com */
}
I made this method using your code and it returns the final redirected URL.
public string GetFinalRedirectedUrl(string url)
{
string result = string.Empty;
Uri Uris = new Uri(url);
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(Uris);
//req3.Proxy = proxy;
req.Method = "HEAD";
req.AllowAutoRedirect = false;
HttpWebResponse myResp = (HttpWebResponse)req.GetResponse();
if (myResp.StatusCode == HttpStatusCode.Redirect)
{
string temp = myResp.GetResponseHeader("Location");
//Recursive call
result = GetFinalRedirectedUrl(temp);
}
else
{
result = url;
}
return result;
}
Note: myResp.ResponseUri does not return the final URL

Categories