Is there a way to detect 404 pages using HtmlAgilityPack?

Is there a way to detect 404 pages using HtmlAgilityPack? - c#

I am parsing a forum where some threads are already deleted. So opening them still shows a page but with a message that says "Thread no longer exists". Is there a way to query this using the HtmlAgilityPack in a special way?
Or do I have to compare the InnerHtml or something along those lines?

a 404 is not actually being returned. If it was, you could just look at the headers.
That said, you are getting a 200 response with an error in the html, therefore you will have to parse the html, traverse the DOM, whatever you want to call it and determine if it failed.
It appears that there could potentially be several different error messages, so I would try to make your comparison generic by looking for the "notify administrator" link or perhaps the class="blockrow restore" is only used on the error page.
Hope that helps.

Related

Clear-Site-Data header error in Chrome console

I'm trying to implement a cache clearing button for our website that will append the Clear-Site-Data header on a specific route so we can be relatively sure that the users are getting the latest javascript, css, etc. after a release. I'm assigning the header in my ActionMethod like so:
According to developer tools, I'm getting the header on the client:
So, Chrome is trying to do what I'm asking but it's throwing an error saying the types I'm passing it are unrecognized:
Am I missing something with how I'm creating the header? Is this a bug?

Ok, so here we are a few months down the road and I finally remembered to come back and post the working solution. What I didn't understand at the time I asked the question was that the quotes are expected to be treated as literal in the response header examples I found. So the code I posted in my question was missing a few \ characters in the strings. What ended up working was this:
The headers on the client now look like this (note the quotes around cache and storage):
And the cache and localStorage are cleared as desired. Hope this helps someone else as well!

Kentico "This action is not allowed in current context"

I have a kentico 6 installation, if i go to CMSDesk, edit one particular content item (document i suppose) and try to save it i get the following error in a javascript alert: "This action is not allowed in current context". There is not much information on this error on the internet, it says that one of the parts of the page is broken, is there any way to determine which one? I get this error for ONLY ONE item, all the others are fine. Any ideas are welcome, I will provide any info needed.

You can check the Event log for any errors:
CMSSitemanager > Administration > Event log

Have you recently done any upgrades to the system? This typically happens with an upgrade. It is specifically related to JS files. I've specifically experienced it with upgrades to major versions (6->7, 7->8, etc). What you might do is try to clear the cache on the server and within you browser.
It can also be related to bad markup. You might check to see if you have a <form> tag or other invalid markup, correct it and see if this resolves your issue.

Inconsistent POSTing between Web Browser and HttpWebRequest

I’m working on Web Scraping using C# HttpWebRequest/HttpWebResponse. For the most part this process has gone smoothly. But after POSTing my way through several pages, I have gotten stuck with what seems to be an inconsistency between testing with the Web Browser and the HttpWebRequest/HttpWebResponse calls.
The problem occurs when I land on a page containing an input element that has a name similar to this: “RidiculouslyLongInputName.RidiculouslyLongInputName.RidiculouslyLongInputName.#RidiculouslyLong”
POSTing a value for this input element causes a 500 error when using HttpWebRequest but works fine when POSTing through the browser. If I remove this input value from the post data the the HttpWebRequest will not get the 500 error. But then I'm stuck with a data validate issue from the website.
Any idea on why HttpWebRequest is failing?

It's times like these when packet sniffers come in extremely useful for seeing exactly what kind of data is flowing through and what the difference is.
http://www.wireshark.org/
Is a great tool for things like this.
Filter down to only the domains you're interested in, then send off the packet with HttpWebRequest. Save the packet data somewhere. Repeat but do the request through the browser. Check the difference.
If it is indeed an issue with POST variables, it should be evident in the HTTP payload.

Not sure why you are running into the problem, but I would recommend grabbing a copy of Fiddler and taking a look at what the browser is sending in the POST request. It is possible there is something less than obvious going on.

You can also use Firebug extension with Firefox. With this extension installed and enabled, go through the entire scenario in Firefox. FIrebug will tell you the exact request/response sent by the browser. You can then duplicate that as much as possible using HttpWebRequest

First thanks for MEF response. That case was a personal mistake so I deleted the question.
I think best tool for your case is Fiddler but I guess there are other JavaScript attached to that button or something like that you are missing to mimic. WebRequest cannot do that for you and WebBrowser can do since it's working on DOM.
In order to use WebRequest correctly you highly need to reverse engineer every request by something like Fiddler. It's very hard to find what's exactly going on by looking at the page's source (and it's referenced Javascripts/CSS...).

How to get an image file extension from the web when it has been stripped?

The link below is an image URL where the extension has been stripped. I assume this is being done with content negotiation tools. I know that it's a GIF having viewed the HTML meta data with Firebug. What I would like to know is a simple way working in C# on .NET, how would I get the file type of this URL?
http://ep.yimg.com/ca/I/yhst-20493720720238_2066_63220718
With most image URLs it's easy. One can use string functions to find the file type in the URL.
Ex. /imageEx.png

You're going to have to make an HTTP HEAD request, and then check the Content-Type on the response. I can't recall whether System.Net.HttpWebRequest supports HEAD requests, but that would be the place to start.
Alternatively, you could perform a full GET request, but that could have performance implications if all you need to know is Content-Type.

You would have to read in the image and look for 'magic numbers' which can tell you what the file type really is. Here is an incomplete example of what I am talking about:
http://www.garykessler.net/library/file_sigs.html
EDIT: OK, you don't have to do it this way in this context. I am not a web guy, so this is how I would have approached it :-)

See Content-Type. You might also want to read up on content type spoofing.

Is it possible to modify Request.ReferrerUrl using a URL argument?

I'm trying to get around an issue for a customer that have a site developed by a previous developer. The following is the line of code causing the issue:
args.AddParam("REFERER","",Request.UrlReferrer.ToString());
Therefore if you navigate directly to this page using the URL, it returns a null exception error. I know that to fix this the code should first be checking if UrlReferrer is set to null, however I am trying to find a way around this problem without having to change any source. Any help would be appreciated

No, there is no URL (Querystring) argument that can set the REFERER http header to something. The only way you can do it is to link to the page from another one, and only navigate to it in that way.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Is there a way to detect 404 pages using HtmlAgilityPack? - c#

I am parsing a forum where some threads are already deleted. So opening them still shows a page but with a message that says "Thread no longer exists". Is there a way to query this using the HtmlAgilityPack in a special way? Or do I have to compare the InnerHtml or something along those lines?

Related

Clear-Site-Data header error in Chrome console

Kentico "This action is not allowed in current context"

Inconsistent POSTing between Web Browser and HttpWebRequest

How to get an image file extension from the web when it has been stripped?

Is it possible to modify Request.ReferrerUrl using a URL argument?

Categories

Resources