C# WebClient, Only Support HTML3.2 - c#

I need to call a webpage, but have my webclient act like it doesn't support HTML4.0, but only HTML3.2.
Is it possible to do this? Perhaps with a different user-agent or some header I'm unaware of?
Thanks.
This is related to this problem:
SSRS 2008, Force HTML3.2

The WebClient Class implements HTTP. It contains nothing related to HTML.
If the website you're retrieving serves different content depending on the HTTP "User-Agent" header, you can set this header as follows:
WebClient client = new WebClient();
client.Headers.Add("user-agent",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
Which value you need to specify depends, of course, on the website.

WebClient has no notion of what kind of HTML it is downloading. If the site you're accessing is doing some sort of sniffing, use HttpWebRequest and set the UserAgent property to some really old browser.
You can set the User-Agent header using WebClient as well, but you have to set the header directly as there's no associated property.

Related

Why does the HTML downloaded by WebClient differ from Chrome's "View Source" page?

I'm using the following code to download the given web page.
using (WebClient client = new WebClient())
{
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
using (Stream data = client.OpenRead("https://www.yellowpages.com/south-jordan-ut/plumbers?page=5"))
using (StreamReader reader = new StreamReader(data))
{
string html = reader.ReadToEnd();
}
}
It works but html appears to contain only a small portion of the final page. I understand that this site has dynamic content, but when I view the source code in Chrome (by selecting the View page source command), it appears to download everything.
Unless Chrome actually runs scripts when you run View page source, it seems to be using some other trick to get all the HTML. Does anyone know what that trick might be?
So if you read the HTML that the webClient is returning, you can see some text:
We can provide you with the best experience on Yellowpages.com, if you upgrade to the latest version of your browser"
If you change your user-agent to something that Chrome would send, you get the results as expected:
client.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36");
There's probably an ajax call or something similar to load the page data. It's a modern software paradigm whereas before the page would contain all the data already for the data. Whet everyone else is saying is that if there IS javascript to load the content then the webclient would not be able to load this content. This is why you see it in your browser and not the webclient.
The solution is to use another tool like Selenium to download the page into a rendering engine & then scrape what you need.

C# - User Agent for Web in WP8.1 Apps

One question that's been confusing me and could really do with some insight.
I need to retreive Json objects from a http service. When I tested this in a Console Window, I kept receiving a "Internal Server Error : 500" until I set the UserAgent property for the WebClient object.
Example:
WebClient client = new WebClient();
client.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36");
content = client.DownloadString(url);
Now, if I need to do the same for a WP8.1 app, how would I detect (if I need to in the first place?) the UserAgent (and set it) and be able to retrieve the data?
Thank you all.
Windows Phone 8.1 App will use HttpClient. By default there will not be a user agent set. The default user-agent for the phones web browser is:
"Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 520) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537"
You can manually set the user-agent on the HttpRequestMessage.Headers.UserAgent property.
References:
HttpClient
https://msdn.microsoft.com/en-us/library/windows/apps/xaml/windows.web.http.headers.httprequestheadercollection.aspx
User-Agent
https://msdn.microsoft.com/en-us/library/ie/hh869301(v=vs.85).aspx#ie11\
The class libraries for using http do not add any User Agents by default. See these lines from the msdn page:
By default, no user-agent header is sent with the HTTP request to the web service by the HttpClient object. Some HTTP servers, including some Microsoft web servers, require that a user-agent header be included with the HTTP request sent from the client. The user-agent header is used by the HTTP server to determine how to format some HTTP pages so they render better on the client for different web browsers and form factors (mobile phones, for example). Some HTTP servers return an error if no user-agent header is present on the client request. We need to add a user-agent header to avoid these errors using classes in the Windows.Web.Http.Headers namespace. We add this header to the HttpClient.DefaultRequestHeaders property.
For more details, refer the link below:
How to connect to an HTTP server using Windows.Web.Http.HttpClient (XAML)
Also look at the answer below (by Bret Bentzinger) for the exact user agent string.

getting HTML source of the web page using c# for different browsers

I want to get the HTML source of the web page using c#, as if it was visited using different browsers like IE9, Chrome, Firefox. Is there a way to do that?
You can get the HTML source in a number of ways. My preferred method is HTML Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.Load("http://domain.com/resource/page.html");
doc.Save("file.htm");
The WebClient in .NET works well too.
WebClient myWebClient = new WebClient();
myWebClient.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)"); // If you need to simulate a specific browser
byte[] myDataBuffer = myWebClient.DownloadData (remoteUri);
string download = Encoding.ASCII.GetString(myDataBuffer);
// This is verbatim from MSDN... unfortunately their example does not dispose
// of myWebClient (it implements IDisposable). You should wrap use of a WebClient
// in a using statement.
http://msdn.microsoft.com/en-us/library/xz398a3f.aspx
The HTML you get is what you get. A given browser decides what to make of it (unless, that is, the server renders different HTML for different user agents).
If you do need to explicitly set the user agent (to simulate different browsers), the following post shows how to do that:
http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/
(this link also implements a simple web crawler using HTML Agility Pack)
I'm no C# expert, but assuming the html will be the same regardless of which "browser" visits the url, you can use System.Net.WebClient (if you only need simple control) or HttpWebRequest (if you need more advanced control)
For WebClient, just create an instance and call one of it's Download* methods:
var cli = new WebClient();
string data = cli.DownloadString("http://www.stackoverflow.com");

Return Javascript managed cookies programmatically using C#

I'm trying to programmatically ping a website (through a console application) and return details of all the cookies being used by that site.
The following approach I'm using only captures those cookies managed through the header request and misses the ones set using Javascript:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.CookieContainer = new CookieContainer();
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)";
request.Method = "GET";
response = (HttpWebResponse)request.GetResponse();
foreach (Cookie c in response.Cookies)
{
cookie.Add(c);
}
Can someone possibly provide suggestions to how this can be extended to include javascript configured cookies?
Thanks!
Well, not sure how much help this will be but...
You are asking if there is an easy way to get cookies that are created dynamically by client-side javascript and the answer is no, there isn't (unless I'm missing something).
Is there a harder way, maybe, like wrapping the .NET browser control, letting the javascript execute through automated web scripts and then scraping the DOM... Doesn't sound like a good idea to me though.
Any other thoughts welcome.
Just managed to achieve something close to what I wanted (still testing the solution, and it seems to be missing some tracking cookies from 3rd parties!). However what I did was use Selenium, and the Chrome Driver executable. What it does is open up an instance of chrome, navigates to the URL, and pulls back all the dynamically generated information, which can then be interrogated using C#. :)

Loading an xml from an http url

I am using xml.net in web application
When I try load xml through an http internet url using:
xmlDoc.Load("http://....")
i get an error:"connected host has failed to respond"
Anyone knows the fix for this?
Thanks
Connected Host has failed to respond is because you've not go the uri correct or you're not allowed to access it, or it's not responding to you, or it's down. http doesn't really care what it transmits.
It probably means exactly what it says: the web server responsible for requests at the URL you specify isn't sending back responses. Something's going wrong on the web server, and if so, you can't do anything about someone's web server out there in the cloud not functioning properly.
You can, however, accept the fact that not every URL will work, and that you'll have to catch the Exception that the XmlDocument or XDocument is throwing. It's reasonable to expect that this scenario may occur. Thus, you need to programming defensively and by including the appropriate exception handling to handle such cases.
EDIT: So you can access it from outside the .NET framework eh? Perhaps try using an HTTP debugger, like Fiddler, and compare the request your XML document object makes to the request your browser makes. What header fields are different? Is there a header that the browser includes that the XML document object doesn't? Or are there different header values between the two, that may be causing the .NET request not to be responded to? Go figure.
If the page is accessible through a web browser but not through the load method it sounds as like the method isn't making a proper HTTP Request to the web server for the page wanted.
You can try using an HTTPWebRequest with a standard GET method to make a proper HTTP request for the webpage. You can then pass the response to the XMLDocument.Load method as a stream and it should then load up fine.
HTTPWebRequest Class MSDN.com
Try making a WebRequest to the url and set its UserAgent property to something like "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)" . If it works load the text you get in the xmldoc.
I tried loading the xml using .Net HttpWebRequest and also tried setting the userAgent property.
But its still giving me the error message:
"Unable to connect to the remote server"
The xml is however accesible through the browser.
Here is the code:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(URL);
request.UserAgent ="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)";
string result = string.Empty;
using (HttpWebResponse response = request.GetResponse() as HttpWebResponse)
{
// Get the response stream
StreamReader reader = new StreamReader(response.GetResponseStream());
// Read the whole contents and return as a string
result = reader.ReadToEnd();
}
Thanks.
Is there any proxy being used by your browser?
just try telnet to see if you are able to connect to the web server by an application other than the browser.
so if you are using a url like http://www.xmlserver.com/file.xml then try the following in command prompt:
telnet xmlserver.com 80
A big difference between your request and a browser request could be bridged with the following line:
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";

Categories