Getting html source from url, css inline problem! - c#

I have a strange problem:
I am getting the html source from url using this:
string html;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(Url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
{
html = reader.ReadToEnd();
reader.Close();
}
response.Close();
}
The page that I am requesting has css inline like this:
<span class="VL" style="display:inline-block;height:20px;width:0px;"></span>
But the html var value has only:
<span class="VL" style="display:inline-block;"></span>
Anyone knows why? I have tested with many enconders and using WebRequest and WebClient too, but doesn't work too.

You might need to send a User Agent so that the site doesn't think that you are a bot. Some sites don't bother with CSS when requested from bots. Also the reading of the remote HTML could be simplified using a WebClient:
using (var client = new WebClient())
{
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4";
string html = client.DownloadString(url);
}

Are you viewing the source through a browser development tool, by clicking inspect element? Is it possible you are viewing the source from a browser which is adding the height and width attributes on the client side through JavaScript and showing you the modified style.

Related

Error when trying to parse HTML

I trying to parse site"https://www.crunchbase.com". But this site has an "Antibot protection". And i don't know how to get any html element from the page.
First i made a "ssl" security channel.
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
Then i made a HttpRequest with my browser's user agent string.
var request = (HttpWebRequest)WebRequest.Create("https://www.crunchbase.com");
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0";
request.Timeout = 10000;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Console.WriteLine("Server status code: " + response.StatusCode);
And used a StreamWriter to load the page:
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
string result = sr.ReadToEnd();
Console.WriteLine(result);
}
But result is:
enter image description here
And finally i tried to get all Urls from the page:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(response.ResponseUri.AbsoluteUri);
string respUri = response.ResponseUri.ToString();
HtmlNode[] nodes = document.DocumentNode.SelectNodes("//a").ToArray();
foreach (var item in nodes)
{
Console.WriteLine(item.InnerHtml);
}
But apllication throws Unhadled Exception.
I assume the upper part in your console window is the output from Console.WriteLine(result), and this shows pretty much the antibot protection. Whatever you see when browsing this site, it's not in this HTML which has an almost empty body (when this is rendered, it gives... nothing). The actual content of the web page is probably loaded dynamically by one of the Javascript code pieces referred by the HTML content. On the other hand, the HtmlWeb parser (from HTML Agility Pack, I presume) does not execute this Javascript code, and thus does not reach the actual content that includes the elements you are looking for. In other words, the protection works...

C# WebClient DownloadString returns gibberish

I am attempting to view the source of http://simpledesktops.com/browse/desktops/2012/may/17/where-the-wild-things-are/ using the code:
String URL = "http://simpledesktops.com/browse/desktops/2012/may/17/where-the-wild-things-are/";
WebClient webClient = new WebClient();
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
webClient.Encoding = Encoding.GetEncoding("Windows-1255");
string download = webClient.DownloadString(URL);
webClient.Dispose();
Console.WriteLine(download);
When I run this, the console returns a bunch of nonsense that looks like it's been decoded incorrectly.
I've also attempted adding headers with no avail:
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
webClient.Headers.Add("Accept-Encoding", "gzip,deflate");
Other websites all returned the proper html source. I can also view the page's source through Chrome. What's going on here?
Response of that URL is gzipped, you should decompress it or set empty Accept-Encoding header, you don't need that user-agent field.
String URL = "http://simpledesktops.com/browse/desktops/2012/may/17/where-the-wild-things-are/";
WebClient webClient = new WebClient();
webClient.Headers.Add("Accept-Encoding", "");
string download = webClient.DownloadString(URL);
I've had the same thing bug me today.
Using a WebClient object to check whether a URL is returning something.
But my experience is different. I tried removing the Accept-Encoding, basically using the code #Antonio Bakula gave in his answer. But I kept getting the same error every time (InvalidOperationException)
So this did not work:
WebClient wc = new WebClient();
wc.Headers.Add("Accept-Encoding", "");
string result = wc.DownloadString(url);
But adding 'any' text as a User Agent instead did do the trick. This worked fine:
WebClient wc = new WebClient();
wc.Headers.Add(HttpRequestHeader.UserAgent, "My User Agent String");
System.IO.Stream stream = wc.OpenRead(url);
Your mileage may vary obviously, also of note. I'm using ASP.NET 4.0.30319.

Grabbing HTML from URL doesn't work - any tips?

I have tried several methods in C# using webclient and webresponse and they all return
<html><head><meta http-equiv=\"REFRESH\" content=\"0; URL=http://www.windowsphone.com/en-US/games?list=xbox\"><script type=\"text/javascript\">function OnBack(){}</script></head></html>"
instead of the actual rendered page when you use a browser to go to http://www.windowsphone.com/en-US/games?list=xbox
How would you go about grabbing the HTML from that location?
http://www.windowsphone.com/en-US/games?list=xbox
Thanks!
/edit: examples added:
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
Uri inputUri = new Uri(inputUrl);
WebRequest request = WebRequest.CreateDefault(inputUri);
request.Method = "GET";
WebResponse response;
try
{
response = request.GetResponse();
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
resultHTML = reader.ReadToEnd();
}
}
catch { }
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
WebClient webClient = new WebClient();
try
{
resultHTML = webClient.DownloadString(inputUrl);
}
catch { }
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
WebResponse objResponse;
WebRequest objRequest = HttpWebRequest.Create(inputUrl);
try
{
objResponse = objRequest.GetResponse();
using (StreamReader sr = new StreamReader(objResponse.GetResponseStream()))
{
resultHTML = sr.ReadToEnd();
sr.Close();
}
}
catch { }
I checked for this URL, and you need to parse the cookies.
When you try to access the page for the first time, you are redirected to an https URL on login.live.com and then redirected back to the original URL. The https page sets a cookie called MSPRequ for the domain login.live.com. If you do not have this cookie, you cannot access the site.
I tried disabling cookies in my browser and it ends up looping infinitely back to the URL https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&checkda=1&ct=1328303901&rver=6.1.6195.0&wp=MBI&wreply=http:%2F%2Fwww.windowsphone.com%2Fen-US%2Fgames%3Flist%3Dxbox&lc=1033&id=268289. It's been going on for several minutes now and doesn't appear it will ever stop.
So you will have to grab the cookie from the https page when it is set, and persist that cookie for your subsequent requests.
This might be because the server you are requesting HTML from returns different HTML depending on the User Agent string. You might try something like this
webClient.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
That particular header may not work, but you could try others that would mimic standard browsers.

Cant access site using webclient method..?

I am making a desktop yellowpage application. I can access all countries yellowpage site but not australian site. I dont know why?
Here is the code
class Program
{
static void Main(string[] args)
{
WebClient wb = new WebClient();
wb.Headers.Add("user-agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US)");
string html = wb.DownloadString("http://www.yellowpages.com.au");
Console.WriteLine(html);
}
}
For all other site I get html of the website for australian site I get null. i even tried httpwebrequest also.
Here is the yellowpage australian site: http://www.yellowpages.com.au
Thanks in advance
It looks like that website will only send over gzip'ed data. Try switching to HttpWebRequest and using auto decompression:
var request = (HttpWebRequest)WebRequest.Create("http://www.yellowpages.com.au");
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705;)";
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
In addition to #bkaid's correct (and upvoted) answer, you can use your own class inherited from WebClient to uncompress/handle gzip compressed html:
public class GZipWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = (HttpWebRequest)base.GetWebRequest(address);
request.AutomaticDecompression = DecompressionMethods.GZip |
DecompressionMethods.Deflate;
return request;
}
}
Having done this, the following works just fine:
WebClient wb = new GZipWebClient();
string html = wb.DownloadString("http://www.yellowpages.com.au");
When I view the transfer from that website in Wireshark, it says it's a malformed HTTP packet. It says it uses chunked transfer, then says the following chunk has 0 bytes and then sends the code of the website. That's why WebClient returns an empty string (not null). And I think it's correct behavior.
It seems browsers ignore this error and so they can display the page properly.
EDIT:
As bkaid pointed out, the server seems to handle send correct gziped response. The following code works for me:
WebClient wb = new WebClient();
wb.Headers.Add("Accept-Encoding", "gzip");
string html;
using (var webStream = wb.OpenRead("http://www.yellowpages.com.au"))
using (var gzipStream = new GZipStream(webStream, CompressionMode.Decompress))
using (var streamReader = new StreamReader(gzipStream))
html = streamReader.ReadToEnd();

Screen Scrape a page of a web app - Internal Server Error

I am tring to screen scrape a page of a web app that just contains text and is hosted by a 3rd party. It's not a properly formed HTML page, however the text that is diplayed will tell us if the web app is up or down.
When I try to scrape the sreen it returns an error when it tries the WebRequest. The error is "The remote server returned an error: (500) Internal Server Error."
public void ScrapeScreen()
{
try
{
var url = textBox1.Text;
var request = WebRequest.Create(url);
var response = request.GetResponse();
var stream = response.GetResponseStream();
var reader = new StreamReader(stream);
var result = reader.ReadToEnd();
stream.Dispose();
reader.Dispose();
richTextBox1.Text = result;
}
catch(Exception ex)
{
MessageBox.Show(ex.Message);
}
}
Any ideas how I can get the text from the page?
Some sites don't like the default UserAgent. Consider changing it to something real, like:
((HttpWebRequest)request).UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4"
First, try this:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
However, if you're just looking for text and not having to do any POST-ing of data to the server, you may want to look at the webClient class. It more closely resembles a real browser, and takes care of a lot of HTTP header stuff that you may end up having to twek if you stick with the HttpWebRequest class.

Categories