Error when trying to parse HTML - c#

I trying to parse site"https://www.crunchbase.com". But this site has an "Antibot protection". And i don't know how to get any html element from the page.
First i made a "ssl" security channel.
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
Then i made a HttpRequest with my browser's user agent string.
var request = (HttpWebRequest)WebRequest.Create("https://www.crunchbase.com");
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0";
request.Timeout = 10000;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Console.WriteLine("Server status code: " + response.StatusCode);
And used a StreamWriter to load the page:
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
string result = sr.ReadToEnd();
Console.WriteLine(result);
}
But result is:
enter image description here
And finally i tried to get all Urls from the page:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(response.ResponseUri.AbsoluteUri);
string respUri = response.ResponseUri.ToString();
HtmlNode[] nodes = document.DocumentNode.SelectNodes("//a").ToArray();
foreach (var item in nodes)
{
Console.WriteLine(item.InnerHtml);
}
But apllication throws Unhadled Exception.

I assume the upper part in your console window is the output from Console.WriteLine(result), and this shows pretty much the antibot protection. Whatever you see when browsing this site, it's not in this HTML which has an almost empty body (when this is rendered, it gives... nothing). The actual content of the web page is probably loaded dynamically by one of the Javascript code pieces referred by the HTML content. On the other hand, the HtmlWeb parser (from HTML Agility Pack, I presume) does not execute this Javascript code, and thus does not reach the actual content that includes the elements you are looking for. In other words, the protection works...

Related

WebRequest loading HTMLDocument coming back with all special characters for SSL site

Pretty standard implementation of HttpWebRequest, whenever I pass a certain URL to get the html it comes back with nothing but special characters. An example of what comes back is below.
Now this site is SSL so I'm wondering if that has something to do with it but I've never had this problem before and I've used this with other SSL sites.
�
ServicePointManager.ServerCertificateValidationCallback = new System.Net.Security.RemoteCertificateValidationCallback(AcceptAllCertifications);
var request = (HttpWebRequest)WebRequest.Create(url);
using (var response = (HttpWebResponse)request.GetResponse())
{
Stream data = response.GetResponseStream();
HtmlDocument hDoc = new HtmlDocument();
using (StreamReader readURLContent = new StreamReader(data))
{
html = readURLContent.ReadToEnd();
hDoc.LoadHtml(html);
}
}
I can't really find anything for this specific issue so I'm kind of lost if anybody could point me in the right direction that would be awesome.
Edit: here's an image of what it looks like since I can't copy paste it
My guess is that the response is compressed. If you use a WebDebugger like Charles or Fiddler. You can see how the requests and structured and what data they contain - it makes it a lot easier to replicate the http requests later on when programming them. Try the following code.
try
{
string webAddr = url;
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "text/html; charset=utf-8";
httpWebRequest.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0";
httpWebRequest.AllowAutoRedirect = true;
httpWebRequest.Method = "GET";
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream(), Encoding.UTF8))
{
var responseText = streamReader.ReadToEnd();
doc.LoadHtml(responseText);
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
The code sets the encoding on the requsts. You an also set the encoding at the streamreader when reading the response. And automatic decompression is enabled.

proxy and htmlagilitypack issues

I have been trying to use the htmlagilitypack through a proxy and I see that I get some unpredictable behavior.
How do you add credentials to htmlagilitypack so that it will be able to scrape web pages through a proxy?
Here is what I usually do:
HttpWebRequest request = (HttpWebRequest) HttpWebRequest.Create(url);
...
HttpWebResponse response = (HttpWebResponse) request.GetResponse();
using (var reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8)) {
var doc = new HtmlDocument();
doc.Load(reader.BaseStream);
//Use (or return) the HtmlDocument 'doc' here.
}
You could encapsulate this code on a method that given a url, returns a HtmlDocument object.
There is a similar question already answered hereenter link description here
You use it as below in your code.
HtmlWeb web = new HtmlWeb();
web.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6";
var doc = web.Load(string url, string proxyHost, int proxyPort, string yourUserId, string yourPassword);

Trouble with getting web page's HTML code from my C# program

The problem:
I want to scrap some data from certain webpage (I have administrative access) and to store some information in db for later analysis.
Sounds easy, right?
I've decided to make simple console prototype and code look something like this:
string uri = #"http://s7.iqstreaming.com:8044/admin.cgi";
HttpWebRequest request = WebRequest.Create(uri) as HttpWebRequest;
if(request == null)
{
Console.WriteLine(":( This shouldn't happen!");
Console.ReadKey();
}
request.ContentType = #"text/html";
request.Accept = #"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
request.Credentials = new NetworkCredential("myID", "myPass");
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
StreamReader reader = new StreamReader( response.GetResponseStream());
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}
reader.Close();
response.Close();
}
This code works on most other sites, but here I get errors 404 (most of the time), 502 or timeout.
I've consulted with Firebug (I've took Accept and compression info from there) but to no avail.
Using Win-forms and webBrowser control as an alternative is not an option (at least for now).
P.S.
Same thing happens when I try to get HTML from http://s7.iqstreaming.com:8044/index.html (doesn't need credentials).
I think the problem is related with User-Agent.
This may solve it
request.UserAgent="Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11";

Can't get HTML code through HttpWebRequest

I am trying to parse the HTML code of the page at http://odds.bestbetting.com/horse-racing/today in order to have a list of races, etc.
The problem is I am not being able to retrieve the HTML code of the page. Here is the C# code of the function:
public static string Http(string url) {
Uri myUri = new Uri(url);
// Create a 'HttpWebRequest' object for the specified url.
HttpWebRequest myHttpWebRequest = (HttpWebRequest)WebRequest.Create(myUri);
myHttpWebRequest.AllowAutoRedirect = true;
// Send the request and wait for response.
HttpWebResponse myHttpWebResponse = (HttpWebResponse)myHttpWebRequest.GetResponse();
var stream = myHttpWebResponse.GetResponseStream();
var reader = new StreamReader(stream);
var html = reader.ReadToEnd();
// Release resources of response object.
myHttpWebResponse.Close();
return html;
}
When I execute the program calling the function it throws an exception on
HttpWebResponse myHttpWebResponse =
(HttpWebResponse)myHttpWebRequest.GetResponse();
which is:
Cannot handle redirect from HTTP/HTTPS protocols to other dissimilar ones.
I have read this question but I don't seem to have the same problem.
I've also tried iguring something out sniffing the traffic with fiddler but can't see anything to where it redirects or something similar. I just have extracted these two possible redirections: odds.bestbetting.com/horse-racing/2011-06-10/byCourse
and odds.bestbetting.com/horse-racing/2011-06-10/byTime , but querying them produces the same result as above.
It's not the first time I do something like this, but I'm really lost on this one. Any help?
Thanks!
I finally found the solution... it effectively was a problem with the headers, specifically the User-Agent one.
I found after lots of searching a guy having the same problem as me with the same site. Although his code was different the important bit was that he set the UserAgent attribute of the request manually to that of a browser. I think I had done this before but I may had done it pretty bad... sorry.
The final code if it is of interest to any one is this:
public static string Http(string url) {
if (url.Length > 0)
{
Uri myUri = new Uri(url);
// Create a 'HttpWebRequest' object for the specified url.
HttpWebRequest myHttpWebRequest = (HttpWebRequest)WebRequest.Create(myUri);
// Set the user agent as if we were a web browser
myHttpWebRequest.UserAgent = #"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4";
HttpWebResponse myHttpWebResponse = (HttpWebResponse)myHttpWebRequest.GetResponse();
var stream = myHttpWebResponse.GetResponseStream();
var reader = new StreamReader(stream);
var html = reader.ReadToEnd();
// Release resources of response object.
myHttpWebResponse.Close();
return html;
}
else { return "NO URL"; }
}
Thank you very much for helping.
There can be a dozen probable causes for your problem.
One of them is that the redirect from the server is pointing to an FTP site, or something like that.
It can also being that the server require some headers in the request that you're failing to provide.
Check what a browser would send to the site and try to replicate.

Getting html source from url, css inline problem!

I have a strange problem:
I am getting the html source from url using this:
string html;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(Url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
{
html = reader.ReadToEnd();
reader.Close();
}
response.Close();
}
The page that I am requesting has css inline like this:
<span class="VL" style="display:inline-block;height:20px;width:0px;"></span>
But the html var value has only:
<span class="VL" style="display:inline-block;"></span>
Anyone knows why? I have tested with many enconders and using WebRequest and WebClient too, but doesn't work too.
You might need to send a User Agent so that the site doesn't think that you are a bot. Some sites don't bother with CSS when requested from bots. Also the reading of the remote HTML could be simplified using a WebClient:
using (var client = new WebClient())
{
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4";
string html = client.DownloadString(url);
}
Are you viewing the source through a browser development tool, by clicking inspect element? Is it possible you are viewing the source from a browser which is adding the height and width attributes on the client side through JavaScript and showing you the modified style.

Categories