I am trying to scrape nse website, but when i try it using this method
static async void DownloadPageAsync(string url)
{
HttpClient client = new HttpClient();
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml");
client.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");
HttpResponseMessage response = await client.GetAsync(url);
Thread.Sleep(30000);
response.EnsureSuccessStatusCode();
var responseStream = await response.Content.ReadAsStreamAsync();
var streamReader = new StreamReader(responseStream);
var str = streamReader.ReadToEnd();
}
I am getting this response
but when I try the same link via chrome, My response this..
Where am I going wrong.. how to get the chrome response via code... please help..
regards
Srivastava
So, first off: crawling webpages is not a trivial task. Particularly correct HTML parsing is quite tricky.
There are also some netiquettes regarding web crawling, that you should be aware of before you start writing your web crawler. One in particular is to write down details on how to find more information about your web crawler in your browser. In other words, don't do this, but make it something more fancy - even if you need the 'Gecko' due to browser detection, it's proper to put something between the '('...')'.
client.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
One thing that's notoriously difficult to handle in a web crawler is AJAX calls. Having an incorrect user agent might even make this worse, some web sites decide wether or not to use AJAX based on the browser capabilities. For the context of this question, it's best to simply assume that you cannot properly handle Javascript or AJAX in your crawler (although the truth is way more complex it would take too long to describe here...).
Knowing some stock websites, I think this is also your problem. These numbers are often refreshed using AJAX 'in real time'.
Related
I am in the process of migrating one of my company's web services to a new server, and unfortunately the previous developers left us no way to test the migration of this service prior to migrating the production version. This leaves us in a harsh situation where I have to formulate a backup plan in case things go wrong when we migrate to the new server.
To understand my plan, you must first understand that the flow of execution for this web service is currently:
Customer calls platform.
Platform calls web service.
Web service responds to platform.
Platform responds to customer.
Simple enough, but the platform's changes are already in place for deployment at the flip of a switch and the developer will not be in house for the migration. Thus, they will flip the switch and leave me hoping the migration works.
I have a simple rollback plan in which the platform's developer won't be required for me to rollback. I simply inject a middle-man to the chain above which acts as a conduit to the web service for the platform:
Customer calls platform.
Platform calls conduit service.
Conduit service calls web service.
Web service responds to conduit.
Conduit responds to platform.
Platform responds to customer.
This way, if for some reason, the migrated version of the web service fails, I can fallback to the original version hosted on the old server until we can investigate what's missing and why it all went wrong (currently we have no way to do this).
Now that you have an understanding of the issue, I have a simple issue with writing the conduit to the underlying web service. I encountered a method in the web service that returns HttpResponseMessage and expects HttpRequestMessage as a request. This is rather confusing since the platform calls this method via the following URI:
test.domain.com:port/api/route/methodname
I have no access to the code under this URI assignment (which is in RPG code), so I have no idea how they are passing the data over. Currently my code is simple:
[Route("MethodName")]
[HttpPost]
public HttpResponseMessage MethodName(HttpRequestMessage request) {
try {
HttpWebRequest request = (HttpWebRequest)WebRequest.Create($"{ServiceRoute}/api/route/methodname");
request.Method = "GET";
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
return response; // There is a type mismatch, I know.
} catch (Exception e) {
// Log exception.
return null;
}
}
How can I call a restful web service and pass on the request message to the service?
NOTE: I understand, the snippet I've supplied will not work and has an error. I DO NOT expect anyone to just hand out code. References and explanations as to what needs to be done and why are what I'm looking for.
I'm not sure I totally understand the question, so apologies if this isn't helpful, but if your conduit truly just forwards each request as-is, you should be able to reuse the incoming HttpRequestMessage by changing the RequestUri property to the web service URI and forwarding it to the web service with an instance of HttpClient. Something like this:
[Route("MethodName")]
[HttpPost]
public async HttpResponseMessage MethodName(HttpRequestMessage request) {
request.RequestUri = $"{ServiceRoute}/api/route/methodname";
request.Method = HttpMethod.Get;
request.Headers.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
//add any additional headers, etc...
try
{
//best practice is to reuse the same HttpClient instance instead of reinstantiating per request, but this will do for an example
var httpClient = new HttpClient();
var response = await httpClient.SendAsync(request);
//perform any validation or modification of the response here, or fall back to the old web service on a failure
return response;
}
catch (Exception e)
{
// Log exception.
return null;
}
}
I'm trying to obtain a JSON via a rest API using, targeting .Net 4.5
I've tried various methods in code, but the all end up in me getting:
"Authentication failed because the remote party has closed the
transportstream" .
the exact same URL works via browser and Postman.
So far, I've tried using .Net's WebClient, HttpClient and HttpWebRequest with identical results. I've tried comparing requests between Postman and my code (via RequestBin), but even when they were identical, I still kept getting back:
Authentication failed because the remote party has closed the
transport
My current code is using HttpWebRequest, but every solution will do.
I've played around with all of the security protocols, some of them will cause the API to return 404 and some will cause the server to return
"Authentication failed because the remote party has closed the
transport stream".
Here's my current code:
public string GetCityStreets()
{
var url = "https://data.gov.il/api/action/datastore_search?resource_id=a7296d1a-f8c9-4b70-96c2-6ebb4352f8e3&q=26";
var request = (HttpWebRequest)WebRequest.Create(url);
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 |
SecurityProtocolType.Tls12 |
SecurityProtocolType.Tls11 |
SecurityProtocolType.Tls;
var response = (HttpWebResponse)request.GetResponse();
string jsonResponse;
using (var reader = new StreamReader(response.GetResponseStream()))
{
jsonResponse = reader.ReadToEnd();
}
return jsonResponse;
}
In my current code, the exception is thrown when the request is actually made: request.GetResponse().
What I need, essentially, is to get the JSON from the API.
Set SecurityProtocolType.Tls12 before you initalize the request:
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12`
var request = WebRequest.CreateHttp(url);
If you're on Windows 7. On Windows 10, you should only need SecurityProtocolType.SystemDefault.
Note: To enable TLS1.3 (it's available in both Windows 7 and Windows 10), If you don't use .Net 4.8 or .Net Core 3.0, since there's no enumerator for it, you can set it with:
var secProtoTls13 = (SecurityProtocolType)12288;
Remove all the other SecurityProtocolType you have set there.
Setting the User-Agent header is also mandatory, otherwise you will receive a 404 (not found). Here's the FireFox header:
request.UserAgent = "Mozilla/5.0 (Windows NT 10; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0";
A note on the User-Agent header: this specific site doesn't activate HTTP Strict Transport Security (HSTS). But some sites do, when they see that the WebBrowser supports it. HttpWebRequest doesn't understand it, so it will simply wait for a response that never comes, since the Site is waiting for interaction.
You may want to use the IE11 header instead.
Also add this other header:
request.Headers.Add(HttpRequestHeader.CacheControl, "no-cache");
The server side appears to be checking the user agent (presumably to stop bots and other code (like yours!) from hitting the endpoint). To bypass this, you will need to set the user agent to a value such that it thinks you are a web browser.
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36";
works, as an example.
You may wish to consider setting the ServicePointManager.SecurityProtocol just once (at app startup) rather than on each request.
This works using HttpClient (.net 4.5 and up)
var url = "https://data.gov.il/api/action/datastore_search?resource_id=a7296d1a-f8c9-4b70-96c2-6ebb4352f8e3&q=26";
var client = new HttpClient();
client.DefaultRequestHeaders.Add("User-Agent", "C# App");
Task<string> response = client.GetStringAsync(url);
Console.WriteLine(response.Result);
Think the server requires a user agent.
I'm using the following code to download the given web page.
using (WebClient client = new WebClient())
{
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
using (Stream data = client.OpenRead("https://www.yellowpages.com/south-jordan-ut/plumbers?page=5"))
using (StreamReader reader = new StreamReader(data))
{
string html = reader.ReadToEnd();
}
}
It works but html appears to contain only a small portion of the final page. I understand that this site has dynamic content, but when I view the source code in Chrome (by selecting the View page source command), it appears to download everything.
Unless Chrome actually runs scripts when you run View page source, it seems to be using some other trick to get all the HTML. Does anyone know what that trick might be?
So if you read the HTML that the webClient is returning, you can see some text:
We can provide you with the best experience on Yellowpages.com, if you upgrade to the latest version of your browser"
If you change your user-agent to something that Chrome would send, you get the results as expected:
client.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36");
There's probably an ajax call or something similar to load the page data. It's a modern software paradigm whereas before the page would contain all the data already for the data. Whet everyone else is saying is that if there IS javascript to load the content then the webclient would not be able to load this content. This is why you see it in your browser and not the webclient.
The solution is to use another tool like Selenium to download the page into a rendering engine & then scrape what you need.
I am writing a C# application whereby I formulate the POST strings in C# but the website I am POSTing recognizes that I am not using IE, Chrome or Firefox. Is there a way that I can "use" Internet Explorer (or either of the other two browsers) to make the POST request and then retrieve the response back in the C# (to parse the HTML)?
I have this currently:
using (var wb = new WebClient())
{
var data = new NameValueCollection();
//Any key-value arguments for the POST are stored in data
var response = wb.UploadValues(url, "POST", data);
}
Yes. Forge the User-Agent HTTP header. The User-Agent header basically tells the receiving server what program is sending the packets to it.
See this StackOverflow answer on how to do just that.
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
place after response
also
Other Response
I need to call a web page from different domain. When I call this page from browser, it responds normally. But when i call it from a server side code or from jquery ajax script, it responds empty xml.
I am trying to call a page or service like this:
http://www.otherdomain.com/oddsData.jsp?odds_flash_id=11&odds_s_type=1&odds_league=all&odds_period=all&me_select_string=&q=93801
this responds normally from browser. But when I write a c# code like this:
WebClient wc = new WebClient();
wc.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5";
wc.Headers[HttpRequestHeader.Accept] = "*/*";
wc.Headers[HttpRequestHeader.AcceptCharset] = "ISO-8859-1,utf-8;q=0.7,*;q=0.3";
wc.Headers[HttpRequestHeader.AcceptEncoding] = "gzip,deflate,sdch";
wc.Headers[HttpRequestHeader.AcceptLanguage] = "en-US,en;q=0.8";
wc.Headers[HttpRequestHeader.Host] = "otherdomain.com";
var response = wc.DownloadString("http://www.otherdomain.com/oddsData.jsp?odds_flash_id=11&odds_s_type=1&odds_league=all&odds_period=all&me_select_string=&q=93801");
Response.Write(response);
i get empty xml as response:
<xml></xml>
How can I get same response from server side code or client side which I got from browser?
I tried solution here: Calling Cross Domain WCF service using Jquery
So that I didnt understand what to do, I couldnt apply solution described.
How can I get same response from server side code or client side which I got from browser?
Due to the same origin policy restriction you cannot send cross domain AJAX requests from browsers.
From .NET on the other hand you could perfectly fine send this request. But probably the web server that you are trying to send the request to expects some HTTP headers such as the User-Agent header for example. So make sure that you have provided all the headers in your request that the server needs. For example to add the User-Agent header:
using (WebClient wc = new WebClient())
{
wc.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5";
var response = wc.DownloadString("http://www.otherdomain.com/oddsData.jsp?odds_flash_id=11&odds_s_type=1&odds_league=all&odds_period=all&me_select_string=&q=93801");
Response.Write(response);
}
You could use FireBug or Chrome developer toolbar to inspect all the HTTP request headers that your browser sends along the request that works and simply add those headers.