I am tring to screen scrape a page of a web app that just contains text and is hosted by a 3rd party. It's not a properly formed HTML page, however the text that is diplayed will tell us if the web app is up or down.
When I try to scrape the sreen it returns an error when it tries the WebRequest. The error is "The remote server returned an error: (500) Internal Server Error."
public void ScrapeScreen()
{
try
{
var url = textBox1.Text;
var request = WebRequest.Create(url);
var response = request.GetResponse();
var stream = response.GetResponseStream();
var reader = new StreamReader(stream);
var result = reader.ReadToEnd();
stream.Dispose();
reader.Dispose();
richTextBox1.Text = result;
}
catch(Exception ex)
{
MessageBox.Show(ex.Message);
}
}
Any ideas how I can get the text from the page?
Some sites don't like the default UserAgent. Consider changing it to something real, like:
((HttpWebRequest)request).UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4"
First, try this:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
However, if you're just looking for text and not having to do any POST-ing of data to the server, you may want to look at the webClient class. It more closely resembles a real browser, and takes care of a lot of HTTP header stuff that you may end up having to twek if you stick with the HttpWebRequest class.
Related
I'd like to create a tool to check if an url is valid (valid: it returns a 200). I have two examples of check in pages of airlines, and both works correctly in the browser. However the British Airlines always throws an exception becuase of a 500 response. What is wrong with my code?
static void Main(string[] args)
{
var testUrl1 = new Program().UrlIsValid("https://www.klm.com/ams/checkin/web/kl/nl/nl");
var testUrl2 = new Program().UrlIsValid("https://www.britishairways.com/travel/olcilandingpageauthreq/public/en_gb");
Console.WriteLine(testUrl1 + "\t - https://www.klm.com/ams/checkin/web/kl/nl/nl");
Console.WriteLine(testUrl2 + "\t - https://www.britishairways.com/travel/olcilandingpageauthreq/public/en_gb");
}
public bool UrlIsValid(string onlineCheckInUrl)
{
try
{
var request = (HttpWebRequest)WebRequest.Create(onlineCheckInUrl);
request.Method = "GET";
var response = (HttpWebResponse)request.GetResponse();
return (response.StatusCode == HttpStatusCode.OK);
}
catch (Exception e)
{
return false;
}
}
A lot of sites block obvious bot activity. The British Airways url you show works for me if I set a valid User-Agent request header:
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0";
Keep in mind that 200 OK is not the only response that means the URL is valid and your method of testing will always be unreliable at best. You may have to narrow your definition of what a valid URL means or at least expect things to change on a site-by-site basis.
After searching about iCloud API, I found some example on NodeJS and Python, but unfortunately, I'm not familiar with them. I want to know how to get iCloud Contact list on C#.
Example on python: https://github.com/mindcollapse/iCloud-API/blob/master/iCloud.py
Example on NodeJS: https://www.snip2code.com/Snippet/65033/Request-Contact-List-From-iCloud
I try to parse the login code to C#:
private void iCloudLogin()
{
string guiid = Guid.NewGuid().ToString("N");
//string url = "https://p12-setup.icloud.com/setup/ws/1/login?clientBuildNumber=1P24&clientId=" + guiid;
string url = "https://setup.icloud.com/setup/ws/1/login?clientBuildNumber=1P24&clientId=" + guiid;
using (var client = new WebClient())
{
client.Headers.Set("Origin", "https://www.icloud.com");
client.Headers.Set("Referer", "https://www.icloud.com");
client.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36");
var values = new NameValueCollection();
values["apple_id"] = appleId;
values["password"] = password;
values["extended_login"] = "false";
var response = client.UploadValues(url, values);
}
}
I receive 400 : Bad request with above code, please help to go give the direction where I'm wrong, I appreciate your help if there is code example.
Update:
Now I could login and get many information, include my contact server url, dsid, this is the link I used:
https://p12-setup.icloud.com/setup/ws/1/login?clientBuildNumber=1P24&clientId=MyGuid
After that, I use below url to get contact list:
https://p35-contactsws.icloud.com/co/startup?clientBuildNumber=1P24&clientId=MyGuid&clientVersion=2.1&dsid=MyDSID&locale=en-EN&order=last%2Cfirst
https://p35-contactsws.icloud.com is my contact server, it actually is https://p35-contactsws.icloud.com:443, but base on example I refer to, the port :443 need to be removed.
But I still get 421: Client Error
I know the answer
Firstly, in this case the request should be WebRequest, not WebClient.
In the first api url: https://setup.icloud.com/setup/ws/1/login?clientBuildNumber=WHATEVERNUMBER&clientId=RANDOM_GUID :
The WebRequest should be a Post and include appleid, password in data, and in header there should be Origin=https://www.icloud.com :
private void iCloudLogin()
{
string data = "{\"apple_id\":" + appleId + ", \"password\":" + password + ", \"extended_login\":false}";
byte[] dataStream = Encoding.UTF8.GetBytes(data);
WebRequest webRequest = WebRequest.Create(url);
webRequest.Method = "POST";
webRequest.Headers.Set("Origin", "https://www.icloud.com");
webRequest.ContentLength = dataStream.Length;
Stream newStream=webRequest.GetRequestStream();
// Attach the data.
newStream.Write(dataStream,0,dataStream.Length);
newStream.Close();
WebResponse webResponse = webRequest.GetResponse();
// get contact server url, dsid, Cookie
}
iCloud server will response contact server url, dsid, also "X-APPLE-WEBAUTH-TOKEN" and "X-APPLE-WEBAUTH-USER" (these two values are in header "Set-Cookie" of webResponse)
When you have enough above parameters, you can get icloud contact list, follow by this way:
Make a GET request to this url:
https://p35-contactsws.icloud.com/co/startup?clientBuildNumber=1P24&clientId=MyGuid&clientVersion=2.1&dsid=MyDSID&locale=en-EN&order=last%2Cfirst
+https://p35-contactsws.icloud.com : my contact server url, yours can be different.
+clientVersion: just leave it 2.1
+MyGuid: the Guid you used in the first request.
Important: in the header, must include:
Origin:https://www.icloud.com
Cookie: X-APPLE-WEBAUTH-TOKEN=XXXXXX;X-APPLE-WEBAUTH-USER=YYYYYYYYY
After that, you will get full iCloud Contact list.
This way is web service base, so it can work in many languages, so I think this can help.
The main problem I think is that I am trying to get an output of a php script on an ssl protected website. Why doesn't the following code work?
string URL = "https://mtgox.com/api/0/data/ticker.php";
HttpWebRequest myRequest =
(HttpWebRequest)WebRequest.Create(URL);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader _sr = new StreamReader(myResponse.GetResponseStream(),
System.Text.Encoding.UTF8);
string result = _sr.ReadToEnd();
//Console.WriteLine(result);
result = result.Replace('\n', ' ');
_sr.Close();
myResponse.Close();
Console.WriteLine(result);
It hangs at WebException was unhandeled The operation has timed out
You're hitting the wrong url. ssl is https://, but you're hitting http:// (note the lack of S). The site does redirect to the SSL version of the page, but your code is apparently not following that redirect.
Have added myRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11"; everything started working
The problem:
I want to scrap some data from certain webpage (I have administrative access) and to store some information in db for later analysis.
Sounds easy, right?
I've decided to make simple console prototype and code look something like this:
string uri = #"http://s7.iqstreaming.com:8044/admin.cgi";
HttpWebRequest request = WebRequest.Create(uri) as HttpWebRequest;
if(request == null)
{
Console.WriteLine(":( This shouldn't happen!");
Console.ReadKey();
}
request.ContentType = #"text/html";
request.Accept = #"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
request.Credentials = new NetworkCredential("myID", "myPass");
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
StreamReader reader = new StreamReader( response.GetResponseStream());
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}
reader.Close();
response.Close();
}
This code works on most other sites, but here I get errors 404 (most of the time), 502 or timeout.
I've consulted with Firebug (I've took Accept and compression info from there) but to no avail.
Using Win-forms and webBrowser control as an alternative is not an option (at least for now).
P.S.
Same thing happens when I try to get HTML from http://s7.iqstreaming.com:8044/index.html (doesn't need credentials).
I think the problem is related with User-Agent.
This may solve it
request.UserAgent="Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11";
I am trying to parse the HTML code of the page at http://odds.bestbetting.com/horse-racing/today in order to have a list of races, etc.
The problem is I am not being able to retrieve the HTML code of the page. Here is the C# code of the function:
public static string Http(string url) {
Uri myUri = new Uri(url);
// Create a 'HttpWebRequest' object for the specified url.
HttpWebRequest myHttpWebRequest = (HttpWebRequest)WebRequest.Create(myUri);
myHttpWebRequest.AllowAutoRedirect = true;
// Send the request and wait for response.
HttpWebResponse myHttpWebResponse = (HttpWebResponse)myHttpWebRequest.GetResponse();
var stream = myHttpWebResponse.GetResponseStream();
var reader = new StreamReader(stream);
var html = reader.ReadToEnd();
// Release resources of response object.
myHttpWebResponse.Close();
return html;
}
When I execute the program calling the function it throws an exception on
HttpWebResponse myHttpWebResponse =
(HttpWebResponse)myHttpWebRequest.GetResponse();
which is:
Cannot handle redirect from HTTP/HTTPS protocols to other dissimilar ones.
I have read this question but I don't seem to have the same problem.
I've also tried iguring something out sniffing the traffic with fiddler but can't see anything to where it redirects or something similar. I just have extracted these two possible redirections: odds.bestbetting.com/horse-racing/2011-06-10/byCourse
and odds.bestbetting.com/horse-racing/2011-06-10/byTime , but querying them produces the same result as above.
It's not the first time I do something like this, but I'm really lost on this one. Any help?
Thanks!
I finally found the solution... it effectively was a problem with the headers, specifically the User-Agent one.
I found after lots of searching a guy having the same problem as me with the same site. Although his code was different the important bit was that he set the UserAgent attribute of the request manually to that of a browser. I think I had done this before but I may had done it pretty bad... sorry.
The final code if it is of interest to any one is this:
public static string Http(string url) {
if (url.Length > 0)
{
Uri myUri = new Uri(url);
// Create a 'HttpWebRequest' object for the specified url.
HttpWebRequest myHttpWebRequest = (HttpWebRequest)WebRequest.Create(myUri);
// Set the user agent as if we were a web browser
myHttpWebRequest.UserAgent = #"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4";
HttpWebResponse myHttpWebResponse = (HttpWebResponse)myHttpWebRequest.GetResponse();
var stream = myHttpWebResponse.GetResponseStream();
var reader = new StreamReader(stream);
var html = reader.ReadToEnd();
// Release resources of response object.
myHttpWebResponse.Close();
return html;
}
else { return "NO URL"; }
}
Thank you very much for helping.
There can be a dozen probable causes for your problem.
One of them is that the redirect from the server is pointing to an FTP site, or something like that.
It can also being that the server require some headers in the request that you're failing to provide.
Check what a browser would send to the site and try to replicate.