Grabbing HTML from URL doesn't work - any tips?

Grabbing HTML from URL doesn't work - any tips? - c#

I have tried several methods in C# using webclient and webresponse and they all return
<html><head><meta http-equiv=\"REFRESH\" content=\"0; URL=http://www.windowsphone.com/en-US/games?list=xbox\"><script type=\"text/javascript\">function OnBack(){}</script></head></html>"
instead of the actual rendered page when you use a browser to go to http://www.windowsphone.com/en-US/games?list=xbox
How would you go about grabbing the HTML from that location?
http://www.windowsphone.com/en-US/games?list=xbox
Thanks!
/edit: examples added:
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
Uri inputUri = new Uri(inputUrl);
WebRequest request = WebRequest.CreateDefault(inputUri);
request.Method = "GET";
WebResponse response;
try
{
response = request.GetResponse();
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
resultHTML = reader.ReadToEnd();
}
}
catch { }
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
WebClient webClient = new WebClient();
try
{
resultHTML = webClient.DownloadString(inputUrl);
}
catch { }
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
WebResponse objResponse;
WebRequest objRequest = HttpWebRequest.Create(inputUrl);
try
{
objResponse = objRequest.GetResponse();
using (StreamReader sr = new StreamReader(objResponse.GetResponseStream()))
{
resultHTML = sr.ReadToEnd();
sr.Close();
}
}
catch { }

I checked for this URL, and you need to parse the cookies.
When you try to access the page for the first time, you are redirected to an https URL on login.live.com and then redirected back to the original URL. The https page sets a cookie called MSPRequ for the domain login.live.com. If you do not have this cookie, you cannot access the site.
I tried disabling cookies in my browser and it ends up looping infinitely back to the URL https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&checkda=1&ct=1328303901&rver=6.1.6195.0&wp=MBI&wreply=http:%2F%2Fwww.windowsphone.com%2Fen-US%2Fgames%3Flist%3Dxbox&lc=1033&id=268289. It's been going on for several minutes now and doesn't appear it will ever stop.
So you will have to grab the cookie from the https page when it is set, and persist that cookie for your subsequent requests.

This might be because the server you are requesting HTML from returns different HTML depending on the User Agent string. You might try something like this
webClient.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
That particular header may not work, but you could try others that would mimic standard browsers.

Related

WebRequest loading HTMLDocument coming back with all special characters for SSL site

Pretty standard implementation of HttpWebRequest, whenever I pass a certain URL to get the html it comes back with nothing but special characters. An example of what comes back is below.
Now this site is SSL so I'm wondering if that has something to do with it but I've never had this problem before and I've used this with other SSL sites.
�
ServicePointManager.ServerCertificateValidationCallback = new System.Net.Security.RemoteCertificateValidationCallback(AcceptAllCertifications);
var request = (HttpWebRequest)WebRequest.Create(url);
using (var response = (HttpWebResponse)request.GetResponse())
{
Stream data = response.GetResponseStream();
HtmlDocument hDoc = new HtmlDocument();
using (StreamReader readURLContent = new StreamReader(data))
{
html = readURLContent.ReadToEnd();
hDoc.LoadHtml(html);
}
}
I can't really find anything for this specific issue so I'm kind of lost if anybody could point me in the right direction that would be awesome.
Edit: here's an image of what it looks like since I can't copy paste it

My guess is that the response is compressed. If you use a WebDebugger like Charles or Fiddler. You can see how the requests and structured and what data they contain - it makes it a lot easier to replicate the http requests later on when programming them. Try the following code.
try
{
string webAddr = url;
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "text/html; charset=utf-8";
httpWebRequest.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0";
httpWebRequest.AllowAutoRedirect = true;
httpWebRequest.Method = "GET";
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream(), Encoding.UTF8))
{
var responseText = streamReader.ReadToEnd();
doc.LoadHtml(responseText);
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
The code sets the encoding on the requsts. You an also set the encoding at the streamreader when reading the response. And automatic decompression is enabled.

Get full url from shorten url in C#.net

I am developing one application where I need capture basic detail like title, description and images of website based on url provided by user.
But user may be enter www.google.com insted of http://www.google.com but C#.net code failed to retrieve data for "www.google.com" through below code
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(new Uri(url));
request.Method = WebRequestMethods.Http.Get;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
String responseString = reader.ReadToEnd();
response.Close();
and found error like "Invalid URI: The format of the URI could not be determined."
So do know any technique to found full url based on shorten url.
for ex. google.com or www.google.com
Expected output : http://www.google.com or https://www.google.com
PS : I found online web tool (http://urlex.org/) that will return full url based on shorten url
Thanks in advance.

You can use UriBuilder to create a URL with HTTP as default scheme:
UriBuilder urb = new UriBuilder("www.google.com");
Uri uri = urb.Uri;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(uri);
request.Method = WebRequestMethods.Http.Get;
string responseString;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
responseString = reader.ReadToEnd();
}
}
If your URL contains a scheme, it will use that one instead of the default HTTP scheme. I have also used using to release all unmanaged resources.

So do know any technique to found full url based on shorten url.
I may have misunderstood your issue here but can't you just append "http://" if it's missing?
string url = "www.google.com";
if (!url.StartsWith("http"))
url = $"http://{url}";
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(new Uri(url));
request.Method = WebRequestMethods.Http.Get;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
String responseString = reader.ReadToEnd();
}
This is basically what a web browser does when you don't specify any protocol.

C#: Post login data to form and process website's response

I am trying to write a C# program to prolong the deadline for my books in my university's library. What I want to do is the following:
1.) Login to library website via WebRequest & POST method, with username & password entered in C# program
2.) Get the url to "View borrowed books" site containing the encrypted password & plain text username as GET parameters
3.) Download content of named page to display to user in the C# program
4.) If the user presses the corresponding button in the program, submit the prolongation form to the website to prolong all media at once.
Right now I'm stuck between 1 and 2, I seem to be able to connect to the website and to enter the userdata, but the WebResponse I get is again the login page (which is not the case if you login manually on the website).
This is the method I wrote to connect to the website:
// Login function, logs the user in, uses passed user number & password
public static Boolean userLogin(String unr, String pass)
{
// Login
// Cookie needed for maintaining php session
CookieContainer cContainer = new CookieContainer();
Console.WriteLine(unr+","+pass);
String postUrl = "https://universitylibrary.com/loan/DB=4/LNG=DU/USERINFO_LOGIN";
String formParams = String.Format("ACT={0}&HOST_NAME={1}&HOST_PORT={2}&HOST_SCRIPT={3}&LOGIN={4}&STATUS={5}&BOR_U={6}&BOR_PW={7}","UI_DATA","","","","KNOWNUSER","HML_OK", unr, pass);
String cookieHeader;
WebRequest wreq = WebRequest.Create(postUrl);
wreq.ContentType = "application/x-www-form-urlencoded";
wreq.Method = "POST";
byte[] bytes = Encoding.ASCII.GetBytes(formParams);
wreq.ContentLength = bytes.Length;
using (Stream os = wreq.GetRequestStream())
{
os.Write(bytes, 0, bytes.Length);
}
WebResponse resp = wreq.GetResponse();
cookieHeader = resp.Headers["Set-cookie"];
//Authentication trial
String PageSource;
String getUrl = "https://universitylibrary.com:443/loan/DB=4/USERINFO";
WebRequest getReq = WebRequest.Create(getUrl);
getReq.Headers.Add("Cookie",cookieHeader);
WebResponse getResp = getReq.GetResponse();
using (StreamReader sr = new StreamReader(getResp.GetResponseStream()))
{
PageSource = sr.ReadToEnd();
}
Console.Write(PageSource);
return true;
}
Can you see my mistake? I get the sourcecode and the params (username, password) output on the console, but the output is again the login page.
I would just look at the php page, but I don't have access to any of the internal system data, all I have is the HTML page.
Any suggestions would be highly appreciated!
EDIT:
I have rethought the whole thing, and rebuilt the HTTP request header completely as recorded by fiddler. That part of the function looks like this now:
// Login
// Cookie needed for maintaining php session
CookieContainer cContainer = new CookieContainer();
HttpCookie cookie = new HttpCookie("cookie", "PSC_4='xxxxxxx'; DB='n'");
CookieCollection cookieCol = new CookieCollection();
cookieCol.Add(cookieCol);
cContainer.Add(cookieCol);
Console.WriteLine(unr+","+pass);
String postUrl = "https://universitylibrary.com:443/loan/DB=4/USERINFO";
String formParams = String.Format("ACT={0}&HOST_NAME={1}&HOST_PORT={2}&HOST_SCRIPT={3}&LOGIN={4}&STATUS={5}&BOR_U={6}&BOR_PW={7}","UI_DATA","","","","KNOWNUSER","HML_OK", unr, pass);
String cookieHeader;
HttpWebRequest wreq = (HttpWebRequest) WebRequest.Create(postUrl);
wreq.Referer = "https://universitylibrary.com/loan/DB=4/LNG=DU/USERINFO_LOGIN";
wreq.KeepAlive = true;
wreq.ContentLength = 119;
wreq.Host = "universitylibrary.com";
wreq.ContentType = "application/x-www-form-urlencoded";
wreq.Method = "POST";
wreq.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0";
wreq.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
wreq.SendChunked = true;
wreq.TransferEncoding = "gzip, deflate";
byte[] bytes = Encoding.ASCII.GetBytes(formParams);
wreq.ContentLength = bytes.Length;
using (Stream os = wreq.GetRequestStream())
{
os.Write(bytes, 0, bytes.Length);
}
cookieHeader = "";
try
{
HttpWebResponse resp = (HttpWebResponse) wreq.GetResponse();
cookieHeader = resp.Headers["Set-cookie"];
}
catch (WebException ex)
{
Console.WriteLine(ex.Status);
Console.WriteLine(ex.Response);
}
The whole thing still doesn't work though, same issue as before.
Is it possible that HttpWebRequest can't handle https or that something else is missing for https to work? (HTTP & HTTPS seem to be syntactically identical and the port is correctly set to 443, the real difference seems to lay in the additional SSL/TLS layer, maybe I need to add this somewhere?)

When you are returned the login page it probably means that you have not been correctly authenticated. There could be several reasons for this, but in the end it is because you not mirroring the HTTP communication from the manual login on the website correctly.
What I usually do is to use a monitor such as Fiddler to capture the full request/response pattern from the manual login, which I can then subsequently mirror in my code.

You can just modify your login page after checking login data and make the only output is "SuccessfullySignIn", for example.
If this is the data your receive,
$if (getReq == "SuccessfullySignIn")
{
//Do something
}
And try to use Redirect features in your webpage

Can't get HTML code through HttpWebRequest

I am trying to parse the HTML code of the page at http://odds.bestbetting.com/horse-racing/today in order to have a list of races, etc.
The problem is I am not being able to retrieve the HTML code of the page. Here is the C# code of the function:
public static string Http(string url) {
Uri myUri = new Uri(url);
// Create a 'HttpWebRequest' object for the specified url.
HttpWebRequest myHttpWebRequest = (HttpWebRequest)WebRequest.Create(myUri);
myHttpWebRequest.AllowAutoRedirect = true;
// Send the request and wait for response.
HttpWebResponse myHttpWebResponse = (HttpWebResponse)myHttpWebRequest.GetResponse();
var stream = myHttpWebResponse.GetResponseStream();
var reader = new StreamReader(stream);
var html = reader.ReadToEnd();
// Release resources of response object.
myHttpWebResponse.Close();
return html;
}
When I execute the program calling the function it throws an exception on
HttpWebResponse myHttpWebResponse =
(HttpWebResponse)myHttpWebRequest.GetResponse();
which is:
Cannot handle redirect from HTTP/HTTPS protocols to other dissimilar ones.
I have read this question but I don't seem to have the same problem.
I've also tried iguring something out sniffing the traffic with fiddler but can't see anything to where it redirects or something similar. I just have extracted these two possible redirections: odds.bestbetting.com/horse-racing/2011-06-10/byCourse
and odds.bestbetting.com/horse-racing/2011-06-10/byTime , but querying them produces the same result as above.
It's not the first time I do something like this, but I'm really lost on this one. Any help?
Thanks!

I finally found the solution... it effectively was a problem with the headers, specifically the User-Agent one.
I found after lots of searching a guy having the same problem as me with the same site. Although his code was different the important bit was that he set the UserAgent attribute of the request manually to that of a browser. I think I had done this before but I may had done it pretty bad... sorry.
The final code if it is of interest to any one is this:
public static string Http(string url) {
if (url.Length > 0)
{
Uri myUri = new Uri(url);
// Create a 'HttpWebRequest' object for the specified url.
HttpWebRequest myHttpWebRequest = (HttpWebRequest)WebRequest.Create(myUri);
// Set the user agent as if we were a web browser
myHttpWebRequest.UserAgent = #"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4";
HttpWebResponse myHttpWebResponse = (HttpWebResponse)myHttpWebRequest.GetResponse();
var stream = myHttpWebResponse.GetResponseStream();
var reader = new StreamReader(stream);
var html = reader.ReadToEnd();
// Release resources of response object.
myHttpWebResponse.Close();
return html;
}
else { return "NO URL"; }
}
Thank you very much for helping.

There can be a dozen probable causes for your problem.
One of them is that the redirect from the server is pointing to an FTP site, or something like that.
It can also being that the server require some headers in the request that you're failing to provide.
Check what a browser would send to the site and try to replicate.

performing http methods using windows application in c#

There are many sites which call a script upon form submit and pass in parameters using HTTP POST or GET, using a web debugger i have found the parameters being passed. Now i wish to do the same thing through my Windows Application in C#. How can i achieve such a functionality?
I am currently using HttpWebRequest and HttpWebResponse class in C#. But it is a pain as i have to write explicit code for each page i try to load and work. For Example i am trying to pass username and password to a php page and taking the response, which will send a cookie and a page in return, based on which i identify if the user has logged in or not.
HttpWebRequest loginreq = createreq("http://www.indyarocks.com/mobile/index.php");
String logintext = "username=" + TxtUsrname.Text + "&pass=" + TxtPasswd.Password + "&button.x=0&button.y=0";
loginreq.ContentLength = logintext.Length;
StreamWriter writerequest = new StreamWriter(loginreq.GetRequestStream());
writerequest.Write(logintext);
writerequest.Close();
HttpWebResponse getloginpageresponse = (HttpWebResponse)loginreq.GetResponse();
cookie = getloginpageresponse.Cookies[0];
BinaryFormatter bf1 = new BinaryFormatter();
Stream f1 = new FileStream("E:\\cookie.dat", FileMode.OpenOrCreate);
bf1.Serialize(f1, cookie);
f1.Close();
string nexturl = getloginpageresponse.Headers[HttpResponseHeader.Location];
StreamReader readresponse = new StreamReader(getloginpageresponse.GetResponseStream());
if (nexturl == "p_mprofile.php")
{
MessageBox.Show("Login Successful");
GrpMsg.IsEnabled = true;
}
else if (nexturl == "index.php?msg=1")
{
MessageBox.Show("Invalid Credentials Login again");
}
This is my createreq class
private HttpWebRequest createreq(string url)
{
HttpWebRequest temp = (HttpWebRequest)WebRequest.Create(url);
temp.Method = "POST";
temp.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022; FDM)";
temp.KeepAlive = true;
temp.ContentType = "application/x-www-form-urlencoded";
temp.CookieContainer = new CookieContainer();
temp.AllowAutoRedirect = false;
return temp;
}
Am i on the right track? Is there any better way to do it?

You should use System.Net.WebClient.
You can use it to make a request with any method and headers that you'd like, and get the resulting page with a simple stream read.
There's a simple example on the MSDN page, but some sample code for using it might look like:
WebClient webclient= new WebClient();
using (StreamReader reader = new StreamReader(webclient.OpenRead("http://www.google.com")))
{
string result = reader.ReadToEnd();
// Parse web page here
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Grabbing HTML from URL doesn't work - any tips? - c#

Related

WebRequest loading HTMLDocument coming back with all special characters for SSL site

Get full url from shorten url in C#.net

C#: Post login data to form and process website's response

Can't get HTML code through HttpWebRequest

performing http methods using windows application in c#

Categories

Resources