I'm trying to scrape web page via C# application, but it keeps responding
"The remote server returned an error: (404) Not Found."
The web page is accesible through browser, but the app keeps failing. Any help appreciated.
var d = DateTime.UtcNow.Date;
var AddressString = #"http://www.booking.com/searchresults.html?src=searchresults&si=ai%2Cco%2Cci%2Cre%2Cdi&ss={0}&checkin_monthday={1}&checkin_year_month={2}&checkout_monthday={3}&checkout_year_month={4}";
var URi = String.Format(AddressString, "Prague", d.Day, d.Year + "-" + d.Month, d.Day + 1, d.Year + "-" + d.Month);
var request = (HttpWebRequest)WebRequest.Create(URi);
request.Timeout = 5000;
request.UserAgent = "Fiddler"; //I tried to set next three rows not to be null
request.Credentials = CredentialCache.DefaultCredentials;
request.Proxy = WebProxy.GetDefaultProxy();
try
{
var response = (HttpWebResponse)request.GetResponse();
}
catch(WebException e)
{
var response = (HttpWebResponse)e.Response; //e.Response contains WebPage, but it is incomplete
StreamReader sr = new StreamReader(response.GetResponseStream());
HtmlDocument doc = new HtmlDocument();
doc.Load(sr);
var a = doc.DocumentNode.SelectNodes("div[#class='resut-details']"); //fails, as not all desired nodes arent in response
}
EDIT:
Hi guys, thx for suggestions.
I added header: "Accept-Encoding: gzip,deflate,sdch" according to David Martins reply, but it didn't helped on its own.
I used Fidller to try to get any info about the problem, but I saw that app for the first time and it didn't made me any smarter. On the other hand, I tried to change request.UserAgent to that which is sent by my browser ("User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36";) and voila, I am not getting 404 exception anymore, but the document is not readable, as it is filled with such chars: ¿½O~���G�. I tried setting request.TransferEncoding = "UTF-8", but to enable this propperty, request.SendChunked must be set to true, which ends in
ProtocolViolationException
Additional information: Content-Length or Chunked Encoding cannot be set for an operation that does not write data.
EDIT 2:
I'm forgetting something and I can't figure out what. I'm getting somehow encoded response and need to decode it first to read it correctly. Even in Fiddler, when I want to see response, I need to confirm decoding to inspect result. After I decode it in fiddler, I'm getting just what I want to get into my application...
So, after trying suggestions from Jon Skeet and David Martin I got somewhere further and found relevant answer on new question in another toppic. If anyone ever looked for sth similar, answer is here:
.NET: Is it possible to get HttpWebRequest to automatically decompress gzip'd responses?
Related
I am new with dotnet-core.
I am doing some scraping requesting code. My previous practice with AliExpress is working fine for me with same pattern
Now, I am stuck with Walmart requests
When I am using following code with any other website it returns me OK response and required data.
HttpWebRequest wRequest = (HttpWebRequest) WebRequest.Create(url);
// wRequest.Timeout = 10000
wRequest.UserAgent = "Mozilla/ 5.0(Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, Like Gecko) Chrome/15.0.874.121 Safari/535.2";
using (HttpWebResponse httpResponse = (HttpWebResponse)wRequest.GetResponse())
{
if (httpResponse.StatusCode == HttpStatusCode.OK)
{
System.IO.StreamReader sr = new System.IO.StreamReader(httpResponse.GetResponseStream());
var responseString= sr.ReadToEnd();
Debug.Write(responseString);
}
}
}
but when I am doing it with Walmart it returns me 404 not found error.
more strange thing is following (another) code is working for me over Walmart with C# and dotnet Core 2.1 framework in one console project.
When I am importing it into main project it again returns 404 error.
WebClient wReq = new WebClinet();
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(wReq.DownloadString(URL));
I have used all headers found via fiddler and cookies container even. But no luck. I cant understand what the issue is.
PS: I have tried to use some code above which is rejecting for Walmart with another (some random) marketplace URL it works for me. But no luck with Walmart.
Just for information for anyone stuck in same or similar issue.
My URL for Walmart was not right.
I was checking URL by dividing it into part like Authority, query string etc.
and then combining it back.
During combination addition "/" was adding at the end of URL and it becomes invalid.
so, removing that did the trick.
I'm writing an interface to scrape info from a service. The link is behind a login, so I keep a copy of the cookies and then attempt to loop through the pages to get stats for our users.
The urls to hit are of the format: https://domain.com/groups/members/1234
for the first page, and each subsequent page appends ?page=X
string vUrl = "https://domain.com/groups/members/1234";
if (pageNumber > 1) vUrl += "?page=" + (pageNumber).ToString();
HttpWebRequest groupsRequest = (HttpWebRequest)WebRequest.Create(vUrl);
groupsRequest.CookieContainer = new CookieContainer();
groupsRequest.CookieContainer.Add(cookies); //recover cookies First request
groupsRequest.Method = WebRequestMethods.Http.Get;
groupsRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36";
groupsRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
groupsRequest.UseDefaultCredentials = true;
groupsRequest.AutomaticDecompression = DecompressionMethods.GZip;
groupsRequest.Headers.Add("Accept-Language", "en-US,en;q=0.8");
groupsRequest.Headers.Add("Cache-Control", "max-age=0");
HttpWebResponse getResponse = (HttpWebResponse)groupsRequest.GetResponse();
This works fine for the first page and I get the data back that I need, but with each subsequent pass, the queryString is ignored. Debugging at the last line shows that RequestUri.Query for the request is correct, but the response RequestUri.Query is blank. So it has the effect of always returning page 1 data.
I've tried to mimic the request headers that I see via Inspect in Chrome, but I'm stuck. Help?
when you put that url that is failing into a browser does it work? Because it is a GET, the browser should make the same request and tell you if it is working. If it does not work in the browser, then perhaps you are missing something aside from the query string?
If it does work, then maybe use fiddler and find out exactly what headers, cookies, and query string values are being sent to make 100% sure that you are sending the correct request. It could be that the query string is not enough information to get the data that you need from the request that you are sending.
If you still can't get it then fiddler the request when you send it through the browser and then use this fiddler extension to turn the request into code and see whats up.
I am working on a API based App in Xamarin using HttpWebRequest class. I have to send request to URL
http://example.com/APIRequest/Request?Parts=33333|N|2014|ABCD
But when I see this request in Fiddler it shows me URL like
http://example.com/APIRequest/Request?Parts=33333%7CN%7C2014%7CABCD
Now the problem is this encoded URL is not getting understood by server and its returning errors, which is beyond my control.
Earlier in .NET2.0 C# Application I was using
Uri url = new Uri(rawurl, true);
But the second parameter has been deprecated in .NET 4.0 MonoTouch available on Xamarin so its giving error or simply not doing anything.
I have tried all possible ways like UrlDecode, htmldecode, double decode or even Java UrlDecode but nothing has worked and always shows encoded URL in Fiddler.
Please suggest how to overcome this problem or any alternate to new Uri(url-string, true) the old function.
UPDATE:
After spending hours n hours probably I have found the culprit. The problem is
When I use "new Uri(url, true)", it sends unescaped URL containing | (pipe) to WebRequest.Create but if I remove "true" it sends encoded URL, which produces result but unfortunately server doesn't understand, so I get error.
Uri ourUri = new Uri(url, true);
myHttpWebResponse1 = (System.Net.HttpWebResponse)request.GetResponse();
But it may be a bug that request.GetResponse() Stops working without throwing any exception and process hangs if I use | (pipe) in URL.
Any possible solution to that?
My complete function is given below (modified with hardcoded URL)
public static string getURLCustom(string GETurl, string GETreferal)
{
GETurl = "http://example.com/?req=111111|wwww|N|2014|asdwer4";
GETreferal = "";
Uri ourUri = new Uri(GETurl.Trim(), true);
HttpWebRequest request = (HttpWebRequest)(WebRequest.Create(ourUri));
request.Method = "GET";
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.3) Gecko/20100401 Firefox/4.0 (.NET CLR 3.5.30729)";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request.KeepAlive = true;
request.CookieContainer = loginCookie; //stored after login
request.ContentType = "application/x-www-form-urlencoded";
request.Referer = GETreferal;
request.AllowAutoRedirect = true;
HttpWebResponse myHttpWebResponse1 = default(HttpWebResponse);
myHttpWebResponse1 = (System.Net.HttpWebResponse)request.GetResponse();
StreamReader postreqreader1 = new StreamReader(myHttpWebResponse1.GetResponseStream());
return postreqreader1.ReadToEnd();
}
And yes this code works perfectly in .NET 2.0 Windows Application but not on Xamarin Mono-Touch App.
It seems the server you are connecting to does not support internationalized resource identifier (IRI).
IRI is enabled by default since mono 3.10. mono 3.10 release notes
You can disable it on your client application by doing:
FieldInfo iriParsingField = typeof (Uri).GetField ("s_IriParsing",
BindingFlags.Static | BindingFlags.GetField | BindingFlags.NonPublic);
if (iriParsingField != null)
iriParsingField.SetValue (null, false);
You can also disable IRI parsing by setting the environment variable MONO_URI_IRIPARSING to false.
I have some code that fetches searches from google and I just noticed recently the html retrieved contains extra characters compared to a web browsers response. I noticed google seems to be forcing https which might be the issue. If someone could help me figure out something I'd appreciate it.
URL = "http://www.google.com/search?hl=en&safe=off&q=test";
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(URL);
myRequest.Proxy = null;
myRequest.Method = "GET";
myRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0";
myRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
myRequest.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
myRequest.Headers.Add("Accept-Language", "en-us,en;q=0.5");
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
TextWriter tw2 = new StreamWriter(Directory.GetCurrentDirectory() + "\\google.html");
tw2.WriteLine(result);
tw2.Close();
Here is a comparison between the result from the code and my web browser. First one is from the code, notice the ‎ near the end. (The other slight difference doesn't effect anything and is probably because of different headers or something.)
Speedtest.net by Ookla - The Global Broadband Speed <em>Test</em></a></h3><div class="s"><div><div class="f kv" style="white-space:nowrap"><cite class="_md"><cite class="visurl">speedtest.net</cite><cite class="visurl"></cite></cite>‎<div
Speedtest.net by Ookla - The Global Broadband Speed <em>Test</em></a></h3><div class="s"><div><div class="f kv _xu" style="white-space:nowrap"><cite class="_md">www.speed<b>test</b>.net/</cite><div
Something bad with your regex. It's totally normal to have non-ANSI characters in Unicode response. You must expect to have they also. We are living in Unicode epoch now. They must be here - because they are present in Google's response. And t's not a bug, it's a feature. :)
There is absolutely nothing wrong with WebRequest.
When i trying to load html from server by https, it returning an error code 500: but when i open same link in browser it works fine: is there any way to do this? I'm using Webclient and also sending a useragent information to the server:
HttpWebRequest req1 = (HttpWebRequest)WebRequest.Create("mobile.unibet.com/";);
req1.UserAgent = #"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";
var response1 = req1.GetResponse();
var responsestream1 = response1.GetResponseStream();
David is correct, this generally happens when the server is expecting some headers that is not passed through, in your case Accept
this code works now
string requestUrl = "https://mobile.unibet.com/unibet_index.t";
var request = (HttpWebRequest)WebRequest.Create(requestUrl);
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request.UserAgent = "//Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
using (var response = request.GetResponse() as HttpWebResponse)
{
using (var sr = new StreamReader(response.GetResponseStream()))
{
var responsestring = sr.ReadToEnd();
if (!string.IsNullOrEmpty(responsestring))
{
Console.WriteLine(responsestring);
}
}
}
This should probably be a comment but there's not enough room in the comment for all the questions... I don't think the question has enough information to answer with any level of confidence.
A 500 error means a problem at the server. The short answer is that the browser is sending some content that the WebClient is not.
The WebClient may not be sending headers that are expected by the server. Does the server require authentication? Is this a page on a company that you've contracted with that perhaps provided you with credentials or an API key that was Do you need to add HTTP Authorization?
If this is something you're doing with a company that you've got a partnership with, you should be able to ask them to help trace why you're getting a 500 error. Otherwise, you may need to provide us with a code sample and more details so we can offer more suggestions.