I am trying to download and save the favicon for various websites. For the majority the following code works. However, I have a problem with some urls. for example:
https://www.bestbuy.com/favicon.ico bestbuy,
https://www.macys.com/favicon.ico macys
I can open these urls in my default browser (firefox) without any problems.
This is the code I'm using to do the HttpWebRequest and where I get the exception.
This is how I do the WebRequest
HttpWebRequest request = WebRequest.Create(uri) as HttpWebRequest;
request.Timeout = 10000;
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
request.Headers.Add("Upgrade-Insecure-Requests", "1");
request.CookieContainer = new CookieContainer();
request.UserAgent = "Application name here";
response = request.GetResponse() as HttpWebResponse;
Any ideas why the example urls time out (again, most work fine).
`
You're getting blocked by your useragent. Send something a browser would send. I used this:
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36
HttpWebRequest request = WebRequest.Create(uri) as HttpWebRequest;
request.Timeout = 10000;
request.AutomaticDecompression = DecompressionMethods.GZip |
DecompressionMethods.Deflate;
request.Headers.Add("Upgrade-Insecure-Requests", "1");
request.CookieContainer = new CookieContainer();
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36";
response = request.GetResponse() as HttpWebResponse;
Related
I did a batch that parse html page of gearbest.com to extract data of the items (example link link).
It worked until 2-3 week ago after that the site was updated.
So I can't dowload pages to parse and I don't undastand why.
Before the update I did request with the following code with HtmlAgilityPack.
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = null;
doc = web.Load(url); //now this the point where is throw the exception
I tried without the framework and I added some date to the request
HttpWebRequest request = (HttpWebRequest) WebRequest.Create("https://it.gearbest.com/tv-box/pp_009940949913.html");
request.Credentials = CredentialCache.DefaultCredentials;
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36";
request.ContentType = "text/html; charset=UTF-8";
request.CookieContainer = new CookieContainer();
request.Headers.Add("accept-language", "it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7");
request.Headers.Add("accept-encoding", "gzip, deflate, br");
request.Headers.Add("upgrade-insecure-requests", "1");
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8";
request.CookieContainer = new CookieContainer();
Response response = request.GetResponse(); //exception
the exception is:
IOException: Unable to read data from the transport connection
SocketException: The connection could not be established.
If I try to request the main page (https://it.gearbest.com) it works.
What's the problem in your opinion?
For some reason it doesn't like the provided user agent. If you omit setting UserAgent everything works fine
HttpWebRequest request = (HttpWebRequest) WebRequest.Create("https://it.gearbest.com/tv-box/pp_009940949913.html");
request.Credentials = CredentialCache.DefaultCredentials;
//request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36";
request.ContentType = "text/html; charset=UTF-8";
Another solution would be setting request.Connection to a random string (but not keep-alive or close)
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36";
request.Connection = "random value";
It also works but I cannot explain why.
Might be worth a try...
HttpRequest.KeepAlive = false;
HttpRequest.ProtocolVersion = HttpVersion.Version10;
https://stackoverflow.com/a/16140621/1302730
If I use a regular url it works fine and if I use google domains update url I get 401 error. This is my first try on C# application.
HttpWebRequest request = WebRequest.Create("https://UUUUUUUUUUUUU:PPPPPPPPPPPPP#domains.google.com/nic/update?hostname=subdomain.example.com") as HttpWebRequest;
//request.Accept = "application/xrds+xml";
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.102 Safari/537.36 Viv/1.97.1246.7";
request.UseDefaultCredentials = true;
request.PreAuthenticate = true;
request.Credentials = CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
WebHeaderCollection header = response.Headers;
var encoding = ASCIIEncoding.ASCII;
using (var reader = new System.IO.StreamReader(response.GetResponseStream(), encoding))
{
string responseText = reader.ReadToEnd();
//responseddns = responseText;
MessageBox.Show(responseText);
}
If I use http://example.com/getip.php it works fine I can see the output.
you cannot use
> `CredentialCache.DefaultCredentials;`
since the url is the domain.google.com's domain.
You need to enter your google credentials or else directly use
http://example.com/getip.php as u did before
im trying to send a simple GET request to a varnish server and the code below only works if a http sniffer is ON, otherwise im getting a 430
Unauthorized response, if you visit this url https://identity.ticketmaster.com/v1/me you should get 400 status code cause no credential is specified but instead with the code below, we get 430 Unauthorized
userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36";
Uri requestUri = new Uri("https://identity.ticketmaster.com/v1/me");
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls11 |
SecurityProtocolType.Tls12 | SecurityProtocolType.Ssl3;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(requestUri);
request.Method = "GET";
request.Host = "identity.ticketmaster.com";
request.KeepAlive = true;
request.Headers.Add("Cache-Control", "max-age=0");
request.Headers.Add("Upgrade-Insecure-Requests", "1");
request.UserAgent = userAgent;
request.Accept = "*/*";
request.Headers.Add("Accept-Language", "en-US,en;q=0.9");
string str = "";
try{
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
str = new StreamReader(response.GetResponseStream()).ReadToEnd();
Console.WriteLine(str);
response.Close();
request.Abort();
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
with any browser it works ok also if i build the request with any other request builder it works. Looks like somehow varnish is able to detect if the request is being sent by a .NET app and fire the 430 Unauthorized message.
I tried wireshark and the only difference i can see is the TLS1.2 handshake
First of all I think my issue is pretty different than the other topics in stackoverflow since I've tried the solutions in them.
I'm using .NET 4.5 :
HttpWebRequest MainRequest = (HttpWebRequest)WebRequest.Create(url);
WebHeaderCollection myWebHeaderCollection = MainRequest.Headers;
MainRequest.Method = "GET";
MainRequest.Host = "aro.example.com";
MainRequest.Timeout = 20000;
MainRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
myWebHeaderCollection.Add("Upgrade-Insecure-Requests", "1");
MainRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36";
myWebHeaderCollection.Add("Accept-Encoding", "gzip, deflate, sdch");
myWebHeaderCollection.Add("Accept-Language", "en-US,en;q=0.8");
MainRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
MainRequest.ServicePoint.Expect100Continue = false;
MainRequest.CookieContainer = new CookieContainer();
HttpWebResponse MainResponse = (HttpWebResponse)MainRequest.GetResponse();
This throws the exception the underlying ...
I added :
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
I get no exception now, but also no response until the timeout occurs.
Is it possible to say a webrequest to only get text-based data from a site? And if it is how should I do this?
The only thing I can imagine is to search in the response string and remove all the image-tags. But this is a very bad way to do this...
EDIT: this is my code snippet:
string baseUrl = kvPair.Value[0];
string loginUrl = kvPair.Value[1];
string notifyUrl = kvPair.Value[2];
cc = new CookieContainer();
string loginDetails = DataCollector.GetLoginDetails(baseUrl, ref cc);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(loginUrl);
request.Method = "POST";
request.Accept = "text/*";
request.ContentType = "application/x-www-form-urlencoded; charset=UTF-8";
request.CookieContainer = cc;
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
Byte[] data = Encoding.ASCII.GetBytes(loginDetails);
request.ContentLength = data.Length;
using (Stream s = request.GetRequestStream())
{
s.Write(data, 0, data.Length);
}
HttpWebResponse res = (HttpWebResponse)request.GetResponse();
request = (HttpWebRequest)WebRequest.Create(notifyUrl);
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
request.CookieContainer = cc;
res = (HttpWebResponse)request.GetResponse();
Stream streamResponse = res.GetResponseStream();
using (StreamReader sr = new StreamReader(streamResponse))
{
ViewData["data"] += "<div style=\"float: left; margin-bottom: 50px;\">" + sr.ReadToEnd() + "</div>";
}
I found myself a good coding solution:
public static string StripImages(string input)
{
return Regex.Replace(input, "<img.*?>", String.Empty);
}
this kills all images but only as soon as you have loaded all the images so there is no savings in transfered data in this solution...
The HTTP/1.1 Header Field Definitions' section 14.1 contains the Accept header definition. It states the following:
... If an Accept header field is present, and if the server cannot send a response which is acceptable according to the combined Accept field value, then the server SHOULD send a 406 (not acceptable) response.
So it is up to the server if it respects the client's request.
I have found that most of the servers ignore the Accept header. So far I have found only one exceptoin: it is GitHub. I requested the GitHub homepage with audio as the Accept parameter. And it responded appropriately with response code 406.
Try the following snippet for a demo, you should get System.Net.WebException: The remote server returned an error: (406) Not Acceptable.
HttpWebRequest request = (HttpWebRequest) WebRequest.Create("https://github.com/");
request.Method = "GET";
request.Accept = "audio/*";
var response = request.GetResponse();