WPF c# WebCrawler - c#

I´m trying to develop a web crawler to extract some information from my company's web site, but I'm getting the error as below:
An exception of type System.Net.WebException occurred in System.dll but was not handled in user code
Additional information: The remote server returned an error: (500) Internal Server Error.
Here is the requestHeaders of the website:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Cookie:WASReqURL=https://:9446/ProcessPortal/jsp/index.jsp; com_ibm_bpm_process_portal_hash=null
Host:ca8webp.itau:9446
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17
Here is my code:
public bool Acessar(CredencialWAO credencial)
{
bool logado = true;
#region REQUISIÇÃO PAGINA INICIAL
ParametrosCrawler parametrosRequest = new ParametrosCrawler("https://ca8webp.itau:9446/ProcessPortal/login.jsp");
parametrosRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
parametrosRequest.AcceptEncoding = "gzip, deflate, sdch";
parametrosRequest.AcceptLanguage = "en-US,en;q=0.8";
parametrosRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17";
parametrosRequest.CacheControl = "max-age=0";
}
protected override WebResponse GetWebResponse(WebRequest request)
{
var response = base.GetWebResponse(request);
AtualizarCookies(((HttpWebResponse)response).Cookies);
return response;
}
The error occurs when calling "base.GetWebResponse(request)"

Related

Received an invalid header name: '#<FCGI'. in c# when I try to get data from a server

I'm trying to connect from a C# app to a web server and I want to get values from it and save it to a data base. This server has a login, so I'm trying to replicate the browser behaviour, when I send the login request, I get an exception System.Net.Requests with the mesage: Received an invalid header name: '#<FCGI'.
I've tried to add the header with the sentence: request.Headers.Add("#<FCGI:");, but I get a new System.Net.WebHeaderCollection exception, with the following message: Specified value has invalid HTTP Header characters. (Parameter 'name').
The code I'm using is the following:
HttpWebRequest request = (HttpWebRequest)WebRequest.CreateHttp($"http://{this.ipaddr}/app/app.app?cmd=authority_login&username=user&userpass=password");
request.KeepAlive = true;
request.ContentType="application/json";
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
request.Method = "GET";
request.Headers.Add("#<FCGI:");
WebResponse response1 = request.GetResponse();
I captured the trafic with the server with
Wireshark (picture here), and the full header is
#<FCGI: :CGI:0x405b5068 #output_hidden=nil, #cookies={"sid"=>["ad7ee82e6731416ef7f1d0e80b266906"], "dispzone"=>["client"], "lang"=>["en"]}, #args=nil, #multipart=false, #output_cookies=nil, #params={"username"=>["User"], "cmd"=>["authority_login"], "userpass"=>["espec"]}, #request=#<FCGI::Request:0x405b5230 #id=1, #out=#<StringIO:0x405b50e0>, #in=#<StringIO:0x405b5170>, #data=#<StringIO:0x405b5140>, #env={"SERVER_NAME"=>"192.168.30.146", "HTTP_HOST"=>"192.168.30.146", "HTTP_ACCEPT"=>"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "HTTP_USER_AGENT"=>"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0", "SERVER_PROTOCOL"=>"HTTP/1.1", "HTTP_ACCEPT_LANGUAGE"=>"es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3", "SERVER_SOFTWARE"=>"lighttpd/1.4.13", "REMOTE_ADDR"=>"192.168.30.29", "PATH_INFO"=>"", "SERVER_ADDR"=>"192.168.30.146", "SCRIPT_NAME"=>"/app/app.app", "HTTP_UPGRADE_INSECURE_REQUESTS"=>"1", "HTTP_COOKIE"=>"lang=en; dispzone=client; sid=ad7ee82e6731416ef7f1d0e80b266906", "REMOTE_PORT"=>"51580", "REQUEST_URI"=>"/app/app.app?cmd=authority_login&username=User&userpass=password", "SERVER_PORT"=>"80", "PATH"=>"/bin:/usr/bin:/sbin:/usr/sbin", "DOCUMENT_ROOT"=>"/espec/html/", "REQUEST_METHOD"=>"GET", "SCRIPT_FILENAME"=>"/espec/html/app/app.app", "GATEWAY_INTERFACE"=>"CGI/1.1", "QUERY_STRING"=>"cmd=authority_login&username=User&userpass=password", "REDIRECT_STATUS"=>"200", "HTTP_ACCEPT_ENCODING"=>"gzip, deflate", "HTTP_CONNECTION"=>"keep-alive"}, #err=#<StringIO:0x405b50b0>>>
I get a warning message from Wireshark saying that Illegal characters found in header name. Is there a way to "skip" this header so my app can read the content of the web?
Thank you in advance!

c# GetResponse() timeout, but works on browser

I'm trying to read the response i get from nyc.gov. I used Fiddler to construct the WebRequest and it keeps timing out.
Important: this works if the url is https://www.google.com so it's got to be something from the nyc.gov server. But how can it know the difference between my code and Chrome?
I tried settings the KeepAlive to true/false/none.
I tried using Http1.0
I tried setting request.ServicePoint.Expect100Continue to false
I tried setting request.ContentLength = 0;
I tried enclosing in "using"
I added to app.config
<system.net>
<connectionManagement>
<add address="*" maxconnection="1000" />
</connectionManagement>
</system.net>
Here is my code:
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://a810-bisweb.nyc.gov/bisweb/bispi00.jsp");
request.KeepAlive = true;
request.Headers.Add("Upgrade-Insecure-Requests", #"1");
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3";
request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip, deflate");
request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.9");
response = (HttpWebResponse)request.GetResponse();
}
catch (WebException e)
{
if (e.Status == WebExceptionStatus.ProtocolError) response = (HttpWebResponse)e.Response;
else return false;
}
catch (Exception)
{
if (response != null) response.Close();
return false;
}
Here is the RAW request (provided by Fiddler) from Chrome - WORKS:
GET http://a810-bisweb.nyc.gov/bisweb/bispi00.jsp HTTP/1.1
Host: a810-bisweb.nyc.gov
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
And this is the RAW request from my code - HANGS (and eventually times out)
GET http://a810-bisweb.nyc.gov/bisweb/bispi00.jsp HTTP/1.1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Host: a810-bisweb.nyc.gov
Connection: Keep-Alive

Validating Large Number of URLs and Adding Results to ListBox Freezes UI

I'm validating a set of URLs. When the URL count reaches 30,000 the program's performance deteriorates.
Parallel.ForEach(urilist, line =>
{
if (!IsHttpStatusOk(line.ToString()))
{
listBox1.Items.Add(line.ToString());
}
});
public static bool IsHttpStatusOk(string url)
{
try
{
var request = WebRequest.Create(finalurl) as HttpWebRequest;
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36";
request.AllowAutoRedirect = false;
using (var response = request.GetResponse())
return (response as HttpWebResponse).StatusCode == HttpStatusCode.OK;
}
catch (Exception)
{
return false;
}
}
I need to Display Invalid URLs to the user, for this purpose I use a ListBox. This works fine to an extent until the Number of Invalid URLs exceeds some limit. The UI Freezes, the CPU Usage Spikes and the program Becomes non-responsive.
How can i solve this issue? The issue also exists with a DataGridView and occurs with a lesser number of URLs..

I am getting different page in httpwebrequest in C#

I am doing a httpwebrequest to recevie a web data from americalapperal.com using this code
var request = (HttpWebRequest)WebRequest.Create("http://store.americanapparel.net/en/sports-bra_rsaak301?c=White");
request.UserAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/49.0.2623.108 Chrome/49.0.2623.108 Safari/537.36";
var response = request.GetResponse();
//cli.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
using (var reader = new StreamReader(response.GetResponseStream()))
{
var data = reader.ReadToEnd();
return data;
}
I am receiving data from this url
http://store.americanapparel.net/en/sports-bra_rsaak301?c=White
But this live data is different and the data received my httpwebrequest is different
how could i get exact page data in c#?

Can't download utf-8 web content

I have simple code for getting response from a vietnamese website: http://vnexpress.net , but there is a small problem. For the first time, it downloads ok, but after that, the content contains unknown symbols like this:�\b\0\0\0\0\0\0�\a`I�%&/m.... What is the problem?
string address = "http://vnexpress.net";
WebClient webClient = new WebClient();
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11 AlexaToolbar/alxg-3.1");
webClient.Encoding = System.Text.Encoding.UTF8;
return webClient.DownloadString(address);
You'll find that the response is GZipped. There doesn't appear to be a way to download that with WebClient, unless you create a derived class and modify the underlying HttpWebRequest to allow automatic decompression.
Here's how you'd do that:
public class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
var req = base.GetWebRequest(address) as HttpWebRequest;
req.AutomaticDecompression = DecompressionMethods.GZip;
return req;
}
}
And to use it:
string address = "http://vnexpress.net";
MyWebClient webClient = new MyWebClient();
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11 AlexaToolbar/alxg-3.1");
webClient.Encoding = System.Text.Encoding.UTF8;
return webClient.DownloadString(address);
try with code and you'll be fine:
string address = "http://vnexpress.net";
WebClient webClient = new WebClient();
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11 AlexaToolbar/alxg-3.1");
return Encoding.UTF8.GetString(Encoding.Default.GetBytes(webClient.DownloadString(address)));
DownloadString requires that the server correctly indicate the charset in the Content-Type response header. If you watch in Fiddler, you'll see that the server instead sends the charset inside a META Tag in the HTML response body:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If you need to handle responses like this, you need to either parse the HTML yourself or use a library like FiddlerCore to do this for you.

Categories