I'm trying to scrape a web page using C#, however after the page loads, it executes some JavaScript which loads more elements into the DOM which I need to scrape. A standard scraper simply grabs the html of the page on load and doesn't pick up the DOM changes made via JavaScript. How do I put in some sort of functionality to wait for a second or two and then grab the source?
Here is my current code:
private string ScrapeWebpage(string url, DateTime? updateDate)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
Stream responseStream = null;
StreamReader reader = null;
string html = null;
try
{
//create request (which supports http compression)
request = (HttpWebRequest)WebRequest.Create(url);
request.Pipelined = true;
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
if (updateDate != null)
request.IfModifiedSince = updateDate.Value;
//get response.
response = (HttpWebResponse)request.GetResponse();
responseStream = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains("gzip"))
responseStream = new GZipStream(responseStream,
CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("deflate"))
responseStream = new DeflateStream(responseStream,
CompressionMode.Decompress);
//read html.
reader = new StreamReader(responseStream, Encoding.Default);
html = reader.ReadToEnd();
}
catch
{
throw;
}
finally
{
//dispose of objects.
request = null;
if (response != null)
{
response.Close();
response = null;
}
if (responseStream != null)
{
responseStream.Close();
responseStream.Dispose();
}
if (reader != null)
{
reader.Close();
reader.Dispose();
}
}
return html;
}
Here's a sample URL:
http://www.realtor.com/realestateandhomes-search/geneva_ny#listingType-any/pg-4
You'll see when the page first loads it says 134 listings found, then after a second it says 187 properties found.
To execute the JavaScript I use webkit to render the page, which is the engine used by Chrome and Safari. Here is an example using its Python bindings.
Webkit also has .NET bindings but I haven't used them.
The approach you have will not work regardless how long you wait, you need a browser to execute the javascript (or something that understands javascript).
Try this question:
What's a good tool to screen-scrape with Javascript support?
You would need to execute the javascript yourself to get this functionality. Currently, your code only receives whatever the server replies with at the URL you request. The rest of the listings are "showing up" because the browser downloads, parses, and executes the accompanying javascript.
The answer to this similar question says to use a web browser control to read the page in and process it before scraping it. Perhaps with some kind of timer delay to give the javascript some time to execute and return results.
Related
i made a program using winforms that reproduce the execution of Reports.
meaning, i input dates: from.... to.... and the code Rerun the reports.
i used:
System.Diagnostics.Process.Start(Url.ToString());
and it works well, but it opens the IE,
now i want to run the url behind the scenes without displaying it in browser.
i tried:
try
{
WebRequest myRequest = WebRequest.Create(Url.ToString());
myRequest.UseDefaultCredentials = true;
HttpWebResponse response = (HttpWebResponse)myRequest.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
var statusResponse = response.StatusDescription;
Stream dataStream = response.GetResponseStream();
StreamReader readerr = new StreamReader(dataStream);
string responseFromServer = readerr.ReadToEnd();
var responseServer = responseFromServer;
}
response.Close();
}
it just doesnt work!
what did i do wrong?
thanks
OK,
i found the problem.
a had an error:
"{"The remote server returned an error: (500) Internal Server Error."}
i just did not display it.
thanks
The reccomended approach for triggering report renderings via code is the via the SOAP api - https://msdn.microsoft.com/en-us/library/ms154052.aspx
I am writing a tool that allows the user to input a URL, to which the program responds by attempting to show that website's favicon. I have this working for many sites but one site that is giving me trouble is my self-hosted Trac site. It seems that Trac's normal behaviour, until the end user is autenticated, is to show a custom 403 page (Forbidden), inviting the user to log in. Accessing Trac from a web browser, the favicon displays in the browser's tab, even though I'm not logged in (and Firebug, for instance, shows a 403 for the page content). If I view source from the browser, the favicon's location is right there in the source. However, from my application, requesting the Trac website with request.GetResponse() throws a WebException containing a 403, giving me no opportunity to read the response stream that contains the vital information required to find the favicon.
I already have code to download a website's HTML and extract the location of its favicon. What I am stuck with is downloading a site's HTML even when it responds with a 403.
I played with various UserAgent, Accept and AcceptLanguage properties of the HttpWebRequest object but it didn't help. I also tried following any redirects myself as I read somewhere that .NET doesn't do them well. Still no luck.
Here's what I have:
public static MemoryStream DownloadHtml(
string urlParam,
int timeoutMs = DefaultHttpRequestTimeoutMs,
string userAgent = "",
bool silent = false
)
{
MemoryStream result = null;
HttpWebRequest request = null;
HttpWebResponse response = null;
try
{
Func<string, HttpWebRequest> createRequest = (urlForFunc) =>
{
var requestForAction = (HttpWebRequest)HttpWebRequest.Create(urlForFunc);
// This step is now required by Wikipedia (and others?) to prevent periodic or
// even constant 403's (Forbidden).
requestForAction.UserAgent = userAgent;
requestForAction.Accept = "text/html";
requestForAction.AllowAutoRedirect = false;
requestForAction.Timeout = timeoutMs;
return requestForAction;
};
string urlFromResponse = "";
string urlForRequest = "";
do
{
if(response == null)
{
urlForRequest = urlParam;
}
else
{
urlForRequest = urlFromResponse;
response.Close();
}
request = createRequest(urlForRequest);
response = (HttpWebResponse)request.GetResponse();
urlFromResponse = response.Headers[HttpResponseHeader.Location];
}
while(urlFromResponse != null
&& urlFromResponse.Length > 0
&& urlFromResponse != urlForRequest);
using(var stream = response.GetResponseStream())
{
result = new MemoryStream();
stream.CopyTo(result);
}
}
catch(WebException ex)
{
// Things like 404 and, well, all other web-type exceptions.
Debug.WriteLine(ex.Message);
if(ex.InnerException != null) Debug.WriteLine(ex.InnerException.Message);
}
catch(System.Threading.ThreadAbortException)
{
// Let ac.Thread handle some cleanup.
throw;
}
catch(Exception)
{
if(!silent) throw;
}
finally
{
if(response != null) response.Close();
}
return result;
}
The stream content is stored in Exception object.
var resp = new StreamReader(ex.Response.GetResponseStream()).ReadToEnd();
I want to download webpages html code, but have problems with several links. For example: http://www.business-top.info/, http://azerizv.az/
I recieve no html at all using this:
1. WebClient:
using (var client = new WebClient())
{
client.Encoding = System.Text.Encoding.UTF8;
string result = client.DownloadString(resultUrl);
Console.WriteLine(result);
Console.ReadLine();
}
2. Http request/response
var request = (HttpWebRequest)WebRequest.Create(resultUrl);
request.Method = "POST";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
StreamReader sr = new StreamReader(stream, Encoding.UTF8);
string data = sr.ReadToEnd();
Console.WriteLine(data);
Console.ReadLine();
}
}
There are many such links, so I can't download html manually just via sourse code of web page via browser
Some pages load in stages. First they load the core of the page and only then they evaluate any JavaScript inside which loads further content via AJAX. To scrape these pages you will need more advanced content scraping libraries, than just simple HTTP request sender.
EDIT:
Here is a question in SO about the same problem that you are having now:
Jquery Ajax Web page scraping using c#
I am trying to pass HttpWebResponse data from a method which checks whether web address written by user exists to another method which will use a StreamReader to get the html sourcecode and later working with it but even though it doesn't show any error I am not getting the sourcode written in prepared listbox. There is as well a button click event which I am not including and shouldn't have any impact on the problem.
protected bool ZkontrolujExistenciStranky(string WebovaStranka)
{
try
{
var pozadavek = WebRequest.Create(WebovaStranka) as HttpWebRequest;
pozadavek.Method = "HEAD";
using (var odezva = (HttpWebResponse)pozadavek.GetResponse())
{
GetData(odezva);
return odezva.StatusCode == HttpStatusCode.OK;
}
}
catch
{
return false;
}
}
protected void GetData(HttpWebResponse ziskanaOdezva)
{
using (Stream strm = ziskanaOdezva.GetResponseStream())
{
StreamReader reader = new StreamReader(strm);
string prochazec;
while ((prochazec = reader.ReadLine()) != null)
{
listBox1.Items.Add(prochazec);
}
}
}
You are using the HEAD method, whose whole point is not to return a body; only headers are returned. Use GET if you want the body.
HTTP HEAD method:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request.
I'm trying to access the last.fm APIs via C#. As a first test I'm querying similar artists if that matters.
I get an XML response when I pass a correct artist name, i.e. "Nirvana". My problem is when I deliver an invalid name (i.e. "Nirvana23") I don't receive XML but an error code (403 or 400) and a WebException.
Interesting thing: If I enter the URL inside a browser (tested with Firefox and Chrome) I receive the XML file I want (containing a lastfm specific error message).
I tried both XmlReader and XDocument:
XDocument doc = XDocument.Load(requestUrl);
and HttpWebRequest:
string httpResponse = "";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(requestUrl);
HttpWebResponse response = null;
StreamReader reader = null;
try
{
response = (HttpWebResponse)request.GetResponse();
reader = new StreamReader(response.GetResponseStream());
httpResponse = reader.ReadToEnd();
}
The URL is something like "http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&artist=Nirvana23" (and a specific key given by lastfm, but even without it - it should return XML). A link to give it a try: link (this is the error file I cannot access via C#).
What I also tried (without success): comparing the request by both the browser and my program with the help of WireShark. Then I added some headers to the request, but that didn't help either.
In .NET the WebRequest is converting HTTP error codes into exceptions, while your browser is just ignoring them since the response is not empty. If you catch the exception then the GetResponseStream method should still return the error XML that you are expecting.
Edit:
Try this:
string httpResponse = "";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(requestUrl);
WebResponse response = null;
StreamReader reader = null;
try
{
response = request.GetResponse();
}
catch (WebException ex)
{
response = ex.Response;
}
reader = new StreamReader(response.GetResponseStream());
httpResponse = reader.ReadToEnd();
Why don't you catch the exception and then process that accordingly. If you want to display any custom error, you can do that also in your catch block.