Retrieving webpage data after some delay (web scraping)

Retrieving webpage data after some delay (web scraping) - c#

The Aim is to retrieve data from a website after it has finished its Ajax calls.
Currently the data is being retrieved when the page first loads. But the required data is found inside a div which is loaded after an ajax call.
To summarize , the Scenario is as follows:
A webpage is called with some parameters passed inside C# code (currently using CsQuery for c#). when the request is sent, the page opens and a "Loading" picture shows and after few seconds the Required data is retrieved. The cSQuery code however retrieves the first Page contents with the "Loading" picture ..
the code is as follows
UrlBuilder ub = new UrlBuilder("<url>")
.AddQuery("departure", "KHI")
.AddQuery("arrival", "DXB")
.AddQuery("queryDate", "2013-03-28")
.AddQuery("queryType", "D");
CQ dom = CQ.CreateFromUrl(ub.ToString());
CQ availableFlights = dom.Select("div#availFlightsDiv");
string RenderedDiv = availableFlights["#availFlightsDiv"].RenderSelection();

When you "scrape" a site you are making a call to the web server and you get what it serves up. If the DOM of the target site is modified by javascript (ajax or otherwise) you are never going to get that content unless you load it into some kind of browser engine on the machine that is doing the scraping, that is capable of executing the javascript calls.

Almost a year old question, you might have got your answer already. But would like mention this awesome project here - SimpleBrowser.
https://github.com/axefrog/SimpleBrowser
It keeps your DOM updated.

Related

c# HtmlAgilityPack class inside many classes, need to check if class exist

I am working on this for days without a solution.
For example, I have this link: https://www.aliexpress.com/item/32682673712.html
I am trying to check if the Buy now button disable
if I have this line inside the DOM : Buy Now
the problem is that this line inside a class that inside a class and so on...
I know there is an option to get a specific node with HtmlAgilityPack But I didn't succeed
var nodes = doc.DocumentNode.SelectNodes("//div[(#class='next-btn next-large next-btn-primary buynow disable')]/p");
but I don't get anything
I tried to get the entire dom and then search with inside but didn't succeed
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(url);
I just got the html and not the DOM
another thing I tried to do is:
var Driver = new FirefoxDriver();
Driver.Navigate().GoToUrl(url);
string pagesource = Driver.PageSource;
and it did works! but this solution open the browser and I don't want that (I am running over many links)
Please help a frustrated guy :)
thanks.

This is happening because the buynow button is being loaded via JavaScript.
If you open network tab in chrome dev tools, you will notice that the page is making a call to some api to load the product information.
The url with json data for the product looks like this:
https://www.aliexpress.com/aeglodetailweb/api/store/header?itemId=32682673712&categoryId=100007324&sellerAdminSeq=202460364&storeNum=720855&minPrice=4.99&maxPrice=4.99&priceCurrency=USD
You will most probably have to send same headers as chrome is sending for the request in order to load the api endpoint in your app.

How to Upload Multiple Files Along with Other Form Fields, ASP.NET MVC

So I have a single form on a page. There are several text input fields and such. Right now there is also a jQuery file upload control wherein you can select several files. The problem I have right now is that I'm requiring that the user upload the files first (using the jQuery control) and then I save those files in Session state until the regular form posts the rest of the form fields. Then I pull the previously uploaded files from Session and do what I need to do.
So basically to fill out the form requires two separate POST operations back to the server. The files, then the remaining form fields.
I'm thinking there must be a better way to let a user select his/her files yet not post anything until the user submits the main form to post all the other fields. I've read several posts on this site, but I can't find one that addresses this particular issue.
Any suggestions/assistance is greatly appreciated.

I believe you can do this using Uploadify. There are two options you'd want to look at. First, set auto to false to prevent selected files from immediately being loaded. Second, use the formdata option to send along your other form fields along with the payload.
You'd then call the upload method when the user submits the form, uploading each file in the queue and sending the form data all at once.
Server Side Part:
You'll probably be submitting the form to an ASPX file or ASHX handler. I prefer using an ASHX handler since they're more light-weight. Both will allow you access to the HttpContext or the HttpRequest object. First, you'll need to check context.Request.Files.Count to make sure files were posted:
if (context.Request.Files.Count > 0) // We have files uploaded
{
var file = context.Request.Files[0]; // First file, but there could be others
// You can call file.SaveAs() to save the file, or file.InputString to access a stream
}
Obtaining the other form fields should be just as easy:
var formfield = context.Request["MyFormField"]; // Some form field
You can also write results back to the client, such as a JSON encoded description of any resulting errors:
context.Response.Write(response); // Anything you write here gets passed in to onUploadSuccess
I think that should get you started anyway!

Display two pdf documents on form submit in separate windows/tabs

I have .NET MVC web application. On my page there is a form to choose what pdf docs to display. I want to open pdf files in a new window or tab. The user can choose to display one or two pdf files. My form posts the data to controller, but i dont know how to return two pdfs from my controller and display in separate window/tab.
Does anyone have an idea how this can be done?

You can let the model write the urls to the documents into a javascript code block
#if(model.ShowPDFs)
{
<script>
function ShowPDF()
{
window.open('#model.PdfUrl1');
#if(model.Open2Pdf)
{
window.open('#model.PdfUrl2');
}
}
// opens the document after 3 seconds after the page has loaded
setTimeOut("ShowPDF()", 3000);
</script>
}

I made something similar (but I build the pdf server-side using ReportViewer) in this way:
my form post data to the controller action (with ajax)
the controller action reads the posted data, query the database
accordingly to it, and decide how many pdfs I have to return;
the controller action saves in the session, with a different key for every pdf (determined by my logic), the data to pass to ReportViewer;
the controller action returns (to the callback of the ajax call) an array with all the key used to store data in the sessions;
client side, the js callback loop over the returned array and, for every item, call (it opens a link in a different tab) a different controller (whose only responsibility is to send pdf to the request) passing it, in the query string, the key for that pdf;
the PrintController read the data from the session (using the key received), build the report and send it in the response.
I think you could do something similar; I don't understand how your pdf are built (are they data-depending or pdf pre-existing on the server?), but you can save the pdf stream, or the pdf path in the session instead of the data like me.
Hope to help; if you think that my solution can work for you and you need some code I can try to extract some from my codebase (in my case there are other issues and I have to rewrite the code if you need it ...).

How to retrieve site root url?

I need to get the url of the site so that I render a user control on only the main page. I need to check for http://foo.com, http://www.foo.com, and foo.com. I am a bit stumped as to how check for all 3. I tried the following which does not work.
string domainName = Request.Url.Host.ToString();
if (domainName == "http://nomorecocktails.com" | Request.Url.Host.Contains("default.aspx"))
{ //code to push user control to page
Also tried
var url = HttpContext.Current.Request.Url.GetLeftPart(UriPartial.Authority) + "/";
Any thoughts?

You need to check if the Request.Path property is equal to / or /Default.aspx or whatever your "main page" is. The domain name is completely irrelevant. What if I accessed your site via http://192.56.17.205/, and similarly, what if your server switched IP addresses? Your domain check would fail.
If you utilize the QueryString to display different content, you'll also need to check Request.QueryString.
Documentation for Request.Path:
http://msdn.microsoft.com/en-us/library/system.web.httprequest.path.aspx
Documentation for Request.QueryString:
http://msdn.microsoft.com/en-us/library/system.web.httprequest.querystring.aspx

If you need the user control to only appear on the main page (I'm assuming you mean home page), then add the code to call the user control to the code behind of that file.
If this code is stored in the master page, then you can reference it like:
Master.FindControl("UserControlID");
If you are only using the one web form (ie. just Default.aspx), then you can check that no relevant query strings are included in the URL, and display only if this is the case:
if (Request.QueryString["q"] == null){
//user control code
}
However if you are using this technique then I would recommend using multiple web forms using master pages in the future to structure your application better.
The ASP.NET website has some good tutorials on how to do this:
http://www.asp.net/web-forms/tutorials/master-pages

C# Open web page in default browser with post data

I am sure this must have been answered before but I cannot find a solution, so I figure I am likely misunderstanding other people's solutions or trying to do something daft, but here we go.
I am writing an add-in for Outlook 2010 in C# where a user can click a button in the ribbon and submit the email contents to a web site. When they click the button the website should open in the default browser, thus allowing them to review what has just been submitted and interact with it on the website. I am able to do this using query strings in the URL using:
System.Diagnostics.Process.Start("http://www.test.com?something=value");
but the limit on the amount of data that can be submitted and the messy URLs are preventing me from following through with this approach. I would like to use an HTTP POST for this as it is obviously more suitable. However, the methods I have found for doing this do not seem to open the page up in the browser after submitting the post data:
http://msdn.microsoft.com/en-us/library/debx8sh9.aspx
to summarise; the user needs to be able to click the button in the Outlook ribbon, have the web browser open and display the contents of the email which have been submitted via POST.
EDIT:
Right, I found a way to do it, its pretty fugly but it works! Simply create a temporary .html file (that is then launched as above) containing a form with hidden fields for all the data, and have it submitted on page load with JavaScript.
I don't really like this solution as it relies on JavaScript (I have a <noscript> submit button just in case) and seems like a bit of a bodge, so I am still really hoping someone on here will come up with something better.

This is eight years late, but here's some code that illustrates the process pretty well:
string tempHTMLLocation = "some_arbitrary_location" + "/temp.html";
string url = https://your_desired_url.com";
// create the temporary html file
using (FileStream fs = new FileStream(tempHTMLLocation, FileMode.Create)) {
using (StreamWriter w = new StreamWriter(fs, Encoding.UTF8)) {
w.WriteLine("<body onload=\"goToLink()\">");
w.WriteLine("<form id=\"form\" method=\"POST\" action=\"" + url + "\">");
w.WriteLine("<input type=\"hidden\" name=\"post1\" value=\"" + post_data1 + "\">");
w.WriteLine("<input type=\"hidden\" name=\"post2\" value=\"" + post_data2 + "\">");
w.WriteLine("</form>");
w.WriteLine("<script> function goToLink() { document.getElementById(\"form\").submit(); } </script>");
w.WriteLine("</body>");
}
}
// launch the temp html file
var launchProcess = new ProcessStartInfo {
FileName = tempHTMLLocation,
UseShellExecute = true
};
Process.Start(launchProcess);
// delete temp file but add delay so that Process has time to open file
Task.Delay(1500).ContinueWith(t=> File.Delete(tempHTMLLocation));
Upon opening the page, the onload() JS script immediately submits the form, which posts the data to the url and opens it in the default browser.

The Dropbox client does it the same ways as you mentioned in your EDIT. But it also does some obfuscation, i.e. it XORs the data with the hash submitted via the URL.
Here are the steps how Dropbox does it:
in-app: Create a token that can be used to authorize at dropbox.com.
in-app: Convert token to hex string (A).
in-app: Create a secure random hex string (B) of the same length.
in-app: Calculate C = A XOr B.
in-app: Create temporary HTML file with the following functionality:
A hidden input field which contains value B.
A submit form with hidden input fields necessary for login to dropbox.com.
A JS function that reads the hash from URI, XORs it with B and writes the result to the submit forms hidden fields.
Delete hash from URI.
Submit form.
in-app: Open the temporary HTML file with the standard browser and add C as hash to the end of the URI.
Now if your browser opens the HTML file it calculates the auth token from the hidden input field and the hash in the URI and opens dropbox.com. And because of Point 5.4. you are not able to hit the back button in your browser to login again because the hash is gone.

I'm not sure I would have constructed the solution that way. Instead, I would post all the data to a web service (using HttpWebRequest, as #Loci described, or just importing the service using Visual Studio), which would store the data in a database (perhaps with a pending status). Then direct the user (using your Process.Start approach) to a page that would display the pending help ticket, which would allow them to either approve or discard the ticket.
It sounds like a bit more work, but it should clean up the architecture of what you are trying to do. Plus you have the added benefit of not worrying about how to trigger a form post from the client side.
Edit:
A plain ASMX web service should at least get you started. You can right-click on your project and select Add Service Reference to generate the proxy code for calling the service.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.