I'm creating a simple web-scraper. I want it to download data from a specific webpage. However, the data I want appears after clicking on a div. I'm trying to find that div, invoke click event on it, and then download the page source (after showing the hidden data). The data probably appears on the page after a javascript executes. I had to set WebBrowser.ScriptErrorsSuppressed to true, because too many errors would pop up. Currently I'm using the following code:
WebBrowser browser = new WebBrowser();
//Navigate etc...
foreach (HtmlElement el in browser.Document.GetElementsByTagName("div"))
{
if (el.GetAttribute("className").ToString().Equals(className))
{
el.InvokeMember("click");
foreach(HtmlElement child in el.Children)
{
child.InvokeMember("click");
}
}
}
browser.Document.GetElementsByTagName("body")[0].InvokeMember("click");
while (browser.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
Debug.WriteLine("State: " + browser.ReadyState);
System.Threading.Thread.Sleep(50);
}
string source = browser.DocumentText;
This doesn't work. The hidden data isn't shown. I've tried using RaiseEvent instead of InvokeMember, and changing click to onclick. Nothing worked.
Btw. the code invokes click on every child, because I'm not sure which one makes the data appear.
Does anyone know what goes wrong?
Related
I am currently working on an app for WP7 for my university, and need a temporary solution to a problem. Now this solution is, that I will be loading a webpage using the web browser control for WP7. For example: http://m.iastate.edu/laundry/
Now as you see on the webpage, there are certain elements I want to hide, for example the back button. For now, what I have done to handle the back button is something like this:
private void webBrowser1_Navigating(object sender, NavigatingEventArgs e)
{
// Handle loading animations
// Handle what happens when the "back" button is pressed
Uri home = new Uri("http://m.iastate.edu/");
// The the current loading address is home
// Cancel the navigation, and go back to the
// apps home page.
if (e.Uri.Equals(home))
{
e.Cancel = true;
NavigationService.Navigate(new Uri("/MainPage.xaml", UriKind.Relative));
}
}
Now that works beautifully, except for the part that there is a back button on the hardware.
So my second option is to completely hide the back button ONLY on that page, and not its children. So not on http://m.iastate.edu/laundry/l/0
I am still debating on just parsing the data and displaying it in my own style, but I'm not sure if that's completely needed seeing how the data needs constant internet service and is already in a well-put format. Plus, I feel like that would be a waste of resources? Throw in your opinions on that too :)
Thanks!
You should inject a script in the page with InvokeScript.
Here is the kind of Javascript code you need to remove the back button:
// get the first child element of the header
var backButton = document.getElementsByTagName("header")[0].firstChild;
// check if it looks like a back button
if(backButton && backButton.innerText == "Back") {
// it looks like a back button, remove it
document.getElementsByTagName("header")[0].removeChild[backButton];
}
Call this script with InvokeScript:
webBrowser1.InvokeScript("eval", "(function() { "+ script +"}()");
Warning: IsScriptEnabled must be set to true on the web control
If the removal of the back button depends of the page, just test the navigating URI in C# and inject the script if neeeded.
So I've been working on this project, and we have a WebBrowser object on the form. The purpose of the object is that it loads in HTML Forms into it to be viewed, at this current point in time though, you are able to edit the contents of the HTML form, which is not desired.
I want to simply display this HTML form of information to the user, but not allow them to alter the textboxes or checkboxes or anything of that nature on the form.
I tried using the Navigating event and set e.cancel = true;. This haulted the control from even loading the page. And if I set it to only execute e.cancel = true; after the form had loaded, I could still change text boxes and such on the form, as it only seemed to randomly called the Navigating event.
Does anyone know of a way to get a WebBrowser object to be read only?
Cheers!
You can apply contentEditable attribute to the Body tag of the document.
Document.Body.SetAttribute("contentEditable", false);
This will make your document readonly for user.
You could try accessing all form elements on the page and set the readonly attribute on the tag. Something like:
var inputs = webBrowser1.Document.GetElementsByTagName("input");
foreach (HtmlElement element in inputs)
{
element.SetAttribute("readonly", "readonly");
}
You'd obviously have to repeat the process for all form elements (select etc.), but it should work.
I have been running into this issue as well. Thanks to steavy I have been able to come up with a solution :
Hook up to the DocumentCompleted event (you can do this in the designer) :
myWebBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wb_procedure_DocumentCompleted);
Then make it readonly in the event :
private void myWebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
myWebBrowser.Document.Body.SetAttribute("contentEditable", "false");
}
I do this in the event when the document is fully loaded because I sometimes ran into a NullReferenceException, the body wasn't loaded yet and the line would throw.
Consider the following simple WinForms form with a textbox and a webbrowser control. Whenever the textbox content changes, the text is pushed to the browser:
public class MainForm : Form
{
public MainForm()
{
var browser = new WebBrowser() { Dock = DockStyle.Fill };
var textbox = new TextBox() { Dock = DockStyle.Fill, Multiline = true };
var splitter = new SplitContainer() { Dock = DockStyle.Fill };
splitter.Panel1.Controls.Add(textbox);
splitter.Panel2.Controls.Add(browser);
this.Controls.Add(splitter);
textbox.TextChanged += delegate { browser.DocumentText = textbox.Text; };
textbox.Text = "<b>hello world</b>";
}
}
(I am doing something like this in my DownMarker code to build a Markdown editor with Stackoverflow's MarkdownSharp library.)
This works fine, except that the WebBrowser control insists on showing the wait cursor whenever DocumentText is set - even if updating the browser content takes only a few milliseconds. This results in mouse cursor flicker when typing in the textbox.
Is there any way to supress these mouse cursor changes? I already considered rate-limiting the DocumentText updates, but I find that the occasional flicker during an update is still annoying and I would prefer instant updates.
edit: Hans' answer pointed me in the right direction. Changing the TextChanged event handler to this seems to work without cursor flicker:
textbox.TextChanged +=
delegate
{
if (browser.Document == null)
{
browser.DocumentText = "<html><body></body></html>";
}
while ((browser.Document == null)
|| (browser.Document.Body == null))
{
Application.DoEvents();
}
browser.Document.Body.InnerHtml = textbox.Text;
};
edit2: the above still shows the wait cursor when the page is made heavier, e.g. by adding images. This might be fixable be doing more fine grained updates of just the html elements that change, but that is obviously much more complex.
Assigning the DocumentText property is a Big Deal, WebBrowser treats it like a navigation command. It can't tell the difference. Which normally takes time, hundreds of milliseconds, enough for it to justify displaying the wait cursor.
A very different approach would be to load a dummy document and alter the DOM through the Document property. That's pretty common in web pages, Ajax and javascript and what-not. No wait cursor for those. Not so sure if that will still fit your editing model, I'd guess at you wanting to load a dummy HTML document with a empty <body> and change the body content.
Should work. Back-up plan is an Update! button. Which would also avoid trying to render half-finished and thus broken HTML.
I'm having a problem screenscraping some data from this website using the MSHTML COM component. I have a WebBrowser control on my WPF form.
The code where I retrieve the HMTL elements is in the WebBrowser LoadCompleted events. After I set the values of the data to the HTMLInputElement and call the click method on the HTMLInputButtonElement, it is refusing to submit the the request and display the next page.
I analyse the HTML for the onclick attribute on the button, it is actually calling a JavaScript function and it processes my request. Which makes me not sure if calling the JavaScript function is causing the problem? But funny enough when I take my code out of the LoadCompleted method and put it inside a button click event it actually takes me to the next page where as the LoadCompleted method didn't do. Doing that sort of thing defeats the point of trying to screenscrape the page automatically.
On another thought: when I had the code inside the LoadCompleted method, I'm thinking the HTMLInputButtonElement is not fully rendered on to the page which result in click event not firing, despite the fact when I looked at the object in run time it is actually held the submit button element there and the state is saying I completed which baffles me even more.
Here is the code I used inside the LoadCompleted method and the click method on the button:
private void browser_LoadCompleted(object sender, NavigationEventArgs e)
{
HTMLDocument dom = (HTMLDocument)browser.Document;
IHTMLElementCollection elementCollection = dom.getElementsByName("PCL_NO_FROM.PARCEL_RANGE.XTRACKING.1-1-1.");
HTMLInputElement inputBox = null;
if (elementCollection.length > 0)
{
foreach (HTMLInputElement element in elementCollection)
{
if (element.name.Equals("PCL_NO_FROM.PARCEL_RANGE.XTRACKING.1-1-1."))
{
inputBox = element;
}
}
}
inputBox.value = "Test";
elementCollection = dom.getElementsByName("SUBMIT.DUM_CONTROLS.XTRACKING.1-1.");
HTMLInputButtonElement submitButton = null;
if (elementCollection.length > 0)
{
foreach (HTMLInputButtonElement element in elementCollection)
{
if (element.name.Equals("SUBMIT.DUM_CONTROLS.XTRACKING.1-1."))
{
submitButton = element;
}
}
}
submitButton.click();
}
FYI: This is the URL of the web page I'm trying to access using MSHTML,
http://track.dhl.co.uk/tracking/wrd/run/wt_xtrack_pw.entrypoint.
There are many possibilities:
You may try to put your code at
other events, such as on Navigation
Completed, or on Download Completed.
You may need to explicitly evaluate the OnClick event after the click() function.
Using the MS WebBrowser control is
easier than using the MSHTML COM.
To make life easier, you may just use a webscraping library such as the IRobotSoft ActiveX control to automate your entire process.
Delay in OnBeforeNavigate can cause click actions to fail.
We have noticed that with some submit actions OnBeforeNavigate is called twice, especially where onClick is used. The first call is before the onClick action is performed, the second is after it is complete.
Turn off your BHO, put a breakpoint on onClick, step over the submit action return jsSubmit() and then wait a bit and you should be able to cause the same issue without your automation.
Any delay >150ms on the second call to OnBeforeNavigate causes some failure in page load/navigation to the result.
Edit:
Having tried our own automation of this DHL page we don't currently have an issue with the timing described above.
I am trying to create a popup which will be used to select a month/year for a textbox. I have kind of got it working but however when I try and read from the textbox when I Submit the form it returns an empty string. However visually on the page I can see the result in there when I click the Done button which can be seen in the screenshot.
http://i27.tinypic.com/2eduttx.png - is a screenshot of the popup
I have wrapped the whole textbox/popup inside a Web User Control
Here is the code of the control
Code Behind
ASP Page
and then read from the Textbox on the button click event with the following
((TextBox)puymcStartDate.FindControl("txtDate")).Text
Any suggestions of how to fix the problem?
You may need to read the form posted value rather than the value from the view state. I have the following methods in my code to handle this.
The below code just grabs the values in the request headers (on post back) and sets/updates the controls. The problem is that when using the ASP.NET Ajax controls, it doesn't register an update on the control, so the viewstate isn't modified (I think). Anyways, this works for me.
protected void btnDone_Click(object sender, EventArgs e)
{
LoadPostBackData();
// do your other stuff
}
// loads the values posted to the page via form postback to the actual controls
private void LoadPostBackData()
{
LoadPostBackDataItem(this.txtYear);
LoadPostBackDataItem(this.txtDate);
// put other items here if needed
}
// loads the values posted to the page via form postback to the actual controls
private void LoadPostBackDataItem(TextBox control)
{
string controlId = control.ClientID.Replace("_", "$");
string postedValue = Request.Params[controlId];
if (!string.IsNullOrEmpty(postedValue))
{
control.Text = postedValue;
}
else
{
control.Text = null; // string.Empty;
}
}