I am currently using a WebBrowser control in my Windows Forms application to navigate to a URL. Once I am at that URL, I use the FirstChild in conjunction with NextSibling methods of the HtmlElement class to walk the document tree from the WebBrowser.Document object.
The reason I do this is to get information from a page and store this information into a database.
Here is the crux of my question: Do I really need to use the WebBrowser class? I currently do not need to display the web page to the user, only some of the information found in the page.
Is there a better way to do this without relying on this class? Something solid which can do DOM traversal would be required, but as mentioned above, I do not need to display the web page.
Regards
Crouz
You can use a WebClient to download the HTML without displaying the page. You can then use something like HTML Agility Pack to create an HTMLDocument from the string.
Example:
using (WebClient wc = new WebClient())
{
string html = wc.DownloadString("http://www.foo.bar/"); // Change as required.
HtmlAgilityPack.HtmlDocument h = new HtmlAgilityPack.HtmlDocument();
h.LoadHtml(html);
}
Reason to use HTML Agility Pack:
The HtmlDocument class is a wrapper around the native IHtmlDocument2 COM interface.
You cannot easily create it from a string.....
and thus not without using the WebBrowser.
From https://stackoverflow.com/a/4935482/4546874.
However, you can hide the WebBrowser.
Related
I'm creating a universal app and need to be able to pull plain text from a HTML page. I know that in WPF you can utilize the IHTMLDocument2 interface to achieve this.
IHTMLDocument2 document = webBrowser1.Document as IHTMLDocument2;
string data = document.body.innerText;
Is there something similar for Windows Runtime?
Thanks,
I would use something like HtmlAgilityPack. The HTML then becomes queryable through Linq. Then you can do something like this:
HtmlDocument htmlDoc = webBrowser1.Document as HtmlDocument;
string innerText = htmlDoc.DocumentNode.Descendants("body").Single().InnerText;
You can also load the HTML as a string or stream through LoadHtml and Load respectively.
I need to make a communication process between java applet and C# WebBrowser control's html page. And I want to do it without refreshing the html page. I know I can communicate with applet using applet parameters, but in that case I have to refresh everytime the applet page to get updated parameter. I also can use cookie, but I dont want to send all those unnecessary cookies to server for each request. So I was thinking if there is a way to create javacript array variable using DOM and then read it with the applet. But I dont know if it is possible or may be there are other ways to do it. Any suggestion will be highly appreciated.
Thanks
You can achieve this by injecting html code in the Webrowser Controls DocumentText property
string htmlCode = "<html><head></head><body>";
htmlCode += "<applet code="Example.class" width="350" height="350"></applet>";
htmlCode +="</body></html>";
webBrowser1.DocumentText = htmlCode;
I have a C# Form with WebBrowser object.
This object contains HTML Document.
And there is a link in that document that has no markers (no id and no name)
How can I access this element??
I tried to use this:
webBrowser1.Document.GetElementsByTagName("a")[n]
But it is not very useful, because if there will be some new link on the page, I'll need to rebuild all program.
I also can not do loops through document, or get a substring of Document.ToString() because then I can not click the link.
Would be great if you could give me some advice.
In this kind of situation the best idea is always to find an "Anchor", meaning - a place in the document that never change.
Lets say that
dada
Doesn't have an ID or Name, so the closest you can go is check if the parent of the element you're looking for has an ID.
<div id="parentDiv">
Some text
Some other stuff
The link you're looking for
</div>
That way you could get the parentDiv, which you know doesn't change, and then the A tag inside that parent (which should be permanent unless that website completely changes the structure which is one of the problems in parsing external HTML pages)
Shai.
you can use Html Agility Pack. and select links by xpath
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load(/* url */);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
// do stuff
}
You should have some info on how to identify the link. it may be id or name or the text. If the text is always same then check the inner text of that link.
I have a form with an embedded web browser control on it. I am currently using WebBrowser and use it like so:
webBrowser1.Navigate("about:blank");
HtmlDocument doc = this.webBrowser1.Document;
doc.Write(string.Empty);
String htmlContent = GetHTML();
doc.Write(htmlContent);
This writes the HTML correctly to the web browser control BUT it never clears the existing data and it just appends, so I end up with N web pages stacked on top of each other.
Is this the best control to use? If so why is it not clearing existing data?
You need to use:
HtmlDocument doc = this.webBrowser1.Document.OpenNew(true);
now the contents of the document will be cleared before writing.
All calls to Write should be preceded
by a call to OpenNew, which will clear
the current document and all of its
variables. Your calls to Write will
create a new HTML document in its
place. To change only a specific
portion of the document, obtain the
appropriate HtmlElement and set its
InnerHtml property.
Yes, it is.
You should be able to call the Clear method if you need to clear contents.
Check this article for in-depth details and sample code:
http://www.codeproject.com/KB/miscctrl/simplebrowserformfc.aspx
Call HtmlDocument.OpenNew between pages:
OpenNew will clear the previous loaded
document, including any associated
state, such as variables. It will not
cause navigation events in WebBrowser
to be raised.
How to read IFRAME html code using WebBrowser?
I have site with iframe, and after few clicks new URL opens inside this IFRAME with some portion of HTML CODE. Is there a possiblity to read this?. When I am trying to Navigate() to this URL, I am redirected to main page of this site (it is not possible to open this link twice).
Uri IFRAME_URL = webBrowser1.Document.Window.Frames[0].Url;
Maybe there is something similar to:
Uri IFRAME_URL = webBrowser1.Document.Window.Frames[0]. ... DOCUMENTTEXT;
Try:
string content = webBrowser1.Document.Window.Frames[0].WindowFrameElement.InnerText;
You can also acquire various items via mshtml types:
Set a reference to the "Microsoft HTML Object Library" under COM references.
Set your using statement:
using mshtml;
Then tap into the mshtml API to snatch the source:
HTMLFrameBase frame = yourWebBrowserControl.Document.GetElementById( "yourFrameId" ).DomElement as HTMLFrameBase;
If "frame" isn't null after that line, it has a lot of items hanging off it for your use.
try:
string content = webBrowser1.Document.Window.Frames[0].Document.Body.InnerText
A WebBrowser Control window can contain more that one iframe and .net supports frame collection so why not use something like this:
// Setup a string variable...
string html = string.Empty;
// webBrowser1.Document.Window.Frames gets a collection of iframes contained in the current document...
// HTMLWindow is the iterator for the Collection...
foreach (HtmlWindow frame in webBrowser1.Document.Window.Frames)
{
html += frame.Document.Body.OuterHtml;
}
This way, maybe with a little adjustment you can get all you need from the iframe containers you need.