Retrieve Inner Text from WebView HTML in Windows/Windows Phone 8.1 - c#

I'm creating a universal app and need to be able to pull plain text from a HTML page. I know that in WPF you can utilize the IHTMLDocument2 interface to achieve this.
IHTMLDocument2 document = webBrowser1.Document as IHTMLDocument2;
string data = document.body.innerText;
Is there something similar for Windows Runtime?
Thanks,

I would use something like HtmlAgilityPack. The HTML then becomes queryable through Linq. Then you can do something like this:
HtmlDocument htmlDoc = webBrowser1.Document as HtmlDocument;
string innerText = htmlDoc.DocumentNode.Descendants("body").Single().InnerText;
You can also load the HTML as a string or stream through LoadHtml and Load respectively.

Related

c# - Select specific text boxes on WebBrowser and write in them

how would I select a specific text box on a web browser (the default c# webbrowser) and write a string in it? I have been thinking of webrequests but that may be illegal due to the amount of packets you will send to the host server.
If you are just wanting to fill in a field, you need to use the DOM. Here are some examples that assume your browser is named browser and has navigated to google.com.
WPF
mshtml.IHTMLDocument2 htmlDoc = browser.Document as mshtml.IHTMLDocument2;
((mshtml.HTMLInputElement)htmlDoc.all.item("lst-ib")).value = "wpf web browser access dom";
WinForms
HtmlDocument htmlDoc = browser.Document;
htmlDoc.All["lst-ib"].InnerText = "c# web browser access dom";

How can I extract the HTML inside all the script tags that appear before a certain pattern in a web page?

I have a variable inside a script tag in a web page. I want to use the HTML Agility Pack to extract content (e.g. all the css code) that appears before this variable is initialized. I'm able to extract all the style tags on the page using the following code.
string source = new WebClient().DownloadString(url);
var document = new HtmlDocument();
document.LoadHtml(source);
HtmlNode[] styleNodes = document.DocumentNode.SelectNodes("//style").ToArray();
Is there a way I can extract the style tags appearing before and after the variable separately?

Is there an alternative to WebBrowser control for DOM traversal?

I am currently using a WebBrowser control in my Windows Forms application to navigate to a URL. Once I am at that URL, I use the FirstChild in conjunction with NextSibling methods of the HtmlElement class to walk the document tree from the WebBrowser.Document object.
The reason I do this is to get information from a page and store this information into a database.
Here is the crux of my question: Do I really need to use the WebBrowser class? I currently do not need to display the web page to the user, only some of the information found in the page.
Is there a better way to do this without relying on this class? Something solid which can do DOM traversal would be required, but as mentioned above, I do not need to display the web page.
Regards
Crouz
You can use a WebClient to download the HTML without displaying the page. You can then use something like HTML Agility Pack to create an HTMLDocument from the string.
Example:
using (WebClient wc = new WebClient())
{
string html = wc.DownloadString("http://www.foo.bar/"); // Change as required.
HtmlAgilityPack.HtmlDocument h = new HtmlAgilityPack.HtmlDocument();
h.LoadHtml(html);
}
Reason to use HTML Agility Pack:
The HtmlDocument class is a wrapper around the native IHtmlDocument2 COM interface.
You cannot easily create it from a string.....
and thus not without using the WebBrowser.
From https://stackoverflow.com/a/4935482/4546874.
However, you can hide the WebBrowser.

identify html in windows form

I have use web service in my windows application. webservice return string like:
<b>sdfsdf</b>
<img alt="*" src="df"/>
is any way in windows form that read html tag. like if <b/> then text is bold. and <img/> image should apear not text.
Easiest way would be to use a WebBrowser control, I suppose.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.aspx
Load your XML string into an XmlDocument:
XmlDocument doc = new XmlDocument();
doc.LoadXml(webservice_output_string);
Then you can use XPath against the document:
string bold = (doc.SelectSingleNode("//b") as XmlElement).InnerText;
string src = (doc.SelectSingleNode("//img/#src") as XmlAttribute).Value;

Read HTML code from iframe using Webbrowser C#

How to read IFRAME html code using WebBrowser?
I have site with iframe, and after few clicks new URL opens inside this IFRAME with some portion of HTML CODE. Is there a possiblity to read this?. When I am trying to Navigate() to this URL, I am redirected to main page of this site (it is not possible to open this link twice).
Uri IFRAME_URL = webBrowser1.Document.Window.Frames[0].Url;
Maybe there is something similar to:
Uri IFRAME_URL = webBrowser1.Document.Window.Frames[0]. ... DOCUMENTTEXT;
Try:
string content = webBrowser1.Document.Window.Frames[0].WindowFrameElement.InnerText;
You can also acquire various items via mshtml types:
Set a reference to the "Microsoft HTML Object Library" under COM references.
Set your using statement:
using mshtml;
Then tap into the mshtml API to snatch the source:
HTMLFrameBase frame = yourWebBrowserControl.Document.GetElementById( "yourFrameId" ).DomElement as HTMLFrameBase;
If "frame" isn't null after that line, it has a lot of items hanging off it for your use.
try:
string content = webBrowser1.Document.Window.Frames[0].Document.Body.InnerText
A WebBrowser Control window can contain more that one iframe and .net supports frame collection so why not use something like this:
// Setup a string variable...
string html = string.Empty;
// webBrowser1.Document.Window.Frames gets a collection of iframes contained in the current document...
// HTMLWindow is the iterator for the Collection...
foreach (HtmlWindow frame in webBrowser1.Document.Window.Frames)
{
html += frame.Document.Body.OuterHtml;
}
This way, maybe with a little adjustment you can get all you need from the iframe containers you need.

Categories