I need to create a complete IHTMLDocument2 document so I end up with this snippet which works. However, the URL property seems to be ignored all the times.
string page = "my HTML code in string";
IHTMLDocument2 doc2 = (IHTMLDocument2)new HTMLDocument();
doc2.url = "www.stackoverflow.com";
doc2.write(new object[] { page });
doc2.close();
while (doc2.body == null)
Application.DoEvents();
Now doc2.url is always "about:blank". How can I set this URL property?
Thank you in advance,
Related
I wanted to use HTMLDocument object from mshtml library. I was trying to assign HTML to document:
var doc = new mshtml.HTMLDocument();
var html = File.ReadAllText(#"path_to_html_file");
doc.body.innerHTML = html; // <-- this line throws error
However, I get error on the third line:
System.NullReferenceException: 'Object reference not set to an
instance of an object.'
mshtml.DispHTMLDocument.body.get returned null.
I was trying to use dynamic code, but it didn't work either:
dynamic doc = Activator.CreateInstance(Type.GetTypeFromProgID("htmlfile"));
In this case I get the following error:
Microsoft.CSharp.RuntimeBinder.RuntimeBinderException:
'Cannot perform runtime binding on a null reference'
Is there some solution to overcome this problem? Thanks!
UPDATE: VBA code
Sub GetData()
Dim doc As MSHTML.HTMLDocument
Dim fso As FileSystemObject, txt As TextStream
Set doc = New MSHTML.HTMLDocument
Set fso = New FileSystemObject
Set txt = fso.OpenTextFile("path_to_html_file")
doc.body.innerHTML = txt.ReadAll() '// <-- No error here
txt.Close
End Sub
You could cast the mshtml.HtmlDocument to the IHTMLDocument2 interface, to have the main objects' properties and methods available:
var doc = (IHTMLDocument2)new mshtml.HTMLDocument();
Or create a HtmlDocumentClass instance using Activator.CreateInstance() with the Type Guid, then cast to a IHTMLDocument2 Interface.
IHTMLDocument2 doc =
(IHTMLDocument2)Activator.CreateInstance(
Type.GetTypeFromCLSID(new Guid("25336920-03F9-11CF-8FD0-00AA00686F13")));
It's more or less the same thing. I'ld prefer the first one, mainly for this reason
Then you can write to the HtmlDocument whatever you want. For example:
doc.write(File.ReadAllText(#"[Some Html Page]"));
Console.WriteLine(doc.body.innerText);
To create a HtmlDocument, a skeleton HTML Page is enough, something like this:
string html = "<!DOCTYPE html><html><head></head><Body><p></body></html>";
doc.write(html);
Note: before a Document is created, all elements in the page will be null.
After, you can set the Body.InnerHtml to something else:
doc.body.innerHTML = "<P>Some Text</P>";
Console.WriteLine(doc.body.innerText);
Note that if you need to work with HTML Document more extensively, you'll have to cast to a higher level interface: IHTMLDocument3 to IHTMLDocument8 (as of now), depeding on the System version.
The classic getElementById, getElementsByName, getElementsByTagName methods are availble in the IHTMLDocument3 interface.
For example, use the getElementsByTagName() to retrieve the InnerText of an HTMLElement using it's tag name:
string innerText =
(doc as IHTMLDocument3).getElementsByTagName("body")
.OfType<IHTMLElement>().First().innerText;
Note:
If you can't find the IHTMLDocument6, IHTMLDocument7 and IHTMLDocument8 interfaces (and possibly other interfaces referenced in the MSDN Docs), then you probably have an old Type library in the \Windows\Assembly\ GAC. Follow Hans Passant's advices to create a new Interop.mshtml library:
How to get mshtml.IHTMLDocument6 or mshtml.IHTMLDocument7?
I faced with the System.NullReferenceException too, because the doc.body was null. Finally, I resolved the problem in this way:
public void SetWebBrowserHtml(WebBrowser webBrowser, string html)
{
if (!(webBrowser.Document is MSHTML.IHTMLDocument2))
{
webBrowser.Navigate("about:blank");
}
if (webBrowser.Document is MSHTML.IHTMLDocument2 doc)
{
if (doc.body == null)
{
doc.write(html);
}
else
{
doc.body.innerHTML = html;
}
}
}
When I instantiate an IE object and navigate to a url, I don't know how to obtain the source HTML code from that address.
This is the code I'm using:
SHDocVw.InternetExplorer IE = new SHDocVw.InternetExplorer();
IE.Visible = false;
IE.Navigate("www.testsite.com");
I want something like:
string source = IE.ToSource();
So I can inspect the content of it.
Can I achieve this? Thanks.
Try that:
SHDocVw.InternetExplorer IE = new SHDocVw.InternetExplorer();
IE.Visible = false;
IE.Navigate("www.testsite.com");
mshtml.IHTMLDocument2 htmlDoc
= IE.Document as mshtml.IHTMLDocument2;
string content = htmlDoc.body.outerHTML;
You can access the whole HTML string from the body.parent property:
string content = htmlDoc.body.parent.outerHTML;
You can see a nice example here (the example in c++)
I'm looking to get the *.aspx page name from the parent of an IHTMLElement. I started looking through the attributes on an IHTMLElement, and the document property looked promising.
Do I just need to cast as follows?
IHTMLElement elem;
elem = getElement(args);
IHTMLElement2 dom = (IHTMLElement2)elem.document;
string aspx = dom.<something?>;
That doesn't appear to work, but I feel like I'm on the right track. Ideas?
HTMLDocument doc = somedoc;
Regex pullASPX = new Regex(#"(?<=\/)[^//]*?(?=\.aspx)");
if (elem != null && !doc.url.Contains("default.aspx"))
{
EchoAbstraction.page = pullASPX.Match(doc.url).Value;
EchoAbstraction.tag = tagName;
EchoAbstraction.id = elem.id;
}
This is how I ended up doing it. I had found the ID in the dom already, so I just pulled the current doc page and parsed the URL.
The task is simple, but I couldn't find the answer.
Removing tags (nodes) is easy with Node.Remove()... But how to replace them?
There's a ReplaceChild() method, but it requires to create a new tag. How do I set the contents of a tag? InnerHtml and OuterHtml are read only properties.
See this code snippet:
public string ReplaceTextBoxByLabel(string htmlContent)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
foreach(HtmlNode tb in doc.DocumentNode.SelectNodes("//input[#type='text']"))
{
string value = tb.Attributes.Contains("value") ? tb.Attributes["value"].Value : " ";
HtmlNode lbl = doc.CreateElement("span");
lbl.InnerHtml = value;
tb.ParentNode.ReplaceChild(lbl, tb);
}
return doc.DocumentNode.OuterHtml;
}
Are you sure InnerHtml is a read only property?
The HTMLAgility pack's documentation says otherwise: (Cut & Paste)
Gets or Sets the HTML between the start and end tags of the object.
Namespace: HtmlAgilityPack
Assembly: HtmlAgilityPack (in HtmlAgilityPack.dll) Version: 1.4.0.0 (1.4.0.0)
Syntax
C#
public virtual string InnerHtml { get; set; }
If it is read only could you post some code?
I'm trying to parse this field, but can't get it to work. Current attempt:
var name = doc.DocumentNode.SelectSingleNode("//*[#id='my_name']").InnerHtml;
<h1 class="bla" id="my_name">namehere</h1>
Error: Object reference not set to an instance of an object.
Appreciate any help.
#John - I can assure that the HTML is correctly loaded. I am trying to read my facebook name for learning purposes. Here is a screenshot from the Firebug plugin. The version i am using is 1.4.0.
http://i54.tinypic.com/kn3wo.jpg
I guess the problem is that profile_name is a child node or something, that's why I'm not able to read it?
The reason your code doesn't work is because there is JavaScript on the page that is actually writing out the <h1 id='profile_name'> tag, so if you're requesting the page from a User Agent (or via AJAX) that doesn't execute JavaScript then you won't find the element.
I was able to get my own name using the following selector:
string name =
doc.DocumentNode.SelectSingleNode("//a[#id='navAccountName']").InnerText;
Try this:
var name = doc.DocumentNode.SelectSingleNode("//#id='my_name'").InnerHtml;
HtmlAgilityPack.HtmlNode name = doc.DocumentNode.SelectSingleNode("//h1[#id='my_name']").InnerText;
public async Task<List<string>> GetAllTagLinkContent(string content)
{
string html = string.Format("<html><head></head><body>{0}</body></html>", content);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//[#id='my_name']");
return nodes.ToList().ConvertAll(r => r.InnerText).Select(j => j).ToList();
}
It's ok with ("//a[#href]"); You can try it as above.Hope helpful