Can't assign HTML to HTMLDocument.body - c#

I wanted to use HTMLDocument object from mshtml library. I was trying to assign HTML to document:
var doc = new mshtml.HTMLDocument();
var html = File.ReadAllText(#"path_to_html_file");
doc.body.innerHTML = html; // <-- this line throws error
However, I get error on the third line:
System.NullReferenceException: 'Object reference not set to an
instance of an object.'
mshtml.DispHTMLDocument.body.get returned null.
I was trying to use dynamic code, but it didn't work either:
dynamic doc = Activator.CreateInstance(Type.GetTypeFromProgID("htmlfile"));
In this case I get the following error:
Microsoft.CSharp.RuntimeBinder.RuntimeBinderException:
'Cannot perform runtime binding on a null reference'
Is there some solution to overcome this problem? Thanks!
UPDATE: VBA code
Sub GetData()
Dim doc As MSHTML.HTMLDocument
Dim fso As FileSystemObject, txt As TextStream
Set doc = New MSHTML.HTMLDocument
Set fso = New FileSystemObject
Set txt = fso.OpenTextFile("path_to_html_file")
doc.body.innerHTML = txt.ReadAll() '// <-- No error here
txt.Close
End Sub

You could cast the mshtml.HtmlDocument to the IHTMLDocument2 interface, to have the main objects' properties and methods available:
var doc = (IHTMLDocument2)new mshtml.HTMLDocument();
Or create a HtmlDocumentClass instance using Activator.CreateInstance() with the Type Guid, then cast to a IHTMLDocument2 Interface.
IHTMLDocument2 doc =
(IHTMLDocument2)Activator.CreateInstance(
Type.GetTypeFromCLSID(new Guid("25336920-03F9-11CF-8FD0-00AA00686F13")));
It's more or less the same thing. I'ld prefer the first one, mainly for this reason
Then you can write to the HtmlDocument whatever you want. For example:
doc.write(File.ReadAllText(#"[Some Html Page]"));
Console.WriteLine(doc.body.innerText);
To create a HtmlDocument, a skeleton HTML Page is enough, something like this:
string html = "<!DOCTYPE html><html><head></head><Body><p></body></html>";
doc.write(html);
Note: before a Document is created, all elements in the page will be null.
After, you can set the Body.InnerHtml to something else:
doc.body.innerHTML = "<P>Some Text</P>";
Console.WriteLine(doc.body.innerText);
Note that if you need to work with HTML Document more extensively, you'll have to cast to a higher level interface: IHTMLDocument3 to IHTMLDocument8 (as of now), depeding on the System version.
The classic getElementById, getElementsByName, getElementsByTagName methods are availble in the IHTMLDocument3 interface.
For example, use the getElementsByTagName() to retrieve the InnerText of an HTMLElement using it's tag name:
string innerText =
(doc as IHTMLDocument3).getElementsByTagName("body")
.OfType<IHTMLElement>().First().inne‌​rText;
Note:
If you can't find the IHTMLDocument6, IHTMLDocument7 and IHTMLDocument8 interfaces (and possibly other interfaces referenced in the MSDN Docs), then you probably have an old Type library in the \Windows\Assembly\ GAC. Follow Hans Passant's advices to create a new Interop.mshtml library:
How to get mshtml.IHTMLDocument6 or mshtml.IHTMLDocument7?

I faced with the System.NullReferenceException too, because the doc.body was null. Finally, I resolved the problem in this way:
public void SetWebBrowserHtml(WebBrowser webBrowser, string html)
{
if (!(webBrowser.Document is MSHTML.IHTMLDocument2))
{
webBrowser.Navigate("about:blank");
}
if (webBrowser.Document is MSHTML.IHTMLDocument2 doc)
{
if (doc.body == null)
{
doc.write(html);
}
else
{
doc.body.innerHTML = html;
}
}
}

Related

Scrape data from div in Windows.Form

I am new in c# programming. I am trying to scrape data from div (I want to display temperature from web page in Forms application).
This is my code:
private void btnOnet_Click(object sender, EventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("https://pogoda.onet.pl/");
var temperatura = doc.DocumentNode.SelectSingleNode("/html/body/div[1]/div[3]/div/section/div/div[1]/div[2]/div[1]/div[1]/div[2]/div[1]/div[1]/div[1]");
onet.Text = temperatura.InnerText;
}
This is the exception:
System.NullReferenceException:
temperatura was null.
You can use this:
public static bool TryGetTemperature(HtmlAgilityPack.HtmlDocument doc, out int temperature)
{
temperature = 0;
var temp = doc.DocumentNode.SelectSingleNode(
"//div[contains(#class, 'temperature')]/div[contains(#class, 'temp')]");
if (temp == null)
{
return false;
}
var text = temp.InnerText.EndsWith("°") ?
temp.InnerText.Substring(0, temp.InnerText.Length - 5) :
temp.InnerText;
return int.TryParse(text, out temperature);
}
If you use XPath, you can select with more precission your target. With your query, a bit change in the HTML structure, your application will fail. Some points:
// is to search in any place of document
You search any div that contains a class "temperature" and, inside that node:
you search a div child with "temp" class
If you get that node (!= null), you try to convert the degrees (removing '°' before)
And check:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("https://pogoda.onet.pl/");
if (TryGetTemperature(doc, out int temperature))
{
onet.Text = temperature.ToString();
}
UPDATE
I updated a bit the TryGetTemperature because the degrees are encoded. The main problem is the HTML. When you request the source code you get some HTML that browser update later dynamically. So the HTML that you get is not valid for you. It doesn't contains the temperature.
So, I see two alternatives:
You can use a browser control (in Common Controls -> WebBrowser, in the Form Tools with the Button, Label...), insert into your form and Navigate to the page. It's not difficult, but you need learn some things: wait to events for page downloaded and then get source code from the control. Also, I suppose you'll want to hide the browser control. Be carefully, sometimes the browser doesn't works correctly if you hide. In that case, you can use a visible Form outside desktop and manage activate events to avoid activate this window. Also, hide from Task Window (Alt+Tab). Things become harder in this way but sometimes is the only way.
The simple way is search the location that you want (ex: Madryt) and look in DevTools the request done (ex: https://pogoda.onet.pl/prognoza-pogody/madryt-396099). Use this Url and you get a valid HTML.

c# IHTMLDocument2 set the URL property

I need to create a complete IHTMLDocument2 document so I end up with this snippet which works. However, the URL property seems to be ignored all the times.
string page = "my HTML code in string";
IHTMLDocument2 doc2 = (IHTMLDocument2)new HTMLDocument();
doc2.url = "www.stackoverflow.com";
doc2.write(new object[] { page });
doc2.close();
while (doc2.body == null)
Application.DoEvents();
Now doc2.url is always "about:blank". How can I set this URL property?
Thank you in advance,

How do I open XML from link in razor?

The task is quite simple, connect to another webservice using XML.
In the current pages (classic ASP) we use the following code:
zoekpcode=UCASE(Request.Querystring("zoekpcode")) <-- postal-code
zoeknr=Request.Querystring("zoeknr") <-- house-number
PC=Trim(Replace(zoekpcode," ",""))
NR=Trim(Replace(zoeknr," ",""))
strGetAddress="https://ws1.webservices.nl/rpc/get-simplexml/addressReeksPostcodeSearch/*~*/*~*/" & PC & NR
set xml = Server.CreateObject("Microsoft.XMLHTTP")
xml.open "GET", strGetAddress , false
xml.send ""
strStatus = xml.Status
If Len(PC)>5 and Len(NR)>0 Then
strRetval = Trim(xml.responseText)
End If
set xml = nothing
'Do something with the result string
One of the possible links could be: https://ws1.webservices.nl/rpc/get-simplexml/addressReeksPostcodeSearch/~/~/1097ZD49
Currently I'm looking for a way to do this in razor (C#), but all I seem to be able to find on Google is how to do it in JavaScript
I've tried (most combinations of) the following terms:
razor
xmlhttp
comobject
XML from url
-javascript
Results were mostly about JavaScript or razorblades.
Based on other result (like in the search comobjects in razor) it seems that comobject aren't available in Razor.
I did find this question (How to use XML with WebMatrix razor (C#)) on stackoverflow that seems to answer my question (partially), but is it also possible with a link to an external system (the mentioned web-service)?
I have covered the consumption of Web Services in Razor web pages here: http://www.mikesdotnetting.com/Article/209/Consuming-Feeds-And-Web-Services-In-Razor-Web-Pages.
If your web service is a SOAP one, you are best off using Visual Studio (the free Express editions is fine) to add a service reference and then work from there. Otherwise you can use Linq To XML to load the XML directly into an XDocument as in the ATOM example in the article:
var xml = XDoxument.Load("https://ws1.webservices.nl/rpc/get-simplexml/blah/blah");
Then use the System.Xml.Linq APIs to query the document.
With the help of Ralf I came to the following code:
public static XmlDocument getaddress(string pcode, string number){
string serverresponse = "";
string getlocation = "https://ws1.webservices.nl/rpc/get-simplexml/addressReeksPostcodeSearch/*~*/*~*/" + Request.QueryString["PCODE"] + Request.QueryString["NR"];
HttpWebRequest req = (HttpWebRequest) WebRequest.Create(getlocation);
using (var r = req.GetResponse()) {
using (var s = new StreamReader(r.GetResponseStream())) {
serverresponse = s.ReadToEnd();
}
}
XmlDocument loader = new XmlDocument();
loader.LoadXml(serverresponse);
return loader;
}
public static string getvalue(XmlDocument document, string node){
string returnval = "";
var results = document.SelectNodes(node);
foreach(XmlNode aNode in results){
returnval = returnval + "," + aNode.InnerText;
}
return returnval.Substring(1);
}

How to get XML-code of webpage that is opened in IE (without using WebRequest)?

I'm trying to get an XML-text from a wabpage, that is already opened in IE. Web requests are not allowed because of a security of target page (long boring story with certificates etc). I use method to walk through all opened pages and, if I found a match with page's URI, I need to get it's XML.
Some time ago I needed to get an HTML-code between body tags. I used method with IHTMLDocument2 like this:
private string GetSourceHTML()
{
Regex reg = new Regex(patternURL);
Match match;
string result;
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
match = reg.Match(ie.LocationURL.ToString());
if (!string.IsNullOrEmpty(match.Value))
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)ie.Document;
result = doc.body.innerHTML.ToString();
return result;
}
}
result = string.Empty;
return result;
}
So now I need to get a whole XML-code of a target page. I've googled a lot, but didn't find anything useful. Any ideas? Thanks.
Have you tried this? It should get the HTML, which hopefully you could parse to XML?
Retrieving the HTML source code

C# Html Agility Pack ( SelectSingleNode )

I'm trying to parse this field, but can't get it to work. Current attempt:
var name = doc.DocumentNode.SelectSingleNode("//*[#id='my_name']").InnerHtml;
<h1 class="bla" id="my_name">namehere</h1>
Error: Object reference not set to an instance of an object.
Appreciate any help.
#John - I can assure that the HTML is correctly loaded. I am trying to read my facebook name for learning purposes. Here is a screenshot from the Firebug plugin. The version i am using is 1.4.0.
http://i54.tinypic.com/kn3wo.jpg
I guess the problem is that profile_name is a child node or something, that's why I'm not able to read it?
The reason your code doesn't work is because there is JavaScript on the page that is actually writing out the <h1 id='profile_name'> tag, so if you're requesting the page from a User Agent (or via AJAX) that doesn't execute JavaScript then you won't find the element.
I was able to get my own name using the following selector:
string name =
doc.DocumentNode.SelectSingleNode("//a[#id='navAccountName']").InnerText;
Try this:
var name = doc.DocumentNode.SelectSingleNode("//#id='my_name'").InnerHtml;
HtmlAgilityPack.HtmlNode name = doc.DocumentNode.SelectSingleNode("//h1[#id='my_name']").InnerText;
public async Task<List<string>> GetAllTagLinkContent(string content)
{
string html = string.Format("<html><head></head><body>{0}</body></html>", content);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//[#id='my_name']");
return nodes.ToList().ConvertAll(r => r.InnerText).Select(j => j).ToList();
}
It's ok with ("//a[#href]"); You can try it as above.Hope helpful

Categories