Error while using XPath to parse text from HTML

Error while using XPath to parse text from HTML - c#

The HTML content I need to parse is the text in the marquee element as given below. I'm using C# with HTML Agility Pack to parse it, but a nullrefrence exception is thrown.
C# code is
var ht1 = ht.DocumentNode.SelectSingleNode("html/body/table/tbody/tr/td[2]/div[2]/marquee/text()").InnerText;
Part of HTML:
<html>
-<body ...
-<table id=..
-<tbody>
-<tr>
+<td.........
-<td
+<div ......
-<div style="width:100%;padding:0;margin:0;border
-style:solid;border-width:0;border-color:darkred;">
<marquee width="100%" height="20" bgcolor="" style="color:
darkorchid; font-size: 14" loop="3" behavior="scroll"
scrolldelay="90 scrollamount="5" align="middle" border="0">
your scrolling text - these are some samples - think of
possibilities</marquee>
<div>

Did you look in the direct source of the html file? If you only look in the html shown in a browser like Firebug/fox, it shows additional tbody tags, that are not actually in the file.
Therefore use:
var ht1 = ht.DocumentNode.SelectSingleNode("html/body/table/tr/td[2]/div[2]/marquee/text()").InnerText;
You usually do not want to use text() because, the text content of a node is already its text. And text() returns a set of text-nodes, not the concatenated text.
Therefore use:
var ht1 = ht.DocumentNode.SelectSingleNode("html/body/table/tr/td[2]/div[2]/marquee").InnerText

That page does not seem to be well formed HTML.
This worked for me though:
ht.DocumentNode.SelectSingleNode(#"html/head/table[1]/tbody/tr/td[1]/td/div[2]/marquee").InnerText;

Related

Why label control does not render DIV as HTML (AllowHtmlString=true)

I want to center some of the strings.
I saw it.
https://documentation.devexpress.com/WindowsForms/9536/Controls-and-Libraries/Editors-and-Simple-Controls/Simple-Editors/Examples/How-to-Format-Text-in-LabelControl-Using-HTML-Tags
So, I wrote this code.
labelControl1.Text = "<div style=\"text-align:center;\">center</div><br>" +
"<size=14>Size = 14<br>" +
"Bold <i>Italic</i> <u>Underline</u><br>" +
"<color=255, 0, 0>Sample Text</color></size>";
labelControl1.AllowHtmlString = true;
labelControl1.Appearance.TextOptions.WordWrap = WordWrap.Wrap;
labelControl1.Appearance.Options.UseTextOptions = true;
labelControl1.AutoSizeMode = LabelAutoSizeMode.Vertical;
But, it didn't work.
What is the problem with it?

According to HTML Text Formatting documentation, LabelControl.AllowHtmlString property support these tags and "pseudotags" (tags which not exist in current HTML standard but can be used for rendering purpose in label control):
Normal HTML tags
<b> - bold text
<i> - italic text
<s> - strikethrough
<u> - underline
<br> (current HTML equivalent is <br />)
Pseudotags
<color> (equivalent to CSS color)
<backcolor> (equivalent to CSS background-color)
<size> (equivalent to CSS font-size)
<image=value> (equivalent to HTML <img src="value">)
<href=url> (equivalent to HTML <a href="url">)
<nbsp> (equivalent to HTML )
The HTML <div> tag is not included in supported tags mentioned above, hence it will rendered as standard text instead.

According to the documentation, only specific HTML tags are supported, and div is not in the list.
Depending on your requirements, you might split the text into two labels, one centered (AutoSize=False, TextAlign=MiddleCenter) and one with HTML.

How can I wrap a <span> around matched words in HTML without breaking the HTML

Using C# - WinForms
I have a valid HTML string which may or may not contain various HTML elements such as <a>.
I need to search this HTML and highlight certain keywords - the highlighting is done by adding a <span> around the text with inline styling. I should not be doing this for <a> tags, or any other HTML tag that isn't actually visible to the user.
e.g. currently I am doing this:
html = html.Replace(phraseToCount, "<span style=\"background: #FF0000; color: #FFFFFF; font-weight: bold;\">" + phraseToCount + "</span>");
This kind of works but it breaks <a> tags. So in the example below only the 1st instance of the word cereal should end up with a <span> around it:
<p>To view more types of cereal click here.</p>
How could I do this?
EDIT - more info.
This will be running in a Winforms app as the best way to get the HTML is using the WebBrowser control - I will be scraping web pages and highlighting various words.

You're handling HTML as plain text. You don't want that. You only want to search through the "InnerText" of your HTML elements, as in <p attribute="value">innertext</p>. Not through tags, comments, styles and script and whatever else can be included in your document.
In order to do that properly, you need to parse the HTML, and then obtain all elements' InnerTexts and do your logic on that.
In fact, InnerText is a simplification: when you have an element like <p>FooBar<span>BarBaz</span></p> where "Baz" is to be replaced, then you need to actually recursively iterate all the nodes in the DOM, and only replace text nodes, because writing into the InnerText property will remove all child nodes.
For how to do that, you'd want to use a library. You don't want to build an HTML parser on your own. See for example C#: HtmlAgilityPack extract inner text, Extracting Inner text from HTML BODY node with Html Agility Pack, How can i parse InnerText of <option> tag with HtmlAgilityPack?, Parsing HTML with CSQuery, HtmlAgilityPack - get all nodes in a document and so on.
Most importantly seems to be How can I retrieve all the text nodes of a HTMLDocument in the fastest way in C#?:
HtmlNodeCollection coll = htmlDoc.DocumentNode.SelectNodes("//text()");
foreach (HtmlTextNode node in coll.Cast<HtmlTextNode>())
{
node.Text = node.Text.Replace(...);
}

Here's how you would do what #CodeCaster suggested in CSQuery
string str = "<p>To view more types of cereal click here cereal.</p>";
var cq = CQ.Create(str);
foreach (IDomElement node in cq.Elements)
{
PerformActionOnTextNodeRecursively(node, domNode => domNode.NodeValue = domNode.NodeValue.Replace("cereal", "<span>cereal</span>"));
}
Console.WriteLine(cq.Render());
private static void PerformActionOnTextNodeRecursively(IDomNode node, Action<IDomNode> action)
{
foreach (var childNode in node.ChildNodes)
{
if (childNode.NodeType == NodeType.TEXT_NODE)
{
action(childNode);
}
else
{
PerformActionOnTextNodeRecursively(childNode, action);
}
}
}
Hope it helps.

How to write rich text to word document generated from htm file in C#

I am trying to generate a word doc from saved HTML file using an Open XML library.
If the HTML file does not contain an image I can simply use the code below and write text content to word doc.
HtmlDocument doc = new HtmlDocument();
doc.Load(fileName); //fileName is the Htm file
string Detail = string.Empty;
string webData = string.Empty;
HtmlNode hcollection = doc.DocumentNode.SelectSingleNode("//body");
Detail = hcollection.InnerText;
But if the HTML file contains an embedded image I am struggling to include that image in the word doc.
Using hcollection.InnerText only writes the text part and excludes the image.
When I use
HtmlNode hcollection = doc.DocumentNode.SelectSingleNode("//body");
Detail = hcollection.InnerHtml;
All the HTML tags get written to the word doc along with path of Image in the tag
<table border='0' width='100%' cellpadding='0' cellspacing='0' align='center'>
<tr><td valign='top' align="left">
<div style='width:100%'><div id="div_img">
<div>
<img src="http://www.myweb.com/web/img/2013/07/18/img_1.jpg">
<span>Sample Text</span></div></div><br><br>Sample Text Content here<br><br> </div></td></tr></table>
How to remove the html tags and instead of path shown like
<img src="http://www.myweb.com/web/img/2013/07/18/img_1.jpg">
the corresponding picture gets loaded.
Please help.

You'll need to look at the HTML and translate it to OpenXML somehow.
I've used HtmlToOpenXml open-source library (license), and that works well enough. It should handle images (inline, local or remote) and correctly insert them into the OpenXML document. I recently submitted a patch which was accepted, so the project is still somewhat active.
There are some limitations with the library though:
Javascript (<script>), CSS <style>, <meta> and other not supported tags does not generate an error but are ignored.
It does handle inline style information, but it entirely ignores other CSS, which was something I needed. I ended up integrating some simple parsing of a single <style> element from another open-source project (jsonfx, using MIT license).
Note: handling multiple <style> elements, downloading CSS files, sorting out which style rules have precedence -- these are all problems which I did not address.

Actually the converting of HTML document to MS Word is a very complex task and there are a lot of cases besides of IMAGE tags which need to be solved. The difference between Open XML and HTML formats is absolutely decisive.
If I were you I would look for 3rd party tools for that. It would be chiper to pay for them than spending weeks on investigation and learning of all aspects of the task, writing the code, and then fixing miltiple bugs.
Personaly me used Aspose.Words library for that. It worked perfectly fine, but maybe you want to try another one.

format string containing html

I have a simple string variable that contains a portion of HTML inside. For example:
string contents = "<div><p>Hi how are you. Click here if you want to know more";
I want to include this HTML in page:
<div class="description">
#contents
</div>
However, it messes up the rest of the page because of unclosed tags.
Is there a function (or a helper) that reads and formats the HTML inside for example, to complete the HTML without errors:
#Html.DisplayProperHTML(contents)
This will render as:
<div><p>Hi how are you. Click here if you want to know more</p></div>

There is no such functionality built-in.
You can use the HTML Agility Pack to parse and fix broken HTML.

Page.FindControl?

I am working on a page that inherits a Base Page. The aspx page includes a control that uses xslt for to transform an xml document to html markup. Within that document I am using the following:
<xsl:template match="Headline">
<h1 runat="server" id="h1" class="article-heading">
<xsl:value-of select="text()"/>
</h1>
</xsl:template>
I am trying to get the get the value of the h1 to set it to page.title, can this be done with page.findControl ?

XSLT within a browser tends to be interpreted on the client-side, not the server side. Using Page.FindControl to find the content of the H1 won't get you too far, as all that will return is the literal <xsl:value-of...> statement.
The best approach is to also open the XML document within the codebehind on the server and set the Page Title from there.

you can use javascript to find h1 on clientside then set it to the document.title

Unfortunately no, because your <h1> element never gets added to control tree on the server. Even though you have runat="server" ASP.NET doesn't parse HTML resulting from XSLT transformation.
You would have to resort to parsing your XML to get the heading. With XPath it should be easy.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Error while using XPath to parse text from HTML - c#

That page does not seem to be well formed HTML. This worked for me though: ht.DocumentNode.SelectSingleNode(#"html/head/table[1]/tbody/tr/td[1]/td/div[2]/marquee").InnerText;

Related

Why label control does not render DIV as HTML (AllowHtmlString=true)

How can I wrap a <span> around matched words in HTML without breaking the HTML

How to write rich text to word document generated from htm file in C#

format string containing html

Page.FindControl?

Categories

Resources