Paste HTML code from XML to page like HTML and like String - c#

I have problem with parsing my HTML code from XML
Here some code for introducing
<item>
<title>
My HTML code
</title>
<description>
<![CDATA[Here some example
<ul style="list-style: disc;" type="disc">
<li>Text1</li>
<li>Text2</li>
</ul>]]>
</description>
</item>
I'd like show this code in 2 ways, first like html list and second like string(make visible all tags e.g...)
I tried to grab this code in 2 ways and in booth result are same
string DescriptionCurent = item.Element("description").Value.ToString();
HtmlString html2 = new HtmlString(item.Element("description").Value.ToString());
And on the end I show this on page
<p><%= DescriptionCurent %></p>
<p><%: html2 %></p>
On the end we have text before in p tag, and list outside P tag showing like regular list.
I use asp.net 4.5 web forms
If you can't understand something pls ask and I will try to explain better
UPDATE:
I cant add images because i need 10 reputation so i put it on some free host
Here is image ho that look
http://tinypic.com/view.php?pic=2eam7bb&s=8#.VQLYLY7F91A
And here it's in inspect element how look
http://tinypic.com/view.php?pic=zv205h&s=8
so I just need to add everything in tag no outside

I assume that item is an XElement from LINQ-to-XML. So, to get XML markup from an XElement object you simply call ToString() on the object (not on the object's Value property) :
HtmlString html2 = new HtmlString(item.Element("description").ToString());
Or a different way if you meant to get, specifically, the content of the CDATA element (excluding the <CDATA> tag itself) :
XCData cdata = (XCData)item.Element("description").FirstNode;
HtmlString html2 = new HtmlString(cdata.Value);

Related

Extract content within a div tag ignoring other tags inside

Below is the sample html source
<div id="page2" dir="ltr">
<p>This text I dont want to extract</p>
This is the text which I want to extract
</div>
Irrespective of the attributes of div tag, I want to extract only the div tag text ignoring the other tags text that come inside div tag.
In the above example i do not want to extract text within <p></p> tag, but i want to extract text within <div></div> tag, i.e "This is the text which I want to extract"
XmlNodeList DivNodeList = xDoc.GetElementsByTagName("div");
string DivInnerText;
for (int i = 0; i < DivNodeList.Count; i++)
{
if (!DivNodeList[i].InnerXml.Contains("p"))
{
DivInnerText = DivNodeList[i].InnerText.Trim();
Div_List.Add(DivInnerText);
}
}
But the above code is not working as expected, since I am checking whether p tag is present or not, then only extracting the text. Obviously if p tag is present, it would not go inside and more over the inner text of the div tag contains all the text combined whatever the tags inside it.
Any help on this is greatly appreciated.
For HTML processing, you should try the HtmlAgilityPack library.
Your requirement should be easy to do.
Take a look : http://www.c-sharpcorner.com/UploadFile/9b86d4/getting-started-with-html-agility-pack/
Using JQuery you can achieve this by doing that:
$("#page2").clone().children().remove().end().text();
Example
The credit should go to "DotNetWala" -
check his answer here

How can I wrap a <span> around matched words in HTML without breaking the HTML

Using C# - WinForms
I have a valid HTML string which may or may not contain various HTML elements such as <a>.
I need to search this HTML and highlight certain keywords - the highlighting is done by adding a <span> around the text with inline styling. I should not be doing this for <a> tags, or any other HTML tag that isn't actually visible to the user.
e.g. currently I am doing this:
html = html.Replace(phraseToCount, "<span style=\"background: #FF0000; color: #FFFFFF; font-weight: bold;\">" + phraseToCount + "</span>");
This kind of works but it breaks <a> tags. So in the example below only the 1st instance of the word cereal should end up with a <span> around it:
<p>To view more types of cereal click here.</p>
How could I do this?
EDIT - more info.
This will be running in a Winforms app as the best way to get the HTML is using the WebBrowser control - I will be scraping web pages and highlighting various words.
You're handling HTML as plain text. You don't want that. You only want to search through the "InnerText" of your HTML elements, as in <p attribute="value">innertext</p>. Not through tags, comments, styles and script and whatever else can be included in your document.
In order to do that properly, you need to parse the HTML, and then obtain all elements' InnerTexts and do your logic on that.
In fact, InnerText is a simplification: when you have an element like <p>FooBar<span>BarBaz</span></p> where "Baz" is to be replaced, then you need to actually recursively iterate all the nodes in the DOM, and only replace text nodes, because writing into the InnerText property will remove all child nodes.
For how to do that, you'd want to use a library. You don't want to build an HTML parser on your own. See for example C#: HtmlAgilityPack extract inner text, Extracting Inner text from HTML BODY node with Html Agility Pack, How can i parse InnerText of <option> tag with HtmlAgilityPack?, Parsing HTML with CSQuery, HtmlAgilityPack - get all nodes in a document and so on.
Most importantly seems to be How can I retrieve all the text nodes of a HTMLDocument in the fastest way in C#?:
HtmlNodeCollection coll = htmlDoc.DocumentNode.SelectNodes("//text()");
foreach (HtmlTextNode node in coll.Cast<HtmlTextNode>())
{
node.Text = node.Text.Replace(...);
}
Here's how you would do what #CodeCaster suggested in CSQuery
string str = "<p>To view more types of cereal click here cereal.</p>";
var cq = CQ.Create(str);
foreach (IDomElement node in cq.Elements)
{
PerformActionOnTextNodeRecursively(node, domNode => domNode.NodeValue = domNode.NodeValue.Replace("cereal", "<span>cereal</span>"));
}
Console.WriteLine(cq.Render());
private static void PerformActionOnTextNodeRecursively(IDomNode node, Action<IDomNode> action)
{
foreach (var childNode in node.ChildNodes)
{
if (childNode.NodeType == NodeType.TEXT_NODE)
{
action(childNode);
}
else
{
PerformActionOnTextNodeRecursively(childNode, action);
}
}
}
Hope it helps.

How to get first-level elements from HTML file with HTML Agility Pack & c#

I want to get first-level elements via parsing HTML file with HTML Agility Pack ,for example result will be like this:
<html>
<body>
<div class="header">....</div>
<div class="main">.....</div>
<div class="right">...</div>
<div class="left">....</div>
<div class="footer">...</div>
</body>
</html>
That each is contains other tag...
I want to extract all text that exist in the website,but separately . for example right side separate,left side separate , footer and so...
can anyone help me?
thanks...
Use HtmlAgilityPack to load the webpage from the given URL, then parse it by selecting the correct corresponding tags.
HtmlWeb page = new HtmlWeb();
HtmlDocument doc = new HtmlDocument();
docc = page.Load("http://www.google.com");
If you want to select a specific div with the class name 'header', you do so by using the DocumentNode property of your document object.
string mainText = doc.DocumentNode.SelectSingleNode("//div[#class=\"main\"]").InnerText;
Chances are though that you have several tags in your HTML that are members of the 'main' class, thus you have to select them all then iterate over the collection, or be more precise when you select your single node.
To get a collection representation of all tags i.e. in class 'main', you use the DocumentNode.SelectNodes property instead.
I suggest you take a look at this question at SO where some of the basics and links to tutorials are available.
How to use HTML Agility pack

How to remove nodes using HTML agility pack and XPath so as to clean the HTML page

I need to extract Text from webpages mostly related to business news.
say the HTML page is as follows..
<html>
<body>
<div>
<p> <span>Desired Content - 1</span></p>
<p> <span>Desired Content - 2</span></p>
<p> <span>Desired Content - 3</span></p>
</div>
</body>
</html>"
I have a sample stored in a string that can take me to Desired Content -1 directly, so i can collect that content. But i need to collect Desired Content -2 and 3.
For that what i tried is from the current location i.e from with in span node of desired Content -1 i used parentof and moved to the external node i.e Para node and got the content but actually i need to get the entire desired content in div. How to do it? You might ask me to go to div directly using parentof.parentof.span. But that would be specific to this example, i need a general idea.
Mostly news articles will have desired content in a division and i will go directly to some nested inner node of that division. I need to come out of those inner nodes only till i encounter a division and then get the innerText.
I am using XPath and HTMLagilitypack.
Xpath i am using is -
variable = doc.DocumentNode.SelectSingleNode("//*[contains(text(),'" + searchData + "')]").ParentNode.ParentNode.InnerText;
Here "searchData" is a variable that is holding a sample of Desired Content -1 used for searching the node having news in the entire body of the webpage.
What i am thinking is clean up the webpages and have only main tags like HTML, BODY, Tables, Division and Paragraphs but no spans and other formating elements. But some other website might use Spans only instead of divs so i am not sure how to implement this requirement.
Basic requirement is to extract the News content from different webpages(almost 250 different websites). So i can not code specific to each webpage..i need a generic method.
Any ideas appreciated. Thank you.
This XPath expression selects the innermost div element with $searchData variable reference value as part of its string value.
//div[contains(.,$searchData)]
[not(.//div[contains(.,$searchData)])]
Found out the answer myself...
Using a while loop till i find a div parent and then getting innertext is working.
`{ //Select the desired node, move up till you find a div and then get the inner text.
node = hd.DocumentNode.SelectSingleNode("//*[contains(text(),'" + searchData + "')]"); //Find the desired Node.
while (node.ParentNode.Name != "div") //Move up till you find a encapsulating Div node.
{
node = node.ParentNode;
Console.WriteLine(node.InnerText);
}
Body = node.InnerText;
}`

Check Empty XML data validation before displaying

I want to check xml before displaying it .I am using XPath not xsl for it. For e.g.
<title></title>
<url></url>
<submit></submit>
i wanna check that if xml data is not there for it . Then don't display it. because I m putting these values in <a href=<%#container.dataitem,url%>>new link</a>.
So i want that if url is empty then don't display new link otherwise display it and similarly for title that if title is not empty display it otherwise don't display it.
Main problem is I can check like in ascx.cs file
if(iterator.current.value="") don't display it but the problem is in ascx file i m givin
new link
i want that new link should not come if url is empty...
Any idea how to check this condition?
I've seen this handled using an asp:Literal control.
In the web form, you'd have <asp:Literal id='literal' runat='server' text='<%# GetAnchorTag(container.dataitem) %>' />
And in the code behind, you'd have:
protected string GetAnchorTag(object dataItem) {
if(dataItem != null) {
string url = Convert.ToString(DataBinder.Eval(dataItem, "url"));
if(!string.IsNullOrEmpty(url)) {
string anchor = /* build your anchor tag */
return anchor;
}
}
return string.Empty;
}
this way, you either output a full anchor tag or an empty string. I don't know how this would fit in with your title and submit nodes, but it solves the anchor display issue.
Personally, I don't like this approach, but I've seen it quite a bit.
Use XPath. Assuming that the elements are enclosed in an element named link:
link[title != '' and url !='']
will find you the link elements whose title and url child elements contain no descendant text nodes. To make it a little more bulletproof,
link[normalize-space(title) != '' and normalize-space(url) !='']
will keep the expression from matching link elements whose title or url children contain whitespace.
If you don't have access to the .cs file for this then you can still embed the code right in the .ascx file. Remember, you don't HAVE to put all your code in the code behind file, it can go inline right inside the .ascx file.
<%
if(iterator.current.value!="") {
%>
<a href=<%#container.dataitem,url%>>new link</a>
<%
}
%>
what about //a[not(./#href) or not(text()='']

Categories