Regex removes to much text

Regex removes to much text - c#

In our CMS we are using some tags which should be replaced on exporting for other systems.
The code for replacing is stated below:
var rxStr = "<div[^<]+class=([\"'])related-document-content\\1.*</div>";
var rx = new System.Text.RegularExpressions.Regex(rxStr,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
bodyText = rx.Replace(bodyText, "");
Our problem occurs when there are to instances of the tag in rxStr :
<p>First paragraph</p>
<div class='related-document-content' id='457'>First related text</div>
<p>Second paragraph</p>
<div class='related-document-content' id='458'>Second related text</div>
<p>Third paragraph</p>
When the code runs it removes the second paragraph and the output will be
<p>First paragraph</p>
<p>Third paragraph</p>
Can anyone help me adjust code so that only the div tags get removed

Besides the obvious "Use an HTML parser/write instead":
What your regex matches is the < of the next HTML tag over, that's why it skips one.
Your rxStr looks for "anything but the next open tag" <div[^<]+.
Instead it should look for "anything but the current tag's end" <div[^>]+.
You then also add the > to your regular expression. See below:
// Added [^>]+> towards the end.
// Also adding () within the div so you can debug better which matches were found.
var rxStr = "<div[^>]+class=([\"'])related-document-content\\1[^>]*>(.*)</div>";
If the innerHTML of your div is actually text-only use [^<]* instead of .*:
var rxStr = "<div[^>]+class=([\"'])related-document-content\\1[^>]*>([^<]*)</div>";

Related

Stop Characters from Encoding in HTML from C# Code Behind

Hello and thank you in advance. I am generating controls from C# code behind and want to apply an onclick attribute that will sent a modal to block display. When the code executes, apostrophes in the onclick function string are encoded to ' and I want to stop that from happening and pass the raw text.
I have tried the following to solve the problem with no resolution:
\'
''
<%= myString %>
HtmlString(myString).ToHtmlString();
There are probably other iteration that i tried but I lost track of them all.
This is the code section creating the control:
HtmlGenericControl viewMoreInfo = new HtmlGenericControl("span");
//HtmlString onclick1 = new HtmlString($"document.getElementById('{i.il_itemcode}Modal').style.display = 'block'");
string onclick1 = $"document.getElementById('{i.il_itemcode}Modal').style.display = 'block'";
viewMoreInfo.Attributes["onclick"] = onclick1;
viewMoreInfo.Attributes["class"] = "w3-bar-item w3-button w3-white w3-xlarge w3-right";
viewMoreInfo.InnerText = "+";
Here is an example of what it renders as:
<span onclick="document.getElementById('TEST1Modal').style.display = 'block'" class="w3-bar-item w3-button w3-white w3-xlarge w3-right">+</span>
I need the apostrophe to render correctly for the function to work.
Any help would be greatly appropriated.

Your question is misleading because "I need the apostrophe to render correctly for the function to work." is incorrect. Note the following which is an exact copy of your example, and the click event works just fine as it currently being rendered.
<p id="TEST1Modal" style="display:none">This is hidden until clicked</p>
<span onclick="document.getElementById('TEST1Modal').style.display = 'block'" class="w3-bar-item w3-button w3-white w3-xlarge w3-right">+</span>

Why label control does not render DIV as HTML (AllowHtmlString=true)

I want to center some of the strings.
I saw it.
https://documentation.devexpress.com/WindowsForms/9536/Controls-and-Libraries/Editors-and-Simple-Controls/Simple-Editors/Examples/How-to-Format-Text-in-LabelControl-Using-HTML-Tags
So, I wrote this code.
labelControl1.Text = "<div style=\"text-align:center;\">center</div><br>" +
"<size=14>Size = 14<br>" +
"Bold <i>Italic</i> <u>Underline</u><br>" +
"<color=255, 0, 0>Sample Text</color></size>";
labelControl1.AllowHtmlString = true;
labelControl1.Appearance.TextOptions.WordWrap = WordWrap.Wrap;
labelControl1.Appearance.Options.UseTextOptions = true;
labelControl1.AutoSizeMode = LabelAutoSizeMode.Vertical;
But, it didn't work.
What is the problem with it?

According to HTML Text Formatting documentation, LabelControl.AllowHtmlString property support these tags and "pseudotags" (tags which not exist in current HTML standard but can be used for rendering purpose in label control):
Normal HTML tags
<b> - bold text
<i> - italic text
<s> - strikethrough
<u> - underline
<br> (current HTML equivalent is <br />)
Pseudotags
<color> (equivalent to CSS color)
<backcolor> (equivalent to CSS background-color)
<size> (equivalent to CSS font-size)
<image=value> (equivalent to HTML <img src="value">)
<href=url> (equivalent to HTML <a href="url">)
<nbsp> (equivalent to HTML )
The HTML <div> tag is not included in supported tags mentioned above, hence it will rendered as standard text instead.

According to the documentation, only specific HTML tags are supported, and div is not in the list.
Depending on your requirements, you might split the text into two labels, one centered (AutoSize=False, TextAlign=MiddleCenter) and one with HTML.

Extract content within a div tag ignoring other tags inside

Below is the sample html source
<div id="page2" dir="ltr">
<p>This text I dont want to extract</p>
This is the text which I want to extract
</div>
Irrespective of the attributes of div tag, I want to extract only the div tag text ignoring the other tags text that come inside div tag.
In the above example i do not want to extract text within <p></p> tag, but i want to extract text within <div></div> tag, i.e "This is the text which I want to extract"
XmlNodeList DivNodeList = xDoc.GetElementsByTagName("div");
string DivInnerText;
for (int i = 0; i < DivNodeList.Count; i++)
{
if (!DivNodeList[i].InnerXml.Contains("p"))
{
DivInnerText = DivNodeList[i].InnerText.Trim();
Div_List.Add(DivInnerText);
}
}
But the above code is not working as expected, since I am checking whether p tag is present or not, then only extracting the text. Obviously if p tag is present, it would not go inside and more over the inner text of the div tag contains all the text combined whatever the tags inside it.
Any help on this is greatly appreciated.

For HTML processing, you should try the HtmlAgilityPack library.
Your requirement should be easy to do.
Take a look : http://www.c-sharpcorner.com/UploadFile/9b86d4/getting-started-with-html-agility-pack/

Using JQuery you can achieve this by doing that:
$("#page2").clone().children().remove().end().text();
Example
The credit should go to "DotNetWala" -
check his answer here

Error while using XPath to parse text from HTML

The HTML content I need to parse is the text in the marquee element as given below. I'm using C# with HTML Agility Pack to parse it, but a nullrefrence exception is thrown.
C# code is
var ht1 = ht.DocumentNode.SelectSingleNode("html/body/table/tbody/tr/td[2]/div[2]/marquee/text()").InnerText;
Part of HTML:
<html>
-<body ...
-<table id=..
-<tbody>
-<tr>
+<td.........
-<td
+<div ......
-<div style="width:100%;padding:0;margin:0;border
-style:solid;border-width:0;border-color:darkred;">
<marquee width="100%" height="20" bgcolor="" style="color:
darkorchid; font-size: 14" loop="3" behavior="scroll"
scrolldelay="90 scrollamount="5" align="middle" border="0">
your scrolling text - these are some samples - think of
possibilities</marquee>
<div>

Did you look in the direct source of the html file? If you only look in the html shown in a browser like Firebug/fox, it shows additional tbody tags, that are not actually in the file.
Therefore use:
var ht1 = ht.DocumentNode.SelectSingleNode("html/body/table/tr/td[2]/div[2]/marquee/text()").InnerText;
You usually do not want to use text() because, the text content of a node is already its text. And text() returns a set of text-nodes, not the concatenated text.
Therefore use:
var ht1 = ht.DocumentNode.SelectSingleNode("html/body/table/tr/td[2]/div[2]/marquee").InnerText

That page does not seem to be well formed HTML.
This worked for me though:
ht.DocumentNode.SelectSingleNode(#"html/head/table[1]/tbody/tr/td[1]/td/div[2]/marquee").InnerText;

Regular expression to find anchor link with special href?

I just need to find a regular expression for the following:
I have some content in div tag, that includes lot of anchor links in it. So my task is to find anchor links with href as format of "components/showdoc.aspx?docid=" And then add onclick event for that anchor link only, leave the rest of the anchor links.
<div id="content" runat="server">
test doc
</div>
This expression gives and add target to it.
RegEx.Replace(inputString, "<(a)([^>]+)>", "<$1 target=""_blank""$2>")
Thanks

If you are looking to make permanent changes to your HTML file, first manage your HTML parsing by loading it into a System.Windows.Forms.WebBrowser control. From there you can perform DOM-like modifications to the HTML without the dangerous repercussions of parsing corruption that can be caused by performing Regex.Replace on the raw file. (Apparently RegEx + HTML is a serious issue for some).
So first in your code you would:
WebBrowser myBrowser = new WebBrowser();
myBrowser.URL = #"C:\MyPath\MyFile.HTML";
HtmlElement myDocBody = myBrowser.Document.Body;
Then you can navigate through your document body, seeking out your div tag and looking for your anchor tags by using the HtmlElement.Id property and HtmlElement.GetAttribute method.
Note: feel free to still use RegEx matching on the URL strings but only after extracting them from a GetAttribute("href") method.
To add the onClick method, simply invoke the HtmlElement.SetAttribute method.
When you have finished all your modifications, save the changes by writing the WebBrowser.DocumentText to file.
Here is a reference:
http://msdn.microsoft.com/en-us/library/system.windows.forms.htmlelement.aspx

Don't use regex to parse html, it's evil.
You could use the HTML Agility Pack, it even has a nice NuGet Package.
Alternatively, you could do this on the client side with a single line of jQuery:
$('a[href*="components/showdoc.aspx?docid="]').on('click', myClickFunction);
This is making use of the Attribute Contains Selector.
If you want to find the docid in your click function, you could write something like this in your click function:
function myClickFunction(e){
var href = $(this).attr('href');
var docId = href.split('=')[1];
alert(docId);
}
Note that this assumes there's only ever one query string value, if you wanted to make this more robust you could do something like in this answer: https://stackoverflow.com/a/1171731/21200

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex removes to much text - c#

Related

Stop Characters from Encoding in HTML from C# Code Behind

Why label control does not render DIV as HTML (AllowHtmlString=true)

Extract content within a div tag ignoring other tags inside

Error while using XPath to parse text from HTML

Regular expression to find anchor link with special href?

Categories

Resources