I need to write some code that will search and replace whole words in a string that are outside HTML tags. So if I have this string:
string content = "the brown fox jumped over <b>the</b> lazy dog over there";
string keyword = "the";
I need to something like:
if (content.ToLower().Contains(keyword.ToLower()))
content = content.Replace(keyword, String.Format("<span style=\"background-color:yellow;\">{0}</span>", keyword));
but I don't want to replace the "the" in the bold tags or the "the" in "there", just the first "the".
you can use this library to parse you html and to replace only the words that are not in any html, to replace only the word "the" and not "three" use RegEx.Replace("the\s+"...) instead of string replace
Try this:
content = RegEx.Replace(content, "(?<!>)"
+ keyword
+ "(?!(<|\w))", "<span blah...>" + keyword + '</span>';
Edit: I fixed the "these" case, but not the case where more than the keyword is wrapped in HTML, e.g., "fox jumped over the lazy dog."
What you're asking for is going to be nearly impossible with RegEx and normal, everyday HTML, because to know if you're "inside" a tag, you would have to "pair" each start and end tag, and ignore tags that are intended to be self-closing (BR and IMG, for instance).
If this is merely eye candy for a web site, I suggest going the other route: fix your CSS so the SPAN you are adding only impacts the HTML outside of a tag.
For example:
content = content.Replace("the", "<span class=\"highlight\">the</span>");
Then, in your CSS:
span.highlight { background-color: yellow; }
b span.highlight,
i span.highlight,
em span.highlight,
strong span.highlight,
p span.highlight,
blockquote span.highlight { background: none; }
Just add an exclusion for each HTML tag whose contents should not be highlighted.
I like the suggestion to use an HTML parser, but let me propose a way to enumerate the top-level text (no enclosing tags) regions, which you can transform and recombine at your leisure.
Essentially, you can treat each top-level open tag as a {, and track the nesting of only that tag. This might be simple enough compared to regular parsing that you want to do it yourself.
Here are some potential gotchas:
If it's not XHTML, you need a list of tags which are always empty:
<hr> , <br> and <img> (are there more?).
For all opening tags, if it ends in />, it's immediately closed - {} rather than {.
Case insensitivity - I believe you'll want to match tag names insensitively (just lc them all).
Super-permissive generous browser interpretations like
"<p> <p>" = "<p> </p><p>" = {}{
Quoted entities are NOT allowed to contain <> (they need to use <), but maybe browsers are super permissive there as well.
Essentially, if you want to parse correct HTML markup, there's no problem.
So, the algorithm:
"end of previous tag" = start of string
repeatedly search for the next open-tag (case insensitive), or end of string:
< *([^ >/]+)[^/>]*(/?) *>|$
handle (end of previous tag, start of match) as a region outside all tags.
set tagname=lc($1). if there was a / ($2 isn't empty), then update end and continue at start. else, with depth=1,
while depth > 0, scan for next (also case insensitive):
< *(/?) *$tagname *(/?) *>
If $1, then it's a close tag (depth-=1). Else if not $2, it's another open tag; depth+=1. In any case, keep looping (back to 1.)
Back to start (you're at top level again). Note that I said at the top "scan for next start of top-level open tag, or end of string", i.e. make sure you process the toplevel text hanging off the last closing tag.
That's it. Essentially, you get to ignore all other tags than the current topmost one you're monitoring, on the assumption that the input markup is properly nested (it will still work properly against some types of mis-nesting).
Also, wherever I wrote a space above, should probably be any whitespace (between < > / and tag name you're allowed any whitespace you like).
As you can see, just because the problem is slightly easier than full HTML parsing, doesn't necessarily mean you shouldn't use a real HTML parser :) There's a lot you could screw up.
You'll need to give more details.
For example:
<p>the brown fox</p>
is technically inside HTML tags.
Related
There is a tough nut to crack.
I have a HTML which needs to be stripped of some tags, attributes AND properties.
Basically there are three different approaches which are to be considered:
String Operations: Iterate through the HTML string and strip it via string operations 'manually'
Regex: Parsing HTML with RegEx is evil. Is stripping HTML evil too?
Using a library to strip it (e.g. HTML Agility Pack)
My wish is that I have lists for:
acceptedTags (e.g. SPAN, DIV, OL, LI)
acceptedAttributes (e.g. STYLE, SRC)
acceptedProperties (e.g. TEXT-ALIGN, FONT-WEIGHT, COLOR, BACKGROUND-COLOR)
Which I can pass to this function which strips the HTML.
Example Input:
<BODY STYLE="font-family:Tahoma;font-size:11;"> <DIV STYLE="margin:0 0 0 0;text-align:Left;font-family:Tahoma;font-size:16;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;font-family:tahoma;font-size:11;">Hello</SPAN></BODY>
Example Output (with parameter lists from above):
<DIV STYLE="text-align:Left;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;">Hello</SPAN>
the entire tag Body is stripped (not accepted tag)
properties margin, font-family and font-size are stripped from DIV-Tag
properties font-family and font-size are stripped from SPAN-Tag.
What have I tried?
Regex seemed to be the best approach at the first glance. But I couldn't get it working properly.
Articles on Stackoverflow I had a look at:
Regular expression to remove HTML tags
How to clean HTML tags using C#
...and many more.
I tried the following regex:
Dim AcceptableTags As String = "font|span|html|i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
Dim Html as String = Regex.Replace(b.HTML, WhiteListPattern, "", RegexOptions.Compiled)
However, this is only removing tags and no attributes or properties!
I'm definitely not looking for someone who's doing the whole job. Rather for someone, who points me to the right direction.
I'm happy with either C# or VB.NET as answers.
Definitely use a library! (See this)
With HTMLAgilityPack you can do pretty much everything you want:
Remove tags you don't want:
string[] allowedTags = {"SPAN", "DIV", "OL", "LI"};
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
{
if (!allowedTags.Contains(node.Name.ToUpper()))
{
HtmlNode parent = node.ParentNode;
parent.RemoveChild(node,true);
}
}
Remove attributes you don't want & remove properties
string[] allowedAttributes = { "STYLE", "SRC" };
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
{
List<HtmlAttribute> attributesToRemove = new List<HtmlAttribute>();
foreach (HtmlAttribute att in node.Attributes)
{
if (!allowedAttributes.Contains(att.Name.ToUpper())) attributesToRemove.Add(att);
else
{
string newAttrib = string.Empty;
//do string manipulation based on your checking accepted properties
//one way would be to split the attribute.Value by a semicolon and do a
//String.Contains() on each one, not appending those that don't match. Maybe
//use a StringBuilder instead too
att.Value = newAttrib;
}
}
foreach (HtmlAttribute attribute in attributesToRemove)
{
node.Attributes.Remove(attribute);
}
}
I would probably actually just write this myself as a multi-step process:
1) Exclude all rules for removing properties from tags that are listed as tags to be removed (the tags won't be there anyway!)
2) Walk the document, taking a copy of the document without excluded tags (i.e. in your example, copy everything up until "< div" then wait until I see ">" before continuing to copy. If I'm in copy mode, and I see "ExcludedTag=" then stop copying until I see quotation mark.
You'll probably want to do some pre-work validation on the html and getting the formatting the same, etc. before running this process to avoid broken output.
Oh, and copy in chunks, i.e. just keep the index of copy start until you reach copy end, then copy the whole chunk, not individual characters!
Hopefully that helps as a starting point.
I am trying to get the data between the html (span) provided (in this case 31)
Here is the original code (from inspect elements in chrome)
<span id="point_total" class="tooltip" oldtitle="Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again." aria-describedby="ui-tooltip-0">31</span>
I have a rich textbox which contains the source of the page, here is the same code but in line 51 of the rich textbox:
<DIV id=point_display>You have<BR><SPAN id=point_total class=tooltip jQuery16207621750175125325="23" oldtitle="Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again.">17</SPAN><BR>Points </DIV><IMG style="FLOAT: right" title="Gain subscribers" border=0 alt="When people subscribe to you, you lose a point" src="http://static.subxcess.com/images/page/decoration/remove-1-point.png"> </DIV>
How would I go about doing this? I have tried several methods and none of them seem to work for me.
I am trying to retrieve the point value from this page: http://www.subxcess.com/sub4sub.php
The number changes depending on who subs you.
You could be incredibly specific about it:
var regex = new Regex(#"<span id=""point_total"" class=""tooltip"" oldtitle="".*?"" aria-describedby=""ui-tooltip-0"">(.*?)</span>");
var match = regex.Match(#"<span id=""point_total"" class=""tooltip"" oldtitle=""Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again."" aria-describedby=""ui-tooltip-0"">31</span>");
var result = match.Groups[1].Value;
You'll want to use HtmlAgilityPack to do this, it's pretty simple:
HtmlDocument doc = new HtmlDocument();
doc.Load("filepath");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span"); //Here, you can also do something like (".//span[#id='point_total' class='tooltip' jQuery16207621750175125325='23' oldtitle='Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again.']"); to select specific spans, etc...
string value = node.InnerText; //this string will contain the value of span, i.e. <span>***value***</span>
Regex, while a viable option, is something you generally would want to avoid if at all possible for parsing html (see Here)
In terms of sustainability, you'll want to make sure that you understand the page source (i.e., refresh it a few times and see if your target span is nested within the same parents after every refresh, make sure the page is in the same general format, etc..., then navigate to the span using the above principle).
There are multiple possibilities.
Regex
Let HTML be parsed as XML and get the value via XPath
Iterate through all elements. If you get on a span tag, skip all characters until you find the closing '>'. Then the value you need is everything before the next opening '<'
Also look at System.Windows.Forms.HtmlDocument
I have following string:
<div> text0 </div> prefix <div> text1 <strong>text2</strong> text3 </div> text4
and want to know wether it contains text3 inside divs that go after prefix:
prefix<div>...text3...</div>
but I don't know how ta make regex for that, since I can't use [^<]+ because div's can contain strong tag inside.
Please help
EDIT:
Div tags after prefix are guaranted to be not nested
Language is C#
Text4 is very long, so regex must not look after closing div
EDIT2: I don't want to use html parser, it can be easily (and MUCH faster) achieved with Regex. HTML there is simple: no attributes in tags; no nesting div's. And even some % of wrong answers are acceptable in my case.
If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>. (If the <div> is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div> instead.)
Numerous improvements could be made to this in order to take account of unusual spacing, >s within quotes, </div> within quotes, etc.
Patterns like prefix<div>...<div></div>text3</div> would be more difficult. You might have to capture all of the occurrences of the div tag so that you could count how many div tags were open at a given time.
EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.
this is my new regex:
prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>
seems to work ok.
For C# + HtmlAgilityPack you can do something like:
InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(InputString);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");
The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.
(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:
var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';
InputString = InputString.replace(/^.*?prefix/,'');
var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')
console.log(MatchingDivs.get());
This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div tags to actually work).
I've got a varchar() field in SQL Server that has some carriage return/linefeeds between paragraph marks.
I'd like to turn it into properly formatted HTML.
For instance:
---------- before ----------
The quick brown fox jumped over the lazy dog. Then he got bored and went to bed. After that, he played with his friends.
The next day, he and his friends had a big party.
---------- after -----------
<p>The quick brown fox jumped over the lazy dog. Then he got bored and went to bed. After that, he played with his friends.</p>
<p>The next day, he and his friends had a big party.</p>
What's the right way to do this? Obviously regular expressions would be a good way to go, but I can't figure out how to trap the beginning of field along with the crlf (carriage return/linefeed) combo in a sane way.
Any regex geniuses out there? Would love some help. Thanks if so!
A regular expression is not required for something like this. Plain string operations can do it. (Example in C#):
text = "<p>" + text.Replace("\r\n", "</p><p>") + "</p>";
(Depending on if the line breaks are system dependent or not you should use either a specific string like "\r\n" or the property Environment.NewLine.)
If the string initially comes from user input so that you don't have total control over it, you have to properly html encode it before putting the paragraph tags in, to prevent cross site scripting attacks.
And do not forget that adding <p> tags is not enough, you have to escape characters that have special meaning in HTML ( < becomes < and so on), otherwise you can end up with a broken page or even script injection.
If the text is already broken up into paragraphs with newlines, it could be as simple as
text = Regex.Replace(text, ".+", "<p>$0</p>");
This assumes there are no HTML special characters (as Thilo mentioned) or extra whitespace characters between paragraphs, like this: "text\n \nmore text". You would want to deal with anything like that before you add the tags.
If the string initially comes from user input so that you don't have total control over it, you have to properly html encode it before putting the paragraph tags in
yourString="p" + text.Replace("\r\n","<p></p>") + "</p>";
I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?
Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.
The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.
Maybe read your HTML into an XML parser which should take care of the conversions for you.
Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the > ;. (I'd also do it with the < tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry
Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.
Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>