Convert > to HTML entity equivalent within HTML string - c#

I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?

Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.

The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.

Maybe read your HTML into an XML parser which should take care of the conversions for you.

Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the &gt ;. (I'd also do it with the &lt tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry

Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.

Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>

Related

Remove everything expect src in Image Tag using Regex

I want to remove everything expect src in Image tag using regex.I am using C# but I don't want to use HTMLAgilityPack I want it using regex only.
How to get this ?
If String is <img id="image" class="header" src="test.png"> then it returns as <img src="test.png">
Image tag may contain many other extra properties.
To clarify my comments: Normally I wouldn't recommend parsing HTML Using Regex. however, this is one of the few times when it's possible without ending up with a disastrously complicated regex string, because here you have a single node, with 1 pair of matching angle brackets. In addition, the OP only needs a single tag from this string. If he needed to do anything more complicated, I'd agree that he should use HTMLAgilityPack, but this is perfectly doable.
What you do is you extract the tag from the string using this regex: (src=['\"].+?['\"]). Then you take what you extracted from the string and paste it into a new string:
String newImgTag = String.Format("<img {0}>", srcMatch);
Again, if this were any more complicated (or if I had to do other HTML manipulation), I would just skip the regex and go for the established solutions like the aforementioned HTMLAgilityPack, because it offers far more support for HTML manipulation.
However, I don't view this as HTML manipulation, because you got a single tag without even a matching closing tag. This is more like basic string manipulation. It's similar to calculating a number to the second power: I doubt anyone would import the entire math library just for that, they'd just do N * N.
I fully expect and accept that people will downvote me for even considering to use Regex for this. Before you do so, however, read the post and think about it. This is one of those borderline cases where HTMLAgilityPack would make the project far more complicated without actually adding anything except that you're not using Regex. Regex has its uses, it's only when you abuse it that it becomes a monster to work with.

Matching a term that contains nested HTML

I have been having trouble finding a solution to this problem.
I am parsing the content of a number of ebooks, finding specific terms and characters, marking the locations and lengths of each term.
A normal case would be something like this (excerpts from A Game of Thrones):
"When he paused to look down, his head swam dizzily and he felt his fingers slipping. Bran cried out and clung for dear life."
If we are searching for the character "Bran", its location is 85 and length is 4. Easy enough.
My issue arises when there is a paragraph like this:
<span height="-0em"><font size="7">D</font></span>aenerys Targaryen wed Khal Drogo
We need to match "Daenerys Targaryn". It is easy enough to strip the HTML and match the string, but in this example the result needs to include the HTML. Thus the expected result would here be would be location = 0, length = 67.
Another situation, caused by random anchor tags scattered throughout:
Did anyone outside the Vale even suspect where Catelyn <a></a>Stark had taken him?
Again, searching for "Catelyn Stark" needs to include the HTML, so location = 47, length = 20.
I have been able to get around it temporarily by adding those specific cases (searching for "Catelyn <a></a>Stark specifically), but clearly I should have a more robust solution, which I cannot seem to get my head around. My attempts have been using RegEx but with limited success.
I have found various questions regarding HTML matching/stripping (and whether or not to use RegEx =)), but this case seems to be somewhat unique.
Stripping the tags isn't an option as the content must be preserved.
This is within a stand-alone C# application.
Any ideas, steps in the right direction, or similar examples should your search go better than mine would be greatly appreciated!
One possible approach would be to insert the following between each letter in your search string:
(?:<[^>]*>)*
So when searching for the character "Bran" your regex would become the following:
(?:<[^>]*>)*B(?:<[^>]*>)*r(?:<[^>]*>)*a(?:<[^>]*>)*n
This will allow your regex to match any number of HTML tags anywhere within the search string. Note that this will only work if your search strings are always something simple like a character's name, and not regular expressions (this method will fail if there is repetition like a* in your search string).
I would create a function that would take "Daenerys Targaryn" as a parameter and then strip the first letter. Then, it would only search for "aenerys Targaryn," and if found, it would search for ">D<" or the first variable letter. Does than make sense?
Example:
public static string searchFor(string str)
{
// strip first letter of search string (in this case "D")
// search for the rest of the string ("aenerys Targaryn")
// if found, search for ">D<"
// if found, search for HTML tags with "D" inside (using regex)
// if found, search for HTML tags with the previous HTML tag in them (using regex)
return result;
}
Well using Javascript or Php you can get the text of elements and the text of documents and search there and then do a regex to return the closest match (containing the html):
Another option:
would be to index the books first using something like Lucene Search Engine (which happens to let you index in different formats (html format being one of them).
You can then use the Lucene api to search your documents a little easier.
In php we have Zend_Search_Lucene which works perfectly for this kind of thing.
Lucene Search can be found at:
http://lucene.apache.org/core/
Have fun!

Is it possible to use Regex to extract text from attributes repeated in a text file - c# .NET

I am working something at the moment and need to extract an attribute from a big list tags, they are formatted like this:
<appid="928" appname="extractapp" supportemail="me#mydomain.com" /><appid="928" appname="extractapp" supportemail="me#mydomain.com" />
The tags are repeated one after another and all have different appid, appname, supportemail.
I need to just extract all of the support emails, just the email address, without the supportemail=
Will I need to use two regex statements, one to seperate each individual tag, then loop through the result and pull out the emails?
I would then go through and Add the emails to a list, then loop through the list and write each one to a txt file, with a comma after it.
I've never really used Regex too much, so don't know if it's suitable for the above?
I would spend more time trying it myself but it's quite urgent. So hopefully somebody can help.
Have you considered Linq to XML?
http://www.hookedonlinq.com/LINQtoXML5MinuteOverview.ashx
Using XML is better, perhaps, but here's the regular expression you'd use (in case there's a particular reason you need/want to use regular expressions to read XML):
(appid="(?<AppID>[^"]+)" appname="(?<AppName>[^"]+)" supportemail="(?<SupportEmail>[^"]+)")
You can just take the last bit there for the support email but this will extract all of the attributes you mentioned and they will be "grouped" within each tag.
What about modify the string to have proper xml format and load xml to extract all the values of supportemail attribute?
Use
string pattern = "supportemail=\"([^\"]+)";
MatchCollection matches = Regex.Matches(inputString, pattern);
foreach(Match m in matches)
Console.WriteLine(m.Groups[1].Value);
See it here.
Problems you'll encounter by using regular expressions instead of an XML DOM:
All of the example regexes posted thus far will fail in the extremely common case that the attribute values are delimited by single quotes.
Any regex that depends on the attributes appearing in a specific order (e.g. appId before appName) will fail in the event that attributes - whose ordering is insignificant to XML - appear in an order different from what the regex expects.
A DOM will resolve entity references for you and a regex will not; if you use regex, you must check the returned values for (at least) the XML character entitites &, &apos;, >, <, and ".
There's a well-known edge case where using regular expressions to parse XML and XHTML unleashes the Great Old Ones. This will complicate your task considerably, as you will be reduced to gibbering madness and then the Earth will be eaten.

How do I find a HTML div contains specific text after a text prefix?

I have following string:
<div> text0 </div> prefix <div> text1 <strong>text2</strong> text3 </div> text4
and want to know wether it contains text3 inside divs that go after prefix:
prefix<div>...text3...</div>
but I don't know how ta make regex for that, since I can't use [^<]+ because div's can contain strong tag inside.
Please help
EDIT:
Div tags after prefix are guaranted to be not nested
Language is C#
Text4 is very long, so regex must not look after closing div
EDIT2: I don't want to use html parser, it can be easily (and MUCH faster) achieved with Regex. HTML there is simple: no attributes in tags; no nesting div's. And even some % of wrong answers are acceptable in my case.
If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>. (If the <div> is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div> instead.)
Numerous improvements could be made to this in order to take account of unusual spacing, >s within quotes, </div> within quotes, etc.
Patterns like prefix<div>...<div></div>text3</div> would be more difficult. You might have to capture all of the occurrences of the div tag so that you could count how many div tags were open at a given time.
EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.
this is my new regex:
prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>
seems to work ok.
For C# + HtmlAgilityPack you can do something like:
InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(InputString);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");
The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.
(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:
var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';
InputString = InputString.replace(/^.*?prefix/,'');
var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')
console.log(MatchingDivs.get());
This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div tags to actually work).

How to find a repeated string and the value between them using regexes?

How would you find the value of string that is repeated and the data between it using regexes? For example, take this piece of XML:
<tagName>Data between the tag</tagName>
What would be the correct regex to find these values? (Note that tagName could be anything).
I have found a way that works that involves finding all the tagNames that are inbetween a set of < > and then searching for the first instance of the tagName from the opening tag to the end of the string and then finding the closing </tagName> and working out the data from between them. However, this is extremely inefficient and complex. There must be an easier way!
EDIT: Please don't tell me to use XMLReader; I doubt I will ever use my custom class for reading XML, I am trying to learn the best way to do it (and the wrong ways) through attempting to make my own.
Thanks in advance.
You can use: <(\w+)>(.*?)<\/\1>
Group #1 is the tag, Group #2 is the content.
Using regular expressions to parse XML is a terrible error.
This is efficient (it doesn't parse the XML into a DOM) and simple enough:
string s = "<tagName>Data between the tag</tagName>";
using (XmlReader xr = XmlReader.Create(new StringReader(s)))
{
xr.Read();
Console.WriteLine(xr.ReadElementContentAsString());
}
Edit:
Since the actual goal here is to learn something by doing, and not to just get the job done, here's why using regular expressions doesn't work:
Consider this fairly trivial test case:
<a><b><a>text1<b>CDATA<![<a>text2</a>]]></b></a></b>text3</a>
There are two elements with a tag name of "a" in that XML. The first has one text-node child with a value of "text1", and the second has one text-node child with a value of "text3". Also, there's a "b" element that contains a string of text that looks like an "a" element but isn't because it's enclosed in a CDATA section.
You can't parse that with simple pattern-matching. Finding <a> and looking ahead to find </a> doesn't begin to do what you need. You have to put start tags on a stack as you find them, and pop them off the stack as you reach the matching end tag. You have to stop putting anything on the stack when you encounter the start of a CDATA section, and not start again until you encounter the end.
And that's without introducing whitespace, empty elements, attributes, processing instructions, comments, or Unicode into the problem.
You can use a backreference like \1 to refer to an earlier match:
#"<([^>]*)>(.*)</\1>"
The \1 will match what was captured by the first parenthesized group.
with Perl:
my $tagName = 'some tag';
my $i; # some line of XML
$i =~ /\<$tagName\>(.+)\<\/$tagname\>/;
where $1 is now filled with the data you captured
Going forward, if you get stuck check out regexlib.com
It's the first place I go when i get stuck on regex

Categories