I want to extract all the texts between the specified opening and closing tags including the tags.
For eg:
Input : I am <NAME>Kai</NAME>
Text Extracted: <NAME>Kai</NAME>
It extract the text based on tag.
What is Regex for the above?
If the tag in question can't be nested (and assuming case insensitivity):
Regex regexObj = new Regex("<NAME>(?:(?!</NAME>).)*</NAME>", RegexOptions.Singleline | RegexOptions.IgnoreCase);
Be advised that this is a quick-and-dirty solution which might work fine for your needs, but might also blow up in your face (for example if tags occur within comments, if there is whitespace inside the tags, if there are any attributes inside the tags etc.). If any of these might be a problem for you, please edit your question with the exact specifications you need the regex to comply with.
Here is a regex which accepts any tag name: <(\w+)>.*?</\1>
\1 is back-referencing the group (\w+) and ensures that the closing tag must have the same name as the opening tag.
If you want to search for the special tag NAME then you could use this regex: <NAME>.*?</NAME>
http://www.regular-expressions.info/reference.html You might find something useful here, they have allot of stuff specially for tags etc. Combine the examples to meet your requirements.
Related
Versions of this have been asked several times on here, and using those I was able to get two different ReGex statements.
One that strips all HTML
1. <[^>]*>
And one that strips everything but the anchor tags
2. <a[^>]*>([^<]+)<\/a>
I have no hope of combining those to get a regex that strips all HTML but keeps the anchors so (1+!2). Therefore I'm currently going once trough my HTML with the first regex, and if I encounter a certain keyword that usually lives inside the anchors then I go trough the Body with the 2nd regex and combine both.
That clearly is not ideal and will most likely miss many anchors.
What would a single regex that matches all HTML but the anchors look like ? /1?!2/
Test data: https://www.regextester.com/?fam=105725 I need everything that is ALL CAPS and the anchor around it.
Disregarding my own comment ;) - Is this what you're after?
Replace
<((?!a|\/a)[^>]*)>\s*
with empty string.
The negative look-ahead after the opening < makes sure it ignores anchors.
Here at regex101.
I am using C# and writing mail merge application. My users store the templates as follows
Dear [[-UserName-]],
You have been subscribed for [[-SubscriptionName-]]...
and so on. There will be lot of custom fields between [[-xxxxx-]] place holders. I am merging them fine. But sometimes they don't pass for some place holders. I would like to find those things using regular expressions and replace them with empty strings.
Technically, I want to find out the regular expression to find [[-what ever it is in between-]] and replace with empty string along with [[--]] tags
You can use:
\[{2}(-.*?-)\]{2}
# look for [[-, -]] and anything in between.
This is called a lazy dot star, see a demo on regex101.com.
Since there are no dashes inside the tag names, you can use this regex:
\[\[-([^-]+)-\]\]
The tag name will be in capture group 1. Obviously, I recommend using the #"" type of string for the regex.
You will want to use the replacement string:
[[--]]
Essentially the regex finds [[-, then captures a bunch of non - characters, then finds -]].
How can I replace
Text
with
Text
where page and Text can be any set of characters?
This will work. Note that I only capture whatever is inside href.
resultString = Regex.Replace(subjectString, #"(?<=<a[^>]*?\bhref\s*=\s*(['""]))(.*)(?=\1.*?>)", "$2.html");
And append the .html to it. You may wish to change it to your needs.
Edit : before flame wars begin. Yes it will work for your specific example not for all possible html in the internet.
You shouldn't parse HTML with regular expressions. See the answer to this question for details.
UPD: As TrueWill has pointed out, you might want to do the replace with Html Agility Pack. But in some special cases the regexp proposed by FailedDev will do, although I would slightly modify it to look like this: #"(?<=<a\b[^>]*?\bhref\s*=\s*(['""]))(.*)(?=\1.*?>)" (put a \b after the <a to exclude other tags starting with "a").
If I use this
string showPattern = #"return new_lightox\(this\);"">[a-zA-Z0-9(\s),!\?\-:'&%]+</a>";
MatchCollection showMatches = Regex.Matches(pageSource, showPattern);
I get some matches but I want to get rid of [a-zA-Z0-9(\s),!\?\-:'&%]+and use any char .+
but if do this I get no match at all.
What am I doing wrong?
By default "." does not match newlines, but the class \s does.
To let . match newline, turn on SingleLine/DOTALL mode - either using a flag in the function call (as Abel's answer shows), or using the inline modifier (?s), like this for the whole expression:
"(?s)return new_lightox\(this\);"">.+</a>"
Or for just the specific part of it:
"return new_lightox\(this\);"">(?s:.+)</a>"
It might be better to take that a step further and do this:
"return new_lightox\(this\);"">(?s:(?!</?a).+)</a>"
Which should prevent the closing </a> from belonging to a different link.
However, you need to be very wary here - it's not clear what you're doing overall, but regex is not a good tool for parsing HTML with, and can cause all sorts of problems. Look at using a HTML DOM parser instead, such as HtmlAgilityPack.
You're matching a tag, so you probably want something along these lines, instead of .+:
string showPattern = #"return new_lightox\(this\);"">[^<]+</a>";
The reason that the match doesn't hit is possibly because you are missing the multiline/singleline flag and the closing tag is on the next line. In other words, this should work too:
// SingleLine option changes the dot (.) to match newlines too
MatchCollection showMatches = Regex.Matches(
pageSource,
showPattern,
RegexOptions.SingleLine);
I am making a regex expression in which I only want to match wrong tags like: <p> *some text here, some other tags may be here as well but no ending 'p' tag* </p>
<P>Affectionately Inscribed </P><P>TO </P><P>HENRY BULLAR, </P><P>(of the western circuit)<P>PREFACE</P>
In the above same text I want to get the result as <P>(of the western circuit)<P> and nothing else should be captured. I'm using this but its not working:
<P>[^\(</P>\)]*<P>
Please help.
Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.
For xhtml, I'd use XmlDocument/XDocument and an xpath query.
For "non-x" html, I'd look at the HTML Agility Pack and the same.
Match group one of:
(?:<p>(?:(?!<\/?p>).?)+)(<p>)
matches the second <p> in:
<P>(of the western circuit)<P>PREFACE</P>
Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.
I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like
<p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>
I'm pretty sure the regular expressions given so-far would match the second <p> there, even though it is not actually an unclosed <p>.
Rather than using * for maximal match, use *? for minimal.
Should be able to make a start with
<P>((?!</P>).)*?<P>
This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "<P>" matches.
EDIT: Corrected to put assertion (thanks to commenter).
All of the solutions offered so far match the second <P>, but that's wrong. What if there are two consecutive <P> elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:
#"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"
As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).