Single Regex to strip all HTML but the anchors

Single Regex to strip all HTML but the anchors - c#

Versions of this have been asked several times on here, and using those I was able to get two different ReGex statements.
One that strips all HTML
1. <[^>]*>
And one that strips everything but the anchor tags
2. <a[^>]*>([^<]+)<\/a>
I have no hope of combining those to get a regex that strips all HTML but keeps the anchors so (1+!2). Therefore I'm currently going once trough my HTML with the first regex, and if I encounter a certain keyword that usually lives inside the anchors then I go trough the Body with the 2nd regex and combine both.
That clearly is not ideal and will most likely miss many anchors.
What would a single regex that matches all HTML but the anchors look like ? /1?!2/
Test data: https://www.regextester.com/?fam=105725 I need everything that is ALL CAPS and the anchor around it.

Disregarding my own comment ;) - Is this what you're after?
Replace
<((?!a|\/a)[^>]*)>\s*
with empty string.
The negative look-ahead after the opening < makes sure it ignores anchors.
Here at regex101.

Related

Regex excluding matches contained within a HTML tag

I'm trying to create a Regex expression to match content within a HTML document, but I wish to exclude matches contained within a tag itself. Consider the following:
<p>Here is some sample text for my widgets</p>
Click here to view my widgets
I would like to match 'widgets' so that I can replace it with a different string, say 'green box', without replacing the match within the url.
Matching 'widgets' is, well, easy as anything, but I'm struggling to add the exclude to check for 'widgets' when it appears within the opening and closing tag '<>'.
My current workings: As a first step I have started to match 'widgets' contained within '<>'. (I can then move on to make this an exclude later) However the below string seems to match the whole document, even though I have placed an exclude on the closing > to make sure widgets appears within a tag.
<.*[^>]widgets.*[^<]>+
It's probably down to lazy / greedy, but I can't quite work it out!

Overview
By no means is this a great answer since it's parsing HTML with regex, but it does work for the test case given by the OP.
See RegEx match open tags except XHTML self-contained tags
for more information.
Code
See regex in use here
(?<!<[^>]*)widgets
Explanation
(?<!<[^>]*) Negative lookbehind ensuring what precedes is not < followed by any character except > (any number of times)
widgets Match this literally

This may partially work:
(?:^|>)[^<]*widgets
This will start looking from the start of a line (if the /m flag is used) or the end of a tag (so we know we are not in one), and advance as many characters possible that are not <, meaning you can't open another tag, before looking for widgets.
The issues with this are that it may give weird results if you have a > inside a tag (eg, in javascript), or if a single tag can span over multiple lines and it won't find several instances of "widgets" in the same substring. To solve those issue, you'd better use an actual XML parser as advised by ctwheels

Regular expression not capturing matches in the middle of a string

The regular expression I'm starting with is:
^(((http|ftp|https|www)://)?([\w+?.\w+])+([a-zA-Z0-9\~!\##\$\%\^\&*()_-\=+\/\?.\:\;\'\,]*)?)$
I'm using this to find URLs in the middle of user-supplied text and replace it with a hyperlink. This works fine and matches the following:
http://www.google.com
www.google.com
google.com
www.google.com?id=5
etc...
However, it doesn't find a match if there is any text on either side of it (kind of defeats the purpose of what I'm doing). :)
No match:
Go to www.google.com
www.google.com is the best.
I go to www.google.com all the time.
etc...
How can I change this so that it will match no matter where in the string it appears? I'm terrible with regular expressions...

You have a bug in your original regex. The square brackets make \w+?\.\w+ a character class:
(((http|ftp|https|www)://)?([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?)
^ ^
After removing them (and the anchors ^ and $), your regex will not match obvious non-URLs.
I suggest using http://regexpal.com/ for testing regexes, as it has syntax highlighting within the regex.

i think you should use a positive look ahead, that is going to search for a given url to first of all check two possibilities, either is at the beginning or at the middile of the whole string.
but you should you use something like ^((?=url)?|.?(?=url).*?$))
that is just the beginning , i am not giving you an answer, just an idea.
i would do it, but at the moment i am lazy and your regex looks for a 20 minutes analisis.
stackoverflow erase some things of my example

Designing Regular Expressions in C#

I've run into a bit of an issue designing RegEx in C#. I have to parse a text document that has multiple urls embedded in it, and I have to extract those
...url=http://www.cnn.com?id=abc,def&system=2&mode=2&quality=ade,url=http://www.bbc.com...
(^ I've added ellipsis to show that its part of content, ... won't actually be in the text)
The begining part is easy as I can start regex with 'url=', however, I can't come up with a way of ending the match
RegEx = (?<IgnoreFirst>[,]url=)(?<Url>[^,]+)
This regex stop at first comma - so just after 'abc' and doesn't return the entire url
RegEx = (?<IgnoreFirst>[,]url=)(?<Url>[^,]+)(?<IgnoreSecond>url)
This doesn't work either because the match stops at first comma and then looks for 'url', which it couldn't find. From some of the reading I've done it seems like its an issue of backtracking etc, so if anyone can help me out with the correct regex, that'd be great!
PS. while we're at it, if I wanted to extract url just before &quality, how would I do that?

How about using something like this:
RegEx = url=(?<Url>.+?)(?=,url|$)
The lookahead at the end will force matching to stop either at the next ",url" or at the end of the string or line.

Regex decoding html

I have the following regex and would like it to match the following two lines. It appears to match the first end tag it finds rather than the last. How can it be modified to find the last one not the first.
Regex: <div(?<Attr>.*?)>(?<Content>.*?)</div>
Currently matches: <div class="test">Test Div</div>
Needs to match: <div class="test">Test Div<div>Another Test</div></div>

Not really an answer, but an observation based on experience. In general, regex-based approaches to pattern-matching HTML will give you endless grief and ultimately cannot work properly since HTML is not a regular language. Instead, I would recommend looking at DOM-based mechanisms. I've used, with considerably improved success, both jQuery and phpQuery to deal with hunting for stuff in HTML documents.

You’re using the non-greedy quantifier *? that will be expanded to as few as possible repetitions. If you want to match as much as possible, use the greedy version without the ?.
But in general, regular expressions are not suitable for non-regular languages like HTML. You should better use a HTML parser.

Regex typically is greedy, meaning it will try to find the last match, for what you need to do you can tel it to match /div> twice, or just including the unique ;</div> before that.

Regex - I only want to match the start tags in regex

I am making a regex expression in which I only want to match wrong tags like: <p> *some text here, some other tags may be here as well but no ending 'p' tag* </p>
<P>Affectionately Inscribed </P><P>TO </P><P>HENRY BULLAR, </P><P>(of the western circuit)<P>PREFACE</P>
In the above same text I want to get the result as <P>(of the western circuit)<P> and nothing else should be captured. I'm using this but its not working:
<P>[^\(</P>\)]*<P>
Please help.

Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.
For xhtml, I'd use XmlDocument/XDocument and an xpath query.
For "non-x" html, I'd look at the HTML Agility Pack and the same.

Match group one of:
(?:<p>(?:(?!<\/?p>).?)+)(<p>)
matches the second <p> in:
<P>(of the western circuit)<P>PREFACE</P>
Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.

I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like
<p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>
I'm pretty sure the regular expressions given so-far would match the second <p> there, even though it is not actually an unclosed <p>.

Rather than using * for maximal match, use *? for minimal.
Should be able to make a start with
<P>((?!</P>).)*?<P>
This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "<P>" matches.
EDIT: Corrected to put assertion (thanks to commenter).

All of the solutions offered so far match the second <P>, but that's wrong. What if there are two consecutive <P> elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:
#"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"
As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Single Regex to strip all HTML but the anchors - c#

Disregarding my own comment ;) - Is this what you're after? Replace <((?!a|\/a)[^>])>\s with empty string. The negative look-ahead after the opening < makes sure it ignores anchors. Here at regex101.

Related

Regex excluding matches contained within a HTML tag

Regular expression not capturing matches in the middle of a string

Designing Regular Expressions in C#

Regex decoding html

Regex - I only want to match the start tags in regex

Categories

Resources

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Single Regex to strip all HTML but the anchors - c#

Disregarding my own comment ;) - Is this what you're after? Replace <((?!a|\/a)[^>]*)>\s* with empty string. The negative look-ahead after the opening < makes sure it ignores anchors. Here at regex101.

Related

Regex excluding matches contained within a HTML tag

Regular expression not capturing matches in the middle of a string

Designing Regular Expressions in C#

Regex decoding html

Regex - I only want to match the start tags in regex

Categories

Resources

Disregarding my own comment ;) - Is this what you're after? Replace <((?!a|\/a)[^>])>\s with empty string. The negative look-ahead after the opening < makes sure it ignores anchors. Here at regex101.