Regex decoding html - c#

I have the following regex and would like it to match the following two lines. It appears to match the first end tag it finds rather than the last. How can it be modified to find the last one not the first.
Regex: <div(?<Attr>.*?)>(?<Content>.*?)</div>
Currently matches: <div class="test">Test Div</div>
Needs to match: <div class="test">Test Div<div>Another Test</div></div>

Not really an answer, but an observation based on experience. In general, regex-based approaches to pattern-matching HTML will give you endless grief and ultimately cannot work properly since HTML is not a regular language. Instead, I would recommend looking at DOM-based mechanisms. I've used, with considerably improved success, both jQuery and phpQuery to deal with hunting for stuff in HTML documents.

You’re using the non-greedy quantifier *? that will be expanded to as few as possible repetitions. If you want to match as much as possible, use the greedy version without the ?.
But in general, regular expressions are not suitable for non-regular languages like HTML. You should better use a HTML parser.

Regex typically is greedy, meaning it will try to find the last match, for what you need to do you can tel it to match /div> twice, or just including the unique ;</div> before that.

Related

Reusable Non-Capture Groups [duplicate]

I can't seem to find an answer to this problem, and I'm wondering if one exists. Simplified example:
Consider a string "nnnn", where I want to find all matches of "nn" - but also those that overlap with each other. So the regex would provide the following 3 matches:
nnnn
nnnn
nnnn
I realize this is not exactly what regexes are meant for, but walking the string and parsing this manually seems like an awful lot of code, considering that in reality the matches would have to be done using a pattern, not a literal string.
Update 2016:
To get nn, nn, nn, SDJMcHattie proposes in the comments (?=(nn)) (see regex101).
(?=(nn))
Original answer (2008)
A possible solution could be to use a positive look behind:
(?<=n)n
It would give you the end position of:
nnnn
 
nnnn
 
nnnn
As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)
I would prefer to his proposition (?=nn)n the simpler form:
(n)(?=(n))
That would reference the first position of the strings you want and would capture the second n in group(2).
That is so because:
Any valid regular expression can be used inside the lookahead.
If it contains capturing parentheses, the backreferences will be saved.
So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).
Using a lookahead with a capturing group works, at the expense of making your regex slower and more complicated. An alternative solution is to tell the Regex.Match() method where the next match attempt should begin. Try this:
Regex regexObj = new Regex("nn");
Match matchObj = regexObj.Match(subjectString);
while (matchObj.Success) {
matchObj = regexObj.Match(subjectString, matchObj.Index + 1);
}
AFAIK, there is no pure regex way to do that at once (ie. returning the three captures you request without loop).
Now, you can find a pattern once, and loop on the search starting with offset (found position + 1). Should combine regex use with simple code.
[EDIT] Great, I am downvoted when I basically said what Jan shown...
[EDIT 2] To be clear: Jan's answer is better. Not more precise, but certainly more detailed, it deserves to be chosen. I just don't understand why mine is downvoted, since I still see nothing incorrect in it. Not a big deal, just annoying.

Single Regex to strip all HTML but the anchors

Versions of this have been asked several times on here, and using those I was able to get two different ReGex statements.
One that strips all HTML
1. <[^>]*>
And one that strips everything but the anchor tags
2. <a[^>]*>([^<]+)<\/a>
I have no hope of combining those to get a regex that strips all HTML but keeps the anchors so (1+!2). Therefore I'm currently going once trough my HTML with the first regex, and if I encounter a certain keyword that usually lives inside the anchors then I go trough the Body with the 2nd regex and combine both.
That clearly is not ideal and will most likely miss many anchors.
What would a single regex that matches all HTML but the anchors look like ? /1?!2/
Test data: https://www.regextester.com/?fam=105725 I need everything that is ALL CAPS and the anchor around it.
Disregarding my own comment ;) - Is this what you're after?
Replace
<((?!a|\/a)[^>]*)>\s*
with empty string.
The negative look-ahead after the opening < makes sure it ignores anchors.
Here at regex101.

How to Find All Matches in Regular Expressions when one Overlaps OR Contains the Other?

The question of how to find every match when they might overlap was asked in Overlapping matches in Regex. However, as far as I can see, the answers there does not cover a more general case.
How can we find all substrings that begin with "a" and end with "z"? For example, given "akzzaz", it should find "akz", "akzz", "az" and "akzzaz".
Since there may be more than one match starting at the same position, ("akz" and "akzz") and also there may be more than one match ending at the same position ("az" and "akzzaz") I cannot see how using a lookahead or lookbehind helps as in the mentioned link. (Also, please bear in mind that in the general case "a" and "z" might be more complex regular expressions)
I use C#, so, in case it matters, having any feature specific to .Net Regular Expressions is OK.
Regular expressions are designed to find one match at a time. Even a global match operation is simply repeated applications of the same regex, each starting at the end of the previous match in the target string. So no, regexes are not able to find all matches in this way.
I will stick my neck out and say that I don't believe you can even find "all strings beginning with 'a' in 'akzzaz'" with a regex. /(a.*)/g will find the entire string, while /(a.*?)/g will find just 'a' twice.
The way I would code this would be to locate all 'a's, and search each of the substrings from there to the end of the string for all 'z's. So search 'akzzaz` and 'az' for 'z', giving 'akz', 'akzz', 'akzzaz', and 'az'. That is a fairly simple thing to do, but not a job for a regex unless the actual 'a' and 'z' tokens are complex.
For your current problem, string.startwith and string.endwith would do be a better job. Regular Expression is not necessarily faster in all cases.
Try this regular expression
a[akz]+z - in case a, k and z are the only characters
a[a-z]+z - in case of any alphabet
I think it's worth noting that there is actually a way for a regex to return more than one match at the same time. Although this doesn't answer your question, I think this would be a good place to mention this for others who may run into a similar situation.
The regex below for example would return all the right substrings of a string with a single match and has them in different capturing groups:
(?=(\w+)).
This regex uses capturing groups inside a zero-width assertion and for each match at position i(each character) the capturing group is a substring of length n-i.
Doing anything that would require the regex engine to stay in the same place after a match is probably overkill for a regular expression approach.

C# regular expression for finding forms with input tags in HTML?

I have a simple problem: I want to construct a regex that matches a form in HTML, but only if the form has any input tags. Example:
The following should be matched (ignoring attributes):
..
<form>
..
<input/>
..
</form>
..
But the following should not (ignoring attributes):
..
<form>
..
</form>
..
I have tried everything from look-arounds to capture groups but it quickly gets complicated. I want to believe there is a simple regex to capture the problem. Please note that it is important that the regex pairs the opening and closing tags according to the HTML code which means the following does not work:
<form>.+<input/>.+</form>
because it matches wrongly like this:
..
<form> <--- This is wrongly matched as the opening tag
..
</form>
<form> <-- This is the correct opening tag of the correct form
..
<input/>
..
</form> <--- This is matched as the closing tag
..
EDIT:
I already made a RegEx that matches what I want; my question is now how to do it, but how to do it SIMPLE/elegantly.
To me this is not simple or elegant at all:
<form>
(.(?<!</form>))+
<input/>
(.(?<!</form>))+
</form>
I want to believe there is a simple regex to capture the problem
Wishing does not make it so. There is no evidence for the proposition that every problem can be solved with regular expressions, and plenty of evidence against. Your faith is not well placed.
The set of languages which are recognizable by regular expressions is called -- unsurprisingly -- the regular languages. A nice property of all regular languages is that they can be recognized by a device with finitely many states. Therefore, you can quickly figure out if a language is not regular by asking yourself the question "would I require an unbounded number of states to recognize this language?"
Consider the language of matching parens: (), ()(), (()), ()(()), and so on. To recognize this language you have to keep track of how many open parens there are waiting to be closed, and therefore you need an unbounded number of states. Therefore this language is not a regular language, and therefore it cannot be matched by a regular expression.
HTML is clearly the paren language but even more complicated, because now there are an infinite number of different "kinds of parens". Each tag is like an open paren that must be matched by its corresponding closing tag. Since this is an even more complex and difficult version of a non-regular language, clearly it cannot be a regular language. And therefore it cannot be matched correctly with regular expressions.
The right tool to recognize patterns in HTML is an HTML parser.
You really don't want to parse HTML using RegEx. See this answer if you need more convicing.
Regular expressions are the wrong tool for trying to parse HTML - especially when it's HTML that is not gauranteed to be well formed.
You should really get an HTML/XHTML parsing library and use that to match HTML content. Take a look at the HTML Agility Pack, it's probably sufficient for what you need.
Don't parse HTML with regular expressions.
You should not parse HTML with regular expressions, but if you must, then what about something simple as:
<form>[^</form>]+<input/>.+</form>

Regex - I only want to match the start tags in regex

I am making a regex expression in which I only want to match wrong tags like: <p> *some text here, some other tags may be here as well but no ending 'p' tag* </p>
<P>Affectionately Inscribed </P><P>TO </P><P>HENRY BULLAR, </P><P>(of the western circuit)<P>PREFACE</P>
In the above same text I want to get the result as <P>(of the western circuit)<P> and nothing else should be captured. I'm using this but its not working:
<P>[^\(</P>\)]*<P>
Please help.
Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.
For xhtml, I'd use XmlDocument/XDocument and an xpath query.
For "non-x" html, I'd look at the HTML Agility Pack and the same.
Match group one of:
(?:<p>(?:(?!<\/?p>).?)+)(<p>)
matches the second <p> in:
<P>(of the western circuit)<P>PREFACE</P>
Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.
I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like
<p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>
I'm pretty sure the regular expressions given so-far would match the second <p> there, even though it is not actually an unclosed <p>.
Rather than using * for maximal match, use *? for minimal.
Should be able to make a start with
<P>((?!</P>).)*?<P>
This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "<P>" matches.
EDIT: Corrected to put assertion (thanks to commenter).
All of the solutions offered so far match the second <P>, but that's wrong. What if there are two consecutive <P> elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:
#"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"
As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).

Categories