How can I replace
Text
with
Text
where page and Text can be any set of characters?
This will work. Note that I only capture whatever is inside href.
resultString = Regex.Replace(subjectString, #"(?<=<a[^>]*?\bhref\s*=\s*(['""]))(.*)(?=\1.*?>)", "$2.html");
And append the .html to it. You may wish to change it to your needs.
Edit : before flame wars begin. Yes it will work for your specific example not for all possible html in the internet.
You shouldn't parse HTML with regular expressions. See the answer to this question for details.
UPD: As TrueWill has pointed out, you might want to do the replace with Html Agility Pack. But in some special cases the regexp proposed by FailedDev will do, although I would slightly modify it to look like this: #"(?<=<a\b[^>]*?\bhref\s*=\s*(['""]))(.*)(?=\1.*?>)" (put a \b after the <a to exclude other tags starting with "a").
Related
I have small problem. I'm trying to get text whitch is out of html elements.
Example input:
I want this text I want this text I want this text <I don't want this text/>
I want this text I wan this text <I don't>want this</text>
Does anybody know how is it possible by regex? I thought that I can make it by deleting element text. So, does anybody know another solution for this problem? Please help me.
Instead of regex, which is not suitable for parsing HTML in general (especially malformed HTML), use an HTML parser like the HTML Agility Pack.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
I agree that anything not trivial should be done with a HTML parser (Agility pack is excellent if you use .NET) but for small requirements as this its more than likely overkill.
Then again, A HTML parser knows more about the quirks and edge cases that HTML is full of. Be sure to test well before using a regex.
Here you go
<.*?>.*?<.*?>|<.*?/>
It also correctly ignores
<I don't>want this</text>
and not just the tags
In C# this becomes
string resultString = null;
resultString = Regex.Replace(subjectString, "<.*?>.*?<.*?>|<.*?/>", "");
Try this
(?<!<.*?)([^<>]+)
Explanation
#"
(?<! # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
< # Match the character “<” literally
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
)
( # Match the regular expression below and capture its match into backreference number 1
[^<>] # Match a single character NOT present in the list “<>”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"
How to remove all text between BBCode Quotation (including BBCode itself):
[quote date=2011-07-02 14:43:53 user=test link=1]blabla[/quote]
I must add that between tags can be text with HTML tags for formating.
My current attempt looks like:
Regex regex = new Regex(#"[quote+].+?[/\+quote]");
Well it's almost working.
You may try the following regex:
#"\[quote.*\].*?\[/quote\]"
Note that you have to escape square brackets in a regex.
Since your BBCode blocks contains attributes, a simple + won't suffice to cover everything. + means to repeat the specified range of characters, in this case e.
On the top of my head, I'd try something like this:
\[quote([^\[]*)\](.*?)\[\/quote\]
Please bear in mind that I have not tested this for C#, where the syntax might be different depending on the interpreter. Also note that I've added selection groups so that you'd be able to examine the result of each expression. As #Howard answered, [ and ] are reserved symbols and consequently needs to be escaped.
I want to extract all the texts between the specified opening and closing tags including the tags.
For eg:
Input : I am <NAME>Kai</NAME>
Text Extracted: <NAME>Kai</NAME>
It extract the text based on tag.
What is Regex for the above?
If the tag in question can't be nested (and assuming case insensitivity):
Regex regexObj = new Regex("<NAME>(?:(?!</NAME>).)*</NAME>", RegexOptions.Singleline | RegexOptions.IgnoreCase);
Be advised that this is a quick-and-dirty solution which might work fine for your needs, but might also blow up in your face (for example if tags occur within comments, if there is whitespace inside the tags, if there are any attributes inside the tags etc.). If any of these might be a problem for you, please edit your question with the exact specifications you need the regex to comply with.
Here is a regex which accepts any tag name: <(\w+)>.*?</\1>
\1 is back-referencing the group (\w+) and ensures that the closing tag must have the same name as the opening tag.
If you want to search for the special tag NAME then you could use this regex: <NAME>.*?</NAME>
http://www.regular-expressions.info/reference.html You might find something useful here, they have allot of stuff specially for tags etc. Combine the examples to meet your requirements.
I have a simple problem: I want to construct a regex that matches a form in HTML, but only if the form has any input tags. Example:
The following should be matched (ignoring attributes):
..
<form>
..
<input/>
..
</form>
..
But the following should not (ignoring attributes):
..
<form>
..
</form>
..
I have tried everything from look-arounds to capture groups but it quickly gets complicated. I want to believe there is a simple regex to capture the problem. Please note that it is important that the regex pairs the opening and closing tags according to the HTML code which means the following does not work:
<form>.+<input/>.+</form>
because it matches wrongly like this:
..
<form> <--- This is wrongly matched as the opening tag
..
</form>
<form> <-- This is the correct opening tag of the correct form
..
<input/>
..
</form> <--- This is matched as the closing tag
..
EDIT:
I already made a RegEx that matches what I want; my question is now how to do it, but how to do it SIMPLE/elegantly.
To me this is not simple or elegant at all:
<form>
(.(?<!</form>))+
<input/>
(.(?<!</form>))+
</form>
I want to believe there is a simple regex to capture the problem
Wishing does not make it so. There is no evidence for the proposition that every problem can be solved with regular expressions, and plenty of evidence against. Your faith is not well placed.
The set of languages which are recognizable by regular expressions is called -- unsurprisingly -- the regular languages. A nice property of all regular languages is that they can be recognized by a device with finitely many states. Therefore, you can quickly figure out if a language is not regular by asking yourself the question "would I require an unbounded number of states to recognize this language?"
Consider the language of matching parens: (), ()(), (()), ()(()), and so on. To recognize this language you have to keep track of how many open parens there are waiting to be closed, and therefore you need an unbounded number of states. Therefore this language is not a regular language, and therefore it cannot be matched by a regular expression.
HTML is clearly the paren language but even more complicated, because now there are an infinite number of different "kinds of parens". Each tag is like an open paren that must be matched by its corresponding closing tag. Since this is an even more complex and difficult version of a non-regular language, clearly it cannot be a regular language. And therefore it cannot be matched correctly with regular expressions.
The right tool to recognize patterns in HTML is an HTML parser.
You really don't want to parse HTML using RegEx. See this answer if you need more convicing.
Regular expressions are the wrong tool for trying to parse HTML - especially when it's HTML that is not gauranteed to be well formed.
You should really get an HTML/XHTML parsing library and use that to match HTML content. Take a look at the HTML Agility Pack, it's probably sufficient for what you need.
Don't parse HTML with regular expressions.
You should not parse HTML with regular expressions, but if you must, then what about something simple as:
<form>[^</form>]+<input/>.+</form>
I am making a regex expression in which I only want to match wrong tags like: <p> *some text here, some other tags may be here as well but no ending 'p' tag* </p>
<P>Affectionately Inscribed </P><P>TO </P><P>HENRY BULLAR, </P><P>(of the western circuit)<P>PREFACE</P>
In the above same text I want to get the result as <P>(of the western circuit)<P> and nothing else should be captured. I'm using this but its not working:
<P>[^\(</P>\)]*<P>
Please help.
Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.
For xhtml, I'd use XmlDocument/XDocument and an xpath query.
For "non-x" html, I'd look at the HTML Agility Pack and the same.
Match group one of:
(?:<p>(?:(?!<\/?p>).?)+)(<p>)
matches the second <p> in:
<P>(of the western circuit)<P>PREFACE</P>
Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.
I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like
<p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>
I'm pretty sure the regular expressions given so-far would match the second <p> there, even though it is not actually an unclosed <p>.
Rather than using * for maximal match, use *? for minimal.
Should be able to make a start with
<P>((?!</P>).)*?<P>
This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "<P>" matches.
EDIT: Corrected to put assertion (thanks to commenter).
All of the solutions offered so far match the second <P>, but that's wrong. What if there are two consecutive <P> elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:
#"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"
As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).