I have small problem. I'm trying to get text whitch is out of html elements.
Example input:
I want this text I want this text I want this text <I don't want this text/>
I want this text I wan this text <I don't>want this</text>
Does anybody know how is it possible by regex? I thought that I can make it by deleting element text. So, does anybody know another solution for this problem? Please help me.
Instead of regex, which is not suitable for parsing HTML in general (especially malformed HTML), use an HTML parser like the HTML Agility Pack.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
I agree that anything not trivial should be done with a HTML parser (Agility pack is excellent if you use .NET) but for small requirements as this its more than likely overkill.
Then again, A HTML parser knows more about the quirks and edge cases that HTML is full of. Be sure to test well before using a regex.
Here you go
<.*?>.*?<.*?>|<.*?/>
It also correctly ignores
<I don't>want this</text>
and not just the tags
In C# this becomes
string resultString = null;
resultString = Regex.Replace(subjectString, "<.*?>.*?<.*?>|<.*?/>", "");
Try this
(?<!<.*?)([^<>]+)
Explanation
#"
(?<! # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
< # Match the character “<” literally
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
)
( # Match the regular expression below and capture its match into backreference number 1
[^<>] # Match a single character NOT present in the list “<>”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"
Related
This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I have a HTML file and I am trying to retrieve valid innertext from each tag. I am using Regex for this with the following pattern:
(?<=>).*?(?=<)
It works fine for simple innertext. But, I recently encountered following HTML pieces:
<div id="mainDiv"> << Generate Report>> </div>
<input id="name" type="text">Your Name->></input>
I am not sure, how to retrieve these innertexts with regular expressions? Can someone please help?
Thanks
I'd use a parser, but this is possible with RegEx using something like:
<([a-zA-Z0-9]+)(?:\s+[^>]+)?>(.+?)<\/\1>
Then you can grab the inner text with capture group 2.
You can always eliminate HTML tags which can be described by a regular grammar while HTML cannot. Replace "<[a-zA-Z][a-zA-Z0-9]*\s*([a-zA-Z]+\s*=\s*("|')(?("|')(?<=).|.)("|')\s*)*/?>" with string.Empty.
That regex should match any valid HTML tag.
EDIT:
If you do not want to obtain a concatenated result you can use "<" instead of string.Empty and then split by '<' since '<' in HTML always starts a tag and should never be displayed. Or you can use the overload of Regex.Replace that takes a delegate and use match index and match length (it may turn out more optimal that way). Or even better use Regex.Match and go from matched tag to matched tag. substring(PreviousMatchIndex + PreviousMatchLength, CurrentMatchIndex - PreviousMatchIndex + PreviousMatchLength) should provide the inner text.
That's exactly why you don't use regex for parsing html.Although you can get around this problem by using backreference in regex
(?<=<(\w+)[<>]*>).*?(?=/<\1>)
Though that wont work always because
tags wont always have a closing tag
tag attributes can contain <>
arbitrary spaces around tag's name
Use an html parser like htmlagilitypack
Your code would be as simple as this
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
//InnerText of all div's
List<string> divs=doc.DocumentElement
.SelectNodes("//div")
.Select(x=>x.InnerText).ToList();
How can I replace
Text
with
Text
where page and Text can be any set of characters?
This will work. Note that I only capture whatever is inside href.
resultString = Regex.Replace(subjectString, #"(?<=<a[^>]*?\bhref\s*=\s*(['""]))(.*)(?=\1.*?>)", "$2.html");
And append the .html to it. You may wish to change it to your needs.
Edit : before flame wars begin. Yes it will work for your specific example not for all possible html in the internet.
You shouldn't parse HTML with regular expressions. See the answer to this question for details.
UPD: As TrueWill has pointed out, you might want to do the replace with Html Agility Pack. But in some special cases the regexp proposed by FailedDev will do, although I would slightly modify it to look like this: #"(?<=<a\b[^>]*?\bhref\s*=\s*(['""]))(.*)(?=\1.*?>)" (put a \b after the <a to exclude other tags starting with "a").
I have a simple problem: I want to construct a regex that matches a form in HTML, but only if the form has any input tags. Example:
The following should be matched (ignoring attributes):
..
<form>
..
<input/>
..
</form>
..
But the following should not (ignoring attributes):
..
<form>
..
</form>
..
I have tried everything from look-arounds to capture groups but it quickly gets complicated. I want to believe there is a simple regex to capture the problem. Please note that it is important that the regex pairs the opening and closing tags according to the HTML code which means the following does not work:
<form>.+<input/>.+</form>
because it matches wrongly like this:
..
<form> <--- This is wrongly matched as the opening tag
..
</form>
<form> <-- This is the correct opening tag of the correct form
..
<input/>
..
</form> <--- This is matched as the closing tag
..
EDIT:
I already made a RegEx that matches what I want; my question is now how to do it, but how to do it SIMPLE/elegantly.
To me this is not simple or elegant at all:
<form>
(.(?<!</form>))+
<input/>
(.(?<!</form>))+
</form>
I want to believe there is a simple regex to capture the problem
Wishing does not make it so. There is no evidence for the proposition that every problem can be solved with regular expressions, and plenty of evidence against. Your faith is not well placed.
The set of languages which are recognizable by regular expressions is called -- unsurprisingly -- the regular languages. A nice property of all regular languages is that they can be recognized by a device with finitely many states. Therefore, you can quickly figure out if a language is not regular by asking yourself the question "would I require an unbounded number of states to recognize this language?"
Consider the language of matching parens: (), ()(), (()), ()(()), and so on. To recognize this language you have to keep track of how many open parens there are waiting to be closed, and therefore you need an unbounded number of states. Therefore this language is not a regular language, and therefore it cannot be matched by a regular expression.
HTML is clearly the paren language but even more complicated, because now there are an infinite number of different "kinds of parens". Each tag is like an open paren that must be matched by its corresponding closing tag. Since this is an even more complex and difficult version of a non-regular language, clearly it cannot be a regular language. And therefore it cannot be matched correctly with regular expressions.
The right tool to recognize patterns in HTML is an HTML parser.
You really don't want to parse HTML using RegEx. See this answer if you need more convicing.
Regular expressions are the wrong tool for trying to parse HTML - especially when it's HTML that is not gauranteed to be well formed.
You should really get an HTML/XHTML parsing library and use that to match HTML content. Take a look at the HTML Agility Pack, it's probably sufficient for what you need.
Don't parse HTML with regular expressions.
You should not parse HTML with regular expressions, but if you must, then what about something simple as:
<form>[^</form>]+<input/>.+</form>
I have a string like:
[a b="c" d="e"]Some multi line text[/a]
Now the part d="e" is optional. I want to convert such type of string into:
<a b="c" d="e">Some multi line text</a>
The values of a b and d are constant, so I don't need to catch them. I just need the values of c, e and the text between the tags and create an equivalent xml based expression. So how to do that, because there is some optional part also.
For HTML tags, please use HTML parser.
For [a][/a], you can do like following
Match m=Regex.Match(#"[a b=""c"" d=""e""]Some multi line text[/a]",
#"\[a b=""([^""]+)"" d=""([^""]+)""\](.*?)\[/a\]",
RegexOptions.Multiline);
m.Groups[1].Value
"c"
m.Groups[2].Value
"e"
m.Groups[3].Value
"Some multi line text"
Here is Regex.Replace (I am not that prefer though)
string inputStr = #"[a b=""[[[[c]]]]"" d=""e[]""]Some multi line text[/a]";
string resultStr=Regex.Replace(inputStr,
#"\[a( b=""[^""]+"")( d=""[^""]+"")?\](.*?)\[/a\]",
#"<a$1$2>$3</a>",
RegexOptions.Multiline);
If you are actually thinking of processing (pseudo)-HTML using regexes,
don't
SO is filled with posts where regexes are proposed for HTML/XML and answers pointing out why this is a bad idea.
Suppose your multiline text ("which can be anything") contains
[a b="foo" [a b="bar"]]
a regex cannot detect this.
See the classic answer in:
RegEx match open tags except XHTML self-contained tags
which has:
I think it's time for me to quit the
post of Assistant Don't Parse HTML
With Regex Officer. No matter how many
times we say it, they won't stop
coming every day... every hour even.
It is a lost cause, which someone else
can fight for a bit. So go on, parse
HTML with regex, if you must. It's
only broken code, not life and death.
– bobince
Seriously. Find an XML or HTML DOM and populate it with your data. Then serialize it. That will take care of all the problems you don't even know you have got.
Would some multiline text include [ and ]? If not, you can just replace [ with < and ] with > using string.replace - no need of regex.
Update:
If it can be anything but [/a], you can replace
^\[a([^\]]+)](.*?)\[/a]$
with
<a$1>$2</a>
I haven't escaped ] and / in the regex - escape them if necessary to get
^\[a([^\]]+)\](.*?)\[\/a\]$
I am making a regex expression in which I only want to match wrong tags like: <p> *some text here, some other tags may be here as well but no ending 'p' tag* </p>
<P>Affectionately Inscribed </P><P>TO </P><P>HENRY BULLAR, </P><P>(of the western circuit)<P>PREFACE</P>
In the above same text I want to get the result as <P>(of the western circuit)<P> and nothing else should be captured. I'm using this but its not working:
<P>[^\(</P>\)]*<P>
Please help.
Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.
For xhtml, I'd use XmlDocument/XDocument and an xpath query.
For "non-x" html, I'd look at the HTML Agility Pack and the same.
Match group one of:
(?:<p>(?:(?!<\/?p>).?)+)(<p>)
matches the second <p> in:
<P>(of the western circuit)<P>PREFACE</P>
Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.
I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like
<p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>
I'm pretty sure the regular expressions given so-far would match the second <p> there, even though it is not actually an unclosed <p>.
Rather than using * for maximal match, use *? for minimal.
Should be able to make a start with
<P>((?!</P>).)*?<P>
This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "<P>" matches.
EDIT: Corrected to put assertion (thanks to commenter).
All of the solutions offered so far match the second <P>, but that's wrong. What if there are two consecutive <P> elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:
#"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"
As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).