I have this regex to extract paragraphs that are outside of a table
((?<=<\/w:tbl>)<w:p [^>]*>.*?<\/w:p>(?=<w:tbl>)|(?<=<\/w:tbl>)<w:p [^>]*>.*?<\/w:p>(?=<w:sectPr .*>))
The problem is that it reads all paragraphs as if they are one paragraph (from the first opening tag until the last closing tag without the intermediate paragraphs).
Below is an example of the text. In this case it match one instead of 3
</w:tr></w:tbl><w:p w:rsidR="00F24C60" w:rsidRDefault="00F24C60" w:rsidP="009D46A1"><w:pPr><w:spacing w:before="240" w:after="240"/></w:pPr><w:r><w:t></w:t></w:r></w:p><w:p w:rsidR="00F24C60" w:rsidRDefault="00F24C60" w:rsidP="009D46A1"><w:pPr><w:spacing w:before="240" w:after="240"/></w:pPr><w:r><w:t></w:t></w:r></w:p><w:p w:rsidR="00346D4D" w:rsidRPr="00AC7B53" w:rsidRDefault="00F24C60" w:rsidP="009D46A1"><w:pPr><w:spacing w:before="240" w:after="240"/></w:pPr><w:r><w:t></w:t></w:r></w:p><w:tbl><w:tblPr>
Any help to make it match each paragraph alone (3 paragraphs)?
Thanks.
I think, you can't, because you want to create groups inside another tags, but regex don't know about structures it just looking string from begin to end, assume string: eabcabce if need all abc groups I can do next (abc), however I can't tell that I want all abc groups between e.
You can use some xml parser.
You can try two regexes for this particular case:
Get content of tbl tag with your regex
Get groups from tbl content with this regex (<w:p [^>]*>.*?<\/w:p>)
some links:
why not to parse html with regex (I think your xml is close to html :)) RegEx match open tags except XHTML self-contained tags
https://www.regextester.com/
Related
I'm trying to create a Regex expression to match content within a HTML document, but I wish to exclude matches contained within a tag itself. Consider the following:
<p>Here is some sample text for my widgets</p>
Click here to view my widgets
I would like to match 'widgets' so that I can replace it with a different string, say 'green box', without replacing the match within the url.
Matching 'widgets' is, well, easy as anything, but I'm struggling to add the exclude to check for 'widgets' when it appears within the opening and closing tag '<>'.
My current workings: As a first step I have started to match 'widgets' contained within '<>'. (I can then move on to make this an exclude later) However the below string seems to match the whole document, even though I have placed an exclude on the closing > to make sure widgets appears within a tag.
<.*[^>]widgets.*[^<]>+
It's probably down to lazy / greedy, but I can't quite work it out!
Overview
By no means is this a great answer since it's parsing HTML with regex, but it does work for the test case given by the OP.
See RegEx match open tags except XHTML self-contained tags
for more information.
Code
See regex in use here
(?<!<[^>]*)widgets
Explanation
(?<!<[^>]*) Negative lookbehind ensuring what precedes is not < followed by any character except > (any number of times)
widgets Match this literally
This may partially work:
(?:^|>)[^<]*widgets
This will start looking from the start of a line (if the /m flag is used) or the end of a tag (so we know we are not in one), and advance as many characters possible that are not <, meaning you can't open another tag, before looking for widgets.
The issues with this are that it may give weird results if you have a > inside a tag (eg, in javascript), or if a single tag can span over multiple lines and it won't find several instances of "widgets" in the same substring. To solve those issue, you'd better use an actual XML parser as advised by ctwheels
This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I have a HTML file and I am trying to retrieve valid innertext from each tag. I am using Regex for this with the following pattern:
(?<=>).*?(?=<)
It works fine for simple innertext. But, I recently encountered following HTML pieces:
<div id="mainDiv"> << Generate Report>> </div>
<input id="name" type="text">Your Name->></input>
I am not sure, how to retrieve these innertexts with regular expressions? Can someone please help?
Thanks
I'd use a parser, but this is possible with RegEx using something like:
<([a-zA-Z0-9]+)(?:\s+[^>]+)?>(.+?)<\/\1>
Then you can grab the inner text with capture group 2.
You can always eliminate HTML tags which can be described by a regular grammar while HTML cannot. Replace "<[a-zA-Z][a-zA-Z0-9]*\s*([a-zA-Z]+\s*=\s*("|')(?("|')(?<=).|.)("|')\s*)*/?>" with string.Empty.
That regex should match any valid HTML tag.
EDIT:
If you do not want to obtain a concatenated result you can use "<" instead of string.Empty and then split by '<' since '<' in HTML always starts a tag and should never be displayed. Or you can use the overload of Regex.Replace that takes a delegate and use match index and match length (it may turn out more optimal that way). Or even better use Regex.Match and go from matched tag to matched tag. substring(PreviousMatchIndex + PreviousMatchLength, CurrentMatchIndex - PreviousMatchIndex + PreviousMatchLength) should provide the inner text.
That's exactly why you don't use regex for parsing html.Although you can get around this problem by using backreference in regex
(?<=<(\w+)[<>]*>).*?(?=/<\1>)
Though that wont work always because
tags wont always have a closing tag
tag attributes can contain <>
arbitrary spaces around tag's name
Use an html parser like htmlagilitypack
Your code would be as simple as this
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
//InnerText of all div's
List<string> divs=doc.DocumentElement
.SelectNodes("//div")
.Select(x=>x.InnerText).ToList();
I have small problem. I'm trying to get text whitch is out of html elements.
Example input:
I want this text I want this text I want this text <I don't want this text/>
I want this text I wan this text <I don't>want this</text>
Does anybody know how is it possible by regex? I thought that I can make it by deleting element text. So, does anybody know another solution for this problem? Please help me.
Instead of regex, which is not suitable for parsing HTML in general (especially malformed HTML), use an HTML parser like the HTML Agility Pack.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
I agree that anything not trivial should be done with a HTML parser (Agility pack is excellent if you use .NET) but for small requirements as this its more than likely overkill.
Then again, A HTML parser knows more about the quirks and edge cases that HTML is full of. Be sure to test well before using a regex.
Here you go
<.*?>.*?<.*?>|<.*?/>
It also correctly ignores
<I don't>want this</text>
and not just the tags
In C# this becomes
string resultString = null;
resultString = Regex.Replace(subjectString, "<.*?>.*?<.*?>|<.*?/>", "");
Try this
(?<!<.*?)([^<>]+)
Explanation
#"
(?<! # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
< # Match the character “<” literally
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
)
( # Match the regular expression below and capture its match into backreference number 1
[^<>] # Match a single character NOT present in the list “<>”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"
I want to extract all the texts between the specified opening and closing tags including the tags.
For eg:
Input : I am <NAME>Kai</NAME>
Text Extracted: <NAME>Kai</NAME>
It extract the text based on tag.
What is Regex for the above?
If the tag in question can't be nested (and assuming case insensitivity):
Regex regexObj = new Regex("<NAME>(?:(?!</NAME>).)*</NAME>", RegexOptions.Singleline | RegexOptions.IgnoreCase);
Be advised that this is a quick-and-dirty solution which might work fine for your needs, but might also blow up in your face (for example if tags occur within comments, if there is whitespace inside the tags, if there are any attributes inside the tags etc.). If any of these might be a problem for you, please edit your question with the exact specifications you need the regex to comply with.
Here is a regex which accepts any tag name: <(\w+)>.*?</\1>
\1 is back-referencing the group (\w+) and ensures that the closing tag must have the same name as the opening tag.
If you want to search for the special tag NAME then you could use this regex: <NAME>.*?</NAME>
http://www.regular-expressions.info/reference.html You might find something useful here, they have allot of stuff specially for tags etc. Combine the examples to meet your requirements.
I have a string like:
[a b="c" d="e"]Some multi line text[/a]
Now the part d="e" is optional. I want to convert such type of string into:
<a b="c" d="e">Some multi line text</a>
The values of a b and d are constant, so I don't need to catch them. I just need the values of c, e and the text between the tags and create an equivalent xml based expression. So how to do that, because there is some optional part also.
For HTML tags, please use HTML parser.
For [a][/a], you can do like following
Match m=Regex.Match(#"[a b=""c"" d=""e""]Some multi line text[/a]",
#"\[a b=""([^""]+)"" d=""([^""]+)""\](.*?)\[/a\]",
RegexOptions.Multiline);
m.Groups[1].Value
"c"
m.Groups[2].Value
"e"
m.Groups[3].Value
"Some multi line text"
Here is Regex.Replace (I am not that prefer though)
string inputStr = #"[a b=""[[[[c]]]]"" d=""e[]""]Some multi line text[/a]";
string resultStr=Regex.Replace(inputStr,
#"\[a( b=""[^""]+"")( d=""[^""]+"")?\](.*?)\[/a\]",
#"<a$1$2>$3</a>",
RegexOptions.Multiline);
If you are actually thinking of processing (pseudo)-HTML using regexes,
don't
SO is filled with posts where regexes are proposed for HTML/XML and answers pointing out why this is a bad idea.
Suppose your multiline text ("which can be anything") contains
[a b="foo" [a b="bar"]]
a regex cannot detect this.
See the classic answer in:
RegEx match open tags except XHTML self-contained tags
which has:
I think it's time for me to quit the
post of Assistant Don't Parse HTML
With Regex Officer. No matter how many
times we say it, they won't stop
coming every day... every hour even.
It is a lost cause, which someone else
can fight for a bit. So go on, parse
HTML with regex, if you must. It's
only broken code, not life and death.
– bobince
Seriously. Find an XML or HTML DOM and populate it with your data. Then serialize it. That will take care of all the problems you don't even know you have got.
Would some multiline text include [ and ]? If not, you can just replace [ with < and ] with > using string.replace - no need of regex.
Update:
If it can be anything but [/a], you can replace
^\[a([^\]]+)](.*?)\[/a]$
with
<a$1>$2</a>
I haven't escaped ] and / in the regex - escape them if necessary to get
^\[a([^\]]+)\](.*?)\[\/a\]$