regex c# optional group - should act greedy? - c#

having regex ~like this:
blablabla.+?(?:<a href="(http://.+?)" target="_blank">)?
I want to capture an url if I find one... finds stuff but I don't get the link (capture is always empty). Now if I remove the question mark at the end like this
blablabla.+?(?:<a href="(http://.+?)" target="_blank">)
This will only match stuff that has the link at the end... it's 2.40 am... and I've got no ideas...
--Edit--
sample input:
blablabla asd 1234t535 <a href="http://google.com" target="_blank">
expected output:
match 0:
group 1: <a href="http://google.com" target="_blank">
group 2: http://google.com`
I just want "http://google.com" or ""

Are you doing a whole-string match? If so, try adding .* to the end of the first regex and see what it matches. The problem with the first regex is that it can match anything after blablabla because of the .+? (leading to an empty capture), but the parenthesized part still won't match an a tag unless it's at the end of the string. By the way, looking at your expected output, capture 1 will be the URL; the parentheses around the whole HTML tag are non-capturing because of the ?: at the beginning.

you shouldn't need .+? at the start, the regex is going to search the whole input anyway
you also have the closing '>' right after blank which will limit your matches
(?:<a href="(http://.+?)" target="_blank".*?>)
regex test

It's the trailing ? that's doing you in. Reason: By marking it as optional, you're allowing the .+? to grab it.
blablabla.*(?:<a href="((http://)?.*)".+target="_blank".*>)
I modified it slightly... .+? is basically the same as .*, and if you may have nothing in your href (you indicated you wanted ""), you need to make the http optional as well as the trailing text. Also, .* in front target means you have at least one space or character, but may have more (multiple blanks or other attributes). .* before the > means you can have blanks or other attributes trailing after.
This will not match a line at all if there's no <a href...>, but that's what you want, right?
The (?: ... ) can be dropped completely, if you don't need to capture the whole <a href...> portion.
This will fail if the attributes are not listed in the order specified... which is one of the reasons regex can't really be used to parse html. But if you're certain the href will always come before the target, this should do what you need.

Related

RegEx replace all slashes in capture group

I have a list of html data that is structured as shown below
<div>lots of other data...
Test1
</div>
<div>lots of other data...
<a href="http://localserver1/OpenFile?path=Test1/Subfolder/file2.pdf&OtherParam=2
</div>
<div>lots of other data...
<a href="http://localserver1/OpenFile?path=Test2%2FSubfolder%2Ffile3.pdf&OtherParam=3
</div>
As you can see in the second url, there is no encoding in the slashes. These links interface with a content management system (an admittedly bad one), and very frequently we get paths that are not encoded. I wanted to write a small block of code in C# that would check whether or not the blocks of html code here would have slashes and just replace them with the %2F encoding. I have been able to locate all instances of where the OpenFile link occurs like this:
OpenFile\?path=(.*)&
However I can't seem to find an easy way to look through the path's capture group and replace only slashes that are in there. How would I go about doing this?
Since your example uses "&" as the end of the pattern, I will assume it is consistent for all cases.
You can use this expression:
\/(?!.*OpenFile\?path=)(?=.*&)
https://regex101.com/r/hZ3Oja/1
This uses a negative lookahead on "OpenFile?path=" and a positive lookahead on "&" so that it only replaces slashes that are a part of your inner path.
Your c# syntax will look like Regex.Replace(input, pattern, replacement);
In C# you can use lookarounds to match the forward slash:
(?<=OpenFile\?path=[^\s&]*)/(?=[^\s&]*&)
Explanation
(?<=OpenFile\?path=[^\s&]*) Positive lookbehind, assert the openfile part to the left followed by optional non whitespace chars excluding &
/ Match the forward slash
(?=[^\s&]*&) Positive lookahead, assert an ampersand to the right
Regex demo
If there can also be a match without an ampersand at the right, you can omit the last positive lookahead in the pattern.

Regex - find matches not contained within pattern

I would like to use a regular expression to match all occurrences of a phrase where it's not contained within some delimiting characters. I tried putting one together but had some difficulty with the negative lookaheads.
My search phrase is "my phrase". The start delimiter tag is [[ and the end delimiter tag is ]]. The string I'd like to search is:
Here is a sentence with my phrase, here's another part which I don't want to match on [[my phrase]]. I would like to find this occurrence of my phrase.
From this string I would expect to find all occurrences of "my phrase" except the one contained within [[ ]].
I hope that makes sense, thanks in advance for any guidance.
[^#]my phrase[^#]
I have knocked up a RegEx that will do what you ask, this can be seen here.
Literally just escaping out # as a character and allowing any other character to be returned. You can return the index of these results but remember to strip off the first and last character of the string.
Note: This will not pick up any "my phrase" that end the sentence without a character following it
Edit - Seeing as you changed the scope while I was writing this answer,
here is the RegEx for the other delimiter:
[^[[]my phrase[^\]\]]
(?<=[^\[])my phrase(?=[^\]]*)
This will also elliminate the trailing punctuation marks.

RegEx - Text between two markers with one of them optional

Using RegEx, Is there a way to extract all the text between 2 marker where the 2nd marker is optional?
For example:
MARK1 allthetext I need t0 extr4ct i$ here unt.l I_will-find (MARK2 | MARK3 | ANYENDMARK)
or
MARK1 allthetext I need t0 extr4ct i$ here unt.l I_will-find nothing else
I tryed to use
(?<=(MARK1 ))([[:ascii:]]*)(MARK2|MARK3|$)?
and
(?<=(MARK1 ))([[:ascii:]]*)(?=(MARK2|MARK3|$))?
without success.
PS: I need to evalute the regex in C#. I'm using regex101.com as test environment
You can use
(?<=\bMARK1\b)(.*?)(?=(?:\bMARK2\b|\bMARK3\b|$))
See demo
Note I am using singleline mode so that . could match a newline as well.
The \b is a word boundary that enables matching whole words. This, \bMARK1\b will not match ANYMARK1.
If you have MARKn at the end, you may use a bit different look-ahead: (?<=\bMARK1\b)(.*?)(?=(?:\bMARK\d+\b|$)). See demo
Now, the regex exaplanation comes:
(?<=\bMARK1\b) - a look-behind that makes sure there is a whole word MARK1 right before...
(.*?) - any 0 or more (but as few as possible) characters (even including newline due to RegexOptions.Singleline flag used)
(?=(?:\bMARK2\b|\bMARK3\b|$)) - match the characters above only if they are followed with a whole word MARK2 or MARK3 or end of string.
You're almost there. Let's start from your second expression:
(?<=(MARK1 ))([[:ascii:]]*)(?=(MARK2|MARK3|$))?
Remove the question mark at the end:
(?<=(MARK1 ))([[:ascii:]]*)(?=(MARK2|MARK3|$))
You don't need it: The string is ended by either MARK2, MARK3 or the end of the line. That's not optional.
Make the * of [[:ascii:]]* non-greedy by replacing it with *?:
(?<=(MARK1 ))([[:ascii:]]*?)(?=(MARK2|MARK3|$))
Otherwise, it will prefer the line end over MARK2 or MARK3, because it can do a longer match. *? will try to make the shortest match possible.
You also probably want to add a space in front of MARK2 and MARK3, to avoid matching words ending with MARK2/3.
(?<=(MARK1 ))([[:ascii:]]*?)(?=( MARK2| MARK3|$))
(?<=MARK\d+).*?(?=MARK\d+|$)
You can use this.See demo.

Trying to ascertain the Regex to remove newline from start and end of content in <PRE> tag, using Regex.Replace from .NET

I have a "pre" which is getting newlines added before the content and after the content ie:
<pre>
My Content
</pre>
The above seems to be equivalent to 2 newlines before and 1 after.
I would like to parse my HTML string for all "pre" tags and to remove these before and after newlines.
I would use ASP.NET code to do the replacing:
Regex.replace(myHtmlString,#"Regex Pattern",String.Empty);
The result should be:
<pre>My Content</pre>
So what would the "Regex Pattern" look like please?
Thanks in advance.
EDIT
Answer so far:
strCleanXhtmlDoc = Regex.Replace(strCleanXhtmlDoc,#"<pre>[\r\n]*(.*?)[\r\n]*</pre>", "<pre>$1</pre>")
The replace bit is $1.
EDIT:
Strruggling to get the Regex to work with:
<pre style="color: #a11f98;font-family: calibri;font-size: 14pt;font-style: normal;font-weight: normal;">
L1
L11
L111
</pre>
Which does need matching, to produce:
<pre style="color: #a11f98;font-family: calibri;font-size: 14pt;font-style: normal;font-weight: normal;">L1
L11
L111</pre>
The regex you need is this (<pre[^>]*>)\s*([\w\W]*?)\s*(</pre>)
To break it down
(<pre[^>]*>) matches the start pre tag including any attributes. [^>]* this bit does most of the work and means all chars that aren't >
\s* then we match all the whitespace we can
([\w\W]*?) this grabs the content \w\W means any character and is more inclusive than .. The ? is present so that this doesnt also grab the whitespace that the next bit is meant to grab its a non greedy modifier.
\s* match the whitespace at the end of the content before the end tag
(</pre>) match the end tag nothing special here
The replacement is $1$2$3 to grab the 3 parenthesized sections and put them back together without the whitespace.
Hope that makes some sense and helps you write your next one.

Why does changing this regex class to .+ not provide any match?

If I use this
string showPattern = #"return new_lightox\(this\);"">[a-zA-Z0-9(\s),!\?\-:'&%]+</a>";
MatchCollection showMatches = Regex.Matches(pageSource, showPattern);
I get some matches but I want to get rid of [a-zA-Z0-9(\s),!\?\-:'&%]+and use any char .+
but if do this I get no match at all.
What am I doing wrong?
By default "." does not match newlines, but the class \s does.
To let . match newline, turn on SingleLine/DOTALL mode - either using a flag in the function call (as Abel's answer shows), or using the inline modifier (?s), like this for the whole expression:
"(?s)return new_lightox\(this\);"">.+</a>"
Or for just the specific part of it:
"return new_lightox\(this\);"">(?s:.+)</a>"
It might be better to take that a step further and do this:
"return new_lightox\(this\);"">(?s:(?!</?a).+)</a>"
Which should prevent the closing </a> from belonging to a different link.
However, you need to be very wary here - it's not clear what you're doing overall, but regex is not a good tool for parsing HTML with, and can cause all sorts of problems. Look at using a HTML DOM parser instead, such as HtmlAgilityPack.
You're matching a tag, so you probably want something along these lines, instead of .+:
string showPattern = #"return new_lightox\(this\);"">[^<]+</a>";
The reason that the match doesn't hit is possibly because you are missing the multiline/singleline flag and the closing tag is on the next line. In other words, this should work too:
// SingleLine option changes the dot (.) to match newlines too
MatchCollection showMatches = Regex.Matches(
pageSource,
showPattern,
RegexOptions.SingleLine);

Categories