Regular expression a captured group with 1 to 5 ords - c#

I have a sentence like 'This is [[a captured group]].' The number of words between the captured can be 1 to 5.
I want to pick out everything between the two brackets (including the brackets). I know I could use something like #"^.*(?<identifier>\[\[\.*\]\]).*$" but I want to try to be more precise so I thought this would work: #"^.*(?<identifier>\[\[\w*(\b\w*){0,4}\]\]).*$"
Can anyone see why this doesn't work? It captures if there's one word as in between the brackets but not multiple. I thought the (\b\w*){0,4} would allow for 0 to 4 more words.
Thanks, Bill N

I think you forget about word delimeters (\s):
^.*(?<identifier>\[\[\w+(\s+\b\w+){0,4}\]\]).*$

You problem is here:
(\b\w*){0,4}
This would not work since you have not allowed for spaces. Change it to:
(\s+\b\w*){0,4}
This will capture spaces but you can easily post-process (using Trim()).

You create more than one captured groups, one per bracket. Try this:
#"^.*(?<identifier>\[\[\w*(?:\s\w*){0,4}\]\]).*$"
(?:) This is a non capturing group, that not creates a variable, so that your result is still in the named group.
Update: And of course as the two other answers pointed out, your main problem is the missing \s I added this also to my solution.
Update2: The \b is not needed when the \s is added, so removed.

My preference would be something like this (untested):
^[^\[]*(?<identifier>\[\[\s*(\w+(?:\s+|(?=\]))){1,5}\]\])[\S\s]*$
^ # begin of string
[^\[]* # some optional not '[' chars
(?<identifier> # <ID> begin
\[\[ # '[['
\s* # some optional whitespace
(?:\w+ (?:\s+|(?=\])) ){1,5} # 1-5 words separated by spaces
\]\] # ']]'
) # end <ID>
[\S\s]* # some optional any chars
$
# end of string

Related

Regex stop Quantifer on True possible?

I wanna stop the Quantifier if the statment are true. any know how?
XXXXXX, 20. September 2017 XXX XXXXXXXXX XX
MwSt. Nummer: CHE-XXX.XXX.XXX p.A. XXXXX XXXXXX XXXXX
Rechnungs Nr.321 XX XXXXX 32
XXXXXX, (?<Date>\d{2}.\s{1,}[A-z]{1,}\s{1,}\d{4})\s{1,}(?<CompanyName>.*)\n(?(?=Rechnungs Nr\.)Rechnungs Nr\.(?<BillNumber>\d{1,})|.*\n){1,}
My target is that:
XXXXXX, (?<Date>\d{2}.\s{1,}[A-z]{1,}\s{1,}\d{4})\s{1,}(?<CompanyName>.*)\n(?(?=Rechnungs Nr\.)Rechnungs Nr\.(?<BillNumber>\d{1,})|.*\n){2}
you see this is not dynamic and here is the problem. I wanna do it much times as possible. in some case {2} isnt enough. So i pick {1,}. The Problem here is the following text are match to. That is bad for me. I wanna do after that loop more loops for other text sequence. I only want match the digits ( in this example 321 ) After this Stop the if condition.
Thank you in advance.
You can get Output here: Regular Expression
As per my comment (see the demo on regex101.com):
XXXXXX,\s*
(?<Date>\d{2}.\s+[A-Za-z]+\s+\d{4})\s+
(?<CompanyName>.*)(?s:.*?)
Rechnungs\ Nr\.(?<BillNumber>\d+)
Broken down this says:
XXXXXX,\s* # XXXXXX, followed by spaces
(?<Date>\d{2}.\s+[A-Za-z]+\s+\d{4})\s+ # your original expression
# followed by at least one space
(?<CompanyName>.*) # rest of the line goes into
# group CompanyName
(?s:.*?) # DOTALL, lazily
Rechnungs\ Nr\.(?<BillNumber>\d+) # Rechnungs Nr.
# followed by digits
Letting aside some potential optimizations, the main idea was to use
(?s:.*?)
Which turns on the DOTALL mode for a group, meaning that inside that group the dot matches every charater (including newline characters). With the lazy quantifier (.*?) it expands as needed, even across multiple lines.
As an alternative, you could use [\s\S]*? which combines whitespaces and not whitespaces leading to all characters in the end.
Side note: \s{1,} is the same as \s+, \d{1,} is the same as \d+, [A-z] includes more characters then [A-Za-z].
I found fast and good way:
XXXXXX, (?<Date>\d{2}.\s+[A-z]+\s+\d{4})\s{1,}(?<CompanyName>.*)\n(?(?!Rechnungs Nr\.).*\n)Rechnungs Nr\.(?<BillNumber>\d+)
You can get Output here: Regular Expression

Regex pattern in C# with empty space

I am having issue with a reg ex expression and can't find the answer to my question.
I am trying to build a reg ex pattern that will pull in any matches that have # around them. for example #match# or #mt# would both come back.
This works fine for that. #.*?#
However I don't want matches on ## to show up. Basically if there is nothing between the pound signs don't match.
Hope this makes sense.
Thanks.
Please use + to match 1 or more symbols:
#+.+#+
UPDATE:
If you want to only match substrings that are enclosed with single hash symbols, use:
(?<!#)#(?!#)[^#]+#(?!#)
See regex demo
Explanation:
(?<!#)#(?!#) - a # symbol that is not preceded with a # (due to the negative lookbehind (?<!#)) and not followed by a # (due to the negative lookahead (?!#))
[^#]+ - one or more symbols other than # (due to the negated character class [^#])
#(?!#) - a # symbol not followed with another # symbol.
Instead of using * to match between zero and unlimited characters, replace it with +, which will only match if there is at least one character between the #'s. The edited regex should look like this: #.+?#. Hope this helps!
Edit
Sorry for the incorrect regex, I had not expected multiple hash signs. This should work for your sentence: #+.+?#+
Edit 2
I am pretty sure I got it. Try this: (?<!#)#[^#].*?#. It might not work as expected with triple hashes though.
Try:
[^#]?#.+#[^#]?
The [^ character_group] construction matches any single character not included in the character group. Using the ? after it will let you match at the beginning/end of a string (since it matches the preceeding character zero or more times. Check out the documentation here

Regex to clean repetitions of characters

I have a pattern in the string like this:
T T and I want to T
And It can be any character from [a-z].
I have tried this Regex Example but not able to replace it.
EDIT
Like I have A Aa ar r then it should become Aar means replace any character 1st or 2nd no matter what it is.
You can use the backreferences for this.
/([a-z])\s*\1\s?/gi
Example
Some more explanation:
( begin matching group 1
[a-z] match any character from a to z
) end matching group 1
\s* match any amount of space characters
\1 match the result of matching group 1
exactly as it was again
this allows for the repition
\s? match none or one space character
this will allow to remove multiple
spaces when replacing

.NET regex matching

Broadly: how do I match a word with regex rules for a)the beginning, b)the whole word, and c)the end?
More specifically: How do I match an expression of length >= 1 that has the following rules:
It cannot have any of: ! # #
It cannot begin with a space or =
It cannot end with a space
I tried:
^[^\s=][^!##]*[^\s]$
But the ^[^\s=] matching moves past the first character in the word. Hence this also matches words that begin with '!' or '#' or '#' (eg: '#ab' or '#aa'). This also forces the word to have at least 2 characters (one beginning character that is not space or = -and- one non-space character in the end).
I got to:
^[^\s=(!##)]\1*$
for a regex matching the first two rules. But how do I match no trailing spaces in the word with allowing words of length 1?
Cameron's solution is both accurate and efficient (and should be used for any production code where speed needs to be optimized). The answer presented here is less efficient, but demonstrates a general approach for applying logic using regular expressions.
You can use multiple positive and negative lookahead regex assertions (all applied at one location in the target string - typically the beginning), to apply multiple logical constraints for a match. The commented regex below demonstrates how easy this is to do for this example case. You do need to understand how the regex engine actually matches (and doesn't match), to come up with the correct expressions, but its not hard once you get the hang of it.
foundMatch = Regex.IsMatch(subjectString, #"
# Match 'word' meeting multiple logical constraints.
^ # Anchor to start of string.
(?=[^!##]*$) # It cannot have any of: ! # #, AND
(?![ =]) # It cannot begin with a space or =, AND
(?!.*\S$) # It cannot end with a space, AND
.{1,} # length >= 1 (ok to match special 'word')
\z # Anchor to end of string.
",
RegexOptions.IgnorePatternWhitespace);
This application of "regex-logic" is frequently used for complex password validation.
Your first attempt was very close. You only need to exclude more characters for the first and last parts, and make the last two parts optional:
^[^\s=!##](?:[^!##]*[^\s!##])?$
This ensures that all three sections will not include any of !##. Then, if the word is more than one character long, it will need to end with a not-space, with only select characters filling the space in-between. This is all enforced properly because of the ^ and $ anchors.
I'm not quite sure what your second example matched, since the () should be taken as literal characters when embedded within a character class, not as a capturing group.

Can Regex be used for this particular string manipulation?

I need to replace character (say) x with character (say) P in a string, but only if it is contained in a quoted substring.
An example makes it clearer:
axbx'cxdxe'fxgh'ixj'k -> axbx'cPdPe'fxgh'iPj'k
Let's assume, for the sake of simplicity, that quotes always come in pairs.
The obvious way is to just process the string one character at a time (a simple state machine approach);
however, I'm wondering if regular expressions can be used to do all the processing in one go.
My target language is C#, but I guess my question pertains to any language having builtin or library support for regular expressions.
I converted Greg Hewgill's python code to C# and it worked!
[Test]
public void ReplaceTextInQuotes()
{
Assert.AreEqual("axbx'cPdPe'fxgh'iPj'k",
Regex.Replace("axbx'cxdxe'fxgh'ixj'k",
#"x(?=[^']*'([^']|'[^']*')*$)", "P"));
}
That test passed.
I was able to do this with Python:
>>> import re
>>> re.sub(r"x(?=[^']*'([^']|'[^']*')*$)", "P", "axbx'cxdxe'fxgh'ixj'k")
"axbx'cPdPe'fxgh'iPj'k"
What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string.
This relies on your assumption that the quotes are always balanced. This is also not very efficient.
A more general (and simpler) solution which allows non-paired quotes.
Find quoted string
Replace 'x' by 'P' in the string
#!/usr/bin/env python
import re
text = "axbx'cxdxe'fxgh'ixj'k"
s = re.sub("'.*?'", lambda m: re.sub("x", "P", m.group(0)), text)
print s == "axbx'cPdPe'fxgh'iPj'k", s
# -> True axbx'cPdPe'fxgh'iPj'k
The trick is to use non-capturing group to match the part of the string following the match (character x) we are searching for.
Trying to match the string up to x will only find either the first or the last occurence, depending whether non-greedy quantifiers are used.
Here's Greg's idea transposed to Tcl, with comments.
set strIn {axbx'cxdxe'fxgh'ixj'k}
set regex {(?x) # enable expanded syntax
# - allows comments, ignores whitespace
x # the actual match
(?= # non-matching group
[^']*' # match to end of current quoted substring
##
## assuming quotes are in pairs,
## make sure we actually were
## inside a quoted substring
## by making sure the rest of the string
## is what we expect it to be
##
(
[^']* # match any non-quoted substring
| # ...or...
'[^']*' # any quoted substring, including the quotes
)* # any number of times
$ # until we run out of string :)
) # end of non-matching group
}
#the same regular expression without the comments
set regexCondensed {(?x)x(?=[^']*'([^']|'[^']*')*$)}
set replRegex {P}
set nMatches [regsub -all -- $regex $strIn $replRegex strOut]
puts "$nMatches replacements. "
if {$nMatches > 0} {
puts "Original: |$strIn|"
puts "Result: |$strOut|"
}
exit
This prints:
3 replacements.
Original: |axbx'cxdxe'fxgh'ixj'k|
Result: |axbx'cPdPe'fxgh'iPj'k|
#!/usr/bin/perl -w
use strict;
# Break up the string.
# The spliting uses quotes
# as the delimiter.
# Put every broken substring
# into the #fields array.
my #fields;
while (<>) {
#fields = split /'/, $_;
}
# For every substring indexed with an odd
# number, search for x and replace it
# with P.
my $count;
my $end = $#fields;
for ($count=0; $count < $end; $count++) {
if ($count % 2 == 1) {
$fields[$count] =~ s/a/P/g;
}
}
Wouldn't this chunk do the job?
Not with plain regexp. Regular expressions have no "memory" so they cannot distinguish between being "inside" or "outside" quotes.
You need something more powerful, for example using gema it would be straighforward:
'<repl>'=$0
repl:x=P
Similar discussion about balanced text replaces: Can regular expressions be used to match nested patterns?
Although you can try this in Vim, but it works well only if the string is on one line, and there's only one pair of 's.
:%s:\('[^']*\)x\([^']*'\):\1P\2:gci
If there's one more pair or even an unbalanced ', then it could fail. That's way I included the c a.k.a. confirm flag on the ex command.
The same can be done with sed, without the interaction - or with awk so you can add some interaction.
One possible solution is to break the lines on pairs of 's then you can do with vim solution.
Pattern: (?s)\G((?:^[^']*'|(?<=.))(?:'[^']*'|[^'x]+)*+)x
Replacement: \1P
\G — Anchor each match at the end of the previous one, or the start of the string.
(?:^[^']*'|(?<=.)) — If it is at the beginning of the string, match up to the first quote.
(?:'[^']*'|[^'x]+)*+ — Match any block of unquoted characters, or any (non-quote) characters up to an 'x'.
One sweep trough the source string, except for a single character look-behind.
Sorry to break your hopes, but you need a push-down automata to do that. There is more info here:
Pushdown Automaton
In short, Regular expressions, which are finite state machines can only read and has no memory while pushdown automaton has a stack and manipulating capabilities.
Edit: spelling...

Categories