Problem with RegEx OR operator in C#

Problem with RegEx OR operator in C# - c#

I want to match a pattern [0-9][0-9]KK[a-z][a-z] which is not preceded by either of these words
http://
example
I have a RegEx which takes care of the first criteria, but not the second criteria.
Without OR operator
var body = Regex.Replace(body, "(?<!http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%
\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?)([0-9][0-9]KK[a-z][a-z])
(?!</a>)","replaced");
wth OR Operator
var body = Regex.Replace(body, "(?example)|(?<!http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#
\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?)([0-9][0-9]KK[a-
z][a-z])(?!</a>)","replaced");
The second one with OR operator throws an exception. How can I fix this?
It should not match either of these:
example99KKas
http://stack.com/99KKas

Here is one way to do it. Start at the beginning of the string and check that each character is not the start of 'http://' or 'example'. Do this lazily, and one character at a time so that we can spot the magic word once we reach it. Also, capture everything up to the magic word so that we can put it back in the replacement string. Here it is in commented free-spacing mode so that it can be comprehended by mere mortals:
var body = Regex.Replace(body,
#"# Match special word not preceded by 'http://' or 'example'
^ # Anchor to beginning of string
(?i) # Set case-insensitive mode.
( # $1: Capture everything up to special word.
(?: # Non-capture group for applying * quantifier.
(?!http://) # Assert this char is not start of 'http://'
(?!example) # Assert this char is not start of 'example'
. # Safe to match this one acceptable char.
)*? # Lazily match zero or more preceding chars.
) # End $1: Everything up to special word.
(?-i) # Set back to case-sensitive mode.
([0-9][0-9]KK[a-z][a-z]) # $2: Match our special word.
(?!</a>) # Assert not end of Anchor tag contents.
",
"$1replaced",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
Note that this is case sensitive for the magic word but not for http:// and example. Note also that this is untested (I don't know C# - just its regex engine). The "var" in "var body = ..." looks kinda suspicious to me. ??

I wasn't able to get the second example working, it gave an ArgumentException of "Unrecognized grouping construct".
But I replaced the url matching and moved the first alternative group a bit and came up with this:
var body = Regex.Replace(body, "(?<!http\\://[a-zA-Z0-9\\-\\.]+\\.[a-zA-Z]{2,3}(/\\S*)?|example)
([0-9][0-9]KK[a-z][a-z])(?!</a>)","replaced");

You could use something like this:
body = Regex.Replace(body, #"(?<!\S)(?!(?i:http://|example))\S*\d\dKK[a-z]{2}\b", "replaced");

Related

C# Regex - How to remove multiple paired parentheses from string

I am trying to figure out how to use C# regular expressions to remove all instances paired parentheses from a string. The parentheses and all text between them should be removed. The parentheses aren't always on the same line. Also, their might be nested parentheses. An example of the string would be
This is a (string). I would like all of the (parentheses
to be removed). This (is) a string. Nested ((parentheses) should) also
be removed. (Thanks) for your help.
The desired output should be as follows:
This is a . I would like all of the . This a string. Nested also
be removed. for your help.

Fortunately, .NET allows recursion in regexes (see Balancing Group Definitions):
Regex regexObj = new Regex(
#"\( # Match an opening parenthesis.
(?> # Then either match (possessively):
[^()]+ # any characters except parentheses
| # or
\( (?<Depth>) # an opening paren (and increase the parens counter)
| # or
\) (?<-Depth>) # a closing paren (and decrease the parens counter).
)* # Repeat as needed.
(?(Depth)(?!)) # Assert that the parens counter is at zero.
\) # Then match a closing parenthesis.",
RegexOptions.IgnorePatternWhitespace);
In case anyone is wondering: The "parens counter" may never go below zero (<?-Depth> will fail otherwise), so even if the parentheses are "balanced" but aren't correctly matched (like ()))((()), this regex will not be fooled.
For more information, read Jeffrey Friedl's excellent book "Mastering Regular Expressions" (p. 436)

You can repetitively replace /\([^\)\(]*\)/g with the empty string till no more matches are found, though.

How about this: Regex Replace seems to do the trick.
string Remove(string s, char begin, char end)
{
Regex regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
return regex.Replace(s, string.Empty);
}
string s = "Hello (my name) is (brian)"
s = Remove(s, '(', ')');
Output would be:
"Hello is"

Normally, it is not an option. However, Microsoft does have some extensions to standard regular expressions. You may be able to achieve this with Grouping Constructs even if it is faster to code as an algorithm than to read and understand Microsoft's explanation of their extension.

Seeking some C# RegEx help

I am trying to create a RegEx expression that will successfully parse the following line:
"57" "testing123" 82 16 # 13 26 blah blah
What I want is to be able to do is identify the numbers in the line. Currently, what I'm using is this:
[0-9]+
which parses fine. However, where it gets tricky is if the number is in quotes, like "57" is or like "testing123" is, I do not want it to match.
In addition to that, anything after the hash sign (the '#"), I do not want to match anything at all after the hash sign.
So in this example, the matches I should be getting are "82" and "16". Nothing else should match.
Any help on this would be appreciated.

It should be easier for you to build 3 different regexes, and then create the logic that combines them:
Check, whether the string has #, and ignore everything after it.
Check, for all the matches of "\d+", and ignore all of them
Check everything that's left, whether it matches [0-9]+

.Net regular expression can rather easily parse this string. The following pattern should match everything until the comment:
\A # Start of the string
(?>
(?<Quoted> # A quoted string
"" # Open quotes
[^""\\]* # non quotes or backslashes
(?:\\.[^""\\]*)* # but allow escaped characters
"" # Close quotes
)
|
(?<Number> # A number
\d+ # some digits
)
|
\s+ # Whitespace separator
)*
If you also want to match the comment, add:
(?<Comment>
\# .*
)?
\z
You can get your numbers in a single Match, using all captures of the "Number" group:
Match parsed = Regex.Match(s, pattern, RegexOptions.IgnorePatternWhitespace);
CaptureCollection numbers = parsed.Groups["Number"].Captures;
Missing from this pattern is mainly unquoted string tokens, such as 4 8 this 15that, which can add some complexity, depending on how we'd want it to work.

Regex that matches a newline (\n) in C#

OK, this one is driving me nuts....
I have a string that is formed thus:
var newContent = string.Format("({0})\n{1}", stripped_content, reply)
newContent will display like:
(old text)
new text
I need a regular expression that strips away the text between parentheses with the parenthesis included AND the newline character.
The best I can come up with is:
const string regex = #"^(\(.*\)\s)?(?<capture>.*)";
var match= Regex.Match(original_content, regex);
var stripped_content = match.Groups["capture"].Value;
This works, but I want specifically to match the newline (\n), not any whitespace (\s)
Replacing \s with \n \\n or \\\n does NOT work.
Please help me hold on to my sanity!
EDIT: an example:
public string Reply(string old,string neww)
{
const string regex = #"^(\(.*\)\s)?(?<capture>.*)";
var match= Regex.Match(old, regex);
var stripped_content = match.Groups["capture"].Value;
var result= string.Format("({0})\n{1}", stripped_content, neww);
return result;
}
Reply("(messageOne)\nmessageTwo","messageThree") returns :
(messageTwo)
messageThree

If you specify RegexOptions.Multiline then you can use ^ and $ to match the start and end of a line, respectively.
If you don't wish to use this option, remember that a new line may be any one of the following: \n, \r, \r\n, so instead of looking only for \n, you should perhaps use something like: [\n\r]+, or more exactly: (\n|\r|\r\n).

Actually it works but with opposite option i.e.
RegexOptions.Singleline

You are probably going to have a \r before your \n. Try replacing the \s with (\r\n).

Think I may be a bit late to the party, but still hope this helps.
I needed to get multiple tokens between two hash signs.
Example i/p:
## token1 ##
## token2 ##
## token3_a
token3_b
token3_c ##
This seemed to work in my case:
var matches = Regex.Matches (mytext, "##(.*?)##", RegexOptions.Singleline);
Of course, you may want to replace the double hash signs at both ends with your own chars.
HTH.

Counter-intuitive as it is, you can use both Multiline and Singleline option.
Regex.Match(input, #"(.+)^(.*)", RegexOptions.Multiline | RegexOptions.Singleline)
First capturing group will contain first line (including \r and \n) and second group will have second line.
Why:
First of all RegexOptions enum is flag so it can be combined with bitwise operators, then
Multiline:
^ and $ match the beginning and end of each line (instead of the beginning and end of the input string).
Singleline:
The period (.) matches every character (instead of every character except \n)
see docs

C# regex to replace a delimiter by another one

I'm working on pl/sql code where i want to replace ';' which is commented with '~'.
e.g.
If i have a code as:
--comment 1 with;
select id from t_id;
--comment 2 with ;
select name from t_id;
/*comment 3
with ;*/
Then i want my result text as:
--comment 1 with~
select id from t_id;
--comment 2 with ~
select name from t_id;
/*comment 3
with ~*/
Can it be done using regex in C#?

Regular expression:
((?:--|/\*)[^~]*)~(\*/)?
C# code to use it:
string code = "all that text of yours";
Regex regex = new Regex(#"((?:--|/\*)[^~]*)~(\*/)?", RegexOptions.Multiline);
result = regex.Replace(code, "$1;$2");
Not tested with C#, but the regular expression and the replacement works in RegexBuddy with your text =)
Note: I am not a very brilliant regular expression writer, so it could probably have been written better. But it works. And handles both your cases with one-liner-comments starting with -- and also the multiline ones with /* */
Edit: Read your comment to the other answer, so removed the ^ anchor, so that it takes care of comments not starting on a new line as well.
Edit 2: Figured it could be simplified a bit. Also found it works fine without the ending $ anchor as well.
Explanation:
// ((?:--|/\*)[^~]*)~(\*/)?
//
// Options: ^ and $ match at line breaks
//
// Match the regular expression below and capture its match into backreference number 1 «((?:--|/\*)[^~]*)»
// Match the regular expression below «(?:--|/\*)»
// Match either the regular expression below (attempting the next alternative only if this one fails) «--»
// Match the characters “--” literally «--»
// Or match regular expression number 2 below (the entire group fails if this one fails to match) «/\*»
// Match the character “/” literally «/»
// Match the character “*” literally «\*»
// Match any character that is NOT a “~” «[^~]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the character “~” literally «~»
// Match the regular expression below and capture its match into backreference number 2 «(\*/)?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the character “*” literally «\*»
// Match the character “/” literally «/»

A regex is not really needed - you can iterate on lines, locate the lines starting with "--" and replace ";" with "~" on them.
String.StartsWith("--") - Determines whether the beginning of an instance of String matches a specified string.
String.Replace(";", "~") - Returns a new string in which all occurrences of a specified Unicode character or String in this instance are replaced with another specified Unicode character or String.

Can Regex be used for this particular string manipulation?

I need to replace character (say) x with character (say) P in a string, but only if it is contained in a quoted substring.
An example makes it clearer:
axbx'cxdxe'fxgh'ixj'k -> axbx'cPdPe'fxgh'iPj'k
Let's assume, for the sake of simplicity, that quotes always come in pairs.
The obvious way is to just process the string one character at a time (a simple state machine approach);
however, I'm wondering if regular expressions can be used to do all the processing in one go.
My target language is C#, but I guess my question pertains to any language having builtin or library support for regular expressions.

I converted Greg Hewgill's python code to C# and it worked!
[Test]
public void ReplaceTextInQuotes()
{
Assert.AreEqual("axbx'cPdPe'fxgh'iPj'k",
Regex.Replace("axbx'cxdxe'fxgh'ixj'k",
#"x(?=[^']*'([^']|'[^']*')*$)", "P"));
}
That test passed.

I was able to do this with Python:
>>> import re
>>> re.sub(r"x(?=[^']*'([^']|'[^']*')*$)", "P", "axbx'cxdxe'fxgh'ixj'k")
"axbx'cPdPe'fxgh'iPj'k"
What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string.
This relies on your assumption that the quotes are always balanced. This is also not very efficient.

A more general (and simpler) solution which allows non-paired quotes.
Find quoted string
Replace 'x' by 'P' in the string
#!/usr/bin/env python
import re
text = "axbx'cxdxe'fxgh'ixj'k"
s = re.sub("'.*?'", lambda m: re.sub("x", "P", m.group(0)), text)
print s == "axbx'cPdPe'fxgh'iPj'k", s
# -> True axbx'cPdPe'fxgh'iPj'k

The trick is to use non-capturing group to match the part of the string following the match (character x) we are searching for.
Trying to match the string up to x will only find either the first or the last occurence, depending whether non-greedy quantifiers are used.
Here's Greg's idea transposed to Tcl, with comments.
set strIn {axbx'cxdxe'fxgh'ixj'k}
set regex {(?x) # enable expanded syntax
# - allows comments, ignores whitespace
x # the actual match
(?= # non-matching group
[^']*' # match to end of current quoted substring
##
## assuming quotes are in pairs,
## make sure we actually were
## inside a quoted substring
## by making sure the rest of the string
## is what we expect it to be
##
(
[^']* # match any non-quoted substring
| # ...or...
'[^']*' # any quoted substring, including the quotes
)* # any number of times
$ # until we run out of string :)
) # end of non-matching group
}
#the same regular expression without the comments
set regexCondensed {(?x)x(?=[^']*'([^']|'[^']*')*$)}
set replRegex {P}
set nMatches [regsub -all -- $regex $strIn $replRegex strOut]
puts "$nMatches replacements. "
if {$nMatches > 0} {
puts "Original: |$strIn|"
puts "Result: |$strOut|"
}
exit
This prints:
3 replacements.
Original: |axbx'cxdxe'fxgh'ixj'k|
Result: |axbx'cPdPe'fxgh'iPj'k|

#!/usr/bin/perl -w
use strict;
# Break up the string.
# The spliting uses quotes
# as the delimiter.
# Put every broken substring
# into the #fields array.
my #fields;
while (<>) {
#fields = split /'/, $_;
}
# For every substring indexed with an odd
# number, search for x and replace it
# with P.
my $count;
my $end = $#fields;
for ($count=0; $count < $end; $count++) {
if ($count % 2 == 1) {
$fields[$count] =~ s/a/P/g;
}
}
Wouldn't this chunk do the job?

Not with plain regexp. Regular expressions have no "memory" so they cannot distinguish between being "inside" or "outside" quotes.
You need something more powerful, for example using gema it would be straighforward:
'<repl>'=$0
repl:x=P

Similar discussion about balanced text replaces: Can regular expressions be used to match nested patterns?
Although you can try this in Vim, but it works well only if the string is on one line, and there's only one pair of 's.
:%s:\('[^']*\)x\([^']*'\):\1P\2:gci
If there's one more pair or even an unbalanced ', then it could fail. That's way I included the c a.k.a. confirm flag on the ex command.
The same can be done with sed, without the interaction - or with awk so you can add some interaction.
One possible solution is to break the lines on pairs of 's then you can do with vim solution.

Pattern: (?s)\G((?:^[^']*'|(?<=.))(?:'[^']*'|[^'x]+)*+)x
Replacement: \1P
\G — Anchor each match at the end of the previous one, or the start of the string.
(?:^[^']*'|(?<=.)) — If it is at the beginning of the string, match up to the first quote.
(?:'[^']*'|[^'x]+)*+ — Match any block of unquoted characters, or any (non-quote) characters up to an 'x'.
One sweep trough the source string, except for a single character look-behind.

Sorry to break your hopes, but you need a push-down automata to do that. There is more info here:
Pushdown Automaton
In short, Regular expressions, which are finite state machines can only read and has no memory while pushdown automaton has a stack and manipulating capabilities.
Edit: spelling...

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Problem with RegEx OR operator in C# - c#

You could use something like this: body = Regex.Replace(body, #"(?<!\S)(?!(?i:http://|example))\S*\d\dKK[a-z]{2}\b", "replaced");

Related

C# Regex - How to remove multiple paired parentheses from string

Seeking some C# RegEx help

Regex that matches a newline (\n) in C#

C# regex to replace a delimiter by another one

Can Regex be used for this particular string manipulation?

Categories

Resources