regex grouping fails with more than one group - c#

I have this regex, for C#. The alert portion works fine, but when I add the msg group,
it just hangs with cursor blinking on command line.
What have I missed, they both work by themselves, but not in full group map.
string pattern = #"(?<action>alert\s+(?:tcp|udp|icmp)\s+(.*?)*[(])\s+" +
#"(?<msg>msg[:](.*?)\[;\s*])";
Regex rgx = new Regex(pattern);
Match res = rgx.Match(rule);
I'm trying to match a string like #alert tcp $EXTERNAL_NET any -> $HOME_NET 12345:12346 (msg:"MALWARE-BACKDOOR netbus getinfo"; flow:to_server,established;

The problem is with (.*?)* in your first group. Try (.*?) instead.
When matching without the second group, that just matches till the end of the line. However, when adding the second group, it needs to back off to allow the second group to match. Since you've got two quantifiers interacting, there are a bazillion ways to match until it has backed off sufficiently to allow the second group to match.
An example. Let's say you're matching the string abc with (.*?)*. The ways for that to match are:
(a)(b)(c)
(a)(bc)
(ab)(c)
(abc)
And that's not counting the possible empty strings that regex might match in between (because .* will match an empty string as well).
Trying to match one character more, say abcd, yields as possible matches:
(a)(b)(c)(d)
(a)(b)(cd)
(a)(bc)(d)
(a)(bcd)
(ab)(c)(d)
(ab)(cd)
(abc)(d)
(abcd)
So the number of possible matches doubles for every character added.

Related

Matching a sequence of characters splitted by spaces after a prefix

I have the following strings:
-prefix <#141222969505480701> where the second part e.g. <#141222969505480701> can be repeated unlimited times (only the numbers change).
-prefix 141222969505480701 which should behave the same as above.
-prefix 141222969505480701 <#141222969505480702> which would still be able to repeat itself forever.
The last one should have groups containing 141222969505480701 and 141222969505480702.
So a few bits of information:
The digit chains are always 18 in total so I use \d{18} in my regex
I would like to have the numbers in groups for me to use them afterwards.
What I tried
First of I tried to match the first of my example strings.
-prefix(\s<#\d{18}>)\1* which would match the entire string, but I would like to have the digits itself in its own group. Also this method only matches the same parts e.g. <#141222969505480701> <#141222969505480701> <#141222969505480701> would match, but any other number in between wouldn't match.
What would sound logical in my head
-prefix (\d{18})+ but it would only match the first one of the 'digit parts'.
While I was testing it on regex101 it told me the following:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data.
I tried to adjust the regex to the following -prefix ((\d{18})+), but with the same result.
With the help to of #madreflection in the comments I was able to come up with this solution:
-prefix([\s]*(<#|)(?<digits>[0-9]{18})>?)+
Which is exactly what I needed, which even ignores spaces in between. Also with the use of match.Groups["digits"].Captures it made the whole story a lot easier.
You could use an alternation to list the 3 different allowed formats. In .NET it is supported to reuse the group name.
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digits>[0-9]{18}))
Pattern parts
-prefix\s* Match literally followed by 0+ whitespace characters
(?: Non capturing group
(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})> 2 named capturing groups which will match the digits
| Or
(?<digits>[0-9]{18}) Named capturing group, match digits only
| Or
<#(?<digits>[0-9]{18}) Named capturing group, match digits between brackets only
)
Regex demo
You could also use 2 named capturing groups, 1 for each format. For example:
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digitsBrackets>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digitsBrackets>[0-9]{18}))
Regex demo

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?
The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).
I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"
If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

C# Regular expressions, retrieving two words separated by a comma, parenthesis operator

I've been playing around with retrieving data from a string using regular expression, mostly as an exercise for myself. The pattern that I'm trying to match looks like this:
"(SomeWord,OtherWord)"
After reading some documentation and looking at a cheat sheet I came to the conclusion that the following regex should give me 2 matches:
"\((\w),(\w)\)"
Because according to the documentation the parenthesis should do the following:
(pattern) Matches pattern and remembers the match. The matched
substring can be retrieved from the resulting Matches collection,
using Item [0]...[n]. To match parentheses characters ( ), use "\ (" or
"\ )".
However using the following code (removed error checking for conciseness) matches quite something different:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Value;
string right = matches[1].Value;
Now I would expect left to become "A" and right to become "B". However left becomes "(A,B)" and there is no second match at all. What am I missing here?
(I know this example is trivial to solve without regexes but to learn how to properly use regexes I should be able to make something simple as this work)
You want the Groups member of the first match. In your example case there is only 1 match, which is the whole string. In the Groups collection you will have 3 items. Try this sample code, left should be A, and right should be B. If you look at the group[0] value it will be the whole string.
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
GroupCollection groups = matches[0].Groups;
string left = groups[1].Value;
string right = groups[2].Value;
\w matches only one word character. If words have to contain at least one character, the expression should be:
string pattern = #"\((\w+),(\w+)\)";
if words may be empty:
string pattern = #"\((\w*),(\w*)\)";
+: means one or more repetitions.
*: means zero, one or more repetitions.
In any case, you will get one match with three groups, the first containing the whole string including the left and right parentheses, the two others the two words.
I think the problem is that you're confusing the concept of a match and a group.
A MatchCollection contains a list of strings that matched your entire regex, not just the parenthetical groups inside that Regex. For example, if the string you searched looked like this...
(A,B)(C,D)
...then you would have two matches: (A,B) and (C,D).
However, there's good news: you can get the groups from each match very easily, like so:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Groups[1].Value;
string right = matches[0].Groups[2].Value;
That Groups variable is a collection of parenthetical groups from a single match.
Edit:
Olivier Jacot-Descombes made a very good point: we all got so hung up explaining match vs. group that we forgot to notice a second problem: \w will only match a SINGLE character. You need to add a quantifier (such as +) in order to grab more than one character at a time. Olivier's answer should explain that part clearly.
First off, it's one "match", with 2 "groups"...
I would recommend you name the groups anyway...
string pattern = #"\((?<FirstWord>\w+),(?<SecondWord>\w+)\)";
Then you could do...
Match m = Regex.Match(line, pattern);
string firstWord = m.Groups["FirstWord"].Value;
Since all you are looking for are the characters separated by a comma, you can simply use \w as your pattern. The matches will be A and B.
A handy site for testing your Regex is http://gskinner.com/RegExr/

C# Regex Replace weird behavior with multiple captures and matching at the end of string?

I'm trying to write something that format Brazilian phone numbers, but I want it to do it matching from the end of the string, and not the beginning, so it would turn input strings according to the following pattern:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "3555-4444"
"5554444" -> "555-4444"
Since the begining portion is what usually changes, I thought of building the match using the $ sign so it would start at the end, and then capture backwards (so I thought), replacing then by the desired end format, and after, just getting rid of the parentesis "()" in front if they were empty.
This is the C# code:
s = "5135554444";
string str = Regex.Replace(s, #"\D", ""); //Get rid of non digits, if any
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{1,4})$", "($1) $2-$3");
return Regex.Replace(str, #"^\(\) ", ""); //Get rid of empty () at the beginning
The return value was as expected for a 10 digit number. But for anything less than that, it ended up showing some strange behavior. These were my results:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "(35) 5544-44"
"5554444" -> "(55) 5444-4"
It seems that it ignores the $ at the end to do the match, except that if I test with something less than 7 digits it goes like this:
"554444" -> "(55) 444-4"
"54444" -> "(54) 44-4"
"4444" -> "(44) 4-4"
Notice that it keeps the "minimum" {n} number of times of the third capture group always capturing it from the end, but then, the first two groups are capturing from the beginning as if the last group was non greedy from the end, just getting the minimum... weird or it's me?
Now, if I change the pattern, so instead of {1,4} on the third capture I use {4} these are the results:
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{4})$", "($1) $2-$3");
"5135554444" -> "(51) 3555-4444" //As expected
"35554444" -> "(35) 55-4444" //The last four are as expected, but "35" as $1?
"54444" -> "(5) -4444" //Again "4444" in $3, why nothing in $2 and "5" in $1?
I know this is probably some stupidity of mine, but wouldn't it be more reasonable if I want to capture at the end of the string, that all previous capture groups would be captured in reverse order?
I would think that "54444" would turn into "5-4444" in this last example... then it does not...
How would one accomplish this?
(I know maybe there's a better way to accomplish the very same thing using different approaches... but what I'm really curious is to find out why this particular behavior of the Regex seems odd. So, the answer tho this question should focus on explaining why the last capture is anchored at the end of the string, and why the others are not, as demonstrated in this example. So I'm not particularly interested in the actual phone # formatting problem, but to understand the Regex sintax)...
Thanks...
So you want the third part to always have four digits, the second part zero to four digits, and the first part zero to two digits, but only if the second part contains four digits?
Use
^(\d{0,2}?)(\d{0,4})(\d{4})$
As a C# snippet, commented:
resultString = Regex.Replace(subjectString,
#"^ # anchor the search at the start of the string
(\d{0,2}?) # match as few digits as possible, maximum 2
(\d{0,4}) # match up to four digits, as many as possible
(\d{4}) # match exactly four digits
$ # anchor the search at the end of the string",
"($1) $2-$3", RegexOptions.IgnorePatternWhitespace);
By adding a ? to a quantifier (??, *?, +?, {a,b}?) you make it lazy, i. e. tell it to match as few characters as possible while still allowing an overall match to be found.
Without the ? in the first group, what would happen when trying to match 123456?
First, the \d{0,2} matches 12.
Then, the \d{0,4} matches 3456.
Then, the \d{4} doesn't have anything left to match, so the regex engine backtracks until that's possible again. After four steps, the \d{4} can match 3456. The \d{0,4} gives up everything it had matched greedily for this.
Now, an overall match has been found - no need to try any more combinations. Therefore, the first and third groups will contain parts of the match.
You have to tell it that it's OK if the first matching groups aren't there, but not the last one:
(\d{0,2}?)(\d{0,4}?)(\d{1,4})$
Matches your examples properly in my testing.

How to prevent regex from stopping at the first match of alternatives?

If I have the string hello world , how can I modify the regex world|wo|w so that it will match all of "world", "wo" and "w" rather than just the single first match of "world" that it comes to ?
If this is not possible directly, is there a good workaround ? I'm using C# if it makes a difference:
Regex testRegex = new Regex("world|wo|w");
MatchCollection theMatches = testRegex.Matches("hello world");
foreach (Match thisMatch in theMatches)
{
...
}
I think you're going to need to use three separate regexs and match on each of them. When you specify alternatives it considers each one a successful match and stops looking after matching one of them. The only way I can see to do it is to repeat the search with each of your alternatives in a separate regex. You can create an array or list of Match items and have each search add to the list if you want to be able to iterate through them later.
If you're trying to match (the beginning of) the word world three times, you'll need to use three separate Regex objects; a single Regex cannot match the same character twice.
As SLaks wrote, a regex can't match the same text more than once.
You could "fake it" like this:
\b(w)((?<=w)o)?((?<=wo)rld)?
will match the w, the o only if preceded by w*, and rld only if preceded by wo.
Of course, only parts of the word will actually be matched, but you'll see whether only the first one, the first two or all the parts did match by looking at the captured groups.
So in the word want, the w will match (the rest is optional, so the regex reports overall success.
In work, the wo will match; \1 will contain w, and \2 will contain o. The rld will fail, but since it's optional, the regex still reports success.
I have added a word boundary anchor \b to the start of the regex to avoid matches in the middle of words like reword; if don't want to exclude those matches, drop the \b.
* The (?<=w) is not actually needed here, but I kept it in for consistency.

Categories