How to prevent regex from stopping at the first match of alternatives? - c#

If I have the string hello world , how can I modify the regex world|wo|w so that it will match all of "world", "wo" and "w" rather than just the single first match of "world" that it comes to ?
If this is not possible directly, is there a good workaround ? I'm using C# if it makes a difference:
Regex testRegex = new Regex("world|wo|w");
MatchCollection theMatches = testRegex.Matches("hello world");
foreach (Match thisMatch in theMatches)
{
...
}

I think you're going to need to use three separate regexs and match on each of them. When you specify alternatives it considers each one a successful match and stops looking after matching one of them. The only way I can see to do it is to repeat the search with each of your alternatives in a separate regex. You can create an array or list of Match items and have each search add to the list if you want to be able to iterate through them later.

If you're trying to match (the beginning of) the word world three times, you'll need to use three separate Regex objects; a single Regex cannot match the same character twice.

As SLaks wrote, a regex can't match the same text more than once.
You could "fake it" like this:
\b(w)((?<=w)o)?((?<=wo)rld)?
will match the w, the o only if preceded by w*, and rld only if preceded by wo.
Of course, only parts of the word will actually be matched, but you'll see whether only the first one, the first two or all the parts did match by looking at the captured groups.
So in the word want, the w will match (the rest is optional, so the regex reports overall success.
In work, the wo will match; \1 will contain w, and \2 will contain o. The rld will fail, but since it's optional, the regex still reports success.
I have added a word boundary anchor \b to the start of the regex to avoid matches in the middle of words like reword; if don't want to exclude those matches, drop the \b.
* The (?<=w) is not actually needed here, but I kept it in for consistency.

Related

Regex to match when not inside single quotes [duplicate]

I'm looking to write a regex (C#) that will match words that aren't surrounded by quotes. An example input string would be:
dbo.test line_length "quoted words" notquoted
And this needs to match
dbo.test
line_length
nonquoted
So 3 separate matches and "quoted words" is not matched. The quoted phrase could be anywhere in the input...beginning, middle, end, etc.
I haven't been able to come up with a regex that matches words not in quotes where there could be a space in the quotes...I've been able to match something like: hello "world" and only get hello.
Is there a way to write the regex I'm trying to?
There are two ways to tackle this, depending on what you want to do with the output.
First, match (but don't capture) any text within quotation marks. (This is specifically matching the stuff that you DON'T want.)
Using the | pipe, use capture groups to select everything that you DO want to keep.
Example:
".*?"|(\b\S+\b)
You can see an example of that here.
The other option, using look-arounds, is to specifically look backward from the beginning of the words to ensure that the " doesn't appear there:
(?<!")(\b\S+\b)(?!")
You can see that here.
This may have a problem when you start using multiple words, but this should get you on the right track, and you can indicate whether one of these methods works better for you than the other.

Regex to match single words/character sets that aren't in quotes

I'm looking to write a regex (C#) that will match words that aren't surrounded by quotes. An example input string would be:
dbo.test line_length "quoted words" notquoted
And this needs to match
dbo.test
line_length
nonquoted
So 3 separate matches and "quoted words" is not matched. The quoted phrase could be anywhere in the input...beginning, middle, end, etc.
I haven't been able to come up with a regex that matches words not in quotes where there could be a space in the quotes...I've been able to match something like: hello "world" and only get hello.
Is there a way to write the regex I'm trying to?
There are two ways to tackle this, depending on what you want to do with the output.
First, match (but don't capture) any text within quotation marks. (This is specifically matching the stuff that you DON'T want.)
Using the | pipe, use capture groups to select everything that you DO want to keep.
Example:
".*?"|(\b\S+\b)
You can see an example of that here.
The other option, using look-arounds, is to specifically look backward from the beginning of the words to ensure that the " doesn't appear there:
(?<!")(\b\S+\b)(?!")
You can see that here.
This may have a problem when you start using multiple words, but this should get you on the right track, and you can indicate whether one of these methods works better for you than the other.

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?
The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).
I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"
If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

regex grouping fails with more than one group

I have this regex, for C#. The alert portion works fine, but when I add the msg group,
it just hangs with cursor blinking on command line.
What have I missed, they both work by themselves, but not in full group map.
string pattern = #"(?<action>alert\s+(?:tcp|udp|icmp)\s+(.*?)*[(])\s+" +
#"(?<msg>msg[:](.*?)\[;\s*])";
Regex rgx = new Regex(pattern);
Match res = rgx.Match(rule);
I'm trying to match a string like #alert tcp $EXTERNAL_NET any -> $HOME_NET 12345:12346 (msg:"MALWARE-BACKDOOR netbus getinfo"; flow:to_server,established;
The problem is with (.*?)* in your first group. Try (.*?) instead.
When matching without the second group, that just matches till the end of the line. However, when adding the second group, it needs to back off to allow the second group to match. Since you've got two quantifiers interacting, there are a bazillion ways to match until it has backed off sufficiently to allow the second group to match.
An example. Let's say you're matching the string abc with (.*?)*. The ways for that to match are:
(a)(b)(c)
(a)(bc)
(ab)(c)
(abc)
And that's not counting the possible empty strings that regex might match in between (because .* will match an empty string as well).
Trying to match one character more, say abcd, yields as possible matches:
(a)(b)(c)(d)
(a)(b)(cd)
(a)(bc)(d)
(a)(bcd)
(ab)(c)(d)
(ab)(cd)
(abc)(d)
(abcd)
So the number of possible matches doubles for every character added.

Regular Expression boundary

Please help me find the match of:
<ptext>Any Sentence/tags goes here</ptext>
My current regex is:
\<ptext\>\b.+\b\</ptext\>
But if I will double the for example:
<ptext>Any Sentence/tags goes here</ptext> <ptext>Any Sentence/tags goes here</ptext>
My regex will match the ptext up to the last ptext
How can I separate that so I will match two(2) matches in the example I gave. Thanks for all your help.
This is where a single pair of ( ) and a .+? will come in handy. Try...
\<ptext\>\b(.+?)\b\</ptext\>
This does two things. First, the parentheses used by themselves, not to be confused with an OR statement (like|this), will return specifically within the parentheses, not necessarily everything. Second, the "lazy" .+? will match 1 or more characters until it comes to the FIRST match, not the last match, that works. That way, it should only catch each set of items and not the whole file.
Also not sure if the \b are right in your case, FYI. Thus, I would recommend...
\<ptext\>(.+?)\</ptext\>
For example, this code should return...
Array[0] = "This is a sentence"
Array[1] = "Here's another one."
Use a non-greedy quantifier:
<ptext>.+?</ptext>

Categories