Regex - find matches not contained within pattern - c#

I would like to use a regular expression to match all occurrences of a phrase where it's not contained within some delimiting characters. I tried putting one together but had some difficulty with the negative lookaheads.
My search phrase is "my phrase". The start delimiter tag is [[ and the end delimiter tag is ]]. The string I'd like to search is:
Here is a sentence with my phrase, here's another part which I don't want to match on [[my phrase]]. I would like to find this occurrence of my phrase.
From this string I would expect to find all occurrences of "my phrase" except the one contained within [[ ]].
I hope that makes sense, thanks in advance for any guidance.

[^#]my phrase[^#]
I have knocked up a RegEx that will do what you ask, this can be seen here.
Literally just escaping out # as a character and allowing any other character to be returned. You can return the index of these results but remember to strip off the first and last character of the string.
Note: This will not pick up any "my phrase" that end the sentence without a character following it
Edit - Seeing as you changed the scope while I was writing this answer,
here is the RegEx for the other delimiter:
[^[[]my phrase[^\]\]]

(?<=[^\[])my phrase(?=[^\]]*)
This will also elliminate the trailing punctuation marks.

Related

Extract string from a pattern preceded by any length

I'm looking for a regular expression to extract a string from a file name
eg if filename format is "anythingatallanylength_123_TESTNAME.docx", I'm interested in extracting "TESTNAME" ... probably fixed length of 8. (btw, 123 can be any three digit number)
I think I can use regex match ...
".*_[0-9][0-9][0-9]_[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z].docx$"
However this matches the whole thing. How can I just get "TESTNAME"?
Thanks
Use parenthesis to match a specific piece of the whole regex.
You can also use the curly braces to specify counts of matching characters, and \d for [0-9].
In C#:
var myRegex = new Regex(#"*._\d{3}_([A-Za-z]{8})\.docx$");
Now "TESTNAME" or whatever your 8 letter piece is will be found in the captures collection of your regex after using it.
Also note, there will be a performance overhead for look-ahead and look-behind, as presented in some other solutions.
You can use a look-behind and a look-ahead to check parts without matching them:
(?<=_[0-9]{3}_)[A-Z]{8}(?=\.docx$)
Note that this is case-sensitive, you may want to use other character classes and/or quantifiers to fit your exact pattern.
In your file name format "anythingatallanylength_123_TESTNAME.docx", the pattern you are trying to match is a string before .docx and the underscore _. Keeping the thing in mind that any _ before doesn't get matched I came up with following solution.
Regex: (?<=_)[A-Za-z]*(?=\.docx$)
Flags used:
g global search
m multi-line search.
Explanation:
(?<=_) checks if there is an underscore before the file name.
(?=\.docx$) checks for extension at the end.
[A-Za-z]* checks the required match.
Regex101 Demo
Thanks to #Lucero #noob #JamesFaix I came up with ...
#"(?<=.*[0-9]{3})[A-Z]{8}(?=.docx$)"
So a look behind (in brackets, starting with ?<=) for anything (ie zero or more any char (denoted by "." ) followed by an underscore, followed by thee numerics, followed by underscore. Thats the end of the look behind. Now to match what I need (eight letters). Finally, the look ahead (in brackets, starting with ?=), which is the .docx
Nice work, fellas. Thunderbirds are go.

C# Regex Captures Extra Parameters

I'm using .NET's regex as part of my university assignment (writing a compiler). I found an interesting caveat that's driving me nuts.
I have this regex pattern: \A(?:(func)[^\w\d]*|(func)\z)
When I try to match string "func sum(a, b)\n..., the resulting Match object has one item in CaptureCollection containing the string "func ".
Why am I getting the whitespace along with my keyword?
You're talking about item #0. The item at index 0 is always the whole match. The following items are the captured groups.
You got a match from the (func)[^\w\d]* part, and [^\w\d]* captured the whitespace you're seeing in the result.
Because [^\w\d]* part match a space character, without it it gives only func. Compere it to THIS
You're trying to negate a character group of either a word or digit to come immediately after "func" with [^\w\d]*, a whitespace qualifies.
You also specify any number of non-words and non-digits with the *, explaining the several whitespaces captured alongside "func".
I hope that answers your question as to why you're capturing the whitespaces.
I am uncertain what your exact goal is, so here are some examples:
This statement matches only "func" with any word immediately after it: \A(?:(func)[\w\d]*|(func)\z)
This statement matches "func" at the beginning of EACH line and the end of the ENTIRE string: ^func|func\z
This statement matches "func" at the beginning of the entire string and the end of the ENTIRE string: \Afunc|func\z
You can find a quick reference page here: Regular Expression Language - Quick Reference

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?
The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).
I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"
If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

Why doesn't this regular expression using word separator (\b) match the example in .Net?

Should be simple enough, but this thing not working is baffling me, any insight into why is greatly appreciated.
I'm trying to match any instances of an abbreviated word with any number of trailing '.','/' or '-'. Notice I'm using a '\b' to try to grab the whole 'word' including the trailing characters mentioned above but not any following characters (it also has the advantage of matching against the end of the line or string).
I'm using the following expression:
(?<target>\bLLC[\./\-]+\b)
As an example, i'm trying to make it match this:
Ace Charter High School LLC. East Liberty
I want the expression to select 'LLC.' but instead it's not picking any matches I don't know why.
I've tried debugging the expression using RegexBuddy and it works if I remove the trailing '\b' but that's not what I want as I explained before
Anyone has any idea why this isn't working?
There is no word boundary that matches the last \b.
The closest word boundaries are after LLC and before East, and your pattern doesn't allow for the last \b to be at either of those places.
Try
(?<target>\bLLC[\./\-]+)\s*\b
This allows the whitespace preceding the word boundary (which is between the space and E as Guffa points out) without including those spaces in the match group "target".
On the other hand, matching a word boundary after the . isn't gaining you much, since punctuation is going to cause a word boundary unless it's followed by other punctuation.
I've had good responses that pointed me in the right direction but none really proposed an alternative to using '\b' that had the same effect in terms of what is being targeted and that will match separator characters as well as the end of the string.
As Guffa pointed out, the issue is that I was using '\b' as a way to select any separator character or the end of the string at the position before that separator, when in reality it actually performs as what it represents: a word separator. Since my selector was already in a position outside a word, it doesn't match as this position (after the '.') is neither the beginning of a word or the end of one, hence there are no matches in the whole string as a '\b' after the target is still required for the match.
I've finally settled for using the following expression:
(?<target>\bLLC[\./\-]+)([^a-zA-Z0-9]|$)
This matches any non alphanumeric character as well as the end of string and will match the 'target' group without any of the separating characters before or after producing the same effect I wanted in the first place.
Thanks again for the responses and hopefully this will help others in a similar problem

Simple C# regex

I have a regex I need to match against a path like so: "C:\Documents and Settings\User\My Documents\ScanSnap\382893.pd~". I need a regex that matches all paths except those ending in '~' or '.dat'. The problem I am having is that I don't understand how to match and negate the exact string '.dat' and only at the end of the path. i.e. I don't want to match {d,a,t} elsewhere in the path.
I have built the regex, but need to not match .dat
[\w\s:\.\\]*[^~]$[^\.dat]
[\w\s:\.\\]* This matches all words, whitespace, the colon, periods, and backspaces.
[^~]$[^\.dat]$ This causes matches ending in '~' to fail. It seems that I should be able to follow up with a negated match for '.dat', but the match fails in my regex tester.
I think my answer lies in grouping judging from what I've read, would someone point me in the right direction? I should add, I am using a file watching program that allows regex matching, I have only one line to specify the regex.
This entry seems similar: Regex to match multiple strings
You want to use a negative look-ahead:
^((?!\.dat$)[\w\s:\.\\])*$
By the way, your character group ([\w\s:\.\\]) doesn't allow a tilde (~) in it. Did you intend to allow a tilde in the filename if it wasn't at the end? If so:
^((?!~$|\.dat$)[\w\s:\.\\~])*$
The following regex:
^.*(?<!\.dat|~)$
matches any string that does NOT end with a '~' or with '.dat'.
^ # the start of the string
.* # gobble up the entire string (without line terminators!)
(?<!\.dat|~) # looking back, there should not be '.dat' or '~'
$ # the end of the string
In plain English: match a string only when looking behind from the end of the string, there is no sub-string '.dat' or '~'.
Edit: the reason why your attempt failed is because a negated character class, [^...] will just negate a single character. A character class always matches a single character. So when you do [^.dat], you're not negating the string ".dat" but you're matching a single character other than '.', 'd', 'a' or 't'.
^((?!\.dat$)[\w\s:\.\\])*$
This is just a comment on an earlier answer suggestion:
. within a character class, [], is a literal . and does not need escaping.
^((?!\.dat$)[\w\s:.\\])*$
I'm sorry to post this as a new solution, but I apparently don't have enough credibility to simply comment on an answer yet.
I believe you are looking for this:
[\w\s:\.\\]*([^~]|[^\.dat])$
which finds, like before, all word chars, white space, periods (.), back slashes. Then matches for either tilde (~) or '.dat' at the end of the string. You may also want to add a caret (^) at the very beginning if you know that the string should be at the beginning of a new line.
^[\w\s:\.\\]*([^~]|[^\.dat])$

Categories