Skip text between quotes in Regex - c#

I have the following Regex using IgnoreCaseand Multilinein .NET:
^.*?\WtextToFind\W.*?$
Given a multiline input like:
1 Some Random Text textToFind
2 Some more "textToFind" random text
3 Another textToFinddd random text
The current regular expression matches with the lines 1 and 2. However I need to skip all the lines which textToFindis inside quotes and double quotes.
Any ideas how to achieve this?
Thanks!
EDIT:
Explanation: My purpose is to find some method calls inside VBScript code. I thought this would be irrelevant for my question, but after reading the comments I realised I should explain this.
So basically I want to skip text that is between quotes or single quotes and all the text that is between a quote and the end of line since that would be a comment in VBScript:
If I'm looking for myFunc
Call myFunc("parameter") // should match
Call anotherFunc("myFunc") //should not match
Call someFunc("parameter") 'Comment myFunc //should not match
If(myFunc("parameter") And someFunc("myFunc")) //should match

With all of the possible cases involving mixed sets of quotes, a regex may not be your best option here. What you could do instead (after using your current regex to filter for everything but quotes), is count the number of quotes before and after the occurrence of textToFind. If both counts are odd, then you have quotes around your keyword and should scrap the line. If both are even, you've got matched quotes elsewhere (or no quotes at all), and should keep the line. Then repeat the process for double quotes. You could do all this only walking through the string once.
Edit to address the update that you're searching through code:
There are some additional considerations to take into account.
Escaped quotes (skip over the character after an escape character, and it won't be counted).
Commented quotes, e.g. /* " */ in the middle of a line. When you hit a /*, just jump to the next occurrence of */ and then continue inspecting characters. You may also want to check whether the occurrence of textToFind is in a comment.
End-of-line ' quotes - if it occurs (outside a literal string) before the keyword, it's not a valid method call.
The bottom line is still that regexes aren't the droids you're looking for, here. You're better off walking through lines and parsing them.

It seems like this should work for your actual implementation in all the examples you've given:
/\bmyFunc\(/
Demonstration - view console.
as long as you don't have something like "i'm going to call myFunc()", but if you start trying to deal with quotes, multiple quotes, nested quotes, etc... it will get very messy (like trying to parse dom with regex).
Also, it appears that you are checking within vbscript code. Comments in vbscript code start with an ', right? You could check this as well, as it looks like you are doing this on a line by line basis, this should work for those type of comments:
/^\s*[^'].*\bmyFunc\(/
Demo

Related

Get what is in string from one quotation mark to other [duplicate]

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

Remove all Trailing <br> Using Regex, Substitute Group not Returning Full Match

Here is the problem. I have a block of pasted html text. I need to remove trailing line breaks and white space from the text. Even ones proceeded by closing tags. The below text is simply an example, and actually closely represents the real text I'm dealing with.
EG:
This:
<span>Here is some<br></span><br>
<span><span>Here is some text</span><br><span><br> </span></span><br><br>
Becomes this:
<span>Here is some<br></span><br>
<span><span>Here is some text<span></span></span>
My first pass. I use this: Regex.Replace(htmlString, #"(?:\<br\s*?\>)*$", "") to get rid of the trailing line breaks. Now all I have left is the line breaks stuck behind closing tags and white space.
I'm attempting to use this:
While(Regex.IsMatch(#"(<br>|\s| )*(<[^>]*>)*$")
{
Regex.Replace(htmlString, #"(<br>|\s| )*(<[^>]*>)*$", $2)
}
The regex pattern is actually working great, the problem is that the substitute by matched group 2 is only giving back a single closing span. So that I end up with the below:
<span>Here is some<br></span><br>
<span><span>Here is some text</span></span>
The regular expression is in #"(<br>|\s| )*(<[^>]*>)*$". The second group is followed by a * meaning the group is repeated and so the $2 only yields one repetition of the group.
Putting the repetition in a group will capture the whole repetition. Change the regular expression to be #"(<br>|\s| )*((<[^>]*>)*)$".
Note that repeating the first group with a * may make the code spin on some input strings as there no guarantee that the Replace will change the text to a different string. As the first group is optional (ie zero or more repeats) the Replace might replace one string with exactly the same string. So I suggest changing the regular expression to be #"(<br>|\s| )+((<[^>]*>)*)$" meaning that one or more occurrences of the first group are required.
I guess you can use:
resultString = Regex.Replace(subjectString, #"<br>| |\n", "");
Regex Demo

Regular Expression to ignore all text between 2 points in .NET

Start code is "abc123" now, this is irrelevant text and characters,
possibly new lines and line breaks, quotes etc. Finish.
I would like my regular expression to find a match where there is text that begins with 'Start code is', retrieves the value that follows within the quotes (in this case abc123), but then ignores all that immediately follows that closing quote UP UNTIL the word 'Finish'.
So if I had the following text:
This is some dummy text. Start code is "abc123" now, this is
irrelevant text and characters, possibly new lines and line
breaks, quotes etc. Finish. And this is some more dummy text.
... the match would be successful, and I would be able to retrieve the value abc123. That is the only part that I actually need to use.
var matches = Regex.Matches(str, ".*Start code.[^\"]*\"([^\"]*)\"[^\"]*.*Finish\\..*");
if (matches.Count > 0)
{
result = matches[0].Groups[1];
}
Basically, you just need to check the results of Regex.Matches. The only "trick" used here, is the fact that I'm limiting myself for looking for non-quote characters, both before, between and after the quotes.
You can use this pattern with lookarounds:
#"(?<=Start code is "")[^""]+(?="")"
or a simple capturing group:
#"Start code is ""([^""]+)"""

Regex matching only when condition has not been met

I have kind of a weird problem that I am attempting to resolve with some elegant regular expressions.
The system I am working on was initially designed to accept an incoming string and through a pattern matching method, alter the string which it then returns. A very simplistic example is:
Incoming string:
The dog & I went to the park and had a great time...
Outgoing string:
The dog {&} I went to the park and had a great time {...}
The punctuation mapper wraps key characters or phrases and wraps them in curly braces. The original implementation was a one way street and was never meant for how it is currently being applied and as a result, if it is called incorrectly, it is very easy for the system to "double" wrap a string as it is just doing a simple string replace.
I spun up Regex Hero this morning and started working on some pattern matches and having not written a regular expression in nearly a year, quickly hit a wall.
My first idea was to match a character (i.e. &) but only if it wasn't wrapped in braces and came up with [^\{]&[^\}], which is great but of course catches any instance of the ampersand so long as it is not preceded by a curly brace, including white spaces and would not work in a situation where there were two ampersands back to back (i.e. && would need to be {&}{&} in the outgoing string. To make matters more complicated, it is not always a single character as ellipsis (...) is also one of the mapped values.
Every solution I noodle over either hits a barrier because there is an unknown number of occurrences of a particular value in the string or that the capture groups will either be too greedy or finally, cannot compensate for multiple values back to back (i.e. a single period . vs ellipsis ...) which the original dev handled by processing ellipsis first which covered the period in the string replace implementation.
Are there any regex gurus out there that have any ideas on how I can detect the undecorated (unwrapped) values in a string and then perform their replacements in an ungreedy fashion that can also handle multiple repeated characters?
My datasource that I am working against is a simple key value pair that contains the value to be searched for and the value to replace it with.
Updated with example strings:
Undecorated:
Show Details...
Default Server:
"Smart" 2-Way
Show Lender's Information
Black & White
Decorated:
Show Details{...}
Default Server{:}
{"}Smart{"} 2-Way
Show Lender{'}s Information
Black {&} White
Updated With More Concrete Examples and Datasource
Datasource (SQL table, can grow at any time):
TaggedValue UntaggedValue
{:} :
{&} &
{<} <
{$} $
{'} '
{} \
{>} >
{"} "
{%} %
{...} ...
{...} …
{:} :
{"} “
{"} ”
{'} `
{'} ’
Broken String: This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it & follow by ! and ... _ & . &&&
String that needs decoration: Show Details... Default Server: "Smart" 2-Way Show Lender's Information Black & White
String that would pass through the method untouched (because it was already decorated): The dog {&} I went to the park and had a great time {...}
The other "gotcha" in moving to regex is the need to handle escaping, especially of backslashes elegantly due to their function in regular expressions.
Updated with output from #Ethan Brown
#Ethan Brown,
I am starting think that regex, while elegant might not be the way to go here. The updated code you provided, while closer still does not yield correct results and the number of variables involved may exceed the regex logics capability.
Using my example above:
'This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it & follow by ! and ... _ & . &&&'
yields
This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it {&} follow by {!} and {...} {_} {&} . {&&}&
Where the last group of ampersands which should come out as {&}{&}{&} actually comes out as {&&}&.
There is so much variability here (i.e. need to handle ellipsis and wide ellipsis from far east languages) and the need to utilize a database as the datasource is paramount.
I think I am just going to write a custom evaluator which I can easily enough write to perform this type of validation and shelve the regex route for now. I will grant you credit for your answer and work as soon as I get in front of a desktop browser.
This kind of problem can be really tough, but let me give you some ideas that might help out. One thing that's really going to give you headaches is handling the case where the punctuation appears at the beginning or end of the string. Certainly that's possible to handle in a regex with a construct like (^|[^{])&($|[^}]), but in addition to that being painfully hard to read, it also has efficiency issues. However, there's a simple way to "cheat" and get around this problem: just pad your input string with a space on either end:
var input = " " + originalInput + " ";
When you're done you can just trim. Of course if you care about preserving input at the beginning or end, you'll have to be more clever, but I'm going to assume for argument's sake that you don't.
So now on to the meat of the problem. Certainly, we can come up with some elaborate regular expressions to do what we're looking for, but often the answer is much much simpler if you use more than one regular expression.
Since you've updated your answer with more characters, and more problem inputs, I've updated this answer to be more flexible: hopefully it will meet your needs better as more characters get added.
Looking over your input space, and the expressions you need quoted, there are really three cases:
Single-character replacements (! becomes {!}, for example).
Multi-character replacements (... becomes {...}).
Slash replacement (\ becomes {})
Since the period is included in the single-character replacements, order matters: if you replace all the periods first, then you will miss ellipses.
Because I find the C# regex library a little clunky, I use the following extension method to make this more "fluent":
public static class StringExtensions {
public static string RegexReplace( this string s, string regex, string replacement ) {
return Regex.Replace( s, regex, replacement );
}
}
Now I can cover all of the cases:
// putting this into a const will make it easier to add new
// characters in the future
const string normalQuotedChars = #"\!_\\:&<\$'>""%:`";
var output = s
.RegexReplace( "(?<=[^{])\\.\\.\\.(?=[^}])", "{$&}" )
.RegexReplace( "(?<=[^{])[" + normalQuotedChars + "](?=[^}])", "{$&}" )
.RegexReplace( "\\\\", "{}" );
So let's break this solution down:
First we handle the ellipses (which will keep us from getting in trouble with periods later). Note that we use a zero-width assertions at the beginning and end of the expression to exclude expressions that are already quoted. The zero-width assertions are necessary, because without them, we'd get into trouble with quoted characters right next to each other. For example, if you have the regex ([^{])!([^}]), and your input string is foo !! bar, the match would include the space before the first exclamation point and the second exclamation point. A naive replacement of $1!$2 would therefore yield foo {!}! bar because the second exclamation point would have been consumed as part of the match. You'd have to end up doing an exhaustive match, and it's much easier to just use zero-width assertions, which are not consumed.
Then we handle all of the normal quoted characters. Note that we use zero-width assertions here for the same reasons as above.
Finally, we can find lone slashes (note we have to escape it twice: once for C# strings and again for regex metacharacters) and replace that with empty curly brackets.
I ran all of your test cases (and a few of my own invention) through this series of matches, and it all worked as expected.
I'm no regex god, so one simple way:
Get / construct the final replacement string(s) - ex. "{...}", "{&}"
Replace all occurrences of these in the input with a reserved char (unicode to the rescue)
Run your matching regex(es) and put "{" or whatever desired marker(s).
Replace reserved char(s) with the original string.
Ignoring the case where your original input string has a { or } character, a common way to avoid re-applying a regex to an already-escaped string is to look for the escape sequence and remove it from the string before applying your regex to the remainders. Here's an example regex to find things that are already escaped:
Regex escapedPattern = new Regex(#"\{[^{}]*\}"); // consider adding RegexOptions.Compiled
The basic idea of this negative-character class pattern comes from regular-expressions.info, a very helpful site for all thing regex. The pattern works because for any inner-most pair of braces, there must be a { followed by non {}'s followed by a }
Run the escapedPattern on the input string, find for each Match get the start and end indices in the original string and substring them out, then with the final cleaned string run your original pattern match again or use something like the following:
Regex punctPattern = new Regex(#"[^\w\d\s]+"); // this assumes all non-word,
// digit or space chars are punctuation, which may not be a correct
//assumption
And replace Match.Groups[1].Value for each match (groups are a 0 based array where 0 is the whole match, 1 is the first set of parentheses, 2 is the next etc.) with "{" + Match.Groups[1].Value + "}"

Regex - Lookahead for pattern through newlines

Still learning Regex, and am having trouble getting my head wrapped around the lookahead concept. Similar data to my question here - Matching multiple lines up until a sepertor line? , say I have the following lines handed to me by the user:
0000AA.The horizontal coordinates are valid at the epoch date displayed above.
0000AA.The epoch date for horizontal control is a decimal equivalence
0000AA.of Year/Month/Day.
0000AA
[..]
So a really simple Regex is #^[0-9]{4}[A-Z]{2}\.(?<noteline>.*), where would give me every line. Fantastic. :) However, I'd like a lookahead (or a condition?) that would look at the next line and tell me if the line has the code WITHOUT a '.'. (i.e. If the NEXT line would match #^[0-9]{4}[A-Z]{2}[^\.]
Trying the lookahead, I get hits on the first two lines (because the following line has '.' after the code) but not on the last.
Edit: Using the regex above, or the one offered below gives me all lines, but I'd like to know IF a blank line (line with AA0000 code, but no '.' afterwards) follows. For example, when I get to the match on the line of Year/Month/Day, I'd like to know IF that line is followed by a blank line (or not). (Like with a grouping name that's not spaces or empty, for high-level example.)
Edit 2: I may be mis-using the 'lookahead' term. Going back over .NET's regex, I see something referred to as a Alternation Construct, but not sure if that could be used here.
Thanks! Mike.
Apply the option RegexOptions.Multiline. It changes the meaning of ^ and $ making them match the beginning and the end of ervery line instead the beginning and end of the entire string.
var matches = Regex.Matches(input,
#"^[0-9]{4}[A-Z]{2}\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])",
RegexOptions.Multiline);
The negative look ahead is
find(?!suffix)
It matches a position not preceeding a suffix. Don't escape the dot within the brackets [ ]. The bracket disables the special meaning of most characters anyway.
I also added .*$? making the pattern match until the end of the current line. The ? is required in order make * lazy. Otherwise it is greedy, meaning that will try to get as many characters as possible and possibly match several lines at a time.
If you need only the number part, you can capture it in a group by enclosing it within parentheses.
(^[0-9]{4}[A-Z]{2})\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])
You can then get the group like this
string number = match.Groups[1].Value;
Note: Group #0 represents the entire match.
After doing a lot of research, and hit and misses, I'm certain now that it can't be done - or, rather - it CAN be but would be prohibitively difficult - easier to do it in code.
To refrain, I was looking at a multiline string (document), where every line was preceded by a 6-digit code. Some lines - the lines I'm interested in - have a '.' after the 6-digit code, and then open text. I was hoping there would be a way to get me each line in a group, along with a flag letting me know if the next line has no free-text entry. (No '.' after the 6-digit code.) I.e. Two line data entry would give me two matches on the document. First match would have the line's text in the group called 'notetext', and the group 'lastline' would be empty. The second line would have the second part of the entered note in 'notetext', and the group 'lastline' would have something (anything, content wouldn't matter.)
From what I understand, lookaheads are zero-width assertions, so that if it matches, the returnable value is still empty. Without using lookahead, the match for 'lastline' would consume the next line's code, making the 'notetext' skip that line (giving me every other line of text.) So, I would need to have some back-reference to revert back to.
By this time, it'd be easier (code-wise) to simply get all the lines, and add up text until I get to the end of their notes. (Looping over then entire document, which can't be more than 200 lines as opposed to looping through the regex-matched lines, and the ease of reading the code for future modifications would out-weigh any slight speed advantage the regex could get me.
Thanks guys -
-Mike.

Categories