Regex - Lookahead for pattern through newlines

Regex - Lookahead for pattern through newlines - c#

Still learning Regex, and am having trouble getting my head wrapped around the lookahead concept. Similar data to my question here - Matching multiple lines up until a sepertor line? , say I have the following lines handed to me by the user:
0000AA.The horizontal coordinates are valid at the epoch date displayed above.
0000AA.The epoch date for horizontal control is a decimal equivalence
0000AA.of Year/Month/Day.
0000AA
[..]
So a really simple Regex is #^[0-9]{4}[A-Z]{2}\.(?<noteline>.*), where would give me every line. Fantastic. :) However, I'd like a lookahead (or a condition?) that would look at the next line and tell me if the line has the code WITHOUT a '.'. (i.e. If the NEXT line would match #^[0-9]{4}[A-Z]{2}[^\.]
Trying the lookahead, I get hits on the first two lines (because the following line has '.' after the code) but not on the last.
Edit: Using the regex above, or the one offered below gives me all lines, but I'd like to know IF a blank line (line with AA0000 code, but no '.' afterwards) follows. For example, when I get to the match on the line of Year/Month/Day, I'd like to know IF that line is followed by a blank line (or not). (Like with a grouping name that's not spaces or empty, for high-level example.)
Edit 2: I may be mis-using the 'lookahead' term. Going back over .NET's regex, I see something referred to as a Alternation Construct, but not sure if that could be used here.
Thanks! Mike.

Apply the option RegexOptions.Multiline. It changes the meaning of ^ and $ making them match the beginning and the end of ervery line instead the beginning and end of the entire string.
var matches = Regex.Matches(input,
#"^[0-9]{4}[A-Z]{2}\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])",
RegexOptions.Multiline);
The negative look ahead is
find(?!suffix)
It matches a position not preceeding a suffix. Don't escape the dot within the brackets [ ]. The bracket disables the special meaning of most characters anyway.
I also added .*$? making the pattern match until the end of the current line. The ? is required in order make * lazy. Otherwise it is greedy, meaning that will try to get as many characters as possible and possibly match several lines at a time.
If you need only the number part, you can capture it in a group by enclosing it within parentheses.
(^[0-9]{4}[A-Z]{2})\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])
You can then get the group like this
string number = match.Groups[1].Value;
Note: Group #0 represents the entire match.

After doing a lot of research, and hit and misses, I'm certain now that it can't be done - or, rather - it CAN be but would be prohibitively difficult - easier to do it in code.
To refrain, I was looking at a multiline string (document), where every line was preceded by a 6-digit code. Some lines - the lines I'm interested in - have a '.' after the 6-digit code, and then open text. I was hoping there would be a way to get me each line in a group, along with a flag letting me know if the next line has no free-text entry. (No '.' after the 6-digit code.) I.e. Two line data entry would give me two matches on the document. First match would have the line's text in the group called 'notetext', and the group 'lastline' would be empty. The second line would have the second part of the entered note in 'notetext', and the group 'lastline' would have something (anything, content wouldn't matter.)
From what I understand, lookaheads are zero-width assertions, so that if it matches, the returnable value is still empty. Without using lookahead, the match for 'lastline' would consume the next line's code, making the 'notetext' skip that line (giving me every other line of text.) So, I would need to have some back-reference to revert back to.
By this time, it'd be easier (code-wise) to simply get all the lines, and add up text until I get to the end of their notes. (Looping over then entire document, which can't be more than 200 lines as opposed to looping through the regex-matched lines, and the ease of reading the code for future modifications would out-weigh any slight speed advantage the regex could get me.
Thanks guys -
-Mike.

Related

Conditional match without false force a match?

I'm using the following regex in c# to match some input cases:
^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$
The options are ignoring pattern whitespaces.
My input looks as follows:
hello
#world
[xxx]
This all can be tested here: DEMO
My problem is that this regex will not match the last line. Why?
What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.
This is a simplyfied regex and simplyfied input.
The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).
I try to understand why the conditional group doesn't match as stated in original regex.
I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:
^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$
That's the reason why I'm trying to use a conditional match.
UPDATE 10/12/2018
I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:
(?(a)a).*
DEMO
I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information

There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])
If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said
Singline tells the parser to handle the . to match all characters including the \n.
Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.
Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.
Notice the second match (as index 1) has world in group capture id and value as ↵.
I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.
Let us turn on Singline and see what happens.
Now everything is consumed, but there is a different problem. :-)

Remove all Trailing <br> Using Regex, Substitute Group not Returning Full Match

Here is the problem. I have a block of pasted html text. I need to remove trailing line breaks and white space from the text. Even ones proceeded by closing tags. The below text is simply an example, and actually closely represents the real text I'm dealing with.
EG:
This:
<span>Here is some<br></span><br>
<span><span>Here is some text</span><br><span><br> </span></span><br><br>
Becomes this:
<span>Here is some<br></span><br>
<span><span>Here is some text<span></span></span>
My first pass. I use this: Regex.Replace(htmlString, #"(?:\<br\s*?\>)*$", "") to get rid of the trailing line breaks. Now all I have left is the line breaks stuck behind closing tags and white space.
I'm attempting to use this:
While(Regex.IsMatch(#"(<br>|\s| )*(<[^>]*>)*$")
{
Regex.Replace(htmlString, #"(<br>|\s| )*(<[^>]*>)*$", $2)
}
The regex pattern is actually working great, the problem is that the substitute by matched group 2 is only giving back a single closing span. So that I end up with the below:
<span>Here is some<br></span><br>
<span><span>Here is some text</span></span>

The regular expression is in #"(<br>|\s| )*(<[^>]*>)*$". The second group is followed by a * meaning the group is repeated and so the $2 only yields one repetition of the group.
Putting the repetition in a group will capture the whole repetition. Change the regular expression to be #"(<br>|\s| )*((<[^>]*>)*)$".
Note that repeating the first group with a * may make the code spin on some input strings as there no guarantee that the Replace will change the text to a different string. As the first group is optional (ie zero or more repeats) the Replace might replace one string with exactly the same string. So I suggest changing the regular expression to be #"(<br>|\s| )+((<[^>]*>)*)$" meaning that one or more occurrences of the first group are required.

I guess you can use:
resultString = Regex.Replace(subjectString, #"<br>| |\n", "");
Regex Demo

Skip text between quotes in Regex

I have the following Regex using IgnoreCaseand Multilinein .NET:
^.*?\WtextToFind\W.*?$
Given a multiline input like:
1 Some Random Text textToFind
2 Some more "textToFind" random text
3 Another textToFinddd random text
The current regular expression matches with the lines 1 and 2. However I need to skip all the lines which textToFindis inside quotes and double quotes.
Any ideas how to achieve this?
Thanks!
EDIT:
Explanation: My purpose is to find some method calls inside VBScript code. I thought this would be irrelevant for my question, but after reading the comments I realised I should explain this.
So basically I want to skip text that is between quotes or single quotes and all the text that is between a quote and the end of line since that would be a comment in VBScript:
If I'm looking for myFunc
Call myFunc("parameter") // should match
Call anotherFunc("myFunc") //should not match
Call someFunc("parameter") 'Comment myFunc //should not match
If(myFunc("parameter") And someFunc("myFunc")) //should match

With all of the possible cases involving mixed sets of quotes, a regex may not be your best option here. What you could do instead (after using your current regex to filter for everything but quotes), is count the number of quotes before and after the occurrence of textToFind. If both counts are odd, then you have quotes around your keyword and should scrap the line. If both are even, you've got matched quotes elsewhere (or no quotes at all), and should keep the line. Then repeat the process for double quotes. You could do all this only walking through the string once.
Edit to address the update that you're searching through code:
There are some additional considerations to take into account.
Escaped quotes (skip over the character after an escape character, and it won't be counted).
Commented quotes, e.g. /* " */ in the middle of a line. When you hit a /*, just jump to the next occurrence of */ and then continue inspecting characters. You may also want to check whether the occurrence of textToFind is in a comment.
End-of-line ' quotes - if it occurs (outside a literal string) before the keyword, it's not a valid method call.
The bottom line is still that regexes aren't the droids you're looking for, here. You're better off walking through lines and parsing them.

It seems like this should work for your actual implementation in all the examples you've given:
/\bmyFunc\(/
Demonstration - view console.
as long as you don't have something like "i'm going to call myFunc()", but if you start trying to deal with quotes, multiple quotes, nested quotes, etc... it will get very messy (like trying to parse dom with regex).
Also, it appears that you are checking within vbscript code. Comments in vbscript code start with an ', right? You could check this as well, as it looks like you are doing this on a line by line basis, this should work for those type of comments:
/^\s*[^'].*\bmyFunc\(/
Demo

Optimizing a regex expression

I'm having issues where it's taking very long to run a match against this query. I'm trying to match up content that looks like the following:
One or more content paragraph of any length
Here is an optional paragraph
A single line or list item
A single line or list item
Here is my pattern. While it works for short expressions, it fails for longer ones.
^((.+[\r\n]?)+)\r\n\r\n([* -]*(.+)[\r\n]?)+$
My goal really is to separate out the first piece of content into a paragraph, and collect the last items into a list object using the matching pattern. I'm assuming two line breaks separate the paragraph(s) and a set of single-line items (only one line break).
Hope this isn't confusing. How can I optimize this regex? Thanks.

Time-consuming, inefficient backtracking can often be avoided by adding the ? modifier to the * and + quantifiers to make them match lazily or reluctantly, i.e. as few times as possible.
This can be particularly important when the quantifiers follow the . wildcard meta-character.
Try
(.+?)\r\n\r\n(?:[* -]*(.+?)(?:\r\n|$))+
with RegexOptions.Singleline so . matches any character including newlines.
(Alternatively use [\s\S] in place of the first .).
The first capture group will capture all that comes before the consecutive newlines, and then the next capture groups will capture each single line that follows. As in your regex, any leading *, - or space characters in the single lines will not be captured.
The paragraph/s will be match.Groups[1].Value, the first captured single line will be match.Groups[2].Captures[0].Value and the second match.Groups[2].Captures[1].Value) etc.
If the line-endings may be simply \n, change \r\n to \r?\n.

I'm not that good at regex but your one seems quite optimized to me. But to make it faster, use split instead to seperate the paragraph from the list
res = yourstring.Split('\r\n\r\n');
paragraph = res[0];
list=res[1];
then you can use regex or again split to seperate the list items from each other

RegEx Performance Issue

We are having problem with the following regular expression:
(.*?)\|\*\|([0-9]+)\*\|\*(.*?)
It should match things like: |*25 *|
We are using .Net Framework 4 RegEx Class the code is the following:
string expression = "(.*?)" +
Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) +
"([0-9]+)" +
Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END) +
"(.*?)";
Regex r = new Regex(expression);
r.Matches(contentText)
It is taking too long (like 60 seconds) with a 40.000 character text.
But with a text of 180.000 the speed its very acceptable (3 sec or less)
The only difference between texts its that the first text(the one which is slow) it is all contained in a single line, with no line breaks. Can this be an issue? That is affecting the performance?
Thanks

#David Gorsline's solution (from the comment) is correct:
string expression =
Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) +
"([0-9]+)" +
Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END);
Specifically, it's the (.*?) at the beginning that's doing you in. What that does is take over doing what the regex engine should be doing itself--scan for the next place where the regex can match--and doing it much, much less efficiently. At each position, the (.*?) effectively performs a lookahead to determine whether the next part of the regex can match, and only if that fails does it go ahead and consume the next character.
But even if you used something more efficient, like [^|]*, you would still be slowing it down. Leave that part off, though, and the regex engine can instead scan for the first constant portion of the regex, probably using an algorithm like Boyer-Moore or Knuth-Morris-Pratt. So don't worry about what's around the bits you want to match; just tell the regex engine what you're looking for and get out of its way.
On the other hand, the trailing (.*?) has virtually no effect, because it never really does anything. The ? turns the .* reluctant, so what does it take to make it go ahead and consume the next character? It will only do so if there's something following it in the regex that forces it to. For example, foo.*?bar consumes everything from the next "foo" to the next "bar" after that, but foo.*? stops as soon as it's consumed "foo". It never makes sense to have a reluctant quantifier as the last thing in a regex.

You've answered your question: the problem is that . fails to match new-lines (it doesn't by default), which results in many failed attempts - almost one for every position on your 40000 character string.
On the long but single lined file, the engine can match the pattern in a single pass over the file (assuming a successful match exists - if it doesn't, I suspect it will take a long time to fail...).
On the shorter file, with many lines, the engine tries to match from the first character. It matches .*? until the end of the first line (this is a lazy match, so a lot more is happening, but lets ignore that), and fails. Now, it stats again from the second character, not the second line! This results in n² complexity even before matching the number.
A simple solution is to make . match newlines:
Regex r = new Regex(expression, RegexOptions.Singleline);
You can also make sure to match from start to end using the absolute start and end anchors, \A and \z:
string expression = "\\A(.*?)" +
Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) +
"([0-9]+)" +
Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END) +
"(.*?)\\z";
Another note:
As David suggests in the comments, \|\*\|([0-9]+)\*\|\* should work well enough. Even if you need to "capture" all text before and after the match, you can easily get it using the position of the match.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.