Regex to parse formatter string - c#

I am writing a string.Format-like method. In order to do this, I am adopting Regex to determine commands and parameters: e.g. Format(#"\m{0,1,2}", byteArr0, byteArr1, byteArr2)
For the first Regex, return 2 groups:
'\m'
'{0,1,2}'
Another Regex takes the value of '{0,1,2}' and has 3 matches:
0
1
2
These values are the indexes corresponding to the byteArr params.
This command structure is likely to grow so I'm really trying to figure this out and learn enough to be able to modify the Regex for future requirements.I would think that a single Regex would do all of the above but there is value in having 2 separate Regex(es/ices ???) expressions.
Any way, to get the first group '\m' the Regex is:
"(\\)(\w{1,1})" // I want the '{0,1,2}' group also
To get the integer matches '{0,1,2}' I was trying:
"(?<=\{)([^}]*)(?=\})"
I am having difficulty in achieving: (1) 2 groups on the first expression and (2) 3 matches on the integers within the braces delimited by a comma in the second expression.

Your first regex (\\)(\w{1,1}) can be greatly simplified.
You don't want to capture the \ separately to the m so no need to wrap them in their own sets of parenthesis.
\w{1,1} is the same as just \w.
So we have \\\w to match the first part \m.
Now to deal with the second part, really we can ignore everything other than the 0,1,2 in the example since there are no numbers elsewhere so you'd just use: \d+ and iterate through the matches.
But lets assume the example could actually be \9{1,2,3}.
Now \d+ would match the 9 so to avoid this we could use [{,](\d+)[,}]. This says capture a number that has either a , or { on the left of it and a , or } on the right.
You're right in saying that we can match the whole string with a single regex, something like this would do it:
(\\\w){((\d+),?)+}
However the problem with this is when you examine the contents of the capture groups afterwards, the last number caught by the (\d+) overwrites all the other values that were caught in there. So you'd be left with group 1: \m and group 2: 2 for your example.
With that in mind I recommend using 2 regexs:
For the 1st part: \\\w
For the numbers: I'd forget about the [{,](\d+)[,}] (and the many other ways you could do it), the cleanest way might just be to grab whatever is inside the {...} and then match with a simple \d+.
So to do this first use (\\\w)\{([^/}]+)\} to grab the \m into group 1 and the 1,2,3 into group 2, then just use \d+ on that.
FYI, your (?<=\{)([^}]*)(?=\}) works fine, but you can't but anything before the lookbehind i.e. the \\\w. In the vast majority of cases where a lookbehind can be used, you can do what you want by just using capture groups and ignoring everything else :
My regex \{([^/}]+)\} is pretty much the same as you (?<=\{)([^}]*)(?=\}) except rather than looking ahead and looking behind for the { and } I just leave them outside the capture groups that are going to be used.

Consider the following Regexes...
(^.*?)(?={.*})
\d+
Good Luck!

Related

Match exactly one occurrence with regex

Consider M,T,W,TH,F,S,SU are days of week.
I have regex which is working well except for one scenario when there is no sequence of weekdays, i.e. there is no M, T , W , TH , F , S , SU at the expected location inside the string.
For example, q10MT is valid but q10HT is invalid.
Below is my expression:
string expression = "q(\\d*)(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?";
In case of q10MT, the output is q10MT which is correct, but in case of q10HT the output is q10 which is incorrect, my regex should return no value or empty string when there is no match.
What changes do I need to make in order to achieve this?
You can achieve it with a positive look-ahead:
q(\\d*)(?=(?:M|T(?!H)|W|TH|F|S(?!U)|SU))(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?
Or, as #Taemyr noted, a shorter equivalent
q(\\d*)(?=(?:M|TH?|W|TH|F|SU?))(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?
Here is a demo
The (?=(?:M|TH?|W|F|SU?)) look-ahead makes sure there is at least one required value from the alternation list you have after the look-ahead.
C# regex usage:
var rx = new Regex(#"q(\d*)(?=(?:M|TH?|W|TH|F|SU?))(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?");
var result = rx.Match("q10MSUT").Value;
Result:
What about the following:
q(\d*)(M|TH?|W|F|SU?)+
See demo with some examples on matches and no-matches. The key change in this regexp is that this one uses the + to require at least one match.
Be aware that this solution doesn't demand the days to be in order, and allows skipping of days specified in comments not to matter.
Edit: OP says in comments that he requires only one match for each day, which this solution doesn't account for.
If order does not matter you need to do something like this;
q(?<number>\d+)((?<monday>(?<!M\D*)M)|(?<tuesday>(?<!T(?!H)\D*)T(?!H))|(?<wednesday>(?<!W\D*)W)|(?<thursday>(?<!TH\D*)TH)|(?<friday>(?<!F\D*)F)|(?<saturday>(?<!S(?!U)\D*)S(?!U))|(?<sunday>(?<!SU\D*)SU))+
This matches if q is followed by some number, and then followed by one or more weekdays. Order of weekdays does not matter, and the negative lookbehind insures that no weekday can occur more than once.
Each weekday is captured in it's own capturing group and that group is named so that it can be extracted later. "q10MTsomething" will capture "q10MT" with 10 in the "number" capturing group, M in the "monday" capturing group and T in the "tuesday" capturing group, other capturing groups will be empty. "q10TFMother" will capture "q10TFM" with capturing as in the previous example, plus F in the "friday" capturing group. "q10TFMT" will capture "q10TFM" with capturing groups as in the previous example. "q10HT" will not match.
demo
Note that this is the regexp string. If entered in code you might need to escape the \s to produce the correct string.
The question is answered already. Even so I want to point to another idea using a variable length lookbehind for maintaining the sequence, which should be fine with .NET
q(\d*)[MTWFSUH]+(?<=q\d*(M)?(T)?(W)?(TH)?(F)?(S)?(SU)?)
[MTWFSUH] is the list of valid characters. At least one is required
Using a lookbehind for matching as long as the sequence is maintained
Test at your test tool

How to match a string, but only if the same string has not already been matched with or without dashes?

I have a case I'm trying to match using regular expressions.
My current expression will match a string in a certain format with or without dashes. I would like to add it to match only if the string has not been matched before, with or without the dashes. For example, take the following cases:
1. 1234-56-789-5555
2. 1234567895555
3. 0000-99-888-3333
4. 1111223334444
If the four examples above appeared in this same order in a list, document, whatever, I would want to only capture (1, 3, 4). I want to skip #2 since it was already captured by #1, but with the dashes. If #2 had of come first, I would have wanted to similarly skip #1.
Here's the current expression I'm using:
\d\d\d\d-*\d\d-*\d\d\d-*\d\d\d\d
I tried to read up on look behinds (I'm fairly inexperienced with Regex) but I only really understand that a look behind only checks if certain text is matched previously. I'm not sure if what I want can be combined with this; I only see how to check for specific text, not for the current value with/without dashes.
I'm currently doing this with C# logic, but am trying to see if it can be done purely in Regex. If it can't be done, that's fine; I'm just trying to beef up my Regex knowledge in this case.
Is this possible -- how can I accomplish this?
If you want to obtain just the first occurrence of each number (answering I want to skip #2 since it was already captured by #1, but with the dashes), you need a negative look-behind with a RegexOptions.RightToLeft and RegexOptions.Singleline options:
(?<!\b\1-?\2-?\3-?\4\b.*)\b(\d{4})-?(\d{2})-?(\d{3})-?(\d{4})\b
The \b(\d{4})-?(\d{2})-?(\d{3})-?(\d{4})\b subpattern is the number with capture groups to check for their presence regardless of the hyphens earlier in the string.
The (?<!\b\1-?\2-?\3-?\4\b.*) subpattern look-behind is checking if we have no other occurrences of the same string.
Tested at regexhero.net and in Expresso:
You can easily do this without using regex.. but if you still want to use regex for this purpose.. you can use the following to match:
(?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)\2\3\4\5
And replace with '' (empty string)
Explanation:
This will match all those digits without dashes which are already captured by digits with dashes
So, in your 1,2,3 and 4.. instead of matching 1,3 and 4 types it matches type 2.. and you can replace it with '' (nothing) and you remain with 1,3, and 4
See demo here
You can use the following regex to do exactly what you want..
((?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)(?!\3\4\5\6)\d{13})|(((?<=((\d{4})(\d{2})(\d{3})(\d{4})).*?)(?!\10-\11-\12-\13)((\d{4})-(\d{2})-(\d{3})-(\d{4}))))
Explanation:
((?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)(?!\3\4\5\6)\d{13}) match all those \d{13} which are not previously occurred with dashes in between them (this excludes strings of type 2 in your case)
((\d{4})-(\d{2})-(\d{3})-(\d{4})) and match all of this pattern
Matches 1, 3 and 4 in your case.
See DEMO

UB: C#'s Regex.Match returns whole string instead of part when matching

Attention! This is NOT related to Regex problem, matches the whole string instead of a part
Hi all.
I try to do
Match y = Regex.Match(someHebrewContainingLine, #"^.{0,9} - \[(.*)?\s\d{1,3}");
Aside from the other VS hebrew quirks (how do you like replacing ] for [ when editing the string?), it occasionally returns the crazy results:
Match.Captures.Count = 1;
Match.Captures[0] = whole string! (not expected)
Match.Groups.Count = 2; (not expected)
Match.Groups[0] = whole string again! (not expected)
Match.Groups[1] = (.*)? value (expected).
Regex.Matches() is acting same way.
What can be a general reason for such behaviour? Note: it's not acting this way on a simple test strings like Regex.Match("-היי45--", "-(.{1,5})-") (sample is displayed incorrectly!, please look to the page's source code), there must be something with the regex which makes it greedy. The matched string contains [ .... ], but simply adding them to test string doesn't causes the same effect.
I hit this problem when I first started using the .NET regex, too. The way to understand this is to understand that the Group member of Match is the nesting member. You have to traverse Groups in order to get down to lower captures. Groups also have Capture members. The Match is kind of like the top "Group" in that it represents the successful "match" of the whole string against your expression. The single input string can have multiple matches. The Captures member represents the match of your full expression.
Whenever you have a single capture as you have, Group[1] will always be the data you are interested in. Look at this page. The source code in examples 2 and 3 is hardcoded to print out Groups[1].
Remember that a single capture can capture multiple substrings in a single match operation. If this were the case then you would see Match.Groups[1].Captures.Count be greater than 1. Also, I think if you passed in multiple matching lines of text to the single Match call, then you would see Match.Captures.Count be greater than 1, but each top-level Match.Captures would be the full string matched by your full expression.
There is one capture group in the pattern; that is group 1.
There is always group 0, which is the entire match.
Therefore there are a total of 2 groups.
My test regex was different from any others in the project's scope (thats what happens when Perl guy comes to C#), as it had no lookaheads/lookbehinds. So this discovery took some time.
Now, why we should call Regex behaviour undocumented, not undefined:
let's do some matches against "1.234567890".
PCRE-like syntax: (.)\.2345678
lookahead syntax: (.)(?=\.\d)
When you're doing a normal match, the result is copied from whole matched part of line, no matter where you've put the parentesizes; in case of lookaheads present, anything that did not belongs to them is copied.
So, the matches will return:
PCRE: 1.2345678 (at 2300, this looks like original string and I start yelling here at SO)
lookahead: 1

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

I'm currently writing a library where I wish to allow the user to be able to specify spreadsheet cell(s) under four possible alternatives:
A single cell: "A1";
Multiple contiguous cells: "A1:B10"
Multiple separate cells: "A1,B6,I60,AA2"
A mix of 2 and 3: "B2:B12,C13:C18,D4,E11000"
Then, to validate whether the input respects these formats, I intended to use a regular expression to match against. I have consulted this article on Wikipedia:
Regular Expression (Wikipedia)
And I also found this related SO question:
regex matching alpha character followed by 4 alphanumerics.
Based on the information provided within the above-linked articles, I would try with this Regex:
Default Readonly Property Cells(ByVal cellsAddresses As String) As ReadOnlyDictionary(Of String, ICell)
Get
Dim validAddresses As Regex = New Regex("A-Za-z0-9:,A-Za-z0-9")
If (Not validAddresses.IsMatch(cellsAddresses)) then _
Throw New FormatException("cellsAddresses")
// Proceed with getting the cells from the Interop here...
End Get
End Property
Questions
1. Is my regular expression correct? If not, please help me understand what expression I could use.
2. What exception is more likely to be the more meaningful between a FormatException and an InvalidExpressionException? I hesitate here, since it is related to the format under which the property expect the cells to be input, aside, I'm using an (regular) expression to match against.
Thank you kindly for your help and support! =)
I would try this one:
[A-Za-z]+[0-9]+([:,][A-Za-z]+[0-9]+)*
Explanation:
Between [] is a possible group of characters for a single position
[A-Za-z] means characters (letters) from 'A' to 'Z' and from 'a' to 'z'
[0-9] means characters (digits) from 0 to 9
A "+" appended to a part of a regex means: repeat that one or more times
A "*" means: repeat the previous part zero or more times.
( ) can be used to define a group
So [A-Za-z]+[0-9]+ matches one or more letters followed by one or more digits for a single cell-address.
Then that same block is repeated zero or more times, with a ',' or ':' separating the addresses.
Assuming that the column for the spreadsheet is any 1- or 2-letter value and the row is any positive number, a more complex but tighter answer still would be:
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
"[A-Z]{1,2}[1-9]\d*" is the expression for a single cell reference. If you replace "[A-Z]{1,2}[1-9]\d*" in the above with then the complex expression becomes
^<cell>(:<cell>)?(,<cell>(:<cell>*)?)*$
which more clearly shows that it is a cell or a range followed by one or more "cell or range" entries with commas in between.
The row and column indicators could be further refined to give a tighter still, yet more complex expression. I suspect that the above could be simplified with look-ahead or look-behind assertions, but I admit those are not (yet) my strong suit.
I'd go with this one, I think:
(([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*,)*([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*
This only allows capital letters as the prefix. If you want case insensitivity, use RegexOptions.IgnoreCase.
You could simplify this by replacing [A-Z]+[1-9]\d* with plain old [A-Z]\d+, but that will only allow a one-letter prefix, and it also allows stuff like A0 and B01. Up to you.
EDIT:
Having thought hard about DocMax's mention of lookarounds, and using Hans Kesting's answer as inspiration, it occurs to me that this should work:
^[A-Z]+\d+((,|(?<!:\w*):)[A-Z]+\d+)*$
Or if you want something really twisted:
^([A-Z]+\d+(,|$|(?<!:\w*):))*(?<!,|:)
As in the previous example, replace \d+ with [1-9]\d* if you want to prevent leading zeros.
The idea behind the ,|(?<!\w*:): is that if a group is delimited by a comma, you want to let it through; but if it's a colon, it's only allowed if the previous delimiter wasn't a colon. The (,|$|...) version is madness, but it allows you to do it all with only one [A-Z]+\d+ block.
However! Even though this is shorter, and I'll admit I feel a teeny bit clever about it, I pity the poor fellow who has to come along and maintain it six months from now. It's fun from a code-golf standpoint, but I think it's best for practical purposes to go with the earlier version, which is a lot easier to read.
i think your regex is incorrect, try (([A-Za-z0-9]*)[:,]?)*
Edit : to correct the bug pointed out by Baud : (([A-Za-z0-9]*)[:,]?)*([A-Za-z0-9]+)
and finally - best version : (([A-Za-z]+[0-9]+)[:,]?)*([A-Za-z]+[0-9]+)
// ah ok this wont work probably... but to answer 1. - no i dont think your regex is correct
( ) form a group
[ ] form a charclass (you can use A-Z a-d 0-9 etc or just single characters)
? means 1 or 0
* means 0 or any
id suggest reading http://www.regular-expressions.info/reference.html .
thats where i learned regexes some time ago ;)
and for building expressions i use Rad Software Regular Expression Designer
Let's build this step by step.
If you are following an Excel addressing format, to match a single-cell entry in your CSL, you would use the regular expression:
[A-Z]{1,2}[1-9]\d*
This matches the following in sequence:
Any character in A to Z once or twice
Any digit in 1 to 9
Any digit zero or more times
The digit expression will prevent inputting a cell address with leading zeros.
To build the expression that allows for a cell address pair, repeat the expression preceded by a colon as optional.
[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?
Now allow for repeating the pattern preceded by a comma zero or more times and add start and end string delimiters.
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
Kind of long and obnoxious, I admit, but after trying enough variants, I can't find a way of shortening it.
Hope this is helpful.

In C# regular expression why does the initial match show up in the groups?

So if I write a regex it's matches I can get the match or I can access its groups. This seems counter intuitive since the groups are defined in the expression with braces "(" and ")". It seems like it is not only wrong but redundant. Any one know why?
Regex quickCheck = new Regex(#"(\D+)\d+");
string source = "abc123";
m.Value //Equals source
m.Groups.Count //Equals 2
m.Groups[0]) //Equals source
m.Groups[1]) //Equals "abc"
I agree - it is a little strange, however I think there are good reasons for it.
A Regex Match is itself a Group, which in turn is a Capture.
But the Match.Value (or Capture.Value as it actually is) is only valid when one match is present in the string - if you're matching multiple instances of a pattern, then by definition it can't return everything. In effect - the Value property on the Match is a convenience for when there is only match.
But to clarify where this behaviour of passing the whole match into Groups[0] makes sense - consider this (contrived) example of a naive code unminifier:
[TestMethod]
public void UnMinifyExample()
{
string toUnMinify = "{int somevalue = 0; /*init the value*/} /* end */";
string result = Regex.Replace(toUnMinify, #"(;|})\s*(/\*[^*]*?\*/)?\s*", "$0\n");
Assert.AreEqual("{int somevalue = 0; /*init the value*/\n} /* end */\n", result);
}
The regex match will preserve /* */ comments at the end of a statement, placing a newline afterwards - but works for either ; or } line-endings.
Okay - you might wonder why you'd bother doing this with a regex - but humour me :)
If Groups[0] generated by the matches for this regex was not the whole capture - then a single-call replace would not be possible - and your question would probably be asking why doesn't the whole match get put into Groups[0] instead of the other way round!
The documentation for Match says that the first group is always the entire match so it's not an implementation detail.
It's historical is all. In Perl 5, the contents of capture groups are stored in the special variables $1, $2, etc., but C#, Java, and others instead store them in an array (or array-like structure). To preserve compatibility with Perl's naming convention (which has been copied by several other languages), the first group is stored in element number one, the second in element two, etc. That leaves element zero free, so why not store the full match there?
FYI, Perl 6 has adopted a new convention, in which the first capturing group is numbered zero instead of one. I'm sure it wasn't done just to piss us off. ;)
Most likely so that you can use "$0" to represent the match in a substitution expression, and "$1" for the first group match, etc.
I don't think there's really an answer other than the person who wrote this chose that as an implementation detail. As long as you remember that the first group will always equal the source string you should be ok :-)
Not sure why either, but if you use named groups you can then set the option RegExOptions.ExplicitCapture and it should not include the source as first group.
It might be redundant, however it has some nice properties.
For example, it means the capture groups work the same way as other regex engines - the first capture group corresponds to "1", and so on.
Backreferences are one-based, e.g., \1 or $1 is the first parenthesized subexpression, and so on. As laid out, one maps to the other without any thought.
Also of note: m.Groups["0"] gives you the entire matched substring, so be sure to skip "0" if you're iterating over regex.GetGroupNames().

Categories