Regex non-greedy match - c#

If I have the following simplified string:
string331-itemR-icon253,string131-itemA-icon453,
string12131-itemB-icon4535,string22-itemC-icon443
How do I get the following only using only regex?
string12131-itemB-icon4535,
All numbers are unknown. The only known parts are
itemA, itemB, itemC, string and icon
I've tried string.+?itemB.+?, but it also picks up from the first occurrence of string rather than the one adjacent to itemB
I've also tried using [^icon] preceding the itemB in various positions but couldn't get it to work.

Try this:
string[^,]+itemB[^,]+,

Try this regex
string\d+-itemB-icon\d+,

The given solutions that use a restricted set of characters instead of a wildcard are simplest, but to get more at the general question: You got the non-greedy quantifier part right, but being non-greedy doesn't prevent the matcher from taking as many characters as it needs to find a match. You might be looking for the atomic group operator, (?>group). Once the group matches something, it will be treated atomically if the matcher needs to backtrack.
(?>string.+?item)B.+?,
In your example, the group matches string331-item, but the B doesn't match R so the whole group is tossed and the search moves to the next string.

You don't mention the commas separating items as a known part but use it in the example regex so I assume it can be used in a solution. Try excluding the comma as a character set instead of matching against ".".

Related

Regex to match for a string but only bringing a part of it by grouping

I have a string on the following form - for fiddle see here.
{[(0.3;0.43)(10;8.2)(0;0)(0.7888;12.345)]:8.56/13.9}
I would like to filter out just the coordinates in the parentheses. So, the pattern to match I've set up recognized a pair of parentheses (two backslashes for escaping the grouping). Then, it creates a group that on its inside has the minimal set of characters as long as they're separated by a semicolon and enclosed in parentheses. According to the debugger I have four matches, which suggest that it's correct. However, when I access Grouping, I see two elements - one containing a pair of brackets and one without.
Regex regex = new Regex("\\((.*?;.*?)\\)");
string full = regex.Matches(figure.ToString())[0].Value;
string with = regex.Matches(figure.ToString())[0].Groups[0].Value;
string sans = regex.Matches(figure.ToString())[0].Groups[1].Value;
I'm not entirely sure where the first group (Groups[0]) gets its information from. I suspect that I haven't phrased the regular expression well enough and that it actually react to the escaped parentheses as if it was a grouping as well. Am I right in my suspicion? How should I reformulate the expression?
From https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.match.groups(v=vs.110).aspx:
If the regular expression engine can find a match, the first element of the GroupCollection object (the element at index 0) returned by the Groups property contains a string that matches the entire regular expression pattern.
So Groups[0] has the entire value you matched (e.g. (1;2)), while Groups[1] is the first matched subgroup (e.g. 1;2).
Anything in parentheses is considered a group. If you don't want the group to be considered in matches, you should prefix it with "?:"
(?:REGEX)
will cause the group to be ignored as a result, but still matched against

Regex match what did not match

In words, I would describe the problem as "match A or B, match foo, and then match A or B that did NOT match previously".
I can do it with the following regex:
AfooB|BfooA
I'm wondering if there is a more efficient way to do this? I know how to reference a captured group using the "\" and then the group number. In this case I would like to apply something like to say "not the option that matched in the captured group" (and still be restricted to only match the other possible matches for that group).
The reason I'm looking for something more efficient than simply "AfooB|BfooA" is that in my case"foo"is a very long pattern and I would prefer to reduce duplication if possible.
You may use a negative lookahead with a backreference restriction when matching the second A or B:
(A|B)foo(?!\1)(A|B)
Basically, (A|B) matches and captures the value into Group 1, then foo matches foo, (?!\1) makes sure that the text that follows is not the same as the one captured into the first group, and then it can only match the opposite value with (A|B).
See this regex demo
NOTE: if A and B are single characters, use a character class: ([AB])foo(?!\1)([AB])

Extract string from a pattern preceded by any length

I'm looking for a regular expression to extract a string from a file name
eg if filename format is "anythingatallanylength_123_TESTNAME.docx", I'm interested in extracting "TESTNAME" ... probably fixed length of 8. (btw, 123 can be any three digit number)
I think I can use regex match ...
".*_[0-9][0-9][0-9]_[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z].docx$"
However this matches the whole thing. How can I just get "TESTNAME"?
Thanks
Use parenthesis to match a specific piece of the whole regex.
You can also use the curly braces to specify counts of matching characters, and \d for [0-9].
In C#:
var myRegex = new Regex(#"*._\d{3}_([A-Za-z]{8})\.docx$");
Now "TESTNAME" or whatever your 8 letter piece is will be found in the captures collection of your regex after using it.
Also note, there will be a performance overhead for look-ahead and look-behind, as presented in some other solutions.
You can use a look-behind and a look-ahead to check parts without matching them:
(?<=_[0-9]{3}_)[A-Z]{8}(?=\.docx$)
Note that this is case-sensitive, you may want to use other character classes and/or quantifiers to fit your exact pattern.
In your file name format "anythingatallanylength_123_TESTNAME.docx", the pattern you are trying to match is a string before .docx and the underscore _. Keeping the thing in mind that any _ before doesn't get matched I came up with following solution.
Regex: (?<=_)[A-Za-z]*(?=\.docx$)
Flags used:
g global search
m multi-line search.
Explanation:
(?<=_) checks if there is an underscore before the file name.
(?=\.docx$) checks for extension at the end.
[A-Za-z]* checks the required match.
Regex101 Demo
Thanks to #Lucero #noob #JamesFaix I came up with ...
#"(?<=.*[0-9]{3})[A-Z]{8}(?=.docx$)"
So a look behind (in brackets, starting with ?<=) for anything (ie zero or more any char (denoted by "." ) followed by an underscore, followed by thee numerics, followed by underscore. Thats the end of the look behind. Now to match what I need (eight letters). Finally, the look ahead (in brackets, starting with ?=), which is the .docx
Nice work, fellas. Thunderbirds are go.

Using regex to match any character until a substring is reached?

I'd like to be able to match a specific sequence of characters, starting with a particular substring and ending with a particular substring. My positive lookahead regex works if there is only one instance to match on a line, but not if there should be multiple matches on a line. I understand this is because (.+) captures up everything until the last positive lookahead expression is found. It'd be nice if it would capture everything until the first expression is found.
Here is my regex attempt:
##FOO\[(.*)(?=~~)~~(.*)(?=\]##)\]##
Sample input:
##FOO[abc~~hi]## ##FOO[def~~hey]##
Desired output: 2 matches, with 2 matching groups each (abc, hi) and (def, hey).
Actual output: 1 match with 2 groups (abc~~hi]## ##FOO[def, hey)
Is there a way to get the desired output?
Thanks in advance!
Use the question mark, it will match as few times as possible.
##FOO\[(.*?)(?=~~)~~(.*?)(?=\]##)\]##
This one also works but is not as strict although easier to read
##FOO\[(.*?)~~(.*?)\]##
The * operator is greedy by default, meaning it eats up as much of the string as possible while still leaving enough to match the remaining regex. You can make it not greedy by appending a ? to it. Make sure to read about the differences at the link.
You could use the String.IndexOf() method instead to find the first occurrence of your substring.

Regex search and replace where the replacement is a mod of the search term

i'm having a hard time finding a solution to this and am pretty sure that regex supports it. i just can't recall the name of the concept in the world of regex.
i need to search and replace a string for a specific pattern but the patterns can be different and the replacement needs to "remember" what it's replacing.
For example, say i have an arbitrary string: 134kshflskj9809hkj
and i want to surround the numbers with parentheses,
so the result would be: (134)kshflskj(9809)hkj
Finding numbers is simple enough, but how to surround them?
Can anyone provide a sample or point me in the right direction?
In some various langauges:
// C#:
string result = Regex.Replace(input, #"(\d+)", "($1)");
// JavaScript:
thestring.replace(/(\d+)/g, '($1)');
// Perl:
s/(\d+)/($1)/g;
// PHP:
$result = preg_replace("/(\d+)/", '($1)', $input);
The parentheses around (\d+) make it a "group" specifically the first (and only in this case) group which can be backreferenced in the replacement string. The g flag is required in some implementations to make it match multiple times in a single string). The replacement string is fairly similar although some languages will use \1 instead of $1 and some will allow both.
Most regex replacement functions allow you to reference capture groups specified in the regex (a.k.a. backreferences), when defining your replacement string. For instance, using preg_replace() from PHP:
$var = "134kshflskj9809hkj";
$result = preg_replace('/(\d+)/', '(\1)', $var);
// $result now equals "(134)kshflskj(9809)hkj"
where \1 means "the first capture group in the regex".
Another somewhat generic solution is this:
search : /([\d]+)([^\d]*)/g
replace: ($1)$2
([\d]+): match a set of one or more digits and retain them in a group
([^\d]*): match a set of non-digits, and retain them as well. \D could work here, too.
g: indicate this is a global expression, to work multiple times on the input.
($1): in the replace block, parens have no special meaning, so output the first group, surrounding it with parens.
$2: output the second group
I used a pretty good online regex tool to test out my expression. The next step would be to apply it to the language that you are using, as each has its own implemention nuance.
Backreferences (grouping) are not necessary if you're just looking to search for numbers and replace with the found regex surrounded by parens. It is simpler to use the whole regex match in the replacement string.
e.g for perl
$text =~ s/\d+/($&)/g;
This searches for 1 or more digits and replaces with parens surrounding the match (specified by $&), with trailing g to find and replace all occurrences.
see http://www.regular-expressions.info/refreplace.html for the correct syntax for your regex language.
Depending on your language, you're looking to match groups.
So typically you'll make a pattern in the form of
([0-9]{1,})|([a-zA-Z]{1,})
Then, you'll iterate over the resulting groups in (specific to your language).

Categories