c# Regex capturing repeated keyword values - c#

I'm trying to capture the value of a keyword that is delimited by either another keyword or the end of the line with the keywords possibly be repeated, in any order or have no data to capture:
Keywords:
K1,K2
Input data:
somedatahereornotk1capturethis1k2capturethis2k2capturethis3k1k2
I want the captured data to be
1. capturethis1
2. capturethis2
3. capturethis3
4.
5.
I've tried k1|k2(?<Data>.*?)k1|k2, but the captured data is always empty.
Thanks!

First, be aware that the alternation operator | has low precedence, so
k1|k2(?<Data>.*?)k1|k2
is actually looking for k1 or k2(?<Data>.*?)k1 or k2. Use grouping:
(?:k1|k2)(?<Data>.*?)(?:k1|k2)
Second, consider using the zero-width lookahead and lookbehind assertions:
(?<=k1|k2)(?<Data>.*?)(?=k1|k2)

You are on the right track with the alternations. The missing piece is to use look-behind and look-ahead to assert that something must be preceded and followed by the delimiters.
(?<=k1|k2)(?<Data>.*?)(?=k1|k2)
Lookbehind (?<=…) and lookahead (?=…) are zero-width assertions, so they must be satisfied but do not become part of the match.
Your desire to capture instances of consecutive delimeters is a bit trickier, because you can't really capture "nothing" -- the space between two characters. One approach would be to capture the lookbehind (or lookahead):
(?<=(?<Delimiter>k1|k2))(?<Data>.*?)(?=k1|k2)
This will yield 4 results instead of 3, because it will include the consecutive k1k2 at the end of your sample data. You'll just have to ignore the extra data for each match (k1,k2,k2,k1).

string s="somedatahereornotk1capturethis1k2capturethis2k2capturethis3k1k2";
Regex r=new Regex("(?<=k1|k2).*?(?=k1|k2)");
foreach(Match m in r.Matches(s))
Console.WriteLine(m.Value);

Related

Convert regex pattern from php to C# [duplicate]

I have a regex expression that I tested on http://gskinner.com/RegExr/ and it worked, but when I used it in my C# application it failed.
My regex expression: (?<!\d)\d{6}\K\d+(?=\d{4}(?!\d))
Text: 4000751111115425
Result: 111111
What is wrong with my regex expression?
This issue you are having is that .NET regular expressions do not support \K, "discard what has been matched so far".
I believe your regex translates as "match any string of more than ten \d digits, to as many digits as possible, and discard the first 6 and the last 4".
I believe that the .NET-compliant regex
(?<=\d{6})\d+(?=\d{4})
achieves the same thing. Note that the negative lookahead/behind for no-more-\ds is not necessary as the \d+ is greedy - the engine already will try to match as many digits as possible.
In general, \K operator (that discards all text matched so far from the match memory buffer) can be emulated with two techniques:
Lookarounds
Capturing groups (with and without lookarounds).
For example,
PCRE a+b+c+=\K\d+ (demo) = .NET (?<=a+b+c+=)\d+ or a+b+c+=(\d+) (and grab Group 1 value)
PCRE ^[^][]+\K.* (demo) = .NET (?<=^[^][]+)(?:\[.*)?$ (demo) or (better here) ^[^][]+(.*) (demo).
The problem with the second example is that [^][]+ can match the same text as .* (these patterns overlap) and since there is no clear boundary between the two patterns, just using a lookbehind is not actually working and needs additional tricks to make it work.
Capturing group approach is universal here and should work in all situations.
Since \K makes the regex engine "forget" the part of a match consumed so far, the best approach here is to use a capturing group to grab the part of a match you need to obtain after the left-hand context:
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var text = "Text 4000751111115425";
var result = Regex.Match(text, #"(?<!\d)\d{6}(\d+)(?=\d{4}(?!\d))")?.Groups[1].Value;
Console.WriteLine($"Result: '{result}'");
}
}
See the online C# demo and the regex demo (see Table tab for the proper result table). Details:
(?<!\d) - a left-hand digit boundary
\d{6} - six digits
(\d+) - Capturing group 1: one or more digits
(?=\d{4}(?!\d)) - a positive lookahead that matches a location that is immediately followed with four digits not immediately followed with another digit.

Removing commas from numbers with .NET regex

So I'm processing a report that (brilliantly, really) spits out number values with commas in them, in a .csv output. Super useful.
So, I'm using (C#)regex lookahead positive and lookbehind positive expressions to remove commas that have digits on both sides.
If I use only the lookahead, it seems to work. However when I add the lookbehind as well, the expression breaks down and removes nothing. Both ends of the comma can have arbitrary numbers of digits around them, so I just want to remove the comma if the pattern has one or more digits around it.
Here's the expression that works with the lookahead only:
str = Regex.Replace(str, #"[,](?=(\d+)),"");
Here's the expression that doesn't work as I intend it:
str = Regex.Replace(str, #"[,](?=(\d+)?<=(\d+))", "");
What's wrong with my regex! If I had to guess, there's something I'm misunderstanding about how lookbehind works. Any ideas?
You may use any of the solutions below:
var s = "abc,def,2,100,xyz!,:))))";
Console.WriteLine(Regex.Replace(s, #"(\d),(\d)", "$1$2")); // Does not handle 1,2,3,4 cases
Console.WriteLine(Regex.Replace(s, #"(\d),(?=\d)", "$1")); // Handles consecutive matches with capturing group+backreference/lookahead
Console.WriteLine(Regex.Replace(s, #"(?<=\d),(?=\d)", "")); // Handles consecutive matches with lookbehind/lookahead, the most efficient way
Console.WriteLine(Regex.Replace(s, #",(?<=\d,)(?=\d)", "")); // Also handles all cases
See the C# demo.
Explanations:
(\d),(\d) - matches and captures single digits on both sides of , and $1$2 are replacement backreferences that insert captured texts back into the result
(\d),(?=\d) - matches and captures a digit before ,, then a comma is matched and then a positive lookahead (?=\d) requires a digit after ,, but since it is not consumed, onyl $1 is required in the replacement pattern
(?<=\d),(?=\d) - only such a comma is matched that is enclosed with digits without consuming the digits ((?<=\d) is a positive lookbehind that requires its pattern match immediately to the left of the current location)
,(?<=\d,)(?=\d) - matches a comma and only after matching it, the regex engine checks if there is a digit and a comma immediately before the location (that is after the comma), and if the check if true, the next char is checked for a digit. If it is a digit, a match is returned.
RegexHero.net test:
Bonus:
You may just match a pattern like yours with \d,\d and pass the match to the MatchEvaluator method where you may manipulate the match further:
Console.WriteLine(Regex.Replace(s, #"\d,\d", m => m.Value.Replace(",",string.Empty))); // Callback method
Here, m is the match object and m.Value holds the whole match value. With .Replace(",",string.Empty), you remove all commas from the match value.
You can always check a website that evaluates regex expressions.
I think this code might be able to help you:
str = Regex.Replace(str, #"[,](?=(\d+))(?<=(\d))", "");

Multiple RegEx negation matching

I have the following RegEx patterns:
"[0-9]{4,5}\.FU|[0-9]{4,5}\.NG|[0-9]{4,5}\.SP|[0-9]{4,5}\.T|JGB[A-Z][0-9]|JNI[A-Z][0-9]|JN4F[A-Z][0-9]|JNM[A-Z][0-9]|JTI[A-Z][0-9]|JTM[A-Z][0-9]|NIY[A-Z][0-9]|SSI[A-Z][0-9]|JNI[A-Z][0-9]-[A-Z][0-9]|JTI[A-Z][0-9]-[A-Z][0-9]" ===> matches 8411.T or JNID8
"[0-9]{4,5}\.HK|HSI[A-Z][0-9]|HMH[A-Z][0-9]|HCEI[A-Z][0-9]|HCEI[A-Z][0-9]-[A-Z][0-9]" ==> matches 9345.HK or HCEIU9-A9
".*\.SI|SFC[A-Z][0-9]" ==> matches 8345.SI or SFCX8
How can I obtain a RegEx from the negation of these patterns?
I want to match strings that match neither of these 3 patterns:
e.g. I want to match 8411.ABC, but not any of the aforementioned strings (8411.T, HCEIU-A9, 8345.SI, etc.).
I've tried (just to exclude 2 and 3 for instance, but it doesn't work):
^(?!((.*\.SI|SFC[A-Z][0-9])|([0-9]{4,5}\.HK|HSI[A-Z][0-9]|HMH[A-Z][0-9]|HCEI[A-Z][0-9]|HCEI[A-Z][0-9]-[A-Z][0-9])))
The main idea here is to place the patterns into (?!.*<pattern>) negative lookaheads anchored at the start of the string (^). The difficulty here is that you patterns contain unanchored alternations, and if not grouped, the .* before the patterns will only refer to the first alternative (i.e. all the subsequent alternatives will only be negated at the start of the string.
Thus, your pattern formula is ^(?!.*(?:<PATTERN1>))(?!.*(?:<PATTERN2>))(?!.*(?:<PATTERN3>)). Note that .+ or .* at the end is optional if you need to just get a boolean result. Note that in the last pattern, you need to remove the .* in the first alternative, it won't make sense to use .*.*.
Use
^(?!.*(?:[0-9]{4,5}\.FU|[0-9]{4,5}\.NG|[0-9]{4,5}\.SP|[0-9]{4,5}\.T|JGB[A-Z][0-9]|JNI[A-Z][0-9]|JN4F[A-Z][0-9]|JNM[A-Z][0-9]|JTI[A-Z][0-9]|JTM[A-Z][0-9]|NIY[A-Z][0-9]|SSI[A-Z][0-9]|JNI[A-Z][0-9]-[A-Z][0-9]|JTI[A-Z][0-9]-[A-Z][0-9]))(?!.*(?:[0-9]{4,5}\.HK|HSI[A-Z][0-9]|HMH[A-Z][0-9]|HCEI[A-Z][0-9]|HCEI[A-Z][0-9]-[A-Z][0-9]))(?!.*(?:\.SI|SFC[A-Z][0-9])).+
See the regex demo.
You may also contract the formula to ^(?!.*(?:<PATTERN1>|<PATTERN2>|<PATTERN3>)):
^(?!.*(?:[0-9]{4,5}\.FU|[0-9]{4,5}\.NG|[0-9]{4,5}\.SP|[0-9]{4,5}\.T|JGB[A-Z][0-9]|JNI[A-Z][0-9]|JN4F[A-Z][0-9]|JNM[A-Z][0-9]|JTI[A-Z][0-9]|JTM[A-Z][0-9]|NIY[A-Z][0-9]|SSI[A-Z][0-9]|JNI[A-Z][0-9]-[A-Z][0-9]|JTI[A-Z][0-9]-[A-Z][0-9]|[0-9]{4,5}\.HK|HSI[A-Z][0-9]|HMH[A-Z][0-9]|HCEI[A-Z][0-9]|HCEI[A-Z][0-9]-[A-Z][0-9]|\.SI|SFC[A-Z][0-9])).+
See another regex demo.

C# Regular expression to match on a character not following pairs of the same charcater

Objective: Regex Matching
For this example I'm interested in matching a "|" pipe character.
I need to match it if it's alone: "aaa|aaa"
I need to match it (the last pipe) only if it's preceded by pairs of pipe: (2,4,6,8...any even number)
Another way: I want to ignore ALL pipe pairs "||" (right to left)
or I want to select bachelor bars only (the odd man out)
string twomatches = "aaaaaaaaa||||**|**aaaaaa||**|**aaaaaa";
string onematch = "aaaaaaaaa||**|**aaaaaaa||aaaaaaaa";
string noMatch = "||";
string noMatch = "||||";
I'm trying to select the last "|" only when preceded by an even sequence of "|" pairs or in a string when a single bar exists by itself.
Regardless of the number of "|"
You may use the following regex to select just odd one pipe out:
(?<=(?<!\|)(?:\|{2})*)\|(?!\|)
See regex demo.
The regex breakdown:
(?<=(?<!\|)(?:\|{2})*) - if a pipe is preceded with an even number of pipes ((?:\|{2})* - 0 or more sequences of exactly 2 pipes) from a position that has no preceding pipe ((?<!\|))
\| - match an odd pipe on the right
(?!\|) - if it is not followed by another pipe.
Please note that this regex uses a variable-width look-behind and is very resource-consuming. I'd rather use a capturing group mechanism here, but it all depends on the actual purpose of matching that odd pipe.
Here is a modified version of the regex for removing the odd one out:
var s = "1|2||3|||4||||5|||||6||||||7|||||||";
var data = Regex.Replace(s, #"(?<!\|)(?<even_pipes>(?:\|{2})*)\|(?!\|)", "${even_pipes}");
Console.WriteLine(data);
See IDEONE demo. Here, the quantified part is moved from lookbehind to an even_pipes named capturing group, so that it could be restored with the backreference in the replaced string. Regexhero.net shows 129,046 iterations per second for the version with a capturing group and 69,206 with the original version with variable-width lookbehind.
Only use variable-width look-behind if it is absolutely necessary!
Oh, it's reopened! If you need better performance, also try this negative improved version.
\|(?!\|)(?<!(?:[^|]|^)(?:\|\|)*)
The idea here is to first match the last literal | at right side of a sequence or single | and execute a negated version of the lookbehind just after the match. This should perform considerably better.
\|(?!\|) matches literal | IF NOT followed by another pipe character (right most if sequence).
(?<!(?:[^|]|^)(?:\|\|)*) IF position right after the matched | IS NOT preceded by (?:\|\|)* any amount of literal || until a non| or ^ start.In other words: If this position is not preceded by an even amount of pipe characters.
Btw, there is no performance gain in using \|{2} over \|\| it might be better readable.
See demo at regexstorm

Using regex to match any character until a substring is reached?

I'd like to be able to match a specific sequence of characters, starting with a particular substring and ending with a particular substring. My positive lookahead regex works if there is only one instance to match on a line, but not if there should be multiple matches on a line. I understand this is because (.+) captures up everything until the last positive lookahead expression is found. It'd be nice if it would capture everything until the first expression is found.
Here is my regex attempt:
##FOO\[(.*)(?=~~)~~(.*)(?=\]##)\]##
Sample input:
##FOO[abc~~hi]## ##FOO[def~~hey]##
Desired output: 2 matches, with 2 matching groups each (abc, hi) and (def, hey).
Actual output: 1 match with 2 groups (abc~~hi]## ##FOO[def, hey)
Is there a way to get the desired output?
Thanks in advance!
Use the question mark, it will match as few times as possible.
##FOO\[(.*?)(?=~~)~~(.*?)(?=\]##)\]##
This one also works but is not as strict although easier to read
##FOO\[(.*?)~~(.*?)\]##
The * operator is greedy by default, meaning it eats up as much of the string as possible while still leaving enough to match the remaining regex. You can make it not greedy by appending a ? to it. Make sure to read about the differences at the link.
You could use the String.IndexOf() method instead to find the first occurrence of your substring.

Categories