Match words in string using negative lookbehind

Match words in string using negative lookbehind - c#

I try to get words which don't start with "un" using pattern with negative lookbehind. This is the code:
using Regexp = System.Text.RegularExpressions.Regex;
using RegexpOptions = System.Text.RegularExpressions.RegexOptions;
string quote = "Underground; round; unstable; unique; queue";
Regexp negativeViewBackward = new Regexp(#"(?<!un)\w+\b", RegexpOptions.IgnoreCase);
MatchCollection finds = negativeViewBackward.Matches(quote);
Console.WriteLine(String.Join(", ", finds));
It always returns full set of words, but should return only round, queue.

The (?<!un)\w+\b first matches a location that is not preceded with un (with the negative lookbehind), then matches 1 or more word chars followed with a word boundary position.
You need to use a negative lookahead after a leading word boundary:
\b(?!un)\w+\b
See the regex demo.
Details
\b - leading word boundary
(?!un) - a negative lookahead that fails the match if the next two word chars are un
\w+ - 1+ word chars
\b - a trailing word boundary.
C# demo:
string quote = "Underground; round; unstable; unique; queue";
Regex negativeViewBackward = new Regex(#"\b(?!un)\w+\b", RegexOptions.IgnoreCase);
List<string> result = negativeViewBackward.Matches(quote).Cast<Match>().Select(x => x.Value).ToList();
foreach (string s in result)
Console.WriteLine(s);
Output:
round
queue

Related

Regex match all words enclosed by parentheses and separated by a pipe

I think an image a better than words sometimes.
My problem as you can see, is that It only matches two words by two. How can I match all of the words ?
My current regex (PCRE) : ([^\|\(\)\|]+)\|([^\|\(\)\|]+)
The goal : retrieve all the words in a separate groupe for each of them

You can use an infinite length lookbehind in C# (with a lookahead):
(?<=\([^()]*)\w+(?=[^()]*\))
To match any kind of strings inside parentheses, that do not consist of (, ) and |, you will need to replace \w+ with [^()|]+:
(?<=\([^()]*)[^()|]+(?=[^()]*\))
// ^^^^^^
See the regex demo (and regex demo #2). Details:
(?<=\([^()]*) - a positive lookbehind that matches a location that is immediately preceded with ( and then zero or more chars other than ( and )
\w+ - one or more word chars
(?=[^()]*\)) - a positive lookahead that matches a location that is immediately followed with zero or more chars other than ( and ) and then a ) char.
Another way to capture these words is by using
(?:\G(?!^)\||\()(\w+)(?=[^()]*\)) // words as units consisting of letters/digits/diacritics/connector punctuation
(?:\G(?!^)\||\()([^()|]+)(?=[^()]*\)) // "words" that consist of any chars other than (, ) and |
See this regex demo. The words you need are now in Group 1. Details:
(?:\G(?!^)\||\() - a position after the previous match (\G(?!^)) and a | char (\|), or (|) a ( char (\()
(\w+) - Group 1: one or more word chars
(?=[^()]*\)) - a positive lookahead that makes sure there is a ) char after any zero or more chars other than ( and ) to the right of the current position.
Extracting the matches in C# can be done with
var matches = Regex.Matches(text, #"(?<=\([^()]*)\w+(?=[^()]*\))")
.Cast<Match>()
.Select(x => x.Value);
// Or
var matches = Regex.Matches(text, #"(?:\G(?!^)\||\()(\w+)(?=[^()]*\))")
.Cast<Match>()
.Select(x => x.Groups[1].Value);

In c# you can also make use of the group captures using a capture group.
The matches are in named group word
\((?<word>\w+)(?:\|(?<word>\w+))*\)
\( Match (
(?<word>\w+) Match 1+ word chars in group word
(?: Non capture group
\| Match |
(?<word>\w+) Match 1+ word chars
)* Close the non capture group and optionally repeat to get all occurrences
\) Match the closing parenthesis
Code example provided by Wiktor Stribiżew in the comments:
var line = "I love (chocolate|fish|honey|more)";
var output = Regex.Matches(line, #"\((?<word>\w+)(?:\|(?<word>\w+))*\)")
.Cast<Match>()
.SelectMany(x => x.Groups["word"].Captures);
foreach (var s in output)
Console.WriteLine(s);
Output
chocolate
fish
honey
more
foreach (var s in output)
Console.WriteLine(s);
Regex demo

Regex - Get digits after a colon

I have a regex:
var topPayMatch = Regex.Match(result, #"(?<=Top Pay)(\D*)(\d+(?:\.\d+)?)", RegexOptions.IgnoreCase);
And I have to convert this to int which I did
topPayMatch = Convert.ToInt32(topPayMatchString.Groups[2].Value);
So now...
Top Pay: 1,000,000 then it currently grabs the first digit, which is 1. I want all 1000000.
If Top Pay: 888,888 then I want all 888888.
What should I add to my regex?

You can use something as simple like #"(?<=Top Pay: )([0-9,]+)". Note that, decimals will be ignored with this regex.
This will match all numbers with their commas after Top Pay:, which after you can parse it to an integer.
Example:
Regex rgx = new Regex(#"(?<=Top Pay: )([0-9,]+)");
string str = "Top Pay: 1,000,000";
Match match = rgx.Match(str);
if (match.Success)
{
string val = match.Value;
int num = int.Parse(val, System.Globalization.NumberStyles.AllowThousands);
Console.WriteLine(num);
}
Console.WriteLine("Ended");
Source:
Convert int from string with commas

If you use the lookbehind, you don't need the capture groups and you can move the \D* into the lookbehind.
To get the values, you can match 1+ digits followed by optional repetitions of , and 1+ digits.
Note that your example data contains comma's and no dots, and using ? as a quantifier means 0 or 1 time.
(?<=Top Pay\D*)\d+(?:,\d+)*
The pattern matches:
(?<=Top Pay\D*) Positive lookbehind, assert what is to the left is Top Pay and optional non digits
\d+ Match 1+ digits
(?:,\d+)* Optionally repeat a , and 1+ digits
See a .NET regex demo and a C# demo
string pattern = #"(?<=Top Pay\D*)\d+(?:,\d+)*";
string input = #"Top Pay: 1,000,000
Top Pay: 888,888";
RegexOptions options = RegexOptions.IgnoreCase;
foreach (Match m in Regex.Matches(input, pattern, options))
{
var topPayMatch = int.Parse(m.Value, System.Globalization.NumberStyles.AllowThousands);
Console.WriteLine(topPayMatch);
}
Output
1000000
888888

Regex match with Arabic

i have a text in Arabic and i want to use Regex to extract numbers from it. here is my attempt.
String :
"ما المجموع:
1+2"
Match match = Regex.Match(text, "المجموع: ([^\\r\\n]+)", RegexOptions.IgnoreCase);
it will always return false. and groups.value will always return null.
expected output:
match.Groups[1].Value //returns (1+2)

The regex you wrote matches a word, then a colon, then a space and then 1 or more chars other than backslash, r and n.
You want to match the whole line after the word, colon and any amount of whitespace chars:
var text = "ما المجموع:\n1+2";
var result = Regex.Match(text, #"المجموع:\s*(.+)")?.Groups[1].Value;
Console.WriteLine(result); // => 1+2
See the C# demo
Other possible patterns:
#"المجموع:\r?\n(.+)" // To match CRLF or LF line ending only
#"المجموع:\n(.+)" // To match just LF ending only
Also, if you run the regex against a long multiline text with CRLF endings, it makes sense to replace .+ wit [^\r\n]+ since . in a .NET regex matches any chars but newlines, LF, and thus matches CR symbol.

Regex to extract substrings in C#

I have a string as:
string subjectString = #"(((43*('\\uth\Hgh.Green.two.190ITY.PCV')*9.8)/100000+('VBNJK.PVI.10JK.PCV'))*('ASFGED.Height Density.1JKHB01.PCV')/476)";
My expected output is:
Hgh.Green.two.190ITY.PCV
VBNJK.PVI.10JK.PCV
ASFGED.Height Density.1JKHB01.PCV
Here's what I have tried:
Regex regexObj = new Regex(#"'[^\\]*.PCV");
Match matchResults = regexObj.Match(subjectString);
string val = matchResults.Value;
This works when the input string is :"#"(((43*('\\uth\Hgh.Green.two.190ITY.PCV')*9.8)/100000+"; but when the string grows and the number of substrings to be extracted is more than 1 , I am getting undesired results .
How do I extract three substrings from the original string?

It seems you want to match word and . chars before .PCV.
Use
[\w\s.]*\.PCV
See the regex demo
To force at least 1 word char at the start use
\w[\w\s.]*\.PCV
Optionally, if needed, add a word boundary at the start: #"\b\w[\w\s.]*\.PCV".
To force \w match only ASCII letters and digits (and _) compile the regex object with RegexOptions.ECMAScript option.
Here,
\w - matches any letter, digit or _
[\w\s.]* - matches 0+ whitespace, word or/and . chars
\. - a literal .
PCV - a PCV substring.
Sample usage:
var results = Regex.Matches(str, #"\w[\w\s.]*\.PCV")
.Cast<Match>()
.Select(m=>m.Value)
.ToList();

How can I use lookbehind in a C# Regex in order to skip matches of repeated prefix patterns?

How can I use lookbehind in a C# Regex in order to skip matches of repeated prefix patterns?
Example - I'm trying to have the expression match all the b characters following any number of a characters:
Regex expression = new Regex("(?<=a).*");
foreach (Match result in expression.Matches("aaabbbb"))
MessageBox.Show(result.Value);
returns aabbbb, the lookbehind matching only an a. How can I make it so that it would match all the as in the beginning?
I've tried
Regex expression = new Regex("(?<=a+).*");
and
Regex expression = new Regex("(?<=a)+.*");
with no results...
What I'm expecting is bbbb.

Are you looking for a repeated capturing group?
(.)\1*
This will return two matches.
Given:
aaabbbb
This will result in:
aaa
bbbb
This:
(?<=(.))(?!\1).*
Uses the above principal, first checking that the finding the previous character, capturing it into a back reference, and then asserting that that character is not the next character.
That matches:
bbbb

I figured it out eventually:
Regex expression = new Regex("(?<=a+)[^a]+");
foreach (Match result in expression.Matches(#"aaabbbb"))
MessageBox.Show(result.Value);
I must not allow the as to me matched by the non-lookbehind group. This way, the expression will only match those b repetitions that follow a repetitions.
Matching aaabbbb yields bbbb and matching aaabbbbcccbbbbaaaaaabbzzabbb results in bbbbcccbbbb, bbzz and bbb.

The reason the look-behind is skipping the "a" is because it is consuming the first "a" (but no capturing it), then it captures the rest.
Would this pattern work for you instead? New pattern: \ba+(.+)\b
It uses a word boundary \b to anchor either ends of the word. It matches at least one "a" followed by the rest of the characters till the word boundary ends. The remaining characters are captured in a group so you can reference them easily.
string pattern = #"\ba+(.+)\b";
foreach (Match m in Regex.Matches("aaabbbb", pattern))
{
Console.WriteLine("Match: " + m.Value);
Console.WriteLine("Group capture: " + m.Groups[1].Value);
}
UPDATE: If you want to skip the first occurrence of any duplicated letters, then match the rest of the string, you could do this:
string pattern = #"\b(.)(\1)*(?<Content>.+)\b";
foreach (Match m in Regex.Matches("aaabbbb", pattern))
{
Console.WriteLine("Match: " + m.Value);
Console.WriteLine("Group capture: " + m.Groups["Content"].Value);
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Match words in string using negative lookbehind - c#

Related

Regex match all words enclosed by parentheses and separated by a pipe

Regex - Get digits after a colon

Regex match with Arabic

Regex to extract substrings in C#

How can I use lookbehind in a C# Regex in order to skip matches of repeated prefix patterns?

Categories

Resources