Regex c# obtain subgroup of a captured group - c#

It seems a simple question, but I don't think it is so easy.
From the example string AAACARACBBBBBDZAAAAEE, I want to extract the first 8 characters (= AAACARAC) and from this resulting 8-char long string, I want to extract everything except the leading 'A' characters (= CARAC).
I tried with this regex (?^[A]<WORD>\w{8}), but I dont know how to apply another regex on the captured group named WORD?

This is the regex you want:
(?=^.{8}(.*)$)A*(?<WORD>.*?)\1$
See a demo here (click then on "Table" for looking at the specific matches).
The regex firs will match the first eight characters looking for what comes next (matching this "tail" in the first capturing group), then will restart from the beginning of the string excluding all the trailing As and matching for as less character as possible such that these characters are followed by the same content of the first capturing group.

Using C#, you might also use a positive lookbehind to assert 8 chars to the left, matching optional A's and capture the chars that follow in a group.
^A*(?<WORD>[^\sA].*)(?<=^.{8})
^ Start of string
A* match optional repetitions of A
(?<WORD> Named group WORD
[^\sA].* Match any non whitespace char except A
) Close named group WORD
(?<=^.{8}) Assert 8 chars to the left of the current position
.NET regex demo
If you only want to match word characters:
^A*(?<WORD>[^\WA]\w*)(?<=^\w{8})
.NET Regex demo

Related

Regex pattern for search first letters of the first and last name

I have a problem with regex pattern. Every day I get names and surnames. Example:
Darkholme Van Tadashi
Herrington Billy Aniki
Johny
Walker Sam Cooler
etc..
The fact is that they are specific and do not consist of just one last name and first name.
From this list, I need to select one person (whose last name and first name I know). To do this, I found pattern:
"Darkholme|\b[vt]"
As I said, I know the person's data in advance (before the list arrives). But I only know his last name. The second and third names (Van Tadashi) are unknown to me, I only know the first letters of these names ("V" and "T"). I ran into this problem: when regex analyzes incoming data (I use regex.ismatch), it returns true if the input string is "Van Dungeonmaster". How do I create a pattern that will only return true if the surname=Darkholme, first letters of the second and third names match (=V and T)?
Perhaps I'm not making myself clear.. But in the end, it should turn out that I passed only the last name and the first letters of the first name and patronymic to pattern, and regex gave a match for input string.
If there is a comma present and the names can start with either V or T where the third name can be optional, you could use an optional group matching any non whitespace char except a comma.
\bDarkholme\s+[VT][^\s,]+(?:\s+[VT][^\s,]+)?
\b Word bounary, to prevent Darkholme being part of a larger word
Darkholme Match literally
\s+[VT] Match 1+ whitespace chars followed by either V or T
[^\s,]+ Match 1+ times any char except a whitespace char or comma
(?: Non capture group
\s+[VT] Match 1+ whitespace chars followed by either V or T
[^\s,]+ Match 1+ times any char except a whitespace char or comma
)? Close the group to make the 3rd part optional
.NET regex demo
If you know that the name starts with V for the second and T for the third:
\bDarkholme\s+V[^\s,]+(?:\s+T[^\s,]+)?
.NET regex demo
If the name can also be a Single V or T, the quantifier could be an asterix for [^\s,]*
Your pattern as is means "match any string that contains Darkholme or any string where any word starts with a v or a t" which isn't quite what you want
Perhaps
Darkholme\s+V\S*\s+T
Would suit you better. It means "darkholme followed by at least one white space then V, followed by any number of non whitespace characters then any number of whitespace followed by T

Difference between the Regex expressions in dotnet

what is the difference between the two regex
new Regex(#"(([[[{""]))", RegexOptions.Compiled)
and
new Regex(#"(^[[[{""])", RegexOptions.Compiled)
I've used the both regex but can't find the difference. it's almost match similar things.
The regex patterns are not well written because
There are duplicate characters in character classes (thus redundant)
The first regex contains duplicate capture group on the whole pattern.
The first regex - (([[[{""])) - matches 1 character, either a [, a {, or a ", and captures it into Group 1 and Group 2. See demo. It is equal to
[[{"]
Demo
The second regex - (^[[[{""]) - only matches the same characters as the pattern above, but at the beginning of a string (if RegexOptions.Multiline is not set), or the beginning of a line (if that option is set). See demo. It is equal to
^[[{"]
See demo
You will access the matched characters using Regex.Match(s).Value.
More about anchors
Aslo see Caret ^: Beginning of String (or Line)

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?
The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).
I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"
If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

Regex for catching word with special characters between letters

I am new to regex, I'm programming an advanced profanity filter for a commenting feature (in C#). Just to save time, I know that all filters can be fooled, no matter how good they are, you don't have to tell me that. I'm just trying to make it a bit more advanced than basic word replacement. I've split the task into several separate approaches and this is one of them.
What I need is a specific piece of regex, that catches strings such as these:
s_h_i_t
s h i t
S<>H<>I<>T
s_/h_/i_/t
s***h***i***t
you get the idea.
I guess what I'm looking for is a regex that says "one or more characters that are not alphanumeric". This should include both spaces and all special characters that you can type on a standard (western) keyboard. If possible, it should also include line breaks, so it would catch things like
s
h
i
t
There should always be at least one of the characters present, to avoid likely false positives such as in
Finish it.
This will of course mean that things like
sh_it
will not be caught, but as I said, it doesn't matter, it doesn't have to be perfect. All I need is the regex, I can do the splitting of words and inserting the regex myself. I have the RegexOptions.IgnoreCase option set in my C# code, so character case in the actual word is not an issue. Also, this regex shouldn't worry about "leetspeek", i.e. some of the actual letters of the word being replaced by other characters:
sh1t
I have a different approach that deals with that.
Thank you in advance for your help.
Lets see if this regex works for you:
/\w(?:_|\W)+/
Alright, HamZa's answer worked. However I ran into a programmatic problem while working on the solution. When I was replacing just the words, I always knew the length of the word. So I knew exactly how many asterisks to replace it with. If I'm matching shit, I know I need to put 4 asterisks. But if I'm matching s[^a-z0-9]+h[^a-z0-9]+[^a-z0-9]+i[^a-z0-9]+t, I might catch s#h#i#t or I may catch s------h------i--------t. In both cases the length of the matched text will differ wildly from that of the pattern. How can I get the actual length of the matched string?
\bs[\W_]*h[\W_]*i[\W_]*t[\W_]*(?!\w)
matches characters between letters that aren't word characters or character _ or whitespace characters (also new line breaks)
\b (word boundrary) ensures that Finish it won't match
(?!\w) ensures that sh ituuu wont match, you may want to remove/modify that, as s_hittt will not match as well. \bs[\W_]*h[\W_]*i[\W_]*t+[\W_]*(?!\w) will match the word with repeated last character
modification \bs[\W_]*h[\W_]*i[\W_]*t[\W_]*?(?!\w) will make the match of last character class not greedy and in sh it&&& only sh it will match
\bs[\W\d_]*h[\W\d_]*i[\W\d_]*t+[\W\d_]*?(?!\w) will match sh1i444t (digits between characters)
EDIT:
(?!\w) is a negative lookahead. It basicly checks if your match is followed by a word character (word characters are [A-z09_]). It has a length of 0, which means it won't be included in the match. If you want to catch words like "shi*tface" you'll have to remove it.
( http://www.regular-expressions.info/lookaround.html )
A word booundrary [/b] matches a place where word starts or ends, it's length is 0, which means that it matches between characters
[\W] is a negative character class, I think it's equal to [^a-zA-Z0-9_] or [^\w]
You want to match words where each letter is separated with the identical non-word char(s).
You can use
\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b
See the regex demo. (I added (?!\n) to make the regex work for each line as if it were a separate string.) Details:
\b - word boundary
\p{L} - a letter
(?=([\W_]+)) - a positive lookahead that matches a location that is immediately followed with any non-word or _ char (captured into Group 1)
(?:\1\p{L})+ - one or more repetitions of a sequence of the same char captured into Group 1 and a letter
\b - word boundary.
To check if there is such a pattern in a string, you can use
var HasSpamWords = Regex.IsMatch(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b");
To return all occurrences in a string, you can use
var results = Regex.Matches(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
See the C# demo.
Getting the length of each string is easy if you get Match.Length and use .Select(x => x.Length). If you need to get the length of the string with all special chars removed, simply use .Select(x => x.Value.Count(c => char.IsLetter(c))) (see this C# demo).

regular expression start (^) does not work correctly

I want to match a pattern like 091\d{8} in a content.
I want to extract strings that start with 091, I try this:
^(091)\d{8}
this pattern only match when string begins in new line,what pattern must I use?
You should match for a word boundary (\b)
^ will only match the number if the string starts with 091, not in between.
You should match word boundaries in your regular expression ,
else it will fetch those expressions too which start with 091, but have more than 8 digits after that.
See this regex \b((091)\d{8})\b working at : http://regexr.com?310ra
The caputred group in parenthesis will give you the required number.

Categories