C# Regular Expression: Search the first 3 letters of each name - c#

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?

The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).

I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"

If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

Related

Matching a sequence of characters splitted by spaces after a prefix

I have the following strings:
-prefix <#141222969505480701> where the second part e.g. <#141222969505480701> can be repeated unlimited times (only the numbers change).
-prefix 141222969505480701 which should behave the same as above.
-prefix 141222969505480701 <#141222969505480702> which would still be able to repeat itself forever.
The last one should have groups containing 141222969505480701 and 141222969505480702.
So a few bits of information:
The digit chains are always 18 in total so I use \d{18} in my regex
I would like to have the numbers in groups for me to use them afterwards.
What I tried
First of I tried to match the first of my example strings.
-prefix(\s<#\d{18}>)\1* which would match the entire string, but I would like to have the digits itself in its own group. Also this method only matches the same parts e.g. <#141222969505480701> <#141222969505480701> <#141222969505480701> would match, but any other number in between wouldn't match.
What would sound logical in my head
-prefix (\d{18})+ but it would only match the first one of the 'digit parts'.
While I was testing it on regex101 it told me the following:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data.
I tried to adjust the regex to the following -prefix ((\d{18})+), but with the same result.
With the help to of #madreflection in the comments I was able to come up with this solution:
-prefix([\s]*(<#|)(?<digits>[0-9]{18})>?)+
Which is exactly what I needed, which even ignores spaces in between. Also with the use of match.Groups["digits"].Captures it made the whole story a lot easier.
You could use an alternation to list the 3 different allowed formats. In .NET it is supported to reuse the group name.
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digits>[0-9]{18}))
Pattern parts
-prefix\s* Match literally followed by 0+ whitespace characters
(?: Non capturing group
(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})> 2 named capturing groups which will match the digits
| Or
(?<digits>[0-9]{18}) Named capturing group, match digits only
| Or
<#(?<digits>[0-9]{18}) Named capturing group, match digits between brackets only
)
Regex demo
You could also use 2 named capturing groups, 1 for each format. For example:
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digitsBrackets>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digitsBrackets>[0-9]{18}))
Regex demo

Regex to find a string that does not contain a space while matching other conditions

I have a bunch of strings that may contains certain patterns. Specifically, the following 3.
Starts with (- followed by 10 digits followed by ).
E.g.:
(-1234567890)
Starts with (, ends with ), and may contain 1 or more characters, but NO spaces.
E.g.:
(ABC) or (AF33) or (2345)
Starts with (, ends with ), and may contain 1 or more characters, INCLUDING spaces.
E.g.:
(Some string)
The strings I work with may contain zero or more of the patterns above. My requirement is to match ONLY the second one from above in a given string, and I'd like to be able to use Regex class in C#.
For example, let's say following are five different strings I have.
This is some random text.
This is some (ABC) random (-1234567890) text.
This is some (XY12) random (-1234567890) text.
This is some (Contains space) random (-1234567890) text.
This is some () random text.
My Regex should match only the 2nd and 3rd strings from the above list.
So far, I've managed to write this following Regex, which excludes strings 1 and 5.
.*\((?!\-).+\).*
This matches 2nd, 3rd, AND 4th strings above. Now I'm not sure how I can get it to exclude the 4th, one which contains spaces inside parenthesis. I know that \S detects whitespaces, but how can I tell it to detect strings that do not contain spaces only within the parenthesis that don't contain a - after the first (?
EDIT 1:
There will never be nested parenthesis in my strings.
EDIT 2:
Here's a Regex Tester.
.*\(\w+\).*
If you use above regex, second and third strings are matches only
.* all characters
( pharantesis
\w+ all word characters (at least one)
) pharantesis
.* all characters
\(([^- ]+[^ ]*)\)
should work
Explanation:
[^- ]+ will first match one character that's neither - or This will make sure it contains at least one character
Then [^ ]* will match 0 or more none white space characters
This will work for any char set

C# Regular Expression - How to identify full name from the first 3 letters of each name

I'm using regex Mar(.*)Ant(.*)Ara[^\s]*\s to locate names in PDF file pages, an example of the result is:
Marck SC. Antony L. Aragon
Marcus Anthilope Burton Chase Araujo
But the use of (.*) will return any text of the entire page, including texts that are not full names.
How can I prevent this regex find very long text (which is not a name) but who answered the regex? Example:
Return wrong:
And the Cat Marked this on looking at Anti-terror panel in the city of Aragon
Following the same principle thought to improve this regex forcing the reading to 10 words (from the first name) and not entire page.
How to do this?
Use a range quantifier {min,max}
#"\b(Mar\S*)(?:\s+\S+){0,3}\s*\b(Ant\S*)(?:\s+\S+){0,3}\s*\b(Ara\S*)"
DEMO
(?:\s+\S+){0,3} - \s+ matches one or more space and \S+ matches one or more spaces, then the whole pattern would be repeated from 0 upto three times.
Mar[^ ]*(?:[ ][A-Z][^\s]*)* Ant[^ ]*(?:[ ][A-Z][^\s]*)* Ara[^\s]*
Make use of the fact that names will start from capital letter.See demo.
https://regex101.com/r/qH1uG3/3

Regex help, words that contain numbers without a space

I have been trying for hours now to figure out this regex and it will not work correctly.
What i need is one that will match the following:
I need it to match IF any word in the entire description is a character followed by a number, without a space
for example:
MC15 this is a test description - MATCH
MC 15 this is another description - NO MATCH
another test MC55 description - MATCH
another test MC 55 description - NO MATCH
i greatly appreciate any help!
thanks for your time!
Can you use "find" instead of "match"? (I.E., the regular expression method you call on your pattern - in order to search for a substring instead of matching against the entire input?) If so, this will do nicely:
([a-zA-Z]+\d+)
Otherwise, it can be expanded to work with "match", using something like:
\b*([a-zA-Z]+\d+)\b*
(?<aMatch>[a-zA-Z]+[0-9]+)
This matches any a-z or A-Z character one or more times followed by any number 0-9 one or more times, no spaces or any other character allowed to separate them
In perl, a regexp like /[a-zA-Z]+\d+/ would work:
[a-zA-Z] denotes any letter
\d denotes any decimal
the + after each character class says it has to be present one or more times

C# Regex Replace weird behavior with multiple captures and matching at the end of string?

I'm trying to write something that format Brazilian phone numbers, but I want it to do it matching from the end of the string, and not the beginning, so it would turn input strings according to the following pattern:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "3555-4444"
"5554444" -> "555-4444"
Since the begining portion is what usually changes, I thought of building the match using the $ sign so it would start at the end, and then capture backwards (so I thought), replacing then by the desired end format, and after, just getting rid of the parentesis "()" in front if they were empty.
This is the C# code:
s = "5135554444";
string str = Regex.Replace(s, #"\D", ""); //Get rid of non digits, if any
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{1,4})$", "($1) $2-$3");
return Regex.Replace(str, #"^\(\) ", ""); //Get rid of empty () at the beginning
The return value was as expected for a 10 digit number. But for anything less than that, it ended up showing some strange behavior. These were my results:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "(35) 5544-44"
"5554444" -> "(55) 5444-4"
It seems that it ignores the $ at the end to do the match, except that if I test with something less than 7 digits it goes like this:
"554444" -> "(55) 444-4"
"54444" -> "(54) 44-4"
"4444" -> "(44) 4-4"
Notice that it keeps the "minimum" {n} number of times of the third capture group always capturing it from the end, but then, the first two groups are capturing from the beginning as if the last group was non greedy from the end, just getting the minimum... weird or it's me?
Now, if I change the pattern, so instead of {1,4} on the third capture I use {4} these are the results:
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{4})$", "($1) $2-$3");
"5135554444" -> "(51) 3555-4444" //As expected
"35554444" -> "(35) 55-4444" //The last four are as expected, but "35" as $1?
"54444" -> "(5) -4444" //Again "4444" in $3, why nothing in $2 and "5" in $1?
I know this is probably some stupidity of mine, but wouldn't it be more reasonable if I want to capture at the end of the string, that all previous capture groups would be captured in reverse order?
I would think that "54444" would turn into "5-4444" in this last example... then it does not...
How would one accomplish this?
(I know maybe there's a better way to accomplish the very same thing using different approaches... but what I'm really curious is to find out why this particular behavior of the Regex seems odd. So, the answer tho this question should focus on explaining why the last capture is anchored at the end of the string, and why the others are not, as demonstrated in this example. So I'm not particularly interested in the actual phone # formatting problem, but to understand the Regex sintax)...
Thanks...
So you want the third part to always have four digits, the second part zero to four digits, and the first part zero to two digits, but only if the second part contains four digits?
Use
^(\d{0,2}?)(\d{0,4})(\d{4})$
As a C# snippet, commented:
resultString = Regex.Replace(subjectString,
#"^ # anchor the search at the start of the string
(\d{0,2}?) # match as few digits as possible, maximum 2
(\d{0,4}) # match up to four digits, as many as possible
(\d{4}) # match exactly four digits
$ # anchor the search at the end of the string",
"($1) $2-$3", RegexOptions.IgnorePatternWhitespace);
By adding a ? to a quantifier (??, *?, +?, {a,b}?) you make it lazy, i. e. tell it to match as few characters as possible while still allowing an overall match to be found.
Without the ? in the first group, what would happen when trying to match 123456?
First, the \d{0,2} matches 12.
Then, the \d{0,4} matches 3456.
Then, the \d{4} doesn't have anything left to match, so the regex engine backtracks until that's possible again. After four steps, the \d{4} can match 3456. The \d{0,4} gives up everything it had matched greedily for this.
Now, an overall match has been found - no need to try any more combinations. Therefore, the first and third groups will contain parts of the match.
You have to tell it that it's OK if the first matching groups aren't there, but not the last one:
(\d{0,2}?)(\d{0,4}?)(\d{1,4})$
Matches your examples properly in my testing.

Categories