REGEX - Having trouble getting the basics - c#

I am very new to Regular expression so I apologise for the 'noobyness' of the question...
I need to match a pattern for an ID we use at work.
So far the only specification for the pattern is that it will be 9 characters long, and comprised of Capital letters and digits. The ID can contain 1 or any number of Capital letters or digits, so long as the total length of the string is 9 characters long.
So far i have the following... [A-Z][0-9]{9}
this does not makes sure the string has atleast one letter or digit (so a 9 character long string would pass)... Alos, Im sure it matched a 9 letter word made of non capitals.
I have done a fair bit of googling, but i have not found anything dumbed down enough for me to understand.
Any help very apppreciated :)
Thanks
EDIT: Just to recap the requirements - The id has to be 9 characters long, no more no less. It will be comprised of capital letters and digits. There can be any amount of either letter or digit, so long as the id contains atleast one of each (so BH98T6YUO or R3DBLUEEE or 1234R6789
I will also post my code to make sure that bits not wrong... ??
string myRegex = "A ton of different combinations that i have tried";
Regex re = new Regex(myRegex);
// stringCombos is a List<string> containing all my strings
// The strings contain within them, my id
// I am attempting to pull out this id
// the below is just to print out all found matches for each string in the list
foreach (string s in stringCombos)
{
MatchCollection mc = re.Matches(s);
Console.WriteLine("-------------------------");
Console.Write(s);
Console.WriteLine(" --- was split into the following:");
foreach (Match mt in mc)
{
Console.WriteLine(mt.ToString());
}
}

You actually have to learn regular expressions as a language. The curve is kind of steep, but there are a ton of excellent tutorials for the basics. Also, you might get this in a chat situation (SO has a chat functionality) - that is how I originally learnt them...
I think this might work for your case:
[A-Z0-9]{1,9}
According to your update, for exactly 9 elements, use:
[A-Z0-9]{9}
Note, though, that the requirement to include at least one letter and at least one digit is not expressed in this solution. An easy way to do that would be to apply a second and third match to the first one:
[0-9]*[A-Z][0-9]*
[A-Z]*[0-9][A-Z]*
Thereby matching three times. You might be able to get this result with the fancy forward and backward reference stuff, but you cannot really capture that requirement with a regular grammar.

You need to match the start and then end of the string using ^ and $, this means that it will match 9 characters and not 10
^[0-9A-Z]$
You aren't exact clear on the requirements the above match will match 9 characters either capital or numeric.
You may find Expresso useful for trying out your expressions.
EDIT (With new requirements) if you are requiring a minimum of 1 Uppercase character you could use the following.
\b[0-9A-Z]{8}(?:(?<=.*[A-Z].*)[0-9]|(?<=.*[0-9].*)[A-Z])\b
Breakdown
\b Match a word boundry
[0-9A-Z]{8} 8 Chars that are either uppercase or numbers
(?: Begins a non capturing group, this is to enclose the or condition
(?<=.*[A-Z].*)[0-9] This basically matches [0-9] aslong as there is an A-Z somewhere before it in the first [0-9A-Z]{8} capture
| OR
(?<=.*[0-9].*)[A-Z] This basically matches [A-Z] aslong as there is an 0-9 somewhere before it in the first [0-9A-Z]{8} capture
) close non capturing group
\b match a word boundry
Basically do a match on the first 8 chars and then if the 9th char is a digit then there has to be an uppercase in the first 8, if the 9th is an A-Z then there has to be a digit in the first 8
The new edited version will now find ID's that appear within the string rather than requiring the string exactly match them.

Related

Regex c# obtain subgroup of a captured group

It seems a simple question, but I don't think it is so easy.
From the example string AAACARACBBBBBDZAAAAEE, I want to extract the first 8 characters (= AAACARAC) and from this resulting 8-char long string, I want to extract everything except the leading 'A' characters (= CARAC).
I tried with this regex (?^[A]<WORD>\w{8}), but I dont know how to apply another regex on the captured group named WORD?
This is the regex you want:
(?=^.{8}(.*)$)A*(?<WORD>.*?)\1$
See a demo here (click then on "Table" for looking at the specific matches).
The regex firs will match the first eight characters looking for what comes next (matching this "tail" in the first capturing group), then will restart from the beginning of the string excluding all the trailing As and matching for as less character as possible such that these characters are followed by the same content of the first capturing group.
Using C#, you might also use a positive lookbehind to assert 8 chars to the left, matching optional A's and capture the chars that follow in a group.
^A*(?<WORD>[^\sA].*)(?<=^.{8})
^ Start of string
A* match optional repetitions of A
(?<WORD> Named group WORD
[^\sA].* Match any non whitespace char except A
) Close named group WORD
(?<=^.{8}) Assert 8 chars to the left of the current position
.NET regex demo
If you only want to match word characters:
^A*(?<WORD>[^\WA]\w*)(?<=^\w{8})
.NET Regex demo

How to validate Regex

Im having a hard time with grouping parts of a Regex. I want to validate a few things in a string that follows this format: I-XXXXXX.XX.XX.XX
Validate that the first set of 6 X's (I-xxxxxx.XX.XX.XX) does not contain characters and its length is no more than 6.
Validate that the third set of X's (I-XXXXXX.XX.xx.XX) does not contain characters and is only 1 or 2.
Now, I have already validation on the last set of XX's to make sure the numbers are 1-8 using
string pattern1 = #"^.+\.(0?[1-8])$";
Match match = Regex.Match(TxtWBS.Text, pattern1);
if (match.Success)
;
else
{ errMessage += "WBS invalid"; errMessage +=
Environment.NewLine; }
I just cant figure out how to target specific parts of the string. Any help would be greatly appreciated and thank you in advance!
You're having some trouble adding new validation to this string because it's very generic. Let's take a look at what you're doing:
^.+\.(0?[1-8])$
This finds the following:
^ the start of the string
.+ everything it can, other than a newline, basically jumping the engine's cursor to the end of your line
\. the last period in the string, because of the greedy quantifier in the .+ that comes before it
0? a zero, if it can
[1-8] a number between 1 and 8
()$ stores the two previous things in a group, and if the end of the string doesn't come after this, it may even backtrace and try the same thing from the second to last period instead, which we know isn't a great strategy.
This ends up matching a lot of weird stuff, like for example the string The number 0.1
Let's try patterning something more specific, if we can:
^I-(\d{6})\.(\d{2})\.(\d{1,2})\.([1-8]{2})$
This will match:
^I- an I and a hyphen at the start of the string
(\d{6}) six digits, which it stores in a capture group
\. a period. By now, if there was any other number of digits than six, the match fails instead of trying to backtrace all over the place.
(\d{2})\. Same thing, but two digits instead of six.
(\d{1,2})\. Same thing, the comma here meaning it can match between one and two digits.
([1-8]{2}) Two digits that are each between 1 and 8.
$ The end of the string.
I hope I understood what exactly you're trying to match here. Let me know if this isn't what you had in mind.
This regex:
^.-[0-9]{6}(\.[1-8]{1,2}){3}$
will validate the following:
The first character can be any character, but is of length 1
It is followed by a dash
The dash is followed by exactly 6 numbers 0 - 9. (If this could be less than 6 characters - for example, between 3 and 6 characters - just replace {6} with {3,6}).
This is followed by 3 groups of characters. Each of this groups are proceeded by a period, are of length 1 or 2, and can be any number 1 - 8.
An example of a valid string is:
I-587954.12.34.56
This is also valid:
I-587954.1.3.5
But this isn't:
I-587954.12.80.356
because the second-to-last group contains a 0, and because the last group is of length 3.
Pleas let me know if I have misunderstood any of the rules.
^I-([0-9]{1,6})\.(.{1,2})\.(0[1-2])\.(.{1,2})$
groups delimited by . (\.) :
([0-9]{1,6}) - 1-6 digits
(.{1,2}) - 1-2 any single character
(0[1-2]) - 01 or 02
(.{1,2}) - 1-2 any single character
you can write and easy test regex on your input data, just google "regex online"

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?
The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).
I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"
If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

Regex for catching word with special characters between letters

I am new to regex, I'm programming an advanced profanity filter for a commenting feature (in C#). Just to save time, I know that all filters can be fooled, no matter how good they are, you don't have to tell me that. I'm just trying to make it a bit more advanced than basic word replacement. I've split the task into several separate approaches and this is one of them.
What I need is a specific piece of regex, that catches strings such as these:
s_h_i_t
s h i t
S<>H<>I<>T
s_/h_/i_/t
s***h***i***t
you get the idea.
I guess what I'm looking for is a regex that says "one or more characters that are not alphanumeric". This should include both spaces and all special characters that you can type on a standard (western) keyboard. If possible, it should also include line breaks, so it would catch things like
s
h
i
t
There should always be at least one of the characters present, to avoid likely false positives such as in
Finish it.
This will of course mean that things like
sh_it
will not be caught, but as I said, it doesn't matter, it doesn't have to be perfect. All I need is the regex, I can do the splitting of words and inserting the regex myself. I have the RegexOptions.IgnoreCase option set in my C# code, so character case in the actual word is not an issue. Also, this regex shouldn't worry about "leetspeek", i.e. some of the actual letters of the word being replaced by other characters:
sh1t
I have a different approach that deals with that.
Thank you in advance for your help.
Lets see if this regex works for you:
/\w(?:_|\W)+/
Alright, HamZa's answer worked. However I ran into a programmatic problem while working on the solution. When I was replacing just the words, I always knew the length of the word. So I knew exactly how many asterisks to replace it with. If I'm matching shit, I know I need to put 4 asterisks. But if I'm matching s[^a-z0-9]+h[^a-z0-9]+[^a-z0-9]+i[^a-z0-9]+t, I might catch s#h#i#t or I may catch s------h------i--------t. In both cases the length of the matched text will differ wildly from that of the pattern. How can I get the actual length of the matched string?
\bs[\W_]*h[\W_]*i[\W_]*t[\W_]*(?!\w)
matches characters between letters that aren't word characters or character _ or whitespace characters (also new line breaks)
\b (word boundrary) ensures that Finish it won't match
(?!\w) ensures that sh ituuu wont match, you may want to remove/modify that, as s_hittt will not match as well. \bs[\W_]*h[\W_]*i[\W_]*t+[\W_]*(?!\w) will match the word with repeated last character
modification \bs[\W_]*h[\W_]*i[\W_]*t[\W_]*?(?!\w) will make the match of last character class not greedy and in sh it&&& only sh it will match
\bs[\W\d_]*h[\W\d_]*i[\W\d_]*t+[\W\d_]*?(?!\w) will match sh1i444t (digits between characters)
EDIT:
(?!\w) is a negative lookahead. It basicly checks if your match is followed by a word character (word characters are [A-z09_]). It has a length of 0, which means it won't be included in the match. If you want to catch words like "shi*tface" you'll have to remove it.
( http://www.regular-expressions.info/lookaround.html )
A word booundrary [/b] matches a place where word starts or ends, it's length is 0, which means that it matches between characters
[\W] is a negative character class, I think it's equal to [^a-zA-Z0-9_] or [^\w]
You want to match words where each letter is separated with the identical non-word char(s).
You can use
\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b
See the regex demo. (I added (?!\n) to make the regex work for each line as if it were a separate string.) Details:
\b - word boundary
\p{L} - a letter
(?=([\W_]+)) - a positive lookahead that matches a location that is immediately followed with any non-word or _ char (captured into Group 1)
(?:\1\p{L})+ - one or more repetitions of a sequence of the same char captured into Group 1 and a letter
\b - word boundary.
To check if there is such a pattern in a string, you can use
var HasSpamWords = Regex.IsMatch(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b");
To return all occurrences in a string, you can use
var results = Regex.Matches(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
See the C# demo.
Getting the length of each string is easy if you get Match.Length and use .Select(x => x.Length). If you need to get the length of the string with all special chars removed, simply use .Select(x => x.Value.Count(c => char.IsLetter(c))) (see this C# demo).

C# Regex Validation

Can someone please validate this for me (newbie of regex match cons).
Rather than asking the question, I am writing this:
Regex rgx = new Regex (#"^{3}[a-zA-Z0-9](\d{5})|{3}[a-zA-Z0-9](\d{9})$"
Can someone telll me if it's OK...
The accounts I am trying to match are either of:
1. BAA89345 (8 chars)
2. 12345678 (8 chars)
3. 123456789112 (12 chars)
Thanks in advance.
You can use a Regex tester. Plenty of free ones online. My Regex Tester is my current favorite.
Is the value with 3 characters then followed by digits always starting with three... can it start with less than or more than three. What are these mins and max chars prior to the digits if they can be.
You need to place your quantifiers after the characters they are supposed to quantify. Also, character classes need to be wrapped in square brackets. This should work:
#"^(?:[a-zA-Z0-9]{3}|\d{3}\d{4})\d{5}$"
There are several good, automated regex testers out there. You may want to check out regexpal.
Although that may be a perfectly valid match, I would suggest rewriting it as:
^([a-zA-Z]{3}\d{5}|\d{8}|\d{12})$
which requires the string to match one of:
[a-zA-Z]{3}\d{5} three alpha and five numbers
\d{8} 8 digits or
\d{12} twelve digits.
Makes it easier to read, too...
I'm not 100% on your objective, but there are a few problems I can see right off the bat.
When you list the acceptable characters to match, like with a-zA-Z0-9, you need to put it inside brackets, like [a-zA-Z0-9] Using a ^ at the beginning will negate the contained characters, e.g. `[^a-zA-Z0-9]
Word characters can be matched like \w, which is equivalent to [a-zA-Z0-9_].
Quantifiers need to appear at the end of the match expression. So, instead of {3}[a-zA-Z0-9], you would need to write [a-zA-Z0-9]{3} (assuming you want to match three instances of a character that matches [a-zA-Z0-9]

Categories