How to match camel case identifiers with a Regular Expression? - c#

I have the need to match camel case variables. I am ignoring variables with numbers in the name.
private const String characters = #"\-:;*+=\[\{\(\/?\s^""'\<\]\}\.\)$\>";
private const String start = #"(?<=[" + characters +"])[_a-z]+";
private const String capsWord = "[_A-Z]{1}[_a-z]+";
private const String end = #"(?=[" + characters + "])";
var regex = new Regex($"{start}{capsWord}{end}",
RegexOptions.Compiled | RegexOptions.CultureInvariant) }
This is great for matching single hump variables! But not with multiple nor does the one that meets the end of the line. I thought $ or ^ in my characters would allow them to match.
abcDef // match
notToday<end of line> // no match
<start of line>intheBeginning // no match
whatIf // match
"howFar" // match
(whatsNext) // match
ohMyGod // two humps don't match
I have also tried wrapping my capsWord like this
"(capsWord)+" but it also doesn't work.
WARNING! Regex tester online matches using this "(capsWord)+" so don't verify and respond by testing from there.
It seems that my deployment wasn't getting the updates when I was making changes so there may not have been an issue after all.
This following almost works save for the start of line problem. Note, I notice I didn't need the suffix part because the match ends with [a-z] content.
private const String characters = #"\-:;*+=\[\{\(\/?\s^""'\<\]\}\.\)$\>";
private const String pattern = "(?<=[" + characters + "])[_a-z]+([A-Z][a-z]+)+";
abcDef // match
notToday<end of line> // match
<start of line>intheBeginning // no match
whatIf // match
"howFar" // match
(whatsNext) // match
ohMyGod // match
So, if anyone can solve it let me know.
I have also simplified the other characters to a simpler more concise expression but it still has a problem with matching from the beginning of the line.
private const String pattern = "(?<=[^a-zA-Z])[_a-z]+([A-Z][a-z]+)+";

You can match an empty position between a prefix and a suffix to split the camelCase identifiers
(?<=[_a-z])(?=[_A-Z])
The prefix contains the lower case letters, the suffix the upper case letters.
If you want to match camelCase identifiers, you can use
(?<=^|[^_a-zA-Z])_*[a-z]+[_a-zA-Z]*
How it works:
(?<= Match any position pos following a prefix exp (?<=exp)pos
^ Beginning of line
| OR
[^_a-zA-Z] Not an identifier character
)
_* Any number of underlines
[a-z]+ At least one lower case letter
[_a-zA-Z]* Any number of underlines and lower or upper case letters
So, it basically says: Match a sequence optionally starting with underlines, followed by at least one lower case letter, optionally followed by underlines and letters (upper and lower), and the whole thing must be preceded by either a beginning of line or a non-identifier character. This is necessary to make sure that we not only match the ending of a identifier starting with an upper case letter (or underscores and a upper case letter).
var camelCaseExpr = new Regex("(?<=^|[^_a-zA-Z])_*[a-z]+[_a-zA-Z]*");
MatchCollection matches = camelCaseExpr.Matches("whatIf _Abc _abc howFar");
foreach (Match m in matches) {
Console.WriteLine(m.Value);
}
prints
whatIf
_abc
howFar

Had the same problem today, what worked for me:
\b([a-z][a-z0-9]+[A-Z])+[a-z0-9]+\b
Note: this is for PCRE regexes
Explanation:
`(` group begin
`[a-z]` start with a lower-case letter
`[a-z0-9]+` match a string of all lowercase/numbers
`[A-Z]` an upper-case letter
`)+` group end; match one or more of such groups.
Ends with some more lower-case/numbers.
\b for word boundary.
In my case, the _camelCaseIdent_s had only one letter upper in between words.
So, this worked for me, but if you can have (or want to match) more than one
upper-case letter in between, you could do something like [A-Z]{1,2}

Related

Separate title string with no spaces into words

I want to find and separate words in a title that has no spaces.
Before:
ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)"Test"'Test'[Test]
After:
This Is An Example Title HELLO-WORLD 2019 T.E.S.T. (Test) [Test] "Test" 'Test'
I'm looking for a regular expression rule that can do the following.
I thought I'd identify each word if it starts with an uppercase letter.
But also preserve all uppercase words as not to space them into A L L U P P E R C A S E.
Additional rules:
Space a letter if it touches a number: Hello2019World Hello 2019 World
Ignore spacing initials that contain periods, hyphens, or underscores T.E.S.T.
Ignore spacing if between brackets, parentheses, or quotes [Test] (Test) "Test" 'Test'
Preserve hyphens Hello-World
C#
https://rextester.com/GAZJS38767
// Title without spaces
string title = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)[Test]\"Test\"'Test'";
// Detect where to space words
string[] split = Regex.Split(title, "(?<!^)(?=(?<![.\\-'\"([{])[A-Z][\\d+]?)");
// Trim each word of extra spaces before joining
split = (from e in split
select e.Trim()).ToArray();
// Join into new title
string newtitle = string.Join(" ", split);
// Display
Console.WriteLine(newtitle);
Regular expression
I'm having trouble with spacing before the numbers, brackets, parentheses, and quotes.
https://regex101.com/r/9IIYGX/1
(?<!^)(?=(?<![.\-'"([{])(?<![A-Z])[A-Z][\d+?]?)
(?<!^) // Negative look behind
(?= // Positive look ahead
(?<![.\-'"([{]) // Ignore if starts with punctuation
(?<![A-Z]) // Ignore if starts with double Uppercase letter
[A-Z] // Space after each Uppercase letter
[\d+]? // Space after number
)
Solution
Thanks for all your combined effort in answers. Here's a Regex example. I'm applying this to file names and have exclude special characters \/:*?"<>|.
https://rextester.com/FYEVE73725
https://regex101.com/r/xi8L4z/1
Here is a regex which seems to work well, at least for your sample input:
(?<=[a-z])(?=[A-Z])|(?<=[0-9])(?=[A-Za-z])|(?<=[A-Za-z])(?=[0-9])|(?<=\W)(?=\W)
This patten says to make a split on a boundary of one of the following conditions:
what precedes is a lowercase, and what precedes is an uppercase (or
vice-versa)
what precedes is a digit and what follows is a letter (or
vice-versa)
what precedes and what follows is a non word character
(e.g. quote, parenthesis, etc.)
string title = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)[Test]\"Test\"'Test'";
string[] split = Regex.Split(title, "(?<=[a-z])(?=[A-Z])|(?<=[0-9])(?=[A-Za-z])|(?<=[A-Za-z])(?=[0-9])|(?<=\\W)(?=\\W)");
split = (from e in split select e.Trim()).ToArray();
string newtitle = string.Join(" ", split);
This Is An Example Title HELLO-WORLD 2019 T.E.S.T. (Test) [Test] "Test" 'Test'
Note: You might also want to add this assertion to the regex alternation:
(?<=\W)(?=\w)|(?<=\w)(?=\W)
We got away with this here, because this boundary condition never happened. But you might need it with other inputs.
First few parts are similar to #revo answer: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}, additionally I add the following regex to space between number and letter: (?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z]) and to detect OTPIsADevice then replace with lookahead and lookbehind to find uppercase with a lowercase: (((?<!^)[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))
Note that | is or operator which allowed all the regex to be executed.
Regex: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])|(((?<!^)[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))
Demo
Update
Improvised a bit:
From: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])
into: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=\p{L})\d which do the same thing.
(((?<!^)(?<!\p{P})[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\]}!&}]) improvised from OP comment which is adding exception to some punctuation: (((?<!^)(?<!['([{])[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\\]}!&}])
Final regex:
(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=\p{L})\d|(((?<!^)(?<!\p{P})[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\]}!&}])
Demo
Aiming for simplicity rather than huge regex, I would recommend this code with small simple patterns (comments with explanation are in code):
string str = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)\"Test\"'Test'[Test]";
// insert space when there is small letter followed by upercase letter
str = Regex.Replace(str, "(?<=[a-z])(?=[A-Z])", " ");
// insert space whenever there's digit followed by a ltter
str = Regex.Replace(str, #"(?<=\d)(?=[A-Za-z])", " ");
// insert space when there's letter followed by digit
str = Regex.Replace(str, #"(?<=[A-Za-z])(?=\d)", " ");
// insert space when there's one of characters ("'[ followed by letter or digit
str = Regex.Replace(str, #"(?=[(\[""'][a-zA-Z0-9])", " ");
// insert space when what preceeds is on of characters ])"'
str = Regex.Replace(str, #"(?<=[)\]""'])", " ");
You could reduce the requirements to shorten the steps of a regular expression using a different interpretation of them. For example, the first requirement would be the same as to say, preserve capital letters if they are not preceded by punctuation marks or capital letters.
The following regex works almost for all of the mentioned requirements and may be extended to include or exclude other situations:
(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}
You have to use Replace() method and use $0 as substitution string.
See live demo here
.NET (See it in action):
string input = #"ThisIsAnExample.TitleHELLO-WORLD2019T.E.S.T.(Test)""Test""'Test'[Test]";
Regex regex = new Regex(#"(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}", RegexOptions.Multiline);
Console.WriteLine(regex.Replace(input, #" $0"));

Merging 3 Regular Expressions to make a Slug/URL validation check

I am trying to merge a few working RegEx patterns together (AND them). I don't think I am doing this properly, further, the first RegEx might be getting in the way of the next two.
Slug example (no special characters except for - and _):
(^[a-z0-9-_]+$)
Then I would like to ensure the first character is NOT - or _:
(^[^-_])
Then I would like to ensure the last character is NOT - or _:
([^-_]$)
Match (good Alias):
my-new_page
pagename
Not-Match (bad Alias)
-my-new-page
my-new-page_
!##$%^&*()
If this RegExp can be simplified and I am more than happy to use it. I am trying to create validation on a page URL that the user can provide, I am looking for the user to:
Not start or and with a special character
Start and end with a number or letter
middle (not start and end) can include - and _
One I get that working, I can tweak if for other characters as needed.
In the end I am applying as an Annotation to my model like so:
[RegularExpression(
#"(^[a-z0-9-_]+$)?(^[^-_])?([^-_]$)",
ErrorMessage = "Alias is not valid")
]
Thank you, and let me know if I should provide more information.
See regex in use here
^[a-z\d](?:[a-z\d_-]*[a-z\d])?$
^ Assert position at the start of the line
[a-z\d] Match any lowercase ASCII letter or digit
(?:[a-z\d_-]*[a-z\d])? Optionally match the following
[a-z\d_-]* Match any character in the set any number of times
[a-z\d] Match any lowercase ASCII letter or digit
$ Assert position at the end of the line
See code in use here
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
Regex regex = new Regex(#"^[a-z\d](?:[a-z\d_-]*[a-z\d])?$");
string[] strings = {"my-new_page", "pagename", "-my-new-page", "my-new-page_", "!##$%^&*()"};
foreach(string s in strings) {
if (regex.IsMatch(s))
{
Console.WriteLine(s);
}
}
}
}
Result (only positive matches):
my-new_page
pagename

C# Regular Expression Match Failing

Here's the regular expression pattern:
string testerpattern = #"\s+\d+:\s+\w\w\w\w\w\w\s+..:..:..:..:..:..:..:..\s+\d+.\d+.\d+.\d+\s+\d+.\d+.\d+.\d+\s+""\w +""";
Here's some lines of text I want to match. there will be 1 or more spaces at the beginning of the line. When I get it working I will modify it to do named matches. Basically I want most of the line without doing multiple matches on a line for each pattern.
2: fffc02 10:00:00:05:1e:36:5f:82 172.31.3.93 0.0.0.0 "SAN002A"
3: fffc03 10:00:00:05:1e:e2:a7:00 172.31.3.168 0.0.0.0 "SAN003A"
4: fffc04 50:00:51:e8:cc:2f:ae:01 0.0.0.0 0.0.0.0 "fcr_fd_4"
here's the static class I wrote to do the matches. It works elsewhere in my program so I'm assuming that it's the pattern that's a problem. the pattern matches successfully on Regexr.com
public static class RegexExtensions
{
public static bool TryMatch(out Match match, string input, string pattern)
{
match = Regex.Match(input, pattern);
return (match.Success);
}
public static bool TryMatch(out MatchCollection match, string input, string pattern)
{
match = Regex.Matches(input, pattern);
return (match.Count > 0);
}
}
First of all, surely remove the space between \w and + if you intend to match one or more word characters.
Next, if you need to match a literal dot, you must either escape it - \., or put into a character class - [.].
Also, you can make use of limiting quantifiers to shorten the pattern if you do not need captures. See how your pattern can be written:
string pat = #"\s+\d+:\s+\w{6}\s+(?:..:){7}..(?:\s+\d+(?:\.\d+){3}){2}\s+""\w+""";
See the regex demo (where \w{6} matches 6 "word" chars, (?:..:){7} matches 7 sequences of 2 any chars other than a newline followed with :, etc.)
If you need to capture, still, you can use the ideas I outlined above:
\s+(\d+):\s+(\w{6})\s+(..(?::..){3}):((?:..:){3}..)\s+(\d+(?:\.\d+){3})\s+(\d+(?:\.\d+){3})\s+"(\w+)"
See the regex demo

Regex to find special pattern

I have a string to parse. First I have to check if string contains special pattern:
I wanted to know if there is substrings which starts with "$(",
and end with ")",
and between those start and end special strings,there should not be
any white-empty space,
it should not include "$" character inside it.
I have a little regex for it in C#
string input = "$(abc)";
string pattern = #"\$\(([^$][^\s]*)\)";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
foreach (var match in matches)
{
Console.WriteLine("value = " + match);
}
It works for many cases but failed at input= $(a$() , which inside the expression is empty. I wanted NOT to match when input is $().[ there is nothing between start and end identifiers].
What is wrong with my regex?
Note: [^$] matches a single character but not of $
Use the below regex if you want to match $()
\$\(([^\s$]*)\)
Use the below regex if you don't want to match $(),
\$\(([^\s$]+)\)
* repeats the preceding token zero or more times.
+ Repeats the preceding token one or more times.
Your regex \(([^$][^\s]*)\) is wrong. It won't allow $ as a first character inside () but it allows it as second or third ,, etc. See the demo here. You need to combine the negated classes in your regex inorder to match any character not of a space or $.
Your current regex does not match $() because the [^$] matches at least 1 character. The only way I can think of where you would have this match would be when you have an input containing more than one parens, like:
$()(something)
In those cases, you will also need to exclude at least the closing paren:
string pattern = #"\$\(([^$\s)]+)\)";
The above matches for example:
abc in $(abc) and
abc and def in $(def)$()$(abc)(something).
Simply replace the * with a + and merge the options.
string pattern = #"\$\(([^$\s]+)\)";
+ means 1 or more
* means 0 or more

How to ignore regex matches in C#?

An input string:
string datar = "aag, afg, agg, arg";
I am trying to get matches: "aag" and "arg", but following won't work:
string regr = "a[a-z&&[^fg]]g";
string regr = "a[a-z[^fg]]g";
What is the correct way of ignoring regex matches in C#?
The obvious way is to use a[a-eh-z]g, but you could also try with a negative lookbehind like this :
string regr = "a[a-z](?<!f|g)g"
Explanation :
a Match the character "a"
[a-z] Match a single character in the range between "a" and "z"
(?<!XXX) Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
f|g Match the character "f" or match the character "g"
g Match the character "g"
Character classes aren't quite that fancy. The simple solution is:
a[a-eh-z]g
If you really want to explicitly list out the letters that don't belong, you could try something like:
a[^\W\d_A-Zfg]g
This character class matches everything except:
\W excludes non-word characters, i.e. punctuation, whitespace, and other special characters. What's left are letters, digits, and the underscore _.
\d removes digits so now we have letters and the underscore _.
_ removes the underscore so now we only match letters.
A-Z removes uppercase letters so now we only match lowercase letters.
Finally at this point we can list the individual lowercase letters we don't want to match.
All in all way more complicated than we'd likely ever want. That's regular expressions for ya!
What you're using is Java's set intersection syntax:
a[a-z&&[^fg]]g
..meaning the intersection of the two sets ('a' THROUGH 'z') and (ANYTHING EXCEPT 'f' OR 'g'). No other regex flavor that I know of uses that notation. The .NET flavor uses the simpler set subtraction syntax:
a[a-z-[fg]]g
...that is, the set ('a' THROUGH 'z') minus the set ('f', 'g').
Java demo:
String s = "aag, afg, agg, arg, a%g";
Matcher m = Pattern.compile("a[a-z&&[^fg]]g").matcher(s);
while (m.find())
{
System.out.println(m.group());
}
C# demo:
string s = #"aag, afg, agg, arg, a%g";
foreach (Match m in Regex.Matches(s, #"a[a-z-[fg]]g"))
{
Console.WriteLine(m.Value);
}
Output of both is
aag
arg
Try this if you want match arg and aag:
a[ar]g
If you want to match everything except afg and agg, you need this regex:
a[^fg]g
It seems like you're trying to match any three alphabetic characters, with the condition that the second character cannot be f or g. If this is the case, why not use the following regular expression:
string regr = "a[a-eh-z]g";
Regex: a[a-eh-z]g.
Then use Regex.Matches to get the matched substrings.

Categories