I'm trying to select all the tokens that contain only letters or only letters and end with a dot.
Example of valid words : "abc", "abc."
Invalid "a.b" "a2"
i've tried this
string[] tokens = text.Split(' ');
var words = from token in tokens
where Regex.IsMatch(token,"^[a-zA-Z]+.?$")
select token;
^[a-zA-Z]+ - only letters one or more times and start with letter
.?$ = ends with 0 or 1 dot ?? not sure about this
In regex, an unescaped . pattern matches any character (including digits). Thus, your regex would undesirably match tokens such as "a2".
You need to escape your dot character as \..
string[] tokens = text.Split(' ');
var words = from token in tokens
where Regex.IsMatch(token,#"^[a-zA-Z]+\.?$")
select token;
Edit: Furthermore, you can amalgamate your Split(' ') logic into your regex by using lookbehind and lookahead. This might improve efficiency, although it does reduce legibility a bit.
var words = Regex.Matches(text, #"(?<=\ |^)[a-zA-Z]+\.?(?=\ |$)")
.OfType<Match>()
.Select(m => m.Value);
The (?<=\ |^) lookbehind means that the match must be preceded by a space or start-of-string.
The (?=\ |$) lookahead means that the match must be succeeded by a space or end-of-string.
You need to escape .
^[a-zA-Z]+\.?$
Otherwise, . is a special character that matches (almost) all characters--not just periods.
Related
so i would like to get words between underscores after second occurence of underscore
this is my string
ABC_BC_BE08_C1000004_0124
I've assembled this expresion
(?<=_)[^_]+
well it matches what i need but only skips the first word since there is no underscore before it. I would like it to skip ABC and BC and just get the last three strings, i've tried messing around but i am stuck and cant make it work. Thanks!
You can use a non-regex approach here with Split and Skip:
var text = "ABC_BC_BE08_C1000004_0124";
var result = text.Split('_').Skip(2);
foreach (var s in result)
Console.WriteLine(s);
Output:
BE08
C1000004
0124
See the C# demo.
With regex, you can use
var result = Regex.Matches(text, #"(?<=^(?:[^_]*_){2,})[^_]+").Cast<Match>().Select(x => x.Value);
See the regex demo and the C# demo. The regex matches
(?<=^(?:[^_]*_){2,}) - a positive lookbehind that matches a location that matches the following patterns immediately to the left of the current location:
^ - start of string
(?:[^_]*_){2,} - two or more ({2,}) sequences of any zero or more chars other than _ ([^_]*) and then a _ char
[^_]+ - one or more chars other than _
Usign .NET there is also a captures collection that you might use with a regex and a repeated catpure group.
^[^_]*_[^_]*(?:_([^_]+))+
The pattern matches:
^ Start of string
[^_]*_[^_]* Match any char except an _, match _ and again any char except _
(?: Non capture group
_([^_]+) Match _ and capture 1 or more times any char except _ in group 1
)+ Close the non capture group and repeat 1 or more times
.NET regex demo | C# demo
For example:
var pattern = #"^[^_]*_[^_]*(?:_([^_]+))+";
var str = "ABC_BC_BE08_C1000004_0124";
var strings = Regex.Match(str, pattern).Groups[1].Captures.Select(c => c.Value);
foreach (String s in strings)
{
Console.WriteLine(s);
}
Output
BE08
C1000004
0124
If you want to match only word characters in between the underscores, another option for a pattern could be using a negated character class [^\W_] excluding the underscore from the word characters in between:
^[^\W_]*_[^\W_]*(?:_([^\W_]+))+
Using the C# Regex.Split method, I would like to split strings that will always start with RepXYZ, Where the XYZ bit is a number that will always have either 3 or 4 characters.
Examples
"Rep1007$chkCheckBox"
"Rep127_Group_Text"
The results should be:
{"Rep1007","$chkCheckBox"}
{"Rep127","_Group_Text"}
So far I have tried (Rep)[\d]{3,4} and ((Rep)[\d]{3,4})+ but both of those are giving me unwanted results
Using Regex.Split often results in empty or unwanted items in the resulting array. Using (Rep)[\d]{3,4} in Regex.Split, will put Rep without the numbers into the resulting array. (Rep[\d]{3,4}) will put the Rep and the numbers into the result, but since the match is at the start, there will be an empty item in the array.
I suggest using Regex.Match here:
var match = Regex.Match(text, #"^(Rep\d+)(.*)$");
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value);
Console.WriteLine(match.Groups[2].Value);
}
See the regex demo
Details:
^ - start of string
(Rep\d+) - capturing group 1: Rep and any one or more digits
(.*) - capturing group 2: any one or more chars other than a newline, as many as possible
$ - end of string.
A splitting approach is better implemented with a lookaround-based regex:
var results = Regex.Split(text, #"(?<=^Rep\d+)(?=[$_])");
See this regex demo.
(?<=^Rep\d+)(?=[$_]) splits a string at the location that is immediately preceded with Rep and one or more digits at the start of the string, and immediately followed with $ or _.
Try splitting on the regex pattern on either $ or _:
string input = "Rep127_Group_Text";
string[] parts = input.Split(new[] { '$', '_' }, 2);
foreach (string part in parts)
{
Console.WriteLine(part);
}
This prints:
Rep127
Group_Text
I have the following problem.
This is what the regex looks like:
var regexTest = new Regex(#"'\d.*\d#");
This is what the string looks like:
var text = "dsadsadsadsa('1.222222#dsadsa'";
That is the result of what I would like to have:
1.222222
That's the result I'm getting right now ...:
'1.222222#
You want to extract the float number in between ' and ", use
var text = "dsadsadsadsa('1.222222#dsadsa'";
var regexTest = new Regex(#"'(\d+\.\d+)#");
var m = regexTest.Match(text);
if (m.Success)
{
Console.WriteLine(m.Groups[1].Value);
}
Here, (\d+\.\d+) captures any 1+ digits, . and then 1+ digits into Group 1 that you may access using match.Groups[1].Value. However, only access that value if there was a match, or you will get an exception (see m.Success part in my demo snippet).
See the regex demo:
Just enclose the part you want to get in parentheses, so that you can get it as a group:
var regexTest = new Regex(#"'(\d.*\d)#");
-----------------------------^------^----
In '\d.*\d# you are are matching ' followed by a digit, any character 0+ times followed by a digit. That would match '1.222222# but also for example '1.A2# because of the .*
To don't match the ' and the # you could use a positive lookahead and a positive lookbehind to assert that they are there. If you only want to match digits then the .* could be left out.
(?<=')\d+\.\d+(?=#)
Regex demo
I try to get words which don't start with "un" using pattern with negative lookbehind. This is the code:
using Regexp = System.Text.RegularExpressions.Regex;
using RegexpOptions = System.Text.RegularExpressions.RegexOptions;
string quote = "Underground; round; unstable; unique; queue";
Regexp negativeViewBackward = new Regexp(#"(?<!un)\w+\b", RegexpOptions.IgnoreCase);
MatchCollection finds = negativeViewBackward.Matches(quote);
Console.WriteLine(String.Join(", ", finds));
It always returns full set of words, but should return only round, queue.
The (?<!un)\w+\b first matches a location that is not preceded with un (with the negative lookbehind), then matches 1 or more word chars followed with a word boundary position.
You need to use a negative lookahead after a leading word boundary:
\b(?!un)\w+\b
See the regex demo.
Details
\b - leading word boundary
(?!un) - a negative lookahead that fails the match if the next two word chars are un
\w+ - 1+ word chars
\b - a trailing word boundary.
C# demo:
string quote = "Underground; round; unstable; unique; queue";
Regex negativeViewBackward = new Regex(#"\b(?!un)\w+\b", RegexOptions.IgnoreCase);
List<string> result = negativeViewBackward.Matches(quote).Cast<Match>().Select(x => x.Value).ToList();
foreach (string s in result)
Console.WriteLine(s);
Output:
round
queue
I have a string as:
string subjectString = #"(((43*('\\uth\Hgh.Green.two.190ITY.PCV')*9.8)/100000+('VBNJK.PVI.10JK.PCV'))*('ASFGED.Height Density.1JKHB01.PCV')/476)";
My expected output is:
Hgh.Green.two.190ITY.PCV
VBNJK.PVI.10JK.PCV
ASFGED.Height Density.1JKHB01.PCV
Here's what I have tried:
Regex regexObj = new Regex(#"'[^\\]*.PCV");
Match matchResults = regexObj.Match(subjectString);
string val = matchResults.Value;
This works when the input string is :"#"(((43*('\\uth\Hgh.Green.two.190ITY.PCV')*9.8)/100000+"; but when the string grows and the number of substrings to be extracted is more than 1 , I am getting undesired results .
How do I extract three substrings from the original string?
It seems you want to match word and . chars before .PCV.
Use
[\w\s.]*\.PCV
See the regex demo
To force at least 1 word char at the start use
\w[\w\s.]*\.PCV
Optionally, if needed, add a word boundary at the start: #"\b\w[\w\s.]*\.PCV".
To force \w match only ASCII letters and digits (and _) compile the regex object with RegexOptions.ECMAScript option.
Here,
\w - matches any letter, digit or _
[\w\s.]* - matches 0+ whitespace, word or/and . chars
\. - a literal .
PCV - a PCV substring.
Sample usage:
var results = Regex.Matches(str, #"\w[\w\s.]*\.PCV")
.Cast<Match>()
.Select(m=>m.Value)
.ToList();