Get the nth word of a line

Get the nth word of a line - c#

With this code:
regex = new Regex(#"^(?:\S+\s){2}(\S+)");
match = regex.Match("one two three four five");
if (match.Success)
{
Console.WriteLine(match.Value);
}
I want to retrieve the third word of the line --> "three".
But instead, I get "one two three".
Edit:
I know that I could do it with s.Split(' ')[2] but I want to do it with regex.

If you want to use Match method only without reference to groups, etc., then you have to use look-behind. Basically you say - find a word that is preceded by two words. In you current regex you say - find me 2 words + 1 word, so you just have to change part "find 2 words" to "preceded by 2 words", i.e. ^(?:\S+\s){2} is changed to (?<=^(\S+\s){2})
(?<=^(\S+\s){2})\S+

match.Value returns the entire matched substring, which includes the non-capturing parts of your regex. You should instead use match.Groups[1].Value to get the value of the first capturing group.

Related

Regular expression "[ ]" work to weed out white spaces, but why and how?

To extract the text between the pattern >>Digit<<, I have successfully used regex "(?<=\>>[0-9]+?<<)[ ].+?(?=\>>[0-9]+?<<)". Regex option is set to single line because the to-be-extracted text may be multiline.
>>1<< First Option For Third Variable Reply1 >>1<<
>>2<< Second Option For Third Variable Reply 1 >>2<<
>>3<< Third Option For Third Variable Reply 1
>>3<<
If I remove the [ ] portion of the regex "(?<=\>>[0-9]+?<<).+?(?=\>>[0-9]+?<<)", matches using the regex will actually extract white spaces (e.g. between >>1<< and >>2<) which is not my intent. I don't understand why adding [ ] excludes those white spaces.
I understand that square brackets in regex generally signify character classes that are to be included. But here, by inserting square brackets with a space, I manage to exclude the white spaces (e.g. between >>1<< and >>2<). So I am trying to understand how it worked in my case.
Thank you.

The point is that there are whitespaces between >>2<< and >>3<< and they are matched with .+? when the singleline mode is on.
You may try to use a capturing group around the first digit pattern and use a backreference to match the same number on the right:
(?<=>>([0-9]+)<<).*?(?=>>\1<<)
See the regex demo
Details
(?<=>>([0-9]+)<<) - a positive lookbehind making sure there is >>, 1+ digits (Group 1), << immediately to the left of the current location
.*? - any 0+ chars, as few as possible
(?=>>\1<<) - a positive lookahead making sure there is >>, same number as in Group 1, << immediately to the right of the current location.
See the C# demo:
var s = ">>1<< First Option For Third Variable Reply1 >>1<<\n\n>>2<< Second Option For Third Variable Reply 1 >>2<<\n\n>>3<< Third Option For Third Variable Reply 1 \n>>3<<";
var rx = #"(?<=>>([0-9]+)<<).*?(?=>>\1<<)";
var results = Regex.Matches(s, rx, RegexOptions.Singleline)
.Cast<Match>()
.Select(m => m.Value);
Console.WriteLine(string.Join("\n", results));
Result:
First Option For Third Variable Reply1
Second Option For Third Variable Reply 1
Third Option For Third Variable Reply 1
Another idea is to disallow whitespaces only between the >>...<< patterns:
(?<=>>[0-9]+<<)(?!\s+>>[0-9]+<<).*?(?=>>[0-9]+<<)
^^^^^^^^^^^^^^^^
See this regex demo

Find a string pattern using Regular expression

How can I use regular expressions to find if the string matches a pattern like [sometextornumber] is a [sometextornumber].
For instance, if the input is This is a test, the output should be this and test.
I was thinking something like ([a-zA-Z0-9]) is a([a-zA-Z0-9]) but looks like I am way off the correct path.

Your question is geared towards grabbing the first and last word of a sentence. If this is all you're going to be interested in, this pattern will suffice:
"^(\\w+)|(\\w+)$"
Pattern breakdown:
^ indicates the beginning of a line
^(\\w+) capture group for a word at the beginning of the line. This is equivalent to [a-zA-Z0-9]+, where the + says you want a one or more letters and numbers.
| acts as an OR operator in Regex
$ indicates the end of a line
(\\w+)$ capture group for a word at the end of the line. This is equivalent to [a-zA-Z0-9]+, where the + says you want a one or more letters and numbers.
This pattern allows you to ignore what's in between the first and last word, so it doesn't care about "is a", and give you one capture group to pull from.
Usage:
string data = "This is going to be a test";
Match m = Regex.Match(data, "^(\\w+)|(\\w+)$");
while (m.Success)
{
Console.WriteLine(m.Groups[0]);
m = m.NextMatch();
}
Results:
This
test
If you're really only interested in the first and last word of a sentence, you also don't need to bother with Regex. Just split the sentence by a space and grab the first and last element of the array.
string[] dataPieces = data.Split(' ');
Console.WriteLine(dataPieces[0]);
Console.WriteLine(dataPieces[dataPieces.Length - 1]);
And the results are the same.
References:
https://msdn.microsoft.com/en-us/library/hs600312(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

Try this:
([a-zA-Z0-9])+ is a ([a-zA-Z0-9])+
Edit:
You need a space after the a since it is another word. Without the + it will only match from last letter of the first word till the first letter of the last word. The + will match 1 or more of whatever is in the (), so in this case the whole word.

If you're looking to match a specific pattern such as "This" or "Test" you can simply do a case insensitive string compare.
From your question, I'm not sure that you necessarily need a regular expression here.

Here is a quick LINQpad:
var r = new Regex("(.*) is a (.*)");
var match = r.Match("This is a test");
match.Groups.OfType<Group>().Skip(1).Select(g=>g.Value).Dump();
That outputs:
IEnumerable<String> (2 items)
This
test

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?

The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).

I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"

If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

Odd regexp behaviour - matches only first and last capture group

I am trying to write a regexp which would match a comma separated list of words and capture all words. This line should be matched    apple , banana ,orange,peanut  and captures should be apple, banana, orange, peanut. To do that I use following regexp:
^\s*([a-z_]\w*)(?:\s*,\s*([a-z_]\w*))*\s*$
It successfully matches the string but all of a sudden only apple and peanut are captured. This behaviour is seen in both C# and Perl. Thus I assume I am missing something about how regexp matching works. Any ideas? :)

The value given by match.Groups[2].Value is just the last value captured by the second group.
To find all the values, look at match.Groups[2].Captures[i].Value where in this case i ranges from 0 to 2. (As well as match.Groups[1].Value for the first group.)
(+1 for question, I learned something today!)

Try this:
string text = " apple , banana ,orange,peanut";
var matches = Regex.Matches(text, #"\s*(?<word>\w+)\s*,?")
.Cast<Match>()
.Select(x => x.Groups["word"].Value)
.ToList();

You are repeating your capturing group, at every repeated match the previous content is overwritten. So only the last match of your second capturing group is available at the end.
You can change your second capturing group to
^\s*([a-z_]\w*)((?:\s*,\s*(?:[a-z_]\w*))*)\s*$
Then the result would be " , banana ,orange,peanut" in your second group. I am not sure, if you want this.
If you want to check that the string has that pattern and extract each word. I would do it in two steps.
Check the pattern with your regex.
If the pattern is correct, remove leading and trailing whitespace and split on \s*,\s*.

Simple regexp:
(?:^| *)(.+?)(?:,|$)
Explanation:
?: # Non capturing group
^| * # Match start of line or multiple spaces
.+ # Capture the word in the list, lazy
?: # Non capture group
,|$ # Match comma or end of line
Note: Rublular is a nice website for testing this kind of thing.

How to prevent regex from stopping at the first match of alternatives?

If I have the string hello world , how can I modify the regex world|wo|w so that it will match all of "world", "wo" and "w" rather than just the single first match of "world" that it comes to ?
If this is not possible directly, is there a good workaround ? I'm using C# if it makes a difference:
Regex testRegex = new Regex("world|wo|w");
MatchCollection theMatches = testRegex.Matches("hello world");
foreach (Match thisMatch in theMatches)
{
...
}

I think you're going to need to use three separate regexs and match on each of them. When you specify alternatives it considers each one a successful match and stops looking after matching one of them. The only way I can see to do it is to repeat the search with each of your alternatives in a separate regex. You can create an array or list of Match items and have each search add to the list if you want to be able to iterate through them later.

If you're trying to match (the beginning of) the word world three times, you'll need to use three separate Regex objects; a single Regex cannot match the same character twice.

As SLaks wrote, a regex can't match the same text more than once.
You could "fake it" like this:
\b(w)((?<=w)o)?((?<=wo)rld)?
will match the w, the o only if preceded by w*, and rld only if preceded by wo.
Of course, only parts of the word will actually be matched, but you'll see whether only the first one, the first two or all the parts did match by looking at the captured groups.
So in the word want, the w will match (the rest is optional, so the regex reports overall success.
In work, the wo will match; \1 will contain w, and \2 will contain o. The rld will fail, but since it's optional, the regex still reports success.
I have added a word boundary anchor \b to the start of the regex to avoid matches in the middle of words like reword; if don't want to exclude those matches, drop the \b.
* The (?<=w) is not actually needed here, but I kept it in for consistency.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get the nth word of a line - c#

match.Value returns the entire matched substring, which includes the non-capturing parts of your regex. You should instead use match.Groups[1].Value to get the value of the first capturing group.

Related

Regular expression "[ ]" work to weed out white spaces, but why and how?

Find a string pattern using Regular expression

C# Regular Expression: Search the first 3 letters of each name

Odd regexp behaviour - matches only first and last capture group

How to prevent regex from stopping at the first match of alternatives?

Categories

Resources