Find a string pattern using Regular expression - c#

How can I use regular expressions to find if the string matches a pattern like [sometextornumber] is a [sometextornumber].
For instance, if the input is This is a test, the output should be this and test.
I was thinking something like ([a-zA-Z0-9]) is a([a-zA-Z0-9]) but looks like I am way off the correct path.

Your question is geared towards grabbing the first and last word of a sentence. If this is all you're going to be interested in, this pattern will suffice:
"^(\\w+)|(\\w+)$"
Pattern breakdown:
^ indicates the beginning of a line
^(\\w+) capture group for a word at the beginning of the line. This is equivalent to [a-zA-Z0-9]+, where the + says you want a one or more letters and numbers.
| acts as an OR operator in Regex
$ indicates the end of a line
(\\w+)$ capture group for a word at the end of the line. This is equivalent to [a-zA-Z0-9]+, where the + says you want a one or more letters and numbers.
This pattern allows you to ignore what's in between the first and last word, so it doesn't care about "is a", and give you one capture group to pull from.
Usage:
string data = "This is going to be a test";
Match m = Regex.Match(data, "^(\\w+)|(\\w+)$");
while (m.Success)
{
Console.WriteLine(m.Groups[0]);
m = m.NextMatch();
}
Results:
This
test
If you're really only interested in the first and last word of a sentence, you also don't need to bother with Regex. Just split the sentence by a space and grab the first and last element of the array.
string[] dataPieces = data.Split(' ');
Console.WriteLine(dataPieces[0]);
Console.WriteLine(dataPieces[dataPieces.Length - 1]);
And the results are the same.
References:
https://msdn.microsoft.com/en-us/library/hs600312(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

Try this:
([a-zA-Z0-9])+ is a ([a-zA-Z0-9])+
Edit:
You need a space after the a since it is another word. Without the + it will only match from last letter of the first word till the first letter of the last word. The + will match 1 or more of whatever is in the (), so in this case the whole word.

If you're looking to match a specific pattern such as "This" or "Test" you can simply do a case insensitive string compare.
From your question, I'm not sure that you necessarily need a regular expression here.

Here is a quick LINQpad:
var r = new Regex("(.*) is a (.*)");
var match = r.Match("This is a test");
match.Groups.OfType<Group>().Skip(1).Select(g=>g.Value).Dump();
That outputs:
IEnumerable<String> (2 items)
This
test

Related

How to match any repeated chunks of characters?

I've seen many questions similar to this but none quite like it.
I have strings like this:
HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02
I can't figure this out and the other questions I've seen don't apply because 1) the repeated chunk is more than one character, 2) There are no spaces between the repetition.
Or is regex not the best way to do this?
EDIT:
Rules
Alpha chunks are never repeated more than one time.
Some chunks can be alphanumeric but also never repeated more than one
time.
The part that can be repeated would be from the start of the string
and any additional chunks by hyphen.
So you would never have something like HF-HF-01-01. But in this case using the above rules, it would become HF-01-01 since HF is the only part repeated from the beginning of the string.
Perhaps something like this would work:
Scan string to first hyphen, see if that matches anywhere else after first hyphen, if so scan to second hyphen, see if that matches anywhere else, if not, take the first scan and remove one instance of it from the string, if so, scan to third, etc.
But I don't know how to do that in regex.
I'm not sure if RegExp is the right tool here.
Using MoreLinq RunLengthEncode method (that implement R.L.E.) you can achieve it like this:
string RemoveDuplicate(string input)
{
var chunks = input.Split('-') // cut at -
.RunLengthEncode() // group and count adjacent equals chunck
.Select(kvp => kvp.Key);// just take the chunk value
return string.Join("-", chunks); // reglue with -
}
Edit
Doesn't work for:
OZYA-03A-OZYA-03A-03
I guess,
([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)
or with start/end anchors,
^([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)$
might work to some extent and the desired output is in the last capturing group:
(\1.*)
RegEx Demo 1
RegEx Demo 2
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)";
string input = #"HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
I'm not sure if regex is the right tool here, but atleast it can be somewhat done with this short pattern:
^([A-Z0-9]+)-.*(\1.*)$
Explanation:
^ start of string
( group 1 start
[A-Z0-9]+ one or more capital letters or digits
) end group 1
- literal
.* any number of any chars
( group 2 start
\1 anything that was matched in group 1
.* any number of any chars
) end group 2 (this group will be used as the result)
$ end of string

Get the nth word of a line

With this code:
regex = new Regex(#"^(?:\S+\s){2}(\S+)");
match = regex.Match("one two three four five");
if (match.Success)
{
Console.WriteLine(match.Value);
}
I want to retrieve the third word of the line --> "three".
But instead, I get "one two three".
Edit:
I know that I could do it with s.Split(' ')[2] but I want to do it with regex.
If you want to use Match method only without reference to groups, etc., then you have to use look-behind. Basically you say - find a word that is preceded by two words. In you current regex you say - find me 2 words + 1 word, so you just have to change part "find 2 words" to "preceded by 2 words", i.e. ^(?:\S+\s){2} is changed to (?<=^(\S+\s){2})
(?<=^(\S+\s){2})\S+
match.Value returns the entire matched substring, which includes the non-capturing parts of your regex. You should instead use match.Groups[1].Value to get the value of the first capturing group.

Limit regex expression by character in c#

I get the following pattern (\s\w+) I need matches every words in my string with a space.
For example
When i have this string
many word in the textarea must be happy
I get
many
word
in
the
textarea
must
be
happy
It is correct, but when i have another character, for example
many word in the textarea , must be happy
I get
many
word
in
the
textarea
must
be
happy
But must be happy should be ignored, because i want it to break when another character is in the string
Edit:
Example 2
all cats { in } the world are nice
Should be return
all
cats
Because { is another separator for me
Example 3
My 3 cats are ... funny
Should be return
My
3
cats
are
Because 3 is alphanumeric and . is separator for me
What can I do?
To do that you need to use the \G anchors that matches the positions at the start of the string or after the last match. so you can do it with this pattern:
#"(?<=\G\s*)\w+"
[^\w\s\n].*$|(\w+\s+)
Try this.Grab the captures or matches.See demo.Set flag m for multiline mode.
See demo.
http://regex101.com/r/kP4pZ2/12
I think Sam I Am's comment is correct: you'll require two regular expressions.
Capture the text up to a non-word character.
Capture all the words with a space on one side.
Here's the corresponding code:
"^(\\w+\\s+)+"
"(\\w+\\s+)"
You can combine these two to capture just the individual words pretty easily - like so
"^(\\w+\\s+)+"
Here's a complete piece of code demonstrating the pattern:
string input = "many word in the textarea , must be happy";
string pattern = "^(\\w+\\s+)+";
Match match = Regex.Match(input , pattern);
// Never returns a NullReferenceException because of GroupsCollection array indexer - check it out!
foreach(Capture capture in match.Groups[1].Captures)
{
Console.WriteLine(capture.Value);
}
EDIT
Check out Casimir et Hippolyte for a really clean answer.
All in one regex :-) Result is in list
Regex regex = new Regex(#"^((\w+)\s*)+([^\w\s]|$).*");
Match m = regex.Match(inputString);
if(m.Success)
{
List<string> list =
m.Groups[2].Captures.Cast<Capture>().
Select(c=>c.Value).ToList();
}

Regex replace all matching words that do not contain a certain string

How can I use regex to replace matching strings that do not include a specific string?
input string
Keepword mywordsecond mythirdword myfourthwordKeep
string to replace
word
exclude string
Keep
Desired out put
Keepword mysecond mythird myfourthKeep
Will there ever be more than one word in a word? If there are more than one, do you want to replace all of them? If not, this should sort you out:
Regex r = new Regex(#"\b((?:(?!Keep|word)\w)*)word((?:(?!Keep)\w)*)\b");
s1 = r.Replace(s0, "$1$2");
to explain:
First, \b((?:(?!Keep|word)\w)*) captures whatever text precedes the first occurrence of word or Keep.
The next thing it sees must be word, If it sees Keep or the end of the string instead, the match attempt immediately fails.
Then ((?:(?!Keep)\w)*)\b captures the remainder of the text in order to ensure it doesn't contain Keep.
When faced with a problem like this, most users' first impulse is to match (in the sense of consuming) only the part of the string they're interested in, using lookarounds to establish the context. It's usually much easier to write the regex so that it always moves forward through the string as it matches. You capture the parts you want to retain so you can plug them back into the result string by means of group references ($1, $2, etc.).
Given that you're using C#, you could use the lookaround approach:
Regex r = new Regex(#"(?<!Keep\w*)word(?!\w*Keep)");
s1 = r.Replace(s0, "");
But please don't. There are very few regex flavors that support unrestricted lookbehinds like .NET does, and most problems don't work so neatly as this one anyway.
string str = "Keepword mywordsecond mythirdword myfourthwordKeep";
str = Regex.Replace(str, "(?<!Keep)word", "");
And I'm going to link you to a one of good Regular Expressions Cheat sheet here
This works in notepad++:
(?<!Keep)word(?!Keep)
It uses "look ahead".
You can use negative look-behind assertion if you want to remove all "word" that are not proceeded by "Keep":
String input = "Keepword mywordsecond mythirdword myfourthwordKeep";
String pattern = "(?<!Keep)word";
String output = Regex.Replace(input, pattern, "");

C# Regular expressions, retrieving two words separated by a comma, parenthesis operator

I've been playing around with retrieving data from a string using regular expression, mostly as an exercise for myself. The pattern that I'm trying to match looks like this:
"(SomeWord,OtherWord)"
After reading some documentation and looking at a cheat sheet I came to the conclusion that the following regex should give me 2 matches:
"\((\w),(\w)\)"
Because according to the documentation the parenthesis should do the following:
(pattern) Matches pattern and remembers the match. The matched
substring can be retrieved from the resulting Matches collection,
using Item [0]...[n]. To match parentheses characters ( ), use "\ (" or
"\ )".
However using the following code (removed error checking for conciseness) matches quite something different:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Value;
string right = matches[1].Value;
Now I would expect left to become "A" and right to become "B". However left becomes "(A,B)" and there is no second match at all. What am I missing here?
(I know this example is trivial to solve without regexes but to learn how to properly use regexes I should be able to make something simple as this work)
You want the Groups member of the first match. In your example case there is only 1 match, which is the whole string. In the Groups collection you will have 3 items. Try this sample code, left should be A, and right should be B. If you look at the group[0] value it will be the whole string.
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
GroupCollection groups = matches[0].Groups;
string left = groups[1].Value;
string right = groups[2].Value;
\w matches only one word character. If words have to contain at least one character, the expression should be:
string pattern = #"\((\w+),(\w+)\)";
if words may be empty:
string pattern = #"\((\w*),(\w*)\)";
+: means one or more repetitions.
*: means zero, one or more repetitions.
In any case, you will get one match with three groups, the first containing the whole string including the left and right parentheses, the two others the two words.
I think the problem is that you're confusing the concept of a match and a group.
A MatchCollection contains a list of strings that matched your entire regex, not just the parenthetical groups inside that Regex. For example, if the string you searched looked like this...
(A,B)(C,D)
...then you would have two matches: (A,B) and (C,D).
However, there's good news: you can get the groups from each match very easily, like so:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Groups[1].Value;
string right = matches[0].Groups[2].Value;
That Groups variable is a collection of parenthetical groups from a single match.
Edit:
Olivier Jacot-Descombes made a very good point: we all got so hung up explaining match vs. group that we forgot to notice a second problem: \w will only match a SINGLE character. You need to add a quantifier (such as +) in order to grab more than one character at a time. Olivier's answer should explain that part clearly.
First off, it's one "match", with 2 "groups"...
I would recommend you name the groups anyway...
string pattern = #"\((?<FirstWord>\w+),(?<SecondWord>\w+)\)";
Then you could do...
Match m = Regex.Match(line, pattern);
string firstWord = m.Groups["FirstWord"].Value;
Since all you are looking for are the characters separated by a comma, you can simply use \w as your pattern. The matches will be A and B.
A handy site for testing your Regex is http://gskinner.com/RegExr/

Categories