Limit regex expression by character in c# - c#

I get the following pattern (\s\w+) I need matches every words in my string with a space.
For example
When i have this string
many word in the textarea must be happy
I get
many
word
in
the
textarea
must
be
happy
It is correct, but when i have another character, for example
many word in the textarea , must be happy
I get
many
word
in
the
textarea
must
be
happy
But must be happy should be ignored, because i want it to break when another character is in the string
Edit:
Example 2
all cats { in } the world are nice
Should be return
all
cats
Because { is another separator for me
Example 3
My 3 cats are ... funny
Should be return
My
3
cats
are
Because 3 is alphanumeric and . is separator for me
What can I do?

To do that you need to use the \G anchors that matches the positions at the start of the string or after the last match. so you can do it with this pattern:
#"(?<=\G\s*)\w+"

[^\w\s\n].*$|(\w+\s+)
Try this.Grab the captures or matches.See demo.Set flag m for multiline mode.
See demo.
http://regex101.com/r/kP4pZ2/12

I think Sam I Am's comment is correct: you'll require two regular expressions.
Capture the text up to a non-word character.
Capture all the words with a space on one side.
Here's the corresponding code:
"^(\\w+\\s+)+"
"(\\w+\\s+)"
You can combine these two to capture just the individual words pretty easily - like so
"^(\\w+\\s+)+"
Here's a complete piece of code demonstrating the pattern:
string input = "many word in the textarea , must be happy";
string pattern = "^(\\w+\\s+)+";
Match match = Regex.Match(input , pattern);
// Never returns a NullReferenceException because of GroupsCollection array indexer - check it out!
foreach(Capture capture in match.Groups[1].Captures)
{
Console.WriteLine(capture.Value);
}
EDIT
Check out Casimir et Hippolyte for a really clean answer.

All in one regex :-) Result is in list
Regex regex = new Regex(#"^((\w+)\s*)+([^\w\s]|$).*");
Match m = regex.Match(inputString);
if(m.Success)
{
List<string> list =
m.Groups[2].Captures.Cast<Capture>().
Select(c=>c.Value).ToList();
}

Related

I want only matching string using regex

I have a string "myname 18-may 1234" and I want only "myname" from whole string using a regex.
I tried using the \b(^[a-zA-Z]*)\b regex and that gave me "myname" as a result.
But when the string changes to "1234 myname 18-may" the regex does not return "myname". Please suggest the correct way to select only "myname" whole word.
Is it also possible - given the string in
"1234 myname 18-may" format - to get myname only, not may?
UPDATE
Judging by your feedback to your other question you might need
(?<!\p{L})\p{L}+(?!\p{L})
ORIGINAL ANSWER
I have come up with a lighter regex that relies on the specific nature of your data (just a couple of words in the string, only one is whole word):
\b(?<!-)\p{L}+\b
See demo
Or even a more restrictive regex that finds a match only between (white)spaces and string start/end:
(?<=^|\s)\p{L}+(?=\s|$)
The following regex is context-dependent:
\p{L}+(?=\s+\d{1,2}-\p{L}{3}\b)
See demo
This will match only the word myname.
The regex means:
\p{L}+ - Match 1 or more Unicode letters...
(?=\s+\d{1,2}-\p{L}{3}\b) - until it finds 1 or more whitespaces (\s+) followed with 1 or 2 digits, followed with a hyphen and 3 Unicode letters (\p{L}{3}) which is a whole word (\b). This construction is a positive look-ahead that only checks if something can be found after the current position in the string, but it does not "consume" text.
Since the date may come before the string, you can add an alternation:
\p{L}+(?=[ ]+\d{1,2}-\p{L}{3}\b)|(?<=\d{1,2}-\p{L}{3}[ ]+)\p{L}+
See another demo
The (?<=\d{1,2}-\p{L}{3}\s+) is a look-behind that checks for the same thing (almost) as the look-ahead, but before the myname.
here is a solution without RegEx
string input = "myname 18-may 1234";
string result = input.Split(' ').Where(x => x.All(y => char.IsLetter(y))).FirstOrDefault();
Do a replace using this regex:
(\s*\d+\-.{3}\s*|\s*.{3}\-\d+\s*)|(\s*\d+\s*)
you will end up with just your name.
Demo

Regex in a string

I need some help on a problem.
In fact I search to check for an image type by the hexadecimal code.
string JpgHex = "FF-D8-FF-E0-xx-xx-4A-46-49-46-00";
Then I have a condition on
string.StartsWith(pngHex).
The problem is that the "x" characters presents in my "JpgHex" string can be whatever I want.
I think I need a regex to check that but I don't know how!!
Thanks a lot!
I'm not quite clear what exactly you want to do, but the dot '.' character represents any character in Regex.
So the regex "^FF-D8-FF-E0-..-..-4A-46-49-46-00" will probably do the trick. '^' = Start of input.
If you want to allow only hex chars you can use "^FF-D8-FF-E0-[0-9A-F]{2}-[0-9A-F]{2}-4A-46-49-46-00".
Like I said, I'd need a better idea of what pattern you need to match.
Here are some examples:
Regex rgx =
new Regex(#"^FF-D8-FF-E0-[a-zA-Z0-9]{2}-[a-zA-Z0-9]{2}-4A-46-49-46-00$");
rgx.IsMatch(pngHex); // is match will return a bool.
I use [a-zA-Z0-9]{2} to denote two instances of a character, caps or small or a number. So the above regex would match :
FF-D8-FF-E0-aa-zZ-4A-46-49-46-00
FF-D8-FF-E0-11-22-4A-46-49-46-00
.. etc
Based on your need change the regex accordingly so for capitals and numbers only you change to [A-Z0-9]. The {2} denotes two occurrences.
The ^ denotes the string should start with FF and $ means the string should end with 00.
Lets say you wanted to only match two numbers, so you would use \d{2}, the whole thing would look like this:
Regex rgx = new Regex(#"^FF-D8-FF-E0-\d{2}-\d{2}-4A-46-49-46-00$");
rgx.IsMatch(pngHex);
How do I know of these magical characters? Simple, there are docs everywhere. See this MSDN page for some basic regex patterns. This page shows some quantifiers, those are things like match one or more or match only one.
Cheat-sheets also come in handy.
A regex would help you; you can use the following tool to help you test and learn: -
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
I recommend you have a play because then you'll learn!
To simply match any character in place of the x, the following should work: -
"^FF-D8-FF-E0-..-..-4A-46-49-46-00$"
In C#, it would be something like this: -
var test = "FF-D8-FF-E0-AB-CD-4A-46-49-46-00";
var foo = new Regex("^FF-D8-FF-E0-..-..-4A-46-49-46-00$");
if (foo.IsMatch(test))
{
// Do magic
}
You will need to read up on regular expressions to understand some of the characters that may not look familiar, i.e. ^ and $. See http://www.regular-expressions.info/

C# Regular expressions, retrieving two words separated by a comma, parenthesis operator

I've been playing around with retrieving data from a string using regular expression, mostly as an exercise for myself. The pattern that I'm trying to match looks like this:
"(SomeWord,OtherWord)"
After reading some documentation and looking at a cheat sheet I came to the conclusion that the following regex should give me 2 matches:
"\((\w),(\w)\)"
Because according to the documentation the parenthesis should do the following:
(pattern) Matches pattern and remembers the match. The matched
substring can be retrieved from the resulting Matches collection,
using Item [0]...[n]. To match parentheses characters ( ), use "\ (" or
"\ )".
However using the following code (removed error checking for conciseness) matches quite something different:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Value;
string right = matches[1].Value;
Now I would expect left to become "A" and right to become "B". However left becomes "(A,B)" and there is no second match at all. What am I missing here?
(I know this example is trivial to solve without regexes but to learn how to properly use regexes I should be able to make something simple as this work)
You want the Groups member of the first match. In your example case there is only 1 match, which is the whole string. In the Groups collection you will have 3 items. Try this sample code, left should be A, and right should be B. If you look at the group[0] value it will be the whole string.
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
GroupCollection groups = matches[0].Groups;
string left = groups[1].Value;
string right = groups[2].Value;
\w matches only one word character. If words have to contain at least one character, the expression should be:
string pattern = #"\((\w+),(\w+)\)";
if words may be empty:
string pattern = #"\((\w*),(\w*)\)";
+: means one or more repetitions.
*: means zero, one or more repetitions.
In any case, you will get one match with three groups, the first containing the whole string including the left and right parentheses, the two others the two words.
I think the problem is that you're confusing the concept of a match and a group.
A MatchCollection contains a list of strings that matched your entire regex, not just the parenthetical groups inside that Regex. For example, if the string you searched looked like this...
(A,B)(C,D)
...then you would have two matches: (A,B) and (C,D).
However, there's good news: you can get the groups from each match very easily, like so:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Groups[1].Value;
string right = matches[0].Groups[2].Value;
That Groups variable is a collection of parenthetical groups from a single match.
Edit:
Olivier Jacot-Descombes made a very good point: we all got so hung up explaining match vs. group that we forgot to notice a second problem: \w will only match a SINGLE character. You need to add a quantifier (such as +) in order to grab more than one character at a time. Olivier's answer should explain that part clearly.
First off, it's one "match", with 2 "groups"...
I would recommend you name the groups anyway...
string pattern = #"\((?<FirstWord>\w+),(?<SecondWord>\w+)\)";
Then you could do...
Match m = Regex.Match(line, pattern);
string firstWord = m.Groups["FirstWord"].Value;
Since all you are looking for are the characters separated by a comma, you can simply use \w as your pattern. The matches will be A and B.
A handy site for testing your Regex is http://gskinner.com/RegExr/

Regex Word splitting in C#

I know similar questions have been asked before, but I can't find one that is like mine, or enough like mine to help me out :). So essentially I want to split up a string which contains a bunch of words, and I don't want to return any characters that are not words (this is the key problem I am struggling with, ignoring characters). This is how I define the problem:
What constitutes a word is a string of any character a-zA-Z only
(no numbers or anything else)
In between any word, there can be any number of random other characters
I want to get back a string[] containing only the words
eg: text: "apple^&**^orange1247pear"
I want to return: apple, orange, pear in an array.
The closest I have found I suppose is this:
Regex.Split("apple^orange7pear",#"([a-zA-Z]*)")
Which splits out the apple/orange/pear, but also returns a bunch of other junk and blank strings.
Anyone know how to stop the split function from returning certain parts of the string, or is that not possible?
Thanks in advance for any help you give me :)
Split should match the tokens between your words. In your regex you've added a group around the word, so it is included in the result, but that isn't desired in this case. Note that this regex matches anything besides valid words - anything that isn't an ASCII letter:
string[] words = Regex.Split(str, "[^a-zA-Z]+");
Another option is to match the words directly:
MatchCollection matches = Regex.Matches(str, "[a-zA-Z]+");
string[] words2 = matches.Cast<Match>().Select(m => m.Value).ToArray();
The second option is probably clearer, and will not include blank elements on the start or end of the array.
var splits = Regex.Split("aaa $$$bbb ccc", #"[^A-Za-z]+");
But to include non-latin letters, I would use this:
var splits = Regex.Split("aaa $$$bbb ccc", #"\P{L}+");
Try this:
Regex.Matches("kalle kula(/()&//()nisse8978971", #"[A-Za-z]+")
Using Matches() will collect only the words, Split() will divide the string which is not what you want.
The second option Kobi listed is better and easier to control. I use the following regular expression to locate common entities such as words, numbers, email addresses in a string it will.
var regex = new Regex(#"[\p{L}\p{N}\p{M}]+(?:[-.'ยด_#][\p{L}|\p{N}|\p{M}]+)*", RegexOptions.Compiled);

C# regex need characters after \player_n\

I need a regex pattern which will accommodate for the following.
I get a response from a UDP server, it's a very long string and each word is separated by \, for example:
\g79g97\g879o\wot87gord\player_0\name0\g6868o\g78og89\g79g79\player_1\name1\gyuvui\yivyil\player_2\name2\g7g87\g67og9o\v78v9i7
I need the strings after \player_n\, so in the above example I would need name0, name1 and name3,
I know this is the second regex question of the day but I have the book (Mastering Regular Expressions) on order! Thank you.
UPDATE. elusive's regex pattern will suffice, and I can add the match(0) to a textbox. However, what if I want to add all the matches to the text box ?
textBox1.Text += match.Captures[0].ToString(); //this works fine.
How do I add "all" match.captures to the text box? :s sorry for being so lame, this Regex class is brand new to me .
Try this one:
\\player_\d+\\([^\\]+)
i think that this test sample can help you
string inp = #"\g79g97\g879o\wot87gord\player_0\name0\g6868o\g78og89\g79g79\player_1\name1\gyuvui\yivyil\player_2\name2\g7g87\g67og9o\v78v9i7";
string rex = #"[\w]*[\\]player_[0-9]+[\\](?<name>[A-Za-z0-9]*)\b";
Regex re = new Regex(rex);
Match mat = re.Match(inp);
for (Match m = re.Match(inp); m.Success; m = m.NextMatch())
{
Console.WriteLine(m.Groups["name"]);
}
you can take the name of the player from the m.Groups["name"]
To get only the player name, you could use:
(?<=\\player_\d+\\)[^\\]+
This (?<=\\player_\d+\\) is something called a positive look-behind. It makes sure that the actual match [^\\]+ is preceded by the expression in the parentheses.
In this case, it's even specific to only a few regex engines (.NET being among them, luckily), in that it contains a variable length expression (due to \d+). Most regex engines only support fixed-length look-behind.
In any case, look-behind is not necessarily the best approach to this problem, match groups are simpler easier to read.

Categories