I want only matching string using regex - c#

I have a string "myname 18-may 1234" and I want only "myname" from whole string using a regex.
I tried using the \b(^[a-zA-Z]*)\b regex and that gave me "myname" as a result.
But when the string changes to "1234 myname 18-may" the regex does not return "myname". Please suggest the correct way to select only "myname" whole word.
Is it also possible - given the string in
"1234 myname 18-may" format - to get myname only, not may?

UPDATE
Judging by your feedback to your other question you might need
(?<!\p{L})\p{L}+(?!\p{L})
ORIGINAL ANSWER
I have come up with a lighter regex that relies on the specific nature of your data (just a couple of words in the string, only one is whole word):
\b(?<!-)\p{L}+\b
See demo
Or even a more restrictive regex that finds a match only between (white)spaces and string start/end:
(?<=^|\s)\p{L}+(?=\s|$)
The following regex is context-dependent:
\p{L}+(?=\s+\d{1,2}-\p{L}{3}\b)
See demo
This will match only the word myname.
The regex means:
\p{L}+ - Match 1 or more Unicode letters...
(?=\s+\d{1,2}-\p{L}{3}\b) - until it finds 1 or more whitespaces (\s+) followed with 1 or 2 digits, followed with a hyphen and 3 Unicode letters (\p{L}{3}) which is a whole word (\b). This construction is a positive look-ahead that only checks if something can be found after the current position in the string, but it does not "consume" text.
Since the date may come before the string, you can add an alternation:
\p{L}+(?=[ ]+\d{1,2}-\p{L}{3}\b)|(?<=\d{1,2}-\p{L}{3}[ ]+)\p{L}+
See another demo
The (?<=\d{1,2}-\p{L}{3}\s+) is a look-behind that checks for the same thing (almost) as the look-ahead, but before the myname.

here is a solution without RegEx
string input = "myname 18-may 1234";
string result = input.Split(' ').Where(x => x.All(y => char.IsLetter(y))).FirstOrDefault();

Do a replace using this regex:
(\s*\d+\-.{3}\s*|\s*.{3}\-\d+\s*)|(\s*\d+\s*)
you will end up with just your name.
Demo

Related

Regex - find matches not contained within pattern

I would like to use a regular expression to match all occurrences of a phrase where it's not contained within some delimiting characters. I tried putting one together but had some difficulty with the negative lookaheads.
My search phrase is "my phrase". The start delimiter tag is [[ and the end delimiter tag is ]]. The string I'd like to search is:
Here is a sentence with my phrase, here's another part which I don't want to match on [[my phrase]]. I would like to find this occurrence of my phrase.
From this string I would expect to find all occurrences of "my phrase" except the one contained within [[ ]].
I hope that makes sense, thanks in advance for any guidance.
[^#]my phrase[^#]
I have knocked up a RegEx that will do what you ask, this can be seen here.
Literally just escaping out # as a character and allowing any other character to be returned. You can return the index of these results but remember to strip off the first and last character of the string.
Note: This will not pick up any "my phrase" that end the sentence without a character following it
Edit - Seeing as you changed the scope while I was writing this answer,
here is the RegEx for the other delimiter:
[^[[]my phrase[^\]\]]
(?<=[^\[])my phrase(?=[^\]]*)
This will also elliminate the trailing punctuation marks.

Extract string from a pattern preceded by any length

I'm looking for a regular expression to extract a string from a file name
eg if filename format is "anythingatallanylength_123_TESTNAME.docx", I'm interested in extracting "TESTNAME" ... probably fixed length of 8. (btw, 123 can be any three digit number)
I think I can use regex match ...
".*_[0-9][0-9][0-9]_[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z].docx$"
However this matches the whole thing. How can I just get "TESTNAME"?
Thanks
Use parenthesis to match a specific piece of the whole regex.
You can also use the curly braces to specify counts of matching characters, and \d for [0-9].
In C#:
var myRegex = new Regex(#"*._\d{3}_([A-Za-z]{8})\.docx$");
Now "TESTNAME" or whatever your 8 letter piece is will be found in the captures collection of your regex after using it.
Also note, there will be a performance overhead for look-ahead and look-behind, as presented in some other solutions.
You can use a look-behind and a look-ahead to check parts without matching them:
(?<=_[0-9]{3}_)[A-Z]{8}(?=\.docx$)
Note that this is case-sensitive, you may want to use other character classes and/or quantifiers to fit your exact pattern.
In your file name format "anythingatallanylength_123_TESTNAME.docx", the pattern you are trying to match is a string before .docx and the underscore _. Keeping the thing in mind that any _ before doesn't get matched I came up with following solution.
Regex: (?<=_)[A-Za-z]*(?=\.docx$)
Flags used:
g global search
m multi-line search.
Explanation:
(?<=_) checks if there is an underscore before the file name.
(?=\.docx$) checks for extension at the end.
[A-Za-z]* checks the required match.
Regex101 Demo
Thanks to #Lucero #noob #JamesFaix I came up with ...
#"(?<=.*[0-9]{3})[A-Z]{8}(?=.docx$)"
So a look behind (in brackets, starting with ?<=) for anything (ie zero or more any char (denoted by "." ) followed by an underscore, followed by thee numerics, followed by underscore. Thats the end of the look behind. Now to match what I need (eight letters). Finally, the look ahead (in brackets, starting with ?=), which is the .docx
Nice work, fellas. Thunderbirds are go.

Limit regex expression by character in c#

I get the following pattern (\s\w+) I need matches every words in my string with a space.
For example
When i have this string
many word in the textarea must be happy
I get
many
word
in
the
textarea
must
be
happy
It is correct, but when i have another character, for example
many word in the textarea , must be happy
I get
many
word
in
the
textarea
must
be
happy
But must be happy should be ignored, because i want it to break when another character is in the string
Edit:
Example 2
all cats { in } the world are nice
Should be return
all
cats
Because { is another separator for me
Example 3
My 3 cats are ... funny
Should be return
My
3
cats
are
Because 3 is alphanumeric and . is separator for me
What can I do?
To do that you need to use the \G anchors that matches the positions at the start of the string or after the last match. so you can do it with this pattern:
#"(?<=\G\s*)\w+"
[^\w\s\n].*$|(\w+\s+)
Try this.Grab the captures or matches.See demo.Set flag m for multiline mode.
See demo.
http://regex101.com/r/kP4pZ2/12
I think Sam I Am's comment is correct: you'll require two regular expressions.
Capture the text up to a non-word character.
Capture all the words with a space on one side.
Here's the corresponding code:
"^(\\w+\\s+)+"
"(\\w+\\s+)"
You can combine these two to capture just the individual words pretty easily - like so
"^(\\w+\\s+)+"
Here's a complete piece of code demonstrating the pattern:
string input = "many word in the textarea , must be happy";
string pattern = "^(\\w+\\s+)+";
Match match = Regex.Match(input , pattern);
// Never returns a NullReferenceException because of GroupsCollection array indexer - check it out!
foreach(Capture capture in match.Groups[1].Captures)
{
Console.WriteLine(capture.Value);
}
EDIT
Check out Casimir et Hippolyte for a really clean answer.
All in one regex :-) Result is in list
Regex regex = new Regex(#"^((\w+)\s*)+([^\w\s]|$).*");
Match m = regex.Match(inputString);
if(m.Success)
{
List<string> list =
m.Groups[2].Captures.Cast<Capture>().
Select(c=>c.Value).ToList();
}

Tidy up a string

I'm looking for the best solution, performance wise, to rebuild a string by removing words that are not complete words. An acceptable word in this instance is a whole word without numbers or doesn't start with a forward slash, or a back slash. So just letters only, but can include hyphen and apostrophe's
For example:
String str ="\DR1234 this is a word, 123456, frank's place DA123 SW1 :50:/"
Using the above I'd need a new string that returns the following:
Str = "this is a word, frank's place"
I've done some research on Regex, but I can't find anything that would do what I need.
Final Code Snippet
var resultSet = Regex.Matches(item.ToLower(), #"(?:^|\s)(?![\\\/])(?!-+(?:\s|$))(?!'+(?:\s|$))(?!(?:[a-z'-]*?-){3,})(?!(?:[a-z'-]*?'){2,})[a-z'-]+[,.]?(?=\s|$)")
.Cast<Match>()
.Select(m => m.Value).ToArray();
Thanks for all your input guys - proves what a great site this is
Description
Based on your comments: A word in this instance is:
a whole word without numbers
doesn't start with a forward slash, or a back slash
just letters only
can include hyphen and apostrophes
The character class to cover all the word characters by your definition would be [a-z'-]+ and that group could be surrounded by whitespace, or the start/end of a string. You sample also shows a comma so I'm presuming a word can be followed by a comma or dot either of which are followed by white space is ok too.
This regex will:
collect all substings defined as words [a-z'-]+
allow a comma or dot after a word, but not inside or at the start of a word
rejects substrings from containing all hyphens
rejects substrings from containing all apostrophes
prevents words from having 3 or more hyphens
prevents words from having 2 or more apostrophes
(?:^|\s)(?![\\\/])(?!-+(?:\s|$))(?!'+(?:\s|$))(?!(?:[a-z'-]*?-){3,})(?!(?:[a-z'-]*?'){2,})[a-z'-]+[,.]?(?=\s|$)
Expanded explanation
(?:^|\s) match the start of the string or a white space. This eliminates the need to test for word boundary which is problematic for strings like "abdc-egfh"
(?![\\\/]) prevent the word from starting with a \ or /, however this is over kill as the character class doesn't allow it either
(?!-+(?:\s|$)) prevent strings which are all hyphens
(?!'+(?:\s|$)) prevent strings which are all apostrophes
(?!(?:[a-z'-]*?-){3,}) prevent strings which have 3 or more hyphens
(?!(?:[a-z'-]*?'){2,}) prevent strings which have 2 or more apostrophes
[a-z'-]+[,.]?(?=\s|$) match the word followed by some optional punctuation, and ensure this is followed by either a space or the end of a string
Examples
I'm not a C# programmer, but a returned array of matches from a code block like the one covered in question Return a array/list using regex and this regular expression will probably work for you. Note this expression does assume you'll use the case insensitive option.
Sample Text
\DR1234 - this is a word, 123456, frank's place DA123 SW1 :50:/ one-hyphen two-hyphens-here I-have-three-hyphens
Matches
[0] => this
[1] => is
[2] => a
[3] => word,
[4] => frank's
[5] => place
[6] => one-hyphen
[7] => two-hyphens-here
the regex: \b\w+\b will match words or if you're more picky, than \b[a-zA-Z]+\b won't include numbers or _s
http://rubular.com/r/uOVvPTb5nh
It looks like you want to allow 's and ,s, so the regex: \b[a-zA-Z,']+\b will do an okay job at that, but it will also let slip through any number of things that you might not want(such as
,','hello''',World
or, in c#,
string str =#"\DR1234 this is a word, 123456, frank's place DA123 SW1 :50:/";
Regex r = new Regex(#"\b[a-zA-Z,']+\b");
string newStr = string.Join(" ", r.Matches(str).Cast<Match>().Select(m => m.Value).ToArray());
Regex.Match("[a-z\s,']+") is what you're looking for. So here is the code example:
string pattern = "[a-z\s,']+";
string input = #"\DR1234 this is a word, 123456, frank's place DA123 SW1 :50:/";
Match match = Regex.Match(input, pattern);
while (match.Success){
Console.WriteLine(match.Value);
match = match.NextMatch();
}

Regex - Get all words that are not wrapped with a "/"

Im really trying to learn regex so here it goes.
I would really like to get all words in a string which do not have a "/" on either side.
For example, I need to do this to:
"Hello Great /World/"
I need to have the results:
"Hello"
"Great"
is this possible in regex, if so, how do I do it? I think i would like the results to be stored in a string array :)
Thank you
Just use this regular expression \b(?<!/)\w+(?!/)\b:
var str = "Hello Great /World/ /I/ am great too";
var words = Regex.Matches(str, #"\b(?<!/)\w+(?!/)\b")
.Cast<Match>()
.Select(m=>m.Value)
.ToArray();
This will get you:
Hello
Great
am
great
too
var newstr = Regex.Replace("Hello Great /World/", #"/(\w+?)/", "");
If you realy want an array of strings
var words = Regex.Matches(newstr, #"\w+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
I would first split the string into the array, then filter out matching words. This solution might also be cleaner than a big regexp, because you can spot the requirements for "word" and the filter better.
The big regexp solution would be something like word boundary - not a slash - many no-whitespaces - not a slash - word boundary.
I would use a regex replace to replace all /[a-zA-Z]/ with '' (nothing) then get all words
Try this one : (Click here for a demo)
(\s(?<!/)([A-Za-z]+)(?!/))|((?<!/)([A-Za-z]+)(?!/)\s)
Using this example excerpt:
The /character/ "_" (underscore/under-strike) can be /used/ in /variable/ names /in/ many /programming/ /languages/, while the /character/ "/" (slash/stroke/solidus) is typically not allowed.
...this expression matches any string of letters, numbers, underscores, or apostrophes (fairly typical idea of a "word" in English) that does not have a / character both before and after it - wrapped with a "/"
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/))
...and is the purest form, using only one character class to define "word" characters. It matches the example as follows:
Matched Not Matched
------------- -------------
The character
_ used
underscore variable
under in
strike programming
can languages
be character
in stroke
names
many
while
the
slash
solidus
is
typically
not
allowed
If excluding /stroke/, is not desired, then adding a bit to the end limitation will allow it, depending upon how you want to define the beginning of a "next" word:
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/([^\w]))).
changes (?!/) to (?!/([^\w])), which allows /something/ if it does have a letter, number, or underscore immediately after it. This would move stroke from the "Not Matched" to the "Matched" list, above.
note: \w matches uppercase or lowercase letters, numbers and the underscore character
If you want to alter your concept for "word" from the above, simply exchange the characters and shorthand character classes contained in the [\w'] part of the expression to something like [a-zA-Z'] to exclude digits or [\w'-] to include hyphens, which would capture under-strike as a single match, rather than two separate matches:
\b([\w'-]+)\b(?<=(?<!/)\1|\1(?!/([^\w])))
IMPORTANT ALTERNATIVE!!! (I think)
I just thought of an alternative to Matching any words that are not wrapped with / symbols: simply consume all of these symbols and words that are surrounded in them (splitting). This has a few benefits: no lookaround means this could be used in more contexts (JavaScript does not support lookbehind and some flavors of regex don't support lookaround at all) while increasing efficiency; also, using a split expression means a direct result of a String array:
string input = "The /character/ "_" (underscore/under-strike) can be..."; //etc...
string[] resultsArray = Regex.Split(input, #"([^\w'-]+?(/[\w]+/)?)+");
voila!

Categories