In C#, I want to use a regular expression to match any of these words:
string keywords = "(shoes|shirt|pants)";
I want to find the whole words in the content string. I thought this regex would do that:
if (Regex.Match(content, keywords + "\\s+",
RegexOptions.Singleline | RegexOptions.IgnoreCase).Success)
{
//matched
}
but it returns true for words like participants, even though I only want the whole word pants.
How do I match only those literal words?
You should add the word delimiter to your regex:
\b(shoes|shirt|pants)\b
In code:
Regex.Match(content, #"\b(shoes|shirt|pants)\b");
Try
Regex.Match(content, #"\b" + keywords + #"\b", RegexOptions.Singleline | RegexOptions.IgnoreCase)
\b matches on word boundaries. See here for more details.
You need a zero-width assertion on either side that the characters before or after the word are not part of the word:
(?=(\W|^))(shoes|shirt|pants)(?!(\W|$))
As others suggested, I think \b will work instead of (?=(\W|^)) and (?!(\W|$)) even when the word is at the beginning or end of the input string, but I'm not sure.
put a word boundary on it using the \b metasequence.
Related
I have a regex that detect urls:
#"((http|ftp|https)\:\/\/)?([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?";
I am using it with regex.replace to remove urls from text.
I do not want it to replace any word that starts with /images
for example if the text is "this is my text here is a link http://dfdf.com and my is /images/dd.gif"
I need the http://dfdf.com replaces but not the /images/dd.gif
my regex replaces the dd.gif
so I want to negate any word after images/
any idea how can I fix this ?
You may start matching after a word boundary, and fail the match if it is immediately preceded with a whole "word" images/ using
\b(?<!\bimages/)(?:(?:http|ftp)s?://)?([\w-]+(?:\.[\w-]+)+)([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
See the regex demo. Details:
\b - a word boundary
(?<!\bimages/) - no images/ as a whole word is allowed immediately on the left
(?:(?:http|ftp)s?://)? - an optional sequence of either http or ftp followed with an optional s and then :// substring
([\w-]+(?:\.[\w-]+)+) - Group 1: one or more word or hyphen chars followed with one or more sequences of a . and then one or more word or hyphen chars
([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])? - an optional Group 2: zero or more word chars or chars from the .,#?^=%&:/~+#- set and then a word char or a char from the #?^=%&/~+#- set.
As an alternative solution, you could match match what you don't want to remove and capture what you do want to remove.
You can use a callback with Replace and test for the existence of group 1. If it is there, return an empty string. If it is not there, return the match to leave it unchanged.
\S*/images\S*|(?<!\S)((?:(?:https?|ftp)://)?[\w-]+(?:(?:\.[\w-]+)+)(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?)
Explanation
\S*/images\S* Match /images preceded and followed by optional non whitespace chars that your want to keep
| Or
(?<!\S) Assert a whitespace boundary to the left
((?:(?:https?|ftp)://)?[\w-]+(?:(?:\.[\w-]+)+)(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?) The pattern that you tried with some minor changes to make it a bit shorter
Regex demo (Click on the Table tab to see the matches)
For example
var s = #"this is my text here is a link http://dfdf.com and my is /images/dd.gif";
var regex = new Regex(#"\S*/images\S*|(?<!\S)((?:(?:https?|ftp)://)?[\w-]+(?:(?:\.[\w-]+)+)(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?)");
var result = regex.Replace(s, match => match.Groups[1].Success ? "" : match.Value);
Console.WriteLine(result);
See a C# demo
I have a Grid filled with Tamil words and a search string. I need to implement a full-word search through the Grid records. I'm using .NET Regex class for that approach. It sounds pretty simple, what I used to do is:
string pattern = #"\b" + searchText + #"\b".
It works as expected in Latin languages but for Tamil, this expression returns strange results. I have read about Unicode characters in regular expressions but that doesn't seem quite helpful to me. What I probably need is to determine where is the word boundary found and why.
As an example:
For the "\bஅம்மா\b" pattern Regex found matches in
அம்மாவிடம் and அம்மாக்கள் records but not in the original அம்மா record.
The last char in "அம்மா" word is 0BBE TAMIL VOWEL SIGN AA and it is a combining mark (in regex, it can be matched with \p{M}).
As \b only matches between start/end of string and a word char or between a word and a non-word char, it won't match after the char and a non-word char.
Use a usual workaround in this case.
var pattern = $#"(?<!\w){searchText}(?!\w)";
See this regex demo.
Here, (?<!\w) fails the match if there is a word char before searchText and (?!\w) fails the match if there is a word char after the text to find. Note you may also use Regex.Escape(searchText) if the text can contains special regex chars.
Or, if you want to avoid matching when inside base letters/diacritics, use
var pattern = $#"(?<![\p{{L}}\p{{M}}]){searchText}(?![\p{{L}}\p{{M}}])";
See this regex demo.
The (?<![\p{L}\p{M}]) and (?![\p{L}\p{M}]) lookarounds work similarly as the ones above, just they fails the match if there is a letter or a combining mark on either side of the search phrase.
I'm still learning how to write a regex, but this I can't solve on my own.
have a string that contains a word looking like this : ##companyname##
I have tried the following, but it doesn't work
content = Regex.Replace(content, #"\b##companyname##\b", setup.Company, RegexOptions.IgnoreCase);
\b matches a word boundary, so it won't match # character.
Use \B instead to match a non-word boundary.
content = Regex.Replace(content, #"\B##companyname##\B", setup.Company, RegexOptions.IgnoreCase);
That is because word boundary matches a word boundary position such as whitespace or the beginning or end of the string.
But your regex itself contains #. Do this:
"##companyname##"
The original regex was not a word boundary.
Problem is with the meaning of the \b specifier:
Specifies that the match must occur on a boundary between \w (alphanumeric) and \W (nonalphanumeric) characters. The match must occur on word boundaries — that is, at the first or last characters in words separated by any nonalphanumeric characters.
In your case it is not real boundary between words because in your case both # and < and > are not word characters.
In my oppinion just replacing simply ##companyname## will be enough.
Difference between \b and \B in regex
\b matches the empty string at the beginning or end of a word. \B matches the empty string not at the beginning or end of a word.
content = Regex.Replace(content,
#"\B##companyname##\B",
setup.Company,
RegexOptions.IgnoreCase
);
You can test this regex B##companyname##\B here - http://regexr.com/38p8i
P.S: Started learning regex today :)
I want to find the words in a sentence between spaces. So the words till the first space before and after the search word
This is anexampleof what I want should return anexampleof if my search word is example
I now have this regex "(?:^|\S*\s*)\S*" + searchword + "\S*(?:$|\s*\S*)" but this gives me an extra word in the beginning and the end.
'This is anexampleof what I want' --> returns 'is anexampleof what'
I tried to change the regex but I'm not good at it at all..
I'm using c#. Thx for the help.
Full C# code:
MatchCollection m1 = Regex.Matches(content, #"(?:^|\S*\s*)\S*" + searchword + #"\S*(?:$|\s*\S*)",
RegexOptions.IgnoreCase | RegexOptions.Multiline);
You can simply leave out the non-capturing groups at the end:
#"\S*" + searchword + #"\S*";
Due to greediness you will get as many non-space characters on each side as possible.
Also, the idea of non-capturing groups is not, that they are not included in the match. All they do is not to produce captures of sub-matches. If you wanted to check that there is something, but don't to include it in the match, you want lookarounds:
#"(?<=^|\S*\s*)\S*" + searchword + #"\S*(?=$|\s*\S*)"
However these lookarounds don't really do anything in this case, because \s*\S* is satisfied with an empty string (because * makes both characters optional). But just for further reference... if you want to make assertions at the boundary of your match, which should not be part of the match... lookarounds are the way to go.
I got the following regex call -
MatchCollection matches = Regex.Matches(text,#"( And )|( Or )|( Not )"
I got a problem with a string like this - " And Or Not "
Only "And" will be matched but "Or" and "Not" will not be, just because they are not the first word.
The reason as far as I understand is because the first match is " And " including the trailing white space, because of that the Regex does not recognize it as a potential white space for the next match, and ignores it, just because it was part of the first match.
So if for example this was my string instead - " And Or Not " - every word would have been matched.
Is there a way to somehow instruct the Regex to share the matched white-spaces between the matches?
Thanks!
Instead of looking explicitly for whitespace, you should look for word boundaries.
Just had to go look it up, but apparently, \b would be what you're looking for, e.g.:
#"(\bAnd\b)|(\bOr\b)|(\bNot\b)"
(Or, as #stema points out):
#"\b(And|Or|Not)\b"
The problem is, if you have matched a whitespace, the regex is continuing after the last match, the withespace is so to say "gone", because it has already been matched.
What you can do is to use a lookahead, like this:
MatchCollection matches = Regex.Matches(s, #" (?:And|Or|Not)(?= )");
The lookahead is not matching the space, it is just looking ahead, if there is a space following. The expression will not match, if there is no space.
But the result in your MatchCollection will not have this space at the end!
I would simplify the expression a little and use a look-ahead assertion (match something but don't make it part of the capture):
string text = " And Or Not ";
foreach (Match m in Regex.Matches(text, #"\s(And|Or|Not)(?=\s)")) {
Console.WriteLine(m.Value);
}
(note: I'm using \s instead of spaces)