Regular expressions \b but with not just alphanumeric characters in c# - c#

I want the same functionality as \b but with other characters.
In C#, I want to have something like
string str = "\\b" + Regex.Escape(string) + "\\b");
However I have some so Regex.Escape("#(Something")
will find it in the string Typing #(Something to you.

The problem you experience is related to the fact that \b word boundary is context dependent, and \b\(abc\b will match (abc in x(abc) but not in :(abc) (\b\( means there must be a word char before ().
To match any string that is not enclosed with word chars use
var pattern = $#"(?<!\w){Regex.Escape(string)}(?!\w)";
See the regex demo.
Here, (?<!\w) is a negative lookbehind that will make sure there is no word char immediately to the left of the current location, and (?!\w) negative lookahead will make sure there is no word char immediately to the right of the current location.
Other custom "word" boundaries:
Whitespace word boundary: var pattern = $#"(?<!\S){Regex.Escape(string)}(?!\S)"; // Match when enclosed with whitespaces
Word and symbol boundary (if you do not want to find c in c++): var pattern = $#"(?<![\w\p{S}]){Regex.Escape(string)}(?![\w\p{S}])";

For this you'd need a conditional word boundary at each end.
It just guards the string begin and end, if it's a word, it must be at
a word boundary.
If it's not a word, the default is nothing, as it should be.
(?(?= \w )
\b
)
(?: #\(Something )
(?(?<= \w )
\b
)
So, it ends up looking like
string str = "(?(?=\\w)\\b)" + Regex.Escape(string) + "(?(?<=\\w)\\b)";
Regexstorm.net demo
This takes the guesswork out of it.

Related

Alternate regex with -SDR?

I have the following regex in my c#:
(?<!\w)M20A\w+
Actual code:
string regex = $#"(?<!\w){prefix}\w+";
Notice the prefix var matches strings such as M20A and X50G.
It perfectly matches the following cases:
M20A0820
M20A1234
M20A7U8V
But now I got a new requirement from the business to match, for example:
M20A-SDR
It will be the prefix followed by the exact string "-SDR". Not just a dash followed by 3 alphanumerics, but literally "-SDR". The existing matches need to still work, but prefix + "-SDR" must also be matched.
What would be the regex that would match the following:
M20A0820
M20A1234
M20A7U8V
M20A-SDR
You may use
string regex = $#"(?<!\w){prefix}\w*(?:-SDR)?";
See the regex demo.
Or, to match as a whole word, you may use word boundaries:
string regex = $#"\b{prefix}\w*(?:-SDR)?\b";
See this regex demo
The \b word boundary at the start will work if all the values in prefix start with a word char, a letter, digit or _. The word boundary at the end will make sense if after -SDR, there can be no more word chars.
The (?:-SDR)? will match a -SDR string optonally.
Details
\b - word boundary
M20A - a literal string
\w* - 0+ word chars
(?:-SDR)? - a non-capturing group that matches 1 or 0 times (as there is a ? after it) an -SDR substring
\b - a word boundary.

Remove everything that doesn't match

string line = "Rok rok irrelevant text irrelevant;text.irrelevant,text";
string NewLine = Regex.Replace(line, #"\b[rR]\w*", "");
Right now it replaces every word starting with r/R with a blank space, but I want to make everything a blank space EXCEPT words starting with r/R.
Edit
It seems all you want is to extract words starting with r or R and join them with a space. In this case, use a mere \b[rR]\w* regex and the following code:
var result = string.Join(" ", Regex.Matches(line, #"\b[rR]\w*").Cast<Match>().Select(x => x.Value));
See the C# demo.
Original answer
You may use a negative lookahead after a word boundary:
\b(?![rR])\w+
^^^^^^^^
Note that the + quantifier is better here since you want to remove at least 1 char found.
Or, in case you also want to remove all non-word chars after the found word, use
\b(?![rR])\w+\W*
See the regex demo #1 and regex demo #2.
If you want to remove any non-word chars before and after a qualifying word, use
var result = Regex.Replace(line, #"\W*\b(?![rR])\w+\W*", " ").Trim();
It will remove all non-word chars before a word not starting with r and R and after it.
Details
\b - a word boundary
(?![rR]) - a negative lookahead that will fail the match if, immediately to the right of the current location, there is r or R
\w+ - 1+ word chars
\W* - 0+ non-word chars.

How to insert spaces between characters using Regex?

Trying to learn a little more about using Regex (Regular expressions). Using Microsoft's version of Regex in C# (VS 2010), how could I take a simple string like:
"Hello"
and change it to
"H e l l o"
This could be a string of any letter or symbol, capitals, lowercase, etc., and there are no other letters or symbols following or leading this word. (The string consists of only the one word).
(I have read the other posts, but I can't seem to grasp Regex. Please be kind :) ).
Thanks for any help with this. (an explanation would be most useful).
You could do this through regex only, no need for inbuilt c# functions.
Use the below regexes and then replace the matched boundaries with space.
(?<=.)(?!$)
DEMO
string result = Regex.Replace(yourString, #"(?<=.)(?!$)", " ");
Explanation:
(?<=.) Positive lookbehind asserts that the match must be preceded by a character.
(?!$) Negative lookahead which asserts that the match won't be followed by an end of the line anchor. So the boundaries next to all the characters would be matched but not the one which was next to the last character.
OR
You could also use word boundaries.
(?<!^)(\B|b)(?!$)
DEMO
string result = Regex.Replace(yourString, #"(?<!^)(\B|b)(?!$)", " ");
Explanation:
(?<!^) Negative lookbehind which asserts that the match won't be at the start.
(\B|\b) Matches the boundary which exists between two word characters and two non-word characters (\B) or match the boundary which exists between a word character and a non-word character (\b).
(?!$) Negative lookahead asserts that the match won't be followed by an end of the line anchor.
Regex.Replace("Hello", "(.)", "$1 ").TrimEnd();
Explanation
The dot character class matches every character of your string "Hello".
The paranthesis around the dot character are required so that we could refer to the captured character through the $n notation.
Each captured character is replaced by the replacement string. Our replacement string is "$1 " (notice the space at the end). Here $1 represents the first captured group in the input, therefore our replacement string will replace each character by that character plus one space.
This technique will add one space after the final character "o" as well, so we call TrimEnd() to remove that.
A demo can be seen here.
For the enthusiast, the same effect can be achieve through LINQ using this one-liner:
String.Join(" ", YourString.AsEnumerable())
or if you don't want to use the extension method:
String.Join(" ", YourString.ToCharArray())
It's very simple. To match any character use . dot and then replace with that character along with one extra space
Here parenthesis (...) are used for grouping that can be accessed by $index
Find what : "(.)"
Replace with "$1 "
DEMO

C# regex with a certain word and hashtags

I'm still learning how to write a regex, but this I can't solve on my own.
have a string that contains a word looking like this : ##companyname##
I have tried the following, but it doesn't work
content = Regex.Replace(content, #"\b##companyname##\b", setup.Company, RegexOptions.IgnoreCase);
\b matches a word boundary, so it won't match # character.
Use \B instead to match a non-word boundary.
content = Regex.Replace(content, #"\B##companyname##\B", setup.Company, RegexOptions.IgnoreCase);
That is because word boundary matches a word boundary position such as whitespace or the beginning or end of the string.
But your regex itself contains #. Do this:
"##companyname##"
The original regex was not a word boundary.
Problem is with the meaning of the \b specifier:
Specifies that the match must occur on a boundary between \w (alphanumeric) and \W (nonalphanumeric) characters. The match must occur on word boundaries — that is, at the first or last characters in words separated by any nonalphanumeric characters.
In your case it is not real boundary between words because in your case both # and < and > are not word characters.
In my oppinion just replacing simply ##companyname## will be enough.
Difference between \b and \B in regex
\b matches the empty string at the beginning or end of a word. \B matches the empty string not at the beginning or end of a word.
content = Regex.Replace(content,
#"\B##companyname##\B",
setup.Company,
RegexOptions.IgnoreCase
);
You can test this regex B##companyname##\B here - http://regexr.com/38p8i
P.S: Started learning regex today :)

Regex word boundary expressions

Say for example I have the following string "one two(three) (three) four five" and I want to replace "(three)" with "(four)" but not within words. How would I do it?
Basically I want to do a regex replace and end up with the following string:
"one two(three) (four) four five"
I have tried the following regex but it doesn't work:
#"\b\(three\)\b"
Basically I am writing some search and replace code and am giving the user the usual options to match case, match whole word etc. In this instance the user has chosen to match whole words but I don't know what the text being searched for will be.
Your problem stems from a misunderstanding of what \b actually means. Admittedly, it is not obvious.
The reason \b\(three\)\b doesn’t match the threes in your input string is the following:
\b means: the boundary between a word character and a non-word character.
Letters (e.g. a-z) are considered word characters.
Punctuation marks such as ( are considered non-word characters.
Here is your input string again, stretched out a bit, and I’ve marked the places where \b matches:
o n e t w o ( t h r e e ) ( t h r e e ) f o u r f i v e
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
As you can see here, there is a \b between “two” and “(three)”, but not before the second “(three)”.
The moral of the story? “Whole-word search” doesn’t really make much sense if what you’re searching for is not just a word (a string of letters). Since you have punctuation characters (parentheses) in your search string, it is not as such a “word”. If you searched for a word consisting only of word characters, then \b would do what you expect.
You can, of course, use a different Regex to match the string only if it surrounded by spaces or occurs at the beginning or end of the string:
(^|\s)\(three\)(\s|$)
However, the problem with this is, of course, that if you search for “three” (without the parentheses), it won’t find the one in “(three)” because it doesn’t have spaces around it, even though it is actually a whole word.
I think most text editors (including Visual Studio) will use \b only if your search string actually starts and/or ends with a word character:
var pattern = Regex.Escape(searchString);
if (Regex.IsMatch(searchString, #"^\w"))
pattern = #"\b" + pattern;
if (Regex.IsMatch(searchString, #"\w$"))
pattern = pattern + #"\b";
That way they will find “(three)” even if you select “whole words only”.
Here a simple code you may be interested in:
string pattern = #"\b" + find + #"\b";
Regex.Replace(stringToSearch, pattern, replace, RegexOptions.IgnoreCase);
Source code: snip2code - C#: Replace an exact word in a sentence
See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
So, your \b\(three\)\b regex DOES work, but NOT the way you expected. It does not match (three) in In (three) years, In(three) years and In (three)years, but it matches in In(three)years because there are word boundaries between n and ( and between ) and y.
What you can do in these situations is use dynamic adaptive word boundaries that are constructs that ensure whole word matching where they are expected only (see my "Dynamic adaptive word boundaries" YT video for better visual understanding of these constructs).
In C#, it can be written as
#"(?!\B\w)\(three\)(?<!\w\B)"
In short:
(?!\B\w) - only require a word boundary on the left if the char that follows the word boundary is a word char
\(three\)
(?<!\w\B) - only require a word boundary on the right if the char that precedes the word boundary is a word char.
In case your search phrases can contain whitespaces and you need to match the longer alternatives first you can build the pattern dynamically from a list like
var phrases = new List<string> { #"(one)", #".two.", "[three]" };
phrases = phrases.OrderByDescending(x => x.Length).ToList();
var pattern = $#"(?!\B\w)(?:{string.Join("|", phrases.Select(z => Regex.Escape(z)))})(?<!\w\B)";
with the resulting pattern like (?!\B\w)(?:\[three]|\(one\)|\.two\.)(?<!\w\B) that matches what you'd expect, see the C# demo and the regex demo.
I recently came across a similar issue in javascript trying to match terms with a leading '$' character only as separate words, e.g. if $hot = 'FUZZ', then:
"some $hot $hotel bird$hot pellets" ---> "some FUZZ $hotel bird$hot pellets"
The regex /\b\$hot\b/g (my first guess) did not work for the same reason the parens did not match in the original question — as non word characters, there is no word/non-word boundary preceding them with whitespace or a string start.
However the regex /\B\$hot\b/g does match, which shows that the positions not marked in #timwi's excellent example match the \B term. This was not intuitive to me because ") (" is not made of regex word characters. But I guess since \B is an inversion of the \b class, it doesn't have to be word characters, it just has to be not- not- word characters :)
As Gopi said, but (theoretically) catching only (three) not two(three):
string input = "one two(three) (three) four five";
string output = input.Replace(" (three) ", " (four) ");
When I test that, I get: "one two(three) (four) four five" Just remember that white-space is a string character, too, so it can also be replaced. If I did this:
//use same input
string output = input.Replace(" ", ";");
I'd get one;two(three);(three);four;five"

Categories