Custom word boundaries in regular expression

Custom word boundaries in regular expression - c#

I am trying to match words using a regular expression, but sadly the word boundary character (\b) does not include enough characters for my taste, so I want to add more. (in that precise case, the "+" character)
Here is what I used to have (it is C# but not very relevant) :
string expression = Regex.Escape(word);
Regex regExp = new Regex(#"\b" + expression + #"\b", RegexOptions.IgnoreCase);
This particular regex did not match "C++" and I thought it was a real bummer. So I tried using the \w character in a character class that way, along with the + character :
string expression = Regex.Escape(word);
Regex regExp = new Regex(#"(?![\w\+])" + expression + #"(?![\w\+])", RegexOptions.IgnoreCase);
But now, nothing gets matched... is there something I am missing?

(no need to escape the + in a character class)
The problem is that you use a negative lookahead first whereas you should use a negative lookbehind. Try:
#"(?<![\w+])" + expression + #"(?![\w+])"

Related

Detecting a word followed by a dot or whitespace using regex

I am using regex and C# to find occurrences of a particular word using
Regex regex = new Regex(#"\b" + word + #"\b");
How can I modify my Regex to only detect the word if it is either preceded with a whitespace, followed with a whitespace or followed with a dot?
Examples:
this.Button.Value - should match
this.value - should match
document.thisButton.Value - should not match

You may use lookarounds and alternation to check for the 2 possibilities when a keyword is enclosed with spaces or is just followed with a dot:
var line = "this.Button.Value\nthis.value\ndocument.thisButton.Value";
var word = "this";
var rx =new Regex(string.Format(#"(?<=\s)\b{0}\b(?=\s)|\b{0}\b(?=\.)", word));
var result = rx.Replace(line, "NEW_WORD");
Console.WriteLine(result);
See IDEONE demo and a regex demo.
The pattern matches:
(?<=\s)\bthis\b(?=\s) - whole word "this" that is preceded with whitespace (?<=\s) and that is followed with whitespace (?=\s)
| - or
\bthis\b(?=\.) - whole word "this" that is followed with a literal . ((?=\.))
Since lookarounds are not consuming characters (the regex index remains where it was) the characters matched with them are not placed in the match value, and are thus untouched during the replacement.

If i am understanding you correctly:
Regex regex = new Regex(#"\b" + (word " " || ".") + #"\b");

Regex regex = new Regex(#"((?<=( \.))" + word + #"\b)" + "|" + #"(\b" + word + #"[ .])");
However, note that this could cause trouble if word contains characters that have special meanings in Regular Expressions. I'm assuming that word contains alpha-numeric characters only.

The (?<=...) match group checks for preceding and (?=...) checks for following, both without including them in the match.
Regex regex = new Regex(#"(?<=\s)\b" + word + #"\b|\b" + word + #"\b(?=[\s\.])");
EDIT: Pattern updated.
EDIT 2: Online test: http://ideone.com/RXRQM5

Why does my C# regular expression that asks for 7 characters accept 8 characters?

I have a C# regular expression that I'd like to match 7 characters:
string digits4 = "\\d{4}";
string allowable3 = "[a-zA-z0-9 $%&#?+=!]{3}";
Regex regex1 = new Regex(digits4 + allowable3);
allowable3 is meant to match three letters, numbers, or any of the subsequent characters. However, the following returns true:
regex1.IsMatch("1234abc^")
This confuses me for two reasons:
The matched pattern has 8 characters.
The allowable3 doesn't include "^".
I must have some additional, unexpected wild card matching going on within my "positive character group" (the part within the square brackets), but I'm not seeing it.

Use anchors ^ and $ round the pattern to require a full string match. IsMatch searches for partial matches when the pattern is not anchored.
Your A-z pattern matches more than just letters, it matches ^ and some other symbols. You need to change it to A-Za-z.
string allowable3 = "[a-zA-Z0-9 $%&#?+=!]{3}";
Regex regex1 = new Regex("^" + digits4 + allowable3 + "$");

C# RegEx to find a specific string or all words in a string

Looking it up, I thought I understood how to look up a string of multiple words in a sentence, but it does not find a match. Can someone tell me what I am doing wrong? I need to be able to find a single or multiple word match. I passed in "to find" to the method and it did not find the match. Also, if the user does not enclose their search phrase in quotes, I also need it to search on each word entered.
var pattern = #"\b\" + searchString + #"\b"; //searchString is passed in.
Regex rgx = new Regex(pattern);
var sentence = "I need to find a string in this sentence!";
Match match = rgx.Match(sentence);
if (match.Success)
{
// Do something with the match.
}

Just remove the second \ in the first #"\b\":
var pattern = #"\b" + searchString + #"\b";
^
See IDEONE demo
Note that in case you have special regex metacharacters (like (, ), [, +, *, etc.) in your searchStrings, you can use Regex.Escape() to escape them:
var pattern = #"\b" + Regex.Escape(searchString) + #"\b";
And if those characters may appear in edge positions, use lookarounds rather than word boundaries:
var pattern = #"(?<!\w)" + searchString + #"(?=\w)";

Regex to find special pattern

I have a string to parse. First I have to check if string contains special pattern:
I wanted to know if there is substrings which starts with "$(",
and end with ")",
and between those start and end special strings,there should not be
any white-empty space,
it should not include "$" character inside it.
I have a little regex for it in C#
string input = "$(abc)";
string pattern = #"\$\(([^$][^\s]*)\)";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
foreach (var match in matches)
{
Console.WriteLine("value = " + match);
}
It works for many cases but failed at input= $(a$() , which inside the expression is empty. I wanted NOT to match when input is $().[ there is nothing between start and end identifiers].
What is wrong with my regex?

Note: [^$] matches a single character but not of $
Use the below regex if you want to match $()
\$\(([^\s$]*)\)
Use the below regex if you don't want to match $(),
\$\(([^\s$]+)\)
* repeats the preceding token zero or more times.
+ Repeats the preceding token one or more times.
Your regex \(([^$][^\s]*)\) is wrong. It won't allow $ as a first character inside () but it allows it as second or third ,, etc. See the demo here. You need to combine the negated classes in your regex inorder to match any character not of a space or $.

Your current regex does not match $() because the [^$] matches at least 1 character. The only way I can think of where you would have this match would be when you have an input containing more than one parens, like:
$()(something)
In those cases, you will also need to exclude at least the closing paren:
string pattern = #"\$\(([^$\s)]+)\)";
The above matches for example:
abc in $(abc) and
abc and def in $(def)$()$(abc)(something).

Simply replace the * with a + and merge the options.
string pattern = #"\$\(([^$\s]+)\)";
+ means 1 or more
* means 0 or more

Regular expression query

I am looking for regular expression that evaluates below.
0-9 a-Z A-Z - / '

The C# version of this pattern is:
#"[0-9a-zA-Z/'-]"
Used in code:
var regex = new Regex(#"[0-9a-zA-Z/'-]");
or
var regex = new Regex(#"[0-9a-z/'-]", RegexOptions.IgnoreCase);
Note that the - is at the very end of the character class (the part in the brackets). For - to mean a literal hyphen inside a character class, it must be at the beginning or end of the class (i.e. [-blah] or [blah-]), or escaped with a backslash: [ab\-c] will match a, b, c, or -.
Note also the # at the beginning of the quoted string. This isn't important for this pattern, but it's a good habit to get into with C# regex. Regular expressions often contain backslashes, and the #"..." form will allow you to use backslashes in your pattern without having to escape them.

Use bellow code to validate(Regex patterns) Alphabetic and numbers:
String name="123ABCabc";
if(System.Text.RegularExpressions.Regex.Match(name, #"[0-9a-zA-Z_]") == true)
{
return true;
}
else
{
return false;
}

In the case you want to match digits, lower- and higher-case latin characters, "-", "/" and "'" then I would suggest the following:
[0-9a-zA-Z-\/\']

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Custom word boundaries in regular expression - c#

(no need to escape the + in a character class) The problem is that you use a negative lookahead first whereas you should use a negative lookbehind. Try: #"(?<![\w+])" + expression + #"(?![\w+])"

Related

Detecting a word followed by a dot or whitespace using regex

Why does my C# regular expression that asks for 7 characters accept 8 characters?

C# RegEx to find a specific string or all words in a string

Regex to find special pattern

Regular expression query

Categories

Resources