Regex vs, versus, vs., v. matching in lines for C# code - c#

I need to write a regex that matches the word versus, verse, vs., v., v but v should not be jumbled in words.
#"\b((.*?)" + Regex.Unescape(xz) + #"[.,:/s]?)\b", RegexOptions.IgnoreCase))
Here I'll pass the array and test it.

This will get you close:
\b(?:(?:vers(?:e|us)|v)\b|vs\.|v\.)
One difficulty is in word boundaries vs. (heh) words that end in a period. See Regex using word boundary but word ends with a . (period) for other options.
Note that "verse" can also mean "poetry" so there could be false positives.

Related

Regex Remove pair of double quotes around word, but not single instances of double quotes

I need to be able to remove a pair of double quotes around words, without removing single instances of double quotes.
Ie. in the below examples, the regex should only match around "hello" and "bounce", without removing the word itself.
3.5" hdd
"hello"
"cool
"bounce"
single sentence with out quotes.
Closest regex i've found so far is this one below, but this highlights the entire "bounce" word which is not acceptable as I need to retain the word.
"([^\\"]|\\")*"
Other close regex I've found in my research:
1.
\"*\"
but this highlights the single quotes.
and Unsuccessful Method 2
This needs to be usable in C# code.
I've been using RegexStorm to test my regex: http://regexstorm.net/reference
Your first regex seems fine but lacks an outer capturing group. It would be better if we transform this into a linear regex, avoiding alternation.
"([^\\"\r\n]*(?:\\.[^\\"\r\n]*)*)"
I included carriage return \r and \n in character class to prevent regex from going more than one line in regex, you may not need them however. You then replace whole match with $1 (a back-reference to first capturing group saved data). To escape a " in C# use double quote "".
Live demo
C# code:
string pattern = #"""([^\\""\r\n]*(?:\\.[^\\""\r\n]*)*)""";
string input = #"3.5"" hdd
""hello""
""cool
""bounce""
single sentence with out quotes.";
Regex regex = new Regex(pattern);
Console.WriteLine(regex.Replace(input, #"$1"));

How to find abbreviations as words in a C# regular expression

I have been given a list of strings to find as whole "words" in my string. Generally, using the \b anchor works for most things except when I'm trying to find the & character as a word or if the abbreviation has a dot after it since the \b doesn't match between the space and the & character, or after a period and space.
For instance to find these strings:
&
b&w
bpi
p.
I'm trying to write something like:
\b((&)|(b&w)|(bpi)|(p\.))\b
In a test string:
my b&w and & and p. test.
I've also tried using \s to check for whitespace but I don't want to capture the whitespace and I haven't been able to figure out how not to. It would also then need to check for beginning and ending of the string as well I believe.
Instead of using word boundaries (\b) you could use look around assertions for (space) OR ^beginning or $end of line.. like so:
(?<=^|\s)([^\s]*)(?=\s|$)
Working regex example:
http://regex101.com/r/rJ0wU4
Test string:
my b&w and & and p. test.
Matches:
"my", "b&w", "and", "&", "and", "p.", "test."
Try to use all abbrs in one group like:
(^|\s+)(&|b&w|bpi|p\.)(\s+|$)

Regex for catching word with special characters between letters

I am new to regex, I'm programming an advanced profanity filter for a commenting feature (in C#). Just to save time, I know that all filters can be fooled, no matter how good they are, you don't have to tell me that. I'm just trying to make it a bit more advanced than basic word replacement. I've split the task into several separate approaches and this is one of them.
What I need is a specific piece of regex, that catches strings such as these:
s_h_i_t
s h i t
S<>H<>I<>T
s_/h_/i_/t
s***h***i***t
you get the idea.
I guess what I'm looking for is a regex that says "one or more characters that are not alphanumeric". This should include both spaces and all special characters that you can type on a standard (western) keyboard. If possible, it should also include line breaks, so it would catch things like
s
h
i
t
There should always be at least one of the characters present, to avoid likely false positives such as in
Finish it.
This will of course mean that things like
sh_it
will not be caught, but as I said, it doesn't matter, it doesn't have to be perfect. All I need is the regex, I can do the splitting of words and inserting the regex myself. I have the RegexOptions.IgnoreCase option set in my C# code, so character case in the actual word is not an issue. Also, this regex shouldn't worry about "leetspeek", i.e. some of the actual letters of the word being replaced by other characters:
sh1t
I have a different approach that deals with that.
Thank you in advance for your help.
Lets see if this regex works for you:
/\w(?:_|\W)+/
Alright, HamZa's answer worked. However I ran into a programmatic problem while working on the solution. When I was replacing just the words, I always knew the length of the word. So I knew exactly how many asterisks to replace it with. If I'm matching shit, I know I need to put 4 asterisks. But if I'm matching s[^a-z0-9]+h[^a-z0-9]+[^a-z0-9]+i[^a-z0-9]+t, I might catch s#h#i#t or I may catch s------h------i--------t. In both cases the length of the matched text will differ wildly from that of the pattern. How can I get the actual length of the matched string?
\bs[\W_]*h[\W_]*i[\W_]*t[\W_]*(?!\w)
matches characters between letters that aren't word characters or character _ or whitespace characters (also new line breaks)
\b (word boundrary) ensures that Finish it won't match
(?!\w) ensures that sh ituuu wont match, you may want to remove/modify that, as s_hittt will not match as well. \bs[\W_]*h[\W_]*i[\W_]*t+[\W_]*(?!\w) will match the word with repeated last character
modification \bs[\W_]*h[\W_]*i[\W_]*t[\W_]*?(?!\w) will make the match of last character class not greedy and in sh it&&& only sh it will match
\bs[\W\d_]*h[\W\d_]*i[\W\d_]*t+[\W\d_]*?(?!\w) will match sh1i444t (digits between characters)
EDIT:
(?!\w) is a negative lookahead. It basicly checks if your match is followed by a word character (word characters are [A-z09_]). It has a length of 0, which means it won't be included in the match. If you want to catch words like "shi*tface" you'll have to remove it.
( http://www.regular-expressions.info/lookaround.html )
A word booundrary [/b] matches a place where word starts or ends, it's length is 0, which means that it matches between characters
[\W] is a negative character class, I think it's equal to [^a-zA-Z0-9_] or [^\w]
You want to match words where each letter is separated with the identical non-word char(s).
You can use
\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b
See the regex demo. (I added (?!\n) to make the regex work for each line as if it were a separate string.) Details:
\b - word boundary
\p{L} - a letter
(?=([\W_]+)) - a positive lookahead that matches a location that is immediately followed with any non-word or _ char (captured into Group 1)
(?:\1\p{L})+ - one or more repetitions of a sequence of the same char captured into Group 1 and a letter
\b - word boundary.
To check if there is such a pattern in a string, you can use
var HasSpamWords = Regex.IsMatch(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b");
To return all occurrences in a string, you can use
var results = Regex.Matches(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
See the C# demo.
Getting the length of each string is easy if you get Match.Length and use .Select(x => x.Length). If you need to get the length of the string with all special chars removed, simply use .Select(x => x.Value.Count(c => char.IsLetter(c))) (see this C# demo).

regex to find a word or words between spaces

I want to find the words in a sentence between spaces. So the words till the first space before and after the search word
This is anexampleof what I want should return anexampleof if my search word is example
I now have this regex "(?:^|\S*\s*)\S*" + searchword + "\S*(?:$|\s*\S*)" but this gives me an extra word in the beginning and the end.
'This is anexampleof what I want' --> returns 'is anexampleof what'
I tried to change the regex but I'm not good at it at all..
I'm using c#. Thx for the help.
Full C# code:
MatchCollection m1 = Regex.Matches(content, #"(?:^|\S*\s*)\S*" + searchword + #"\S*(?:$|\s*\S*)",
RegexOptions.IgnoreCase | RegexOptions.Multiline);
You can simply leave out the non-capturing groups at the end:
#"\S*" + searchword + #"\S*";
Due to greediness you will get as many non-space characters on each side as possible.
Also, the idea of non-capturing groups is not, that they are not included in the match. All they do is not to produce captures of sub-matches. If you wanted to check that there is something, but don't to include it in the match, you want lookarounds:
#"(?<=^|\S*\s*)\S*" + searchword + #"\S*(?=$|\s*\S*)"
However these lookarounds don't really do anything in this case, because \s*\S* is satisfied with an empty string (because * makes both characters optional). But just for further reference... if you want to make assertions at the boundary of your match, which should not be part of the match... lookarounds are the way to go.

Regex word boundary expressions

Say for example I have the following string "one two(three) (three) four five" and I want to replace "(three)" with "(four)" but not within words. How would I do it?
Basically I want to do a regex replace and end up with the following string:
"one two(three) (four) four five"
I have tried the following regex but it doesn't work:
#"\b\(three\)\b"
Basically I am writing some search and replace code and am giving the user the usual options to match case, match whole word etc. In this instance the user has chosen to match whole words but I don't know what the text being searched for will be.
Your problem stems from a misunderstanding of what \b actually means. Admittedly, it is not obvious.
The reason \b\(three\)\b doesn’t match the threes in your input string is the following:
\b means: the boundary between a word character and a non-word character.
Letters (e.g. a-z) are considered word characters.
Punctuation marks such as ( are considered non-word characters.
Here is your input string again, stretched out a bit, and I’ve marked the places where \b matches:
o n e t w o ( t h r e e ) ( t h r e e ) f o u r f i v e
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
As you can see here, there is a \b between “two” and “(three)”, but not before the second “(three)”.
The moral of the story? “Whole-word search” doesn’t really make much sense if what you’re searching for is not just a word (a string of letters). Since you have punctuation characters (parentheses) in your search string, it is not as such a “word”. If you searched for a word consisting only of word characters, then \b would do what you expect.
You can, of course, use a different Regex to match the string only if it surrounded by spaces or occurs at the beginning or end of the string:
(^|\s)\(three\)(\s|$)
However, the problem with this is, of course, that if you search for “three” (without the parentheses), it won’t find the one in “(three)” because it doesn’t have spaces around it, even though it is actually a whole word.
I think most text editors (including Visual Studio) will use \b only if your search string actually starts and/or ends with a word character:
var pattern = Regex.Escape(searchString);
if (Regex.IsMatch(searchString, #"^\w"))
pattern = #"\b" + pattern;
if (Regex.IsMatch(searchString, #"\w$"))
pattern = pattern + #"\b";
That way they will find “(three)” even if you select “whole words only”.
Here a simple code you may be interested in:
string pattern = #"\b" + find + #"\b";
Regex.Replace(stringToSearch, pattern, replace, RegexOptions.IgnoreCase);
Source code: snip2code - C#: Replace an exact word in a sentence
See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
So, your \b\(three\)\b regex DOES work, but NOT the way you expected. It does not match (three) in In (three) years, In(three) years and In (three)years, but it matches in In(three)years because there are word boundaries between n and ( and between ) and y.
What you can do in these situations is use dynamic adaptive word boundaries that are constructs that ensure whole word matching where they are expected only (see my "Dynamic adaptive word boundaries" YT video for better visual understanding of these constructs).
In C#, it can be written as
#"(?!\B\w)\(three\)(?<!\w\B)"
In short:
(?!\B\w) - only require a word boundary on the left if the char that follows the word boundary is a word char
\(three\)
(?<!\w\B) - only require a word boundary on the right if the char that precedes the word boundary is a word char.
In case your search phrases can contain whitespaces and you need to match the longer alternatives first you can build the pattern dynamically from a list like
var phrases = new List<string> { #"(one)", #".two.", "[three]" };
phrases = phrases.OrderByDescending(x => x.Length).ToList();
var pattern = $#"(?!\B\w)(?:{string.Join("|", phrases.Select(z => Regex.Escape(z)))})(?<!\w\B)";
with the resulting pattern like (?!\B\w)(?:\[three]|\(one\)|\.two\.)(?<!\w\B) that matches what you'd expect, see the C# demo and the regex demo.
I recently came across a similar issue in javascript trying to match terms with a leading '$' character only as separate words, e.g. if $hot = 'FUZZ', then:
"some $hot $hotel bird$hot pellets" ---> "some FUZZ $hotel bird$hot pellets"
The regex /\b\$hot\b/g (my first guess) did not work for the same reason the parens did not match in the original question — as non word characters, there is no word/non-word boundary preceding them with whitespace or a string start.
However the regex /\B\$hot\b/g does match, which shows that the positions not marked in #timwi's excellent example match the \B term. This was not intuitive to me because ") (" is not made of regex word characters. But I guess since \B is an inversion of the \b class, it doesn't have to be word characters, it just has to be not- not- word characters :)
As Gopi said, but (theoretically) catching only (three) not two(three):
string input = "one two(three) (three) four five";
string output = input.Replace(" (three) ", " (four) ");
When I test that, I get: "one two(three) (four) four five" Just remember that white-space is a string character, too, so it can also be replaced. If I did this:
//use same input
string output = input.Replace(" ", ";");
I'd get one;two(three);(three);four;five"

Categories