How can you match words with more than one character? - c#

I would like to use a regular expression to match all words with more that one character, as opposed to words entirely made of the same char.
This should not match: ttttt, rrrrr, ggggggggggggg
This should match: rttttttt, word, wwwwwwwwwu

The following expression will do the trick.
^(?<FIRST>[a-zA-Z])[a-zA-Z]*?(?!\k<FIRST>)[a-zA-Z]+$
capture the first character into the group FIRST
capture some more characters (lazily to avoid backtracking)
ensure that that the next character is different from FIRST using a negative lookahead assertion
capture all (at least one due to the assertion) remaining characters
Note that is sufficient to look for a character that is different from the first one, because if no character is different from the first one, all characters are equal.
You can shorten the expression to the following.
^(\w)\w*?(?!\1)\w+$
This will match some more characters other than [a-zA-Z].

I would add all unique words to a list and then used this regex
\b(\w)\1+\b
to grab all one character words and get rid of them

This doesn't use a regular expression, but I believe it will do what you require:
public bool Match(string str)
{
return string.IsNullOrEmpty(str)
|| str.ToCharArray()
.Skip(1)
.Any( c => !c.Equals(str[0]) );
}

The following RE will do the opposite of what you're asking for: match where a word is composed of the same character. It may still be useful to you though.
\b(\w)\1*\b

\b\w*?(\w)\1*(?:(?!\1)\w)\w*\b
or
\b(\w)(?!\1*\b)\w*\b
This assumes you're plucking the words out of some larger text; that's why it needs the word boundaries and the padding. If you have a list of words and you're just trying to validate the ones that meet the criteria, a much simpler regex would probably do:
(.)(?:(?!\1).)
...because you already know each word contains only word characters. On the other hand, depending on your definition of "word" you might need to replace \w in the first two regexes with something more specific, like [A-Za-z].

Related

C# Regex boundary with special characters

I want to have a Regex that finds "Attributable".
I tried #"\bAttributable\b" but the \b boundary doesn't work with special characters.
For example, it wouldn't differentiate Attributable and Non-Attributable. Is there any way to Regex for Attributable and not it's negative?
Do a negative look-behind?
(?<!-)\bAttributable\b
Obviously this only checks for -s. If you want to check for other characters, put them in a character class in the negative look-behind:
(?<![-^])\bAttributable\b
Alternatively, if you just want to not match Non-Attributable but do match SomethingElse-Attributable, then put Non- in the look-behind:
(?<!Non-)\bAttributable\b
There are several ways to fix the issue like you have but it all depends on the real requirements. It is sometimes necessary to precise what "word boundary" you need in each concrete case, since \b word boundary is 1) context dependent, and 2) matches specific places in the string that you should be aware of:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.
Now, here are several approaches that you may follow:
When you only care about compound words usually joined with hyphens (similar #Sweeper's answer): (?<!-)\bAttributable\b(?!-)
Only match between whitespaces or start/end of string: (?<!\S)Attributable(?!\S). NOTE: Actually, if it is what you want, you may do without a regex by using s.Split().Contains("Attributable")
Only match if not preceded with punctuation and there is no letter/digit/underscore right after: (?<!\p{P})Attributable\b
Only match if not preceded with punctation symbols but some specific ones (say, you want to match the word after a comma and a colon): (?<![^\P{P},;])Attributable\b.

C# Regular Expression for String matching

I am looking for a regular expression that returns success only if the input string contains following characters:
a-zA-Z0-9~!#$^ ()_-+’:.?
Is this regular expression correct?
^[a-zA-Z0-9~!#$^ ()_-+’:.?]+$
I have understood what ^ means here but not sure about +$. Also are there any alternatives to this? By the way the above regular expression also includes a space character between ^ and (
it only contains the characters listed above
bool invalidCharsExist =
Regex.Replace(input, #"[a-zA-Z0-9~!#\$\^\ \(\)_\-\+’:\.\?]", "").Length != 0;
BTW: This is not fully equivalent to your regex (It will also include non-ascii letters and digits) but I think it is a better way to check
var specialChars = new HashSet<char>("~!#$^ ()_-+’:.?");
var allValid = input.All(c => char.IsLetterOrDigit(c) || specialChars.Contains(c));
Close, but get rid of that dash in the middle of your character class and put it at the beginning:
^[-a-zA-Z0-9~!#$^ ()_+’:.?]+$
And make sure when you put it in a string that you use the proper string qualifier (I forget what it's called):
#"^[-a-zA-Z0-9~!#$^ ()_+’:.?]+$"
As to whether or not you can do it in other ways, sure, for example a negative look-ahead that doesn't actually match anything. I don't think a proper regex optimizer would leave one better than the other, it's just a matter of preference. Do you want something that looks to succeed (selects the entire string if valid), or something that looks to fail (negative look-ahead).
Honestly if performance is at all important, you should write a good old for and loop over the characters (or the equivalent LINQ implementation). Regex won't even be in the ballpark.
the regular expression would be: ^[a-zA-Z0-9~!#$^ ()_\-+’:.?]+$
I personally recommend using https://regex101.com to check regex expressions - note that they don't have C# support, but in general javascript's RegExp has similar syntax to C#, but what it does give you a particularly useful explaination of what your expression is doing, here is this epression's explaination from there:
^ assert position at start of the string
[a-zA-Z0-9~!#$^ ()_\-\+’:.?]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case sensitive)
A-Z a single character in the range between A and Z (case sensitive)
0-9 a single character in the range between 0 and 9
~!#$^ ()_ a single character in the list ~!#$^ ()_ literally
\- matches the character - literally
+’:.? a single character in the list ’:.? literally
$ assert position at end of the string
the issue with what you put in the OP was literally only forgetting to escape the - as it is reserved in the regular expression pattern to be used for special purposes (i.e in the [] notation the - is reserved to declare a character range like a-z)

Regex for catching word with special characters between letters

I am new to regex, I'm programming an advanced profanity filter for a commenting feature (in C#). Just to save time, I know that all filters can be fooled, no matter how good they are, you don't have to tell me that. I'm just trying to make it a bit more advanced than basic word replacement. I've split the task into several separate approaches and this is one of them.
What I need is a specific piece of regex, that catches strings such as these:
s_h_i_t
s h i t
S<>H<>I<>T
s_/h_/i_/t
s***h***i***t
you get the idea.
I guess what I'm looking for is a regex that says "one or more characters that are not alphanumeric". This should include both spaces and all special characters that you can type on a standard (western) keyboard. If possible, it should also include line breaks, so it would catch things like
s
h
i
t
There should always be at least one of the characters present, to avoid likely false positives such as in
Finish it.
This will of course mean that things like
sh_it
will not be caught, but as I said, it doesn't matter, it doesn't have to be perfect. All I need is the regex, I can do the splitting of words and inserting the regex myself. I have the RegexOptions.IgnoreCase option set in my C# code, so character case in the actual word is not an issue. Also, this regex shouldn't worry about "leetspeek", i.e. some of the actual letters of the word being replaced by other characters:
sh1t
I have a different approach that deals with that.
Thank you in advance for your help.
Lets see if this regex works for you:
/\w(?:_|\W)+/
Alright, HamZa's answer worked. However I ran into a programmatic problem while working on the solution. When I was replacing just the words, I always knew the length of the word. So I knew exactly how many asterisks to replace it with. If I'm matching shit, I know I need to put 4 asterisks. But if I'm matching s[^a-z0-9]+h[^a-z0-9]+[^a-z0-9]+i[^a-z0-9]+t, I might catch s#h#i#t or I may catch s------h------i--------t. In both cases the length of the matched text will differ wildly from that of the pattern. How can I get the actual length of the matched string?
\bs[\W_]*h[\W_]*i[\W_]*t[\W_]*(?!\w)
matches characters between letters that aren't word characters or character _ or whitespace characters (also new line breaks)
\b (word boundrary) ensures that Finish it won't match
(?!\w) ensures that sh ituuu wont match, you may want to remove/modify that, as s_hittt will not match as well. \bs[\W_]*h[\W_]*i[\W_]*t+[\W_]*(?!\w) will match the word with repeated last character
modification \bs[\W_]*h[\W_]*i[\W_]*t[\W_]*?(?!\w) will make the match of last character class not greedy and in sh it&&& only sh it will match
\bs[\W\d_]*h[\W\d_]*i[\W\d_]*t+[\W\d_]*?(?!\w) will match sh1i444t (digits between characters)
EDIT:
(?!\w) is a negative lookahead. It basicly checks if your match is followed by a word character (word characters are [A-z09_]). It has a length of 0, which means it won't be included in the match. If you want to catch words like "shi*tface" you'll have to remove it.
( http://www.regular-expressions.info/lookaround.html )
A word booundrary [/b] matches a place where word starts or ends, it's length is 0, which means that it matches between characters
[\W] is a negative character class, I think it's equal to [^a-zA-Z0-9_] or [^\w]
You want to match words where each letter is separated with the identical non-word char(s).
You can use
\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b
See the regex demo. (I added (?!\n) to make the regex work for each line as if it were a separate string.) Details:
\b - word boundary
\p{L} - a letter
(?=([\W_]+)) - a positive lookahead that matches a location that is immediately followed with any non-word or _ char (captured into Group 1)
(?:\1\p{L})+ - one or more repetitions of a sequence of the same char captured into Group 1 and a letter
\b - word boundary.
To check if there is such a pattern in a string, you can use
var HasSpamWords = Regex.IsMatch(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b");
To return all occurrences in a string, you can use
var results = Regex.Matches(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
See the C# demo.
Getting the length of each string is easy if you get Match.Length and use .Select(x => x.Length). If you need to get the length of the string with all special chars removed, simply use .Select(x => x.Value.Count(c => char.IsLetter(c))) (see this C# demo).

how to create regular expression based on some condition

i want to create a regular expression to find and replace uppercase character based on some condition.
find the starting uppercase for a group of uppercase character in a string and replace it lowercase and * before the starting uppercase.
If there is any lowercase following the uppercase,replace the uppercase with lowercase and * before the starting uppercase.
input string : stackOVERFlow
expected output : stack*over*flow
i tried but could not get it working perfectly.
Any idea on how to create a regular expression ?
Thanks
Well the expected inputs and outputs are slightly illogical: you're lower-casing the "f" in "flow" but not including it in the asterisk.
Anyway, the regex you want is pretty simple: #"[A-Z]+?". This matches a string of one or more uppercase alpha characters, nongreedily (don't think it makes a difference either way as the matched character class is relatively narrow).
Now, to do the find/replace, you would do something like the following:
Regex.Replace(inputString, #"([A-Z]+?)", "*$1*").ToLower();
This simply finds all occurrences of one or more uppercase alpha characters, and wherever it finds a match it replaces it with itself surrounded by asterisks. This does the surrounding but not the lowercasing; .NET Regex doesn't provide for that kind of string modification. However, since the end result of the operation should be a string with all lowercase chars, just do exactly that with a ToLower() and you'll get the expected result.
KeithS's solution can be simplified a bit
Regex.Replace("stackOVERFlow","[A-Z]+","*$0*").ToLower()
However, this will yield stack*overf*low including the f between the stars. If you want to exclude the last upper case letter, use the following expression
Regex.Replace("stackOVERFlow","[A-Z]+(?=[A-Z])","*$0*").ToLower()
It will yield stack*over*flow
This uses the pattern find(?=suffix), which finds a position before a suffix.

Regex - Get all words that are not wrapped with a "/"

Im really trying to learn regex so here it goes.
I would really like to get all words in a string which do not have a "/" on either side.
For example, I need to do this to:
"Hello Great /World/"
I need to have the results:
"Hello"
"Great"
is this possible in regex, if so, how do I do it? I think i would like the results to be stored in a string array :)
Thank you
Just use this regular expression \b(?<!/)\w+(?!/)\b:
var str = "Hello Great /World/ /I/ am great too";
var words = Regex.Matches(str, #"\b(?<!/)\w+(?!/)\b")
.Cast<Match>()
.Select(m=>m.Value)
.ToArray();
This will get you:
Hello
Great
am
great
too
var newstr = Regex.Replace("Hello Great /World/", #"/(\w+?)/", "");
If you realy want an array of strings
var words = Regex.Matches(newstr, #"\w+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
I would first split the string into the array, then filter out matching words. This solution might also be cleaner than a big regexp, because you can spot the requirements for "word" and the filter better.
The big regexp solution would be something like word boundary - not a slash - many no-whitespaces - not a slash - word boundary.
I would use a regex replace to replace all /[a-zA-Z]/ with '' (nothing) then get all words
Try this one : (Click here for a demo)
(\s(?<!/)([A-Za-z]+)(?!/))|((?<!/)([A-Za-z]+)(?!/)\s)
Using this example excerpt:
The /character/ "_" (underscore/under-strike) can be /used/ in /variable/ names /in/ many /programming/ /languages/, while the /character/ "/" (slash/stroke/solidus) is typically not allowed.
...this expression matches any string of letters, numbers, underscores, or apostrophes (fairly typical idea of a "word" in English) that does not have a / character both before and after it - wrapped with a "/"
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/))
...and is the purest form, using only one character class to define "word" characters. It matches the example as follows:
Matched Not Matched
------------- -------------
The character
_ used
underscore variable
under in
strike programming
can languages
be character
in stroke
names
many
while
the
slash
solidus
is
typically
not
allowed
If excluding /stroke/, is not desired, then adding a bit to the end limitation will allow it, depending upon how you want to define the beginning of a "next" word:
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/([^\w]))).
changes (?!/) to (?!/([^\w])), which allows /something/ if it does have a letter, number, or underscore immediately after it. This would move stroke from the "Not Matched" to the "Matched" list, above.
note: \w matches uppercase or lowercase letters, numbers and the underscore character
If you want to alter your concept for "word" from the above, simply exchange the characters and shorthand character classes contained in the [\w'] part of the expression to something like [a-zA-Z'] to exclude digits or [\w'-] to include hyphens, which would capture under-strike as a single match, rather than two separate matches:
\b([\w'-]+)\b(?<=(?<!/)\1|\1(?!/([^\w])))
IMPORTANT ALTERNATIVE!!! (I think)
I just thought of an alternative to Matching any words that are not wrapped with / symbols: simply consume all of these symbols and words that are surrounded in them (splitting). This has a few benefits: no lookaround means this could be used in more contexts (JavaScript does not support lookbehind and some flavors of regex don't support lookaround at all) while increasing efficiency; also, using a split expression means a direct result of a String array:
string input = "The /character/ "_" (underscore/under-strike) can be..."; //etc...
string[] resultsArray = Regex.Split(input, #"([^\w'-]+?(/[\w]+/)?)+");
voila!

Categories