Regular Expression Match for a Title - c#

I need to use C# to write a Regular Expression for a title, here is the requirement:
Title is required (length > 0);
Maximum 256 characters (length <= 256);
No character is forbidden, but whitespace only is illegal (the title ONLY containing whitespaces is illegal);
No leading or trailing whitespaces;
I have already have this:
^.{1,256}$
So how can I meet the rule 3?
EDIT:
Explained rule 3 more clearly;
I added rule 4 from Mario's answer.

I'd skip regular expressions completely, because you can just hardcode string cleanup and validation in two simple steps:
Use String.Trim(null) to remove all leading/trailing whitespaces.
Compare the length of the remaining string.
Uppercase the first character (if you want to).
This works, because a name consisting of whitespaces only would be trimmed to 0 length.
Also this avoids using titles such as " Let's go!".

You need to use a zero-width assertion:
#"^(?=.*\S).{1,256}$"
(?=.*\S) matches any sequence of characters that ends in a non-whitespace character, but does not affect the rest of the match.

Use the (?=pattern)
#"^(?=.*\S).{1,256}$"
The (?=pattern) asserts that the specified pattern exists immediately after this location.
So, the regex matches if and only if after the beginning of the string, it matches the pattern .*\S and if the whole string matches the pattern ^.{1,256}$

Though my own answer fits my question, but the credit should still go to the other guys (I either upvoted and chose as the correct answer), because I edited my question after their answer.
=====================
I finally came up a pure regex solution (without any extra steps)
^(\S|\S.{0,254}\S)$
(though I don't understand why the parentheses () are important)
The following test cases pass:
[TestMethod]
public void CheckTitleTest()
{
// Empty
Assert.IsFalse(CheckTitle(#""));
// A whitespace
Assert.IsFalse(CheckTitle(#" "));
// Multiple whitespace only
// http://msdn.microsoft.com/en-us/library/t809ektx.aspx
Assert.IsFalse(CheckTitle(" \t \n \u1680"));
// Leading whitespaces
Assert.IsFalse(CheckTitle(" \tabc"));
// Trailing whitespaces
Assert.IsFalse(CheckTitle("abc\t "));
// Leading and trailing whitespaces
Assert.IsFalse(CheckTitle(" \tabc\t "));
// Too long: 257 character
Assert.IsFalse(CheckTitle(#"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/*"));
// A normal title
Assert.IsTrue(CheckTitle(#"This is a normal title"));
Assert.IsTrue(CheckTitle(#"This is a normal title."));
// 256 characters
Assert.IsTrue(CheckTitle(#"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"));
// A very simple title
Assert.IsTrue(CheckTitle(#"A"));
Assert.IsTrue(CheckTitle(#"!"));
Assert.IsTrue(CheckTitle(#"\"));
}

Related

.NET Regular Expression white space special characters

This pattern is not working sometimes (it works only for the 3rd instance). The pattern is ^\s*flood\s{55}\s+\w+
I am new to regular expression and I am trying to write a regular expression that captures all the following conditions:
Example 1: flood a)
Example 2: flood As respects
Example 3: flood USD100,000
(it's in a tabular format and there's a lot of space between flood and the next word)
Your expression is saying:
^\s* The start of the string may have zero or more whitespace characters
flood followed by the string flood
\s{55} followed by exactly 55 whitespace characters
\s+\w+ followed by one or more whitespace characters and then one or more word characters.
If you want a minimum number of whitespace characters, say at least 30, followed by one or more word chraracters, then you could do this:
^\s*flood\s{30,}\w+
Try this:
string input =
#" flood a)
flood As respects
flood USD100,000";
string pattern = #"^\s*flood\s+.+$";
MatchCollection matches = Regex.Matches(input, pattern, RegexOptions.Multiline);
If there are a lot of spaces between flood and the next word you could omit \s{55} which is a quantifier that matches a whitespace character 55 times.
That would leave you with ^\s*flood\s+\w+ which does not yet match all the values at the end because \w matches a word character but not a whitespace or any of ),.
To match your values you might use a character class and add the characters that you allow to match:
^\s*flood\s+[\w,) ]+
Or if you want to match any character you could use a dot instead of a character class.
According to your comment, you might use a positive lookbehind:
(?<=\(13\. Deductible\))\s*(\s*flood\s+[\w,) ]+)+
Demo

C# Regular expression to match on a character not following pairs of the same charcater

Objective: Regex Matching
For this example I'm interested in matching a "|" pipe character.
I need to match it if it's alone: "aaa|aaa"
I need to match it (the last pipe) only if it's preceded by pairs of pipe: (2,4,6,8...any even number)
Another way: I want to ignore ALL pipe pairs "||" (right to left)
or I want to select bachelor bars only (the odd man out)
string twomatches = "aaaaaaaaa||||**|**aaaaaa||**|**aaaaaa";
string onematch = "aaaaaaaaa||**|**aaaaaaa||aaaaaaaa";
string noMatch = "||";
string noMatch = "||||";
I'm trying to select the last "|" only when preceded by an even sequence of "|" pairs or in a string when a single bar exists by itself.
Regardless of the number of "|"
You may use the following regex to select just odd one pipe out:
(?<=(?<!\|)(?:\|{2})*)\|(?!\|)
See regex demo.
The regex breakdown:
(?<=(?<!\|)(?:\|{2})*) - if a pipe is preceded with an even number of pipes ((?:\|{2})* - 0 or more sequences of exactly 2 pipes) from a position that has no preceding pipe ((?<!\|))
\| - match an odd pipe on the right
(?!\|) - if it is not followed by another pipe.
Please note that this regex uses a variable-width look-behind and is very resource-consuming. I'd rather use a capturing group mechanism here, but it all depends on the actual purpose of matching that odd pipe.
Here is a modified version of the regex for removing the odd one out:
var s = "1|2||3|||4||||5|||||6||||||7|||||||";
var data = Regex.Replace(s, #"(?<!\|)(?<even_pipes>(?:\|{2})*)\|(?!\|)", "${even_pipes}");
Console.WriteLine(data);
See IDEONE demo. Here, the quantified part is moved from lookbehind to an even_pipes named capturing group, so that it could be restored with the backreference in the replaced string. Regexhero.net shows 129,046 iterations per second for the version with a capturing group and 69,206 with the original version with variable-width lookbehind.
Only use variable-width look-behind if it is absolutely necessary!
Oh, it's reopened! If you need better performance, also try this negative improved version.
\|(?!\|)(?<!(?:[^|]|^)(?:\|\|)*)
The idea here is to first match the last literal | at right side of a sequence or single | and execute a negated version of the lookbehind just after the match. This should perform considerably better.
\|(?!\|) matches literal | IF NOT followed by another pipe character (right most if sequence).
(?<!(?:[^|]|^)(?:\|\|)*) IF position right after the matched | IS NOT preceded by (?:\|\|)* any amount of literal || until a non| or ^ start.In other words: If this position is not preceded by an even amount of pipe characters.
Btw, there is no performance gain in using \|{2} over \|\| it might be better readable.
See demo at regexstorm

Regex for catching word with special characters between letters

I am new to regex, I'm programming an advanced profanity filter for a commenting feature (in C#). Just to save time, I know that all filters can be fooled, no matter how good they are, you don't have to tell me that. I'm just trying to make it a bit more advanced than basic word replacement. I've split the task into several separate approaches and this is one of them.
What I need is a specific piece of regex, that catches strings such as these:
s_h_i_t
s h i t
S<>H<>I<>T
s_/h_/i_/t
s***h***i***t
you get the idea.
I guess what I'm looking for is a regex that says "one or more characters that are not alphanumeric". This should include both spaces and all special characters that you can type on a standard (western) keyboard. If possible, it should also include line breaks, so it would catch things like
s
h
i
t
There should always be at least one of the characters present, to avoid likely false positives such as in
Finish it.
This will of course mean that things like
sh_it
will not be caught, but as I said, it doesn't matter, it doesn't have to be perfect. All I need is the regex, I can do the splitting of words and inserting the regex myself. I have the RegexOptions.IgnoreCase option set in my C# code, so character case in the actual word is not an issue. Also, this regex shouldn't worry about "leetspeek", i.e. some of the actual letters of the word being replaced by other characters:
sh1t
I have a different approach that deals with that.
Thank you in advance for your help.
Lets see if this regex works for you:
/\w(?:_|\W)+/
Alright, HamZa's answer worked. However I ran into a programmatic problem while working on the solution. When I was replacing just the words, I always knew the length of the word. So I knew exactly how many asterisks to replace it with. If I'm matching shit, I know I need to put 4 asterisks. But if I'm matching s[^a-z0-9]+h[^a-z0-9]+[^a-z0-9]+i[^a-z0-9]+t, I might catch s#h#i#t or I may catch s------h------i--------t. In both cases the length of the matched text will differ wildly from that of the pattern. How can I get the actual length of the matched string?
\bs[\W_]*h[\W_]*i[\W_]*t[\W_]*(?!\w)
matches characters between letters that aren't word characters or character _ or whitespace characters (also new line breaks)
\b (word boundrary) ensures that Finish it won't match
(?!\w) ensures that sh ituuu wont match, you may want to remove/modify that, as s_hittt will not match as well. \bs[\W_]*h[\W_]*i[\W_]*t+[\W_]*(?!\w) will match the word with repeated last character
modification \bs[\W_]*h[\W_]*i[\W_]*t[\W_]*?(?!\w) will make the match of last character class not greedy and in sh it&&& only sh it will match
\bs[\W\d_]*h[\W\d_]*i[\W\d_]*t+[\W\d_]*?(?!\w) will match sh1i444t (digits between characters)
EDIT:
(?!\w) is a negative lookahead. It basicly checks if your match is followed by a word character (word characters are [A-z09_]). It has a length of 0, which means it won't be included in the match. If you want to catch words like "shi*tface" you'll have to remove it.
( http://www.regular-expressions.info/lookaround.html )
A word booundrary [/b] matches a place where word starts or ends, it's length is 0, which means that it matches between characters
[\W] is a negative character class, I think it's equal to [^a-zA-Z0-9_] or [^\w]
You want to match words where each letter is separated with the identical non-word char(s).
You can use
\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b
See the regex demo. (I added (?!\n) to make the regex work for each line as if it were a separate string.) Details:
\b - word boundary
\p{L} - a letter
(?=([\W_]+)) - a positive lookahead that matches a location that is immediately followed with any non-word or _ char (captured into Group 1)
(?:\1\p{L})+ - one or more repetitions of a sequence of the same char captured into Group 1 and a letter
\b - word boundary.
To check if there is such a pattern in a string, you can use
var HasSpamWords = Regex.IsMatch(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b");
To return all occurrences in a string, you can use
var results = Regex.Matches(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
See the C# demo.
Getting the length of each string is easy if you get Match.Length and use .Select(x => x.Length). If you need to get the length of the string with all special chars removed, simply use .Select(x => x.Value.Count(c => char.IsLetter(c))) (see this C# demo).

C# Regex match on special characters

I know this stuff has been talked about a lot, but I'm having a problem trying to match the following...
Example input: "test test 310-315"
I need a regex expression that recognizes a number followed by a dash, and returns 310. How do I include the dash in the regex expression though. So the final match result would be: "310".
Thanks a lot - kcross
EDIT: Also, how would I do the same thing but with the dash preceding, but also take into account that the number following the dash could be a negative number... didnt think of this one when I wrote the question immediately. for example: "test test 310--315" returns -315 and "test 310-315" returns 315.
Regex regex = new Regex(#"\d+(?=\-)");
\d+ - Looks for one or more digits
(?=\-) - Makes sure it is followed by a dash
The # just eliminates the need to escape the backslashes to keep the compiler happy.
Also, you may want this instead:
\d+(?=\-\d+)
This will check for a one or more numbers, followed by a dash, followed by one or more numbers, but only match the first set.
In response to your comment, here's a regex that will check for a number following a -, while accounting for potential negative (-) numbers:
Regex regex = new Regex(#"(?<=\-)\-?\d+");
(?<=\-) - Negative lookbehind which will check and make sure there is a preceding -
\-? - Checks for either zero or one dashes
\d+ - One or more digits
(?'number'\d+)- will work ( no need to escape ). In this example the group containing the single number is the named group 'number'.
if you want to match both groups with optional sign try:
#"(?'first'-?\d+)-(?'second'-?\d+)"
See it working here.
Just to describe, nothing complicated, just using -? to match an optional - and \d+ to match one or more digit. a literal - match itself.
here's some documentation that I use:
http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet
in the comments section of that page, it suggests escaping the dash with '\-'
make sure you escape your escape character \
You would escape the special meaning of - in regex language (means range) using a backslash (\). Since backslash has a special meaning in C# literals to escape quotes or be part of some characters, you need to escape that with another backslash(\). So essentially it would be \d+\\-.
\b\d*(?=\-) you will want to look ahead for the dash
\b = is start at a word boundry
\d = match any decimal digit
* = match the previous as many times as needed
(?=\-) = look ahead for the dash
Edited for Formatting issue with the slash not showing after posting

regular expression ".*[^a-zA-Z0-9_].*"

As I am trying to read more about regular expressions in C#, I just want to make sure of my conclusion that I made.
for the following expression ".*[^a-zA-Z0-9_].* ", the " .* " at the beginning and end are useless, is that right ? because as I understood, that ".*" means zero or more occurrence of any character, but being followed by "[^a-zA-Z0-9_]" which means any character other than any combination of letters and digits case insensitive, makes ".*" useless to be added before and after "[^a-zA-Z0-9_]", is that right ?
Here is the code I am using to check if the expressions matches
// Here we call Regex.Match.
Match match = Regex.Match("anytest#", ".*[^a-z A-Z0-9_].*");
//Match match = Regex.Match("anytest#", "[^a-z A-Z0-9_]");
// Here we check the Match instance.
if (match.Success)
Console.WriteLine("error");
else
Console.WriteLine("no error");
.*[^a-zA-Z0-9_].* will match the entire input as long as there is a non-alphanumeric/underscore somewhere in the input. [^a-zA-Z0-9_] will match only a single non-alphanumeric/underscore character (most likely the last one, if you're using the default greedy matching) if it is somewhere in the input. Which one you want depends on the input and what you want to do once you find out if a non-alphanumeric/underscore character exists in the input.
The only difference would be whether the "margin characters" will be included in the result or not.
For:
ab41--_71j
It will match:
1--_7
And without the .* at beginning and end it will match:
--_
Any string will match the .*[^a-zA-Z0-9_].* regex at least once as long as it has at least one character that isn't a-zA-Z0-9_
From your currently last comment in your answer, I understand that you actually use:
^[a-zA-Z0-9]*$
This will match only if all characters are digit/letters.
If it doesn't match, then the string is invalid.
If you also want to allow the _ character, then use:
^[a-zA-Z0-9_]*$
Which can even be shortened to:
^\w$
In general, it is better to make regex's Validate rather than Invalidate strings. It just makes more sense and is more intuitive.
So my validation would look like:
if (Regex.IsMatch("anytest#", "^\\w$"))
{
Console.WriteLine("Success");
}
else
{
Console.WriteLine("Error");
}
Another option that is probably faster:
if ("anytest#".ToCharArray().All(c => char.IsLetterOrDigit(c) || c == '_'))
{
Console.WriteLine("Success");
}
else
{
Console.WriteLine("Error");
}
And if you don't want '_' to be included, it can even look nicer;
if ("anytest#".ToCharArray().All(char.IsLetterOrDigit))
{
Console.WriteLine("Success");
}
else
{
Console.WriteLine("Error");
}
No, because there are other characters than a-Z and 0-9.
That regex matches all strings that start with any characters followed not by a-zA-Z0-9 and end with any characters. Or just a string that does not contain a-zA-Z0-9 at all.
If you leave the .* then you just have a regex that matches a charatcer that does not contain a-zA-Z0-9 at all.
.*[^a-zA-Z0-9_].* matches for instance: ABC_ß_ABC
[^a-zA-Z0-9_] matches for instance: ß (and this regex just matches 1 character)
Input 1 : ABC_ß_ABC
Input 2 : ß
Regex 1: .*[^a-zA-Z0-9_].*
Regex 2: [^a-zA-Z0-9_]
Both the inputs match both the regex,
For input 1
Regex 1 matches 9 characters
Regex 2 matches only 1 character
Only include those tokens in the Regex that you are actually looking for. In your case you didn't actually care whether there are any other characters before or after the excluding character class you specified. Adding .* before and after that doesn't change the success of the match, but makes matching more complicated. A Regex matches anywhere already, unless you specifically anchor it somehow, e.g. using ^ at the start.

Categories