Regex dealing with dots (.) within words - c#

I'm having a hard time on a regex expression.
It's only requirement is that if there is a dot (.) in the word, there must be a letter on either side of the dot. There can be any number of dots in the word and any number of letters in between the dots. There just has to be a letter on either side of a dot.
I have the mostly figured it out but I am having issue with dots that are only separated by one letter (see example below)
Currently I have this expression:
^(\s*[0-9A-Za-z]{1,}[.]{0,1}[0-9A-Za-z]{1,}\s*)+$
this works for the following:
dot.InWord
Multiple.dots.In.Word
d.ot.s
t.wo.Le.tt.er.sB.et.we.en.do.ts
However, this does not work for words if the dots are only seperated by one letter, as follows:
d.o.t.s.O.n.l.y.S.e.p.e.r.a.t.e.d.B.y.O.n.e.L.e.t.t.e.r
Anyone know how I could solve this?
EDIT:
BHustus solution below is the better solution.
However, I did take what BHustus has shown and combined it with what I had before to come up with a less "confusing" pattern just in case anyone else was interested.
^(\s*[\d\w]+([.]?[\d\w]+)+\s*)+$
The key was to have the . and the 1 word after be in its own group and repeat. ([.]?[\d\w]+)+
Thanks.

([\w]+\.)+[\w]+(?=[\s]|$)
To explain:
The first group, in the parentheses, matches 1 or more letter or number (\w is shorthand for [A-Za-z0-9] and + means "match the preceding one or more times", shorthand for {1,}), followed by one period. After it has matched one or more cycles of [\w]+\., the final [\w]+ ensures that there is at least one letter at the end and consumes all characters until it reaches a non-character. finally, the (?=[\s]|$) is a lookahead assertion that ensures there is either whitespace immediately ahead ([\s]), or the end of the string ($) (with | being the regex "OR" character). If the lookahead fails, it doesn't match.
Online demo, showing all your test cases

Do you have to use a Regex? The accepted answer's Regex is pretty difficult to read. How about a simple loop?
for(int i = 0; i < str.length; i++)
{
char ch = str[i];
if(ch == '.')
{
if(i == 0) return false; //no dots at start of string
if(i == str.length - 1) return false; //no dots at end of string
if(str[i + 1] == '.') return false; //no consecutive dots
}
else if(!IsLetter(ch) && !IsNumber(ch))
{
return false; //allow only letters and numbers
}
}
return true;

Related

How to remove multiple, repeating & unnecessary punctuation from string in C#?

Considering strings like this:
"This is a string....!"
"This is another...!!"
"What is this..!?!?"
...
// There are LOTS of examples of weird/angry sentence-endings like the ones above.
I want to replace the unnecessary punctuation at the end to make it look like this:
"This is a string!"
"This is another!"
"What is this?"
What I basically do is:
- split by space
- check if last char in string contains a punctuation
- start replacing with the patterns below
I have tried a very big ".Replace(string, string)" function, but it does not work - there has to be a simpler regex I guess.
Documentation:
Returns a new string in which all occurrences of a specified string in the current instance are replaced with another specified string.
As well as:
Because this method returns the modified string, you can chain together successive calls to the Replace method to perform multiple replacements on the original string.
Anything is wrong here.
EDIT: ALL the proposed solutions work fine! Thank you very much!
This one was the best suited solution for my project:
Regex re = new Regex("[.?!]*(?=[.?!]$)");
string output = re.Replace(input, "");
Your solution works almost fine (demo), the only issue is when the same sequence could be matched starting at different spots. For example, ..!?!? from your last line is not part of the substitution list, so ..!? and !? get replaced by two separate matches, producing ?? in the output.
It looks like your strategy is pretty straightforward: in a chain of multiple punctuation characters the last character wins. You can use regular expressions to do the replacement:
[!?.]*([!?.])
and replace it with $1, i.e. the capturing group that has the last character:
string s;
while ((s = Console.ReadLine()) != null) {
s = Regex.Replace(s, "[!?.]*([!?.])", "$1");
Console.WriteLine(s);
}
Demo
Simply
[.?!]*(?=[.?!]$)
should do it for you. Like
Regex re = new Regex("[.?!]*(?=[.?!]$)");
Console.WriteLine(re.Replace("This is a string....!", ""));
This replaces all punctuations but the last with nothing.
[.?!]* matches any number of consecutive punctuation characters, and the (?=[.?!]$) is a positive lookahead making sure it leaves one at the end of the string.
See it here at ideone.
Or you can do it without regExps:
string TrimPuncMarks(string str)
{
HashSet<char> punctMarks = new HashSet<char>() {'.', '!', '?'};
int i = str.Length - 1;
for (; i >= 0; i--)
{
if (!punctMarks.Contains(str[i]))
break;
}
// the very last punct mark or null if there were no any punct marks in the end
char? suffix = i < str.Length - 1 ? str[str.Length - 1] : (char?)null;
return str.Substring(0, i+1) + suffix;
}
Debug.Assert("What is this?" == TrimPuncMarks("What is this..!?!?"));
Debug.Assert("What is this" == TrimPuncMarks("What is this"));
Debug.Assert("What is this." == TrimPuncMarks("What is this."));

How to select first sentence in a piece of text using regular expression?

My task is to select first sentence from a text (I'm writing in C#). I suppose that the most appropriate way would be using regex but some troubles occurred. What regex pattern should I use to select the first sentence?
Several examples:
Input: "I am a lion and I want to be free. Do you see a lion when you look inside of me?" Expected result: "I am a lion and I want to be free."
Input: "I drink so much they call me Charlie 4.0 hands. Any text." Expected result: "I drink so much they call me Charlie 4.0 hands."
Input: "So take out your hands and throw the H.U. up. 'Now wave it around like you don't give a fake!'" Expected result: "So take out your hands and throw the H.U. up."
The third is really confusing me.
Since you aleready provided some assumptions:
sentences are divided by a whitespace
task is to select first sentence
You can use the following regex:
^.*?[.?!](?=\s+(?:$|\p{P}*\p{Lu}))
See RegexStorm demo
Regex breakdown:
^ - start of string (thus, only the first sentence will be matched)
.*? - any number of characters, as few as possible (use RegexOptions.Singleline to also match a newline with .)
[.?!] - a final punctuation symbol
(?=\s+(?:$|\p{P}*\p{Lu})) - a look-ahead making sure there is 1 or more whitespace symbols (\s+) right after before the end of string ($) or optional punctuation (\p{P}) and a capital letter (\p{Lu}).
UPDATE:
Since it turns out you can have single sentence input, and your sentences can start with any letter or digit, you can use
^.*?[.?!](?=\s+\p{P}*[\p{Lu}\p{N}]|\s*$)
See another demo
I came up with a regular expression that uses lots of negative look-aheads to exclude certain cases, e.g. a punctuation must not be followed by lowercase character, or a dot before a capital letter is not closing a sentence. This splits up all the text in their seperate sentences. If you are given a text, just take the first match.
[\s\S]*?(?![A-Z]+)(?:\.|\?|\!)(?!(?:\d|[A-Z]))(?! [a-z])/gm
Sentence separators should be searched with following scanner:
if it's sentence-finisher character (like [.!?])
it must be followed by space or allowed sequence of characters and then space:
like sequence of '.' for '.' (A sentence...)
...or sequence of '!' and/or '?' for '!' and '?' (Exclamation here!?)
then it must be followed by either:
capital character (ignore quotes, if any)
numeric
which must be followed by lowercase or another sentence-finister
dialog-starter character (Blah blah blah... - And what next, Elric?)
Tip: don't forget to add extra space character to input source string.
Upd:
Some wild pseudocode xD:
func sentence(inputString) {
finishers = ['.', '!', '?']
allowedSequences = ['.' => ['..'], '!' => ['!!', '?'], '?' => ['??', '!']]
input = inputString
result = ''
found = false
while input != '' {
finisherPos = min(pos(input, finishers))
if !finisherPos
return inputString
result += substr(input, 0, finisherPos + 1)
input = substr(input, finisherPos)
p = finisherPos
finisher = input[p]
p++
if input[p] != ' '
if match = testSequence(substr(input, p), allowedSequences[finisher]) {
result += match
found = true
break
} else {
continue
}
else {
p++
if input[p] in [A-Z] {
found = true
break
}
if input[p] in [0-9] {
p++
if input[p] in [a-z] or input[p] in finishers {
found = true
break
}
p--
}
if input[p] in ['-'] {
found = true;
break
}
}
}
if !found
return inputStr
return result
}
func testSequence(str, sequences) {
foreach (sequence: sequences)
if startsWith(str, sequence)
return sequence
return false
}

Regex Lookahead and lookbehind at most one digit

I'm looking for create RegEx pattern
8 characters [a-zA_Z]
must contains only one digit in any place of string
I created this pattern:
^(?=.*[0-9].*[0-9])[0-9a-zA-Z]{8}$
This pattern works fine but i want only one digit allowed. Example:
aaaaaaa6 match
aaa7aaaa match
aaa88aaa don't match
aaa884aa don't match
aaawwaaa don't match
You could instead use:
^(?=[0-9a-zA-Z]{8})[^\d]*\d[^\d]*$
The first part would assert that the match contains 8 alphabets or digits. Once this is ensured, the second part ensures that there is only one digit in the match.
EDIT: Explanation:
The anchors ^ and $ denote the start and end of string.
(?=[0-9a-zA-Z]{8}) asserts that the match contains 8 alphabets or digits.
[^\d]*\d[^\d]* would imply that there is only one digit character and remaining non-digit characters. Since we had already asserted that the input contains digits or alphabets, the non-digit characters here are alphabets.
If you want a non regex solution, I wrote this for a small project :
public static bool ContainsOneDigit(string s)
{
if (String.IsNullOrWhiteSpace(s) || s.Length != 8)
return false;
int nb = 0;
foreach (char c in s)
{
if (!Char.IsLetterOrDigit(c))
return false;
if (c >= '0' && c <= '9') // just thought, I could use Char.IsDigit() here ...
nb++;
}
return nb == 1;
}

Regex for alphanumeric, at least 1 number and special chars

I am trying to find a regex which will give me the following validation:
string should contain at least 1 digit and at least 1 special character. Does allow alphanumeric.
I tried the following but this fails:
#"^[a-zA-Z0-9##$%&*+\-_(),+':;?.,!\[\]\s\\/]+$]"
I tried "password1$" but that failed
I also tried "Password1!" but that also failed.
ideas?
UPDATE
Need the solution to work with C# - currently the suggestions posted as of Oct 22 2013 do not appear to work.
Try this:
Regex rxPassword = new Regex( #"
^ # start-of-line, followed by
[a-zA-Z0-9!##]+ # a sequence of one or more characters drawn from the set consisting of ASCII letters, digits or the punctuation characters ! # and #
(<=[0-9]) # at least one of which is a decimal digit
(<=[!##]) # at least one of which is one of the special characters
(<=[a-zA-Z]) # at least one of which is an upper- or lower-case letter
$ # followed by end-of-line
" , RegexOptions.IgnorePatternWhitespace ) ;
The construct (<=regular-expression) is a zero-width positive look-behind assertion.
Sometimes it's a lot simpler to do things one step at a time. The static constructor builds the escaped character class characters from a simple list of allowed special characters. The built-in Regex.Escape method doesn't work here.
public static class PasswordValidator {
private const string ALLOWED_SPECIAL_CHARS = #"##$%&*+_()':;?.,![]\-";
private static string ESCAPED_SPECIAL_CHARS;
static PasswordValidator() {
var escapedChars = new List<char>();
foreach (char c in ALLOWED_SPECIAL_CHARS) {
if (c == '[' || c == ']' || c == '\\' || c == '-')
escapedChars.AddRange(new[] { '\\', c });
else
escapedChars.Add(c);
}
ESCAPED_SPECIAL_CHARS = new string(escapedChars.ToArray());
}
public static bool IsValidPassword(string input) {
// Length requirement?
if (input.Length < 8) return false;
// First just check for a digit
if (!Regex.IsMatch(input, #"\d")) return false;
// Then check for special character
if (!Regex.IsMatch(input, "[" + ESCAPED_SPECIAL_CHARS + "]")) return false;
// Require a letter?
if (!Regex.IsMatch(input, "[a-zA-Z]")) return false;
// DON'T allow anything else:
if (Regex.IsMatch(input, #"[^a-zA-Z\d" + ESCAPED_SPECIAL_CHARS + "]")) return false;
return true;
}
}
This may be work, there are two possible, the digit before special char or the digit after the special char. You should use DOTALL(the dot point all char)
^((.*?[0-9].*?[##$%&*+\-_(),+':;?.,!\[\]\s\\/].*)|(.*?[##$%&*+\-_(),+':;?.,!\[\]\s\\/].*?[0-9].*))$
This worked for me:
#"(?=^[!##$%\^&*()_-+=[{]};:<>|./?a-zA-Z\d]{8,}$)(?=([!##$%\^&*()_-+=[{]};:<>|./?a-zA-Z\d]\W+){1,})(?=[^0-9][0-9])[!##$%\^&*()_-+=[{]};:<>|./?a-zA-Z\d]*$"
alphanumeric, at least 1 numeric, and special character with a min length of 8
This should do the work
(?:(?=.*[0-9]+)(?=.*[a-zA-Z]+)(?=.*[##$%&*+\-_(),+':;?.,!\[\]\s\\/]+))+
Tested with javascript, not sure about c#, may need some little adjust.
What it does is use anticipated positive lookahead to find the required elements of the password.
EDIT
Regular expression is designed to test if there are matches. Since all the patterns are lookahead, no real characters get captured and matches are empty, but if the expression "match", then the password is valid.
But, since the question is C# (sorry, i don't know c#, just improvising and adapting samples)
string input = "password1!";
string pattern = #"^(?:(?=.*[0-9]+)(?=.*[a-zA-Z]+)(?=.*[##$%&*+\-_(),+':;?.,!\[\]\s\\/]+))+.*$";
Regex rgx = new Regex(pattern, RegexOptions.None);
MatchCollection matches = rgx.Matches(input);
if (matches.Count > 0) {
Console.WriteLine("{0} ({1} matches):", input, matches.Count);
foreach (Match match in matches)
Console.WriteLine(" " + match.Value);
}
Adding start of line, and a .*$ to the end, the expression will match if the password is valid. And the match value will be the password. (i guess)

Do not match opening and closing parenthesis when a character sequence appears in middle

Got an interesting problem here for everyone to consider:
I am trying to parse and tokenize strings delimited by a "/" character but only when not in between parenthesis.
For instance:
Root/Branch1/branch2/leaf
Should be tokenized as: "Root", "Branch1", "Branch2", "leaf"
Root/Branch1(subbranch1/subbranch2)/leaf
Should be tokenized as: "Root", "Branch1(subbranch1,subbranch2)", "leaf"
Root(branch1/branch2) text (branch3/branch4) text/Root(branch1/branch2)/Leaf
Should be tokenized as: "Root(branch1/branch2) text(branch3/branch4)", "Root(branch1/branch2)", "leaf".
I came up with the following expression which works great for all cases except ONE!
([^/()]*\((?<=\().*(?=\))\)[^/()]*)|([^/()]+)
The only case where this does not work is the following test condition:
Root(branch1/branch2)/SubRoot/SubRoot(branch3/branch4)/Leaf
This should be tokenized as: "Root(branch1/branch2)", "SubRoot", "SubRoot(branch3/branch4)", "Leaf"
The result I get instead consists of only one group that matches the whole line so it is not tokenizing it at all:
"Root(branch1/branch2)/SubRoot/SubRoot(branch3/branch4)/Leaf"
What is happening here is that because Regex is greedy it is matching the left most opening parenthesis "(" with the last closing parenthesis ")" instead of just knowing to stop at its appropriate delimiter.
Any of you Regex gurus out there can help me figure out how to add a small Regex piece to my existing expression to handle this additional case?
Root(branch1/branch2) Test (branch3/branch4)/SubRoot/SubRoot(branch5/branch6)/Leaf
Should be tokenized into groups as:
"Root(branch1/branch2) Test (branch3/branch4)"
"SubRoot"
"SubRoot(branch5/branch6)"
"Leaf"
List<string> Tokenize(strInput)
{
var sb = new StringBuilder();
var tokens = new List<string>();
bool inParen = false;
foreach(var c in strInput)
{
if (inParens)
{
if (c == ')')
inParens = false;
else
sb.Append(c);
}
else if (c == '(')
inParens = true;
else if (c == '/')
{
tokens.Add(sb.ToString());
sb.Length = 0;
}
else
sb.Append(c);
}
if (sb.Length > 0)
tokens.Add(sb.ToString());
return tokens;
}
That's untested but it should work. (and will almost certainly be much faster than the regex)
Different approach, trying to avoid costly look-around assertions...
/(\(.+?\)|[^\/(]+)+/
With some comments...
/
( # group things to be captured
\(.+?\) # 1 or more of anything in (escaped) brackets, un-greedily
| # or ...
[^\/(]+ # 1 or more, not slash, and not open bracket characters
)+ # repeat until done...
/
The following uses balanced groups to capture each matching item with Regex.Matches, ensuring the closing / isn't matched when the brackets before it haven't balanced:
(?<=^|/)((?<br>\()|(?<-br>\))|[^()])*?(?(br)(?!))(?=$|/)
Bizarrely, this seems to perform similarly to Billy Moon's much simpler answer, even though this is overengineered (supporting multiple, possibly nested sets of brackets per token).
The following does something similar, but splits the string with Regex.Split (linebreaks added for clarity):
(?<=^(?(brb)(?!))(?:(?<-brb>\()|(?<brb>\))|[^()])*)
/
(?=(?:(?<bra>\()|(?<-bra>\))|[^()])*(?(bra)(?!))$)
This matches "any / where any brackets between the start of the string and the / are balanced, and any bracket between the / and the end of the string are balanced".
Note that in the lookbehind, the brb captures appear in reverse order from before - this is because a lookbehind apparently works right-to-left. (Thanks to Kobi for the answer that taught me this.)
This is much slower than the match version, but I wanted to work out how to do it anyway.

Categories