Regex and the colon (:) - c#

I have the following code. The idea is to detect whole words.
bool contains = Regex.IsMatch("Hello1 Hello2", #"\bHello\b"); // yields false
bool contains = Regex.IsMatch("Hello Hello2", #"\bHello\b"); // yields true
bool contains = Regex.IsMatch("Hello: Hello2", #"\bHello\b"); **// yields true, but should yield false**
Seems that Regex is ignoring the colon. How can I modify the code such that the last line will return false?

\b means "word boundary". : is not part of any word, so the expression is true.
Maybe you want an expression like this:
(^|\s)Hello(\s|$)
Which means: the string "Hello", preceded by either the start of the expression or a whitespace, and followed by either the end of the expression or a whitespace.

The Regex isn't ignoring the colon. The position before the colon is where \b matches, because \b matches word-boundaries. That means the position between a word-character and a non-word-chracter.
If you want Whitespace to follow after your word 'Hello', than use "\bHello\s".

To match a whole word not directly followed with a colon, use
\bHello\b(?!:)
\bHello(?![:\w])
See the regex demo. Details:
\b - a word boundary
Hello - a word
(?![:\w]) - a negative lookahead that fails the match if there is : or a word char immediately to the right of the current location.
Se the C# code demo:
bool contains = Regex.IsMatch("Hello: Hello2", #"\bHello\b");
Console.WriteLine(contains); // => False
Console.WriteLine(Regex.IsMatch("Hello: Hello2", #"\bHello(?![:\w])"));
// => False

Related

Regex find all matches EXCEPT those surrounded by characters

I have the following regular expression to find all of the instances of {word} in my string. In the following string, this (correctly) matches {eeid} and {catalog}:
Expression
{([^:]*?)}
String being searched
{?:participants::lookup(.,{eeid},{catalog})}
Now - I need to "escape" one of those values, so it is NOT matched/replaced. I'm trying to use double square brackets to do so:
{?:participants::lookup(.,{eeid},[[{catalog}]])}
How can I adjust my regular expression so it ignores {catalog} (enclosed in [[ ]]) but finds {eeid}?
You can use
(?<!\[\[)\{([^:{}]*)}(?!]])
See the .NET regex demo.
Details
(?<!\[\[) - a negative lookbehind that fails the match if there is [[ immediately to the left of the current location
\{ - a { char
([^:{}]*) - Group 1: any zero or more chars other than :, { and }
} - a } char
(?!]]) - a negative lookahead that fails the match if there is ]] immediately to the right of the current location.
See the C# demo:
var s = "{?:participants::lookup(.,{eeid},[[{catalog}]])}";
var rx = #"(?<!\[\[)\{([^:{}]*)}(?!]])";
var res = Regex.Matches(s, rx).Cast<Match>().Select(x => x.Groups[1].Value);
foreach (var t in res)
Console.WriteLine(t);
// => eeid

Regex match with Arabic

i have a text in Arabic and i want to use Regex to extract numbers from it. here is my attempt.
String :
"ما المجموع:
1+2"
Match match = Regex.Match(text, "المجموع: ([^\\r\\n]+)", RegexOptions.IgnoreCase);
it will always return false. and groups.value will always return null.
expected output:
match.Groups[1].Value //returns (1+2)
The regex you wrote matches a word, then a colon, then a space and then 1 or more chars other than backslash, r and n.
You want to match the whole line after the word, colon and any amount of whitespace chars:
var text = "ما المجموع:\n1+2";
var result = Regex.Match(text, #"المجموع:\s*(.+)")?.Groups[1].Value;
Console.WriteLine(result); // => 1+2
See the C# demo
Other possible patterns:
#"المجموع:\r?\n(.+)" // To match CRLF or LF line ending only
#"المجموع:\n(.+)" // To match just LF ending only
Also, if you run the regex against a long multiline text with CRLF endings, it makes sense to replace .+ wit [^\r\n]+ since . in a .NET regex matches any chars but newlines, LF, and thus matches CR symbol.

Regex.IsMatch is not working when text including "$"

Regex.IsMatch method returns the wrong result while checking the following condition,
string text = "$0.00";
Regex compareValue = new Regex(text);
bool result = compareValue.IsMatch(text);
The above code returns as "False". Please let me know if i missed anything.
The Regex class has a special method for escaping characters in a pattern: Regex.Escape()
Change your code like this:
string text = "$0.00";
Regex compareValue = new Regex(Regex.Escape(text)); // Escape characters in text
bool result = compareValue.IsMatch(text);
"$" is a special character in C# regex. Escape it first.
Regex compareValue = new Regex(#"\$0\.00");
bool result = compareValue.IsMatch("$0.00");
Regex expressions: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
Both '.' and '$' are special characters and thus you need to escape them if you want to match the character itself. '.' matches any character and '$' matches the end of a string
see: https://regex101.com/r/pK2uY6/1
You have to escape $ since it is a special (reserved) character which means "end of string". In case . means just dot (say, decimal separator) you have to escape it as well (when not escaped, . means "any symbol"):
string pattern = #"\$0\.00";
bool result = RegEx.IsMatch(text, pattern);
As for your original pattern, it has no chance to match any string, since $0.00 means
$ end of string, followed by
0 zero
. any character
0 zero
0 zero
but end of string can't be followed by...

How to select first sentence in a piece of text using regular expression?

My task is to select first sentence from a text (I'm writing in C#). I suppose that the most appropriate way would be using regex but some troubles occurred. What regex pattern should I use to select the first sentence?
Several examples:
Input: "I am a lion and I want to be free. Do you see a lion when you look inside of me?" Expected result: "I am a lion and I want to be free."
Input: "I drink so much they call me Charlie 4.0 hands. Any text." Expected result: "I drink so much they call me Charlie 4.0 hands."
Input: "So take out your hands and throw the H.U. up. 'Now wave it around like you don't give a fake!'" Expected result: "So take out your hands and throw the H.U. up."
The third is really confusing me.
Since you aleready provided some assumptions:
sentences are divided by a whitespace
task is to select first sentence
You can use the following regex:
^.*?[.?!](?=\s+(?:$|\p{P}*\p{Lu}))
See RegexStorm demo
Regex breakdown:
^ - start of string (thus, only the first sentence will be matched)
.*? - any number of characters, as few as possible (use RegexOptions.Singleline to also match a newline with .)
[.?!] - a final punctuation symbol
(?=\s+(?:$|\p{P}*\p{Lu})) - a look-ahead making sure there is 1 or more whitespace symbols (\s+) right after before the end of string ($) or optional punctuation (\p{P}) and a capital letter (\p{Lu}).
UPDATE:
Since it turns out you can have single sentence input, and your sentences can start with any letter or digit, you can use
^.*?[.?!](?=\s+\p{P}*[\p{Lu}\p{N}]|\s*$)
See another demo
I came up with a regular expression that uses lots of negative look-aheads to exclude certain cases, e.g. a punctuation must not be followed by lowercase character, or a dot before a capital letter is not closing a sentence. This splits up all the text in their seperate sentences. If you are given a text, just take the first match.
[\s\S]*?(?![A-Z]+)(?:\.|\?|\!)(?!(?:\d|[A-Z]))(?! [a-z])/gm
Sentence separators should be searched with following scanner:
if it's sentence-finisher character (like [.!?])
it must be followed by space or allowed sequence of characters and then space:
like sequence of '.' for '.' (A sentence...)
...or sequence of '!' and/or '?' for '!' and '?' (Exclamation here!?)
then it must be followed by either:
capital character (ignore quotes, if any)
numeric
which must be followed by lowercase or another sentence-finister
dialog-starter character (Blah blah blah... - And what next, Elric?)
Tip: don't forget to add extra space character to input source string.
Upd:
Some wild pseudocode xD:
func sentence(inputString) {
finishers = ['.', '!', '?']
allowedSequences = ['.' => ['..'], '!' => ['!!', '?'], '?' => ['??', '!']]
input = inputString
result = ''
found = false
while input != '' {
finisherPos = min(pos(input, finishers))
if !finisherPos
return inputString
result += substr(input, 0, finisherPos + 1)
input = substr(input, finisherPos)
p = finisherPos
finisher = input[p]
p++
if input[p] != ' '
if match = testSequence(substr(input, p), allowedSequences[finisher]) {
result += match
found = true
break
} else {
continue
}
else {
p++
if input[p] in [A-Z] {
found = true
break
}
if input[p] in [0-9] {
p++
if input[p] in [a-z] or input[p] in finishers {
found = true
break
}
p--
}
if input[p] in ['-'] {
found = true;
break
}
}
}
if !found
return inputStr
return result
}
func testSequence(str, sequences) {
foreach (sequence: sequences)
if startsWith(str, sequence)
return sequence
return false
}

Replace any character before <usernameredacted#example.com> with an empty string

I have this string
AnyText: "jonathon" <usernameredacted#example.com>
Desired Output Using Regex
AnyText: <usernameredacted#example.com>
Omit anything in between !
I am still a rookie at regular expressions. Could anyone out there help me with the matching & replacing expression for the above scenario?
Try this:
string input = "jonathon <usernameredacted#example.com>";
string output = Regex.Match(input, #"<[^>]+>").Groups[0].Value;
Console.WriteLine(output); //<usernameredacted#example.com>
You could use the following regex to match all the characters that you want to replace with an empty string:
^[^<]*
The first ^ is an anchor to the beginning of the string. The ^ inside the character class means that the character class is a negation. ie. any character that isn't an < will match. The * is a greedy quantifier. So in summary, this regex will swallow up all characters from the beginning of the string until the first <.
Here is the way to do it in VBA flavor: Replace "^[^""]*" with "".
^ marks the start of the sentence.
[^""]* marks anything other than a
quote sign.
UPDATE:
Since in your additional comment you mentioned you wanted to grab the "From:" and the email address, but none of the junk in between or after, I figure instead of replace, extract would be better. Here is a VBA function written for Excel that will give you back all the subgroup matches (everything you put in parenthesis) and nothing else.
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String) As String
Application.ScreenUpdating = False
Dim i As Long
Dim result As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.Item(0).submatches.count - 1
result = result & allMatches.Item(0).submatches.Item(i)
Next
RegexExtract = result
Application.ScreenUpdating = True
End Function
Using this code, your regex call would be: "^(.+: ).+(<.+>).*"
^ denotes start of sentence
(.+: ) denotes first match group. .+ is one or more characters, followed by : and a space
.+ denotes one or more characters
(<.+>) denotes second match group.
< is <, then .+ for one or more characters, then the final >
.* denotes zero or more
characters.
So in excel you'd use (assuming cell is A1):
=RegexExtract(A1, "^(.+: ).+(<.+>).*")

Categories