Regex and the colon (:)

Regex and the colon (:) - c#

I have the following code. The idea is to detect whole words.
bool contains = Regex.IsMatch("Hello1 Hello2", #"\bHello\b"); // yields false
bool contains = Regex.IsMatch("Hello Hello2", #"\bHello\b"); // yields true
bool contains = Regex.IsMatch("Hello: Hello2", #"\bHello\b"); **// yields true, but should yield false**
Seems that Regex is ignoring the colon. How can I modify the code such that the last line will return false?

\b means "word boundary". : is not part of any word, so the expression is true.
Maybe you want an expression like this:
(^|\s)Hello(\s|$)
Which means: the string "Hello", preceded by either the start of the expression or a whitespace, and followed by either the end of the expression or a whitespace.

The Regex isn't ignoring the colon. The position before the colon is where \b matches, because \b matches word-boundaries. That means the position between a word-character and a non-word-chracter.
If you want Whitespace to follow after your word 'Hello', than use "\bHello\s".

To match a whole word not directly followed with a colon, use
\bHello\b(?!:)
\bHello(?![:\w])
See the regex demo. Details:
\b - a word boundary
Hello - a word
(?![:\w]) - a negative lookahead that fails the match if there is : or a word char immediately to the right of the current location.
Se the C# code demo:
bool contains = Regex.IsMatch("Hello: Hello2", #"\bHello\b");
Console.WriteLine(contains); // => False
Console.WriteLine(Regex.IsMatch("Hello: Hello2", #"\bHello(?![:\w])"));
// => False

Related

Regex find all matches EXCEPT those surrounded by characters

I have the following regular expression to find all of the instances of {word} in my string. In the following string, this (correctly) matches {eeid} and {catalog}:
Expression
{([^:]*?)}
String being searched
{?:participants::lookup(.,{eeid},{catalog})}
Now - I need to "escape" one of those values, so it is NOT matched/replaced. I'm trying to use double square brackets to do so:
{?:participants::lookup(.,{eeid},[[{catalog}]])}
How can I adjust my regular expression so it ignores {catalog} (enclosed in [[ ]]) but finds {eeid}?

You can use
(?<!\[\[)\{([^:{}]*)}(?!]])
See the .NET regex demo.
Details
(?<!\[\[) - a negative lookbehind that fails the match if there is [[ immediately to the left of the current location
\{ - a { char
([^:{}]*) - Group 1: any zero or more chars other than :, { and }
} - a } char
(?!]]) - a negative lookahead that fails the match if there is ]] immediately to the right of the current location.
See the C# demo:
var s = "{?:participants::lookup(.,{eeid},[[{catalog}]])}";
var rx = #"(?<!\[\[)\{([^:{}]*)}(?!]])";
var res = Regex.Matches(s, rx).Cast<Match>().Select(x => x.Groups[1].Value);
foreach (var t in res)
Console.WriteLine(t);
// => eeid

Regex match with Arabic

i have a text in Arabic and i want to use Regex to extract numbers from it. here is my attempt.
String :
"ما المجموع:
1+2"
Match match = Regex.Match(text, "المجموع: ([^\\r\\n]+)", RegexOptions.IgnoreCase);
it will always return false. and groups.value will always return null.
expected output:
match.Groups[1].Value //returns (1+2)

The regex you wrote matches a word, then a colon, then a space and then 1 or more chars other than backslash, r and n.
You want to match the whole line after the word, colon and any amount of whitespace chars:
var text = "ما المجموع:\n1+2";
var result = Regex.Match(text, #"المجموع:\s*(.+)")?.Groups[1].Value;
Console.WriteLine(result); // => 1+2
See the C# demo
Other possible patterns:
#"المجموع:\r?\n(.+)" // To match CRLF or LF line ending only
#"المجموع:\n(.+)" // To match just LF ending only
Also, if you run the regex against a long multiline text with CRLF endings, it makes sense to replace .+ wit [^\r\n]+ since . in a .NET regex matches any chars but newlines, LF, and thus matches CR symbol.

Regex.IsMatch is not working when text including "$"

Regex.IsMatch method returns the wrong result while checking the following condition,
string text = "$0.00";
Regex compareValue = new Regex(text);
bool result = compareValue.IsMatch(text);
The above code returns as "False". Please let me know if i missed anything.

The Regex class has a special method for escaping characters in a pattern: Regex.Escape()
Change your code like this:
string text = "$0.00";
Regex compareValue = new Regex(Regex.Escape(text)); // Escape characters in text
bool result = compareValue.IsMatch(text);

"$" is a special character in C# regex. Escape it first.
Regex compareValue = new Regex(#"\$0\.00");
bool result = compareValue.IsMatch("$0.00");
Regex expressions: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

Both '.' and '$' are special characters and thus you need to escape them if you want to match the character itself. '.' matches any character and '$' matches the end of a string
see: https://regex101.com/r/pK2uY6/1

You have to escape $ since it is a special (reserved) character which means "end of string". In case . means just dot (say, decimal separator) you have to escape it as well (when not escaped, . means "any symbol"):
string pattern = #"\$0\.00";
bool result = RegEx.IsMatch(text, pattern);
As for your original pattern, it has no chance to match any string, since $0.00 means
$ end of string, followed by
0 zero
. any character
0 zero
0 zero
but end of string can't be followed by...

How to select first sentence in a piece of text using regular expression?

My task is to select first sentence from a text (I'm writing in C#). I suppose that the most appropriate way would be using regex but some troubles occurred. What regex pattern should I use to select the first sentence?
Several examples:
Input: "I am a lion and I want to be free. Do you see a lion when you look inside of me?" Expected result: "I am a lion and I want to be free."
Input: "I drink so much they call me Charlie 4.0 hands. Any text." Expected result: "I drink so much they call me Charlie 4.0 hands."
Input: "So take out your hands and throw the H.U. up. 'Now wave it around like you don't give a fake!'" Expected result: "So take out your hands and throw the H.U. up."
The third is really confusing me.

Since you aleready provided some assumptions:
sentences are divided by a whitespace
task is to select first sentence
You can use the following regex:
^.*?[.?!](?=\s+(?:$|\p{P}*\p{Lu}))
See RegexStorm demo
Regex breakdown:
^ - start of string (thus, only the first sentence will be matched)
.*? - any number of characters, as few as possible (use RegexOptions.Singleline to also match a newline with .)
[.?!] - a final punctuation symbol
(?=\s+(?:$|\p{P}*\p{Lu})) - a look-ahead making sure there is 1 or more whitespace symbols (\s+) right after before the end of string ($) or optional punctuation (\p{P}) and a capital letter (\p{Lu}).
UPDATE:
Since it turns out you can have single sentence input, and your sentences can start with any letter or digit, you can use
^.*?[.?!](?=\s+\p{P}*[\p{Lu}\p{N}]|\s*$)
See another demo

I came up with a regular expression that uses lots of negative look-aheads to exclude certain cases, e.g. a punctuation must not be followed by lowercase character, or a dot before a capital letter is not closing a sentence. This splits up all the text in their seperate sentences. If you are given a text, just take the first match.
[\s\S]*?(?![A-Z]+)(?:\.|\?|\!)(?!(?:\d|[A-Z]))(?! [a-z])/gm

Sentence separators should be searched with following scanner:
if it's sentence-finisher character (like [.!?])
it must be followed by space or allowed sequence of characters and then space:
like sequence of '.' for '.' (A sentence...)
...or sequence of '!' and/or '?' for '!' and '?' (Exclamation here!?)
then it must be followed by either:
capital character (ignore quotes, if any)
numeric
which must be followed by lowercase or another sentence-finister
dialog-starter character (Blah blah blah... - And what next, Elric?)
Tip: don't forget to add extra space character to input source string.
Upd:
Some wild pseudocode xD:
func sentence(inputString) {
finishers = ['.', '!', '?']
allowedSequences = ['.' => ['..'], '!' => ['!!', '?'], '?' => ['??', '!']]
input = inputString
result = ''
found = false
while input != '' {
finisherPos = min(pos(input, finishers))
if !finisherPos
return inputString
result += substr(input, 0, finisherPos + 1)
input = substr(input, finisherPos)
p = finisherPos
finisher = input[p]
p++
if input[p] != ' '
if match = testSequence(substr(input, p), allowedSequences[finisher]) {
result += match
found = true
break
} else {
continue
}
else {
p++
if input[p] in [A-Z] {
found = true
break
}
if input[p] in [0-9] {
p++
if input[p] in [a-z] or input[p] in finishers {
found = true
break
}
p--
}
if input[p] in ['-'] {
found = true;
break
}
}
}
if !found
return inputStr
return result
}
func testSequence(str, sequences) {
foreach (sequence: sequences)
if startsWith(str, sequence)
return sequence
return false
}

Replace any character before <usernameredacted#example.com> with an empty string

I have this string
AnyText: "jonathon" <usernameredacted#example.com>
Desired Output Using Regex
AnyText: <usernameredacted#example.com>
Omit anything in between !
I am still a rookie at regular expressions. Could anyone out there help me with the matching & replacing expression for the above scenario?

Try this:
string input = "jonathon <usernameredacted#example.com>";
string output = Regex.Match(input, #"<[^>]+>").Groups[0].Value;
Console.WriteLine(output); //<usernameredacted#example.com>

You could use the following regex to match all the characters that you want to replace with an empty string:
^[^<]*
The first ^ is an anchor to the beginning of the string. The ^ inside the character class means that the character class is a negation. ie. any character that isn't an < will match. The * is a greedy quantifier. So in summary, this regex will swallow up all characters from the beginning of the string until the first <.

Here is the way to do it in VBA flavor: Replace "^[^""]*" with "".
^ marks the start of the sentence.
[^""]* marks anything other than a
quote sign.
UPDATE:
Since in your additional comment you mentioned you wanted to grab the "From:" and the email address, but none of the junk in between or after, I figure instead of replace, extract would be better. Here is a VBA function written for Excel that will give you back all the subgroup matches (everything you put in parenthesis) and nothing else.
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String) As String
Application.ScreenUpdating = False
Dim i As Long
Dim result As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.Item(0).submatches.count - 1
result = result & allMatches.Item(0).submatches.Item(i)
Next
RegexExtract = result
Application.ScreenUpdating = True
End Function
Using this code, your regex call would be: "^(.+: ).+(<.+>).*"
^ denotes start of sentence
(.+: ) denotes first match group. .+ is one or more characters, followed by : and a space
.+ denotes one or more characters
(<.+>) denotes second match group.
< is <, then .+ for one or more characters, then the final >
.* denotes zero or more
characters.
So in excel you'd use (assuming cell is A1):
=RegexExtract(A1, "^(.+: ).+(<.+>).*")

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex and the colon (:) - c#

\b means "word boundary". : is not part of any word, so the expression is true. Maybe you want an expression like this: (^|\s)Hello(\s|$) Which means: the string "Hello", preceded by either the start of the expression or a whitespace, and followed by either the end of the expression or a whitespace.

The Regex isn't ignoring the colon. The position before the colon is where \b matches, because \b matches word-boundaries. That means the position between a word-character and a non-word-chracter. If you want Whitespace to follow after your word 'Hello', than use "\bHello\s".

Related

Regex find all matches EXCEPT those surrounded by characters

Regex match with Arabic

Regex.IsMatch is not working when text including "$"

How to select first sentence in a piece of text using regular expression?

Replace any character before <usernameredacted#example.com> with an empty string

Categories

Resources