how to find a word in one sentence using Regex

how to find a word in one sentence using Regex - c#

I have this sample data:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Re: Krishna P Mohan (31231231 / NA0031212301)
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
This is what I expect and currently get:
expected op - Krishna P Mohan
output - Krishna P Mohan (31231231 / NA0031212301)
I need to find the name which is comes after the Re: and till the (. im getting the complete line instead of only name till bracket starts.
code
var regex = new Regex(#"[\n\r].*Re:\s*([^\n\r]*)");
var fullNameText = regex.Match(extractedDocContent).Value;

If you want a match only, you can use a lookbehind assertion:
(?<=\bRe:\s*)[^\s()]+(?:[^\n\r()]*[^\s()])?
Explanation
(?<=\bRe:\s*) Positive lookbehind, assert the word Re: followed by optional whitespace chars to the left
[^\s()]+ Match 1 or more non whitespace chars except for ( and )
(?: Non capture group
[^\n\r()]* Optionally repeat matching any char except newlines and ( or )
[^\s()] Match a non whitespace character except for ( and )
)? Close the non capture group
If you want the capture group value, and you are matching only word characters:
\bRe:\s*([^\n\r(]+)\b
Regex demo
Else you can use:
\bRe:\s*([^\s()]+(?:[^\n\r()]*[^\s()])?)

Related

Remove part of string between 2 brackets containing a specific word

I need to make a function called RemoveError, that checks if a string contains the word "Error" inside 2 brackets with other text. If so, I need to remove the 2 brackets sorrounding "Error" and everything inside it.
Example:
var Result = RemoveError("Lorem Ipsum (Status: Hello) (Error: 14) (Comment: Some text)");
Result will return:
"Lorem Ipsum (Status: Hello) (Comment: Some text)"
Hope someone can help

You could try this Regex pattern:
public string RemoveError(string input) {
return Regex.Replace(input, #"\(Error\:\s[0-9]{1,3}\)\s", "");
}
I am assuming that your error code is numeric and between 1 and 3 digits long. If that is not the case, you need to adapt that part of the expression. I am additionally removing one extra whitespace after the error part, because otherwise you would end up with 2 whitespaces in between.
\( - opening paranthesis
Error - match the word Error
\: - match the colon
\s - match a whitespace
[0-9]{1,3} - match 1 to 3 characters in the range from 0-9
\) - match a closing paranthesis
\s - match a whitespace
Output:
Lorem Ipsum (Status: Hello) (Comment: Some text)

Regex to clean repetitions of characters

I have a pattern in the string like this:
T T and I want to T
And It can be any character from [a-z].
I have tried this Regex Example but not able to replace it.
EDIT
Like I have A Aa ar r then it should become Aar means replace any character 1st or 2nd no matter what it is.

You can use the backreferences for this.
/([a-z])\s*\1\s?/gi
Example
Some more explanation:
( begin matching group 1
[a-z] match any character from a to z
) end matching group 1
\s* match any amount of space characters
\1 match the result of matching group 1
exactly as it was again
this allows for the repition
\s? match none or one space character
this will allow to remove multiple
spaces when replacing

Regex match if a string has length 2 and contains 1 letter and 1 number

Guys I hate Regex and I suck at writing.
I have a string that is space separated and contains several codes that I need to pull out. Each code is marked by beginning with a capital letter and ending with a number. The code is only two digits.
I'm trying to create an array of strings from the initial string and I can't get the regular expression right.
Here is what I have
String[] test = Regex.Split(originalText, "([a-zA-Z0-9]{2})");
I also tried:
String[] test = Regex.Split(originalText, "([A-Z]{1}[0-9]{1})");
I don't have any experience with Regex as I try to avoid writing them whenever possible.
Anyone have any suggestions?
Example input:
AA2410 F7 A4 Y7 B7 A 0715 0836 E0.M80
I need to pull out F7, A4, B7. E0 should be ignored.

You want to collect the results, not split on them, right?
Regex regexObj = new Regex(#"\b[A-Z][0-9]\b");
allMatchResults = regexObj.Matches(subjectString);
should do this. The \bs are word boundaries, making sure that only entire strings (like A1) are extracted, not substrings (like the A1 in TWA101).
If you also need to exclude "words" with non-word characters in them (like E0.M80 in your comment), you need to define your own word boundary, for example:
Regex regexObj = new Regex(#"(?<=^|\s)[A-Z][0-9](?=\s|$)");
Now A1 only matches when surrounded by whitespace (or start/end-of-string positions).
Explanation:
(?<= # Assert that we can match the following before the current position:
^ # Start of string
| # or
\s # whitespace.
)
[A-Z] # Match an uppercase ASCII letter
[0-9] # Match an ASCII digit
(?= # Assert that we can match the following after the current position:
\s # Whitespace
| # or
$ # end of string.
)
If you also need to find non-ASCII letters/digits, you can use
\p{Lu}\p{N}
instead of [A-Z][0-9]. This finds all uppercase Unicode letters and Unicode digits (like Ä٣), but I guess that's not really what you're after, is it?

Do you mean that each code looks like "A00"?
Then this is the regex:
"[A-Z][0-9][0-9]"
Very simple... By the way, there's no point writing {1} in a regex. [0-9]{1} means "match exactly one digit, which is exactly like writing [0-9].
Don't give up, simple regexes make perfect sense.

This should be ok:
String[] all_codes = Regex.Split(originalText, #"\b[A-Z]\d\b");
It gives you an array with all code starting with a capital letter followed by a digit, separated by an kind of word boundary (site space etc.)

Regex word boundary expressions

Say for example I have the following string "one two(three) (three) four five" and I want to replace "(three)" with "(four)" but not within words. How would I do it?
Basically I want to do a regex replace and end up with the following string:
"one two(three) (four) four five"
I have tried the following regex but it doesn't work:
#"\b\(three\)\b"
Basically I am writing some search and replace code and am giving the user the usual options to match case, match whole word etc. In this instance the user has chosen to match whole words but I don't know what the text being searched for will be.

Your problem stems from a misunderstanding of what \b actually means. Admittedly, it is not obvious.
The reason \b\(three\)\b doesn’t match the threes in your input string is the following:
\b means: the boundary between a word character and a non-word character.
Letters (e.g. a-z) are considered word characters.
Punctuation marks such as ( are considered non-word characters.
Here is your input string again, stretched out a bit, and I’ve marked the places where \b matches:
o n e t w o ( t h r e e ) ( t h r e e ) f o u r f i v e
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
As you can see here, there is a \b between “two” and “(three)”, but not before the second “(three)”.
The moral of the story? “Whole-word search” doesn’t really make much sense if what you’re searching for is not just a word (a string of letters). Since you have punctuation characters (parentheses) in your search string, it is not as such a “word”. If you searched for a word consisting only of word characters, then \b would do what you expect.
You can, of course, use a different Regex to match the string only if it surrounded by spaces or occurs at the beginning or end of the string:
(^|\s)\(three\)(\s|$)
However, the problem with this is, of course, that if you search for “three” (without the parentheses), it won’t find the one in “(three)” because it doesn’t have spaces around it, even though it is actually a whole word.
I think most text editors (including Visual Studio) will use \b only if your search string actually starts and/or ends with a word character:
var pattern = Regex.Escape(searchString);
if (Regex.IsMatch(searchString, #"^\w"))
pattern = #"\b" + pattern;
if (Regex.IsMatch(searchString, #"\w$"))
pattern = pattern + #"\b";
That way they will find “(three)” even if you select “whole words only”.

Here a simple code you may be interested in:
string pattern = #"\b" + find + #"\b";
Regex.Replace(stringToSearch, pattern, replace, RegexOptions.IgnoreCase);
Source code: snip2code - C#: Replace an exact word in a sentence

See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
So, your \b\(three\)\b regex DOES work, but NOT the way you expected. It does not match (three) in In (three) years, In(three) years and In (three)years, but it matches in In(three)years because there are word boundaries between n and ( and between ) and y.
What you can do in these situations is use dynamic adaptive word boundaries that are constructs that ensure whole word matching where they are expected only (see my "Dynamic adaptive word boundaries" YT video for better visual understanding of these constructs).
In C#, it can be written as
#"(?!\B\w)\(three\)(?<!\w\B)"
In short:
(?!\B\w) - only require a word boundary on the left if the char that follows the word boundary is a word char
\(three\)
(?<!\w\B) - only require a word boundary on the right if the char that precedes the word boundary is a word char.
In case your search phrases can contain whitespaces and you need to match the longer alternatives first you can build the pattern dynamically from a list like
var phrases = new List<string> { #"(one)", #".two.", "[three]" };
phrases = phrases.OrderByDescending(x => x.Length).ToList();
var pattern = $#"(?!\B\w)(?:{string.Join("|", phrases.Select(z => Regex.Escape(z)))})(?<!\w\B)";
with the resulting pattern like (?!\B\w)(?:\[three]|\(one\)|\.two\.)(?<!\w\B) that matches what you'd expect, see the C# demo and the regex demo.

I recently came across a similar issue in javascript trying to match terms with a leading '$' character only as separate words, e.g. if $hot = 'FUZZ', then:
"some $hot $hotel bird$hot pellets" ---> "some FUZZ $hotel bird$hot pellets"
The regex /\b\$hot\b/g (my first guess) did not work for the same reason the parens did not match in the original question — as non word characters, there is no word/non-word boundary preceding them with whitespace or a string start.
However the regex /\B\$hot\b/g does match, which shows that the positions not marked in #timwi's excellent example match the \B term. This was not intuitive to me because ") (" is not made of regex word characters. But I guess since \B is an inversion of the \b class, it doesn't have to be word characters, it just has to be not- not- word characters :)

As Gopi said, but (theoretically) catching only (three) not two(three):
string input = "one two(three) (three) four five";
string output = input.Replace(" (three) ", " (four) ");
When I test that, I get: "one two(three) (four) four five" Just remember that white-space is a string character, too, so it can also be replaced. If I did this:
//use same input
string output = input.Replace(" ", ";");
I'd get one;two(three);(three);four;five"

Regex to find words that start with a specific character

I am trying to find words starts with a specific character like:
Lorem ipsum #text Second lorem ipsum.
How #are You. It's ok. Done.
Something #else now.
I need to get all words starts with "#". so my expected results are #text, #are, #else
Any ideas?

Search for:
something that is not a word character then
#
some word characters
So try this:
/(?<!\w)#\w+/
Or in C# it would look like this:
string s = "Lorem ipsum #text Second lorem ipsum. How #are You. It's ok. Done. Something #else now.";
foreach (Match match in Regex.Matches(s, #"(?<!\w)#\w+"))
{
Console.WriteLine(match.Value);
}
Output:
#text
#are
#else

Try this #(\S+)\s?

Match a word starting with # after a white space or the beginning of a line. The last word boundary in not necessary depending on your usage.
/(?:^|\s)\#(\w+)\b/
The parentheses will capture your word in a group. Now, it depends on the language how you apply this regex.
The (?:...) is a non-capturing group.

Code below should solve the case.
/\$(\w)+/g Searches for words that starts with $
/#(\w)+/g Searches for words that starts with #
The answer /(?<!\w)#\w+/ given by Mark Bayers throws a warning like below on RegExr.com website
"(?<!" The "negative lookbehind" feature may not be supported in all browsers.
the warning can be fixed by changing it to (?!\w)#\w+ by removing >

To accommodate different languages I have this (PCRE/PHP):
'~(?<!\p{Latin})#(\p{Latin}+)~u'
or
$language = 'ex. get form value';
'~(?<!\p{' . $language . '})#(\p{' . $language . '}+)~u'
or to cycle through multiple scripts
$languages = $languageArray;
$replacePattern = [];
foreach ($languages as $language) {
$replacePattern[] = '~(?<!\p{' . $language . '})#(\p{' . $language . '}+)~u';
}
$replacement = '<html>$1</html>';
$replaceText = preg_replace($replacePattern, $replacement, $text);
\w works great, but as far as I've seen is only for Latin script.
Switch Latin for Cyrillic or Phoenician in the above example.
The above example does not work for 'RTL' scripts.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

how to find a word in one sentence using Regex - c#

Related

Remove part of string between 2 brackets containing a specific word

Regex to clean repetitions of characters

Regex match if a string has length 2 and contains 1 letter and 1 number

Regex word boundary expressions

Regex to find words that start with a specific character

Categories

Resources