Regex to find words that start with a specific character - c#

I am trying to find words starts with a specific character like:
Lorem ipsum #text Second lorem ipsum.
How #are You. It's ok. Done.
Something #else now.
I need to get all words starts with "#". so my expected results are #text, #are, #else
Any ideas?

Search for:
something that is not a word character then
#
some word characters
So try this:
/(?<!\w)#\w+/
Or in C# it would look like this:
string s = "Lorem ipsum #text Second lorem ipsum. How #are You. It's ok. Done. Something #else now.";
foreach (Match match in Regex.Matches(s, #"(?<!\w)#\w+"))
{
Console.WriteLine(match.Value);
}
Output:
#text
#are
#else

Try this #(\S+)\s?

Match a word starting with # after a white space or the beginning of a line. The last word boundary in not necessary depending on your usage.
/(?:^|\s)\#(\w+)\b/
The parentheses will capture your word in a group. Now, it depends on the language how you apply this regex.
The (?:...) is a non-capturing group.

Code below should solve the case.
/\$(\w)+/g Searches for words that starts with $
/#(\w)+/g Searches for words that starts with #
The answer /(?<!\w)#\w+/ given by Mark Bayers throws a warning like below on RegExr.com website
"(?<!" The "negative lookbehind" feature may not be supported in all browsers.
the warning can be fixed by changing it to (?!\w)#\w+ by removing >

To accommodate different languages I have this (PCRE/PHP):
'~(?<!\p{Latin})#(\p{Latin}+)~u'
or
$language = 'ex. get form value';
'~(?<!\p{' . $language . '})#(\p{' . $language . '}+)~u'
or to cycle through multiple scripts
$languages = $languageArray;
$replacePattern = [];
foreach ($languages as $language) {
$replacePattern[] = '~(?<!\p{' . $language . '})#(\p{' . $language . '}+)~u';
}
$replacement = '<html>$1</html>';
$replaceText = preg_replace($replacePattern, $replacement, $text);
\w works great, but as far as I've seen is only for Latin script.
Switch Latin for Cyrillic or Phoenician in the above example.
The above example does not work for 'RTL' scripts.

Related

how to find a word in one sentence using Regex

I have this sample data:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Re: Krishna P Mohan (31231231 / NA0031212301)
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
This is what I expect and currently get:
expected op - Krishna P Mohan
output - Krishna P Mohan (31231231 / NA0031212301)
I need to find the name which is comes after the Re: and till the (. im getting the complete line instead of only name till bracket starts.
code
var regex = new Regex(#"[\n\r].*Re:\s*([^\n\r]*)");
var fullNameText = regex.Match(extractedDocContent).Value;
If you want a match only, you can use a lookbehind assertion:
(?<=\bRe:\s*)[^\s()]+(?:[^\n\r()]*[^\s()])?
Explanation
(?<=\bRe:\s*) Positive lookbehind, assert the word Re: followed by optional whitespace chars to the left
[^\s()]+ Match 1 or more non whitespace chars except for ( and )
(?: Non capture group
[^\n\r()]* Optionally repeat matching any char except newlines and ( or )
[^\s()] Match a non whitespace character except for ( and )
)? Close the non capture group
If you want the capture group value, and you are matching only word characters:
\bRe:\s*([^\n\r(]+)\b
Regex demo
Else you can use:
\bRe:\s*([^\s()]+(?:[^\n\r()]*[^\s()])?)

Regex to match and replace whole word or part of word [duplicate]

I'm trying to use regexes to match space-separated numbers.
I can't find a precise definition of \b ("word boundary").
I had assumed that -12 would be an "integer word" (matched by \b\-?\d+\b) but it appears that this does not work. I'd be grateful to know of ways of .
[I am using Java regexes in Java 1.6]
Example:
Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());
This returns:
true
false
true
A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.
In the course of learning regular expression, I was really stuck in the metacharacter which is \b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(\w)-boundary.
My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters are alpha-numeric; a minus sign is not.
Taken from Regex Tutorial.
I would like to explain Alan Moore's answer
A word boundary is a position that is either preceded by a word character and not followed by one or followed by a word character and not preceded by one.
Suppose I have a string "This is a cat, and she's awesome", and I want to replace all occurrences of the letter 'a' only if this letter ('a') exists at the "Boundary of a word",
In other words: the letter a inside 'cat' should not be replaced.
So I'll perform regex (in Python) as
re.sub(r"\ba","e", myString.strip()) //replace a with e
Therefore,
Input; Output
This is a cat and she's awesome
This is e cat end she's ewesome
A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.
I talk about what \b-style regex boundaries actually are here.
The short story is that they’re conditional. Their behavior depends on what they’re next to.
# same as using a \b before:
(?(?=\w) (?<!\w) | (?<!\W) )
# same as using a \b after:
(?(?<=\w) (?!\w) | (?!\W) )
Sometimes that isn’t what you want. See my other answer for elaboration.
I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.
Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w. (I'm sure there was a good reason for it at the time).
The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.
Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the \b.
Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.
Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the \b with before and after whitespace and punctuation designators. For example:
public static String grep(String regexp, String multiLineStringToSearch) {
String result = "";
String[] lines = multiLineStringToSearch.split("\\n");
Pattern pattern = Pattern.compile(regexp);
for (String line : lines) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
result = result + "\n" + line;
}
}
return result.trim();
}
Then in your test or main function:
String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";
String afterWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
text = "Programming in C, (C++) C#, Java, and .NET.";
System.out.println("text="+text);
// Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));
System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text)); // Works Ok for this example, but see below
// Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
System.out.println("text="+text);
System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
// Make sure the first and last cases work OK.
text = "C is a language that should have been named differently.";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
text = "One language that should have been named differently is C";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
//Make sure we don't get false positives
text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
System.out.println("text="+text);
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!
Check out the documentation on boundary conditions:
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html
Check out this sample:
public static void main(final String[] args)
{
String x = "I found the value -12 in my string.";
System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
}
When you print it out, notice that the output is this:
[I found the value -, in my string.]
This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like #brianary kinda beat me to the punch, so he gets an up-vote.
Reference: Mastering Regular Expressions (Jeffrey E.F. Friedl) - O'Reilly
\b is equivalent to (?<!\w)(?=\w)|(?<=\w)(?!\w)
Word boundary \b is used where one word should be a word character and another one a non-word character.
Regular Expression for negative number should be
--?\b\d+\b
check working DEMO
I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.
One possible alternative is
(?:(?:^|\s)-?)\d+\b
This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.
when you use \\b(\\w+)+\\b that means exact match with a word containing only word characters ([a-zA-Z0-9])
in your case for example setting \\b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)
for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.

Regex.Split on "non-quoted whitespace" returning duplicate strings

I am trying to make a regex string that splits single line of string, with
Command line argument style. Where:
Whitespace (\s+)
Not Quoted by "
2-a. ignore escaped " (ignore \")
According to regexr.com and regex101.com(link to test code), where I created this string, this had no problem as I understand.
Given the code:
Regex.Split("lorem \"ipsum door?!\" sit?! amet!!!", #"(?<=^(\\""|[^""])*((?<!\\)""(\\""|[^""])*((?<!\\)""))*(\\""|[^""])*)\s+")
//(?<=^(\\"|[^"])*((?<!\\)"(\\"|[^"])*((?<!\\)"))*(\\"|[^"])*)\s+
Expected:
lorem
"ipsum door?!"
sit?!
amet!!!
Returns:
lorem
l
"ipsum door?!"
l
"ipsum door?!"
i
"
sit?!
l
"ipsum door?!"
i
"
amet!!!
More info: Before adding condition 2-a(ignore \"), I came up with this and had a similar issue. Code/result: https://pastebin.com/76eKp1wb
I was able to get wanted result with option RegexOptions.ExplicitCapture
Like so:
Regex splitter = new Regex(#"(?<=^(\\""|[^""])*((?<!\\)""(\\""|[^""])*((?<!\\)""))*(\\""|[^""])*)\s+", RegexOptions.ExplicitCapture);
var lineBreak = splitter.Split("lorem \\\"ipsum \\\" door?!\" sit?! \" amet!!!");
Apparently, when capturing without or with any options, there is no difference. In other words, only when splitting, there is a problem.
My best guess is library trying to capture something in the capture group, and trying to split there.

Limit regex expression by character in c#

I get the following pattern (\s\w+) I need matches every words in my string with a space.
For example
When i have this string
many word in the textarea must be happy
I get
many
word
in
the
textarea
must
be
happy
It is correct, but when i have another character, for example
many word in the textarea , must be happy
I get
many
word
in
the
textarea
must
be
happy
But must be happy should be ignored, because i want it to break when another character is in the string
Edit:
Example 2
all cats { in } the world are nice
Should be return
all
cats
Because { is another separator for me
Example 3
My 3 cats are ... funny
Should be return
My
3
cats
are
Because 3 is alphanumeric and . is separator for me
What can I do?
To do that you need to use the \G anchors that matches the positions at the start of the string or after the last match. so you can do it with this pattern:
#"(?<=\G\s*)\w+"
[^\w\s\n].*$|(\w+\s+)
Try this.Grab the captures or matches.See demo.Set flag m for multiline mode.
See demo.
http://regex101.com/r/kP4pZ2/12
I think Sam I Am's comment is correct: you'll require two regular expressions.
Capture the text up to a non-word character.
Capture all the words with a space on one side.
Here's the corresponding code:
"^(\\w+\\s+)+"
"(\\w+\\s+)"
You can combine these two to capture just the individual words pretty easily - like so
"^(\\w+\\s+)+"
Here's a complete piece of code demonstrating the pattern:
string input = "many word in the textarea , must be happy";
string pattern = "^(\\w+\\s+)+";
Match match = Regex.Match(input , pattern);
// Never returns a NullReferenceException because of GroupsCollection array indexer - check it out!
foreach(Capture capture in match.Groups[1].Captures)
{
Console.WriteLine(capture.Value);
}
EDIT
Check out Casimir et Hippolyte for a really clean answer.
All in one regex :-) Result is in list
Regex regex = new Regex(#"^((\w+)\s*)+([^\w\s]|$).*");
Match m = regex.Match(inputString);
if(m.Success)
{
List<string> list =
m.Groups[2].Captures.Cast<Capture>().
Select(c=>c.Value).ToList();
}

Match everything before a specific word in a multiline string

I'm trying to filter out some garbage text from a string with regex but can't seem to get it to work. I'm not a regex expert (not even close) and I've searched for similar examples but none that seems to solve my problem.
I need a regex that matches everything from the start of a string to a specific word in that string but not the word itself.
here's an example:
<p>This is the string I want to process with as you can see also contains HTML tags like <i>this</i> and <strong>this</strong></p>
<p>I want to remove everything in the string BEFORE the word "giraffe" (but not "giraffe" itself and keep everything after it.</p>
So, how do I match everything in the string before the word "giraffe"?
Thanks!
resultString = Regex.Replace(subjectString,
#"\A # Start of string
(?: # Match...
(?!""giraffe"") # (unless we're at the start of the string ""giraffe"")
. # any character (including newlines)
)* # zero or more times",
"", RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
should work.
Why regex?
String s = "blagiraffe";
s = s.SubString(s.IndexOf("giraffe"));
Try this:
var s =
#"<p>This is the string I want to process with as you can see also contains HTML tags like <i>this</i> and <strong>this</strong></p>
<p>I want to remove everything in the string BEFORE the word ""giraffe"" (but not ""giraffe"" itself and keep everything after it.</p>";
var ex = new Regex("giraffe.*$", RegexOptions.Multiline);
Console.WriteLine(ex.Match(s).Value);
This code snippet produces the following output:
giraffe" (but not "giraffe" itself and keep everything after it.</p>
A look-ahead would do the trick:
^.*(?=\s+giraffe)
You could used a pattern with a lookahead like this
^.*?(?=giraffe)

Categories