Why doesn't $ always match to an end of line - c#

Below is a simple code snippet that demonstrates the seemingly buggy behavior of end of line matching ("$") in .Net regular expressions. Am I missing something obvious?
string input = "Hello\nWorld\n";
string regex = #"^Hello\n^World\n"; //Match
//regex = #"^Hello\nWorld\n"; //Match
//regex = #"^Hello$"; //Match
//regex = #"^Hello$World$"; //No match!!!
//regex = #"^Hello$^World$"; //No match!!!
Match m = Regex.Match(input, regex, RegexOptions.Multiline | RegexOptions.CultureInvariant);
Console.WriteLine(m.Success);

$ does not consume the newline character(s). #"^Hello$\s+^World$" should match.

The $ doesn't match a newline. It matches the end of the string in which the pattern is applied (unless multiline mode is enabled). There isn't much sense in having two ends in a string.

Related

Find hashtags in string

I am working on a Xamarin.Forms PCL project in C# and would like to detect all the hashtags.
I tried splitting at spaces and checking if the word begins with an # but the problem is if the post contains two spaces like "Hello #World Test" it would lose that the double space
string body = "Example string with a #hashtag in it";
string newbody = "";
foreach (var word in body.Split(' '))
{
if (word.StartsWith("#"))
newbody += "[" + word + "]";
newbody += word;
}
Goal output:
Example string with a [#hashtag] in it
I also only want it to have A-Z a-z 0-9 and _ stopping at any other character
Test #H3ll0_W0rld$%Test => Test [#H3ll0_W0rld]$%Test
Other Stack questions try to detect the string and extract it, I would like it work with it and put it back in the string without losing anything that methods such as splitting by certain characters would lose.
You can use Regex with #\w+ and $&
Explanation
# matches the character # literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$& Includes a copy of the entire match in the replacement string.
Example
var input = "asdads sdfdsf #burgers, #rabbits dsfsdfds #sdf #dfgdfg";
var regex = new Regex(#"#\w+");
var matches = regex.Matches(input);
foreach (var match in matches)
{
Console.WriteLine(match);
}
or
var result = regex.Replace(input, "[$&]" );
Console.WriteLine(result);
Ouput
#burgers
#rabbits
#sdf
#dfgdfg
asdads sdfdsf [#burgers], [#rabbits] dsfsdfds [#sdf] [#dfgdfg]
Updated Demo here
Another Example
Use a regular expression: \#\w*
string pattern = "\#\w*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);

C# Regex match multiple words in a string

How can I find all the matches in a string using a regular expression run in C#?
I want to find all matches in the below example string.
Example:
inputString: Hello (mail) byebye (time) how are you (mail) how are you (time)
I want to match (mail) and (time) from the example. Including parentheses( and ).
In attempting to solve this, I've writtent the following code.
string testString = #"(mail)|(time)";
Regex regx = new Regex(Regex.Escape(testString), RegexOptions.IgnoreCase);
List<string> mactches = regx.Matches(inputString).OfType<Match>().Select(m => m.Value).Distinct().ToList();
foreach (string match in mactches)
{
//Do something
}
Is the pipe(|) used for the logical OR condition?
Using Regex.Escape(testString) is going to escape your pipe character, turning
#"(mail)|(time)"
effectively into
#"\(mail\)\|\(time\)".
Thus, your regex is looking for the literal "(mail)|(time)".
If all of your matches are as simple as words surrounded by parens, I would build the regex like this:
List<string> words = new List<string> { "(mail)", "(time)", ... };
string pattern = string.Join("|", words.Select(w => Regex.Escape(w)));
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
Escape the parentheses in your test string:
string testString = #"\(mail\)|\(time\)";
Remove Regex.Escape:
Regex regx = new Regex(testString, RegexOptions.IgnoreCase);
Output (includes parentheses):
(mail)
(time)
The reason Regex.Escape isn't working in your case is that it escapes the | character as well:
Escapes a minimal set of metacharacters (\, *, +, ?, |, {, [, (, ), ^, $, ., #, and whitespace) by replacing them with their \ codes.

Dot word pattern matching

I want to create a regular expression to match a word that begins with a period. The word(s) can exist N times in a string. I want to ensure that the word comes up whether it's at the beginning of a line, the end of a line or somewhere in the middle. The latter part is what I'm having difficulty with.
Here is where I am at so far.
const string pattern = #"(^|(.* ))(?<slickText>\.[a-zA-Z0-9]*)( .*|$)";
public static MatchCollection Find(string input)
{
Regex regex = new Regex(pattern,RegexOptions.IgnoreCase | RegexOptions.Multiline);
MatchCollection collection = regex.Matches(input);
return collection;
}
My test pattern finds .lee and .good. My test pattern fails to find .bruce:
static void Main()
{
MatchCollection results = ClassName.Find("a short stump .bruce\r\nand .lee a small tree\r\n.good roots");
foreach (Match item in results)
{
GroupCollection groups = item.Groups;
Console.WriteLine("{0} ", groups["slickText"].Value);
}
System.Diagnostics.Debug.Assert(results.Count > 0);
}
Maybe you're just looking for \.\w+?
Test:
var s = "a short stump .bruce\r\nand .lee a small tree\r\n.good roots";
Regex.Matches(s, #"\.\w+").Dump();
Result:
Note:
If you don't want to find foo in some.foo (because there's no whitespace between some and .foo), you can use (?<=\W|^)\.\w+ instead.
Bizarrely enough, it seems that with RegexOptions.Multiline, ^ and $ will only additionally match \n, not \r\n.
Thus you get .good because it is preceded by \n which is matched by ^, but you don't get .bruce because it is succeeded by \r which is not matched by $.
You could do a .Replace("\r", "") on the input, or rewrite your expression to take individual lines of input.
Edit: Or replace $ with \r?$ in your pattern to explicitly include the \r; thanks to SvenS for the suggestion.
In your RegEx, a word has to be terminated by a space, but bruce is terminated by \r instead.
I would give this regex a go:
(?:.*?(\.[A-Za-z]+(?:\b|.\s)).*?)+
And change the RegexOptions from Multiline to Singleline - in this mode dot matches all characters including newline.

How to find a string that is delimited by a certain start and end character

I want to make an array of strings based on start and end characters using regular expression.
An example will help me explanation.
Considering '$' as my starting identifier and '|' as my ending identifier from the below string
stack $over| flow $stack| exchange
Regular expression should find over and stack in the above string.
[Edited to include code snippets in OP's comments...]
string testingString = "stack $over| flow $stack| exchange";
var pattern = #"(?$.*?|)"; // also tried #"\$[^|]\|"
foreach (var m in System.Text.RegularExpressions.Regex.Split(testingString, pattern)) {
Response.Write(m );
}
// output == stack $over| flow $stack| exchange
I would use look-behind and look-aheads to exclude the start and end delimiter form the match.
string testingString = #"stack $over| flow $stack| exchange";
MatchCollection result = Regex.Matches
(testingString,
#"
(?<=\$) # This is a lookbehind, it ensure there is a $ before the string
[^|]* # Match any character that is not a |
(?=\|) # This is a lookahead,it ensures that a | is ahead the pattern
"
, RegexOptions.IgnorePatternWhitespace);
foreach (Match item in result) {
Console.WriteLine(item.ToString());
}
The RegexOptions.IgnorePatternWhitespace is a useful option to be able to write readable regexes and use also comments in the regexes.
In regular expressions $ is a special character meaning "match the end of the string".
For a literal $ you need to escape it, try \$.
Similarly | is a special character in regex and needs to be escaped.
Try \$.*?\| or \$[^|]+\|.
Learn about regular expressions from the net, for example here.
[UPDATE]
In response to your comment, you want to extract text delimited by $ and |, not split on it. Try Regex.Matches instead of Regex.Split.
Regex t = new Regex(#"\$([^|]+)\|");
MatchCollection allMatches = t.Matches("stack $over| flow $stack| exchange");

C# - Regex Match whole words

I need to match all the whole words containing a given a string.
string s = "ABC.MYTESTING
XYZ.YOUTESTED
ANY.TESTING";
Regex r = new Regex("(?<TM>[!\..]*TEST.*)", ...);
MatchCollection mc = r.Matches(s);
I need the result to be:
MYTESTING
YOUTESTED
TESTING
But I get:
TESTING
TESTED
.TESTING
How do I achieve this with Regular expressions.
Edit: Extended sample string.
If you were looking for all words including 'TEST', you should use
#"(?<TM>\w*TEST\w*)"
\w includes word characters and is short for [A-Za-z0-9_]
Keep it simple: why not just try \w*TEST\w* as the match pattern.
I get the results you are expecting with the following:
string s = #"ABC.MYTESTING
XYZ.YOUTESTED
ANY.TESTING";
var m = Regex.Matches(s, #"(\w*TEST\w*)", RegexOptions.IgnoreCase);
Try using \b. It's the regex flag for a non-word delimiter. If you wanted to match both words you could use:
/\b[a-z]+\b/i
BTW, .net doesn't need the surrounding /, and the i is just a case-insensitive match flag.
.NET Alternative:
var re = new Regex(#"\b[a-z]+\b", RegexOptions.IgnoreCase);
Using Groups I think you can achieve it.
string s = #"ABC.TESTING
XYZ.TESTED";
Regex r = new Regex(#"(?<TM>[!\..]*(?<test>TEST.*))", RegexOptions.Multiline);
var mc= r.Matches(s);
foreach (Match match in mc)
{
Console.WriteLine(match.Groups["test"]);
}
Works exactly like you want.
BTW, your regular expression pattern should be a verbatim string ( #"")
Regex r = new Regex(#"(?<TM>[^.]*TEST.*)", RegexOptions.IgnoreCase);
First, as #manojlds said, you should use verbatim strings for regexes whenever possible. Otherwise you'll have to use two backslashes in most of your regex escape sequences, not just one (e.g. [!\\..]*).
Second, if you want to match anything but a dot, that part of the regex should be [^.]*. ^ is the metacharacter that inverts the character class, not !, and . has no special meaning in that context, so it doesn't need to be escaped. But you should probably use \w* instead, or even [A-Z]*, depending on what exactly you mean by "word". [!\..] matches ! or ..
Regex r = new Regex(#"(?<TM>[A-Z]*TEST[A-Z]*)", RegexOptions.IgnoreCase);
That way you don't need to bother with word boundaries, though they don't hurt:
Regex r = new Regex(#"(?<TM>\b[A-Z]*TEST[A-Z]*\b)", RegexOptions.IgnoreCase);
Finally, if you're always taking the whole match anyway, you don't need to use a capturing group:
Regex r = new Regex(#"\b[A-Z]*TEST[A-Z]*\b", RegexOptions.IgnoreCase);
The matched text will be available via Match's Value property.

Categories