Regex matches unspecified ampersand character in C#.NET

Regex matches unspecified ampersand character in C#.NET - c#

I'm trying to match a set of characters with a pattern. But ampersand is matching without specifying. Could you please explain why Regex behaves like this?
string input = "<font face=\"Verdana\">É-øá-É-</font><font face=\"Arial\"> ;&: ant ;ghj\n</font>";
Regex Matcher = new Regex("</font><font face=\"[\\w\\s-_]+\">[ -,:;\\.\\r\\n\\/\\]\\)]+");
string output = Matcher.Match(input);
I need the output as
"</font><font face=\"Arial\"> ;"
since the matchable characters after font start tag doesn't contain & character.
But the actual output I'm getting is
"</font><font face=\"Myriad\"> ;&: "
Why this regex matches the & character too ?

You should escape the dash -.
[ -,
means match all character between the space and the comma.
SPACE => 32
COMMA => 44
APERSTAND => 38 (matches)

You have forgotten to escape the dash '-' Change to this:
Regex Matcher = new Regex("</font><font face=\"[\\w\\s-_]+\">[ \\-,:;\\r\\n\\/\\]\\)]+");

Related

Regex match with Arabic

i have a text in Arabic and i want to use Regex to extract numbers from it. here is my attempt.
String :
"ما المجموع:
1+2"
Match match = Regex.Match(text, "المجموع: ([^\\r\\n]+)", RegexOptions.IgnoreCase);
it will always return false. and groups.value will always return null.
expected output:
match.Groups[1].Value //returns (1+2)

The regex you wrote matches a word, then a colon, then a space and then 1 or more chars other than backslash, r and n.
You want to match the whole line after the word, colon and any amount of whitespace chars:
var text = "ما المجموع:\n1+2";
var result = Regex.Match(text, #"المجموع:\s*(.+)")?.Groups[1].Value;
Console.WriteLine(result); // => 1+2
See the C# demo
Other possible patterns:
#"المجموع:\r?\n(.+)" // To match CRLF or LF line ending only
#"المجموع:\n(.+)" // To match just LF ending only
Also, if you run the regex against a long multiline text with CRLF endings, it makes sense to replace .+ wit [^\r\n]+ since . in a .NET regex matches any chars but newlines, LF, and thus matches CR symbol.

Regex.IsMatch is not working when text including "$"

Regex.IsMatch method returns the wrong result while checking the following condition,
string text = "$0.00";
Regex compareValue = new Regex(text);
bool result = compareValue.IsMatch(text);
The above code returns as "False". Please let me know if i missed anything.

The Regex class has a special method for escaping characters in a pattern: Regex.Escape()
Change your code like this:
string text = "$0.00";
Regex compareValue = new Regex(Regex.Escape(text)); // Escape characters in text
bool result = compareValue.IsMatch(text);

"$" is a special character in C# regex. Escape it first.
Regex compareValue = new Regex(#"\$0\.00");
bool result = compareValue.IsMatch("$0.00");
Regex expressions: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

Both '.' and '$' are special characters and thus you need to escape them if you want to match the character itself. '.' matches any character and '$' matches the end of a string
see: https://regex101.com/r/pK2uY6/1

You have to escape $ since it is a special (reserved) character which means "end of string". In case . means just dot (say, decimal separator) you have to escape it as well (when not escaped, . means "any symbol"):
string pattern = #"\$0\.00";
bool result = RegEx.IsMatch(text, pattern);
As for your original pattern, it has no chance to match any string, since $0.00 means
$ end of string, followed by
0 zero
. any character
0 zero
0 zero
but end of string can't be followed by...

C# Regex to match any character next to a specified substring but ignoring newline/tab character

I want to match the character next to the word "Test" but if that next character is a newline \n character I need to get the character next to the newline charactger instead. In the following input string my desired output is character C and w. But I'm getting \n and w instead:
string str = "This abcTest\nCde and qrvTestwest is an input";
foreach (Match mt in Regex.Matches(str, #"(?<=Test)(.)",RegexOptions.Singleline))
Console.WriteLine(mt.Groups[1].Value);

Try this:
Test[\n]*(.)
It will skip over any number of newlines.

The regex for what you're asking is:
Test[\n](.)|Test(.)
You need to check both cases Test\n and Test(.).
Check a working regex of this: https://regex101.com/r/mI2tE5/1
For the comments this works better:
Test[\n]*(.)

Regex for special characters?

string Val = Regex.Replace(TextBox1.Text, #"[^a-z, A-z, 0-9]", string.Empty);
This expression does not match the character ^ and _. What should i do to match those values?
One more things is, If TextBox1.Text string value is more than 10, the last string value(11th string value) should match.

Note that the ^ is has special meaning when enclosed in square brackets. It means match everything but those specified in the character class, basically '[]'.
If you want to match "^" and "_", put the caret (^) in another position than after the opening bracket like so, using the repetition to restrict character length:
[\W_]
That will make sure the characters in the entire string are 10.
Or you escape it using the slash "\^".
string Val = Regex.Replace(TextBox1.Text, #"[\W_]", string.Empty);

Your problem is A-z.
This matches all ASCII letters A through Z, then the characters that lie between Z and a (which contain, among others, ^ and _), then all ASCII letters between a and z. This means that ^ and _ won't be matched by your regex (as well as the comma and space which you included in your regex as well).
To clarify, your regex could also have been written as
[^a-zA-Z0-9\[\\\]^_` ,]
You probably wanted
string Val = Regex.Replace(TextBox1.Text, #"[^a-zA-Z0-9]", string.Empty);

How to ignore regex matches in C#?

An input string:
string datar = "aag, afg, agg, arg";
I am trying to get matches: "aag" and "arg", but following won't work:
string regr = "a[a-z&&[^fg]]g";
string regr = "a[a-z[^fg]]g";
What is the correct way of ignoring regex matches in C#?

The obvious way is to use a[a-eh-z]g, but you could also try with a negative lookbehind like this :
string regr = "a[a-z](?<!f|g)g"
Explanation :
a Match the character "a"
[a-z] Match a single character in the range between "a" and "z"
(?<!XXX) Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
f|g Match the character "f" or match the character "g"
g Match the character "g"

Character classes aren't quite that fancy. The simple solution is:
a[a-eh-z]g
If you really want to explicitly list out the letters that don't belong, you could try something like:
a[^\W\d_A-Zfg]g
This character class matches everything except:
\W excludes non-word characters, i.e. punctuation, whitespace, and other special characters. What's left are letters, digits, and the underscore _.
\d removes digits so now we have letters and the underscore _.
_ removes the underscore so now we only match letters.
A-Z removes uppercase letters so now we only match lowercase letters.
Finally at this point we can list the individual lowercase letters we don't want to match.
All in all way more complicated than we'd likely ever want. That's regular expressions for ya!

What you're using is Java's set intersection syntax:
a[a-z&&[^fg]]g
..meaning the intersection of the two sets ('a' THROUGH 'z') and (ANYTHING EXCEPT 'f' OR 'g'). No other regex flavor that I know of uses that notation. The .NET flavor uses the simpler set subtraction syntax:
a[a-z-[fg]]g
...that is, the set ('a' THROUGH 'z') minus the set ('f', 'g').
Java demo:
String s = "aag, afg, agg, arg, a%g";
Matcher m = Pattern.compile("a[a-z&&[^fg]]g").matcher(s);
while (m.find())
{
System.out.println(m.group());
}
C# demo:
string s = #"aag, afg, agg, arg, a%g";
foreach (Match m in Regex.Matches(s, #"a[a-z-[fg]]g"))
{
Console.WriteLine(m.Value);
}
Output of both is
aag
arg

Try this if you want match arg and aag:
a[ar]g
If you want to match everything except afg and agg, you need this regex:
a[^fg]g

It seems like you're trying to match any three alphabetic characters, with the condition that the second character cannot be f or g. If this is the case, why not use the following regular expression:
string regr = "a[a-eh-z]g";

Regex: a[a-eh-z]g.
Then use Regex.Matches to get the matched substrings.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex matches unspecified ampersand character in C#.NET - c#

You should escape the dash -. [ -, means match all character between the space and the comma. SPACE => 32 COMMA => 44 APERSTAND => 38 (matches)

You have forgotten to escape the dash '-' Change to this: Regex Matcher = new Regex("</font><font face=\"[\\w\\s-_]+\">[ \\-,:;\\r\\n\\/\\]\\)]+");

Related

Regex match with Arabic

Regex.IsMatch is not working when text including "$"

C# Regex to match any character next to a specified substring but ignoring newline/tab character

Regex for special characters?

How to ignore regex matches in C#?

Categories

Resources