Regex issue with reserved characters in c# - c#

I've got a working regex that scans a chunk of text for a list of keywords defined in a db. I dynamically create my regex from the db to get this:
\b(?:keywords|from|database|with|esc\#ped|characters|\#ss|gr\#ss)\b
Notice that special characters are escaped. This works for the vast majority of cases, EXCEPT where the first character of the keyword is a regex special character like # or $. So in the above example, #ss will not be matched, but gr#ss and esc#ped will.
Any ideas how to get this regex to work for these special cases? I've tried both with and without escaping the special characters in the regex string, but to no avail.
Thanks in advance,
David

new Regex(#"(?<=^|\W)(?:keywords|from|database|with|esc#ped|characters|#ss|gr#ss)(?=\W|$)")
will match. It checks whether there is a non-word character (or beginning/end of string) before/after the keyword to be matched. I chose \W over \s because of punctuation and other non-word characters that might constitute a word boundary.
Edit: Even better (thanks to Alan Moore! - both versions will produce the same results):
new Regex(#"(?<!\w)(?:keywords|from|database|with|esc#ped|characters|#ss|gr#ss)(?!\w)")
Both will fail to match #ass in l#ss which is probably what you want.

When you get the keywords from the database, escape them with Regex.Escape before creating the Regex string.

The # does not denote a word boundary.
Use: (\s|^)(?:keywords|from|database|with|esc#ped|characters|#ss|gr#ss)(\s|$)
Tested with the following program:
static void Main(string[] args)
{
string pattern = "(\\s|^)(?:keywords|from|database|with|esc#ped|characters|#ss|gr#ss)(\\s|$)"
var matches = Regex.Matches("#ss is gr#ss is esc#ped keywordsnospace keywords", pattern);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[2]);
}
}
Giving the result:
#ss
gr#ss
esc#ped
keywords

Related

Tamil language full-word search with .NET Regex

I have a Grid filled with Tamil words and a search string. I need to implement a full-word search through the Grid records. I'm using .NET Regex class for that approach. It sounds pretty simple, what I used to do is:
string pattern = #"\b" + searchText + #"\b".
It works as expected in Latin languages but for Tamil, this expression returns strange results. I have read about Unicode characters in regular expressions but that doesn't seem quite helpful to me. What I probably need is to determine where is the word boundary found and why.
As an example:
For the "\bஅம்மா\b" pattern Regex found matches in
அம்மாவிடம் and அம்மாக்கள் records but not in the original அம்மா record.
The last char in "அம்மா" word is ‎0BBE TAMIL VOWEL SIGN AA and it is a combining mark (in regex, it can be matched with \p{M}).
As \b only matches between start/end of string and a word char or between a word and a non-word char, it won't match after the char and a non-word char.
Use a usual workaround in this case.
var pattern = $#"(?<!\w){searchText}(?!\w)";
See this regex demo.
Here, (?<!\w) fails the match if there is a word char before searchText and (?!\w) fails the match if there is a word char after the text to find. Note you may also use Regex.Escape(searchText) if the text can contains special regex chars.
Or, if you want to avoid matching when inside base letters/diacritics, use
var pattern = $#"(?<![\p{{L}}\p{{M}}]){searchText}(?![\p{{L}}\p{{M}}])";
See this regex demo.
The (?<![\p{L}\p{M}]) and (?![\p{L}\p{M}]) lookarounds work similarly as the ones above, just they fails the match if there is a letter or a combining mark on either side of the search phrase.

C# Regex for validating words with no space, no special character

I have written the following Regex for matching only those words with no space and no special character. But it is matching with words containing space too. What is wrong in it?
Regex rgx = new Regex("[a-zA-Z0-9]+");
if (!rgx.IsMatch(TextBox_EntityType.Text))
{
}
You can change the logic of your check so it does the opposite, and you take the appropriate action:
Regex rgx = new Regex("[^a-zA-Z0-9]");
# Match if there is something that is not alphanumeric
if (rgx.IsMatch(TextBox_EntityType.Text))
{
# Do what should be done if the text contains non-alphanumeric
}
This one works just as well because .IsMatch() looks for a match anywhere in a string (it tries its best to find a match), so either you make it match the whole string with anchors like Nikhil suggested, or invert the logic like I did (and which I believe should be slightly more efficient, but not benchmarked).
It should be ^[a-zA-Z0-9]+$
Added ^ and $.
The ^ matches the start of the string and $ matches the end.

Can you construct a RegEx to replace unwanted characters with the underscore?

I'm trying to write a string 'clean-up' function that allows only alphanumeric characters, plus a few others, such as the underscore, period and the minus (dash) character.
Currently our function uses straight char iteration of the source string, but I'm trying to convert it to RegEx because from what I've been reading, it is much cleaner and more performant (which seems backwards to me over a straight iteration, but I can't profile it until I get a working RegEx.)
The problem is two-fold for me. One, I know the following regex...
[a-zA-Z0-9]
...matches a range of alphanumeric characters, but how do I also include the underscore, period and the minus character? Do you simply escape them with the '\' character and put them between the brackets with the rest?
Second, for any character that isn't part of the match (i.e. other punctuation like '?') we would like it replaced with an underscore.
My thinking is to instead match on a range of desired characters, we match on a single character that's not in the desired range, then replace that. I think the RegEx for that is to include the carat as the first character between the brackets like this...
[^a-zA-Z0-9]
Is that the correct approach?
Probably the most efficient way to do this is to set up a static Regex that describes the characters that you want to replace.
public static class StringCleaner
{
public static Regex invalidChars = new Regex(#"[^A-Z0-9._\-]", RegexOptions.Compiled | RegexOptions.IgnoreCase);
public static string ReplaceInvalidChars(string input)
{
return invalidChars.Replace(input, "_");
}
}
However, if you don't want the Regex to replace line ends and whitespace (like spaces and tabs) you'll need to use a slightly different expression.
public static Regex invalidChars = new Regex(#"[^A-Z0-9._\-\s]", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Also, here are the rules for what you must escape to match the literal character:
Inside a set denoted by square brackets you must escape these characters -#]\ anywhere they occur and ^ only if it appears in the first position of the set to match the literal characters. Outside of a set you must escape these characters: .$^|{}[]()+?# to match the literal character.
See the following documentation for more information:
.NET Framework Regular Expressions
Regex Class
RegexOptions Enumeration
If you are trying to remove characters that you don't want, you'd be better served by Regex.Replace:
string cleaned = Regex.Replace(input, "[^a-zA-Z0-9_.]|-", "_");
To include the '-' character you can just use the Regex OR to include that character, although there probably is a way to include it in the character class, it's escaping me at the moment.
Edit: You don't actually need to explicitly include the hyphen, because it doesn't match the class anyway. That is, if you want to replace hyphen with underscore, just use [^a-zA-Z0-9_.] as your class... anything that doesn't match those classes will get replaced. But the correct way to include a hyphen in a class is to escape it with backslash (\-) or you can put it at the begging of the class list: [^-a-zA-Z0-9_.].
I think it would be perfect to use the Replace method of the string.
public string StringClean(string source, char replacement, char[] targets)
{
foreach(char c in targets)
{
//...
}
}
(Not in VS so maybe not perfect code)
If you need to replace all characters that are not on your described pattern with an underscore do this:
string result = Regex.Replace(YourOriginalString, "[^a-zA-Z0-9_.-]", "_");

Regex: Match any punctuation character except . and _

Is there an easy way to match all punctuation except period and underscore, in a C# regex? Hoping to do it without enumerating every single punctuation mark.
Use Regex Subtraction
[\p{P}-[._]]
See the .NET Regex documentation. I'm not sure if other flavors support it.
C# example
string pattern = #"[\p{P}\p{S}-[._]]"; // added \p{S} to get ^,~ and ` (among others)
string test = #"_""'a:;%^&*~`bc!##.,?";
MatchCollection mx = Regex.Matches(test, pattern);
foreach (Match m in mx)
{
Console.WriteLine("{0}: {1} {2}", m.Value, m.Index, m.Length);
}
Explanation
The pattern is a Character Class Subtraction. It starts with a standard character class like [\p{P}] and then adds a Subtraction Character Class like -[._], which says to remove the . and _. The subtraction is placed inside the [ ] after the standard class guts.
The answers so far do not respect ALL punctuation. This should work:
(?![\._])\p{P}
(Explanation: Negative lookahead to ensure that neither . nor _ are matched, then match any unicode punctuation character.)
Here is something a little simpler. Not words or white-space (where words include A-Za-z0-9 AND underscore).
[^\w\s.]
You could possibly use a negated character class like this:
[^0-9A-Za-z._\s]
This includes every character except those listed. You may need to exclude more characters (such as control characters), depending on your ultimate requirements.

Regex battle between maximum and minimum munge

Greetings, I have file with the following strings:
string.Format("{0},{1}", "Having \"Two\" On The Same Line".Localize(), "Is Tricky For regex".Localize());
my goal is to get a match set with the two strings:
Having \"Two\" On The Same Line
Is Tricky For regex
My current regex looks like this:
private Regex CSharpShortRegex = new Regex("\"(?<constant>[^\"]+?)\".Localize\\(\\)");
My problem is with the escaped quotes in the first line I end up stopping at the quote and I get:
On The Same Line
Is Tricky For This Style Too
however attempting to ignore the escaped quotes is not working out because it makes the Regex greedy and I get
Having \"Two\" On The Same Line".Localize(), "Is Tricky For regex"
We seem to be caught between maximum and minimum munge. Is there any hope? I have some backup plans. Can you Regex backwards? that would make it easier because I can start with the "()ezilacoL."
EDIT:
To clarify. This is my lone edge case. Most of the time the string sits alone like:
var myString = "Hot Patootie".Localize()
This one works for me:
\"((?:[^\\"]|(?:\\\"))*)\"\.Localize\(\)
Tested on http://www.regexplanet.com/simple/index.html against a number of strings with various escaped quotes.
Looks like most of us who answered this one had the same rough idea, so let me explain the approach (comments after #s):
\" # We're looking for a string delimited by quotation marks
( # Capture the contents of the quotation marks
(?: # Start a non-capturing group
[^\\"] # Either read a character that isn't a quote or a slash
|(?:\\\") # Or read in a slash followed by a quote.
)* # Keep reading
) # End the capturing group
\" # The string literal ends in a quotation mark
\.Localize\(\) # and ends with the literal '.Localize()', escaping ., ( and )
For C# you'll need to escape the slashes twice (messy):
\"((?:[^\\\\\"]|(?:\\\\\"))*)\"\\.Localize\\(\\)
Mark correctly points out that this one doesn't match escaped characters other than quotation marks. So here's a better version:
\"((?:[^\\"]|(?:\\")|(?:\\.))*)\"\.Localize\(\)
And its slashed-up equivalent:
\"((?:[^\\\\\"]|(?:\\\\\")|(?:\\\\.))*)\"\\.Localize\\(\\)
Works the same way, except it has a special case that if encounters a slash but it can't match \", it just consumes the slash and the following character and moves on.
Thinking about it, it's better to just consume two characters at every slash, which is effectively Mark's answer so I won't repeat it.
Here's the regular expression you need:
#"""(?<constant>(\\.|[^""])*)""\.Localize\(\)"
A test program:
using System;
using System.Text.RegularExpressions;
using System.IO;
class Program
{
static void Main()
{
Regex CSharpShortRegex =
new Regex(#"""(?<constant>(\\.|[^""])*)""\.Localize\(\)");
foreach (string line in File.ReadAllLines("input.txt"))
foreach (Match match in CSharpShortRegex.Matches(line))
Console.WriteLine(match.Groups["constant"].Value);
}
}
Output:
Having \"Two\" On The Same Line
Is Tricky For regex
Hot Patootie
Notice that I have used #"..." to avoid having to escape backslashes inside the regular expression. I think this makes it easier to read.
Update:
My original answer (below the horizontal rule) has a bug: regular-expression matchers attempt alternatives in left-to-right order. Having [^"] as the first alternative allows it to consume the backslash, but then the next character to be matched is a quote, which prevents the match from proceeding.
Incompatibility note: Given the pattern below, perl backtracks to the other alternative (the escaped quote) and successfully finds a match for the Having \"Two\" On The Same Line case.
The fix is to try an escaped quote first and then a non-quote:
var CSharpShortRegex =
new Regex("\"(?<constant>(\\\\\"|[^\"])*)\"\\.Localize\\(\\)");
or if you prefer the at-string form:
var CSharpShortRegex =
new Regex(#"""(?<constant>(\\""|[^""])*)""\.Localize\(\)");
Allow for escapes:
private Regex CSharpShortRegex =
new Regex("\"(?<constant>([^\"]|\\\\\")*)\"\\.Localize\\(\\)");
Applying one level of escaping to make the pattern easier to read, we get
"(?<constant>([^"]|\\")*)"\.Localize\(\)
That is, a string starts and ends with " characters, and everything between is either a non-quote or an escaped quote.
Looks like you're trying to parse code so one approach might be to evaluate the code on the fly:
var cr = new CSharpCodeProvider().CompileAssemblyFromSource(
new CompilerParameters { GenerateInMemory = true },
"class x { public static string e() { return " + input + "}}");
var result = cr.CompiledAssembly.GetType("x")
.GetMethod("e").Invoke(null, null) as string;
This way you could handle all kinds of other special cases (e.g. concatenated or verbatim strings) that would be extremely difficult to handle with regex.
new Regex(#"((([^#]|^|\n)""(?<constant>((\\.)|[^""])*)"")|(#""(?<constant>(""""|[^""])*)""))\s*\.\s*Localize\s*\(\s*\)", RegexOptions.Compiled);
takes care of both simple and #"" strings. It also takes into account escape sequences.

Categories