C# regex does not allow special characters correctly?

C# regex does not allow special characters correctly? - c#

For example I have the following string:
thats a\n\ntest\nwith multiline \n\nthings...
I tried to use the following code which does not work correctly and still hasn't all chars included:
string text = "thats a\n\ntest\nwith multiline \n\nthings and so on";
var res = Regex.IsMatch(text, #"^([a-zA-Z0-9äöüÄÖÜß\-|()[\]/%'<>_?!=,*. ':;#+\\])+$");
Console.WriteLine(res);
I want the regex returning true when only the following chars are included (do not have to contain all of them but at least one of the following and no others):
a-z, A-Z, 0-9, äüöÄÖÜß and !#'-.:,; ^"§$%&/()=?\}][{³²°*+~'_<>|.
This is a list of known keyboard characters I thought of would be nice the use inside of a message.

If you specified all the chars you want to allow, the regex declaration in C# will look like
#"^[a-zA-Z0-9äüöÄÖÜß!#'\-.:,; ^""§$%&/()=?\\}\][{³²°*+~'_<>|]+$"
However, the test string you supplied contains line feed (LF, \n, \x0A) chars, so you need to either test on a string with no newlines, or add \n to the character class:
#"^[a-zA-Z0-9äüöÄÖÜß!#'\-.:,; ^""§$%&/()=?\\}\][{³²°*+~'_<>|\n]+$"
Note that the " char is doubled since this is the only way to put a double quote into a verbatim string literal.
Also, the capturing parentheses in your pattern create redundant overhead, you should remove them.

Related

C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

i would appreciate your help on this, since i do not know which range of characters to use, or if there is a character class like [[:cntrl:]] that i have found in ruby?
by means of non printable, i mean delete all characters that are not shown in ie output, when one prints the input string. Please note, i look for a c# regex, i do not have a problem with my code

You may remove all control and other non-printable characters with
s = Regex.Replace(s, #"\p{C}+", string.Empty);
The \p{C} Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.
Breaking it down into subcategories
To only match basic control characters you may use \p{Cc}+, see 65 chars in the Other, Control Unicode category. It is equal to a [\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085]+ regex.
To only match 161 other format chars including the well-known soft hyphen (\u00AD), zero-width space (\u200B), zero-width non-joiner (\u200C), zero-width joiner (\u200D), left-to-right mark (\u200E) and right-to-left mark (\u200F) use \p{Cf}+. The equivalent including astral place code points is a (?:[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB]|\uD804[\uDCBD\uDCCD]|\uD80D[\uDC30-\uDC38]|\uD82F[\uDCA0-\uDCA3]|\uD834[\uDD73-\uDD7A]|\uDB40[\uDC01\uDC20-\uDC7F])+ regex.
To match 137,468 Other, Private Use control code points you may use \p{Co}+, or its equivalent including astral place code points, (?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])+.
To match 2,048 Other, Surrogate code points that include some emojis, you may use \p{Cs}+, or [\uD800-\uDFFF]+ regex.

You can try with :
string s = "Täkörgåsmrgås";
s = Regex.Replace(s, #"[^\u0000-\u007F]+", string.Empty);
Updated answer after comments:
Documentation about non-printable character:
https://en.wikipedia.org/wiki/Control_character
Char.IsControl Method:
https://msdn.microsoft.com/en-us/library/system.char.iscontrol.aspx
Maybe you can try:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

To remove all control and other non-printable characters
Regex.Replace(s, #"\p{C}+", String.Empty);
To remove the control characters only (if you don't want to remove the emojis 😎)
Regex.Replace(s, #"\p{Cc}+", String.Empty);

you can try this:
public static string TrimNonAscii(this string value)
{
string pattern = "[^ -~]*";
Regex reg_exp = new Regex(pattern);
return reg_exp.Replace(value, "");
}

Regex to match comma separated string with no comma at the end of the line

I am trying to write a regex that will allow input of all characters on the keyboard(even space) but will restrict the input of comma at the end of the line. I have tried do this,that includes all the possible characters,but it still does not give me the correct output:
[RegularExpression("^([a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+,)*[a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+$", ErrorMessage = "Comma is not allowed at the end of {0} ")]

^.*[^,]$
.* means all char,don't need so long

^([a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+,)*[a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+(?<!,)$
^^
Just add lookbehind at the end.

a regex that will allow input of all characters on the keyboard(even space) but will restrict the input of comma at the end of the line.
Mind that you can type much more than what you typed using a keyboard. Basically, you want to allow any character but a comma at the end of the line.
So,
(?!,).(?=\r\n|\z)
This regex is checking each line (because of the (?=\r\n|$) look-ahead), and the (?!,) look-ahead makes sure the last character (that we match using .) is not a comma. \z is an unambiguous string end anchor.
See regex demo
This will work even on a client side.
To also get the full line match, you can just add .* at the beginning of the pattern (as we are not using singleline flag, . does not match newline symbols):
.*(?!,).(?=\r\n|\z)
Or (making it faster with an atomic group or an inline multiline option with ^ start of line anchor, but will not work on the client side)
(?>.*)(?!,).(?=\r\n|\z)
(?m)^.*?(?!,).(?=\r\n|\z) // The fastest of the last three
See demo

Can you construct a RegEx to replace unwanted characters with the underscore?

I'm trying to write a string 'clean-up' function that allows only alphanumeric characters, plus a few others, such as the underscore, period and the minus (dash) character.
Currently our function uses straight char iteration of the source string, but I'm trying to convert it to RegEx because from what I've been reading, it is much cleaner and more performant (which seems backwards to me over a straight iteration, but I can't profile it until I get a working RegEx.)
The problem is two-fold for me. One, I know the following regex...
[a-zA-Z0-9]
...matches a range of alphanumeric characters, but how do I also include the underscore, period and the minus character? Do you simply escape them with the '\' character and put them between the brackets with the rest?
Second, for any character that isn't part of the match (i.e. other punctuation like '?') we would like it replaced with an underscore.
My thinking is to instead match on a range of desired characters, we match on a single character that's not in the desired range, then replace that. I think the RegEx for that is to include the carat as the first character between the brackets like this...
[^a-zA-Z0-9]
Is that the correct approach?

Probably the most efficient way to do this is to set up a static Regex that describes the characters that you want to replace.
public static class StringCleaner
{
public static Regex invalidChars = new Regex(#"[^A-Z0-9._\-]", RegexOptions.Compiled | RegexOptions.IgnoreCase);
public static string ReplaceInvalidChars(string input)
{
return invalidChars.Replace(input, "_");
}
}
However, if you don't want the Regex to replace line ends and whitespace (like spaces and tabs) you'll need to use a slightly different expression.
public static Regex invalidChars = new Regex(#"[^A-Z0-9._\-\s]", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Also, here are the rules for what you must escape to match the literal character:
Inside a set denoted by square brackets you must escape these characters -#]\ anywhere they occur and ^ only if it appears in the first position of the set to match the literal characters. Outside of a set you must escape these characters: .$^|{}[]()+?# to match the literal character.
See the following documentation for more information:
.NET Framework Regular Expressions
Regex Class
RegexOptions Enumeration

If you are trying to remove characters that you don't want, you'd be better served by Regex.Replace:
string cleaned = Regex.Replace(input, "[^a-zA-Z0-9_.]|-", "_");
To include the '-' character you can just use the Regex OR to include that character, although there probably is a way to include it in the character class, it's escaping me at the moment.
Edit: You don't actually need to explicitly include the hyphen, because it doesn't match the class anyway. That is, if you want to replace hyphen with underscore, just use [^a-zA-Z0-9_.] as your class... anything that doesn't match those classes will get replaced. But the correct way to include a hyphen in a class is to escape it with backslash (\-) or you can put it at the begging of the class list: [^-a-zA-Z0-9_.].

I think it would be perfect to use the Replace method of the string.
public string StringClean(string source, char replacement, char[] targets)
{
foreach(char c in targets)
{
//...
}
}
(Not in VS so maybe not perfect code)

If you need to replace all characters that are not on your described pattern with an underscore do this:
string result = Regex.Replace(YourOriginalString, "[^a-zA-Z0-9_.-]", "_");

why do these regex tests let certain characters pass?

I am checking a string with the following regexes:
[a-zA-Z0-9]+
[A-Za-z]+
For some reason, the characters:
.
-
_
are allowed to pass, why is that?

If you want to check that the complete string consists of only the wanted characters you need to anchor your regex like follows:
^[a-zA-Z0-9]+$
Otherwise every string will pass that contains a string of the allowed characters somewhere. The anchors essentially tell the regular expression engine to start looking for those characters at the start of the string and stop looking at the end of the string.
To clarify: If you just use [a-zA-Z0-9]+ as your regex, then the regex engine would rightfully reject the string -__-- as the regex doesn't match against that. There is no single character from the character class you defined.
However, with the string a-b it's different. The regular expression engine will match the first a here since that matches the expression you entered (at least one of the given characters) and won't care about the - or the b. It has done its job and successfully matched a substring according to your regular expression.
Similarly with _-abcdef- – the regex will match the substring abcdef just fine, because you didn't tell it to match only at the start or end of the string; and ignore the other characters.
So when using ^[a-zA-Z0-9]+$ as your regex you are telling the regex engine definitely that you are looking for one or more letters or digits, starting at the very beginning of the string right until the end of the string. There is no room for other characters to squeeze in or hide so this will do what you apparently want. But without the anchors, the match can be anywhere in your search string. For validation purposes you always want to use those anchors.

In regular expressions the + tells the engine to match one or more characters.
So this expression [A-Za-z]+ passes if the string contains a sequence of 1 or more alphabetic characters. The only strings that wouldn't pass are strings that contain no alphabetic characters at all.
The ^ symbol anchors the character class to the beginning of the string and the $ symbol anchors to the end of the string.
So ^[A-Za-z0-9]+ means 'match a string that begins with a sequence of one or more alphanumeric characters'. But would allow strings that include non-alphanumerics so long as those characters were not at the beginning of the string.
While ^[A-Za-z0-9]+$ means 'match a string that begins and ends with a sequence of one or more alphanumeric characters'. This is the only way to completely exclude non-alphanumerics from a string.

Simple regex pattern

i'm using C# and i'm trying to allow only alphabetical letters and spaces. my expression at the moment is:
string regex = "^[A-Za-z\s]{1,40}$";
my IDE says that \s is an "Unrecognized escape sequence"
what am i missing?

"\" is a c# escape character as well as a regex escape character. Try:
string regex = #"^[A-Za-z\s]{1,40}$";

You need to put an # in front of your string to turn it into a verbatim string literal:
string regex = #"^[A-Za-z\s]{1,40}$";
Right now, the \ in your regex is being interpreted as trying to escape the following s, which the compiler doesn't understand.
Alternatively, you can just escape the backslash with another one:
string regex = "^[A-Za-z\\s]{1,40}$";
but in general, prefer the first approach to the second.

An additional note, your regex doesn't do what you describe. You say a max of 1 space in between words. In order to do that, you need to move the "\s" out of the character list. The pattern you're currently using allows "any alphanumeric or space from 1 to 40 times" which allows for multiple successive spaces. You'll need something more like the following:
string regex = #"^(?:[A-Za-z]+\s?)+$";
This means "any alphanumeric 1 or more times followed by an optional space, this whole thing one or more times". I don't know how to limit the whole string to 40 characters when you don't know the size of the first expression in advance. Maybe this can be achieved with a "look behind" expression, but I'm not sure. You might have to do it in two steps.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# regex does not allow special characters correctly? - c#

Related

C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

Regex to match comma separated string with no comma at the end of the line

Can you construct a RegEx to replace unwanted characters with the underscore?

why do these regex tests let certain characters pass?

Simple regex pattern

Categories

Resources