Regex to replace specific control characters except a few special cases C#?

Regex to replace specific control characters except a few special cases C#? - c#

i have the following requirement:
i have a string str which has control characters...i want to replace these control characters with some specific values. So i am using the following Regex as:
str = Regex.Replace(str, #"\p{C}+","\r\n");
The above replaces ALL control characters with \r\n.
However, I want to do the same thing above but exclude the following control characters :
SPACE , `\u000D`, `\u000A`
How can i modify the RegEx above to accomplish this?
Any ideas? thanks!

Use a character class subtraction:
str = Regex.Replace(str, #"[\p{C}-[ \u000D\u000A]]+","\r\n");
^^^^^^^^^^^^^^^^^^^^^^^
The [\p{C}-[ \u000D\u000A]]+ pattern matches 1 or more chars from the \p{C} Unicode category except a space, \u000D and \u000A.

Here you go: [^\P{C}\r\n]+
Negative class [^
Negative property \P{C} (negative class + negative property = \p{C})
Carriage return \r
Line feed \n
Result: All control codes excluding CRLF.
(btw: SPACE is not matched by \p{C})

Related

C# regex does not allow special characters correctly?

For example I have the following string:
thats a\n\ntest\nwith multiline \n\nthings...
I tried to use the following code which does not work correctly and still hasn't all chars included:
string text = "thats a\n\ntest\nwith multiline \n\nthings and so on";
var res = Regex.IsMatch(text, #"^([a-zA-Z0-9äöüÄÖÜß\-|()[\]/%'<>_?!=,*. ':;#+\\])+$");
Console.WriteLine(res);
I want the regex returning true when only the following chars are included (do not have to contain all of them but at least one of the following and no others):
a-z, A-Z, 0-9, äüöÄÖÜß and !#'-.:,; ^"§$%&/()=?\}][{³²°*+~'_<>|.
This is a list of known keyboard characters I thought of would be nice the use inside of a message.

If you specified all the chars you want to allow, the regex declaration in C# will look like
#"^[a-zA-Z0-9äüöÄÖÜß!#'\-.:,; ^""§$%&/()=?\\}\][{³²°*+~'_<>|]+$"
However, the test string you supplied contains line feed (LF, \n, \x0A) chars, so you need to either test on a string with no newlines, or add \n to the character class:
#"^[a-zA-Z0-9äüöÄÖÜß!#'\-.:,; ^""§$%&/()=?\\}\][{³²°*+~'_<>|\n]+$"
Note that the " char is doubled since this is the only way to put a double quote into a verbatim string literal.
Also, the capturing parentheses in your pattern create redundant overhead, you should remove them.

C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

i would appreciate your help on this, since i do not know which range of characters to use, or if there is a character class like [[:cntrl:]] that i have found in ruby?
by means of non printable, i mean delete all characters that are not shown in ie output, when one prints the input string. Please note, i look for a c# regex, i do not have a problem with my code

You may remove all control and other non-printable characters with
s = Regex.Replace(s, #"\p{C}+", string.Empty);
The \p{C} Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.
Breaking it down into subcategories
To only match basic control characters you may use \p{Cc}+, see 65 chars in the Other, Control Unicode category. It is equal to a [\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085]+ regex.
To only match 161 other format chars including the well-known soft hyphen (\u00AD), zero-width space (\u200B), zero-width non-joiner (\u200C), zero-width joiner (\u200D), left-to-right mark (\u200E) and right-to-left mark (\u200F) use \p{Cf}+. The equivalent including astral place code points is a (?:[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB]|\uD804[\uDCBD\uDCCD]|\uD80D[\uDC30-\uDC38]|\uD82F[\uDCA0-\uDCA3]|\uD834[\uDD73-\uDD7A]|\uDB40[\uDC01\uDC20-\uDC7F])+ regex.
To match 137,468 Other, Private Use control code points you may use \p{Co}+, or its equivalent including astral place code points, (?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])+.
To match 2,048 Other, Surrogate code points that include some emojis, you may use \p{Cs}+, or [\uD800-\uDFFF]+ regex.

You can try with :
string s = "Täkörgåsmrgås";
s = Regex.Replace(s, #"[^\u0000-\u007F]+", string.Empty);
Updated answer after comments:
Documentation about non-printable character:
https://en.wikipedia.org/wiki/Control_character
Char.IsControl Method:
https://msdn.microsoft.com/en-us/library/system.char.iscontrol.aspx
Maybe you can try:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

To remove all control and other non-printable characters
Regex.Replace(s, #"\p{C}+", String.Empty);
To remove the control characters only (if you don't want to remove the emojis 😎)
Regex.Replace(s, #"\p{Cc}+", String.Empty);

you can try this:
public static string TrimNonAscii(this string value)
{
string pattern = "[^ -~]*";
Regex reg_exp = new Regex(pattern);
return reg_exp.Replace(value, "");
}

Regex.Replace removes '\r' character in "\r\n"

Here is a simple example
string text = "parameter=120\r\n";
int newValue = 250;
text = Regex.Replace(text, #"(?<=parameter\s*=).*", newValue.ToString());
text will be "parameter=250\n" after replacement. Replace() method removes '\r'. Does it uses unix-style for line feed by default? Adding \b to my regex (?<=parameter\s*=).*\b solves the problem, but I suppose there should be a better way to parse lines with windows-style line feeds.

Take a look at this answer. In short, the period (.) matches every character except \n in pretty much all regex implementations. Nothing to do with Replace in particular - you told it to remove any number of ., and that will slurp up \r as well.
Can't test now, but you might be able to rewrite it as (?<=parameter\s*=)[^\r\n]* to explicitly state which characters you want disallowed.

. by default doesn't match \n..If you want it to match you have to use single line mode..
(?s)(?<=parameter\s*=).*
^
(?s) would toggle the single line mode

Try this:
string text = "parameter=120\r\n";
int newValue = 250;
text = Regex.Replace(text, #"(parameter\s*=).*\r\n", "${1}" + newValue.ToString() + "\n");
Final value of text:
parameter=250\n
Match carriage return and newline explicitly. Will only match lines ending in \r\n.

How to correctly represent a whitespace character

I wanted to know how to represent a whitespace character in C#. I found the empty string representation string.Empty. Is there anything like that that represents a whitespace character?
I would like to do something like this:
test.ToLower().Split(string.Whitespace)
//test.ToLower().Split(Char.Whitespace)

Which whitespace character? The empty string is pretty unambiguous - it's a sequence of 0 characters. However, " ", "\t" and "\n" are all strings containing a single character which is characterized as whitespace.
If you just mean a space, use a space. If you mean some other whitespace character, there may well be a custom escape sequence for it (e.g. "\t" for tab) or you can use a Unicode escape sequence ("\uxxxx"). I would discourage you from including non-ASCII characters in your source code, particularly whitespace ones.
EDIT: Now that you've explained what you want to do (which should have been in your question to start with) you'd be better off using Regex.Split with a regular expression of \s which represents whitespace:
Regex regex = new Regex(#"\s");
string[] bits = regex.Split(text.ToLower());
See the Regex Character Classes documentation for more information on other character classes.

No, there isn't such constant.

The WhiteSpace CHAR can be referenced using ASCII Codes here.
And Character# 32 represents a white space, Therefore:
char space = (char)32;
For example, you can use this approach to produce desired number of white spaces anywhere you want:
int _length = {desired number of white spaces}
string.Empty.PadRight(_length, (char)32));

So I had the same problem so what I did was create a string with a white space and just index the character.
String string = "Hello Morning Good Night";
char empty = string.charAt(5);
Now whenever I need a empty character I will pull it from my reference in memory.

Which whitespace character? The most common is the normal space, which is between each word in my sentences. This is just " ".

Using regular expressions, you can represent any whitespace character with the metacharacter "\s"
MSDN Reference

You can always use Unicode character, for me personally this is the most clear solution:
var space = "\u0020"

regex check for white space in middle of string

I want to validate that the characters are alpha numeric:
Regex aNum = Regex("[a-z][A-Z][0-9]");
I want to add the option that there might be a white space, so it would be a two word expression:
Regex aNum = Regex("[a-z][A-Z][0-9]["\\s]");
but couldn't find the correct syntax.
id applicate any incite.

[A-Za-z0-9\s]{1,} should work for you. It matches any string which contains alphanumeric or whitespace characters and is at least one char long. If you accept underscores, too you shorten it to [\w\s]{1,}.
You should add ^ and $ to verify the whole string matches and not only a part of the string:
^[A-Za-z0-9\s]{1,}$ or ^[\w\s]{1,}$.

Exactly two words with single space:
Regex aNum = Regex("[a-zA-Z0-9]+[\s][a-zA-Z0-9]+");
OR any number of words having any number of spaces:
Regex aNum = Regex("[a-zA-Z0-9\s]");

"[A-Za-z0-9\s]*"
matches alphanumeric characters and whitespace. If you want a word that can contain whitespace but want to ensure it starts and ends with an alphanumeric character you could try
"[A-Za-z0-9][A-Za-z0-9\s]*[A-Za-z0-9]|[A-Za-z0-9]"

To not allow empty strings then
Regex.IsMatch(s ?? "",#"^[\w\s]+$");
and to allow empty strings
Regex.IsMatch(s ?? "",#"^[\w\s]*$");
I added the ?? "" as IsMatch does not accept null arguments

If you want to check for white space in middle of string you can use these patterns :
"(\w\s)+" : this must match a word with a white space at least.
"(\w\s)+$" : this must match a word with a white space at least and must finish with white space.
"[\w\s]+" : this match for word or white space or the two.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex to replace specific control characters except a few special cases C#? - c#

Use a character class subtraction: str = Regex.Replace(str, #"[\p{C}-[ \u000D\u000A]]+","\r\n"); ^^^^^^^^^^^^^^^^^^^^^^^ The [\p{C}-[ \u000D\u000A]]+ pattern matches 1 or more chars from the \p{C} Unicode category except a space, \u000D and \u000A.

Here you go: [^\P{C}\r\n]+ Negative class [^ Negative property \P{C} (negative class + negative property = \p{C}) Carriage return \r Line feed \n Result: All control codes excluding CRLF. (btw: SPACE is not matched by \p{C})

Related

C# regex does not allow special characters correctly?

C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

Regex.Replace removes '\r' character in "\r\n"

How to correctly represent a whitespace character

regex check for white space in middle of string

Categories

Resources