Replace special characters and spaces with empty string with REGEX - c#

I am looking to take a string from an input text box and remove all special characters and spaces.
e.g.
Input: ##HH 4444 5656 3333 2AB##
Output: HH4444565633332AB

If dealing with unicode to remove one or more characters that is not a letter nor a number:
[^\p{L}\p{N}]+
See this demo at regexstorm or a C# replace demo at tio.run
\p{L} matches any kind of letter from any language
\p{N} matches any kind of numeric character in any script

Let's define what we are going to keep, not what to remove. If we keep Latin letters and digits only we can put
string result = Regex.Replace(input, "[^A-Za-z0-9]", "");
or (Linq alternative)
string result = string.Concat(input
.Where(c => c >= 'a' && c <= 'z' || c >= 'A' && c <= 'Z' c >= '0' && c <= '9'));

Related

How to capitalize 1st letter (ignoring non a-z) with regex in c#?

There are tons of posts regarding how to capitalize the first letter with C#, but I specifically am struggling how to do this when ignoring prefixed non-letter characters and tags inside them. Eg,
<style=blah>capitalize the word, 'capitalize'</style>
How to ignore potential <> tags (or non-letter chars before it, like asterisk *) and the contents within them, THEN capitalize "capitalize"?
I tried:
public static string CapitalizeFirstCharToUpperRegex(string str)
{
// Check for empty string.
if (string.IsNullOrEmpty(str))
return string.Empty;
// Return char and concat substring.
// Start # first char, no matter what (avoid <tags>, etc)
string pattern = #"(^.*?)([a-z])(.+)";
// Extract middle, then upper 1st char
string middleUpperFirst = Regex.Replace(str, pattern, "$2");
middleUpperFirst = CapitalizeFirstCharToUpper(str); // Works
// Inject the middle back in
string final = $"$1{middleUpperFirst}$3";
return Regex.Replace(str, pattern, final);
}
EDIT:
Input: <style=foo>first non-tagged word 1st char upper</style>
Expected output: <style=foo>First non-tagged word 1st char upper</style>
You may use
<[^<>]*>|(?<!\p{L})(\p{L})(\p{L}*)
The regex does the following:
<[^<>]*> - matches <, any 0+ chars other than < and > and then >
| - or
(?<!\p{L}) - finds a position not immediately preceded with a letter
(\p{L}) - captures into Group 1 any letter
(\p{L}*) - captures into Group 2 any 0+ letters (that is necessary if you want to lowercase the rest of the word).
Then, check if Group 2 matched, and if yes, capitalize the first group value and lowercase the second one, else, return the whole value:
var result = Regex.Replace(s, #"<[^<>]*>|(?<!\p{L})(\p{L})(\p{L}*)", m =>
m.Groups[1].Success ?
m.Groups[1].Value.ToUpper() + m.Groups[2].Value.ToLower() :
m.Value);
If you do not need to lowercase the rest of the word, remove the second group and the code related to it:
var result = Regex.Replace(s, #"<[^<>]*>|(?<!\p{L})(\p{L})", m =>
m.Groups[1].Success ?
m.Groups[1].Value.ToUpper() : m.Value);
To only replace the first occurrence using this approach, you need to set a flag and reverse it once the first match is found:
var s = "<style=foo>first non-tagged word 1st char upper</style>";
var found = false;
var result = Regex.Replace(s, #"<[^<>]*>|(?<!\p{L})(\p{L})", m => {
if (m.Groups[1].Success && !found) {
found = !found;
return m.Groups[1].Value.ToUpper();
} else {
return m.Value;
}
});
Console.WriteLine(result); // => <style=foo>First non-tagged word 1st char upper</style>
See the C# demo.
Using look-behind regex feature you can match the first 'capitalize' without > parenthesis and then you can capitalize the output.
The regex is the following:
(?<=<.*>)\w+
It will match the first word after the > parenthesis

Regex Lookahead and lookbehind at most one digit

I'm looking for create RegEx pattern
8 characters [a-zA_Z]
must contains only one digit in any place of string
I created this pattern:
^(?=.*[0-9].*[0-9])[0-9a-zA-Z]{8}$
This pattern works fine but i want only one digit allowed. Example:
aaaaaaa6 match
aaa7aaaa match
aaa88aaa don't match
aaa884aa don't match
aaawwaaa don't match
You could instead use:
^(?=[0-9a-zA-Z]{8})[^\d]*\d[^\d]*$
The first part would assert that the match contains 8 alphabets or digits. Once this is ensured, the second part ensures that there is only one digit in the match.
EDIT: Explanation:
The anchors ^ and $ denote the start and end of string.
(?=[0-9a-zA-Z]{8}) asserts that the match contains 8 alphabets or digits.
[^\d]*\d[^\d]* would imply that there is only one digit character and remaining non-digit characters. Since we had already asserted that the input contains digits or alphabets, the non-digit characters here are alphabets.
If you want a non regex solution, I wrote this for a small project :
public static bool ContainsOneDigit(string s)
{
if (String.IsNullOrWhiteSpace(s) || s.Length != 8)
return false;
int nb = 0;
foreach (char c in s)
{
if (!Char.IsLetterOrDigit(c))
return false;
if (c >= '0' && c <= '9') // just thought, I could use Char.IsDigit() here ...
nb++;
}
return nb == 1;
}

.Net Regex for Comma Separated string with a strict format

I've got a string that I need to verify for validity, the later being so if:
It is completely empty
Or contains a comma-separated string that MUST look like this: 'abc,def,ghi,jkl'.
It doesn't matter how many of these comma separated values are there, but if the string is not empty, it must adhere to the comma (and only comma) separated format with no white-spaces around them and each value may only contain ascii a-z/A-z.. no special characters or anything.
How would I verify whether strings adhere to the rules, or not?
You can use this regex
^([a-zA-Z]+(,[a-zA-Z]+)*)?$
or
^(?!,)(,?[a-zA-Z])*$
^ is start of string
[a-zA-Z] is a character class that matches a single uppercase or lowercase alphabet
+ is a quantifier which matches preceding character or group 1 to many times
* is a quantifier which matches preceding character or group 0 to many times
? is a quantifier which matches preceding character or group 0 or 1 time
$ is end of string
Consider not using regex:
bool isOK = str == "" || str.Split(',').All(part => part != "" && part.All(c=> (c>= 'a' && c<='z') || (c>= 'A' && c<='Z')));

switch statement - validate substrings

the field data has 4 acceptable types of values:
j
47d (where the first one-two characters are between 0 and 80 and third character is d)
9u (where the first one-two characters are between 0 and 80 and third character is u)
3v (where the first character is between 1 and 4 and second character is v).
Otherwise the data should be deemed invalid.
string data = readconsole();
what is the best way of validating this input?
I was considering a combination of .Length and Switch substring checks.
ie.
if (data == "j")
else if (data.substring(1) == "v" && data.substring(0,1) >=1 && data.substring(0,1) <=4)
....
else
writeline("fail");
You can use a regular expression that matches the different kinds of values:
^(j|(\d|[1-7]\d|80)[du]|[1-4]v)$
Example:
if (Regex.IsMatch(data, #"^(j|(\d|[1-7]\d|80)[du]|[1-4]v)$")) ...
Explanation of the regular expression:
^ matches the beginning of string
j matches the literal value "j"
| is the "or" operator
\d matches one digit
[1-7]\d matches "10" - "79"
80 matches "80"
[du] matches either "d" or "u"
[1-4] matches "1" - "4"
v matches "v"
$ matches the end of the string
A regular expression will be the most succinct way to validate such rules.
You can use the regular expression:
^(?:j|(?:[0-7]?[0-9]|80)[du]|[1-4]v)$
Another option is to split by number and letter, and check the results. This is quite longer, but probably easier to maintain in the long run:
public bool IsValid(string s)
{
if (s == "j")
return true;
Match m = Regex.Match(s, #"^(\d+)(\p{L})$");
if (!m.Success)
return false;
char c = m.Groups[2].Value[0];
int number;
if (!Int32.TryParse(m.Groups[1].Value, NumberStyles.Integer,
CultureInfo.CurrentCulture, out number)) //todo: choose culture
return false;
return ((c == 'u' || c == 'd') && number > 0 && number <= 80) ||
(c == 'v' && number >= 1 && number <= 4);
}

Regex to find out if the sequence has any special characters

I am looking for a regex to find out the given word sequence has any special characters.
For example.
In this input string
"test?test";
I would like to find out the words got
"test(any special char(s) including space)test"
You can just use [^A-Za-z0-9], which will match anything that is not alphanumeric, but of course it depends on what you consider a "special character." If underscore is not special [\W] can be a shortcut for anything that is not a word (A-Za-z0-9_) character.
You don't really need a regex here. If you want to test for alphanumeric characters, you car use LINQ, for example (or just iterate over the letters):
string input = "test test";
bool valid = input.All(Char.IsLetterOrDigit);
Char.IsLetterOrDigit checks for all Unicode alphanumeric characters. If you only want the English ones, you can write:
public static bool IsEnglishAlphanumeric(char c)
{
return ((c >= 'a') && (c <= 'z'))
|| ((c >= 'A') && (c <= 'Z'))
|| ((c >= '0') && (c <= '9'));
}
and use it similarly:
bool valid = input.All(IsEnglishAlphanumeric);

Categories