Regex to remove all special characters from string? - c#

I'm completely incapable of regular expressions, and so I need some help with a problem that I think would best be solved by using regular expressions.
I have list of strings in C#:
List<string> lstNames = new List<string>();
lstNames.add("TRA-94:23");
lstNames.add("TRA-42:101");
lstNames.add("TRA-109:AD");
foreach (string n in lstNames) {
// logic goes here that somehow uses regex to remove all special characters
string regExp = "NO_IDEA";
string tmp = Regex.Replace(n, regExp, "");
}
I need to be able to loop over the list and return each item without any special characters. For example, item one would be "TRA9423", item two would be "TRA42101" and item three would be TRA109AD.
Is there a regular expression that can accomplish this for me?
Also, the list contains more than 4000 items, so I need the search and replace to be efficient and quick if possible.
EDIT:
I should have specified that any character beside a-z, A-Z and 0-9 is special in my circumstance.

It really depends on your definition of special characters. I find that a whitelist rather than a blacklist is the best approach in most situations:
tmp = Regex.Replace(n, "[^0-9a-zA-Z]+", "");
You should be careful with your current approach because the following two items will be converted to the same string and will therefore be indistinguishable:
"TRA-12:123"
"TRA-121:23"

[^a-zA-Z0-9] is a character class matches any non-alphanumeric characters.
Alternatively, [^\w\d] does the same thing.
Usage:
string regExp = "[^\w\d]";
string tmp = Regex.Replace(n, regExp, "");

This should do it:
[^a-zA-Z0-9]
Basically it matches all non-alphanumeric characters.

You can use:
string regExp = "\\W";
This is equivalent to Daniel's "[^a-zA-Z0-9]"
\W matches any nonword character. Equivalent to the Unicode categories [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].

For my purposes I wanted all English ASCII chars, so this worked.
html = Regex.Replace(html, "[^\x00-\x80]+", "")

Depending on your definition of "special character", I think "[^a-zA-Z0-9]" would probably do the trick. That would find anything that is not a small letter, a capital letter, or a digit.

tmp = Regex.Replace(n, #"\W+", "");
\w matches letters, digits, and underscores, \W is the negated version.

If you don't want to use Regex then another option is to use
char.IsLetterOrDigit
You can use this to loop through each char of the string and only return if true.

public static string Letters(this string input)
{
return string.Concat(input.Where(x => char.IsLetter(x) && !char.IsSymbol(x) && !char.IsWhiteSpace(x)));
}

Related

Can anyone improve this 'list of IP addresses' regex? [duplicate]

What is the regular expression to validate a comma delimited list like this one:
12365, 45236, 458, 1, 99996332, ......
I suggest you to do in the following way:
(\d+)(,\s*\d+)*
which would work for a list containing 1 or more elements.
This regex extracts an element from a comma separated list, regardless of contents:
(.+?)(?:,|$)
If you just replace the comma with something else, it should work for any delimiter.
It depends a bit on your exact requirements. I'm assuming: all numbers, any length, numbers cannot have leading zeros nor contain commas or decimal points. individual numbers always separated by a comma then a space, and the last number does NOT have a comma and space after it. Any of these being wrong would simplify the solution.
([1-9][0-9]*,[ ])*[1-9][0-9]*
Here's how I built that mentally:
[0-9] any digit.
[1-9][0-9]* leading non-zero digit followed by any number of digits
[1-9][0-9]*, as above, followed by a comma
[1-9][0-9]*[ ] as above, followed by a space
([1-9][0-9]*[ ])* as above, repeated 0 or more times
([1-9][0-9]*[ ])*[1-9][0-9]* as above, with a final number that doesn't have a comma.
Match duplicate comma-delimited items:
(?<=,|^)([^,]*)(,\1)+(?=,|$)
Reference.
This regex can be used to split the values of a comma delimitted list. List elements may be quoted, unquoted or empty. Commas inside a pair of quotation marks are not matched.
,(?!(?<=(?:^|,)\s*"(?:[^"]|""|\\")*,)(?:[^"]|""|\\")*"\s*(?:,|$))
Reference.
/^\d+(?:, ?\d+)*$/
i used this for a list of items that had to be alphanumeric without underscores at the front of each item.
^(([0-9a-zA-Z][0-9a-zA-Z_]*)([,][0-9a-zA-Z][0-9a-zA-Z_]*)*)$
You might want to specify language just to be safe, but
(\d+, ?)+(\d+)?
ought to work
I had a slightly different requirement, to parse an encoded dictionary/hashtable with escaped commas, like this:
"1=This is something, 2=This is something,,with an escaped comma, 3=This is something else"
I think this is an elegant solution, with a trick that avoids a lot of regex complexity:
if (string.IsNullOrEmpty(encodedValues))
{
return null;
}
else
{
var retVal = new Dictionary<int, string>();
var reFields = new Regex(#"([0-9]+)\=(([A-Za-z0-9\s]|(,,))+),");
foreach (Match match in reFields.Matches(encodedValues + ","))
{
var id = match.Groups[1].Value;
var value = match.Groups[2].Value;
retVal[int.Parse(id)] = value.Replace(",,", ",");
}
return retVal;
}
I think it can be adapted to the original question with an expression like #"([0-9]+),\s?" and parse on Groups[0].
I hope it's helpful to somebody and thanks for the tips on getting it close to there, especially Asaph!
In JavaScript, use split to help out, and catch any negative digits as well:
'-1,2,-3'.match(/(-?\d+)(,\s*-?\d+)*/)[0].split(',');
// ["-1", "2", "-3"]
// may need trimming if digits are space-separated
The following will match any comma delimited word/digit/space combination
(((.)*,)*)(.)*
Why don't you work with groups:
^(\d+(, )?)+$
If you had a more complicated regex, i.e: for valid urls rather than just numbers. You could do the following where you loop through each element and test each of them individually against your regex:
const validRelativeUrlRegex = /^(^$|(?!.*(\W\W))\/[a-zA-Z0-9\/-]+[^\W_]$)/;
const relativeUrls = "/url1,/url-2,url3";
const startsWithComma = relativeUrls.startsWith(",");
const endsWithComma = relativeUrls.endsWith(",");
const areAllURLsValid = relativeUrls
.split(",")
.every(url => validRelativeUrlRegex.test(url));
const isValid = areAllURLsValid && !endsWithComma && !startsWithComma

Check for special characters are not allowed in C#

I have to validate a text box from a list of special characters that are not allowed.
This all are not allowed characters.
"&";"\";"/";"!";"%";"#";"^";"(";")";"?";"|";"~";"+";" ";
"{";"}";"*";",";"[";"]";"$";";";":";"=";"
Where semi-column is used to just separate between characters .I tried to write a regex for some characters to validate if it had worked i would extend it.it is not working .
What I am doing wrong in this.
Regex.IsMatch(textBox1.Text, #"^[\%\/\\\&\?\,\'\;\:\!\-]+$")
^[\%\/\\\&\?\,\'\;\:\!\-]+$
matches the strings that consist entirely of special characters. You need to invert the character class to match the strings that do not contain a special character:
^[^\%\/\\\&\?\,\'\;\:\!\-]+$
^--- added
Alternatively, you can use this regex to match any string containing only alphanumeric characters, hyphens, underscores and apostrophes.
^[a-zA-Z0-9\-'_]$
The regex you mention in the comments
[^a-zA-Z0-9-'_]
matches a string that contains any character except those that are allowed (you might need to escape the hyphen, though). This works as well, assuming you reverse the condition correctly (accept the strings that do not match).
If you are just looking for any of a list of characters then a regular expression is the more complicated option. String.IndexOfAny will return the first index of any of an array of characters or -1. So the check:
if (input.IndexOfAny(theCharacetrers) != -1) {
// Found one of them.
}
where theCharacetrers has previously been set up at class scope:
private readonly char[] theCharacetrers = new [] {'&','\','/','!','%','#','^',... };
You needed to remove ^ from the beginning and $ from the end of the pattern, otherwise in order to match the string should start and end with the special characters.
So, instead of
#"^[\%\/\\\&\?\,\'\;\:\!\-]+$"
it should be
#"[\%\/\\\&\?\,\'\;\:\!\-]+"
You can read more about start of string and end of string anchors here
Your RegExp is "string consiting only of special characters (since you have begin/end markers ^ and $).
You probably want just check if string does not contain any of the characters #"[\%\/\\\&\?\,\'\;\:\!\-]") would be enough.
Also String.IndexOfAny may be better fit if you just need to see if any of the characters is present in the source string.
PLease use this in textchange event
//Regex regex = new Regex("([a-zA-Z0-9 ._#]+)");
Regex regex = new Regex("^[a-zA-Z0-9_#(+).,-]+$");
string alltxt = txtOthers.Text;//txtOthers is textboxes name;
int k = alltxt.Length;
for (int i = 0; i <= k - 1; i++)
{
string lastch = alltxt.Substring(i, 1);
MatchCollection matches = regex.Matches(lastch);
if (matches.Count > 0)
{
}
else
{
txtOthers.Text = alltxt.Remove(i, 1);
i = i - 1;
alltxt = txtOthers.Text;
k = alltxt.Length;
}
txtOthers.Select(txtOthers.TextLength, 0);
}
BY Sharafu Hameed

C# Why i can not split the string?

string myNumber = "3.44";
Regex regex1 = new Regex(".");
string[] substrings = regex1.Split(myNumber);
foreach (var substring in substrings)
{
Console.WriteLine("The string is : {0} and the length is {1}",substring, substring.Length);
}
Console.ReadLine();
I tried to split the string by ".", but it the splits return 4 empty string. Why?
. means "any character" in regular expressions. So don't split using a regex - split using String.Split:
string[] substrings = myNumber.Split('.');
If you really want to use a regex, you could use:
Regex regex1 = new Regex(#"\.");
The # makes it a verbatim string literal, to stop you from having to escape the backslash. The backslash within the string itself is an escape for the dot within the regex parser.
the easiest solution would be: string[] val = myNumber.Split('.');
. is a reserved character in regex. if you literally want to match a period, try:
Regex regex1 = new Regex(#"\.");
However, you're better off simply using myNumber.Split(".");
The dot matches a single character, without caring what that character
is. The only exception are newline characters.
Source: http://www.regular-expressions.info/dot.html
Therefore your implying in your code to split the string at each character.
Use this instead.
string substr = num.Split('.');
Keep it simple, use String.Split() method;
string[] substrings = myNumber.Split('.');
It has an other overload which allows specifying split options:
public string[] Split(
char[] separator,
StringSplitOptions options
)
You don't need regex you do that by using Split method of string object
string myNumber = "3.44";
String[] substrings = myNumber.Split(".");
foreach (var substring in substrings)
{
Console.WriteLine("The string is : {0} and the length is {1}",substring, substring.Length);
}
Console.ReadLine();
The period "." is being interpreted as any single character instead of a literal period.
Instead of using regular expressions you could just do:
string[] substrings = myNumber.Split(".");
In Regex patterns, the period character matches any single character. If you want the Regex to match the actual period character, you must escape it in the pattern, like so:
#"\."
Now, this case is somewhat simple for Regex matching; you could instead use String.Split() which will split based on the occurrence of one or more static strings or characters:
string[] substrings = myNumber.Split('.');
try
Regex regex1 = new Regex(#"\.");
EDIT: Er... I guess under a minute after Jon Skeet is not too bad, anyway...
You'll want to place an escape character before the "." - like this "\\."
"." in a regex matches any character, so if you pass 4 characters to a regex with only ".", it will return four empty strings. Check out this page for common operators.
Try
Regex regex1 = new Regex("[.]");

extract last match from string in c#

i have strings in the form [abc].[some other string].[can.also.contain.periods].[our match]
i now want to match the string "our match" (i.e. without the brackets), so i played around with lookarounds and whatnot. i now get the correct match, but i don't think this is a clean solution.
(?<=\.?\[) starts with '[' or '.['
([^\[]*) our match, i couldn't find a way to not use a negated character group
`.*?` non-greedy did not work as expected with lookarounds,
it would still match from the first match
(matches might contain escaped brackets)
(?=\]$) string ends with an ]
language is .net/c#. if there is an easier solution not involving a regex i'd be also happy to know
what really irritates me is the fact, that i cannot use (.*?) to capture the string, as it seems non-greedy does not work with lookbehinds.
i also tried: Regex.Split(str, #"\]\.\[").Last().TrimEnd(']');, but i'm not really pround of this solution either
The following should do the trick. Assuming the string ends after the last match.
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
var search = new Regex("\\.\\[(.*?)\\]$", RegexOptions.RightToLeft);
string ourMatch = search.Match(input).Groups[1]);
Assuming you can guarantee the input format, and it's just the last entry you want, LastIndexOf could be used:
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
int lastBracket = input.LastIndexOf("[");
string result = input.Substring(lastBracket + 1, input.Length - lastBracket - 2);
With String.Split():
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
char[] seps = {'[',']','\\'};
string[] splitted = input.Split(seps,StringSplitOptions.RemoveEmptyEntries);
you get "out match" in splitted[7] and can.also.contain.periods is left as one string (splitted[4])
Edit: the array will have the string inside [] and then . and so on, so if you have a variable number of groups, you can use that to get the value you want (or remove the strings that are just '.')
Edited to add the backslash to the separator to treat cases like '\[abc\]'
Edit2: for nested []:
string input = #"[abc].[some other string].[can.also.contain.periods].[our [the] match]";
string[] seps2 = { "].["};
string[] splitted = input.Split(seps2, StringSplitOptions.RemoveEmptyEntries);
you our [the] match] in the last element (index 3) and you'd have to remove the extra ]
You have several options:
RegexOptions.RightToLeft - yes, .NET regex can do this! Use it!
Match the whole thing with greedy prefix, use brackets to capture the suffix that you're interested in
So generally, pattern becomes .*(pattern)
In this case, .*\[([^\]]*)\], then extract what \1 captures (see this on rubular.com)
References
regular-expressions.info/Grouping with brackets

Regex to match alphanumeric and spaces

What am I doing wrong here?
string q = "john s!";
string clean = Regex.Replace(q, #"([^a-zA-Z0-9]|^\s)", string.Empty);
// clean == "johns". I want "john s";
just a FYI
string clean = Regex.Replace(q, #"[^a-zA-Z0-9\s]", string.Empty);
would actually be better like
string clean = Regex.Replace(q, #"[^\w\s]", string.Empty);
This:
string clean = Regex.Replace(dirty, "[^a-zA-Z0-9\x20]", String.Empty);
\x20 is ascii hex for 'space' character
you can add more individual characters that you want to be allowed.
If you want for example "?" to be ok in the return string add \x3f.
I got it:
string clean = Regex.Replace(q, #"[^a-zA-Z0-9\s]", string.Empty);
Didn't know you could put \s in the brackets
The following regex is for space inclusion in textbox.
Regex r = new Regex("^[a-zA-Z\\s]+");
r.IsMatch(textbox1.text);
This works fine for me.
I suspect ^ doesn't work the way you think it does outside of a character class.
What you're telling it to do is replace everything that isn't an alphanumeric with an empty string, OR any leading space. I think what you mean to say is that spaces are ok to not replace - try moving the \s into the [] class.
There appear to be two problems.
You're using the ^ outside a [] which matches the start of the line
You're not using a * or + which means you will only match a single character.
I think you want the following regex #"([^a-zA-Z0-9\s])+"
bottom regex with space, supports all keyboard letters from different culture
string input = "78-selim güzel667.,?";
Regex regex = new Regex(#"[^\w\x20]|[\d]");
var result= regex.Replace(input,"");
//selim güzel
The circumflex inside the square brackets means all characters except the subsequent range. You want a circumflex outside of square brackets.
This regex will help you to filter if there is at least one alphanumeric character and zero or more special characters i.e. _ (underscore), \s whitespace, -(hyphen)
string comparer = "string you want to compare";
Regex r = new Regex(#"^([a-zA-Z0-9]+[_\s-]*)+$");
if (!r.IsMatch(comparer))
{
return false;
}
return true;
Create a set using [a-zA-Z0-9]+ for alphanumeric characters, "+" sign (a quantifier) at the end of the set will make sure that there will be at least one alphanumeric character within the comparer.
Create another set [_\s-]* for special characters, "*" quantifier is to validate that there can be special characters within comparer string.
Pack these sets into a capture group ([a-zA-Z0-9]+[_\s-]*)+ to say that the comparer string should occupy these features.
[RegularExpression(#"^[A-Z]+[a-zA-Z""'\s-]*$")]
Above syntax also accepts space

Categories