Cannot remove a set of chars in a string - c#

I have a set of characters I want to remove from a string : "/\[]:|<>+=;,?*'#
I'm trying with :
private const string CHARS_TO_REPLACE = #"""/\[]:|<>+=;,?*'#";
private string Clean(string stringToClean)
{
return Regex.Replace(stringToClean, "[" + Regex.Escape(CHARS_TO_REPLACE) + "]", "");
}
However, the result is strictly identical to the input with something like "Foo, bar and other".
What is wrong in my code ?
This looks like a lot to this question, but with a black list instead of a white list of chars, so I removed the not in ^ char.

You didn't escape the closing square bracket in CHARS_TO_REPLACE

The problem is a misunderstanding of how Regex.Escape works. From MSDN:
Escapes a minimal set of characters (\, *, +, ?, |, {, [, (,), ^, $,., #, and white space) by replacing them with their escape codes.
It works as expected, but you need to think of Regex.Escape as escaping metacharacters outside of a character class. When you use a character class, the things you want to escape inside are different. For example, inside a character class - should be escaped to be literal, otherwise it could act as a range of characters (e.g., [A-Z]).
In your case, as others have mentioned, the ] was not escaped. For any character that holds a special meaning within the character class, you will need to handle them separately after calling Regex.Escape. This should do what you need:
string CHARS_TO_REPLACE = #"""/\[]:|<>+=;,?*'#";
string pattern = "[" + Regex.Escape(CHARS_TO_REPLACE).Replace("]", #"\]") + "]";
string input = "hi\" there\\ [i love regex];#";
string result = Regex.Replace(input, pattern, "");
Console.WriteLine(result);
Otherwise, you were ending up with ["/\\\[]:\|<>\+=;,\?\*'#], which doesn't have ] escaped, so it was really ["/\\\[] as a character class, then :\|<>\+=;,\?\*'#] as the rest of the pattern, which wouldn't match unless your string matched exactly those remaining characters.

As already mentioned (but the answer has suddenly disappeared), Regex.Escape does not escape ], so you need to tweak your code:
return Regex.Replace(stringToClean, "[" + Regex.Escape(CHARS_TO_REPLACE)
.Replace("]", #"\]") + "]", " ");

There are a number of characters within CHARS_TO_REPLACE which are special to Regex's and need to be escaped with a slash \.
This should work:
"/\[]:\|<>\+=;,\?\*'#

Why not just do:
private static string Clean(string stringToClean)
{
string[] disallowedChars = new string[] {//YOUR CHARS HERE};
for (int i = 0; i < disallowedChars.Length; i++)
{
stringToClean= stringToClean.Replace(disallowedChars[i],"");
}
return stringToClean;
}

Single-statement linq solution:
private const string CHARS_TO_REPLACE = #"""/\[]:|<>+=;,?*'#";
private string Clean(string stringToClean) {
return CHARS_TO_REPLACE
.Aggregate(stringToClean, (str, l) => str.Replace(""+l, ""));
}

For the sake of knowledge, here is a variant suited for very large strings (or even streams). No regex here, simply a loop over each chars with a stringbuilder for storing the result :
class Program
{
private const string CHARS_TO_REPLACE = #"""/\[]:|<>+=;,?*'#";
static void Main(string[] args)
{
var wc = new WebClient();
var veryLargeString = wc.DownloadString("http://msdn.microsoft.com");
using (var sr = new StringReader(veryLargeString))
{
var sb = new StringBuilder();
int readVal;
while ((readVal = sr.Read()) != -1)
{
var c = (char)readVal;
if (!CHARS_TO_REPLACE.Contains(c))
{
sb.Append(c);
}
}
Console.WriteLine(sb.ToString());
}
}
}

Related

Using regex or string manipulation when creating permalinks

I have following method(and looks expensive too) for creating permalinks but it's lacking few stuff that are quite important for nice permalink:
public string createPermalink(string text)
{
text = text.ToLower().TrimStart().TrimEnd();
foreach (char c in text.ToCharArray())
{
if (!char.IsLetterOrDigit(c) && !char.IsWhiteSpace(c))
{
text = text.Replace(c.ToString(), "");
}
if (char.IsWhiteSpace(c))
{
text = text.Replace(c, '-');
}
}
if (text.Length > 200)
{
text = text.Remove(200);
}
return text;
}
Few stuff that it is lacking:
if someone enters text like this:
"My choiches are:foo,bar" would get returned as "my-choices-arefoobar"
and it should be like: "my-choiches-are-foo-bar"
and If someone enters multiple white spaces it would get returned as "---" which is not nice to have in url.
Is there some better way to do this in regex(I really only used it few times)?
UPDATE:
Requirement was:
Any non digit or letter chars at beginning or end are not allowed
Any non digit or letter chars should be replaced by "-"
When replaced with "-" chars should not reapeat like "---"
And finally stripping string at index 200 to ensure it's not too long
Change to
public string createPermalink(string text)
{
text = text.ToLower();
StringBuilder sb = new StringBuilder(text.Length);
// We want to skip the first hyphenable characters and go to the "meat" of the string
bool lastHyphen = true;
// You can enumerate directly a string
foreach (char c in text)
{
if (char.IsLetterOrDigit(c))
{
sb.Append(c);
lastHyphen = false;
}
else if (!lastHyphen)
{
// We use lastHyphen to not put two hyphens consecutively
sb.Append('-');
lastHyphen = true;
}
if (sb.Length == 200)
{
break;
}
}
// Remove the last hyphen
if (sb.Length > 0 && sb[sb.Length - 1] == '-')
{
sb.Length--;
}
return sb.ToString();
}
If you really want to use regexes, you can do something like this (based on the code of Justin)
Regex rgx = new Regex(#"^\W+|\W+$");
Regex rgx2 = new Regex(#"\W+");
return rgx2.Replace(rgx.Replace(text.ToLower(), string.Empty), "-");
The first regex searches for non-word characters (1 or more) at the beginning (^) or at the end of the string ($) and removes them. The second one replaces one or more non-word characters with -.
This should solve the problem that you have explained. Please let me know if it needs any further explanation.
Just as an FYI, the regex makes use of lookarounds to get it done in one run
//This will find any non-character word, lumping them in one group if more than 1
//It will ignore non-character words at the beginning or end of the string
Regex rgx = new Regex(#"(?!\W+$)\W+(?<!^\W+)");
//This will then replace those matches with a -
string result = rgx.Replace(input, "-");
To keep the string from going beyond 200 characters, you will have to use substring. If you do this before the regex, then you will be ok, but if you do it after, then you run the risk of having a trailing dash again, FYI.
example:
myString.Substring(0,200)
I use an iterative approach for this - because in some cases you might want certain characters to be turned into words instead of having them turned into '-' characters - e.g. '&' -> 'and'.
But when you're done you'll also end up with a string that potentially contains multiple '-' - so you have a final regex that collapses all multiple '-' characters into one.
So I would suggest using an ordered list of regexes, and then run them all in order. This code is written to go in a static class that is then exposed as a single extension method for System.String - and is probably best merged into the System namespace.
I've hacked it from code I use, which had extensibility points (e.g. you could pass in a MatchEvaluator on construction of the replacement object for more intelligent replacements; and you could pass in your own IEnumerable of replacements, as the class was public), and therefore it might seem unnecessarily complicated - judging by the other answers I'm guessing everybody will think so (but I have specific requirements for the SEO of the strings that are created).
The list of replacements I use might not be exactly correct for your uses - if not, you can just add more.
private class SEOSymbolReplacement
{
private Regex _rx;
private string _replacementString;
public SEOSymbolReplacement(Regex r, string replacement)
{
//null-checks required.
_rx = r;
_replacementString = replacement;
}
public string Execute(string input)
{
/null-check required
return _rx.Replace(input, _replacementString);
}
}
private static readonly SEOSymbolReplacement[] Replacements = {
new SEOSymbolReplacement(new Regex(#"#", RegexOptions.Compiled), "Sharp"),
new SEOSymbolReplacement(new Regex(#"\+", RegexOptions.Compiled), "Plus"),
new SEOSymbolReplacement(new Regex(#"&", RegexOptions.Compiled), " And "),
new SEOSymbolReplacement(new Regex(#"[|:'\\/,_]", RegexOptions.Compiled), "-"),
new SEOSymbolReplacement(new Regex(#"\s+", RegexOptions.Compiled), "-"),
new SEOSymbolReplacement(new Regex(#"[^\p{L}\d-]",
RegexOptions.IgnoreCase | RegexOptions.Compiled), ""),
new SEOSymbolReplacement(new Regex(#"-{2,}", RegexOptions.Compiled), "-")};
/// <summary>
/// Transforms the string into an SEO-friendly string.
/// </summary>
/// <param name="str"></param>
public static string ToSEOPathString(this string str)
{
if (str == null)
return null;
string toReturn = str;
foreach (var replacement in DefaultReplacements)
{
toReturn = replacement.Execute(toReturn);
}
return toReturn;
}

Is there a method for removing whitespace characters from a string?

Is there a string class member function (or something else) for removing all spaces from a string? Something like Python's str.strip() ?
You could simply do:
myString = myString.Replace(" ", "");
If you want to remove all white space characters you could use Linq, even if the syntax is not very appealing for this use case:
myString = new string(myString.Where(c => !char.IsWhiteSpace(c)).ToArray());
String.Trim method removes trailing and leading white spaces. It is the functional equivalent of Python's strip method.
LINQ feels like overkill here, converting a string to a list, filtering the list, then turning it back onto a string. For removal of all white space, I would go for a regular expression. Regex.Replace(s, #"\s", ""). This is a common idiom and has probably been optimized.
If you want to remove the spaces that prepend the string or at itt's end, you might want to have a look at TrimStart() and TrimEnd() and Trim().
If you're looking to replace all whitespace in a string (not just leading and trailing whitespace) based on .NET's determination of what's whitespace or not, you could use a pretty simple LINQ query to make it work.
string whitespaceStripped = new string((from char c in someString
where !char.IsWhiteSpace(c)
select c).ToArray());
Yes, Trim.
String a = "blabla ";
var b = a.Trim(); // or TrimEnd or TrimStart
Yes, String.Trim().
var result = " a b ".Trim();
gives "a b" in result. By default all whitespace is trimmed. If you want to remove only space you need to type
var result = " a b ".Trim(' ');
If you want to remove all spaces in a string you can use string.Replace().
var result = " a b ".Replace(" ", "");
gives "ab" in result. But that is not equivalent to str.strip() in Python.
I don't know much about Python...
IF the str.strip() just removes whitespace at the start and the end then you could use str = str.Trim() in .NET... otherwise you could just str = str.Replace ( " ", "") for removing all spaces.
IF it removes all whitespace then use
str = (from c in str where !char.IsWhiteSpace(c) select c).ToString()
There are many diffrent ways, some faster then others:
public static string StripTabsAndNewlines(this string s) {
//string builder (fast)
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.Length; i++) {
if ( ! Char.IsWhiteSpace(s[i])) {
sb.Append();
}
}
return sb.tostring();
//linq (faster ?)
return new string(input.ToCharArray().Where(c => !Char.IsWhiteSpace(c)).ToArray());
//regex (slow)
return Regex.Replace(s, #"\s+", "")
}
you could use
StringVariable.Replace(" ","")
I'm surprised no one mentioned this:
String.Join("", " all manner\tof\ndifferent\twhite spaces!\n".Split())
string.Split by default splits along the characters that are char.IsWhiteSpace so this is a very similar solution to filtering those characters out by the direct use of char.IsWhiteSpace and it's a one-liner that works in pre-LINQ environments as well.
Strip spaces? Strip whitespaces? Why should it matter? It only matters if we're searching for an existing implementation, but let's not forget how fun it is to program the solution rather than search MSDN (boring).
You should be able to strip any chars from any string by using 1 of the 2 functions below.
You can remove any chars like this
static string RemoveCharsFromString(string textChars, string removeChars)
{
string tempResult = "";
foreach (char c in textChars)
{
if (!removeChars.Contains(c))
{
tempResult = tempResult + c;
}
}
return tempResult;
}
or you can enforce a character set (so to speak) like this
static string EnforceCharLimitation(string textChars, string allowChars)
{
string tempResult = "";
foreach (char c in textChars)
{
if (allowChars.Contains(c))
{
tempResult = tempResult + c;
}
}
return tempResult;
}

Using Regular Expressions for Pattern Finding with Replace

I have a string in the following format in a comma delimited file:
someText, "Text with, delimiter", moreText, "Text Again"
What I need to do is create a method that will look through the string, and will replace any commas inside of quoted text with a dollar sign ($).
After the method, the string will be:
someText, "Text with$ delimiter", moreText, "Text Again"
I'm not entirely good with RegEx, but would like to know how I can use regular expressions to search for a pattern (finding a comma in between quotes), and then replace that comma with the dollar sign.
Personally, I'd avoid regexes here - assuming that there aren't nested quote marks, this is quite simple to write up as a for-loop, which I think will be more efficient:
var inQuotes = false;
var sb = new StringBuilder(someText.Length);
for (var i = 0; i < someText.Length; ++i)
{
if (someText[i] == '"')
{
inQuotes = !inQuotes;
}
if (inQuotes && someText[i] == ',')
{
sb.Append('$');
}
else
{
sb.Append(someText[i]);
}
}
This type of problem is where Regex fails, do this instead:
var sb = new StringBuilder(str);
var insideQuotes = false;
for (var i = 0; i < sb.Length; i++)
{
switch (sb[i])
{
case '"':
insideQuotes = !insideQuotes;
break;
case ',':
if (insideQuotes)
sb.Replace(',', '$', i, 1);
break;
}
}
str = sb.ToString();
You can also use a CSV parser to parse the string and write it again with replaced columns.
Here's how to do it with Regex.Replace:
string output = Regex.Replace(
input,
"\".*?\"",
m => m.ToString().Replace(',', '$'));
Of course, if you want to ignore escaped double quotes it gets more complicated. Especially when the escape character can itself be escaped.
Assuming the escape character is \, then when trying to match the double quotes, you'll want to match only quotation marks which are preceded by an even number of escape characters (including zero). The following pattern will do that for you:
string pattern = #"(?<=((^|[^\\])(\\\\){0,}))"".*?(?<=([^\\](\\\\){0,}))""";
A this point, you might prefer to abandon regular expressions ;)
UPDATE:
In reply to your comment, it is easy to make the operation configurable for different quotation marks, delimiters and placeholders.
string quote = "\"";
string delimiter = ",";
string placeholder = "$";
string output = Regex.Replace(
input,
quote + ".*?" + quote,
m => m.ToString().Replace(delimiter, placeholder));
If you'd like to go the regex route here's what you're looking for:
var result = Regex.Replace( text, "(\"[^,]*),([^,]*\")", "$1$$$2" );
The problem with regex in this case is that it won't catch "this, has, two commas".
See it working at http://refiddle.com/1ab
Can you give this a try: "[\w ],[\w ]" (double quotes included)?
And be careful with the replacement because direct replacement will remove the whole string enclosed in the double quotes.

C# Capitalizing string, but only after certain punctuation marks

I'm trying to find an efficient way to take an input string and capitalize the first letter after every punctuation mark (. : ? !) which is followed by a white space.
Input:
"I ate something. but I didn't:
instead, no. what do you think? i
think not! excuse me.moi"
Output:
"I ate something. But I didn't:
Instead, no. What do you think? I
think not! Excuse me.moi"
The obvious would be to split it and then capitalize the first char of every group, then concatenate everything. But it's uber ugly. What's the best way to do this? (I'm thinking Regex.Replace using a MatchEvaluator that capitalizes the first letter but would like to get more ideas)
Thanks!
Fast and easy:
static class Ext
{
public static string CapitalizeAfter(this string s, IEnumerable<char> chars)
{
var charsHash = new HashSet<char>(chars);
StringBuilder sb = new StringBuilder(s);
for (int i = 0; i < sb.Length - 2; i++)
{
if (charsHash.Contains(sb[i]) && sb[i + 1] == ' ')
sb[i + 2] = char.ToUpper(sb[i + 2]);
}
return sb.ToString();
}
}
Usage:
string capitalized = s.CapitalizeAfter(new[] { '.', ':', '?', '!' });
Try this:
string expression = #"[\.\?\!,]\s+([a-z])";
string input = "I ate something. but I didn't: instead, no. what do you think? i think not! excuse me.moi";
char[] charArray = input.ToCharArray();
foreach (Match match in Regex.Matches(input, expression,RegexOptions.Singleline))
{
charArray[match.Groups[1].Index] = Char.ToUpper(charArray[match.Groups[1].Index]);
}
string output = new string(charArray);
// "I ate something. But I didn't: instead, No. What do you think? I think not! Excuse me.moi"
I use an extension method.
public static string CorrectTextCasing(this string text)
{
// /[.:?!]\\s[a-z]/ matches letters following a space and punctuation,
// /^(?:\\s+)?[a-z]/ matches the first letter in a string (with optional leading spaces)
Regex regexCasing = new Regex("(?:[.:?!]\\s[a-z]|^(?:\\s+)?[a-z])", RegexOptions.Multiline);
// First ensure all characters are lower case.
// (In my case it comes all in caps; this line may be omitted depending upon your needs)
text = text.ToLower();
// Capitalize each match in the regular expression, using a lambda expression
text = regexCasing.Replace(text, s => (s.Value.ToUpper));
// Return the new string.
return text;
}
Then I can do the following:
string mangled = "i'm A little teapot, short AND stout. here IS my Handle.";
string corrected = s.CorrectTextCasing();
// returns "I'm a little teapot, short and stout. Here is my handle."
Using the Regex / MatchEvaluator route, you could match on
"[.:?!]\s[a-z]"
and capitalize the entire match.
Where the text variable contains the string
string text = "I ate something. but I didn't: instead, no. what do you think? i think not! excuse me.moi";
string[] punctuators = { "?", "!", ",", "-", ":", ";", "." };
for (int i = 0; i< 7;i++)
{
int pos = text.IndexOf(punctuators[i]);
while(pos!=-1)
{
text = text.Insert(pos+2, char.ToUpper(text[pos + 2]).ToString());
text = text.Remove(pos + 3, 1);
pos = text.IndexOf(punctuators[i],pos+1);
}
}

Replace char in a string

how to change
XXX#YYY.ZZZ into XXX_YYY_ZZZ
One way i know is to use the string.replace(char, char) method,
but i want to replace "#" & "." The above method replaces just one char.
one more case is what if i have XX.X#YYY.ZZZ...
i still want the output to look like XX.X_YYY_ZZZ
Is this possible?? any suggestions thanks
So, if I'm understanding correctly, you want to replace # with _, and . with _, but only if . comes after #? If there is a guaranteed # (assuming you're dealing with e-mail addresses?):
string e = "XX.X#YYY.ZZZ";
e = e.Substring(0, e.IndexOf('#')) + "_" + e.Substring(e.IndexOf('#')+1).Replace('.', '_');
Here's a complete regex solution that covers both your cases. The key to your second case is to match dots after the # symbol by using a positive look-behind.
string[] inputs = { "XXX#YYY.ZZZ", "XX.X#YYY.ZZZ" };
string pattern = #"#|(?<=#.*?)\.";
foreach (var input in inputs)
{
string result = Regex.Replace(input, pattern, "_");
Console.WriteLine("Original: " + input);
Console.WriteLine("Modified: " + result);
Console.WriteLine();
}
Although this is simple enough to accomplish with a couple of string Replace calls. Efficiency is something you will need to test depending on text size and number of replacements the code will make.
You can use the Regex.Replace method:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace(v=VS.90).aspx
You can use the following extension method to do your replacement without creating too many temporary strings (as occurs with Substring and Replace) or incurring regex overhead. It skips to the # symbol, and then iterates through the remaining characters to perform the replacement.
public static string CustomReplace(this string s)
{
var sb = new StringBuilder(s);
for (int i = Math.Max(0, s.IndexOf('#')); i < sb.Length; i++)
if (sb[i] == '#' || sb[i] == '.')
sb[i] = '_';
return sb.ToString();
}
you can chain replace
var newstring = "XX.X#YYY.ZZZ".Replace("#","_").Replace(".","_");
Create an array with characters you want to have replaced, loop through array and do the replace based off the index.
Assuming data format is like XX.X#YYY.ZZZ, here is another alternative with String.Split(char seperator):
string[] tmp = "XX.X#YYY.ZZZ".Split('#');
string newstr = tmp[0] + "_" + tmp[1].Replace(".", "_");

Categories