C# Regular Expression filtering characters [closed] - c#

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am having a string in C# and I would like to filter out (throw away) all characters except for digits i.e. 0 to 9. For example if I have a string like "5435%$% r3443_+_+**╥╡←", then the output should be 54353443. How can this be done using regular expression or something else in C#?
Thanks

You don't need regex for this
var newstr = String.Join("", str.Where(c => Char.IsDigit(c)));

Here is some example without regular expressions:
var str = "5435%$% r3443_+_+**╥╡←";
var result = new string(str.Where(o => char.IsDigit(o)).ToArray());
//Or you can make code above slightly more compact, using following syntax:
var result = new string(str.Where(char.IsDigit).ToArray());
Selects from string everything, that is digit-character, and creates new string based on selection.
And speaking about speed.
var sw = new Stopwatch();
var str = "5435%$% r3443_+_+**╥╡←";
sw.Start();
for (int i = 0; i < 100000; i++)
{
var result = new string(str.Where(o => char.IsDigit(o)).ToArray());
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds); // Takes nearly 107 ms
sw.Reset();
sw.Start();
for (int i = 0; i < 100000; i++)
{
var s = Regex.Replace(str, #"\D", "");
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds); //Takes up to 600 ms
sw.Reset();
sw.Start();
for (int i = 0; i < 100000; i++)
{
var newstr = String.Join("", str.Where(c => Char.IsDigit(c)));
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds); //Takes up to 109 ms
So regular expression implementation works predictably slow. Join and new string gives pretty similar results, also it might very depending from use case. Did not test implementation with manual string looping, I believe, it might give best results.
Update.
Also there is RegexOptions.Compiled option for regular expression, usage from example was intended. But for clarity of test, can say, that compiled regular expression gives in example above nearly 150 ms performance boost, which is still pretty slow (4 times slower then other).

CODE:
using System;
using System.Linq;
using System.Text.RegularExpressions;
using System.Diagnostics;
public class Foo
{
public static void Main()
{
string s = string.Empty;
TimeSpan e;
var sw = new Stopwatch();
//REGEX
sw.Start();
for(var i = 0; i < 10000; i++)
{
s = "123213!¤%//)54!!#¤!#%13425";
s = Regex.Replace(s, #"\D", "");
}
sw.Stop();
e = sw.Elapsed;
Console.WriteLine(s);
Console.WriteLine(e);
sw.Reset();
//NONE REGEX
sw.Start();
for(var i = 0; i < 10000; i++)
{
s = "123213!¤%//)54!!#¤!#%13425";
s = new string(s.Where(c => char.IsDigit(c)).ToArray());
}
sw.Stop();
e = sw.Elapsed;
Console.WriteLine(s);
Console.WriteLine(e);
}
}
OUTPUT:
1232135413425
00:00:00.0564964
1232135413425
00:00:00.0107598
Conclusion: This clearly favors the none regex method to solve this issue.

What have you tried?
static Regex rxNonDigits = new Regex( #"[^\d]+");
public static string StripNonDigits( string s )
{
return rxNonDigits.Replace(s,"") ;
}
Or the probably more efficient
public static string StripNonDigits( string s )
{
StringBuilder sb = new StrigBuilder(s.Length) ;
foreach ( char c in s )
{
if ( !char.IsDigit(c) ) continue ;
sb.Append(c) ;
}
return sb.ToString() ;
}
Or the equivalent one-liner:
public static string StripNonDigits( string s )
{
return new StringBuilder(s.Length)
.Append( s.Where(char.IsDigit).ToArray() )
.ToString()
;
}
Or if you don't care about other culture's digits and only care about ASCII decimal digits, you could save a [perhaps] expensive lookup and do two compares:
public static string StripNonDigits( string s )
{
return new StringBuilder(s.Length)
.Append( s.Where( c => c >= '0' && c <= '9' ).ToArray() )
.ToString()
;
}
It should be noted that the LINQ solutions almost certainly require constructing an intermediate array (something that's not required using a StringBuilder. You could also use LINQ aggregation:
s.Where( char.IsDigit ).Aggregate(new StringBuilder(s.Length), (sb,c) => sb.Append(c) ).ToString()
There More Than One Way To Do It!

The ^ excludes an expression from your match. Use it with \d, which matches digits 0-9, and replace this with nothing.
var cleanString = Regex.Replace("123abc,.é", "^\d", "");

You could simply do the following, The caret (^) inside of a character class [ ] is the negation operator.
var pattern = #"[^0-9]+";
var replaced = Regex.Replace("5435%$% r3443_+_+**╥╡←", pattern, "");
Output:
54353443

Related

Function which takes a string as input and returns an array of strings as follows ABC(abcde) should return [Abcde, aBcde, abCde, abcDe, abcdE]

I am doing this in this way but its remove string previous characters, its out put is (Magic,Agic,Gic,Ic,C) but I want the whole string to be concate before and after.
public string[] Transform(string st)
{
string[] arr = new string[st.Length];
string[] arr1 = new string[st.Length];
for (int x = 0; x < st.Length; x++)
{
arr1[x] = char.ToLower(st[x]) + "".ToString();
}
for (int i = 0; i < st.Length; i++)
{
string st1 = "";
{
st1 = char.ToUpper(st[i]) + st.Substring(i + 1);
}
arr[i] = st1;
}
return arr;
}
You can do this with a single loop:
public static string[] Transform(string str)
{
var strs = new List<string>();
var sb = new StringBuilder();
for (int i = 0; i < str.Length; i++)
{
sb.Clear();
sb.Append(str);
sb[i] = char.ToUpper(str[i]);
strs.Add(sb.ToString());
}
return strs.ToArray();
}
What this does is adds the str to a StringBuilder and then modifies the indexed character with the upper case version of that character. For example, the input abcde will give:
Abcde
aBcde
abCde
abcDe
abcdE
Try it out on DotNetFiddle
If you wanted to get really fancy I'm sure there is some convoluted LINQ that can do the same, but this gives you a basic framework for how it can work.
You forgot to add left part of the string. Try to do like this:
st1 = st.ToLower().Substring + char.ToUpper(st[i]) + st.Substring(i + 1);
Here. This is twice as fast as the method that uses a string builder and a List
public static string[] Transform(string str)
{
var strs = new string [str.Length];
var sb = str.ToCharArray();
char oldCh;
for (int i = 0; i < str.Length; i++)
{
oldCh = sb[i];
sb[i] = char.ToUpper(sb[i]);
strs[i] = new string (sb);
sb[i] = oldCh;
}
return strs;
}
There's no need to clear and keep reading the string to the string builder. We also know the size of the array so that can be allocated at the start.
I wrote an answer for your questions (it's second code snippet), you can modify it for your needs, like changing the return type to string[], or use ToArray() extension method if you wanna stick with it. I think it's more readable this way.
I decided to put a the end little profiler to check CPU usage and memory compared to #Ron Beyer answer.
Here is my first attempt:
public static void Main()
{
var result = Transform("abcde");
result.ToList().ForEach(WriteLine);
}
public static IEnumerable<string> Transform(string str)
{
foreach (var w in str)
{
var split = str.Split(w);
yield return split[0] + char.ToUpper(w) + split[1];
}
}
Result:
Abcde
aBcde
abCde
abcDe
abcdE
Code fiddle https://dotnetfiddle.net/gnsAGX
There is one huge drawback of that code above, it works only if the passed word has unique letters. Therefore "aaaaa" won't produce proper result.
Here is my second successful attempt that seems works with any string input. I used one instance of StringBuilder to decrease the number of objects that would need to be created and manage on one instance, instead of so much copying objects so it's more optimized.
public static void Main()
{
var result = Transform("aaaaa");
result.ToList().ForEach(WriteLine);
}
public static IEnumerable<string> Transform(string str)
{
var result = new StringBuilder(str.ToLower());
for( int i = 0; i < str.Length; i++)
{
result[i] = char.ToUpper(str[i]);
yield return result.ToString();
result[i] = char.ToLower(str[i]);
}
}
Result:
Aaaaa
aAaaa
aaAaa
aaaAa
aaaaA
Code fiddle: https://dotnetfiddle.net/tzhXtP
Measuring execute time and memory uses.
I will use dotnetfiddle.net status panel, to make it easier.
Fiddle has few limitations like time execution of code 10 sec and used memory
besides differences are very significant.
I tested programs with 14 000 repetitions, my code additionally changes the output to array[].
My answer (https://dotnetfiddle.net/1fLVw9)
Last Run: 12:23:09 pm
Compile: 0.046s
Execute: 7.563s
Memory: 16.22Gb
CPU: 7.609s
Compared answer (https://dotnetfiddle.net/Zc88F2)
Compile: 0.031s
Execute: 9.953s
Memory: 16.22Gb
CPU: 9.938s
It slightly reduces the execution time.
Hope this helps!
public static string[] Transform(string str)
{
var strs = new string [str.Length];
var sb = str.ToCharArray();
char oldCh;
for (int i = 0; i < str.Length; i++)
{
oldCh = sb[i];
sb[i] = char.ToUpper(sb[i]);
strs[i] = new string (sb);
sb[i] = oldCh;
}
return strs;
}

Compare two strings ignoring little changes

I want to compare two strings ignoring few words (say three).
Like if I compare these two strings:
"Hello! My name is Alex Jolig. Its nice to meet you."
"My name is Alex. Nice to meet you."
I should get result as True.
Is there any way to do that?
Nothing inbuilt that comes to my mind, but I think you can tokenise both the strings using a delimiter (' ' in your case) & punctuation marks (! & . in your case).
Once both the strings are broken down in ordered tokens you can apply a comparison between individual tokens as per your requirement.
You could split the strings into words and compare them like this;
private bool compareStrings()
{
string stringLeft = "Hello! My name is Alex Jolig. Its nice to meet you.";
string stringRight = "My name is Alex. Nice to meet you.";
List<string> liLeft = stringLeft.Split(' ').ToList();
List<string> liRight = stringRight.Split(' ').ToList();
double totalWordCount = liLeft.Count();
double matchingWordCount = 0;
foreach (var item in liLeft)
{
if(liRight.Contains(item)){
matchingWordCount ++;
}
}
//return bool based on percentage of matching words
return ((matchingWordCount / totalWordCount) * 100) >= 50;
}
This returns a boolean based on a percentage of matching words, you might want to use a Regex or similar to replace some format characters for more accurate results.
There is an article on Fuzzy String Matching with Edit Distance in codeproject
You can probably extend this idea to suit your requirement. It uses Levenshtein's Edit Distance as a Fuzzy String Match.
http://www.codeproject.com/Articles/162790/Fuzzy-String-Matching-with-Edit-Distance
Hey so my here is my go at an answer.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var string1 = "Hi thar im a string";
var string2 = "Hi thar im a string";
var string3 = "Hi thar im a similar string";
var string4 = "im a really different string";
var string5 = "Hi thar im a string but have many different words";
Console.WriteLine(StringComparo(string1, string2));
Console.WriteLine(StringComparo(string1, string3));
Console.WriteLine(StringComparo(string1, string4));
Console.WriteLine(StringComparo(string1, string5));
Console.ReadLine();
}
public static bool StringComparo(string str1, string str2, int diffCounterLimiter = 3)
{
var counter = 0;
var arr1 = str1.Split(' ');
var arr2 = str2.Split(' ');
while (counter <= diffCounterLimiter)
{
TreeNode bestResult = null;
for (int i = 0; i < arr1.Length; i++)
{
for (int j = 0; j < arr2.Length; j++)
{
var result = new TreeNode() { arr1Index = i, arr2Index = j };
if (string.Equals(arr1[i], arr2[j]) && (bestResult == null || bestResult.diff < result.diff))
{
bestResult = result;
}
}
}
// no result found
if(bestResult == null)
{
// any left over words plus current counter
return arr1.Length + arr2.Length + counter <= diffCounterLimiter;
}
counter += bestResult.diff;
arr1 = arr1.Where((val, idx) => idx != bestResult.arr1Index).ToArray();
arr2 = arr2.Where((val, idx) => idx != bestResult.arr2Index).ToArray();
}
return false;
}
}
public class TreeNode
{
public int arr1Index;
public int arr2Index;
public int diff => Math.Abs(arr1Index - arr2Index);
}
}
I tried to implement a tree search(I know its not really a search tree I may re write it a bit).
In essence it find the closest matched elements in each string.While under the limit of 3 differences it removes the elements matched them adds the difference and repeats. Hope it helps.

Remove tags in [] with regex [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
As an example I have this content
<tag1><tag2>Test</tag2>[<tag3>[<tag4>TAB1</tag4>]</tag3>]</tab1>
<tag1><tag2>Test</tag2>[<tag3>[<tag5></tag5><tag4>TAB2</tag4>]</tag3>]</tab1>
I want this return
<tag1><tag2>Test</tag2>[[TAB1]]</tab1>
<tag1><tag2>Test</tag2>[[TAB2]]</tab1>
I tried
Regex.Replace(text, "<.*?>", string.Empty)
but this removed all tags. I need to remove only those that are within [].
EDIT
Thanks for the help guys there. I ended up doing otherwise, because I could not make any of the following ways: for I have random tags and variable names.
public static string PrepareDocument(string input, int posBase = 0)
{
int indexFistOpen = input.IndexOf('[', posBase);
int indexFistClose = input.IndexOf(']', indexFistOpen);
int indexLastClose = input.IndexOf(']', indexFistClose + 1);
int tagLength = (indexLastClose - indexFistOpen) + 1;
var txWithTags = input.Substring(indexFistOpen, tagLength);
var text = Regex.Replace(txWithTags, "<.*?>", string.Empty);
input = input.Remove(indexFistOpen, tagLength);
input = input.Insert(indexFistOpen, text);
posBase = input.IndexOf(text, posBase) + text.Length;
if (input.IndexOf('[', posBase) > -1)
{
input = PrepareDocument(input, posBase);
}
return input;
}
A way consists to find the outer-most square brackets and to remove only tags in the matched parts.
To do that you need to use balancing groups to find substrings in nested (or not) brackets. Then all you need is to delegate the replacement to a function with MatchEvaluator instead of a fixed string.
public static void Main()
{
string html = "<tag1><tag2>Test</tag2>[<tag3>[<tag4>TAB1</tag4>]</tag3>]</tab1>\n"
+ "<tag1><tag2>Test</tag2>[<tag3>[<tag5></tag5><tag4>TAB2</tag4>]</tag3>]</tab1>";
string pattern = #"\[(?>[^][]+|(?<open>\[)|(?<close-open>]))*(?(open)(?!))]";
MatchEvaluator evaluator = new MatchEvaluator(RemoveTags);
Console.WriteLine(Regex.Replace(html, pattern, evaluator));
}
public static string RemoveTags(Match match)
{
return Regex.Replace(match.Value, #"<[^>]*>", string.Empty);
}
An other way that can be more performant (since C# is a compiled language) is to write your own string parser with basic string manipulations. All you need is a counter to know when the square brackets are balanced. When an opening bracket is found you increment the counter, when a closing bracket is found you decrement the counter, when the counter is equal to zero, the brackets are balanced. (note that this is more or less what the balancing group pattern does).
var regex = new Regex(#"(?<=\[)(</?tag\d>)+|(</?tag\d>)+(?=\])");
var src1 = "<tag1><tag2>Test</tag2>[<tag3>[<tag4>TAB1</tag4>]</tag3>]</tab1>";
var src2 = "<tag1><tag2>Test</tag2>[<tag3>[<tag5></tag5><tag4>TAB2</tag4>]</tag3>]</tab1>";
var result1 = regex.Replace(src1, "");
var result2 = regex.Replace(src2, "");
Here is the result:
There is probably a less verbose way of writing the regex. Anyway, I am using the lookbefore (?<=\[) and lookahead (?=\]) assertions to determine when to match the tag elements.
Using Regex is a good solution but it's about 3 times slower then this method that've just wrote:
static string removeTagsInBrackets(string input)
{
StringBuilder sb = new StringBuilder(input.Length);
bool insideBrackets = false;
bool insideTag = false; char c;
int indexOfLast = input.LastIndexOf(']');
for (int i = 0; i < input.Length; i++)
{
c = input[i];
if (c == '[') { insideBrackets = true; sb.Append(c); continue; }
if (i == indexOfLast) { insideBrackets = false; sb.Append(c); continue; }
if (c == '<' || c == '>') { insideTag = !insideTag; }
if (insideBrackets) if (insideTag || (!insideTag && c == '>')) continue;
sb.Append(c);
}
return sb.ToString();
}
Usage:
string s = #"<tag1><tag2>Test</tag2>[<tag3>[<tag5></tag5><tag4>TAB2</tag4>]</tag3>]</tab1>";
var result = removeTagsInBrackets(s);
Console.WriteLine(result);
Output : <tag1><tag2>Test</tag2>[[TAB2]]</tab1>
Check also : Test on performance

Is the string ctor the fastest way to convert an IEnumerable<char> to string

Edited for the release of .Net Core 2.1
Repeating the test for the release of .Net Core 2.1, I get results like this
1000000 iterations of "Concat" took 842ms.
1000000 iterations of "new String" took 1009ms.
1000000 iterations of "sb" took 902ms.
In short, if you are using .Net Core 2.1 or later, Concat is king.
I've edited the question to incorporate the valid points raised in the comments.
I was musing on my answer to a previous question and I started to wonder, is this,
return new string(charSequence.ToArray());
The best way to convert an IEnumerable<char> to a string. I did a little search and found this question already asked here. That answer asserts that,
string.Concat(charSequence)
is a better choice. Following an answer to this question, a StringBuilder enumeration approach was also suggested,
var sb = new StringBuilder();
foreach (var c in chars)
{
sb.Append(c);
}
return sb.ToString();
while this may be a little unwieldy I include it for completeness. I decided I should do a little test, the code used is at the bottom.
When built in release mode, with optimizations, and run from the command line without the debugger attached I get results like this.
1000000 iterations of "Concat" took 1597ms.
1000000 iterations of "new String" took 869ms.
1000000 iterations of "sb" took 748ms.
To my reckoning, the new string(...ToArray()) is close to twice as fast as the string.Concat method. The StringBuilder is marginally faster still but, is awkward to use but could be an extension.
Should I stick with new string(...ToArray()) or, is there something I'm missing?
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
class Program
{
private static void Main()
{
const int iterations = 1000000;
const string testData = "Some reasonably small test data";
TestFunc(
chars => new string(chars.ToArray()),
TrueEnumerable(testData),
10,
"new String");
TestFunc(
string.Concat,
TrueEnumerable(testData),
10,
"Concat");
TestFunc(
chars =>
{
var sb = new StringBuilder();
foreach (var c in chars)
{
sb.Append(c);
}
return sb.ToString();
},
TrueEnumerable(testData),
10,
"sb");
Console.WriteLine("----------------------------------------");
TestFunc(
string.Concat,
TrueEnumerable(testData),
iterations,
"Concat");
TestFunc(
chars => new string(chars.ToArray()),
TrueEnumerable(testData),
iterations,
"new String");
TestFunc(
chars =>
{
var sb = new StringBuilder();
foreach (var c in chars)
{
sb.Append(c);
}
return sb.ToString();
},
TrueEnumerable(testData),
iterations,
"sb");
Console.ReadKey();
}
private static TResult TestFunc<TData, TResult>(
Func<TData, TResult> func,
TData testData,
int iterations,
string stage)
{
var dummyResult = default(TResult);
var stopwatch = Stopwatch.StartNew();
for (var i = 0; i < iterations; i++)
{
dummyResult = func(testData);
}
stopwatch.Stop();
Console.WriteLine(
"{0} iterations of \"{2}\" took {1}ms.",
iterations,
stopwatch.ElapsedMilliseconds,
stage);
return dummyResult;
}
private static IEnumerable<T> TrueEnumerable<T>(IEnumerable<T> sequence)
{
foreach (var t in sequence)
{
yield return t;
}
}
}
It's worth noting that these results, whilst true for the case of IEnumerable from a purists point of view, are not always thus. For example if you were to actually have a char array even if you are passed it as an IEnumerable it is faster to call the string constructor.
The results:
Sending String as IEnumerable<char>
10000 iterations of "new string" took 157ms.
10000 iterations of "sb inline" took 150ms.
10000 iterations of "string.Concat" took 237ms.
========================================
Sending char[] as IEnumerable<char>
10000 iterations of "new string" took 10ms.
10000 iterations of "sb inline" took 168ms.
10000 iterations of "string.Concat" took 273ms.
The Code:
static void Main(string[] args)
{
TestCreation(10000, 1000);
Console.ReadLine();
}
private static void TestCreation(int iterations, int length)
{
char[] chars = GetChars(length).ToArray();
string str = new string(chars);
Console.WriteLine("Sending String as IEnumerable<char>");
TestCreateMethod(str, iterations);
Console.WriteLine("===========================================================");
Console.WriteLine("Sending char[] as IEnumerable<char>");
TestCreateMethod(chars, iterations);
Console.ReadKey();
}
private static void TestCreateMethod(IEnumerable<char> testData, int iterations)
{
TestFunc(chars => new string(chars.ToArray()), testData, iterations, "new string");
TestFunc(chars =>
{
var sb = new StringBuilder();
foreach (var c in chars)
{
sb.Append(c);
}
return sb.ToString();
}, testData, iterations, "sb inline");
TestFunc(string.Concat, testData, iterations, "string.Concat");
}
Well, I just wrote up a little test, trying 3 different ways of creating a string from an IEnumerable:
using StringBuilder and repeated invocations of its Append(char ch) method.
using string.Concat<T>
using the String constructor.
10,000 iterations of generating a random 1,000 character sequence and building a string from it, I see the following timings in a release build:
Style=StringBuilder
elapsed time is 00:01:05.9687330 minutes.
Style=StringConcatFunction
elapsed time is 00:02:33.2672485 minutes.
Style=StringConstructor
elapsed time is 00:04:00.5559091 minutes.
StringBuilder the clear winner. I'm using a static StringBuilder (singleton) instance, though. Dunno if that makes much difference.
Here's the source code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Security.Cryptography;
using System.Text;
namespace ConsoleApplication6
{
class Program
{
static readonly RandomNumberGenerator Random = RandomNumberGenerator.Create() ;
static readonly byte[] buffer = {0,0} ;
static char RandomChar()
{
ushort codepoint ;
do
{
Random.GetBytes(buffer) ;
codepoint = BitConverter.ToChar(buffer,0) ;
codepoint &= 0x007F ; // restrict to Unicode C0 ;
} while ( codepoint < 0x0020 ) ;
return (char) codepoint ;
}
static IEnumerable<char> GetRandomChars( int count )
{
if ( count < 0 ) throw new ArgumentOutOfRangeException("count") ;
while ( count-- >= 0 )
{
yield return RandomChar() ;
}
}
enum Style
{
StringBuilder = 1 ,
StringConcatFunction = 2 ,
StringConstructor = 3 ,
}
static readonly StringBuilder sb = new StringBuilder() ;
static string MakeString( Style style )
{
IEnumerable<char> chars = GetRandomChars(1000) ;
string instance ;
switch ( style )
{
case Style.StringConcatFunction :
instance = String.Concat<char>( chars ) ;
break ;
case Style.StringBuilder :
foreach ( char ch in chars )
{
sb.Append(ch) ;
}
instance = sb.ToString() ;
break ;
case Style.StringConstructor :
instance = new String( chars.ToArray() ) ;
break ;
default :
throw new InvalidOperationException() ;
}
return instance ;
}
static void Main( string[] args )
{
Stopwatch stopwatch = new Stopwatch() ;
foreach ( Style style in Enum.GetValues(typeof(Style)) )
{
stopwatch.Reset() ;
stopwatch.Start() ;
for ( int i = 0 ; i < 10000 ; ++i )
{
MakeString( Style.StringBuilder ) ;
}
stopwatch.Stop() ;
Console.WriteLine( "Style={0}, elapsed time is {1}" ,
style ,
stopwatch.Elapsed
) ;
}
return ;
}
}
}

Better way to clean a string?

I am using this method to clean a string:
public static string CleanString(string dirtyString)
{
string removeChars = " ?&^$##!()+-,:;<>’\'-_*";
string result = dirtyString;
foreach (char c in removeChars)
{
result = result.Replace(c.ToString(), string.Empty);
}
return result;
}
This method gives the correct result. However, there is a performance glitch in this method. Every time I pass the string, every character goes into the loop. If I have a large string then it will take too much time to return the object.
Is there a better way of doing the same thing? Maybe using LINQ or jQuery/JavaScript?
Any suggestions would be appreciated.
OK, consider the following test:
public class CleanString
{
//by MSDN http://msdn.microsoft.com/en-us/library/844skk0h(v=vs.71).aspx
public static string UseRegex(string strIn)
{
// Replace invalid characters with empty strings.
return Regex.Replace(strIn, #"[^\w\.#-]", "");
}
// by Paolo Tedesco
public static String UseStringBuilder(string strIn)
{
const string removeChars = " ?&^$##!()+-,:;<>’\'-_*";
// specify capacity of StringBuilder to avoid resizing
StringBuilder sb = new StringBuilder(strIn.Length);
foreach (char x in strIn.Where(c => !removeChars.Contains(c)))
{
sb.Append(x);
}
return sb.ToString();
}
// by Paolo Tedesco, but using a HashSet
public static String UseStringBuilderWithHashSet(string strIn)
{
var hashSet = new HashSet<char>(" ?&^$##!()+-,:;<>’\'-_*");
// specify capacity of StringBuilder to avoid resizing
StringBuilder sb = new StringBuilder(strIn.Length);
foreach (char x in strIn.Where(c => !hashSet.Contains(c)))
{
sb.Append(x);
}
return sb.ToString();
}
// by SteveDog
public static string UseStringBuilderWithHashSet2(string dirtyString)
{
HashSet<char> removeChars = new HashSet<char>(" ?&^$##!()+-,:;<>’\'-_*");
StringBuilder result = new StringBuilder(dirtyString.Length);
foreach (char c in dirtyString)
if (removeChars.Contains(c))
result.Append(c);
return result.ToString();
}
// original by patel.milanb
public static string UseReplace(string dirtyString)
{
string removeChars = " ?&^$##!()+-,:;<>’\'-_*";
string result = dirtyString;
foreach (char c in removeChars)
{
result = result.Replace(c.ToString(), string.Empty);
}
return result;
}
// by L.B
public static string UseWhere(string dirtyString)
{
return new String(dirtyString.Where(Char.IsLetterOrDigit).ToArray());
}
}
static class Program
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]
static void Main()
{
var dirtyString = "sdfdf.dsf8908()=(=(sadfJJLef#ssyd€sdöf////fj()=/§(§&/(\"&sdfdf.dsf8908()=(=(sadfJJLef#ssyd€sdöf////fj()=/§(§&/(\"&sdfdf.dsf8908()=(=(sadfJJLef#ssyd€sdöf";
var sw = new Stopwatch();
var iterations = 50000;
sw.Start();
for (var i = 0; i < iterations; i++)
CleanString.<SomeMethod>(dirtyString);
sw.Stop();
Debug.WriteLine("CleanString.<SomeMethod>: " + sw.ElapsedMilliseconds.ToString());
sw.Reset();
....
<repeat>
....
}
}
Output
CleanString.UseReplace: 791
CleanString.UseStringBuilder: 2805
CleanString.UseStringBuilderWithHashSet: 521
CleanString.UseStringBuilderWithHashSet2: 331
CleanString.UseRegex: 1700
CleanString.UseWhere: 233
Conclusion
It probably does not matter which method you use.
The difference in time between the fastest (UseWhere: 233ms) and the slowest (UseStringBuilder: 2805ms) method is 2572ms when called 50000 (!) times in a row. If you don't run the method that often, the difference does not really matter.
But if performance is critical, use the UseWhere method (written by L.B). Note, however, that its behavior is slightly different.
If it's purely speed and efficiency you are after, I would recommend doing something like this:
public static string CleanString(string dirtyString)
{
HashSet<char> removeChars = new HashSet<char>(" ?&^$##!()+-,:;<>’\'-_*");
StringBuilder result = new StringBuilder(dirtyString.Length);
foreach (char c in dirtyString)
if (!removeChars.Contains(c)) // prevent dirty chars
result.Append(c);
return result.ToString();
}
RegEx is certainly an elegant solution, but it adds extra overhead. By specifying the starting length of the string builder, it will only need to allocate the memory once (and a second time for the ToString at the end). This will cut down on memory usage and increase the speed, especially on longer strings.
However, as L.B. said, if you are using this to properly encode text that is bound for HTML output, you should be using HttpUtility.HtmlEncode instead of doing it yourself.
use regex [?&^$##!()+-,:;<>’\'-_*] for replacing with empty string
I don't know if, performance-wise, using a Regex or LINQ would be an improvement.
Something that could be useful, would be to create the new string with a StringBuilder instead of using string.Replace each time:
using System.Linq;
using System.Text;
static class Program {
static void Main(string[] args) {
const string removeChars = " ?&^$##!()+-,:;<>’\'-_*";
string result = "x&y(z)";
// specify capacity of StringBuilder to avoid resizing
StringBuilder sb = new StringBuilder(result.Length);
foreach (char x in result.Where(c => !removeChars.Contains(c))) {
sb.Append(x);
}
result = sb.ToString();
}
}
This one is even faster!
use:
string dirty=#"tfgtf$#$%gttg%$% 664%$";
string clean = dirty.Clean();
public static string Clean(this String name)
{
var namearray = new Char[name.Length];
var newIndex = 0;
for (var index = 0; index < namearray.Length; index++)
{
var letter = (Int32)name[index];
if (!((letter > 96 && letter < 123) || (letter > 64 && letter < 91) || (letter > 47 && letter < 58)))
continue;
namearray[newIndex] = (Char)letter;
++newIndex;
}
return new String(namearray).TrimEnd();
}
Give this a try: http://msdn.microsoft.com/en-us/library/xwewhkd1.aspx
Perhaps it helps to first explain the 'why' and then the 'what'. The reason you're getting slow performance is because c# copies-and-replaces the strings for each replacement. From my experience using Regex in .NET isn't always better - although in most scenario's (I think including this one) it'll probably work just fine.
If I really need performance I usually don't leave it up to luck and just tell the compiler exactly what I want: that is: create a string with the upper bound number of characters and copy all the chars in there that you need. It's also possible to replace the hashset with a switch / case or array in which case you might end up with a jump table or array lookup - which is even faster.
The 'pragmatic' best, but fast solution is:
char[] data = new char[dirtyString.Length];
int ptr = 0;
HashSet<char> hs = new HashSet<char>() { /* all your excluded chars go here */ };
foreach (char c in dirtyString)
if (!hs.Contains(c))
data[ptr++] = c;
return new string(data, 0, ptr);
BTW: this solution is incorrect when you want to process high surrogate Unicode characters - but can easily be adapted to include these characters.
-Stefan.
I use this in my current project and it works fine. It takes a sentence, it removes all the non alphanumerical characters, it then returns the sentence with all the words in the first letter upper case and everything else in lower case. Maybe I should call it SentenceNormalizer. Naming is hard :)
internal static string StringSanitizer(string whateverString)
{
whateverString = whateverString.Trim().ToLower();
Regex cleaner = new Regex("(?:[^a-zA-Z0-9 ])", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.Compiled);
var listOfWords = (cleaner.Replace(whateverString, string.Empty).Split(' ', StringSplitOptions.RemoveEmptyEntries)).ToList();
string cleanString = string.Empty;
foreach (string word in listOfWords)
{
cleanString += $"{word.First().ToString().ToUpper() + word.Substring(1)} ";
}
return cleanString;
}
I am not able to spend time on acid testing this but this line did not actually clean slashes as desired.
HashSet<char> removeChars = new HashSet<char>(" ?&^$##!()+-,:;<>’\'-_*");
I had to add slashes individually and escape the backslash
HashSet<char> removeChars = new HashSet<char>(" ?&^$##!()+-,:;<>’'-_*");
removeChars.Add('/');
removeChars.Add('\\');

Categories