Word Fast Counting - c#

I'm implementing a word count feature for my ASP.NET server, and I was wondering what would be the fastest method of doing so, as I'm not sure using a simple
text.AsParallel().Count(Char.IsWhiteSpace);
is the fastest possible method. Since this feature might be used quite a bit on relatively long walls of text, I want it to be as fast as possible, even if it means using unsafe methods.
Edit: Some benchmarking with Rufus L's code as well as my own unsafe method:
public static unsafe int CountWords(string s)
{
int count = 0;
fixed (char* ps = s)
{
int len = s.Length;
char* pc = ps;
while (len-- > 0)
{
if (char.IsWhiteSpace(*pc++))
{
count++;
}
}
}
return count;
}
Split(null): 681979 words in 415867 ticks.
Count(WhiteSpace): 681978 words in 147860 ticks.
AsParallel: 681978 words in 401077 ticks.
Unsafe: 681978 words in 98139 ticks.
I'm still open to any better ideas :)
EDIT2:
Rewrote the function, taking care of multiple white spaces too:
public static unsafe int CountWords(string s)
{
int count = 0;
fixed (char* ps = s)
{
int len = s.Length;
bool inWord = false;
char* pc = ps;
while (len-- > 0)
{
if (char.IsWhiteSpace(*pc++))
{
if (!inWord)
{
inWord = true;
}
}
else
{
if (inWord)
{
inWord = false;
count++;
}
}
if (len == 0)
{
if (inWord)
{
count++;
}
}
}
}
return count;
}
Split(null): 681979 words in 517055 ticks.
Count(WhiteSpace): 681978 words in 148952 ticks.
AsParallel: 681978 words in 410289 ticks.
Unsafe: 660000 words in 114833 ticks.

According to my tests, this is much faster, by a factor of 4 (but see the update below for different results):
wordCount = text.Split(null).Length;
Here's the test, in case you want to try it out. Note that adding AsParallel() slows the process down on my machine, due to the cost of task switching:
public static void Main()
{
var text = File.ReadAllText("d:\\public\\temp\\temp.txt");
int wordCount;
var sw = new Stopwatch();
sw.Start();
wordCount = text.Split(null).Length;
sw.Stop();
Console.WriteLine("Split(null): {0} words in {1} ticks.", wordCount,
sw.ElapsedTicks);
sw.Restart();
wordCount = text.Count(Char.IsWhiteSpace);
sw.Stop();
Console.WriteLine("Count(WhiteSpace): {0} words in {1} ticks.", wordCount,
sw.ElapsedTicks);
sw.Restart();
wordCount = text.AsParallel().Count(Char.IsWhiteSpace);
sw.Stop();
Console.WriteLine("AsParallel: {0} words in {1} ticks.", wordCount,
sw.ElapsedTicks);
}
Output:
Split(null): 964 words in 629 ticks.
Count(WhiteSpace): 963 words in 2377 ticks.
AsParallel: 963 words in 208983 ticks.
Update
After making the string MUCH longer (OP mentioned 100's of 1000's of words), the results became much more similar, and the Count(WhiteSpace) method became faster than the Split(null) method:
Code change:
var text = File.ReadAllText("d:\\public\\temp\\temp.txt");
var textToSearch = new StringBuilder();
for (int i = 0; i < 500; i++) textToSearch.Append(text);
text = textToSearch.ToString();
Output:
Split(null): 481501 words in 185135 ticks.
Count(WhiteSpace): 481500 words in 101373 ticks.
AsParallel: 481500 words in 336117 ticks.

After some benchmarking, the following unsage code yielded the fastest result in any case with 500+ words:
public static unsafe int CountWords(string s)
{
int count = 0;
fixed (char* ps = s)
{
int len = s.Length;
bool inWord = false;
char* pc = ps;
while (len-- > 0)
{
if (char.IsWhiteSpace(*pc++))
{
if (!inWord)
{
inWord = true;
}
}
else
{
if (inWord)
{
inWord = false;
count++;
}
}
if (len == 0)
{
if (inWord)
{
count++;
}
}
}
}
return count;
}

Answering your ask, the méthod AsParallel() is very fast. But exists more options, by e.g.:
Using Regex:
string input = "text text text text";
string pattern = "(-)";
string[] substrings = Regex.Split(input, pattern); // Split on hyphens
Console.WriteLine("Words: {0}", substrings.count());
But I reiterate, the AsParallel () method is very fast. You can do a proof of concept, to find out which is better. Place a stopwatch () at the beginning and another at the end of code and compare the AsParallel runtime () with the regex, so will have a more 'exact' answer.
Update
Using Parallel.For:
static void Main(string[] args)
{
string text = #"text text text text text text text text text text ";
int count = 0;
Console.WriteLine("Generating words, wait...");
Parallel.For(0, 100000, i =>
{
text += #"text text text text text text text text text text ";
});
var sw = new Stopwatch();
sw.Start();
Parallel.For(0, text.Length, i =>
{
if (char.IsWhiteSpace(text[i]))
count++;
});
sw.Stop();
Console.WriteLine("Words: {0} in {1} ticks", count, sw.ElapsedTicks);
Console.ReadLine();
}
Results:
PS: Note that the Parallel.For used is not managed

Related

Removing a character from a character array

Is there a way to remove characters from a current character array and then save it to a new character array. Following is the code:
string s1 = "move";
string s2 = "remove";
char[] c1 = s1.ToCharArray();
char[] c2 = s2.ToCharArray();
for (int i = 0; i < s2.Length; i++)
{
for (int p = 0; p < s1.Length; p++)
{
if (c2[i] == c1[p])
{
// REMOVE LETTER FROM C2
}
// IN THE END I SHOULD JUST HAVE c3 = re (ALL THE MATCHING CHARACTERS M-O-V-E SHOULD BE
DELETED)
Would appreciate your help
You can create a third array, c3, where you will add characters from c2 that are not to be removed.
You may also use Replace.
string s3 = s2.Replace(s1,"");
The original O(N^2) approach is wasteful. And I don't see how the other two answers actually perform the work you seem to be trying to accomplish. I hope this example, which has O(N) performance, works better for you:
string s1 = "move";
string s2 = "remove";
HashSet<char> excludeCharacters = new HashSet<char>(s1);
StringBuilder sb = new StringBuilder();
// Copy every character from the original string, except those to be excluded
foreach (char ch in s2)
{
if (!excludeCharacters.Contains(ch))
{
sb.Append(ch);
}
}
return sb.ToString();
Granted, for short strings the performance isn't likely to matter. But IMHO this is also easier to comprehend than the alternatives.
EDIT:
It is still not entirely clear to me what the OP is trying to do here. The most obvious task would be to remove whole words, but neither of his descriptions seem to say that's what he really wants. So, on the assumption that the above is not addressing his needs, but that he also does not want to remove whole words, here are a couple of other options...
1) O(N), the best approach for strings of non-trivial length, but is somewhat more complicated:
string s1 = "move";
string s2 = "remove";
Dictionary<char, int> excludeCharacters = new Dictionary<char, int>();
foreach (char ch in s1)
{
int count;
excludeCharacters.TryGetValue(ch, out count);
excludeCharacters[ch] = ++count;
}
StringBuilder sb = new StringBuilder();
foreach (char ch in s2)
{
int count;
if (!excludeCharacters.TryGetValue(ch, out count) || count == 0)
{
sb.Append(ch);
}
else
{
excludeCharacters[ch] = --count;
}
}
return sb.ToString();
2) An O(N^2) implementation which at least minimizes other unnecessary inefficiencies and which would suffice if all the inputs are relatively short:
StringBuilder sb = new StringBuilder(s2);
foreach (char ch in s1)
{
for (int i = 0; i < sb.Length; i++)
{
if (sb[i] == ch)
{
sb.Remove(i, 1);
break;
}
}
}
return sb.ToString();
This isn't particularly efficient, but it will probably be fast enough for short strings:
string s1 = "move";
string s2 = "remove";
foreach (char charToRemove in s1)
{
int index = s2.IndexOf(charToRemove);
if (index >= 0)
s2 = s2.Remove(index, 1);
}
// Result is now in s2.
Console.WriteLine(s2);
This avoids converting to a char array.
However, just to emphasize: This will be VERY slow for large strings.
[EDIT]
I have done some testing, and it turns out that this code is in fact pretty fast.
Here I'm comparing the code with the optimized code from another answer. However, note that we're not comparing entirely fairly, since the code here correctly implements the OP's requirement while the other code doesn't. However, it does demonstrate that the use of HashSet doesn't help as much as one might think.
I tested this code on a release build, not run inside a debugger (if you run it in a debugger, it does a debug build not a release build which will give incorrect timings).
This test uses target strings of length 1024 and chars to remove == "SKFPBPENAALDKOWJKFPOSKLW".
My results, where test1() is the incorrect but supposedly optimized solution from another answer, and test2() is my unoptimized but correct solution:
test1() took 00:00:00.2891665
test2() took 00:00:00.1004743
test1() took 00:00:00.2720192
test2() took 00:00:00.0993898
test1() took 00:00:00.2753971
test2() took 00:00:00.0997268
test1() took 00:00:00.2754325
test2() took 00:00:00.1026486
test1() took 00:00:00.2785548
test2() took 00:00:00.1039417
test1() took 00:00:00.2818029
test2() took 00:00:00.1029695
test1() took 00:00:00.2727377
test2() took 00:00:00.0995654
test1() took 00:00:00.2711982
test2() took 00:00:00.1009849
As you can see, test2() consistently outperforms test1(). This remains true even if the strings are increased to length 8192.
The test code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Text;
namespace Demo
{
public static class Program
{
private static void Main(string[] args)
{
var sw = new Stopwatch();
string text = randomString(8192, 27367);
string charsToRemove = "SKFPBPENAALDKOWJKFPOSKLW";
int dummyLength = 0;
int iters = 10000;
for (int trial = 0; trial < 8; ++trial)
{
sw.Restart();
for (int i = 0; i < iters; ++i)
dummyLength += test1(text, charsToRemove).Length;
Console.WriteLine("test1() took " + sw.Elapsed);
sw.Restart();
for (int i = 0; i < iters; ++i)
dummyLength += test2(text, charsToRemove).Length;
Console.WriteLine("test2() took " + sw.Elapsed);
Console.WriteLine();
}
}
private static string randomString(int length, int seed)
{
var rng = new Random(seed);
var sb = new StringBuilder(length);
for (int i = 0; i < length; ++i)
sb.Append((char) rng.Next(65, 65 + 26*2));
return sb.ToString();
}
private static string test1(string text, string charsToRemove)
{
HashSet<char> excludeCharacters = new HashSet<char>(charsToRemove);
StringBuilder sb = new StringBuilder();
foreach (char ch in text)
{
if (!excludeCharacters.Contains(ch))
{
sb.Append(ch);
}
}
return sb.ToString();
}
private static string test2(string text, string charsToRemove)
{
foreach (char charToRemove in charsToRemove)
{
int index = text.IndexOf(charToRemove);
if (index >= 0)
text = text.Remove(index, 1);
}
return text;
}
}
}
[EDIT 2]
Here's a much more optimized solution:
public static string RemoveChars(string text, string charsToRemove)
{
char[] result = new char[text.Length];
char[] targets = charsToRemove.ToCharArray();
int n = 0;
int m = targets.Length;
foreach (char ch in text)
{
if (m == 0)
{
result[n++] = ch;
}
else
{
int index = findFirst(targets, ch, m);
if (index < 0)
{
result[n++] = ch;
}
else
{
if (m > 1)
{
--m;
targets[index] = targets[m];
}
else
{
m = 0;
}
}
}
}
return new string(result, 0, n);
}
private static int findFirst(char[] chars, char target, int n)
{
for (int i = 0; i < n; ++i)
if (chars[i] == target)
return i;
return -1;
}
Plugging that into my test program above shows that it runs around 3 times faster than test2().

c# how to check whether a Long string contains any letters or not?

Okay so I have a long string of digits, but I need to make sure it hasn't got any digits, whats the easiest way to do this?
You could use char.IsDigit:
var containsOnlyDigits = "007".All(char.IsDigit); // true
Scan the string, test the char... s.All(c => Char.IsDigit(c))
Enumerable.All will exit as soon as it finds a non-digit characted. IsDigit is very fast at checking the char. Cost is O(N) (as good as it can get); for sure, it is bettet than trying to parse the string (which will fail if the string is really long) or use a regexpr...
If you try this solution and see it is too slow for you, you can always go back to good old loops to scan the string..
foreach (char c in s) {
if (!Char.IsDigit(c))
return false;
}
return true;
or even better:
for (int i = 0; i < s.Length; i++){
if (!Char.IsDigit(s[i]))
return false;
}
return true;
EDIT: Benchmarks, at last!
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text.RegularExpressions;
namespace FindTest
{
class Program
{
const int Iterations = 1000;
static string TestData;
static Regex regex;
static bool ValidResult = false;
static void Test(Func<string, bool> function)
{
Console.Write("{0}... ", function.Method.Name);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < Iterations; i++)
{
bool result = function(TestData);
if (result != ValidResult)
{
throw new Exception("Bad result: " + result);
}
}
sw.Stop();
Console.WriteLine(" {0}ms", sw.ElapsedMilliseconds);
GC.Collect();
}
static void InitializeTestDataEnd(int length)
{
TestData = new string(Enumerable.Repeat('1', length - 1).ToArray()) + "A";
}
static void InitializeTestDataStart(int length)
{
TestData = "A" + new string(Enumerable.Repeat('1', length - 1).ToArray());
}
static void InitializeTestDataMid(int length)
{
TestData = new string(Enumerable.Repeat('1', length / 2).ToArray()) + "A" + new string(Enumerable.Repeat('1', length / 2 - 1).ToArray());
}
static void InitializeTestDataPositive(int length)
{
TestData = new string(Enumerable.Repeat('1', length).ToArray());
}
static bool LinqScan(string s)
{
return s.All(Char.IsDigit);
}
static bool ForeachScan(string s)
{
foreach (char c in s)
{
if (!Char.IsDigit(c))
return false;
}
return true;
}
static bool ForScan(string s)
{
for (int i = 0; i < s.Length; i++)
{
if (!Char.IsDigit(s[i]))
return false;
}
return true;
}
static bool Regexp(string s)
{
// String contains numbers
return regex.IsMatch(s);
// String contains letters
//return Regex.IsMatch(s, "\\w", RegexOptions.Compiled);
}
static void Main(string[] args)
{
regex = new Regex(#"^\d+$", RegexOptions.Compiled);
Console.WriteLine("Positive (all digitis)");
InitializeTestDataPositive(100000);
ValidResult = true;
Test(LinqScan);
Test(ForeachScan);
Test(ForScan);
Test(Regexp);
Console.WriteLine("Negative (char at beginning)");
InitializeTestDataStart(100000);
ValidResult = false;
Test(LinqScan);
Test(ForeachScan);
Test(ForScan);
Test(Regexp);
Console.WriteLine("Negative (char at end)");
InitializeTestDataEnd(100000);
ValidResult = false;
Test(LinqScan);
Test(ForeachScan);
Test(ForScan);
Test(Regexp);
Console.WriteLine("Negative (char in middle)");
InitializeTestDataMid(100000);
ValidResult = false;
Test(LinqScan);
Test(ForeachScan);
Test(ForScan);
Test(Regexp);
Console.WriteLine("Done");
}
}
}
I tested positive, and three negatives, to 1) test which regex is the correct one, 2) look for confirmation of a suspect I had...
My opinion was that Regexp.IsMatch had to scan the string as well, and so it seems to be:
Times are consistent with scans, only 3x worse:
Positive (all digitis)
LinqScan... 952ms
ForeachScan... 1043ms
ForScan... 869ms
Regexp... 3074ms
Negative (char at beginning)
LinqScan... 0ms
ForeachScan... 0ms
ForScan... 0ms
Regexp... 0ms
Negative (char at end)
LinqScan... 921ms
ForeachScan... 958ms
ForScan... 867ms
Regexp... 3986ms
Negative (char in middle)
LinqScan... 455ms
ForeachScan... 476ms
ForScan... 430ms
Regexp... 1982ms
Credits: I borrowed the Test function from Jon Skeet
Conclusions: s.All(Char.IsDigit) is efficient, and really easy (which was the original question, after all). Personally, I find it easier than regular expressions (I had to look on SO which was the correct one, as I am not familiar with C# regexp syntax - which is standard, but I didn't know - and the proposed solution was wrong).
So.. measure, and don't believe in myths like "LINQ is slow" or "RegExp are slow".
After all, they are both OK for the task (it really depends on what you need it for), pick the one you prefer.
Try using TryParse
bool longStringisInt = Int64.TryParse(longString, out number);
If the string (i.e. longString) cannot be converted to an int (i.e. has letters in it) then the bool is false, otherwise it would be true.
EDIT: Changed to Int64 to ensure wider coverage
You can use IndexOfAny
bool containsDigits = mystring.IndexOfAny("0123456789".ToCharArray()) != -1;
In Linq you'd have to do:
bool containsDigits = mystring.Any(char.IsDigit);
Edit:
I timed this and it turns out the Linq solution is slower.
For string length 1,000,000 the execution time is ~13ms for the linq solution and 1ms for IndexOfAny.
For string length 10,000,000 the execution time for Linq is still ~122ms, whereas IndexOfAny is ~18ms.
Use regular expressions:
using System.Text.RegularExpressions;
string myString = "some long characters...";
if(Regex.IsMatch(myString, #"^\d+$", RegexOptions.Compiled))
{
// String contains only numbers
}
if(Regex.IsMatch(myString, #"^\w+$", RegexOptions.Compiled))
{
// String contains only letters
}

Best approach of word censoring - C# 4.0

For my custom made chat screen i am using the code below for checking censored words. But i wonder can this code performance improved. Thank you.
if (srMessageTemp.IndexOf(" censored1 ") != -1)
return;
if (srMessageTemp.IndexOf(" censored2 ") != -1)
return;
if (srMessageTemp.IndexOf(" censored3 ") != -1)
return;
C# 4.0 . actually list is a lot more long but i don't put here as it goes away.
I would use LINQ or regular expression for this:
LINQ: How to: Query for Sentences that Contain a Specified Set of Words (LINQ)
Regular Expression: Highlight a list of words using a regular expression in c#
You can simplify it. Here listOfCencoredWords will contains all the censored words
if (listOfCensoredWords.Any(item => srMessageTemp.Contains(item)))
return;
If you want to make it really fast, you can use Aho-Corasick automaton. This is how antivirus software checks thousands of viruses at once. But I don't know where you can get the implementation done, so it will require much more work from you compared to using just simple slow methods like regular expressions.
See the theory here: http://en.wikipedia.org/wiki/Aho-Corasick
First, I hope you aren't really "tokenizing" the words as written. You know, just because someone doesn't put a space before a bad word, it doesn't make the word less bad :-) Example ,badword,
I'll say that I would use a Regex here :-) I'm not sure if a Regex or a man-made parser would be faster, but at least a Regex would be a good starting point. As others wrote, you begin by splitting the text in words and then checking an HashSet<string>.
I'm adding a second version of the code, based on ArraySegment<char>. I speak later of this.
class Program
{
class ArraySegmentComparer : IEqualityComparer<ArraySegment<char>>
{
public bool Equals(ArraySegment<char> x, ArraySegment<char> y)
{
if (x.Count != y.Count)
{
return false;
}
int end = x.Offset + x.Count;
for (int i = x.Offset, j = y.Offset; i < end; i++, j++)
{
if (!x.Array[i].ToString().Equals(y.Array[j].ToString(), StringComparison.InvariantCultureIgnoreCase))
{
return false;
}
}
return true;
}
public override int GetHashCode(ArraySegment<char> obj)
{
unchecked
{
int hash = 17;
int end = obj.Offset + obj.Count;
int i;
for (i = obj.Offset; i < end; i++)
{
hash *= 23;
hash += Char.ToUpperInvariant(obj.Array[i]);
}
return hash;
}
}
}
static void Main()
{
var rx = new Regex(#"\b\w+\b", RegexOptions.Compiled);
var sampleText = #"For my custom made chat screen i am using the code below for checking censored words. But i wonder can this code performance improved. Thank you.
if (srMessageTemp.IndexOf("" censored1 "") != -1)
return;
if (srMessageTemp.IndexOf("" censored2 "") != -1)
return;
if (srMessageTemp.IndexOf("" censored3 "") != -1)
return;
C# 4.0 . actually list is a lot more long but i don't put here as it goes away.
And now some accented letters àèéìòù and now some letters with unicode combinable diacritics àèéìòù";
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
HashSet<string> prohibitedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase) { "For", "custom", "combinable", "away" };
Stopwatch sw1 = Stopwatch.StartNew();
var words = rx.Matches(sampleText);
foreach (Match word in words)
{
string str = word.Value;
if (prohibitedWords.Contains(str))
{
Console.Write(str);
Console.Write(" ");
}
else
{
//Console.WriteLine(word);
}
}
sw1.Stop();
Console.WriteLine();
Console.WriteLine();
HashSet<ArraySegment<char>> prohibitedWords2 = new HashSet<ArraySegment<char>>(
prohibitedWords.Select(p => new ArraySegment<char>(p.ToCharArray())),
new ArraySegmentComparer());
var sampleText2 = sampleText.ToCharArray();
Stopwatch sw2 = Stopwatch.StartNew();
int startWord = -1;
for (int i = 0; i < sampleText2.Length; i++)
{
if (Char.IsLetter(sampleText2[i]) || Char.IsDigit(sampleText2[i]))
{
if (startWord == -1)
{
startWord = i;
}
}
else
{
if (startWord != -1)
{
int length = i - startWord;
if (length != 0)
{
var wordSegment = new ArraySegment<char>(sampleText2, startWord, length);
if (prohibitedWords2.Contains(wordSegment))
{
Console.Write(sampleText2, startWord, length);
Console.Write(" ");
}
else
{
//Console.WriteLine(sampleText2, startWord, length);
}
}
startWord = -1;
}
}
}
if (startWord != -1)
{
int length = sampleText2.Length - startWord;
if (length != 0)
{
var wordSegment = new ArraySegment<char>(sampleText2, startWord, length);
if (prohibitedWords2.Contains(wordSegment))
{
Console.Write(sampleText2, startWord, length);
Console.Write(" ");
}
else
{
//Console.WriteLine(sampleText2, startWord, length);
}
}
}
sw2.Stop();
Console.WriteLine();
Console.WriteLine();
Console.WriteLine(sw1.ElapsedTicks);
Console.WriteLine(sw2.ElapsedTicks);
}
}
I'll note that you could go faster doing the parsing "in" the original string. What does this means: if you subdivide the "document" in words and each word is put in a string, clearly you are creating n string, one for each word of your document. But what if you skipped this step and operated directly on the document, simply keeping the current index and the length of the current word? Then it would be faster! Clearly then you would need to create a special comparer for the HashSet<>.
But wait! C# has something similar... It's called ArraySegment. So your document would be a char[] instead of a string and each word would be an ArraySegment<char>. Clearly this is much more complex! You can't simply use Regexes, you have to build "by hand" a parser (but I think converting the \b\w+\b expression would be quite easy). And creating a comparer for HashSet<char> would be a little complex (hint: you would use HashSet<ArraySegment<char>> and the words to be censored would be ArraySegments "pointing" to a char[] of a word and with size equal to the char[].Length, like var word = new ArraySegment<char>("tobecensored".ToCharArray());)
After some simple benchmark, I can see that an unoptimized version of the program using ArraySegment<string> is as much fast as the Regex version for shorter texts. This probably because if a word is 4-6 char long, it's as much "slow" to copy it around than it's to copy around an ArraySegment<char> (an ArraySegment<char> is 12 bytes, a word of 6 characters is 12 bytes. On top of both of these we have to add a little overhead... But in the end the numbers are comparable). But for longer texts (try decommenting the //sampleText += sampleText;) it becomes a little faster (10%) in Release -> Start Without Debugging (CTRL-F5)
I'll note that comparing strings character by character is wrong. You should always use the methods given to you by the string class (or by the OS). They know how to handle "strange" cases much better than you (and in Unicode there isn't any "normal" case :-) )
You can use linq for this but it's not required if you use a list to hold your list of censored values. The solution below uses the build in list functions and allows you to do your searches case insensitive.
private static List<string> _censoredWords = new List<string>()
{
"badwordone1",
"badwordone2",
"badwordone3",
"badwordone4",
};
static void Main(string[] args)
{
string badword1 = "BadWordOne2";
bool censored = ShouldCensorWord(badword1);
}
private static bool ShouldCensorWord(string word)
{
return _censoredWords.Contains(word.ToLower());
}
What you think about this:
string[] censoredWords = new[] { " censored1 ", " censored2 ", " censored3 " };
if (censoredWords.Contains(srMessageTemp))
return;

Efficient way to remove ALL whitespace from String?

I'm calling a REST API and am receiving an XML response back. It returns a list of a workspace names, and I'm writing a quick IsExistingWorkspace() method. Since all workspaces consist of contiguous characters with no whitespace, I'm assuming the easiest way to find out if a particular workspace is in the list is to remove all whitespace (including newlines) and doing this (XML is the string received from the web request):
XML.Contains("<name>" + workspaceName + "</name>");
I know it's case-sensitive, and I'm relying on that. I just need a way to remove all whitespace in a string efficiently. I know RegEx and LINQ can do it, but I'm open to other ideas. I am mostly just concerned about speed.
This is fastest way I know of, even though you said you didn't want to use regular expressions:
Regex.Replace(XML, #"\s+", "");
Crediting #hypehuman in the comments, if you plan to do this more than once, create and store a Regex instance. This will save the overhead of constructing it every time, which is more expensive than you might think.
private static readonly Regex sWhitespace = new Regex(#"\s+");
public static string ReplaceWhitespace(string input, string replacement)
{
return sWhitespace.Replace(input, replacement);
}
I have an alternative way without regexp, and it seems to perform pretty good. It is a continuation on Brandon Moretz answer:
public static string RemoveWhitespace(this string input)
{
return new string(input.ToCharArray()
.Where(c => !Char.IsWhiteSpace(c))
.ToArray());
}
I tested it in a simple unit test:
[Test]
[TestCase("123 123 1adc \n 222", "1231231adc222")]
public void RemoveWhiteSpace1(string input, string expected)
{
string s = null;
for (int i = 0; i < 1000000; i++)
{
s = input.RemoveWhitespace();
}
Assert.AreEqual(expected, s);
}
[Test]
[TestCase("123 123 1adc \n 222", "1231231adc222")]
public void RemoveWhiteSpace2(string input, string expected)
{
string s = null;
for (int i = 0; i < 1000000; i++)
{
s = Regex.Replace(input, #"\s+", "");
}
Assert.AreEqual(expected, s);
}
For 1,000,000 attempts the first option (without regexp) runs in less than a second (700 ms on my machine), and the second takes 3.5 seconds.
Try the replace method of the string in C#.
XML.Replace(" ", string.Empty);
My solution is to use Split and Join and it is surprisingly fast, in fact the fastest of the top answers here.
str = string.Join("", str.Split(default(string[]), StringSplitOptions.RemoveEmptyEntries));
Timings for 10,000 loop on simple string with whitespace inc new lines and tabs
split/join = 60 milliseconds
linq chararray = 94 milliseconds
regex = 437 milliseconds
Improve this by wrapping it up in method to give it meaning, and also make it an extension method while we are at it ...
public static string RemoveWhitespace(this string str) {
return string.Join("", str.Split(default(string[]), StringSplitOptions.RemoveEmptyEntries));
}
Building on Henks answer I have created some test methods with his answer and some added, more optimized, methods. I found the results differ based on the size of the input string. Therefore, I have tested with two result sets. In the fastest method, the linked source has a even faster way. But, since it is characterized as unsafe I have left this out.
Long input string results:
InPlaceCharArray: 2021 ms (Sunsetquest's answer) - (Original source)
String split then join: 4277ms (Kernowcode's answer)
String reader: 6082 ms
LINQ using native char.IsWhitespace: 7357 ms
LINQ: 7746 ms (Henk's answer)
ForLoop: 32320 ms
RegexCompiled: 37157 ms
Regex: 42940 ms
Short input string results:
InPlaceCharArray: 108 ms (Sunsetquest's answer) - (Original source)
String split then join: 294 ms (Kernowcode's answer)
String reader: 327 ms
ForLoop: 343 ms
LINQ using native char.IsWhitespace: 624 ms
LINQ: 645ms (Henk's answer)
RegexCompiled: 1671 ms
Regex: 2599 ms
Code:
public class RemoveWhitespace
{
public static string RemoveStringReader(string input)
{
var s = new StringBuilder(input.Length); // (input.Length);
using (var reader = new StringReader(input))
{
int i = 0;
char c;
for (; i < input.Length; i++)
{
c = (char)reader.Read();
if (!char.IsWhiteSpace(c))
{
s.Append(c);
}
}
}
return s.ToString();
}
public static string RemoveLinqNativeCharIsWhitespace(string input)
{
return new string(input.ToCharArray()
.Where(c => !char.IsWhiteSpace(c))
.ToArray());
}
public static string RemoveLinq(string input)
{
return new string(input.ToCharArray()
.Where(c => !Char.IsWhiteSpace(c))
.ToArray());
}
public static string RemoveRegex(string input)
{
return Regex.Replace(input, #"\s+", "");
}
private static Regex compiled = new Regex(#"\s+", RegexOptions.Compiled);
public static string RemoveRegexCompiled(string input)
{
return compiled.Replace(input, "");
}
public static string RemoveForLoop(string input)
{
for (int i = input.Length - 1; i >= 0; i--)
{
if (char.IsWhiteSpace(input[i]))
{
input = input.Remove(i, 1);
}
}
return input;
}
public static string StringSplitThenJoin(this string str)
{
return string.Join("", str.Split(default(string[]), StringSplitOptions.RemoveEmptyEntries));
}
public static string RemoveInPlaceCharArray(string input)
{
var len = input.Length;
var src = input.ToCharArray();
int dstIdx = 0;
for (int i = 0; i < len; i++)
{
var ch = src[i];
switch (ch)
{
case '\u0020':
case '\u00A0':
case '\u1680':
case '\u2000':
case '\u2001':
case '\u2002':
case '\u2003':
case '\u2004':
case '\u2005':
case '\u2006':
case '\u2007':
case '\u2008':
case '\u2009':
case '\u200A':
case '\u202F':
case '\u205F':
case '\u3000':
case '\u2028':
case '\u2029':
case '\u0009':
case '\u000A':
case '\u000B':
case '\u000C':
case '\u000D':
case '\u0085':
continue;
default:
src[dstIdx++] = ch;
break;
}
}
return new string(src, 0, dstIdx);
}
}
Tests:
[TestFixture]
public class Test
{
// Short input
//private const string input = "123 123 \t 1adc \n 222";
//private const string expected = "1231231adc222";
// Long input
private const string input = "123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222";
private const string expected = "1231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc222";
private const int iterations = 1000000;
[Test]
public void RemoveInPlaceCharArray()
{
string s = null;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
s = RemoveWhitespace.RemoveInPlaceCharArray(input);
}
stopwatch.Stop();
Console.WriteLine("InPlaceCharArray: " + stopwatch.ElapsedMilliseconds + " ms");
Assert.AreEqual(expected, s);
}
[Test]
public void RemoveStringReader()
{
string s = null;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
s = RemoveWhitespace.RemoveStringReader(input);
}
stopwatch.Stop();
Console.WriteLine("String reader: " + stopwatch.ElapsedMilliseconds + " ms");
Assert.AreEqual(expected, s);
}
[Test]
public void RemoveLinqNativeCharIsWhitespace()
{
string s = null;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
s = RemoveWhitespace.RemoveLinqNativeCharIsWhitespace(input);
}
stopwatch.Stop();
Console.WriteLine("LINQ using native char.IsWhitespace: " + stopwatch.ElapsedMilliseconds + " ms");
Assert.AreEqual(expected, s);
}
[Test]
public void RemoveLinq()
{
string s = null;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
s = RemoveWhitespace.RemoveLinq(input);
}
stopwatch.Stop();
Console.WriteLine("LINQ: " + stopwatch.ElapsedMilliseconds + " ms");
Assert.AreEqual(expected, s);
}
[Test]
public void RemoveRegex()
{
string s = null;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
s = RemoveWhitespace.RemoveRegex(input);
}
stopwatch.Stop();
Console.WriteLine("Regex: " + stopwatch.ElapsedMilliseconds + " ms");
Assert.AreEqual(expected, s);
}
[Test]
public void RemoveRegexCompiled()
{
string s = null;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
s = RemoveWhitespace.RemoveRegexCompiled(input);
}
stopwatch.Stop();
Console.WriteLine("RegexCompiled: " + stopwatch.ElapsedMilliseconds + " ms");
Assert.AreEqual(expected, s);
}
[Test]
public void RemoveForLoop()
{
string s = null;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
s = RemoveWhitespace.RemoveForLoop(input);
}
stopwatch.Stop();
Console.WriteLine("ForLoop: " + stopwatch.ElapsedMilliseconds + " ms");
Assert.AreEqual(expected, s);
}
[TestMethod]
public void StringSplitThenJoin()
{
string s = null;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
s = RemoveWhitespace.StringSplitThenJoin(input);
}
stopwatch.Stop();
Console.WriteLine("StringSplitThenJoin: " + stopwatch.ElapsedMilliseconds + " ms");
Assert.AreEqual(expected, s);
}
}
Edit: Tested a nice one liner from Kernowcode.
Just an alternative because it looks quite nice :) - NOTE: Henks answer is the quickest of these.
input.ToCharArray()
.Where(c => !Char.IsWhiteSpace(c))
.Select(c => c.ToString())
.Aggregate((a, b) => a + b);
Testing 1,000,000 loops on "This is a simple Test"
This method = 1.74 seconds
Regex = 2.58 seconds
new String (Henks) = 0.82 seconds
I found a nice write-up on this on CodeProject by Felipe Machado (with help by Richard Robertson)
He tested ten different methods. This one is the fastest safe version...
public static string TrimAllWithInplaceCharArray(string str) {
var len = str.Length;
var src = str.ToCharArray();
int dstIdx = 0;
for (int i = 0; i < len; i++) {
var ch = src[i];
switch (ch) {
case '\u0020': case '\u00A0': case '\u1680': case '\u2000': case '\u2001':
case '\u2002': case '\u2003': case '\u2004': case '\u2005': case '\u2006':
case '\u2007': case '\u2008': case '\u2009': case '\u200A': case '\u202F':
case '\u205F': case '\u3000': case '\u2028': case '\u2029': case '\u0009':
case '\u000A': case '\u000B': case '\u000C': case '\u000D': case '\u0085':
continue;
default:
src[dstIdx++] = ch;
break;
}
}
return new string(src, 0, dstIdx);
}
And the fastest unsafe version... (some inprovements by Sunsetquest 5/26/2021 )
public static unsafe void RemoveAllWhitespace(ref string str)
{
fixed (char* pfixed = str)
{
char* dst = pfixed;
for (char* p = pfixed; *p != 0; p++)
{
switch (*p)
{
case '\u0020': case '\u00A0': case '\u1680': case '\u2000': case '\u2001':
case '\u2002': case '\u2003': case '\u2004': case '\u2005': case '\u2006':
case '\u2007': case '\u2008': case '\u2009': case '\u200A': case '\u202F':
case '\u205F': case '\u3000': case '\u2028': case '\u2029': case '\u0009':
case '\u000A': case '\u000B': case '\u000C': case '\u000D': case '\u0085':
continue;
default:
*dst++ = *p;
break;
}
}
uint* pi = (uint*)pfixed;
ulong len = ((ulong)dst - (ulong)pfixed) >> 1;
pi[-1] = (uint)len;
pfixed[len] = '\0';
}
}
There are also some nice independent benchmarks on Stack Overflow by Stian Standahl that also show how Felipe's function is about 300% faster than the next fastest function. Also, for the one I modified, I used this trick.
If you need superb performance, you should avoid LINQ and regular expressions in this case. I did some performance benchmarking, and it seems that if you want to strip white space from beginning and end of the string, string.Trim() is your ultimate function.
If you need to strip all white spaces from a string, the following method works fastest of all that has been posted here:
public static string RemoveWhitespace(this string input)
{
int j = 0, inputlen = input.Length;
char[] newarr = new char[inputlen];
for (int i = 0; i < inputlen; ++i)
{
char tmp = input[i];
if (!char.IsWhiteSpace(tmp))
{
newarr[j] = tmp;
++j;
}
}
return new String(newarr, 0, j);
}
Regex is overkill; just use extension on string (thanks Henk). This is trivial and should have been part of the framework. Anyhow, here's my implementation:
public static partial class Extension
{
public static string RemoveWhiteSpace(this string self)
{
return new string(self.Where(c => !Char.IsWhiteSpace(c)).ToArray());
}
}
I think alot of persons come here for removing spaces. :
string s = "my string is nice";
s = s.replace(" ", "");
Here is a simple linear alternative to the RegEx solution. I am not sure which is faster; you'd have to benchmark it.
static string RemoveWhitespace(string input)
{
StringBuilder output = new StringBuilder(input.Length);
for (int index = 0; index < input.Length; index++)
{
if (!Char.IsWhiteSpace(input, index))
{
output.Append(input[index]);
}
}
return output.ToString();
}
I needed to replace white space in a string with spaces, but not duplicate spaces. e.g., I needed to convert something like the following:
"a b c\r\n d\t\t\t e"
to
"a b c d e"
I used the following method
private static string RemoveWhiteSpace(string value)
{
if (value == null) { return null; }
var sb = new StringBuilder();
var lastCharWs = false;
foreach (var c in value)
{
if (char.IsWhiteSpace(c))
{
if (lastCharWs) { continue; }
sb.Append(' ');
lastCharWs = true;
}
else
{
sb.Append(c);
lastCharWs = false;
}
}
return sb.ToString();
}
I assume your XML response looks like this:
var xml = #"<names>
<name>
foo
</name>
<name>
bar
</name>
</names>";
The best way to process XML is to use an XML parser, such as LINQ to XML:
var doc = XDocument.Parse(xml);
var containsFoo = doc.Root
.Elements("name")
.Any(e => ((string)e).Trim() == "foo");
We can use:
public static string RemoveWhitespace(this string input)
{
if (input == null)
return null;
return new string(input.ToCharArray()
.Where(c => !Char.IsWhiteSpace(c))
.ToArray());
}
Using Linq, you can write a readable method this way :
public static string RemoveAllWhitespaces(this string source)
{
return string.IsNullOrEmpty(source) ? source : new string(source.Where(x => !char.IsWhiteSpace(x)).ToArray());
}
Here is yet another variant:
public static string RemoveAllWhitespace(string aString)
{
return String.Join(String.Empty, aString.Where(aChar => aChar !Char.IsWhiteSpace(aChar)));
}
As with most of the other solutions, I haven't performed exhaustive benchmark tests, but this works well enough for my purposes.
I have found different results to be true. I am trying to replace all whitespace with a single space and the regex was extremely slow.
return( Regex::Replace( text, L"\s+", L" " ) );
What worked the most optimally for me (in C++ cli) was:
String^ ReduceWhitespace( String^ text )
{
String^ newText;
bool inWhitespace = false;
Int32 posStart = 0;
Int32 pos = 0;
for( pos = 0; pos < text->Length; ++pos )
{
wchar_t cc = text[pos];
if( Char::IsWhiteSpace( cc ) )
{
if( !inWhitespace )
{
if( pos > posStart ) newText += text->Substring( posStart, pos - posStart );
inWhitespace = true;
newText += L' ';
}
posStart = pos + 1;
}
else
{
if( inWhitespace )
{
inWhitespace = false;
posStart = pos;
}
}
}
if( pos > posStart ) newText += text->Substring( posStart, pos - posStart );
return( newText );
}
I tried the above routine first by replacing each character separately, but had to switch to doing substrings for the non-space sections. When applying to a 1,200,000 character string:
the above routine gets it done in 25 seconds
the above routine + separate character replacement in 95 seconds
the regex aborted after 15 minutes.
The straightforward way to remove all whitespaces from a string, "example" is your initial string.
String.Concat(example.Where(c => !Char.IsWhiteSpace(c))

Measuring Performance of an Extension Method. Is this right?

I'm testing the efficiency of an extension method to see which permutation would be the fastest in terms of processing time. Memory consumption isn't an issue at this point..
I've created a small console app to generate an array of of random strings, which then has the extension methods applied to it. I'm currently using the StopWatch class to measure the time taken to run the extension methods. I then average to total time of each method over a number of iterations.
I'm not excluding highest or lowest results at this point.
Extension Methods being tested:
public static String ToString1(this String[] s) {
StringBuilder sb = new StringBuilder();
foreach (String item in s) {
sb.AppendLine(item);
}
return sb.ToString();
}
public static String ToString2(this String[] s) {
return String.Join("\n", s);
}
Program.cs
static void Main(string[] args)
{
long s1Total = 0;
long s2Total = 0;
double s1Avg = 0;
double s2Avg = 0;
int iteration = 1;
int size = 100000;
while (iteration <= 25)
{
Console.WriteLine("Iteration: {0}", iteration);
Test(ref s1Total, ref s2Total, ref iteration, size);
}
s1Avg = s1Total / iteration;
s2Avg = s2Total / iteration;
Console.WriteLine("Version\t\tTotal\t\tAvg");
Console.WriteLine("StringBuilder\t\t{0}\t\t{1}",s1Total, s1Avg);
Console.WriteLine("String.Join:\t\t{0}\t\t{1}",s2Total, s2Avg);
Console.WriteLine("Press any key..");
Console.ReadKey();
}
private static void Test(ref long s1Total, ref long s2Total, ref int iteration, int size)
{
String[] data = new String[size];
Random r = new Random();
for (int i = 0; i < size; i++)
{
data[i] = r.NextString(50);
}
Stopwatch s = new Stopwatch();
s.Start();
data.ToString1();
s.Stop();
s1Total += s.ElapsedTicks;
s.Reset();
s.Start();
data.ToString2();
s.Stop();
s2Total += s.ElapsedTicks;
iteration++;
}
Other extensions methods used in the above code for completeness..
Random extension:
public static String NextString(this Random r,int size)
{
return NextString(r,size,false);
}
public static String NextString(this Random r,int size, bool lowerCase)
{
StringBuilder sb = new StringBuilder();
char c;
for (int i = 0; i < size; i++)
{
c = Convert.ToChar(Convert.ToInt32(Math.Floor(26*r.NextDouble() + 65)));
sb.Append(c);
}
if (lowerCase) {
return sb.ToString().ToLower();
}
return sb.ToString();
}
Running the above code, my results indicate that the StringBuilder based method is faster than String.Join based method.
My Questions:
Is this the right way to be performing this type of measurement..
Is there a better way of doing this?
Are my results in this instance correct, and if so is using a StringBuilder actually faster than String.Join in this situation?
Thanks.
Next time when you want to compare the performance, you can take a look at the source code via reflector. You can easily find that string.Join is using StringBuilder to construct the string. So they have slight performance difference.
I got
StringBuilder 3428567 131867
String.Join: 1245078 47887
Note that ToString1 adds an extra newline.
Also, you can improve it by setting the StringBuilder's capacity.

Categories