C# Extension method slower than chained Replace unless in tight loop. Why? - c#

I have an extension method to remove certain characters from a string (a phone number) which is performing much slower than I think it should vs chained Replace calls. The weird bit, is that in a loop it overtakes the Replace thing if the loop runs for around 3000 iterations, and after that it's faster. Lower than that and chaining Replace is faster. It's like there's a fixed overhead to my code which Replace doesn't have. What could this be!?
Quick look. When only testing 10 numbers, mine takes about 0.3ms, while Replace takes only 0.01ms. A massive difference! But when running 5 million, mine takes around 1700ms while Replace takes about 2500ms.
Phone numbers will only have 0-9, +, -, (, )
Here's the relevant code:
Building test cases, I'm playing with testNums.
int testNums = 5_000_000;
Console.WriteLine("Building " + testNums + " tests");
Random rand = new Random();
string[] tests = new string[testNums];
char[] letters =
{
'0','1','2','3','4','5','6','7','8','9',
'+','-','(',')'
};
for(int t = 0; t < tests.Length; t++)
{
int length = rand.Next(5, 20);
char[] word = new char[length];
for(int c = 0; c < word.Length; c++)
{
word[c] = letters[rand.Next(letters.Length)];
}
tests[t] = new string(word);
}
Console.WriteLine("Tests built");
string[] stripped = new string[tests.Length];
Using my extension method:
Stopwatch stopwatch = Stopwatch.StartNew();
for (int i = 0; i < stripped.Length; i++)
{
stripped[i] = tests[i].CleanNumberString();
}
stopwatch.Stop();
Console.WriteLine("Clean: " + stopwatch.Elapsed.TotalMilliseconds + "ms");
Using chained Replace:
stripped = new string[tests.Length];
stopwatch = Stopwatch.StartNew();
for (int i = 0; i < stripped.Length; i++)
{
stripped[i] = tests[i].Replace(" ", string.Empty)
.Replace("-", string.Empty)
.Replace("(", string.Empty)
.Replace(")", string.Empty)
.Replace("+", string.Empty);
}
stopwatch.Stop();
Console.WriteLine("Replace: " + stopwatch.Elapsed.TotalMilliseconds + "ms");
Extension method in question:
public static string CleanNumberString(this string s)
{
Span<char> letters = stackalloc char[s.Length];
int count = 0;
for (int i = 0; i < s.Length; i++)
{
if (s[i] >= '0' && s[i] <= '9')
letters[count++] = s[i];
}
return new string(letters.Slice(0, count));
}
What I've tried:
I've run them around the other way. Makes a tiny difference, but not enough.
Make it a normal static method, which was significantly slower than extension. As a ref parameter was slightly slower, and in parameter was about the same as extension method.
Aggressive Inlining. Doesn't make any real difference. I'm in release mode, so I suspect the compiler inlines it anyway. Either way, not much change.
I have also looked at memory allocations, and that's as I expect. My one allocates on the managed heap only one string per iteration (the new string at the end) which Replace allocates a new object for each Replace. So the memory used by the Replace one is much, higher. But it's still faster!
Is it calling native C code and doing something crafty there? Is the higher memory usage triggering the GC and slowing it down (still doesn't explane the insanely fast time on only one or two iterations)
Any ideas?
(Yes, I know not to bother optimising things like this, it's just bugging me because I don't know why it's doing this)

After doing some benchmarks, I think can safely assert that your initial statement is wrong for the exact reason you mentionned in your deleted answer: the loading time of the method is the only thing that misguided you.
Here's the full benchmark on a simplified version of the problem:
static void Main(string[] args)
{
// Build string of n consecutive "ab"
int n = 1000;
Console.WriteLine("N: " + n);
char[] c = new char[n];
for (int i = 0; i < n; i+=2)
c[i] = 'a';
for (int i = 1; i < n; i += 2)
c[i] = 'b';
string s = new string(c);
Stopwatch stopwatch;
// Make sure everything is loaded
s.CleanNumberString();
s.Replace("a", "");
s.UnsafeRemove();
// Tests to remove all 'a' from the string
// Unsafe remove
stopwatch = Stopwatch.StartNew();
string a1 = s.UnsafeRemove();
stopwatch.Stop();
Console.WriteLine("Unsafe remove:\t" + stopwatch.Elapsed.TotalMilliseconds + "ms");
// Extension method
stopwatch = Stopwatch.StartNew();
string a2 = s.CleanNumberString();
stopwatch.Stop();
Console.WriteLine("Clean method:\t" + stopwatch.Elapsed.TotalMilliseconds + "ms");
// String replace
stopwatch = Stopwatch.StartNew();
string a3 = s.Replace("a", "");
stopwatch.Stop();
Console.WriteLine("String.Replace:\t" + stopwatch.Elapsed.TotalMilliseconds + "ms");
// Make sure the returned strings are identical
Console.WriteLine(a1.Equals(a2) && a2.Equals(a3));
Console.ReadKey();
}
public static string CleanNumberString(this string s)
{
char[] letters = new char[s.Length];
int count = 0;
for (int i = 0; i < s.Length; i++)
if (s[i] == 'b')
letters[count++] = 'b';
return new string(letters.SubArray(0, count));
}
public static T[] SubArray<T>(this T[] data, int index, int length)
{
T[] result = new T[length];
Array.Copy(data, index, result, 0, length);
return result;
}
// Taken from https://stackoverflow.com/a/2183442/6923568
public static unsafe string UnsafeRemove(this string s)
{
int len = s.Length;
char* newChars = stackalloc char[len];
char* currentChar = newChars;
for (int i = 0; i < len; ++i)
{
char c = s[i];
switch (c)
{
case 'a':
continue;
default:
*currentChar++ = c;
break;
}
}
return new string(newChars, 0, (int)(currentChar - newChars));
}
When ran with different values of n, it is clear that your extension method (or at least my somewhat equivalent version of it) has a logic that makes it faster than String.Replace(). In fact, it is more performant on either small or big strings:
N: 100
Unsafe remove: 0,0024ms
Clean method: 0,0015ms
String.Replace: 0,0021ms
True
N: 100000
Unsafe remove: 0,3889ms
Clean method: 0,5308ms
String.Replace: 1,3993ms
True
I highly suspect optimizations for the replacement of strings (not to be compared to removal) in String.Replace() to be the culprit here. I also added a method from this answer to have another comparison on removal of characters. That one's times behave similarly to your method but gets faster on higher values (80k+ on my tests) of n.
With all that being said, since your question is based on an assumption that we found was false, if you need more explanation on why the opposite is true (i.e. "Why is String.Replace() slower than my method"), plenty of in-depth benchmarks about string manipulation already do so.

I ran the clean method a couple more. interestingly, it is a lot faster than the Replace. Only the first time run was slower. Sorry that I couldn't explain why it's slower the first time but I ran more of the method then the result was expected.
Building 100 tests
Tests built
Replace: 0.0528ms
Clean: 0.4526ms
Clean: 0.0413ms
Clean: 0.0294ms
Replace: 0.0679ms
Replace: 0.0523ms
used dotnet core 2.1

So I've found with help from daehee Kim and Mat below that it's only the first iteration, but it's for the whole first loop. Every loop after there is ok.
I use the following line to force the JIT to do its thing and initialise this method:
RuntimeHelpers.PrepareMethod(typeof(CleanExtension).GetMethod("CleanNumberString", BindingFlags.Public | BindingFlags.Static).MethodHandle);
I find the JIT usually takes about 2-3ms to do its thing here (including Reflection time of about 0.1ms). Note that you should probably not be doing this because you're now getting the Reflection cost as well, and the JIT will be called right after this anyway, but it's probably a good idea for benchmarks to fairly compare.
The more you know!
My benchmark for a loop of 5000 iterations, repeated 5000 times with random strings and averaged is:
Clean: 0.41078ms
Replace: 1.4974ms

Related

If string contains 2 period in c#

I'm trying to build a Reg Expression where if the textbox string contains two periods anywhere it will execute my code. This is what I've got so far:
Regex word = new Regex("(\\.){2,}");
if (word.IsMatch(textBoxSearch.Text))
{
//my code here to execute
}
However, it only executes when there are two periods together and not anywhere within the string...
There is no need for regex here, just use LINQ!
myString.Count(x => x == '.') == 2
Or for 2 or more:
myString.Where(x => x == '.').Skip(1).Any()
If performance is crucial, you should use a loop. Here is a comparison of the three approaches (LINQ, loop, regex):
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace Experiment
{
public static class Program
{
static bool hasTwoPeriodsLinq(string text)
{
return text.Count(x => x == '.') == 2;
}
static bool hasTwoPeriodsLoop(string text)
{
int count = 0;
for (int i = 0; i < text.Length; i++)
{
if (text[i] == '.')
{
// This early break makes the loop faster than regex
if (count == 2)
{
return false;
}
count++;
}
}
return count == 2;
}
static Regex twoPeriodsRegex = new Regex(#"^.*\..*\..*$", RegexOptions.Compiled);
static bool hasTwoPeriodsRegex(string text)
{
return twoPeriodsRegex.IsMatch(text);
}
public static void Main(string[] args)
{
var text = #"The young Princess Bolk6nskaya had
brought some work in a gold-embroidered vel-
vet bag. Her pretty little upper lip, on which
a delicate dark down was just perceptible, was
too short for her teeth, but it lifted all the more
sweetly, and was especially charming when she
occasionally drew it down to meet the lower
lip. As is always the case with a thoroughly at-
tractive woman, her defectthe shortness of
her upperlip and her half-open mouth seemed
to be her own special and peculiar form of
beauty. Everyone brightened at the sight of
this pretty young woman, so soon to become
a mother, so full of life and health, and carry-
ing her burden so lightly. Old men and dull
dispirited young ones who looked at her, after
being in her company and talking to her a
litttle while, felt as if they too were becoming,
like her, full of life and health. All who talked
to her, and at each word saw her bright smile
and the constant gleam of her white teeth,
thought that they were in a specially amiable
mood that day. ";
const int iterations = 100000;
// Warm up...
for (int i = 0; i < iterations; i++)
{
hasTwoPeriodsLinq(text);
hasTwoPeriodsLoop(text);
hasTwoPeriodsRegex(text);
}
var watch = System.Diagnostics.Stopwatch.StartNew();
// hasTwoPeriodsLinq
watch.Restart();
for (int i = 0; i < iterations; i++)
{
hasTwoPeriodsLinq(text);
}
watch.Stop();
Console.WriteLine("hasTwoPeriodsLinq " + watch.ElapsedMilliseconds);
// hasTwoPeriodsLoop
watch.Restart();
for (int i = 0; i < iterations; i++)
{
hasTwoPeriodsLoop(text);
}
watch.Stop();
Console.WriteLine("hasTwoPeriodsLoop " + watch.ElapsedMilliseconds);
// hasTwoPeriodsRegex
watch.Restart();
for (int i = 0; i < iterations; i++)
{
hasTwoPeriodsRegex(text);
}
watch.Stop();
Console.WriteLine("hasTwoPeriodsRegex " + watch.ElapsedMilliseconds);
}
}
}
Try it here.
And the results:
hasTwoPeriodsLinq 1280
hasTwoPeriodsLoop 54
hasTwoPeriodsRegex 74
Try this:
int count = source.Count(f => f == '.');
If count == 2, you're all good.
You should declare two periods and anything except period around and between them:
[^\.]*\.[^\.]*\.[^\.]*
This works according to my tests:
^.*\..*\..*$
Any character zero or more times followed by a period followed by any character zero or more times followed by a period followed by any character zero or more times.
Of course, as others have pointed out, using Regex here is not the most efficient or readable way of doing it. Regex has a learning curve and future programmers may not appreciate a less-than-straightforward approach given that there are simpler alternatives.
if you want to use regex, then you can use the Regex.Matches to check the count.
if(Regex.Matches(stringinput, #"\.").Count == 2 )
{
//perform your operation
}
Several people have given examples to test for exactly 2 but here's an example to test for at least 2 periods. You could actually easily modify this to test for exactly 2 if you wanted to too.
(.*\..*){2}

C# Type of String Index

I need to access a very large number in the index of the string which int and long can't handle. I had to use ulong but the problem is that the indexer can only handle the type int.
This is my code and I have marked the line where the error is located. Any ideas how to solve this?
string s = Console.ReadLine();
long n = Convert.ToInt64(Console.ReadLine());
var cont = s.Count(x => x == 'a');
Console.WriteLine(cont);
Console.ReadKey();
The main idea of the code is to identify how many 'a's there are in the string. What are some other ways I can do this?
EDIT:
i didn't know that is the string index Capicity cant exceed the int type. and i fixed my for loop by replacing it with this linq line
var cont = s.Count(x => x == 'a');
now since my string can't exceed certain amount. so how i can repeat my string to append its char for 1,000,000,000,000 times rather than using this code
for (int i = 0; i < 20; i++)
{
s += s;
}
since this code is generating random char numbers in the string and if i raised the 20 may cause to overflow so i need to adjust it to repeat itself to make the string[index] = n // the long i declared above.
so for example if my string input is "aba" and n is 10 so the string will be "abaabaabaa" // total chars 10
PS: I Edited the original code
I assume you got a programming assignment or online coding challenge, where the requirement was "Count all instances of the letter 'a' in this > 2 GB file". You solution is to read the file in memory at once, and loop over it with a variable type that allows values over 2GB.
This causes an XY problem. You cannot have an array that large in memory in the first place, so you're not going to reach the point where you need a uint, long or ulong to index into it.
Instead, use a StreamReader to read the file in chunks, as explained in for example Reading large file in chunks c#.
You can repeat your string using an infinite sequence. I haven't added any check for valid arguments, etc.
static void Main(string[] args)
{
long count = countCharacters("aba", 'a', 10);
Console.WriteLine("Count is {0}", count);
Console.WriteLine("Press ENTER to exit...");
Console.ReadLine();
}
private static long countCharacters(string baseString, char c, long limit)
{
long result = 0;
if (baseString.Length == 1)
{
result = baseString[0] == c ? limit : 0;
}
else
{
long n = 0;
foreach (var ch in getInfiniteSequence(baseString))
{
if (n >= limit)
break;
if (ch == c)
{
result++;
}
n++;
}
}
return result;
}
//This method iterates through a base string infinitely
private static IEnumerable<char> getInfiniteSequence(string baseString)
{
int stringIndex = 0;
while (true)
{
yield return baseString[stringIndex++ % baseString.Length];
}
}
For the given inputs, the result is 7
I highly recommend you rethink the way you are doing this, but a quick fix would be to use a foreach loop instead:
foreach(char c in s)
{
if (c == 'a')
cont++;
}
Alternative using Linq:
cont = s.Count(c => c == 'a');
I'm not sure about what n is supposed to do. According to your code it limits the string length but your question never mentions why or to what end.
i need to access a very large number in the index of the string which
int, long can't handle
this statement is not true
c# string's max length is int.Max since string.Length is an integer and it is limited by that. You should be able to do
for (int i = 0; i <= n; i++)
The maximum length of a string cannot exceed the size of an int so there really is no point in using ulong or long to index into the string.
Simply put, you're trying to solve the wrong problem.
If we disregard the fact that the program is likely to cause an out of memory exception when building such a long string, you can simply fix your code by switching to an int instead of a ulong:
for (int i = 0; i <= n; i++)
Having said that you can also use LINQ to do this:
int cont = s.Take(n + 1).Count(c => c == 'a');
Now, in the first sentence of your question you state this:
I need to access a very large number in the index of the string which int and long can't handle.
This is wholly unnecessary because any legal index of a string will fit inside an int.
If you need to do this on some input that's longer than the maximum length of a string in .NET, you'll need to change your approach; use a Stream instead trying to read all input into a string.
char seeking = 'a';
ulong count = 0;
char[] buffer = new char[4096];
using (var reader = new StreamReader(inStream))
{
int length;
while ((length = reader.Read(buffer, 0, buffer.Length)) > 0)
{
count += (ulong)buffer.Count(c => c == seeking);
}
}

Fastest way to get the first charachter from a string as a string

I need to get the first letter in a string, and was trying to use stringValue[0]. This gave compilation error when I tried to pass it as a parameter to a method, because the method only took string as type for that parameter. So I have several options to convert it to string:
myMethod(stringValue[0].ToString());
myMethod(new String(stringValue[0], 1));
myMethod(stringValue.Substring(0,1));
myMethod("" + stringValue[0]);
Which method is the best and fastest regarding performance (thoughts on best practice is also welcome)?
Proposed Solution
string input = "This is a string";
string firstChar = new String(input[0], 1);
This is the least number of operations you can do.
extract the first char by index
pass the char to the constructor of String
String Constructor (Char, Int32)
Performance Code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace StringPerf
{
class Program
{
static string firstChar;
static void Main(string[] args)
{
string input = "This is a sample string";
int count = 100000;
RunPerf("Warmup", 1, () =>
{
PerfContructor(input);
PerfSubstring(input);
PerfAppend(input);
});
Console.WriteLine();
Console.WriteLine();
RunPerf("Constructor", count, () => PerfContructor(input));
Console.WriteLine();
RunPerf("ToString", count, () => PerfToString(input));
Console.WriteLine();
RunPerf("Substring", count, () => PerfSubstring(input));
Console.WriteLine();
RunPerf("Append", count, () => PerfAppend(input));
Console.ReadLine();
}
static void PerfContructor(string input)
{
firstChar = new String(input[0], 1);
}
static void PerfToString(string input)
{
firstChar = input[0].ToString();
}
static void PerfSubstring(string input)
{
firstChar = input.Substring(0, 1);
}
static void PerfAppend(string input)
{
firstChar = "" + input[0];
}
static void RunPerf(string name, int count, Action action)
{
var sw = new System.Diagnostics.Stopwatch();
Console.WriteLine(string.Format("Starting perf for {0}. {1} times", name, count));
sw.Start();
for (int i = 0; i < count; i++)
{
action();
}
sw.Stop();
Console.WriteLine(string.Format("{0} completed in {1}", name, sw.Elapsed));
}
}
}
Results
Starting perf for Warmup. 1 times
Warmup completed in 00:00:00.0003153
Starting perf for Constructor. 9999999 times
Constructor completed in 00:00:00.1961569
Starting perf for ToString. 9999999 times
ToString completed in 00:00:00.2890530
Starting perf for Substring. 9999999 times
Substring completed in 00:00:00.2412256
Starting perf for Append. 9999999 times
Append completed in 00:00:00.3271857
Use String.Substring:
string firstChar = stringValue.Substring(0, 1);
According to the performance part of your question.
I wouldn't take performance into consideration here unless it was
actually becoming a problem for you - in which case the only way you'd
know would be to have test cases, and then it's easy to just run those
test cases for each option and compare the results. I'd expect
Substring to probably be the fastest here, simply because Substring
always ends up creating a string from a single chunk of the original
input, whereas Remove has to at least potentially glue together a
start chunk and an end chunk.
Fastest way to remove first char in a string
However, if you want to micro-optimize and you really need just the first character, it seems that the string indexer + constructor is a little bit faster than others:
string stringValue = "testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttest";
var sw = new System.Diagnostics.Stopwatch();
string firstChar;
sw.Start();
for(int i=0; i<100000000; i++)
firstChar = stringValue.Substring(0, 1);
sw.Stop();
Console.WriteLine("Substring Elapsed: " + sw.Elapsed);
sw.Restart();
for (int i = 0; i < 100000000; i++)
firstChar = stringValue[0].ToString();
sw.Stop();
Console.WriteLine("Char[]-Indexer-ToString Elapsed: " + sw.Elapsed);
sw.Restart();
for (int i = 0; i < 100000000; i++)
firstChar = new string(stringValue[0],1);
sw.Stop();
Console.WriteLine("Char[]-Indexer-Constructor Elapsed: " + sw.Elapsed);
Result:
Substring Elapsed: 00:00:03.0214131
Char[]-Indexer-ToString Elapsed: 00:00:02.1274226
Char[]-Indexer-Constructor Elapsed: 00:00:01.7042839
But note that this is really micro-optimization. Readability is more important in most cases. Consider also that you might need to take the first two characters sometime which is easy with String.Substring, an array indexer cannot return multiple characters.
Based on the answers by Tim Schmelter and Gusdor I did some testing on my own, and the results made it clear that the fastest methods were
stringValue[0].ToString()
and
new String(stringValue[0], 1)
Which of the two is fastest is unclear, as my tests were inconclusive regarding those two. I tried to run them in different order, and it varied which one was the fastest.
I ran it 2E9 times, and these two fastest methods took about 30 seconds.
In comparison, the Substring method was 43 seconds, and the append method was 50 seconds.

C# Char permutation with repetition on large set of chars

Hello Im trying to get all possible combinations with repetitions of given char array.
Char array consists of alphabet letters(only lower) and I need to generate strings with length of 30 or more chars.
I tried with method of many for-loops,but when I try to get all combinations of char in char array with length of string more then 5 I get out of Memory Exception.
So I created similar Method that takes only first 200000 strings,then next 2000000 and so on this was proven sucessfull but only with smaller length strings.
This was my method with length of 7 chars:
public static int Progress = 0;
public static ArrayList CreateRngUrl7()
{
ArrayList AllCombos = new ArrayList();
int passed = 0;
int Too = Progress + 200000;
char[] alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZ".ToLower().ToCharArray();
for (int i = 0; i < alpha.Length; i++)
for (int i1 = 0; i1 < alpha.Length; i1++)
for (int i2 = 0; i2 < alpha.Length; i2++)
for (int i3 = 0; i3 < alpha.Length; i3++)
for (int i4 = 0; i4 < alpha.Length; i4++)
for (int i5 = 0; i5 < alpha.Length; i5++)
for (int i6 = 0; i6 < alpha.Length; i6++)
{
if (passed > (Too - 200000) && passed < Too)
{
string word = new string(new char[] { alpha[i], alpha[i1], alpha[i2], alpha[i3], alpha[i4], alpha[i5],alpha[i6] });
AllCombos.Add(word);
}
passed++;
}
if (Too >= passed)
{
MessageBox.Show("All combinations of RNG7 were returned");
}
Progress = Too;
return AllCombos;
}
I tried adding 30 for-loops with in way described above so i Would get strings with lenghts of 30 but application just hangs.Is there any better way to do this? All answers would be much appreciated. Thank you in advance!
Can someone please just post method how it is done with larger legth strings I just want to see an example? I don't have to store that data,I just need to compare it with something and release it from memory. I used alphabet for example I don't need whole alphabet.Question was not how long it would take or how much combinations would it be!!!!!
You get an OutOfMemoryException because inside the loop you allocate a string and store it in an ArrayList. The strings have to stay in memory until the ArrayList is garbage collected and your loop creates more strings than you will be able to store.
If you simply want to check the string for a condition you should put the check inside the loop:
for ( ... some crazy loop ...) {
var word = ... create word ...
if (!WordPassesTest(word)) {
Console.WriteLine(word + " failed test.");
return false;
}
}
return true;
Then you only need storage for a single word. Of course, if the loop is crazy enough, it will not terminate before the end of the universe as we know it.
If you need to execute many nested but similar loops you can use recursion to simplify the code. Here is an example that is not incredible efficient, but at least it is simple:
Char[] chars = "ABCD".ToCharArray();
IEnumerable<String> GenerateStrings(Int32 length) {
if (length == 0) {
yield return String.Empty;
yield break;
}
var strings = chars.SelectMany(c => GenerateStrings(length - 1), (c, s) => c + s);
foreach (var str in strings)
yield return str;
}
Calling GenerateStrings(3) will generate all strings of length 3 using lazy evaluation (so no additional storage is required for the strings).
Building on top of an IEnumerable generating your strings you can create primites to buffer and process buffers of strings. An easy solution is to using Reactive Extensions for .NET. Here you already have a Buffer primitive:
GenerateStrings(3)
.ToObservable()
.Buffer(10)
.Subscribe(list => ... ship the list to another computer and process it ...);
The lambda in Subscribe will be called with a List<String> with at most 10 strings (the parameter provided in the call to Buffer).
Unless you have an infinte number of computers you will still have to pull the computers from a pool and only recycle them back to the pool when they have finished the computation.
It should be obvious from the comments on this question that you will not be able to process 26^30 strings even if you have multiple computers at your disposal.
I don't have time right now to write some code but essentially if you are running out of RAM use disk. I'm thinking along the lines of one thread running an algorithm to find the combinations and another persisting the results to disk and releasing the RAM.

Fast string suffix checking in C# (.NET 4.0)?

What is the fastest method of checking string suffixes in C#?
I need to check each string in a large list (anywhere from 5000 to 100000 items) for a particular term. The term is guaranteed never to be embedded within the string. In other words, if the string contains the term, it will be at the end of the string. The string is also guaranteed to be longer than the suffix. Cultural information is not important.
These are how different methods performed against 100000 strings (half of them have the suffix):
1. Substring Comparison - 13.60ms
2. String.Contains - 22.33ms
3. CompareInfo.IsSuffix - 24.60ms
4. String.EndsWith - 29.08ms
5. String.LastIndexOf - 30.68ms
These are average times. [Edit] Forgot to mention that the strings also get put into separate lists, but this is not important. It does add to the running time though.
On my system substring comparison (extracting the end of the string using the String.Substring method and comparing it to the suffix term) is consistently the fastest when tested against 100000 strings. The problem with using substring comparison though is that Garbage Collection can slow it down considerably (more than the other methods) because String.Substring creates new strings. The effect is not as bad in .NET 4.0 as it was in 3.5 and below, but it is still noticeable. In my tests, String.Substring performed consistently slower on sets of 12000-13000 strings. This will obviously differ between systems and implementations.
[EDIT]
Benchmark code:
http://pastebin.com/smEtYNYN
[EDIT]
FlyingStreudel's code runs fast, but Jon Skeet's recommendation of using EndsWith in conjunction with StringComparison.Ordinal appears to be the best option.
If that's the time taken to check 100,000 strings, does it really matter?
Personally I'd use string.EndsWith on the grounds that it's the most descriptive: it says exactly what you're trying to test.
I'm somewhat suspicious of the fact that it appears to be performing worst though... if you could post your benchmark code, that would be very useful. (In particular, it really shouldn't have to do as much work as string.Contains.)
Have you tried specifying an ordinal match? That may well make it significantly faster:
if (x.EndsWith(y, StringComparison.Ordinal))
Of course, you shouldn't do that unless you want an ordinal comparison - are you expecting culturally-sensitive matches? (Developers tend not to consider this sort of thing, and I very firmly include myself in that category.)
Jon is absolutely right; this is potentially not an apples-to-apples comparison because different string methods have different defaults for culteral sensitivity. Be very sure that you are getting the comparison semantics you intend to in each one.
In addition to Jon's answer, I'd add that the relevant question is not "which is fastest?" but rather "which is too slow?" What's your performance goal for this code? The slowest method still finds the result in less time than it takes a movie projector to advance to the next frame, and obviously that is not noticable by humans. If your goal is that the search appears instantaneous to the user then you're done; any of those methods work. If your goal is that the search take less than a millisecond then none of those methods work; they are all orders of magnitude too slow. What's the budget?
I took a look at your benchmark code and frankly, it looks dodgy.
You are measuring all kinds of extraneous things along with what it is you want to measure; you're measuring the cost of the foreach and the adding to a list, both of which might have costs of the same order of magnitude as the thing you are attempting to test.
Also, you are not throwing out the first run; remember, the JIT compiler is going to jit the code that you call the first time through the loop, and it is going to be hot and ready to go the second time, so your results will therefore be skewed; you are averaging one potentially very large thing with many small things. In the past when I have done this I have discovered situations where the jit time actually dominated the time of everything else. Is that realistic? Do you mean to measure the jit time, or should it be not considered as part of the average?
I dunno how fast this is, but this is what I would do?
static bool HasSuffix(string check, string suffix)
{
int offset = check.Length - suffix.Length;
for (int i = 0; i < suffix.Length; i++)
{
if (check[offset + i] != suffix[i])
{
return false;
}
}
return true;
}
edit: OOPS x2
edit: So I wrote my own little benchmark... does this count? It runs 25 trials of evaluating one million strings and takes the average of the difference in performance. The handful of times I ran it it was consistently outputting that CharCompare was faster by ~10-40ms over one million records. So that is a hugely unimportant increase in efficiency (.000000001s/call) :) All in all I doubt it will matter which method you implement.
class Program
{
volatile static List<string> strings;
static double[] results = new double[25];
static void Main(string[] args)
{
strings = new List<string>();
Random r = new Random();
for (int rep = 0; rep < 25; rep++)
{
Console.WriteLine("Run " + rep);
strings.Clear();
for (int i = 0; i < 1000000; i++)
{
string temp = "";
for (int j = 0; j < r.Next(3, 101); j++)
{
temp += Convert.ToChar(
Convert.ToInt32(
Math.Floor(26 * r.NextDouble() + 65)));
}
if (i % 4 == 0)
{
temp += "abc";
}
strings.Add(temp);
}
OrdinalWorker ow = new OrdinalWorker(strings);
CharWorker cw = new CharWorker(strings);
if (rep % 2 == 0)
{
cw.Run();
ow.Run();
}
else
{
ow.Run();
cw.Run();
}
Thread.Sleep(1000);
results[rep] = ow.finish.Subtract(cw.finish).Milliseconds;
}
double tDiff = 0;
for (int i = 0; i < 25; i++)
{
tDiff += results[i];
}
double average = tDiff / 25;
if (average < 0)
{
average = average * -1;
Console.WriteLine("Char compare faster by {0}ms average",
average.ToString().Substring(0, 4));
}
else
{
Console.WriteLine("EndsWith faster by {0}ms average",
average.ToString().Substring(0, 4));
}
}
}
class OrdinalWorker
{
List<string> list;
int count;
public Thread t;
public DateTime finish;
public OrdinalWorker(List<string> l)
{
list = l;
}
public void Run()
{
t = new Thread(() => {
string suffix = "abc";
for (int i = 0; i < list.Count; i++)
{
count = (list[i].EndsWith(suffix, StringComparison.Ordinal)) ?
count + 1 : count;
}
finish = DateTime.Now;
});
t.Start();
}
}
class CharWorker
{
List<string> list;
int count;
public Thread t;
public DateTime finish;
public CharWorker(List<string> l)
{
list = l;
}
public void Run()
{
t = new Thread(() =>
{
string suffix = "abc";
for (int i = 0; i < list.Count; i++)
{
count = (HasSuffix(list[i], suffix)) ? count + 1 : count;
}
finish = DateTime.Now;
});
t.Start();
}
static bool HasSuffix(string check, string suffix)
{
int offset = check.Length - suffix.Length;
for (int i = 0; i < suffix.Length; i++)
{
if (check[offset + i] != suffix[i])
{
return false;
}
}
return true;
}
}
Did you try direct access ?
I mean, you can make a loop watching for similar string, it could be faster than make a substring and having the same behaviour.
int i,j;
foreach(String testing in lists){
i=0;
j=0;
int ok=1;
while(ok){
i = testing.lenght - PATTERN.lenght;
if(i>0 && i<testing.lenght && testing[i] != PATTERN[j])
ok = 0;
i++;
j++;
}
if(ok) return testing;
}
Moreover if it's big strings, you could try using hashs.
I don't profess to be an expert in this area, however I felt compelled to at least profile this to some extent (knowing full well that my fictitious scenario will differ substantially from your own) and here is what I came up with:
It seems, at least for me, EndsWith takes the lead with LastIndexOf consistently coming in second, some timings are:
SubString: 00:00:00.0191877
Contains: 00:00:00.0201980
CompareInfo: 00:00:00.0255181
EndsWith: 00:00:00.0120296
LastIndexOf: 00:00:00.0133181
These were gleaned from processing 100,000 strings where the desired suffix appeared in all strings and so to me simply echoes Jon's answer (where the benefit is both speed and descriptiveness). And the code used to come to these results:
class Program
{
class Profiler
{
private Stopwatch Stopwatch = new Stopwatch();
public TimeSpan Elapsed { get { return Stopwatch.Elapsed; } }
public void Start()
{
Reset();
Stopwatch.Start();
}
public void Stop()
{
Stopwatch.Stop();
}
public void Reset()
{
Stopwatch.Reset();
}
}
static string suffix = "_sfx";
static Profiler profiler = new Profiler();
static List<string> input = new List<string>();
static List<string> output = new List<string>();
static void Main(string[] args)
{
GenerateSuffixedStrings();
FindStringsWithSuffix_UsingSubString(input, suffix);
Console.WriteLine("SubString: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingContains(input, suffix);
Console.WriteLine("Contains: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingCompareInfo(input, suffix);
Console.WriteLine("CompareInfo: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingEndsWith(input, suffix);
Console.WriteLine("EndsWith: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingLastIndexOf(input, suffix);
Console.WriteLine("LastIndexOf: {0}", profiler.Elapsed);
Console.WriteLine();
Console.WriteLine("Press any key to exit...");
Console.ReadKey();
}
static void GenerateSuffixedStrings()
{
for (var i = 0; i < 100000; i++)
{
input.Add(Guid.NewGuid().ToString() + suffix);
}
}
static void FindStringsWithSuffix_UsingSubString(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if(s.Substring(s.Length - 4) == suffix)
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingContains(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (s.Contains(suffix))
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingCompareInfo(IEnumerable<string> strings, string suffix)
{
var ci = CompareInfo.GetCompareInfo("en-GB");
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (ci.IsSuffix(s, suffix))
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingEndsWith(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (s.EndsWith(suffix))
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingLastIndexOf(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (s.LastIndexOf(suffix) == s.Length - 4)
output.Add(s);
}
profiler.Stop();
}
}
EDIT:
As commented, I attempted this again with only some of the strings having a suffix applied and these are the results:
SubString: 00:00:00.0079731
Contains: 00:00:00.0243696
CompareInfo: 00:00:00.0334056
EndsWith: 00:00:00.0196668
LastIndexOf: 00:00:00.0229599
The string generator method was updated as follows, to produce the strings:
static void GenerateSuffixedStrings()
{
var nxt = false;
var rnd = new Random();
for (var i = 0; i < 100000; i++)
{
input.Add(Guid.NewGuid().ToString() +
(rnd.Next(0, 2) == 0 ? suffix : string.Empty));
}
}
Further, this trend continues if none of the string have a suffix:
SubString: 00:00:00.0055584
Contains: 00:00:00.0187089
CompareInfo: 00:00:00.0228983
EndsWith: 00:00:00.0114227
LastIndexOf: 00:00:00.0199328
However, this gap shortens again when assigning a quarter of the inputs a suffix (the first quarter, then sorting to randomise the coverage):
SubString: 00:00:00.0302997
Contains: 00:00:00.0305685
CompareInfo: 00:00:00.0306335
EndsWith: 00:00:00.0351229
LastIndexOf: 00:00:00.0322899
Conclusion? IMO, and agreeing with Jon, EndsWith seems the way to go (based on this limited test, anyway).
Further Edit:
To cure Jon's curiosity I ran a few more tests on EndsWith, with and without Ordinal string comparison...
On 100,000 strings with a quarter of them suffixed:
EndsWith: 00:00:00.0795617
OrdinalEndsWith: 00:00:00.0240631
On 1,000,000 strings with a quarter of them suffixed:
EndsWith: 00:00:00.5460591
OrdinalEndsWith: 00:00:00.2807860
On 10,000,000 strings with a quarter of them suffixed:
EndsWith: 00:00:07.5889581
OrdinalEndsWith: 00:00:03.3248628
Note that I only ran the last test once as generating the strings proved this laptop is in need of a replacement
There's a lot of good information here. I wanted to note that if your suffix is short, it could be even faster to look at the last few characters individually. My modified version of the benchmark code in question is here: http://pastebin.com/6nNdbEvW. It gives theses results:
Last char equality: 1.52 ms (50000)
Last 2 char equality: 1.56 ms (50000)
EndsWith using StringComparison.Ordinal: 3.75 ms (50000)
Contains: 11.10 ms (50000)
LastIndexOf: 14.85 ms (50000)
IsSuffix: 11.30 ms (50000)
Substring compare: 17.69 ms (50000)

Categories