What is the fastest method of checking string suffixes in C#?
I need to check each string in a large list (anywhere from 5000 to 100000 items) for a particular term. The term is guaranteed never to be embedded within the string. In other words, if the string contains the term, it will be at the end of the string. The string is also guaranteed to be longer than the suffix. Cultural information is not important.
These are how different methods performed against 100000 strings (half of them have the suffix):
1. Substring Comparison - 13.60ms
2. String.Contains - 22.33ms
3. CompareInfo.IsSuffix - 24.60ms
4. String.EndsWith - 29.08ms
5. String.LastIndexOf - 30.68ms
These are average times. [Edit] Forgot to mention that the strings also get put into separate lists, but this is not important. It does add to the running time though.
On my system substring comparison (extracting the end of the string using the String.Substring method and comparing it to the suffix term) is consistently the fastest when tested against 100000 strings. The problem with using substring comparison though is that Garbage Collection can slow it down considerably (more than the other methods) because String.Substring creates new strings. The effect is not as bad in .NET 4.0 as it was in 3.5 and below, but it is still noticeable. In my tests, String.Substring performed consistently slower on sets of 12000-13000 strings. This will obviously differ between systems and implementations.
[EDIT]
Benchmark code:
http://pastebin.com/smEtYNYN
[EDIT]
FlyingStreudel's code runs fast, but Jon Skeet's recommendation of using EndsWith in conjunction with StringComparison.Ordinal appears to be the best option.
If that's the time taken to check 100,000 strings, does it really matter?
Personally I'd use string.EndsWith on the grounds that it's the most descriptive: it says exactly what you're trying to test.
I'm somewhat suspicious of the fact that it appears to be performing worst though... if you could post your benchmark code, that would be very useful. (In particular, it really shouldn't have to do as much work as string.Contains.)
Have you tried specifying an ordinal match? That may well make it significantly faster:
if (x.EndsWith(y, StringComparison.Ordinal))
Of course, you shouldn't do that unless you want an ordinal comparison - are you expecting culturally-sensitive matches? (Developers tend not to consider this sort of thing, and I very firmly include myself in that category.)
Jon is absolutely right; this is potentially not an apples-to-apples comparison because different string methods have different defaults for culteral sensitivity. Be very sure that you are getting the comparison semantics you intend to in each one.
In addition to Jon's answer, I'd add that the relevant question is not "which is fastest?" but rather "which is too slow?" What's your performance goal for this code? The slowest method still finds the result in less time than it takes a movie projector to advance to the next frame, and obviously that is not noticable by humans. If your goal is that the search appears instantaneous to the user then you're done; any of those methods work. If your goal is that the search take less than a millisecond then none of those methods work; they are all orders of magnitude too slow. What's the budget?
I took a look at your benchmark code and frankly, it looks dodgy.
You are measuring all kinds of extraneous things along with what it is you want to measure; you're measuring the cost of the foreach and the adding to a list, both of which might have costs of the same order of magnitude as the thing you are attempting to test.
Also, you are not throwing out the first run; remember, the JIT compiler is going to jit the code that you call the first time through the loop, and it is going to be hot and ready to go the second time, so your results will therefore be skewed; you are averaging one potentially very large thing with many small things. In the past when I have done this I have discovered situations where the jit time actually dominated the time of everything else. Is that realistic? Do you mean to measure the jit time, or should it be not considered as part of the average?
I dunno how fast this is, but this is what I would do?
static bool HasSuffix(string check, string suffix)
{
int offset = check.Length - suffix.Length;
for (int i = 0; i < suffix.Length; i++)
{
if (check[offset + i] != suffix[i])
{
return false;
}
}
return true;
}
edit: OOPS x2
edit: So I wrote my own little benchmark... does this count? It runs 25 trials of evaluating one million strings and takes the average of the difference in performance. The handful of times I ran it it was consistently outputting that CharCompare was faster by ~10-40ms over one million records. So that is a hugely unimportant increase in efficiency (.000000001s/call) :) All in all I doubt it will matter which method you implement.
class Program
{
volatile static List<string> strings;
static double[] results = new double[25];
static void Main(string[] args)
{
strings = new List<string>();
Random r = new Random();
for (int rep = 0; rep < 25; rep++)
{
Console.WriteLine("Run " + rep);
strings.Clear();
for (int i = 0; i < 1000000; i++)
{
string temp = "";
for (int j = 0; j < r.Next(3, 101); j++)
{
temp += Convert.ToChar(
Convert.ToInt32(
Math.Floor(26 * r.NextDouble() + 65)));
}
if (i % 4 == 0)
{
temp += "abc";
}
strings.Add(temp);
}
OrdinalWorker ow = new OrdinalWorker(strings);
CharWorker cw = new CharWorker(strings);
if (rep % 2 == 0)
{
cw.Run();
ow.Run();
}
else
{
ow.Run();
cw.Run();
}
Thread.Sleep(1000);
results[rep] = ow.finish.Subtract(cw.finish).Milliseconds;
}
double tDiff = 0;
for (int i = 0; i < 25; i++)
{
tDiff += results[i];
}
double average = tDiff / 25;
if (average < 0)
{
average = average * -1;
Console.WriteLine("Char compare faster by {0}ms average",
average.ToString().Substring(0, 4));
}
else
{
Console.WriteLine("EndsWith faster by {0}ms average",
average.ToString().Substring(0, 4));
}
}
}
class OrdinalWorker
{
List<string> list;
int count;
public Thread t;
public DateTime finish;
public OrdinalWorker(List<string> l)
{
list = l;
}
public void Run()
{
t = new Thread(() => {
string suffix = "abc";
for (int i = 0; i < list.Count; i++)
{
count = (list[i].EndsWith(suffix, StringComparison.Ordinal)) ?
count + 1 : count;
}
finish = DateTime.Now;
});
t.Start();
}
}
class CharWorker
{
List<string> list;
int count;
public Thread t;
public DateTime finish;
public CharWorker(List<string> l)
{
list = l;
}
public void Run()
{
t = new Thread(() =>
{
string suffix = "abc";
for (int i = 0; i < list.Count; i++)
{
count = (HasSuffix(list[i], suffix)) ? count + 1 : count;
}
finish = DateTime.Now;
});
t.Start();
}
static bool HasSuffix(string check, string suffix)
{
int offset = check.Length - suffix.Length;
for (int i = 0; i < suffix.Length; i++)
{
if (check[offset + i] != suffix[i])
{
return false;
}
}
return true;
}
}
Did you try direct access ?
I mean, you can make a loop watching for similar string, it could be faster than make a substring and having the same behaviour.
int i,j;
foreach(String testing in lists){
i=0;
j=0;
int ok=1;
while(ok){
i = testing.lenght - PATTERN.lenght;
if(i>0 && i<testing.lenght && testing[i] != PATTERN[j])
ok = 0;
i++;
j++;
}
if(ok) return testing;
}
Moreover if it's big strings, you could try using hashs.
I don't profess to be an expert in this area, however I felt compelled to at least profile this to some extent (knowing full well that my fictitious scenario will differ substantially from your own) and here is what I came up with:
It seems, at least for me, EndsWith takes the lead with LastIndexOf consistently coming in second, some timings are:
SubString: 00:00:00.0191877
Contains: 00:00:00.0201980
CompareInfo: 00:00:00.0255181
EndsWith: 00:00:00.0120296
LastIndexOf: 00:00:00.0133181
These were gleaned from processing 100,000 strings where the desired suffix appeared in all strings and so to me simply echoes Jon's answer (where the benefit is both speed and descriptiveness). And the code used to come to these results:
class Program
{
class Profiler
{
private Stopwatch Stopwatch = new Stopwatch();
public TimeSpan Elapsed { get { return Stopwatch.Elapsed; } }
public void Start()
{
Reset();
Stopwatch.Start();
}
public void Stop()
{
Stopwatch.Stop();
}
public void Reset()
{
Stopwatch.Reset();
}
}
static string suffix = "_sfx";
static Profiler profiler = new Profiler();
static List<string> input = new List<string>();
static List<string> output = new List<string>();
static void Main(string[] args)
{
GenerateSuffixedStrings();
FindStringsWithSuffix_UsingSubString(input, suffix);
Console.WriteLine("SubString: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingContains(input, suffix);
Console.WriteLine("Contains: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingCompareInfo(input, suffix);
Console.WriteLine("CompareInfo: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingEndsWith(input, suffix);
Console.WriteLine("EndsWith: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingLastIndexOf(input, suffix);
Console.WriteLine("LastIndexOf: {0}", profiler.Elapsed);
Console.WriteLine();
Console.WriteLine("Press any key to exit...");
Console.ReadKey();
}
static void GenerateSuffixedStrings()
{
for (var i = 0; i < 100000; i++)
{
input.Add(Guid.NewGuid().ToString() + suffix);
}
}
static void FindStringsWithSuffix_UsingSubString(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if(s.Substring(s.Length - 4) == suffix)
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingContains(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (s.Contains(suffix))
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingCompareInfo(IEnumerable<string> strings, string suffix)
{
var ci = CompareInfo.GetCompareInfo("en-GB");
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (ci.IsSuffix(s, suffix))
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingEndsWith(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (s.EndsWith(suffix))
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingLastIndexOf(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (s.LastIndexOf(suffix) == s.Length - 4)
output.Add(s);
}
profiler.Stop();
}
}
EDIT:
As commented, I attempted this again with only some of the strings having a suffix applied and these are the results:
SubString: 00:00:00.0079731
Contains: 00:00:00.0243696
CompareInfo: 00:00:00.0334056
EndsWith: 00:00:00.0196668
LastIndexOf: 00:00:00.0229599
The string generator method was updated as follows, to produce the strings:
static void GenerateSuffixedStrings()
{
var nxt = false;
var rnd = new Random();
for (var i = 0; i < 100000; i++)
{
input.Add(Guid.NewGuid().ToString() +
(rnd.Next(0, 2) == 0 ? suffix : string.Empty));
}
}
Further, this trend continues if none of the string have a suffix:
SubString: 00:00:00.0055584
Contains: 00:00:00.0187089
CompareInfo: 00:00:00.0228983
EndsWith: 00:00:00.0114227
LastIndexOf: 00:00:00.0199328
However, this gap shortens again when assigning a quarter of the inputs a suffix (the first quarter, then sorting to randomise the coverage):
SubString: 00:00:00.0302997
Contains: 00:00:00.0305685
CompareInfo: 00:00:00.0306335
EndsWith: 00:00:00.0351229
LastIndexOf: 00:00:00.0322899
Conclusion? IMO, and agreeing with Jon, EndsWith seems the way to go (based on this limited test, anyway).
Further Edit:
To cure Jon's curiosity I ran a few more tests on EndsWith, with and without Ordinal string comparison...
On 100,000 strings with a quarter of them suffixed:
EndsWith: 00:00:00.0795617
OrdinalEndsWith: 00:00:00.0240631
On 1,000,000 strings with a quarter of them suffixed:
EndsWith: 00:00:00.5460591
OrdinalEndsWith: 00:00:00.2807860
On 10,000,000 strings with a quarter of them suffixed:
EndsWith: 00:00:07.5889581
OrdinalEndsWith: 00:00:03.3248628
Note that I only ran the last test once as generating the strings proved this laptop is in need of a replacement
There's a lot of good information here. I wanted to note that if your suffix is short, it could be even faster to look at the last few characters individually. My modified version of the benchmark code in question is here: http://pastebin.com/6nNdbEvW. It gives theses results:
Last char equality: 1.52 ms (50000)
Last 2 char equality: 1.56 ms (50000)
EndsWith using StringComparison.Ordinal: 3.75 ms (50000)
Contains: 11.10 ms (50000)
LastIndexOf: 14.85 ms (50000)
IsSuffix: 11.30 ms (50000)
Substring compare: 17.69 ms (50000)
Related
I've tried changing my sort into a recursive function where the method calls itself. At least that's my understanding of recursion to forego for loops and the method calls itself to repeat the necessary iterations.
Below is my iterative verion:
for (int i = 0; i < Integers.Count; i++) //loops through all the numbers
{
min = i; //setting the current index number to be the minimum
for (int index = i + 1; index < Integers.Count; index++) //finds and checks pos of min value
{ //and switches it with last element using the swap method
if ((int) Integers[index] > (int) Integers[min]) {
min = index;
}
comparisons++;
}
if (i != min) //if statement for if swop is done
{
Swap(i, min, Integers, ref swops); //swap method called
}
//Swap method called
}
I've tried making it recursive. I read online that it was OK to still have for loops in a recursive funtion which I guess is not true. I just havent been able to develop a working sort. Am I going to need to split the method into 2 where one method traverses a list and the other does the sort?
Here's my selection sort recursive method attempt below:
static void DoSelectionSortRecursive(ArrayList Integers, int i, int swops, int comparisons) {
int min;
min = i;
for (int index = i + 1; index < Integers.Count; index++) //read online that the use of arraylists are deprecated, and i shoudlve rather used List<int> in order to remedy the code. is it necassary
{
if ((int) Integers[index] > (int) Integers[min]) {
min = index;
}
comparisons++;
}
if (i != min) {
Swap(i, min, Integers, ref swops);
}
DoSelectionSortRecursive(Integers, (i + 1), comparisons, swops); //DoSelectionSortRecursive method called
}
This is my imporved attempt including performance measures and everything. The original list of integers in the unsorted lists. 84,24,13,10,37.
and im getting 84,24,13,37,10. clearly not in a sorted descending order.
below is the improved code
static void DoSelectionSortRecursive(ArrayList Integers)
{
Stopwatch timer = new Stopwatch();
timer.Start();
int shifts = 0;
int swops = 0;
int comparisons = 0;
Sort(Integers, 1,ref swops,ref comparisons);
timer.Stop();
Console.WriteLine("Selection Sort Recursive: ");
Console.WriteLine(timer.ElapsedMilliseconds);
Console.WriteLine(swops);
Console.WriteLine(comparisons);
Console.WriteLine(shifts); //not needed in this sort
Console.WriteLine("-------------------------------------");
foreach (int i in Integers)
{
Console.WriteLine(i);
}
}
static void Sort(ArrayList Integers, int i, ref int swops, ref int comparisons)
{
int min = i;
int index = i + 1;
if (index < Integers.Count) //read online that the use of arraylists are deprecated, and i shoudlve rather used List<int> in order to remedy the code. is it necassary
{
if ((int)Integers[index] > (int)Integers[min])
{
min = index;
}
comparisons++;
index++;
}
if (i != min)
{
Swap(i, min, Integers, ref swops);
}
if (i < Integers.Count - 1)
{
Sort(Integers, (i + 1), ref comparisons, ref swops); //DoSelectionSortRecursive method called
}
}
static void Swap(int x, int y, ArrayList Integers, ref int swap) //swap method, swaps the position of 2 elements
{
swap++;
int temporary = (int)Integers[x]; //essentially will swap the min with the current position
Integers[x] = Integers[y];
Integers[y] = temporary;
}
There are no "rules" about recursion that say you cannot use loops in the recursive method body. The only rule in recursion is that the function has to call itself, which your second code snippet does, so DoSelectionSortRecursive is legitimately recursive.
For example, merge sort uses recursion for splitting the array and loops for merging the sorted subarrays. It'd be wrong to call it anything but a recursive function, and it'd be somewhat silly to implement the merging stage (an implementation detail of merge sort) recursively -- it'd be slower and harder to reason about, so loops are the natural choice.
On the other hand, the splitting part of merge sort makes sense to write recursively because it chops the problem space down by a logarithmic factor and has multiple branches. The repeated halving means it won't need to make more than a few or a dozen recursive calls on a typical array. These calls don't incur much overhead and fit well within the call stack.
On the other hand, the call stack can easily blow for linear recursive algorithms in languages without tail-call optimization like C# where each index in the linear structure requires a whole stack frame.
Rules prohibiting loops are concoted by educators who are trying to teach recursion by forcing you to use a specific approach in your solution. It's up to your instructor to determine whether one or both loops need to be converted to recursion for it to "count" as far as the course is concerned. (apologies if my assumptions about your educational arrangement are incorrect)
All that is to say that this requirement to write a nested-loop sort recursively is basically a misapplication of recursion for pedagogical purposes. In the "real world", you'd just write it iteratively and be done with it, as Google does in the V8 JavaScript engine, which uses insertion sort on small arrays. I suspect there are many other cases, but this is the one I'm most readily familiar with.
The point with using simple, nested loop sorts in performance-sensitive production code is that they're not recursive. These sorts' advantage is that they avoid allocating stack frames and incurring function call overhead to sort small arrays of a dozen numbers where the quadratic time complexity isn't a significant factor. When the array is mostly sorted, insertion sort in particular doesn't have to do much work and is mostly a linear walk over the array (sometimes a drawback in certain real-time applications that need predictable performance, in which case selection sort might be preferable -- see Wikipedia).
Regarding ArrayLists, the docs say: "We don't recommend that you use the ArrayList class for new development. Instead, we recommend that you use the generic List<T> class." So you can basically forget about ArrayList unless you're doing legacy code (Note: Java does use ArrayLists which are more akin to the C# List. std::list isn't an array in C++, so it can be confusing to keep all of this straight).
It's commendable that you've written your sort iteratively first, then translated to recursion on the outer loop only. It's good to start with what you know and get something working, then gradually transform it to meet the new requirements.
Zooming out a bit, we can isolate the role this inner loop plays when we pull it out as a function, then write and test it independent of the selection sort we hope to use it in. After the subroutine works on its own, then selection sort can use it as a black box and the overall design is verifiable and modular.
More specifically, the role of this inner loop is to find the minimum value beginning at an index: int IndexOfMin(List<int> lst, int i = 0). The contract is that it'll throw an ArgumentOutOfRangeException error if the precondition 0 <= i < lst.Count is violated.
I skipped the metrics variables for simplicity but added a random test harness that gives a pretty reasonable validation against the built-in sort.
using System;
using System.Collections.Generic;
using System.Linq;
class Sorter
{
private static void Swap(List<int> lst, int i, int j)
{
int temp = lst[i];
lst[i] = lst[j];
lst[j] = temp;
}
private static int IndexOfMin(List<int> lst, int i = 0)
{
if (i < 0 || i >= lst.Count)
{
throw new ArgumentOutOfRangeException();
}
else if (i == lst.Count - 1)
{
return i;
}
int bestIndex = IndexOfMin(lst, i + 1);
return lst[bestIndex] < lst[i] ? bestIndex : i;
}
public static void SelectionSort(List<int> lst, int i = 0)
{
if (i < lst.Count)
{
Swap(lst, i, IndexOfMin(lst, i));
SelectionSort(lst, i + 1);
}
}
public static void Main(string[] args)
{
var rand = new Random();
int tests = 1000;
int lstSize = 100;
int randMax = 1000;
for (int i = 0; i < tests; i++)
{
var lst = new List<int>();
for (int j = 0; j < lstSize; j++)
{
lst.Add(rand.Next(randMax));
}
var actual = new List<int>(lst);
SelectionSort(actual);
lst.Sort();
if (!lst.SequenceEqual(actual))
{
Console.WriteLine("FAIL:");
Console.WriteLine($"Expected => {String.Join(",", lst)}");
Console.WriteLine($"Actual => {String.Join(",", actual)}\n");
}
}
}
}
Here's a more generalized solution that uses generics and CompareTo so that you can sort any list of objects that implement the IComparable interface. This functionality is more akin to the built-in sort.
using System;
using System.Collections.Generic;
using System.Linq;
class Sorter
{
public static void Swap<T>(List<T> lst, int i, int j)
{
T temp = lst[i];
lst[i] = lst[j];
lst[j] = temp;
}
public static int IndexOfMin<T>(List<T> lst, int i = 0)
where T : IComparable<T>
{
if (i < 0 || i >= lst.Count)
{
throw new ArgumentOutOfRangeException();
}
else if (i == lst.Count - 1)
{
return i;
}
int bestIndex = IndexOfMin(lst, i + 1);
return lst[bestIndex].CompareTo(lst[i]) < 0 ? bestIndex : i;
}
public static void SelectionSort<T>(List<T> lst, int i = 0)
where T : IComparable<T>
{
if (i < lst.Count)
{
Swap(lst, i, IndexOfMin(lst, i));
SelectionSort(lst, i + 1);
}
}
public static void Main(string[] args)
{
// same as above
}
}
Since you asked how to smush both of the recursive functions into one, it's possible by keeping track of both i and j indices in the parameter list and adding a branch to figure out whether to deal with the inner or outer loop on a frame. For example:
public static void SelectionSort<T>(
List<T> lst,
int i = 0,
int j = 0,
int minJ = 0
) where T : IComparable<T>
{
if (i >= lst.Count)
{
return;
}
else if (j < lst.Count)
{
minJ = lst[minJ].CompareTo(lst[j]) < 0 ? minJ : j;
SelectionSort(lst, i, j + 1, minJ);
}
else
{
Swap(lst, i, minJ);
SelectionSort(lst, i + 1, i + 1, i + 1);
}
}
All of the code shown in this post is not suitable for production -- the point is to illustrate what not to do.
I have an extension method to remove certain characters from a string (a phone number) which is performing much slower than I think it should vs chained Replace calls. The weird bit, is that in a loop it overtakes the Replace thing if the loop runs for around 3000 iterations, and after that it's faster. Lower than that and chaining Replace is faster. It's like there's a fixed overhead to my code which Replace doesn't have. What could this be!?
Quick look. When only testing 10 numbers, mine takes about 0.3ms, while Replace takes only 0.01ms. A massive difference! But when running 5 million, mine takes around 1700ms while Replace takes about 2500ms.
Phone numbers will only have 0-9, +, -, (, )
Here's the relevant code:
Building test cases, I'm playing with testNums.
int testNums = 5_000_000;
Console.WriteLine("Building " + testNums + " tests");
Random rand = new Random();
string[] tests = new string[testNums];
char[] letters =
{
'0','1','2','3','4','5','6','7','8','9',
'+','-','(',')'
};
for(int t = 0; t < tests.Length; t++)
{
int length = rand.Next(5, 20);
char[] word = new char[length];
for(int c = 0; c < word.Length; c++)
{
word[c] = letters[rand.Next(letters.Length)];
}
tests[t] = new string(word);
}
Console.WriteLine("Tests built");
string[] stripped = new string[tests.Length];
Using my extension method:
Stopwatch stopwatch = Stopwatch.StartNew();
for (int i = 0; i < stripped.Length; i++)
{
stripped[i] = tests[i].CleanNumberString();
}
stopwatch.Stop();
Console.WriteLine("Clean: " + stopwatch.Elapsed.TotalMilliseconds + "ms");
Using chained Replace:
stripped = new string[tests.Length];
stopwatch = Stopwatch.StartNew();
for (int i = 0; i < stripped.Length; i++)
{
stripped[i] = tests[i].Replace(" ", string.Empty)
.Replace("-", string.Empty)
.Replace("(", string.Empty)
.Replace(")", string.Empty)
.Replace("+", string.Empty);
}
stopwatch.Stop();
Console.WriteLine("Replace: " + stopwatch.Elapsed.TotalMilliseconds + "ms");
Extension method in question:
public static string CleanNumberString(this string s)
{
Span<char> letters = stackalloc char[s.Length];
int count = 0;
for (int i = 0; i < s.Length; i++)
{
if (s[i] >= '0' && s[i] <= '9')
letters[count++] = s[i];
}
return new string(letters.Slice(0, count));
}
What I've tried:
I've run them around the other way. Makes a tiny difference, but not enough.
Make it a normal static method, which was significantly slower than extension. As a ref parameter was slightly slower, and in parameter was about the same as extension method.
Aggressive Inlining. Doesn't make any real difference. I'm in release mode, so I suspect the compiler inlines it anyway. Either way, not much change.
I have also looked at memory allocations, and that's as I expect. My one allocates on the managed heap only one string per iteration (the new string at the end) which Replace allocates a new object for each Replace. So the memory used by the Replace one is much, higher. But it's still faster!
Is it calling native C code and doing something crafty there? Is the higher memory usage triggering the GC and slowing it down (still doesn't explane the insanely fast time on only one or two iterations)
Any ideas?
(Yes, I know not to bother optimising things like this, it's just bugging me because I don't know why it's doing this)
After doing some benchmarks, I think can safely assert that your initial statement is wrong for the exact reason you mentionned in your deleted answer: the loading time of the method is the only thing that misguided you.
Here's the full benchmark on a simplified version of the problem:
static void Main(string[] args)
{
// Build string of n consecutive "ab"
int n = 1000;
Console.WriteLine("N: " + n);
char[] c = new char[n];
for (int i = 0; i < n; i+=2)
c[i] = 'a';
for (int i = 1; i < n; i += 2)
c[i] = 'b';
string s = new string(c);
Stopwatch stopwatch;
// Make sure everything is loaded
s.CleanNumberString();
s.Replace("a", "");
s.UnsafeRemove();
// Tests to remove all 'a' from the string
// Unsafe remove
stopwatch = Stopwatch.StartNew();
string a1 = s.UnsafeRemove();
stopwatch.Stop();
Console.WriteLine("Unsafe remove:\t" + stopwatch.Elapsed.TotalMilliseconds + "ms");
// Extension method
stopwatch = Stopwatch.StartNew();
string a2 = s.CleanNumberString();
stopwatch.Stop();
Console.WriteLine("Clean method:\t" + stopwatch.Elapsed.TotalMilliseconds + "ms");
// String replace
stopwatch = Stopwatch.StartNew();
string a3 = s.Replace("a", "");
stopwatch.Stop();
Console.WriteLine("String.Replace:\t" + stopwatch.Elapsed.TotalMilliseconds + "ms");
// Make sure the returned strings are identical
Console.WriteLine(a1.Equals(a2) && a2.Equals(a3));
Console.ReadKey();
}
public static string CleanNumberString(this string s)
{
char[] letters = new char[s.Length];
int count = 0;
for (int i = 0; i < s.Length; i++)
if (s[i] == 'b')
letters[count++] = 'b';
return new string(letters.SubArray(0, count));
}
public static T[] SubArray<T>(this T[] data, int index, int length)
{
T[] result = new T[length];
Array.Copy(data, index, result, 0, length);
return result;
}
// Taken from https://stackoverflow.com/a/2183442/6923568
public static unsafe string UnsafeRemove(this string s)
{
int len = s.Length;
char* newChars = stackalloc char[len];
char* currentChar = newChars;
for (int i = 0; i < len; ++i)
{
char c = s[i];
switch (c)
{
case 'a':
continue;
default:
*currentChar++ = c;
break;
}
}
return new string(newChars, 0, (int)(currentChar - newChars));
}
When ran with different values of n, it is clear that your extension method (or at least my somewhat equivalent version of it) has a logic that makes it faster than String.Replace(). In fact, it is more performant on either small or big strings:
N: 100
Unsafe remove: 0,0024ms
Clean method: 0,0015ms
String.Replace: 0,0021ms
True
N: 100000
Unsafe remove: 0,3889ms
Clean method: 0,5308ms
String.Replace: 1,3993ms
True
I highly suspect optimizations for the replacement of strings (not to be compared to removal) in String.Replace() to be the culprit here. I also added a method from this answer to have another comparison on removal of characters. That one's times behave similarly to your method but gets faster on higher values (80k+ on my tests) of n.
With all that being said, since your question is based on an assumption that we found was false, if you need more explanation on why the opposite is true (i.e. "Why is String.Replace() slower than my method"), plenty of in-depth benchmarks about string manipulation already do so.
I ran the clean method a couple more. interestingly, it is a lot faster than the Replace. Only the first time run was slower. Sorry that I couldn't explain why it's slower the first time but I ran more of the method then the result was expected.
Building 100 tests
Tests built
Replace: 0.0528ms
Clean: 0.4526ms
Clean: 0.0413ms
Clean: 0.0294ms
Replace: 0.0679ms
Replace: 0.0523ms
used dotnet core 2.1
So I've found with help from daehee Kim and Mat below that it's only the first iteration, but it's for the whole first loop. Every loop after there is ok.
I use the following line to force the JIT to do its thing and initialise this method:
RuntimeHelpers.PrepareMethod(typeof(CleanExtension).GetMethod("CleanNumberString", BindingFlags.Public | BindingFlags.Static).MethodHandle);
I find the JIT usually takes about 2-3ms to do its thing here (including Reflection time of about 0.1ms). Note that you should probably not be doing this because you're now getting the Reflection cost as well, and the JIT will be called right after this anyway, but it's probably a good idea for benchmarks to fairly compare.
The more you know!
My benchmark for a loop of 5000 iterations, repeated 5000 times with random strings and averaged is:
Clean: 0.41078ms
Replace: 1.4974ms
I'm trying to build a Reg Expression where if the textbox string contains two periods anywhere it will execute my code. This is what I've got so far:
Regex word = new Regex("(\\.){2,}");
if (word.IsMatch(textBoxSearch.Text))
{
//my code here to execute
}
However, it only executes when there are two periods together and not anywhere within the string...
There is no need for regex here, just use LINQ!
myString.Count(x => x == '.') == 2
Or for 2 or more:
myString.Where(x => x == '.').Skip(1).Any()
If performance is crucial, you should use a loop. Here is a comparison of the three approaches (LINQ, loop, regex):
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace Experiment
{
public static class Program
{
static bool hasTwoPeriodsLinq(string text)
{
return text.Count(x => x == '.') == 2;
}
static bool hasTwoPeriodsLoop(string text)
{
int count = 0;
for (int i = 0; i < text.Length; i++)
{
if (text[i] == '.')
{
// This early break makes the loop faster than regex
if (count == 2)
{
return false;
}
count++;
}
}
return count == 2;
}
static Regex twoPeriodsRegex = new Regex(#"^.*\..*\..*$", RegexOptions.Compiled);
static bool hasTwoPeriodsRegex(string text)
{
return twoPeriodsRegex.IsMatch(text);
}
public static void Main(string[] args)
{
var text = #"The young Princess Bolk6nskaya had
brought some work in a gold-embroidered vel-
vet bag. Her pretty little upper lip, on which
a delicate dark down was just perceptible, was
too short for her teeth, but it lifted all the more
sweetly, and was especially charming when she
occasionally drew it down to meet the lower
lip. As is always the case with a thoroughly at-
tractive woman, her defectthe shortness of
her upperlip and her half-open mouth seemed
to be her own special and peculiar form of
beauty. Everyone brightened at the sight of
this pretty young woman, so soon to become
a mother, so full of life and health, and carry-
ing her burden so lightly. Old men and dull
dispirited young ones who looked at her, after
being in her company and talking to her a
litttle while, felt as if they too were becoming,
like her, full of life and health. All who talked
to her, and at each word saw her bright smile
and the constant gleam of her white teeth,
thought that they were in a specially amiable
mood that day. ";
const int iterations = 100000;
// Warm up...
for (int i = 0; i < iterations; i++)
{
hasTwoPeriodsLinq(text);
hasTwoPeriodsLoop(text);
hasTwoPeriodsRegex(text);
}
var watch = System.Diagnostics.Stopwatch.StartNew();
// hasTwoPeriodsLinq
watch.Restart();
for (int i = 0; i < iterations; i++)
{
hasTwoPeriodsLinq(text);
}
watch.Stop();
Console.WriteLine("hasTwoPeriodsLinq " + watch.ElapsedMilliseconds);
// hasTwoPeriodsLoop
watch.Restart();
for (int i = 0; i < iterations; i++)
{
hasTwoPeriodsLoop(text);
}
watch.Stop();
Console.WriteLine("hasTwoPeriodsLoop " + watch.ElapsedMilliseconds);
// hasTwoPeriodsRegex
watch.Restart();
for (int i = 0; i < iterations; i++)
{
hasTwoPeriodsRegex(text);
}
watch.Stop();
Console.WriteLine("hasTwoPeriodsRegex " + watch.ElapsedMilliseconds);
}
}
}
Try it here.
And the results:
hasTwoPeriodsLinq 1280
hasTwoPeriodsLoop 54
hasTwoPeriodsRegex 74
Try this:
int count = source.Count(f => f == '.');
If count == 2, you're all good.
You should declare two periods and anything except period around and between them:
[^\.]*\.[^\.]*\.[^\.]*
This works according to my tests:
^.*\..*\..*$
Any character zero or more times followed by a period followed by any character zero or more times followed by a period followed by any character zero or more times.
Of course, as others have pointed out, using Regex here is not the most efficient or readable way of doing it. Regex has a learning curve and future programmers may not appreciate a less-than-straightforward approach given that there are simpler alternatives.
if you want to use regex, then you can use the Regex.Matches to check the count.
if(Regex.Matches(stringinput, #"\.").Count == 2 )
{
//perform your operation
}
Several people have given examples to test for exactly 2 but here's an example to test for at least 2 periods. You could actually easily modify this to test for exactly 2 if you wanted to too.
(.*\..*){2}
I just did a simple test in .NET Fiddle of sorting 100 random integer arrays of length 1000 and seeing whether doing so with a Paralell.ForEach loop is faster than a plain old foreach loop.
Here is my code (I put this together fast, so please ignore the repetition and overall bad look of the code)
using System;
using System.Net;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;
using System.Linq;
public class Program
{
public static int[] RandomArray(int minval, int maxval, int arrsize)
{
Random randNum = new Random();
int[] rand = Enumerable
.Repeat(0, arrsize)
.Select(i => randNum.Next(minval, maxval))
.ToArray();
return rand;
}
public static void SortOneThousandArraysSync()
{
var arrs = new List<int[]>(100);
for(int i = 0; i < 100; ++i)
arrs.Add(RandomArray(Int32.MinValue,Int32.MaxValue,1000));
Parallel.ForEach(arrs, (arr) =>
{
Array.Sort(arr);
});
}
public static void SortOneThousandArraysAsync()
{
var arrs = new List<int[]>(100);
for(int i = 0; i < 100; ++i)
arrs.Add(RandomArray(Int32.MinValue,Int32.MaxValue,1000));
foreach(var arr in arrs)
{
Array.Sort(arr);
};
}
public static void Main()
{
var start = DateTime.Now;
SortOneThousandArraysSync();
var end = DateTime.Now;
Console.WriteLine("t1 = " + (end - start).ToString());
start = DateTime.Now;
SortOneThousandArraysAsync();
end = DateTime.Now;
Console.WriteLine("t2 = " + (end - start).ToString());
}
}
and here are the results after hitting Run twice:
t1 = 00:00:00.0156244
t2 = 00:00:00.0156243
...
t1 = 00:00:00.0467854
t2 = 00:00:00.0156246
...
So, sometimes it's faster and sometimes it's about the same.
Possible explanations:
The random arrays were "more unsorted" for the sync one versus the async one in the 2nd test I ran
It has something to do with the processes running on .NET Fiddle. In the first case the parallel one basically ran like a non-parallel operation because there weren't any threads for my fiddle to take over. (Or something like that)
Thoughts?
You should only use Parallel.ForEach() if the code within the loop takes a significant amount of time to execute. In this case, it takes more time to create multiple threads, sort the array, and then combine the result onto one thread than it is to simply sort it on a single thread. For example, the Parallel.ForEach() in the following code snippet takes less time to execute than the normal ForEach loop:
public static void Main(string[] args)
{
var numbers = Enumerable.Range(1, 10000);
Parallel.ForEach(numbers, n => Factorial(n));
foreach (var number in numbers)
{
Factorial(number);
}
}
private static int Factorial(int number)
{
if (number == 1 || number == 0)
return 1;
return number * Factorial(number - 1);
}
However, if I change var numbers = Enumerable.Range(1, 10000); to var numbers = Enumerable.Range(1, 1000);, the ForEach loop is faster than Parallel.ForEach().
When working with small tasks (which don't take a significant amount of time to execute) have a look at Partitioner class; in your case:
public static void SortOneThousandArraysAsyncWithPart() {
var arrs = new List<int[]>(100);
for (int i = 0; i < 100; ++i)
arrs.Add(RandomArray(Int32.MinValue, Int32.MaxValue, 1000));
// Let's spread the tasks between threads manually with a help of Partitioner.
// We don't want task stealing and other optimizations: just split the
// list between 8 (on my workstation) threads and run them
Parallel.ForEach(Partitioner.Create(0, 100), part => {
for (int i = part.Item1; i < part.Item2; ++i)
Array.Sort(arrs[i]);
});
}
I get the following results (i7 3.2GHz 4 cores HT, .Net 4.6 IA-64) - averaged by 100 runs:
0.0081 Async (foreach)
0.0119 Parallel.ForEach
0.0084 Parallel.ForEach + Partitioner
as you can see, foreach is still on the top, but Parallel.ForEach + Partitioner is very close to the winner
Checking performance of algorithms is a tricky business, and performance at small scale can easily be affected by a variety of factors external to your code. Please see my answer to an almost-duplicate question here for an in-depth explanation, plus some links to benchmarking templates that you can adapt to better measure your algorithm's performance.
I'm doing some work with strings, and I have a scenario where I need to determine if a string (usually a small one < 10 characters) contains repeated characters.
`ABCDE` // does not contain repeats
`AABCD` // does contain repeats, ie A is repeated
I can loop through the string.ToCharArray() and test each character against every other character in the char[], but I feel like I am missing something obvious.... maybe I just need coffee. Can anyone help?
EDIT:
The string will be sorted, so order is not important so ABCDA => AABCD
The frequency of repeats is also important, so I need to know if the repeat is pair or triplet etc.
If the string is sorted, you could just remember each character in turn and check to make sure the next character is never identical to the last character.
Other than that, for strings under ten characters, just testing each character against all the rest is probably as fast or faster than most other things. A bit vector, as suggested by another commenter, may be faster (helps if you have a small set of legal characters.)
Bonus: here's a slick LINQ solution to implement Jon's functionality:
int longestRun =
s.Select((c, i) => s.Substring(i).TakeWhile(x => x == c).Count()).Max();
So, OK, it's not very fast! You got a problem with that?!
:-)
If the string is short, then just looping and testing may well be the simplest and most efficient way. I mean you could create a hash set (in whatever platform you're using) and iterate through the characters, failing if the character is already in the set and adding it to the set otherwise - but that's only likely to provide any benefit when the strings are longer.
EDIT: Now that we know it's sorted, mquander's answer is the best one IMO. Here's an implementation:
public static bool IsSortedNoRepeats(string text)
{
if (text.Length == 0)
{
return true;
}
char current = text[0];
for (int i=1; i < text.Length; i++)
{
char next = text[i];
if (next <= current)
{
return false;
}
current = next;
}
return true;
}
A shorter alternative if you don't mind repeating the indexer use:
public static bool IsSortedNoRepeats(string text)
{
for (int i=1; i < text.Length; i++)
{
if (text[i] <= text[i-1])
{
return false;
}
}
return true;
}
EDIT: Okay, with the "frequency" side, I'll turn the problem round a bit. I'm still going to assume that the string is sorted, so what we want to know is the length of the longest run. When there are no repeats, the longest run length will be 0 (for an empty string) or 1 (for a non-empty string). Otherwise, it'll be 2 or more.
First a string-specific version:
public static int LongestRun(string text)
{
if (text.Length == 0)
{
return 0;
}
char current = text[0];
int currentRun = 1;
int bestRun = 0;
for (int i=1; i < text.Length; i++)
{
if (current != text[i])
{
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = text[i];
}
currentRun++;
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}
Now we can also do this as a general extension method on IEnumerable<T>:
public static int LongestRun(this IEnumerable<T> source)
{
bool first = true;
T current = default(T);
int currentRun = 0;
int bestRun = 0;
foreach (T element in source)
{
if (first || !EqualityComparer<T>.Default(element, current))
{
first = false;
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = element;
}
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}
Then you can call "AABCD".LongestRun() for example.
This will tell you very quickly if a string contains duplicates:
bool containsDups = "ABCDEA".Length != s.Distinct().Count();
It just checks the number of distinct characters against the original length. If they're different, you've got duplicates...
Edit: I guess this doesn't take care of the frequency of dups you noted in your edit though... but some other suggestions here already take care of that, so I won't post the code as I note a number of them already give you a reasonably elegant solution. I particularly like Joe's implementation using LINQ extensions.
Since you're using 3.5, you could do this in one LINQ query:
var results = stringInput
.ToCharArray() // not actually needed, I've left it here to show what's actually happening
.GroupBy(c=>c)
.Where(g=>g.Count()>1)
.Select(g=>new {Letter=g.First(),Count=g.Count()})
;
For each character that appears more than once in the input, this will give you the character and the count of occurances.
I think the easiest way to achieve that is to use this simple regex
bool foundMatch = false;
foundMatch = Regex.IsMatch(yourString, #"(\w)\1");
If you need more information about the match (start, length etc)
Match match = null;
string testString = "ABCDE AABCD";
match = Regex.Match(testString, #"(\w)\1+?");
if (match.Success)
{
string matchText = match.Value; // AA
int matchIndnex = match.Index; // 6
int matchLength = match.Length; // 2
}
How about something like:
string strString = "AA BRA KA DABRA";
var grp = from c in strString.ToCharArray()
group c by c into m
select new { Key = m.Key, Count = m.Count() };
foreach (var item in grp)
{
Console.WriteLine(
string.Format("Character:{0} Appears {1} times",
item.Key.ToString(), item.Count));
}
Update Now, you'd need an array of counters to maintain a count.
Keep a bit array, with one bit representing a unique character. Turn the bit on when you encounter a character, and run over the string once. A mapping of the bit array index and the character set is upto you to decide. Break if you see that a particular bit is on already.
/(.).*\1/
(or whatever the equivalent is in your regex library's syntax)
Not the most efficient, since it will probably backtrack to every character in the string and then scan forward again. And I don't usually advocate regular expressions. But if you want brevity...
I started looking for some info on the net and I got to the following solution.
string input = "aaaaabbcbbbcccddefgg";
char[] chars = input.ToCharArray();
Dictionary<char, int> dictionary = new Dictionary<char,int>();
foreach (char c in chars)
{
if (!dictionary.ContainsKey(c))
{
dictionary[c] = 1; //
}
else
{
dictionary[c]++;
}
}
foreach (KeyValuePair<char, int> combo in dictionary)
{
if (combo.Value > 1) //If the vale of the key is greater than 1 it means the letter is repeated
{
Console.WriteLine("Letter " + combo.Key + " " + "is repeated " + combo.Value.ToString() + " times");
}
}
I hope it helps, I had a job interview in which the interviewer asked me to solve this and I understand it is a common question.
When there is no order to work on you could use a dictionary to keep the counts:
String input = "AABCD";
var result = new Dictionary<Char, int>(26);
var chars = input.ToCharArray();
foreach (var c in chars)
{
if (!result.ContainsKey(c))
{
result[c] = 0; // initialize the counter in the result
}
result[c]++;
}
foreach (var charCombo in result)
{
Console.WriteLine("{0}: {1}",charCombo.Key, charCombo.Value);
}
The hash solution Jon was describing is probably the best. You could use a HybridDictionary since that works well with small and large data sets. Where the letter is the key and the value is the frequency. (Update the frequency every time the add fails or the HybridDictionary returns true for .Contains(key))