I'm trying to speed up the following:
string s; //--> s is never null
if (s.Length != 0)
{
<do something>
}
Problem is, it appears the .Length actually counts the characters in the string, and this is way more work than I need. Anybody have an idea on how to speed this up?
Or, is there a way to determine if s[0] exists, w/out checking the rest of the string?
EDIT: Now that you've provided some more context:
Trying to reproduce this, I failed to find a bottleneck in string.Length at all. The only way of making it faster was to comment out both the test and the body of the if block - which isn't really fair. Just commenting out the condition slowed things down, i.e. unconditionally copying the reference was slower than checking the condition.
As has been pointed out, using the overload of string.Split which removes empty entries for you is the real killer optimization.
You can go further, by avoiding creating a new char array with just a space in every time. You're always going to pass the same thing effectively, so why not take advantage of that?
Empty arrays are effectively immutable. You can optimize the null/empty case by always returning the same thing.
The optimized code becomes:
private static readonly char[] Delimiters = " ".ToCharArray();
private static readonly string[] EmptyArray = new string[0];
public static string[] SplitOnMultiSpaces(string text)
{
if (string.IsNullOrEmpty(text))
{
return EmptyArray;
}
return text.Split(Delimiters, StringSplitOptions.RemoveEmptyEntries);
}
String.Length absolutely does not count the letters in the string. The value is stored as a field - although I seem to remember that the top bit of that field is used to remember whether or not all characters are ASCII (or used to be, anyway) to enable other optimisations. So the property access may need to do a bitmask, but it'll still be O(1) and I'd expect the JIT to inline it, too. (It's implemented as an extern, but hopefully that wouldn't affect the JIT in this case - I suspect it's a common enough operation to potentially have special support.)
If you already know that the string isn't null, then your existing test of
if (s.Length != 0)
is the best way to go if you're looking for raw performance IMO. Personally in most cases I'd write:
if (s != "")
to make it clearer that we're not so much interested in the length as a value as whether or not this is the empty string. That will be slightly slower than the length test, but I believe it's clearer. As ever, I'd go for the clearest code until you have benchmark/profiling data to indicate that this really is a bottleneck. I know your question is explicitly about finding the most efficient test, but I thought I'd mention this anyway. Do you have evidence that this is a bottleneck?
EDIT: Just to give clearer reasons for my suggestion of not using string.IsNullOrEmpty: a call to that method suggests to me that the caller is explicitly trying to deal with the case where the variable is null, otherwise they wouldn't have mentioned it. If at this point of the code it counts as a bug if the variable is null, then you shouldn't be trying to handle it as a normal case.
In this situation, the Length check is actually better in one way than the inequality test I've suggested: it acts as an implicit assertion that the variable isn't null. If you have a bug and it is null, the test will throw an exception and the bug will be detected early. If you use the equality test it will treat null as being different to the empty string, so it will go into your "if" statement's body. If you use string.IsNullOrEmpty it will treat null as being the same as empty, so it won't go into the block.
String.IsNullOrEmpty is the preferred method for checking for null or zero length strings.
Internally, it will use Length. The Length property for a string should not be calculated on the fly though.
If you're absolutely certain that the string will never be null and you have some strong objection to String.IsNullOrEmpty, the most efficient code I can think of would be:
if(s.Length > 0)
{
// Do Something
}
Or, possibly even better:
if(s != "")
{
// Do Something
}
Accessing the Length property shouldn't do a count -- .NET strings store a count inside the object.
The SSCLI/Rotor source code contains an interesting comment which suggests that String.Length is (a) efficient and (b) magic:
// Gets the length of this string
//
/// This is a EE implemented function so that the JIT can recognise is specially
/// and eliminate checks on character fetchs in a loop like:
/// for(int I = 0; I < str.Length; i++) str[i]
/// The actually code generated for this will be one instruction and will be inlined.
//
public extern int Length {
[MethodImplAttribute(MethodImplOptions.InternalCall)]
get;
}
Here is the function String.IsNullOrEmpty -
if (!String.IsNullOrEmpty(yourstring))
{
// your code
}
String.IsNullOrWhiteSpace(s);
true if s is null or Empty, or if s consists exclusively of white-space characters.
As always with performace: benchmark.
Using C# 3.5 or before, you'll want to test yourString.Length vs String.IsNullOrEmpty(yourString)
using C# 4, do both of the above and add String.IsNullOrWhiteSpace(yourString)
Of course, if you know your string will never be empty, you could just attempt to access s[0] and handle the exception when it's not there. That's not normally good practice, but it may be closer to what you need (if s should always have a non-blank value).
for (int i = 0; i < 100; i++)
{
System.Diagnostics.Stopwatch timer = new System.Diagnostics.Stopwatch();
string s = "dsfasdfsdafasd";
timer.Start();
if (s.Length > 0)
{
}
timer.Stop();
System.Diagnostics.Debug.Write(String.Format("s.Length != 0 {0} ticks ", timer.ElapsedTicks));
timer.Reset();
timer.Start();
if (s == String.Empty)
{
}
timer.Stop();
System.Diagnostics.Debug.WriteLine(String.Format("s== String.Empty {0} ticks", timer.ElapsedTicks));
}
Using the stopwatch the s.length != 0 takes less ticks then s == String.Empty
after I fix the code
Based on your intent described in your answer, why don't you just try using this built-in option on Split:
s.Split(new[]{" "}, StringSplitOptions.RemoveEmptyEntries);
Just use String.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries) and it will do it all for you.
Related
I am writing a custom string split. It will split on a dot(.) that is not preceded by an odd number of backslashes (\).
«string» -> «IEnemerable<string>»
"hello.world" -> "hello", "world"
"abc\.123" -> "abc\.123"
"aoeui\\.dhtns" -> "aoeui\\","dhtns"
I would like to know if there is a substring that will reuse the original string (for speed), or is there an existing split that can do this fast?
This is what I have but is 2—3 times slower than input.Split('.') //where input is a string. (I know it is a (slightly more complex problem, but not that much)
public IEnumerable<string> HandMadeSplit(string input)
{
var Result = new LinkedList<string>();
var word = new StringBuilder();
foreach (var ch in input)
{
if (ch == '.')
{
Result.AddLast(word.ToString());
word.Length = 0;
}
else
{
word.Append(ch);
}
}
Result.AddLast(word.ToString());
return Result;
}
It now uses List instead of LinkedList, and record beginning and end of substring and use string.substring to create the new substrings. This does a lot and is nearly as fast as string.split but I have added my adjustments. (will add code)
The loop that you show is the right approach if you need performance. (Regex wouldn't be).
Switch to an index-based for-loop. Remember the index of the start of the match. Don't append individual chars. Instead, remember the range of characters to copy out and do that with a single Substring call per item.
Also, don't use a LinkedList. It is slower than a List for almost all cases except random-access mutations.
You might also switch from List to a normal array that you resize with Array.Resize. This results in slightly tedious code (because you have inlined a part of the List class into your method) but it cuts out some small overheads.
Next, don't return an IEnumerable because that forces the caller through indirection when accessing its items. Return a List or an array.
This is the one I eventually settled on. It is not as fast a string.split, but good enough and can be modified, to do what I want.
private IEnumerable<string> HandMadeSplit2b(string input)
{
//this one is margenaly better that the second best 2, but makes the resolver (its client much faster), nealy as fast as original.
var Result = new List<string>();
var begining = 0;
var len = input.Length;
for (var index=0;index<len;index++)
{
if (input[index] == '.')
{
Result.Add(input.Substring(begining,index-begining));
begining = index+1;
}
}
Result.Add(input.Substring(begining));
return Result;
}
You shouldn't try to use string.Split for that.
If you need help to implement it, a simple way to solve this is to have loop that scans the string, keeping track of the last place where you found a qualifying dot. When you find a new qualifying dot (or reach the end of the input string), just yield return the current substring.
Edit: about returning a list or an array vs. using yield
If in your application, the most important thing is the time spent by the caller on iterating the substrings, then you should populate a list or an array and return that, as suggested in the accepted question. I would not use a resizable array while collecting the substrings because this would be slow.
On the other hand, if you care about the overall performance and about memory, and if sometimes the caller doesn't have to iterate over the entire list, you should use yield return. When you use yield return, you have the advantage that no code at all is executing until the caller has called MoveNext (directly or indirectly through a foreach). This means that you save the memory for allocating the array/list, and you save the time spent on allocating/resizing/populating the list. You will be spending time almost only on the logic of finding the substrings, and this will be done lazily, that is - only when actually needed because the caller continues to iterate the substrings.
I'm trying to parse a large text string. I need to split the original string in blocks of 15 characters(and the next block might contain white spaces, so the trim function is used). I'm using two strings, the original and a temporary one. This temp string is used to store each 15 length block.
I wonder if I could fall into a performance issue because strings are immutable. This is the code:
string original = "THIS IS SUPPOSE TO BE A LONG STRING AN I NEED TO SPLIT IT IN BLOCKS OF 15 CHARACTERS.SO";
string temp = string.Empty;
while (original.Length != 0)
{
temp = original.Substring(0, 14).Trim();
original = original.Substring(14, (original.Length -14)).Trim();
}
I appreciate your feedback in order to find a best way to achieve this functionality.
You'll get slightly better performance like this (but whether the performance gain will be significant is another matter entirely):
for (var startIndex = 0; startIndex < original.Length; startIndex += 15)
{
temp = original.Substring(startIndex, Math.Min(original.Length - startIndex, 15)).Trim();
}
This performs better because you're not copying the last all-but-15-characters of the original string with each loop iteration.
EDIT
To advance the index to the next non-whitespace character, you can do something like this:
for (var startIndex = 0; startIndex < original.Length; )
{
if (char.IsWhiteSpace(string, startIndex)
{
startIndex++;
continue;
}
temp = original.Substring(startIndex, Math.Min(original.Length - startIndex, 15)).Trim();
startIndex += 15;
}
I think you are right about the immutable issue - recreating 'original' each time is probably not the fastest way.
How about passing 'original' into a StringReader class?
If your original string is longer than few thousand chars, you'll have noticable (>0.1s) processing time and a lot of GC pressure. First Substring call is fine and I don't think you can avoid it unless you go deep inside System.String and mess around with m_FirstChar. Second Substring can be avoided completely when going char-by-char and iterating over int.
In general, if you would run this on bigger data such code might be problematic, it of course depends on your needs.
In general, it might be a good idea to use StringBuilder class, which will allow you to operator on strings in "more mutable" way without performance hit, like remove from it's beggining without reallocating whole string.
In your example however I would consider throwing out lime that takes substring from original and substitute it with some code that would update some indexes pointing where you should get new substring from. Then while condition would be just checking if your index as at the end of the string and your temp method would take substring not from 0 to 14 but from i, where i would be this index.
However - don't optimize code if you don't have to, I'm assuming here that you need more performance and you want to sacrifice some time and/or write a bit less understandable code for more efficiency.
I add an unexpected behaviour from C#/WPF
private void ButtonUp_Click(object sender, RoutedEventArgs e)
{
int quant;
if( int.TryParse(Qnt.Text, out quant))
{
string s = ((quant++).ToString());
Qnt.Text = s;
}
}
So, if I get quant as 1, quant will be incremented to 2. But the s string will be 1. Is this a question of precedence?
EDIT:
I re-wrote this as:
quant++;
Qnt.Text = quant.ToString();
and now this works as I expected.
You are using the post-increment operator. This evalutates to the original value, and then increments. To do what you want in a one-liner you can use the pre-increment operator instead.
(++quant).ToString();
But even better would be to avoid all such pitfalls and do it like this:
quant++;
string s = quant.ToString();
With the first version you have to think about the order in which things happen. In the second version no thought is required. Always value code clarity more highly than conciseness.
It's easy to believe that the one-line version is somehow faster, but that's not true. It might have been true back in the day in 1970s C systems, but even then that I doubt.
The problem is that you're using a post-increment instead of a pre-increment... but why would you want to write this convoluted code? Just separate out the side-effect (incrementing) and the ToString call:
if (int.TryParse(Qnt.Text, out quant))
{
quant++;
Qnt.Text = quant.ToString();
}
Or even forego the actual increment given that you're not going to read the value again:
if (int.TryParse(Qnt.Text, out quant))
{
Qnt.Text = (quant + 1).ToString();
}
Where possible, avoid using compound assignment in the middle of other expressions. It generally leads to pain.
Additionally, it feels like all this parsing and formatting is hiding the real model, which is that there should be an int property somewhere, which might be reflected in the UI. For example:
private void ButtonUp_Click(object sender, RoutedEventArgs e)
{
// This is an int property
Quantity++;
// Now reflect the change in the UI. Ideally, do this through binding
// instead.
Qnt.Text = Quantity.ToString();
}
Now I'll do something that shouldn't be done... I'll try to simplify what Eric Lippert wrote here What is the difference between i++ and ++i? I hope I'm not writing anything too much wrong :-)
Now... What does the pre-increment and post-increment operators do? Simplifying and ignoring all the copy that are done in-between (and remembering that they aren't atomic operators in multi-threaded environments):
both of them are expressions (like i + 1) that return a result (like i + 1) but that have a side-effect (unlike i + 1). The side-effect is that they increment the variable i. The big question is "in which order everything happens?" The answer is quite simple:
pre increment ++i: increments i and returns the new value of i
post increment i++: increments i and returns the old value of i
Now... The important part is that the increments i always happens first. Then a value (the old or the new) is returned.
Let's make an example (the example of Lippert is quite complex. I'll make a different, more simple example, that isn't as much complete but that is enough to check if the order I said before is right or not) (technically I'll make two examples)
Example 1:
unchecked
{
int i = Int32.MaxValue;
Console.WriteLine("Hello! I'm trying to do my work here {0}", i++);
Console.WriteLine("Work done {1}", i);
}
Example 2:
checked
{
int i = Int32.MaxValue;
Console.WriteLine("Hello! I'm trying to do my work here {0}", i++);
Console.WriteLine("Work done {1}", i);
}
checked means that if there is an overflow an exception (OverflowException) will be thrown. unchecked means that the same operation won't throw an exception. Int32.MaxValue + 1 surely will overflow. With checked there will be an exception, with unchecked i will become -1.
Let's try running the first code piece. Result:
Hello! I'm trying to do my work here 2147483647
Work done -1
Ok... The i was incremented but the Console.WriteLine received the old value (Int32.MaxValue == 2147483647). From this example we can't determine the order of the post-increment and of the calling of Console.WriteLine.
Let's try running the second code piece. Result:
System.OverflowException: Arithmetic operation resulted in an overflow.
Ok... It's quite clear that first the post-increment was executed, caused an exception, and then clearly the Console.WriteLine wasn't executed (because the program ended).
So we know that the order I said is the right one.
Now. What should you learn from this example? The same thing I learned many years ago. Pre and post increments in C and C# are good for obfuscated code contests. They aren't good for many other things (but note that C++ is different!). From that lesson I learned that there are exactly two places where you can use post-increment freely, and there are exactly zero places where you can use pre-increment freely.
"Safe" post-increment
for (int i = 0; i < x; i++)
and
i++; // Written alone. Nothing else on the same line but a comment if necessary.
"Safe" pre-increment
(nothing)
In this case, first quant.ToString() will be called and then quant will be incremented.
If you write ((++quant).ToString()) the first step will be incrementing quant and then quant.ToString() will be called.
string s = ((quant++).ToString());
can be distributed as
use quant for toString() method call before incrementing, and then
execute assignment operator, and then
increment `quant'
try with ++quant.
I'm aimed at speed, must be ultra fast.
string s = something;
for (int j = 0; j < s.Length; j++)
{
if (s[j] == 'ь')
if(s.Length>(j+1))
if(s[j+1] != 'о')
s[j] = 'ъ';
It gives me an error Error "Property or indexer 'string.this[int]' cannot be assigned to -- it is read only"
How do I do it the fastest way?
Fast way? Use a StringBuilder.
Fastest way? Always pass around a char* and a length instead of a string so you can modify the buffer in-place, but make sure you don't ever modify any string object.
There are at least two options:
Use a StringBuilder and keep track of the previous character.
You could just use a regular expression "ь(?!о)" or a simple string replacement of "ьо" depending on what your needs are (your question seems self-contradictory).
I tested the performance of a StringBuilder approach versus regular expressions and there is very little difference - at most a factor of 2:
Method Iterations per second
StringBuilder 153480.094
Regex (uncompiled) 90021.978
Regex (compiled) 136355.787
string.Replace 1427605.174
If performance is critical for you I would strongly recommend making some performance measurements before jumping to conclusions about what the fastest approach is.
Strings in .Net is read-only. You could use StringBuilder.
A while back a post by Jon Skeet planted the idea in my head of building a CompiledFormatter class, for using in a loop instead of String.Format().
The idea is the portion of a call to String.Format() spent parsing the format string is overhead; we should be able to improve performance by moving that code outside of the loop. The trick, of course, is the new code should exactly match the String.Format() behavior.
This week I finally did it. I went through using the .Net framework source provided by Microsoft to do a direct adaption of their parser (it turns out String.Format() actually farms the work to StringBuilder.AppendFormat()). The code I came up with works, in that my results are accurate within my (admittedly limited) test data.
Unfortunately, I still have one problem: performance. In my initial tests the performance of my code closely matches that of the normal String.Format(). There's no improvement at all; it's even consistently a few milliseconds slower. At least it's still in the same order (ie: the amount slower doesn't increase; it stays within a few milliseconds even as the test set grows), but I was hoping for something better.
It's possible that the internal calls to StringBuilder.Append() are what actually drive the performance, but I'd like to see if the smart people here can help improve things.
Here is the relevant portion:
private class FormatItem
{
public int index; //index of item in the argument list. -1 means it's a literal from the original format string
public char[] value; //literal data from original format string
public string format; //simple format to use with supplied argument (ie: {0:X} for Hex
// for fixed-width format (examples below)
public int width; // {0,7} means it should be at least 7 characters
public bool justify; // {0,-7} would use opposite alignment
}
//this data is all populated by the constructor
private List<FormatItem> parts = new List<FormatItem>();
private int baseSize = 0;
private string format;
private IFormatProvider formatProvider = null;
private ICustomFormatter customFormatter = null;
// the code in here very closely matches the code in the String.Format/StringBuilder.AppendFormat methods.
// Could it be faster?
public String Format(params Object[] args)
{
if (format == null || args == null)
throw new ArgumentNullException((format == null) ? "format" : "args");
var sb = new StringBuilder(baseSize);
foreach (FormatItem fi in parts)
{
if (fi.index < 0)
sb.Append(fi.value);
else
{
//if (fi.index >= args.Length) throw new FormatException(Environment.GetResourceString("Format_IndexOutOfRange"));
if (fi.index >= args.Length) throw new FormatException("Format_IndexOutOfRange");
object arg = args[fi.index];
string s = null;
if (customFormatter != null)
{
s = customFormatter.Format(fi.format, arg, formatProvider);
}
if (s == null)
{
if (arg is IFormattable)
{
s = ((IFormattable)arg).ToString(fi.format, formatProvider);
}
else if (arg != null)
{
s = arg.ToString();
}
}
if (s == null) s = String.Empty;
int pad = fi.width - s.Length;
if (!fi.justify && pad > 0) sb.Append(' ', pad);
sb.Append(s);
if (fi.justify && pad > 0) sb.Append(' ', pad);
}
}
return sb.ToString();
}
//alternate implementation (for comparative testing)
// my own test call String.Format() separately: I don't use this. But it's useful to see
// how my format method fits.
public string OriginalFormat(params Object[] args)
{
return String.Format(formatProvider, format, args);
}
Additional notes:
I'm wary of providing the source code for my constructor, because I'm not sure of the licensing implications from my reliance on the original .Net implementation. However, anyone who wants to test this can just make the relevant private data public and assign values that mimic a particular format string.
Also, I'm very open to changing the FormatInfo class and even the parts List if anyone has a suggestion that could improve the build time. Since my primary concern is sequential iteration time from front to end maybe a LinkedList would fare better?
[Update]:
Hmm... something else I can try is adjusting my tests. My benchmarks were fairly simple: composing names to a "{lastname}, {firstname}" format and composing formatted phone numbers from the area code, prefix, number, and extension components. Neither of those have much in the way of literal segments within the string. As I think about how the original state machine parser worked, I think those literal segments are exactly where my code has the best chance to do well, because I no longer have to examine each character in the string.
Another thought:
This class is still useful, even if I can't make it go faster. As long as performance is no worse than the base String.Format(), I've still created a strongly-typed interface which allows a program to assemble it's own "format string" at run time. All I need to do is provide public access to the parts list.
Here's the final result:
I changed the format string in a benchmark trial to something that should favor my code a little more:
The quick brown {0} jumped over the lazy {1}.
As I expected, this fares much better compared to the original; 2 million iterations in 5.3 seconds for this code vs 6.1 seconds for String.Format. This is an undeniable improvement. You might even be tempted to start using this as a no-brainer replacement for many String.Format situations. After all, you'll do no worse and you might even get a small performance boost: as much 14%, and that's nothing to sneeze at.
Except that it is. Keep in mind, we're still talking less than half a second difference for 2 million attempts, under a situation specifically designed to favor this code. Not even busy ASP.Net pages are likely to create that much load, unless you're lucky enough to work on a top 100 web site.
Most of all, this omits one important alternative: you can create a new StringBuilder each time and manually handle your own formatting using raw Append() calls. With that technique my benchmark finished in only 3.9 seconds. That's a much greater improvement.
In summary, if performance doesn't matter as much, you should stick with the clarity and simplicity of the built-in option. But when in a situation where profiling shows this really is driving your performance, there is a better alternative available via StringBuilder.Append().
Don't stop now!
Your custom formatter might only be slightly more efficient than the built-in API, but you can add more features to your own implementation that would make it more useful.
I did a similar thing in Java, and here are some of the features I added (besides just pre-compiled format strings):
1) The format() method accepts either a varargs array or a Map (in .NET, it'd be a dictionary). So my format strings can look like this:
StringFormatter f = StringFormatter.parse(
"the quick brown {animal} jumped over the {attitude} dog"
);
Then, if I already have my objects in a map (which is pretty common), I can call the format method like this:
String s = f.format(myMap);
2) I have a special syntax for performing regular expression replacements on strings during the formatting process:
// After calling obj.toString(), all space characters in the formatted
// object string are converted to underscores.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:/\\s+/_/} blah blah blah"
);
3) I have a special syntax that allows the formatted to check the argument for null-ness, applying a different formatter depending on whether the object is null or non-null.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:?'NULL'|'NOT NULL'} blah blah blah"
);
There are a zillion other things you can do. One of the tasks on my todo list is to add a new syntax where you can automatically format Lists, Sets, and other Collections by specifying a formatter to apply to each element as well as a string to insert between all elements. Something like this...
// Wraps each elements in single-quote charts, separating
// adjacent elements with a comma.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:#['$'][,]} blah blah blah"
);
But the syntax is a little awkward and I'm not in love with it yet.
Anyhow, the point is that your existing class might not be much more efficient than the framework API, but if you extend it to satisfy all of your personal string-formatting needs, you might end up with a very convenient library in the end. Personally, I use my own version of this library for dynamically constructing all SQL strings, error messages, and localization strings. It's enormously useful.
It seems to me that in order to get actual performance improvement, you'd need to factor out any format analysis done by your customFormatter and formattable arguments into a function that returns some data structure that tells a later formatting call what to do. Then you pull those data structures in your constructor and store them for later use. Presumably this would involve extending ICustomFormatter and IFormattable. Seems kinda unlikely.
Have you accounted for the time to do the JIT compile as well? After all, the framework will be ngen'd which could account for the differences?
The framework provides explicit overrides to the format methods that take fixed-sized parameter lists instead of the params object[] approach to remove the overhead of allocating and collecting all of the temporary object arrays. You might want to consider that for your code as well. Also, providing strongly-typed overloads for common value types would reduce boxing overhead.
I gotta believe that spending as much time optimizing data IO would earn exponentially bigger returns!
This is surely a kissin' cousin to YAGNI for this. Avoid Premature Optimization. APO.