In my case, I'm using C#, but the concept of the question would apply to Java as well. Hopefully the answer would be generic enough to cover both languages. Otherwise it's better to split the question into two.
I've always thought of which one is a better practice.
Does the compiler take care of enhancing the 'second' code so its performance would be as good as the 'first' code?
Could it be worked around to get a 'better performance' and 'optimized' code at the same time?
Redundant/Better Performance Code:
string name = GetName(); // returned string could be empty
List<string> myListOfStrings = GetListOfStrings();
if(string.IsNullOrWhiteSpace(name)
{
foreach(string s in myListOfStrings)
Console.WriteLine(s);
}
else
{
foreach(string s in myListOfStrings)
Console.WriteLine(s + " (Name is: " + name);
}
Optimized/Less Performance Code:
string name = GetName(); // returned string could be empty
List<string> myListOfStrings = GetListOfStrings();
foreach(string s in myListOfStrings)
Console.WriteLine(string.IsNullOrWhiteSpace(name) ? s : s + " (Name is: " + name);
Obviously the execution time of the 'first' code is less because it executes the condition 'string.IsNullOrWhiteSpace(name)' only once per loop. Whereas the 'second' code (which is nicer) executes the condition on every iteration.
Please consider a long loop execution time not a short one because I know that when it is short, the performance won't differ.
Does the compiler take care of enhancing the 'second' code so its performance would be as good as the 'first' code?
No, it cannot.
It doesn't know that the boolean expression will not change between iterations of the loop. It's possible for the code to not return the same value each time, so it is forced to perform the check in each iteration.
It's also possible that the boolean expression could have side effects. In this case it doesn't, but there's no way for the compiler to know that. It's important that such side effects would be performed in order to meet the specs, so it needs to execute the check in each iteration.
So, the next question you need to ask is, in a case such as this, is it important to perform the optimization that you've mentioned? In any situation I can imagine for the exact code you showed, probably not. The check is simply going to be so fast that it's almost certainly not going to be a bottleneck. If there are performance problems there are almost certainly bigger fish.
That said, with only a few changes to the example it can be made to matter. If the boolean expression itself is computationally expensive (i.e. it is the result of a database call, a web service call, some expensive CPU computation, etc.) then it could be a performance optimization that matters. Another case to consider is what would happen if the boolean expression had side effects. What if it was a MoveNext call on an IEnumerator? If it was important that it only be executed exactly once because you don't want the side effects to happen N times then that makes this a very important issue.
There are several possible solutions in such a case.
The easiest is most likely to just compute the boolean expression once and then store it in a variable:
bool someValue = ComputeComplexBooleanValue();
foreach(var item in collection)
{
if(someValue)
doStuff(item);
else
doOtherStuff(item);
}
If you want to execute the boolean value 0-1 times (i.e. avoid calling it even once in the event that the collection is empty) then we can use Lazy to lazily compute the value, but ensure it's still only computed at most one time:
var someValue = new Lazy<bool>(() => ComputeComplexBooleanValue());
foreach (var item in collection)
{
if (someValue.Value)
doStuff(item);
else
doOtherStuff(item);
}
You should always go the way that is easier to understand and maintain first. This means reducing duplicate code to absolute minumum (DRY). In addition this kind of micro optimization is not that important for many systems. Also note that shorter code is not always better.
I think I would go with somehting like this:
string name = GetName(); // returned string could be empty
bool nameIsEmpty = string.IsNullOrWhiteSpace(name);
foreach (string s in GetListOfStrings()) {
string messageAddition = "";
if (!nameIsEmpty) {
messageAddition = " (Name is: " + name + ")";
}
Console.WriteLine(s + messageAddition);
// more code which uses the computed value..
// otherwise the condition can be moved out the loop
}
I find an extra if statement easier to read than the ?: operator within a method call but this might be a personal taste.
If you want to improve performance later you should profile your application and start optimizing the slowest sections of code first. Maybe your GetListOfStrings() method is so slow that the performance of the other code is totally irrelevant. If you measured that duplicating the loop improves the perfomance by a significant value you can think about changing it.
Related
I'm doing a bit of coding, where I have to write this sort of code:
if( array[i]==false )
array[i]=true;
I wonder if it should be re-written as
array[i]=true;
This raises the question: are comparisions faster than assignments?
What about differences from language to language? (contrast between java & cpp, eg.)
NOTE: I've heard that "premature optimization is the root of all evil." I don't think that applies here :)
This isn't just premature optimization, this is micro-optimization, which is an irrelevant distraction.
Assuming your array is of boolean type then your comparison is unnecessary, which is the only relevant observation.
Well, since you say you're sure that this matters you should just write a test program and measure to find the difference.
Comparison can be faster if this code is executed on multiple variables allocated at scattered addresses in memory. With comparison you will only read data from memory to the processor cache, and if you don't change the variable value when the cache decides to to flush the line it will see that the line was not changed and there's no need to write it back to the memory. This can speed up execution.
Edit: I wrote a script in PHP. I just noticed that there was a glaring error in it meaning the best-case runtime was being calculated incorrectly (scary that nobody else noticed!)
Best case just beats outright assignment but worst case is a lot worse than plain assignment. Assignment is likely fastest in terms of real-world data.
Output:
assignment in 0.0119960308075 seconds
worst case comparison in 0.0188510417938 seconds
best case comparison in 0.0116770267487 seconds
Code:
<?php
$arr = array();
$mtime = explode(" ", microtime());
$starttime = $mtime[1] + $mtime[0];
reset_arr($arr);
for ($i=0;$i<10000;$i++)
$arr[i] = true;
$mtime = explode(" ", microtime());
$firsttime = $mtime[1] + $mtime[0];
$totaltime = ($firsttime - $starttime);
echo "assignment in ".$totaltime." seconds<br />";
reset_arr($arr);
for ($i=0;$i<10000;$i++)
if ($arr[i])
$arr[i] = true;
$mtime = explode(" ", microtime());
$secondtime = $mtime[1] + $mtime[0];
$totaltime = ($secondtime - $firsttime);
echo "worst case comparison in ".$totaltime." seconds<br />";
reset_arr($arr);
for ($i=0;$i<10000;$i++)
if (!$arr[i])
$arr[i] = false;
$mtime = explode(" ", microtime());
$thirdtime = $mtime[1] + $mtime[0];
$totaltime = ($thirdtime - $secondtime);
echo "best case comparison in ".$totaltime." seconds<br />";
function reset_arr($arr) {
for ($i=0;$i<10000;$i++)
$arr[$i] = false;
}
I believe if comparison and assignment statements are both atomic(ie one processor instruction) and the loop executes n times, then in the worst-case comparing then assigning would require n+1(comparing on every iteration plus setting the assignement) executions whereas constantly asssigning the bool would require n executions. Therefore the second one is more efficient.
Depends on the language. However looping through arrays can be costly as well. If the array is in consecutive memory, the fastest is to write 1 bits (255s) across the entire array with memcpy assuming your language/compiler can do this.
Thus performing 0 reads-1 write total, no reading/writing the loop variable/array variable (2 reads/2 writes each loop) several hundred times.
I really wouldn't expect there to be any kind of noticeable performance difference for something as trivial as this so surely it comes down to what gives you clear, more readable code. I my opinion that would be always assigning true.
Might give this a try:
if(!array[i])
array[i]=true;
But really the only way to know for sure is to profile, I'm sure pretty much any compiler would see the comparison to false as unnecessary and optimize it out.
It all depends on the data type. Assigning booleans is faster than first comparing them. But that may not be true for larger value-based datatypes.
As others have noted, this is micro-optimization.
(In politics or journalism, this is known as navel-gazing ;-)
Is the program large enough to have more than a couple layers of function/method/subroutine calls?
If so, it probably had some avoidable calls, and those can waste hundreds as much time as low-level inefficiencies.
On the assumption that you have removed those (which few people do), then by all means run it 10^9 times under a stopwatch, and see which is faster.
Why would you even write the first version? What's the benefit of checking to see if something is false before setting it true. If you always are going to set it true, then always set it true.
When you have a performance bottleneck that you've traced back to setting a single boolean value unnecessarily, come back and talk to us.
I remember in one book about assembly language the author claimed that if condition should be avoided, if possible.
It is much slower if the condition is false and execution has to jump to another line, considerably slowing down performance. Also since programs are executed in machine code, I think 'if' is slower in every (compiled) language, unless its condition is true almost all the time.
If you just want to flip the values, then do:
array[i] = !array[i];
Performance using this is actually worse though, as instead of only having to do a single check for a true false value then setting, it checks twice.
If you declare a 1000000 element array of true,false, true,false pattern comparision is slower. (var b = !b) essentially does a check twice instead of once
I'm writing code that scans large sections of text and performs some basic statistics on it, such as number of upper and lower case characters, punctuation characters etc.
Originally my code looked like this:
foreach (var character in stringToCount)
{
if (char.IsControl(character))
{
controlCount++;
}
if (char.IsDigit(character))
{
digitCount++;
}
if (char.IsLetter(character))
{
letterCount++;
} //etc.
}
And then from there I was creating a new object like this, which simply reads the local variables and passes them to the constructor:
var result = new CharacterCountResult(controlCount, highSurrogatecount, lowSurrogateCount, whiteSpaceCount,
symbolCount, punctuationCount, separatorCount, letterCount, digitCount, numberCount, letterAndDigitCount,
lowercaseCount, upperCaseCount, tempDictionary);
However a user over on Code Review Stack Exchange pointed out that I can just do the following. Great, I've saved myself a load of code which is good.
var result = new CharacterCountResult(stringToCount.Count(char.IsControl),
stringToCount.Count(char.IsHighSurrogate), stringToCount.Count(char.IsLowSurrogate),
stringToCount.Count(char.IsWhiteSpace), stringToCount.Count(char.IsSymbol),
stringToCount.Count(char.IsPunctuation), stringToCount.Count(char.IsSeparator),
stringToCount.Count(char.IsLetter), stringToCount.Count(char.IsDigit),
stringToCount.Count(char.IsNumber), stringToCount.Count(char.IsLetterOrDigit),
stringToCount.Count(char.IsLower), stringToCount.Count(char.IsUpper), tempDictionary);
However creating the object the second way takes approximately (on my machine) an extra ~200ms.
How can this be? While it might not seem a significant amount of extra time, it soon adds up when I've left it running processing text.
What should I be doing differently?
You are using method groups (syntactic sugar hiding a lambda or delegate) and iterating over the characters many times, whereas you could get it done with one pass (as in your original code).
I remember your previous question, and I recall seeing the recommendation to use the method group and string.Count(char.IsLetterOrDigit) and thinking "yeh that looks pretty but won't perform well", so it was amusing to actually see that you found exactly that.
If performance is important, I would just do it without delegates period, and use one giant loop with a single pass, the traditional way without delegates or multiple iterations, and even further, tune it by organizing the logic such that any case that excludes other cases is organized such that you do "lazy evaluation". Example, if you know a character is whitespace, then don't check for digit or alpha, etc. Or if you know it is digitOrAlpha, then include digit and alpha checks inside that condition.
Something like:
foreach(var ch in string) {
if(char.IsWhiteSpace(ch)) {
...
}
else {
if(char.IsLetterOrDigit(ch)) {
letterOrDigit++;
if(char.IsDigit(ch)) digit++;
if(char.IsLetter(ch)) letter++;
}
}
}
If you REALLY want to micro-optimize, write a program to pre-calculate all of the options and emit a huge switch statement which does table lookups.
switch(ch) {
case 'A':
isLetter++;
isUpper++;
isLetterOrDigit++;
break;
case 'a':
isLetter++;
isLower++;
isLetterOrDigit++;
break;
case '!':
isPunctuation++;
...
}
Now if you want to get REALLY crazy, organize the switch statement according to real-life frequency of occurence, and put the most common letters at the top of the "tree", and so forth. Of course, if you care that much about speed, it might be a job for plain C.
But I've wandered a bit far afield from your original question. :)
Your old way you walked through the text once, increasing all of your counters as you go. In your new way you walk though the text 13 times (once for each call to stringToCount.Count() and only update one counter per pass.
However, this kind of problem is the perfect situation for Parallel.ForEach. You can walk through the text with multiple threads (being sure your increments are thread safe) and get your totals faster.
Parallel.ForEach(stringToCount, character =>
{
if (char.IsControl(character))
{
//Interlocked.Increment gives you a thread safe ++
Interlocked.Increment(ref controlCount);
}
if (char.IsDigit(character))
{
Interlocked.Increment(ref digitCount);
}
if (char.IsLetter(character))
{
Interlocked.Increment(ref letterCount);
} //etc.
});
var result = new CharacterCountResult(controlCount, highSurrogatecount, lowSurrogateCount, whiteSpaceCount,
symbolCount, punctuationCount, separatorCount, letterCount, digitCount, numberCount, letterAndDigitCount,
lowercaseCount, upperCaseCount, tempDictionary);
It still walks through the text once, but many workers will be walking through various parts of the text at the same time.
Hey Is it more performant to write this:
Method_That_Is_Getting_Repeated()
{
if(var1 <0)
{
var2 = 'l';
}
else if(var1 >0)
{
var = 'r';
}
}
Or this:
Method_That_Is_Getting_Repeated()
{
if(var1 <0)
{
if(var2 != 'l')
var2 = 'l';
}
else if(var1 >0)
{
if(var2 != 'r')
var = 'r';
}
}
The var2 will already have the right value in many cases, so that it wouldnt have to be set again.
In other words: Does the if statement cost more/less time/performance to be executed than initializing a variable like float,int,char,double,bool?
There is no danger in reseting a char field to the same value. It's an atomic operation, simple operation that won't be observable in a meaningful way. The if statement though is a branching operation which is genarally speaking slower than a direct set because of the possibility of a bad prediction. Now this is nothing I would never design my program around but given that there is no real downside to setting the field to the same value why bother with the if at all?
These operations will end up on the stack and be lightning fast. The difference in performance will be negligible (in the magnitude of milliseconds difference over MILLIONS of operations I'd bet) compared to other, larger parts of your app. Make it as maintainable and readable as possible, and return to re-evaluate once these lines become the largest offenders of your performance profiling reports (which I assure you they will never be unless this is literally the only thing that your application does).
Your implementation of Method_That_Is_Getting_Repeated version2 is poor compared in version1 in terms of readability and also performance. I believe assigning a char is faster than comparing it.
I did some benchmarks some time ago which shown me direct assignment is better than comparisons. So measure your performance yourself and see which is better.
The second option has a higher complexity, so is likely to be less performant. It's also less readable. I would not worry much about it unless it is shown to be a cause of slow performance though.
http://en.wikipedia.org/wiki/Cyclomatic_complexity
Hi I have this code below and am looking for a prettier/faster way to do this.
Thanks!
string value = "HelloGoodByeSeeYouLater";
string[] y = new string[]{"Hello", "You"};
foreach(string x in y)
{
value = value.Replace(x, "");
}
You could do:
y.ToList().ForEach(x => value = value.Replace(x, ""));
Although I think your variant is more readable.
Forgive me, but someone's gotta say it,
value = Regex.Replace( value, string.Join("|", y.Select(Regex.Escape)), "" );
Possibly faster, since it creates fewer strings.
EDIT: Credit to Gabe and lasseespeholt for Escape and Select.
While not any prettier, there are other ways to express the same thing.
In LINQ:
value = y.Aggregate(value, (acc, x) => acc.Replace(x, ""));
With String methods:
value = String.Join("", value.Split(y, StringSplitOptions.None));
I don't think anything is going to be faster in managed code than a simple Replace in a foreach though.
It depends on the size of the string you are searching. The foreach example is perfectly fine for small operations but creates a new instance of the string each time it operates because the string is immutable. It also requires searching the whole string over and over again in a linear fashion.
The basic solutions have all been proposed. The Linq examples provided are good if you are comfortable with that syntax; I also liked the suggestion of an extension method, although that is probably the slowest of the proposed solutions. I would avoid a Regex unless you have an extremely specific need.
So let's explore more elaborate solutions and assume you needed to handle a string that was thousands of characters in length and had many possible words to be replaced. If this doesn't apply to the OP's need, maybe it will help someone else.
Method #1 is geared towards large strings with few possible matches.
Method #2 is geared towards short strings with numerous matches.
Method #1
I have handled large-scale parsing in c# using char arrays and pointer math with intelligent seek operations that are optimized for the length and potential frequency of the term being searched for. It follows the methodology of:
Extremely cheap Peeks one character at a time
Only investigate potential matches
Modify output when match is found
For example, you might read through the whole source array and only add words to the output when they are NOT found. This would remove the need to keep redimensioning strings.
A simple example of this technique is looking for a closing HTML tag in a DOM parser. For example, I may read an opening STYLE tag and want to skip through (or buffer) thousands of characters until I find a closing STYLE tag.
This approach provides incredibly high performance, but it's also incredibly complicated if you don't need it (plus you need to be well-versed in memory manipulation/management or you will create all sorts of bugs and instability).
I should note that the .Net string libraries are already incredibly efficient but you can optimize this approach for your own specific needs and achieve better performance (and I have validated this firsthand).
Method #2
Another alternative involves storing search terms in a Dictionary containing Lists of strings. Basically, you decide how long your search prefix needs to be, and read characters from the source string into a buffer until you meet that length. Then, you search your dictionary for all terms that match that string. If a match is found, you explore further by iterating through that List, if not, you know that you can discard the buffer and continue.
Because the Dictionary matches strings based on hash, the search is non-linear and ideal for handling a large number of possible matches.
I'm using this methodology to allow instantaneous (<1ms) searching of every airfield in the US by name, state, city, FAA code, etc. There are 13K airfields in the US, and I've created a map of about 300K permutations (again, a Dictionary with prefixes of varying lengths, each corresponding to a list of matches).
For example, Phoenix, Arizona's main airfield is called Sky Harbor with the short ID of KPHX. I store:
KP
KPH
KPHX
Ph
Pho
Phoe
Ar
Ari
Ariz
Sk
Sky
Ha
Har
Harb
There is a cost in terms of memory usage, but string interning probably reduces this somewhat and the resulting speed justifies the memory usage on data sets of this size. Searching happens as the user types and is so fast that I have actually introduced an artificial delay to smooth out the experience.
Send me a message if you have the need to dig into these methodologies.
Extension method for elegance
(arguably "prettier" at the call level)
I'll implement an extension method that allows you to call your implementation directly on the original string as seen here.
value = value.Remove(y);
// or
value = value.Remove("Hello", "You");
// effectively
string value = "HelloGoodByeSeeYouLater".Remove("Hello", "You");
The extension method is callable on any string value in fact, and therefore easily reusable.
Implementation of Extension method:
I'm going to wrap your own implementation (shown in your question) in an extension method for pretty or elegant points and also employ the params keyword to provide some flexbility passing the arguments. You can substitute somebody else's faster implementation body into this method.
static class EXTENSIONS {
static public string Remove(this string thisString, params string[] arrItems) {
// Whatever implementation you like:
if (thisString == null)
return null;
var temp = thisString;
foreach(string x in arrItems)
temp = temp.Replace(x, "");
return temp;
}
}
That's the brightest idea I can come up with right now that nobody else has touched on.
I have the this code:
var options = GetOptions(From, Value, SelectedValue);
var stopWatch = System.Diagnostics.Stopwatch.StartNew();
foreach (Option option in options)
{
stringBuilder.Append("<option");
stringBuilder.Append(" value=\"");
stringBuilder.Append(option.Value);
stringBuilder.Append("\"");
if (option.Selected)
stringBuilder.Append(" selected=\"selected\"");
stringBuilder.Append('>');
stringBuilder.Append(option.Text);
stringBuilder.Append("</option>");
}
HttpContext.Current.Response.Write("<b>" + stopWatch.Elapsed.ToString() + "</b><br>");
It is writing:
00:00:00.0004255 in the first try (not in debug)
00:00:00.0004260 in the second try and
00:00:00.0004281 in the third try.
Now, if I change the code so the measure will be inside the foreach loop:
var options = GetOptions(From, Value, SelectedValue);
foreach (Option option in options)
{
var stopWatch = System.Diagnostics.Stopwatch.StartNew();
stringBuilder.Append("<option");
stringBuilder.Append(" value=\"");
stringBuilder.Append(option.Value);
stringBuilder.Append("\"");
if (option.Selected)
stringBuilder.Append(" selected=\"selected\"");
stringBuilder.Append('>');
stringBuilder.Append(option.Text);
stringBuilder.Append("</option>");
HttpContext.Current.Response.Write("<b>" + stopWatch.Elapsed.ToString() + "</b><br>");
}
...I get
[00:00:00.0000014, 00:00:00.0000011] = 00:00:00.0000025 in the first try (not in debug),
[00:00:00.0000016, 00:00:00.0000011] = 00:00:00.0000027 in the second try and
[00:00:00.0000013, 00:00:00.0000011] = 00:00:00.0000024 in the third try.
?!
It is completely unsense according to the first results... I've heard that the foreach loop is slow, but never imagined that it is so slow... Is it that?
options has 2 options.
Here's the option class, if it is needed:
public class Option
{
public Option(string text, string value, bool selected)
{
Text = text;
Value = value;
Selected = selected;
}
public string Text
{
get;
set;
}
public string Value
{
get;
set;
}
public bool Selected
{
get;
set;
}
}
Thanks.
The foreach loop itself has nothing to do with the time difference.
What is the GetOptions method returning? My guess is that it's not returning a collection of options, but rather an enumerator that is capable of getting the options. That means that actually fetching the options are not done until you start to iterate them.
In the first case you are starting the clock before starting iterating the options, which means that the time for fetching the options is included in the time.
In the second case you are starting the clock after starting iterating the options, which means that the time for fetching the options is not included in the time.
So, the time difference that you see it not due to the foreach loop itself, it's the time it takes to fetch the options.
You can make sure that the options are fetched immediately by reading them into a collection:
var options = GetOptions(From, Value, SelectedValue).ToList();
Now measure the performance, and you will see very little difference.
If you measure the time taken to do something 160 times, it will usually take of the order of 160 times longer than measuring the time it takes to do it once. Are you suggesting that the contents of the loop is only executed once, or are you trying to compare chalk and cheese?
In the first case, try changing the last line of your code from using
stopWatch.Elapsed.ToString()
to
stopWatch.Elapsed.ToString() / options.Count
That will at least mean you are comparing one iteration with one iteration.
However, your results will still be useless. Timing a very short operation once gives poor results - you have to repeat such thing tens of thousands of times to get a statistically meaningingful average time. Otherwise the inaccuracy of the system clock and the overheads involved in starting and stopping your timer will swamp your results.
Also, what is the PC doing while all this is happening? If there are other processes loading the CPU, then they could easily interfere with your timings. If you're running this on a busy server then you may get competely random results.
Lastly, how you exceute the tests can alter things. If you always run test 1 followed by test 2, it's possible that running the first test affects CPU caches (e.g. of the data in the options list) etc so that the following code is able to execute faster. If garbage collection occurs during one of your tests, it wil skew the results.
You need to eliminate all these factors before you have numbers that are worth comparing. Only then should you ask "why is test 1 running so much slower than test 2"?
The first code example doesn't output anything until all the options have been iterated while the second one outputs a time after the first option has been processed. If there are multiple options, you would expect to see such a difference.
Just pause it a few times in the IDE and you'll see where the time goes.
There's a very natural and strong temptation to think that the time things take is proportional to how much code they are. For example, which do you think is faster?
for (MyClass x in y)
for (MyClass theParticularInstanceOfClass in MyCollectionOfInstances)
It is natural to think that the first is faster, when in fact the code size is irrelevant and could be hiding a multitude of expensive operations.