My application reads an Excel file using VSTO and adds the read data to a StringDictionary. It adds only data that are numbers with a few digits (1000 1000,2 1000,34 - comma is a delimiter in Russian standards).
What is better to check if the current string is an appropriate number?
object data, string key; // data had read
try
{
Convert.ToDouble(regionData, CultureInfo.CurrentCulture);
dic.Add(key, regionData.ToString());
}
catch (InvalidCastException)
{
// is not a number
}
or
double d;
string str = data.ToString();
if (Double.TryParse(str, out d)) // if done, then is a number
{
dic.Add(key, str);
}
I have to use StringDictionary instead of Dictionary<string, double> because of the following parsing algorithm issues.
My questions: Which way is faster? Which is safer?
And is it better to call Convert.ToDouble(object) or Convert.ToDouble(string) ?
I did a quick non-scientific test in Release mode. I used two inputs: "2.34523" and "badinput" into both methods and iterated 1,000,000 times.
Valid input:
Double.TryParse = 646ms
Convert.ToDouble = 662 ms
Not much different, as expected. For all intents and purposes, for valid input, these are the same.
Invalid input:
Double.TryParse = 612ms
Convert.ToDouble = ..
Well.. it was running for a long time. I reran the entire thing using 1,000 iterations and Convert.ToDouble with bad input took 8.3 seconds. Averaging it out, it would take over 2 hours. I don't care how basic the test is, in the invalid input case, Convert.ToDouble's exception raising will ruin your performance.
So, here's another vote for TryParse with some numbers to back it up.
To start with, I'd use double.Parse rather than Convert.ToDouble in the first place.
As to whether you should use Parse or TryParse: can you proceed if there's bad input data, or is that a really exceptional condition? If it's exceptional, use Parse and let it blow up if the input is bad. If it's expected and can be cleanly handled, use TryParse.
The .NET Framework design guidelines recommend using the Try methods. Avoiding exceptions is usually a good idea.
Convert.ToDouble(object) will do ((IConvertible) object).ToDouble(null);
Which will call Convert.ToDouble(string, null)
So it's faster to call the string version.
However, the string version just does this:
if (value == null)
{
return 0.0;
}
return double.Parse(value, NumberStyles.Float | NumberStyles.AllowThousands, provider);
So it's faster to do the double.Parse directly.
If you aren't going to be handling the exception go with TryParse. TryParse is faster because it doesn't have to deal with the whole exception stack trace.
I generally try to avoid the Convert class (meaning: I don't use it) because I find it very confusing: the code gives too few hints on what exactly happens here since Convert allows a lot of semantically very different conversions to occur with the same code. This makes it hard to control for the programmer what exactly is happening.
My advice, therefore, is never to use this class. It's not really necessary either (except for binary formatting of a number, because the normal ToString method of number classes doesn't offer an appropriate method to do this).
Unless you are 100% certain of your inputs, which is rarely the case, you should use Double.TryParse.
Convert.ToDouble will throw an exception on non-numbers
Double.Parse will throw an exception on non-numbers or null
Double.TryParse will return false or 0 on any of the above without generating an exception.
The speed of the parse becomes secondary when you throw an exception because there is not much slower than an exception.
Lots of hate for the Convert class here... Just to balance a little bit, there is one advantage for Convert - if you are handed an object,
Convert.ToDouble(o);
can just return the value easily if o is already a Double (or an int or anything readily castable).
Using Double.Parse or Double.TryParse is great if you already have it in a string, but
Double.Parse(o.ToString());
has to go make the string to be parsed first and depending on your input that could be more expensive.
Double.TryParse IMO.
It is easier for you to handle, You'll know exactly where the error occurred.
Then you can deal with it how you see fit if it returns false (i.e could not convert).
I have always preferred using the TryParse() methods because it is going to spit back success or failure to convert without having to worry about exceptions.
This is an interesting old question. I'm adding an answer because nobody noticed a couple of things with the original question.
Which is faster: Convert.ToDouble or Double.TryParse?
Which is safer: Convert.ToDouble or Double.TryParse?
I'm going to answer both these questions (I'll update the answer later), in detail, but first:
For safety, the thing every programmer missed in this question is the line (emphasis mine):
It adds only data that are numbers with a few digits (1000 1000,2 1000,34 - comma is a delimiter in Russian standards).
Followed by this code example:
Convert.ToDouble(regionData, CultureInfo.CurrentCulture);
What's interesting here is that if the spreadsheets are in Russian number format but Excel has not correctly typed the cell fields, what is the correct interpretation of the values coming in from Excel?
Here is another interesting thing about the two examples, regarding speed:
catch (InvalidCastException)
{
// is not a number
}
This is likely going to generate MSIL that looks like this:
catch [mscorlib]System.InvalidCastException
{
IL_0023: stloc.0
IL_0024: nop
IL_0025: ldloc.0
IL_0026: nop
IL_002b: nop
IL_002c: nop
IL_002d: leave.s IL_002f
} // end handler
IL_002f: nop
IL_0030: return
In this sense, we can probably compare the total number of MSIL instructions carried out by each program - more on that later as I update this post.
I believe code should be Correct, Clear, and Fast... In that order!
Personally, I find the TryParse method easier to read, which one you'll actually want to use depends on your use-case: if errors can be handled locally you are expecting errors and a bool from TryParse is good, else you might want to just let the exceptions fly.
I would expect the TryParse to be faster too, since it avoids the overhead of exception handling. But use a benchmark tool, like Jon Skeet's MiniBench to compare the various possibilities.
Related
I'm looking for the fastest (generic approach) to converting strings into various data types on the go.
I am parsing large text data files generated by a something (files are several megabytes in size). This particulare function reads lines in the text file, parses each line into columns based on delimitters and places the parsed values into a .NET DataTable. This is later inserted into a database. My bottleneck by FAR is the string conversions (Convert and TypeConverter).
I have to go with a dynamic way (i.e. staying away form "Convert.ToInt32" etc...) because I never know what types are going to be in the files. The type is determined by earlier configuration during runtime.
So far I have tried the following and both take several minutes to parse a file. Note that
if I comment out this one line it runs in only a few hundred milliseconds.
row[i] = Convert.ChangeType(columnString, dataType);
AND
TypeConverter typeConverter = TypeDescriptor.GetConverter(type);
row[i] = typeConverter.ConvertFromString(null, cultureInfo, columnString);
If anyone knows of a faster way that is generic like this I would like to know about it. Or if my whole approach just sucks for some reason I'm open to suggestions. But please don't point me to non-generic approaches using hard coded types; that is simply not an option here.
UPDATE - Multi-threading to Improve Performance Test
In order to improve performance I have looked into splitting up parsing tasks to multiple threads. I found that the speed increased somewhat but still not as much as I had hoped. However, here are my results for those who are interested.
System:
Intel Xenon 3.3GHz Quad Core E3-1245
Memory: 12.0 GB
Windows 7 Enterprise x64
Test:
The test function is this:
(1) Receive an array of strings. (2) Split the string by delimitters. (3) Parse strings into data types and store them in a row. (4) Add row to data table. (5) Repeat (2)-(4) until finished.
The test included 1000 strings, each string being parsed into 16 columns, so that is 16000 string conversions total. I tested single thread, 4 threads (because of quad core), and 8 threads (because of hyper-threading). Since I'm only crunching data here I doubt adding more threads than this would do any good. So for the single thread it parses 1000 strings, 4 threads parse 250 strings each, and 8 threads parse 125 strings each. Also I tested a few different ways of using threads: thread creation, thread pool, tasks, and function objects.
Results:
Result times are in Milliseconds.
Single Thread:
Method Call: 17720
4 Threads
Parameterized Thread Start: 13836
ThreadPool.QueueUserWorkItem: 14075
Task.Factory.StartNew: 16798
Func BeginInvoke EndInvoke: 16733
8 Threads
Parameterized Thread Start: 12591
ThreadPool.QueueUserWorkItem: 13832
Task.Factory.StartNew: 15877
Func BeginInvoke EndInvoke: 16395
As you can see the fastest is using Parameterized Thread Start with 8 threads (the number of my logical cores). However it does not beat using 4 threads by much and is only about 29% faster than using a single core. Of course results will vary by machine. Also I stuck with a
Dictionary<Type, TypeConverter>
cache for string parsing as using arrays of type converters did not offer a noticeable performance increase and having one shared cached type converter is more maintainable rather than creating arrays all over the place when I need them.
ANOTHER UPDATE:
Ok so I ran some more tests to see if I could squeeze some more performance out and I found some interesting things. I decided to stick with 8 threads, all started from the Parameterized Thread Start method (which was the fastest of my previous tests). The same test as above was run, just with different parsing algorithms.
I noticed that
Convert.ChangeType and TypeConverter
take about the same amount of time. Type specific converters like
int.TryParse
are slightly faster but not an option for me since my types are dynamic. ricovox had some good advice about exception handling. My data does indeed have invalid data, some integer columns will put a dash '-' for empty numbers, so type converters blow up at that: meaning every row I parse I have at least one exception, thats 1000 exceptions! Very time consuming.
Btw this is how I do my conversions with TypeConverter. Extensions is just a static class and GetTypeConverter just returns a cahced TypeConverter. If an exceptions is thrown during the conversion, a default value is used.
public static Object ConvertTo(this String arg, CultureInfo cultureInfo, Type type, Object defaultValue)
{
Object value;
TypeConverter typeConverter = Extensions.GetTypeConverter(type);
try
{
// Try converting the string.
value = typeConverter.ConvertFromString(null, cultureInfo, arg);
}
catch
{
// If the conversion fails then use the default value.
value = defaultValue;
}
return value;
}
Results:
Same test on 8 threads - parse 1000 lines, 16 columns each, 250 lines per thread.
So I did 3 new things.
1 - Run the test: check for known invalid types before parsing to minimize exceptions.
i.e. if(!Char.IsDigit(c)) value = 0; OR columnString.Contains('-') etc...
Runtime: 29ms
2 - Run the test: use custom parsing algorithms that have try catch blocks.
Runtime: 12424ms
3 - Run the test: use custom parsing algorithms checking for invalid types before parsing to minimize exceptions.
Runtime 15ms
Wow! As you can see eliminating the exceptions made a world of difference. I never realized how expensive exceptions really were! So If I minimize my exceptions to TRULY unknown cases, then the parsing algorithm runs three orders of magnitude faster. I'm considering this absolutely solved. I believe I will keep the dynamic type conversion with TypeConverter, it is only a few milliseconds slower. Checking for known invalid types before converting avoids exceptions and that speeds things up incredibly! Thanks to ricovox for pointing that out which made me test this further.
if you are primarily going to be converting the strings to the native data types (string, int, bool, DateTime etc) you could use something like the code below, which caches the TypeCodes and TypeConverters (for non-native types) and uses a fast switch statement to quickly jump to the appropriate parsing routine. This should save some time over Convert.ChangeType because the source type (string) is already known, and you can directly call the right parse method.
/* Get an array of Types for each of your columns.
* Open the data file for reading.
* Create your DataTable and add the columns.
* (You have already done all of these in your earlier processing.)
*
* Note: For the sake of generality, I've used an IEnumerable<string>
* to represent the lines in the file, although for large files,
* you would use a FileStream or TextReader etc.
*/
IList<Type> columnTypes; //array or list of the Type to use for each column
IEnumerable<string> fileLines; //the lines to parse from the file.
DataTable table; //the table you'll add the rows to
int colCount = columnTypes.Count;
var typeCodes = new TypeCode[colCount];
var converters = new TypeConverter[colCount];
//Fill up the typeCodes array with the Type.GetTypeCode() of each column type.
//If the TypeCode is Object, then get a custom converter for that column.
for(int i = 0; i < colCount; i++) {
typeCodes[i] = Type.GetTypeCode(columnTypes[i]);
if (typeCodes[i] == TypeCode.Object)
converters[i] = TypeDescriptor.GetConverter(columnTypes[i]);
}
//Probably faster to build up an array of objects and insert them into the row all at once.
object[] vals = new object[colCount];
object val;
foreach(string line in fileLines) {
//delineate the line into columns, however you see fit. I'll assume a tab character.
var columns = line.Split('\t');
for(int i = 0; i < colCount) {
switch(typeCodes[i]) {
case TypeCode.String:
val = columns[i]; break;
case TypeCode.Int32:
val = int.Parse(columns[i]); break;
case TypeCode.DateTime:
val = DateTime.Parse(columns[i]); break;
//...list types that you expect to encounter often.
//finally, deal with other objects
case TypeCode.Object:
default:
val = converters[i].ConvertFromString(columns[i]);
break;
}
vals[i] = val;
}
//Add all values to the row at one time.
//This might be faster than adding each column one at a time.
//There are two ways to do this:
var row = table.Rows.Add(vals); //create new row on the fly.
// OR
row.ItemArray = vals; //(e.g. allows setting existing row, created previously)
}
There really ISN'T any other way that would be faster, because we're basically just using the raw string parsing methods defined by the types themselves. You could re-write your own parsing code for each output type yourself, making optimizations for the exact formats you'll encounter. But I assume that is overkill for your project. It would probably be better and faster to simply tailor the FormatProvider or NumberStyles in each case.
For example let's say that whenever you parse Double values, you know, based on your proprietary file format, that you won't encounter any strings that contain exponents etc, and you know that there won't be any leading or trailing space, etc. So you can clue the parser in to these things with the NumberStyles argument as follows:
//NOTE: using System.Globalization;
var styles = NumberStyles.AllowDecimalPoint | NumberStyles.AllowLeadingSign;
var d = double.Parse(text, styles);
I don't know for a fact how the parsing is implemented, but I would think that the NumberStyles argument allows the parsing routine to work faster by excluding various formatting possibilities. Of course, if you can't make any assumptions about the format of the data, then you won't be able to make these types of optimizations.
Of course, there's always the possibility that your code is slow simply because it takes time to parse a string into a certain data type. Use a performance analyzer (like in VS2010) to try to see where your actual bottleneck is. Then you'll be able to optimize better, or simply give up, e.g. in the case that there is noting else to do short of writing the parsing routines in assembly :-)
Here is a quick piece of code to try :
Dictionary<Type, TypeConverter> _ConverterCache = new Dictionary<Type, TypeConverter>();
TypeConverter GetCachedTypeConverter(Type type)
{
if (!_ConverterCache.ContainsKey(type))
_ConverterCache.Add(type, TypeDescriptor.GetConverter(type));
return _ConverterCache[type];
}
Then use the code below instead :
TypeConverter typeConverter = GetCachedTypeConverter(type);
Is is a little faster ?
A technique I commonly use is:
var parserLookup = new Dictionary<Type, Func<string, dynamic>>();
parserLookup.Add(typeof(Int32), s => Int32.Parse(s));
parserLookup.Add(typeof(Int64), s => Int64.Parse(s));
parserLookup.Add(typeof(Decimal), s => Decimal.Parse(s, NumberStyles.Number | NumberStyles.Currency, CultureInfo.CurrentCulture));
parserLookup.Add(typeof(DateTime), s => DateTime.Parse(s, CultureInfo.CurrentCulture, DateTimeStyles.AssumeLocal));
// and so on for any other type you want to handle.
This assumes you can figure out what Type your data represents. The use of dynamic also implies .net 4 or higher, but you can change that to object in most cases.
Cache your parser lookup for each file (or for your entire app) and you should get pretty good performance.
Using a string.CompareTo(string) i can get around this slightly but is not easy to read and i have read on that locallity settings might influence the result.
Is there a way to just simply use < or > on 2 Strings in a more straightforward way?
You can overload operators but you seldom should. To me "stringA" > "stringB" wouldn't mean a damn thing, it's not helping readability IMO. That's why operator overloading guidelines advise not to overload operators if the meaning is not obvious.
EDIT: Operator Overloading Usage Guidelines
Also, in case of String I'm afraid you can't do it seeing as you can put operator-overloading methods only in the class in which the methods are defined.
If the syntax of CompareTo bothers you, maybe wrapping it in extension method will solve your problem?
Like that:
public static bool IsLessThan(this string str, string str2) {
return str.Compare(str2) < 0;
}
I still find it confusing for reader though.
The bottom line is, you can't overload operators for String. Usually you can do something like declaring a partial and stuffing your overloads there, but String is a sealed class, so not this time. I think that the extension method with reasonable name is your best bet. You can put CompareTo or some custom logic inside it.
CompareTo is the proper way in my opinion, you can use the overloads to specify culture specific parameters...
You mention in a comment that you're comparing two strings with values of the form "A100" and "B001". This works in your legacy VB 6 code with the < and > operators because of the way that VB 6 implements string comparison.
The algorithm is quite simple. It walks through the string, one character at a time, and compares the ASCII values of each character. As soon as a character from one string is found to have a lower ASCII code than the corresponding character in the other string, the comparison stops and the first string is declared to be "less than" the second. (VB 6 can be forced to perform a case-insensitive comparison based on the system's current locale by placing the Option Compare Text statement at the top of
the relevant code module, but this is not the default setting.)
Simple, of course, but not entirely logical. Comparing ASCII values skips over all sorts of interesting things you might find in strings nowadays; namely non-ASCII characters. Since you appear to be dealing with strings whose contents have pre-defined limits, this may not be a problem in your particular case. But more generally, writing code like strA < strB is going to look like complete nonsense to anyone else who has to maintain your code (it seems like you're already having this experience), and I encourage you to do the "right thing" even when you're dealing with a fixed set of possible inputs.
There is nothing "straightforward" about using < or > on string values. If you need to implement this functionality, you're going to have to do it yourself. Following the algorithm that I described VB 6 as using above, you could write your own comparison function and call that in your code, instead. Walk through each character in the string, determine if it is a character or a number, and convert it to the appropriate data type. From there, you can compare the two parsed values, and either move on to the next index in the string or return an "equality" value.
There is another problem with that, I think:
Assert.IsFalse(10 < 2);
Assert.IsTrue("10" < "2");
(The second Assert assumes you did an overload for the < operator on the string class.)
But the operator suggests otherwise!!
I agree with Dyppl: you shouldn't do it!
I'm interested in both style and performance considerations. My choice is to do either of the following ( sorry for the poor formatting but the interface for this site is not WYSIWYG ):
One:
string value = "ALPHA";
switch ( value.ToUpper() )
{
case "ALPHA":
// do somthing
break;
case "BETA":
// do something else
break;
default:
break;
}
Two:
public enum GreekLetters
{
UNKNOWN= 0,
ALPHA= 1,
BETA = 2,
etc...
}
string value = "Alpha";
GreekLetters letter = (GreekLetters)Enum.Parse( typeof( GreekLetters ), value.ToUpper() );
switch( letter )
{
case GreekLetters.ALPHA:
// do something
break;
case GreekLetters.BETA:
// do something else
break;
default:
break;
}
Personally, I prefer option TWO below, but I don't have any real reason other than basic style reasons. However, I'm not even sure there really is a style reason. Thanks for your input.
The second option is marginally faster, as the first option may require a full string comparison. The difference will be too small to measure in most circumstances, though.
The real advantage of the second option is that you've made it explicit that the valid values for value fall into a narrow range. In fact, it will throw an exception at Enum.Parse if the string value isn't in the expected range, which is often exactly what you want.
Option #1 is faster because if you look at the code for Enum.Parse, you'll see that it goes through each item one by one, looking for a match. In addition, there is less code to maintain and keep consistent.
One word of caution is that you shouldn't use ToUpper, but rather ToUpperInvariant() because of Turkey Test issues.
If you insist on Option #2, at least use the overload that allows you to specify to ignore case. This will be faster than converting to uppercase yourself. In addition, be advised that the Framework Design Guidelines encourage that all enum values be PascalCase instead of SCREAMING_CAPS.
I can't comment on the performance part of the question but as for style I prefer option #2. Whenever I have a known set of values and the set is reasonably small (less than a couple of dozen or so) I prefer to use an enum. I find an enum is a lot easier to work with than a collection of string values and anyone looking at the code can quickly see what the set of allowed values is.
This actually depends on the number of items in the enum, and you would have to test it for each specific scenario - not that it is likely to make a big difference. But it is a great question.
With very few values, the Enum.Parse is going to take more time than anything else in either example, so the second should be slower.
With enough values, the switch statement will be implemented as a hashtable, which should work the same speed with strings and enums, so again, Enum.Parse will probably make the second solution slower, but not by relatively as much.
Somewhere in the middle, I would expect the cost of comparing strings being higher than comparing enums would make the first solution faster.
I wouldn't even be surprised if it were different on different compiler versions or different options.
I would definitely say #1. Enum.Parse() causes reflection which is relatively expensive. Plus, Enum.Parse() will throw an Exception if its not defined and since there's no TryParse() you'd need to wrap it in Try/Catch block
Not sure if there is a performance difference when switching on a string value versus an enum.
One thing to consider is would you need the values used for the case statements elsewhere in your code. If so, then using an enum would make more sense as you have a singular definition of the values. Const strings could also be used.
A while back a post by Jon Skeet planted the idea in my head of building a CompiledFormatter class, for using in a loop instead of String.Format().
The idea is the portion of a call to String.Format() spent parsing the format string is overhead; we should be able to improve performance by moving that code outside of the loop. The trick, of course, is the new code should exactly match the String.Format() behavior.
This week I finally did it. I went through using the .Net framework source provided by Microsoft to do a direct adaption of their parser (it turns out String.Format() actually farms the work to StringBuilder.AppendFormat()). The code I came up with works, in that my results are accurate within my (admittedly limited) test data.
Unfortunately, I still have one problem: performance. In my initial tests the performance of my code closely matches that of the normal String.Format(). There's no improvement at all; it's even consistently a few milliseconds slower. At least it's still in the same order (ie: the amount slower doesn't increase; it stays within a few milliseconds even as the test set grows), but I was hoping for something better.
It's possible that the internal calls to StringBuilder.Append() are what actually drive the performance, but I'd like to see if the smart people here can help improve things.
Here is the relevant portion:
private class FormatItem
{
public int index; //index of item in the argument list. -1 means it's a literal from the original format string
public char[] value; //literal data from original format string
public string format; //simple format to use with supplied argument (ie: {0:X} for Hex
// for fixed-width format (examples below)
public int width; // {0,7} means it should be at least 7 characters
public bool justify; // {0,-7} would use opposite alignment
}
//this data is all populated by the constructor
private List<FormatItem> parts = new List<FormatItem>();
private int baseSize = 0;
private string format;
private IFormatProvider formatProvider = null;
private ICustomFormatter customFormatter = null;
// the code in here very closely matches the code in the String.Format/StringBuilder.AppendFormat methods.
// Could it be faster?
public String Format(params Object[] args)
{
if (format == null || args == null)
throw new ArgumentNullException((format == null) ? "format" : "args");
var sb = new StringBuilder(baseSize);
foreach (FormatItem fi in parts)
{
if (fi.index < 0)
sb.Append(fi.value);
else
{
//if (fi.index >= args.Length) throw new FormatException(Environment.GetResourceString("Format_IndexOutOfRange"));
if (fi.index >= args.Length) throw new FormatException("Format_IndexOutOfRange");
object arg = args[fi.index];
string s = null;
if (customFormatter != null)
{
s = customFormatter.Format(fi.format, arg, formatProvider);
}
if (s == null)
{
if (arg is IFormattable)
{
s = ((IFormattable)arg).ToString(fi.format, formatProvider);
}
else if (arg != null)
{
s = arg.ToString();
}
}
if (s == null) s = String.Empty;
int pad = fi.width - s.Length;
if (!fi.justify && pad > 0) sb.Append(' ', pad);
sb.Append(s);
if (fi.justify && pad > 0) sb.Append(' ', pad);
}
}
return sb.ToString();
}
//alternate implementation (for comparative testing)
// my own test call String.Format() separately: I don't use this. But it's useful to see
// how my format method fits.
public string OriginalFormat(params Object[] args)
{
return String.Format(formatProvider, format, args);
}
Additional notes:
I'm wary of providing the source code for my constructor, because I'm not sure of the licensing implications from my reliance on the original .Net implementation. However, anyone who wants to test this can just make the relevant private data public and assign values that mimic a particular format string.
Also, I'm very open to changing the FormatInfo class and even the parts List if anyone has a suggestion that could improve the build time. Since my primary concern is sequential iteration time from front to end maybe a LinkedList would fare better?
[Update]:
Hmm... something else I can try is adjusting my tests. My benchmarks were fairly simple: composing names to a "{lastname}, {firstname}" format and composing formatted phone numbers from the area code, prefix, number, and extension components. Neither of those have much in the way of literal segments within the string. As I think about how the original state machine parser worked, I think those literal segments are exactly where my code has the best chance to do well, because I no longer have to examine each character in the string.
Another thought:
This class is still useful, even if I can't make it go faster. As long as performance is no worse than the base String.Format(), I've still created a strongly-typed interface which allows a program to assemble it's own "format string" at run time. All I need to do is provide public access to the parts list.
Here's the final result:
I changed the format string in a benchmark trial to something that should favor my code a little more:
The quick brown {0} jumped over the lazy {1}.
As I expected, this fares much better compared to the original; 2 million iterations in 5.3 seconds for this code vs 6.1 seconds for String.Format. This is an undeniable improvement. You might even be tempted to start using this as a no-brainer replacement for many String.Format situations. After all, you'll do no worse and you might even get a small performance boost: as much 14%, and that's nothing to sneeze at.
Except that it is. Keep in mind, we're still talking less than half a second difference for 2 million attempts, under a situation specifically designed to favor this code. Not even busy ASP.Net pages are likely to create that much load, unless you're lucky enough to work on a top 100 web site.
Most of all, this omits one important alternative: you can create a new StringBuilder each time and manually handle your own formatting using raw Append() calls. With that technique my benchmark finished in only 3.9 seconds. That's a much greater improvement.
In summary, if performance doesn't matter as much, you should stick with the clarity and simplicity of the built-in option. But when in a situation where profiling shows this really is driving your performance, there is a better alternative available via StringBuilder.Append().
Don't stop now!
Your custom formatter might only be slightly more efficient than the built-in API, but you can add more features to your own implementation that would make it more useful.
I did a similar thing in Java, and here are some of the features I added (besides just pre-compiled format strings):
1) The format() method accepts either a varargs array or a Map (in .NET, it'd be a dictionary). So my format strings can look like this:
StringFormatter f = StringFormatter.parse(
"the quick brown {animal} jumped over the {attitude} dog"
);
Then, if I already have my objects in a map (which is pretty common), I can call the format method like this:
String s = f.format(myMap);
2) I have a special syntax for performing regular expression replacements on strings during the formatting process:
// After calling obj.toString(), all space characters in the formatted
// object string are converted to underscores.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:/\\s+/_/} blah blah blah"
);
3) I have a special syntax that allows the formatted to check the argument for null-ness, applying a different formatter depending on whether the object is null or non-null.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:?'NULL'|'NOT NULL'} blah blah blah"
);
There are a zillion other things you can do. One of the tasks on my todo list is to add a new syntax where you can automatically format Lists, Sets, and other Collections by specifying a formatter to apply to each element as well as a string to insert between all elements. Something like this...
// Wraps each elements in single-quote charts, separating
// adjacent elements with a comma.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:#['$'][,]} blah blah blah"
);
But the syntax is a little awkward and I'm not in love with it yet.
Anyhow, the point is that your existing class might not be much more efficient than the framework API, but if you extend it to satisfy all of your personal string-formatting needs, you might end up with a very convenient library in the end. Personally, I use my own version of this library for dynamically constructing all SQL strings, error messages, and localization strings. It's enormously useful.
It seems to me that in order to get actual performance improvement, you'd need to factor out any format analysis done by your customFormatter and formattable arguments into a function that returns some data structure that tells a later formatting call what to do. Then you pull those data structures in your constructor and store them for later use. Presumably this would involve extending ICustomFormatter and IFormattable. Seems kinda unlikely.
Have you accounted for the time to do the JIT compile as well? After all, the framework will be ngen'd which could account for the differences?
The framework provides explicit overrides to the format methods that take fixed-sized parameter lists instead of the params object[] approach to remove the overhead of allocating and collecting all of the temporary object arrays. You might want to consider that for your code as well. Also, providing strongly-typed overloads for common value types would reduce boxing overhead.
I gotta believe that spending as much time optimizing data IO would earn exponentially bigger returns!
This is surely a kissin' cousin to YAGNI for this. Avoid Premature Optimization. APO.
I have seen something like the following a couple times... and I hate it. Is this basically 'cheating' the language? Or.. would you consider this to be 'ok' because the IsNullOrEmpty is evaluated first, all the time?
(We could argue whether or not a string should be NULL when it comes out of a function, but that isn't really the question.)
string someString;
someString = MagicFunction();
if (!string.IsNullOrEmpty(someString) && someString.Length > 3)
{
// normal string, do whatever
}
else
{
// On a NULL string, it drops to here, because first evaluation of IsNullOrEmpty fails
// However, the Length function, if used by itself, would throw an exception.
}
EDIT:
Thanks again to everyone for reminding me of this language fundamental. While I knew "why" it worked, I can't believe I didn't know/remember the name of the concept.
(In case anyone wants any background.. I came upon this while troubleshooting exceptions generated by NULL strings and .Length > x exceptions... in different places of the code. So when I saw the above code, in addition to everything else, my frustration took over from there.)
You're taking advantage of a language feature known as short circuiting. This is not cheating the language but in fact using a feature exactly how it was designed to be used.
If you are asking if its ok to depend on the "short circuit" relational operators && and ||, then yes thats totally fine.
There is nothing wrong with this, as you just want to make certain you won't get a nullpointer exception.
I think it is reasonable to do.
With Extensions you can make it cleaner, but the basic concept would still be valid.
This code is totally valid, but I like to use the Null Coalesce Operator for avoid null type checks.
string someString = MagicFunction() ?? string.Empty;
if (someString.Length > 3)
{
// normal string, do whatever
}
else
{
// NULL strings will be converted to Length = 0 and will end up here.
}
Theres nothing wrong with this.
if(conditions are evaluated from left to right so it's perfectly fine to stack them like this.
This is valid code, in my opinion (although declaring a variable and assigning it on the next line is pretty annoying), but you should probably realize that you can enter the else-block also in the condition where the length of the string is < 3.
That looks to me like a perfectly reasonable use of logical short-circuitting--if anything, it's cheating with the language. I've only recently come from VB6 which didn't ever short-circuit, and that really annoyed me.
One problem to watch out for is that you might need to test for Null again in that else clause, since--as written--you're winding up there with both Null strings and length-less-than-three strings.
This is perfectly valid and there is nothing wrong with using it that way. If you are following documented behaviour for the language than all is well. In C# the syntax you are using are the conditional logic operators and thier docemented bahviour can be found on MSDN
For me it's the same as when you do not use parenthesis for when doing multiplication and addition in the same statement because the language documents that the multiplication operations will get carried out first.
Relying on short-circuiting is the "right thing" to do in most cases. It leads to terser code with fewer moving parts. Which generally means easier to maintain. This is especially true in C and C++.
I would seriously reconsider hiring someone who is not familiar with (and does not know how to use) short-circuiting operations.
I find it OK :) You're just making sure that you don't access a NULL variable.
Actually, I always do such checking before doing any operation on my variable (also, when indexing collections and so) - it's safer, a best practice, that's all ..
It makes sense because C# by default short circuits the conditions, so I think it's fine to use that to your advantage. In VB there may be some issues if the developer uses AND instead of ANDALSO.
I don't think it's any different than something like this:
INT* pNumber = GetAddressOfNumber();
if ((pNUmber != NULL) && (*pNumber > 0))
{
// valid number, do whatever
}
else
{
// On a null pointer, it drops to here, because (pNumber != NULL) fails
// However, (*pNumber > 0), if used by itself, would throw and exception when dereferencing NULL
}
It's just taking advantage of a feature in the language. This kind of idiom has been in common use, I think, since C started executing Boolean expressions in this manner (or whatever language did it first).)
If it were code in c that you compiled into assembly, not only is short-circuiting the right behavior, it's faster. In machine langauge the parts of the if statement are evaluated one after another. Not short-circuiting is slower.
Writing code cost a lot of $ to a company. But maintaining it cost more !
So, I'm OK with your point : chance are that this line of code will not be understood immediatly by the guy who will have to read it and correct it in 2 years.
Of course, he will be asked to correct a critical production bug. He will search here and there and may not notice this.
We should always code for the next guy and he may be less clever that we are. To me, this is the only thing to remember.
And this implies that we use evident language features and avoid the others.
All the best, Sylvain.
A bit off topic but if you rand the same example in vb.net like this
dim someString as string
someString = MagicFunction()
if not string.IsNullOrEmpty(someString) and someString.Length > 3 then
' normal string, do whatever
else
' do someting else
end if
this would go bang on a null (nothing) string but in VB.Net you code it as follows do do the same in C#
dim someString as string
someString = MagicFunction()
if not string.IsNullOrEmpty(someString) andalso someString.Length > 3 then
' normal string, do whatever
else
' do someting else
end if
adding the andalso make it behave the same way, also it reads better. as someone who does both vb and c' development the second vb one show that the login is slighty different and therefor easyer to explain to someone that there is a differeance etc.
Drux