Performance issue: comparing to String.Format - c#

A while back a post by Jon Skeet planted the idea in my head of building a CompiledFormatter class, for using in a loop instead of String.Format().
The idea is the portion of a call to String.Format() spent parsing the format string is overhead; we should be able to improve performance by moving that code outside of the loop. The trick, of course, is the new code should exactly match the String.Format() behavior.
This week I finally did it. I went through using the .Net framework source provided by Microsoft to do a direct adaption of their parser (it turns out String.Format() actually farms the work to StringBuilder.AppendFormat()). The code I came up with works, in that my results are accurate within my (admittedly limited) test data.
Unfortunately, I still have one problem: performance. In my initial tests the performance of my code closely matches that of the normal String.Format(). There's no improvement at all; it's even consistently a few milliseconds slower. At least it's still in the same order (ie: the amount slower doesn't increase; it stays within a few milliseconds even as the test set grows), but I was hoping for something better.
It's possible that the internal calls to StringBuilder.Append() are what actually drive the performance, but I'd like to see if the smart people here can help improve things.
Here is the relevant portion:
private class FormatItem
{
public int index; //index of item in the argument list. -1 means it's a literal from the original format string
public char[] value; //literal data from original format string
public string format; //simple format to use with supplied argument (ie: {0:X} for Hex
// for fixed-width format (examples below)
public int width; // {0,7} means it should be at least 7 characters
public bool justify; // {0,-7} would use opposite alignment
}
//this data is all populated by the constructor
private List<FormatItem> parts = new List<FormatItem>();
private int baseSize = 0;
private string format;
private IFormatProvider formatProvider = null;
private ICustomFormatter customFormatter = null;
// the code in here very closely matches the code in the String.Format/StringBuilder.AppendFormat methods.
// Could it be faster?
public String Format(params Object[] args)
{
if (format == null || args == null)
throw new ArgumentNullException((format == null) ? "format" : "args");
var sb = new StringBuilder(baseSize);
foreach (FormatItem fi in parts)
{
if (fi.index < 0)
sb.Append(fi.value);
else
{
//if (fi.index >= args.Length) throw new FormatException(Environment.GetResourceString("Format_IndexOutOfRange"));
if (fi.index >= args.Length) throw new FormatException("Format_IndexOutOfRange");
object arg = args[fi.index];
string s = null;
if (customFormatter != null)
{
s = customFormatter.Format(fi.format, arg, formatProvider);
}
if (s == null)
{
if (arg is IFormattable)
{
s = ((IFormattable)arg).ToString(fi.format, formatProvider);
}
else if (arg != null)
{
s = arg.ToString();
}
}
if (s == null) s = String.Empty;
int pad = fi.width - s.Length;
if (!fi.justify && pad > 0) sb.Append(' ', pad);
sb.Append(s);
if (fi.justify && pad > 0) sb.Append(' ', pad);
}
}
return sb.ToString();
}
//alternate implementation (for comparative testing)
// my own test call String.Format() separately: I don't use this. But it's useful to see
// how my format method fits.
public string OriginalFormat(params Object[] args)
{
return String.Format(formatProvider, format, args);
}
Additional notes:
I'm wary of providing the source code for my constructor, because I'm not sure of the licensing implications from my reliance on the original .Net implementation. However, anyone who wants to test this can just make the relevant private data public and assign values that mimic a particular format string.
Also, I'm very open to changing the FormatInfo class and even the parts List if anyone has a suggestion that could improve the build time. Since my primary concern is sequential iteration time from front to end maybe a LinkedList would fare better?
[Update]:
Hmm... something else I can try is adjusting my tests. My benchmarks were fairly simple: composing names to a "{lastname}, {firstname}" format and composing formatted phone numbers from the area code, prefix, number, and extension components. Neither of those have much in the way of literal segments within the string. As I think about how the original state machine parser worked, I think those literal segments are exactly where my code has the best chance to do well, because I no longer have to examine each character in the string.
Another thought:
This class is still useful, even if I can't make it go faster. As long as performance is no worse than the base String.Format(), I've still created a strongly-typed interface which allows a program to assemble it's own "format string" at run time. All I need to do is provide public access to the parts list.

Here's the final result:
I changed the format string in a benchmark trial to something that should favor my code a little more:
The quick brown {0} jumped over the lazy {1}.
As I expected, this fares much better compared to the original; 2 million iterations in 5.3 seconds for this code vs 6.1 seconds for String.Format. This is an undeniable improvement. You might even be tempted to start using this as a no-brainer replacement for many String.Format situations. After all, you'll do no worse and you might even get a small performance boost: as much 14%, and that's nothing to sneeze at.
Except that it is. Keep in mind, we're still talking less than half a second difference for 2 million attempts, under a situation specifically designed to favor this code. Not even busy ASP.Net pages are likely to create that much load, unless you're lucky enough to work on a top 100 web site.
Most of all, this omits one important alternative: you can create a new StringBuilder each time and manually handle your own formatting using raw Append() calls. With that technique my benchmark finished in only 3.9 seconds. That's a much greater improvement.
In summary, if performance doesn't matter as much, you should stick with the clarity and simplicity of the built-in option. But when in a situation where profiling shows this really is driving your performance, there is a better alternative available via StringBuilder.Append().

Don't stop now!
Your custom formatter might only be slightly more efficient than the built-in API, but you can add more features to your own implementation that would make it more useful.
I did a similar thing in Java, and here are some of the features I added (besides just pre-compiled format strings):
1) The format() method accepts either a varargs array or a Map (in .NET, it'd be a dictionary). So my format strings can look like this:
StringFormatter f = StringFormatter.parse(
"the quick brown {animal} jumped over the {attitude} dog"
);
Then, if I already have my objects in a map (which is pretty common), I can call the format method like this:
String s = f.format(myMap);
2) I have a special syntax for performing regular expression replacements on strings during the formatting process:
// After calling obj.toString(), all space characters in the formatted
// object string are converted to underscores.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:/\\s+/_/} blah blah blah"
);
3) I have a special syntax that allows the formatted to check the argument for null-ness, applying a different formatter depending on whether the object is null or non-null.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:?'NULL'|'NOT NULL'} blah blah blah"
);
There are a zillion other things you can do. One of the tasks on my todo list is to add a new syntax where you can automatically format Lists, Sets, and other Collections by specifying a formatter to apply to each element as well as a string to insert between all elements. Something like this...
// Wraps each elements in single-quote charts, separating
// adjacent elements with a comma.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:#['$'][,]} blah blah blah"
);
But the syntax is a little awkward and I'm not in love with it yet.
Anyhow, the point is that your existing class might not be much more efficient than the framework API, but if you extend it to satisfy all of your personal string-formatting needs, you might end up with a very convenient library in the end. Personally, I use my own version of this library for dynamically constructing all SQL strings, error messages, and localization strings. It's enormously useful.

It seems to me that in order to get actual performance improvement, you'd need to factor out any format analysis done by your customFormatter and formattable arguments into a function that returns some data structure that tells a later formatting call what to do. Then you pull those data structures in your constructor and store them for later use. Presumably this would involve extending ICustomFormatter and IFormattable. Seems kinda unlikely.

Have you accounted for the time to do the JIT compile as well? After all, the framework will be ngen'd which could account for the differences?

The framework provides explicit overrides to the format methods that take fixed-sized parameter lists instead of the params object[] approach to remove the overhead of allocating and collecting all of the temporary object arrays. You might want to consider that for your code as well. Also, providing strongly-typed overloads for common value types would reduce boxing overhead.

I gotta believe that spending as much time optimizing data IO would earn exponentially bigger returns!
This is surely a kissin' cousin to YAGNI for this. Avoid Premature Optimization. APO.

Related

how to convert part of the string to int/float/vector3 etc. without creating a temp string?

in C#, I have a string like this:
"1 3.14 (23, 23.2, 43,88) 8.27"
I need to convert this string to other types according to the value like int/float/vector3, now i have some code like this:
public static int ReadInt(this string s, ref string op)
{
s = s.Trim();
string ss = "";
int idx = s.IndexOf(" ");
if (idx > 0)
{
ss = s.Substring(0, idx);
op = s.Substring(idx);
}
else
{
ss = s;
op = "";
}
return Convert.ToInt32(ss);
}
this will read the first int value out, i have some similar functions to read float vector3 etc. but the problem is : in my application, i have to do this a lot because i received the string from some plugin and i need to do it every single frame, so i created a lot of strings which caused a lot GC will impact the performance, is their a way i can do similar stuff without creating temp strings?
Generation 0 objects such as those created here may well not impact performance too much, as they are relatively cheap to collect. I would change from using Convert to calling int.Parse() with the invariant culture before I started worrying about the GC overhead of the extra strings.
Also, you don't really need to create a new string to accomplish the Trim() behavior. After all, you're scanning and indexing the string anyway. Just do your initial scan for whitespace, and then for the space delimiter between ss and op, so you get just the substrings you need. Right now you're creating 50% more string instances than you really need.
All that said, no...there's not anything built into the basic .NET framework that would parse a substring without actually creating a new string instance. You would have to write your own parsing routines to accomplish that.
You should measure the actual real-world performance impact first, to make sure these substrings really are a significant issue.
I don't know what the "some plugin" is or how you have to handle the input from it, but I would not be surprised to hear that the overhead in acquiring the original input string(s) for this scenario swamps the overhead of the substrings for parsing.

Fastest, Efficient, Elegant way of Parsing Strings to Dynamic types?

I'm looking for the fastest (generic approach) to converting strings into various data types on the go.
I am parsing large text data files generated by a something (files are several megabytes in size). This particulare function reads lines in the text file, parses each line into columns based on delimitters and places the parsed values into a .NET DataTable. This is later inserted into a database. My bottleneck by FAR is the string conversions (Convert and TypeConverter).
I have to go with a dynamic way (i.e. staying away form "Convert.ToInt32" etc...) because I never know what types are going to be in the files. The type is determined by earlier configuration during runtime.
So far I have tried the following and both take several minutes to parse a file. Note that
if I comment out this one line it runs in only a few hundred milliseconds.
row[i] = Convert.ChangeType(columnString, dataType);
AND
TypeConverter typeConverter = TypeDescriptor.GetConverter(type);
row[i] = typeConverter.ConvertFromString(null, cultureInfo, columnString);
If anyone knows of a faster way that is generic like this I would like to know about it. Or if my whole approach just sucks for some reason I'm open to suggestions. But please don't point me to non-generic approaches using hard coded types; that is simply not an option here.
UPDATE - Multi-threading to Improve Performance Test
In order to improve performance I have looked into splitting up parsing tasks to multiple threads. I found that the speed increased somewhat but still not as much as I had hoped. However, here are my results for those who are interested.
System:
Intel Xenon 3.3GHz Quad Core E3-1245
Memory: 12.0 GB
Windows 7 Enterprise x64
Test:
The test function is this:
(1) Receive an array of strings. (2) Split the string by delimitters. (3) Parse strings into data types and store them in a row. (4) Add row to data table. (5) Repeat (2)-(4) until finished.
The test included 1000 strings, each string being parsed into 16 columns, so that is 16000 string conversions total. I tested single thread, 4 threads (because of quad core), and 8 threads (because of hyper-threading). Since I'm only crunching data here I doubt adding more threads than this would do any good. So for the single thread it parses 1000 strings, 4 threads parse 250 strings each, and 8 threads parse 125 strings each. Also I tested a few different ways of using threads: thread creation, thread pool, tasks, and function objects.
Results:
Result times are in Milliseconds.
Single Thread:
Method Call: 17720
4 Threads
Parameterized Thread Start: 13836
ThreadPool.QueueUserWorkItem: 14075
Task.Factory.StartNew: 16798
Func BeginInvoke EndInvoke: 16733
8 Threads
Parameterized Thread Start: 12591
ThreadPool.QueueUserWorkItem: 13832
Task.Factory.StartNew: 15877
Func BeginInvoke EndInvoke: 16395
As you can see the fastest is using Parameterized Thread Start with 8 threads (the number of my logical cores). However it does not beat using 4 threads by much and is only about 29% faster than using a single core. Of course results will vary by machine. Also I stuck with a
Dictionary<Type, TypeConverter>
cache for string parsing as using arrays of type converters did not offer a noticeable performance increase and having one shared cached type converter is more maintainable rather than creating arrays all over the place when I need them.
ANOTHER UPDATE:
Ok so I ran some more tests to see if I could squeeze some more performance out and I found some interesting things. I decided to stick with 8 threads, all started from the Parameterized Thread Start method (which was the fastest of my previous tests). The same test as above was run, just with different parsing algorithms.
I noticed that
Convert.ChangeType and TypeConverter
take about the same amount of time. Type specific converters like
int.TryParse
are slightly faster but not an option for me since my types are dynamic. ricovox had some good advice about exception handling. My data does indeed have invalid data, some integer columns will put a dash '-' for empty numbers, so type converters blow up at that: meaning every row I parse I have at least one exception, thats 1000 exceptions! Very time consuming.
Btw this is how I do my conversions with TypeConverter. Extensions is just a static class and GetTypeConverter just returns a cahced TypeConverter. If an exceptions is thrown during the conversion, a default value is used.
public static Object ConvertTo(this String arg, CultureInfo cultureInfo, Type type, Object defaultValue)
{
Object value;
TypeConverter typeConverter = Extensions.GetTypeConverter(type);
try
{
// Try converting the string.
value = typeConverter.ConvertFromString(null, cultureInfo, arg);
}
catch
{
// If the conversion fails then use the default value.
value = defaultValue;
}
return value;
}
Results:
Same test on 8 threads - parse 1000 lines, 16 columns each, 250 lines per thread.
So I did 3 new things.
1 - Run the test: check for known invalid types before parsing to minimize exceptions.
i.e. if(!Char.IsDigit(c)) value = 0; OR columnString.Contains('-') etc...
Runtime: 29ms
2 - Run the test: use custom parsing algorithms that have try catch blocks.
Runtime: 12424ms
3 - Run the test: use custom parsing algorithms checking for invalid types before parsing to minimize exceptions.
Runtime 15ms
Wow! As you can see eliminating the exceptions made a world of difference. I never realized how expensive exceptions really were! So If I minimize my exceptions to TRULY unknown cases, then the parsing algorithm runs three orders of magnitude faster. I'm considering this absolutely solved. I believe I will keep the dynamic type conversion with TypeConverter, it is only a few milliseconds slower. Checking for known invalid types before converting avoids exceptions and that speeds things up incredibly! Thanks to ricovox for pointing that out which made me test this further.
if you are primarily going to be converting the strings to the native data types (string, int, bool, DateTime etc) you could use something like the code below, which caches the TypeCodes and TypeConverters (for non-native types) and uses a fast switch statement to quickly jump to the appropriate parsing routine. This should save some time over Convert.ChangeType because the source type (string) is already known, and you can directly call the right parse method.
/* Get an array of Types for each of your columns.
* Open the data file for reading.
* Create your DataTable and add the columns.
* (You have already done all of these in your earlier processing.)
*
* Note: For the sake of generality, I've used an IEnumerable<string>
* to represent the lines in the file, although for large files,
* you would use a FileStream or TextReader etc.
*/
IList<Type> columnTypes; //array or list of the Type to use for each column
IEnumerable<string> fileLines; //the lines to parse from the file.
DataTable table; //the table you'll add the rows to
int colCount = columnTypes.Count;
var typeCodes = new TypeCode[colCount];
var converters = new TypeConverter[colCount];
//Fill up the typeCodes array with the Type.GetTypeCode() of each column type.
//If the TypeCode is Object, then get a custom converter for that column.
for(int i = 0; i < colCount; i++) {
typeCodes[i] = Type.GetTypeCode(columnTypes[i]);
if (typeCodes[i] == TypeCode.Object)
converters[i] = TypeDescriptor.GetConverter(columnTypes[i]);
}
//Probably faster to build up an array of objects and insert them into the row all at once.
object[] vals = new object[colCount];
object val;
foreach(string line in fileLines) {
//delineate the line into columns, however you see fit. I'll assume a tab character.
var columns = line.Split('\t');
for(int i = 0; i < colCount) {
switch(typeCodes[i]) {
case TypeCode.String:
val = columns[i]; break;
case TypeCode.Int32:
val = int.Parse(columns[i]); break;
case TypeCode.DateTime:
val = DateTime.Parse(columns[i]); break;
//...list types that you expect to encounter often.
//finally, deal with other objects
case TypeCode.Object:
default:
val = converters[i].ConvertFromString(columns[i]);
break;
}
vals[i] = val;
}
//Add all values to the row at one time.
//This might be faster than adding each column one at a time.
//There are two ways to do this:
var row = table.Rows.Add(vals); //create new row on the fly.
// OR
row.ItemArray = vals; //(e.g. allows setting existing row, created previously)
}
There really ISN'T any other way that would be faster, because we're basically just using the raw string parsing methods defined by the types themselves. You could re-write your own parsing code for each output type yourself, making optimizations for the exact formats you'll encounter. But I assume that is overkill for your project. It would probably be better and faster to simply tailor the FormatProvider or NumberStyles in each case.
For example let's say that whenever you parse Double values, you know, based on your proprietary file format, that you won't encounter any strings that contain exponents etc, and you know that there won't be any leading or trailing space, etc. So you can clue the parser in to these things with the NumberStyles argument as follows:
//NOTE: using System.Globalization;
var styles = NumberStyles.AllowDecimalPoint | NumberStyles.AllowLeadingSign;
var d = double.Parse(text, styles);
I don't know for a fact how the parsing is implemented, but I would think that the NumberStyles argument allows the parsing routine to work faster by excluding various formatting possibilities. Of course, if you can't make any assumptions about the format of the data, then you won't be able to make these types of optimizations.
Of course, there's always the possibility that your code is slow simply because it takes time to parse a string into a certain data type. Use a performance analyzer (like in VS2010) to try to see where your actual bottleneck is. Then you'll be able to optimize better, or simply give up, e.g. in the case that there is noting else to do short of writing the parsing routines in assembly :-)
Here is a quick piece of code to try :
Dictionary<Type, TypeConverter> _ConverterCache = new Dictionary<Type, TypeConverter>();
TypeConverter GetCachedTypeConverter(Type type)
{
if (!_ConverterCache.ContainsKey(type))
_ConverterCache.Add(type, TypeDescriptor.GetConverter(type));
return _ConverterCache[type];
}
Then use the code below instead :
TypeConverter typeConverter = GetCachedTypeConverter(type);
Is is a little faster ?
A technique I commonly use is:
var parserLookup = new Dictionary<Type, Func<string, dynamic>>();
parserLookup.Add(typeof(Int32), s => Int32.Parse(s));
parserLookup.Add(typeof(Int64), s => Int64.Parse(s));
parserLookup.Add(typeof(Decimal), s => Decimal.Parse(s, NumberStyles.Number | NumberStyles.Currency, CultureInfo.CurrentCulture));
parserLookup.Add(typeof(DateTime), s => DateTime.Parse(s, CultureInfo.CurrentCulture, DateTimeStyles.AssumeLocal));
// and so on for any other type you want to handle.
This assumes you can figure out what Type your data represents. The use of dynamic also implies .net 4 or higher, but you can change that to object in most cases.
Cache your parser lookup for each file (or for your entire app) and you should get pretty good performance.

Is there a way to use less than on Strings?

Using a string.CompareTo(string) i can get around this slightly but is not easy to read and i have read on that locallity settings might influence the result.
Is there a way to just simply use < or > on 2 Strings in a more straightforward way?
You can overload operators but you seldom should. To me "stringA" > "stringB" wouldn't mean a damn thing, it's not helping readability IMO. That's why operator overloading guidelines advise not to overload operators if the meaning is not obvious.
EDIT: Operator Overloading Usage Guidelines
Also, in case of String I'm afraid you can't do it seeing as you can put operator-overloading methods only in the class in which the methods are defined.
If the syntax of CompareTo bothers you, maybe wrapping it in extension method will solve your problem?
Like that:
public static bool IsLessThan(this string str, string str2) {
return str.Compare(str2) < 0;
}
I still find it confusing for reader though.
The bottom line is, you can't overload operators for String. Usually you can do something like declaring a partial and stuffing your overloads there, but String is a sealed class, so not this time. I think that the extension method with reasonable name is your best bet. You can put CompareTo or some custom logic inside it.
CompareTo is the proper way in my opinion, you can use the overloads to specify culture specific parameters...
You mention in a comment that you're comparing two strings with values of the form "A100" and "B001". This works in your legacy VB 6 code with the < and > operators because of the way that VB 6 implements string comparison.
The algorithm is quite simple. It walks through the string, one character at a time, and compares the ASCII values of each character. As soon as a character from one string is found to have a lower ASCII code than the corresponding character in the other string, the comparison stops and the first string is declared to be "less than" the second. (VB 6 can be forced to perform a case-insensitive comparison based on the system's current locale by placing the Option Compare Text statement at the top of
the relevant code module, but this is not the default setting.)
Simple, of course, but not entirely logical. Comparing ASCII values skips over all sorts of interesting things you might find in strings nowadays; namely non-ASCII characters. Since you appear to be dealing with strings whose contents have pre-defined limits, this may not be a problem in your particular case. But more generally, writing code like strA < strB is going to look like complete nonsense to anyone else who has to maintain your code (it seems like you're already having this experience), and I encourage you to do the "right thing" even when you're dealing with a fixed set of possible inputs.
There is nothing "straightforward" about using < or > on string values. If you need to implement this functionality, you're going to have to do it yourself. Following the algorithm that I described VB 6 as using above, you could write your own comparison function and call that in your code, instead. Walk through each character in the string, determine if it is a character or a number, and convert it to the appropriate data type. From there, you can compare the two parsed values, and either move on to the next index in the string or return an "equality" value.
There is another problem with that, I think:
Assert.IsFalse(10 < 2);
Assert.IsTrue("10" < "2");
(The second Assert assumes you did an overload for the < operator on the string class.)
But the operator suggests otherwise!!
I agree with Dyppl: you shouldn't do it!

C# better way to do this?

Hi I have this code below and am looking for a prettier/faster way to do this.
Thanks!
string value = "HelloGoodByeSeeYouLater";
string[] y = new string[]{"Hello", "You"};
foreach(string x in y)
{
value = value.Replace(x, "");
}
You could do:
y.ToList().ForEach(x => value = value.Replace(x, ""));
Although I think your variant is more readable.
Forgive me, but someone's gotta say it,
value = Regex.Replace( value, string.Join("|", y.Select(Regex.Escape)), "" );
Possibly faster, since it creates fewer strings.
EDIT: Credit to Gabe and lasseespeholt for Escape and Select.
While not any prettier, there are other ways to express the same thing.
In LINQ:
value = y.Aggregate(value, (acc, x) => acc.Replace(x, ""));
With String methods:
value = String.Join("", value.Split(y, StringSplitOptions.None));
I don't think anything is going to be faster in managed code than a simple Replace in a foreach though.
It depends on the size of the string you are searching. The foreach example is perfectly fine for small operations but creates a new instance of the string each time it operates because the string is immutable. It also requires searching the whole string over and over again in a linear fashion.
The basic solutions have all been proposed. The Linq examples provided are good if you are comfortable with that syntax; I also liked the suggestion of an extension method, although that is probably the slowest of the proposed solutions. I would avoid a Regex unless you have an extremely specific need.
So let's explore more elaborate solutions and assume you needed to handle a string that was thousands of characters in length and had many possible words to be replaced. If this doesn't apply to the OP's need, maybe it will help someone else.
Method #1 is geared towards large strings with few possible matches.
Method #2 is geared towards short strings with numerous matches.
Method #1
I have handled large-scale parsing in c# using char arrays and pointer math with intelligent seek operations that are optimized for the length and potential frequency of the term being searched for. It follows the methodology of:
Extremely cheap Peeks one character at a time
Only investigate potential matches
Modify output when match is found
For example, you might read through the whole source array and only add words to the output when they are NOT found. This would remove the need to keep redimensioning strings.
A simple example of this technique is looking for a closing HTML tag in a DOM parser. For example, I may read an opening STYLE tag and want to skip through (or buffer) thousands of characters until I find a closing STYLE tag.
This approach provides incredibly high performance, but it's also incredibly complicated if you don't need it (plus you need to be well-versed in memory manipulation/management or you will create all sorts of bugs and instability).
I should note that the .Net string libraries are already incredibly efficient but you can optimize this approach for your own specific needs and achieve better performance (and I have validated this firsthand).
Method #2
Another alternative involves storing search terms in a Dictionary containing Lists of strings. Basically, you decide how long your search prefix needs to be, and read characters from the source string into a buffer until you meet that length. Then, you search your dictionary for all terms that match that string. If a match is found, you explore further by iterating through that List, if not, you know that you can discard the buffer and continue.
Because the Dictionary matches strings based on hash, the search is non-linear and ideal for handling a large number of possible matches.
I'm using this methodology to allow instantaneous (<1ms) searching of every airfield in the US by name, state, city, FAA code, etc. There are 13K airfields in the US, and I've created a map of about 300K permutations (again, a Dictionary with prefixes of varying lengths, each corresponding to a list of matches).
For example, Phoenix, Arizona's main airfield is called Sky Harbor with the short ID of KPHX. I store:
KP
KPH
KPHX
Ph
Pho
Phoe
Ar
Ari
Ariz
Sk
Sky
Ha
Har
Harb
There is a cost in terms of memory usage, but string interning probably reduces this somewhat and the resulting speed justifies the memory usage on data sets of this size. Searching happens as the user types and is so fast that I have actually introduced an artificial delay to smooth out the experience.
Send me a message if you have the need to dig into these methodologies.
Extension method for elegance
(arguably "prettier" at the call level)
I'll implement an extension method that allows you to call your implementation directly on the original string as seen here.
value = value.Remove(y);
// or
value = value.Remove("Hello", "You");
// effectively
string value = "HelloGoodByeSeeYouLater".Remove("Hello", "You");
The extension method is callable on any string value in fact, and therefore easily reusable.
Implementation of Extension method:
I'm going to wrap your own implementation (shown in your question) in an extension method for pretty or elegant points and also employ the params keyword to provide some flexbility passing the arguments. You can substitute somebody else's faster implementation body into this method.
static class EXTENSIONS {
static public string Remove(this string thisString, params string[] arrItems) {
// Whatever implementation you like:
if (thisString == null)
return null;
var temp = thisString;
foreach(string x in arrItems)
temp = temp.Replace(x, "");
return temp;
}
}
That's the brightest idea I can come up with right now that nobody else has touched on.

Most efficient way to determine if a string length != 0?

I'm trying to speed up the following:
string s; //--> s is never null
if (s.Length != 0)
{
<do something>
}
Problem is, it appears the .Length actually counts the characters in the string, and this is way more work than I need. Anybody have an idea on how to speed this up?
Or, is there a way to determine if s[0] exists, w/out checking the rest of the string?
EDIT: Now that you've provided some more context:
Trying to reproduce this, I failed to find a bottleneck in string.Length at all. The only way of making it faster was to comment out both the test and the body of the if block - which isn't really fair. Just commenting out the condition slowed things down, i.e. unconditionally copying the reference was slower than checking the condition.
As has been pointed out, using the overload of string.Split which removes empty entries for you is the real killer optimization.
You can go further, by avoiding creating a new char array with just a space in every time. You're always going to pass the same thing effectively, so why not take advantage of that?
Empty arrays are effectively immutable. You can optimize the null/empty case by always returning the same thing.
The optimized code becomes:
private static readonly char[] Delimiters = " ".ToCharArray();
private static readonly string[] EmptyArray = new string[0];
public static string[] SplitOnMultiSpaces(string text)
{
if (string.IsNullOrEmpty(text))
{
return EmptyArray;
}
return text.Split(Delimiters, StringSplitOptions.RemoveEmptyEntries);
}
String.Length absolutely does not count the letters in the string. The value is stored as a field - although I seem to remember that the top bit of that field is used to remember whether or not all characters are ASCII (or used to be, anyway) to enable other optimisations. So the property access may need to do a bitmask, but it'll still be O(1) and I'd expect the JIT to inline it, too. (It's implemented as an extern, but hopefully that wouldn't affect the JIT in this case - I suspect it's a common enough operation to potentially have special support.)
If you already know that the string isn't null, then your existing test of
if (s.Length != 0)
is the best way to go if you're looking for raw performance IMO. Personally in most cases I'd write:
if (s != "")
to make it clearer that we're not so much interested in the length as a value as whether or not this is the empty string. That will be slightly slower than the length test, but I believe it's clearer. As ever, I'd go for the clearest code until you have benchmark/profiling data to indicate that this really is a bottleneck. I know your question is explicitly about finding the most efficient test, but I thought I'd mention this anyway. Do you have evidence that this is a bottleneck?
EDIT: Just to give clearer reasons for my suggestion of not using string.IsNullOrEmpty: a call to that method suggests to me that the caller is explicitly trying to deal with the case where the variable is null, otherwise they wouldn't have mentioned it. If at this point of the code it counts as a bug if the variable is null, then you shouldn't be trying to handle it as a normal case.
In this situation, the Length check is actually better in one way than the inequality test I've suggested: it acts as an implicit assertion that the variable isn't null. If you have a bug and it is null, the test will throw an exception and the bug will be detected early. If you use the equality test it will treat null as being different to the empty string, so it will go into your "if" statement's body. If you use string.IsNullOrEmpty it will treat null as being the same as empty, so it won't go into the block.
String.IsNullOrEmpty is the preferred method for checking for null or zero length strings.
Internally, it will use Length. The Length property for a string should not be calculated on the fly though.
If you're absolutely certain that the string will never be null and you have some strong objection to String.IsNullOrEmpty, the most efficient code I can think of would be:
if(s.Length > 0)
{
// Do Something
}
Or, possibly even better:
if(s != "")
{
// Do Something
}
Accessing the Length property shouldn't do a count -- .NET strings store a count inside the object.
The SSCLI/Rotor source code contains an interesting comment which suggests that String.Length is (a) efficient and (b) magic:
// Gets the length of this string
//
/// This is a EE implemented function so that the JIT can recognise is specially
/// and eliminate checks on character fetchs in a loop like:
/// for(int I = 0; I < str.Length; i++) str[i]
/// The actually code generated for this will be one instruction and will be inlined.
//
public extern int Length {
[MethodImplAttribute(MethodImplOptions.InternalCall)]
get;
}
Here is the function String.IsNullOrEmpty -
if (!String.IsNullOrEmpty(yourstring))
{
// your code
}
String.IsNullOrWhiteSpace(s);
true if s is null or Empty, or if s consists exclusively of white-space characters.
As always with performace: benchmark.
Using C# 3.5 or before, you'll want to test yourString.Length vs String.IsNullOrEmpty(yourString)
using C# 4, do both of the above and add String.IsNullOrWhiteSpace(yourString)
Of course, if you know your string will never be empty, you could just attempt to access s[0] and handle the exception when it's not there. That's not normally good practice, but it may be closer to what you need (if s should always have a non-blank value).
for (int i = 0; i < 100; i++)
{
System.Diagnostics.Stopwatch timer = new System.Diagnostics.Stopwatch();
string s = "dsfasdfsdafasd";
timer.Start();
if (s.Length > 0)
{
}
timer.Stop();
System.Diagnostics.Debug.Write(String.Format("s.Length != 0 {0} ticks ", timer.ElapsedTicks));
timer.Reset();
timer.Start();
if (s == String.Empty)
{
}
timer.Stop();
System.Diagnostics.Debug.WriteLine(String.Format("s== String.Empty {0} ticks", timer.ElapsedTicks));
}
Using the stopwatch the s.length != 0 takes less ticks then s == String.Empty
after I fix the code
Based on your intent described in your answer, why don't you just try using this built-in option on Split:
s.Split(new[]{" "}, StringSplitOptions.RemoveEmptyEntries);
Just use String.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries) and it will do it all for you.

Categories