Which Regular Expressions and toUpper combination is faster?

Which Regular Expressions and toUpper combination is faster? - c#

I have two text boxes, one for the input and another for the output. I need to filter only Hexadecimals characters from input and output it in uppercase. I have checked that using Regular Expressions (Regex) is much faster than using loop.
My current code to uppercase first then filter the Hex digits as follow:
string strOut = Regex.Replace(inputTextBox.Text.ToUpper(), "[^0-9^A-F]", "");
outputTextBox.Text = strOut;
An alternatively:
string strOut = Regex.Replace(inputTextBox.Text, "[^0-9^A-F^a-f]", "");
outputTextBox.Text = strOut.ToUpper();
The input may contain up to 32k characters, therefore speed is important here. I have used TimeSpan to measure but the results are not consistent.
My question is: which code has better speed performance and why?

This is definitely a case of premature optimization: 32K characters is not a big deal for finely tuned regex engines running on modern computers, so this optimization task is mostly theoretical.
Before discussing the performance, it's worth pointing out that the expressions are probably not doing what you want, because they allow ^ characters into the output. You need to use [^0-9A-F] and [^0-9A-Fa-f] instead.
The speed of the two regexes will be identical, because the number of characters in a character class hardly makes a difference. However, the second combination ToUpper call will be called on a potentially shorter string, because all invalid characters will be removed. Therefore, the second option is potentially slightly faster.
However, if you must optimize this to the last CPU cycle, you can rewrite this without regular expressions to avoid a memory allocation in the ToUpper: walk through the input string in a loop, and add all valid characters to StringBuilder as you go. When you see a lowercase character, convert it to upper case.

It's simple to test:
using System;
using System.Diagnostics;
using System.Linq;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string letters = "abcdefghijklmnopqrstuvxyzABCDEFGHIJKLMNOPQRSTUVXYZ";
Random random = new Random();
string[] strings = Enumerable.Range(0, 5000).Select(i1 => string.Join("", Enumerable.Range(0,32000).Select(i2 => letters[random.Next(0, letters.Length - 1)]))).ToArray();
Stopwatch stopwatchA = new Stopwatch();
stopwatchA.Start();
foreach (string s in strings)
Regex.Replace(s.ToUpper(), "[^0-9^A-F]", "");
stopwatchA.Stop();
Stopwatch stopwatchB = new Stopwatch();
stopwatchB.Start();
foreach (string s in strings)
Regex.Replace(s, "[^0-9^A-F^a-f]", "").ToUpper();
stopwatchB.Stop();
Debug.WriteLine("stopwatchA: {0}", stopwatchA.Elapsed);
Debug.WriteLine("stopwatchB: {0}", stopwatchB.Elapsed);
}
}
}
Run 1:
stopwatchA: 00:00:39.6552012
stopwatchB: 00:00:40.6757048
Run 2:
stopwatchA: 00:00:39.7022437
stopwatchB: 00:00:41.3477625
In those to runs, the first approach is faster.

On the theoretical size, string.ToUpper() can potentially allocate a new string (remember that .NET strings are semantically immutable), but the regular expression on the other hand conserves memory, i.e. it should be faster in the general case (for large strings). Also the input string will be looped through twice if using the toUpper() call.

Related

How to match condition IF in C#

I have a string like this.
string strex = "Insert|Update|Delete"
I am retrieving another string as string strex1 = "Insert" (It may retrieve Update or Delete)
I need to match strex1 with strex in "IF" condition in C#.
Do I need to split strex and match with strex1?

The string you posted is a regular expression pattern that matches the words Insert, Update or Delete. Regular expressions are a very common way of specifying validation rules in web applications.
Regular expressions can express far more complex rules than a simple comparison. They're also far faster (think 10x) in validation scenarios than splitting. In a web application, that translates to using fewer servers to serve the same traffic.
You can use .NET's Regex to match strings with that pattern, eg :
var strex = "Insert|Update|Delete";
if (Regex.IsMatch(input,strex))
{
....
}
This will create a new regular expression object each time. You can avoid this by creating a static Regex instance and reuse it. Regex is thread-safe which means there's no problem using the same instance from multiple threads :
static Regex _cmdRegex = new Regex("Insert|Update|Delete");
...
void MyMethod(string input)
{
if(_cmdRegex.IsMatch(input))
{
...
}
}
The Regex class methods will match if the pattern appears anywhere in the pattern. Regex.IsMatch("Insert1",strex) will return True. If you want an exact match, you have to specify that the pattern starts at the beginning of the input with ^ and ends at the end with $ :
static Regex _cmdRegex = new Regex("^(Insert|Update|Delete)$");
With this change, _cmdRegex.IsMatch("Insert1") will return false but _cmdRegex.IsMatch("Insert") will return true.
Performance
In this case a regular expression is a lot faster than splitting and trying exact matches. Think 10-100x over time. There are two reasons for this:
Strings are immutable, so every string modification operation like Split() will generate new temporary strings that have to be allocated and garbage collected. In a busy web application this adds up, eventually using up a lot of RAM and CPU for little or no benefit. One of the reasons ASP.NET Core is 10x times faster than the old ASP.NET is eliminating such substring operations wherever possible.
A regular expression is compiled into a program that performs matching in the most efficient way. When you use Split().Any() the program will compare the input with all the substrings even if it's obvious there's no possible match, eg because the first letter is Z. A Regex program on the other hand would only proceed if the first character was I, U or D

Efficient way I can think of is using string.Contains()
if(strex.Contains($"{strex1}|") || strex.Contains($"|{strex1}"))
{
//Your code goes here
}
Solution using Linq, Split string strex by '|' and check strex1 is present in an array or not, like
Issue with below solution is pointed out by #PanagiotisKanavos in the
comment.
Using .Any(),
if(strex.Split('|').Any(x => x.Equals(strex1)))
{
//Your code goes here
}
or using Contains(),
if(strex.Split('|').Contains(strex1))
{
//Your code goes here
}
if you want to ignore case while comparing string then you can use StringComparison.OrdinalIgnoreCase.
if(strex.Split('|').Any(x => x.Equals(strex1, StringComparison.OrdinalIgnoreCase))
{
//Your code goes here
}
.NETFIDDLE

c# Remove elements from List containing string [duplicate]

What would be the fastest way to check if a string contains any matches in a string array in C#? I can do it using a loop, but I think that would be too slow.

Using LINQ:
return array.Any(s => s.Equals(myString))
Granted, you might want to take culture and case into account, but that's the general idea.
Also, if equality is not what you meant by "matches", you can always you the function you need to use for "match".

I really couldn't tell you if this is absolutely the fastest way, but one of the ways I have commonly done this is:
This will check if the string contains any of the strings from the array:
string[] myStrings = { "a", "b", "c" };
string checkThis = "abc";
if (myStrings.Any(checkThis.Contains))
{
MessageBox.Show("checkThis contains a string from string array myStrings.");
}
To check if the string contains all the strings (elements) of the array, simply change myStrings.Any in the if statement to myStrings.All.
I don't know what kind of application this is, but I often need to use:
if (myStrings.Any(checkThis.ToLowerInvariant().Contains))
So if you are checking to see user input, it won't matter, whether the user enters the string in CAPITAL letters, this could easily be reversed using ToLowerInvariant().
Hope this helped!

That works fine for me:
string[] characters = new string[] { ".", ",", "'" };
bool contains = characters.Any(c => word.Contains(c));

You could combine the strings with regex or statements, and then "do it in one pass," but technically the regex would still performing a loop internally. Ultimately, looping is necessary.

If the "array" will never change (or change only infrequently), and you'll have many input strings that you're testing against it, then you could build a HashSet<string> from the array. HashSet<T>.Contains is an O(1) operation, as opposed to a loop which is O(N).
But it would take some (small) amount of time to build the HashSet. If the array will change frequently, then a loop is the only realistic way to do it.

Declaring long strings that use string interpolation in C# 6

I usually wrap long strings by concatenating them:
Log.Debug("I am a long string. So long that I must " +
"be on multiple lines to be feasible.");
This is perfectly efficient, since the compiler handles concatenation of string literals. I also consider it the cleanest way to handle this problem (the options are weighed here).
This approach worked well with String.Format:
Log.Debug(String.Format("Must resize {0} x {1} image " +
"to {2} x {3} for reasons.", image.Width, image.Height,
resizedImage.Width, resizedImage.Height));
However, I now wish to never use String.Format again in these situations, since C# 6's string interpolation is much more readable. My concern is that I no longer have an efficient, yet clean way to format long strings.
My question is if the compiler can somehow optimize something like
Log.Debug($"Must resize {image.Width} x {image.Height} image " +
$"to {resizedImage.Width} x {resizedImage.Height} for reasons.");
into the above String.Format equivalent or if there's an alternative approach that I can use that won't be less efficient (due to the unnecessary concatenation) while also keeping my code cleanly structured (as per the points raised in the link above).

This program:
var name = "Bobby Tables";
var age = 8;
String msg = $"I'm {name} and" +
$" I'm {age} years old";
is compiled as if you had written:
var name = "Bobby Tables";
var age = 8;
String msg = String.Concat(String.Format("I'm {0} and", name),
String.Format(" I'm {0} years old", age));
You see the difficulty in getting rid of the Concat - the compiler has re-written our interpolation literals to use the indexed formatters that String.Format expects, but each string has to number its parameters from 0. Naively concatenating them would cause them both to insert name. To get this to work out correctly, there would have to be state maintained between invocations of the $ parser so that the second string is reformatted as " I'm {1} years old". Alternatively, the compiler could try to apply the same kind of analysis it does for concatenation of string literals. I think this would be a legal optimization even though string interpolation can have side effects, but I wouldn't be surprised if it turned out there was a corner case under which interpolated string concatenation changed program behavior. Neither sounds impossible, especially given the logic is already there to detect a similar condition for string literals, but I can see why this feature didn't make it into the first release.
I would write the code in the way that you feel is cleanest and most readable, and not worry about micro-inefficiencies unless they prove to be a problem. The old saying about code being primarily for humans to understand holds here.

Maybe it would be not as readable as with + but by all means, it is possible. You just have to break line between { and }:
Log.Debug($#"Must resize {image.Width} x {image.Height} image to {
resizedImage.Width} x {resizedImage.Height} for reasons.");
SO's colouring script does not handle this syntax too well but C# compiler does ;-)

In the specialized case of using this string in HTML (or parsing with whatever parser where multiple whitespaces does not matter), I could recommend you to use #$"" strings (verbatim interpolated string) eg.:
$#"some veeeeeeeeeeery long string {foo}
whatever {bar}"

In c# 6.0:
var planetName = "Bob";
var myName = "Ford";
var formattedStr = $"Hello planet {planetName}, my name is {myName}!";
// formattedStr should be "Hello planet Bob, my name is Ford!"
Then concatenate with stringbuilder:
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.Append(formattedStr);
// Then add the strings you need
Append more strings to stringbuilder.....

Most efficient way of adding/removing a character to beginning of string?

I was doing a small 'scalable' C# MVC project, with quite a bit of read/write to a database.
From this, I would need to add/remove the first letter of the input string.
'Removing' the first character is quite easy (using a Substring method) - using something like:
String test = "HHello world";
test = test.Substring(1,test.Length-1);
'Adding' a character efficiently seems to be messy/awkward:
String test = "ello World";
test = "H" + test;
Seeing as this will be done for a lot of records, would this be be the most efficient way of doing these operations?
I am also testing if a string starts with the letter 'T' by using, and adding 'T' if it doesn't by:
String test = "Hello World";
if(test[0]!='T')
{
test = "T" + test;
}
and would like to know if this would be suitable for this

If you have several records and to each of the several records field you need to append a character at the beginning, you can use String.Insert with an index of 0 http://msdn.microsoft.com/it-it/library/system.string.insert(v=vs.110).aspx
string yourString = yourString.Insert( 0, "C" );
This will pretty much do the same of what you wrote in your original post, but since it seems you prefer to use a Method and not an operator...
If you have to append a character several times, to a single string, then you're better using a StringBuilder http://msdn.microsoft.com/it-it/library/system.text.stringbuilder(v=vs.110).aspx

Both are equally efficient I think since both require a new string to be initialized, since string is immutable.
When doing this on the same string multiple times, a StringBuilder might come in handy when adding. That will increase performance over adding.
You could also opt to move this operation to the database side if possible. That might increase performance too.

For removing I would use the remove command as this doesn't require to know the length of the string:
test = test.Remove(0, 1);
You could also treat the string as an array for the Add and use
test = test.Insert(0, "H");
If you are always removing and then adding a character you can treat the string as an array again and just replace the character.
test = (test.ToCharArray()[0] = 'H').ToString();
When doing lots of operations to the same string I would use a StringBuilder though, more expensive to create but faster operations on the string.

StringBuilder: how to get the final String?

Someone told me that it's faster to concatenate strings with StringBuilder. I have changed my code but I do not see any Properties or Methods to get the final build string.
How can I get the string?

You can use .ToString() to get the String from the StringBuilder.

Once you have completed the processing using the StringBuilder, use the ToString method to return the final result.
From MSDN:
using System;
using System.Text;
public sealed class App
{
static void Main()
{
// Create a StringBuilder that expects to hold 50 characters.
// Initialize the StringBuilder with "ABC".
StringBuilder sb = new StringBuilder("ABC", 50);
// Append three characters (D, E, and F) to the end of the StringBuilder.
sb.Append(new char[] { 'D', 'E', 'F' });
// Append a format string to the end of the StringBuilder.
sb.AppendFormat("GHI{0}{1}", 'J', 'k');
// Display the number of characters in the StringBuilder and its string.
Console.WriteLine("{0} chars: {1}", sb.Length, sb.ToString());
// Insert a string at the beginning of the StringBuilder.
sb.Insert(0, "Alphabet: ");
// Replace all lowercase k's with uppercase K's.
sb.Replace('k', 'K');
// Display the number of characters in the StringBuilder and its string.
Console.WriteLine("{0} chars: {1}", sb.Length, sb.ToString());
}
}
// This code produces the following output.
//
// 11 chars: ABCDEFGHIJk
// 21 chars: Alphabet: ABCDEFGHIJK

When you say "it's faster to concatenate strings with a StringBuilder", this is only true if you are repeatedly (I repeat - repeatedly) concatenating to the same object.
If you're just concatenating 2 strings and doing something with the result immediately as a string, there's no point to using StringBuilder.
I just stumbled on Jon Skeet's nice write up of this:
https://jonskeet.uk/csharp/stringbuilder.html
If you are using StringBuilder, then to get the resulting string, it's just a matter of calling ToString() (unsurprisingly).

I would just like to throw out that is may not necessarily faster, it will definitely have a better memory footprint. This is because string are immutable in .NET and every time you change a string you have created a new one.

About it being faster/better memory:
I looked into this issue with Java, I assume .NET would be as smart about it.
The implementation for String is pretty impressive.
The String object tracks "length" and "shared" (independent of the length of the array that holds the string)
So something like
String a = "abc" + "def" + "ghi";
can be implemented (by the compiler/runtime) as:
- Extend the array holding "abc" by 6 additional spaces.
- Copy def in right after abc
- copy ghi in after def.
- give a pointer to the "abc" string to a
- leave abc's length at 3, set a's length to 9
- set the shared flag in both.
Since most strings are short-lived, this makes for some VERY efficient code in many cases.
The case where it's absolutely NOT efficient is when you are adding to a string within a loop, or when your code is like this:
a = "abc";
a = a + "def";
a += "ghi";
In this case, you are much better off using a StringBuilder construct.
My point is that you should be careful whenever you optimize, unless you are ABSOLUTELY sure that you know what you are doing, AND you are absolutely sure it's necessary, AND you test to ensure the optimized code makes a use case pass, just code it in the most readable way possible and don't try to out-think the compiler.
I wasted 3 days messing with strings, caching/reusing string-builders and testing speed before I looked at the string source code and figured out that the compiler was already doing it better than I possibly could for my use case. Then I had to explain how I didn't REALLY know what I was doing, I only thought I did...

It's not faster to concat - As smaclell pointed out, the issue is the immutable string forcing an extra allocation and recopying of existing data.
"a"+"b"+"c" is no faster to do with string builder, but repeated concats with an intermediate string gets faster and faster as the # of concat's gets larger like:
x = "a"; x+="b"; x+="c"; ...

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.