C# - fastest way to compare two strings using wildcards - c#

Is there a fastest way to compare two strings (using the space for a wildcard) than this function?
public static bool CustomCompare(this string word, string mask)
{
for (int index = 0; index < mask.Length; index++)
{
if (mask[index] != ' ') && (mask[index]!= word[index]))
{
return false;
}
}
return true;
}
Example: "S nt nce" comparing with "Sentence" will return true. (The two being compared would need to be the same length)

If mask.length is less than word.length, this function will stop comparing at the end of mask. A word/mask length compare in the beginning would prevent that, also it would quick-eliminate some obvious mismatches.

The loop is pretty simple and I'm not sure you can do much better. You might be able to micro optimize the order of the expression in the if statement. For example due to short circuiting of the && it might be faster to order the if statement this way
if (mask[index]!= word[index])) && (mask[index] != ' ')
Assuming that matching characters is more common that matching the wildcard. Of course this is just theory I wouldn't believe it made a difference without benchmarking it.
And as others have pointed out the routine fails if the mask and string are not the same length.

That looks like a pretty good implementation - I don't think you will get much faster than that.
Have you profiled this code and found it to be a bottleneck in your application? I think this should be fine for most purposes.

If you used . instead of , you could do a simple regex match.

Variable Length comparison:
I used your code as a starting place for my own application which assumes the mask length is shorter or equal to the comparison text length. allowing for a variable length wildcard spot in the mask. ie: "concat" would match a mask of "c ncat" or "c t" or even "c nc t"
private bool CustomCompare(string word, string mask)
{
int lengthDifference = word.Length - mask.Length;
int wordOffset = 0;
for (int index = 0; index < mask.Length; index++)
{
if ((mask[index] != ' ') && (mask[index]!= word[index+wordOffset]))
{
if (lengthDifference <= 0)
{
return false;
}
else
{
lengthDifference += -1;
wordOffset += 1;
}
}
}
return true;
}

Not sure if this is any faster but it looks neat:
public static bool CustomCompare(this string word, string mask)
{
return !mask.Where((c, index) => c != word[index] && c != ' ').Any();
}

I think you're doing a little injustice by not giving a little bit of context to your code. Sure, if you want to search only one string of characters of the same length as your pattern, then yes this is fine.
However, if you are using this as the heart of a pattern matcher where there are several other patterns you will be looking for, this is a poor method. There are other known methods, the best of which depends on your exact application. The phrase "inexact pattern matching" is the phrase you are concerned with.

Related

How do I validate character type and position within a string using a loop?

I'm currently attempting to validate a string for an assignment so it's imperative that I'm not simply given the answer, if you provide an answer please give suitable explanation so that I can learn from it.
Suppose I have a string
(1234)-1234 ABCD
I'd like to create a loop that will go through that string and validate the position of the "()" as well as the "-" and " ". In addition to the validation of those characters their position must also be the same as well as the data type. Finally, it must be inside a method.
CANNOT USE REGEX
TLDR;
Validate the position of characters and digits in a string, while using a loop inside of a method. I cannot use REGEX and need to do this manually.
Here's what I have so far. But I feel like the loop would be more efficient and look nicer.
public static string PhoneChecker(string phoneStr)
{
if (phoneStr[0] == '(' && phoneStr[4] == ')' && phoneStr[5] == ' ' && phoneStr[9] == '-' && phoneStr.Length == 14)
{
phoneStr = phoneStr.Remove(0, 1);
phoneStr = phoneStr.Remove(3, 1);
phoneStr = phoneStr.Remove(3, 1);
phoneStr = phoneStr.Remove(6, 1);
Console.WriteLine(phoneStr);
if (int.TryParse(phoneStr, out int phoneInt) == false)
{
Console.WriteLine("Invalid");
}
else
{
Console.WriteLine("Valid");
}
}
else
{
Console.WriteLine("Invalid");
}
return phoneStr;
}
It is still unmaintaible, but still a little better... Note that your code didn't work with your example string (the indexes were off by one).
public static bool PhoneChecker(string phoneStr)
{
if (phoneStr.Length != 16 || phoneStr[0] != '(' || phoneStr[5] != ')' || phoneStr[6] != '-' || phoneStr[11] != ' ')
{
return false;
}
if (!uint.TryParse(phoneStr.Substring(1, 4), out uint phoneInt))
{
return false;
}
if (!uint.TryParse(phoneStr.Substring(7, 4), out phoneInt))
{
return false;
}
// No checks for phoneStr.Substring(12, 4)
return true;
}
Some differences:
The Length check is the first one. Otherwise a short string would make the program crash (because if you try to do a phoneStr[6] on a phoneStr that has a length of 3 you'll get an exception)
Instead of int.Parse I used uint.Parse, otherwise -500 would be acceptable.
I've splitted the uint.Parse for the two subsections of numbers in two different check
The method returns true or false. It is the caller's work to write the error message.
There are various school of thought about early return in code: I think that the earlier you can abort your code with a return false the better it is. The other advantage is that all the remaining code is at low nesting level (your whole method was inside a big if () {, so nesting +1 compared to mine)
Technically you tagged the question as C#-4.0, but out int is C#-6.0
The main problem here is that stupid constraints produce stupid code. It is rare that Regex are really usefull. This is one of the rare cases. So now you have two possibilities: produce hard-coded unmodifiable code that does exactly what was requested (like the code I wrote), or create a "library" that accepts variable patterns (like the ones used in masked edits, where you can tell the masked edit "accept only (0000)-0000 AAAA") and validates the string based on this pattern... But this will be a poor-man's regex, only worse, because you'll have to maintain and test it. This problem will become clear when one month from the release of the code they'll ask you to accept even the (12345)-1234 ABCD pattern... and then the (1234)-12345 ABCD pattern... and a new pattern every two months (until around one and half years later they'll tell you to remove the validator, because the persons that use the program hate them and it slow their work)

C# - efficiently check if string contains string at specific position (something like regionMatches)

For example, I might have the string "Hello world!", and I want to check if a substring starting at position 6 (0-based) is "world" - in this case true.
Something like "Hello world!".Substring(6).StartsWith("world", StringComparison.Ordinal) would do it, but it involves a heap allocation which ought to be unnecessary for something like this.
(In my case, I don't want a bounds error if the string starting at position 6 is too short for the comparison - I just want false. However, that's easy to code around, so solutions that would give a bounds error are also welcome.)
In Java, 'regionMatches' can be used to achieve this effect (with the bounds error), but I can't find an equivalent in C#.
Just to pre-empt - obviously Contains and IndexOf are bad solutions because they do an unnecessary search. (You know someone will post this!)
If all else fails, it's quick to code my own function for this - mainly I'm wondering if there is a built-in one that I've missed.
obviously Contains and IndexOf are bad solutions because they do an unnecessary search
Actually, that's not true: there is an overload of IndexOf that keeps you in control of how far it should go in search of the match. If you tell it to stay at one specific index, it would do exactly what you want to achieve.
Here is the three-argument overload of IndexOf that you could use. Passing the length of the target for the count parameter would prevent IndexOf from considering any other positions:
var big = "Hello world!";
var small = "world";
if (big.IndexOf(small, 6, small.Length) == 6) {
...
}
Demo.
Or manually
int i = 0;
if (str.Length >= 6 + toFind.Length) {
for (i = 0; i < toFind.Length; i++)
if (str[i + 6] != toFind[i])
break;
}
bool ok = i == toFind.Length;
here you are
static void Main(string[] args)
{
string word = "Hello my friend how are you ?";
if (word.Substring(0).Contains("Hello"))
{
Console.WriteLine("Match !");
}
}

is String.Contains() faster than walking through whole array of char in string?

I have a function that is walking through the string looking for pattern and changing parts of it. I could optimize it by inserting
if (!text.Contains(pattern)) return;
But, I am actually walking through the whole string and comparing parts of it with the pattern, so the question is, how String.Contains() actually works? I know there was such a question - How does String.Contains work? but answer is rather unclear. So, if String.Contains() walks through the whole array of chars as well and compare them to pattern I am looking for as well, it wouldn't really make my function faster, but slower.
So, is it a good idea to attempt such an optimizations? And - is it possible for String.Contains() to be even faster than function that just walk through the whole array and compare every single character with some constant one?
Here is the code:
public static char colorchar = (char)3;
public static Client.RichTBox.ContentText color(string text, Client.RichTBox SBAB)
{
if (text.Contains(colorchar.ToString()))
{
int color = 0;
bool closed = false;
int position = 0;
while (text.Length > position)
{
if (text[position] == colorchar)
{
if (closed)
{
text = text.Substring(position, text.Length - position);
Client.RichTBox.ContentText Link = new Client.RichTBox.ContentText(ProtocolIrc.decode_text(text), SBAB, Configuration.CurrentSkin.mrcl[color]);
return Link;
}
if (!closed)
{
if (!int.TryParse(text[position + 1].ToString() + text[position + 2].ToString(), out color))
{
if (!int.TryParse(text[position + 1].ToString(), out color))
{
color = 0;
}
}
if (color > 9)
{
text = text.Remove(position, 3);
}
else
{
text = text.Remove(position, 2);
}
closed = true;
if (color < 16)
{
text = text.Substring(position);
break;
}
}
}
position++;
}
}
return null;
}
Short answer is that your optimization is no optimization at all.
Basically, String.Contains(...) just returns String.IndexOf(..) >= 0
You could improve your alogrithm to:
int position = text.IndexOf(colorchar.ToString()...);
if (-1 < position)
{ /* Do it */ }
Yes.
And doesn't have a bug (ahhm...).
There are better ways of looking for multiple substrings in very long texts, but for most common usages String.Contains (or IndexOf) is the best.
Also IIRC the source of String.Contains is available in the .Net shared sources
Oh, and if you want a performance comparison you can just measure for your exact use-case
Check this similar post How does string.contains work
I think that you will not be able to simply do anything faster than String.Contains, unless you want to use standard CRT function wcsstr, available in msvcrt.dll, which is not so easy
Unless you have profiled your application and determined that the line with String.Contains is a bottle-neck, you should not do any such premature optimizations. It is way more important to keep your code's intention clear.
Ans while there are many ways to implement the methods in the .NET base classes, you should assume the default implementations are optimal enough for most people's use cases. For example, any (future) implementation of .NET might use the x86-specific instructions for string comparisons. That would then always be faster than what you can do in C#.
If you really want to be sure whether your custom string comparison code is faster than String.Contains, you need to measure them both using many iterations, each with a different string. For example using the Stopwatch class to measure the time.
If you now the details which you can use for optimizations (not just simple contains check) sure you can make your method faster than string.Contains, otherwise - not.

Fastest way possible to validate a string to be only Alphas, or a given string set

I need the absoulute fastest way possible to validate an input string against a given rule. In this case lest say Alpha only characters.
I can think of a number of ways both verbose and non verbose. However speed in execution is of the essence. So If anybody can offer their pearls of wisdom I would be massivly grateful.
I'm avoiding regex to get away from the overhead of creating the expression object. However am open to revisit this if people think this is the FASTEST option.
Current Ideas include:
1)
internal static bool Rule_AlphaOnly(string Value)
{
char[] charList = Value.ToCharArray();
for (int i = 0; i < charList.Length; i++)
{
if (!((charList[i] >= 65 && charList[i] <= 90) || (charList[i] >= 97 && charList[i] <= 122)))
{
return false;
}
}
return true;
}
2)
char[] charList = Value.ToCharArray();
return charList.All(t => ((t >= 65 && t <= 90) || (t >= 97 && t <= 122)));
Thought about also using the "Contains" methods.
Any ideas welcomed.
Many thanks
3)
for (int i = 0; i < Value.Length; i++)
{
if(!char.IsLetter(Value, i))
{
return false;
}
}
Not sure on the speed of this one but what about...
foreach(char c in Value)
{
if(!char.IsLetter(c))
return false;
}
Both codes can be made more efficient by removing the ToCharArray call and thus avoiding a copy: you can access the individual characters of a string directly.
Of those two ways I would strongly opt for the second unless you can show that it’s too slow. Only then would I switch to the first, but replace the for loop with a foreach loop.
Oh, and neither of your codes treats Unicode correctly. If you had used a regular expression with a proper character class, then this wouldn’t be an issue. Correctness first, performance second.
After C# compiler and JIT are done optimizing it, I doubt this will be much slower than manual for loop:
return Value.All(char.IsLetter);
If you need to check for arbitrary set of characters, do something like this:
var set = new HashSet<char>(new[] { 'a', 'b', 'c' /* Etc... */ });
return Value.All(set.Contains);
Unless the set is trivial and can be "emulated" efficiently via few ifs, the hashtable lookup is bound to be as fast a solution as it gets.

Fastest way to trim a string and convert it to lower case

I've written a class for processing strings and I have the following problem: the string passed in can come with spaces at the beginning and at the end of the string.
I need to trim the spaces from the strings and convert them to lower case letters. My code so far:
var searchStr = wordToSearchReplacemntsFor.ToLower();
searchStr = searchStr.Trim();
I couldn't find any function to help me in StringBuilder. The problem is that this class is supposed to process a lot of strings as quickly as possible. So I don't want to be creating 2 new strings for each string the class processes.
If this isn't possible, I'll go deeper into the processing algorithm.
Try method chaining.
Ex:
var s = " YoUr StRiNg".Trim().ToLower();
Cyberdrew has the right idea. With string being immutable, you'll be allocating memory during both of those calls regardless. One thing I'd like to suggest, if you're going to call string.Trim().ToLower() in many locations in your code, is to simplify your calls with extension methods. For example:
public static class MyExtensions
{
public static string TrimAndLower(this String str)
{
return str.Trim().ToLower();
}
}
Here's my attempt. But before I would check this in, I would ask two very important questions.
Are sequential "String.Trim" and "String.ToLower" calls really impacting the performance of my app? Would anyone notice if this algorithm was twice as slow or twice as fast? The only way to know is to measure the performance of my code and compare against pre-set performance goals. Otherwise, micro-optimizations will generate micro-performance gains.
Just because I wrote an implementation that appears faster, doesn't mean that it really is. The compiler and run-time may have optimizations around common operations that I don't know about. I should compare the running time of my code to what already exists.
static public string TrimAndLower(string str)
{
if (str == null)
{
return null;
}
int i = 0;
int j = str.Length - 1;
StringBuilder sb;
while (i < str.Length)
{
if (Char.IsWhiteSpace(str[i])) // or say "if (str[i] == ' ')" if you only care about spaces
{
i++;
}
else
{
break;
}
}
while (j > i)
{
if (Char.IsWhiteSpace(str[j])) // or say "if (str[j] == ' ')" if you only care about spaces
{
j--;
}
else
{
break;
}
}
if (i > j)
{
return "";
}
sb = new StringBuilder(j - i + 1);
while (i <= j)
{
// I was originally check for IsUpper before calling ToLower, probably not needed
sb.Append(Char.ToLower(str[i]));
i++;
}
return sb.ToString();
}
If the strings use only ASCII characters, you can look at the C# ToLower Optimization. You could also try a lookup table if you know the character set ahead of time
So first of all, trim first and replace second, so you have to iterate over a smaller string with your ToLower()
other than that, i think your best algorithm would look like this:
Iterate over the string once, and check
whether there's any upper case characters
whether there's whitespace in beginning and end (and count how many chars you're talking about)
if none of the above, return the original string
if upper case but no whitespace: do ToLower and return
if whitespace:
allocate a new string with the right size (original length - number of white chars)
fill it in while doing the ToLower
You can try this:
public static void Main (string[] args) {
var str = "fr, En, gB";
Console.WriteLine(str.Replace(" ","").ToLower());
}

Categories