Regex to find 'good enough' sequences

Regex to find 'good enough' sequences - c#

I'm looking to implement some algorithm to help me match imperfect sequences.
Say I have a stored sequence of ABBABABBA and I want to find something that 'looks like' that in a large stream of characters.
If I give my algorithm the allowance to have 2 wildcards (differences), how can I use Regex to match something like: where ( and ) mark the differences:
A(A)BABAB(A)A
or
(B)BBA(A)ABBA
My Dilemma is that I am looking to find these potential target matches (with imperfections) in a big string of characters.
So in something like:
ABBDBABDBCBDBABDB(A(A)BABAB(A)A)DBDBABDBCBDBAB
ADBDBABDBDBDBCBDBABCBDBABCBDBABCBDBABABBBDBABABBCD
DBABCBDABDBABCBCBDBABABDABDBABCBDBABABDDABCBDBABAB
I must be able to search for these 'near enough' matches.
Where brackets denote: (The Good enough Match with the (Differences))
Edit: To be more formal in this example, A match of Length N can be accepted if N-2 characters are the same as the original (2 Differences)
I've used Regex before, but only to find perfect sequences - not for something that 'looks like' one.
Hope this is clear enough to get some advice on.
Thanks for reading and any help!

You could use LINQ to be nice and expressive.
In order to use this make sure you have a using System.Linq at the top of your code.
Assuming that
source is the stored target pattern
test is the string to test.
Then you can do
public static bool IsValid(string source, string test)
{
return test != null
&& source != null
&& test.Length == source.Length
&& test.Where((x,i) => source[i] != x).Count() <=2
}
There is also a shortcut version that exits false the moment it fails, saving iterating the rest of the string.
public static bool IsValid(string source, string test) 
{
  return test != null  
         && source != null 
          && test.Length == source.Length 
          && !test.Where((x,i) => source[i] != x).Skip(2).Any();
}
As requested in comments, a little explanation of how this works
in C# a string can be treated as an array of characters, which means that the Linq methods can be used on it.
test.Where((x,i) => source[i] != x)
This uses the overload of Where that for each character in test, x gets assigned to the character and i gets assigned to the index. If the condition character at position i in source is not equal to x then output into the result.
Skip(2)
this skips the first 2 results.
Any()
this returns true if there any results left or false if not. Because linq defers execution the moment that this is false the function exits rather than evaluating the rest of the string.
The entire test is then negated by prefixing with a '!' to indicate we want to know where there are no more results.
Now in order to match as substring you are going to need to behave similar to a regex backtracking...
public static IEnumerable<int> GetMatches(string source, string test)
{
return from i in Enumerable.Range(0,test.Length - source.Length)
where IsValid(source, !test.Skip(i).Take(source.Length))
select i;
}
public static bool IsValid(string source, IEnumerable<char> test) 
{
  return test.Where((x,i) => source[i] != x).Skip(2).Any();
}
UPDATE Explained
Enumerable.Range(0,test.Length - source.Length)
This creates a sequence of numbers from 0 to test.Length - source.Length, there is no need in checking starting at every char in test because once the length is shorter the answer is invalid.
from i in ....
Basically iterate over the collection assigning i to be the current value each time
where IsValid(source, !test.Skip(i).Take(source.Length))
Filter the results to only include the ones where there is a match in test starting at index i (hence the skip) and going on for source.Length chars (hence the take.
select i
return i
This returns an enumerable over the indexes in test where there is a match, you could extract them with
GetMatches(source,test).Select(i =>
new string(test.Skip(i).Take(source.Length).ToArray()));

I don't think this can be done with regexes (if it can, I'm unfamiliar with the syntax). However, you can use the dynamic programming algorithm for Levenshtein distance.
Edit: If you don't need to handle letters that have switched positions, a much easier approach is to just compare each pair of characters from the two strings, and just count the number of differences.

I can't think how you'd do it with regex but it should be pretty simple to code.
I'd probably just split the strings up and compare them character by character. If you get a difference count it and move to the next character. If you exceed 2 differences then move on to the next full string.

I don't think there's a good regular expression to handle this case. (Or at least, there isn't one that won't take up a good three lines of text and cause multiple bullets in your feet.) However, that doesn't mean you can't solve this problem.
Depending on how large your strings are (I'm assuming they won't be millions of characters each) I don't see anything stopping you from using a single loop to compare individuals character in order, while keeping a tally of differences:
int differences = 0; // Count of discrepancies you've detected
int tolerance = 7; // Limit of discrepancies you'll allow
CheckStrings(int differences, int tolerance) {
for (i = 0; i < StringA.Length; i++)
{
if (StringA[i] != StringB[i]) {
differences++;
if (differences > tolerance) {
return false;
}
}
}
return true;
}
Most of the time, don't be concerned about your strings being too long to put into a loop. Behind-the-scenes, any code that assesses every character of a string will loop in some form or another. Until you literally have millions of characters to deal with, a loop should do the trick just fine.

I'll bypass the 'regex' part and focus on:
Is there a better way than doing nested loops to wildcard every position?
It sounds like there's a programmatic way that might help you. See this post about iterating over two IEnumerables. By iterating over both strings at the same time, you can complete the task in O(n) time. Even better, if you know your tolerance(maximum of 2 errors), you can sometimes finish faster than O(n).
Here's a simple example that I wrote up. It probably needs tweaking for your own case, but it might be a good starting point.
static void imperfectMatch(String original, String testCase, int tolerance)
{
int mistakes = 0;
if (original.Length == testCase.Length)
{
using (CharEnumerator enumerator1 = original.GetEnumerator())
using (CharEnumerator enumerator2 = testCase.GetEnumerator())
{
while (enumerator1.MoveNext() && enumerator2.MoveNext())
{
if (mistakes >= tolerance)
break;
if (enumerator1.Current != enumerator2.Current)
mistakes++;
}
}
}
else
mistakes = -1;
Console.WriteLine(String.Format("Original String: {0}", original));
Console.WriteLine(String.Format("Test Case String: {0}", testCase));
Console.WriteLine(String.Format("Number of errors: {0}", mistakes));
Console.WriteLine();
}

Does any combination of A, B, ( and ) work?
bool isMatch = Regex.IsMatch(inputString, "^[AB()]+$")

For sufficiently small patterns (ABCD), you could generate a regexp:
..CD|.B.D|.BC.|A..D|A.C.|AB..
You could also code a custom comparison loop

Related

How do I validate character type and position within a string using a loop?

I'm currently attempting to validate a string for an assignment so it's imperative that I'm not simply given the answer, if you provide an answer please give suitable explanation so that I can learn from it.
Suppose I have a string
(1234)-1234 ABCD
I'd like to create a loop that will go through that string and validate the position of the "()" as well as the "-" and " ". In addition to the validation of those characters their position must also be the same as well as the data type. Finally, it must be inside a method.
CANNOT USE REGEX
TLDR;
Validate the position of characters and digits in a string, while using a loop inside of a method. I cannot use REGEX and need to do this manually.
Here's what I have so far. But I feel like the loop would be more efficient and look nicer.
public static string PhoneChecker(string phoneStr)
{
if (phoneStr[0] == '(' && phoneStr[4] == ')' && phoneStr[5] == ' ' && phoneStr[9] == '-' && phoneStr.Length == 14)
{
phoneStr = phoneStr.Remove(0, 1);
phoneStr = phoneStr.Remove(3, 1);
phoneStr = phoneStr.Remove(3, 1);
phoneStr = phoneStr.Remove(6, 1);
Console.WriteLine(phoneStr);
if (int.TryParse(phoneStr, out int phoneInt) == false)
{
Console.WriteLine("Invalid");
}
else
{
Console.WriteLine("Valid");
}
}
else
{
Console.WriteLine("Invalid");
}
return phoneStr;
}

It is still unmaintaible, but still a little better... Note that your code didn't work with your example string (the indexes were off by one).
public static bool PhoneChecker(string phoneStr)
{
if (phoneStr.Length != 16 || phoneStr[0] != '(' || phoneStr[5] != ')' || phoneStr[6] != '-' || phoneStr[11] != ' ')
{
return false;
}
if (!uint.TryParse(phoneStr.Substring(1, 4), out uint phoneInt))
{
return false;
}
if (!uint.TryParse(phoneStr.Substring(7, 4), out phoneInt))
{
return false;
}
// No checks for phoneStr.Substring(12, 4)
return true;
}
Some differences:
The Length check is the first one. Otherwise a short string would make the program crash (because if you try to do a phoneStr[6] on a phoneStr that has a length of 3 you'll get an exception)
Instead of int.Parse I used uint.Parse, otherwise -500 would be acceptable.
I've splitted the uint.Parse for the two subsections of numbers in two different check
The method returns true or false. It is the caller's work to write the error message.
There are various school of thought about early return in code: I think that the earlier you can abort your code with a return false the better it is. The other advantage is that all the remaining code is at low nesting level (your whole method was inside a big if () {, so nesting +1 compared to mine)
Technically you tagged the question as C#-4.0, but out int is C#-6.0
The main problem here is that stupid constraints produce stupid code. It is rare that Regex are really usefull. This is one of the rare cases. So now you have two possibilities: produce hard-coded unmodifiable code that does exactly what was requested (like the code I wrote), or create a "library" that accepts variable patterns (like the ones used in masked edits, where you can tell the masked edit "accept only (0000)-0000 AAAA") and validates the string based on this pattern... But this will be a poor-man's regex, only worse, because you'll have to maintain and test it. This problem will become clear when one month from the release of the code they'll ask you to accept even the (12345)-1234 ABCD pattern... and then the (1234)-12345 ABCD pattern... and a new pattern every two months (until around one and half years later they'll tell you to remove the validator, because the persons that use the program hate them and it slow their work)

is String.Contains() faster than walking through whole array of char in string?

I have a function that is walking through the string looking for pattern and changing parts of it. I could optimize it by inserting
if (!text.Contains(pattern)) return;
But, I am actually walking through the whole string and comparing parts of it with the pattern, so the question is, how String.Contains() actually works? I know there was such a question - How does String.Contains work? but answer is rather unclear. So, if String.Contains() walks through the whole array of chars as well and compare them to pattern I am looking for as well, it wouldn't really make my function faster, but slower.
So, is it a good idea to attempt such an optimizations? And - is it possible for String.Contains() to be even faster than function that just walk through the whole array and compare every single character with some constant one?
Here is the code:
public static char colorchar = (char)3;
public static Client.RichTBox.ContentText color(string text, Client.RichTBox SBAB)
{
if (text.Contains(colorchar.ToString()))
{
int color = 0;
bool closed = false;
int position = 0;
while (text.Length > position)
{
if (text[position] == colorchar)
{
if (closed)
{
text = text.Substring(position, text.Length - position);
Client.RichTBox.ContentText Link = new Client.RichTBox.ContentText(ProtocolIrc.decode_text(text), SBAB, Configuration.CurrentSkin.mrcl[color]);
return Link;
}
if (!closed)
{
if (!int.TryParse(text[position + 1].ToString() + text[position + 2].ToString(), out color))
{
if (!int.TryParse(text[position + 1].ToString(), out color))
{
color = 0;
}
}
if (color > 9)
{
text = text.Remove(position, 3);
}
else
{
text = text.Remove(position, 2);
}
closed = true;
if (color < 16)
{
text = text.Substring(position);
break;
}
}
}
position++;
}
}
return null;
}

Short answer is that your optimization is no optimization at all.
Basically, String.Contains(...) just returns String.IndexOf(..) >= 0
You could improve your alogrithm to:
int position = text.IndexOf(colorchar.ToString()...);
if (-1 < position)
{ /* Do it */ }

Yes.
And doesn't have a bug (ahhm...).
There are better ways of looking for multiple substrings in very long texts, but for most common usages String.Contains (or IndexOf) is the best.
Also IIRC the source of String.Contains is available in the .Net shared sources
Oh, and if you want a performance comparison you can just measure for your exact use-case

Check this similar post How does string.contains work
I think that you will not be able to simply do anything faster than String.Contains, unless you want to use standard CRT function wcsstr, available in msvcrt.dll, which is not so easy

Unless you have profiled your application and determined that the line with String.Contains is a bottle-neck, you should not do any such premature optimizations. It is way more important to keep your code's intention clear.
Ans while there are many ways to implement the methods in the .NET base classes, you should assume the default implementations are optimal enough for most people's use cases. For example, any (future) implementation of .NET might use the x86-specific instructions for string comparisons. That would then always be faster than what you can do in C#.
If you really want to be sure whether your custom string comparison code is faster than String.Contains, you need to measure them both using many iterations, each with a different string. For example using the Stopwatch class to measure the time.

If you now the details which you can use for optimizations (not just simple contains check) sure you can make your method faster than string.Contains, otherwise - not.

Fastest way to trim a string and convert it to lower case

I've written a class for processing strings and I have the following problem: the string passed in can come with spaces at the beginning and at the end of the string.
I need to trim the spaces from the strings and convert them to lower case letters. My code so far:
var searchStr = wordToSearchReplacemntsFor.ToLower();
searchStr = searchStr.Trim();
I couldn't find any function to help me in StringBuilder. The problem is that this class is supposed to process a lot of strings as quickly as possible. So I don't want to be creating 2 new strings for each string the class processes.
If this isn't possible, I'll go deeper into the processing algorithm.

Try method chaining.
Ex:
var s = " YoUr StRiNg".Trim().ToLower();

Cyberdrew has the right idea. With string being immutable, you'll be allocating memory during both of those calls regardless. One thing I'd like to suggest, if you're going to call string.Trim().ToLower() in many locations in your code, is to simplify your calls with extension methods. For example:
public static class MyExtensions
{
public static string TrimAndLower(this String str)
{
return str.Trim().ToLower();
}
}

Here's my attempt. But before I would check this in, I would ask two very important questions.
Are sequential "String.Trim" and "String.ToLower" calls really impacting the performance of my app? Would anyone notice if this algorithm was twice as slow or twice as fast? The only way to know is to measure the performance of my code and compare against pre-set performance goals. Otherwise, micro-optimizations will generate micro-performance gains.
Just because I wrote an implementation that appears faster, doesn't mean that it really is. The compiler and run-time may have optimizations around common operations that I don't know about. I should compare the running time of my code to what already exists.
static public string TrimAndLower(string str)
{
if (str == null)
{
return null;
}
int i = 0;
int j = str.Length - 1;
StringBuilder sb;
while (i < str.Length)
{
if (Char.IsWhiteSpace(str[i])) // or say "if (str[i] == ' ')" if you only care about spaces
{
i++;
}
else
{
break;
}
}
while (j > i)
{
if (Char.IsWhiteSpace(str[j])) // or say "if (str[j] == ' ')" if you only care about spaces
{
j--;
}
else
{
break;
}
}
if (i > j)
{
return "";
}
sb = new StringBuilder(j - i + 1);
while (i <= j)
{
// I was originally check for IsUpper before calling ToLower, probably not needed
sb.Append(Char.ToLower(str[i]));
i++;
}
return sb.ToString();
}

If the strings use only ASCII characters, you can look at the C# ToLower Optimization. You could also try a lookup table if you know the character set ahead of time

So first of all, trim first and replace second, so you have to iterate over a smaller string with your ToLower()
other than that, i think your best algorithm would look like this:
Iterate over the string once, and check
whether there's any upper case characters
whether there's whitespace in beginning and end (and count how many chars you're talking about)
if none of the above, return the original string
if upper case but no whitespace: do ToLower and return
if whitespace:
allocate a new string with the right size (original length - number of white chars)
fill it in while doing the ToLower

You can try this:
public static void Main (string[] args) {
var str = "fr, En, gB";
Console.WriteLine(str.Replace(" ","").ToLower());
}

String cannot contain any part of another string .NET 2.0

I'm looking for a simple way to discern if a string contains any part of another string (be that regex, built in function I don't know about, etc...). For Example:
string a = "unicorn";
string b = "cornholio";
string c = "ornament";
string d = "elephant";
if (a <comparison> b)
{
// match found ("corn" from 'unicorn' matched "corn" from 'cornholio')
}
if (a <comparison> c)
{
// match found ("orn" from 'unicorn' matched "orn" from 'ornament')
}
if (a <comparison> d)
{
// this will not match
}
something like if (a.ContainsAnyPartOf(b)) would be too much to hope for.
Also, I only have access to .NET 2.0.
Thanks in advance!

This method should work. You'll want to specify a minimum length for the "part" that might match. I'd assume you'd want to look for something of at least 2, but with this you can set it as high or low as you want. Note: error checking not included.
public static bool ContainsPartOf(string s1, string s2, int minsize)
{
for (int i = 0; i <= s2.Length - minsize; i++)
{
if (s1.Contains(s2.Substring(i, minsize)))
return true;
}
return false;
}

I think you're looking for this implementation of longest common substring?

Your best bet, according to my understanding of the question, is to compute the Levenshtein (or related values) distance and compare that against a threshold.

Your requirements are a little vague.
You need to define a minimum length for the match...but implementing an algorithm shouldn't be too difficult when you figure that part out.
I'd suggest breaking down the string into character arrays and then using tail recursion to find matches for the parts.

C# - fastest way to compare two strings using wildcards

Is there a fastest way to compare two strings (using the space for a wildcard) than this function?
public static bool CustomCompare(this string word, string mask)
{
for (int index = 0; index < mask.Length; index++)
{
if (mask[index] != ' ') && (mask[index]!= word[index]))
{
return false;
}
}
return true;
}
Example: "S nt nce" comparing with "Sentence" will return true. (The two being compared would need to be the same length)

If mask.length is less than word.length, this function will stop comparing at the end of mask. A word/mask length compare in the beginning would prevent that, also it would quick-eliminate some obvious mismatches.

The loop is pretty simple and I'm not sure you can do much better. You might be able to micro optimize the order of the expression in the if statement. For example due to short circuiting of the && it might be faster to order the if statement this way
if (mask[index]!= word[index])) && (mask[index] != ' ')
Assuming that matching characters is more common that matching the wildcard. Of course this is just theory I wouldn't believe it made a difference without benchmarking it.
And as others have pointed out the routine fails if the mask and string are not the same length.

That looks like a pretty good implementation - I don't think you will get much faster than that.
Have you profiled this code and found it to be a bottleneck in your application? I think this should be fine for most purposes.

If you used . instead of , you could do a simple regex match.

Variable Length comparison:
I used your code as a starting place for my own application which assumes the mask length is shorter or equal to the comparison text length. allowing for a variable length wildcard spot in the mask. ie: "concat" would match a mask of "c ncat" or "c t" or even "c nc t"
private bool CustomCompare(string word, string mask)
{
int lengthDifference = word.Length - mask.Length;
int wordOffset = 0;
for (int index = 0; index < mask.Length; index++)
{
if ((mask[index] != ' ') && (mask[index]!= word[index+wordOffset]))
{
if (lengthDifference <= 0)
{
return false;
}
else
{
lengthDifference += -1;
wordOffset += 1;
}
}
}
return true;
}

Not sure if this is any faster but it looks neat:
public static bool CustomCompare(this string word, string mask)
{
return !mask.Where((c, index) => c != word[index] && c != ' ').Any();
}

I think you're doing a little injustice by not giving a little bit of context to your code. Sure, if you want to search only one string of characters of the same length as your pattern, then yes this is fine.
However, if you are using this as the heart of a pattern matcher where there are several other patterns you will be looking for, this is a poor method. There are other known methods, the best of which depends on your exact application. The phrase "inexact pattern matching" is the phrase you are concerned with.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex to find 'good enough' sequences - c#

I can't think how you'd do it with regex but it should be pretty simple to code. I'd probably just split the strings up and compare them character by character. If you get a difference count it and move to the next character. If you exceed 2 differences then move on to the next full string.

Does any combination of A, B, ( and ) work? bool isMatch = Regex.IsMatch(inputString, "^[AB()]+$")

For sufficiently small patterns (ABCD), you could generate a regexp: ..CD|.B.D|.BC.|A..D|A.C.|AB.. You could also code a custom comparison loop

Related

How do I validate character type and position within a string using a loop?

is String.Contains() faster than walking through whole array of char in string?

Fastest way to trim a string and convert it to lower case

String cannot contain any part of another string .NET 2.0

C# - fastest way to compare two strings using wildcards

Categories

Resources