Replacing overlapping matches in a string (regex or string operations)

Replacing overlapping matches in a string (regex or string operations) - c#

I have been trying to find all occurrences of a substring in a given string, and replace a specific occurrence with another substring (the condition is not important for the question).
What I need is to find all occurrences (even overlapping ones) and to be able to easily replace a specific one I choose.
The issue is that if I don't use lookahead I can't find overlapping occurrences (e.g. find "aa" in "aaa" will only find the first "aa" sequence because the second one overlaps with the first one):
var regex = new Regex(Regex.Escape("aa"));
regex.Matches("aaa").Count;
Value of the second line: 1
Expected: 2
If I use a lookahead I find all of the occurrences but the replacement doesn't work (e.g. replace "a" in "a" with "b", will result in "ba" instead of "b"):
var regex = new Regex(Regex.Escape("(?=a)"));
regex.Replace("a", "b");
Replace result: ba
Expected: b
Those are, of course, simple examples that showcase the issues in an easy way, but I need this to work on any example.
I know that I can easily do a search for both, or manually go over the word, but this code snippet is going to run many times and needs to both be efficient and readable.
Any ideas / tips on finding overlapping occurrences while still being able to replace properly? Should I even be using regex?

I think I would forgo regex and write a simple loop as below (there is room for improvement), because I think it would be quicker and more understandable.
public IEnumerable<int> FindStartingOccurrences(string input, string pattern)
{
var occurrences = new List<int>();
for (int i=0; i<input.Length; i++)
{
if (input.Length+1 > i+pattern.Length)
{
if (input.Substring(i, pattern.Length) == pattern)
{
occurrences.Add(i);
}
}
}
return occurrences;
}
and then call like:
var occurrences = FindStartingOccurrences("aaabbaaaaaccaadaaa", "aa");

To get overlapping results you have to shift your search pattern by one char for as many times as your search string is long.
Let's say for a text containing aaaaaa and a seachrstring of aaa (4 expected matches), three regex searches will be done with the search patterns:
aaa (2 Matches)
(?<=a)aaa (1 Match)
(?<=aa)aaa (1 Match)
Same works for more complex searches like aba in abababa.
private static IEnumerable<Match> GetOverlappingMatches(string text, string searchstring)
{
IEnumerable<Match> combinedMatches = Enumerable.Empty<Match>();
for (int i = 0; i < searchstring.Length; i++)
{
combinedMatches = combinedMatches.Concat(GetMatches(text, searchstring, i));
}
return combinedMatches.Distinct(new MatchComparer());
}
private static IEnumerable<Match> GetMatches(string text, string searchstring, int shifts)
{
string lookahead = $"(?<={searchstring.Substring(0, shifts)})";
string pattern = $"{lookahead}{searchstring}";
return Regex.Matches(text, pattern);
}
You also want to add a MatchComparer to filter double matches.
public class MatchComparer : IEqualityComparer<Match>
{
public bool Equals(Match x, Match y)
{
return x.Index == y.Index
&& x.Length == y.Length;
}
public int GetHashCode([DisallowNull] Match obj)
{
return obj.Index ^ obj.Length;
}
}

Related

Remove list of words from string

I have a list of words that I want to remove from a string I use the following method
string stringToClean = "The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam";
string[] BAD_WORDS = {
"720p", "web-dl", "hevc", "x265", "Rmteam", "."
};
var cleaned = string.Join(" ", stringToClean.Split(' ').Where(w => !BAD_WORDS.Contains(w, StringComparer.OrdinalIgnoreCase)));
but it is not working And the following text is output
The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam

For this it would be a good idea to create a reusable method that splits a string into words. I'll do this as an extension method of string. If you are not familiar with extension methods, read extension methods demystified
public static IEnumerable<string> ToWords(this string text)
{
// TODO implement
}
Usage will be as follows:
string text = "This is some wild text!"
List<string> words = text.ToWords().ToList();
var first3Words = text.ToWords().Take(3);
var lastWord = text.ToWords().LastOrDefault();
Once you've got this method, the solution to your problem will be easy:
IEnumerable<string> badWords = ...
string inputText = ...
IEnumerable<string> validWords = inputText.ToWords().Except(badWords);
Or maybe you want to use Except(badWords, StringComparer.OrdinalIgnoreCase);
The implementation of ToWords depends on what you would call a word: everything delimited by a dot? or do you want to support whitespaces? or maybe even new-lines?
The implementation for your problem: A word is any sequence of characters delimited by a dot.
public static IEnumerable<string> ToWords(this string text)
{
// find the next dot:
const char dot = '.';
int startIndex = 0;
int dotIndex = text.IndexOf(dot, startIndex);
while (dotIndex != -1)
{
// found a Dot, return the substring until the dot:
int wordLength = dotIndex - startIndex;
yield return text.Substring(startIndex, wordLength;
// find the next dot
startIndex = dotIndex + 1;
dotIndex = text.IndexOf(dot, startIndex);
}
// read until the end of the text. Return everything after the last dot:
yield return text.SubString(startIndex, text.Length);
}
TODO:
Decide what you want to return if text starts with a dot ".ABC.DEF".
Decide what you want to return if the text ends with a dot: "ABC.DEF."
Check if the return value is what you want if text is empty.

Your split/join don't match up with your input.
That said, here's a quick one-liner:
string clean = BAD_WORDS.Aggregate(stringToClean, (acc, word) => acc.Replace(word, string.Empty));
This is basically a "reduce". Not fantastically performant but over strings that are known to be decently small I'd consider it acceptable. If you have to use a really large string or a really large number of "words" you might look at another option but it should work for the example case you've given us.
Edit: The downside of this approach is that you'll get partials. So for example in your token array you have "720p" but the code I suggested here will still match on "720px" but there are still ways around it. For example instead of using string's implementation of Replace you could use a regex that will match your delimiters something like Regex.Replace(acc, $"[. ]{word}([. ])", "$1") (regex not confirmed but should be close and I added a capture for the delimiter in order to put it back for the next pass)

Regex Puzzle Find all Valid String Combinations

I am trying to find the possible subsets within in a string which satisfy the all given condition.
The first letter is a lowercase English letter.
Next, it contains a sequence of zero or more of the following characters:
lowercase English letters, digits, and colons.
Next, it contains a forward slash '/'.
Next, it contains a sequence of one or more of the following characters:
lowercase English letters and digits.
Next, it contains a backward slash '\'.
Next, it contains a sequence of one or more lowercase English letters.
Given some string, s, we define the following:
s[i..j] is a substring consisting of all the characters in the inclusive range between index i and index j.
Two substrings, s[i1..j1] and s[i[2]..j[2]], are said to be distinct if either i1 ≠ i[2] or j1 ≠ j[2].
For example, your command line is abc:/b1c\xy. Valid command substrings are:
abc:/b1c\xy
bc:/b1c\xy
c:/b1c\xy
abc:/b1c\x
bc:/b1c\x
c:/b1c\x
to which I solved as ^([a-z])([a-z0-9:]*)(/)([a-z0-9]+)([\\])([a-z]*)
but this doesn't satisfy the second condition, I tried ^([a-z])([a-z0-9:]*)(/)([a-z0-9]+)([\\])([a-z]+[a-z]*) but still for w:/a\bc it should be 2 subsets [w:/a\b,w:/a\bc] but by regex wise its 1 which is obviuos . what i am doing wrong
Regex Tool: Check
Edit: why w:/a\bc should yield two subsets [w:/a\b, w:/a\bc], cause it satisfies all 6 constraints and its distinct as 'w:/a\bc' is super set of w:/a\b,

You have to perform sub string operations after matching the strings.
For Example:
your string is "abc:/b1c\xy", you matched it using your regex, now it's time to get the required data.
int startIndex=1;
String st="abc:/b1c\xy";
regex1="[a-z0-9:]*(/)"
regex2="(/)([a-z0-9]+)([\\])";
regex3="([\\])([a-z])+";
String PrefixedString=regex1.match(st).group(0);
String CenterString=regex2.match(st).group(0);
String PostfixedString=regex3.match(st).group(0);
if(PrefixedString.contains(":"))
{ startIndex=2; }
for(int i=;i<PrefixedString.length-startIndex;i++)//ends with -startIndex because '/' is included in the string or ':' may be
{
String temp=PrefixedString[i];
if(i!=PrefixedString.length)
{
for(int j=i+1;j<PrefixedString.length;j++)
{
temp+=PrefixedString[j];
}
}
print(temp+CenterString+PostfixedString);
}
for(int i=1;i<PostfixedString.length;i++)//starts with -1 because '\' is included in the string
{
String temp=PrefixedString+CenterString+PostfixedString[i];
if(i!=PostfixedString.length)
{
for(int j=i+1;j<PostfixedString.length;j++)
{
temp+=PostfixedString[j];
}
}
print(temp);
}
I hope this will give you some idea.

You may be able to create a regex that helps you in separating all relevant result parts, but as far as I know, you can't create a regex that gives you all result sets with a single search.
The tricky part are the first two conditions, since there can be many possible starting points when there is a mix of letters, digits and colons.
In order to find possible starting points, I suggest the following pattern for the part before the forward slash: (?:([a-z]+)(?:[a-z0-9:]*?))+
This will match potentially multiple captures where every letter within the capture could be a starting point to the substring.
Whole regex: (?:([a-z]+)(?:[a-z0-9:]*?))+/[a-z0-9]+\\([a-z]*)
Create your results by combining all postfix sub-lengths from all captures of group 1 and all prefix sub-lengths from group 2.
Example code:
var testString = #"a:ab2c:/b1c\xy";
var reg = new Regex(#"(?:([a-z]+)(?:[a-z0-9:]*?))+/[a-z0-9]+\\([a-z]*)");
var matches = reg.Matches(testString);
foreach (Match match in matches)
{
var prefixGroup = match.Groups[1];
var postfixGroup = match.Groups[2];
foreach (Capture prefixCapture in prefixGroup.Captures)
{
for (int i = 0; i < prefixCapture.Length; i++)
{
for (int j = 0; j < postfixGroup.Length; j++)
{
var start = prefixCapture.Index + i;
var end = postfixGroup.Index + postfixGroup.Length - j;
Console.WriteLine(testString.Substring(start, end - start));
}
}
}
}
Output:
a:ab2c:/b1c\xy
a:ab2c:/b1c\x
ab2c:/b1c\xy
ab2c:/b1c\x
b2c:/b1c\xy
b2c:/b1c\x
c:/b1c\xy
c:/b1c\x

Intuitive Way might not correct.
var regex = new Regex(#"(^[a-z])([a-z0-9:]*)(/)([a-z0-9]+)([\\])([a-z]+)");
var counter = 0;
for (var c = 0; c < command.Length; c++)
{
var isMatched = regex.Match(string.Join(string.Empty, command.Skip(c)));
if (isMatched.Success)
{
counter += isMatched.Groups.Last().Value.ToCharArray().Length;
}
}
return counter;

check for a substring(of a string) in the dictionary and return the key's(substring) value

I have a dictionary like below,
PropStreetSuffixDict.Add("ROAD", "RD");
PropStreetSuffixDict.Add("STREET","ST"); and many more.
Now my requirement says when a string contains a substring of either ROAD or STREET i want to return the related value for that substring.
For example..CHURCH ACROSS ROAD should return RD
This is what i tried, which only works if the input string is exactly same as key of the dict.
private string GetSuffix(string input)
{
string suffix=string.Empty;
suffix = PropStreetSuffixDict.Where(x => x.Key.ToUpper().Trim() ==
input.ToUpper().Trim()).FirstOrDefault().Value;
return suffix;
}
Note:
In case a string contains more than one of such substrings, then it should return the value of the first occurence of the any of the substrings.
i.e. if STREET CHURCH ACROSS ROAD is the input, it should return ST not RD

You can try something like this
private string GetSuffix(string input)
{
string suffix=string.Empty;
string[] test =input.ToUpper().Split(' ');
suffix =(from dic in PropStreetSuffixDict
join inp in test on dic.Key equals inp
select dic.Value).LastOrDefault();
return suffix;
}
Split the input and then use linq

If you want it to return first occurrence in the input string (GetSuffix("CHURCH STREET ACROSS ROAD) ==> "STREET") it becomes a little tricky.
Code below will find where in the input string all keys occur, and return value of first found position.
private string GetSuffix(string input)
{
var suffix = PropStreetSuffixDict
.Select(kvp => new
{
Position = input.IndexOf(kvp.Key.Trim(), StringComparison.CurrentCultureIgnoreCase),
Value = kvp.Value
})
.OrderBy(x => x.Position)
.FirstOrDefault(x => x.Position > -1)?.Value;
return suffix ?? string.Empty;
}
If you didn't care about the order of occurrence in input string you could simplify it to this:
private string GetSuffix(string input)
{
var suffix = PropStreetSuffixDict.FirstOrDefault(kvp => input.Containts(kvp.Key.Trim(), StringComparison.CurrentCultureIgnoreCase))?.Value;
return suffix ?? string.Empty;
}

I would recommend using using RegEx to split apart your words, that way you can efficiently split on multiple characters, not just spaces, if required. This solution also allows replacing the individual words very easily, without having to deal with tracking the position and length of the matched word, vs the length of the replacement value.
You could use a function like this:
public string ReplaceWords(string input, Dictionary<string,string> dictionary)
{
var result = Regex.Replace(input, #"\w*", (match) =>
{
if (dictionary.TryGetValue(match.Value, out var replacement))
{
return replacement;
}
return match.Value;
});
return result;
}
It will take an input string, split it up, and replace the individual words with those in the supplied dictionary. The particular RegEx of \w* will match any continuous run of "word" characters, so it will break on spaces, commas, dashes, and anything else that isn't part of a "word".
This code does use some newer C# language features that you may not have access too (inline out parameters). Just let me know if you can't use those and I'll update it to work without them.
You can use it like this:
Console.WriteLine(ReplaceWords("CHURCH ACROSS ROAD", PropStreetSuffixDict));
Console.WriteLine(ReplaceWords("CHURCH ACROSS STREET", PropStreetSuffixDict));
Console.WriteLine(ReplaceWords("CHURCH ACROSS ROAD, LEFT AT THE OTHER STREET", PropStreetSuffixDict));
For the following results:
CHURCH ACROSS RD
CHURCH ACROSS ST
CHURCH ACROSS RD, LEFT AT THE OTHER ST

Find all substrings between two strings

I need to get all substrings from string.
For ex:
StringParser.GetSubstrings("[start]aaaaaa[end] wwwww [start]cccccc[end]", "[start]", "[end]");
that returns 2 string "aaaaaa" and "cccccc"
Suppose we have only one level of nesting.
Not sure about regexp, but I think it will be userful.

private IEnumerable<string> GetSubStrings(string input, string start, string end)
{
Regex r = new Regex(Regex.Escape(start) + "(.*?)" + Regex.Escape(end));
MatchCollection matches = r.Matches(input);
foreach (Match match in matches)
yield return match.Groups[1].Value;
}

Here's a solution that doesn't use regular expressions and doesn't take nesting into consideration.
public static IEnumerable<string> EnclosedStrings(
this string s,
string begin,
string end)
{
int beginPos = s.IndexOf(begin, 0);
while (beginPos >= 0)
{
int start = beginPos + begin.Length;
int stop = s.IndexOf(end, start);
if (stop < 0)
yield break;
yield return s.Substring(start, stop - start);
beginPos = s.IndexOf(begin, stop+end.Length);
}
}

You can use a regular expression, but remember to call Regex.Escape on your arguments:
public static IEnumerable<string> GetSubStrings(
string text,
string start,
string end)
{
string regex = string.Format("{0}(.*?){1}",
Regex.Escape(start),
Regex.Escape(end));
return Regex.Matches(text, regex, RegexOptions.Singleline)
.Cast<Match>()
.Select(match => match.Groups[1].Value);
}
I also added the SingleLine option so that it will match even if there are new-lines in your text.

You're going to need to better define the rules that govern your matching needs. When building any kind of matching or search code you need to be vary clear about what inputs you anticipate and what outputs you need to produce. It's very easy to produce buggy code if you don't take these questions into close consideration. That said...
You should be able to use regular expressions. Nesting may make it slightly more complicated but still doable (depending on what you expect to match in nested scenarios). Something like should get you started:
var start = "[start]";
var end = "[end]";
var regEx = new Regex(String.Format("{0}(.*){1}", Regex.Escape(start), Regex.Escape(end)));
var source = "[start]aaaaaa[end] wwwww [start]cccccc[end]";
var matches = regEx.Match( source );
It should be trivial to wrap the code above into a function appropriate for your needs.

I was bored, and thus I made a useless micro benchmark which "proves" (on my dataset, which has strings up to 7k of characters and <b> tags for start/end parameters) my suspicion that juharr's solution is the fastest of the three overall.
Results (1000000 iterations * 20 test cases):
juharr: 6371ms
Jake: 6825ms
Mark Byers: 82063ms
NOTE: Compiled regex didn't speed things up much on my dataset.

How to get number from a string

I would like to get number from a string eg:
My123number gives 123
Similarly varchar(32) gives 32 etc
Thanks in Advance.

If there is going to be only one number buried in the string, and it is going to be an integer, then something like this:
int n;
string s = "My123Number";
if (int.TryParse (new string (s.Where (a => Char.IsDigit (a)).ToArray ()), out n)) {
Console.WriteLine ("The number is {0}", n);
}
To explain: s.Where (a => Char.IsDigit (a)).ToArray () extracts only the digits from the original string into an array of char. Then, new string converts that to a string and finally int.TryParse converts that to an integer.

you could go the regular expression way. which is normally faster than looping through the string
public int GetNumber(string text)
{
var exp = new Regex("(\d+)"); // find a sequence of digits could be \d+
var matches = exp.Matches(text);
if (matches.Count == 1) // if there's one number return that
{
int number = int.Parse(matches[0].Value);
return number
}
else if (matches.Count > 1)
throw new InvalidOperationException("only one number allowed");
else
return 0;
}

Loop through each char in the string and test it for being a number. remove all non-numbers and then you have a simple integer as a string. Then you can just use int.parse.
string numString;
foreach(char c in inputString)
if (Char.IsDigit(c)) numString += c;
int realNum = int.Parse(numString);

You could do something like this, then it will work with more then one number as well
public IEnumerable<string> GetNumbers(string indata)
{
MatchCollection matches = Regex.Matches(indata, #"\d+");
foreach (Match match in matches)
{
yield return match.Value;
}
}

First write a specification of what you mean by a "number" (integer? long? decimal? double?) and by "get a number from a string". Including all the cases you want to be able to handle (leading/trailing signs? culture-invariant thousands/decimal separators, culture-sensitive thousands/decimal separators, very large values, strings that don't contain a valid number, ...).
Then write some unit tests for each case you need to be able to handle.
Then code the method (should be easy - basically extract the numeric bit from the string, and try to parse it. Some of the answers provided so far will work for integers provided the string doesn't contain a value larger than Int32.MaxValue).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Replacing overlapping matches in a string (regex or string operations) - c#

Related

Remove list of words from string

Regex Puzzle Find all Valid String Combinations

check for a substring(of a string) in the dictionary and return the key's(substring) value

Find all substrings between two strings

How to get number from a string

Categories

Resources