Finding and replacing multiple matches in a single line with regex - c#

I have a query regarding making multiple replacements within a string using a regular expression.
The platform is C#, so .NET's System.Text.RegularExpression implementation.
Let's say I have a string -- in this case, an XML fragment, but it could be any text at all, so no assumptions on the syntax:
<key val1="C:\SomeDir\SomePath\FOLDER1" val2="C:\SomeDir\SomePath\FOLDER2" />
I want to replace the last part of both of these paths -- let's say, change it to FOLDER3.
I currently have the expression (C:\\SomeDir\\SomePath)(\\\w*\\) which gives me two groups -- the first part of the path and the bit I want to replace.
I can use the replacement string ${1}\FOLDER3\ which properly replaces the part of the path I want to change.
However: this only works for the first match in the string. So, FOLDER1 will be replaced with FOLDER3 but FOLDER2 remains unchanged.
I thought I could apply the match/replace operation in a loop until the line no longer changed, but of course this doesn't work as the match regex always stops on the first match.
Any help greatly appreciated!

Use the replace method of the regex. The replace method does replace all matches:
string s = "<key val1=\"C:\\SomeDir\\SomePath\\FOLDER1\" val2=\"C:\\SomeDir\\SomePath\\FOLDER2\" />";
Regex regex = new Regex(#"(C:\\SomeDir\\SomePath)(\\\w*)");
string result = regex.Replace(s, x => x.Groups[1] + #"\FOLDER3");

Related

Getting exact substring that satisfies regex's match

I want to get the indices of the regular expression match below:
input : ab
regex: a(?=b)
The Match object contains information on the actual matched part of the string(a) and does not include the zero-width assertions that were required for the match to succeed. I want to be able to capture the exact substring that satisfies this match. I don't want to have to expand the string manually to do so. It seems to me there should be a method somewhere in the FCL.
Edit:
Just to make things more clear as there are recommendations as to not using lookaheads. I am well aware that I shouldn't be using lookaheads when I want to actually match a part of the string. However, the application I am working on receives a series of regular expressions to be used in a preprocessing stage. These regular expressions are out of my control. I cannot guarantee that they properly match the zero-width assertions. In this stage the matched regular expressions are replaced with a piece of text. In order for the following regular expression replace procedure to work, I need to be able to capture the substring in the string that satisfies the regular expression. Consider the code below:
string input = "abcdefg";
Regex regex = new Regex("a(?=b)");
Match m = regex.Match(input);
regex.Replace(m.Value, "z").Dump();
First notice that I want the replacement to happen only in the portion of the input that the match occurred and not the entire input. This is very important as I don't want all the matches to be replaced just yet. The code above's output is 'a' and not 'z'. The reason for that is that m.Value is a and the regex wouldn't replace a single a with z. It would replace the a found in 'ab' with 'z'. I want to be able to pass 'ab' to the Replace function.
Hope this clears things up.
You are using a wrong API for controlling the replacement: rather than passing the match back to regex, use the four-argument overload of Replace that gives you tighter control over what is being replaced in the original string, and what parts of the string to consider for the replacement:
string input = "abcdefg";
Regex regex = new Regex("a(?=b)");
regex.Replace(input , "z", 1, 0).Dump();
Only the first match will be replaced, starting at the index zero. If you would like to continue replacing additional matches, change the last parameter to the new starting index. Keep the third parameter at 1, so as to make at most one replacement.

Regex replace all matching words that do not contain a certain string

How can I use regex to replace matching strings that do not include a specific string?
input string
Keepword mywordsecond mythirdword myfourthwordKeep
string to replace
word
exclude string
Keep
Desired out put
Keepword mysecond mythird myfourthKeep
Will there ever be more than one word in a word? If there are more than one, do you want to replace all of them? If not, this should sort you out:
Regex r = new Regex(#"\b((?:(?!Keep|word)\w)*)word((?:(?!Keep)\w)*)\b");
s1 = r.Replace(s0, "$1$2");
to explain:
First, \b((?:(?!Keep|word)\w)*) captures whatever text precedes the first occurrence of word or Keep.
The next thing it sees must be word, If it sees Keep or the end of the string instead, the match attempt immediately fails.
Then ((?:(?!Keep)\w)*)\b captures the remainder of the text in order to ensure it doesn't contain Keep.
When faced with a problem like this, most users' first impulse is to match (in the sense of consuming) only the part of the string they're interested in, using lookarounds to establish the context. It's usually much easier to write the regex so that it always moves forward through the string as it matches. You capture the parts you want to retain so you can plug them back into the result string by means of group references ($1, $2, etc.).
Given that you're using C#, you could use the lookaround approach:
Regex r = new Regex(#"(?<!Keep\w*)word(?!\w*Keep)");
s1 = r.Replace(s0, "");
But please don't. There are very few regex flavors that support unrestricted lookbehinds like .NET does, and most problems don't work so neatly as this one anyway.
string str = "Keepword mywordsecond mythirdword myfourthwordKeep";
str = Regex.Replace(str, "(?<!Keep)word", "");
And I'm going to link you to a one of good Regular Expressions Cheat sheet here
This works in notepad++:
(?<!Keep)word(?!Keep)
It uses "look ahead".
You can use negative look-behind assertion if you want to remove all "word" that are not proceeded by "Keep":
String input = "Keepword mywordsecond mythirdword myfourthwordKeep";
String pattern = "(?<!Keep)word";
String output = Regex.Replace(input, pattern, "");

C# Trouble with Regex.Replace

Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)
Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.
This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());

Regular expression with backreferences

I am attempting to write a regular expression in a C# application to find "{value}", along with a backreference to the text before it up to "[[", and another backreference to the text after it up to "]]". For example:
This is some text [[backreference one {value}
backreference two]]
Would match "[[backreference one ", "{value}", and "\r\nbackreference two]]".
I have tried modified versions of the following with no luck. I believe I am missing word boundaries, and may be having trouble because of "{" in the text I am trying to find.
\[\[(^[\{value\}]+)\{value\}(^\]\]+)\]\]
I'm not sure if it would be possible with regular expressions, but it would be ideal if it could find the matching closing bracket, for example the following would find "[[backreferenc[[e]] one ", "{value}", and "ba[[ckref[[e]]rence t]]wo]]":
This is some text [[backreferenc[[e]] one {value}
ba[[ckref[[e]]rence t]]wo]]
You need to use the MatchEvaluator on Regex replace. Also it would make your life easier by breaking up the matches into named capture groups to help with the match evaluator processing. Let me explain.
What the MatchEvaluator does, is it allows one to intercede in the match process with a C# delegate and return what should be replaced when a match happens by examining the actual match captured. That way you can do your text processing as needed.
Here is a basic example where it handles the sections in a basic way, but the structure is there to add your business logic:
string text = #"This is some text [[Name: {name}]] at [[Address: {address}]].";
Regex.Replace(text,
#"(?:\[\[)(?<Section>[^\:]+)(?:\:)(?<Data>[^\]]+)(?:\]\])",
new MatchEvaluator((mtch) =>
{
if (mtch.Groups["Section"].Value == "Name")
return "Jabberwocky";
return "120 Main";
}));
The result of Regex Replace is:
This is some text Jabberwocky at 120 Main.
To the first part of you question try this:
\[\[(.*)({value})(.*)\]\]

Converting wildcard pattern to regular expression

I am new to regular expressions. Recently I was presented with a task to convert a wildcard pattern to regular expression. This will be used to check if a file path matches the regex.
For example if my pattern is *.jpg;*.png;*.bmp
I was able to generate the regex by spliting on semicolons, escaping the string and replaceing the escaped * with .*
String regex = "((?i)" + Regex.Escape(extension).Replace("\\*", ".*") + "$)";
So my resulting regex will be for jpg ((?i).*\.jpg)$)
Thien I combine all my extensions using the OR operator.
Thus my final expression for this example will be:
((?i).*\.jpg)$)|((?i).*\.png)$)|((?i).*\.bmp)$)
I have tested it and it worked yet I am not sure if I should add or remove any expression to cover other cases or is there a better format the whole thing
Also bear in mind that I can encounter a wildcard like *myfile.jpg where it should match all files whose names end with myfile.jpg
I can encounter patterns like *myfile.jpg;*.png;*.bmp
There's a lot of grouping going on there which isn't really needed... well unless there's something you haven't mentioned this regex would do the same for less:
/.*\.(jpg|png|bmp)$/i
That's in regex notation, in C# that would be:
String regex=new RegEx(#".*\.(jpg|png|bmp)$",RegexOptions.IgnoreCase);
If you have to programatically translate between the two, you've started on the right track - split by semicolon, group your extensions into the set (without the preceding dot). If your wildcard patterns can be more complicated (extensions with wildcards, multi-wildcard starting matches) it might need a bit more work ;)
Edit: (For your update)
If the wild cards can be more complicated, then you're almost there. There's an optimization in my above code that pulls the dot out (for extension) which has to be put back in so you'd end up with:
/.*(myfile\.jpg|\.png|\.bmp)$/i
Basically '*' -> '.*', '.' -> '\.'(gets escaped), rest goes into the set. Basically it says match anything ending (the dollar sign anchors to the end) in myfile.jpg, .png or .bmp.

Categories