Regex.Replace doesn't seem to work with back-reference - c#

I made an application designed to prepare files for translation using lists of regexes.
It runs each regex on the file using Regex.Replace. There is also an inspector module which allows the user to see the matches for each regex on the list.
It works well, except when a regex contains a back-reference, Regex.Replace does not replace anything, yet the inspector shows the matches properly (so I know the regex is valid and matches what it should).
sSrcRtf = Regex.Replace(sSrcRtf, sTag, sTaggedTag,
RegexOptions.Compiled | RegexOptions.Singleline);
sSrcRtf contains the RTF code of the page. sTag contains the regular expression in between parentheses. sTaggedTag contains $1 surrounded by the tag formating code.
To give an example:
sSrcRtf = Regex.Replace("the little dog", "((e).*?\1)", "$1",
RegexOptions.Compiled | RegexOptions.Singleline);
doesn't work. But
sSrcRtf = Regex.Replace("the little dog", "((e).*?e)", "$1",
RegexOptions.Compiled | RegexOptions.Singleline);
does. (of course, there is some RTF code around $1)
Any idea why this is?

You technically have two match groups there, the outer and the inner parentheses. Why don't you try addressing the inner set as the second capture, e.g.:
((e).*?\2)
Your parser probably thinks the outer capture is \1, and it doesn't make much sense to backreference it from inside itself.
Also note that your replacement won't do anything, since you are asking to replace the portion that you match with itself. I'm not sure what your intended behavior is, but if you are trying to extract just the match and discard the rest of the string, you want something like:
.*((e).*?\2).*

You're using a reference to a group inside the group you're referencing.
"((e).*?\1)" // first capturing group
"(e)" // second capturing group
I'm not 100% certain, but I don't think you can reference a group from within that group. For starters, what would you expect the backreference to match, since it's not even complete yet?

As others have mentioned, there are some additional groups being captured. Your replacement isn't referencing the correct one.
Your current regex should be rewritten as (options elided):
Regex.Replace("the little dog", #"((e).*?\2)", "$2")
// or
Regex.Replace("the little dog", #"(e).*?\1", "$1")
Here's another example that matches doubled words and indicates which backreferences work:
Regex.Replace("the the little dog", #"\b(\w+)\s+\1\b", "$1") // good
Regex.Replace("the the little dog", #"\b((\w+)\s+\2)\b", "$1") // no good
Regex.Replace("the the little dog", #"\b((\w+)\s+\2)\b", "$2") // good

Related

Regex Expression for Pseudocode

I am trying to figure out a regex expression for a project, and struggling here.
Here's my sample string:
[link="http://www.cnn.com"]CNN Webpage[/link]
I want to regex.replace the above example to this:
CNN Webpage
I know there is a Regex way to do this. Can anyone help?
I personally prefer using named groups when I can. As you'll see it makes the regex/code a little more maintainable/readable. This also helps with maintenance on the code as the captured groups are no longer being referenced by the index. As you probably know, the index groups will change if you change any preceding capturing groups within the regex.
The named groups will stay consistent through the lifetime of the regex unless you specifically change it.
Regex
\[link=["\u201C](?<href>[^"\u201D]+)["\u201D]\](?<title>[^\[]+)\[/link\]
Regex Demo - Note the regex is different because of the different regex engines, but the regex is equal to the one I present here.
Code
var str = "[link=\"http://www.cnn.com\"]CNN Webpage[/link] OR [link=“http://www.cnn.com”]CNN Webpage[/link]";
var regex = new Regex(#"\[link=[""\u201C](?<href>[^""\u201D]+)[""\u201D]\](?<title>[^\[]+)\[/link\]");
//The ${name} refers to a named capture group in the regex. Makes it a little more readable, and maintainable.
str = regex.Replace(str, "${title}");
Console.WriteLine(str);
Please note that the regex only supports the "smart quotes" if the quotes are used properly, to handle cases where the quotes might be reversed you'd need to do something like this:
\[link=["\u201C\u201D](?<href>[^"\u201D\u201C]+)["\u201D\u201C]\](?<title>[^\[]+)\[/link\]
Just for clarity, the example below shows where this regex would be useful. Notice the last link has the unicode characters messed up. It uses the unicode right quote (\u201D ”) on both sides of the text. This regex will parse the data, but the one at the beginning of the post will not.
var str = "[link=\"http://www.cnn.com\"]CNN Webpage[/link] OR [link=“http://www.cnn.com”]CNN Webpage[/link] OR [link=”http://www.cnn.com”]CNN Webpage[/link]";
var regex = new Regex(#"\[link=[""\u201C\u201D](?<href>[^""\u201D\u201C]+)[""\u201D\u201C]\](?<title>[^\[]+)\[/link\]");
//The ${name} refers to a named capture group in the regex. Makes it a little more readable, and maintainable.
str = regex.Replace(str, "${title}");
Use capturing groups to capture the http link and the content of [link] tag.
Regex:
\[link="([^"]*)"\]([^\[\]]*)\[\/link]
Replacement string:
$2
DEMO
\[link(="[^"]+")\]([^\[]+)\[\/link\]
Try this.Replace by <a href$1 target="_blank">$2</a>.See demo.
http://regex101.com/r/kP8uF5/18

Regex For Anchor Tag Pegging CPU

So I have broke my regex in half and found this half is causing the cpu to peg under load. What are some performance savers I could do to help? I can't really find any spots to shave off.
private string pattern =
#"(<a\s+href\s*=\s*(\""|')http(s)?://(www.)?([a-z0-9]+\-?[a-z0-9]+\.)?wordpress.org(\""|'))";
Regex wordPressPattern = new Regex(regexPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);
The only thing that jumps out at me as a performance sink is this part:
[a-z0-9]+\-?[a-z0-9]+
The idea is to match hyphenated words like luxury-yacht or THX-1138, while still allowing words without hyphens. Trouble is, if there's no hyphen, the regex engine still has to choose how to distribute the characters between the first [a-z0-9]+ and the second one. If it tries to match word as w-o-r-(no hyphen)-d, and something later in the regex fails to match, it has to come back and try w-o-(no hyphen)-r-d, and so on. These efforts are pointless, but the regex engine has no way to know that. You need to give it a little help, like so:
[a-z0-9]+(-[a-z0-9]+)?
Now you're saying, "If you run out of alphanumerics, and the next character is a hyphen, try to match some more alphanumerics. Otherwise, go on to the next part." But you don't need to be so specific in this case; you're trying to find the URLs, not validate them. I recommend you replace that part with:
[a-z0-9-]+
This also allows it to match words with more than one hyphen (e.g., james-bond, but also james-bond-007).
You also have a lot of unnecessary capturing groups. You don't seem to be using the captures, so you might as well use the ExplicitCapture option to improve performance a little more. But most of the groups don't seem to be needed even for pure grouping purposes. I suggest you try this regex:
#"<a\s+href\s*=\s*[""']https?://([a-z0-9-]+\.)+wordpress\.org[""']"
...with these options:
RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture

Regex extremely slow on large documents

When running the following code the CPU load goes way up and it takes a long time on larger documents:
string pattern = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
Regex regex = new Regex(
pattern,
RegexOptions.None | RegexOptions.Multiline | RegexOptions.IgnoreCase);
MatchCollection matches = regex.Matches(input); // Here is where it takes time
MessageBox.Show(matches.Count.ToString());
foreach (Match match in matches)
{
...
}
Any idea how to speed it up?
Changing RegexOptions.None | RegexOptions.Multiline | RegexOptions.IgnoreCase to RegexOptions.Compiled yields the same results (since your pattern does not include any literal letters or ^/$).
On my machine this reduces the time taken on the sample document you linked from 46 seconds to 21 seconds (which still seems slow to me, but might be good enough for you).
EDIT: So I looked into this some more and have discovered the real issue.
The problem is with the first half of your regex: \w+([-.]\w+)*\.\w+([-.]\w+)*#. This works fine when matching sections of the input that actually contain the # symbol, but for sections that match just \w+([-.]\w+)*\.\w+([-.]\w+)* but are not followed by #, the regex engine wastes a lot of time backtracking and retrying from each position in the sequence (and failing again because there is still no #!)
You can fix this by forcing the match to start at a word boundary using \b:
string pattern = #"\b\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
On your sample document, this produces the same 10 results in under 1 second.
Try to use regex for streams, use mono-project regex and this article can be useful for .Net
Building a Regular Expression Stream with the .NET Framework
and try to improve your regex performance.
To answer how to change it, you need to tell us, what it should match.
The problem is probably in the last part #\w+([-.]\w+)*\.\w+([-.]\w+)*. On a string "bla#a.b.c.d.e-f.g.h" it will have to try many possibilities, till it finds a match.
Could be a little bit of Catastrophic Backtracking.
So, you need to define you pattern in a better, more "unique" way. Do you really need "Dash/dot - dot - dash/dot"?

How to prevent regex from stopping at the first match of alternatives?

If I have the string hello world , how can I modify the regex world|wo|w so that it will match all of "world", "wo" and "w" rather than just the single first match of "world" that it comes to ?
If this is not possible directly, is there a good workaround ? I'm using C# if it makes a difference:
Regex testRegex = new Regex("world|wo|w");
MatchCollection theMatches = testRegex.Matches("hello world");
foreach (Match thisMatch in theMatches)
{
...
}
I think you're going to need to use three separate regexs and match on each of them. When you specify alternatives it considers each one a successful match and stops looking after matching one of them. The only way I can see to do it is to repeat the search with each of your alternatives in a separate regex. You can create an array or list of Match items and have each search add to the list if you want to be able to iterate through them later.
If you're trying to match (the beginning of) the word world three times, you'll need to use three separate Regex objects; a single Regex cannot match the same character twice.
As SLaks wrote, a regex can't match the same text more than once.
You could "fake it" like this:
\b(w)((?<=w)o)?((?<=wo)rld)?
will match the w, the o only if preceded by w*, and rld only if preceded by wo.
Of course, only parts of the word will actually be matched, but you'll see whether only the first one, the first two or all the parts did match by looking at the captured groups.
So in the word want, the w will match (the rest is optional, so the regex reports overall success.
In work, the wo will match; \1 will contain w, and \2 will contain o. The rld will fail, but since it's optional, the regex still reports success.
I have added a word boundary anchor \b to the start of the regex to avoid matches in the middle of words like reword; if don't want to exclude those matches, drop the \b.
* The (?<=w) is not actually needed here, but I kept it in for consistency.

Regex search and replace where the replacement is a mod of the search term

i'm having a hard time finding a solution to this and am pretty sure that regex supports it. i just can't recall the name of the concept in the world of regex.
i need to search and replace a string for a specific pattern but the patterns can be different and the replacement needs to "remember" what it's replacing.
For example, say i have an arbitrary string: 134kshflskj9809hkj
and i want to surround the numbers with parentheses,
so the result would be: (134)kshflskj(9809)hkj
Finding numbers is simple enough, but how to surround them?
Can anyone provide a sample or point me in the right direction?
In some various langauges:
// C#:
string result = Regex.Replace(input, #"(\d+)", "($1)");
// JavaScript:
thestring.replace(/(\d+)/g, '($1)');
// Perl:
s/(\d+)/($1)/g;
// PHP:
$result = preg_replace("/(\d+)/", '($1)', $input);
The parentheses around (\d+) make it a "group" specifically the first (and only in this case) group which can be backreferenced in the replacement string. The g flag is required in some implementations to make it match multiple times in a single string). The replacement string is fairly similar although some languages will use \1 instead of $1 and some will allow both.
Most regex replacement functions allow you to reference capture groups specified in the regex (a.k.a. backreferences), when defining your replacement string. For instance, using preg_replace() from PHP:
$var = "134kshflskj9809hkj";
$result = preg_replace('/(\d+)/', '(\1)', $var);
// $result now equals "(134)kshflskj(9809)hkj"
where \1 means "the first capture group in the regex".
Another somewhat generic solution is this:
search : /([\d]+)([^\d]*)/g
replace: ($1)$2
([\d]+): match a set of one or more digits and retain them in a group
([^\d]*): match a set of non-digits, and retain them as well. \D could work here, too.
g: indicate this is a global expression, to work multiple times on the input.
($1): in the replace block, parens have no special meaning, so output the first group, surrounding it with parens.
$2: output the second group
I used a pretty good online regex tool to test out my expression. The next step would be to apply it to the language that you are using, as each has its own implemention nuance.
Backreferences (grouping) are not necessary if you're just looking to search for numbers and replace with the found regex surrounded by parens. It is simpler to use the whole regex match in the replacement string.
e.g for perl
$text =~ s/\d+/($&)/g;
This searches for 1 or more digits and replaces with parens surrounding the match (specified by $&), with trailing g to find and replace all occurrences.
see http://www.regular-expressions.info/refreplace.html for the correct syntax for your regex language.
Depending on your language, you're looking to match groups.
So typically you'll make a pattern in the form of
([0-9]{1,})|([a-zA-Z]{1,})
Then, you'll iterate over the resulting groups in (specific to your language).

Categories