I am trying to write a regular expression to rewrite URLs to point to a proxy server.
bodystring = Regex.Replace(bodystring, "(src='/+)", "$1" + proxyStr);
The idea of this expression is pretty simple, basically find instances of "src='/" or "src='//" and insert a PROXY url at that point. This works in general but occasionally I have found cases where a literal "$1" will end up in the result string.
This makes no sense to me because if there was no match, then why would it replace anything at all?
Unfortunately I can't give a simple example of this at it only happens with very large strings so far, but I'd like to know conceptually what could make this sort of thing happen.
As an aside, I tried rewriting this expression using a positive lookbehind as follows:
bodystring = Regex.Replace(bodystring, "(?<=src='/+)", proxyStr);
But this ends up with proxyStr TWICE in the output if the input string contains "src='//". This also doesn't make much sense to me because I thought that "src=" would have to be present in the input twice in order to get proxyStr to end up twice in the output.
When proxyStr = "10.15.15.15:8008/proxy?url=http://", the replacement string becomes "$110.15.15.15:8008/proxy?url=http://". It contains a reference to group number 110, which certainly does not exist.
You need to make sure that your proxy string does not start in a digit. In your case you can do it by not capturing the last slash, and changing the replacement string to "$1/"+proxyStr, like this:
bodystring = Regex.Replace(bodystring, "(src='/*)/", "$1/" + proxyStr);
Edit:
Rawling pointed out that .NET's regexp library addresses this issue: you can enclose 1 in curly braces to avoid false aliasing, like this:
bodystring = Regex.Replace(bodystring, "(src='/+)", "${1}" + proxyStr);
What you are doing can't be done. .NET has trouble when interpolating variable like this. Your problem is that your Proxy string starts with a number : proxyStr = "10.15.15.15:8008/proxy?url=http://"
When you combine this with your $1, the regex thing it has to look for backreference $110 which doesn't exist.
See what I mean here.
You can remedy this by matching something else, or by matching and constructing the replacement string manually etc. Use what suits you best.
Based on dasblinkenlights answer (already +1) the solution is this:
bodystring = Regex.Replace(bodystring, "(src='/+)", "${1}" + proxyStr);
This ensures that the group 1 is used and not a new group number is build.
In the second version, I guess proxyStr appears twice because you're inserting it once more. Try
string s2 = Regex.Replace(s, "((?<=src='/+))", proxyStr);
Related
Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)
Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.
This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());
I have the following string (from a large HTML string):
href="/cgi-bin/pin.cgi?pin=94841&sid=9548.1386389012.v1"><
And here is my code:
var sids = Regex.Matches( htmlCode, "sid=(.)\">" );
I'm not pulling back any results. Is my Regex correct?
This is what it should be:
var str = #"href=""/cgi-bin/pin.cgi?pin=94841&sid=9548.1386389012.v1"">";
var sid = Regex.Match(str, #"sid=([^""]*)");
Console.WriteLine (sid.Groups[1].Value);
What you originally posted was wrong because "." acts as a wildcard, and the way you presented it meant that it would only capture 1 character, the problem with wildcards is that they're difficult to stop till you reach the end of a line, so never use them unless you have to.
. match only single character. To match multiple character you should use * or + modifier: (.+); or more preferably non-greedy version: (.+?)
Use #"verbatim string literal" if possible for regular expression.
var sids = Regex.Matches(htmlCode, #"sid=(.+?)""");
See demo run.
I think you are pretty close. Consider the following minor change to your regex...
sid=.*?\">
Good Luck!
I'm trying to extract information out of rc-files. In these files, "-chars in strings are escaped by doubling them ("") analog to c# verbatim strings. is ther a way to extract the string?
For example, if I have the following string "this is a ""test""" I would like to obtain this is a ""test"". It also must be non-greedy (very important).
I've tried to use the following regular expression;
"(?<text>[^""]*(""(.|""|[^"])*)*)"
However the performance was awful.
I'v based it on the explanation here: http://ad.hominem.org/log/2005/05/quoted_strings.php
Has anybody any idea to cope with this using a regular expression?
You've got some nested repetition quantifiers there. That can be catastrophic for the performance.
Try something like this:
(?<=")(?:[^"]|"")*(?=")
That can now only consume either two quotes at once... or non-quote characters. The lookbehind and lookahead assert, that the actual match is preceded and followed by a quote.
This also gets you around having to capture anything. Your desired result will simply be the full string you want (without the outer quotes).
I do not assert that the outer quotes are not doubled. Because if they were, there would be no way to distinguish them from an empty string anyway.
This turns out to be a lot simpler than you'd expect. A string literal with escaped quotes looks exactly like a bunch of simple string literals run together:
"Some ""escaped"" quotes"
"Some " + "escaped" + " quotes"
So this is all you need to match it:
(?:"[^"]*")+
You'll have to strip off the leading and trailing quotes in a separate step, but that's not a big deal. You would need a separate step anyway, to unescape the escaped quotes (\" or "").
Don't if this is better or worse than m.buettner's (guessing not - he seems to know his stuff) but I thought I'd throw it out there for critique.
"(([^"]+(""[^"]+"")*)*)"
Try this (?<=^")(.*?"{2}.*?"{2})(?="$)
it will be maybe more faster, than two previous
and without any bugs.
Match a " beginning the string
Multiple times match a non-" or two "
Match a " ending the string
"([^"]|(""))*?"
We are having problem with the following regular expression:
(.*?)\|\*\|([0-9]+)\*\|\*(.*?)
It should match things like: |*25 *|
We are using .Net Framework 4 RegEx Class the code is the following:
string expression = "(.*?)" +
Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) +
"([0-9]+)" +
Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END) +
"(.*?)";
Regex r = new Regex(expression);
r.Matches(contentText)
It is taking too long (like 60 seconds) with a 40.000 character text.
But with a text of 180.000 the speed its very acceptable (3 sec or less)
The only difference between texts its that the first text(the one which is slow) it is all contained in a single line, with no line breaks. Can this be an issue? That is affecting the performance?
Thanks
#David Gorsline's solution (from the comment) is correct:
string expression =
Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) +
"([0-9]+)" +
Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END);
Specifically, it's the (.*?) at the beginning that's doing you in. What that does is take over doing what the regex engine should be doing itself--scan for the next place where the regex can match--and doing it much, much less efficiently. At each position, the (.*?) effectively performs a lookahead to determine whether the next part of the regex can match, and only if that fails does it go ahead and consume the next character.
But even if you used something more efficient, like [^|]*, you would still be slowing it down. Leave that part off, though, and the regex engine can instead scan for the first constant portion of the regex, probably using an algorithm like Boyer-Moore or Knuth-Morris-Pratt. So don't worry about what's around the bits you want to match; just tell the regex engine what you're looking for and get out of its way.
On the other hand, the trailing (.*?) has virtually no effect, because it never really does anything. The ? turns the .* reluctant, so what does it take to make it go ahead and consume the next character? It will only do so if there's something following it in the regex that forces it to. For example, foo.*?bar consumes everything from the next "foo" to the next "bar" after that, but foo.*? stops as soon as it's consumed "foo". It never makes sense to have a reluctant quantifier as the last thing in a regex.
You've answered your question: the problem is that . fails to match new-lines (it doesn't by default), which results in many failed attempts - almost one for every position on your 40000 character string.
On the long but single lined file, the engine can match the pattern in a single pass over the file (assuming a successful match exists - if it doesn't, I suspect it will take a long time to fail...).
On the shorter file, with many lines, the engine tries to match from the first character. It matches .*? until the end of the first line (this is a lazy match, so a lot more is happening, but lets ignore that), and fails. Now, it stats again from the second character, not the second line! This results in n² complexity even before matching the number.
A simple solution is to make . match newlines:
Regex r = new Regex(expression, RegexOptions.Singleline);
You can also make sure to match from start to end using the absolute start and end anchors, \A and \z:
string expression = "\\A(.*?)" +
Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) +
"([0-9]+)" +
Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END) +
"(.*?)\\z";
Another note:
As David suggests in the comments, \|\*\|([0-9]+)\*\|\* should work well enough. Even if you need to "capture" all text before and after the match, you can easily get it using the position of the match.
I'm in the process of updating a program that fixes subtitles.
Till now I got away without using regular expressions, but the last problem that has come up might benefit by their use. (I've already solved it without regular expressions, but it's a very unoptimized method that slows my program significantly).
TL;DR;
I'm trying to make the following work:
I want all instances of:
"! ." , "!." and "! . " to become: "!"
unless the dot is followed by another dot, in which case I want all instances of:
"!.." , "! .." , "! . . " and "!. ." to become: "!..."
I've tried this code:
the_str = Regex.Replace(the_str, "\\! \\. [^.]", "\\! [^.]");
that comes close to the first part of what I want to do, but I can't make the [^.] character of the replacement string to be the same character as the one in the original string... Please help!
I'm interested in both C# and PHP implementations...
$str = preg_replace('/!(?:\s*\.){2,3}/', '!...', $str);
$str = preg_replace('/!\s*\.(?!\s*\.)/', '!', $str);
This does the work in to PCREs. You probably could do some magic to merge it to one, but it wouldn't be readable anymore. The first PCRE is for !..., the second one for !. They are quite straightforward.
C#
s = Regex.Replace(s, #"!\s?\.\s?(\.?)\s?", "!$1$1$1");
PHP
$s = preg_replace('/!\s?\.\s?(\.?)\s?/', '!$1$1$1', $s);
The first dot is consumed but not captured; you're effectively throwing that one away. Group #1 captures the second dot if there is one, or an empty string if not. In either case, plugging it into the replacement string three times yields the desired result.
I used \s instead of literal spaces to make it more obvious what I was doing, and added the ? quantifier to make the spaces optional. If you really need to restrict it to actual space characters (not tabs, newlines, etc.) you can change them back to spaces. If you want to allow more than one space at a time, you can change ? to * where appropriate--e.g.:
#"!\s*\.\s*(\.?)\s*"
Also, notice the use of C#'s verbatim string literals--the antidote for backslashitis. ;)