Why the Garbled Output when Replacing /* Comments? - c#

I want to replace /* in selected text with //.
I used regex to do this. When I used any other strings it worked. But when I used:
String result = System.Text.RegularExpressions.Regex.Replace(seltext,"/*","//");
It shows:
/* int a,b; // sample input
///*i//n//t//a//,//b//; // sample output
Instead I want:
// int a,b;

* has a special meaning in regular expressions - it means "match 0 or more of the preceding character/group".
It sounds like you don't want a regex at all - you just want
string result = seltext.Replace("/*", "//");
If you really want to use regular expressions, you need to escape the * (and various other characters, if you use them):
string result = Regex.Replace(seltext, #"/\*", "//");
Note the use of a verbatim string literal (indicated by the the # at the start of the string) to avoid having to escape the \ as well for C# string literal reasons. You'd need to use "/\\*" which isn't as clear. Verbatim string literals are very handy for regular expressions.
I would suggest caution when trying to use simple text operations (including regular expressions) on source code though. For example, imagine applying the replacement to the first of the code snippets above...

Your Question is WHY.... wrong output?
Let's start with the WHY, then we'll look at the fix.
The core of the problem is that /* is able to match the Empty String. Therefore, at Each Position, you insert //
You Need to Escape the Quantifier *
In regex, * means "match what precedes zero or more times".
Therefore, /* does not match /*, but rather, the empty string (zero slashes) or a series of slashes: ////
To match a literal *, escape it with a backslash: \*. Therefore your regex becomes /\*
/* Matches at Every Single Position in the String
Because /* can match the empty string, it matches at every single position.
Therefore, at each position, you insert //, hence your result
In C# Code: Replace not only /* but also /******
There is no need to use regex for a fixed literal /*, so to make it more interesting, we will not only replace /* but /*****. Do do so, we add a + quantifier after the \*. One line is enough:
string resultString = Regex.Replace(s1, #"/\*+", "//");
See this demo to observe how we match at each position.
See this demo to see how to do the replacement.

Related

C# regex does not allow special characters correctly?

For example I have the following string:
thats a\n\ntest\nwith multiline \n\nthings...
I tried to use the following code which does not work correctly and still hasn't all chars included:
string text = "thats a\n\ntest\nwith multiline \n\nthings and so on";
var res = Regex.IsMatch(text, #"^([a-zA-Z0-9äöüÄÖÜß\-|()[\]/%'<>_?!=,*. ':;#+\\])+$");
Console.WriteLine(res);
I want the regex returning true when only the following chars are included (do not have to contain all of them but at least one of the following and no others):
a-z, A-Z, 0-9, äüöÄÖÜß and !#'-.:,; ^"§$%&/()=?\}][{³²°*+~'_<>|.
This is a list of known keyboard characters I thought of would be nice the use inside of a message.
If you specified all the chars you want to allow, the regex declaration in C# will look like
#"^[a-zA-Z0-9äüöÄÖÜß!#'\-.:,; ^""§$%&/()=?\\}\][{³²°*+~'_<>|]+$"
However, the test string you supplied contains line feed (LF, \n, \x0A) chars, so you need to either test on a string with no newlines, or add \n to the character class:
#"^[a-zA-Z0-9äüöÄÖÜß!#'\-.:,; ^""§$%&/()=?\\}\][{³²°*+~'_<>|\n]+$"
Note that the " char is doubled since this is the only way to put a double quote into a verbatim string literal.
Also, the capturing parentheses in your pattern create redundant overhead, you should remove them.

Whitespace in a current regex

I'm currently using the following regex: ^[^&<>\"'/]*$ and I would like to also add a validation where a user can't just enter a space in the beginning of my textbox. Any ideas pls ? Note: Spaces are allowed but not as the first element
A negative look-ahead is the way to go here: ^(?! )[^&<>\"'\/]*$
(?! ) means match only if the next character isn't a space. Since that is right after the ^ anchor, that essentially means match only if the first character isn't a space.
Here is another option: you may allow typing any space (hard or regular one) by adding \p{Zs} Unicode category class to your regex.
To make sure you still can match an empty string, you need to use a look-ahead anchored at the beginning of the string (that is, you must use it right after ^ start-of-string anchor):
^(?!\p{Zs})[^&<>"'/]*$
//^^^^^^^^^
I strongly suggest using verbatim string literals in C# to declare regexes as you won't have to think about how many backslashes you need to use:
var rx = new Regex(#"^(?!\p{Zs})[^&<>""'/]*$");
Note that you need to double the quotation marks in these literals to declare a single double quote.
C# demo:
var rx = new Regex(#"^(?!\p{Zs})[^&<>""'/]*$");
Console.WriteLine(rx.IsMatch("")); // true - empty string
Console.WriteLine(rx.IsMatch(" sapceAtStart")); // false - space at start
Console.WriteLine(rx.IsMatch(" sapceAtStart")); // false - Hard space at start
Console.WriteLine(rx.IsMatch("space not at start")); // true - space not at start

How to use regex to match anything from A to B, where B is not preceeded by C

I'm having a hard time with this one. First off, here is the difficult part of the string I'm matching against:
"a \"b\" c"
What I want to extract from this is the following:
a \"b\" c
Of course, this is just a substring from a larger string, but everything else works as expected. The problem is making the regex ignore the quotes that are escaped with a backslash.
I've looked into various ways of doing it, but nothing has gotten me the correct results. My most recent attempt looks like this:
"((\"|[^"])+?)"
In various test online, this works the way it should - but when I build my ASP.NET page, it cuts off at the first ", leaving me with just the a-letter, white space and a backslash.
The logic behind the pattern above is to capture all instances of \" or something that is not ". I was hoping this would search for \", making sure to find those first - but I got the feeling that this is overridden by the second part of the expression, which is only 1 single character. A single backslash does not match 2 characters (\"), but it will match as a non-". And from there, the next character will be a single ", and the matching is completed. (This is just my hypothesis on why my pattern is failing.)
Any pointers on this one? I have tried various combinations with "look"-methods in regex, but I didn't really get anywhere. I also get the feeling that is what I need.
ORIGINAL ANSWER
To match a string like a \"b\" c, you need to use following regex declaration:
(?:\\"|[^"])+
var rx = Regex(#"(?:\\""|[^""])+");
See RegexStorm demo
Here is an IDEONE demo:
var str = "a \\\"b\\\" c";
Console.WriteLine(str);
var rx = new Regex(#"(?:\\""|[^""])+");
Console.WriteLine(rx.Match(str).Value);
Please note the # in front of the string literal that lets us use verbatim string literals where we have to double quotes to match literal quotes and use single escape slashes instead of double. This makes regexps easier to read and maintain.
If you want to match any escaped entities in your input string, you can use:
var rx = new Regex(#"[^""\\]*(?:\\.[^""\\]*)*");
See demo on RegexStorm
UPDATE
To match the quoted strings, just add quotes around the pattern:
var rx = new Regex(#"""(?<res>[^""\\]*(?:\\.[^""\\]*)*)""");
This pattern yields much better performance than Tim Long's suggested regex, see RegexHero test resuls:
The following expression worked for me:
"(?<Result>(\\"|.)*)"
The expression matches as follows:
An opening quote (literal ")
A named capture (?<name>pattern) consisting of:
Zero or more occurences * of literal \" or (|) any single character (.)
A final closing quote (literal ")
Note that the * (zero or more) quantifier is non-greedy so the final quote is matched by the literal " and not the "any single character" . part.
I used ReSharper 9's built-in Regular Expression validator to develop the expression and verify the results:
I have used the "Explicit Capture" option to reduce cruft in the output (RegexOptions.ExplicitCapture).
One thing to note is that I am matching the whole string, but I am only capturing the substring, using a named capture. Using named captures is a really useful way to get at the results you want. In code, it might look something like this:
static string MatchQuotedString(string input)
{
const string pattern = #"""(?<Result>(\\""|.)*)""";
const RegexOptions options = RegexOptions.ExplicitCapture;
Regex regex = new Regex(pattern, options);
var matches = regex.Match(input);
var substring = matches.Groups["Result"].Value;
return substring;
}
Optimization: If you are planning on using the regex a lot, you could factor it out into a field and use the RegexOptions.Compiled option, this pre-compiles the expression and gives you faster throughput at the expense of longer initialization.

How can i make my regular expression work?

I am new to both .NET (C#) and regular expressions.
I need a regular expression to match against a url:
If url string contains "/id/Whatever_COMES_HERE_EVERY_CHAR_ACCEPTED/" : return true
If url string contains only "/id/" : return false
I have tried the following but it only returns true if url is http:// localhost/id/
This is my script:
string thisUrl = HttpContext.Current.Request.Url.AbsolutePath;
Match match = Regex.Match(thisUrl, #"/id/*$");
What am i doing wrong?
You have this:
/id/*$
What this is doing is matching the literal string /id/ and then you have the quantifier * which means 0 or more times. Then you have $ which means end of the string.
You are looking for repetition of the literal / which is not what you want. (So this: http:// localhost/id/////////////////// should have matched too with your original regex)
What you need is something like this:
/id/.+$
This will match the literal /id/ followed by the . which in regex means any character which is quantified with the + which means 1 or more.
You could tighten it up and use \S instead of . which means non-whitespace characters (since a URL shouldn't have whitespace)
Also note: there are a variety of online regex tools which are really useful when trying to figure out and test a regex. A couple of examples:
http://rubular.com/
http://regex101.com/
http://www.regxlib.com/
And even extension for visual studio you can use:
https://visualstudiogallery.msdn.microsoft.com/bf883ae3-188b-43bc-bd29-6235c4195d1f
When you use the start it signals that 0 or more of the preceding char shall be present. You will want to use
"/id/.+" to signal that at least one more char must come after the /
If you're just looking for true/false solution, you should use IsMatch() function. The other issue is that * (zero or more) and + (one or more) are quantifiers and must be preceeded by a character class or group. Dot (.) is a character class that represents ANY character. So the correct solution for your problem would be:
Regex.IsMatch(thisUrl, #"/id/.+$");
Considering that the input is a URL, this regex can be improved upon by restricting character classes to valid URL characters only, but for your purpose the above should be sufficient.

Regex battle between maximum and minimum munge

Greetings, I have file with the following strings:
string.Format("{0},{1}", "Having \"Two\" On The Same Line".Localize(), "Is Tricky For regex".Localize());
my goal is to get a match set with the two strings:
Having \"Two\" On The Same Line
Is Tricky For regex
My current regex looks like this:
private Regex CSharpShortRegex = new Regex("\"(?<constant>[^\"]+?)\".Localize\\(\\)");
My problem is with the escaped quotes in the first line I end up stopping at the quote and I get:
On The Same Line
Is Tricky For This Style Too
however attempting to ignore the escaped quotes is not working out because it makes the Regex greedy and I get
Having \"Two\" On The Same Line".Localize(), "Is Tricky For regex"
We seem to be caught between maximum and minimum munge. Is there any hope? I have some backup plans. Can you Regex backwards? that would make it easier because I can start with the "()ezilacoL."
EDIT:
To clarify. This is my lone edge case. Most of the time the string sits alone like:
var myString = "Hot Patootie".Localize()
This one works for me:
\"((?:[^\\"]|(?:\\\"))*)\"\.Localize\(\)
Tested on http://www.regexplanet.com/simple/index.html against a number of strings with various escaped quotes.
Looks like most of us who answered this one had the same rough idea, so let me explain the approach (comments after #s):
\" # We're looking for a string delimited by quotation marks
( # Capture the contents of the quotation marks
(?: # Start a non-capturing group
[^\\"] # Either read a character that isn't a quote or a slash
|(?:\\\") # Or read in a slash followed by a quote.
)* # Keep reading
) # End the capturing group
\" # The string literal ends in a quotation mark
\.Localize\(\) # and ends with the literal '.Localize()', escaping ., ( and )
For C# you'll need to escape the slashes twice (messy):
\"((?:[^\\\\\"]|(?:\\\\\"))*)\"\\.Localize\\(\\)
Mark correctly points out that this one doesn't match escaped characters other than quotation marks. So here's a better version:
\"((?:[^\\"]|(?:\\")|(?:\\.))*)\"\.Localize\(\)
And its slashed-up equivalent:
\"((?:[^\\\\\"]|(?:\\\\\")|(?:\\\\.))*)\"\\.Localize\\(\\)
Works the same way, except it has a special case that if encounters a slash but it can't match \", it just consumes the slash and the following character and moves on.
Thinking about it, it's better to just consume two characters at every slash, which is effectively Mark's answer so I won't repeat it.
Here's the regular expression you need:
#"""(?<constant>(\\.|[^""])*)""\.Localize\(\)"
A test program:
using System;
using System.Text.RegularExpressions;
using System.IO;
class Program
{
static void Main()
{
Regex CSharpShortRegex =
new Regex(#"""(?<constant>(\\.|[^""])*)""\.Localize\(\)");
foreach (string line in File.ReadAllLines("input.txt"))
foreach (Match match in CSharpShortRegex.Matches(line))
Console.WriteLine(match.Groups["constant"].Value);
}
}
Output:
Having \"Two\" On The Same Line
Is Tricky For regex
Hot Patootie
Notice that I have used #"..." to avoid having to escape backslashes inside the regular expression. I think this makes it easier to read.
Update:
My original answer (below the horizontal rule) has a bug: regular-expression matchers attempt alternatives in left-to-right order. Having [^"] as the first alternative allows it to consume the backslash, but then the next character to be matched is a quote, which prevents the match from proceeding.
Incompatibility note: Given the pattern below, perl backtracks to the other alternative (the escaped quote) and successfully finds a match for the Having \"Two\" On The Same Line case.
The fix is to try an escaped quote first and then a non-quote:
var CSharpShortRegex =
new Regex("\"(?<constant>(\\\\\"|[^\"])*)\"\\.Localize\\(\\)");
or if you prefer the at-string form:
var CSharpShortRegex =
new Regex(#"""(?<constant>(\\""|[^""])*)""\.Localize\(\)");
Allow for escapes:
private Regex CSharpShortRegex =
new Regex("\"(?<constant>([^\"]|\\\\\")*)\"\\.Localize\\(\\)");
Applying one level of escaping to make the pattern easier to read, we get
"(?<constant>([^"]|\\")*)"\.Localize\(\)
That is, a string starts and ends with " characters, and everything between is either a non-quote or an escaped quote.
Looks like you're trying to parse code so one approach might be to evaluate the code on the fly:
var cr = new CSharpCodeProvider().CompileAssemblyFromSource(
new CompilerParameters { GenerateInMemory = true },
"class x { public static string e() { return " + input + "}}");
var result = cr.CompiledAssembly.GetType("x")
.GetMethod("e").Invoke(null, null) as string;
This way you could handle all kinds of other special cases (e.g. concatenated or verbatim strings) that would be extremely difficult to handle with regex.
new Regex(#"((([^#]|^|\n)""(?<constant>((\\.)|[^""])*)"")|(#""(?<constant>(""""|[^""])*)""))\s*\.\s*Localize\s*\(\s*\)", RegexOptions.Compiled);
takes care of both simple and #"" strings. It also takes into account escape sequences.

Categories