C# file gettext-string regex parser - c#

SHORT QUESTION
Let's have a regex, which reads a string inside a double quotes. This string is valid only if it has NO double quotes inside.
("([^"]+)")
How would one write a regex, which would have the same functionality but will also work for a string with a double quotes WITH a preceding slash?
"Valid string" //VALID
"Valid \"string\"" //VALID
"Invalid " + "string" //INVALID
"Invalid " + "\"string\"" //INVALID
LONG QUESTION
I'm building my own gettext implementation - I found out that the official gettext apps ( http://www.gnu.org/s/gettext/ ) are not sufficient to my needs.
That means I need to find all strings inside each C# code file myself, but only those which are passed to a particular function as the only parameter.
I built a regex which gets most of the strings. The function Translate is public, static and is situated in the namespace GetTextLocalization and in the class Localization.
(GetTextLocalization\.)?(Localization\.Translate)\("([^"]+)"\)
Of course, this will ONLY find the strings alone and it won't find any strings with a verbatim character. If a string parameter is being passed as an operation ("string a" + "string b") or starts with a verbatim (#"Verbatim string"), it will not parse, but that is not the problem.
The regex definition:
([^"]+)
says that there must be no double quotes inside the string and I know that noone in the company is connecting the string somehow while passing it in the parameter. Still, I need to have this construction as a safety "what if" measure.
But that also causes the problem. The double quotes actually can be there.
Localization.Translate("Perfectly valid String with \"double quotes\"")
I need to change the regex so it will include the strings with a double quote (so I skip anything like Translate("a" + "b") which would mess with the translation catalog) but only those which are preceded by a slash .
I thought I might need to use this (?!) grouping construct somehow but I have no idea where to place it.

Since you probably want to allow doubled backslashes before a quote, I suggest
"(?:\\.|[^"\\])*"
Explanation:
" # Match "
(?: # Either match
\\. # an escaped character
| # or
[^"\\] # any character except " or \
)* # any number of times.
" # Match "
This matches "hello", "hello\"there" or "hello\\" but fails on "hello" there" or "hello\\" there".

Related

C# regex can't find text with whitespace if input is escaped

While trying to find a bit of text with a single whitespace between two words, I encountered something that seems like a bug. I'm using a pattern like (abc)\s(abc), to find two specific words. Now I'm escaping my input using Regex.Escape, but then my regex doesn't match anymore because spaces are escaped (to \space), and then not matched. Is this intended?
My text comes from user input, so as far as I know it should be escaped.
To clarify my question, the following code:
Console.WriteLine("Original text: " + text);
Console.WriteLine("Escaped text: " + Regex.Escape(text));
Console.WriteLine("Matches non-escaped text: " + Regex.IsMatch(text, #"(abc)\s(abc)", RegexOptions.IgnoreCase));
Console.WriteLine("Matches escaped text: " + Regex.IsMatch(Regex.Escape(text), #"(abc)\s(abc)", RegexOptions.IgnoreCase));
Gives the following result for input abc abc
Original text: abc abc
Escaped text: abc\ abc
Matches non-escaped text: True
Matches escaped text: False
While I would expect it to still match on spaces
My text comes from user input, so as far as I know it should be escaped.
This is a faulty premise. If you assume this, then every time someone uses any of your apps to create a record for an employee named Shamus A. O'Leary, they'll probably end up being inserted into the db as Shamus A\. O\'Leary, Shamus A. O'Leary, Shamus+A%2E+O'Leary etc depending on where the data came from and how you decided it needed to be escaped
Just because user provides text doesn't mean it needs to be escaped - you're going to have to apply escaping contextually rather than as a blanket rule based on where text comes from. Generally escaping is used to make sure data can survive being put through some transport channel that doesn't support all the characters, or will try to process some of the characters as having a special meaning when they should not. Instead of hence looking at escaping as something that must be done depending on the source of data, look at it as something that must be done to ensure data reaches a destination unharmed
Regex-wise (abc)\s(abc) does not match a string of abc\ abc, because of the slash. You've transformed your string from matching X to something else (Y), and then asked the regex parser whether Y matches the regex. It's no more a match than abc+abc is a match, going off an assumption that "when URLs are escaped, spaces become pluses, so a plus and a space must mean the same thing to a regex" - the regex engine will just look at the data and say "plus is not a whitespace character; no match". The regex engine won't look at your data and think "hey, if I just unescape this before I run it through the pattern matcher..." and it won't look at your data and think "it's a regex pattern" - a regex pattern expression and data passed to a regex matcher working from that pattern are very different things, and if you want your data to match a described pattern, don't alter the data after you've decided on the pattern
Thus the fault is in transforming the string by running a character replacement (escaping) before asking for the match

Escape single and double back slashes

I am trying to replace a string using regex, however I can't seem to find a way to escape the single backslashes for the regex with the double backslashes of the string.
My string literal (this is being read from a text file, as is)
-s \"t\"
and I want to replace it with (again, as a string literal)
-s \"n\"
The best I have been able to come up with is
schedule = Regex.Replace(schedule, "-s\s\\\"\w\\\"", "-s \\\"n\\\"");
The middle argument doesn't compile though, because of the single and double backslashes. It will accept one or the other, but not both (regardless of weather I use #).
I don't use regexes that much so it may be a simple one but I'm pretty stuck!
Thanks
The problem is happening because \ have a special meaning both for strings and regular expressions so it normally needs to be double escaped unless you use # and added to the problem here is the presence of " itself inside the string which needs to be escaped.
Try the following:
schedule = Regex.Replace(schedule, #"-s\s\\""\w\\""", #"-s \""n\""");
After using #, \ doesn't have a special meaning inside strings, but it still have a special meaning inside regular expression expression so it needs to be escaped only once if it is needed literally.
Also now you need to use "" to escape " inside the string (how would you escape it otherwise, since \ doesn't have a special meaning anymore).
This works:
var schedule = #"-s \""t\""";
// value is -s \"t\"
schedule = Regex.Replace(schedule, #"-s\s\\""\w\\""", #"-s \""n\""");
// value is -s \"n\"
The escaping is complicated because \ and " have special meanings both in how you encode strings in C# and in regex. The search pattern is (without any escaping) the C# string -s\s\\"\w\\", which tells regex to look for a literal \ and a literal ". The replacement string is -s \"n\", because you don't need to escape the backslashes in a replacement string.
You could, of course, write this with normal strings ("...") instead of verbatim strings (#"..."), but it'd get way messier.

Regex, MVS does not like my Regex strings, how do I make it comply

So in microsoft visual studio I have a string that is compiled into a regex. My string is "#(\d+(.\d+)?)=(\d+(.\d+)?)". I cannot compile my program because I get an error saying that \d is a unrecognized escape character. How do I tell it to shut up and let me regex like a pro?
Begin your string with #, that causes the compiler to leave (almost) all characters alone, unescaped (the exception is ", which can be escaped as ""):
#"#(\d+(.\d+)?)=(\d+(.\d+)?"
The problem is that c# does not like the \d inside the string. Use a verbatim string instead
string pattern = #"#(\d+(.\d+)?)=(\d+(.\d+)?)";
The "#" denotes it. C# will not look for escape sequences in the string. If you have to escape a " use two "".
Of cause you can use normal strings. but then you will have to escape the backslashes
string pattern = "#(\\d+(.\\d+)?)=(\\d+(.\\d+)?)";
If you're using a normal string, you need to escape your backslashes, like so:
"#(\\d+(.\\d+)?)=(\\d+(.\\d+)?)"
Basically, you're putting a literal string into C#; the C# compiler sees the string first, and tries to interpret \d as an escape sequence (which doesn't exist, hence error). Therefore, you use \\d to get the C# compiler to see the string as \d, which then gets passed to the regex engine (which does recognize \d as something meaningful). (yes, if you want to match a literal backslash in your regex pattern, you need to use \\\\)
But in C#, you have the alternative of just prepending the string with # to get the compiler to leave the string alone (though " still needs escaping), so that would be like this:
#"#(\d+(.\d+)?)=(\d+(.\d+)?)"
You could also use a verbatim string literal (I prefer to use these because of readability).
Use #"(#\d+(.\d+)?)=(\d+(.\d+)?)"
The #" sign indicates that the string shouldn't interpret escaped characters (A character prefixed by a \) until the closing " is reached.
Note: You can match a single " in your search pattern by double quoting instead "". For instance you can match "Hello" by using the pattern #"""\w+"""

What is the significance of the # symbol in C sharp? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
What's the # in front of a string for .NET?
I understand using the # symbol is like an escape character for a string.
However, I have the following line as a path to store a file to a mapped network drive:
String location = #"\\192.168.2.10\datastore\\" + id + "\\";
The above works fine but now I would like to get a string from the command line so I have done this:
String location = #args[0] + id + "\\";
The above doesn't work and it seems my slashes aren't ignored. This my command:
MyProgram.exe "\\192.168.2.10\datastore\\"
How can I get the effect of the # symbol back?
It is used for two things:
create "verbatim" strings (ignores the escape character): string path = #"C:\Windows"
escape language keywords to use them as identifiers: string #class = "foo"
In your case you need to do this:
String location = args[0] + id + #"\\";
The # symbol in front of a string literal tells the compiler to ignore and escape sequences in the string (ie things that begin with a slash) and just to create the string "as-is"
It can also be used to create variables whose name is a reserved work. For example:
int #class=10;
If you don't prefix the # then you'd get a compile-time error.
You can also prefix it to variables that are not reserved word:
int #foo=22;
Note that you can refer to the variable as foo or #foo in your code.
The # prefix means the string is a literal string and the processing of escape characters is not performed by the compiler, so:
#"\n"
is not translated to a newline character. Without it, you'd have:
String location = "\\\\192.168.2.10\\datastore\\\\" + id + "\\\\";
which looks a bit messy. The '#' tidies things up a bit. The '#' can only be prefixed to string constants, that is, things inside a pair of double quotes ("). Since it is a compiler directive it is only applied at compile time so the string must be known at compile time, hence,
#some_string_var
doesn't work the way you think. However, since all the '#' does is stop processing of escaped characters by the compiler, a string in a variable already has the escaped character values in it (10,13 for '\n', etc). If you want to convert a '\n' to 10,13 for example at run time you'll need to parse it yourself doing the required substitutions (but I'm sure someone knows a better way).
To get what you want, do:
String location = args[0] + id + "\\";
The # symbol has two uses in C#.
To use a quotes instead of escaping. "\windows" can be represented as #"\windows". "\"John!\"" can be represented #"""John!""".
To escape variable names (for example to use a keyword as a parameter name)
private static void InsertSafe (string item, object #lock)
{
lock (#lock)
{
mylist.Insert(0,item);
}
}
#-quoted string literals start with # and are enclosed in double quotation marks. For example:
#"good morning" // a string literal
The advantage of #-quoting is that escape sequences are not processed, which makes it easy to write, for example, a fully qualified file name:
#"c:\Docs\Source\a.txt" // rather than "c:\\Docs\\Source\\a.txt"
To include a double quotation mark in an #-quoted string, double it:
#"""Ahoy!"" cried the captain." // "Ahoy!" cried the captain.
Another use of the # symbol is to use referenced (/reference) identifiers that happen to be C# keywords. For more information, see 2.4.2 Identifiers.
http://msdn.microsoft.com/en-us/library/362314fe(v=vs.71).aspx
In this case you may not need to use #; just make it
String location = args[0] + id + "\\";
The # symbol is only relevant for string literals in code. Variables should never modify the contents of a string.
The # symbol goes right before the quotes. It only works on string literals, and it simply changes the way the string is understood by the compiler. The main thing it does is cause \ to be interpreted as a literal backslash, rather than escaping the next character. So you want:
String location = args[0] + id + #"\\";
By default the '\' character is an escape character for strings in C#. That means that if you want to have a backslash in your string you need two slashes the first to escape the second as follows:
string escaped = "This is a backslash \\";
//The value of escaped is - This is a backslash \
An easier example to follow is with the use of quotes:
string escaped = "To put a \" in a string you need to escape it";
//The value of escaped is - To put a " in a string you need to escape it
The # symbol is the equivalent of "ignore all escape characters in this string" and declare it verbatim. Without it your first declaration would look like this:
"\\\\192.168.2.10\\datastore\\\\" + id + "\\";
Note that you already didn't have the # on your second string, so that string hasn't changed and still only contains a single backslash.
You only need to use the # symbol when you are declaring strings. Since your argument is already declared it is not needed. So your new line can be:
String location = args[0] + id + "\\";
or
String location = args[0] + id + #"\";
If you load from the command line, it will already be escaped for you. This is why your escapes are "ignored" from your perspective. Note that the same is true when you load from config, so don't do this:
<add key="pathToFile" value="C:\\myDirectory\\myFile.txt"/>
If you do, you end up with double strings, as ".NET" is smart enough to escape thins for you when you load them in this manner.

Regex battle between maximum and minimum munge

Greetings, I have file with the following strings:
string.Format("{0},{1}", "Having \"Two\" On The Same Line".Localize(), "Is Tricky For regex".Localize());
my goal is to get a match set with the two strings:
Having \"Two\" On The Same Line
Is Tricky For regex
My current regex looks like this:
private Regex CSharpShortRegex = new Regex("\"(?<constant>[^\"]+?)\".Localize\\(\\)");
My problem is with the escaped quotes in the first line I end up stopping at the quote and I get:
On The Same Line
Is Tricky For This Style Too
however attempting to ignore the escaped quotes is not working out because it makes the Regex greedy and I get
Having \"Two\" On The Same Line".Localize(), "Is Tricky For regex"
We seem to be caught between maximum and minimum munge. Is there any hope? I have some backup plans. Can you Regex backwards? that would make it easier because I can start with the "()ezilacoL."
EDIT:
To clarify. This is my lone edge case. Most of the time the string sits alone like:
var myString = "Hot Patootie".Localize()
This one works for me:
\"((?:[^\\"]|(?:\\\"))*)\"\.Localize\(\)
Tested on http://www.regexplanet.com/simple/index.html against a number of strings with various escaped quotes.
Looks like most of us who answered this one had the same rough idea, so let me explain the approach (comments after #s):
\" # We're looking for a string delimited by quotation marks
( # Capture the contents of the quotation marks
(?: # Start a non-capturing group
[^\\"] # Either read a character that isn't a quote or a slash
|(?:\\\") # Or read in a slash followed by a quote.
)* # Keep reading
) # End the capturing group
\" # The string literal ends in a quotation mark
\.Localize\(\) # and ends with the literal '.Localize()', escaping ., ( and )
For C# you'll need to escape the slashes twice (messy):
\"((?:[^\\\\\"]|(?:\\\\\"))*)\"\\.Localize\\(\\)
Mark correctly points out that this one doesn't match escaped characters other than quotation marks. So here's a better version:
\"((?:[^\\"]|(?:\\")|(?:\\.))*)\"\.Localize\(\)
And its slashed-up equivalent:
\"((?:[^\\\\\"]|(?:\\\\\")|(?:\\\\.))*)\"\\.Localize\\(\\)
Works the same way, except it has a special case that if encounters a slash but it can't match \", it just consumes the slash and the following character and moves on.
Thinking about it, it's better to just consume two characters at every slash, which is effectively Mark's answer so I won't repeat it.
Here's the regular expression you need:
#"""(?<constant>(\\.|[^""])*)""\.Localize\(\)"
A test program:
using System;
using System.Text.RegularExpressions;
using System.IO;
class Program
{
static void Main()
{
Regex CSharpShortRegex =
new Regex(#"""(?<constant>(\\.|[^""])*)""\.Localize\(\)");
foreach (string line in File.ReadAllLines("input.txt"))
foreach (Match match in CSharpShortRegex.Matches(line))
Console.WriteLine(match.Groups["constant"].Value);
}
}
Output:
Having \"Two\" On The Same Line
Is Tricky For regex
Hot Patootie
Notice that I have used #"..." to avoid having to escape backslashes inside the regular expression. I think this makes it easier to read.
Update:
My original answer (below the horizontal rule) has a bug: regular-expression matchers attempt alternatives in left-to-right order. Having [^"] as the first alternative allows it to consume the backslash, but then the next character to be matched is a quote, which prevents the match from proceeding.
Incompatibility note: Given the pattern below, perl backtracks to the other alternative (the escaped quote) and successfully finds a match for the Having \"Two\" On The Same Line case.
The fix is to try an escaped quote first and then a non-quote:
var CSharpShortRegex =
new Regex("\"(?<constant>(\\\\\"|[^\"])*)\"\\.Localize\\(\\)");
or if you prefer the at-string form:
var CSharpShortRegex =
new Regex(#"""(?<constant>(\\""|[^""])*)""\.Localize\(\)");
Allow for escapes:
private Regex CSharpShortRegex =
new Regex("\"(?<constant>([^\"]|\\\\\")*)\"\\.Localize\\(\\)");
Applying one level of escaping to make the pattern easier to read, we get
"(?<constant>([^"]|\\")*)"\.Localize\(\)
That is, a string starts and ends with " characters, and everything between is either a non-quote or an escaped quote.
Looks like you're trying to parse code so one approach might be to evaluate the code on the fly:
var cr = new CSharpCodeProvider().CompileAssemblyFromSource(
new CompilerParameters { GenerateInMemory = true },
"class x { public static string e() { return " + input + "}}");
var result = cr.CompiledAssembly.GetType("x")
.GetMethod("e").Invoke(null, null) as string;
This way you could handle all kinds of other special cases (e.g. concatenated or verbatim strings) that would be extremely difficult to handle with regex.
new Regex(#"((([^#]|^|\n)""(?<constant>((\\.)|[^""])*)"")|(#""(?<constant>(""""|[^""])*)""))\s*\.\s*Localize\s*\(\s*\)", RegexOptions.Compiled);
takes care of both simple and #"" strings. It also takes into account escape sequences.

Categories