Finding C#-style unescaped strings using regular expressions

Finding C#-style unescaped strings using regular expressions - c#

I'm trying to write a regular expression that finds C#-style unescaped strings, such as
string x = #"hello
world";
The problem I'm having is how to write a rule that handles double quotes within the string correctly, like in this example
string x = #"before quote ""junk"" after quote";
This should be an easy one, right?

Try this one:
#".*?(""|[^"])"([^"]|$)
The first parantheses mean 'If there is an " before the finishing quote, it better be two of them', the second parantheses mean 'After the finishing quote, there sould ether be not a quote, or the end of the line'.

How 'bout the regex #\"([^\"]|\"\")*\"(?=[^\"])
Due to greedy matching, the final lookahead clause is likely not to be needed in your regex engine, although it is more specific.

If I remember correctly, you have to use \"" - the double-double quotes to hash it for C# and the backslash to hash it for regex.

Try this:
#"[^"]*?(""[^"]*?)*";
It looks for the starting characters #", for the ending characters "; (you can leave the semicolon out if you need to) and in between it can have any characters except quotes, or if there are quotes they have to be doubled.

#"(?:""|[^"])*"(?!")
is the right regex for this job. It matches the #, a quote, then either two quotes in a row or any non-quote character, repeating this up unto the next quote (that isn't doubled).

"^#(""|[^"])*$" is the regex you want, looking for first an at-sign and a double-quote, then a sequence of any characters (except double-quotes) or double double-quotes, and finally a double-quote.
As a string literal in C#, you'd have to write it string regex = "^#\"(\"\"|[^\"])*\"$"; or string regex = #"^#""(""""|[^""])*""$";. Choose your poison.

Related

Regex Remove pair of double quotes around word, but not single instances of double quotes

I need to be able to remove a pair of double quotes around words, without removing single instances of double quotes.
Ie. in the below examples, the regex should only match around "hello" and "bounce", without removing the word itself.
3.5" hdd
"hello"
"cool
"bounce"
single sentence with out quotes.
Closest regex i've found so far is this one below, but this highlights the entire "bounce" word which is not acceptable as I need to retain the word.
"([^\\"]|\\")*"
Other close regex I've found in my research:
1.
\"*\"
but this highlights the single quotes.
and Unsuccessful Method 2
This needs to be usable in C# code.
I've been using RegexStorm to test my regex: http://regexstorm.net/reference

Your first regex seems fine but lacks an outer capturing group. It would be better if we transform this into a linear regex, avoiding alternation.
"([^\\"\r\n]*(?:\\.[^\\"\r\n]*)*)"
I included carriage return \r and \n in character class to prevent regex from going more than one line in regex, you may not need them however. You then replace whole match with $1 (a back-reference to first capturing group saved data). To escape a " in C# use double quote "".
Live demo
C# code:
string pattern = #"""([^\\""\r\n]*(?:\\.[^\\""\r\n]*)*)""";
string input = #"3.5"" hdd
""hello""
""cool
""bounce""
single sentence with out quotes.";
Regex regex = new Regex(pattern);
Console.WriteLine(regex.Replace(input, #"$1"));

Replace certain combination of characters with another

In C# I'm trying to replace characters in a string. To be more precise, wherever there is a double quote which is NOT followed, nor preceded, by a comma, I'd like to replace that double quote with a single quote. So, for example:
John",123
and
123,"John
are both fine, because there is a comma either before or after the double quote, but:
John"Marks
is not fine because there is a double quote which is neither preceded not succeeded by a comma, so it should be replaced with a single quote. I.e. it should become:
John'Marks
I'm struggling to figure this one one... any ideas anyone? Thanks...

You can use look arounds for your search regex:
(?<!,)"(?!,)
RegEx Demo
RegEx Breakup:
(?<!,) - Negative Lookbehind to assert previous character is not a comma
" - Match a double quote
(?!,) - Negative Lookahead to assert next character is not a comma
Replacement string would be just a single quote "'"
Code:
string repl = Regex.Replace(str, #"(?<!,)\"(?!,)", "'");

regex syntax stop search

How do I make Regex stop the search after "Target This"?
HeaderText="Target This" AnotherAttribute="Getting Picked Up"
This is what i've tried
var match = Regex.Match(string1, #"(?<=HeaderText=\").*(?=\")");

The quantifier * is eager, which means it will consume as many characters as it can while still getting a match. You want the lazy quantifier, *?.
As an aside, rather than using look-around expressions as you have done here, you may find it in general easier to use capturing groups:
var match = Regex.Match(string1, "HeaderText=\"(.*?)\"");
^ ^ these make a capturing group
Now the match matches the whole thing, but match.Groups[1] is just the value in the quotes.

Plain regex pattern
(?<=HeaderText=").*?(?=")
or as string
string pattern = "(?<=HeaderText=\").*?(?=\")";
or using a verbatim string
string pattern = #"(?<=HeaderText="").*?(?="")";
The trick is the question mark after .*. It means "as few as possible", making it stop after the first end-quotes it encounters.
Note that verbatim strings (introduced with #) do not recognize the backslash \ as escape character. Escape the double quotes by doubling them.
Note for others interested in regex: The search pattern used finds a postion between a prefix and a suffix:
(?<=prefix)find(?=suffix)

Try this:
var match = Regex.Match(string1, "HeaderText=\"([^\"]+)");
var val = match.Groups[1].Value; //Target This
UPDATE
if there possibilities have double quotes in target,change the regex to:
HeaderText=\"(.+?)\"\\s+\\w
Note: it's not right way to do this, if it's a XML, check out System.XML otherwise,HtmlAgilityPack / How to use HTML Agility pack.

Regex battle between maximum and minimum munge

Greetings, I have file with the following strings:
string.Format("{0},{1}", "Having \"Two\" On The Same Line".Localize(), "Is Tricky For regex".Localize());
my goal is to get a match set with the two strings:
Having \"Two\" On The Same Line
Is Tricky For regex
My current regex looks like this:
private Regex CSharpShortRegex = new Regex("\"(?<constant>[^\"]+?)\".Localize\\(\\)");
My problem is with the escaped quotes in the first line I end up stopping at the quote and I get:
On The Same Line
Is Tricky For This Style Too
however attempting to ignore the escaped quotes is not working out because it makes the Regex greedy and I get
Having \"Two\" On The Same Line".Localize(), "Is Tricky For regex"
We seem to be caught between maximum and minimum munge. Is there any hope? I have some backup plans. Can you Regex backwards? that would make it easier because I can start with the "()ezilacoL."
EDIT:
To clarify. This is my lone edge case. Most of the time the string sits alone like:
var myString = "Hot Patootie".Localize()

This one works for me:
\"((?:[^\\"]|(?:\\\"))*)\"\.Localize\(\)
Tested on http://www.regexplanet.com/simple/index.html against a number of strings with various escaped quotes.
Looks like most of us who answered this one had the same rough idea, so let me explain the approach (comments after #s):
\" # We're looking for a string delimited by quotation marks
( # Capture the contents of the quotation marks
(?: # Start a non-capturing group
[^\\"] # Either read a character that isn't a quote or a slash
|(?:\\\") # Or read in a slash followed by a quote.
)* # Keep reading
) # End the capturing group
\" # The string literal ends in a quotation mark
\.Localize\(\) # and ends with the literal '.Localize()', escaping ., ( and )
For C# you'll need to escape the slashes twice (messy):
\"((?:[^\\\\\"]|(?:\\\\\"))*)\"\\.Localize\\(\\)
Mark correctly points out that this one doesn't match escaped characters other than quotation marks. So here's a better version:
\"((?:[^\\"]|(?:\\")|(?:\\.))*)\"\.Localize\(\)
And its slashed-up equivalent:
\"((?:[^\\\\\"]|(?:\\\\\")|(?:\\\\.))*)\"\\.Localize\\(\\)
Works the same way, except it has a special case that if encounters a slash but it can't match \", it just consumes the slash and the following character and moves on.
Thinking about it, it's better to just consume two characters at every slash, which is effectively Mark's answer so I won't repeat it.

Here's the regular expression you need:
#"""(?<constant>(\\.|[^""])*)""\.Localize\(\)"
A test program:
using System;
using System.Text.RegularExpressions;
using System.IO;
class Program
{
static void Main()
{
Regex CSharpShortRegex =
new Regex(#"""(?<constant>(\\.|[^""])*)""\.Localize\(\)");
foreach (string line in File.ReadAllLines("input.txt"))
foreach (Match match in CSharpShortRegex.Matches(line))
Console.WriteLine(match.Groups["constant"].Value);
}
}
Output:
Having \"Two\" On The Same Line
Is Tricky For regex
Hot Patootie
Notice that I have used #"..." to avoid having to escape backslashes inside the regular expression. I think this makes it easier to read.

Update:
My original answer (below the horizontal rule) has a bug: regular-expression matchers attempt alternatives in left-to-right order. Having [^"] as the first alternative allows it to consume the backslash, but then the next character to be matched is a quote, which prevents the match from proceeding.
Incompatibility note: Given the pattern below, perl backtracks to the other alternative (the escaped quote) and successfully finds a match for the Having \"Two\" On The Same Line case.
The fix is to try an escaped quote first and then a non-quote:
var CSharpShortRegex =
new Regex("\"(?<constant>(\\\\\"|[^\"])*)\"\\.Localize\\(\\)");
or if you prefer the at-string form:
var CSharpShortRegex =
new Regex(#"""(?<constant>(\\""|[^""])*)""\.Localize\(\)");
Allow for escapes:
private Regex CSharpShortRegex =
new Regex("\"(?<constant>([^\"]|\\\\\")*)\"\\.Localize\\(\\)");
Applying one level of escaping to make the pattern easier to read, we get
"(?<constant>([^"]|\\")*)"\.Localize\(\)
That is, a string starts and ends with " characters, and everything between is either a non-quote or an escaped quote.

Looks like you're trying to parse code so one approach might be to evaluate the code on the fly:
var cr = new CSharpCodeProvider().CompileAssemblyFromSource(
new CompilerParameters { GenerateInMemory = true },
"class x { public static string e() { return " + input + "}}");
var result = cr.CompiledAssembly.GetType("x")
.GetMethod("e").Invoke(null, null) as string;
This way you could handle all kinds of other special cases (e.g. concatenated or verbatim strings) that would be extremely difficult to handle with regex.

new Regex(#"((([^#]|^|\n)""(?<constant>((\\.)|[^""])*)"")|(#""(?<constant>(""""|[^""])*)""))\s*\.\s*Localize\s*\(\s*\)", RegexOptions.Compiled);
takes care of both simple and #"" strings. It also takes into account escape sequences.

Simple regex pattern

i'm using C# and i'm trying to allow only alphabetical letters and spaces. my expression at the moment is:
string regex = "^[A-Za-z\s]{1,40}$";
my IDE says that \s is an "Unrecognized escape sequence"
what am i missing?

"\" is a c# escape character as well as a regex escape character. Try:
string regex = #"^[A-Za-z\s]{1,40}$";

You need to put an # in front of your string to turn it into a verbatim string literal:
string regex = #"^[A-Za-z\s]{1,40}$";
Right now, the \ in your regex is being interpreted as trying to escape the following s, which the compiler doesn't understand.
Alternatively, you can just escape the backslash with another one:
string regex = "^[A-Za-z\\s]{1,40}$";
but in general, prefer the first approach to the second.

An additional note, your regex doesn't do what you describe. You say a max of 1 space in between words. In order to do that, you need to move the "\s" out of the character list. The pattern you're currently using allows "any alphanumeric or space from 1 to 40 times" which allows for multiple successive spaces. You'll need something more like the following:
string regex = #"^(?:[A-Za-z]+\s?)+$";
This means "any alphanumeric 1 or more times followed by an optional space, this whole thing one or more times". I don't know how to limit the whole string to 40 characters when you don't know the size of the first expression in advance. Maybe this can be achieved with a "look behind" expression, but I'm not sure. You might have to do it in two steps.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Finding C#-style unescaped strings using regular expressions - c#

Try this one: #".*?(""|[^"])"([^"]|$) The first parantheses mean 'If there is an " before the finishing quote, it better be two of them', the second parantheses mean 'After the finishing quote, there sould ether be not a quote, or the end of the line'.

How 'bout the regex #\"([^\"]|\"\")*\"(?=[^\"]) Due to greedy matching, the final lookahead clause is likely not to be needed in your regex engine, although it is more specific.

If I remember correctly, you have to use \"" - the double-double quotes to hash it for C# and the backslash to hash it for regex.

Try this: #"[^"]?(""[^"]?)*"; It looks for the starting characters #", for the ending characters "; (you can leave the semicolon out if you need to) and in between it can have any characters except quotes, or if there are quotes they have to be doubled.

#"(?:""|[^"])*"(?!") is the right regex for this job. It matches the #, a quote, then either two quotes in a row or any non-quote character, repeating this up unto the next quote (that isn't doubled).

Related

Regex Remove pair of double quotes around word, but not single instances of double quotes

Replace certain combination of characters with another

regex syntax stop search

Regex battle between maximum and minimum munge

Simple regex pattern

Categories

Resources

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Finding C#-style unescaped strings using regular expressions - c#

Try this one: #".*?(""|[^"])"([^"]|$) The first parantheses mean 'If there is an " before the finishing quote, it better be two of them', the second parantheses mean 'After the finishing quote, there sould ether be not a quote, or the end of the line'.

How 'bout the regex #\"([^\"]|\"\")*\"(?=[^\"]) Due to greedy matching, the final lookahead clause is likely not to be needed in your regex engine, although it is more specific.

If I remember correctly, you have to use \"" - the double-double quotes to hash it for C# and the backslash to hash it for regex.

Try this: #"[^"]*?(""[^"]*?)*"; It looks for the starting characters #", for the ending characters "; (you can leave the semicolon out if you need to) and in between it can have any characters except quotes, or if there are quotes they have to be doubled.

#"(?:""|[^"])*"(?!") is the right regex for this job. It matches the #, a quote, then either two quotes in a row or any non-quote character, repeating this up unto the next quote (that isn't doubled).

Related

Regex Remove pair of double quotes around word, but not single instances of double quotes

Replace certain combination of characters with another

regex syntax stop search

Regex battle between maximum and minimum munge

Simple regex pattern

Categories

Resources

Try this: #"[^"]?(""[^"]?)*"; It looks for the starting characters #", for the ending characters "; (you can leave the semicolon out if you need to) and in between it can have any characters except quotes, or if there are quotes they have to be doubled.