regular expression for c# verbatim like strings (processing ""-like escapes)

regular expression for c# verbatim like strings (processing ""-like escapes) - c#

I'm trying to extract information out of rc-files. In these files, "-chars in strings are escaped by doubling them ("") analog to c# verbatim strings. is ther a way to extract the string?
For example, if I have the following string "this is a ""test""" I would like to obtain this is a ""test"". It also must be non-greedy (very important).
I've tried to use the following regular expression;
"(?<text>[^""]*(""(.|""|[^"])*)*)"
However the performance was awful.
I'v based it on the explanation here: http://ad.hominem.org/log/2005/05/quoted_strings.php
Has anybody any idea to cope with this using a regular expression?

You've got some nested repetition quantifiers there. That can be catastrophic for the performance.
Try something like this:
(?<=")(?:[^"]|"")*(?=")
That can now only consume either two quotes at once... or non-quote characters. The lookbehind and lookahead assert, that the actual match is preceded and followed by a quote.
This also gets you around having to capture anything. Your desired result will simply be the full string you want (without the outer quotes).
I do not assert that the outer quotes are not doubled. Because if they were, there would be no way to distinguish them from an empty string anyway.

This turns out to be a lot simpler than you'd expect. A string literal with escaped quotes looks exactly like a bunch of simple string literals run together:
"Some ""escaped"" quotes"
"Some " + "escaped" + " quotes"
So this is all you need to match it:
(?:"[^"]*")+
You'll have to strip off the leading and trailing quotes in a separate step, but that's not a big deal. You would need a separate step anyway, to unescape the escaped quotes (\" or "").

Don't if this is better or worse than m.buettner's (guessing not - he seems to know his stuff) but I thought I'd throw it out there for critique.
"(([^"]+(""[^"]+"")*)*)"

Try this (?<=^")(.*?"{2}.*?"{2})(?="$)
it will be maybe more faster, than two previous
and without any bugs.

Match a " beginning the string
Multiple times match a non-" or two "
Match a " ending the string
"([^"]|(""))*?"

Related

Get what is in string from one quotation mark to other [duplicate]

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?

In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']

I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.

I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.

Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'

Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.

The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.

I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.

A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1

The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.

This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/

MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "

I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1

string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character

Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.

All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.

My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match

echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.

From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter

If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".

A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

How to use regex to match anything from A to B, where B is not preceeded by C

I'm having a hard time with this one. First off, here is the difficult part of the string I'm matching against:
"a \"b\" c"
What I want to extract from this is the following:
a \"b\" c
Of course, this is just a substring from a larger string, but everything else works as expected. The problem is making the regex ignore the quotes that are escaped with a backslash.
I've looked into various ways of doing it, but nothing has gotten me the correct results. My most recent attempt looks like this:
"((\"|[^"])+?)"
In various test online, this works the way it should - but when I build my ASP.NET page, it cuts off at the first ", leaving me with just the a-letter, white space and a backslash.
The logic behind the pattern above is to capture all instances of \" or something that is not ". I was hoping this would search for \", making sure to find those first - but I got the feeling that this is overridden by the second part of the expression, which is only 1 single character. A single backslash does not match 2 characters (\"), but it will match as a non-". And from there, the next character will be a single ", and the matching is completed. (This is just my hypothesis on why my pattern is failing.)
Any pointers on this one? I have tried various combinations with "look"-methods in regex, but I didn't really get anywhere. I also get the feeling that is what I need.

ORIGINAL ANSWER
To match a string like a \"b\" c, you need to use following regex declaration:
(?:\\"|[^"])+
var rx = Regex(#"(?:\\""|[^""])+");
See RegexStorm demo
Here is an IDEONE demo:
var str = "a \\\"b\\\" c";
Console.WriteLine(str);
var rx = new Regex(#"(?:\\""|[^""])+");
Console.WriteLine(rx.Match(str).Value);
Please note the # in front of the string literal that lets us use verbatim string literals where we have to double quotes to match literal quotes and use single escape slashes instead of double. This makes regexps easier to read and maintain.
If you want to match any escaped entities in your input string, you can use:
var rx = new Regex(#"[^""\\]*(?:\\.[^""\\]*)*");
See demo on RegexStorm
UPDATE
To match the quoted strings, just add quotes around the pattern:
var rx = new Regex(#"""(?<res>[^""\\]*(?:\\.[^""\\]*)*)""");
This pattern yields much better performance than Tim Long's suggested regex, see RegexHero test resuls:

The following expression worked for me:
"(?<Result>(\\"|.)*)"
The expression matches as follows:
An opening quote (literal ")
A named capture (?<name>pattern) consisting of:
Zero or more occurences * of literal \" or (|) any single character (.)
A final closing quote (literal ")
Note that the * (zero or more) quantifier is non-greedy so the final quote is matched by the literal " and not the "any single character" . part.
I used ReSharper 9's built-in Regular Expression validator to develop the expression and verify the results:
I have used the "Explicit Capture" option to reduce cruft in the output (RegexOptions.ExplicitCapture).
One thing to note is that I am matching the whole string, but I am only capturing the substring, using a named capture. Using named captures is a really useful way to get at the results you want. In code, it might look something like this:
static string MatchQuotedString(string input)
{
const string pattern = #"""(?<Result>(\\""|.)*)""";
const RegexOptions options = RegexOptions.ExplicitCapture;
Regex regex = new Regex(pattern, options);
var matches = regex.Match(input);
var substring = matches.Groups["Result"].Value;
return substring;
}
Optimization: If you are planning on using the regex a lot, you could factor it out into a field and use the RegexOptions.Compiled option, this pre-compiles the expression and gives you faster throughput at the expense of longer initialization.

C# .NET Regex remove all quotes of quotes excluding one instance in a sentance

I have description field which is:
16" Alloy Upgrade
In CSV format it appears like this:
"16"" Alloy Upgrade "
What would be the best use of regex to maintain the original format? As I'm learning I would appreciate it being broke down for my understanding.
I'm already using Regex to split some text separating 2 fields which are: code, description. I'm using this:
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))
My thoughts are to remove the quotes, then remove the delimiter excluding use in sentences.
Thanks in advance.

If you don't want to/can't use a standard CSV parser (which I'd recommend), you can strip all non-doubled quotes using a regex like this:
Regex.Replace(text, #"(?!="")""(?!"")",string.Empty)
That regex will match every " character not preceded or followed by another ".

I wouldn't use regex since they are usually confusing and totally unclear what they do (like the one in your question for example). Instead this method should do the trick:
public string CleanField(string input)
{
if (input.StartsWith("\"") && input.EndsWith("\""))
{
string output = input.Substring(1,input.Length-2);
output = output.Replace("\"\"","\"");
return output;
}
else
{
//If it doesn't start and end with quotes then it doesn't look like its been escaped so just hand it back
return input;
}
}
It may need tweaking but in essence it checks if the string starts and ends with a quote (which it should if it is an escaped field) and then if so takes the inside part (with the substring) and then replaces double quotes with single quotes. The code is a bit ugly due to all the escaping but there is no avoiding that.
The nice thing is this can be used easily with a bit of Linq to take an existing array and convert it.
processedFieldArray = inputfieldArray.Select(CleanField).ToArray();
I'm using arrays here purely because your linked page seems to use them where you are wanting this solution.

Regex-How to remove comma which is between " and "?

How to remove ,(comma) which is between "(double inverted comma) and "(double inverted comma). Like there is "a","b","c","d,d","e","f" and then from this, between " and " there is one comma which should be removed and after removing that comma it should be "a","b","c","dd","e","f" with the help of the regex in C# ?
EDIT: I forgot to specify that there may be double comma between quotes like "a","b","c","d,d,d","e","f" for it that regex does not work. and there can be any number of comma between quotes.
And there can be string like a,b,c,"d,d",e,f then there should be result like a,b,c,dd,e,f and if string like a,b,c,"d,d,d",e,f then result should be like a,b,c,ddd,e,f.

Assuming the input is as simple as your examples (i.e., not full-fledged CSV data), this should do it:
string input = #"a,b,c,""d,d,d"",e,f,""g,g"",h";
Console.WriteLine(input);
string result = Regex.Replace(input,
#",(?=[^""]*""(?:[^""]*""[^""]*"")*[^""]*$)",
String.Empty);
Console.WriteLine(result);
output: a,b,c,"d,d,d",e,f,"g,g",h
a,b,c,"ddd",e,f,"gg",h
The regex matches any comma that is followed by an odd number of quotation marks.
EDIT: If fields are quoted with apostrophes (') instead of quotation marks ("), the technique is exactly the same--except you don't have to escape the quotes:
string input = #"a,b,c,'d,d,d',e,f,'g,g',h";
Console.WriteLine(input);
string result = Regex.Replace(input,
#",(?=[^']*'(?:[^']*'[^']*')*[^']*$)",
String.Empty);
Console.WriteLine(result);
If some fields were quoted with apostrophes while others were quoted with quotation marks, a different approach would be needed.
EDIT: Probably should have mentioned this in the previous edit, but you can combine those two regexes into one regex that will handle either apostrophes or quotation marks (but not both):
#",(?=[^']*'(?:[^']*'[^']*')*[^']*$|[^""]*""(?:[^""]*""[^""]*"")*[^""]*$)"
Actually, it will handle simple strings like 'a,a',"b,b". The problem is that there would be nothing to stop you from using one of the quote characters in a quoted field of the other type, like '9" Nails' (sic) or "Kelly's Heroes". That's taking us into full-fledged CSV territory (if not beyond), and we've already established that we're not going there. :D

They're called regular expressions for a reason — they are used to process strings that meet a very specific and academic definition for what is "regular". It looks like you have some fairly typical csv data here, and it happens that csv strings are outside of that specific definition: csv data is not formally "regular".
In spite of this, it can be possible to use regular expressions to handle csv data. However, to do so you must either use certain extensions to normal regular expressions to make them Turing complete, know certain constraints about your specific csv data that is not promised in the general case, or both. Either way, the expressions required to do this are unwieldly and difficult to manage. It's often just not a good idea, even when it's possible.
A much better (and usually faster) solution is to use a dedicated CSV parser. There are two good ones hosted at code project (FastCSV and Linq-to-CSV), there is one (actually several) built into the .Net Framework (Microsoft.VisualBasic.TextFieldParser), and I have one here on Stack Overflow. Any of these will perform better and just plain work better than a solution based on regular expressions.
Note here that I'm not arguing it can't be done. Most regular expression engines today have the necessary extensions to make this possible, and most people parsing csv data know enough about the data they're handling to constrain it appropriately. I am arguing that it's slower to execute, harder to implement, harder to maintain, and more error-prone compared to a dedicated parser alternative, which is likely built into whichever platform you're using, and is therefore not in your best interests.

var input = "\"a\",\"b\",\"c\",\"d,d\",\"e\",\"f\"";
var regex = new Regex("(\"\\w+),(\\w+\")");
var output = regex.Replace(input,"$1$2");
Console.WriteLine(output);
You'd need to evaluate whether or not \w is what you want to use.

You can use this:
var result = Regex.Replace(yourString, "([a-z]),", "$1");
Sorry, after seeing your edits, regular expressions are not appropriate for this.

This should be very simple using Regex.Replace and a callback:
string pattern = #"
"" # open quotes
[^""]* # some not quotes
"" # closing quotes
";
data = Regex.Replace(data, pattern, m => m.Value.Replace(",", ""),
RegexOptions.IgnorePatternWhitespace);
You can even make a slight modification to allow escaped quotes (here I have \", and the comments explain how to use "":
string pattern = #"
\\. # escaped character (alternative is be """")
|
(?<Quotes>
"" # open quotes
(?:\\.|[^""])* # some not quotes or escaped characters
# the alternative is (?:""""|[^""])*
"" # closing quotes
)
";
data = Regex.Replace(data, pattern,
m => m.Groups["Quotes"].Success ? m.Value.Replace(",", "") : m.Value,
RegexOptions.IgnorePatternWhitespace);
If you need a single quote replace all "" in the pattern with a single '.

Something like the following, perhaps?
"(,)"

Finding C#-style unescaped strings using regular expressions

I'm trying to write a regular expression that finds C#-style unescaped strings, such as
string x = #"hello
world";
The problem I'm having is how to write a rule that handles double quotes within the string correctly, like in this example
string x = #"before quote ""junk"" after quote";
This should be an easy one, right?

Try this one:
#".*?(""|[^"])"([^"]|$)
The first parantheses mean 'If there is an " before the finishing quote, it better be two of them', the second parantheses mean 'After the finishing quote, there sould ether be not a quote, or the end of the line'.

How 'bout the regex #\"([^\"]|\"\")*\"(?=[^\"])
Due to greedy matching, the final lookahead clause is likely not to be needed in your regex engine, although it is more specific.

If I remember correctly, you have to use \"" - the double-double quotes to hash it for C# and the backslash to hash it for regex.

Try this:
#"[^"]*?(""[^"]*?)*";
It looks for the starting characters #", for the ending characters "; (you can leave the semicolon out if you need to) and in between it can have any characters except quotes, or if there are quotes they have to be doubled.

#"(?:""|[^"])*"(?!")
is the right regex for this job. It matches the #, a quote, then either two quotes in a row or any non-quote character, repeating this up unto the next quote (that isn't doubled).

"^#(""|[^"])*$" is the regex you want, looking for first an at-sign and a double-quote, then a sequence of any characters (except double-quotes) or double double-quotes, and finally a double-quote.
As a string literal in C#, you'd have to write it string regex = "^#\"(\"\"|[^\"])*\"$"; or string regex = #"^#""(""""|[^""])*""$";. Choose your poison.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.