Regex Failing "Unrecognized Escape sequence" - c#

module107 should be matching the sample text Module ID="107"
Can you help me understand where I am going wrong in the code?
var module107 = Regex("\A*Module\sID=\"107\"");
ERROR: Unrecognized escape sequence

The problem here is, you want to escape for two different levels. The \A is a escape sequence for the regex. But the issue is, there is at first the string that tries to interpret escape sequences and the string does not know the escape sequence \A or \s (I don't know).
Two solutions are possible:
if you are escaping for the regex, double the \. So
var module107 = Regex("\\A*Module\\sID=\"107\"");
is the string and after the string is processed, the regex is \A*Module\sID="107"
Use verbatim strings. If you add a # before the string, escape sequences are not evaluated by the string. So Regex(#"\A*Module\sID=") would end as regex \A*Module\sID=
But now you are getting problems with the " you want to have in the regex. You can add a " to a verbatim string by doubling it:
var module107 = Regex(#"\A*Module\sID=""107""");

Description
This will match the module id="107" where the number is any quantity of digits surrounded by double quotes. I've changed your escaped quotes with [""] so they can be nested into a string. I'm using \b which will look for the word break and will allow the string to appear anywhere in the input. But if you're looking to validate a specific text, then you can do the \A or ^ to denote the beginning of the string instead.
\b(Module\s+ID=[""](\d{1,})[""])
Groups
Group 0 will capture the entire string
will get the have from Module to the second quote
will get the value inside the quotes
C# Code Example:
using System;
using System.Text.RegularExpressions;
namespace myapp
{
class Class1
{
static void Main(string[] args)
{
String sourcestring = "for Module ID=""107"" Can you h";
Regex re = new Regex(#"\b(Module\s+ID=[""](\d{1,})[""])",RegexOptions.IgnoreCase);
MatchCollection mc = re.Matches(sourcestring);
int mIdx=0;
foreach (Match m in mc)
{
for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
{
Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
}
mIdx++;
}
}
}
}
$matches Array:
(
[0] => Array
(
[0] => Module ID="107"
)
[1] => Array
(
[0] => Module ID="107"
)
[2] => Array
(
[0] => 107
)
)

The key thing is that the text you've typed is interpreted as a string first, then as a Regex. The string interpretation also looks at '\'s and uses them in its interpretation.
As Tyanna says, this means you need to escape those '\'s so that they don't get 'used up' as the string is read or confuse the string parser.
An alternative approach you might like to try is to use a string literal. This can be a bit cleaner when working with Regexes, as you don't end up with lots of slashes (just more double-quotes sometimes):
var module107 = new Regex(#"\A*Module\sID=""107""");

Related

Removing special characters using Regex in C#

I have one problem in this code. I want to remove all special characters but the square brackets are not getting removed.
string regExp = "[\\\"]";
string tmp = Regex.Replace(str, regExp," ");
string[] strArray = tmp.Split(',');
obj.amcid = db.Execute("select MAX(amcid)+1 from sca_amcmaster");
foreach (string i in strArray)
{
// int myInts = int.Parse(i);
db.Execute(";EXEC insertitems1 #0,#1", i, obj.invoiceno);
}
Square Brackets are metacharacters in Regular Expressions, which allow us to define list of things. So if you want to match then using Regex you need to change your expression to:
string regExp = "\[\\\"\]";
Therefore, you simply need to include the backslashes before the square brackets to match then too.
If none of them are required in the expression, you can group then using brackets, and the character ? (zero or more matches):
string regExp = "(\[)?(\\)?(\")?(\])?";

Regex pattern for splitting a delimited string in curly braces

I have the following string
{token1;token2;token3#somewhere.com;...;tokenn}
I need a Regex pattern, that would give a result in array of strings such as
token1
token2
token3#somewhere.com
...
...
...
tokenn
Would also appreciate a suggestion if can use the same pattern to confirm the format of the string, means string should start and end in curly braces and at least 2 values exist within the anchors.
You may use an anchored regex with named repeated capturing groups:
\A{(?<val>[^;]*)(?:;(?<val>[^;]*))+}\z
See the regex demo
\A - start of string
{ - a {
(?<val>[^;]*) - Group "val" capturing 0+ (due to * quantifier, if the value cannot be empty, use +) chars other than ;
(?:;(?<val>[^;]*))+ - 1 or more occurrences (thus, requiring at least 2 values inside {...}) of the sequence:
; - a semi-colon
(?<val>[^;]*) - Group "val" capturing 0+ chars other than ;
} - a literal }
\z - end of string.
.NET regex keeps each capture in a CaptureCollection stack, that is why all the values captured into "num" group can be accessed after a match is found.
C# demo:
var s = "{token1;token2;token3;...;tokenn}";
var pat = #"\A{(?<val>[^;]*)(?:;(?<val>[^;]*))+}\z";
var caps = new List<string>();
var result = Regex.Match(s, pat);
if (result.Success)
{
caps = result.Groups["val"].Captures.Cast<Capture>().Select(t=>t.Value).ToList();
}
Read it(similar to your problem): How to keep the delimiters of Regex.Split?.
For your RegEx testing use this: http://www.regexlib.com/RETester.aspx?AspxAutoDetectCookieSupport=1.
But RegEx is a very resource-intensive, slow operation.
In your case will be better to use the Split method of string class, for example : "token1;token2;token3;...;tokenn".Split(';');. It will return to you a collection of strings, that you want to obtain.

Regular expression to remove whitespace around a comma, except when quoted

I have a CSV file that has rows resembling this:
1, 4, 2, "PUBLIC, JOHN Q" ,ACTIVE , 1332
I am looking for a regular expression replacement that will match against these rows and spit out something resembling this:
1,4,2,"PUBLIC, JOHN Q",ACTIVE,1332
I thought this would be rather easy: I made the expression ([ \t]+,) and replaced it with ,. I made a complement expression (,[ \t]+) with a replacement of , and I thought I had achieved a good means of right-trimming and left-trimming strings.
...but then I noticed that my "PUBLIC, JOHN Q" was now "PUBLIC,JOHN Q" which isn't what I wanted. (Note the space following the comma is now gone).
What would be the appropriate expression to trim the white space before and after a comma, but leave quoted text untouched?
UPDATE
To clarify, I am using an application to handle the file. This application allows me to define multiple regular expression replacements; it does not provide a parsing capability. While this may not be the ideal mechanism for this, it would sure beat making another application for this one file.
If the engine used by your tool is the C# regular expression engine, then you can try the following expression:
(?<!,\s*"(?:[^\\"]|\\")*)\s+(?!(?:[^\\"]|\\")*"\s*,)
replace with empty string.
The guys answers assumed the quotes are balanced and used counting to determine if the space is part of a quoted value or not.
My expression looks for all spaces that are not part of a quoted value.
RegexHero Demo
Something like this might do the job:
(?<!(^[^"]*"[^"]*(("[^"]*){2})*))[\t ]*,[ \t]*
Which matches [\t ]*,[ \t]*, only when not preceded by an odd number of quotes.
Going with some CSV library or parsing the file yourself would be much more easier, and IMO should be preferable option here.
But if you really insist on a regex, you can use this one:
"\s+(?=([^\"]*\"[^\"]*\")*[^\"]*$)"
And replace it with empty string - ""
This regex matches one or more whitespaces, followed by an even number of quotes. This will of course work only if you have balanced quote.
(?x) # Ignore Whitespace
\s+ # One or more whitespace characters
(?= # Followed by
( # A group - This group captures even number of quotes
[^\"]* # Zero or more non-quote characters
\" # A quote
[^\"]* # Zero or more non-quote characters
\" # A quote
)* # Zero or more repetition of previous group
[^\"]* # Zero or more non-quote characters
$ # Till the end
) # Look-ahead end
string format(string val)
{
if (val.StartsWith("\"")) val = " " + val;
string[] vals = val.Split('\"');
for (int i = 0; i < vals.Length; i += 2) vals[i] = vals[i].Replace(" ", "").Replace("\t", "");
return string.Join("\t", vals);
}
This will work if you have properly closed quoted strings in between
Forget the regex (See Bart's comment on the question, regular expressions aren't suitable for CSV).
public static string ReduceSpaces( string input )
{
char[] a = input.ToCharArray();
int placeComma = 0, placeOther = 0;
bool inQuotes = false;
bool followedComma = true;
foreach( char c in a ) {
inQuotes ^= (c == '\"');
if (c == ' ') {
if (!followedComma)
a[placeOther++] = c;
}
else if (c == ',') {
a[placeComma++] = c;
placeOther = placeComma;
followedComma = true;
}
else {
a[placeOther++] = c;
placeComma = placeOther;
followedComma = false;
}
}
return new String(a, 0, placeComma);
}
Demo: http://ideone.com/NEKm09

regex how to extract strings between delimiters and not double delimiters

I'm sure this is a duplicate but cannot find the right search criteria.
Basically I have a user supplied string with keywords enclosed in braces, I want a regex that will find the keywords but will ignore double sets of the delimiters.
example: "A cat in a {{hat}} doesn't bite {back}."
I need regex that will return {back} but not {hat}.
This is for C#.
This is what you are looking for
(?<!\{)\{\w+\}(?!\})
The answer will vary slightly depending on which regex parser you are using, but something like the following is probably what you want:
(?:[^{]){([^}]*)}|{([^}]*)}(?:[^}])|^{([^}]*)}$
Non "{" (not part of the match) followed by "{" (capturing) all the non "}" chars (end capturing) followed by "}", or...
"{" followed by "{" (capturing) all the non "}" chars (end capturing) followed by non "}" (not part of the match), or...
Start-of-line followed by "{" (capturing) all the non "}" chars (end capturing) followed by end-of-line
Note that some parsers may not recognize the "?:" operator and some parsers may require that some or all of the following chars (when not inside of "[]") be backslash escaped: { } ( ) |
Description
Try this out, this will require the open and close brackets to be single. Double brackets will be ignored.
See also this permlink example
[^{][{]([^}]*)[}][^}]
c# example
using System;
using System.Text.RegularExpressions;
namespace myapp
{
class Class1
{
static void Main(string[] args)
{
String sourcestring = "A cat in a {{hat}} doesn't bite {back}.";
Regex re = new Regex(#"[^{][{]([^}]*)[}][^}]");
MatchCollection mc = re.Matches(sourcestring);
int mIdx=0;
foreach (Match m in mc)
{
for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
{
Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
}
mIdx++;
}
}
}
}
$matches Array:
(
[0] => Array
(
[0] => {back}.
)
[1] => Array
(
[0] => back
)
)
Not too hard just used a Regex helper :
(?:[^{]){([^}]*)}|{([^}]*)}(?:[^}])|^{([^}]*)}$

How to grab specific elements out of a string

I need to be able to grab specific elements out of a string that start and end with curly brackets. If I had a string:
"asjfaieprnv{1}oiuwehern{0}oaiwefn"
How could I grab just the 1 followed by the 0.
Regex is very useful for this.
What you want to match is:
\{ # a curly bracket
# - we need to escape this with \ as it is a special character in regex
[^}] # then anything that is not a curly bracket
# - this is a 'negated character class'
+ # (at least one time)
\} # then a closing curly bracket
# - this also needs to be escaped as it is special
We can collapse this to one line:
\{[^}]+\}
Next, you can capture and extract the inner contents by surrounding the part you want to extract with parentheses to form a group:
\{([^}]+)\}
In C# you'd do:
var matches = Regex.Matches(input, #"\{([^}]+)\}");
foreach (Match match in matches)
{
var groupContents = match.Groups[1].Value;
}
Group 0 is the whole match (in this case including the { and }), group 1 the first parenthesized part, and so on.
A full example:
var input = "asjfaieprnv{1}oiuwehern{0}oaiwef";
var matches = Regex.Matches(input, #"\{([^}]+)\}");
foreach (Match match in matches)
{
var groupContents = match.Groups[1].Value;
Console.WriteLine(groupContents);
}
Outputs:
1
0
Use the Indexof method:
int openBracePos = yourstring.Indexof ("{");
int closeBracePos = yourstring.Indexof ("}");
string stringIWant = yourstring.Substring(openBracePos, yourstring.Len() - closeBracePos + 1);
That will get your first occurrence. You need to slice your string so that the first occurrence is no longer there, then repeat the above procedure to find your 2nd occurrence:
yourstring = yourstring.Substring(closeBracePos + 1);
Note: You MAY need to escape the curly braces: "{" - not sure about this; have never dealt with them in C#
This looks like a job for regular expressions
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string str = "asjfaieprnv{1}oiuwe{}hern{0}oaiwefn";
Regex regex = new Regex(#"\{(.*?)\}");
foreach( Match match in regex.Matches(str))
{
Console.WriteLine(match.Groups[1].Value);
}
}
}
}

Categories