Catching a pattern, but ignoring it within quotes - c#

So, what I need to do in c# regex is basically split a string whenever I find a certain pattern, but ignore that pattern if it is surrounded by double quotes in the string.
Example:
string text = "abc , def , a\" , \"d , oioi";
string pattern = "[ \t]*,[ \t]*";
string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);
Wanted result after split (3 splits, 4 strings):
{"abc",
"def",
"a\" , \"d",
"oioi"}
Actual result (4 splits, 5 strings):
{"abc",
"def",
"a\"",
"\"d",
"oioi"}
Another example:
string text = "a%2% 6y % \"ad%t6%&\" %(7y) %";
string pattern = "%";
string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);
Wanted result after split (5 splits, 6 strings):
{"a",
"2",
" 6y ",
" \"ad%t6%&\" ",
"(7y) ",
""}
Actual result (7 splits, 8 strings):
{"a",
"2",
" 6y ",
"\"ad",
"t6",
"&\" ",
"(7y) ",
""}
A 3rd example, to exemplify a tricky split where only the first case should be ignored:
string text = "!!\"!!\"!!\"";
string pattern = "!!";
string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);
Wanted result after split (2 splits, 3 strings):
{"",
"\"!!\"",
"\""}
Actual result (3 splits, 4 strings):
{"",
"\"",
"\"",
"\"",}
So, how do I move from pattern to a new pattern that achieves the desired result?
Sidenote: If you're going to mark someone's question as duplicate (and I have nothing against that), at least point them to the right answer, not to some random post (yes, I'm looking at you, Mr. Avinash Raj)...

The rules are more or less like in a csv line except that:
the delimiter can be a single character, but it can be a string or a pattern too (in these last cases items must be trimmed if they start or end with the last or first possible tokens of the pattern delimiter),
an orphan quote is allowed for the last item.
First, when you want to separate items (to split) with a little advanced rules, the split method is no more a good choice. The split method is only handy for simple situations, not for your case. (even without orphan quotes, using split with ,(?=(?:[^"]*"[^"]*")*[^"]*$) is a very bad idea since the number of steps needed to parse the string grows exponentially with the string size.)
The other approach consists to capture items. That is more simple and faster. (bonus: it checks the format of the whole string at the same time).
Here is a general way to do it:
^
(?>
(?:delimiter | start_of_the_string)
(
simple_part
(?>
(?: quotes | delim_first_letter_1 | delim_first_letter_2 | etc. )
simple_part
)*
)
)+
$
Example with \s*,\s* as delimiter:
^
# non-capturing group for one delimiter and one item
(?>
(?: \s*,\s* | ^ ) # delimiter or start of the string
# (eventually change "^" to "^ \s*" to trim the first item)
# capture group 1 for the item
( # simple part of the item (maybe empty):
[^\s,"]* # all that is not the quote character or one of the possible first
# character of the delimiter
# edge case followed by a simple part
(?>
(?: # edge cases
" [^"]* (?:"|$) # a quoted part or an orphan quote in the last item (*)
| # OR
(?> \s+ ) # start of the delimiter
(?!,) # but not the delimiter
)
[^\s,"]* # simple part
)*
)
)+
$
demo (click on the table link)
The pattern is designed for the Regex.Match method since it describes all the string. All items are available in group 1 since the .net regex flavor is able to store repeated capture groups.
This example can be easily adapted to all cases.
(*) if you want to allow escaped quotes inside quoted parts, you can use one more time simple_part (?: edge_case simple_part)* instead of " [^"]* (?:"|$), i.e: "[^\\"]* (?: \\. [^\\"]*)* (?:"|$)

I think this is a two step process and it has been overthought trying to make it a one step regex.
Steps
Simply remove any quotes from a string.
Split on the target character(s).
Example of Process
I will split on the , for step 2.
var data = string.Format("abc , def , a{0}, {0}d , oioi", "\"");
// `\x22` is hex for a quote (") which for easier reading in C# editing.
var stage1 = Regex.Replace(data, #"\x22", string.Empty);
// abc , def , a", "d , oioi
// becomes
// abc , def , a, d , oioi
Regex.Matches(stage1, #"([^\s,]+)[\s,]*")
.OfType<Match>()
.Select(mt => mt.Groups[1].Value )
Result

Related

How to extract string two, one in between brackets and two not in brackets?

I'm trying to figure out a way to use regular expressions to extract a string into two values. An example string is
"Regional Store 1 - Madison [RSM1]"
"Regional Store 2 [SS2]"
and I would like to have them extracted to "Regional Store 1 - Madison", "RSM1" and "Regional Store 2", "SS2".
I've tried using the regular expression (?<=[)(.*?)(?=]) but it gives me "Regional Store 2 [", "SS2", "]". Other regular expressions I've tried give me "Regional Store 2", "[SS2]".
Since the strings will always follow the format of "{Name} [{code}]" I'm wondering if I should just be using string.split instead.
You may solve the problem without regex by trimming off the trailing ] and splitting with "[" or " [":
var s = "Regional Store 1 - Madison [RSM1]";
var chunks = s.TrimEnd(']').Split(" [");
Console.WriteLine("Name={0}, Code={1}", chunks[0], chunks[1]);
// => Name=Regional Store 1 - Madison, Code=RSM1
Or, with a regex:
var pattern = #"^(.*?)\s*\[([^][]*)]$";
chunks = Regex.Match(s, pattern)?.Groups.Cast<Group>().Skip(1).Select(x => x.Value).ToArray();
Console.WriteLine("Name={0}, Code={1}", chunks[0], chunks[1]);
// => Name=Regional Store 1 - Madison, Code=RSM1
See the C# demo and the regex demo.
Pattern details
^ - start of string
(.*?) - Group 1: any zero or more chars other than a newline char, as few as possible
\s* - zero or more whitespace chars
\[ - a [ char
([^][]*) - Group 2: any zero or more chars other than [ and ]
]$ - a ] char and end of string.
please try this pattern
MatchCollection mcCollection = Regex.Matches(sIdentifire, #"(Regional\s+Store\s+.+?\[.+?\])");

Regex split by same character within brackets

I have a like long string, like so:
(A) name1, name2, name3, name3 (B) name4, name5, name7 (via name7) ..... (AA) name47, name47 (via name 46) (BB) name48, name49
Currently I split by "(" but it picks up the via as new lines)
string[] lines = routesRaw.Split(new[] { " (" }, StringSplitOptions.RemoveEmptyEntries);
How can I split the information within the first brackets only? There is no AB, AC, AD, etc. the characters are always the same within the brackets.
Thanks.
You may use a matching approach here since the pattern you need will contain a capturing group in order to be able to match the same char 0 or more amount of times, and Regex.Split outputs all captured substrings together with non-matches.
I suggest
(?s)(.*?)(?:\(([A-Z])\2*\)|\z)
Grab all non-empty Group 1 values. See the regex demo.
Details
(?s) - a dotall, RegexOptions.Singleline option that makes . match newlines, too
(.*?) - Group 1: any 0 or more chars, but as few as possible
(?:\(([A-Z])\2*\)|\z) - a non-capturing group that matches:
\(([A-Z])\2*\) - (, then Group 2 capturing any uppercase ASCII letter, then any 0 or more repetitions of this captured letter and then )
| - or
\z - the very end of the string.
In C#, use
var results = Regex.Matches(text, #"(?s)(.*?)(?:\(([A-Z])\2*\)|\z)")
.Cast<Match>()
.Select(x => x.Groups[1].Value)
.Where(z => !string.IsNullOrEmpty(z))
.ToList();
See the C# demo online.

check is valid my string in custom format? the number of brackets

I need a regex for check this format:
[some digits][some digits][some digits][some digits][some digits][some digits]#
"some digits" means each number (0 or 1 or 2 or 3 or .... ), 2 digits, 3 digits, or more...
but it's important that each open bracket be closed before another open one...
actually I want to check the format and also get the number of [].
I tried this code for getting number of [] :
Regex.Matches( input, "[]" ).Count
but it didnt work.
thanks for helping
This is the regex you're looking for:
^(\[\d+\])+#$
See the demo.
Sample Code for the Count
var myRegex = new Regex(#"^(\[\d+\])+#$");
string bracketCount = myRegex.Match(yourString).Groups[1].Count;
Explanation
The ^ anchor asserts that we are at the beginning of the string
( starts capture Group 1
\[opens a bracket
\d+ matches one or more digits
\] matches the closing bracket
) closes Group 1
+ matches this 1 or more times
# the hash
The $ anchor asserts that we are at the end of the string

Semi-Fancy Regex.Replace() function

Words placed after these punctuation marks must be capitalized (note that there may be spaces or special characters on either side of these when used):
dash ( - ), slash ( / ), colon ( : ), period ( . ), question mark ( ? ), exclamation
point ( ! ), ellipsis (... OR …) (they are different)
I am sort of bogged down on this puzzle because of all of the special regex characters that I am trying to literally look for in my search. I believe I can use Regex.Escape although I cannot get it working for me right now in this case.
A few examples of starting strings to change to might be:
Change this:
This is a dash - example
To this:
This is a dash - Example <--capitalize "Example" with Regex
This is another dash -example
This is another dash -Example
This is an ellipsis ... example
This is an ellipsis ... Example
This is another ellipsis …example
This is another ellipsis …Example
This is a slash / example
This is a slash / Example
This is a question mark ? example
This is a question mark ? Example
Here is the code I have so far:
private static string[] postCaps = { "-", "/", ":", "?", "!", "...", "…"};
private static string ReplacePostCaps(string strString)
{
foreach (string postCap in postCaps)
{
strString = Regex.Replace(strString, Regex.Escape(postCap), "/(?<=(" + Regex.Escape(postCap) + "))./", RegexOptions.IgnoreCase);
}
return strString;
}
Thank you very much!
You shouldn't need to iterate over a list of punctuation but instead could just add a character set in a single regex:
(?:[/:?!…-]|\.\.\.)\s*([a-z])
To use it with Regex.Replace():
strString = Regex.Replace(
strString,
#"(?:[/:?!…-]|\.\.\.)\s*([a-z])",
m => m.ToString().ToUpper()
);
Regex Explained:
(?: # non-capture set
[/:?!…-] # match any of these characters
| \.\.\. # *or* match three `.` characters in a row
)
\s* # allow any whitespace between matched character and letter
([a-z]) # match, and capture, a single lowercase character
Maybe this works for you:
var phrase = "This is another dash ... example";
var rx = new System.Text.RegularExpressions.Regex(#"(?<=[\-./:?!]) *\w");
var newString = rx.Replace(phrase, new System.Text.RegularExpressions.MatchEvaluator(m => m.Value.ToUpperInvariant()));

Basic regex for 16 digit numbers

I currently have a regex that pulls up a 16 digit number from a file e.g.:
Regex:
Regex.Match(l, #"\d{16}")
This would work well for a number as follows:
1234567891234567
Although how could I also include numbers in the regex such as:
1234 5678 9123 4567
and
1234-5678-9123-4567
If all groups are always 4 digit long:
\b\d{4}[ -]?\d{4}[ -]?\d{4}[ -]?\d{4}\b
to be sure the delimiter is the same between groups:
\b\d{4}(| |-)\d{4}\1\d{4}\1\d{4}\b
If it's always all together or groups of fours, then one way to do this with a single regex is something like:
Regex.Match(l, #"\d{16}|\d{4}[- ]\d{4}[- ]\d{4}[- ]\d{4}")
You could try something like:
^([0-9]{4}[\s-]?){3}([0-9]{4})$
That should do the trick.
Please note:
This also allows
1234-5678 9123 4567
It's not strict on only dashes or only spaces.
Another option is to just use the regex you currently have, and strip all offending characters out of the string before you run the regex:
var input = fileValue.Replace("-",string.Empty).Replace(" ",string.Empty);
Regex.Match(input, #"\d{16}");
Here is a pattern which will get all the numbers and strip out the dashes or spaces. Note it also checks to validate that there is only 16 numbers. The ignore option is so the pattern is commented, it doesn't affect the match processing.
string value = "1234-5678-9123-4567";
string pattern = #"
^ # Beginning of line
( # Place into capture groups for 1 match
(?<Number>\d{4}) # Place into named group capture
(?:[\s-]?) # Allow for a space or dash optional
){4} # Get 4 groups
(?!\d) # 17th number, do not match! abort
$ # End constraint to keep int in 16 digits
";
var result = Regex.Match(value, pattern, RegexOptions.IgnorePatternWhitespace)
.Groups["Number"].Captures
.OfType<Capture>()
.Aggregate (string.Empty, (seed, current) => seed + current);
Console.WriteLine ( result ); // 1234567891234567
// Shows False due to 17 numbers!
Console.WriteLine ( Regex.IsMatch("1234-5678-9123-45678", pattern, RegexOptions.IgnorePatternWhitespace));

Categories