Regex split preserving strings and escape character - c#

I need to split a string on C#, based on space as delimiter and preserving the quotes.. this part is ok.
But additionally, I want to allow escape character for string \" to allow include other quotes inside the quotes.
Example of what I need:
One Two "Three Four" "Five \"Six\""
To:
One
Two
Three Four
Five "Six"
This is the regex I am currently using, it is working for all the cases except "Five \"Six\""
//Split on spaces unless in quotes
List<string> matches = Regex.Matches(input, #"[\""].+?[\""]|[^ ]+")
.Cast<Match>()
.Select(x => x.Value.Trim('"'))
.ToList();
I'm looking for any Regex, that would do the trick.

You can use
var input = "One Two \"Three Four\" \"Five \\\"Six\\\"\"";
// Console.WriteLine(input); // => One Two "Three Four" "Five \"Six\""
List<string> matches = Regex.Matches(input, #"(?s)""(?<r>[^""\\]*(?:\\.[^""\\]*)*)""|(?<r>\S+)")
.Cast<Match>()
.Select(x => Regex.Replace(x.Groups["r"].Value, #"\\(.)", "$1"))
.ToList();
foreach (var s in matches)
Console.WriteLine(s);
See the C# demo.
The result is
One
Two
Three Four
Five "Six"
The (?s)"(?<r>[^"\\]*(?:\\.[^"\\]*)*)"|(?<r>\S+) regex matches
(?s) - a RegexOptions.Singleline equivalent to make . match newlines, too
"(?<r>[^"\\]*(?:\\.[^"\\]*)*)" - ", then Group "r" capturing any zero or more chars other than " and \ and then zero or more sequences of any escaped char and zero or more chars other than " and \, and then a " is matched
| - or
(?<r>\S+) - Group "r": one or more whitespaces.
The .Select(x => Regex.Replace(x.Groups["r"].Value, #"\\(.)", "$1")) takes the Group "r" value and unescapes (deletes a \ before) all escaped chars.

Related

Regex to match words between underscores after second occurence of underscore

so i would like to get words between underscores after second occurence of underscore
this is my string
ABC_BC_BE08_C1000004_0124
I've assembled this expresion
(?<=_)[^_]+
well it matches what i need but only skips the first word since there is no underscore before it. I would like it to skip ABC and BC and just get the last three strings, i've tried messing around but i am stuck and cant make it work. Thanks!
You can use a non-regex approach here with Split and Skip:
var text = "ABC_BC_BE08_C1000004_0124";
var result = text.Split('_').Skip(2);
foreach (var s in result)
Console.WriteLine(s);
Output:
BE08
C1000004
0124
See the C# demo.
With regex, you can use
var result = Regex.Matches(text, #"(?<=^(?:[^_]*_){2,})[^_]+").Cast<Match>().Select(x => x.Value);
See the regex demo and the C# demo. The regex matches
(?<=^(?:[^_]*_){2,}) - a positive lookbehind that matches a location that matches the following patterns immediately to the left of the current location:
^ - start of string
(?:[^_]*_){2,} - two or more ({2,}) sequences of any zero or more chars other than _ ([^_]*) and then a _ char
[^_]+ - one or more chars other than _
Usign .NET there is also a captures collection that you might use with a regex and a repeated catpure group.
^[^_]*_[^_]*(?:_([^_]+))+
The pattern matches:
^ Start of string
[^_]*_[^_]* Match any char except an _, match _ and again any char except _
(?: Non capture group
_([^_]+) Match _ and capture 1 or more times any char except _ in group 1
)+ Close the non capture group and repeat 1 or more times
.NET regex demo | C# demo
For example:
var pattern = #"^[^_]*_[^_]*(?:_([^_]+))+";
var str = "ABC_BC_BE08_C1000004_0124";
var strings = Regex.Match(str, pattern).Groups[1].Captures.Select(c => c.Value);
foreach (String s in strings)
{
Console.WriteLine(s);
}
Output
BE08
C1000004
0124
If you want to match only word characters in between the underscores, another option for a pattern could be using a negated character class [^\W_] excluding the underscore from the word characters in between:
^[^\W_]*_[^\W_]*(?:_([^\W_]+))+

Regex split by same character within brackets

I have a like long string, like so:
(A) name1, name2, name3, name3 (B) name4, name5, name7 (via name7) ..... (AA) name47, name47 (via name 46) (BB) name48, name49
Currently I split by "(" but it picks up the via as new lines)
string[] lines = routesRaw.Split(new[] { " (" }, StringSplitOptions.RemoveEmptyEntries);
How can I split the information within the first brackets only? There is no AB, AC, AD, etc. the characters are always the same within the brackets.
Thanks.
You may use a matching approach here since the pattern you need will contain a capturing group in order to be able to match the same char 0 or more amount of times, and Regex.Split outputs all captured substrings together with non-matches.
I suggest
(?s)(.*?)(?:\(([A-Z])\2*\)|\z)
Grab all non-empty Group 1 values. See the regex demo.
Details
(?s) - a dotall, RegexOptions.Singleline option that makes . match newlines, too
(.*?) - Group 1: any 0 or more chars, but as few as possible
(?:\(([A-Z])\2*\)|\z) - a non-capturing group that matches:
\(([A-Z])\2*\) - (, then Group 2 capturing any uppercase ASCII letter, then any 0 or more repetitions of this captured letter and then )
| - or
\z - the very end of the string.
In C#, use
var results = Regex.Matches(text, #"(?s)(.*?)(?:\(([A-Z])\2*\)|\z)")
.Cast<Match>()
.Select(x => x.Groups[1].Value)
.Where(z => !string.IsNullOrEmpty(z))
.ToList();
See the C# demo online.

Excluding duplicates from a string in results

I am trying to amend this regex so that it does not match duplicates.
Current regex:
[\""].+?[\""]|[^ ]+
Sample string:
".doc" "test.xls", ".doc","me.pdf", "test file.doc"
Expected results:
".doc"
"test.xls"
"me.pdf"
But not
".doc"
"test.xls"
".doc"
"me.pdf"
Note:
Filenames could potentially have spaces e.g. test file.doc
items could be separated by a space or a comma or both
strings could have quotes around or NOT have quotes around e.g. .doc or ".doc".
In C#, you may use a simple regex to extract all valid matches and use .Distinct() to only keep unique values.
The regex is simple:
"(?<ext>[^"]+)"|(?<ext>[^\s,]+)
See the regex demo, you only need Group "ext" values.
Details
"(?<ext>[^"]+)" - ", (group "ext") any 1+ chars other than " and then "
| - or
(?<ext>[^\s,]+) - (group "ext") 1+ chars other than whitespace and comma
The C# code snippet:
var text = "\".doc\" \"test.xls\", \".doc\",\"me.pdf\", \"test file.doc\".doc \".doc\"";
Console.WriteLine(text); // => ".doc" "test.xls", ".doc","me.pdf", "test file.doc".doc ".doc"
var pattern = "\"(?<ext>[^\"]+)\"|(?<ext>[^\\s,]+)";
var results = Regex.Matches(text, pattern)
.Cast<Match>()
.Select(x => x.Groups["ext"].Value)
.Distinct();
Console.WriteLine(string.Join("\n", results));
Output:
.doc
test.xls
me.pdf
test file.doc

Splitting a String with conditions

Given a string of:
"S1 =F A1 =T A2 =T F3 =F"
How can I split it so that the result is an array of strings where the 4 strings ,individual string would look like this:
"S1=F"
"A1=T"
"A2=T"
"F3=F"
Thank you
You can try matching all Name = (T|F) conditions with regular expressions and then get rid of white spaces in the each match with a help of Linq:
using System.Linq;
using System.Text.RegularExpressions;
..
string source = "S1 \t = F A1 = T A2 = T F3 = F";
string[] result = Regex
.Matches(source, #"[A-Za-z][A-Za-z0-9]*\s*=\s*[TF]")
.OfType<Match>()
.Select(match => string.Concat(match.Value.Where(c => !char.IsWhiteSpace(c))))
.ToArray();
Console.WriteLine(string.Join(Environment.NewLine, result));
Outcome:
S1=F
A1=T
A2=T
F3=F
Edit: What's goining on. First part is a regular expression matching:
... Regex
.Matches(source, #"[A-Za-z][A-Za-z0-9]*\s*=\s*[TF]")
.OfType<Match>() ...
We are trying to find out fragments with a pattern
[A-Za-z] - Letter A..Z or a..z
[A-Za-z0-9]* - followed by zero or many letters or digits
\s* - zero or more white spaces (spaces, tabulations etc.)
= - =
\s* - zero or more white spaces (spaces, tabulations etc.)
[TF] - either T or F
Second part is match clearing: for each match found e.g. S1 \t = F we want to obtain "S1=F" string:
...
.Select(match => string.Concat(match.Value.Where(c => !char.IsWhiteSpace(c))))
.ToArray();
We use Linq here: for each character in the match we filter out all white spaces (take character c if and only if it's not a white space):
match.Value.Where(c => !char.IsWhiteSpace(c))
then combine (Concat) filtered characters (IEnumerable<char>) of each match back to string and organize these strings as an array (materialization):
.Select(match => string.Concat(...))
.ToArray();

Regex match word followed by decimal from text

I want to be able to match the following examples and return array of matches
given text:
some word
another 50.00
some-more 10.10 text
another word
Matches should be (word, followed by space then decimal number (Optionally followed by another word):
another 50.00
some-more 10.10 text
I have the following so far:
string pat = #"\r\n[A-Za-z ]+\d+\.\d{1,2}([A-Za-z])?";
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
Match m = r.Match(input);
but it only matches first item: another 50.00
You do not account for - with [A-Za-z ] and only match some text after a newline.
You can use the following regex:
[\p{L}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}-]+)?
See the regex demo
The [\p{L}-]+ matches 1 or more letters and hyphens, \p{Zs}* matches 0 or more horizontal whitespace symbols, \d*\.?\d{1,2} matches a float number with 1 to 2 digits in the decimal part, and (?:\p{Zs}*[\p{L}-]+)? matches an optional word after the number.
Here is a C# snippet matching all occurrences based on Regex.Matches method:
var res = Regex.Matches(str, #"[\p{L}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}-]+)?")
.Cast<Match>()
.Select(p => p.Value)
.ToList();
Just FYI: if you need to match whole words, you can also use word boundaries \b:
\b[\p{L}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}-]+)?\b
And just another note: if you need to match diacritics, too, you may add \p{M} to the character class containing \p{L}:
[\p{L}\p{M}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}\p{M}-]+)?\b

Categories