I am trying to write a regex to validate and extract the values from a colon separated string that can have 1-4 values. I have found example where there are a fixed number of variables and tried to use this but it only picks up the first and last values, I need to extract all of them. The current regex is also including the : in the match, I simply want the value if possible
I am currently using this;
^([01ab])+(\:[01ab])*
but it only pulls the first and last values, not those in between if they exist.
Valid values;
0
0:a
0:a:1
0:1:a:b
Not valid
0:a:
0:a:1:b:
I suggest a two-step approach: validate the format with the regex and then split the string with : if it qualifies:
if (Regex.IsMatch(text, #"^[01ab](?::[01ab])*$"))
{
result = text.Split(':');
}
The ^[01ab](?::[01ab])*$ regex matches start of a string with ^, a 0, 1, a or b, and then 0 or more repetitions of : followed with a 0, 1, a or b and then end of string ($).
If you want to play with the regex a bit you will see that C# allows you to access all capture group values via CaptureCollection:
var text = "0:1:a:b";
var results = Regex.Match(text, #"^(?:([01ab])(?::\b|$))+$")?
.Groups[1].Captures.Cast<Capture>().Select(c => c.Value);
Console.WriteLine(string.Join(", ", results)); // => 0, 1, a, b
See the C# demo and the regex demo.
Regex details
^ - start of string
(?:([01ab])(?::\b|$))+ - 1 or more repetitions of:
([01ab]) - Group 1: 0, 1, a or b
(?::\b|$) - either : followed with a letter, digit (\b will also allow _ to follow, but it is missing in the pattern) or end of string
$ - end of string.
A not using regex approach (and why would you use regex unless you really have to) is this:
bool Validate(string s)
{
string[] valid = {"0", "1", "a", "b"};
var splitArray = s.Split(':');
if (splitArray.Length < 1 || splitArray.Length > 4)
return false;
return splitArray.All(a => valid.Contains(a));
}
It is more efficient to use a string method than regex. So try following :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication137
{
class Program
{
static void Main(string[] args)
{
string[] inputs = { "0", "0:a", "0:a:1", "0:1:a:b", "Not valid", "0:a:", "0:a:1:b:" };
foreach (string input in inputs)
{
string[] splitArray = input.Split(new char[] { ':' }, StringSplitOptions.RemoveEmptyEntries).ToArray();
if (splitArray.Length < 2)
{
Console.WriteLine("Input: '{0}' Not Valid", input);
}
else
{
Console.WriteLine("Input: '{0}' First Value : '{1}', Last Value : '{2}'", input, splitArray[0], splitArray[splitArray.Length - 1]);
}
}
Console.ReadLine();
}
}
}
Related
Here is my code to find a string between { }:
var text = "Hello this is a {Testvar}...";
int tagFrom = text.IndexOf("{") + "{".Length;
int tagTo = text.LastIndexOf("}");
String tagResult = text.Substring(tagFrom, tagTo - tagFrom);
tagResult Output: Testvar
This only works for one time use.
How can I apply this for several Tags? (eg in a While loop)
For example:
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
tagResult[] Output (eg Array): Testvar, Tagvar, Endvar
IndexOf() has another overload that takes the start index of which starts to search the given string. if you omit it, it will always look from the beginning and will always find the first one.
var text = "Hello this is a {Testvar}...";
int start = 0, end = -1;
List<string> results = new List<string>();
while(true)
{
start = text.IndexOf("{", start) + 1;
if(start != 0)
end = text.IndexOf("}", start);
else
break;
if(end==-1) break;
results.Add(text.Substring(start, end - start));
start = end + 1;
}
I strongly recommend using regular expressions for the task.
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
var regex = new Regex(#"(\{(?<var>\w*)\})+", RegexOptions.IgnoreCase);
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
var matches = regex.Matches(text);
foreach (Match match in matches)
{
var variable = match.Groups["var"];
Console.WriteLine($"Found {variable.Value} from position {variable.Index} to {variable.Index + variable.Length}");
}
}
}
}
Output:
Found Testvar from position 17 to 24
Found Tagvar from position 47 to 53
Found Endvar from position 71 to 77
For more information about regular expression visit the MSDN reference page:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference
and this tool may be great to start testing your own expressions:
http://regexstorm.net/tester
Hope this help!
I would use Regex pattern {(\\w+)} to get the value.
Regex reg = new Regex("{(\\w+)}");
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
string[] tagResult = reg.Matches(text)
.Cast<Match>()
.Select(match => match.Groups[1].Value).ToArray();
foreach (var item in tagResult)
{
Console.WriteLine(item);
}
c# online
Result
Testvar
Tagvar
Endvar
Many ways to skin this cat, here are a few:
Split it on { then loop through, splitting each result on } and taking element 0 each time
Split on { or } then loop through taking only odd numbered elements
Adjust your existing logic so you use IndexOf twice (instead of lastindexof). When you’re looking for a } pass the index of the { as the start index of the search
This is so easy by using Regular Expressions just by using a simple pattern like {([\d\w]+)}.
See the example below:-
using System.Text.RegularExpressions;
...
MatchCollection matches = Regex.Matches("Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.", #"{([\d\w]+)}");
foreach(Match match in matches){
Console.WriteLine("match : {0}, index : {1}", match.Groups[1], match.index);
}
It can find any series of letters or number in these brackets one by one.
I have a string like the following:
[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)
You can look at it as this tree:
- [Testing.User]
- Info
- [Testing.Info]
- Name
- [System.String]
- Matt
- Age
- [System.Int32]
- 21
- Description
- [System.String]
- This is some description
As you can see, it's a string serialization / representation of a class Testing.User
I want to be able to do a split and get the following elements in the resulting array:
[0] = [Testing.User]
[1] = Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
[2] = Description:([System.String]|This is some description)
I can't split by | because that would result in:
[0] = [Testing.User]
[1] = Info:([Testing.Info]
[2] = Name:([System.String]
[3] = Matt)
[4] = Age:([System.Int32]
[5] = 21))
[6] = Description:([System.String]
[7] = This is some description)
How can I get my expected result?
I'm not very good with regular expressions, but I am aware it is a very possible solution for this case.
Using regex lookahead
You can use a regex like this:
(\[.*?])|(\w+:.*?)\|(?=Description:)|(Description:.*)
Working demo
The idea behind this regex is to capture in groups 1,2 and 3 what you want.
You can see it easily with this diagram:
Match information
MATCH 1
1. [0-14] `[Testing.User]`
MATCH 2
2. [15-88] `Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))`
MATCH 3
3. [89-143] `Description:([System.String]|This is some description)`
Regular regex
On the other hand, if you don't like above regex, you can use another one like this:
(\[.*?])\|(.*)\|(Description:.*)
Working demo
Or even forcing one character at least:
(\[.+?])\|(.+)\|(Description:.+)
There are more than enough splitting answers already, so here is another approach. If your input represents a tree structure, why not parse it to a tree?
The following code was automatically translated from VB.NET, but it should work as far as I tested it.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Treeparse
{
class Program
{
static void Main(string[] args)
{
var input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
var t = StringTree.Parse(input);
Console.WriteLine(t.ToString());
Console.ReadKey();
}
}
public class StringTree
{
//Branching constants
const string BranchOff = "(";
const string BranchBack = ")";
const string NextTwig = "|";
//Content of this twig
public string Text;
//List of Sub-Twigs
public List<StringTree> Twigs;
[System.Diagnostics.DebuggerStepThrough()]
public StringTree()
{
Text = "";
Twigs = new List<StringTree>();
}
private static void ParseRecursive(StringTree Tree, string InputStr, ref int Position)
{
do {
StringTree NewTwig = new StringTree();
do {
NewTwig.Text = NewTwig.Text + InputStr[Position];
Position += 1;
} while (!(Position == InputStr.Length || (new String[] { BranchBack, BranchOff, NextTwig }.ToList().Contains(InputStr[Position].ToString()))));
Tree.Twigs.Add(NewTwig);
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchOff) { Position += 1; ParseRecursive(NewTwig, InputStr, ref Position); Position += 1; }
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchBack)
break; // TODO: might not be correct. Was : Exit Do
Position += 1;
} while (!(Position >= InputStr.Length || InputStr[Position].ToString() == BranchBack));
}
/// <summary>
/// Call this to parse the input into a StringTree objects using recursion
/// </summary>
public static StringTree Parse(string Input)
{
StringTree t = new StringTree();
t.Text = "Root";
int Start = 0;
ParseRecursive(t, Input, ref Start);
return t;
}
private void ToStringRecursive(ref StringBuilder sb, StringTree tree, int Level)
{
for (int i = 1; i <= Level; i++)
{
sb.Append(" ");
}
sb.AppendLine(tree.Text);
int NextLevel = Level + 1;
foreach (StringTree NextTree in tree.Twigs)
{
ToStringRecursive(ref sb, NextTree, NextLevel);
}
}
public override string ToString()
{
var sb = new System.Text.StringBuilder();
ToStringRecursive(ref sb, this, 0);
return sb.ToString();
}
}
}
Result (click):
You get the values of each node with its associated subvalues in a treelike structure and you can then do with it whatever you like, for example easily show the structure in a TreeView control:
Assuming your groups can be marked as
[Anything.Anything]
Anything:ReallyAnything (Letters & Numbers only:Then any amount of characters) after the first pipe
Anything:ReallyAnything (Letters & Numbers only:Then any mount of characters) after the last pipe
Then you have a pattern like:
"(\\[\\w+\\.\\w+\\])\\|(\\w+:.+)\\|(\\w+:.+)";
(\\[\\w+\\.\\w+\\]) This capture group will get the "[Testing.User]" but is not restricted to it only being "[Testing.User]"
\\|(\\w+:.+) This capture group will get the data after the first pipe and stop before the last pipe. In this case, "Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))" but is not restricted to it beginning with "Info:"
\\|(\\w+:.+) Same capture group as previous, but captures whatever is after the last pipe, in this case "Description:([System.String]|This is some description)" but is not restricted to beginning with Description:"
Now if you were to add another pipe followed by more data (|Anything:SomeData), then Description: will be part of group 2 and group 3 would now be "Anything:SomeData".
Code looks like:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
String text = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
String pattern = "(\\[\\w+\\.\\w+\\])\\|(\\w+:.+)\\|(\\w+:.+)";
Match match = Regex.Match(text, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
Console.WriteLine(match.Groups[2]);
Console.WriteLine(match.Groups[3]);
}
}
}
Results:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)
See working sample here... https://dotnetfiddle.net/DYcZuY
See working sample if I add another field following the pattern format here... https://dotnetfiddle.net/Mtc1CD
To do that you need to use balancing groups that is a regex feature exclusive the .net regex engine. It is a counter system, when an opening parenthesis is found the counter is incremented, when a closing is found the counter is decremented, then you only have to test if the counter is null to know if the parenthesis are balanced.
This is the only way to be sure you are inside or outside of the parenthesis:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = #"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
string pattern = #"(?:[^|()]+|\((?>[^()]+|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))\))+";
foreach (Match m in Regex.Matches(input, pattern))
Console.WriteLine(m.Value);
}
}
demo
pattern details:
(?:
[^|()]+ # all that is not a parenthesis or a pipe
| # OR
# content between parenthesis (eventually nested)
\( # opening parenthesis
# here is the way to obtain balanced parens
(?> # content between parens
[^()]+ # all that is not parenthesis
| # OR
(?<Open>[(]) # an opening parenthesis (increment the counter)
|
(?<-Open>[)]) # a closing parenthesis (decrement the counter)
)* # repeat as needed
(?(Open)(?!)) # make the pattern fail if the counter is not zero
\)
)+
(?(open) (?!) ) is a conditional statement.
(?!) is an always false subpattern (an empty negative lookahead) that means : not followed by nothing
This pattern matches all that is not a pipe and strings enclosed between parenthesis.
Regex is not the best approach for this kind of problem, you may need to write some code to parse your data, I did a simple example that achieve this simple case of yours. The basic idea here is that you want to split only if the | is not inside parenthesis, so i keep track of the parenthesis count. You will need to do some work around to threat cases where parenthesis is part of the description section for instance, but as I say, this is just a start point:
static IEnumerable<String> splitSpecial(string input)
{
StringBuilder builder = new StringBuilder();
int openParenthesisCount = 0;
foreach (char c in input)
{
if (openParenthesisCount == 0 && c == '|')
{
yield return builder.ToString();
builder.Clear();
}
else
{
if (c == '(')
openParenthesisCount++;
if (c == ')')
openParenthesisCount--;
builder.Append(c);
}
}
yield return builder.ToString();
}
static void Main(string[] args)
{
string input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
foreach (String split in splitSpecial(input))
{
Console.WriteLine(split);
}
Console.ReadLine();
}
Ouputs:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)
This isn't a great/robust solution, but if you know your three top level items are fixed then you can hard code those into your regular expression.
(\[Testing\.User\])\|(Info:.*)\|(Description:.*)
This regular expression will create one match with three groups within it as you were expecting. You can test it here:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
Edit: Here's a full working C# example
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication3
{
internal class Program
{
private static void Main(string[] args)
{
const string input = #"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
const string pattern = #"(\[Testing\.User\])\|(Info:.*)\|(Description:.*)";
var match = Regex.Match(input, pattern);
if (match.Success)
{
for (int i = 1; i < match.Groups.Count; i++)
{
Console.WriteLine("[" + i + "] = " + match.Groups[i]);
}
}
Console.ReadLine();
}
}
}
I want to use Regex to find matches in a string. There are other ways to find the pattern I am looking for, but I am interested in the Regex solution.
Concider these strings
"ABC123"
"ABC245"
"ABC435"
"ABC Oh say can You see"
I want to match the find "ABC" followed by ANYTHING BUT "123". What is the correct regex expression?
Using a negative lookahead:
/ABC(?!123)/
You can check if there are matches in a string str with:
Regex.IsMatch(str, "ABC(?!123)")
Full example:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string[] strings = {
"ABC123",
"ABC245",
"ABC435",
"ABC Oh say can You see"
};
string pattern = "ABC(?!123)";
foreach (string str in strings)
{
Console.WriteLine(
"\"{0}\" {1} match.",
str, Regex.IsMatch(str, pattern) ? "does" : "does not"
);
}
}
}
Live Demo
Alas, my Regex above will match ABC as long as it is not followed by 123. If you need to match at least a character after ABC that is not 123 (that is, do not match ABC on its own/end of the string), you can use ABC(?!123)., the dot ensures that you match at least one character after ABC: demo.
I believe the first Regex is what you're looking for though (as long as "nothing" can be considered "anything" :P).
Try the following test code. This should do what you require
string s1 = "ABC123";
string s2 = "we ABC123 weew";
string s3 = "ABC435";
string s4 = "Can ABC Oh say can You see";
List<string> list = new List<string>() { s1, s2, s3, s4 };
Regex regex = new Regex(#".*(?<=.*ABC(?!.*123.*)).*");
Match m = null;
foreach (string s in list)
{
m = regex.Match(s);
if (m != null)
Console.WriteLine(m.ToString());
}
The output is:
ABC435
Can ABC Oh say can You see
This uses both a 'Negative Lookahead' and a 'Positive Lookbehind'.
I hope this helps.
An alternative to regex, should you find this easier to use. Only a suggestion.
List<string> strs = new List<string>() { "ABC123",
"ABC245",
"ABC435",
"NOTABC",
"ABC Oh say can You see"
};
for (int i = 0; i < strs.Count; i++)
{
//Set the current string variable
string str = strs[i];
//Get the index of "ABC"
int index = str.IndexOf("ABC");
//Do you want to remove if ABC doesn't exist?
if (index == -1)
continue;
//Set the index to be the next character from ABC
index += 3;
//If the index is within the length with 3 extra characters (123)
if (index <= str.Length && (index + 3) <= str.Length)
if (str.Substring(index, 3) == "123")
strs.RemoveAt(i);
}
i am very newbie to c#..
i want program if input like this
input : There are 4 numbers in this string 40, 30, and 10
output :
there = string
are = string
4 = number
numbers = string
in = string
this = string
40 = number
, = symbol
30 = number
, = symbol
and = string
10 = number
i am try this
{
class Program
{
static void Main(string[] args)
{
string input = "There are 4 numbers in this string 40, 30, and 10.";
// Split on one or more non-digit characters.
string[] numbers = Regex.Split(input, #"(\D+)(\s+)");
foreach (string value in numbers)
{
Console.WriteLine(value);
}
}
}
}
but the output is different from what i want.. please help me.. i am stuck :((
The regex parser has an if conditional and the ability to group items into named capture groups; to which I will demonstrate.
Here is an example where the patttern looks for symbols first (only a comma add more symbols to the set [,]) then numbers and drops the rest into words.
string text = #"There are 4 numbers in this string 40, 30, and 10";
string pattern = #"
(?([,]) # If a comma (or other then add it) is found its a symbol
(?<Symbol>[,]) # Then match the symbol
| # else its not a symbol
(?(\d+) # If a number
(?<Number>\d+) # Then match the numbers
| # else its not a number
(?<Word>[^\s]+) # So it must be a word.
)
)
";
// Ignore pattern white space allows us to comment the pattern only, does not affect
// the processing of the text!
Regex.Matches(text, pattern, RegexOptions.IgnorePatternWhitespace)
.OfType<Match>()
.Select (mt =>
{
if (mt.Groups["Symbol"].Success)
return "Symbol found: " + mt.Groups["Symbol"].Value;
if (mt.Groups["Number"].Success)
return "Number found: " + mt.Groups["Number"].Value;
return "Word found: " + mt.Groups["Word"].Value;
}
)
.ToList() // To show the result only remove
.ForEach(rs => Console.WriteLine (rs));
/* Result
Word found: There
Word found: are
Number found: 4
Word found: numbers
Word found: in
Word found: this
Word found: string
Number found: 40
Symbol found: ,
Number found: 30
Symbol found: ,
Word found: and
Number found: 10
*/
Once the regex has tokenized the resulting matches, then we us linq to extract out those tokens by identifying which named capture group has a success. In this example we get the successful capture group and project it into a string to print out for viewing.
I discuss the regex if conditional on my blog Regular Expressions and the If Conditional for more information.
You could split using this pattern: #"(,)\s?|\s"
This splits on a comma, but preserves it since it is within a group. The \s? serves to match an optional space but excludes it from the result. Without it, the split would include the space that occurred after a comma. Next, there's an alternation to split on whitespace in general.
To categorize the values, we can take the first character of the string and check for the type using the static Char methods.
string input = "There are 4 numbers in this string 40, 30, and 10";
var query = Regex.Split(input, #"(,)\s?|\s")
.Select(s => new
{
Value = s,
Type = Char.IsLetter(s[0]) ?
"String" : Char.IsDigit(s[0]) ?
"Number" : "Symbol"
});
foreach (var item in query)
{
Console.WriteLine("{0} : {1}", item.Value, item.Type);
}
To use the Regex.Matches method instead, this pattern can be used: #"\w+|,"
var query = Regex.Matches(input, #"\w+|,").Cast<Match>()
.Select(m => new
{
Value = m.Value,
Type = Char.IsLetter(m.Value[0]) ?
"String" : Char.IsDigit(m.Value[0]) ?
"Number" : "Symbol"
});
Well to match all numbers you could do:
[\d]+
For the strings:
[a-zA-Z]+
And for some of the symbols for example
[,.?\[\]\\\/;:!\*]+
You can very easily do this like so:
string[] tokens = Regex.Split(input, " ");
foreach(string token in tokens)
{
if(token.Length > 1)
{
if(Int32.TryParse(token))
{
Console.WriteLine(token + " = number");
}
else
{
Console.WriteLine(token + " = string");
}
}
else
{
if(!Char.isLetter(token ) && !Char.isDigit(token))
{
Console.WriteLine(token + " = symbol");
}
}
}
I do not have an IDE handy to test that this compiles. Essentially waht you are doing is splitting the input on space and then performing some comparisons to determine if it is a symbol, string, or number.
If you want to get the numbers
var reg = new Regex(#"\d+");
var matches = reg.Matches(input );
var numbers = matches
.Cast<Match>()
.Select(m=>Int32.Parse(m.Groups[0].Value));
To get your output:
var regSymbols = new Regex(#"(?<number>\d+)|(?<string>\w+)|(?<symbol>(,))");
var sMatches = regSymbols.Matches(input );
var symbols = sMatches
.Cast<Match>()
.Select(m=> new
{
Number = m.Groups["number"].Value,
String = m.Groups["string"].Value,
Symbol = m.Groups["symbol"].Value
})
.Select(
m => new
{
Match = !String.IsNullOrEmpty(m.Number) ?
m.Number : !String.IsNullOrEmpty(m.String)
? m.String : m.Symbol,
MatchType = !String.IsNullOrEmpty(m.Number) ?
"Number" : !String.IsNullOrEmpty(m.String)
? "String" : "Symbol"
}
);
edit
If there are more symbols than a comma you can group them in a class, like #Bogdan Emil Mariesan did and the regex will be:
#"(?<number>\d+)|(?<string>\w+)|(?<symbol>[,.\?!])"
edit2
To get the strings with =
var outputLines = symbols.Select(m=>
String.Format("{0} = {1}", m.Match, m.MatchType));
I want to split a string into a list or array.
Input: green,"yellow,green",white,orange,"blue,black"
The split character is the comma (,), but it must ignore commas inside quotes.
The output should be:
green
yellow,green
white
orange
blue,black
Thanks.
Actually this is easy enough to just use match :
string subjectString = #"green,""yellow,green"",white,orange,""blue,black""";
try
{
Regex regexObj = new Regex(#"(?<="")\b[a-z,]+\b(?="")|[a-z]+", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success)
{
Console.WriteLine("{0}", matchResults.Value);
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
}
Output :
green
yellow,green
white
orange
blue,black
Explanation :
#"
# Match either the regular expression below (attempting the next alternative only if this one fails)
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
"" # Match the character “""” literally
)
\b # Assert position at a word boundary
[a-z,] # Match a single character present in the list below
# A character in the range between “a” and “z”
# The character “,”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
"" # Match the character “""” literally
)
| # Or match regular expression number 2 below (the entire match attempt fails if this one fails to match)
[a-z] # Match a single character in the range between “a” and “z”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"
What you have there is an irregular language. In other words, the meaning of a character depends upon the sequence of characters before or after it. As the name implies Regular Expressions are for parsing Regular languages.
What you need here is a Tokenizer and Parser, a good internet search engine should guide you to examples. In fact as the tokens are just characters you probably don't even need the Tokenizer.
While you can do this simple case using a Regular Expression, it is likly to be very slow. It could also cause issues if ever the quotes arn't balanced as a regular expression would not detect this error, where as a parser would.
If you are importing a CSV file you may want to have a look at the Microsoft.VisualBasic.FileIO.TextFieldParser class (Simply add a reference to Microsoft.VisualBasic.dll in a C# project) which parses CSV files.
Another way to do this is to write your own state machine (example below) though this still does not solve the issue of a quote in the middle of a value:
using System;
using System.Text;
namespace Example
{
class Program
{
static void Main(string[] args)
{
string subjectString = #"green,""yellow,green"",white,orange,""blue,black""";
bool inQuote = false;
StringBuilder currentResult = new StringBuilder();
foreach (char c in subjectString)
{
switch (c)
{
case '\"':
inQuote = !inQuote;
break;
case ',':
if (inQuote)
{
currentResult.Append(c);
}
else
{
Console.WriteLine(currentResult);
currentResult.Clear();
}
break;
default:
currentResult.Append(c);
break;
}
}
if (inQuote)
{
throw new FormatException("Input string does not have balanced Quote Characters");
}
Console.WriteLine(currentResult);
}
}
}
Someone will shortly come up with an answer that does this with a single regex. I'm not that clever, but just for the sake of balance, here's a suggestion that doesn't use a regex entirely. Based on the old adage that when you try to solve a problem with a regex, you then have two problems. :)
Personally given my lack of regex-fu, I'd do one of the following:
Use a simple regex-based Replace to escape any commas inside quotes with something else (i.e. ","). Then you can do a simple string.Split() on the result and unescape each item in the resulting array before you use it. This is yucky. Partly because it's double-handling everything, and partly because it also uses regexes. Boooo!
Parse it by hand, char by char. Convert the string to a char array, then iterate through it, keeping note of whether you're "inside quotes" or not, and build the resulting array a char at a time.
Same as the previous suggestion, but using a csv-parser from someone on the internet. The example one I create below doesn't exactly pass all tests from the csv specification, so it's only really a guide to illustrate my point.
There's a good chance non-regex options would perform better if well-written, because regexes can be a little expensive as they scan strings internally looking for patterns.
Really, I just wanted to point out that you don't have to use a regex. :)
Here's a fairly naive implementation of my second suggestion. On my PC it's happy parsing 1 million 15-column strings in a little over 4.5 seconds.
public class ManualParser : IParser
{
public IEnumerable<string> Parse(string line)
{
if (string.IsNullOrWhiteSpace(line)) return new List<string>();
line = line.Trim();
if (line.Contains(",") == false) return new[] { line.Trim('"') };
if (line.Contains("\"") == false) return line.Split(',').Select(c => c.Trim());
bool withinQuotes = false;
var builder = new List<string>();
var trimChars = new[] { ' ', '"' };
int left = 0;
int right = 0;
for (right = 0; right < line.Length; right++)
{
char c = line[right];
if (c == '"')
{
withinQuotes = !withinQuotes;
continue;
}
if (c == ',' && !withinQuotes)
{
builder.Add(line.Substring(left, right - left).Trim(trimChars));
right++; // Jump the comma
left = right;
}
}
builder.Add(line.Substring(left, right - left).Trim(trimChars));
return builder;
}
}
Here's some unit tests for it:
[TestFixture]
public class ManualParserTests
{
[Test]
public void Parse_GivenStringWithNoQuotesAndNoCommas_ShouldReturnThatString()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("This is my data").ToArray();
// Assert
Assert.AreEqual(1, result.Length, "Should only be one column returned");
Assert.AreEqual("This is my data", result[0], "Incorrect value is returned");
}
[Test]
public void Parse_GivenStringWithNoQuotesAndOneComma_ShouldReturnTwoColumns()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("This is, my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesAndNoCommas_ShouldReturnColumnWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This is my data\"").ToArray();
// Assert
Assert.AreEqual(1, result.Length, "Should be 1 column returned");
Assert.AreEqual("This is my data", result[0], "Value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesAndCommas_ShouldReturnColumnsWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This is\", my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesContainingCommasAndCommas_ShouldReturnColumnsWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This, is\", my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This, is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
}
And here's a sample app that I tested the throughput with:
class Program
{
static void Main(string[] args)
{
RunTest();
}
private static void RunTest()
{
var parser = new ManualParser();
string csv = Properties.Resources.Csv;
var result = new StringBuilder();
var s = new Stopwatch();
for (int test = 0; test < 3; test++)
{
int lineCount = 0;
s.Start();
for (int i = 0; i < 1000000 / 50; i++)
{
foreach (var line in csv.Split(new[] { Environment.NewLine }, StringSplitOptions.None))
{
string cur = line + s.ElapsedTicks.ToString();
result.AppendLine(parser.Parse(cur).ToString());
lineCount++;
}
}
s.Stop();
Console.WriteLine("Completed {0} lines in {1}ms", lineCount, s.ElapsedMilliseconds);
s.Reset();
result = new StringBuilder();
}
}
}
The format of the string you are trying to split appears to be standard CSV. Using a CSV parser would likely be easier/faster.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string input = #"green,""yellow,green"",white,orange,""blue,black""";
string splitOn = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";
string[] words = Regex.Split(input, splitOn);
foreach (var word in words)
{
Console.WriteLine(word);
}
}
}
OUTPUT:
green
"yellow,green"
white
orange
"blue,black"
enclosing the regex matching within '(' and ')' and then splitting on this regex should solve this.
eg: /("[^"]+")/g