Complex string splitting - c#

I have a string like the following:
[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)
You can look at it as this tree:
- [Testing.User]
- Info
- [Testing.Info]
- Name
- [System.String]
- Matt
- Age
- [System.Int32]
- 21
- Description
- [System.String]
- This is some description
As you can see, it's a string serialization / representation of a class Testing.User
I want to be able to do a split and get the following elements in the resulting array:
[0] = [Testing.User]
[1] = Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
[2] = Description:([System.String]|This is some description)
I can't split by | because that would result in:
[0] = [Testing.User]
[1] = Info:([Testing.Info]
[2] = Name:([System.String]
[3] = Matt)
[4] = Age:([System.Int32]
[5] = 21))
[6] = Description:([System.String]
[7] = This is some description)
How can I get my expected result?
I'm not very good with regular expressions, but I am aware it is a very possible solution for this case.

Using regex lookahead
You can use a regex like this:
(\[.*?])|(\w+:.*?)\|(?=Description:)|(Description:.*)
Working demo
The idea behind this regex is to capture in groups 1,2 and 3 what you want.
You can see it easily with this diagram:
Match information
MATCH 1
1. [0-14] `[Testing.User]`
MATCH 2
2. [15-88] `Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))`
MATCH 3
3. [89-143] `Description:([System.String]|This is some description)`
Regular regex
On the other hand, if you don't like above regex, you can use another one like this:
(\[.*?])\|(.*)\|(Description:.*)
Working demo
Or even forcing one character at least:
(\[.+?])\|(.+)\|(Description:.+)

There are more than enough splitting answers already, so here is another approach. If your input represents a tree structure, why not parse it to a tree?
The following code was automatically translated from VB.NET, but it should work as far as I tested it.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Treeparse
{
class Program
{
static void Main(string[] args)
{
var input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
var t = StringTree.Parse(input);
Console.WriteLine(t.ToString());
Console.ReadKey();
}
}
public class StringTree
{
//Branching constants
const string BranchOff = "(";
const string BranchBack = ")";
const string NextTwig = "|";
//Content of this twig
public string Text;
//List of Sub-Twigs
public List<StringTree> Twigs;
[System.Diagnostics.DebuggerStepThrough()]
public StringTree()
{
Text = "";
Twigs = new List<StringTree>();
}
private static void ParseRecursive(StringTree Tree, string InputStr, ref int Position)
{
do {
StringTree NewTwig = new StringTree();
do {
NewTwig.Text = NewTwig.Text + InputStr[Position];
Position += 1;
} while (!(Position == InputStr.Length || (new String[] { BranchBack, BranchOff, NextTwig }.ToList().Contains(InputStr[Position].ToString()))));
Tree.Twigs.Add(NewTwig);
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchOff) { Position += 1; ParseRecursive(NewTwig, InputStr, ref Position); Position += 1; }
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchBack)
break; // TODO: might not be correct. Was : Exit Do
Position += 1;
} while (!(Position >= InputStr.Length || InputStr[Position].ToString() == BranchBack));
}
/// <summary>
/// Call this to parse the input into a StringTree objects using recursion
/// </summary>
public static StringTree Parse(string Input)
{
StringTree t = new StringTree();
t.Text = "Root";
int Start = 0;
ParseRecursive(t, Input, ref Start);
return t;
}
private void ToStringRecursive(ref StringBuilder sb, StringTree tree, int Level)
{
for (int i = 1; i <= Level; i++)
{
sb.Append(" ");
}
sb.AppendLine(tree.Text);
int NextLevel = Level + 1;
foreach (StringTree NextTree in tree.Twigs)
{
ToStringRecursive(ref sb, NextTree, NextLevel);
}
}
public override string ToString()
{
var sb = new System.Text.StringBuilder();
ToStringRecursive(ref sb, this, 0);
return sb.ToString();
}
}
}
Result (click):
You get the values of each node with its associated subvalues in a treelike structure and you can then do with it whatever you like, for example easily show the structure in a TreeView control:

Assuming your groups can be marked as
[Anything.Anything]
Anything:ReallyAnything (Letters & Numbers only:Then any amount of characters) after the first pipe
Anything:ReallyAnything (Letters & Numbers only:Then any mount of characters) after the last pipe
Then you have a pattern like:
"(\\[\\w+\\.\\w+\\])\\|(\\w+:.+)\\|(\\w+:.+)";
(\\[\\w+\\.\\w+\\]) This capture group will get the "[Testing.User]" but is not restricted to it only being "[Testing.User]"
\\|(\\w+:.+) This capture group will get the data after the first pipe and stop before the last pipe. In this case, "Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))" but is not restricted to it beginning with "Info:"
\\|(\\w+:.+) Same capture group as previous, but captures whatever is after the last pipe, in this case "Description:([System.String]|This is some description)" but is not restricted to beginning with Description:"
Now if you were to add another pipe followed by more data (|Anything:SomeData), then Description: will be part of group 2 and group 3 would now be "Anything:SomeData".
Code looks like:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
String text = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
String pattern = "(\\[\\w+\\.\\w+\\])\\|(\\w+:.+)\\|(\\w+:.+)";
Match match = Regex.Match(text, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
Console.WriteLine(match.Groups[2]);
Console.WriteLine(match.Groups[3]);
}
}
}
Results:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)
See working sample here... https://dotnetfiddle.net/DYcZuY
See working sample if I add another field following the pattern format here... https://dotnetfiddle.net/Mtc1CD

To do that you need to use balancing groups that is a regex feature exclusive the .net regex engine. It is a counter system, when an opening parenthesis is found the counter is incremented, when a closing is found the counter is decremented, then you only have to test if the counter is null to know if the parenthesis are balanced.
This is the only way to be sure you are inside or outside of the parenthesis:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = #"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
string pattern = #"(?:[^|()]+|\((?>[^()]+|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))\))+";
foreach (Match m in Regex.Matches(input, pattern))
Console.WriteLine(m.Value);
}
}
demo
pattern details:
(?:
[^|()]+ # all that is not a parenthesis or a pipe
| # OR
# content between parenthesis (eventually nested)
\( # opening parenthesis
# here is the way to obtain balanced parens
(?> # content between parens
[^()]+ # all that is not parenthesis
| # OR
(?<Open>[(]) # an opening parenthesis (increment the counter)
|
(?<-Open>[)]) # a closing parenthesis (decrement the counter)
)* # repeat as needed
(?(Open)(?!)) # make the pattern fail if the counter is not zero
\)
)+
(?(open) (?!) ) is a conditional statement.
(?!) is an always false subpattern (an empty negative lookahead) that means : not followed by nothing
This pattern matches all that is not a pipe and strings enclosed between parenthesis.

Regex is not the best approach for this kind of problem, you may need to write some code to parse your data, I did a simple example that achieve this simple case of yours. The basic idea here is that you want to split only if the | is not inside parenthesis, so i keep track of the parenthesis count. You will need to do some work around to threat cases where parenthesis is part of the description section for instance, but as I say, this is just a start point:
static IEnumerable<String> splitSpecial(string input)
{
StringBuilder builder = new StringBuilder();
int openParenthesisCount = 0;
foreach (char c in input)
{
if (openParenthesisCount == 0 && c == '|')
{
yield return builder.ToString();
builder.Clear();
}
else
{
if (c == '(')
openParenthesisCount++;
if (c == ')')
openParenthesisCount--;
builder.Append(c);
}
}
yield return builder.ToString();
}
static void Main(string[] args)
{
string input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
foreach (String split in splitSpecial(input))
{
Console.WriteLine(split);
}
Console.ReadLine();
}
Ouputs:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)

This isn't a great/robust solution, but if you know your three top level items are fixed then you can hard code those into your regular expression.
(\[Testing\.User\])\|(Info:.*)\|(Description:.*)
This regular expression will create one match with three groups within it as you were expecting. You can test it here:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
Edit: Here's a full working C# example
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication3
{
internal class Program
{
private static void Main(string[] args)
{
const string input = #"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
const string pattern = #"(\[Testing\.User\])\|(Info:.*)\|(Description:.*)";
var match = Regex.Match(input, pattern);
if (match.Success)
{
for (int i = 1; i < match.Groups.Count; i++)
{
Console.WriteLine("[" + i + "] = " + match.Groups[i]);
}
}
Console.ReadLine();
}
}
}

Related

c# regrex for a string repeated multiple times

I would like to have a regualr expression for the string where output would be like:
CP_RENOUNCEABLE
CP_RIGHTS_OFFER_TYP
CP_SELLER_FEED_SOURCE
CP_SELLER_ID_BB_GLOBAL
CP_PX
CP_RATIO
CP_RECLASS_TYP
I tried using regex with
string pattern = #"ISNULL(*)";
string strSearch = #"
LTRIM(RTRIM(ISNULL(CP_RENOUNCEABLE,'x2x'))), ISNULL(CP_RIGHTS_OFFER_TYP,-1), LTRIM(RTRIM(ISNULL(CP_SELLER_FEED_SOURCE,'x2x'))),
LTRIM(RTRIM(ISNULL(CP_SELLER_ID_BB_GLOBAL,'x2x'))),ISNULL(CP_PX,-1), ISNULL(CP_RATIO,-1), ISNULL(CP_RECLASS_TYP,-1);
string pattern = #"ISNULL(*\)";
foreach (Match match in Regex.Matches(strSearch, pattern))
{
if (match.Success && match.Groups.Count > 0)
{
var text = match.Groups[1].Value;
}
}
My guess is that we'd be having a comma after our desired outputs listed in the question, which then this simple expression might suffice,
(CP_[A-Z_]+),
Demo 1
If my guess wasn't right, and we would have other chars after that such as an space, we can add a char class on the right side of our capturing group, such as this:
(CP_[A-Z_]+)[,\s]
and we would add any char that might occur after our desired strings in [,\s].
Demo 2
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(CP_[A-Z_]+),";
string input = #"LTRIM(RTRIM(ISNULL(CP_RENOUNCEABLE,'x2x'))), ISNULL(CP_RIGHTS_OFFER_TYP,-1), LTRIM(RTRIM(ISNULL(CP_SELLER_FEED_SOURCE,'x2x'))),
LTRIM(RTRIM(ISNULL(CP_SELLER_ID_BB_GLOBAL,'x2x'))),ISNULL(CP_PX,-1), ISNULL(CP_RATIO,-1), ISNULL(CP_RECLASS_TYP,-1);";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
Edit:
For capturing what is in between ISNULL and the first comma, this might work:
ISNULL\((.+?),
Demo 3

C# "between strings" run several times

Here is my code to find a string between { }:
var text = "Hello this is a {Testvar}...";
int tagFrom = text.IndexOf("{") + "{".Length;
int tagTo = text.LastIndexOf("}");
String tagResult = text.Substring(tagFrom, tagTo - tagFrom);
tagResult Output: Testvar
This only works for one time use.
How can I apply this for several Tags? (eg in a While loop)
For example:
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
tagResult[] Output (eg Array): Testvar, Tagvar, Endvar
IndexOf() has another overload that takes the start index of which starts to search the given string. if you omit it, it will always look from the beginning and will always find the first one.
var text = "Hello this is a {Testvar}...";
int start = 0, end = -1;
List<string> results = new List<string>();
while(true)
{
start = text.IndexOf("{", start) + 1;
if(start != 0)
end = text.IndexOf("}", start);
else
break;
if(end==-1) break;
results.Add(text.Substring(start, end - start));
start = end + 1;
}
I strongly recommend using regular expressions for the task.
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
var regex = new Regex(#"(\{(?<var>\w*)\})+", RegexOptions.IgnoreCase);
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
var matches = regex.Matches(text);
foreach (Match match in matches)
{
var variable = match.Groups["var"];
Console.WriteLine($"Found {variable.Value} from position {variable.Index} to {variable.Index + variable.Length}");
}
}
}
}
Output:
Found Testvar from position 17 to 24
Found Tagvar from position 47 to 53
Found Endvar from position 71 to 77
For more information about regular expression visit the MSDN reference page:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference
and this tool may be great to start testing your own expressions:
http://regexstorm.net/tester
Hope this help!
I would use Regex pattern {(\\w+)} to get the value.
Regex reg = new Regex("{(\\w+)}");
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
string[] tagResult = reg.Matches(text)
.Cast<Match>()
.Select(match => match.Groups[1].Value).ToArray();
foreach (var item in tagResult)
{
Console.WriteLine(item);
}
c# online
Result
Testvar
Tagvar
Endvar
Many ways to skin this cat, here are a few:
Split it on { then loop through, splitting each result on } and taking element 0 each time
Split on { or } then loop through taking only odd numbered elements
Adjust your existing logic so you use IndexOf twice (instead of lastindexof). When you’re looking for a } pass the index of the { as the start index of the search
This is so easy by using Regular Expressions just by using a simple pattern like {([\d\w]+)}.
See the example below:-
using System.Text.RegularExpressions;
...
MatchCollection matches = Regex.Matches("Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.", #"{([\d\w]+)}");
foreach(Match match in matches){
Console.WriteLine("match : {0}, index : {1}", match.Groups[1], match.index);
}
It can find any series of letters or number in these brackets one by one.

c# regex match: length 5-24 with at most 1 underscore

I want to regex match a string that
Is alphanumeric only.
Letters may only be uppercase.
Total length of min 5 and max 24 chars.
May include min 0 max 1 occurrences of underscore in any position except the first or last.
I think I have to somehow nest the statements so that the total length is 5-24 but there may be up to one underscore. I have read a few regex tutorials , but can't understand a way to do this. Also have NO idea how to specify the acceptable position of the underscore (if present).
[A-Z0-9]{5,24}[_]{0,1}
If you are using this in C# code, it's better to check the length of the string outside the regex. (It's possible to cram it inside the regex, but I won't show it here).
private static bool Validate(string str) {
if (str.Length < 5 || str.Length > 24) {
return false;
}
return Regex.IsMatch(str, #"^[A-Z0-9]+(?:_[A-Z0-9]+)?\z");
}
The regex is:
^[A-Z0-9]+(?:_[A-Z0-9]+)?\z
If a string ends with new line, $ can match the empty string before new line, so \z is used here to assert end of string.
Test code
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
string[] fail = {"ABCDacbd", "ACDE", "ABCDE\n", "_01234", "ABCDÉ", "ABCD́Ē", "ABCDEF_", "A_B_CDEF", "AB_C", "1234567890123456789012345", "123456_789012345678901234"};
string[] ok = {"ACBDEF", "01234", "ABC_DE1", "123456789012345678901234", "12345_789012345678901234"};
foreach (string s in fail) {
Console.WriteLine(s + " " + Validate(s));
}
Console.WriteLine();
foreach (string s in ok) {
Console.WriteLine(s + " " + Validate(s));
}
}
private static bool Validate(string str) {
if (str.Length < 5 || str.Length > 24) {
return false;
}
return Regex.IsMatch(str, #"^[A-Z0-9]+(?:_[A-Z0-9]+)?\z");
}
}

How to split strings using regular expressions

I want to split a string into a list or array.
Input: green,"yellow,green",white,orange,"blue,black"
The split character is the comma (,), but it must ignore commas inside quotes.
The output should be:
green
yellow,green
white
orange
blue,black
Thanks.
Actually this is easy enough to just use match :
string subjectString = #"green,""yellow,green"",white,orange,""blue,black""";
try
{
Regex regexObj = new Regex(#"(?<="")\b[a-z,]+\b(?="")|[a-z]+", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success)
{
Console.WriteLine("{0}", matchResults.Value);
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
}
Output :
green
yellow,green
white
orange
blue,black
Explanation :
#"
# Match either the regular expression below (attempting the next alternative only if this one fails)
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
"" # Match the character “""” literally
)
\b # Assert position at a word boundary
[a-z,] # Match a single character present in the list below
# A character in the range between “a” and “z”
# The character “,”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
"" # Match the character “""” literally
)
| # Or match regular expression number 2 below (the entire match attempt fails if this one fails to match)
[a-z] # Match a single character in the range between “a” and “z”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"
What you have there is an irregular language. In other words, the meaning of a character depends upon the sequence of characters before or after it. As the name implies Regular Expressions are for parsing Regular languages.
What you need here is a Tokenizer and Parser, a good internet search engine should guide you to examples. In fact as the tokens are just characters you probably don't even need the Tokenizer.
While you can do this simple case using a Regular Expression, it is likly to be very slow. It could also cause issues if ever the quotes arn't balanced as a regular expression would not detect this error, where as a parser would.
If you are importing a CSV file you may want to have a look at the Microsoft.VisualBasic.FileIO.TextFieldParser class (Simply add a reference to Microsoft.VisualBasic.dll in a C# project) which parses CSV files.
Another way to do this is to write your own state machine (example below) though this still does not solve the issue of a quote in the middle of a value:
using System;
using System.Text;
namespace Example
{
class Program
{
static void Main(string[] args)
{
string subjectString = #"green,""yellow,green"",white,orange,""blue,black""";
bool inQuote = false;
StringBuilder currentResult = new StringBuilder();
foreach (char c in subjectString)
{
switch (c)
{
case '\"':
inQuote = !inQuote;
break;
case ',':
if (inQuote)
{
currentResult.Append(c);
}
else
{
Console.WriteLine(currentResult);
currentResult.Clear();
}
break;
default:
currentResult.Append(c);
break;
}
}
if (inQuote)
{
throw new FormatException("Input string does not have balanced Quote Characters");
}
Console.WriteLine(currentResult);
}
}
}
Someone will shortly come up with an answer that does this with a single regex. I'm not that clever, but just for the sake of balance, here's a suggestion that doesn't use a regex entirely. Based on the old adage that when you try to solve a problem with a regex, you then have two problems. :)
Personally given my lack of regex-fu, I'd do one of the following:
Use a simple regex-based Replace to escape any commas inside quotes with something else (i.e. "&comma;"). Then you can do a simple string.Split() on the result and unescape each item in the resulting array before you use it. This is yucky. Partly because it's double-handling everything, and partly because it also uses regexes. Boooo!
Parse it by hand, char by char. Convert the string to a char array, then iterate through it, keeping note of whether you're "inside quotes" or not, and build the resulting array a char at a time.
Same as the previous suggestion, but using a csv-parser from someone on the internet. The example one I create below doesn't exactly pass all tests from the csv specification, so it's only really a guide to illustrate my point.
There's a good chance non-regex options would perform better if well-written, because regexes can be a little expensive as they scan strings internally looking for patterns.
Really, I just wanted to point out that you don't have to use a regex. :)
Here's a fairly naive implementation of my second suggestion. On my PC it's happy parsing 1 million 15-column strings in a little over 4.5 seconds.
public class ManualParser : IParser
{
public IEnumerable<string> Parse(string line)
{
if (string.IsNullOrWhiteSpace(line)) return new List<string>();
line = line.Trim();
if (line.Contains(",") == false) return new[] { line.Trim('"') };
if (line.Contains("\"") == false) return line.Split(',').Select(c => c.Trim());
bool withinQuotes = false;
var builder = new List<string>();
var trimChars = new[] { ' ', '"' };
int left = 0;
int right = 0;
for (right = 0; right < line.Length; right++)
{
char c = line[right];
if (c == '"')
{
withinQuotes = !withinQuotes;
continue;
}
if (c == ',' && !withinQuotes)
{
builder.Add(line.Substring(left, right - left).Trim(trimChars));
right++; // Jump the comma
left = right;
}
}
builder.Add(line.Substring(left, right - left).Trim(trimChars));
return builder;
}
}
Here's some unit tests for it:
[TestFixture]
public class ManualParserTests
{
[Test]
public void Parse_GivenStringWithNoQuotesAndNoCommas_ShouldReturnThatString()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("This is my data").ToArray();
// Assert
Assert.AreEqual(1, result.Length, "Should only be one column returned");
Assert.AreEqual("This is my data", result[0], "Incorrect value is returned");
}
[Test]
public void Parse_GivenStringWithNoQuotesAndOneComma_ShouldReturnTwoColumns()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("This is, my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesAndNoCommas_ShouldReturnColumnWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This is my data\"").ToArray();
// Assert
Assert.AreEqual(1, result.Length, "Should be 1 column returned");
Assert.AreEqual("This is my data", result[0], "Value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesAndCommas_ShouldReturnColumnsWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This is\", my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesContainingCommasAndCommas_ShouldReturnColumnsWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This, is\", my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This, is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
}
And here's a sample app that I tested the throughput with:
class Program
{
static void Main(string[] args)
{
RunTest();
}
private static void RunTest()
{
var parser = new ManualParser();
string csv = Properties.Resources.Csv;
var result = new StringBuilder();
var s = new Stopwatch();
for (int test = 0; test < 3; test++)
{
int lineCount = 0;
s.Start();
for (int i = 0; i < 1000000 / 50; i++)
{
foreach (var line in csv.Split(new[] { Environment.NewLine }, StringSplitOptions.None))
{
string cur = line + s.ElapsedTicks.ToString();
result.AppendLine(parser.Parse(cur).ToString());
lineCount++;
}
}
s.Stop();
Console.WriteLine("Completed {0} lines in {1}ms", lineCount, s.ElapsedMilliseconds);
s.Reset();
result = new StringBuilder();
}
}
}
The format of the string you are trying to split appears to be standard CSV. Using a CSV parser would likely be easier/faster.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string input = #"green,""yellow,green"",white,orange,""blue,black""";
string splitOn = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";
string[] words = Regex.Split(input, splitOn);
foreach (var word in words)
{
Console.WriteLine(word);
}
}
}
OUTPUT:
green
"yellow,green"
white
orange
"blue,black"
enclosing the regex matching within '(' and ')' and then splitting on this regex should solve this.
eg: /("[^"]+")/g

Best way to parse Space Separated Text

I have string like this
/c SomeText\MoreText "Some Text\More Text\Lol" SomeText
I want to tokenize it, however I can't just split on the spaces. I've come up with somewhat ugly parser that works, but I'm wondering if anyone has a more elegant design.
This is in C# btw.
EDIT: My ugly version, while ugly, is O(N) and may actually be faster than using a RegEx.
private string[] tokenize(string input)
{
string[] tokens = input.Split(' ');
List<String> output = new List<String>();
for (int i = 0; i < tokens.Length; i++)
{
if (tokens[i].StartsWith("\""))
{
string temp = tokens[i];
int k = 0;
for (k = i + 1; k < tokens.Length; k++)
{
if (tokens[k].EndsWith("\""))
{
temp += " " + tokens[k];
break;
}
else
{
temp += " " + tokens[k];
}
}
output.Add(temp);
i = k + 1;
}
else
{
output.Add(tokens[i]);
}
}
return output.ToArray();
}
The computer term for what you're doing is lexical analysis; read that for a good summary of this common task.
Based on your example, I'm guessing that you want whitespace to separate your words, but stuff in quotation marks should be treated as a "word" without the quotes.
The simplest way to do this is to define a word as a regular expression:
([^"^\s]+)\s*|"([^"]+)"\s*
This expression states that a "word" is either (1) non-quote, non-whitespace text surrounded by whitespace, or (2) non-quote text surrounded by quotes (followed by some whitespace). Note the use of capturing parentheses to highlight the desired text.
Armed with that regex, your algorithm is simple: search your text for the next "word" as defined by the capturing parentheses, and return it. Repeat that until you run out of "words".
Here's the simplest bit of working code I could come up with, in VB.NET. Note that we have to check both groups for data since there are two sets of capturing parentheses.
Dim token As String
Dim r As Regex = New Regex("([^""^\s]+)\s*|""([^""]+)""\s*")
Dim m As Match = r.Match("this is a ""test string""")
While m.Success
token = m.Groups(1).ToString
If token.length = 0 And m.Groups.Count > 1 Then
token = m.Groups(2).ToString
End If
m = m.NextMatch
End While
Note 1: Will's answer, above, is the same idea as this one. Hopefully this answer explains the details behind the scene a little better :)
The Microsoft.VisualBasic.FileIO namespace (in Microsoft.VisualBasic.dll) has a TextFieldParser you can use to split on space delimeted text. It handles strings within quotes (i.e., "this is one token" thisistokentwo) well.
Note, just because the DLL says VisualBasic doesn't mean you can only use it in a VB project. Its part of the entire Framework.
There is the state machine approach.
private enum State
{
None = 0,
InTokin,
InQuote
}
private static IEnumerable<string> Tokinize(string input)
{
input += ' '; // ensure we end on whitespace
State state = State.None;
State? next = null; // setting the next state implies that we have found a tokin
StringBuilder sb = new StringBuilder();
foreach (char c in input)
{
switch (state)
{
default:
case State.None:
if (char.IsWhiteSpace(c))
continue;
else if (c == '"')
{
state = State.InQuote;
continue;
}
else
state = State.InTokin;
break;
case State.InTokin:
if (char.IsWhiteSpace(c))
next = State.None;
else if (c == '"')
next = State.InQuote;
break;
case State.InQuote:
if (c == '"')
next = State.None;
break;
}
if (next.HasValue)
{
yield return sb.ToString();
sb = new StringBuilder();
state = next.Value;
next = null;
}
else
sb.Append(c);
}
}
It can easily be extended for things like nested quotes and escaping. Returning as IEnumerable<string> allows your code to only parse as much as you need. There aren't any real downsides to that kind of lazy approach as strings are immutable so you know that input isn't going to change before you have parsed the whole thing.
See: http://en.wikipedia.org/wiki/Automata-Based_Programming
You also might want to look into regular expressions. That might help you out. Here is a sample ripped off from MSDN...
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main ()
{
// Define a regular expression for repeated words.
Regex rx = new Regex(#"\b(?<word>\w+)\s+(\k<word>)\b",
RegexOptions.Compiled | RegexOptions.IgnoreCase);
// Define a test string.
string text = "The the quick brown fox fox jumped over the lazy dog dog.";
// Find matches.
MatchCollection matches = rx.Matches(text);
// Report the number of matches found.
Console.WriteLine("{0} matches found in:\n {1}",
matches.Count,
text);
// Report on each match.
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
Console.WriteLine("'{0}' repeated at positions {1} and {2}",
groups["word"].Value,
groups[0].Index,
groups[1].Index);
}
}
}
// The example produces the following output to the console:
// 3 matches found in:
// The the quick brown fox fox jumped over the lazy dog dog.
// 'The' repeated at positions 0 and 4
// 'fox' repeated at positions 20 and 25
// 'dog' repeated at positions 50 and 54
Craig is right — use regular expressions. Regex.Split may be more concise for your needs.
[^\t]+\t|"[^"]+"\t
using the Regex definitely looks like the best bet, however this one just returns the whole string. I'm trying to tweak it, but not much luck so far.
string[] tokens = System.Text.RegularExpressions.Regex.Split(this.BuildArgs, #"[^\t]+\t|""[^""]+""\t");

Categories