How to split strings using regular expressions

How to split strings using regular expressions - c#

I want to split a string into a list or array.
Input: green,"yellow,green",white,orange,"blue,black"
The split character is the comma (,), but it must ignore commas inside quotes.
The output should be:
green
yellow,green
white
orange
blue,black
Thanks.

Actually this is easy enough to just use match :
string subjectString = #"green,""yellow,green"",white,orange,""blue,black""";
try
{
Regex regexObj = new Regex(#"(?<="")\b[a-z,]+\b(?="")|[a-z]+", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success)
{
Console.WriteLine("{0}", matchResults.Value);
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
}
Output :
green
yellow,green
white
orange
blue,black
Explanation :
#"
# Match either the regular expression below (attempting the next alternative only if this one fails)
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
"" # Match the character “""” literally
)
\b # Assert position at a word boundary
[a-z,] # Match a single character present in the list below
# A character in the range between “a” and “z”
# The character “,”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
"" # Match the character “""” literally
)
| # Or match regular expression number 2 below (the entire match attempt fails if this one fails to match)
[a-z] # Match a single character in the range between “a” and “z”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"

What you have there is an irregular language. In other words, the meaning of a character depends upon the sequence of characters before or after it. As the name implies Regular Expressions are for parsing Regular languages.
What you need here is a Tokenizer and Parser, a good internet search engine should guide you to examples. In fact as the tokens are just characters you probably don't even need the Tokenizer.
While you can do this simple case using a Regular Expression, it is likly to be very slow. It could also cause issues if ever the quotes arn't balanced as a regular expression would not detect this error, where as a parser would.
If you are importing a CSV file you may want to have a look at the Microsoft.VisualBasic.FileIO.TextFieldParser class (Simply add a reference to Microsoft.VisualBasic.dll in a C# project) which parses CSV files.
Another way to do this is to write your own state machine (example below) though this still does not solve the issue of a quote in the middle of a value:
using System;
using System.Text;
namespace Example
{
class Program
{
static void Main(string[] args)
{
string subjectString = #"green,""yellow,green"",white,orange,""blue,black""";
bool inQuote = false;
StringBuilder currentResult = new StringBuilder();
foreach (char c in subjectString)
{
switch (c)
{
case '\"':
inQuote = !inQuote;
break;
case ',':
if (inQuote)
{
currentResult.Append(c);
}
else
{
Console.WriteLine(currentResult);
currentResult.Clear();
}
break;
default:
currentResult.Append(c);
break;
}
}
if (inQuote)
{
throw new FormatException("Input string does not have balanced Quote Characters");
}
Console.WriteLine(currentResult);
}
}
}

Someone will shortly come up with an answer that does this with a single regex. I'm not that clever, but just for the sake of balance, here's a suggestion that doesn't use a regex entirely. Based on the old adage that when you try to solve a problem with a regex, you then have two problems. :)
Personally given my lack of regex-fu, I'd do one of the following:
Use a simple regex-based Replace to escape any commas inside quotes with something else (i.e. "&comma;"). Then you can do a simple string.Split() on the result and unescape each item in the resulting array before you use it. This is yucky. Partly because it's double-handling everything, and partly because it also uses regexes. Boooo!
Parse it by hand, char by char. Convert the string to a char array, then iterate through it, keeping note of whether you're "inside quotes" or not, and build the resulting array a char at a time.
Same as the previous suggestion, but using a csv-parser from someone on the internet. The example one I create below doesn't exactly pass all tests from the csv specification, so it's only really a guide to illustrate my point.
There's a good chance non-regex options would perform better if well-written, because regexes can be a little expensive as they scan strings internally looking for patterns.
Really, I just wanted to point out that you don't have to use a regex. :)
Here's a fairly naive implementation of my second suggestion. On my PC it's happy parsing 1 million 15-column strings in a little over 4.5 seconds.
public class ManualParser : IParser
{
public IEnumerable<string> Parse(string line)
{
if (string.IsNullOrWhiteSpace(line)) return new List<string>();
line = line.Trim();
if (line.Contains(",") == false) return new[] { line.Trim('"') };
if (line.Contains("\"") == false) return line.Split(',').Select(c => c.Trim());
bool withinQuotes = false;
var builder = new List<string>();
var trimChars = new[] { ' ', '"' };
int left = 0;
int right = 0;
for (right = 0; right < line.Length; right++)
{
char c = line[right];
if (c == '"')
{
withinQuotes = !withinQuotes;
continue;
}
if (c == ',' && !withinQuotes)
{
builder.Add(line.Substring(left, right - left).Trim(trimChars));
right++; // Jump the comma
left = right;
}
}
builder.Add(line.Substring(left, right - left).Trim(trimChars));
return builder;
}
}
Here's some unit tests for it:
[TestFixture]
public class ManualParserTests
{
[Test]
public void Parse_GivenStringWithNoQuotesAndNoCommas_ShouldReturnThatString()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("This is my data").ToArray();
// Assert
Assert.AreEqual(1, result.Length, "Should only be one column returned");
Assert.AreEqual("This is my data", result[0], "Incorrect value is returned");
}
[Test]
public void Parse_GivenStringWithNoQuotesAndOneComma_ShouldReturnTwoColumns()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("This is, my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesAndNoCommas_ShouldReturnColumnWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This is my data\"").ToArray();
// Assert
Assert.AreEqual(1, result.Length, "Should be 1 column returned");
Assert.AreEqual("This is my data", result[0], "Value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesAndCommas_ShouldReturnColumnsWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This is\", my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesContainingCommasAndCommas_ShouldReturnColumnsWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This, is\", my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This, is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
}
And here's a sample app that I tested the throughput with:
class Program
{
static void Main(string[] args)
{
RunTest();
}
private static void RunTest()
{
var parser = new ManualParser();
string csv = Properties.Resources.Csv;
var result = new StringBuilder();
var s = new Stopwatch();
for (int test = 0; test < 3; test++)
{
int lineCount = 0;
s.Start();
for (int i = 0; i < 1000000 / 50; i++)
{
foreach (var line in csv.Split(new[] { Environment.NewLine }, StringSplitOptions.None))
{
string cur = line + s.ElapsedTicks.ToString();
result.AppendLine(parser.Parse(cur).ToString());
lineCount++;
}
}
s.Stop();
Console.WriteLine("Completed {0} lines in {1}ms", lineCount, s.ElapsedMilliseconds);
s.Reset();
result = new StringBuilder();
}
}
}

The format of the string you are trying to split appears to be standard CSV. Using a CSV parser would likely be easier/faster.

using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string input = #"green,""yellow,green"",white,orange,""blue,black""";
string splitOn = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";
string[] words = Regex.Split(input, splitOn);
foreach (var word in words)
{
Console.WriteLine(word);
}
}
}
OUTPUT:
green
"yellow,green"
white
orange
"blue,black"

enclosing the regex matching within '(' and ')' and then splitting on this regex should solve this.
eg: /("[^"]+")/g

Related

C# "between strings" run several times

Here is my code to find a string between { }:
var text = "Hello this is a {Testvar}...";
int tagFrom = text.IndexOf("{") + "{".Length;
int tagTo = text.LastIndexOf("}");
String tagResult = text.Substring(tagFrom, tagTo - tagFrom);
tagResult Output: Testvar
This only works for one time use.
How can I apply this for several Tags? (eg in a While loop)
For example:
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
tagResult[] Output (eg Array): Testvar, Tagvar, Endvar

IndexOf() has another overload that takes the start index of which starts to search the given string. if you omit it, it will always look from the beginning and will always find the first one.
var text = "Hello this is a {Testvar}...";
int start = 0, end = -1;
List<string> results = new List<string>();
while(true)
{
start = text.IndexOf("{", start) + 1;
if(start != 0)
end = text.IndexOf("}", start);
else
break;
if(end==-1) break;
results.Add(text.Substring(start, end - start));
start = end + 1;
}

I strongly recommend using regular expressions for the task.
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
var regex = new Regex(#"(\{(?<var>\w*)\})+", RegexOptions.IgnoreCase);
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
var matches = regex.Matches(text);
foreach (Match match in matches)
{
var variable = match.Groups["var"];
Console.WriteLine($"Found {variable.Value} from position {variable.Index} to {variable.Index + variable.Length}");
}
}
}
}
Output:
Found Testvar from position 17 to 24
Found Tagvar from position 47 to 53
Found Endvar from position 71 to 77
For more information about regular expression visit the MSDN reference page:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference
and this tool may be great to start testing your own expressions:
http://regexstorm.net/tester
Hope this help!

I would use Regex pattern {(\\w+)} to get the value.
Regex reg = new Regex("{(\\w+)}");
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
string[] tagResult = reg.Matches(text)
.Cast<Match>()
.Select(match => match.Groups[1].Value).ToArray();
foreach (var item in tagResult)
{
Console.WriteLine(item);
}
c# online
Result
Testvar
Tagvar
Endvar

Many ways to skin this cat, here are a few:
Split it on { then loop through, splitting each result on } and taking element 0 each time
Split on { or } then loop through taking only odd numbered elements
Adjust your existing logic so you use IndexOf twice (instead of lastindexof). When you’re looking for a } pass the index of the { as the start index of the search

This is so easy by using Regular Expressions just by using a simple pattern like {([\d\w]+)}.
See the example below:-
using System.Text.RegularExpressions;
...
MatchCollection matches = Regex.Matches("Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.", #"{([\d\w]+)}");
foreach(Match match in matches){
Console.WriteLine("match : {0}, index : {1}", match.Groups[1], match.index);
}
It can find any series of letters or number in these brackets one by one.

Complex string splitting

I have a string like the following:
[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)
You can look at it as this tree:
- [Testing.User]
- Info
- [Testing.Info]
- Name
- [System.String]
- Matt
- Age
- [System.Int32]
- 21
- Description
- [System.String]
- This is some description
As you can see, it's a string serialization / representation of a class Testing.User
I want to be able to do a split and get the following elements in the resulting array:
[0] = [Testing.User]
[1] = Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
[2] = Description:([System.String]|This is some description)
I can't split by | because that would result in:
[0] = [Testing.User]
[1] = Info:([Testing.Info]
[2] = Name:([System.String]
[3] = Matt)
[4] = Age:([System.Int32]
[5] = 21))
[6] = Description:([System.String]
[7] = This is some description)
How can I get my expected result?
I'm not very good with regular expressions, but I am aware it is a very possible solution for this case.

Using regex lookahead
You can use a regex like this:
(\[.*?])|(\w+:.*?)\|(?=Description:)|(Description:.*)
Working demo
The idea behind this regex is to capture in groups 1,2 and 3 what you want.
You can see it easily with this diagram:
Match information
MATCH 1
1. [0-14] `[Testing.User]`
MATCH 2
2. [15-88] `Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))`
MATCH 3
3. [89-143] `Description:([System.String]|This is some description)`
Regular regex
On the other hand, if you don't like above regex, you can use another one like this:
(\[.*?])\|(.*)\|(Description:.*)
Working demo
Or even forcing one character at least:
(\[.+?])\|(.+)\|(Description:.+)

There are more than enough splitting answers already, so here is another approach. If your input represents a tree structure, why not parse it to a tree?
The following code was automatically translated from VB.NET, but it should work as far as I tested it.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Treeparse
{
class Program
{
static void Main(string[] args)
{
var input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
var t = StringTree.Parse(input);
Console.WriteLine(t.ToString());
Console.ReadKey();
}
}
public class StringTree
{
//Branching constants
const string BranchOff = "(";
const string BranchBack = ")";
const string NextTwig = "|";
//Content of this twig
public string Text;
//List of Sub-Twigs
public List<StringTree> Twigs;
[System.Diagnostics.DebuggerStepThrough()]
public StringTree()
{
Text = "";
Twigs = new List<StringTree>();
}
private static void ParseRecursive(StringTree Tree, string InputStr, ref int Position)
{
do {
StringTree NewTwig = new StringTree();
do {
NewTwig.Text = NewTwig.Text + InputStr[Position];
Position += 1;
} while (!(Position == InputStr.Length || (new String[] { BranchBack, BranchOff, NextTwig }.ToList().Contains(InputStr[Position].ToString()))));
Tree.Twigs.Add(NewTwig);
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchOff) { Position += 1; ParseRecursive(NewTwig, InputStr, ref Position); Position += 1; }
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchBack)
break; // TODO: might not be correct. Was : Exit Do
Position += 1;
} while (!(Position >= InputStr.Length || InputStr[Position].ToString() == BranchBack));
}
/// <summary>
/// Call this to parse the input into a StringTree objects using recursion
/// </summary>
public static StringTree Parse(string Input)
{
StringTree t = new StringTree();
t.Text = "Root";
int Start = 0;
ParseRecursive(t, Input, ref Start);
return t;
}
private void ToStringRecursive(ref StringBuilder sb, StringTree tree, int Level)
{
for (int i = 1; i <= Level; i++)
{
sb.Append(" ");
}
sb.AppendLine(tree.Text);
int NextLevel = Level + 1;
foreach (StringTree NextTree in tree.Twigs)
{
ToStringRecursive(ref sb, NextTree, NextLevel);
}
}
public override string ToString()
{
var sb = new System.Text.StringBuilder();
ToStringRecursive(ref sb, this, 0);
return sb.ToString();
}
}
}
Result (click):
You get the values of each node with its associated subvalues in a treelike structure and you can then do with it whatever you like, for example easily show the structure in a TreeView control:

Assuming your groups can be marked as
[Anything.Anything]
Anything:ReallyAnything (Letters & Numbers only:Then any amount of characters) after the first pipe
Anything:ReallyAnything (Letters & Numbers only:Then any mount of characters) after the last pipe
Then you have a pattern like:
"(\\[\\w+\\.\\w+\\])\\|(\\w+:.+)\\|(\\w+:.+)";
(\\[\\w+\\.\\w+\\]) This capture group will get the "[Testing.User]" but is not restricted to it only being "[Testing.User]"
\\|(\\w+:.+) This capture group will get the data after the first pipe and stop before the last pipe. In this case, "Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))" but is not restricted to it beginning with "Info:"
\\|(\\w+:.+) Same capture group as previous, but captures whatever is after the last pipe, in this case "Description:([System.String]|This is some description)" but is not restricted to beginning with Description:"
Now if you were to add another pipe followed by more data (|Anything:SomeData), then Description: will be part of group 2 and group 3 would now be "Anything:SomeData".
Code looks like:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
String text = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
String pattern = "(\\[\\w+\\.\\w+\\])\\|(\\w+:.+)\\|(\\w+:.+)";
Match match = Regex.Match(text, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
Console.WriteLine(match.Groups[2]);
Console.WriteLine(match.Groups[3]);
}
}
}
Results:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)
See working sample here... https://dotnetfiddle.net/DYcZuY
See working sample if I add another field following the pattern format here... https://dotnetfiddle.net/Mtc1CD

To do that you need to use balancing groups that is a regex feature exclusive the .net regex engine. It is a counter system, when an opening parenthesis is found the counter is incremented, when a closing is found the counter is decremented, then you only have to test if the counter is null to know if the parenthesis are balanced.
This is the only way to be sure you are inside or outside of the parenthesis:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = #"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
string pattern = #"(?:[^|()]+|\((?>[^()]+|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))\))+";
foreach (Match m in Regex.Matches(input, pattern))
Console.WriteLine(m.Value);
}
}
demo
pattern details:
(?:
[^|()]+ # all that is not a parenthesis or a pipe
| # OR
# content between parenthesis (eventually nested)
\( # opening parenthesis
# here is the way to obtain balanced parens
(?> # content between parens
[^()]+ # all that is not parenthesis
| # OR
(?<Open>[(]) # an opening parenthesis (increment the counter)
|
(?<-Open>[)]) # a closing parenthesis (decrement the counter)
)* # repeat as needed
(?(Open)(?!)) # make the pattern fail if the counter is not zero
\)
)+
(?(open) (?!) ) is a conditional statement.
(?!) is an always false subpattern (an empty negative lookahead) that means : not followed by nothing
This pattern matches all that is not a pipe and strings enclosed between parenthesis.

Regex is not the best approach for this kind of problem, you may need to write some code to parse your data, I did a simple example that achieve this simple case of yours. The basic idea here is that you want to split only if the | is not inside parenthesis, so i keep track of the parenthesis count. You will need to do some work around to threat cases where parenthesis is part of the description section for instance, but as I say, this is just a start point:
static IEnumerable<String> splitSpecial(string input)
{
StringBuilder builder = new StringBuilder();
int openParenthesisCount = 0;
foreach (char c in input)
{
if (openParenthesisCount == 0 && c == '|')
{
yield return builder.ToString();
builder.Clear();
}
else
{
if (c == '(')
openParenthesisCount++;
if (c == ')')
openParenthesisCount--;
builder.Append(c);
}
}
yield return builder.ToString();
}
static void Main(string[] args)
{
string input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
foreach (String split in splitSpecial(input))
{
Console.WriteLine(split);
}
Console.ReadLine();
}
Ouputs:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)

This isn't a great/robust solution, but if you know your three top level items are fixed then you can hard code those into your regular expression.
(\[Testing\.User\])\|(Info:.*)\|(Description:.*)
This regular expression will create one match with three groups within it as you were expecting. You can test it here:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
Edit: Here's a full working C# example
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication3
{
internal class Program
{
private static void Main(string[] args)
{
const string input = #"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
const string pattern = #"(\[Testing\.User\])\|(Info:.*)\|(Description:.*)";
var match = Regex.Match(input, pattern);
if (match.Success)
{
for (int i = 1; i < match.Groups.Count; i++)
{
Console.WriteLine("[" + i + "] = " + match.Groups[i]);
}
}
Console.ReadLine();
}
}
}

Matching any word enclosed in parentheses in a sentence

I am trying to find a regex to match any word enclosed in parentheses in a sentence.
Suppose, I have a sentence.
"Welcome, (Hello, All of you) to the Stack Over flow."
Say if my matching word is Hello,, All, of or you. It should return true.
Word could contain anything number , symbol but separated from other by white-space
I tried with this \(([^)]*)\). but this returns all words enclosed by parentheses
static void Main(string[] args)
{
string ss = "Welcome, (Hello, All of you) to the Stack Over flow.";
Regex _regex = new Regex(#"\(([^)]*)\)");
Match match = _regex.Match(ss.ToLower());
if (match.Success)
{
ss = match.Groups[0].Value;
}
}
Help and Guidance is very much appreciated.
Thanks.
Thanks People for you time and answers. I have finally solved by changing my code as reply by Tim.
For People with similar problem. I am writing my final code here
static void Main(string[] args)
{
string ss = "Welcome, (Hello, All of you) to the Stack Over flow.";
Regex _regex = new Regex(#"[^\s()]+(?=[^()]*\))");
Match match = _regex.Match(ss.ToLower());
while (match.Success)
{
ss = match.Groups[0].Value;
Console.WriteLine(ss);
match = match.NextMatch();
}
}

OK, so it seems that a "word" is anything that's not whitespace and doesn't contain parentheses, and that you want to match a word if the next parenthesis character that follows is a closing parenthesis.
So you can use
[^\s()]+(?=[^()]*\))
Explanation:
[^\s()]+ matches a "word" (should be easy to understand), and
(?=[^()]*\)) makes sure that a closing parenthesis follows:
(?= # Look ahead to make sure the following regex matches here:
[^()]* # Any number of characters except parentheses
\) # followed by a closing parenthesis.
) # (End of lookahead assertion)

I've developed a c# function for you, if you are interested.
public static class WordsHelper
{
public static List<string> GetWordsInsideParenthesis(string s)
{
List<int> StartIndices = new List<int>();
var rtn = new List<string>();
var numOfOpen = s.Where(m => m == '(').ToList().Count;
var numOfClose = s.Where(m => m == ')').ToList().Count;
if (numOfClose == numOfOpen)
{
for (int i = 0; i < numOfOpen; i++)
{
int ss = 0, sss = 0;
if (StartIndices.Count == 0)
{
ss = s.IndexOf('(') + 1; StartIndices.Add(ss);
sss = s.IndexOf(')');
}
else
{
ss = s.IndexOf('(', StartIndices.Last()) + 1;
sss = s.IndexOf(')', ss);
}
var words = s.Substring(ss, sss - ss).Split(' ');
foreach (string ssss in words)
{
rtn.Add(ssss);
}
}
}
return rtn;
}
}
Just call it this way:
var text = "Welcome, (Hello, All of you) to the (Stack Over flow).";
var words = WordsHelper.GetWordsInsideParenthesis(s);
Now you'll have a list of words in words variable.
Generally, you should opt for c# coding, rather than regex because c# is far more efficient and readable and better than regex in performance wise.
But, if you want to stick on to Regex, then its ok, do the following:
If you want to use regex, keep the regex from Tim Pietzcker [^\s()]+(?=[^()]*\)) but use it this way:
var text="Welcome, (Hello, All of you) to the (Stack Over flow).";
var values= Regex.Matches(text,#"[^\s()]+(?=[^()]*\))");
now values contains MatchCollection
You can access the value using index and Value property
Something like this:
string word=values[0].Value;

(?<=[(])[^)]+(?=[)])
Matches all words in parentheses
(?<=[(]) Checks for (
[^)]+ Matches everything up to but not including a )
(?=[)]) Checks for )

Using regex or string manipulation when creating permalinks

I have following method(and looks expensive too) for creating permalinks but it's lacking few stuff that are quite important for nice permalink:
public string createPermalink(string text)
{
text = text.ToLower().TrimStart().TrimEnd();
foreach (char c in text.ToCharArray())
{
if (!char.IsLetterOrDigit(c) && !char.IsWhiteSpace(c))
{
text = text.Replace(c.ToString(), "");
}
if (char.IsWhiteSpace(c))
{
text = text.Replace(c, '-');
}
}
if (text.Length > 200)
{
text = text.Remove(200);
}
return text;
}
Few stuff that it is lacking:
if someone enters text like this:
"My choiches are:foo,bar" would get returned as "my-choices-arefoobar"
and it should be like: "my-choiches-are-foo-bar"
and If someone enters multiple white spaces it would get returned as "---" which is not nice to have in url.
Is there some better way to do this in regex(I really only used it few times)?
UPDATE:
Requirement was:
Any non digit or letter chars at beginning or end are not allowed
Any non digit or letter chars should be replaced by "-"
When replaced with "-" chars should not reapeat like "---"
And finally stripping string at index 200 to ensure it's not too long

Change to
public string createPermalink(string text)
{
text = text.ToLower();
StringBuilder sb = new StringBuilder(text.Length);
// We want to skip the first hyphenable characters and go to the "meat" of the string
bool lastHyphen = true;
// You can enumerate directly a string
foreach (char c in text)
{
if (char.IsLetterOrDigit(c))
{
sb.Append(c);
lastHyphen = false;
}
else if (!lastHyphen)
{
// We use lastHyphen to not put two hyphens consecutively
sb.Append('-');
lastHyphen = true;
}
if (sb.Length == 200)
{
break;
}
}
// Remove the last hyphen
if (sb.Length > 0 && sb[sb.Length - 1] == '-')
{
sb.Length--;
}
return sb.ToString();
}
If you really want to use regexes, you can do something like this (based on the code of Justin)
Regex rgx = new Regex(#"^\W+|\W+$");
Regex rgx2 = new Regex(#"\W+");
return rgx2.Replace(rgx.Replace(text.ToLower(), string.Empty), "-");
The first regex searches for non-word characters (1 or more) at the beginning (^) or at the end of the string ($) and removes them. The second one replaces one or more non-word characters with -.

This should solve the problem that you have explained. Please let me know if it needs any further explanation.
Just as an FYI, the regex makes use of lookarounds to get it done in one run
//This will find any non-character word, lumping them in one group if more than 1
//It will ignore non-character words at the beginning or end of the string
Regex rgx = new Regex(#"(?!\W+$)\W+(?<!^\W+)");
//This will then replace those matches with a -
string result = rgx.Replace(input, "-");
To keep the string from going beyond 200 characters, you will have to use substring. If you do this before the regex, then you will be ok, but if you do it after, then you run the risk of having a trailing dash again, FYI.
example:
myString.Substring(0,200)

I use an iterative approach for this - because in some cases you might want certain characters to be turned into words instead of having them turned into '-' characters - e.g. '&' -> 'and'.
But when you're done you'll also end up with a string that potentially contains multiple '-' - so you have a final regex that collapses all multiple '-' characters into one.
So I would suggest using an ordered list of regexes, and then run them all in order. This code is written to go in a static class that is then exposed as a single extension method for System.String - and is probably best merged into the System namespace.
I've hacked it from code I use, which had extensibility points (e.g. you could pass in a MatchEvaluator on construction of the replacement object for more intelligent replacements; and you could pass in your own IEnumerable of replacements, as the class was public), and therefore it might seem unnecessarily complicated - judging by the other answers I'm guessing everybody will think so (but I have specific requirements for the SEO of the strings that are created).
The list of replacements I use might not be exactly correct for your uses - if not, you can just add more.
private class SEOSymbolReplacement
{
private Regex _rx;
private string _replacementString;
public SEOSymbolReplacement(Regex r, string replacement)
{
//null-checks required.
_rx = r;
_replacementString = replacement;
}
public string Execute(string input)
{
/null-check required
return _rx.Replace(input, _replacementString);
}
}
private static readonly SEOSymbolReplacement[] Replacements = {
new SEOSymbolReplacement(new Regex(#"#", RegexOptions.Compiled), "Sharp"),
new SEOSymbolReplacement(new Regex(#"\+", RegexOptions.Compiled), "Plus"),
new SEOSymbolReplacement(new Regex(#"&", RegexOptions.Compiled), " And "),
new SEOSymbolReplacement(new Regex(#"[|:'\\/,_]", RegexOptions.Compiled), "-"),
new SEOSymbolReplacement(new Regex(#"\s+", RegexOptions.Compiled), "-"),
new SEOSymbolReplacement(new Regex(#"[^\p{L}\d-]",
RegexOptions.IgnoreCase | RegexOptions.Compiled), ""),
new SEOSymbolReplacement(new Regex(#"-{2,}", RegexOptions.Compiled), "-")};
/// <summary>
/// Transforms the string into an SEO-friendly string.
/// </summary>
/// <param name="str"></param>
public static string ToSEOPathString(this string str)
{
if (str == null)
return null;
string toReturn = str;
foreach (var replacement in DefaultReplacements)
{
toReturn = replacement.Execute(toReturn);
}
return toReturn;
}

Best way to parse Space Separated Text

I have string like this
/c SomeText\MoreText "Some Text\More Text\Lol" SomeText
I want to tokenize it, however I can't just split on the spaces. I've come up with somewhat ugly parser that works, but I'm wondering if anyone has a more elegant design.
This is in C# btw.
EDIT: My ugly version, while ugly, is O(N) and may actually be faster than using a RegEx.
private string[] tokenize(string input)
{
string[] tokens = input.Split(' ');
List<String> output = new List<String>();
for (int i = 0; i < tokens.Length; i++)
{
if (tokens[i].StartsWith("\""))
{
string temp = tokens[i];
int k = 0;
for (k = i + 1; k < tokens.Length; k++)
{
if (tokens[k].EndsWith("\""))
{
temp += " " + tokens[k];
break;
}
else
{
temp += " " + tokens[k];
}
}
output.Add(temp);
i = k + 1;
}
else
{
output.Add(tokens[i]);
}
}
return output.ToArray();
}

The computer term for what you're doing is lexical analysis; read that for a good summary of this common task.
Based on your example, I'm guessing that you want whitespace to separate your words, but stuff in quotation marks should be treated as a "word" without the quotes.
The simplest way to do this is to define a word as a regular expression:
([^"^\s]+)\s*|"([^"]+)"\s*
This expression states that a "word" is either (1) non-quote, non-whitespace text surrounded by whitespace, or (2) non-quote text surrounded by quotes (followed by some whitespace). Note the use of capturing parentheses to highlight the desired text.
Armed with that regex, your algorithm is simple: search your text for the next "word" as defined by the capturing parentheses, and return it. Repeat that until you run out of "words".
Here's the simplest bit of working code I could come up with, in VB.NET. Note that we have to check both groups for data since there are two sets of capturing parentheses.
Dim token As String
Dim r As Regex = New Regex("([^""^\s]+)\s*|""([^""]+)""\s*")
Dim m As Match = r.Match("this is a ""test string""")
While m.Success
token = m.Groups(1).ToString
If token.length = 0 And m.Groups.Count > 1 Then
token = m.Groups(2).ToString
End If
m = m.NextMatch
End While
Note 1: Will's answer, above, is the same idea as this one. Hopefully this answer explains the details behind the scene a little better :)

The Microsoft.VisualBasic.FileIO namespace (in Microsoft.VisualBasic.dll) has a TextFieldParser you can use to split on space delimeted text. It handles strings within quotes (i.e., "this is one token" thisistokentwo) well.
Note, just because the DLL says VisualBasic doesn't mean you can only use it in a VB project. Its part of the entire Framework.

There is the state machine approach.
private enum State
{
None = 0,
InTokin,
InQuote
}
private static IEnumerable<string> Tokinize(string input)
{
input += ' '; // ensure we end on whitespace
State state = State.None;
State? next = null; // setting the next state implies that we have found a tokin
StringBuilder sb = new StringBuilder();
foreach (char c in input)
{
switch (state)
{
default:
case State.None:
if (char.IsWhiteSpace(c))
continue;
else if (c == '"')
{
state = State.InQuote;
continue;
}
else
state = State.InTokin;
break;
case State.InTokin:
if (char.IsWhiteSpace(c))
next = State.None;
else if (c == '"')
next = State.InQuote;
break;
case State.InQuote:
if (c == '"')
next = State.None;
break;
}
if (next.HasValue)
{
yield return sb.ToString();
sb = new StringBuilder();
state = next.Value;
next = null;
}
else
sb.Append(c);
}
}
It can easily be extended for things like nested quotes and escaping. Returning as IEnumerable<string> allows your code to only parse as much as you need. There aren't any real downsides to that kind of lazy approach as strings are immutable so you know that input isn't going to change before you have parsed the whole thing.
See: http://en.wikipedia.org/wiki/Automata-Based_Programming

You also might want to look into regular expressions. That might help you out. Here is a sample ripped off from MSDN...
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main ()
{
// Define a regular expression for repeated words.
Regex rx = new Regex(#"\b(?<word>\w+)\s+(\k<word>)\b",
RegexOptions.Compiled | RegexOptions.IgnoreCase);
// Define a test string.
string text = "The the quick brown fox fox jumped over the lazy dog dog.";
// Find matches.
MatchCollection matches = rx.Matches(text);
// Report the number of matches found.
Console.WriteLine("{0} matches found in:\n {1}",
matches.Count,
text);
// Report on each match.
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
Console.WriteLine("'{0}' repeated at positions {1} and {2}",
groups["word"].Value,
groups[0].Index,
groups[1].Index);
}
}
}
// The example produces the following output to the console:
// 3 matches found in:
// The the quick brown fox fox jumped over the lazy dog dog.
// 'The' repeated at positions 0 and 4
// 'fox' repeated at positions 20 and 25
// 'dog' repeated at positions 50 and 54

Craig is right — use regular expressions. Regex.Split may be more concise for your needs.

[^\t]+\t|"[^"]+"\t
using the Regex definitely looks like the best bet, however this one just returns the whole string. I'm trying to tweak it, but not much luck so far.
string[] tokens = System.Text.RegularExpressions.Regex.Split(this.BuildArgs, #"[^\t]+\t|""[^""]+""\t");

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to split strings using regular expressions - c#

I want to split a string into a list or array. Input: green,"yellow,green",white,orange,"blue,black" The split character is the comma (,), but it must ignore commas inside quotes. The output should be: green yellow,green white orange blue,black Thanks.

The format of the string you are trying to split appears to be standard CSV. Using a CSV parser would likely be easier/faster.

enclosing the regex matching within '(' and ')' and then splitting on this regex should solve this. eg: /("[^"]+")/g

Related

C# "between strings" run several times

Complex string splitting

Matching any word enclosed in parentheses in a sentence

Using regex or string manipulation when creating permalinks

Best way to parse Space Separated Text

Categories

Resources