Related
This question already has answers here:
Regex to validate string for having three non white-space characters
(2 answers)
Closed 3 years ago.
As said above, I want to find 3 or more whitespaces with regex in C#. Currently I tried:
\s{3,} and [ ]{3,} for Somestreet 155/ EG 47. Both didnt worked out. What did I do wrong?
This \s{3,} matches 3 or more whitespace in a row. You need for example this pattern \s.*\s.*\s to match a string with 3 whitespaces anywhere.
So this would match:
a b c d
a b c
a b
abc d e f
a
a b // ends in 1 space
// just 3 spaces
a // ends in 3 spaces
Linq is an alternative way to count spaces:
string source = "Somestreet 155/ EG 47";
bool result = source
.Where(c => c == ' ') // spaces only
.Skip(2) // skip 2 of them
.Any(); // do we have at least 1 more (i.e. 3d space?)
Edit: If you want not just spaces but whitespaces Where should be
...
.Where(c => char.IsWhiteSpace(c))
...
You could count the whitespace matches:
if (Regex.Matches(yourString, #"\s+").Count >= 3) {...}
The + makes sure that consecutive matches to \s only count once, so "Somestreet 155/ EG 47" has three matches but "Somestreet 155/ EG47" only has two.
If the string is long, then it could take more time than necessary to get all the matches then count them. An alternative is to get one match at a time and bail out early if the required number of matches has been met:
static bool MatchesAtLeast(string s, Regex re, int matchCount)
{
bool success = false;
int startPos = 0;
while (!success)
{
Match m = re.Match(s, startPos);
if (m.Success)
{
matchCount--;
success = (matchCount <= 0);
startPos = m.Index + m.Length;
if (startPos > s.Length - 2) { break; }
}
else { break; }
}
return success;
}
static void Main(string[] args)
{
Regex re = new Regex(#"\s+");
string s = "Somestreet 155/ EG\t47";
Console.WriteLine(MatchesAtLeast(s, re, 3)); // outputs True
Console.ReadLine();
}
Try ^\S*\s\S*\s\S*\s\S*$ instead.
\S matches non-whitespace characters, ^ matches beginnning of a string and $ matches end of a string.
Demo
I want to regex match a string that
Is alphanumeric only.
Letters may only be uppercase.
Total length of min 5 and max 24 chars.
May include min 0 max 1 occurrences of underscore in any position except the first or last.
I think I have to somehow nest the statements so that the total length is 5-24 but there may be up to one underscore. I have read a few regex tutorials , but can't understand a way to do this. Also have NO idea how to specify the acceptable position of the underscore (if present).
[A-Z0-9]{5,24}[_]{0,1}
If you are using this in C# code, it's better to check the length of the string outside the regex. (It's possible to cram it inside the regex, but I won't show it here).
private static bool Validate(string str) {
if (str.Length < 5 || str.Length > 24) {
return false;
}
return Regex.IsMatch(str, #"^[A-Z0-9]+(?:_[A-Z0-9]+)?\z");
}
The regex is:
^[A-Z0-9]+(?:_[A-Z0-9]+)?\z
If a string ends with new line, $ can match the empty string before new line, so \z is used here to assert end of string.
Test code
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
string[] fail = {"ABCDacbd", "ACDE", "ABCDE\n", "_01234", "ABCDÉ", "ABCD́Ē", "ABCDEF_", "A_B_CDEF", "AB_C", "1234567890123456789012345", "123456_789012345678901234"};
string[] ok = {"ACBDEF", "01234", "ABC_DE1", "123456789012345678901234", "12345_789012345678901234"};
foreach (string s in fail) {
Console.WriteLine(s + " " + Validate(s));
}
Console.WriteLine();
foreach (string s in ok) {
Console.WriteLine(s + " " + Validate(s));
}
}
private static bool Validate(string str) {
if (str.Length < 5 || str.Length > 24) {
return false;
}
return Regex.IsMatch(str, #"^[A-Z0-9]+(?:_[A-Z0-9]+)?\z");
}
}
I have a string like the following:
[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)
You can look at it as this tree:
- [Testing.User]
- Info
- [Testing.Info]
- Name
- [System.String]
- Matt
- Age
- [System.Int32]
- 21
- Description
- [System.String]
- This is some description
As you can see, it's a string serialization / representation of a class Testing.User
I want to be able to do a split and get the following elements in the resulting array:
[0] = [Testing.User]
[1] = Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
[2] = Description:([System.String]|This is some description)
I can't split by | because that would result in:
[0] = [Testing.User]
[1] = Info:([Testing.Info]
[2] = Name:([System.String]
[3] = Matt)
[4] = Age:([System.Int32]
[5] = 21))
[6] = Description:([System.String]
[7] = This is some description)
How can I get my expected result?
I'm not very good with regular expressions, but I am aware it is a very possible solution for this case.
Using regex lookahead
You can use a regex like this:
(\[.*?])|(\w+:.*?)\|(?=Description:)|(Description:.*)
Working demo
The idea behind this regex is to capture in groups 1,2 and 3 what you want.
You can see it easily with this diagram:
Match information
MATCH 1
1. [0-14] `[Testing.User]`
MATCH 2
2. [15-88] `Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))`
MATCH 3
3. [89-143] `Description:([System.String]|This is some description)`
Regular regex
On the other hand, if you don't like above regex, you can use another one like this:
(\[.*?])\|(.*)\|(Description:.*)
Working demo
Or even forcing one character at least:
(\[.+?])\|(.+)\|(Description:.+)
There are more than enough splitting answers already, so here is another approach. If your input represents a tree structure, why not parse it to a tree?
The following code was automatically translated from VB.NET, but it should work as far as I tested it.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Treeparse
{
class Program
{
static void Main(string[] args)
{
var input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
var t = StringTree.Parse(input);
Console.WriteLine(t.ToString());
Console.ReadKey();
}
}
public class StringTree
{
//Branching constants
const string BranchOff = "(";
const string BranchBack = ")";
const string NextTwig = "|";
//Content of this twig
public string Text;
//List of Sub-Twigs
public List<StringTree> Twigs;
[System.Diagnostics.DebuggerStepThrough()]
public StringTree()
{
Text = "";
Twigs = new List<StringTree>();
}
private static void ParseRecursive(StringTree Tree, string InputStr, ref int Position)
{
do {
StringTree NewTwig = new StringTree();
do {
NewTwig.Text = NewTwig.Text + InputStr[Position];
Position += 1;
} while (!(Position == InputStr.Length || (new String[] { BranchBack, BranchOff, NextTwig }.ToList().Contains(InputStr[Position].ToString()))));
Tree.Twigs.Add(NewTwig);
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchOff) { Position += 1; ParseRecursive(NewTwig, InputStr, ref Position); Position += 1; }
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchBack)
break; // TODO: might not be correct. Was : Exit Do
Position += 1;
} while (!(Position >= InputStr.Length || InputStr[Position].ToString() == BranchBack));
}
/// <summary>
/// Call this to parse the input into a StringTree objects using recursion
/// </summary>
public static StringTree Parse(string Input)
{
StringTree t = new StringTree();
t.Text = "Root";
int Start = 0;
ParseRecursive(t, Input, ref Start);
return t;
}
private void ToStringRecursive(ref StringBuilder sb, StringTree tree, int Level)
{
for (int i = 1; i <= Level; i++)
{
sb.Append(" ");
}
sb.AppendLine(tree.Text);
int NextLevel = Level + 1;
foreach (StringTree NextTree in tree.Twigs)
{
ToStringRecursive(ref sb, NextTree, NextLevel);
}
}
public override string ToString()
{
var sb = new System.Text.StringBuilder();
ToStringRecursive(ref sb, this, 0);
return sb.ToString();
}
}
}
Result (click):
You get the values of each node with its associated subvalues in a treelike structure and you can then do with it whatever you like, for example easily show the structure in a TreeView control:
Assuming your groups can be marked as
[Anything.Anything]
Anything:ReallyAnything (Letters & Numbers only:Then any amount of characters) after the first pipe
Anything:ReallyAnything (Letters & Numbers only:Then any mount of characters) after the last pipe
Then you have a pattern like:
"(\\[\\w+\\.\\w+\\])\\|(\\w+:.+)\\|(\\w+:.+)";
(\\[\\w+\\.\\w+\\]) This capture group will get the "[Testing.User]" but is not restricted to it only being "[Testing.User]"
\\|(\\w+:.+) This capture group will get the data after the first pipe and stop before the last pipe. In this case, "Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))" but is not restricted to it beginning with "Info:"
\\|(\\w+:.+) Same capture group as previous, but captures whatever is after the last pipe, in this case "Description:([System.String]|This is some description)" but is not restricted to beginning with Description:"
Now if you were to add another pipe followed by more data (|Anything:SomeData), then Description: will be part of group 2 and group 3 would now be "Anything:SomeData".
Code looks like:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
String text = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
String pattern = "(\\[\\w+\\.\\w+\\])\\|(\\w+:.+)\\|(\\w+:.+)";
Match match = Regex.Match(text, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
Console.WriteLine(match.Groups[2]);
Console.WriteLine(match.Groups[3]);
}
}
}
Results:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)
See working sample here... https://dotnetfiddle.net/DYcZuY
See working sample if I add another field following the pattern format here... https://dotnetfiddle.net/Mtc1CD
To do that you need to use balancing groups that is a regex feature exclusive the .net regex engine. It is a counter system, when an opening parenthesis is found the counter is incremented, when a closing is found the counter is decremented, then you only have to test if the counter is null to know if the parenthesis are balanced.
This is the only way to be sure you are inside or outside of the parenthesis:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = #"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
string pattern = #"(?:[^|()]+|\((?>[^()]+|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))\))+";
foreach (Match m in Regex.Matches(input, pattern))
Console.WriteLine(m.Value);
}
}
demo
pattern details:
(?:
[^|()]+ # all that is not a parenthesis or a pipe
| # OR
# content between parenthesis (eventually nested)
\( # opening parenthesis
# here is the way to obtain balanced parens
(?> # content between parens
[^()]+ # all that is not parenthesis
| # OR
(?<Open>[(]) # an opening parenthesis (increment the counter)
|
(?<-Open>[)]) # a closing parenthesis (decrement the counter)
)* # repeat as needed
(?(Open)(?!)) # make the pattern fail if the counter is not zero
\)
)+
(?(open) (?!) ) is a conditional statement.
(?!) is an always false subpattern (an empty negative lookahead) that means : not followed by nothing
This pattern matches all that is not a pipe and strings enclosed between parenthesis.
Regex is not the best approach for this kind of problem, you may need to write some code to parse your data, I did a simple example that achieve this simple case of yours. The basic idea here is that you want to split only if the | is not inside parenthesis, so i keep track of the parenthesis count. You will need to do some work around to threat cases where parenthesis is part of the description section for instance, but as I say, this is just a start point:
static IEnumerable<String> splitSpecial(string input)
{
StringBuilder builder = new StringBuilder();
int openParenthesisCount = 0;
foreach (char c in input)
{
if (openParenthesisCount == 0 && c == '|')
{
yield return builder.ToString();
builder.Clear();
}
else
{
if (c == '(')
openParenthesisCount++;
if (c == ')')
openParenthesisCount--;
builder.Append(c);
}
}
yield return builder.ToString();
}
static void Main(string[] args)
{
string input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
foreach (String split in splitSpecial(input))
{
Console.WriteLine(split);
}
Console.ReadLine();
}
Ouputs:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)
This isn't a great/robust solution, but if you know your three top level items are fixed then you can hard code those into your regular expression.
(\[Testing\.User\])\|(Info:.*)\|(Description:.*)
This regular expression will create one match with three groups within it as you were expecting. You can test it here:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
Edit: Here's a full working C# example
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication3
{
internal class Program
{
private static void Main(string[] args)
{
const string input = #"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
const string pattern = #"(\[Testing\.User\])\|(Info:.*)\|(Description:.*)";
var match = Regex.Match(input, pattern);
if (match.Success)
{
for (int i = 1; i < match.Groups.Count; i++)
{
Console.WriteLine("[" + i + "] = " + match.Groups[i]);
}
}
Console.ReadLine();
}
}
}
This question already has answers here:
Using RegEx to balance match parenthesis
(4 answers)
Closed 9 years ago.
I want to select a part of a string, but the problem is that the last character I want to select can have multiple occurrences.
I want to select 'Aggregate(' and end at the matching ')', any () in between can be ignored.
Examples:
string: Substr(Aggregate(SubQuery, SUM, [Model].Remark * [Object].Shortname + 10), 0, 1)
should return: Aggregate(SubQuery, SUM, [Model].Remark * [Object].Shortname + 10)
string: Substr(Aggregate(SubQuery, SUM, [Model].Remark * ([Object].Shortname + 10)), 0, 1)
should return: Aggregate(SubQuery, SUM, [Model].Remark * ([Object].Shortname + 10))
string: Substr(Aggregate(SubQuery, SUM, ([Model].Remark) * ([Object].Shortname + 10) ), 0, 1)
should return: Aggregate(SubQuery, SUM, ([Model].Remark) * ([Object].Shortname + 10) )
Is there a way to solve this with a regular expression? I'm using C#.
This is a little ugly, but you could use something like
Aggregate\(([^()]+|\(.*?\))*\)
It passes all your tests, but it can only match one level of nested parentheses.
This solution works with any level of nested parenthesis by using .NETs balancing groups:
(?x) # allow comments and ignore whitespace
Aggregate\(
(?:
[^()] # anything but ( and )
| (?<open> \( ) # ( -> open++
| (?<-open> \) ) # ) -> open--
)*
(?(open) (?!) ) # fail if open > 0
\)
I'm not sure how much the input varies but for the string examples in the question something as simple as this would work:
Aggregate\(.*\)(?=,)
If eventually consider avoiding regular expressions, here's an alternative for parsing, which uses the System.Xml.Linq namespace:
class Program
{
static void Main()
{
var input = File.ReadAllLines("input.txt");
input.ToList().ForEach(item => {
Console.WriteLine(item.GetParameter("Aggregate"));
});
}
}
static class X
{
public static string GetParameter(this string expression, string element)
{
XDocument doc;
var input1 = "<root>" + expression
.Replace("(", "<n1>")
.Replace(")", "</n1>")
.Replace("[", "<n2>")
.Replace("]", "</n2>") +
"</root>";
try
{
doc = XDocument.Parse(input1);
}
catch
{
return null;
}
var agg=doc.Descendants()
.Where(d => d.FirstNode.ToString() == element)
.FirstOrDefault();
if (agg == null)
return null;
var param = agg
.Elements()
.FirstOrDefault();
if (param == null)
return null;
return element +
param
.ToString()
.Replace("<n1>", "(")
.Replace("</n1>", ")")
.Replace("<n2>", "[")
.Replace("</n2>", "]");
}
}
This regex works with any number of pairs of brackets, and nested to any level:
Aggregate\(([^(]*\([^)]*\))*[^()]\)
For example, it will find the bolded text here:
Substr(Aggregate(SubQuery, SUM(foo(bar), baz()), ([Model].Remark) * ([Object].Shortname + 10) ), 0, 1)
Notice the SUM(foo(bar), baz()) in there.
See a live demo on rubular.
I have this string in C#
adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO
I want to use a RegEx to parse it to get the following:
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
In addition to the above example, I tested with the following, but am still unable to parse it correctly.
"%exc.uns: 8 hours let # = ABC, DEF", "exc_it = 1 day" , " summ=graffe ", " a,b,(c,d)"
The new text will be in one string
string mystr = #"""%exc.uns: 8 hours let # = ABC, DEF"", ""exc_it = 1 day"" , "" summ=graffe "", "" a,b,(c,d)""";
string str = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
var resultStrings = new List<string>();
int? firstIndex = null;
int scopeLevel = 0;
for (int i = 0; i < str.Length; i++)
{
if (str[i] == ',' && scopeLevel == 0)
{
resultStrings.Add(str.Substring(firstIndex.GetValueOrDefault(), i - firstIndex.GetValueOrDefault()));
firstIndex = i + 1;
}
else if (str[i] == '(') scopeLevel++;
else if (str[i] == ')') scopeLevel--;
}
resultStrings.Add(str.Substring(firstIndex.GetValueOrDefault()));
Event faster:
([^,]*\x28[^\x29]*\x29|[^,]+)
That should do the trick. Basically, look for either a "function thumbprint" or anything without a comma.
adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO
^ ^ ^ ^ ^
The Carets symbolize where the grouping stops.
Just this regex:
[^,()]+(\([^()]*\))?
A test example:
var s= "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
Regex regex = new Regex(#"[^,()]+(\([^()]*\))?");
var matches = regex.Matches(s)
.Cast<Match>()
.Select(m => m.Value);
returns
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
If you simply must use Regex, then you can split the string on the following:
, # match a comma
(?= # that is followed by
(?: # either
[^\(\)]* # no parens at all
| # or
(?: #
[^\(\)]* # ...
\( # (
[^\(\)]* # stuff in parens
\) # )
[^\(\)]* # ...
)+ # any number of times
)$ # until the end of the string
)
It breaks your input into the following:
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
You can also use .NET's balanced grouping constructs to create a version that works with nested parens, but you're probably just as well off with one of the non-Regex solutions.
Another way to implement what Snowbear was doing:
public static string[] SplitNest(this string s, char src, string nest, string trg)
{
int scope = 0;
if (trg == null || nest == null) return null;
if (trg.Length == 0 || nest.Length < 2) return null;
if (trg.IndexOf(src) >= 0) return null;
if (nest.IndexOf(src) >= 0) return null;
for (int i = 0; i < s.Length; i++)
{
if (s[i] == src && scope == 0)
{
s = s.Remove(i, 1).Insert(i, trg);
}
else if (s[i] == nest[0]) scope++;
else if (s[i] == nest[1]) scope--;
}
return s.Split(trg);
}
The idea is to replace any non-nested delimiter with another delimiter that you can then use with an ordinary string.Split(). You can also choose what type of bracket to use - (), <>, [], or even something weird like \/, ][, or `'. For your purposes you would use
string str = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
string[] result = str.SplitNest(',',"()","~");
The function would first turn your string into
adj_con(CL2,1,3,0)~adj_cont(CL1,1,3,0)~NG~ NG/CL~ 5 value of CL(JK)~ HO
then split on the ~, ignoring the nested commas.
Assuming non nested, matching parentheses, you can easily match the tokens you want instead of splitting the string:
MatchCollection matches = Regex.Matches(data, #"(?:[^(),]|\([^)]*\))+");
var s = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
var result = string.Join(#"\n",Regex.Split(s, #"(?<=\)),|,\s"));
The pattern matches for ) and excludes it from the match then matches ,
or
matches , followed by a space.
result =
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
The TextFieldParser (msdn) class seems to have the functionality built-in:
TextFieldParser Class: - Provides methods and properties for parsing structured text files.
Parsing a text file with the TextFieldParser is similar to iterating over a text file, while the ReadFields method to extract fields of text is similar to splitting the strings.
The TextFieldParser can parse two types of files: delimited or fixed-width. Some properties, such as Delimiters and HasFieldsEnclosedInQuotes are meaningful only when working with delimited files, while the FieldWidths property is meaningful only when working with fixed-width files.
See the article which helped me find that
Here's a stronger option, which parses the whole text, including nested parentheses:
string pattern = #"
\A
(?>
(?<Token>
(?:
[^,()] # Regular character
|
(?<Paren> \( ) # Opening paren - push to stack
|
(?<-Paren> \) ) # Closing paren - pop
|
(?(Paren),) # If inside parentheses, match comma.
)*?
)
(?(Paren)(?!)) # If we are not inside parentheses,
(?:,|\Z) # match a comma or the end
)*? # lazy just to avoid an extra empty match at the end,
# though it removes a last empty token.
\Z
";
Match match = Regex.Match(data, pattern, RegexOptions.IgnorePatternWhitespace);
You can get all matches by iterating over match.Groups["Token"].Captures.