Regex works on .NET test site but not in C# environment - c#

Referring to this stackoverflow question:
- Regex Pattern help: finding HTML pattern when nested ASP.NET Eval?
I received an answer to the problem here:
- regexstorm link
The .NET answer that works on the regex .NET testing site does NOT work in my C# Visual Studio environment. Here is the Unit Test for it:
[Test]
public void GetAllHtmlSubsectionsWorksAsExpected()
{
var regPattern = new Regex(#"(?'o'<)(.*)(?'-o'>)+");
var html =
"<%# Page Language=\"C#\" %>" +
"<td class=\"c1 c2 c3\" colspan=\"2\">" +
"lorem ipsum" +
"<div class=\"d1\" id=\"div2\" attrid=\"<%# Eval(\"CategoryID\") %>\">" +
"testing 123" +
"</div>" +
"asdf" +
"</td>";
List<string> results = new List<string>();
MatchCollection matches = regPattern.Matches(html);
for (int mnum = 0; mnum < matches.Count; mnum++)
{
Match match = matches[mnum];
results.Add("Match #" + (mnum + 1) + " - Value: " + match.Value);
}
Assert.AreEqual(5, results.Count()); //Fails: results.Count() == 1
}
Why does this work on the regexstorm website but not in my unit test?

NOTE that parsing HTML with regex is not a best practice, you should use a dedicated parser.
Now, as for the question itself, the pattern you use will work only with lines having 1 single substring starting with < and ending with corresponding >. However, your input string has no newline characters! It looks like:
<%# Page Language="C#" %><td class="c1 c2 c3" colspan="2">lorem ipsum<div class="d1" id="div2" attrid="<%# Eval("CategoryID") %>">testing 123</div>asdf</td>
The .* subpattern is called a greedy dot matching pattern, and it matches as many characters other than a newline as possible (because it grabs the whole line and then backtracks to see if the next subpattern (here, >) is found, thus you get the last possible >).
To fix that, you need a proper balanced construct matching pattern:
<((?>[^<>]+|<(?<c>)|>(?<-c>))*(?(c)(?!)))>
See regex demo
C#:
var r = new Regex(#"
< # First '<'
( # Capturing group 1
(?> # Atomic group start
[^<>] # Match all characters other than `<` or `>`
|
< (?<c>) # Match '<', and add a capture into group 'c'
|
> (?<-c>) # Match '>', and delete 1 value from capture stack
)*
(?(c)(?!)) # Fails if 'c' stack isn't empty!
)
> # Last closing `>`
"; RegexOptions.IgnoreWhitespace);
DISCLAIMER: Even this regex will fail if you have unpaired < or > in your element nodes, that is why do not use regex to parse HTML.

There are two different things in regex: Matching and capturing.
What you want here is the capturing group 1.
So you need to use this:
results.Add("Match #" + (mnum + 1) + " - Value: " + match.Groups[1].Value);
Also, as the other answer pointed, you are missing new lines, and regex captures it all in first match.

Related

Replacing mutiple occurrences of string using string builder by regex pattern matching

We are trying to replace all matching patterns (regex) in a string builder with their respective "groups".
Firstly, we are trying to find the count of all occurrences of that pattern and loop through them (count - termination condition). For each match we are assigning the match object and replace them using their respective groups.
Here only the first occurrence is replaced and the other matches are never replaced.
*str* - contains the actual string
Regex - ('.*')\s*=\s*(.*)
To match pattern:
'nam_cd'=isnull(rtrim(x.nam_cd),''),
'Company'=isnull(rtrim(a.co_name),'')
Pattern : created using https://regex101.com/
*matches.Count* - gives the correct count (here 2)
String pattern = #"('.*')\s*=\s*(.*)";
MatchCollection matches = Regex.Matches(str, pattern);
StringBuilder sb = new StringBuilder(str);
Match match = Regex.Match(str, pattern);
for (int i = 0; i < matches.Count; i++)
{
String First = String.Empty;
Console.WriteLine(match.Groups[0].Value);
Console.WriteLine(match.Groups[1].Value);
First = match.Groups[2].Value.TrimEnd('\r');
First = First.Trim();
First = First.TrimEnd(',');
Console.WriteLine(First);
sb.Replace(match.Groups[0].Value, First + " as " + match.Groups[1].Value) + " ,", match.Index, match.Groups[0].Value.Length);
match = match.NextMatch();
}
Current output:
SELECT DISTINCT
isnull(rtrim(f.fleet),'') as 'Fleet' ,
'cust_clnt_id' = isnull(rtrim(x.cust_clnt_id),'')
Expected output:
SELECT DISTINCT
isnull(rtrim(f.fleet),'') as 'Fleet' ,
isnull(rtrim(x.cust_clnt_id),'') as 'cust_clnt_id'
A regex solution like this is too fragile. If you need to parse any arbitrary SQL, you need a dedicated parser. There are examples on how to parse SQL properly in Parsing SQL code in C#.
If you are sure there are no "wild", unbalaned ( and ) in your input, you may use a regex as a workaround, for a one-off job:
var result = Regex.Replace(s, #"('[^']+')\s*=\s*(\w+\((?>[^()]+|(?<o>\()|(?<-o>\)))*\))", "\n $2 as $1");
See the regex demo.
Details
('[^']+') - Capturing group 1 ($1): ', 1 or more chars other than ' and then '
\s*=\s* - = enclosed with 0+ whitespaces
(\w+\((?>[^()]+|(?<o>\()|(?<-o>\)))*\)) - Capturing group 2 ($2):
\w+ - 1+ word chars
\((?>[^()]+|(?<o>\()|(?<-o>\)))*\) - a (...) substring with any amount of balanced (...)s inside (see my explanation of this pattern).

Detecting a word followed by a dot or whitespace using regex

I am using regex and C# to find occurrences of a particular word using
Regex regex = new Regex(#"\b" + word + #"\b");
How can I modify my Regex to only detect the word if it is either preceded with a whitespace, followed with a whitespace or followed with a dot?
Examples:
this.Button.Value - should match
this.value - should match
document.thisButton.Value - should not match
You may use lookarounds and alternation to check for the 2 possibilities when a keyword is enclosed with spaces or is just followed with a dot:
var line = "this.Button.Value\nthis.value\ndocument.thisButton.Value";
var word = "this";
var rx =new Regex(string.Format(#"(?<=\s)\b{0}\b(?=\s)|\b{0}\b(?=\.)", word));
var result = rx.Replace(line, "NEW_WORD");
Console.WriteLine(result);
See IDEONE demo and a regex demo.
The pattern matches:
(?<=\s)\bthis\b(?=\s) - whole word "this" that is preceded with whitespace (?<=\s) and that is followed with whitespace (?=\s)
| - or
\bthis\b(?=\.) - whole word "this" that is followed with a literal . ((?=\.))
Since lookarounds are not consuming characters (the regex index remains where it was) the characters matched with them are not placed in the match value, and are thus untouched during the replacement.
If i am understanding you correctly:
Regex regex = new Regex(#"\b" + (word " " || ".") + #"\b");
Regex regex = new Regex(#"((?<=( \.))" + word + #"\b)" + "|" + #"(\b" + word + #"[ .])");
However, note that this could cause trouble if word contains characters that have special meanings in Regular Expressions. I'm assuming that word contains alpha-numeric characters only.
The (?<=...) match group checks for preceding and (?=...) checks for following, both without including them in the match.
Regex regex = new Regex(#"(?<=\s)\b" + word + #"\b|\b" + word + #"\b(?=[\s\.])");
EDIT: Pattern updated.
EDIT 2: Online test: http://ideone.com/RXRQM5

Regex tokenize issue

I have strings input by the user and want to tokenize them. For that, I want to use regex and now have a problem with a special case.
An example string is
Test + "Hello" + "Good\"more" + "Escape\"This\"Test"
or the C# equivalent
#"Test + ""Hello"" + ""Good\""more"" + ""Escape\""This\""Test"""
I am able to match the Test and + tokens, but not the ones contained by the ". I use the " to let the user specify that this is literally a string and not a special token. Now if the user wants to use the " character in the string, I thought of allowing him to escape it with a \.
So the rule would be: Give me everything between two " ", but the character in front of the last " can not be a \.
The results I expect are: "Hello" "Good\"more" "Escape\"This\"Test"
I need the " " characters to be in the final match so I know that this is a string.
I currently have the regex #"""([\w]*)(?<!\\"")""" which gives me the following results: "Hello" "more" "Test"
So the look behind isn't working as I want it to be. Does anyone know the correct way to get the string like I want?
Here's an adaption of a regex I use to parse command lines:
(?!\+)((?:"(?:\\"|[^"])*"?|\S)+)
Example here at regex101
(adaption is the negative look-ahead to ignore + and checking for \" instead of "")
Hope this helps you.
Regards.
Edit:
If you aren't interested in surrounding quotes:
(?!\+)(?:"((?:\\"|[^"])*)"?|(\S+))
To make it safer, I'd suggest getting all the substrings within unescaped pairs of "..." with the following regex:
^(?:[^"\\]*(?:\\.[^"\\]*)*("[^"\\]*(?:\\.[^"\\]*)*"))+
It matches
^ - start of string (so that we could check each " and escape sequence)
(?: - Non-capturing group 1 serving as a container for the subsequent subpatterns
[^"\\]*(?:\\.[^"\\]*)* - matches 0+ characters other than " and \ followed with 0+ sequences of \\. (any escape sequence) followed with 0+ characters other than " and \ (thus, we avoid matching the first " that is escaped, and it can be preceded with any number of escape sequences)
("[^"\\]*(?:\\.[^"\\]*)*") - Capture group 1 matching "..." substrings that may contain any escape sequences inside
)+ - end of the first non-capturing group that is repeated 1 or more times
See the regex demo and here is a C# demo:
var rx = "^(?:[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"))+";
var s = #"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
.Cast<Match>()
.SelectMany(m => m.Groups[1].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
.ToList();
Console.WriteLine(string.Join("\n", matches));
UPDATE
If you need to remove the tokens, just match and capture all outside of them with this code:
var keep = "[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*";
var rx = string.Format("^(?:(?<keep>{0})\"{0}\")+(?<keep>{0})$", keep);
var s = #"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
.Cast<Match>()
.SelectMany(m => m.Groups["keep"].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
.ToList();
Console.WriteLine(string.Join("", matches));
See another demo
Output: Test + + + \"Escape\"This\"Test\" + for #"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";.

How can I specify the priority of a match pattern in a Regex?

I'm writing a function-parsing engine which uses regular expressions to separate the individual terms (defined as a constant or a variable followed (optionally) by an operator). It's working great, except when I have grouped terms within other grouped terms. Here's the code I'm using:
//This matches an opening delimiter
Regex openers = new Regex("[\\[\\{\\(]");
//This matches a closing delimiter
Regex closers = new Regex("[\\]\\}\\)]");
//This matches the name of a variable (\w+) or a constant numeric value (\d+(\.\d+)?)
Regex VariableOrConstant = new Regex("((\\d+(\\.\\d+)?)|\\w+)" + FunctionTerm.opRegex + "?");
//This matches the binary operators +, *, -, or /
Regex ops = new Regex("[\\*\\+\\-/]");
//This compound Regex finds a single variable or constant term (including a proceeding operator,
//if any) OR a group containing multiple terms (and their proceeding operators, if any)
//and a proceeding operator, if any.
//Matches that match this second pattern need to be added to the function as sub-functions,
//not as individual terms, to ensure the correct evalutation order with parentheses.
Regex splitter = new Regex(
openers +
"(" + VariableOrConstant + ")+" + closers + ops + "?" +
"|" +
"(" + VariableOrConstant + ")" + ops + "?");
When "splitter" is matched against the string "4/(2*X*[2+1])", the matches' values are "4/", "2*", "X*", "2+", and "1", completely ignoring all of the delimiting parentheses and braces. I believe this is because the second half of the "splitter" Regex (the part after the "|") is being matched and overriding the other part of the pattern. This is bad- I want grouped expressions to take precedence over single terms. Does anyone know how I can do this? I looked into using positive/negative lookaheads and lookbehinds, but I'm honestly not sure how to use those, or what they're even for, for that matter, and I can't find any relevant examples... Thanks in advance.
You didn't show us how you're applying the regex, so here's a demo I whipped up:
private static void ParseIt(string subject)
{
Console.WriteLine("subject : {0}\n", subject);
Regex openers = new Regex(#"[[{(]");
Regex closers = new Regex(#"[]})]");
Regex ops = new Regex(#"[*+/-]");
Regex VariableOrConstant = new Regex(#"((\d+(\.\d+)?)|\w+)" + ops + "?");
Regex splitter = new Regex(
openers + #"(?<FIRST>" + VariableOrConstant + #")+" + closers + ops + #"?" +
#"|" +
#"(?<SECOND>" + VariableOrConstant + #")" + ops + #"?",
RegexOptions.ExplicitCapture
);
foreach (Match m in splitter.Matches(subject))
{
foreach (string s in splitter.GetGroupNames())
{
Console.WriteLine("group {0,-8}: {1}", s, m.Groups[s]);
}
Console.WriteLine();
}
}
output:
subject : 4/(2*X*[2+1])
group 0 : 4/
group FIRST :
group SECOND : 4/
group 0 : 2*
group FIRST :
group SECOND : 2*
group 0 : X*
group FIRST :
group SECOND : X*
group 0 : [2+1]
group FIRST : 1
group SECOND :
As you can see, the term [2+1] is matched by the first part of the regex, as you intended. It can't do anything with the (, though, because the next bracketing character after that is another "opener" ([), and it's looking for a "closer".
You could use .NET's "balanced matching" feature to allow for grouped terms enclosed in other groups, but it's not worth the effort. Regexes are not designed for parsing--in fact, parsing and regex matching are fundamentally different kinds of operation. And this is a good example of the difference: a regex actively seeks out matches, skipping over anything it can't use (like the open-parenthesis in your example), but a parser has to examine every character (even if it's just to decide to ignore it).
About the demo: I tried to make the minimum functional changes necessary to get your code to work (which is why I didn't correct the error of putting the + outside the capturing group), but I also made several surface changes, and those represent active recommendations. To wit:
Always use verbatim string literals (#"...") when creating regexes in C# (I think the reason is obvious).
If you're using capturing groups, use named groups whenever possible, but don't use named groups and numbered groups in the same regex. Named groups save you the hassle of keeping track of what's captured where, and the ExplicitCapture option saves you having to clutter up the regex with (?:...) wherever you need a non-capturing group.
Finally, that whole scheme of building a large regex from a bunch of smaller regexes has very limited usefulness IMO. It's very difficult to keep track of the interactions between the parts, like which part's inside which group. Another advantage of C#'s verbatim strings is that they're multiline, so you can take advantage of free-spacing mode (a.k.a. /x or COMMENTS mode):
Regex r = new Regex(#"
(?<GROUPED>
[[{(] # opening bracket
( # group containing:
((\d+(\.\d+)?)|\w+) # number or variable
[*+/-]? # and proceeding operator
)+ # ...one or more times
[]})] # closing bracket
[*+/-]? # and proceeding operator
)
|
(?<UNGROUPED>
((\d+(\.\d+)?)|\w+) # number or variable
[*+/-]? # and proceeding operator
)
",
RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace
);
This is not intended as a solution to your problem; as I said, that's not a job for regexes. This is just a demonstration of some useful regex techniques.
try using difrent quantifiers
greedy:
* + ?
possessive:
*+ ++ ?+
lazy:
*? +? ??
Try reading this and this
also maybe non-capturing group:
(?:your expr here)
try try try! practice make perfect! :)

RegEx replace query to pick out wiki syntax

I've got a string of HTML that I need to grab the "[Title|http://www.test.com]" pattern out of e.g.
"dafasdfasdf, adfasd. [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad"
I need to replace "[Title|http://www.test.com]" this with "http://www.test.com/'>Title".
What is the best away to approach this?
I was getting close with:
string test = "dafasdfasdf adfasd [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad ";
string p18 = #"(\[.*?|.*?\])";
MatchCollection mc18 = Regex.Matches(test, p18, RegexOptions.Singleline | RegexOptions.IgnoreCase);
foreach (Match m in mc18)
{
string value = m.Groups[1].Value;
string fulltag = value.Substring(value.IndexOf("["), value.Length - value.IndexOf("["));
Console.WriteLine("text=" + fulltag);
}
There must be a cleaner way of getting the two values out e.g. the "Title" bit and the url itself.
Any suggestions?
Replace the pattern:
\[([^|]+)\|[^]]*]
with:
$1
A short explanation:
\[ # match the character '['
( # start capture group 1
[^|]+ # match any character except '|' and repeat it one or more times
) # end capture group 1
\| # match the character '|'
[^]]* # match any character except ']' and repeat it zero or more times
] # match the character ']'
A C# demo would look like:
string test = "dafasdfasdf adfasd [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad ";
string adjusted = Regex.Replace(test, #"\[([^|]+)\|[^]]*]", "$1");

Categories