C# Use REGEX to sort out Parentheses inside eachother - c#

im using REGEX right now to sort out lines like:
string a = "mine(hello, this())"
string b = "mine(hello, me.he)
string c = "mine(hello, this(this2())
Im trying to get this() by itself.
Regex keeps messing up on this as the regex statement was made to get text inside of () how can I correct my regex statement to fix this.
Code:
string result = Regex.Match(a, #"\(([^)]*)\)").Groups[1].Value;

To get everything inside the outermost set of parenthesis, use a greedy operator to get everything until the last parenthesis. The following pattern does not attempt to match sets of parenthesis or nested parenthesis. (To do that is more involved and there are already many questions and online discussions about that.)
Regex.Match(a, #"\((.*)\)").Groups[1].Value;
That pattern will match the opening parenthesis, then everything it possibly can until the closing parenthesis. This will require backtracking, but should be reasonable if matches are limited to single lines. Depending on current regex options, it could match across multiple lines and it has no other constraints, so it could match much more than desired. If that is a concern, add additional operators to limit the extent that it matches. For example, to keep from matching across lines:
Regex.Match(a, #"\(([^\r\n]*)\)").Groups[1].Value;

Related

Regex.Split returning whitespaces

I want to export a View as a HTML-Document to the User on my ASP.NET page. I want to give the option to only get a part of the view.
Because of that I want to split the output with Regex.Split(). I wrote a Regex that matches the part I want to cut out. After splitting I put the 2 output parts together again.
The problem is that I get a list of 3 parts, of which the second contains " ". How can I change the code that the output contains only 2 strings?
My Code:
textParts = Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->");
text = textParts[0] + textParts[1];
text contains HTML, CSS and jQuery Code. I wrote comments like <!--Graphic2--> around the blocks I want to cut out.
EDIT
I got it working now by using the Regex.Replace() Method. But I still don't know why Split isn't working how I expected.
You should consider parsing HTML with the proper tools, like HtmlAgilityPack.
The current question is about why Regex.Split returned 3 values. That is due to the presence of a capturing group in your pattern. Regex.Split returns the chunks between start/end of string and the matched chunks, and all captured substrings:
If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
So, Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->") matches <!--Graphic2--> substring, then matches and captures into Group 1 any 0+ occurrences of any char, as many as possible, and then matches <!--EndDiscarded-->") - these matches are removed and substrings that are not matched are returned, but the last char captured into the repeated capturing group is also returned.
So, if you plan to use regex for this task, you should consider re-writing it to #"(?s)<!--Graphic2-->.*?<!--EndDiscarded-->" or #"<!--Graphic2-->[^<]*(?:<(?!!--EndDiscarded)[^<]*)*<!--EndDiscarded-->" that will be much more efficient, or even #"<!--Graphic2-->[^<]*(?:<(?!!--(?:EndDiscarded|Graphic2))[^<]*)*<!--EndDiscarded-->" that will ensure no nested Graphic2 comments are matched.
See, the complexity of the regexps rises when you want to make sure your patterns work more efficiently and safer. However, even these longer versions do not guarantee 100% safety.

C#: How to use regex to match string '#A=1, #B=2, #C=3, ....."

I have a string something like this.
#A1=1,#B1=2,#C1=3,#D2=4123,#D3='asdsd',...
The string length is undefined.
How to use Regex to get the values each variable name?
Expect result is
A1 is 1
B1 is 2
C1 is 3
Here is my initial code, but that seems not working.
Regex exp = new Regex(#"((#[\w]+=('.+'|[\d\.]+),?)+?)+");
GroupCollection gc = exp.Match(strArg).Groups;
string txt = #"#A1=1,#B1=2,#C1=3,#D2=4123,#D3='asdsd'";
string patten = #"#([^=]*)=([^,]*)";
foreach (Match m in Regex.Matches(txt, patten))
{
Console.WriteLine(string.Format("{0} is {1}", m.Groups[1].Value, m.Groups[2].Value));
}
This is an example
A Group is only created once for each grouping (..) construct - this is independent of how many times a group is actually 'repeated' (eg. with +) to match the input in the Regex#Match call.
To solve this directly with regular expression either -
Use a "global match" (Regex#Matches), remove the repeat-group modifiers in the regular expression, and let the re-application of the regex find each group (this yields a collection of match results, each with the same set of groups); or
Create a hard-coded group for each of the #letter.. constructs manually. (This only works for a fixed set of fields and can quickly become unwieldy, however.)
Alternatively, if possible -
Simplify the problem with a String#Split (or Regex#Split) first and then process resulting part sequence.
Wrt "seems not working" there are several immediate issues, ignoring the afore mentioned grouping problem, other concerns and brought up by Joel in a comment -
The regular expression does not take spaces after a , into account (maybe this is just an error in the title?)
Multiple strings will trivially break it because '.+' is greedy and will "eat up" some of the parameter pairs
Without the actual input and the actual "doesn't work" data it is
hard to further reason/explain it; in any case these two issues can be addressed - fsvo - at the same time as implementing the "global match" approach.
Here's the regular expression I quickly whipped up. It won't support commas in quotes though.
#(\w+)=(.*?)(?:,|$)
Regex101
#([^=]*)\s*=\s*(.*?)(?=,|$)
You can simply try this.See demo.Grab the groups.
https://regex101.com/r/uF4oY4/19

Large string regex replace performance in C#

I'm facing a problem with Regex performance in C#.
I need to replace on a very large string (270k charachters, don't ask why..). The regex matches about 3k times.
private static Regex emptyCSSRulesetRegex = new Regex(#"[^\};\{]+\{\s*\}", RegexOptions.Compiled | RegexOptions.Singleline);
public string ReplaceEmptyCSSRulesets(string css) {
return emptyCSSRulesetRegex.Replace(css, string.Empty);
}
The string I pass to the method looks something like this:
.selector-with-statements{border:none;}.selector-without-statements{}.etc{}
Currently the replace process takes up 1500ms in C#, but when I do exactly the same in Javascript it only takes 100ms.
The Javascript code I used for timing:
console.time('reg replace');
myLargeString.replace(/[^\};\{]+\{\s*\}/g,'');
console.timeEnd('reg replace');
I also tried to do the replacing by looping over the matches in reverse order and replace the string in a StringBuilder. That was not helping.
I'm surprised by the performance difference between C# and Javascript in this case, and I think there I'm doing something wrong but I cannot think of anything.
I can't really explain the difference of time between Javascript and C#(*). But you can try to improve the performance of your pattern (that produces a lot of backtracking):
private static Regex emptyCSSRulesetRegex = new Regex(#"(?<keep>[^};{]+)(?:{\s*}(?<keep>))?", RegexOptions.Compiled);
public string ReplaceEmptyCSSRulesets(string css) {
return emptyCSSRulesetRegex.Replace(css, #"${keep}");
}
One of the problems of your original pattern is that when curly brackets are not empty (or not filled with whitespaces), the regex engine will continue to test each positions before the opening curly bracket (with always the same result). Example: with the string abcd{1234} your pattern will be tested starting on a, then b ...
The pattern I suggests will consume abcd even if it is not followed by empty curly brackets, so the positions of bcd are not tested.
abcd is captured in the group named keep but when empty curly brackets are found, the capture group is overwritten by an empty capture group.
You can have an idea of the number of steps needed for the two patterns (check the debugger):
original pattern
new pattern
Note: your original pattern can be improved if you enclose [^}{;]+ in an atomic group. This change will divide the number of steps needed by 2 (compared to the original), but even with that, the number of steps stays high for the previously explained reason.
(*) it's possible that the javascript regex engine is smart enough to not retry all these positions, but it's only an assumption.

Regex non-greedy match

If I have the following simplified string:
string331-itemR-icon253,string131-itemA-icon453,
string12131-itemB-icon4535,string22-itemC-icon443
How do I get the following only using only regex?
string12131-itemB-icon4535,
All numbers are unknown. The only known parts are
itemA, itemB, itemC, string and icon
I've tried string.+?itemB.+?, but it also picks up from the first occurrence of string rather than the one adjacent to itemB
I've also tried using [^icon] preceding the itemB in various positions but couldn't get it to work.
Try this:
string[^,]+itemB[^,]+,
Try this regex
string\d+-itemB-icon\d+,
The given solutions that use a restricted set of characters instead of a wildcard are simplest, but to get more at the general question: You got the non-greedy quantifier part right, but being non-greedy doesn't prevent the matcher from taking as many characters as it needs to find a match. You might be looking for the atomic group operator, (?>group). Once the group matches something, it will be treated atomically if the matcher needs to backtrack.
(?>string.+?item)B.+?,
In your example, the group matches string331-item, but the B doesn't match R so the whole group is tossed and the search moves to the next string.
You don't mention the commas separating items as a known part but use it in the example regex so I assume it can be used in a solution. Try excluding the comma as a character set instead of matching against ".".

Does this regex expression allow "*"?

I really know very little about regex's.
I'm trying to test a password validation.
Here's the regex that describes it (I didn't write it, and don't know what it means):
private static string passwordField = "[^A-Za-z0-9_.\\-!##$%^&*()=+;:'\"|~`<>?\\/{}]";
I've tried a password like "dfgbrk*", and my code, using the above regex, allowed it.
Is this consistent with what the regex defines as acceptable, or is it a problem with my code?
Can you give me an example of a string that validation using the above regex isn't suppose to allow?
Added: Here's how the original code uses this regex (and it works there):
public static bool ValidateTextExp(string regexp, string sText)
{
if ( sText == null)
{
Log.WriteWarning("ValidateTextExp got null text to validate against regExp {0} . returning false",regexp);
return false;
}
return (!Regex.IsMatch(sText, regexp));
}
It seems I'm doing something wrong..
Thanks.
Your regex matches a value that contains any single character which is not in that list.
Your test value matches because it has spaces in it, which do not appear to be in your expression.
The reason it's not is because your character class starts with ^. The reason it matches any value that contains any single character that is not that is because you did not specify the beginning or end of the string, or any quantifiers.
The above assumes I'm not missing the importance of any of the characters in the middle of the character soup :)
This answer is also dependent on how you actually use the Regex in code.
If your intention was for that Regex string to represent the only characters that are actually allowed in a password, you would change the regex like so:
string pattern = "^[A-Z0-9...etc...]+$";
The important parts there are:
The ^ has been removed from inside the bracket, to outside; where it signifies the start of the whole string.
The $ has been added to the end, where it signifies the end of the whole string.
Those are needed because otherwise, your pattern will match anything that contains the valid values anywhere inside - even if invalid values are also present.
finally, I've added the + quantifier, which means you want to find any one of those valid characters, one or more times. (this regex would not permit a 0-length password)
If you wanted to permit the ^ character also as part of the password, you would add it back in between the brackets, but just *not as the first thing right after the opening bracket [. So for example:
string pattern = "^[A-Z0-9^...etc...]+$";
The ^ has special meaning in different places at different times in Regexes.
[^A-Za-z0-9_.\-!##$%^&*()=+;:'\"|~`?\/{}]
----------------------^
Looks fine to me, at least in regards to your question title. I'm not clear yet on why the spaces in your sample don't trip it up.
Note that I'm assuming the purpose of this expression is to find invalid characters. Thus, if the expression is a positive match, you have a bad password that you must reject. Since there appears to be some confusion about this, perhaps I can clear it up with a little psuedo-code:
bool isGoodPassword = !Regex.IsMatch(#"[^A-Za-z0-9_.\-!...]", requestedPassword);
You could re-write this for a positive match (without the negation) like so:
bool isGoodPassword = Regex.IsMatch(#"^[A-Za-z0-9_.\-!...]+$", requestedPassword);
The new expression matches a string that from the beginning of the string is filled with one or more of any of the characters in the list all the way the way to end. Any character not in the list would cause the match to fail.
You regular expression is just an inverted character class and describes just one single character (but that can’t be *). So it depends on how you use that character class.
Depends on how you apply it. It describes exactly one character, however, the ^ in the beginning buggs me a little, as it prohibits every other character, so there is probably something terribly fishy there.
Edit: as pointed out in other answers, the reason for your string to match is the space, not the explanation that was replaced by this line.

Categories