Regex to match multiple strings with positive look behind - c#

So I have been trying to combine the answers of these two questions:
C# split string but keep split chars\seperators
Regex to match multiple strings
Essentially I'd like to be able to split a string around certain strings and have the splitting strings in the output array of Regex.Split() as well. Here is what I have tried so far:
// ** I'd also like to have UNION ALL but not sure how to add that
private const string CompoundSelectRegEx = #"(?<=[\b(UNION|INTERSECT|EXCEPT)\b])";
string sql = "SELECT TOP 5 * FROM Persons UNION SELECT TOP 5 * FROM Persons INTERSECT SELECT TOP 5 * FROM Persons EXCEPT SELECT TOP 5 * FROM Persons";
string[] strings = Regex.Split(sql, CompoundSelectRegEx);
The problem is that it starts matching individual characters like E and U so I get an incorrect array of strings.
I'd also like to match around UNION ALL but since thats not just a single word but a string I wasn't sure how to add it the above regex so if someone could point me in the right direction there as well that would be great!
Thanks!

If you want to split on those words and include them in the results simply alternate on them and place them in a group. There's no need for look-arounds. This pattern should fit your needs:
string pattern = #"\b(UNION(?:\sALL)?|INTERSECT|EXCEPT)\b";
The (?:\sALL)? makes the word ALL optionally matched. The (?:...) part means match but don't capture the specified pattern. The trailing ? at the end of the group makes it optional. If you want to trim the results you could add a \s* at the end of the pattern.
Be aware that this might work for simple SQL statements, but once you start dealing with nested queries the above approach will probably break down. At that point a regex might not be the best solution and you should develop a parser instead.

Related

C# Use REGEX to sort out Parentheses inside eachother

im using REGEX right now to sort out lines like:
string a = "mine(hello, this())"
string b = "mine(hello, me.he)
string c = "mine(hello, this(this2())
Im trying to get this() by itself.
Regex keeps messing up on this as the regex statement was made to get text inside of () how can I correct my regex statement to fix this.
Code:
string result = Regex.Match(a, #"\(([^)]*)\)").Groups[1].Value;
To get everything inside the outermost set of parenthesis, use a greedy operator to get everything until the last parenthesis. The following pattern does not attempt to match sets of parenthesis or nested parenthesis. (To do that is more involved and there are already many questions and online discussions about that.)
Regex.Match(a, #"\((.*)\)").Groups[1].Value;
That pattern will match the opening parenthesis, then everything it possibly can until the closing parenthesis. This will require backtracking, but should be reasonable if matches are limited to single lines. Depending on current regex options, it could match across multiple lines and it has no other constraints, so it could match much more than desired. If that is a concern, add additional operators to limit the extent that it matches. For example, to keep from matching across lines:
Regex.Match(a, #"\(([^\r\n]*)\)").Groups[1].Value;

Regex.Split returning whitespaces

I want to export a View as a HTML-Document to the User on my ASP.NET page. I want to give the option to only get a part of the view.
Because of that I want to split the output with Regex.Split(). I wrote a Regex that matches the part I want to cut out. After splitting I put the 2 output parts together again.
The problem is that I get a list of 3 parts, of which the second contains " ". How can I change the code that the output contains only 2 strings?
My Code:
textParts = Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->");
text = textParts[0] + textParts[1];
text contains HTML, CSS and jQuery Code. I wrote comments like <!--Graphic2--> around the blocks I want to cut out.
EDIT
I got it working now by using the Regex.Replace() Method. But I still don't know why Split isn't working how I expected.
You should consider parsing HTML with the proper tools, like HtmlAgilityPack.
The current question is about why Regex.Split returned 3 values. That is due to the presence of a capturing group in your pattern. Regex.Split returns the chunks between start/end of string and the matched chunks, and all captured substrings:
If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
So, Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->") matches <!--Graphic2--> substring, then matches and captures into Group 1 any 0+ occurrences of any char, as many as possible, and then matches <!--EndDiscarded-->") - these matches are removed and substrings that are not matched are returned, but the last char captured into the repeated capturing group is also returned.
So, if you plan to use regex for this task, you should consider re-writing it to #"(?s)<!--Graphic2-->.*?<!--EndDiscarded-->" or #"<!--Graphic2-->[^<]*(?:<(?!!--EndDiscarded)[^<]*)*<!--EndDiscarded-->" that will be much more efficient, or even #"<!--Graphic2-->[^<]*(?:<(?!!--(?:EndDiscarded|Graphic2))[^<]*)*<!--EndDiscarded-->" that will ensure no nested Graphic2 comments are matched.
See, the complexity of the regexps rises when you want to make sure your patterns work more efficiently and safer. However, even these longer versions do not guarantee 100% safety.

Regex to parse formatter string

I am writing a string.Format-like method. In order to do this, I am adopting Regex to determine commands and parameters: e.g. Format(#"\m{0,1,2}", byteArr0, byteArr1, byteArr2)
For the first Regex, return 2 groups:
'\m'
'{0,1,2}'
Another Regex takes the value of '{0,1,2}' and has 3 matches:
0
1
2
These values are the indexes corresponding to the byteArr params.
This command structure is likely to grow so I'm really trying to figure this out and learn enough to be able to modify the Regex for future requirements.I would think that a single Regex would do all of the above but there is value in having 2 separate Regex(es/ices ???) expressions.
Any way, to get the first group '\m' the Regex is:
"(\\)(\w{1,1})" // I want the '{0,1,2}' group also
To get the integer matches '{0,1,2}' I was trying:
"(?<=\{)([^}]*)(?=\})"
I am having difficulty in achieving: (1) 2 groups on the first expression and (2) 3 matches on the integers within the braces delimited by a comma in the second expression.
Your first regex (\\)(\w{1,1}) can be greatly simplified.
You don't want to capture the \ separately to the m so no need to wrap them in their own sets of parenthesis.
\w{1,1} is the same as just \w.
So we have \\\w to match the first part \m.
Now to deal with the second part, really we can ignore everything other than the 0,1,2 in the example since there are no numbers elsewhere so you'd just use: \d+ and iterate through the matches.
But lets assume the example could actually be \9{1,2,3}.
Now \d+ would match the 9 so to avoid this we could use [{,](\d+)[,}]. This says capture a number that has either a , or { on the left of it and a , or } on the right.
You're right in saying that we can match the whole string with a single regex, something like this would do it:
(\\\w){((\d+),?)+}
However the problem with this is when you examine the contents of the capture groups afterwards, the last number caught by the (\d+) overwrites all the other values that were caught in there. So you'd be left with group 1: \m and group 2: 2 for your example.
With that in mind I recommend using 2 regexs:
For the 1st part: \\\w
For the numbers: I'd forget about the [{,](\d+)[,}] (and the many other ways you could do it), the cleanest way might just be to grab whatever is inside the {...} and then match with a simple \d+.
So to do this first use (\\\w)\{([^/}]+)\} to grab the \m into group 1 and the 1,2,3 into group 2, then just use \d+ on that.
FYI, your (?<=\{)([^}]*)(?=\}) works fine, but you can't but anything before the lookbehind i.e. the \\\w. In the vast majority of cases where a lookbehind can be used, you can do what you want by just using capture groups and ignoring everything else :
My regex \{([^/}]+)\} is pretty much the same as you (?<=\{)([^}]*)(?=\}) except rather than looking ahead and looking behind for the { and } I just leave them outside the capture groups that are going to be used.
Consider the following Regexes...
(^.*?)(?={.*})
\d+
Good Luck!

regex approach for extracting strings surrounded with double quotes

I have a search string that is getting passed
Eg: "a+b",a, b, "C","d+e",a-b,d
I want to filter out all sub strings surrounded by double quotes("").
In above sample Output should contain:
"a+b","C","d+e"
Is there a way to do this without looping?
Also I then need to extract a string without above values to do further processing
Eg: a,b,a-b,d
Any suggestions on how to do this with minimal performance impact?
Thank you in advance for all your comments and suggestions
Since you didn't say anything about how exactly you want your output (do you need to keep the commas and extra whitespace? Is it comma delimited to begin with? Let's assume that it is NOT comma delimited and you are just trying to remove the occurences of the "xyz":
string strRegex = #"""([^""])+""";
string strTargetString = #" ""a+b"",a, b, ""C"",""d+e"",a-b,d";
string strOutput = Regex.Replace(strTargetString, strRegex, x => "");
Will remove all of the items (leaving the extra commas and whitespace).
If you are trying to do something where you need each individual match then you might want to try:
var y = (from Match m in Regex.Matches(strTargetString, strRegex) select m.Value).ToList<string>();
y.ForEach(s => Console.WriteLine(s));
To get the list of items without the surrounding quotes, you could either reverse the regex pattern OR use the replace method in the first code sample and then split on the commas, trimming white space (again, assuming you are splitting on commas which it sounds like you are)
First, add a comma to the end of your output:
"a+b",a, b, "C","d+e",a-b,d,
Then, use this regular expression:
((?<quoted>\".+?\")|(?<unquoted>.+?)),\s*
Now you have 2 problems. Kidding!
You'll have to find a way of extracting the matches without using a loop, but at least they are separated into quoted and unquoted strings by using the group. You could use a lamdba expression to pull the data out and join it, one each for quoted and unquoted, but it's just doing a loop behind the scenes, and may add more overhead than a simple for loop. It sounds like you're trying to eek out performance here, so time and test each method to see what gives the best results.

Get substring from string in C# using Regular Expression

I have a string like:
Brief Exercise 1-1 Types of Businesses Brief Exercise 1-2 Forms of Organization Brief Exercise 1-3 Business Activities.
I want to break above string using regular expression so that it can be like:
Types of Businesses
Forms of Organization
Business Activities.
Please don't say that I can break it using 1-1, 1-2 and 1-3 because it will bring the word "Brief Exercise" in between the sentences. Later on I can have Exercise 1-1 or Problem 1-1 also. So I want some general Regular expression.
Any efficient regular expression for this scenario ?
var regex=new Regex(#"Brief (?:Exercise|Problem) \d+-\d+\s");
var result=string.Join("\n",regex.Split(x).Where(a=>!string.IsNullOrEmpty(a)));
The regex will match "Brief " followed by either "Exercise" or "Problem" (the ?: makes the group non capturing), followed by a space, then 1 or more digits then a "-", then one or more digits then a space.
The second statement uses the split function to split the string into an array and then regex to skip all the empty entries (otherwise the split would include the empty string at the begining, you could use Skip(1) instead of Where(a=>!string.IsNullOrEmpty(a)), and then finally uses string.Join to combine the array back into string with \n as the seperator.
You could use regex.Replace to convert directly to \n but you will end up with a \n at the begining that you would have to strip.
--EDIT---
if the fist number is always 1 and the second number is 1-50ish you could use the following regex to support 0-59
var regex=new Regex(#"Brief (?:Exercise|Problem) 1-\[1-5]?\d\s");
This regular expression will match on "Brief Exercise 1-" followed by a digit and an optional second digit:
#"Brief Exercise 1-\d\d?"
Update:
Since you might have "Problem" as well, an alternation between Exercise and Problem is also needed (using non capturing parenthesis):
#"Brief (?:Exercise|Problem) 1-\d\d?"
Why don't you do it the easy way? I mean, if the regular part is "Brief Exercise #-#" Replace it by some split character and then split the resulting string to obtain what you want.
If you do it otherwise you will always have to take care of special cases.
string pattern = "Brief Exercise \d+-\d+";
Regex reg = new Regex(patter);
string out = regex.replace(yourstring, "|");
string results[] = out.split("|");

Categories