Negative lookahead in Regex to exclude two words

Negative lookahead in Regex to exclude two words - c#

I have the following regex:
(?!SELECT|FROM|WHERE|AND|OR|AS|[0-9])(?<= |^|\()([a-zA-Z0-9_]+)
that I'm matching against a string like this:
SELECT Static AS My_alias FROM Table WHERE Id = 400 AND Name = 'Something';
This already does 90% of what I want. What I also like to do is to exclude AS My_alias, where the alias can be any word.
I tried to add this to my regex, but this didn't work:
(?!SELECT|FROM|WHERE|AND|OR|AS [a-zA-Z0-9_]+|[0-9])(?<= |^|\()([a-zA-Z0-9_]+)
^^^^^^^^^^^^^^^^
this is the new part
How can I exclude this part of the string using my regex?
Demo of the regex can be found here

This excludes the AS and gets the tokens you seek. It also handles multiple select values, along zero to many Where clauses.
The thought is to use named explicit captures, and let the regex engine know to disregard any non-named capture groups. (A match but don't capture feature)
We will also put all the "tokens" wanted into one token captures (?<Token> ... ) for all of our token needs.
var data = "SELECT Static AS My_alias FROM Table WHERE Id = 400 AND Name = 'Something';";
var pattern = #"
^
SELECT\s+
(
(?<Token>[^\s]+)
(\sAS\s[^\s]+)?
[\s,]+
)+ # One to many statements
FROM\s+
(?<Token>[^\s]+) # Table name
(
\s+WHERE\s+
(
(?<Token>[^\s]+)
(.+?AND\s+)?
)+ # One to many conditions
)? # Optional Where
";
var tokens =
Regex.Matches(data, pattern,
RegexOptions.IgnorePatternWhitespace // Lets us space out/comment pattern
| RegexOptions.ExplicitCapture) // Only consume named groups.
.OfType<Match>()
.SelectMany(mt => mt.Groups["Token"].Captures // Get the captures inserted into `Token`
.OfType<Capture>()
.Select(cp => cp.Value.ToString()))
;
tokens is an array of these strings: { "Static", "Table", "Id", "Name" }
This should get you going on most of the cases of what will find. Use similar logic if you need to process selects with joins; regardless this is a good base to work from going forward.

Related

How to Split OData multi-level expand query string in C#?

I have a URL: Expand=User($select=Firstname,Lastname),Organisation,Contract($Expand=MyOrganisation($select=Name,Status),Organisation),List
I need to split this string in the below format:
User($select=Firstname,Lastname)
Organisation
Contract($Expand=MyOrganisation($select=Name,Status),Organisation)
List
How to achieve this functionality in C#?

You not only need to split the string but also keep track of the parentheses while splitting. This is not possible with just plain old regex. See this post.
However, the splitting can be achieved with some advanced RegEx; .NET fortunately supports balancing groups using which you can keep track of the parentheses. This answer was quite helpful in coming up with a solution. For readability, I have split the regex into multiple lines and used RegexOptions.IgnorePatternWhitespace:
string url = "User($select=Firstname,Lastname),Organisation,Contract($Expand=MyOrganisation($select=Name,Status),Organisation),List";
Regex rgx = new Regex(
#"(.+?)
(
(
\(
(?:
[^()]
|
(?'open'\()
|
(?'close-open'\))
)+
(?(open)(?!))
\)
)
,
|
,
|
\b$
)",
RegexOptions.IgnorePatternWhitespace);
foreach(var match in rgx.Matches(url))
{
Console.WriteLine($"{match.Groups[1]} {match.Groups[3]}");
}
The field will be available as match.Groups[1] and the parameters, if any will be available as match.Groups[3](this will be an empty string if there are no parameters). You can access match.Groups[0] to get the entire group.
Regex Breakdown
Regex
Description
(.+?)
Non-greedily match one or more characters
\( and \)
Match an actual ( and )
[^()]
Match any character that is not a ( or )
(?'open'\()
Create a named group with the name "open" and match a ( character
(?'close-open\))
Create a group "close" and assign the interval between "open" and "close" to "close" and delete group "open"
(?(open)(?!))
Assert if the "open" group is not deleted
(?:[^()]|(?'open'\()|(?'close-open'\)))+
Create a non-capturing group and match one or more characters that match one of the expressions between |

More likely you have to use an ODataLib with in-built URI Parser
Uri requestUri = new Uri("Products?$select=ID&$expand=ProductDetail" +
"&$filter=Categories/any(d:d/ID%20gt%201)&$orderby=ID%20desc" +
"&$top=1&$count=true&$search=tom",
UriKind.Relative);
ODataUriParser parser = new ODataUriParser(model, serviceRoot, requestUri);
SelectExpandClause expand = parser.ParseSelectAndExpand(); // parse $select, $expand
FilterClause filter = parser.ParseFilter(); // parse $filter
OrderByClause orderby = parser.ParseOrderBy(); // parse $orderby
SearchClause search = parser.ParseSearch(); // parse $search
long? top = parser.ParseTop(); // parse $top
long? skip = parser.ParseSkip(); // parse $skip
bool? count = parser.ParseCount(); // parse $count
Adding the RegExp option (the fixed version of what Amal has provided below)
string url = "User($select=Firstname,Lastname),Organisation,Contract($Expand=MyOrganisation($select=Name,Status),Organisation),List";
Regex rgx = new Regex(#"(.+?)(?:(\(.*?\)),|,)");
foreach (var match in rgx.Matches($"{url},"))
{
Console.WriteLine(match.ToString()[..^1]);
}

Find multiply groups matching in specific substring

I would like to catch bold values in the string below that starts with "need" word, while words in other string that starts from "skip" and "ignored" must be ignored. I tried the pattern
need.+?(:"(?'index'\w+)"[,}])
but it found only first(ephasised) value. How I can get needed result using RegEx only?
"skip" : {"A":"ABCD123","B":"ABCD1234","C":"ABCD1235"}
"need" : {"A":"ZABCD123","B":"ZABCD1234","C":"ZABCD1235"}
"ignore" : {"A":"SABCD123","B":"SABCD1234","C":"SABCD1235"}

We are going find need and group what we find into Named Match Group => Captures. There will be two groups, one named Index which holds the A | B | C and then one named Data.
The match will hold our data which will look like this:
From there we will join them into a dictionary:
Here is the code to do that magic:
string data =
#"""skip"" : {""A"":""ABCD123"",""B"":""ABCD1234"",""C"":""ABCD1235""}
""need"" : {""A"":""ZABCD123"",""B"":""ZABCD1234"",""C"":""ZABCD1235""}
""ignore"" : {""A"":""SABCD123"",""B"":""SABCD1234"",""C"":""SABCD1235""}";
string pattern = #"
\x22need\x22\s *:\s *{ # Find need
( # Beginning of Captures
\x22 # Quote is \x22
(?<Index>[^\x22] +) # A into index.
\x22\:\x22 # ':'
(?<Data>[^\x22] +) # 'Z...' Data
\x22,? # ',(maybe)
)+ # End of 1 to many Captures";
var mt = Regex.Match(data,
pattern,
RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);
// Get the data capture into a List<string>.
var captureData = mt.Groups["Data"].Captures.OfType<Capture>()
.Select(c => c.Value).ToList();
// Join the index capture data and project it into a dictionary.
var asDictionary = mt.Groups["Index"]
.Captures.OfType<Capture>()
.Select((cp, iIndex) => new KeyValuePair<string,string>
(cp.Value, captureData[iIndex]) )
.ToDictionary(kvp => kvp.Key, kvp => kvp.Value );

If number of fields is fixed - you can code it like:
^"need"\s*:\s*{"A":"(\w+)","B":"(\w+)","C":"(\w+)"}
Demo
If tags would be after values - like that:
{"A":"ABCD123","B":"ABCD1234","C":"ABCD1235"} : "skip"
{"A":"ZABCD123","B":"ZABCD1234","C":"ZABCD1235"} : "need"
{"A":"SABCD123","B":"SABCD1234","C":"SABCD1235"} : "ignore"
Then you could employ infinite positive look ahead with
"\w+?":"(\w+?)"(?=.*"need")
Demo
But infinite positive look behind's are prohibited in PCRE. (prohibited use of *+ operators in look behind's syntax). So not very useful in your situation

You can't capture a dynamically set number of groups, so I'd run something like this regex
"need".*{.*,?".*?":(".+?").*}
[Demo]
with a 'match_all' function, or use Agnius' suggestion

C# RegEx - get only first match in string

I've got an input string that looks like this:
level=<device[195].level>&name=<device[195].name>
I want to create a RegEx that will parse out each of the <device> tags, for example, I'd expect two items to be matched from my input string: <device[195].level> and <device[195].name>.
So far I've had some luck with this pattern and code, but it always finds both of the device tags as a single match:
var pattern = "<device\\[[0-9]*\\]\\.\\S*>";
Regex rgx = new Regex(pattern);
var matches = rgx.Matches(httpData);
The result is that matches will contain a single result with the value <device[195].level>&name=<device[195].name>
I'm guessing there must be a way to 'terminate' the pattern, but I'm not sure what it is.

Use non-greedy quantifiers:
<device\[\d+\]\.\S+?>
Also, use verbatim strings for escaping regexes, it makes them much more readable:
var pattern = #"<device\[\d+\]\.\S+?>";
As a side note, I guess in your case using \w instead of \S would be more in line with what you intended, but I left the \S because I can't know that.

depends how much of the structure of the angle blocks you need to match, but you can do
"\\<device.+?\\>"

I want to create a RegEx that will parse out each of the <device> tags
I'd expect two items to be matched from my input string:
1. <device[195].level>
2. <device[195].name>
This should work. Get the matched group from index 1
(<device[^>]*>)
Live demo
String literals for use in programs:
#"(<device[^>]*>)"

Change your repetition operator and use \w instead of \S
var pattern = #"<device\[[0-9]+\]\.\w+>";
String s = #"level=<device[195].level>&name=<device[195].name>";
foreach (Match m in Regex.Matches(s, #"<device\[[0-9]+\]\.\w+>"))
Console.WriteLine(m.Value);
Output
<device[195].level>
<device[195].name>

Use named match groups and create a linq entity projection. There will be two matches, thus separating the individual items:
string data = "level=<device[195].level>&name=<device[195].name>";
string pattern = #"
(?<variable>[^=]+) # get the variable name
(?:=<device\[) # static '=<device'
(?<index>[^\]]+) # device number index
(?:]\.) # static ].
(?<sub>[^>]+) # Get the sub command
(?:>&?) # Match but don't capture the > and possible &
";
// Ignore pattern whitespace is to document the pattern, does not affect processing.
var items = Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace)
.OfType<Match>()
.Select (mt => new
{
Variable = mt.Groups["variable"].Value,
Index = mt.Groups["index"].Value,
Sub = mt.Groups["sub"].Value
})
.ToList();
items.ForEach(itm => Console.WriteLine ("{0}:{1}:{2}", itm.Variable, itm.Index, itm.Sub));
/* Output
level:195:level
name:195:name
*/

Regular Expression Groups in C#

I've inherited a code block that contains the following regex and I'm trying to understand how it's getting its results.
var pattern = #"\[(.*?)\]";
var matches = Regex.Matches(user, pattern);
if (matches.Count > 0 && matches[0].Groups.Count > 1)
...
For the input user == "Josh Smith [jsmith]":
matches.Count == 1
matches[0].Value == "[jsmith]"
... which I understand. But then:
matches[0].Groups.Count == 2
matches[0].Groups[0].Value == "[jsmith]"
matches[0].Groups[1].Value == "jsmith" <=== how?
Looking at this question from what I understand the Groups collection stores the entire match as well as the previous match. But, doesn't the regexp above match only for [open square bracket] [text] [close square bracket] so why would "jsmith" match?
Also, is it always the case the the groups collection will store exactly 2 groups: the entire match and the last match?

match.Groups[0] is always the same as match.Value, which is the entire match.
match.Groups[1] is the first capturing group in your regular expression.
Consider this example:
var pattern = #"\[(.*?)\](.*)";
var match = Regex.Match("ignored [john] John Johnson", pattern);
In this case,
match.Value is "[john] John Johnson"
match.Groups[0] is always the same as match.Value, "[john] John Johnson".
match.Groups[1] is the group of captures from the (.*?).
match.Groups[2] is the group of captures from the (.*).
match.Groups[1].Captures is yet another dimension.
Consider another example:
var pattern = #"(\[.*?\])+";
var match = Regex.Match("[john][johnny]", pattern);
Note that we are looking for one or more bracketed names in a row. You need to be able to get each name separately. Enter Captures!
match.Groups[0] is always the same as match.Value, "[john][johnny]".
match.Groups[1] is the group of captures from the (\[.*?\])+. The same as match.Value in this case.
match.Groups[1].Captures[0] is the same as match.Groups[1].Value
match.Groups[1].Captures[1] is [john]
match.Groups[1].Captures[2] is [johnny]

The ( ) acts as a capture group. So the matches array has all of matches that C# finds in your string and the sub array has the values of the capture groups inside of those matches. If you didn't want that extra level of capture jut remove the ( ).

Groups[0] is your entire input string.
Groups[1] is your group captured by parentheses (.*?). You can configure Regex to capture Explicit groups only (there is an option for that when you create a regex), or use (?:.*?) to create a non-capturing group.

The parenthesis is identifying a group as well, so match 1 is the entire match, and match 2 are the contents of what was found between the square brackets.

How? The answer is here
(.*?)
That is a subgroup of #"[(.*?)];

Regex to extract a string between two delimeters WITHOUT also returning the delimeters?

I want to just extract the text between the brackets -- NOT the brackets, too!
My code looks currently like this:
var source = "Harley, J. Jesse Dead Game (2009) [Guard]"
// Extract role with regex
m = Regex.Match(source, #"\[(.*)\]");
var role = m.Groups[0].Value;
// role is now "[Guard]"
role = role.Substring(1, role.Length-2);
// role is now "Guard"
Can you help me to simplify this to just a single regex, instead of the regex, then the substring?

you use a different group number. Every time you wrap something in ( ) it creates a new group out of it. Group zero is the entire found expression. group1 is the first group of (), group2 is the second, etc. Since you're using group 0, it's returning the entire string that matches the expression
Try changing Groups[x] to 1 and see what it gives you.

You can use zero-width lookahead (?=) and lookbehind (?<=) assertions :
m = Regex.Match(source, #"(?<=\[).*(?=\])");
var role = m.Value;
Zero-width positive lookahead assertion : matches a suffix but excludes it from the capture
Zero-width positive lookbehind assertion : matches a prefix but excludes it from the capture
See Grouping Constructs on MSDN for details.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Negative lookahead in Regex to exclude two words - c#

Related

How to Split OData multi-level expand query string in C#?

Find multiply groups matching in specific substring

C# RegEx - get only first match in string

Regular Expression Groups in C#

Regex to extract a string between two delimeters WITHOUT also returning the delimeters?

Categories

Resources