.net System.Text.RegularExpressions nesting regexps in regexp - c#

I would like to ask any skilled .net developer, if there is a possibility to define regular expression (using the .net RegularExpressions namespace cpabilities), which would include references to another regexp(s). I would like to describe grammar rules, each rule as a single regexp. The final regexp would be the grammar's start symbol.
Of course I can perform the expansion to single line regular expression, but the readability would suffer. I also would not like to try each option included in start symbol programatically (like foreach(regexp r in line.regexps) {check if r.matches(input)}).
For example having following ini-like file grammar in regexp-like form (does not follow microsoft regexp rules, just general ones):
sp = \s*
allowed_char = [a-zA-Z0-9_]
key = <allowed_char>+
value = <allowed_char>((<allowed_char>|[ ])*<allowed_char>)?
comment = (;|(//)|#)(.*)
empty_line = ^<sp>$
line_comment = ^<sp><comment>$
section = ^<sp>\[<sp><value><sp>\]<sp>(<comment>)?$
item = ^<sp><key><sp>=<sp><value><sp>(<comment>)?$
line = <empty_line>|<line_comment>|<section>|<item>
I would like to:
Check if a sentence is part of the language (true/false) - seems trivial: matches the <line> start symbol.
Access the terminal-like symbol values (e.g. <section>, <key>, <value>, ...) - I suppose this could be achieved via named matching groups (or whatever exactly is it called - still nedd to read some details at msdn).
I do not expect you to write the code, just if you could give me some hints, whether it is possible (and how) or not, because I have not found this info yet. All examples are for single regexp matching.
Thank you.

This is what I came up with when I was doing my own regex based mathematical expression parser:
private static class Regexes {
// omitted...
private static readonly string
strFunctionNames = "sin|ln|cos|tg|tan|abs",
strReal = #"([\+-]?\d+([,\.]\d+)?(E[\+-]?\d+)?)|[\+-]Infinit(y|o)",
strFunction = string.Format( #"(?<function>{0})(?<argument>{1})",
strFuncitonNames, strReal );
// omitted...
public static readonly Regex
FunzioniLowerCase = new Regex( strFunctionNames ),
RealNumber = new Regex( strReal ),
Function = new Regex( strFunction );
}
This has the obvious disadvantage that there's some sort of repetition in the code, but you could use reflection to compile (and perhaps even create) those regexes in a static constructor.

Related

How to match condition IF in C#

I have a string like this.
string strex = "Insert|Update|Delete"
I am retrieving another string as string strex1 = "Insert" (It may retrieve Update or Delete)
I need to match strex1 with strex in "IF" condition in C#.
Do I need to split strex and match with strex1?
The string you posted is a regular expression pattern that matches the words Insert, Update or Delete. Regular expressions are a very common way of specifying validation rules in web applications.
Regular expressions can express far more complex rules than a simple comparison. They're also far faster (think 10x) in validation scenarios than splitting. In a web application, that translates to using fewer servers to serve the same traffic.
You can use .NET's Regex to match strings with that pattern, eg :
var strex = "Insert|Update|Delete";
if (Regex.IsMatch(input,strex))
{
....
}
This will create a new regular expression object each time. You can avoid this by creating a static Regex instance and reuse it. Regex is thread-safe which means there's no problem using the same instance from multiple threads :
static Regex _cmdRegex = new Regex("Insert|Update|Delete");
...
void MyMethod(string input)
{
if(_cmdRegex.IsMatch(input))
{
...
}
}
The Regex class methods will match if the pattern appears anywhere in the pattern. Regex.IsMatch("Insert1",strex) will return True. If you want an exact match, you have to specify that the pattern starts at the beginning of the input with ^ and ends at the end with $ :
static Regex _cmdRegex = new Regex("^(Insert|Update|Delete)$");
With this change, _cmdRegex.IsMatch("Insert1") will return false but _cmdRegex.IsMatch("Insert") will return true.
Performance
In this case a regular expression is a lot faster than splitting and trying exact matches. Think 10-100x over time. There are two reasons for this:
Strings are immutable, so every string modification operation like Split() will generate new temporary strings that have to be allocated and garbage collected. In a busy web application this adds up, eventually using up a lot of RAM and CPU for little or no benefit. One of the reasons ASP.NET Core is 10x times faster than the old ASP.NET is eliminating such substring operations wherever possible.
A regular expression is compiled into a program that performs matching in the most efficient way. When you use Split().Any() the program will compare the input with all the substrings even if it's obvious there's no possible match, eg because the first letter is Z. A Regex program on the other hand would only proceed if the first character was I, U or D
Efficient way I can think of is using string.Contains()
if(strex.Contains($"{strex1}|") || strex.Contains($"|{strex1}"))
{
//Your code goes here
}
Solution using Linq, Split string strex by '|' and check strex1 is present in an array or not, like
Issue with below solution is pointed out by #PanagiotisKanavos in the
comment.
Using .Any(),
if(strex.Split('|').Any(x => x.Equals(strex1)))
{
//Your code goes here
}
or using Contains(),
if(strex.Split('|').Contains(strex1))
{
//Your code goes here
}
if you want to ignore case while comparing string then you can use StringComparison.OrdinalIgnoreCase.
if(strex.Split('|').Any(x => x.Equals(strex1, StringComparison.OrdinalIgnoreCase))
{
//Your code goes here
}
.NETFIDDLE

C# grammar using ANTLR 3

I'm now writing C# grammar using Antlr 3 based on this grammar file.
But, I found some definitions I can't understand.
NUMBER:
Decimal_digits INTEGER_TYPE_SUFFIX? ;
// For the rare case where 0.ToString() etc is used.
GooBall
#after
{
CommonToken int_literal = new CommonToken(NUMBER, $dil.text);
CommonToken dot = new CommonToken(DOT, ".");
CommonToken iden = new CommonToken(IDENTIFIER, $s.text);
Emit(int_literal);
Emit(dot);
Emit(iden);
Console.Error.WriteLine("\tFound GooBall {0}", $text);
}
:
dil = Decimal_integer_literal d = '.' s=GooBallIdentifier
;
fragment GooBallIdentifier
: IdentifierStart IdentifierPart* ;
The above fragments contain the definition of 'GooBall'.
I have some questions about this definition.
Why is GooBall needed?
Why does this grammar define lexer rules to parse '0.ToString()' instead of parser rules?
It's because that's a valid expression that's not handled by any of the other rules - I guess you'd call it something like an anonymous object, for lack of a better term. Similar to "hello world".ToUpper(). Normally method calls are only valid on variable identifiers or return values ala GetThing().Method(), or otherwise bare.
Sorry. I found the reason from the official FAQ pages.
Now if you want to add '..' range operator so 1..10 makes sense, ANTLR has trouble distinguishing 1. (start of the range) from 1. the float without backtracking. So, match '1..' in NUM_FLOAT and just emit two non-float tokens:

RegEx for a Glossary Function

I'm working on a web-based help system that will auto-insert links into the explanatory text, taking users to other topics in help. I have hundreds of terms that should be linked, i.e.
"Manuals and labels" (describes these concepts in general)
"Delete Manuals and Labels" (describes this specific action)
"Learn more about adding manuals and labels" (again, more specific action)
I have a RegEx to find / replace whole words (good ol' \b), which works great, except for linked terms found inside other linked terms. Instead of:
Learn more about manuals and labels
I end up with
Learn more about <a href="#">manuals and labels</a>
Which makes everyone cry a little. Changing the order in which the terms are replaced (going shortest to longest) means that I''d get:
Learn more about manuals and labels
Without the outer link I really need.
The further complication is that the capitalization of the search terms can vary, and I need to retain the original capitalization. If I could do something like this, I'd be all set:
Regex _regex = new Regex("\\b" + termToFind + "(|s)" + "\\b", RegexOptions.IgnoreCase);
string resultingText = _regex.Replace(textThatNeedsLinksInserted, "<a>" + "$&".Replace(" ", "_") + "</a>));
And then after all the terms are done, remove the "_", that would be perfect. "Learn_more_about_manuals_and_labels" wouldn't match "manuals and labels," and all is well.
It would be hard to have the help authors delimit the terms that need to be replaced when writing the text -- they're not used to coding. Also, this would limit the flexibility to add new terms later, since we'd have to go back and add delimiters to all the previously written text.
Is there a RegEx that would let me replace whitespace with "_" in the original match? Or is there a different solution that's eluding me?
From your examples with nested links it sounds like you're making individual passes over the terms and performing multiple Regex.Replace calls. Since you're using a regex you should let it do the heavy lifting and put a nice pattern together that makes use of alternation.
In other words, you likely want a pattern like this: \b(term1|term2|termN)\b
var input = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";
var terms = new[]
{
"Learn more about adding manuals and labels",
"Delete Manuals and Labels",
"manuals and labels"
};
var pattern = #"\b(" + String.Join("|", terms) + #")\b";
var replacement = #"$1";
var result = Regex.Replace(input, pattern, replacement, RegexOptions.IgnoreCase);
Console.WriteLine(result);
Now, to address the issue of a corresponding href value for each term, you can use a dictionary and change the regex to use a MatchEvaluator that will return the custom format and look up the value from the dictionary. The dictionary also ignores case by passing in StringComparer.OrdinalIgnoreCase. I tweaked the pattern slightly by adding ?: at the start of the group to make it a non-capturing group since I am no longer referring to the captured item as I did in the first example.
var terms = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
{
{ "Learn more about adding manuals and labels", "2.html" },
{ "Delete Manuals and Labels", "3.html" },
{ "manuals and labels", "1.html" }
};
var pattern = #"\b(?:" + String.Join("|", terms.Select(t => t.Key)) + #")\b";
var result = Regex.Replace(input, pattern,
m => String.Format(#"{1}", terms[m.Value], m.Value),
RegexOptions.IgnoreCase);
Console.WriteLine(result);
I would use an ordered dictionary like this, making sure the smallest term is last:
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
public class Test
{
public static void Main()
{
OrderedDictionary Links = new OrderedDictionary();
Links.Add("Learn more about adding manuals and labels", "2");
Links.Add("Delete Manuals and Labels", "3");
Links.Add("manuals and labels", "1");
string text = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";
foreach (string termToFind in Links.Keys)
{
Regex _regex = new Regex(#"\b" + termToFind + #"s?\b(?![^<>]*</)", RegexOptions.IgnoreCase);
text = _regex.Replace(text, #"$&");
}
Console.WriteLine(text);
}
}
ideone demo
The negative lookahead ((?![^<>]*</)) I added prevents the replace of a part you already replaced before which is between anchor tags.
First, you can prevent your Regex for manuals and labels from finding Learn more about manuals and labels by using a lookbehind. Modified your regex looks like this:
(?<!Learn more about )(manuals and labels)
But for your specific request i would suggest a different solution. You should define a rule or priority list for your regexs or both. A possible rule could be "always search for the regex first that matches the most characters". This however requires that your regexs are always fixed length. And it does not prevent one regex from consuming and replacing characters that would have been matched by a different regex (maybe even of the same size).
Of course you will need to add an additional lookbehind and lookahead to each of your regexs to prevent replacing strings that are inside of your replacing elements

How to find « («) in a string

I need to find &#171 in a string with a regex. How would you add that into the following:
String RegExPattern = #"^[0-9a-df-su-z]+\.\s&#171";
Regex PatternRegex = new Regex(RegExPattern);
return (PatternRegex.Match(Source).Value);
You should be able to simply use it directly:
var pattern = new Regex("&#171");
Of course, if used alone you can also use String.IndexOf instead. If you want to use it in another pattern, as in your question, go ahead. The usage is correct.
If, on the other hand, you also want to allow the named entity, use an alternation:
var pattern = new Regex("(?:&#171|«)");
Once again, the same can be done in a more complex expression. The ?: at the beginning of the group isn’t necessary; it just prevents that a capture group will be created for this alternation.

Is it possible to negate a regular expression search?

I'm building a lexical analysis engine in c#. For the most part it is done and works quite well. One of the features of my lexer is that it allows any user to input their own regular expressions. This allows the engine to lex all sort of fun and interesting things and output a tokenised file.
One of the issues im having is I want the user to have everything contained in this tokenised file. I.E the parts they are looking for and the parts they are not (Partial Highlighting would be a good example of this).
Based on the way my lexer highlights I found the best way to do this would be to negate the regular expressions given by the user.
So if the user wanted to lex a string for every occurrence of "T" the negated version would find everything except "T".
Now the above is easy to do but what if a user supplies 8 different expressions of a complex nature, is there a way to put all these expressions into one and negate the lot?
You could combine several RegEx's into 1 by using (pattern1)|(pattern1)|...
To negate it you just check for !IsMatch
var matches = Regex.Matches("aa bb cc dd", #"(?<token>a{2})|(?<token>d{2})");
would return in fact 2 tokens (note that I've used the same name twice.. that's ok)
Also explore Regex.Split. For instance:
var split = Regex.Split("aa bb cc dd", #"(?<token>aa bb)|(?:\s+)");
returns the words as tokens, except for "aa bb" which is returned as one token because I defined it as so with (?...).
You can also use the Index and Length properties to calculate the middle parts that have not been recognized by the Regex:
var matches = Regex.Matches("aa bb cc dd", #"(?<token>a{2})|(?<token>d{2})");
for (int i = 0; i < matches.Count; i++)
{
var group = matches[i].Groups["token"];
Console.WriteLine("Token={0}, Index={1}, Length={2}", group.Value, group.Index, group.Length);
}

Categories