I'm building a lexical analysis engine in c#. For the most part it is done and works quite well. One of the features of my lexer is that it allows any user to input their own regular expressions. This allows the engine to lex all sort of fun and interesting things and output a tokenised file.
One of the issues im having is I want the user to have everything contained in this tokenised file. I.E the parts they are looking for and the parts they are not (Partial Highlighting would be a good example of this).
Based on the way my lexer highlights I found the best way to do this would be to negate the regular expressions given by the user.
So if the user wanted to lex a string for every occurrence of "T" the negated version would find everything except "T".
Now the above is easy to do but what if a user supplies 8 different expressions of a complex nature, is there a way to put all these expressions into one and negate the lot?
You could combine several RegEx's into 1 by using (pattern1)|(pattern1)|...
To negate it you just check for !IsMatch
var matches = Regex.Matches("aa bb cc dd", #"(?<token>a{2})|(?<token>d{2})");
would return in fact 2 tokens (note that I've used the same name twice.. that's ok)
Also explore Regex.Split. For instance:
var split = Regex.Split("aa bb cc dd", #"(?<token>aa bb)|(?:\s+)");
returns the words as tokens, except for "aa bb" which is returned as one token because I defined it as so with (?...).
You can also use the Index and Length properties to calculate the middle parts that have not been recognized by the Regex:
var matches = Regex.Matches("aa bb cc dd", #"(?<token>a{2})|(?<token>d{2})");
for (int i = 0; i < matches.Count; i++)
{
var group = matches[i].Groups["token"];
Console.WriteLine("Token={0}, Index={1}, Length={2}", group.Value, group.Index, group.Length);
}
Related
I want to take an expression like
(123456789..value > 2000) && (987654321.Value < 12)
extract the 123456789 and 987654321 (could be anything here)
and replace it with
ClassName.GetInfo("%s").GetValue() (as an example)
putting the 2 values in the place of the %s...
to be a resulting
(ClassName.GetInfo("123456789").GetValue() > 2000) && (ClassName.GetInfo("987654321").GetValue() < 12)
Can anyone give me a clue as to how to accomplish this?
A rather oversimplified example, but this should work.
Note that the following will only allow alpha-numeric or '-' or '_' in the place you claim (could be anything here). This is by nessesity if you intend to be able to recognize it with any form of parser regex or otherwise. You need to either limit the characters that can be used as an identifier, or you need to delineate them and allow for escaping the delimitation characters.
private static void Main()
{
Regex pattern = new Regex(#"(?<Name>[\w\-_]+)\.+(?<Value>[\w\-_]+)");
string sample = #"(123456789..value > 2000) && (987654321.Value < 12)";
string result = pattern.Replace(sample,
m =>
String.Format(
"ClassName.GetInfo(\"{0}\").Get{1}{2}()",
m.Groups["Name"].Value,
Char.ToUpper(m.Groups["Value"].Value[0]),
m.Groups["Value"].Value.Substring(1))
);
Console.WriteLine(result);
}
The program outputs:
(ClassName.GetInfo("123456789").GetValue() > 2000) && (ClassName.GetInfo("987654321").GetValue() < 12)
There are two other rather odd behaviors in your example that are addressed above. The first is the use of multiple delimiters '..' in your example "(123456789..value". This seems like a possible mistake, just remove the '+' from this part of the expression ").+(".
The second oddity is that your example just auto-magically corrects the character-case of the first property from "value" to "Value". Although I mimic this magical behavior by ensuring the first character is upper-case this is not a great solution. A better answer would be to use a case-insensitive dictionary and lookup the proper case.
Hopefully that will get you started, but I have to be honest and say you have a VERY long road ahead of you. Parsing an expression language is never a trivial thing and should generally be avoided. If this is for internal use just make them type in the full version. If this is for external use... well, I would re-think you're objective. Perhaps building a graphical expression tree like SQL's QBE would be a better expenditure of your time and energy.
I'm looking at UK postcodes and trying to work out how I can take data from a database (the first part of a UK postcode) and dynamically create a regexp for them using c#. For example:
AB44-56
I know what I want as an output:
AB([4][4-9]|[5][0-6])+
However, I can't work out how I might be able to do this with logic, perhaps I need to split the Letters from the numbers first, but i can't do that using split.
I have other combinations too - single range:
AB31 would be AB[3][1]+
Some with just letters:
BT would be BT+
Some with a single letter and 1 or two numbers:
G83 Would be G[8][3]
Any suggestions or guidance would be very much appriciated how this may be coded.
afrom wikipedia UK postal codes :
This can be generalised as: (one or two letters)(number between 0 and
99)(zero or one letter)(space)(single digit)(two letters)
so
^[A-Z,a-z]{0,2}\d+[A-Z,a-z]?\s\d[A-Z,a-z]{2}$
might work.
EDIT: Also if you are trying to restric the postal codes to say those with the same prefix as the ones in the database you could do this.
var source = "BTasdfweasdf"; //from the database
var input = "BT1A 1BB"; //from the somewhere else
var regex = Regex.Replace(source, #"(^[A-z,a-z]{0,2})(.*)", #"$1\d+[A-Z,a-z]?\s\d[A-Z,a-z]{2}$");
var match = Regex.Match(input,regex);
I'm parsing CSS3 selectors using a regex. For example, the selector a>b,c+d is broken down into:
Selector:
a>b
c+d
SOSS:
a
b
c
d
TypeSelector:
a
b
c
d
Identifier:
a
b
c
d
Combinator:
>
+
The problem is, for example, I don't know which selector the > combinator belongs to. The Selector Group has 2 captures (as shown above), each containing 1 combinator. I want to know what that combinator is for that capture.
Groups have lists of Captures, but Captures don't have lists of Groups found in that Capture. Is there a way around this, or should I just re-parse each selector?
Edit: Each capture does give you the index of where the match occurred though... maybe I could use that information to determine what belongs to what?
So you don't think I'm insane, the syntax is actually quite simple, using my special dict class:
var flex = new FlexDict
{
{"GOS"/*Group of Selectors*/, #"^\s*{Selector}(\s*,\s*{Selector})*\s*$"},
{"Selector", #"{SOSS}(\s*{Combinator}\s*{SOSS})*{PseudoElement}?"},
{"SOSS"/*Sequence of Simple Selectors*/, #"({TypeSelector}|{UniversalSelector}){SimpleSelector}*|{SimpleSelector}+"},
{"SimpleSelector", #"{AttributeSelector}|{ClassSelector}|{IDSelector}|{PseudoSelector}"},
{"TypeSelector", #"{Identifier}"},
{"UniversalSelector", #"\*"},
{"AttributeSelector", #"\[\s*{Identifier}(\s*{ComparisonOperator}\s*{AttributeValue})?\s*\]"},
{"ClassSelector", #"\.{Identifier}"},
{"IDSelector", #"#{Identifier}"},
{"PseudoSelector", #":{Identifier}{PseudoArgs}?"},
{"PseudoElement", #"::{Identifier}"},
{"PseudoArgs", #"\([^)]*\)"},
{"ComparisonOperator", #"[~^$*|]?="},
{"Combinator", #"[ >+~]"},
{"Identifier", #"-?[a-zA-Z\u00A0-\uFFFF_][a-zA-Z\u00A0-\uFFFF_0-9-]*"},
{"AttributeValue", #"{Identifier}|{String}"},
{"String", #""".*?(?<!\\)""|'.*?(?<!\\)'"},
};
You shouldn't write one regex to parse the whole thing. But first get the selectors and then get the combinator for each of them. (At least that's how you would parse your example, real CSS is going to be more complicated.)
Each capture does give you the index of where the match occurred though... maybe I could use that information to determine what belongs to what?
Just thinking aloud here; you could pick out each match in the Selector group, get its starting and ending indices relative to the entire match and see if the index of each combinator falls within the start and end index range. If the combinator's index falls within the range, it occurs in that selector.
I'm not sure how this would fare in terms of performance though. But I think you could make it work.
I wouldn't recommend using regex for parsing anything. Except for very simple cases parsers are almost always a better choice. Take a look at this question.
Is there a CSS parser for C#?
I need some help extracting the following bits of information using regular expressions.
Here is my input string "C:\Yes"
******** Missing character at start of string and in between but not at the end =
a weird superscript looking L.***
I need to extract "C:\" into one string and "Yes" into another.
Thanks In Advance.
I wouldn't bother with regular expressions for that. Too much work, and I'd be too likely to screw it up.
var x = #"C:\Yes";
var root = Path.GetPathRoot(x); // => #"C:\"
var file = Path.GetFileName(x); // => "Yes"
The following regular expression returns C:\ in the first capture group and the rest in the second:
^(\w:\\)(.*)$
This is looking for: a full string (^…$) starting with a letter (\w, although [a-z] would probably more accurate for Windows drive letters), followed by :\. All the rest (.*) is captured in the second group.
Notice that this won’t work with UNC paths. If you’re working with paths, your best bet is not to use strings and regular expressions but rather the API found in System.IO. The classes found there already offer the functionality that you want.
Regex r = new Regex("([A-Z]:\\)([A-Za-z]+)");
Match m = r.Match(#"C:\");
string val1 = m.Groups[0];
string val2 = m.Groups[1];
Imagine that users are inserting strings in several computers.
On one computer, the pattern in the configuration will extract some characters of that string, lets say position 4 to 5.
On another computer, the extract pattern will return other characters, for instance, last 3 positions of the string.
These configurations (the Regex patterns) are different for each computer, and should be available for change by the administrator, without having to change the source code.
Some examples:
Original_String Return_Value
User1 - abcd78defg123 78
User2 - abcd78defg123 78g1
User3 - mm127788abcd 12
User4 - 123456pp12asd ppsd
Can it be done with Regex?
Thanks.
Why do you want to use regex for this? What is wrong with:
string foo = s.Substring(4,2);
string bar = s.Substring(s.Length-3,3);
(you can wrap those up to do a bit of bounds-checking on the length easily enough)
If you really want, you could wrap it up in a Func<string,string> to put somewhere - not sure I'd bother, though:
Func<string, string> get4and5 = s => s.Substring(4, 2);
Func<string,string> getLast3 = s => s.Substring(s.Length - 3, 3);
string value = "abcd78defg123";
string foo = getLast3(value);
string bar = get4and5(value);
If you really want to use regex:
^...(..)
And:
.*(...)$
To have a regex capture values for further use you typically use (), depending on the regex compiler it might be () or for microsoft MSVC I think it's []
Example
User4 - 123456pp12asd ppsd
is most interesting in that you have here 2 seperate capture areas. Is there some default rule on how to join them together, or would you then want to be able to specify how to make the result?
Perhaps something like
r/......(..)...(..)/\1\2/ for ppsd
r/......(..)...(..)/\2-\1/ for sd-pp
do you want to run a regex to get the captures and handle them yourself, or do you want to run more advanced manipulation commands?
I'm not sure what you are hoping to get by using RegEx. RegEx is used for pattern matching. If you want to extract based on position, just use substring.
It seems to me that Regex really isn't the solution here. To return a section of a string beginning at position pos (starting at 0) and of length length, you simply call the Substring function as such:
string section = str.Substring(pos, length)
Grouping. You could match on /^.{3}(.{2})/ and then look at group $1 for example.
The question is why? Normal string handling i.e. actual substring methods are going to be faster and clearer in intent.