How to capture white spaces in a string in ANTLR? - c#

This should be simple, but I am pulling my hair out trying to solve it.
I am trying to grab a quoted string, including the white spaces (space and tab) in the string, and record the value in a CSharp string. However, I would also like to ignore the same whitespace (via the lexer) that is included outside the quoted string. I have the typical WS lexer rule included, but the WS rule is taking out the white spaces when I want it (when its in a quoted string). If I remove the {channel=HIDDEN} from the WS rule, I lose all the other whitespace and have to manually add WS everywhere between the tokens. Any help would be greatly appreciated! Here is my grammar:
program returns [KeyValuePair<string, string> kvp]
:
ident=IDENT {kvp.Key = ident.Text;}
'='
quote=quoted_ident {kvp.Value = quote.ret;}
;
quoted_ident returns [string ret]
:
'"'
(
(ident=IDENT|ident=DOUBLE) {$ret += ident.Text;}
|
ws=WS {$ret += ws.Text;}
)+
'"'
;
WS :
(
' '
|
'\t'
)
{ $channel = HIDDEN; }
;
fragment DIGIT: '0'..'9';
fragment LETTER: ('a'..'z' | 'A'..'Z');
fragment DOT:'.';
DOUBLE : ((DIGIT)+(DOT(DIGIT)+)?)|(DOT(DIGIT)+);
IDENT : (LETTER|DIGIT|DOT|':'|'\''|'/'|'\\'|'_'|'#'|';'|'?'|'-'|'#'|'$'|'%'|'^'|'&'|'*')+;
Examples:
input: ' Name = " My Name " '
Expected Value for kvp.Value ' My Name '
Actual Value 'MyName'
I want to ignore all spaces and tabs outside of the quotes, but capture them within the quotes.

Related

How to avoid the spaces between two texts while parsing using antlr

Please help me with the following case. I have a line with multiple texts in it. Based on some rule I need to parse each words in the line. Below is my example input line
## KEYWORD = MyName MyAliasName
Below is my parsing rule sets.
rule1:
Keyword name = identifier{ $name.str;} (' '* diffName = identifierTest { $diffName.str; })?
;
identifier:
returns [string str]
#init{$str="";}:
i=Word{$str+=$i.text;} (i=(Number | Word ) {$str+=$i.text;})*
;
Keyword: SPACE* START SPACE* 'KEYWORD' SPACE* EQUAL SPACE*;
Number:DIGIT+;
Word:LETTER+;
fragment LETTER: 'A'..'Z' | 'a'..'z' | '_';
fragment DIGIT: [0-9];
fragment SPACE: ' ' | '\t';
fragment START: '##';
fragment EQUAL: '=';
The "rule1" rule defines that, the MyName text is mandatory and MyAliasName is an optional one.
The "identifier" rule defined that, the name can start with only by a letter or underscores.
The Problem
If I give exactly one space between MyName and MyAliasName then the above rules works fine. Whereas if there are more than one spaces between MyName and MyAliasName, then the first identifier rule reads both the texts together as MyNameMyAliasName(it removes the spaces automatically). Why ? I don't know what I'm doing wrong!
Whenever the optional texts is available then i will have to overwrite the name with AliasName. Please help and thanks in advance
This grammar should solve your problem
grammar TestGrammar;
rule1:
keyword name=IDENT{ System.out.println($name.text);} ( diffName = IDENT { System.out.println($diffName.text); })?
;
keyword: START KEYWORD EQUAL;
KEYWORD : 'KEYWORD' ;
fragment LETTER: 'A'..'Z' | 'a'..'z' | '_';
fragment DIGIT: '0'..'9';
IDENT : LETTER (LETTER|DIGIT)*;
START : '##';
EQUAL : '=';
SPACE : [ \t]+ -> skip;

regex for removing characters at end of string

i would like to match recursively, all text that ends with : or / or ; or , and remove all these characters, along with any spaces left behind, in the end of the text.
Example:
some text : ; , /
should become:
some text
What i have tried, just removes the first occurrence of any of these special characters found, how one can do this recursively, so as to delete all characters
found that match?
regex i use:
find: [ ,;:/]*
replace with nothing
[ ,;:/]*$ should be what you need. This is the same as your current regex except with the $ on the end. The $ tells it that the match must happen at the end of the string.
You can use C#'s TrimEnd() like so
string line = "some text : ; , / "
char[] charsToTrim = {',', ':', ';', ' ', '/'};
string trimmedLine = line.TrimEnd(charsToTrim);

Regex to split "&" in URL parameters only if they are followed by content ending with "="

I have a dilemma that I have been attempting to resolve with malformed URL's, where specific parameters can have values that contain specific characters that might conflict with parsing the url.
if( remaining.Contains( "?" ) || remaining.Contains( "#" ) )
{
if( remaining.Contains( "?" ) )
{
Path = remaining.Substring( 0, temp = remaining.IndexOf( "?" ) );
remaining = remaining.Substring( temp + 1 );
// Re-encode for URLs
if( remaining.Contains( "?" ) )
{
remaining = URL.Substring( URL.IndexOf( "?" ) + 1 );
}
if( remaining.IndexOf("=") >= 0 )
{
string[] qsps = Regex.Split( remaining, #"[&]\b" );// Original Method: remaining.Split( '&' );
qsps.ToList().ForEach( qsp =>
{
string[] vals = qsp.Split( '=' );
if( vals.Length == 2 )
{
Parameters.Add( vals[0], vals[1] );
}
else
{
string key = (string) vals[0].Clone();
vals[0] = "";
Parameters.Add( key, String.Join( "=", vals ).Substring( 1 ) );
}
} );
}
}
I added the line "Regex.Split( remaining, #"[&]\b" );" to grab "&" that were followed by a character, which seems useful.
I am just trying to see if there is a better approach to only splitting the "&'s" that are actually for parameters?
Example to test against (which caused this needed update):
www.myURL.com/shop/product?utm_src=bm23&utm_med=email&utm_term=apparel&utm_content=02/15/2016&utm_campaign=Last
Chance! Presidents' Day Sales Event: Free Shipping & More!
A working regex should only grab the &'s for the following:
utm_src=btm23
utm_med=email
utm_term=apparel
utm_content=02/15/2016
utm_campaign=Last Chance! Presidents' Day Sales Event: Free Shipping & More!
It should NOT count the "& More" as a match, since the section does not end with "=" afterwards
I would suggest a regex using a look-ahead:
/&(?=[^&=]+=)/
You can see this in effect here: version1. It looks first for the & character, and then "peeks" forward to ensure that a = follows, but only if it does not contain another & or a = in between.
You can also ensure that there are no whitespace characters (like newlines, etc.) which aren't valid in URLs anyway (version 2):
&(?=[^\s&=]+=)
I'd like to use this regex:
Regex.Split(url, #"(?<=(?:=\S+?))&",
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
if you pass your test string via url which is.
www.myURL.com/shop/product?utm_src=bm23&utm_med=email&utm_term=apparel&utm_content=02/15/2016&utm_campaign=Last Chance! Presidents' Day Sales Event: Free Shipping & More!
The output should be.
www.myURL.com/shop/product?utm_src=bm23
utm_med=email
utm_term=apparel
utm_content=02/15/2016
utm_campaign=Last Chance! Presidents' Day Sales Event: Free Shipping & More!
Please note first line of output.
www.myURL.com/shop/product?utm_src=bm23
which contains first path of url, but can be easily splitted by ?
Not sure what you're trying to do, but if you want to find errant
ampersands, this is a good regex for that.
&(?=[^&=]*(?:&|$))
You could either replace with a %26 or split with it.
If you split with it, just recombine and the errant ampersand will be gone.
(?<=[?&])([^&]*)(?=.*[&=])
Explanation:
(?<=[?&]) positive lookbehind for either '&' or '?'
([^&]*) capture as many characters as possible that aren't '&'
(?=.*[&=]) positive lookahead for either an '&' or '='
Output:
utm_src=bm23
utm_med=email
utm_term=apparel
utm_content=02/15/2016
utm_campaign=Last Chance! Presidents' Day Sales Event: Free Shipping
Demo
So to get the matches:
string str = "www.myURL.com/...";
Regex reg = "(?<=[?&])([^&]*)(?=.*[&=])";
List<string> result = reg.Matches(str).Cast<Match>().Select(m => m.Value).ToList();
Edit for the question edit:
(?<=[?&])\S.*?(?=&\S)|(?<=[?&])\S.*(?=\s)

match optional special characters

I have a question that has asked before in this link, but there is no right answer in the link. I have some sql query text and I want to get all function's names (the whole name, contain schema) that has created in these.
my string may be like this:
create function [SN].[FunctionName] test1 test1 ...
create function SN.FunctionName test2 test2 ...
create function functionName test3 test3 ...
and I want to get both [SN].[FunctionName] and SN.FunctionName,
I tried this regex :
create function (.*?\]\.\[.*?\])
but this returns only the first statement, how can I make those brackets optional in the regex expression?
This one works for me:
create function\s+\[?\w+\]?\.\[?\w+\]?
val regExp = "create function" + //required string literal
"\s+" + //allow to have several spaces before the function name
"\[?" + // '[' is special character, so we quote it and make it optional using - '?'
"\w+" + // only letters or digits for the function name
"\]?" + // optional close bracket
"\." + // require to have point, quote it with '\' because it is a special character
"\[?" + //the same as before for the second function name
"\w+" +
"\]?"
See test example: http://regexr.com/3bo0e
You can use lookarounds:
(?<=create function )(\s*\S+\..*?)(?=\s)
Demo on regex101.com
It captures everything between create function literal followed by one or more spaces and another space assuming the matched string contains at least one dot char.
To make some subpattern optional, you need to use the ? quantifier that matches 1 or 0 occurrences of the preceding subpattern.
In your case, you can use
create[ ]function[ ](?<name>\[?[^\]\s.]*\]?\.\[?[^\]\s.]*\]?)
^ ^ ^ ^
The regex matches a string starting with create function and then matching:
var rx = new Regex(#"create[ ]function[ ]
(?<name>\[? # optional opening square bracket
[^\]\s.]* # 0 or more characters other than `.`, whitespace, or `]`
\]? # optional closing square bracket
\. # a literal `.`
\[? # optional opening square bracket
[^\]\s.]* # 0 or more characters other than `.`, whitespace, or `]`
\]? # optional closing square bracket
)", RegexOptions.IgnorePatternWhitespace);
See demo

Replacing a string with strings that include parenthesis issue

i am currenty having a problem related with regex.replace . I have an item in checkedlistbox that contains a string with parenthesis "()" :
regx2[4] = new Regex( "->" + checkedListBox1.SelectedItem.ToString());
the example setence inside the selected item is
hello how are you (today)
i use it in regex like this :
if (e.NewValue == CheckState.Checked)
{
//replaces the string without parenthesis with the one with parenthesis
//ex:<reason1> ----> hello, how are you (today) (worked fine)
richTextBox1.Text = regx2[selected].Replace(richTextBox1.Text,"->"+checkedListBox1.Items[selected].ToString());
}
else if (e.NewValue == CheckState.Unchecked)
{
//replaces the string with parenthesis with the one without parenthesis
//hello, how are you (today)----><reason1> (problem)
richTextBox1.Text = regx2[4].Replace(richTextBox1.Text, "<reason" + (selected + 1).ToString() + ">");
}
it is able to replace the string on the first condition but unable to re-replace the setences again on second because it has parenthesis "()", do you know how to solve this problem??
thx for the response :)
Instead of:
regx2[4] = new Regex( "->" + checkedListBox1.SelectedItem.ToString());
Try:
regx2[4] = new Regex(Regex.Escape("->" + checkedListBox1.SelectedItem));
To use any of the special characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign has a special meaning.
http://www.regular-expressions.info/characters.html
special characters:
backslash \,
caret ^,
dollar sign $,
period or dot .,
vertical bar or pipe symbol |,
question mark ?,
asterisk or star *,
plus sign +,
opening parenthesis (,
closing parenthesis ),
opening square bracket [,
opening curly brace {
To fix it you could probably do this:
regx2[4] = new Regex("->" + checkedListBox1.SelectedItem.ToString().Replace("(", #"\(").Replace(")", #"\)"));
But I would just use string.replace() since you aren't doing any parsing. I can't tell what you're transforming from/to and why you use selected as an index on the regex array in the if and 4 as the index in the else.

Categories