Parsing a regex - c#

I am having trouble writing a regular expression in C#; its purpose is to extract all words that start with '#' from a given string so they can be stored in some type of data structure.
If the string is "The quick #brown fox jumps over the lazy #dog", I'd like to get an array that contains two elements: brown and dog. It needs to handle the edge cases properly. For example, if it's ##brown, it should still produce 'brown' not '#brown'.

something like this
C#:
string quick = "The quick #brown fox jumps over the lazy #dog ##dog";
MatchCollection results = Regex.Matches(quick, "#\\w+");
foreach (Match m in results)
{
Literal1.Text += m.Value.Replace("#", "");
}
takes care of your edge case too. (##dog => dog)

#[\w\d]+ should work for you.
Tested using http://www.regextester.com/.
This works by matching for the #, followed by one or more word characters. The \w represents any "word character" (character sets), the \d represents any digit, and the + (repetition) indicates one or more. The \w and \d are both allowed by being wrapped in brackets.
To exclude the # you could use str.Substring(1) to ignore the first character, or use the regex #([\w\d]+) and extract the first group.

Depending on your definition of "word" (\w is more the C-language definition of a symbol valid in an identifier or keyword: [a-z0-9_].), you might try the folowing — I'm defining "word" here as a sequence of non-whitespace characters:
(^|\s)(#+(?<atword>[^\s]+))(\s|$)
The above has been tested here, and matches the following:
Match start-of-string or a whitespace character, followed by
1 or more # characters, followed by
1 or more non-whitespace characters, in group named 'atword', followed by
a whitespace character or end-of-string.
For successful matches, the named group atword will contain the text following the lead-in # sign(s).
So:
This ## foo won't match.
This #foo bar will match
`###foobarbat is kind of silly will match
`###foobar#bazabat will match.
silly.#rabbit, tricks are for kids won't match, but
silly #rabbit, tricks are for kids will match and you'll get rabbit, rather than rabbit (like I said, you need to think about how you define 'word'.
etc.

Related

Cannot match parentheses in regex group

This is a regular expression, evaluated in .NET
I have the following input:
${guid->newguid()}
And I want to produce two matching groups, a character sequence after the ${ and before }, which are split by -> :
guid
newguid()
The pattern I am using is the following:
([^(?<=\${)(.*?)(?=})->]+)
But this doesn't match the parentheses, I am getting only the following matches:
guid
newguid
How can I modify the regex so I get the desired groups?
Your regex - ([^(?<=\${)(.*?)(?=})->]+) - match 1+ characters other than those defined in the negated character class (that is, 1 or more chars other than (, ?, <, etc).
I suggest using a matching regex like this:
\${([^}]*?)->([^}]*)}
See the regex demo
The results you need are in match.Groups[1] and match.Groups[2].
Pattern details:
\${ - match ${ literal character sequence
([^}]*?) - Group 1 capturing 0+ chars other than } as few as possible
-> - a literal char sequence ->
([^}]*) - Group 2 capturing 0+ chars other than } as many as possible
} - a literal }.
If you know that you only have word chars inside, you may simplify the regex to a mere
\${(\w+)->(\w+\(\))}
See the regex demo. However, it is much less generic.
Your input structure is always ${identifier->identifier()}? If this is the case, you can user ^\$\{([^-]+)->([^}]+)\}$.
Otherwise, you can modify your regexpr to ([^?<=\${.*??=}\->]+): using this rexexpr you should match input and get the desired groups: uid and newguid(). The key change is the quoting of - char, which is intendend as range operator without quoting and forces you to insert parenthesis in your pattern - but... [^......(....)....] excludes parenthesis from the match.
I hope than can help!
EDIT: testing with https://regex101.com helped me a lot... showing me that - was intended as range operator.

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?
The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).
I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"
If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

Odd regexp behaviour - matches only first and last capture group

I am trying to write a regexp which would match a comma separated list of words and capture all words. This line should be matched    apple , banana ,orange,peanut  and captures should be apple, banana, orange, peanut. To do that I use following regexp:
^\s*([a-z_]\w*)(?:\s*,\s*([a-z_]\w*))*\s*$
It successfully matches the string but all of a sudden only apple and peanut are captured. This behaviour is seen in both C# and Perl. Thus I assume I am missing something about how regexp matching works. Any ideas? :)
The value given by match.Groups[2].Value is just the last value captured by the second group.
To find all the values, look at match.Groups[2].Captures[i].Value where in this case i ranges from 0 to 2. (As well as match.Groups[1].Value for the first group.)
(+1 for question, I learned something today!)
Try this:
string text = " apple , banana ,orange,peanut";
var matches = Regex.Matches(text, #"\s*(?<word>\w+)\s*,?")
.Cast<Match>()
.Select(x => x.Groups["word"].Value)
.ToList();
You are repeating your capturing group, at every repeated match the previous content is overwritten. So only the last match of your second capturing group is available at the end.
You can change your second capturing group to
^\s*([a-z_]\w*)((?:\s*,\s*(?:[a-z_]\w*))*)\s*$
Then the result would be " , banana ,orange,peanut" in your second group. I am not sure, if you want this.
If you want to check that the string has that pattern and extract each word. I would do it in two steps.
Check the pattern with your regex.
If the pattern is correct, remove leading and trailing whitespace and split on \s*,\s*.
Simple regexp:
(?:^| *)(.+?)(?:,|$)
Explanation:
?: # Non capturing group
^| * # Match start of line or multiple spaces
.+ # Capture the word in the list, lazy
?: # Non capture group
,|$ # Match comma or end of line
Note: Rublular is a nice website for testing this kind of thing.

Problem with regex, how do I get all with \S up until a special character?

Ive got the text:
192.168.20.31 Url=/flash/56553550_hi.mp4?token=(uniquePlayerReference=81781956||videoId=1)
And im trying to get the uniquePlayerReference and the videoId
Ive tried this regular expression:
(?<=uniquePlayerReference=)\S*
but it matches:
81781956||videoId=1)
And then I try and get the video id with this:
(?<=videoId=)\S*
But it matches the ) after the videoId.
My question is two fold:
1) How do I use the \S character and get it to stop at a character? (essentially what is the regex to do what i want) I cant get it to stop at a defined character, I think I need to use a positive lookahead to match but not include the double pipe).
2) When should I use brackets?
The problem is the mul;tiplicity operator you have here - the * - which means "as many as possible". If you have an explicit number in mind you can use the operator {a,b} where a is a minimum and b a maximum number fo matches, but if you have an unknown number, you can't use \S (which is too generic).
As for brackets, if you mean () you use them to capture a part of a match for backreferencing. Bit complicated, think you need to use a reference for that.
I think you want something like this:
/uniquePlayerReference=(\d+)||videoId=(\d+)/i
and then backreference to \1 and \2 respectively.
Given that both id's are numeric you are probably better off using \d instead of \S. \d only matches numeric digits whereas \S matches any non-whitespace character.
What you might also do is a non gready match up till the character you do not want to match like so:
uniquePlayerReference=(.*?)\|\|videoId=(.*?)\)
Note that I have escaped both the | and ) characters because otherwise they would have a special meaning inside a regex.
In C# you would use this like so: (which also answers your question what the brackets are for, they are meant to capture parts of the matched result).
Regex regex = new Regex(#"uniquePlayerReference=(.*?)\|\|videoId=(.*?)\)");
Match match = regex.Match(
"192.168.20.31 Url=/flash/56553550_hi.mp4?token=(uniquePlayerReference=81781956||videoId=1)");
if (match.Success)
{
string playerReference = match.Groups[1].Value;
string videoId = match.Groups[2].Value;
// Etc.
}
If the ID isn't just digits then you could use [^|] instead of \S, i.e.
(?<=uniquePlayerReference=)[^|]*
Then you can use
(?<=videoId=)[^)]*
For the video ID
The \S means it matches any non-whitespace character, including the closing parenthesis. So if you had to use \S, you would have to explicitly say stop at the closing parenthesis, like this:
videoId=(\S+)\)
Therefore, you are better off using the \d, since what you are looking for are numeric:
uniquePlayerReference=(\d+)
videoId=(\d+)

Regex for string with spaces and special characters - C#

I have been using Regex to match strings embedded in square brackets [*] as:
new Regex(#"\[(?<name>\S+)\]", RegexOptions.IgnoreCase);
I also need to match some codes that look like:
[TESTTABLE: A, B, C, D]
it has got spaces, comma, colon
Can you please guide me how can I modify my above Regex to include such codes.
P.S. other codes have no spaces/special charaters but are always enclosed in [...].
Regex myregex = new Regex(#"\[([^\]]*)]")
will match all characters that are not closing brackets and that are enclosed between brackets. Capture group \1 will match the content between brackets.
Explanation (courtesy of RegexBuddy):
Match the character “[” literally «\[»
Match the regular expression below and capture its match into backreference number 1 «([^\]]*)»
Match any character that is NOT a ] character «[^\]]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “]” literally «]»
This will also work if you have more than one pair of matching brackets in the string you're looking at. It will not work if brackets can be nested, e. g. [Blah [Blah] Blah].
/\[([^\]:])*(?::([^\]]*))?\]/
Capture group 1 will contain the entire tag if it doesn't have a colon, or the part before the colon if it does.
Capture group 2 will contain the part after the colon. You can then split on ',' and trim each entry to get the individual parts.

Categories