Regex which matches URN by rfc8141 - c#

I am struggling to find a Regex which could match a URN as described in rfc8141.
I have tried this one:
\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}):(?<nss>(?:[a-z0-9()+,-.:=#;$_!*']|%[0-9a-f]{2})+))\z
but this one only matches the first part of the URN without the components.
For example lets say we have the corresponding URN: urn:example:a123,0%7C00~&z456/789?+abc?=xyz#12/3 We should match the following groups:
NID - example
NSS - a123,0%7C00~&z456/789 (from the last ':' tll we match '?+' or '?=' or '#'
r-component - abc (from '?+' till '?=' or '#'')
f-component - 12/3 (from '#' till end)

I haven't read all the specifications, so there may be other rules to implement, but it should put you on the way for the optional components:
\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}):(?<nss>(?:[-a-z0-9()+,.:=#;$_!*'&~\/]|%[0-9a-f]{2})+)(?:\?\+(?<rcomponent>.*?))?(?:\?=(?<qcomponent>.*?))?(?:#(?<fcomponent>.*?))?)\z
explanations:
(?<nss>(?:[-a-z0-9()+,.:=#;$_!*'&~\/]|%[0-9a-f]{2})+) : The - has been moved to the beginning of the list to be considered in the allowed chars, or else it means "range from , to .". The characters &, ~ and / (has to be escaped with "\") have also been added to the list, or else it won't match your example.
optional components: (?:\?\+(?<rcomponent>.*?))? : inside an optional non-capturing group (?:)? to prevent capturing the identifier (the ?+, ?= and # part). The chars ? and + have to be escaped with "\". Will capture anything (.) but in lazy mode (*?) or else the first component found would capture everything until the end of the string.
See working example in Regex101
Hope that helps

If you want to validate string with Uniform Resource Names (URNs) 8141: rfc8141 You can refer to URN8141Test.java and URN8141.java
It has been used in our team for a few years.

Related

Use OR in Regex Expression

I have a regex to match the following:
somedomain.com/services/something
Basically I need to ensure that /services is present.
The regex I am using and which is working is:
\/services*
But I need to match /services OR /servicos. I tried the following:
(\/services|\/servicos)*
But this shows 24 matches?! https://regex101.com/r/jvB1lr/1
How to create this regex?
The (\/services|\/servicos)* matches 0+ occurrences of /services or /servicos, and that means it can match an empty string anywhere inside the input string.
You can group the alternatives like /(services|servicos) and remove the * quantifier, but for this case, it is much better to use a character class [oe] as the strings only differ in 1 char.
You want to use the following pattern:
/servic[eo]s
See the regex demo
To make sure you match a whole subpart, you may append (?:/|$) at the pattern end, /servic[eo]s(?:/|$).
In C#, you may use Regex.IsMatch with the pattern to see if there is a match in a string:
var isFound = Regex.IsMatch(s, #"/servic[eo]s(?:/|$)");
Note that you do not need to escape / in a .NET regex as it is not a special regex metacharacter.
Pattern details
/ - a /
servic[eo]s - services or servicos
(?:/|$) - / or end of string.
Well the * quantifier means zero or more, so that is the problem. Remove that and it should work fine:
(\/services|\/servicos)
Keep in mind that in your example, you have a typo in the URL so it will correctly not match anything as it stands.
Here is an example with the typo in the URL fixed, so it shows 1 match as expected.
First off you specify C# (really .Net is the library which holds regex not the language) in this post but regex101 in your example is set to PHP. That is providing you with invalid information such as needed to escape a forward slash / with \/ which is unnecessary in .Net regular expressions. The regex language is the same but there are different tools which behave differently and php is not like .Net regex.
Secondly the star * on the ( ) is saying that there may be nothing in the parenthesis and your match is getting null nothing matches on every word.
Thirdly one does not need to split the whole word. I would just extract the commonality in the words into a set [ ]. That will allow the "or-ness" you need to match on either services or servicos. Such as
(/servic[oe]s)
Will inform you if services are found or not. Nothing else is needed.

Cannot match parentheses in regex group

This is a regular expression, evaluated in .NET
I have the following input:
${guid->newguid()}
And I want to produce two matching groups, a character sequence after the ${ and before }, which are split by -> :
guid
newguid()
The pattern I am using is the following:
([^(?<=\${)(.*?)(?=})->]+)
But this doesn't match the parentheses, I am getting only the following matches:
guid
newguid
How can I modify the regex so I get the desired groups?
Your regex - ([^(?<=\${)(.*?)(?=})->]+) - match 1+ characters other than those defined in the negated character class (that is, 1 or more chars other than (, ?, <, etc).
I suggest using a matching regex like this:
\${([^}]*?)->([^}]*)}
See the regex demo
The results you need are in match.Groups[1] and match.Groups[2].
Pattern details:
\${ - match ${ literal character sequence
([^}]*?) - Group 1 capturing 0+ chars other than } as few as possible
-> - a literal char sequence ->
([^}]*) - Group 2 capturing 0+ chars other than } as many as possible
} - a literal }.
If you know that you only have word chars inside, you may simplify the regex to a mere
\${(\w+)->(\w+\(\))}
See the regex demo. However, it is much less generic.
Your input structure is always ${identifier->identifier()}? If this is the case, you can user ^\$\{([^-]+)->([^}]+)\}$.
Otherwise, you can modify your regexpr to ([^?<=\${.*??=}\->]+): using this rexexpr you should match input and get the desired groups: uid and newguid(). The key change is the quoting of - char, which is intendend as range operator without quoting and forces you to insert parenthesis in your pattern - but... [^......(....)....] excludes parenthesis from the match.
I hope than can help!
EDIT: testing with https://regex101.com helped me a lot... showing me that - was intended as range operator.

Regex c#: Last Name in an email to include only one hyphen

Friends,
I have a textbox, which takes firstname.Lastname for my organisation.
The last name may or may not include hyphen.If it includes, then it should appear,
1) Only once in last name
2) Not at the beginning of lastname
3) not at the end of last name
I have this Regex figured out
^(?!.{51})[a-zA-Z]+(?:[.][a-zA-Z-]+)?$
This includes "-" in last name. But will not satisfy above conditions.
Im still learning regex, and is taking time to figure out this.
Please help
-Thank you
You need to add one more nested group inside the last name part:
^(?!.{51})[a-zA-Z]+(?:\.[a-zA-Z]+(?:-[a-zA-Z]+)?)?$
^^^^^^^^^^^^^^^
See the regex demo
Details:
^ - start of string
(?!.{51}) - no more than 50 chars in the string requirement
[a-zA-Z]+ - 1+ ASCII letters
(?:\.[a-zA-Z]+(?:-[a-zA-Z]+)?)? - an optional sequence of:
\. - a dot
[a-zA-Z]+ - 1 or more ASCII letters
(?:-[a-zA-Z]+)? - an optional sequence of:
- - a hyphen
[a-zA-Z]+ - 1+ ASCII letters
$ - end of string.
To declare this pattern, use a verbatim string literal:
var pattern = #"^(?!.{51})[a-zA-Z]+(?:\.[a-zA-Z]+(?:-[a-zA-Z]+)?)?$";
To match any Unicode letters, use \p{L} instead of [a-zA-Z].
This is a somehow Quick and Dirty Solution.
^(?!.{51})[a-zA-Z]+.{1}[a-zA-Z]{1,}-?[a-zA-Z]+$
The first part of the string is still used from you. It seem to match just perfect.
My Addition .{1}[a-zA-Z]{1,}-?[a-zA-Z]+ simply describes what happens next.
[a-zA-Z]{1,} makes sure that there is at least one charackter after the dot, so no "-" can be written as beginning.
-?just describes that there may be appear a "-" somewhere after the first characters has been written.
[a-zA-Z]+ just says that there may come as many characters as the user want's. But not a "-" anymore.

getting the correct regex to print out in c#

Below is a regex statement I have been working on for quite sometime:
Match parsedRequestData = Regex.Match(requestData, #"^.*\[(.*)\]$");
What this is supposed to be doing is taking the email out of the email below:
2.3|[0246303#up.com]
For clarification, this email comes from a table in SQL Server. There are many emails that are formatted like this in there and the regex is supposed to be getting all of that from inside the brackets. However, it is matching the entirety of this line instead of whats inside of it. So my question is, is there something wrong with my regex statement or do I have something in my code I need to add?
Your regex is storing the email address in capture group 1. Try referencing group 1 like this:
parsedRequestData.Groups[1];
Code Sample:
string requestData = "2.3|[0246303#up.com]";
Match parsedRequestData = Regex.Match(requestData, #"^.*\[(.*)\]$");
if (parsedRequestData.Success)
{
Console.WriteLine(parsedRequestData.Groups[1]);
}
Results:
0246303#up.com
Your regex is OK. All you need is to use the Group[1]
var email = Regex.Match("2.3|[0246303#up.com]", #"^.*\[(.*)\]$").Groups[1].Value;
However, it is matching the entirety of this line instead of whats inside of it.
Unless one uses named match captures, the match capture groups are indexed.
Match.Groups[0].Value is the whole match; it shows all the match captures and all the grouped matched text.
Match.Groups[{1-N}].Value is the match captures in the order of specification in the pattern for anything in a ( ) parenthesis set(s). If there is only one ( ) there will be two indexed groups; 0 as mentioned above, and 1 of the items specified to be captured to N.
You only have one ( ) set so the data you want is found in match capture group 1. Group 0 has the non match capture items along with the match capture data.
If one names the match capture such as (?<MyNameHere> ) one can also access the match via Match.Groups["MyNameHere"].Value.
Suggestion on your pattern away from the answer
Usage of * (zero or more) in patterns can be problematic in that it can significantly increase the time of the parser takes due to backtracking false scenarios.
If one knows there is text to be found, don't tell the parser zero items may happen when that is impossible, change it to + one or more. That slight change can greatly affect the parsing operations, both in time and operations.
Change ^.*\[(.*)\]$ to ^.+\[(.+)\]$.
But to even increase the efficiency of the pattern, focus on the knowns of the characters [ and ] as anchors.
Pattern Restructure To Use Anchors
^[^[]+\[([^\]]+)[\s\]]+$
Why is this pattern better? Because we will look for "[" and "]" as anchors.
Let us break it down
^ - Beginning of the pattern (a hard anchor)
[^ ]+ This is a set notation where the ^ says NOT.
[^\[]+ So we want to capture all text + (one or more) that is NOT a [. This tells the pattern to match up to our anchor [ in the text. Note that we don't have to escape it for regex parser treats all characters in a set [ ] as a literal so [^[] is valid. (To be clear this is a match but don't capture text anchor so we will not find this text in an index above the 0 index; only in 0).
\[ Our literal anchor the "[" character.
([^\]]+) This is our match capture which says match this set where any character is valid but not an "]". Here we have to escape the ] because otherwise it would signify the end of our set.
[\s\]]+ we know the end of our text there will be spaces and the "]" character, so let us match (but not to capture) any combination of spaces and a ] before the end.
$ our final anchor, the end of the file/buffer indicator (or line if the right parser rule is set).

Improving my failing regex

My regex was working - until the form of the string it was capturing slightly changed. It used to always be of the form :
Word1 - Word2 - 01.2.3456.7890 - xx-xx - Word 3 [Word-inbracket]
Where I was interested in capturing the xx-xx.
For capturing this data, the following regex worked :
(.+\s*-\s*.+\s*-\s*.+)\s*-\s*(\w{1,3}\s*-\s*\w{1,3})\s*-\s*.+
Selecting groups[2] from it.
Now, however, the string has changed form so that sometimes there is another dash, and another set of letters between 1 and 4 characters after the xx-xx. (Remember, this only happens sometimes).
So, now I also need to capture the info where it is of the form :
Word1 - Word2 - 01.2.3456.7890 - xx-xx-XxxX - Word 3 [Word-inbracket]
Word1 - Word2 - 01.2.3456.7890 - xXX-XxX-xxxx - Word 3 [Word-inbracket]
Etc.
How can I edit my regex to capture this string in addition to the ones that were previously caught? What is the cleanest way to do this ?
A little hacky but that will do the trick:
(.+\s*-\s*.+\s*-\s*.+)\s*-\s*((\w{1,3}\s*-\s*\w{1,3})|(\w{1,4}\s*-\s*\w{1,4}))\s*-\s*.+
Based on the input lines, a more simplified approach could be taken altogether.
The following regex matches both cases and should also work for any other modifications to the part the was modified.
([^-]*-){3}\s*([^\s]+).*
This should capture the first group of with "Word1 - Word2 - 01.2.3456.7890 -", and then the second group of "xx-xx-XxxX".
Also note, I'm going off of the assumption that the second desired group does not contain spaces as the example strings do not have them.
Explained:
([^-]*-){3} # captures the "word1 - word2 - word3.234.234 -" block
\s*
([^\s]+) # captures the "xx-xx-xxx" block up to the first whitespace char.
.* # matches the rest of the line
I believe this should do it:
(.+?\s*-\s*.+?\s*-\s*.+?)\s*-\s*(\w{1,3}\s*-\s*\w{1,3})\s*(?:-(\w{1,3}))?\s*-\s*.+
The changes I've made are:
Made the any characters matches at the beginning non-greedy by adding '?' after them — this stops them gobbling up too much when the extra bit is present.
Added the '(?:-(\w{1,3}))?' which matches the optional extra bit if it is present, but without capturing the '-' prefix (the '?:' makes the outer group non-capturing).
This will give you an extra capturing group that includes the optional bit.
You can see it in action here (edited).
this is more clear .+\s-\s(.+)\s-\s.+$

Categories