Regex extract numbers from Url (except port number) - c#

i have to extract id value of a product from url.
It is SEO Friendly (url routing).
Url can be
http://www.example.com/{param0}/{param1}/123/{param2}/{paramN}
Or
http://localhost:6847/{param0}/{param1}/123/{param2}/{paramN}
For the first url there is no problem.
But for the second i want to extract ONLY the 123 or (ID) <-(It is an Integer).
I know that if i want to extract only numbers i can use
[0-9]+
but how can i tell regengine how to get all the numerical data from url except numbers that may have
:
before.
i use :
((!:)[0-9]+)
it is not correct.
Every advice is wellcamed:)
Thank you.

There needs to be more info on what delimits the 123 in your example.
On its face, (?<!:)[0-9]+ will find the first clump of digits NOT preceded by ':'
Edit Probably for more accuracy, this (?<!:\d+)[0-9]+ would be better.
Note this is if .NET allows variable length look-behind (I think it does).
For fixed length look-behind (PCRE), something like this might work: (?<![:\d])[0-9]+
Edit2
#Sanosay- After thinking about .NET type lookbehinds, the above regex needs a slight change.
It should be (?<!:\d*)[0-9]+ . Thats because in ':1234', 1 will satisfy the assertion.
Hope you figured this to be the case. I made a test case for the two regex's
#"(?<!:\d*)[0-9]+"
#"(?<![:\d])[0-9]+"
that satisfy the conditions.
The link to the ideone C# code is here: http://ideone.com/tLn2j

Related

The simple regex expression [0-9]* does not work with {e|24} in C#

I've been working on my own simple Wikipedia parser in C#. So far I'm not getting very far because of this problem.
I extract the string {e|24} but it could contain any number. All I want to do is simply extract the number from this string.
This is the code I am using currently:
Match num = Regex.Match(exp.Value, "[0-9]*");
Console.WriteLine(num.Value);
However num.Value is blank.
Can someone please explain why this is not working and how I can fix it?
You would want to use [0-9]+ to ensure at least one number. [0-9]* allows it to be matched 0 times or more, thus getting blanks
My suggestion, make the regexp: \d+
Works. Simpler. Shorter, uses no groups or ranges.

Regular Expression for UK postcodes

I have a list of post codes which should be excluded from my shipping methods.
Suppose I have to exclude Scilly Isles, Isle of Man and few others.
For the above 2 areas valid post codes are IM1-IM9, IM86, IM87, IM89. And if it is IM25 or IM85 it is invalid.
I have writtent following expression. But it is returning even it is IM25 or IM 85.
var regex = new Regex("(PO3[0-9]|PO4[0-1]|GY[1-9]|JE[1-5]|IM[1-9]|TR[1-9])");
If I am passing IM85, to my expression it should return false. for IM1-IM9,, IM86, IM87, IM89 it should return true.
Same with TR post codes also. TR1-TR27 is a valid post code. If I give TR28, it should return false.
I am using '|' to seperate multiple patterns. Is that the right way of including multiple patterns in 1 expression.
What do you expect? What should be matched and what not? And please give an example of the string you want to test.
If you match your pattern against "IM25" it will match because you do allow IM[1-9] in your pattern, so you get a valid partial match. If you want to avoid that (I am not sure what you want to achieve) and want to allow really only a single digit after the first letters, use a "word boundary" \b and specify exactly what you want to allow, something like this:
(PO3[0-9]|PO4[0-1]|GY[1-9]|JE[1-5]|IM([1-9]|8[6-9])|TR([1-9]|2[0-7]))\b
See it here on Regexr
this would allow for the "IM" part also 6-9 as a second digit when there is a 8 before.
Update
It is still not clear what the context of your task is. I assume you have a list of valid Postcodes, probably it would be better, you extract the post code or only the first part of it (for that you can eventually use a regex) and check if it is in the list or not.
The actual validation is on the wikipedia site... Google has the answers ;) http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation
(GIR 0AA)|(((A[BL]|B[ABDFHLNRSTX]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[HNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTY]?|T[ADFNQRSW]|UB|W[ADFNRSV]|YO|ZE)[1-9]?[0-9]|((E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]|(SW|W)([2-9]|[1-9][0-9])|EC[1-9][0-9]) [0-9][ABD-HJLNP-UW-Z]{2})
I still think you need more clarification. As a huge Regex guy, I would like to point out that multi-digit ranges should try to be put into the code side, not the Regex side, just for your sanity. But I personally like to play with Regex in this way. Regex reads one character at a time, so it only recognizes zero through nine. Not ten, not twenty eight. If you want to allow the following:
28 through 347
Then it becomes pretty complicated.
To put it into words, you want to allow:
If Two Digits, allow 2-9 for the first digit, and:
If the first digit is a Two, then allow 8/9 for the second digit,
ElseIf the first Digit is 3-9, then allow 0-9 for the second digit
Elseif Three Digits, allow 1-3 for the first Digit, and:
If the first digit is a Three, then allow 0-4 for the second digit, and:
If the second digit is a Four, then allow 0-7 for the third digit,
ElseIf the second digit is 0-3, then allow 0-9 for the third digit.
ElseIf the first digit is 1/2, then allow 0-9 for both the Second and Third digits.
Then with that, you can write a proper Regex like so, which searches for a word boundary or non-Digit surrounding a 2-pair or 3-pair. With this type of Problem-Solving, you should be able to figure out your Regex issue. Otherwise, let us know more about EXACTLY What you want to Match and NOT Match:
(\b|\D)((2[89]|[3-9][0-9])(\b|\D)|(3(4[0-7]|[0-3][0-9])|[12][0-9][0-9])(\b|\D))
I have changed my approach.
Instead of going for a regular expression which is becoming more complex, I am saving all the excluded outward codes of UK post codes.
And if any post code contains the particular outward code, excluding the post code from the list.
Outward codes are in this format
XX-YYY
XXX-YYY
XXXX-YYY
In all above formats, X represents outward code of an UK postcode.

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

I'm currently writing a library where I wish to allow the user to be able to specify spreadsheet cell(s) under four possible alternatives:
A single cell: "A1";
Multiple contiguous cells: "A1:B10"
Multiple separate cells: "A1,B6,I60,AA2"
A mix of 2 and 3: "B2:B12,C13:C18,D4,E11000"
Then, to validate whether the input respects these formats, I intended to use a regular expression to match against. I have consulted this article on Wikipedia:
Regular Expression (Wikipedia)
And I also found this related SO question:
regex matching alpha character followed by 4 alphanumerics.
Based on the information provided within the above-linked articles, I would try with this Regex:
Default Readonly Property Cells(ByVal cellsAddresses As String) As ReadOnlyDictionary(Of String, ICell)
Get
Dim validAddresses As Regex = New Regex("A-Za-z0-9:,A-Za-z0-9")
If (Not validAddresses.IsMatch(cellsAddresses)) then _
Throw New FormatException("cellsAddresses")
// Proceed with getting the cells from the Interop here...
End Get
End Property
Questions
1. Is my regular expression correct? If not, please help me understand what expression I could use.
2. What exception is more likely to be the more meaningful between a FormatException and an InvalidExpressionException? I hesitate here, since it is related to the format under which the property expect the cells to be input, aside, I'm using an (regular) expression to match against.
Thank you kindly for your help and support! =)
I would try this one:
[A-Za-z]+[0-9]+([:,][A-Za-z]+[0-9]+)*
Explanation:
Between [] is a possible group of characters for a single position
[A-Za-z] means characters (letters) from 'A' to 'Z' and from 'a' to 'z'
[0-9] means characters (digits) from 0 to 9
A "+" appended to a part of a regex means: repeat that one or more times
A "*" means: repeat the previous part zero or more times.
( ) can be used to define a group
So [A-Za-z]+[0-9]+ matches one or more letters followed by one or more digits for a single cell-address.
Then that same block is repeated zero or more times, with a ',' or ':' separating the addresses.
Assuming that the column for the spreadsheet is any 1- or 2-letter value and the row is any positive number, a more complex but tighter answer still would be:
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
"[A-Z]{1,2}[1-9]\d*" is the expression for a single cell reference. If you replace "[A-Z]{1,2}[1-9]\d*" in the above with then the complex expression becomes
^<cell>(:<cell>)?(,<cell>(:<cell>*)?)*$
which more clearly shows that it is a cell or a range followed by one or more "cell or range" entries with commas in between.
The row and column indicators could be further refined to give a tighter still, yet more complex expression. I suspect that the above could be simplified with look-ahead or look-behind assertions, but I admit those are not (yet) my strong suit.
I'd go with this one, I think:
(([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*,)*([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*
This only allows capital letters as the prefix. If you want case insensitivity, use RegexOptions.IgnoreCase.
You could simplify this by replacing [A-Z]+[1-9]\d* with plain old [A-Z]\d+, but that will only allow a one-letter prefix, and it also allows stuff like A0 and B01. Up to you.
EDIT:
Having thought hard about DocMax's mention of lookarounds, and using Hans Kesting's answer as inspiration, it occurs to me that this should work:
^[A-Z]+\d+((,|(?<!:\w*):)[A-Z]+\d+)*$
Or if you want something really twisted:
^([A-Z]+\d+(,|$|(?<!:\w*):))*(?<!,|:)
As in the previous example, replace \d+ with [1-9]\d* if you want to prevent leading zeros.
The idea behind the ,|(?<!\w*:): is that if a group is delimited by a comma, you want to let it through; but if it's a colon, it's only allowed if the previous delimiter wasn't a colon. The (,|$|...) version is madness, but it allows you to do it all with only one [A-Z]+\d+ block.
However! Even though this is shorter, and I'll admit I feel a teeny bit clever about it, I pity the poor fellow who has to come along and maintain it six months from now. It's fun from a code-golf standpoint, but I think it's best for practical purposes to go with the earlier version, which is a lot easier to read.
i think your regex is incorrect, try (([A-Za-z0-9]*)[:,]?)*
Edit : to correct the bug pointed out by Baud : (([A-Za-z0-9]*)[:,]?)*([A-Za-z0-9]+)
and finally - best version : (([A-Za-z]+[0-9]+)[:,]?)*([A-Za-z]+[0-9]+)
// ah ok this wont work probably... but to answer 1. - no i dont think your regex is correct
( ) form a group
[ ] form a charclass (you can use A-Z a-d 0-9 etc or just single characters)
? means 1 or 0
* means 0 or any
id suggest reading http://www.regular-expressions.info/reference.html .
thats where i learned regexes some time ago ;)
and for building expressions i use Rad Software Regular Expression Designer
Let's build this step by step.
If you are following an Excel addressing format, to match a single-cell entry in your CSL, you would use the regular expression:
[A-Z]{1,2}[1-9]\d*
This matches the following in sequence:
Any character in A to Z once or twice
Any digit in 1 to 9
Any digit zero or more times
The digit expression will prevent inputting a cell address with leading zeros.
To build the expression that allows for a cell address pair, repeat the expression preceded by a colon as optional.
[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?
Now allow for repeating the pattern preceded by a comma zero or more times and add start and end string delimiters.
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
Kind of long and obnoxious, I admit, but after trying enough variants, I can't find a way of shortening it.
Hope this is helpful.

Help me build a regex

I'm, quite frankly, completely clueless about Regular expression, more so building them. I am reading in a string that could contain any sort of combination of characters and numbers. What I know for certain is, somewhere in the string, there will be a number followed by % (1%, 13% etc.), and I want to extract that number from the string.
Examples are;
[05:37:25] Completed 21% //want to extract 21
[05:32:34] Completed 18000000 out of 50000000 steps (36%). //want to extract 36
I'm guessing I should be using either regex.Replace or regex.Split, but beyond that, I'm not sure. Any help would be appreciated.
You should be able to use something like "(\d+)%". This will match any number of consecutive digit characters, then a percent sign, and will capture the actual number so you can extract and parse it. Use this in Regex.Match(), and browse the Matches array of the result (I think it'll be the second element in the array, index 1).
If you need a decimal point, use "(\d+(\.\d+)?)%", which will match a string of digits, followed by a decimal point, then another set of digits.
The regex you want is:
/(\d+)%/
This will capture any number of digits immediately preceding a percentage sign.
([\d]+)(%)
The parentheses will group the result.
The [\d]+ gives you any digit, repeated one or more times.
The "%" is just a literal.
You will need to make sure you extract only the first grouping. Also, you will need to be sure that there are no other instances of "<number>%" in the line.
I'm not entirely sure how to make this C# specific, but I'm sure you can figure that out. :-P
Most likely you will need to use double-backslashes (\\) where I only had one.

Validating email address with single character domain-names with a regex

I have a regex that I am using to validate email addresses. I like this regex because it is fairly relax and has proven to work quite well.
Here is the regex:
(['\"]{1,}.+['\"]{1,}\s+)?<?[\w\.\-]+#[^\.][\w\.\-]+\.[A-Za-z]{2,}>?
Ok great, basically all reasonably valid email addresses that you can throw at it will validate. I know that maybe even some invalid ones will fall through but that is ok for my specific use-case.
Now it happens to be the case that joe#x.com does not validate. And guess what x.com is actually a domain name that exists (owned by paypall).
Looking at the regex part that validates the domain name:
#[^\.][\w\.\-]+
It looks like this should be able to parse the x.com domain name, but it doesn't. The culprit is the part that checks that a domain name can not begin with a dot (such as test#.test.com)
#[^\.]
If I remove the [^.] part of my regex the domain x.com validates but now the regex allows domains names beginning with a dot, such as .test.com; this is a little bit too relax for me ;-)
So my question is how can the negative character list part affect my single character check, basically the way I am reading the regex is: "make sure this string does not start with a dot", but apparantly it does more.
Any help would be appreciated.
Regards,
Waseem
As Luis suggested, you can use [^\.][\w\.\-]* to match the domtain name, however it will now also match addresses like john#x.....com and john##.com. You might want to make sure that there is only one period at a time, and that the first character after the # is more restricted than just not being a period.
Match the domain name and the period (and subdomains and their periods) using:
([\w\-]+\.)+
So your pattern would be:
(['\"]{1,}.+['\"]{1,}\s+)?<?[\w\.\-]+#([\w\-]+\.)+[A-Za-z]{2,}>?
If you change [^\.][\w\.\-]+ to [^\.][\w\.\-]*, it will work as you expect!
The reason is: [^\.] will match a single character which is not a dot (in your case, the "x" on "x.com", then you will try to reach 1 or more characters, and then a dot. You will match the dot after the x, and there are no more dots to match. The * will match 0 or more characters after the first one, which is what you want.
Change the quantifier +, meaning one or more, to *, meaning zero or more.
Change #[^\.][\w\.\-]+ to #[^\.][\w\.\-]*
The reason you need this is that [^\.] says match a single character that is not a dot. Now there are no more characters left so the [\w\.\-]+ has nothing to match, even though the plus sign requires a minimum of one character. Changing the plus to a star fixes this.
Look at the broader context in your pattern:
#[^\.][\w\.\-]+\.[A-Za-z]{2,}
So for joe#x.com,
[^.] matches x
[\w.-]+ matches .
\. needs a dot but finds c
Change this part to #[^.][\w-]*\.[A-Za-z]{2,}

Categories