Regular Expression for dialing prefix eg +44 etc

Regular Expression for dialing prefix eg +44 etc - c#

I realise making a regex for all internation numbers is futile. So as a compromise I need a regex that will check that a string starts with the international dialing code eg:
(+44) or (+33) etc.
I dont understand how to write regex's yet, so any help would be greatly appreciated.
P.s. It's for use in a c#.net project.
Thanks.

Try this
var regex=new Regex(#"^\(\+\d{1,4}\)");
I'm using the # bit before the string so you don't need to double escape \
This does
^ Match Start of string
\( Match (
\+ Match +
\d{1,4} Match 1-4 digits
\) Match bracket
Also as you are using c# you may wish to consider using Expresso as this helps you build and test regular expressions targeted at the .Net Regular expression library and can even generate c# code for you to use them.

This is a complex matter, since the phone number needs not to be formatted, so it may have a digit right after the country code. Since these codes can have a varying length, you will need a list of country codes to check for. America and Russia have 1 and 7 respectively, most European countries and larger other countries have a two digit code, but there are also codes with 3 digits (not more, I believe). So you'll need a list of 1 and 2 digit codes at least, and can assume 3 digits if your phone number does not start with any code in that list.
I'd get this clear before you even start thinking whether to build this using regex or not. It may be easier in a simple piece of code.

Related

C# Regex Validation

Can someone please validate this for me (newbie of regex match cons).
Rather than asking the question, I am writing this:
Regex rgx = new Regex (#"^{3}[a-zA-Z0-9](\d{5})|{3}[a-zA-Z0-9](\d{9})$"
Can someone telll me if it's OK...
The accounts I am trying to match are either of:
1. BAA89345 (8 chars)
2. 12345678 (8 chars)
3. 123456789112 (12 chars)
Thanks in advance.

You can use a Regex tester. Plenty of free ones online. My Regex Tester is my current favorite.

Is the value with 3 characters then followed by digits always starting with three... can it start with less than or more than three. What are these mins and max chars prior to the digits if they can be.

You need to place your quantifiers after the characters they are supposed to quantify. Also, character classes need to be wrapped in square brackets. This should work:
#"^(?:[a-zA-Z0-9]{3}|\d{3}\d{4})\d{5}$"
There are several good, automated regex testers out there. You may want to check out regexpal.

Although that may be a perfectly valid match, I would suggest rewriting it as:
^([a-zA-Z]{3}\d{5}|\d{8}|\d{12})$
which requires the string to match one of:
[a-zA-Z]{3}\d{5} three alpha and five numbers
\d{8} 8 digits or
\d{12} twelve digits.
Makes it easier to read, too...

I'm not 100% on your objective, but there are a few problems I can see right off the bat.
When you list the acceptable characters to match, like with a-zA-Z0-9, you need to put it inside brackets, like [a-zA-Z0-9] Using a ^ at the beginning will negate the contained characters, e.g. `[^a-zA-Z0-9]
Word characters can be matched like \w, which is equivalent to [a-zA-Z0-9_].
Quantifiers need to appear at the end of the match expression. So, instead of {3}[a-zA-Z0-9], you would need to write [a-zA-Z0-9]{3} (assuming you want to match three instances of a character that matches [a-zA-Z0-9]

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

I'm currently writing a library where I wish to allow the user to be able to specify spreadsheet cell(s) under four possible alternatives:
A single cell: "A1";
Multiple contiguous cells: "A1:B10"
Multiple separate cells: "A1,B6,I60,AA2"
A mix of 2 and 3: "B2:B12,C13:C18,D4,E11000"
Then, to validate whether the input respects these formats, I intended to use a regular expression to match against. I have consulted this article on Wikipedia:
Regular Expression (Wikipedia)
And I also found this related SO question:
regex matching alpha character followed by 4 alphanumerics.
Based on the information provided within the above-linked articles, I would try with this Regex:
Default Readonly Property Cells(ByVal cellsAddresses As String) As ReadOnlyDictionary(Of String, ICell)
Get
Dim validAddresses As Regex = New Regex("A-Za-z0-9:,A-Za-z0-9")
If (Not validAddresses.IsMatch(cellsAddresses)) then _
Throw New FormatException("cellsAddresses")
// Proceed with getting the cells from the Interop here...
End Get
End Property
Questions
1. Is my regular expression correct? If not, please help me understand what expression I could use.
2. What exception is more likely to be the more meaningful between a FormatException and an InvalidExpressionException? I hesitate here, since it is related to the format under which the property expect the cells to be input, aside, I'm using an (regular) expression to match against.
Thank you kindly for your help and support! =)

I would try this one:
[A-Za-z]+[0-9]+([:,][A-Za-z]+[0-9]+)*
Explanation:
Between [] is a possible group of characters for a single position
[A-Za-z] means characters (letters) from 'A' to 'Z' and from 'a' to 'z'
[0-9] means characters (digits) from 0 to 9
A "+" appended to a part of a regex means: repeat that one or more times
A "*" means: repeat the previous part zero or more times.
( ) can be used to define a group
So [A-Za-z]+[0-9]+ matches one or more letters followed by one or more digits for a single cell-address.
Then that same block is repeated zero or more times, with a ',' or ':' separating the addresses.

Assuming that the column for the spreadsheet is any 1- or 2-letter value and the row is any positive number, a more complex but tighter answer still would be:
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
"[A-Z]{1,2}[1-9]\d*" is the expression for a single cell reference. If you replace "[A-Z]{1,2}[1-9]\d*" in the above with then the complex expression becomes
^<cell>(:<cell>)?(,<cell>(:<cell>*)?)*$
which more clearly shows that it is a cell or a range followed by one or more "cell or range" entries with commas in between.
The row and column indicators could be further refined to give a tighter still, yet more complex expression. I suspect that the above could be simplified with look-ahead or look-behind assertions, but I admit those are not (yet) my strong suit.

I'd go with this one, I think:
(([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*,)*([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*
This only allows capital letters as the prefix. If you want case insensitivity, use RegexOptions.IgnoreCase.
You could simplify this by replacing [A-Z]+[1-9]\d* with plain old [A-Z]\d+, but that will only allow a one-letter prefix, and it also allows stuff like A0 and B01. Up to you.
EDIT:
Having thought hard about DocMax's mention of lookarounds, and using Hans Kesting's answer as inspiration, it occurs to me that this should work:
^[A-Z]+\d+((,|(?<!:\w*):)[A-Z]+\d+)*$
Or if you want something really twisted:
^([A-Z]+\d+(,|$|(?<!:\w*):))*(?<!,|:)
As in the previous example, replace \d+ with [1-9]\d* if you want to prevent leading zeros.
The idea behind the ,|(?<!\w*:): is that if a group is delimited by a comma, you want to let it through; but if it's a colon, it's only allowed if the previous delimiter wasn't a colon. The (,|$|...) version is madness, but it allows you to do it all with only one [A-Z]+\d+ block.
However! Even though this is shorter, and I'll admit I feel a teeny bit clever about it, I pity the poor fellow who has to come along and maintain it six months from now. It's fun from a code-golf standpoint, but I think it's best for practical purposes to go with the earlier version, which is a lot easier to read.

i think your regex is incorrect, try (([A-Za-z0-9]*)[:,]?)*
Edit : to correct the bug pointed out by Baud : (([A-Za-z0-9]*)[:,]?)*([A-Za-z0-9]+)
and finally - best version : (([A-Za-z]+[0-9]+)[:,]?)*([A-Za-z]+[0-9]+)
// ah ok this wont work probably... but to answer 1. - no i dont think your regex is correct
( ) form a group
[ ] form a charclass (you can use A-Z a-d 0-9 etc or just single characters)
? means 1 or 0
* means 0 or any
id suggest reading http://www.regular-expressions.info/reference.html .
thats where i learned regexes some time ago ;)
and for building expressions i use Rad Software Regular Expression Designer

Let's build this step by step.
If you are following an Excel addressing format, to match a single-cell entry in your CSL, you would use the regular expression:
[A-Z]{1,2}[1-9]\d*
This matches the following in sequence:
Any character in A to Z once or twice
Any digit in 1 to 9
Any digit zero or more times
The digit expression will prevent inputting a cell address with leading zeros.
To build the expression that allows for a cell address pair, repeat the expression preceded by a colon as optional.
[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?
Now allow for repeating the pattern preceded by a comma zero or more times and add start and end string delimiters.
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
Kind of long and obnoxious, I admit, but after trying enough variants, I can't find a way of shortening it.
Hope this is helpful.

What is the regular expression for the following strings and would the expression change if the number rolled over?

What would be the following regular expressions for the following strings?
56AAA71064D6
56AAA7105A25
Would the regular expression change if the numbers rolled over? What I mean by this is that the above numbers happen to contain hexadecimal values and I don't know how the value changes one it reaches F. Using the first one as an example: 56AAA71064D6, if this went up to
56AAA71064F6 and then the following one would become 56AAA7106406, this would create a different regular expression because where a letter was allowed, now their is a digit, so does this make the regular expression even more difficult. Suggestions?
A manufacturer is going to enter a range of serial numbers. The problems are that different manufacturers have different formats for serial numbers (some are just numbers, some are alpha numeric, some contain extra characters like dashes, some contain hexadacimal values which makes it more difficult because I don't know how the roll over to the next serial number). The roll over issue is the biggest problem because the serial numbers are entered as a range like 5A1B - 6F12 and without knowing how the roll over, it seems to me that storing them in the database is not as easy. I was going to have the option of giving the user the option to input the pattern (expression) and storing that in the databse, but if a character or characters changes from a digit to a letter or vice versa, then the regular expression is no longer valid for certain serial numbers.
Also, the above example I gave is with just one case. There are multitude of serial numbers that would contain different expressions.

There's no single regular expression which is "the" expression to match both of those strings. Instead, there are infinitely many which will do so. Here are two options at opposite ends of the spectrum:
(56AAA71064D6)|(56AAA7105A25)
.*
The first will only match those two strings. The second will match anything. Both satisfy all the criteria you've given.
Now, if you specify more criteria, then we'd be able to give a more reasonable idea of the regular expression to provide - and that will drive the answers to the other questions. (At the moment, the only answer that makes sense is "It depends on what regex you use.")

I think you could do it this way for 12 characters. This will search for a 12 character phrase where each of the characters must be a capital (A or B or C or D or E or F or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9 or 0)
[A-F0-9]{12}
If you're wanting to include the possibility of dashes then do this.
[A-F0-9\-]{12}
Or you're wanting to include the possibility of dashes plus the 12 characters then do this. But that would pick up any 12-15 character item that fit the criteria though.
[A-F0-9\-]{12,15}
Or if it's surrounded by spaces (AAAAHHHh...SO is stripping out my spaces!!!)
[A-F0-9\-]{12}
Or if it's surrounded by tabs
\t[A-F0-9\-]{12}\t

This match a string that contains 12 hexa
[0-9A-F]{12}

Assuming these are all 12-digit hexadecimal numbers, which it looks like they are, the following regex should work:
[0-9A-Fa-f]{12}
Here I'm using a character class to say that I want any digit, OR A-F, OR a-f. As a bonus I'm allowing lowercase letters; if you don't want those just get them out of the regex.
As Jon Skeet and others have said, you really didn't provide enough information, so if you don't like this answer please understand that I was doing the best I can with what information you provided.

So, how about this:
[0-9A-F]{12}

Well it sounds like you're describing a 12 digit hexadecimal number:
^[A-F0-9]{12}$

C# Regex Replace Pattern (Replace String) Return $1

I'm currently working with parsing some data from SQL Server and I'm in need of help with a Regex.
I have an assembly in Sql Server 2005 that helps me Replace strings using C# Regex.Replace() Method.
I need to parse the following.
Strings:
CAD 90890
(CAD 90892)
CAD G67859
CAD 34G56
CAD 3S56.
AX CAD 890990
CAD 783783 MX
Needed Results:
90890
90892
G67859
34G56
3S56
890990
783783
SELECT TOP 25 CADCODE, dbo.RegExReplace(CADCODE, '*pattern*', '$1')
FROM dbo.CADCODES
WHERE CADCODE LIKE '%CAD%'
I need to get the proceeding string after the CAD word until it hits a white-space or anything that not a number or digit. I managed to get the digits but it really fails on others. I'm trying to get it to work but I can't find a real solution.
Thanks in advance.
Updated to reflect new Strings
AX CAD 890990
CAD 783783 MX

Try this:
(\w+)\W*$
The pattern matches the last word - made of alphanumeric (and underscores).
Example: http://www.rubular.com/r/1zWQQVLZy1
Another option is to find a word with at least one digit - this one can match anywhere on the string, so you may need to handle multiple matches. In this case, you can add a capturing group around the whole pattern, or replace using $&.
[a-zA-Z_]*\d\w*
Example: http://www.rubular.com/r/XUrFNuPQUv
If you can't match (Regex.Match) and must use Regex.Replace, you can match the entire string start to end and replace it with the group you need:
RegExReplace(CADCODE, '^.*\b([a-zA-Z_]*\d\w*)\b.*$', '$1')

I think this is what you're after:
^\W*\w*CAD\w*\W*(\w+)\W*$
The regex has to match the whole string so RegExReplace can replace it with $1, effectively stripping off the unwanted parts.
EDIT: Let me back up and make sure I've got this right. Because of the
WHERE CADCODE LIKE '%CAD%'
in your query, you already know every string contains the sequence CAD. That being the case, there's no need to complicate the regex by matching that sequence again. This should be all you need:
^.*?(\w+)\W*$

Try this:
(?:\(CAD\)|CAD)\s+?([\dA-Z]+)
You can get the result from the capture group number 1.

The problem with regex is that it's always easy to get a good pattern if you have a limited sample set.
In your case, you use:
\w{4}\w*
which just says, 4 alphanumerics, followed by 0 or more alphanumerics, so all the CAD sections would not match, nor would spaces or ().

A beginner's guide to interpreting Regex?

Greetings.
I've been tasked with debugging part of an application that involves a Regex -- but, I have never dealt with Regex before. Two questions:
1) I know that the regexes are supposed to be testing whether or not two strings are equivalent, but what specifically do the two regex statements, below, mean in plain English?
2) Does anyone have a recommendation on websites / sources where I can learn more about Regexes? (preferably in C#)
if (Regex.IsMatch(testString, #"^(\s*?)(" + tag + #")(\s*?),", RegexOptions.IgnoreCase))
{
result = true;
}
else if (Regex.IsMatch(testString, #",(\s*?)(" + tag + #")(\s*?),", RegexOptions.IgnoreCase))
{
result = true;
}

It's going to be difficult to tell what that regex means, without knowing what's in tag. In fact, it looks like that regex is broken (or, at least, doesn't properly escape inputs).
Roughly speaking, for the first regex:
The ^ says to match at the beginning of the string.
The (...) sets up a capturing group (which is available, although this example apparently doesn't use it).
The \s matches any white space characters (spaces, tabs, etc.)
The *? matches zero or more of the previous character (in this case, whitespace), and because it has a question-mark, it matches the minimum number of characters needed to make the rest of the expression work.
The (" + tag + #") inserts the contents of the tag into the regex. As I mention, that's dangerous, without escaping.
The (\s*?) matches the same as the before (the minimum number of whitespace characters)
The , matches a trailing comma.
The second regex is very similar, but looks for a starting comma (rather than the beginning of the string).
I like the Python documentation for Regular Expressions, but it looks like this site
has a pretty good, basic introduction, with C# examples.

One word - Cribsheet (or is that two?) :)

I'm not c# savvy but I can recommend an awesome guide to regular expressions that I use for Bash and Java programming. It applies to pretty much all languages:
http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=tmm_pap_title_0
It is totally worth $30 to own this book. It is VERY thorough and helped my fundamental understanding of Regex a lot.
-Ryan

Since you specifically tagged C#, I recommend the Regex Hero as a tool you can use to play around with them since it's running on .NET. It also lets you toggle the different RegexOptions flags as you would pass them into the constructor when creating a new Regex.
Also, if you're using a version of Visual Studio 2010 that supports extensions, I would take a look at the Regex Editor extension... it will popup whenever you type new Regex( and offer you some guidance and autocomplete for your regex pattern.

Using The Regex Coach
The regular expression is a sequence consisting of the expression '(\s*?)', the expression '(tag)', the expression '(\s*?)', and the character ','.
where (\s*?) is defined as The regular expression is a repetition which matches a whitespace character as often as necessary.
the second one matches a , at the start too
As for good learning websites, I like www.regular-expressions.info/
Super simple version:
At the start of a string 0 or more spaces, whatever Tag is, 0 or More spaces, a comma.
the second one is
a comma, 0 or more spaces, whatever Tag is, 0 or More spaces, a comma.

Once you have the very basic idea about regex (it's full of resources over there) I recommend you to use Expresso for creating your regular expressions.
Expresso editor is equally suitable as a teaching tool for the beginning user of regular expressions or as a full-featured development environment for the experienced programmer or web designer with an extensive knowledge of regular expressions.

Your premise is not correct. Regular expressions are not used to tell if two strings are equivalent, but rather if the input string matches a certain pattern.
The first test above looks for any text that does not contain "zero or more whitespace charaters" searching "non-greedy". Then matches the text of the variable "tag" in the middle, then "zero or more whitespace characters, non greedy" again.
The second one is very similar, except that it allows for beginning whitespace as long as it follows a comma.
It is hard to explain "non-greedy" in this context, especially involving whitespace characters, so look here for more information.

A regular expression is a way to describe a set of strings that have some particular characteristics.
They don't merely need just to compare two strings.. what you usually do it to test if a string matches a particular regular expression. They can also be used to do simple parsing of a string in tokens that respect some patterns..
The good thing about regexps is that they allow you to express certain constraints inside a string keeping it general and able to match a group of strings that respect those constraints.. then they follow a formal specification that doesn't leave ambiguities around..
Here you can find a comparison table of various regular expression languages in many different programming languages and a specific guide for C# if you follow its link.
Usually the implementations for the various languages are quite similar since the syntax is somewhat standardized from the theoretical topics regexps come from, so any tutorial about regexp will be fine, then you'll just need to get into C# API.

1) The first regex is trying to do a case-insensitive match starting at the beginning of the test string. It then matches optional whitespace, followed by whatever is in tag, followed by optional whitespace then finally a comma.
The second matches a string containing a comma, followed by optional whitespace, followed by whatever is in tag, followed by optional whitespace then finally a comma.
Thought it's for C# I recommend picking up the Perl Pocket Reference which has a great Regex syntax reference. It helped my out a lot when I was learning regexes 14 years ago.

http://www.myregextester.com/ is a decent regular expression tester that also has an explain option for C# regexps - For Instance check out this example:
The regular expression:
(?-imsx:^(\s*?)(tagtext)(\s*?),)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\s*? whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
tagtext 'tagtext'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
\s*? whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
, ','
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

A regular expression does not tell you if two strings match, but rather if a given string matches a pattern.
This site is my favorite for learning and testing regular expressions:
http://gskinner.com/RegExr/
It allows you to interactively test regular expressions as you write them, and provides a built-in tutorial.

Although it doesn't use C#, Rejex is a simple tool for testing and learning about regular expressions which includes a quick reference for the special characters

It looks like that they are trying to match some kind of list of words delimited by colons (UPDATE: commas).
The first one is probably matching first item and the second one some item after the first one excluding the last one. I hope you will understand :).
A good source of information about regular expressions is at http://www.regular-expressions.info/

also a great site to test your regular expressions with extra info: http://regex101.com/

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.