Regular expression for tokenizing connection string - c#

I have 2 different types of connection strings (because of legacy reasons that I can't fix everywhere for various reasons which are irrelevant here). I need to break them up into key/value pairs. Here are the sample connection strings:
1. Server=SomeServer;Database=SomeDatabase;Something=Hello
2. Server=SomeServer,Database=SomeDatabase;Something=Hello
3. Server=SomeServer,1111;Database=SomeDatabase;Something=Hello
For the first 2 cases, I can use the regex:
(?<Key>[0-9A-z\s]+)=(?<Val>[0-9A-z\s,]+?[0-9A-z\s]+)
For the third one, I can use the regex:
(?<Key>[0-9A-z\s]+)=(?<Val>[0-9A-z\s]+?[0-9A-z\s,]+)
How do I turn this into one regex that would work for all cases?

You can just use the below regex
(?<Key>[^=;]+)=(?<Val>[^;]+)
What the above uses is negated character class. [^;]+ will select everything till the first ; it encounters.
DEMO (I've removed the named groups for testing. It would work well in C#, however)

Here is my suggestion.
(?<key>[^=;,]+)=(?<val>[^;,]+(,\d+)?)
The semicolon is a delimiter as is the comma if it is not immediately followed by numbers.

Related

How to match a string, but only if the same string has not already been matched with or without dashes?

I have a case I'm trying to match using regular expressions.
My current expression will match a string in a certain format with or without dashes. I would like to add it to match only if the string has not been matched before, with or without the dashes. For example, take the following cases:
1. 1234-56-789-5555
2. 1234567895555
3. 0000-99-888-3333
4. 1111223334444
If the four examples above appeared in this same order in a list, document, whatever, I would want to only capture (1, 3, 4). I want to skip #2 since it was already captured by #1, but with the dashes. If #2 had of come first, I would have wanted to similarly skip #1.
Here's the current expression I'm using:
\d\d\d\d-*\d\d-*\d\d\d-*\d\d\d\d
I tried to read up on look behinds (I'm fairly inexperienced with Regex) but I only really understand that a look behind only checks if certain text is matched previously. I'm not sure if what I want can be combined with this; I only see how to check for specific text, not for the current value with/without dashes.
I'm currently doing this with C# logic, but am trying to see if it can be done purely in Regex. If it can't be done, that's fine; I'm just trying to beef up my Regex knowledge in this case.
Is this possible -- how can I accomplish this?
If you want to obtain just the first occurrence of each number (answering I want to skip #2 since it was already captured by #1, but with the dashes), you need a negative look-behind with a RegexOptions.RightToLeft and RegexOptions.Singleline options:
(?<!\b\1-?\2-?\3-?\4\b.*)\b(\d{4})-?(\d{2})-?(\d{3})-?(\d{4})\b
The \b(\d{4})-?(\d{2})-?(\d{3})-?(\d{4})\b subpattern is the number with capture groups to check for their presence regardless of the hyphens earlier in the string.
The (?<!\b\1-?\2-?\3-?\4\b.*) subpattern look-behind is checking if we have no other occurrences of the same string.
Tested at regexhero.net and in Expresso:
You can easily do this without using regex.. but if you still want to use regex for this purpose.. you can use the following to match:
(?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)\2\3\4\5
And replace with '' (empty string)
Explanation:
This will match all those digits without dashes which are already captured by digits with dashes
So, in your 1,2,3 and 4.. instead of matching 1,3 and 4 types it matches type 2.. and you can replace it with '' (nothing) and you remain with 1,3, and 4
See demo here
You can use the following regex to do exactly what you want..
((?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)(?!\3\4\5\6)\d{13})|(((?<=((\d{4})(\d{2})(\d{3})(\d{4})).*?)(?!\10-\11-\12-\13)((\d{4})-(\d{2})-(\d{3})-(\d{4}))))
Explanation:
((?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)(?!\3\4\5\6)\d{13}) match all those \d{13} which are not previously occurred with dashes in between them (this excludes strings of type 2 in your case)
((\d{4})-(\d{2})-(\d{3})-(\d{4})) and match all of this pattern
Matches 1, 3 and 4 in your case.
See DEMO

Character 'e' is not recognized by simple regular expression - why?

I wrote a very simple regular expression that need to match the next pattern:
word.otherWord
- Word must have at least 2 characters and must not start with digit.
I wrote the next expression:
[a-zA-Z][a-zA-Z](.[a-zA-Z0-9])+
I tested it using Regex tester and it seems to be working at most of the cases but when I try some inputs that ends with 'e' it's not working.
for example:
Hardware.Make does not work but Hardware.Makee is works fine, why? How can I fix it?
That's because your regex looks for inputs which length is even.
You have two characters matched by [a-zA-Z][a-zA-Z] and then another two characters matched by (.[a-zA-Z0-9]) as a group which is repeated one or more times (because of +).
You can see it here: http://regex101.com/r/fW2bC1
I think you need that:
[a-zA-Z]+(\.[a-zA-Z0-9]+)+
Actually, the dot is a regex metacharacter, which stands for "any character". You'll need to escape the dot.
For your situation, I'd do this:
[a-zA-Z]{2,}\.[a-zA-Z0-9]+
The {2,} means, at least 2 characters from the previous range.
In regex, the dot period is one of the most commonly used metacharacters and unfortunately also commonly misused metacharacter. The dot matches a single character without caring what that character is...
So u would also re-write it like
[a-zA-Z]+(\.[a-zA-Z0-9]+)+

Verify that my Regex works as expected

It's my first regex for production code, until now I've always avoided to write them myself and now I'm a bit worried if it really works as it is expected to. I made a lot of attempts trying to break it, but I really don't want to rely on this, especially when I have zero experience.
My regex should match exactly this pattern
first character must be one of the letters (not case sensitive) - K,C,M,X,S,W
second character must be a digit from 0-9
a hyphen -
4 alphanumeric characters (A-Z or 0-9) (not case sensitive) and
one letter (A-Z) (not case sensitive).
And that's it. It can't be shorter, it can't be longer, it must match exactly this pattern. What I have for now is this:
string RegExPattern = #"^(K|C|M|X|S|W){1}[0-9]{1}[-]{1}[A-Z0-9]{4}[A-Z]{1}$";
if (!Regex.IsMatch(txtCode.Text, RegExPattern, RegexOptions.IgnoreCase))
{
MessageBox.Show("Fail");
return false;
}
Is there any tool, or some other way to verify the behavior of a regex and is this regex correct for the matching pattern I explained above?
Yes, that is correct.
However, all the {1} are redundant, you can make a set of the first character insted of using the | operator, and you don't need a set for the dash:
string RegExPattern = #"^[KCMXSW][0-9]-[A-Z0-9]{4}[A-Z]$";
There are tools for writing and testing regular expressions, but you can only use them to test any variations in the input that you can think of, and it seems that you have already done that.
Nice tool to verify and develop regular expressions: http://www.debuggex.com. Nevertheless I'd advise you to concrete your regular expressions with the bunch of unit tests.
The best tool is a suite of unit tests, or a single test that iterates over several dozen chunks of text.
Create a text file that has a whole bunch of lines of text that are similar to the data this pattern will be used against. Make sure that some lines match, and some lines that won't match different parts of the rule (eg: a pattern that matches everything but the first character, one that matches everything but the last character, one with only 2 or 3 characters rather than four, etc.
Then, write a small program that reads each line of text and runs your expression against it. Have it print the line numbers of the lines that match, and then compare that list of numbers against your expected results.

Regex to match Table of contents pattern

Considering
NN = number/digit
x = any single letter
I want to match these patterns:
1. NN
2. NNx
3. NN.NN
4. NN.NNx
5. NN.NN.NN
6. NN.NN.NNx
Example that needs to be match:
1. 20
2. 20a
3. 20.20
4. 20.20a
5. 20.20.20
6. 20.20.20a
Right now I am trying to use this regex:
\b\d+\.?\d+\.?\d+?[a-z]?\b
But if fails.
Any help would be greatly appreciate, thanks! XD
EDIT:
I am matching this:
<fn:footnote fr="10.23.20a"> (Just a sample)
Now I have a regex that will extract the '10.23.20a'
Now I will check if this value will be valid, the 6 examples above will be the only string that will be accepted.
This examples are invalid:
1. 20.a
2. 20a.20.20
3. etc.
Many thanks for your help men! :D
You always have \d+, which is one or more digits. So you require at least three digits. Try grouping the digits with their periods:
^\d+(?:[.]\d+){0,2}[a-z]?$
The ?: is just an optimization (and a good practice) that suppresses capturing. [.] and \. are completely interchangeable, but I prefer the readability of the former. Choose whatever you like best.
If you actually want to capture the numbers and the letter, there two options:
^(?<first>\d+)(?:[.](?<second>\d+))?(?:[.](?<third>\d+))?(?<letter>[a-z])?$
Note that the important point is to group a period and the digits together and make them optional together. You could as well use unnamed groups, it doesn't really matter. However, if you use my version, you can now access the parts through (for instance)
match.Groups["first"].Value
where match is a Match object returned by Regex.Match or Regex.Matches for example.
Alternatively, you can use .NET's feature of capturing multiple values with one group:
^(?<d>\d+)(?:[.](?<d>\d+){0,2}(?<letter>[a-z])?$
Now match.Groups["d"].Captures will contain a list of all captured numbers (1 to 3). And match.Groups["letter"].Value will still contain the letter if it was there.
Try this
^\d+(?:(?:\.\d+)*[a-z]?)$

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

I'm currently writing a library where I wish to allow the user to be able to specify spreadsheet cell(s) under four possible alternatives:
A single cell: "A1";
Multiple contiguous cells: "A1:B10"
Multiple separate cells: "A1,B6,I60,AA2"
A mix of 2 and 3: "B2:B12,C13:C18,D4,E11000"
Then, to validate whether the input respects these formats, I intended to use a regular expression to match against. I have consulted this article on Wikipedia:
Regular Expression (Wikipedia)
And I also found this related SO question:
regex matching alpha character followed by 4 alphanumerics.
Based on the information provided within the above-linked articles, I would try with this Regex:
Default Readonly Property Cells(ByVal cellsAddresses As String) As ReadOnlyDictionary(Of String, ICell)
Get
Dim validAddresses As Regex = New Regex("A-Za-z0-9:,A-Za-z0-9")
If (Not validAddresses.IsMatch(cellsAddresses)) then _
Throw New FormatException("cellsAddresses")
// Proceed with getting the cells from the Interop here...
End Get
End Property
Questions
1. Is my regular expression correct? If not, please help me understand what expression I could use.
2. What exception is more likely to be the more meaningful between a FormatException and an InvalidExpressionException? I hesitate here, since it is related to the format under which the property expect the cells to be input, aside, I'm using an (regular) expression to match against.
Thank you kindly for your help and support! =)
I would try this one:
[A-Za-z]+[0-9]+([:,][A-Za-z]+[0-9]+)*
Explanation:
Between [] is a possible group of characters for a single position
[A-Za-z] means characters (letters) from 'A' to 'Z' and from 'a' to 'z'
[0-9] means characters (digits) from 0 to 9
A "+" appended to a part of a regex means: repeat that one or more times
A "*" means: repeat the previous part zero or more times.
( ) can be used to define a group
So [A-Za-z]+[0-9]+ matches one or more letters followed by one or more digits for a single cell-address.
Then that same block is repeated zero or more times, with a ',' or ':' separating the addresses.
Assuming that the column for the spreadsheet is any 1- or 2-letter value and the row is any positive number, a more complex but tighter answer still would be:
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
"[A-Z]{1,2}[1-9]\d*" is the expression for a single cell reference. If you replace "[A-Z]{1,2}[1-9]\d*" in the above with then the complex expression becomes
^<cell>(:<cell>)?(,<cell>(:<cell>*)?)*$
which more clearly shows that it is a cell or a range followed by one or more "cell or range" entries with commas in between.
The row and column indicators could be further refined to give a tighter still, yet more complex expression. I suspect that the above could be simplified with look-ahead or look-behind assertions, but I admit those are not (yet) my strong suit.
I'd go with this one, I think:
(([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*,)*([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*
This only allows capital letters as the prefix. If you want case insensitivity, use RegexOptions.IgnoreCase.
You could simplify this by replacing [A-Z]+[1-9]\d* with plain old [A-Z]\d+, but that will only allow a one-letter prefix, and it also allows stuff like A0 and B01. Up to you.
EDIT:
Having thought hard about DocMax's mention of lookarounds, and using Hans Kesting's answer as inspiration, it occurs to me that this should work:
^[A-Z]+\d+((,|(?<!:\w*):)[A-Z]+\d+)*$
Or if you want something really twisted:
^([A-Z]+\d+(,|$|(?<!:\w*):))*(?<!,|:)
As in the previous example, replace \d+ with [1-9]\d* if you want to prevent leading zeros.
The idea behind the ,|(?<!\w*:): is that if a group is delimited by a comma, you want to let it through; but if it's a colon, it's only allowed if the previous delimiter wasn't a colon. The (,|$|...) version is madness, but it allows you to do it all with only one [A-Z]+\d+ block.
However! Even though this is shorter, and I'll admit I feel a teeny bit clever about it, I pity the poor fellow who has to come along and maintain it six months from now. It's fun from a code-golf standpoint, but I think it's best for practical purposes to go with the earlier version, which is a lot easier to read.
i think your regex is incorrect, try (([A-Za-z0-9]*)[:,]?)*
Edit : to correct the bug pointed out by Baud : (([A-Za-z0-9]*)[:,]?)*([A-Za-z0-9]+)
and finally - best version : (([A-Za-z]+[0-9]+)[:,]?)*([A-Za-z]+[0-9]+)
// ah ok this wont work probably... but to answer 1. - no i dont think your regex is correct
( ) form a group
[ ] form a charclass (you can use A-Z a-d 0-9 etc or just single characters)
? means 1 or 0
* means 0 or any
id suggest reading http://www.regular-expressions.info/reference.html .
thats where i learned regexes some time ago ;)
and for building expressions i use Rad Software Regular Expression Designer
Let's build this step by step.
If you are following an Excel addressing format, to match a single-cell entry in your CSL, you would use the regular expression:
[A-Z]{1,2}[1-9]\d*
This matches the following in sequence:
Any character in A to Z once or twice
Any digit in 1 to 9
Any digit zero or more times
The digit expression will prevent inputting a cell address with leading zeros.
To build the expression that allows for a cell address pair, repeat the expression preceded by a colon as optional.
[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?
Now allow for repeating the pattern preceded by a comma zero or more times and add start and end string delimiters.
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
Kind of long and obnoxious, I admit, but after trying enough variants, I can't find a way of shortening it.
Hope this is helpful.

Categories