Regex for key value pairs - c#

I am not great with regular expressions and I have a need to parse out key/value pairs from a string. An example of the string would be:
Event Name CallingNumber:+15555555555 CallID:12345 CallingName:Doe, John CallingTime:12-26-2013 14:27:41.645497
The result I'm looking for would be something like this:
CallingNumber=+15555555555
CallID=12345
CallingName=Doe, John
CallingTime=12-26-2013 14:27:41.645497
The key/value pairs are delimited by a space, but the value is allowed to have a space in it (ex: Doe, John). It would be nice if the values were surrounded by quotes or something, but they are not. Essentially I'm trying to match a word without a space followed by a colon and then any character after the colon until it reaches another word without a space followed by a colon.

Your match is impossible, the fields are delimited with : but you have a date with : in there, as well, Regex can't really distinguish those very easily.
Still, this is what I came up with:
(.+?):(.+?)(?=(?:[^\s]+:)|(?:$))
Again, beacuse of the date, this won't work perfectly.
Here's a fiddle to demonstrate: http://www.rexfiddle.net/Wm3NiK0
Edit: If your "keys" are only letters (not numbers), which avoids the time/date problem, then this will work:
([A-Za-z]+?):(.+?)\s?(?=(?:[A-Za-z]+:)|(?:$))
Here's another fiddle to demonstrate this: http://www.rexfiddle.net/sGQs7YV

You can apply the regex repeatedly, with a (.*) to return the "yet to be parsed" remainder
In pseudocode form, this might be:
match string to "^(([^:]*\s)*[^:]*)\s+(.*)$"
should grab "Event Name" and leave the rest as $3
loop:
keep only $3 as new base string
match new base string to "^(\w+)[:](.+?)\s+(\w+[:].*)$"
key = $1, value = $2, new remainder = $3
repeat until no $1, $2 values are returned

"I'm suing .NET (c#)," good idea! :) Microsoft needs to be put in its place!
Do you have a fixed number of fields, or could they vary in number? Do you expect the same fields each time? In the same order? If a fixed number, you could hard code the number of fields in the regexp, but I still think that trying to do it with just one regexp is asking for a headache. Use some scripting code and break it down piece by piece, first of all splitting it on :\s+. The last word in a group is then stripped off as the name of the next group, and the remainder is the value of the previous group. The first and last groups have to have some special treatment. I think that would be a lot easier and more understandable than trying to do it in one ugly regexp. As a bonus, any number of fields in any order could be handled.

Related

Matching strings between whitespaces without including them [duplicate]

I'm trying to come up with an example where positive look-around works but
non-capture groups won't work, to further understand their usages. The examples I"m coming up with all work with non-capture groups as well, so I feel like I"m not fully grasping the usage of positive look around.
Here is a string, (taken from a SO example) that uses positive look ahead in the answer. The user wanted to grab the second column value, only if the value of the
first column started with ABC, and the last column had the value 'active'.
string ='''ABC1 1.1.1.1 20151118 active
ABC2 2.2.2.2 20151118 inactive
xxx x.x.x.x xxxxxxxx active'''
The solution given used 'positive look ahead' but I noticed that I could use non-caputure groups to arrive at the same answer.
So, I'm having trouble coming up with an example where positive look-around works, non-capturing group doesn't work.
pattern =re.compile('ABC\w\s+(\S+)\s+(?=\S+\s+active)') #solution
pattern =re.compile('ABC\w\s+(\S+)\s+(?:\S+\s+active)') #solution w/out lookaround
If anyone would be kind enough to provide an example, I would be grateful.
Thanks.
The fundamental difference is the fact, that non-capturing groups still consume the part of the string they match, thus moving the cursor forward.
One example where this makes a fundamental difference is when you try to match certain strings, that are surrounded by certain boundaries and these boundaries can overlap. Sample task:
Match all as from a given string, that are surrounded by bs - the given string is bababaca. There should be two matches, at positions 2 and 4.
Using lookarounds this is rather easy, you can use b(a)(?=b) or (?<=b)a(?=b) and match them. But (?:b)a(?:b) won't work - the first match will also consume the b at position 3, that is needed as boundary for the second match. (note: the non-capturing group isn't actually needed here)
Another rather prominent sample are password validations - check that the password contains uppercase, lowercase letters, numbers, whatever - you can use a bunch of alternations to match these - but lookaheads come in way easier:
(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[!?.])
vs
(?:.*[a-z].*[A-Z].*[0-9].*[!?.])|(?:.*[A-Z][a-z].*[0-9].*[!?.])|(?:.*[0-9].*[a-z].*[A-Z].*[!?.])|(?:.*[!?.].*[a-z].*[A-Z].*[0-9])|(?:.*[A-Z][a-z].*[!?.].*[0-9])|...

How to match a string, but only if the same string has not already been matched with or without dashes?

I have a case I'm trying to match using regular expressions.
My current expression will match a string in a certain format with or without dashes. I would like to add it to match only if the string has not been matched before, with or without the dashes. For example, take the following cases:
1. 1234-56-789-5555
2. 1234567895555
3. 0000-99-888-3333
4. 1111223334444
If the four examples above appeared in this same order in a list, document, whatever, I would want to only capture (1, 3, 4). I want to skip #2 since it was already captured by #1, but with the dashes. If #2 had of come first, I would have wanted to similarly skip #1.
Here's the current expression I'm using:
\d\d\d\d-*\d\d-*\d\d\d-*\d\d\d\d
I tried to read up on look behinds (I'm fairly inexperienced with Regex) but I only really understand that a look behind only checks if certain text is matched previously. I'm not sure if what I want can be combined with this; I only see how to check for specific text, not for the current value with/without dashes.
I'm currently doing this with C# logic, but am trying to see if it can be done purely in Regex. If it can't be done, that's fine; I'm just trying to beef up my Regex knowledge in this case.
Is this possible -- how can I accomplish this?
If you want to obtain just the first occurrence of each number (answering I want to skip #2 since it was already captured by #1, but with the dashes), you need a negative look-behind with a RegexOptions.RightToLeft and RegexOptions.Singleline options:
(?<!\b\1-?\2-?\3-?\4\b.*)\b(\d{4})-?(\d{2})-?(\d{3})-?(\d{4})\b
The \b(\d{4})-?(\d{2})-?(\d{3})-?(\d{4})\b subpattern is the number with capture groups to check for their presence regardless of the hyphens earlier in the string.
The (?<!\b\1-?\2-?\3-?\4\b.*) subpattern look-behind is checking if we have no other occurrences of the same string.
Tested at regexhero.net and in Expresso:
You can easily do this without using regex.. but if you still want to use regex for this purpose.. you can use the following to match:
(?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)\2\3\4\5
And replace with '' (empty string)
Explanation:
This will match all those digits without dashes which are already captured by digits with dashes
So, in your 1,2,3 and 4.. instead of matching 1,3 and 4 types it matches type 2.. and you can replace it with '' (nothing) and you remain with 1,3, and 4
See demo here
You can use the following regex to do exactly what you want..
((?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)(?!\3\4\5\6)\d{13})|(((?<=((\d{4})(\d{2})(\d{3})(\d{4})).*?)(?!\10-\11-\12-\13)((\d{4})-(\d{2})-(\d{3})-(\d{4}))))
Explanation:
((?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)(?!\3\4\5\6)\d{13}) match all those \d{13} which are not previously occurred with dashes in between them (this excludes strings of type 2 in your case)
((\d{4})-(\d{2})-(\d{3})-(\d{4})) and match all of this pattern
Matches 1, 3 and 4 in your case.
See DEMO

Regex to parse formatter string

I am writing a string.Format-like method. In order to do this, I am adopting Regex to determine commands and parameters: e.g. Format(#"\m{0,1,2}", byteArr0, byteArr1, byteArr2)
For the first Regex, return 2 groups:
'\m'
'{0,1,2}'
Another Regex takes the value of '{0,1,2}' and has 3 matches:
0
1
2
These values are the indexes corresponding to the byteArr params.
This command structure is likely to grow so I'm really trying to figure this out and learn enough to be able to modify the Regex for future requirements.I would think that a single Regex would do all of the above but there is value in having 2 separate Regex(es/ices ???) expressions.
Any way, to get the first group '\m' the Regex is:
"(\\)(\w{1,1})" // I want the '{0,1,2}' group also
To get the integer matches '{0,1,2}' I was trying:
"(?<=\{)([^}]*)(?=\})"
I am having difficulty in achieving: (1) 2 groups on the first expression and (2) 3 matches on the integers within the braces delimited by a comma in the second expression.
Your first regex (\\)(\w{1,1}) can be greatly simplified.
You don't want to capture the \ separately to the m so no need to wrap them in their own sets of parenthesis.
\w{1,1} is the same as just \w.
So we have \\\w to match the first part \m.
Now to deal with the second part, really we can ignore everything other than the 0,1,2 in the example since there are no numbers elsewhere so you'd just use: \d+ and iterate through the matches.
But lets assume the example could actually be \9{1,2,3}.
Now \d+ would match the 9 so to avoid this we could use [{,](\d+)[,}]. This says capture a number that has either a , or { on the left of it and a , or } on the right.
You're right in saying that we can match the whole string with a single regex, something like this would do it:
(\\\w){((\d+),?)+}
However the problem with this is when you examine the contents of the capture groups afterwards, the last number caught by the (\d+) overwrites all the other values that were caught in there. So you'd be left with group 1: \m and group 2: 2 for your example.
With that in mind I recommend using 2 regexs:
For the 1st part: \\\w
For the numbers: I'd forget about the [{,](\d+)[,}] (and the many other ways you could do it), the cleanest way might just be to grab whatever is inside the {...} and then match with a simple \d+.
So to do this first use (\\\w)\{([^/}]+)\} to grab the \m into group 1 and the 1,2,3 into group 2, then just use \d+ on that.
FYI, your (?<=\{)([^}]*)(?=\}) works fine, but you can't but anything before the lookbehind i.e. the \\\w. In the vast majority of cases where a lookbehind can be used, you can do what you want by just using capture groups and ignoring everything else :
My regex \{([^/}]+)\} is pretty much the same as you (?<=\{)([^}]*)(?=\}) except rather than looking ahead and looking behind for the { and } I just leave them outside the capture groups that are going to be used.
Consider the following Regexes...
(^.*?)(?={.*})
\d+
Good Luck!

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

I'm currently writing a library where I wish to allow the user to be able to specify spreadsheet cell(s) under four possible alternatives:
A single cell: "A1";
Multiple contiguous cells: "A1:B10"
Multiple separate cells: "A1,B6,I60,AA2"
A mix of 2 and 3: "B2:B12,C13:C18,D4,E11000"
Then, to validate whether the input respects these formats, I intended to use a regular expression to match against. I have consulted this article on Wikipedia:
Regular Expression (Wikipedia)
And I also found this related SO question:
regex matching alpha character followed by 4 alphanumerics.
Based on the information provided within the above-linked articles, I would try with this Regex:
Default Readonly Property Cells(ByVal cellsAddresses As String) As ReadOnlyDictionary(Of String, ICell)
Get
Dim validAddresses As Regex = New Regex("A-Za-z0-9:,A-Za-z0-9")
If (Not validAddresses.IsMatch(cellsAddresses)) then _
Throw New FormatException("cellsAddresses")
// Proceed with getting the cells from the Interop here...
End Get
End Property
Questions
1. Is my regular expression correct? If not, please help me understand what expression I could use.
2. What exception is more likely to be the more meaningful between a FormatException and an InvalidExpressionException? I hesitate here, since it is related to the format under which the property expect the cells to be input, aside, I'm using an (regular) expression to match against.
Thank you kindly for your help and support! =)
I would try this one:
[A-Za-z]+[0-9]+([:,][A-Za-z]+[0-9]+)*
Explanation:
Between [] is a possible group of characters for a single position
[A-Za-z] means characters (letters) from 'A' to 'Z' and from 'a' to 'z'
[0-9] means characters (digits) from 0 to 9
A "+" appended to a part of a regex means: repeat that one or more times
A "*" means: repeat the previous part zero or more times.
( ) can be used to define a group
So [A-Za-z]+[0-9]+ matches one or more letters followed by one or more digits for a single cell-address.
Then that same block is repeated zero or more times, with a ',' or ':' separating the addresses.
Assuming that the column for the spreadsheet is any 1- or 2-letter value and the row is any positive number, a more complex but tighter answer still would be:
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
"[A-Z]{1,2}[1-9]\d*" is the expression for a single cell reference. If you replace "[A-Z]{1,2}[1-9]\d*" in the above with then the complex expression becomes
^<cell>(:<cell>)?(,<cell>(:<cell>*)?)*$
which more clearly shows that it is a cell or a range followed by one or more "cell or range" entries with commas in between.
The row and column indicators could be further refined to give a tighter still, yet more complex expression. I suspect that the above could be simplified with look-ahead or look-behind assertions, but I admit those are not (yet) my strong suit.
I'd go with this one, I think:
(([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*,)*([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*
This only allows capital letters as the prefix. If you want case insensitivity, use RegexOptions.IgnoreCase.
You could simplify this by replacing [A-Z]+[1-9]\d* with plain old [A-Z]\d+, but that will only allow a one-letter prefix, and it also allows stuff like A0 and B01. Up to you.
EDIT:
Having thought hard about DocMax's mention of lookarounds, and using Hans Kesting's answer as inspiration, it occurs to me that this should work:
^[A-Z]+\d+((,|(?<!:\w*):)[A-Z]+\d+)*$
Or if you want something really twisted:
^([A-Z]+\d+(,|$|(?<!:\w*):))*(?<!,|:)
As in the previous example, replace \d+ with [1-9]\d* if you want to prevent leading zeros.
The idea behind the ,|(?<!\w*:): is that if a group is delimited by a comma, you want to let it through; but if it's a colon, it's only allowed if the previous delimiter wasn't a colon. The (,|$|...) version is madness, but it allows you to do it all with only one [A-Z]+\d+ block.
However! Even though this is shorter, and I'll admit I feel a teeny bit clever about it, I pity the poor fellow who has to come along and maintain it six months from now. It's fun from a code-golf standpoint, but I think it's best for practical purposes to go with the earlier version, which is a lot easier to read.
i think your regex is incorrect, try (([A-Za-z0-9]*)[:,]?)*
Edit : to correct the bug pointed out by Baud : (([A-Za-z0-9]*)[:,]?)*([A-Za-z0-9]+)
and finally - best version : (([A-Za-z]+[0-9]+)[:,]?)*([A-Za-z]+[0-9]+)
// ah ok this wont work probably... but to answer 1. - no i dont think your regex is correct
( ) form a group
[ ] form a charclass (you can use A-Z a-d 0-9 etc or just single characters)
? means 1 or 0
* means 0 or any
id suggest reading http://www.regular-expressions.info/reference.html .
thats where i learned regexes some time ago ;)
and for building expressions i use Rad Software Regular Expression Designer
Let's build this step by step.
If you are following an Excel addressing format, to match a single-cell entry in your CSL, you would use the regular expression:
[A-Z]{1,2}[1-9]\d*
This matches the following in sequence:
Any character in A to Z once or twice
Any digit in 1 to 9
Any digit zero or more times
The digit expression will prevent inputting a cell address with leading zeros.
To build the expression that allows for a cell address pair, repeat the expression preceded by a colon as optional.
[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?
Now allow for repeating the pattern preceded by a comma zero or more times and add start and end string delimiters.
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
Kind of long and obnoxious, I admit, but after trying enough variants, I can't find a way of shortening it.
Hope this is helpful.

Regular Expression to check the spaces and minimum entries in C#

I am using c# for programming!
I want to write one regular expression in c# which will check first and last space in a sentence and will allow spaces in between it as well as there should be minimumm 2 charater entry in field, no limit for maximum characters, no special keys are allowed (#,#,$ etc) characters allowed
Please suggests!
It's not really clear exactly what you want. Your comment -- contradicting the question itself -- suggests something like this, perhaps...
^[A-Za-z0-9]+(?:\s*[A-Za-z0-9]+)+$
This means that the string must start and end with an alphanumeric, and all characters except the first and last must be either alphanumeric or whitespace.

Categories