Regex Parsing Beginning of Line - c#

I have a string and I would like to parse it using regular expression. .. indicates the category name and everything after : is the content for that category.
Below is the full string I'm trying to parse:
..NAME: JOHN
..BDAY: 1/1/2010
..NOTE: 1. some note 1
2. some note 2
3. some note 3
..DATE: 6/3/2014
I'm trying to parse it so that
(group 1)
..NAME: JOHN
(group 2)
..BDAY: 1/1/2010
(group 3)
..NOTE: 1. some note 1
2. some note 2
3. some note 3
(group 4)
..DATE: 6/3/2014 //a.k.a update date
The regular expression patter I use is
\.\.[A-Z0-9]{2,4}:.*
which makes (group 3) ..NOTE: 1. some note 1 missing the content on second and third line.
How can I modify my pattern so I can get the correct grouping?

. matches all but newline (in most languages, Ruby is one exception). Use RegexOptions.Singleline in C# (or the s modifier in PCRE).
You will need to make your .* lazy up till the next .. or the end of the string $ so that you don't match everything the first time. Also, . doesn't have any special meaning in a character class..so your expression may end up looking cleaner like this:
[.]{2}[A-Z0-9]{2,4}:.*?(?=[.]{2}|$)
Demos: Regex and C#

I managed to achieve it with the negative lookahead for [.]{2}:
[.]{2}[A-Z0-9]{2,4}:(.*\n?(?![.]{2}))*

Related

Regex which matches URN by rfc8141

I am struggling to find a Regex which could match a URN as described in rfc8141.
I have tried this one:
\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}):(?<nss>(?:[a-z0-9()+,-.:=#;$_!*']|%[0-9a-f]{2})+))\z
but this one only matches the first part of the URN without the components.
For example lets say we have the corresponding URN: urn:example:a123,0%7C00~&z456/789?+abc?=xyz#12/3 We should match the following groups:
NID - example
NSS - a123,0%7C00~&z456/789 (from the last ':' tll we match '?+' or '?=' or '#'
r-component - abc (from '?+' till '?=' or '#'')
f-component - 12/3 (from '#' till end)
I haven't read all the specifications, so there may be other rules to implement, but it should put you on the way for the optional components:
\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}):(?<nss>(?:[-a-z0-9()+,.:=#;$_!*'&~\/]|%[0-9a-f]{2})+)(?:\?\+(?<rcomponent>.*?))?(?:\?=(?<qcomponent>.*?))?(?:#(?<fcomponent>.*?))?)\z
explanations:
(?<nss>(?:[-a-z0-9()+,.:=#;$_!*'&~\/]|%[0-9a-f]{2})+) : The - has been moved to the beginning of the list to be considered in the allowed chars, or else it means "range from , to .". The characters &, ~ and / (has to be escaped with "\") have also been added to the list, or else it won't match your example.
optional components: (?:\?\+(?<rcomponent>.*?))? : inside an optional non-capturing group (?:)? to prevent capturing the identifier (the ?+, ?= and # part). The chars ? and + have to be escaped with "\". Will capture anything (.) but in lazy mode (*?) or else the first component found would capture everything until the end of the string.
See working example in Regex101
Hope that helps
If you want to validate string with Uniform Resource Names (URNs) 8141: rfc8141 You can refer to URN8141Test.java and URN8141.java
It has been used in our team for a few years.

Regex Group Optional

I have the following regular expression that isn't working the way I thought it would.
("^\\d{2}(?:\\d{2})?\\.\\d{2}(\\.\\d{2-4})?$");
I am trying to match a string that starts with either 2 or 4 digits, followed by a period, followed by 2 digits and then optionally another period and either 2 or 4 digits.
I would expect 33.44.4444 to work, as would 33.33 but anytime I have a string that has a 2nd period, my expression fails.
What am I doing wrong ?
Your regex is correct for what you want to do except for the {2-4} part, if you use {2,4} it will go for the 2 to 4 characters capture you're looking for.
("^\\d{2}(?:\\d{2})?\\.\\d{2}(\\.\\d{2,4})?$");
Hope it helps.
As others have pointed out the syntax {2-4} is incorrect. Use {2,4} to specify a range of occurrences. But also if you only want 2 or 4 (not 3) I would use this regex:
#"^(\d{2}|\d{4})\.\d{2}(\.(\d{2}|\d{4}))?$"
The way you expressed "either two or four digits" in the first section of your expression is correct:
\\d{2}(?:\\d{2})?
The second part does it incorrectly:
(\\.\\d{2-4})?
Copy the first part into the second to fix the problem:
("^\\d{2}(?:\\d{2})?\\.\\d{2}(\\.\\d{2}(?:\\d{2})?)?$");
Demo.
You can use this regex:
^\d{2}(?:\d{2})?\.\d{2}(?:\.\d{2}(?:\d{2})?)?$
\d{2-4} will match {2-4} text literally.
RegEx Demo

How to match a string, but only if the same string has not already been matched with or without dashes?

I have a case I'm trying to match using regular expressions.
My current expression will match a string in a certain format with or without dashes. I would like to add it to match only if the string has not been matched before, with or without the dashes. For example, take the following cases:
1. 1234-56-789-5555
2. 1234567895555
3. 0000-99-888-3333
4. 1111223334444
If the four examples above appeared in this same order in a list, document, whatever, I would want to only capture (1, 3, 4). I want to skip #2 since it was already captured by #1, but with the dashes. If #2 had of come first, I would have wanted to similarly skip #1.
Here's the current expression I'm using:
\d\d\d\d-*\d\d-*\d\d\d-*\d\d\d\d
I tried to read up on look behinds (I'm fairly inexperienced with Regex) but I only really understand that a look behind only checks if certain text is matched previously. I'm not sure if what I want can be combined with this; I only see how to check for specific text, not for the current value with/without dashes.
I'm currently doing this with C# logic, but am trying to see if it can be done purely in Regex. If it can't be done, that's fine; I'm just trying to beef up my Regex knowledge in this case.
Is this possible -- how can I accomplish this?
If you want to obtain just the first occurrence of each number (answering I want to skip #2 since it was already captured by #1, but with the dashes), you need a negative look-behind with a RegexOptions.RightToLeft and RegexOptions.Singleline options:
(?<!\b\1-?\2-?\3-?\4\b.*)\b(\d{4})-?(\d{2})-?(\d{3})-?(\d{4})\b
The \b(\d{4})-?(\d{2})-?(\d{3})-?(\d{4})\b subpattern is the number with capture groups to check for their presence regardless of the hyphens earlier in the string.
The (?<!\b\1-?\2-?\3-?\4\b.*) subpattern look-behind is checking if we have no other occurrences of the same string.
Tested at regexhero.net and in Expresso:
You can easily do this without using regex.. but if you still want to use regex for this purpose.. you can use the following to match:
(?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)\2\3\4\5
And replace with '' (empty string)
Explanation:
This will match all those digits without dashes which are already captured by digits with dashes
So, in your 1,2,3 and 4.. instead of matching 1,3 and 4 types it matches type 2.. and you can replace it with '' (nothing) and you remain with 1,3, and 4
See demo here
You can use the following regex to do exactly what you want..
((?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)(?!\3\4\5\6)\d{13})|(((?<=((\d{4})(\d{2})(\d{3})(\d{4})).*?)(?!\10-\11-\12-\13)((\d{4})-(\d{2})-(\d{3})-(\d{4}))))
Explanation:
((?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)(?!\3\4\5\6)\d{13}) match all those \d{13} which are not previously occurred with dashes in between them (this excludes strings of type 2 in your case)
((\d{4})-(\d{2})-(\d{3})-(\d{4})) and match all of this pattern
Matches 1, 3 and 4 in your case.
See DEMO

Regex to parse formatter string

I am writing a string.Format-like method. In order to do this, I am adopting Regex to determine commands and parameters: e.g. Format(#"\m{0,1,2}", byteArr0, byteArr1, byteArr2)
For the first Regex, return 2 groups:
'\m'
'{0,1,2}'
Another Regex takes the value of '{0,1,2}' and has 3 matches:
0
1
2
These values are the indexes corresponding to the byteArr params.
This command structure is likely to grow so I'm really trying to figure this out and learn enough to be able to modify the Regex for future requirements.I would think that a single Regex would do all of the above but there is value in having 2 separate Regex(es/ices ???) expressions.
Any way, to get the first group '\m' the Regex is:
"(\\)(\w{1,1})" // I want the '{0,1,2}' group also
To get the integer matches '{0,1,2}' I was trying:
"(?<=\{)([^}]*)(?=\})"
I am having difficulty in achieving: (1) 2 groups on the first expression and (2) 3 matches on the integers within the braces delimited by a comma in the second expression.
Your first regex (\\)(\w{1,1}) can be greatly simplified.
You don't want to capture the \ separately to the m so no need to wrap them in their own sets of parenthesis.
\w{1,1} is the same as just \w.
So we have \\\w to match the first part \m.
Now to deal with the second part, really we can ignore everything other than the 0,1,2 in the example since there are no numbers elsewhere so you'd just use: \d+ and iterate through the matches.
But lets assume the example could actually be \9{1,2,3}.
Now \d+ would match the 9 so to avoid this we could use [{,](\d+)[,}]. This says capture a number that has either a , or { on the left of it and a , or } on the right.
You're right in saying that we can match the whole string with a single regex, something like this would do it:
(\\\w){((\d+),?)+}
However the problem with this is when you examine the contents of the capture groups afterwards, the last number caught by the (\d+) overwrites all the other values that were caught in there. So you'd be left with group 1: \m and group 2: 2 for your example.
With that in mind I recommend using 2 regexs:
For the 1st part: \\\w
For the numbers: I'd forget about the [{,](\d+)[,}] (and the many other ways you could do it), the cleanest way might just be to grab whatever is inside the {...} and then match with a simple \d+.
So to do this first use (\\\w)\{([^/}]+)\} to grab the \m into group 1 and the 1,2,3 into group 2, then just use \d+ on that.
FYI, your (?<=\{)([^}]*)(?=\}) works fine, but you can't but anything before the lookbehind i.e. the \\\w. In the vast majority of cases where a lookbehind can be used, you can do what you want by just using capture groups and ignoring everything else :
My regex \{([^/}]+)\} is pretty much the same as you (?<=\{)([^}]*)(?=\}) except rather than looking ahead and looking behind for the { and } I just leave them outside the capture groups that are going to be used.
Consider the following Regexes...
(^.*?)(?={.*})
\d+
Good Luck!

Groups in a C# regular expression

I'm using the following tester to try and figure out this regex:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
My input:
123stringA 456 stringB
My pattern:
([0-9]{3})(.*?)
The pattern will eventually be a date but for this question's sake, I'll keep it simple and use my simplified input.
The way I understand this pattern, it's "give me 3 numbers [0-9]{3}, followed by any number of characters of any kind .*, until it reaches the next match ?
What I want/expect out of this test is 2 matches with 2 groups each:
Match 1
Group 1 - 123
Group 2 - stringA
Match2
Group 1 - 456
Group 2 - stringB
For some reason, the tester at the link I provided sees that there is a second group, but it's coming up blank. I have done this with PHP before and it seemed to work as I described, but in C# I'm seeing different results. Any help you can provide would be appreciated.
I should also note that this could expand multiple lines...
EDIT *
Here's the actual input:
2011-08-09 09:25:57,069 [9] Orchard.Environment.Extensions.ExtensionManager - Error loading extension 2011-08-09 09:25:57,493 [8] Orchard.Environment.Extensions.ExtensionManager
For match 1 I'm wanting to get:
2011-08-09 09:25:57 and
,069 [9] Orchard.Environment.Extensions.ExtensionManager - Error loading extension
and for match 2:
2011-08-09 09:25:57 and
,493 [8] Orchard.Environment.Extensions.ExtensionManager
I'm trying to find a good way to parse an error log file that's in one giant text file and maintain the date the error happened and the details that went along with it
The first group matches 3 digits and the second group matches the remainder of the string because there's nothing in the pattern to prevent the .*? from not matching the remainder of the string.
CORRECTION: The second group matches an empty string because there's nothing in the pattern to prevent the .*? from not matching an empty string.
.* means match anything zero or more times. ? Mean to find the minimal number of times, so it chooses zero matches as the minimum.
Try this pattern, ([0-9]{3})([a-zA-Z]*)
According to your comment, this is what you want to match
2011-08-09 09:25:57,069 [9]
Orchard.Environment.Extensions.ExtensionManager - Error loading
extension 2011-08-09 09:25:57,493 [8]
Orchard.Environment.Extensions.ExtensionManager - Error loading
extension
This expression will match the Date in the first capturing group and the rest till the next date OR till the end of the string in the second capturing group.
(\d{4}(?:-\d{2}){2})(.*?)(?=(?:\d{4}(?:-\d{2}){2}|$))
See it here on Regexr
Not sure why the tool gives you that, but you can switch to this alternative pattern that works in .Net
([0-9]{3})([^0-9]*)
http://regexhero.net/tester/?id=155b8e2b-b851-46b9-8a84-b82f8d6963a1
Explanation:
In your previous pattern, the nongreedy version was matching 0 characters.
In the new one, [^0-9] says match any character other than the range 0-9 (note the negation ^ specifier).
Update: Given the actual input string (in comments), the pattern changes to (its a guess assuming what the OP wants to do:
,([0-9]{3})([^\n]*)
http://regexhero.net/tester/?id=155b8e2b-b851-46b9-8a84-b82f8d6963a1

Categories