Groups in a C# regular expression - c#

I'm using the following tester to try and figure out this regex:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
My input:
123stringA 456 stringB
My pattern:
([0-9]{3})(.*?)
The pattern will eventually be a date but for this question's sake, I'll keep it simple and use my simplified input.
The way I understand this pattern, it's "give me 3 numbers [0-9]{3}, followed by any number of characters of any kind .*, until it reaches the next match ?
What I want/expect out of this test is 2 matches with 2 groups each:
Match 1
Group 1 - 123
Group 2 - stringA
Match2
Group 1 - 456
Group 2 - stringB
For some reason, the tester at the link I provided sees that there is a second group, but it's coming up blank. I have done this with PHP before and it seemed to work as I described, but in C# I'm seeing different results. Any help you can provide would be appreciated.
I should also note that this could expand multiple lines...
EDIT *
Here's the actual input:
2011-08-09 09:25:57,069 [9] Orchard.Environment.Extensions.ExtensionManager - Error loading extension 2011-08-09 09:25:57,493 [8] Orchard.Environment.Extensions.ExtensionManager
For match 1 I'm wanting to get:
2011-08-09 09:25:57 and
,069 [9] Orchard.Environment.Extensions.ExtensionManager - Error loading extension
and for match 2:
2011-08-09 09:25:57 and
,493 [8] Orchard.Environment.Extensions.ExtensionManager
I'm trying to find a good way to parse an error log file that's in one giant text file and maintain the date the error happened and the details that went along with it

The first group matches 3 digits and the second group matches the remainder of the string because there's nothing in the pattern to prevent the .*? from not matching the remainder of the string.
CORRECTION: The second group matches an empty string because there's nothing in the pattern to prevent the .*? from not matching an empty string.

.* means match anything zero or more times. ? Mean to find the minimal number of times, so it chooses zero matches as the minimum.
Try this pattern, ([0-9]{3})([a-zA-Z]*)

According to your comment, this is what you want to match
2011-08-09 09:25:57,069 [9]
Orchard.Environment.Extensions.ExtensionManager - Error loading
extension 2011-08-09 09:25:57,493 [8]
Orchard.Environment.Extensions.ExtensionManager - Error loading
extension
This expression will match the Date in the first capturing group and the rest till the next date OR till the end of the string in the second capturing group.
(\d{4}(?:-\d{2}){2})(.*?)(?=(?:\d{4}(?:-\d{2}){2}|$))
See it here on Regexr

Not sure why the tool gives you that, but you can switch to this alternative pattern that works in .Net
([0-9]{3})([^0-9]*)
http://regexhero.net/tester/?id=155b8e2b-b851-46b9-8a84-b82f8d6963a1
Explanation:
In your previous pattern, the nongreedy version was matching 0 characters.
In the new one, [^0-9] says match any character other than the range 0-9 (note the negation ^ specifier).
Update: Given the actual input string (in comments), the pattern changes to (its a guess assuming what the OP wants to do:
,([0-9]{3})([^\n]*)
http://regexhero.net/tester/?id=155b8e2b-b851-46b9-8a84-b82f8d6963a1

Related

Regex stop Quantifer on True possible?

I wanna stop the Quantifier if the statment are true. any know how?
XXXXXX, 20. September 2017 XXX XXXXXXXXX XX
MwSt. Nummer: CHE-XXX.XXX.XXX p.A. XXXXX XXXXXX XXXXX
Rechnungs Nr.321 XX XXXXX 32
XXXXXX, (?<Date>\d{2}.\s{1,}[A-z]{1,}\s{1,}\d{4})\s{1,}(?<CompanyName>.*)\n(?(?=Rechnungs Nr\.)Rechnungs Nr\.(?<BillNumber>\d{1,})|.*\n){1,}
My target is that:
XXXXXX, (?<Date>\d{2}.\s{1,}[A-z]{1,}\s{1,}\d{4})\s{1,}(?<CompanyName>.*)\n(?(?=Rechnungs Nr\.)Rechnungs Nr\.(?<BillNumber>\d{1,})|.*\n){2}
you see this is not dynamic and here is the problem. I wanna do it much times as possible. in some case {2} isnt enough. So i pick {1,}. The Problem here is the following text are match to. That is bad for me. I wanna do after that loop more loops for other text sequence. I only want match the digits ( in this example 321 ) After this Stop the if condition.
Thank you in advance.
You can get Output here: Regular Expression
As per my comment (see the demo on regex101.com):
XXXXXX,\s*
(?<Date>\d{2}.\s+[A-Za-z]+\s+\d{4})\s+
(?<CompanyName>.*)(?s:.*?)
Rechnungs\ Nr\.(?<BillNumber>\d+)
Broken down this says:
XXXXXX,\s* # XXXXXX, followed by spaces
(?<Date>\d{2}.\s+[A-Za-z]+\s+\d{4})\s+ # your original expression
# followed by at least one space
(?<CompanyName>.*) # rest of the line goes into
# group CompanyName
(?s:.*?) # DOTALL, lazily
Rechnungs\ Nr\.(?<BillNumber>\d+) # Rechnungs Nr.
# followed by digits
Letting aside some potential optimizations, the main idea was to use
(?s:.*?)
Which turns on the DOTALL mode for a group, meaning that inside that group the dot matches every charater (including newline characters). With the lazy quantifier (.*?) it expands as needed, even across multiple lines.
As an alternative, you could use [\s\S]*? which combines whitespaces and not whitespaces leading to all characters in the end.
Side note: \s{1,} is the same as \s+, \d{1,} is the same as \d+, [A-z] includes more characters then [A-Za-z].
I found fast and good way:
XXXXXX, (?<Date>\d{2}.\s+[A-z]+\s+\d{4})\s{1,}(?<CompanyName>.*)\n(?(?!Rechnungs Nr\.).*\n)Rechnungs Nr\.(?<BillNumber>\d+)
You can get Output here: Regular Expression

count a multiple set of string that all appeared in one line

here is the C# line of code that tries to find the matching string in a line using regex in a textfile
MatchCollection collection = Regex.Matches(readedLine, #"/funcdesc=cls/ && /jobcat=VSO/");
countedChars = collection.Count;
This is the sample textfile content
2016-01-01 d;D;;D;funcdesc=cls&workcode=file&jobcat=VSO&jobcat=DSO;
2016-01-02 d;D;;D;funcdesc=cls&workcode=file&jobcat=DSO&jobcat=DSO;
expected total count output should be 1
(because line 1 meet the requirement where both "funcdesc=cls" and "jobcat=VSO" are there, however line 2 did not because there is no "jobcat=VSO" found only the first string.
Since the order of the "funcdesc=cls" and "jobcat=VSO" aren't fixed (i.e., "funcdesc" can come after "jobcat"), you can use the following Regex to capture a match for either case. There may be more efficient ways to do it, this is just off the top of my head:
/funcdesc=cls.+jobcat=VSO|jobcat=VSO.+funcdesc=cls/
The | is a way to say "OR" in Regex, i.e., either "funcdesc=cls" followed by one or more (.+) characters followed by "jobcat=VSO", or "jobcat=VSO" followed by one or more characters followed by "funcdesc=cls".
This will match the following:
2016-01-01 d;D;;D;funcdesc=cls&workcode=file&jobcat=VSO&jobcat=DSO;
or
2016-01-01 d;D;;D;jobcat=VSO&workcode=file&funcdesc=cls&jobcat=DSO;
but will not match
2016-01-02 d;D;;D;funcdesc=cls&workcode=file&jobcat=DSO&jobcat=DSO;
Right pattern would be funcdesc=cls.*jobcat=VSO
Also $ in regex means end of line.

Regex Parsing Beginning of Line

I have a string and I would like to parse it using regular expression. .. indicates the category name and everything after : is the content for that category.
Below is the full string I'm trying to parse:
..NAME: JOHN
..BDAY: 1/1/2010
..NOTE: 1. some note 1
2. some note 2
3. some note 3
..DATE: 6/3/2014
I'm trying to parse it so that
(group 1)
..NAME: JOHN
(group 2)
..BDAY: 1/1/2010
(group 3)
..NOTE: 1. some note 1
2. some note 2
3. some note 3
(group 4)
..DATE: 6/3/2014 //a.k.a update date
The regular expression patter I use is
\.\.[A-Z0-9]{2,4}:.*
which makes (group 3) ..NOTE: 1. some note 1 missing the content on second and third line.
How can I modify my pattern so I can get the correct grouping?
. matches all but newline (in most languages, Ruby is one exception). Use RegexOptions.Singleline in C# (or the s modifier in PCRE).
You will need to make your .* lazy up till the next .. or the end of the string $ so that you don't match everything the first time. Also, . doesn't have any special meaning in a character class..so your expression may end up looking cleaner like this:
[.]{2}[A-Z0-9]{2,4}:.*?(?=[.]{2}|$)
Demos: Regex and C#
I managed to achieve it with the negative lookahead for [.]{2}:
[.]{2}[A-Z0-9]{2,4}:(.*\n?(?![.]{2}))*

Improving my failing regex

My regex was working - until the form of the string it was capturing slightly changed. It used to always be of the form :
Word1 - Word2 - 01.2.3456.7890 - xx-xx - Word 3 [Word-inbracket]
Where I was interested in capturing the xx-xx.
For capturing this data, the following regex worked :
(.+\s*-\s*.+\s*-\s*.+)\s*-\s*(\w{1,3}\s*-\s*\w{1,3})\s*-\s*.+
Selecting groups[2] from it.
Now, however, the string has changed form so that sometimes there is another dash, and another set of letters between 1 and 4 characters after the xx-xx. (Remember, this only happens sometimes).
So, now I also need to capture the info where it is of the form :
Word1 - Word2 - 01.2.3456.7890 - xx-xx-XxxX - Word 3 [Word-inbracket]
Word1 - Word2 - 01.2.3456.7890 - xXX-XxX-xxxx - Word 3 [Word-inbracket]
Etc.
How can I edit my regex to capture this string in addition to the ones that were previously caught? What is the cleanest way to do this ?
A little hacky but that will do the trick:
(.+\s*-\s*.+\s*-\s*.+)\s*-\s*((\w{1,3}\s*-\s*\w{1,3})|(\w{1,4}\s*-\s*\w{1,4}))\s*-\s*.+
Based on the input lines, a more simplified approach could be taken altogether.
The following regex matches both cases and should also work for any other modifications to the part the was modified.
([^-]*-){3}\s*([^\s]+).*
This should capture the first group of with "Word1 - Word2 - 01.2.3456.7890 -", and then the second group of "xx-xx-XxxX".
Also note, I'm going off of the assumption that the second desired group does not contain spaces as the example strings do not have them.
Explained:
([^-]*-){3} # captures the "word1 - word2 - word3.234.234 -" block
\s*
([^\s]+) # captures the "xx-xx-xxx" block up to the first whitespace char.
.* # matches the rest of the line
I believe this should do it:
(.+?\s*-\s*.+?\s*-\s*.+?)\s*-\s*(\w{1,3}\s*-\s*\w{1,3})\s*(?:-(\w{1,3}))?\s*-\s*.+
The changes I've made are:
Made the any characters matches at the beginning non-greedy by adding '?' after them — this stops them gobbling up too much when the extra bit is present.
Added the '(?:-(\w{1,3}))?' which matches the optional extra bit if it is present, but without capturing the '-' prefix (the '?:' makes the outer group non-capturing).
This will give you an extra capturing group that includes the optional bit.
You can see it in action here (edited).
this is more clear .+\s-\s(.+)\s-\s.+$

C# Regex Replace weird behavior with multiple captures and matching at the end of string?

I'm trying to write something that format Brazilian phone numbers, but I want it to do it matching from the end of the string, and not the beginning, so it would turn input strings according to the following pattern:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "3555-4444"
"5554444" -> "555-4444"
Since the begining portion is what usually changes, I thought of building the match using the $ sign so it would start at the end, and then capture backwards (so I thought), replacing then by the desired end format, and after, just getting rid of the parentesis "()" in front if they were empty.
This is the C# code:
s = "5135554444";
string str = Regex.Replace(s, #"\D", ""); //Get rid of non digits, if any
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{1,4})$", "($1) $2-$3");
return Regex.Replace(str, #"^\(\) ", ""); //Get rid of empty () at the beginning
The return value was as expected for a 10 digit number. But for anything less than that, it ended up showing some strange behavior. These were my results:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "(35) 5544-44"
"5554444" -> "(55) 5444-4"
It seems that it ignores the $ at the end to do the match, except that if I test with something less than 7 digits it goes like this:
"554444" -> "(55) 444-4"
"54444" -> "(54) 44-4"
"4444" -> "(44) 4-4"
Notice that it keeps the "minimum" {n} number of times of the third capture group always capturing it from the end, but then, the first two groups are capturing from the beginning as if the last group was non greedy from the end, just getting the minimum... weird or it's me?
Now, if I change the pattern, so instead of {1,4} on the third capture I use {4} these are the results:
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{4})$", "($1) $2-$3");
"5135554444" -> "(51) 3555-4444" //As expected
"35554444" -> "(35) 55-4444" //The last four are as expected, but "35" as $1?
"54444" -> "(5) -4444" //Again "4444" in $3, why nothing in $2 and "5" in $1?
I know this is probably some stupidity of mine, but wouldn't it be more reasonable if I want to capture at the end of the string, that all previous capture groups would be captured in reverse order?
I would think that "54444" would turn into "5-4444" in this last example... then it does not...
How would one accomplish this?
(I know maybe there's a better way to accomplish the very same thing using different approaches... but what I'm really curious is to find out why this particular behavior of the Regex seems odd. So, the answer tho this question should focus on explaining why the last capture is anchored at the end of the string, and why the others are not, as demonstrated in this example. So I'm not particularly interested in the actual phone # formatting problem, but to understand the Regex sintax)...
Thanks...
So you want the third part to always have four digits, the second part zero to four digits, and the first part zero to two digits, but only if the second part contains four digits?
Use
^(\d{0,2}?)(\d{0,4})(\d{4})$
As a C# snippet, commented:
resultString = Regex.Replace(subjectString,
#"^ # anchor the search at the start of the string
(\d{0,2}?) # match as few digits as possible, maximum 2
(\d{0,4}) # match up to four digits, as many as possible
(\d{4}) # match exactly four digits
$ # anchor the search at the end of the string",
"($1) $2-$3", RegexOptions.IgnorePatternWhitespace);
By adding a ? to a quantifier (??, *?, +?, {a,b}?) you make it lazy, i. e. tell it to match as few characters as possible while still allowing an overall match to be found.
Without the ? in the first group, what would happen when trying to match 123456?
First, the \d{0,2} matches 12.
Then, the \d{0,4} matches 3456.
Then, the \d{4} doesn't have anything left to match, so the regex engine backtracks until that's possible again. After four steps, the \d{4} can match 3456. The \d{0,4} gives up everything it had matched greedily for this.
Now, an overall match has been found - no need to try any more combinations. Therefore, the first and third groups will contain parts of the match.
You have to tell it that it's OK if the first matching groups aren't there, but not the last one:
(\d{0,2}?)(\d{0,4}?)(\d{1,4})$
Matches your examples properly in my testing.

Categories