Problem with backreferences in C#'s regex

Problem with backreferences in C#'s regex - c#

The goal is to extract time and date strings from this:
<strong>Date</strong> - Thursday, June 2 2011 9:00PM<br>
Here's the code:
Match m = Regex.Match(line, "<strong>Date</strong> - (.*) (.*)<br>");
date = m.Captures[0].Value;
time = m.Captures[1].Value;
Thanks to the regex being greedy, it should match the first group all the way up to the last space. But it doesn't. Captures[0] is the whole line and Captures[1] is out of range. Why?

Use Groups, not Captures. Your results will be in Groups[1] and Groups[2].
And personally, I'd recommend naming the groups:
Match m = Regex.Match(line, "<strong>Date</strong> - (?<date>.*) (?<time>.*)<br>");
if( m.Success )
{
date = m.Groups["date"].Value;
time = m.Groups["time"].Value;
}

Related

Match only the nth occurrence using a regular expression

I have a string with 3 dates in it like this:
XXXXX_20160207_20180208_XXXXXXX_20190408T160742_xxxxx
I want to select the 2nd date in the string, the 20180208 one.
Is there away to do this purely in the regex, with have to resort to pulling out the 2 match in code. I'm using C# if that matters.
Thanks for any help.

You could use
^(?:[^_]+_){2}(\d+)
And take the first group, see a demo on regex101.com.
Broken down, this says
^ # start of the string
(?:[^_]+_){2} # not _ + _, twice
(\d+) # capture digits
C# demo:
var pattern = #"^(?:[^_]+_){2}(\d+)";
var text = "XXXXX_20160207_20180208_XXXXXXX_20190408T160742_xxxxx";
var result = Regex.Match(text, pattern)?.Groups[1].Value;
Console.WriteLine(result); // => 20180208

Try this one
MatchCollection matches = Regex.Matches(sInputLine, #"\d{8}");
string sSecond = matches[1].ToString();

You could use the regular expression
^(?:.*?\d{8}_){1}.*?(\d{8})
to save the 2nd date to capture group 1.
Demo
Naturally, for n > 2, replace {1} with {n-1} to obtain the nth date. To obtain the 1st date use
^(?:.*?\d{8}_){0}.*?(\d{8})
Demo
The C#'s regex engine performs the following operations.
^ # match the beginning of a line
(?: # begin a non-capture group
.*? # match 0+ chars lazily
\d{8} # match 8 digits
_ # match '_'
) # end non-capture group
{n} # execute non-capture group n (n >= 0) times
.*? # match 0+ chars lazily
(\d{8}) # match 8 digits in capture group 1
The important thing to note is that the first instance of .*?, followed by \d{8}, because it is lazy, will gobble up as many characters as it can until the next 8 characters are digits (and are not preceded or followed by a digit. For example, in the string
_1234abcd_efghi_123456789_12345678_ABC
capture group 1 in (.*?)_\d{8}_ will contain "_1234abcd_efghi_123456789".

You can use System.Text.RegularExpressions.Regex
See the following example
Regex regex = new Regex(#"^(?:[^_]+_){2}(\d+)"); //Expression from Jan's answer just showing how to use C# to achieve your goal
GroupCollection groups = regex.Match("XXXXX_20160207_20180208_XXXXXXX_20190408T160742_xxxxx").Groups;
if (groups.Count > 1)
{
Console.WriteLine(groups[1].Value);
}

Matching a pattern in a string

I have a string
string str = "I am fine. How are you? You need exactly 4 pieces of sandwiches. Your ADAST Count is 5. Okay thank you ";
What I want is, get the ADAST count value. For the above example, it is 5.
The problem here is, the is after the ADAST Count. It can be is or =. But there will the two words ADAST Count.
What I have tried is
var resultString = Regex.Match(str, #"ADAST\s+count\s+is\s+\d+", RegexOptions.IgnoreCase).Value;
var number = Regex.Match(resultString, #"\d+").Value;
How can I write the pattern which will search is or = ?

You may use
ADAST\s+count\s+(?:is|=)\s+(\d+)
See the regex demo
Note that (?:is|=) is a non-capturing group (i.e. it is used to only group alternations without pushing these submatches on to the capture stack for further retrieval) and | is an alternation operator.
Details:
ADAST - a literal string
\s+ - 1 or more whitespaces
count - a literal string
\s+ - 1 or more whitespaces
(?:is|=) - either is or =
\s+ - 1 or more whitespaces
(\d+) - Group 1 capturing one or more digits
C#:
var m = Regex.Match(s, #"ADAST\s+count\s+(?:is|=)\s+(\d+)", RegexOptions.IgnoreCase);
if (m.Success) {
Console.Write(m.Groups[1].Value);
}

RegEx string between N and (N+1)th Occurance

I am attempting to find nth occurrence of sub string between two special characters. For example.
one|two|three|four|five
Say, I am looking to find string between (n and n+1 th) 2nd and 3rd Occurrence of '|' character, which turns out to be 'three'.I want to do it using RegEx. Could someone guide me ?
My Current Attempt is as follows.
string subtext = "zero|one|two|three|four";
Regex r = new Regex(#"(?:([^|]*)|){3}");
var m = r.Match(subtext).Value;

If you have full access to C# code, you should consider a mere splitting approach:
var idx = 2; // Might be user-defined
var subtext = "zero|one|two|three|four";
var result = subtext.Split('|').ElementAtOrDefault(idx);
Console.WriteLine(result);
// => two
A regex can be used if you have no access to code (if you use some tool that is powered with .NET regex):
^(?:[^|]*\|){2}([^|]*)
See the regex demo. It matches
^ - start of string
(?:[^|]*\|){2} - 2 (or adjust it as you need) or more sequences of:
[^|]* - zero or more chars other than |
\| - a | symbol
([^|]*) - Group 1 (access via .Groups[1]): zero or more chars other than |
C# code to test:
var pat = $#"^(?:[^|]*\|){{{idx}}}([^|]*)";
var m = Regex.Match(subtext, pat);
if (m.Success) {
Console.WriteLine(m.Groups[1].Value);
}
// => two
See the C# demo
If a tool does not let you access captured groups, turn the initial part into a non-consuming lookbehind pattern:
(?<=^(?:[^|]*\|){2})[^|]*
^^^^^^^^^^^^^^^^^^^^
See this regex demo. The (?<=...) positive lookbehind only checks for a pattern presence immediately to the left of the current location, and if the pattern is not matched, the match will fail.

Use this:
(?:.*?\|){n}(.[^|]*)
where n is the number of times you need to skip your special character. The first capturing group will contain the result.
Demo for n = 2

Use this regex and then select the n-th match (in this case 2) from the Matches collection:
string subtext = "zero|one|two|three|four";
Regex r = new Regex("(?<=\|)[^\|]*");
var m = r.Matches(subtext)[2];

Using RegEx to match Month-Day in C#

Let me preface this by saying I am new to Regex and C# so I am still trying to figure it out. I also realize that Regex is a deep subject that takes time to understand. I have done a little research to figure this out but I don't have the time needed to properly study the art of Regex syntax as I need this program finished tomorrow. (no this is not homework, it is for my job)
I am using c# to search through a text file line by line and I am trying to use a Regex expression to check whether any lines contain any dates of the current month in the format MM-DD. The Regex expression is used within a method that is passed each line of the file.
Here is the method I am currently using:
private bool CheckTransactionDates(string line)
{
// in the actual code this is dynamically set based on other variables
string month = "12";
Regex regExPattern = new Regex(#"\s" + month + #"-\d(0[1-9]|[1-2][0-9]|3[0-1])\s");
Match match = regExPattern.Match(line);
return match.Success;
}
Essentially I need it to match if it is preceded by a space and followed by a space. Only if it is the current month (in this case 12), an hyphen, and a day of the month ( " 12-01 " should match but not " 12-99 "). It should always be 2 digits on either side of the hyphen.
This Regex (The only thing I can make match) will work, but also picks up items outside the necessary range:
Regex regExPattern = new Regex(#"\s" + month + #"-\d{2}\s");
I have also tried this without sucess:
Regex regExPattern = new Regex(#"\s" + month + #"-\d[01-30]{2}\s");
Can anyone tell me what I need to change to get the results I need?
Thanks in advance.

If you just need to find out if the line contains any valid match, something like this will work:
private bool CheckTransactionDates(string line)
{
// in the actual code this is dynamically set based on other variables
int month = DateTime.Now.Month;
int daysInMonth = DateTime.DaysInMonth(DateTime.Today.Year, DateTime.Today.Month);
Regex pattern = new Regex(string.Format(#"{0:00}-(?<DAY>[0123][0-9])", month));
int day = 0;
foreach (Match match in pattern.Matches(line))
{
if (int.TryParse(match.Groups["DAY"].Value, out day))
{
if (day <= daysInMonth)
{
return true;
}
}
}
return false;
}
Here's how it works:
You determine the month to search for (here, I use the current month), and the number of days in that month.
Next, the regex pattern is built using a string.Format function that puts the left-zero-padded month, followed by dash, followed by any two digit number 00 to 39 (the [0123] for the first digit, the [0-9] for the second digit). This narrows the regex matches, but not conclusively for a date. The (?<DAY>...) that surrounds it creates a regex group, which will make processing it later easier. Note that I didn't check for a whitespace, in case the line begins with a valid date. You could easily add a space to the pattern, or modify the pattern to your specific needs.
Next, we check all possible matches on that line (pattern.Matches) in a loop.
If a match is found, we then try to parse it as an integer (it should always work, based on the pattern we are matching). We use the DAY group of that match that we defined in the pattern.
After parsing that match into an integer day, we check to see if that day is a valid number for the month specified. If it is, we return true from the function, as we found a valid date.
Finally, if we found no matches, or if none of the matches is valid, we return false from the function (only if we hadn't returned true earlier).

One thing to note is that \s matches any white space character, not just a space:
\s match any white space character [\r\n\t\f ]
However, a Regex that literally looks for a space would not, one like this (12-\d{2}). However, I've got to go with the rest of the community a bit on what to do with the matches. You're going to need to go through every match and validate the date with a better approach:
var input = string.Format(
" 11-20 2690 E 28.76 12-02 2468 E* 387.85{0}11-15 3610 E 29.34 12-87 2534 E",
Environment.NewLine);
var pattern = string.Format(#" ({0}-\d{{2}}) ", DateTime.Now.ToString("MM"));
var lines = new List<string>();
foreach (var line in input.Split(new string[] { Environment.NewLine },
StringSplitOptions.RemoveEmptyEntries))
{
var m = Regex.Match(line, pattern);
if (!m.Success)
{
continue;
}
DateTime dt;
if (!DateTime.TryParseExact(m.Value.Trim(),
"MM-dd",
null,
DateTimeStyles.None,
out dt))
{
continue;
}
lines.Add(line);
}
The reason I went through the lines one at a time is because presumably you need to know what line is good and what line is bad. My logic may not exactly match what you need but you can easily modify it.

Regular Expressions and "groups"

I have some text like "item number - item description" eg "13-40 - Computer Keyboard" that I want to split into item number and item description.
Is this possible with 1 regular expression, or would I need 2 (one for item and one for description)?
I can't work out how to "group" it - like the item number can be this and the description can be this, without it thinking that everything is the item number. Eg:
(\w(\w|-|/)*\w)-.*
matches everything as 1 match.
This is the code I'm using:
Regex rx = new Regex(RegExString, RegexOptions.Compiled | RegexOptions.IgnoreCase);
MatchCollection matches = rx.Matches("13-40 - Computer Keyboard");
Assert.AreEqual("13-40", matches[0].Value);
Assert.AreEqual("Computer Keyboard", matches[1].Value);

From the code you posted, you are using regex wrong. You should be having one regex pattern to match the whole product and using the captures within the match to extract the number and description.
string RegExString = #"(?<number>[\d-]+)\s-\s(?<description>.*)";
Regex rx = new Regex(RegExString, RegexOptions.Compiled | RegexOptions.IgnoreCase);
Match match = rx.Match("13-40 - Computer Keyboard");
Debug.Assert("13-40" == match.Groups["number"].Value);
Debug.Assert("Computer Keyboard" == match.Groups["description"].Value);

Here is a regexp that works in Ruby - not sure if there are any differences in c# regexp:
/^([\d\-]+) \- (.+)$/

([0-9-]+)\s-\s(.*)
Group 1 contains the item number, and group 2 contains the description.

CaffeineFueled's answer is correct for C#.
Match match = Regex.Match("13-40 - Computer Keyboard", #"^([\d\-]+) \- (.+)$");
Console.WriteLine(match.Groups[1]);
Console.WriteLine(match.Groups[2]);
Results:
13-40
Computer Keyboard

If your text is always divided by a dash and you don't have to handle dashes within the data, you don't have to use regex.
string[] itemProperties = item.Split(new string[] { "-" });
itemProperties = itemProperties.Select(p => p.Trim());
Item item = new Item()
{
Number = itemProperties[0],
Name = itemProperties[1],
Description = itemProperties[2]
}

You don't seem to want to match groups, but have multiple matches.
Maybe this will do what you want?
(:^.+(?=( - ))|(?<=( - )).+$)
Split up:
(: Used to provide two possible matches
^.+ Match item ID text
(?=( - )) Text must be before " - "
| OR
(?<=( - )) Test must be after " - "
.+$ Match description text
)

This isn't as elegant as CaffineFueled's answer but maybe easier to read for a regex beginner.
String RegExString = "(\d*-\d*)\s*-\s*(.*)";
Regex rx = new Regex(RegExString, RegexOptions.Compiled | RegexOptions.IgnoreCase);
MatchCollection matches = rx.Matches("13-40 - Computer Keyboard");
Assert.AreEqual("13-40", matches[0].Value);
Assert.AreEqual("Computer Keyboard", matches[1].Value);
or even more readable:
String RegExString = "(\d*-\d*) - (.*)";

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Problem with backreferences in C#'s regex - c#

Use Groups, not Captures. Your results will be in Groups[1] and Groups[2]. And personally, I'd recommend naming the groups: Match m = Regex.Match(line, "<strong>Date</strong> - (?<date>.) (?<time>.)<br>"); if( m.Success ) { date = m.Groups["date"].Value; time = m.Groups["time"].Value; }

Related

Match only the nth occurrence using a regular expression

Matching a pattern in a string

RegEx string between N and (N+1)th Occurance

Using RegEx to match Month-Day in C#

Regular Expressions and "groups"

Categories

Resources

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Problem with backreferences in C#'s regex - c#

Use Groups, not Captures. Your results will be in Groups[1] and Groups[2]. And personally, I'd recommend naming the groups: Match m = Regex.Match(line, "<strong>Date</strong> - (?<date>.*) (?<time>.*)<br>"); if( m.Success ) { date = m.Groups["date"].Value; time = m.Groups["time"].Value; }

Related

Match only the nth occurrence using a regular expression

Matching a pattern in a string

RegEx string between N and (N+1)th Occurance

Using RegEx to match Month-Day in C#

Regular Expressions and "groups"

Categories

Resources

Use Groups, not Captures. Your results will be in Groups[1] and Groups[2]. And personally, I'd recommend naming the groups: Match m = Regex.Match(line, "<strong>Date</strong> - (?<date>.) (?<time>.)<br>"); if( m.Success ) { date = m.Groups["date"].Value; time = m.Groups["time"].Value; }