c# - regex don't work (match does not preserve the string) - c#

Regex regOrg = new Regex(#"org(?:aniser)?\s+(\d\d):(\d\d)\s?(\d\d)?\.?(\d\d)?", RegexOptions.IgnoreCase);
MatchCollection mcOrg = regOrg.Matches(str);
Match mvOrg = regOrg.Match(str);
dayOrg = mvOrg.Value[4].ToString();
monthOrg = mvOrg.Value[5].ToString();
hourOrg = mvOrg.Value[2].ToString();
minuteOrg = mvOrg.Value[3].ToString();
This regular expression analyzes the string with text
"organiser 23:59" / "organiser 25:59 31.12"
or
"org 23:59" / "org 23:59 31.12"
Day and month of optional parameters
Accordingly, I want to see the output variables dayOrg, monthOrg, hourOrg, minuteOrg with this data, but I get this:
Query: org 23:59 31.12
The value mcOrg.Count: 1
The value dayOrg: 2
The value monthOrg: 3
The value hourOrg: g
The value minuteOrg: empty
What am I doing wrong? Tried a lot of options, but it's not working.

You're not accessing the groups correctly (you're accessing individual characters of the matched string).
dayOrg = mvOrg.Groups[4].Value;
monthOrg = mvOrg.Groups[5].Value;
hourOrg = mvOrg.Groups[2].Value;
minuteOrg = mvOrg.Groups[3].Value;

The reason you are getting that result is because you are getting Value[index] from the mvOrg Match.
The Match class, as described on MSDN says that Value is the first match, hence you are accessing the character array of the first match instead of the groups. You need to use the Groups property of the Match class to get the actual groups found.
Be sure to check the count of this collection before trying to access the optional parameters.

I added name for you pattern so now it look like this :
Regex regOrg = new Regex(#"org(?:aniser)?\s+(?<hourOrg>\d{2}):(?<minuteOrg>\d{2})\s?(?<dayOrg>\d{2})?\.?(?<monthOrg>\d{2})?", RegexOptions.IgnoreCase);
and you can access the result like this
Console.WriteLine(mvOrg.Groups["hourOrg"]);
Console.WriteLine(mvOrg.Groups["minuteOrg"]);
Console.WriteLine(mvOrg.Groups["dayOrg"]);
Console.WriteLine(mvOrg.Groups["monthOrg"]);
Using hard coded indexes is not good practice, since you can change the regex and now need to change all the indexes ...
Is it what you wanted ?

Related

what is a good pattern to processes each individual regex match through a method

I'm trying to figure out a pattern where I run a regex match on a long string, and each time it finds a match, it runs a replace on it. The thing is, the replace will vary depending on the matched value. This new value will be determined by a method. For example:
var matches = Regex.Match(myString, myPattern);
while(matches.Success){
Regex.Replace(myString, matches.Value, GetNewValue(matches.Groups[1]));
matches = matches.NextMatch();
}
The problem (i think) is that if I run the Regex.Replace, all of the match indexes get messed up so the result ends up coming out wrong. Any suggestions?
If you replace each pattern with a fixed string, Regex.replace does that for you. You don't need to iterate the matches:
Regex.Replace(myString, myPattern, "replacement");
Otherwise, if the replacement depends upon the matched value, use the MatchEvaluator delegate, as the 3rd argument to Regex.Replace. It receives an instance of Match and returns string. The return value is the replacement string. If you don't want to replace some matches, simply return match.Value:
string myString = "aa bb aa bb";
string myPattern = #"\w+";
string result = Regex.Replace(myString, myPattern,
match => match.Value == "aa" ? "0" : "1" );
Console.WriteLine(result);
// 0 1 0 1
If you really need to iterate the matches and replace them manually, you need to start replacement from the last match towards the first, so that the index of the string is not ruined for the upcoming matches. Here's an example:
var matches = Regex.Matches(myString, myPattern);
var matchesFromEndToStart = matches.Cast<Match>().OrderByDescending(m => m.Index);
var sb = new StringBuilder(myString);
foreach (var match in matchesFromEndToStart)
{
if (IsGood(match))
{
sb.Remove(match.Index, match.Length)
.Insert(match.Index, GetReplacementFor(match));
}
}
Console.WriteLine(sb.ToString());
Just be careful, that your matches do not contain nested instances. If so, you either need to remove matches which are inside another match, or rerun the regex pattern to generate new matches after each replacement. I still recommend the second approach, which uses the delegates.
If I understand your question correctly, you want to perform a replace based on a constant Regular Expression, but the replacement text you use will change based on the actual text that the regex matches on.
The Captures property of the Match Class (not the Match method) returns a collection of all the matches with your regex within the input string. It contains information like the position within the string, the matched value and the length of the match. If you iterate over this collection with a foreach loop you should be able to treat each match individually and perform some string manipulations where you can dynamically modify the replacement value.
I would use something like
Regex regEx = new Regex("some.*?pattern");
string input = "someBLAHpattern!";
foreach (Match match in regEx.Matches(input))
{
DoStuffWith(match.Value);
}

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong
You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5
The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.
Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

RegEx pattern needed to return the same found pattern under two different group names

Currently I'm using some kind of tree that has a regular expression on each level to parse some arbitrary text file into a tree. Till now everything works quite fine and the regex result is given down to the child node to do further parsing of the text. To get a link between the node and a child node the node itself has also a name, which is used within a regex as the group name. So after parsing some text I'll get a regex containing some named groups and the node itself also contains of child nodes with the same names, which leads to a recursive structure to do some arbitrary parsing.
Now i'm running into trouble, cause to make the processing of this tree in the next step a little bit easier i need the very same information within the text file under different nodes within my tree. Due to the fact, that this is maybe a little bit hard to understand, here is a unit test that shows what i'd like to achieve:
string input = "Some identifier=Just a value like 123";
// ToDo: Change the pattern, that the new group 'anotherName' will contain the same text as 'key'.
string pattern = "^(?'key'.*?)=(?'value'.*)$";
Regex regex = new Regex(pattern);
Match match = regex.Match(input);
var key = match.Groups["key"];
var value = match.Groups["value"];
var sameAsKeyButWithOtherGroupName = match.Groups["anotherName"];
Assert.That(key, Is.EqualTo(sameAsKeyButWithOtherGroupName));
Any ideas how to get this working?
To call a back reference in a .NET pattern, you have to specify \k<name_of_group> syntax. May try this one:
bool foundMatch = false;
try {
foundMatch = Regex.IsMatch(subjectString, #"^(?<authorName>(?'key'.*?)=\k<key>)$", RegexOptions.IgnoreCase | RegexOptions.Multiline);
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Explanation:
<!--
^(?<authorName>(?'key'.*?)=\k'key')$
Assert position at the beginning of the string «^»
Match the regular expression below and capture its match into backreference with name “authorName” «(?<authorName>(?'key'.*?)=\k'key')»
Match the regular expression below and capture its match into backreference with name “key” «(?'key'.*?)»
Match any single character that is not a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “=” literally «=»
Match the same text as most recently matched by the named group “key” «\k'key'»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»
-->
After reading Cylians answer and writing my own comment to him i did a little more research about the back references and my test will succeed with this slightly little changed regular expression:
string input = "Some identifier=Just a value like 123";
string pattern = #"^(?'key'.*?)(?'anotherName'\k<key>)=(?'value'.*)$";
Regex regex = new Regex(pattern);
Match match = regex.Match(input);
var key = match.Groups["key"];
var value = match.Groups["value"];
var sameAsKeyButWithOtherGroupName = match.Groups["anotherName"];
Assert.That(key, Is.EqualTo(sameAsKeyButWithOtherGroupName));
So conclusion is quite simple: If you need the same group under another name, simple declare this group and use the content of another group as pattern string.

C# regex problem

"<div class=\"standings-rank\">([0-9]{1,2})</div>"
Here's my regex. I want to Match it but C# returns me something like
"<div class=\"standings-rank\">1</div>"
When I'd like to just get
"1"
How can I make C# return me the right thing?
Use Match.Groups[int] indexer.
Regex regex = new Regex("<div class=\"standings-rank\">([0-9]{1,2})</div>");
string str = "<div class=\"standings-rank\">1</div>";
string value = regex.Match(str).Groups[1].Value;
Console.WriteLine(value); // Writes "1"
Assuming you have a Regex declared as follows:
Regex pattern = new Regex("<div class=\"standings-rank\">([0-9]{1,2})</div>");
and are testing said regex via the Match method; then you must access the match starting at index 1 not index 0;
pattern.Match("<div class=\"standings-rank\">1</div>").Groups[1].Value
This will return the expected value; index 0 will return the whole matched string.
Specifically, see MSDN
The collection contains one or more
System.Text.RegularExpressions.Group
objects. If the match is successful,
the first element in the collection
contains the Group object that
corresponds to the entire match. Each
subsequent element represents a
captured group, if the regular
expression includes capturing groups.
If the match is unsuccessful, the
collection contains a single
System.Text.RegularExpressions.Group
object whose Success property is false
and whose Value property equals
String.Empty.

How can I get a regex match to only be added once to the matches collection?

I have a string which has several html comments in it. I need to count the unique matches of an expression.
For example, the string might be:
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
I currently use this to get the matches:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
The results of this is 3 matches. However, I would like to have this be only 2 matches since there are only two unique matches.
I know I can probably loop through the resulting MatchCollection and remove the extra Match, but I'm hoping there is a more elegant solution.
Clarification: The sample string is greatly simplified from what is actually being used. There can easily be an X8 or X9, and there are likely dozens of each in the string.
I would just use the Enumerable.Distinct Method for example like this:
string subjectString = "<!--X1-->Hi<!--X1-->there<!--X2--><!--X1-->Hi<!--X1-->there<!--X2-->";
var regex = new Regex(#"<!--X\d-->");
var matches = regex.Matches(subjectString);
var uniqueMatches = matches
.OfType<Match>()
.Select(m => m.Value)
.Distinct();
uniqueMatches.ToList().ForEach(Console.WriteLine);
Outputs this:
<!--X1-->
<!--X2-->
For regular expression, you could maybe use this one?
(<!--X\d-->)(?!.*\1.*)
Seems to work on your test string in RegexBuddy at least =)
// (<!--X\d-->)(?!.*\1.*)
//
// Options: dot matches newline
//
// Match the regular expression below and capture its match into backreference number 1 «(<!--X\d-->)»
// Match the characters “<!--X” literally «<!--X»
// Match a single digit 0..9 «\d»
// Match the characters “-->” literally «-->»
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*\1.*)»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the same text as most recently matched by capturing group number 1 «\1»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
It appears you're doing two different things:
Matching comments like /<-- X. -->/
Finding the set of unique comments
So it is fairly logical to handle these as two different steps:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
var uniqueMatches = matches.Cast<Match>().Distinct(new MatchComparer());
class MatchComparer : IEqualityComparer<Match>
{
public bool Equals(Match a, Match b)
{
return a.Value == b.Value;
}
public int GetHashCode(Match match)
{
return match.Value.GetHashCode();
}
}
Extract the comments and store them in an array. Then you can filter out the unique values.
But I don’t know how to implement this in C#.
Depending on how many Xn's you have you might be able to use:
(\<!--X1--\>){1}.*(\<!--X2--\>){1}
That will only match each occurrence of the X1, X2 etc. once provided they are in order.
Capture the inner portion of the comment as a group. Then put those strings into a hashtable(dictionary). Then ask the dictionary for its count, since it will self weed out repeats.
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
var tokens = new Dicationary<string, string>();
Regex.Replace(teststring, #"<!--(.*)-->",
match => {
tokens[match.Groups[1].Value] = match.Groups[1].Valuel;
return "";
});
var uniques = tokens.Keys.Count;
By using the Regex.Replace construct you get to have a lambda called on each match. Since you are not interested in the replace, you don't set it equal to anything.
You must use Group[1] because group[0] is the entire match.
I'm only repeating the same thing on both sides, so that its easier to put into the dictionary, which only stores unique keys.
If you want a distinct Match list from a MatchCollection without converting to string, you can use something like this:
var distinctMatches = matchList.OfType<Match>().GroupBy(x => x.Value).Select(x =>x.First()).ToList();
I know it has been 12 years but sometimes we need this kind of solutions, so I wanted to share. C# evolved, .NET evolved, so it's easier now.

Categories