Part of an app I'm creating in C# replaces certain substrings in a string with a value in square brackets like [11]. Often there can be the same value straight after - so I want to reduce the amount of text by combining them into one like [11,numberOfSame]
For example, if the string contains:
blahblah[122][122][122]blahblahblahblah[18][18][18][18]blahblahblah
The desired new string would be:
blahblah[122,3]blahblahblahblah[18,4]blahblahblah
Would anyone know how I would do this? Thanks! :)
Regex.Replace("blahblah[122][122][122]blahblahblahblah[18][18][18][18]blahblahblah",
#"(\[([^]]+)])(\1)+",
m => "[" + m.Groups[2].Value + "," + (m.Groups[3].Captures.Count + 1) + "]")
Returns:
blahblah[122,3]blahblahblahblah[18,4]blahblahblah
Explanation of regex:
( Starts group 1
\[ Matches [
( Starts group 2
[^]]+ Matches 1 or more of anything but ]
) Ends group 2
] Matches ]
) Ends group 1
( Starts group 3
\1 Matches whatever was in group 1
) Ends group 3
+ Matches one or more of group 3
Explanation of lambda:
m => Accepts a Match object
"[" + A [
m.Groups[2].Value + Whatever was in group 2
"," + A ,
(m.Groups[3].Captures.Count + 1) + The number of times group 3 matched + 1
"]" A ]
I am using this overload, which accepts a delegate to compute the replacement value.
string input = "[122][44][122]blah[18][18][18][18]blah[122][122]";
string output = Regex.Replace(input, #"((?<firstMatch>\[(.+?)\])(\k<firstMatch>)*)", m => "[" + m.Groups[2].Value + "," + (m.Groups[3].Captures.Count + 1) + "]");
Returns:
[122,1][44,1][122,1]blah[18,4]blah[122,2]
Explanation:
(?<firstMatch>\[(.+?)\]) Matches the [123] group, names group firstMatch
\k<firstMatch> matches whatever text was that was matched by the firstMatch group and adding * matches it zero or more times, giving us our count used in the lambda.
My reference for anything Regex: http://www.regular-expressions.info/
Related
Example :
I want to get the "2" character behind "- 60000 rupiah".
So i tried to code it with substring :
string s = "Ayam Bakar - 30000 x 2 - 60000 rupiah";
string qtyMenu = s.Substring(s.IndexOf("x") + 1, s.IndexOf("-") - 1);
But the substring end index didn't work properly. Maybe because the sentences have multiple "-" character. Is it possible to get different index of same character in that senteces ?
This is a good situation for Regex
string s = "Ayam Bakar - 30000 x 2 - 60000 rupiah";
// "x" followed by (maybe) whitespaces followed by at least one digit followed by (maybe) whitespaces followed by "-".
// Capture the digits with the (...)
var match = Regex.Match(s, #"x\s*(\d+)\s*\-");
if (match.Success)
{
// Groups[1] is the captured group
string foo = match.Groups[1].Value;
}
You can easily achieve this by following technique:
string s = "Ayam Bakar - 30000 x 2 - 60000 rupiah";
string qtyMenu = s.Substring(s.IndexOf("x") + 1, (s.LastIndexOf("-")) - (s.IndexOf("x") + 1));
For the second parameter, the length of the the string to extract is determined by the last index of - minus the index of x
From the message, I can derive the template format:
"{Product Name} - {Price x Qty} - {Subtotal}"
So you can implement this solution:
// Split message by '-'
var messages = s.Split('-');
// Result: messages[0] = "Ayam Bakar"
// Result: messages[1] = " 30000 x 2 "
// Result: messages[2] = " 60000 rupiah"
// Obtain {Price x Qty} in messages[1] and get the value after 'x'
var qtyMenu = messages[1].Substring(messages[1].IndexOf("x") + 1).Trim();
I have strings input by the user and want to tokenize them. For that, I want to use regex and now have a problem with a special case.
An example string is
Test + "Hello" + "Good\"more" + "Escape\"This\"Test"
or the C# equivalent
#"Test + ""Hello"" + ""Good\""more"" + ""Escape\""This\""Test"""
I am able to match the Test and + tokens, but not the ones contained by the ". I use the " to let the user specify that this is literally a string and not a special token. Now if the user wants to use the " character in the string, I thought of allowing him to escape it with a \.
So the rule would be: Give me everything between two " ", but the character in front of the last " can not be a \.
The results I expect are: "Hello" "Good\"more" "Escape\"This\"Test"
I need the " " characters to be in the final match so I know that this is a string.
I currently have the regex #"""([\w]*)(?<!\\"")""" which gives me the following results: "Hello" "more" "Test"
So the look behind isn't working as I want it to be. Does anyone know the correct way to get the string like I want?
Here's an adaption of a regex I use to parse command lines:
(?!\+)((?:"(?:\\"|[^"])*"?|\S)+)
Example here at regex101
(adaption is the negative look-ahead to ignore + and checking for \" instead of "")
Hope this helps you.
Regards.
Edit:
If you aren't interested in surrounding quotes:
(?!\+)(?:"((?:\\"|[^"])*)"?|(\S+))
To make it safer, I'd suggest getting all the substrings within unescaped pairs of "..." with the following regex:
^(?:[^"\\]*(?:\\.[^"\\]*)*("[^"\\]*(?:\\.[^"\\]*)*"))+
It matches
^ - start of string (so that we could check each " and escape sequence)
(?: - Non-capturing group 1 serving as a container for the subsequent subpatterns
[^"\\]*(?:\\.[^"\\]*)* - matches 0+ characters other than " and \ followed with 0+ sequences of \\. (any escape sequence) followed with 0+ characters other than " and \ (thus, we avoid matching the first " that is escaped, and it can be preceded with any number of escape sequences)
("[^"\\]*(?:\\.[^"\\]*)*") - Capture group 1 matching "..." substrings that may contain any escape sequences inside
)+ - end of the first non-capturing group that is repeated 1 or more times
See the regex demo and here is a C# demo:
var rx = "^(?:[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"))+";
var s = #"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
.Cast<Match>()
.SelectMany(m => m.Groups[1].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
.ToList();
Console.WriteLine(string.Join("\n", matches));
UPDATE
If you need to remove the tokens, just match and capture all outside of them with this code:
var keep = "[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*";
var rx = string.Format("^(?:(?<keep>{0})\"{0}\")+(?<keep>{0})$", keep);
var s = #"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
.Cast<Match>()
.SelectMany(m => m.Groups["keep"].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
.ToList();
Console.WriteLine(string.Join("", matches));
See another demo
Output: Test + + + \"Escape\"This\"Test\" + for #"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";.
Referring to this stackoverflow question:
- Regex Pattern help: finding HTML pattern when nested ASP.NET Eval?
I received an answer to the problem here:
- regexstorm link
The .NET answer that works on the regex .NET testing site does NOT work in my C# Visual Studio environment. Here is the Unit Test for it:
[Test]
public void GetAllHtmlSubsectionsWorksAsExpected()
{
var regPattern = new Regex(#"(?'o'<)(.*)(?'-o'>)+");
var html =
"<%# Page Language=\"C#\" %>" +
"<td class=\"c1 c2 c3\" colspan=\"2\">" +
"lorem ipsum" +
"<div class=\"d1\" id=\"div2\" attrid=\"<%# Eval(\"CategoryID\") %>\">" +
"testing 123" +
"</div>" +
"asdf" +
"</td>";
List<string> results = new List<string>();
MatchCollection matches = regPattern.Matches(html);
for (int mnum = 0; mnum < matches.Count; mnum++)
{
Match match = matches[mnum];
results.Add("Match #" + (mnum + 1) + " - Value: " + match.Value);
}
Assert.AreEqual(5, results.Count()); //Fails: results.Count() == 1
}
Why does this work on the regexstorm website but not in my unit test?
NOTE that parsing HTML with regex is not a best practice, you should use a dedicated parser.
Now, as for the question itself, the pattern you use will work only with lines having 1 single substring starting with < and ending with corresponding >. However, your input string has no newline characters! It looks like:
<%# Page Language="C#" %><td class="c1 c2 c3" colspan="2">lorem ipsum<div class="d1" id="div2" attrid="<%# Eval("CategoryID") %>">testing 123</div>asdf</td>
The .* subpattern is called a greedy dot matching pattern, and it matches as many characters other than a newline as possible (because it grabs the whole line and then backtracks to see if the next subpattern (here, >) is found, thus you get the last possible >).
To fix that, you need a proper balanced construct matching pattern:
<((?>[^<>]+|<(?<c>)|>(?<-c>))*(?(c)(?!)))>
See regex demo
C#:
var r = new Regex(#"
< # First '<'
( # Capturing group 1
(?> # Atomic group start
[^<>] # Match all characters other than `<` or `>`
|
< (?<c>) # Match '<', and add a capture into group 'c'
|
> (?<-c>) # Match '>', and delete 1 value from capture stack
)*
(?(c)(?!)) # Fails if 'c' stack isn't empty!
)
> # Last closing `>`
"; RegexOptions.IgnoreWhitespace);
DISCLAIMER: Even this regex will fail if you have unpaired < or > in your element nodes, that is why do not use regex to parse HTML.
There are two different things in regex: Matching and capturing.
What you want here is the capturing group 1.
So you need to use this:
results.Add("Match #" + (mnum + 1) + " - Value: " + match.Groups[1].Value);
Also, as the other answer pointed, you are missing new lines, and regex captures it all in first match.
I am trying to ensure that a list of phrases start on their own line by finding them and replacing them with \n + the phrase. eg
your name: joe your age: 28
becomes
my name: joe
your age: 28
I have a file with phrases that i pull and loop through and do the replace. Except as there are 2 words in some phrases i use \b to signify where the phrase starts and ends.
This doesn't seem to work, anybody know why?
example - String is 'Name: xxxxxx' does not get edited.
output = output.Replace('\b' + "Name" + '\b', "match");
Using regular expressions, accounts for any number of words with any number of spaces:
using System.Text.RegularExpressions;
Regex re = new Regex("(?<key>\\w+(\\b\\s+\\w+)*)\\s*:\\s*(?<value>\\w+)");
MatchCollection mc = re.Matches("your name: joe your age: 28 ");
foreach (Match m in mc) {
string key = m.Groups("key").Value;
string value = m.Groups("value").Value;
//accumulate into a list, but I'll just write to console
Console.WriteLine(key + " : " + value);
}
Here is some explanation:
Suppose what you want to the left of the colon (:) is called a key, and what is to the right - a value.
These key/value pairs are separated by at least once space. Because of this, value has be exactly one word (otherwise we'd have ambiguity).
The above regular expression uses named groups, to make code more readable.
got it
for (int headerNo=0; headerNo<headersArray.Length; headerNo++)
{
string searchPhrase = #"\b" + PhraseArray[headerNo] + #"\b";
string newPhrase = "match";
output = Regex.Replace(output, searchPhrase, newPhrase); }
Following the example you can do that :
output = output.Replace("your", "\nyour");
Im taking a string like "4 + 5 + ( 7 - 9 ) + 8" and trying to split on the parentheses to get a list containing 4 + 5, (7-9), + 8. So im using the regex string below. But it is giving me 4 + 5, (7-9), 7-9 , + 8. Hoping its just something easy. Thanks.
List<string> test = Regex.Split("4 + 5 + ( 7 - 9 ) + 8", #"(\(([^)]+)\))").ToList();
Remove the extra set of parenthesis you have in your regex:
(\(([^)]+)\)) // your regex
( ) // outer parens
\( \) // literal parens match
( ) // extra parens you don't need
[^)]+ // one or more 'not right parens'
The extra parens create a match for 'inside the literal parens', which is the extra 7 - 9 you see.
So you should have:
#"(\([^)]+\))"
List<string> test = Regex.Split("4 + 5 + ( 7 - 9 ) + 8", #"(\([^)]+\))").ToList();