I have the following string that would require me to parse it via Regex in C#.
Format: rec_mnd.rate.current_rate.sum.QWD.RET : 214345
I would like to extract our the bold chars as group objects in a groupcollection.
QWD = 1 group
RET = 1 group
214345 = 1 group
what would the message pattern be like?
It would be something like this:
string s = "Format: rec_mnd.rate.current_rate.sum.QWD.RET : 214345";
Match m = Regex.Match(s, #"^Format: rec_mnd\.rate\.current_rate\.sum\.(.+?)\.(.+?) : (\d+)$");
if( m.Success )
{
Console.WriteLine(m.Groups[1].Value);
Console.WriteLine(m.Groups[2].Value);
Console.WriteLine(m.Groups[3].Value);
}
The question mark in the first two groups make that quantifier lazy: it will capture the least possible amount of characters. In other words, it captures until the first . it sees. Alternatively, you could use ([^.]+) in those groups, which explicitly captures everything except a period.
The last group explicitly only captures decimal digits. If your expression can have other values on the right side of the : you'd have to change that to .+ as well.
Please, make it a lot easier on yourself and label your groups to make it easier to understand what is going on in code.
RegEx myRegex = new Regex(#"rec_mnd\.rate\.current_rate\.sum\.(?<code>[A-Z]{3})\.(?<subCode>[A-Z]{3})\s*:\s*(?<number>\d+)");
var matches = myRegex.Matches(sourceString);
foreach(Match match in matches)
{
//do stuff
Console.WriteLine("Match");
Console.WriteLine("Code: " + match.Groups["code"].Value);
Console.WriteLine("SubCode: " + match.Groups["subCode"].Value);
Console.WriteLine("Number: " + match.Groups["number"].Value);
}
This should give you what you want regardless of what's between the .'s.
#"(?:.+\.){4}(.\w+)\.(\w+)\s?:\s?(\d+)"
Related
I currently have a string which looks like this when it is returned :
//This is the url string
// the-great-debate---toilet-paper-over-or-under-the-roll
string name = string.Format("{0}",url);
name = Regex.Replace(name, "-", " ");
And when I perform the following Regex operation it becomes like this :
the great debate toilet paper over or under the roll
However, like I mentioned in the question, I want to be able to apply regex to the url string so that I have the following output:-
the great debate - toilet paper over or under the roll
I would really appreciate any assistance.
[EDIT] However, not all the strings look like this, some of them just have a single hyphen so the above method work
world-water-day-2016
and it changes to
world water day 2016
but for this one:
the-great-debate---toilet-paper-over-or-under-the-roll
I need a way to check if the string has 3 hyphens than replace those 3 hyphens with [space][hyphen][space]. And than replace all the remaining single hyphens between the words with space.
First of all, there is always a very naive solution to this kind of problem: you replace your specific matches in context with some chars that are not usually used in the current environment and after replacing generic substrings you may replace the temporary substrings with the necessary exception.
var name = url.Replace("---", "[ \uFFFD ]").Replace("-", " ").Replace("[ \uFFFD ]", " - ");
You may also use a regex based replacement that matches either a 3-hyphen substring capturing it, or just match a single hyphen, and then check if Group 1 matched inside a match evaluator (the third parameter to Regex.Replace can be a Match evaluator method).
It will look like
var name = Regex.Replace(url, #"(---)|-", m => m.Groups[1].Success ? " - " : " ");
See the C# demo.
So, when (---) part matches, the 3 hyphens are put into Group 1 and the .Success property is set to true. Thus, m => m.Groups[1].Success ? " - " : " " replaces 3 hyphens with space+-+space and 1 hyphen (that may be actually 1 of the 2 consecutive hyphens) with a space.
Here's a solution using LINQ rather than Regex:
var str = "the-great-debate---toilet-paper-over-or-under-the-roll";
var result = str.Split(new string[] {"---"}, StringSplitOptions.None)
.Select(s => s.Replace("-", " "))
.Aggregate((c,n) => $"{c} - {n}");
// result = "the great debate - toilet paper over or under the roll"
Split the string up based on the ---, then remove hyphens from each substring, then join them back together.
The easy way:
name = Regex.Replace(name, "\b-|-\b", " ");
The show-off way:
name = Regex.Replace(name, "(\b)?-(?(1)|\b)", " ");
I've got an input string that looks like this:
level=<device[195].level>&name=<device[195].name>
I want to create a RegEx that will parse out each of the <device> tags, for example, I'd expect two items to be matched from my input string: <device[195].level> and <device[195].name>.
So far I've had some luck with this pattern and code, but it always finds both of the device tags as a single match:
var pattern = "<device\\[[0-9]*\\]\\.\\S*>";
Regex rgx = new Regex(pattern);
var matches = rgx.Matches(httpData);
The result is that matches will contain a single result with the value <device[195].level>&name=<device[195].name>
I'm guessing there must be a way to 'terminate' the pattern, but I'm not sure what it is.
Use non-greedy quantifiers:
<device\[\d+\]\.\S+?>
Also, use verbatim strings for escaping regexes, it makes them much more readable:
var pattern = #"<device\[\d+\]\.\S+?>";
As a side note, I guess in your case using \w instead of \S would be more in line with what you intended, but I left the \S because I can't know that.
depends how much of the structure of the angle blocks you need to match, but you can do
"\\<device.+?\\>"
I want to create a RegEx that will parse out each of the <device> tags
I'd expect two items to be matched from my input string:
1. <device[195].level>
2. <device[195].name>
This should work. Get the matched group from index 1
(<device[^>]*>)
Live demo
String literals for use in programs:
#"(<device[^>]*>)"
Change your repetition operator and use \w instead of \S
var pattern = #"<device\[[0-9]+\]\.\w+>";
String s = #"level=<device[195].level>&name=<device[195].name>";
foreach (Match m in Regex.Matches(s, #"<device\[[0-9]+\]\.\w+>"))
Console.WriteLine(m.Value);
Output
<device[195].level>
<device[195].name>
Use named match groups and create a linq entity projection. There will be two matches, thus separating the individual items:
string data = "level=<device[195].level>&name=<device[195].name>";
string pattern = #"
(?<variable>[^=]+) # get the variable name
(?:=<device\[) # static '=<device'
(?<index>[^\]]+) # device number index
(?:]\.) # static ].
(?<sub>[^>]+) # Get the sub command
(?:>&?) # Match but don't capture the > and possible &
";
// Ignore pattern whitespace is to document the pattern, does not affect processing.
var items = Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace)
.OfType<Match>()
.Select (mt => new
{
Variable = mt.Groups["variable"].Value,
Index = mt.Groups["index"].Value,
Sub = mt.Groups["sub"].Value
})
.ToList();
items.ForEach(itm => Console.WriteLine ("{0}:{1}:{2}", itm.Variable, itm.Index, itm.Sub));
/* Output
level:195:level
name:195:name
*/
I'm met problem with string parsing and want solve her by regular expression.
Always as input I'm get string the same like: %function_name%(IN: param1, ..., paramN; OUT: param1,..., paramN)
I'm wrote a pattern:
string pattern = #"[A-za-z][A-za-z0-9]*\(IN:\s*(([A-za-z][A-za-z0-9](,|;))+|;)\s*OUT:(\s*[A-za-z][A-za-z0-9],?)*\)";
This pattern detected my input strings, but in fact as output I'm want to have a two arrays of strings. One of this must contain INPUT params (after "IN:") IN: param1, ..., paramN and second array must have names of output params. Params can contains numbers and '_'.
Few examples of real input strings:
Add_func(IN: port_0, in_port_1; OUT: out_port99)
Some_func(IN:;OUT: abc_P1)
Some_func2(IN: input_portA;OUT:)
Please, tell me how to make a correct pattern.
You can use this pattern, that allows to catch all functions with separate params in one shot:
(?<funcName>\w+)\(IN: ?|OUT: ?|\G(?<inParam>[^,;()]+)?(?=[^)(;]*;)\s*[,;]\s*|\G(?<outParam>[^,()]+)(?=[^;]*\s*\))\s*[,)]\s*
Pattern details:
(?<funcName>\w+)\(IN: ? # capture the function name and match "(IN: "
| # OR
OUT: ? # match "OUT: "
| # OR
\G(?<inParam>[^,;()]+)? # contiguous match, that captures a IN param
(?=[^)(;]*;) # check that it is always followed by ";"
\s*[,;]\s* # match "," or ";" (to be always contiguous)
| # OR
\G(?<outParam>[^,()]+)? # contiguous match, that captures a OUT param
(?=[^;]*\s*\)) # check that it is always followed by ")"
\s*[,)]\s* # match "," (to be always contiguous) or ")"
(To obtain a cleaner result, you must walk to the match array (with a foreach) and remove empty entries)
example code:
static void Main(string[] args)
{
string subject = #"Add_func(IN: port_0, in_port_1; OUT: out_port99)
Some_func(IN:;OUT: abc_P1)
shift_data(IN:po1_p0;OUT: po1_p1, po1_p2)
Some_func2(IN: input_portA;OUT:)";
string pattern = #"(?<funcName>\w+)\(IN: ?|OUT: ?|\G(?<inParam>[^,;()]+)?(?=[^)(;]*;)\s*[,;]\s*|\G(?<outParam>[^,()]+)(?=[^;]*\s*\))\s*[,)]\s*";
Match m = Regex.Match(subject, pattern);
while (m.Success)
{
if (m.Groups["funcName"].ToString() != "")
{
Console.WriteLine("\nfunction name: " + m.Groups["funcName"]);
}
if (m.Groups["inParam"].ToString() != "")
{
Console.WriteLine("IN param: " + m.Groups["inParam"]);
}
if (m.Groups["outParam"].ToString() != "")
{
Console.WriteLine("OUT param: "+m.Groups["outParam"]);
}
m = m.NextMatch();
}
}
An other way consists to match all IN parameters and all OUT parameters in one string and then to split these strings with \s*,\s*
example:
string pattern = #"(?<funcName>\w+)\(\s*IN:\s*(?<inParams>[^;]*?)\s*;\s*OUT\s*:\s*(?<outParams>[^)]*?)\s*\)";
Match m = Regex.Match(subject, pattern);
while (m.Success)
{
string functionName = m.Groups["function name"].ToString();
string[] inParams = Regex.Split(m.Groups["inParams"].ToString(), #"\s*,\s*");
string[] outParams = Regex.Split(m.Groups["outParams"].ToString(), #"\s*,\s*");
// Why not construct a "function" object to store all these values
m = m.NextMatch();
}
The way to do this is with capturing groups. Named capturing groups are the easiest to work with:
// a regex surrounded by parens is a capturing group
// a regex surrounded by (?<name> ... ) is a named capturing group
// here I've tried to surround the relevant parts of the pattern with named groups
var pattern = #"[A-za-z][A-za-z0-9]*\(IN:\s*(((?<inValue>[A-za-z][A-za-z0-9])(,|;))+|;)\s*OUT:(\s*(?<outValue>[A-za-z][A-za-z0-9]),?)*\)";
// get all the matches. ExplicitCapture is just an optimization which tells the engine that it
// doesn't have to save state for non-named capturing groups
var matches = Regex.Matches(input: input, pattern: pattern, options: RegexOptions.ExplicitCapture)
// convert from IEnumerable to IEnumerable<Match>
.Cast<Match>()
// for each match, select out the captured values
.Select(m => new {
// m.Groups["inValue"] gets the named capturing group "inValue"
// for groups that match multiple times in a single match (as in this case, we access
// group.Captures, which records each capture of the group. .Cast converts to IEnumerable<T>,
// at which point we can select out capture.Value, which is the actual captured text
inValues = m.Groups["inValue"].Captures.Cast<Capture>().Select(c => c.Value).ToArray(),
outValues = m.Groups["outValue"].Captures.Cast<Capture>().Select(c => c.Value).ToArray()
})
.ToArray();
I think this is what you are looking for:
[A-za-z][A-za-z0-9_]*\(IN:((?:\s*(?:[A-za-z][A-za-z0-9_]*(?:[,;])))+|;)\s*OUT:(\s*[A-za-z][A-za-z0-9_]*,?)*\)
There were a few problems with grouping as well as you were missing the space between multiple IN parameters. You also were not allowing for an underscore which appeared in your examples.
The above will work with all of your examples above.
Add_func(IN: port_0, in_port_1; OUT: out_port99) will capture:
port_0, in_port_1 and out_port99
Some_func(IN:;OUT: abc_P1) will capture:
; and abc_P1
Some_func2(IN: input_portA; OUT:) will capture:
input_portA and empty.
After getting these capture groups, you can split them on commas to get your arrays.
I have a C# application that reads a word file and looks for words wrapped in < brackets >
It's currently using the following code and the regex shown.
private readonly Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
I've used several online testing tools / friends to validate that the regex works, and my application proves this (For those playing at home, http://wordfiller.codeplex.com)!
My problem is however the regex will also pickup extra rubbish.
E.G
I'm walking on <sunshine>.
will return
sunshine>.
it should just return
<sunshine>
Anyone know why my application refuses to play by the rules?
I don't think the problem is your regex at all. It could be improved somewhat -- you don't need the ([]) around each bracket -- but that shouldn't affect the results. My strong suspicion is that the problem is in your C# implementation, not your regex.
Your regex should split <sunshine> into three separate groups: <, sunshine, and >. Having tested it with the code below, that's exactly what it does. My suspicion is that, somewhere in the C# code, you're appending Group 3 to Group 2 without realizing it. Some quick C# experimentation supports this:
private readonly Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
private string sunshine()
{
string input = "I'm walking on <sunshine>.";
var match = _regex.Match(input);
var regex2 = new Regex("<[^>]*>", RegexOptions.Compiled); //A slightly simpler version
string result = "";
for (int i = 0; i < match.Groups.Count; i++)
{
result += string.Format("Group {0}: {1}\n", i, match.Groups[i].Value);
}
result += "\nWhat you're getting: " + match.Groups[2].Value + match.Groups[3].Value;
result += "\nWhat you want: " + match.Groups[0].Value + " or " + match.Value;
result += "\nBut you don't need all those brackets and groups: " + regex2.Match(input).Value;
return result;
}
Result:
Group 0: <sunshine>
Group 1: <
Group 2: sunshine
Group 3: >
What you're getting: sunshine>
What you want: <sunshine> or <sunshine>
But you don't need all those brackets and groups: <sunshine>
We will need to see more code to solve the problem. There is an off by one error somewhere in your code. It is impossible for that regular expression to return sunshine>.. Therefore the regular expression in question is not the problem. I would assume, without more details, that something is getting the index into the string containing your match and it is one character too far into the string.
If all you want is the text between < and > then you'd be better off using:
[<]([^>]*)[>] or simpler: <([^>]+)>
If you want to include < and > then you could use:
([<][^>]*[>]) or simpler: (<[^>]+>)
You're expression currently has 3 Group Matches - indicated by the brackets ().
In the case of < sunshine> this will currently return the following:
Group 1 : "<"
Group 2 : "sunshine"
Group 3 : ">"
So if you only looked at the 2nd group it should work!
The only explanation I can give for your observed behaviour is that where you pull the matches out, you are adding together Groups 2 + 3 and not Group 1.
What you posted works perfectly fine.
Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
string test = "I'm walking on <sunshine>.";
var match = _regex.Match(test);
Match is <sunshine> i guess you need to provide more code.
Regex is eager by default. Teach it to be lazy!
What I mean is, the * operator considers as many repetitions as possible (it's said to be eager). Use the *? operator instead, this tells Regex to consider as few repetitions as possible (i.e. to be lazy):
<.*?>
Because you are using parenthesis, you are creating matching groups. This is causing the match collection to match the groups created by the regular expression to also be matched. You can reduce your regular expression to [<][^>]*[>] and it will match only on the <text> that you wish.
I'm writing a function-parsing engine which uses regular expressions to separate the individual terms (defined as a constant or a variable followed (optionally) by an operator). It's working great, except when I have grouped terms within other grouped terms. Here's the code I'm using:
//This matches an opening delimiter
Regex openers = new Regex("[\\[\\{\\(]");
//This matches a closing delimiter
Regex closers = new Regex("[\\]\\}\\)]");
//This matches the name of a variable (\w+) or a constant numeric value (\d+(\.\d+)?)
Regex VariableOrConstant = new Regex("((\\d+(\\.\\d+)?)|\\w+)" + FunctionTerm.opRegex + "?");
//This matches the binary operators +, *, -, or /
Regex ops = new Regex("[\\*\\+\\-/]");
//This compound Regex finds a single variable or constant term (including a proceeding operator,
//if any) OR a group containing multiple terms (and their proceeding operators, if any)
//and a proceeding operator, if any.
//Matches that match this second pattern need to be added to the function as sub-functions,
//not as individual terms, to ensure the correct evalutation order with parentheses.
Regex splitter = new Regex(
openers +
"(" + VariableOrConstant + ")+" + closers + ops + "?" +
"|" +
"(" + VariableOrConstant + ")" + ops + "?");
When "splitter" is matched against the string "4/(2*X*[2+1])", the matches' values are "4/", "2*", "X*", "2+", and "1", completely ignoring all of the delimiting parentheses and braces. I believe this is because the second half of the "splitter" Regex (the part after the "|") is being matched and overriding the other part of the pattern. This is bad- I want grouped expressions to take precedence over single terms. Does anyone know how I can do this? I looked into using positive/negative lookaheads and lookbehinds, but I'm honestly not sure how to use those, or what they're even for, for that matter, and I can't find any relevant examples... Thanks in advance.
You didn't show us how you're applying the regex, so here's a demo I whipped up:
private static void ParseIt(string subject)
{
Console.WriteLine("subject : {0}\n", subject);
Regex openers = new Regex(#"[[{(]");
Regex closers = new Regex(#"[]})]");
Regex ops = new Regex(#"[*+/-]");
Regex VariableOrConstant = new Regex(#"((\d+(\.\d+)?)|\w+)" + ops + "?");
Regex splitter = new Regex(
openers + #"(?<FIRST>" + VariableOrConstant + #")+" + closers + ops + #"?" +
#"|" +
#"(?<SECOND>" + VariableOrConstant + #")" + ops + #"?",
RegexOptions.ExplicitCapture
);
foreach (Match m in splitter.Matches(subject))
{
foreach (string s in splitter.GetGroupNames())
{
Console.WriteLine("group {0,-8}: {1}", s, m.Groups[s]);
}
Console.WriteLine();
}
}
output:
subject : 4/(2*X*[2+1])
group 0 : 4/
group FIRST :
group SECOND : 4/
group 0 : 2*
group FIRST :
group SECOND : 2*
group 0 : X*
group FIRST :
group SECOND : X*
group 0 : [2+1]
group FIRST : 1
group SECOND :
As you can see, the term [2+1] is matched by the first part of the regex, as you intended. It can't do anything with the (, though, because the next bracketing character after that is another "opener" ([), and it's looking for a "closer".
You could use .NET's "balanced matching" feature to allow for grouped terms enclosed in other groups, but it's not worth the effort. Regexes are not designed for parsing--in fact, parsing and regex matching are fundamentally different kinds of operation. And this is a good example of the difference: a regex actively seeks out matches, skipping over anything it can't use (like the open-parenthesis in your example), but a parser has to examine every character (even if it's just to decide to ignore it).
About the demo: I tried to make the minimum functional changes necessary to get your code to work (which is why I didn't correct the error of putting the + outside the capturing group), but I also made several surface changes, and those represent active recommendations. To wit:
Always use verbatim string literals (#"...") when creating regexes in C# (I think the reason is obvious).
If you're using capturing groups, use named groups whenever possible, but don't use named groups and numbered groups in the same regex. Named groups save you the hassle of keeping track of what's captured where, and the ExplicitCapture option saves you having to clutter up the regex with (?:...) wherever you need a non-capturing group.
Finally, that whole scheme of building a large regex from a bunch of smaller regexes has very limited usefulness IMO. It's very difficult to keep track of the interactions between the parts, like which part's inside which group. Another advantage of C#'s verbatim strings is that they're multiline, so you can take advantage of free-spacing mode (a.k.a. /x or COMMENTS mode):
Regex r = new Regex(#"
(?<GROUPED>
[[{(] # opening bracket
( # group containing:
((\d+(\.\d+)?)|\w+) # number or variable
[*+/-]? # and proceeding operator
)+ # ...one or more times
[]})] # closing bracket
[*+/-]? # and proceeding operator
)
|
(?<UNGROUPED>
((\d+(\.\d+)?)|\w+) # number or variable
[*+/-]? # and proceeding operator
)
",
RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace
);
This is not intended as a solution to your problem; as I said, that's not a job for regexes. This is just a demonstration of some useful regex techniques.
try using difrent quantifiers
greedy:
* + ?
possessive:
*+ ++ ?+
lazy:
*? +? ??
Try reading this and this
also maybe non-capturing group:
(?:your expr here)
try try try! practice make perfect! :)