All regex matches inside of one value [closed] - c#

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I input a string into this function such as var o="ok". This works but when I try two or more the value is stuff like b"var o="ok.
I have already tried every match method I know but it doesn't work, I can't find anything wrong wrong with the pattern.
public List<Varible> GetVars(string code)
{
List<Varible> vars = new List<Varible>();
Regex dagu = new Regex("var\\s+\\w+=(\\s+|)\"(.+|)\"");
Match reg = dagu.Match(code);
while (reg.Success) {
Match fef = reg;
Varible v = new Varible();
v.vartype = vartype.o_string;
v.name = fef.Value.Substring(fef.Value.IndexOf("r") + 1, fef.Value.IndexOf("=") - fef.Value.IndexOf("r") - 1);
int b = fef.Value.LastIndexOf("\"");
int f = fef.Value.IndexOf("\"");
v.value = fef.Value.Substring(f + 1, b - f - 1);
vars.Add(v);
reg = reg.NextMatch();
}
return vars;
}
There are no errors reported.

About your pattern:
"var\\s+\\w+=(\\s+|)\"(.+|)\""
A): The two capturing groups are not exactly wrong, but they're oddly written. (\\s+|) captures "one or more spaces, OR nothing at all", which can be expressed as "zero or more spaces". In regex, the "zero or more" quantifier is the star, so you can replace this group with (\\s*). Same for (.+|) which becomes (.*).
Then there's the issue with trying to match this pattern multiple times: the .+/.*pattern can, and will match quotes. To avoid this you can replace the dot by a negated character class that will match anything except for quotes: [^\"].
So now your pattern should look like this:
"var\\s+\\w+=(\\s*)\"([^\"]*)\""
B): Then... You don't use any of this. You seem to only use the match as a whole and redo the pattern by hand to get the part you need. I understand you added the parentheses in your pattern to be able to use the | operator, but they also have the nifty effect of creating capturing groups. A capturing group, for short, is what you're asking your regex engine to specifically look for and point out for you when matching. You can access these groups with the Match.Groups property. In your case, because there are two pairs of parentheses in your pattern, you'll create two groups, the first one contains the spacing between the equal sign and the first quote of your input. You don't seem to need it, so let's remove it, and instead capture the name of your 'var':
"var\\s+(\\w+)=\\s*\"([^\"]*)\""
You can now access the var's name with reg.Groups[1].Value and its value with Groups[2], and before you ask, reg.Groups[0].Value does exist but it always stores the entire match, so for your purposes it's the same as reg.Value.
Now to overhaul all this code:
public List<Varible> GetVars(string code)
{
List<Varible> vars = new List<Varible>();
Regex dagu = new Regex("var\\s+(\\w+)=\\s*\"([^\"]*)\"");
Match reg = dagu.Match(code);
while (reg.Success)
{
Varible v = new Varible();
v.vartype = vartype.o_string;
v.name = reg.Groups[1].Value;
v.value = reg.Groups[2].Value;
vars.Add(v);
reg = reg.NextMatch();
}
return vars;
}
and you should be good.

Related

How to remove sequence of characters from string [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I want to remove an unknown number of character sequences B from a given string A.
The removing must start to the right of the position of a character sequence C. The removing must stop when the B character sequence ends.
Example for string A:
xxxxxxxxBxxxxxxxxxxxxxxxxxCBBBBBByyyyyyyyyByyyy
A ... sequence of characters from which B's that follow C must be removed
C ... a sequence of characters (example: 123)
B ... a sequence of characters (example: vbz)
x and y ... any characters
In this example all B's after C must be removed. All other B's must not be removed.
The result would be:
xxxxxxxxBxxxxxxxxxxxxxxxxxCyyyyyyyyyByyyy
I tried to use:
A = A.replace("vbz","");
but that removes every 'vbz' sequence from A.
How can I exclude the removal of those 'vbz' that are not preceeded by C?
Regards, Manu
Why don't you try this?
var.Replace("x", "");
var.Replace("y", "");
Just replace x and y with the unknown string sequence
string A = "xxxxxxxxBxxxxxxxxxxxxxxxxxCBBBBBByyyyyyyyyByyyy";
string pattern = #"(?<=C)[B]*";
string B = Regex.Replace(A, pattern, "");
As per your requirement, 2 conditions need to be satisfied for removing from a string :
1. unknown number of string sequences B
2. The removing must start to the right of the position of a string C
It can be achieved using Regex class of System.Text.RegularExpressions namespace.
string A = "xxxxxxxxBxxxxxxxxxxxxxxxxxCBBBBBByyyyyyyyyByyyy";
string pattern = "(?<=C)[b]*";
string result = Regex.Replace(A, pattern,"",RegexOptions.IgnoreCase);
Note :
pattern variable contains regex pattern.
(cb*) :
() : defines group of characters
c : starting string
b : B or b ; i.e, need to be replaced or removed
* : defines multiple number of characters defines before *
(?<=c) : Match any position following a prefix "c"
RegexOptions.IgnoreCase : it says the removed character can be any case like B or b

How to find the capital substring of a string? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am trying to find the capitalized portion of a string, to then insert two characters that represent the Double Capital sign in the Braille language. My intention for doing this is to design a translator that can translate from regular text to Braille.
I'll give an example belo.
English String: My variable is of type IEnumerable.
Braille: ,My variable is of type ,,IE-numberable.
I also want the dash in IE-numerable to only break words that have upper and lower case, but not in front of punctuation marks, white spaces, numbers or other symbols.
Thanks a lot in advance for your answers.
I had never heard of a "Double Capital" sign, so I read up on it here. From what I can tell, this should suit your needs.
You can use this to find any sequence of two or more uppercase (majuscule) Latin letters or hyphens in your string:
var matches = Regex.Matches(input, "[A-Z-]{2,}");
You can use this to insert the double-capital sign:
var result = Regex.Replace(input, "[A-Z-]{2,}", ",,$0");
For example:
var input = "this is a TEST";
var result = Regex.Replace(input, "[A-Z-]{2,}", ",,$0"); // this is a ,,TEST
You can use this to hand single and double capitals:
var input = "McGRAW-HILL";
var result = Regex.Replace(input, "[A-Z-]([A-Z-]+)?",
m => (m.Groups[1].Success ? ",," : ",") + m.Value); // ,Mc,,GRAW-HILL
You can find them with a simple regex:
using System.Text.RegularExpressions;
// ..snip..
Regex r = new Regex("[A-Z]"); // This will capture only upper case characters
Match m = r.Match(input, 0);
The variable m of type System.Text.RegularExpressions.Match will contain a collection of captures. If only the first match matters, you can check its Index property directly.
Now you can insert the characters you want in that position, using String.Insert:
input = input.Insert(m.Index, doubleCapitalSign);
this code can solve your problema
string x = "abcdEFghijkl";
string capitalized = string.Empty;
for (int i = 0; i < x.Length; i++)
{
if (x[i].ToString() == x[i].ToString().ToUpper())
capitalized += x[i];
}
Have you tried using the method Char.IsUpper method
http://msdn.microsoft.com/en-us/library/9s91f3by.aspx
This is another similar question that uses that method to solve a similar problem
Get the Index of Upper Case letter from a String
If you just want to find the first index of an uppercase letter:
var firstUpperCharIndex = text // <-- a string
.Select((chr, index) => new { chr, index })
.FirstOrDefault(x => Char.IsUpper(x.chr));
if (firstUpperCharIndex != null)
{
text = text.Insert(firstUpperCharIndex.index, ",,");
}
Not sure if this is what you are going for?
var inputString = string.Empty; //Your input string here
var output = new StringBuilder();
foreach (var c in inputString.ToCharArray())
{
if (char.IsUpper(c))
{
output.AppendFormat("_{0}_", c);
}
else
{
output.Append(c);
}
}
This will loop through each character in the inputString if the characater is upper it inserts a _ before and after (replace that with your desired braille characters) otherwise appends the character to the output.

Match occurrences of a character before a control character, match zero if control character not present

I am working on functionality to allow a user to specify a "wildcarded" path for items in a folder hierarchy and an associated action that will be performed when an item matches that path. e.g.:
Path Action
----------- -------
1. $/foo/*/baz include
2. $/foo/bar/* exclude
Now with the example above, an item at $/foo/bar/baz would match both actions. Given this I want to provide a crude score of specificity of the wildcarded path, which will be based on the "depth" the first wildcard character occurs at. The path with the most depth will win. Importantly, only * bounded by forward slashes (/*/) is allowed as a wildcard (except when at the end then /*) and any number could be specified at various points in the path.
TL;DR;
So, I think a regex to count the number of forward slashes prior to the first * is the way to go. However for a number of reasons, where there is no wildcard in the path the match of forward slashes will be zero. I have got to the following negative lookbehind:
(?<!\*.*)/
which works fine when there are wildcards (e.g. 2 forward slash matches for path #1 above and 3 for #2), but when there is no wildcard it naturally matches all forward slashes. I am sure it is a simple step to match none but due to rusty regex skills I am stuck.
Ideally from an academic point of view I'd like to see if a single regex could capture this, however bonus points offered for a more elegant solution to the problem!
This would be one way to do it:
match = Regex.Match(subject,
#"^ # Start of string
( # Match and capture in group number 1...
[^*/]* # any number of characters except slashes or asterisks
/ # followed by a slash
)* # zero or more times.
[^*/]* # Match any additional non-slash/non-asterisk characters.
\* # Then match an asterisk",
RegexOptions.IgnorePatternWhitespace);
Now this regex fails to match if there is no asterisk in the subject string (score of 0). If the regex matches, you can be sure that there is at least one asterisk in it.
The clever thing now is that .NET regexes, unlike most other regex flavors, actually can count how many times a repeated capturing group has matched (most other regex engines simply discard that information), which allows us to determine the number of slashes before the first asterisk in the string.
That information can be found in
match.Groups[1].Captures.Count
(Of course this means that "no slashes before the first asterisk" and "no asterisk at all" would both get the score 0, which appears to be what you're asking for in your question, but I'm not sure why this would make sense)
A way that would approach the task:
Validate all test paths (make sure they are valid and contain either \*\ or end by *).
Use a sorted collection to keep track of the test paths and associated actions.
Sort the collection based on the position of the wildcard in the string.
Test the item against each path in the sorted collection.
You can replace the * in the string by .*? to use it in a regex.
Stop at the first match and return the associated action, otherwise continue with the next test in the collection.
A quick test implementation of some of the above:
void Main()
{
// Define some actions to test and add them to a collection
var ActionPaths = new List<ActionPath>() {
new ActionPath() {TestPath = "/foo/*/baz", Action = "include"},
new ActionPath() {TestPath = "/foo/bar/*", Action = "exclude"},
new ActionPath() {TestPath = "/foo/doo/boo", Action = "exclude"},
};
// Sort the list of actions based on the depth of the wildcard
ActionPaths.Sort();
// the path for which we are trying to find the corresponding action
string PathToTest = "/foo/bar/baz";
// Test all ActionPaths from the top down until we find something
var found = default(ActionPath);
foreach (var ap in ActionPaths) {
if (ap.IsMatching(PathToTest)) {
found = ap;
break;
}
}
// At this point, we have either found an Action, or nothing at all
if (found != default(ActionTest)) {
// Found an Action!
} else {
// Found nothing at all :-(
}
}
// Hold and Action Test
class ActionPath : IComparable<ActionPath>
{
public string TestPath;
public string Action;
// Returns true if the given path matches the TestPath
public bool IsMatching(string path) {
var t = TestPath.Replace("*",".*?");
return Regex.IsMatch(path, "^" + t + "$");
}
// Implements IComparable<T>
public int CompareTo(ActionPath other) {
if (other.TestPath == null) return 1;
var ia = TestPath.IndexOf("*");
var ib = other.TestPath.IndexOf("*");
if (ia < ib) return 1;
if (ia > ib) return -1;
return 0;
}
}
No need for regular expressions here.
With LINQ it's a 2-liner:
string s = "$/foo/bar/baz";
var asteriskPos = s.IndexOf('*'); // will be -1 if there is no asterisk
var slashCount = s.Where((c, i) => c == '/' && i < asteriskPos).Count();

Why isn't C# following my regex?

I have a C# application that reads a word file and looks for words wrapped in < brackets >
It's currently using the following code and the regex shown.
private readonly Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
I've used several online testing tools / friends to validate that the regex works, and my application proves this (For those playing at home, http://wordfiller.codeplex.com)!
My problem is however the regex will also pickup extra rubbish.
E.G
I'm walking on <sunshine>.
will return
sunshine>.
it should just return
<sunshine>
Anyone know why my application refuses to play by the rules?
I don't think the problem is your regex at all. It could be improved somewhat -- you don't need the ([]) around each bracket -- but that shouldn't affect the results. My strong suspicion is that the problem is in your C# implementation, not your regex.
Your regex should split <sunshine> into three separate groups: <, sunshine, and >. Having tested it with the code below, that's exactly what it does. My suspicion is that, somewhere in the C# code, you're appending Group 3 to Group 2 without realizing it. Some quick C# experimentation supports this:
private readonly Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
private string sunshine()
{
string input = "I'm walking on <sunshine>.";
var match = _regex.Match(input);
var regex2 = new Regex("<[^>]*>", RegexOptions.Compiled); //A slightly simpler version
string result = "";
for (int i = 0; i < match.Groups.Count; i++)
{
result += string.Format("Group {0}: {1}\n", i, match.Groups[i].Value);
}
result += "\nWhat you're getting: " + match.Groups[2].Value + match.Groups[3].Value;
result += "\nWhat you want: " + match.Groups[0].Value + " or " + match.Value;
result += "\nBut you don't need all those brackets and groups: " + regex2.Match(input).Value;
return result;
}
Result:
Group 0: <sunshine>
Group 1: <
Group 2: sunshine
Group 3: >
What you're getting: sunshine>
What you want: <sunshine> or <sunshine>
But you don't need all those brackets and groups: <sunshine>
We will need to see more code to solve the problem. There is an off by one error somewhere in your code. It is impossible for that regular expression to return sunshine>.. Therefore the regular expression in question is not the problem. I would assume, without more details, that something is getting the index into the string containing your match and it is one character too far into the string.
If all you want is the text between < and > then you'd be better off using:
[<]([^>]*)[>] or simpler: <([^>]+)>
If you want to include < and > then you could use:
([<][^>]*[>]) or simpler: (<[^>]+>)
You're expression currently has 3 Group Matches - indicated by the brackets ().
In the case of < sunshine> this will currently return the following:
Group 1 : "<"
Group 2 : "sunshine"
Group 3 : ">"
So if you only looked at the 2nd group it should work!
The only explanation I can give for your observed behaviour is that where you pull the matches out, you are adding together Groups 2 + 3 and not Group 1.
What you posted works perfectly fine.
Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
string test = "I'm walking on <sunshine>.";
var match = _regex.Match(test);
Match is <sunshine> i guess you need to provide more code.
Regex is eager by default. Teach it to be lazy!
What I mean is, the * operator considers as many repetitions as possible (it's said to be eager). Use the *? operator instead, this tells Regex to consider as few repetitions as possible (i.e. to be lazy):
<.*?>
Because you are using parenthesis, you are creating matching groups. This is causing the match collection to match the groups created by the regular expression to also be matched. You can reduce your regular expression to [<][^>]*[>] and it will match only on the <text> that you wish.

How can I get a regex match to only be added once to the matches collection?

I have a string which has several html comments in it. I need to count the unique matches of an expression.
For example, the string might be:
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
I currently use this to get the matches:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
The results of this is 3 matches. However, I would like to have this be only 2 matches since there are only two unique matches.
I know I can probably loop through the resulting MatchCollection and remove the extra Match, but I'm hoping there is a more elegant solution.
Clarification: The sample string is greatly simplified from what is actually being used. There can easily be an X8 or X9, and there are likely dozens of each in the string.
I would just use the Enumerable.Distinct Method for example like this:
string subjectString = "<!--X1-->Hi<!--X1-->there<!--X2--><!--X1-->Hi<!--X1-->there<!--X2-->";
var regex = new Regex(#"<!--X\d-->");
var matches = regex.Matches(subjectString);
var uniqueMatches = matches
.OfType<Match>()
.Select(m => m.Value)
.Distinct();
uniqueMatches.ToList().ForEach(Console.WriteLine);
Outputs this:
<!--X1-->
<!--X2-->
For regular expression, you could maybe use this one?
(<!--X\d-->)(?!.*\1.*)
Seems to work on your test string in RegexBuddy at least =)
// (<!--X\d-->)(?!.*\1.*)
//
// Options: dot matches newline
//
// Match the regular expression below and capture its match into backreference number 1 «(<!--X\d-->)»
// Match the characters “<!--X” literally «<!--X»
// Match a single digit 0..9 «\d»
// Match the characters “-->” literally «-->»
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*\1.*)»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the same text as most recently matched by capturing group number 1 «\1»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
It appears you're doing two different things:
Matching comments like /<-- X. -->/
Finding the set of unique comments
So it is fairly logical to handle these as two different steps:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
var uniqueMatches = matches.Cast<Match>().Distinct(new MatchComparer());
class MatchComparer : IEqualityComparer<Match>
{
public bool Equals(Match a, Match b)
{
return a.Value == b.Value;
}
public int GetHashCode(Match match)
{
return match.Value.GetHashCode();
}
}
Extract the comments and store them in an array. Then you can filter out the unique values.
But I don’t know how to implement this in C#.
Depending on how many Xn's you have you might be able to use:
(\<!--X1--\>){1}.*(\<!--X2--\>){1}
That will only match each occurrence of the X1, X2 etc. once provided they are in order.
Capture the inner portion of the comment as a group. Then put those strings into a hashtable(dictionary). Then ask the dictionary for its count, since it will self weed out repeats.
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
var tokens = new Dicationary<string, string>();
Regex.Replace(teststring, #"<!--(.*)-->",
match => {
tokens[match.Groups[1].Value] = match.Groups[1].Valuel;
return "";
});
var uniques = tokens.Keys.Count;
By using the Regex.Replace construct you get to have a lambda called on each match. Since you are not interested in the replace, you don't set it equal to anything.
You must use Group[1] because group[0] is the entire match.
I'm only repeating the same thing on both sides, so that its easier to put into the dictionary, which only stores unique keys.
If you want a distinct Match list from a MatchCollection without converting to string, you can use something like this:
var distinctMatches = matchList.OfType<Match>().GroupBy(x => x.Value).Select(x =>x.First()).ToList();
I know it has been 12 years but sometimes we need this kind of solutions, so I wanted to share. C# evolved, .NET evolved, so it's easier now.

Categories