Using Visual Studio 2013 regular expression find - how to invalidate multiple prefixes - c#

I'm trying to find the "symbol" all in a bunch of C#, XML, and JS files. My project is huge and doing a naive search for "all" results in over 8,000 lines found so I'm trying to eliminate some of them.
For example, I don't want to match on "call" or "balloon" or "Balloon" (those are UX element styles.
Looking at the using Regular Expressions MSDN page (http://msdn.microsoft.com/en-us/library/vstudio/2k3te2cs%28v=vs.110%29.aspx) I found out how to invalidate on one of those but I can't figure out how to do it on multiple and make it case-insensitive.
I started off using:
(?!c)all
And that filtered out call and things like that but I can't get one to filter out multiple to work.
(?!b|c)all
Is the form I've been playing around with, trying to get it to ignore balloon. Ideally I could do something like (warning! - invalid regex below)
(?!b|c|B|C|)all
If anyone could point me in the right direction that would be great. The reason why I'm not looking for all surrounded by spaces is because I don't know if the reference I'm looking for is going to be:
.All
.all
("All")
(all)
and etc...

Have you tried: [aA][lL][lL]\b
any version of "all" or "ALL" anchored at a word/non-word boundary
Here is another reference..
Regular Expression to match specific string

The following regex: (?<!(b|c))all (with the IgnoreCase flag)
With the following input: ball all stall .all( "ALL"
Has the following matches: ball [all] st[all] .[all]( "[ALL]"

I think you are on the right track with lookarounds, but you want to use a character class with it. (More info: http://www.regular-expressions.info/charclass.html)
There are convenient shorthand character classes, like \w, to represent common classes. \w for example represents all alphanumeric characters and is shorthand for [A-Za-z0-9_].
\b represents a "word boundary," or in other words the beginning/end of the string and a boundary between a word character and a non-word character. It is zero-length and doesn't won't match any characters.
Here are some examples using a word boundary, positive lookaround, and negative lookaround respectively:
\b[aA][lL][lL]\b
(?<=[^\w])[aA][lL][lL](?=[^\w])
(?<!\w)[aA][lL][lL](?!\w)
Basically, these will find case-insensitive matches of "all" that are surrounded by non-alphanumeric characters. If you want to exclude certain surrounding characters, you can replace \w with your own character class (e.g. to exclude surrounding quote marks, use [A-Za-z0-9_"] instead of \w).

Related

ASP.net core RegularException attribute - multiple conditions

I have two regex that should be matched:
"^[a-z0-9\\!#\\$\\^&\\-\\+%\\=_\\(\\)\\{\\}\\<\\>'\";\\:/\\.,~`\\|\\\\]+$"
and
".*(g[o0]+gle).*"
The first one accept any alpha numeric character (with few more extras). Like helloworld123. The second one should reject any string that contain the word "google" (in diffrent forms - like: gooo0gle).
Allowed:
hello
helloworld
helloworld123
Disallowed:
hellogoogle
google
...
I want to use the RegularExpression to match this string. Thought about something like:
[RegularExpression("^[a-z0-9\\!#\\$\\^&\\-\\+%\\=_\\(\\)\\{\\}\\<\\>'\";\\:/\\.,~`\\|\\\\]+$|.*(g[o0]+gle).*"]
But it's not working since the second part (.*(g[o0]+gle).*) should be NOT.
How to do it right?
Thanks.
You can use your second regex by placing it in a negative look ahead and use the first regex as character set and combine both to get following regex that you can use,
^(?!.*g[o0]+gle)[-a-z0-9!#$^&+%=_(){}<>'";:\/.,~`|]+$
Here, this (?!.*g[o0]+gle) negative look ahead will reject any strings that contains google or any variation as supported by your regex, and this character set [-a-z0-9!#$^&+%=_(){}<>'";:\/.,~|]+` will match one or more characters allowed by it.
Also, you don't need to escape most special characters while they are in character set, hence I have unescaped most of them except / and also always place the hyphen - either as the very first character or very last character in the character set, else depending upon the regex dialects, you may see weird behavior.
Regex Demo

Conditional match without false force a match?

I'm using the following regex in c# to match some input cases:
^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$
The options are ignoring pattern whitespaces.
My input looks as follows:
hello
#world
[xxx]
This all can be tested here: DEMO
My problem is that this regex will not match the last line. Why?
What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.
This is a simplyfied regex and simplyfied input.
The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).
I try to understand why the conditional group doesn't match as stated in original regex.
I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:
^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$
That's the reason why I'm trying to use a conditional match.
UPDATE 10/12/2018
I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:
(?(a)a).*
DEMO
I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information
There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])
If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said
Singline tells the parser to handle the . to match all characters including the \n.
Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.
Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.
Notice the second match (as index 1) has world in group capture id and value as ↵.
I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.
Let us turn on Singline and see what happens.
Now everything is consumed, but there is a different problem. :-)

Regular expression that works on dots

I have this regular expression :
string[] values = Regex
.Matches(mystring4, #"([\w-[\d]][\w\s-[\d]]+)|([0-9]+)")
.OfType<Match>()
.Select(match => match.Value.Trim())
.ToArray();
This regular expression turns this string :
MY LIMITED COMPANY (52100000 / 58447000)";
To these strings :
MY LIMITED COMPANY - 52100000 - 58447000
This also works on non-English characters.
But there is one problem, when I have this string : MY. LIMITED. COMPANY. , it splits that too. I don't want that. I don't want that regular expression to work on dots. How can I do that? Thanks.
You may add the dot after each \w in your pattern, and I also suggest removing unnecessary ( and ):
string[] values = Regex
.Matches("MY. LIMITED. COMPANY. (52100000 / 58447000)", #"[\w.-[\d]][\w.\s-[\d]]+|[0-9]+")
.OfType<Match>()
.Select(match => match.Value.Trim())
.ToArray();
foreach (var s in values)
Console.WriteLine(s);
See the C# demo
Pattern:
[\w.-[\d]] - one Unicode letter or underscore ([\w-[\d]]) or a dot (.)
[\w.\s-[\d]]+ - 1 or more (due to + quantifier at the end) characters that are either Unicode letters or underscore, ., or whitespace (\s)
| - or
[0-9]+ - one or more ASCII-only digits
I'd simplify the expression. What if the names in the front include numbers? Not that my solution doesn't exactly mimic the original expression. It will allow numbers in the name part.
Let's start from the beginning:
To match words all you need is a sequence of word characters:
\w+
This will match any alphanumerical characters including underscores (_).
Considering you want the possibility of the word ending with a dot, you can add it and make it optional (one or zero matches):
\w+\.?
Note the escape to make it an actual character rather than a character class "any character".
To match another potential word following, we now simply duplicate this match, add a white space before, and once again make it optional using the * quantifier:
\w+\.?(?:\w+\.?)*
In case you haven't seen a group starting with ?: is a non-matching group. In essence this works like a usual group, but won't save a matching group in your results.
And that's it already. This pattern will split your demo string as expected. Of course there could be other possible characters not being covered by this.
You can see the results of this matching online here and also play around with it.
To test your regular expressions (and to learn them), I'd really recommend you using a tool such as http://regex101.com
It has an input mask allowing you to provide your pattern and your target string. On the right hand side it will first explain the pattern to you (to see if it's indeed what you had in mind) and below it will show all the groups matched. Just keep in mind it actually uses slightly different flavors of regular expressions, but this shouldn't matter for such simple patterns. (I'm not affiliated with that site, just consider it really useful.)
As an alternative, to directly use C#'s regex parser, you can also try this Regex Tester. This works in a similar way, although doesn't include any explanations, which might be not as ideal for someone just getting started.

Noncapturing along with capturing match

I am trying to capture the subdomain from huge lists of domain names. For example I want to capture "funstuff" from "funstuff.mysite.com". I do not want to capture, ".mysite.com" in the match. These occurances are in a sea of text so I can not depend on them being at the start of a line. I know the subdomain will not include any special characters or numbers. So what I have is:
[a-z]{2,10}(?=\.mysite\.com)
The problem is this will work only if the subdomain is NOT preceded by a number or special character. For example, "asdfbasdasdfdfunstuff.mysite.com" will return "fdfunstuff" but "asdfasf23/funstuff.mysite.com" won't make a match.
I can not depend on there being a special character before the subdomain, like a "/" as in "http://funstuff.mysite.com" so that can not be used as part of the condition.
It is ok if the capture gets erroneous text before the subdomain, although 99% of the time it will be preceded with something other that a lowercase letter. I have tried,
(?<=[^a-z])[a-z]{2,10}(?=\.mysite\.com)
but for some reason this does not capture text is a situation like:
afb"asdfunstuff.mysite.com
Where the quotation mark prevents a match for [a-z]{2-20}. Basically what I would want to do in that case would be to capture asdfunstuff.mysite.com. How can this be accomplished?
So you've got two problems to solve: first, you want to match ".mysite.com" but not capture it; second, you want to grab up to 10 alphabetic characters in the "subdomain" position.
First problem can be solved by using a capturing group. The regex
([a-z]{2,10})\.mysite\.com
will capture somewhere between 2 and 10 characters, and the returned match object will expose that in one of its properties (depends on the language). C# returns a collection of Match objects, so it'll be the only item.
Second problem can be solved by using the word-boundary character \b. In .NET, this matches where an alphanumeric (i.e. \w) is next to a non-alphanumeric (\W). Other languages (e.g. ECMAScript / Javascript) work simliarly.
So, I suggest the following regex to solve your problem:
\b([a-z]{2,10})\.mysite\.com
Note that numbers are legal in subdomain names, too, so the following might be generally correct (though perhaps not in your specific case):
\b(\w{2,10})\.mysite\.com
where the "word character" \w is equivalent to [a-zA-Z_0-9] in .NET's ECMAScript-compliant mode. (Further reading.)

My regular expression isn't returning what I need

I have a block of text as such.
google.sbox.p50 && google.sbox.p50(["how to",[["how to tie a tie",0],["how to train your dragon 2 trailer",0],["how to do the cup song",0],["how to get a six pack in 3 minutes",0],["how to make a paper gun that shoots",0],["how to basic",0],["how to love lil wayne",0],["how to sing like your favorite artist",0],["how to be a heartbreaker marina and the diamonds",0],["how to tame a horse in minecraft",0]],{"q":"XJW--0IKH6sqOp0ME-x5B7b_5wY","j":"5","k":1}])
Using \\[([^]]+)\\] I am able to get everything I need, but with a little extra that I don't. I do not need the ["how to",[[. I only need the blocks that are formatted like,
["how to tie a tie",0]
Can someone please help me modify my expression to only get what I need? I've been at it for hours and I can't grasp the idea of RegEx.
Put both the opening and closing square brackets in the negated character class?
\\[([^][]+)\\]
\\[ matches a literal [
\\] matches a literal ]
[^][] is a negated class, which for instance matches any character except ][. It might be a little difficult to see it, but it's equivalent to [^\\]\\[]. Here the double escapes are not required because you are using a character class (just like \\. is equivalent to [.])
([^][]+) captures everything within square brackets, making sure there's no ] or [ inside.
In C#, you can use the # symbol to avoid having to double escape everytime and using this makes the regex like that:
var regex = new Regex(#"\[([^][]+)\]");
Note: This regex will capture everything within square brackets. If you wish to specificly get the format ["how to tie a tie",0], you can be more precise. After all, the regex will only match stuff you make it match:
var regex = new Regex(#"\["[^"]+",0\]");
Here, we have another negated character class: [^"]. This will match any character which is not a quote character.
This one assumes that the digit is always 0, as depicted in your sample text block. If you have multiple possibilities of numbers, you can use the character class [0-9]+:
var regex = new Regex(#"\["[^"]+",[0-9]+\]");
You can use \d+ as well, but this character class also matches other characters which may or may not render the regex worse. If you want to be more even cautious by allowing possible spaces, tabs, newlines, form feeds in between the characters, you can use this regex:
var regex = new Regex(#"\[\s*"[^"]+"\s*,\s*[0-9]+\s*\]");
Conclusion, there might be many regexes which suit what you need, just make sure you know how your data is coming through so you can pick one which has the right amount of freeway.
I think this is what you are looking for to match the format of ["how to tie a tie",0]:
(\["[^"]+",\d\])
( ) - around the whole thing so it all gets captured in this group
\[" - find ["
[^"]+ - find one or more of anything except "
", - find ",
\d - find a number, if you want more than just a single digit, do \d+
\] - match the ending ]
The only variable things in this regex are whatever is within the quotes ([^"]+) and the number (\d+).
Demo
If you don't want the square brackets in the capture group, you can do it like this:
\[("[^"]+",\d+)\]
I assume you don't want to match if there are quotes within your quotes as it would probably break whatever purpose you are using it for, but if you do, this should work:
\[("[^[\]]+",\d+)\]
You must use this pattern
#"\[[^][]+\]"
More informations about square brackets here.
I think you need this one: (\[[^\[^]+?])
What you did mis is the ? (smallest match) and exclude any [ or ]
Seemingly the text in the outer brackets is a JSON representation of an object. Instead of a regular expression I'd just:
strip off the stuff before the bracket + first bracket (google.sbox.p50 && google.sbox.p50() plus strip off the trailing bracket ). There are more ways to do this, and it can be more efficient than regex.
JSON parse the remaining inner part.
From that point you have the object representation, you can leave out the first element of the array what you don't need, plus you have everything else in a traversable form.
There's the session information at the end along with parameters anyway (in {} brackets), so in the end you may end up parsing stuff anyway. Better not to reinvent the wheel (JSON parsing).

Categories