Alright, so i need to be able to match strings in ways that give me more flexibility.
so, here is an example. for instance, if i had the string "This is my random string", i would want some way to make
" *random str* ",
" *is __ ran* ",
" *is* ",
" *this is * string ",
all match up with it, i think at this point a simple true or false would be okay to weather it match's or not, but id like basically * to be any length of any characters, also that _ would match any one character. i can't think of a way, although im sure there is, so if possible, could answers please contain code examples, and thanks in advance!
I can't quite figure out what you're trying to do, but in response to:
but id like basically * to be any length of any characters, also that _ would match any one character
In regex, you can use . to match any single character and .+ to match any number of characters (at least one), or .* to match any number of characters (or none if necessary).
So your *is __ ran* example might turn into the regex .+is .. ran.+, whilst this is * string could be this is .+ string.
If this doesn't answer your question then you might want to try re-wording it to make it clearer.
For learning more regex syntax, a popular site is regular-expressions.info, which provides pretty much everything you need to get started.
Use Regular Expressions.
In C#, you would use the Regex class.
For example:
var str = "This is my random string";
Console.WriteLine(Regex.IsMatch(str, ".*is .. ran.*")); //Prints "True"
Related
I have kind of a weird problem that I am attempting to resolve with some elegant regular expressions.
The system I am working on was initially designed to accept an incoming string and through a pattern matching method, alter the string which it then returns. A very simplistic example is:
Incoming string:
The dog & I went to the park and had a great time...
Outgoing string:
The dog {&} I went to the park and had a great time {...}
The punctuation mapper wraps key characters or phrases and wraps them in curly braces. The original implementation was a one way street and was never meant for how it is currently being applied and as a result, if it is called incorrectly, it is very easy for the system to "double" wrap a string as it is just doing a simple string replace.
I spun up Regex Hero this morning and started working on some pattern matches and having not written a regular expression in nearly a year, quickly hit a wall.
My first idea was to match a character (i.e. &) but only if it wasn't wrapped in braces and came up with [^\{]&[^\}], which is great but of course catches any instance of the ampersand so long as it is not preceded by a curly brace, including white spaces and would not work in a situation where there were two ampersands back to back (i.e. && would need to be {&}{&} in the outgoing string. To make matters more complicated, it is not always a single character as ellipsis (...) is also one of the mapped values.
Every solution I noodle over either hits a barrier because there is an unknown number of occurrences of a particular value in the string or that the capture groups will either be too greedy or finally, cannot compensate for multiple values back to back (i.e. a single period . vs ellipsis ...) which the original dev handled by processing ellipsis first which covered the period in the string replace implementation.
Are there any regex gurus out there that have any ideas on how I can detect the undecorated (unwrapped) values in a string and then perform their replacements in an ungreedy fashion that can also handle multiple repeated characters?
My datasource that I am working against is a simple key value pair that contains the value to be searched for and the value to replace it with.
Updated with example strings:
Undecorated:
Show Details...
Default Server:
"Smart" 2-Way
Show Lender's Information
Black & White
Decorated:
Show Details{...}
Default Server{:}
{"}Smart{"} 2-Way
Show Lender{'}s Information
Black {&} White
Updated With More Concrete Examples and Datasource
Datasource (SQL table, can grow at any time):
TaggedValue UntaggedValue
{:} :
{&} &
{<} <
{$} $
{'} '
{} \
{>} >
{"} "
{%} %
{...} ...
{...} …
{:} :
{"} “
{"} ”
{'} `
{'} ’
Broken String: This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it & follow by ! and ... _ & . &&&
String that needs decoration: Show Details... Default Server: "Smart" 2-Way Show Lender's Information Black & White
String that would pass through the method untouched (because it was already decorated): The dog {&} I went to the park and had a great time {...}
The other "gotcha" in moving to regex is the need to handle escaping, especially of backslashes elegantly due to their function in regular expressions.
Updated with output from #Ethan Brown
#Ethan Brown,
I am starting think that regex, while elegant might not be the way to go here. The updated code you provided, while closer still does not yield correct results and the number of variables involved may exceed the regex logics capability.
Using my example above:
'This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it & follow by ! and ... _ & . &&&'
yields
This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it {&} follow by {!} and {...} {_} {&} . {&&}&
Where the last group of ampersands which should come out as {&}{&}{&} actually comes out as {&&}&.
There is so much variability here (i.e. need to handle ellipsis and wide ellipsis from far east languages) and the need to utilize a database as the datasource is paramount.
I think I am just going to write a custom evaluator which I can easily enough write to perform this type of validation and shelve the regex route for now. I will grant you credit for your answer and work as soon as I get in front of a desktop browser.
This kind of problem can be really tough, but let me give you some ideas that might help out. One thing that's really going to give you headaches is handling the case where the punctuation appears at the beginning or end of the string. Certainly that's possible to handle in a regex with a construct like (^|[^{])&($|[^}]), but in addition to that being painfully hard to read, it also has efficiency issues. However, there's a simple way to "cheat" and get around this problem: just pad your input string with a space on either end:
var input = " " + originalInput + " ";
When you're done you can just trim. Of course if you care about preserving input at the beginning or end, you'll have to be more clever, but I'm going to assume for argument's sake that you don't.
So now on to the meat of the problem. Certainly, we can come up with some elaborate regular expressions to do what we're looking for, but often the answer is much much simpler if you use more than one regular expression.
Since you've updated your answer with more characters, and more problem inputs, I've updated this answer to be more flexible: hopefully it will meet your needs better as more characters get added.
Looking over your input space, and the expressions you need quoted, there are really three cases:
Single-character replacements (! becomes {!}, for example).
Multi-character replacements (... becomes {...}).
Slash replacement (\ becomes {})
Since the period is included in the single-character replacements, order matters: if you replace all the periods first, then you will miss ellipses.
Because I find the C# regex library a little clunky, I use the following extension method to make this more "fluent":
public static class StringExtensions {
public static string RegexReplace( this string s, string regex, string replacement ) {
return Regex.Replace( s, regex, replacement );
}
}
Now I can cover all of the cases:
// putting this into a const will make it easier to add new
// characters in the future
const string normalQuotedChars = #"\!_\\:&<\$'>""%:`";
var output = s
.RegexReplace( "(?<=[^{])\\.\\.\\.(?=[^}])", "{$&}" )
.RegexReplace( "(?<=[^{])[" + normalQuotedChars + "](?=[^}])", "{$&}" )
.RegexReplace( "\\\\", "{}" );
So let's break this solution down:
First we handle the ellipses (which will keep us from getting in trouble with periods later). Note that we use a zero-width assertions at the beginning and end of the expression to exclude expressions that are already quoted. The zero-width assertions are necessary, because without them, we'd get into trouble with quoted characters right next to each other. For example, if you have the regex ([^{])!([^}]), and your input string is foo !! bar, the match would include the space before the first exclamation point and the second exclamation point. A naive replacement of $1!$2 would therefore yield foo {!}! bar because the second exclamation point would have been consumed as part of the match. You'd have to end up doing an exhaustive match, and it's much easier to just use zero-width assertions, which are not consumed.
Then we handle all of the normal quoted characters. Note that we use zero-width assertions here for the same reasons as above.
Finally, we can find lone slashes (note we have to escape it twice: once for C# strings and again for regex metacharacters) and replace that with empty curly brackets.
I ran all of your test cases (and a few of my own invention) through this series of matches, and it all worked as expected.
I'm no regex god, so one simple way:
Get / construct the final replacement string(s) - ex. "{...}", "{&}"
Replace all occurrences of these in the input with a reserved char (unicode to the rescue)
Run your matching regex(es) and put "{" or whatever desired marker(s).
Replace reserved char(s) with the original string.
Ignoring the case where your original input string has a { or } character, a common way to avoid re-applying a regex to an already-escaped string is to look for the escape sequence and remove it from the string before applying your regex to the remainders. Here's an example regex to find things that are already escaped:
Regex escapedPattern = new Regex(#"\{[^{}]*\}"); // consider adding RegexOptions.Compiled
The basic idea of this negative-character class pattern comes from regular-expressions.info, a very helpful site for all thing regex. The pattern works because for any inner-most pair of braces, there must be a { followed by non {}'s followed by a }
Run the escapedPattern on the input string, find for each Match get the start and end indices in the original string and substring them out, then with the final cleaned string run your original pattern match again or use something like the following:
Regex punctPattern = new Regex(#"[^\w\d\s]+"); // this assumes all non-word,
// digit or space chars are punctuation, which may not be a correct
//assumption
And replace Match.Groups[1].Value for each match (groups are a 0 based array where 0 is the whole match, 1 is the first set of parentheses, 2 is the next etc.) with "{" + Match.Groups[1].Value + "}"
Let me explain the actual problem I am facing.
When my SearchString is C+, the setup of regular expression below works fine.
However when my searchstring is C++, it throws me an error stating --
parsing "C++" - Nested quantifier +.
Can anyone let me know how to get past this error?
RegExp = new Regex(Search_Str.Replace(" ", "|").Trim(), RegexOptions.IgnoreCase);
First of all, I believe that you'll learn much more by looking at some regex tutorials instead of asking here at your current stage.
To answer your question, I would like to point out that + is a quantifier in regexp, and means 1 or more times the previous character (or group), so that C+ will match at least 1 C, meaning C will match, CC will match, CCC will match and so on. Your search with C+ is actually matching only C!
C++ in a regex will give you an error, at least in C#. You won't with other regex flavours, including JGsoft, Java and PCRE (++ is a possessive quantifier in those flavours).
So, what to do? You need to escape the + character so that your search will look for the literal + character. One easy way is to add a backslash before the +: \+. Another way is to put the + in square brackets.
This said, you can use:
C\+\+
Or...
C[+][+]
To look for C++. Now, since you have twice the same character, you can use {n} (where n is the number of occurrences) to denote the number of occurrences of the last character, whence:
C\+{2}
Or...
C[+]{2}
Which should give the same results.
Plus + is a special symbol in regular expression. You must escape it.
Regex regex = new Regex(#"C+\+");
I could not help but notice #ghost's answer is wrong.
This snippet is wrong:
Regex regex = new Regex(#"C+\+");
Because the first + is not escaped. So the meaning still is "one or more C characters followed by a plus sign". The correct form would be #"C\+\+"
I need some help on a problem.
In fact I search to check for an image type by the hexadecimal code.
string JpgHex = "FF-D8-FF-E0-xx-xx-4A-46-49-46-00";
Then I have a condition on
string.StartsWith(pngHex).
The problem is that the "x" characters presents in my "JpgHex" string can be whatever I want.
I think I need a regex to check that but I don't know how!!
Thanks a lot!
I'm not quite clear what exactly you want to do, but the dot '.' character represents any character in Regex.
So the regex "^FF-D8-FF-E0-..-..-4A-46-49-46-00" will probably do the trick. '^' = Start of input.
If you want to allow only hex chars you can use "^FF-D8-FF-E0-[0-9A-F]{2}-[0-9A-F]{2}-4A-46-49-46-00".
Like I said, I'd need a better idea of what pattern you need to match.
Here are some examples:
Regex rgx =
new Regex(#"^FF-D8-FF-E0-[a-zA-Z0-9]{2}-[a-zA-Z0-9]{2}-4A-46-49-46-00$");
rgx.IsMatch(pngHex); // is match will return a bool.
I use [a-zA-Z0-9]{2} to denote two instances of a character, caps or small or a number. So the above regex would match :
FF-D8-FF-E0-aa-zZ-4A-46-49-46-00
FF-D8-FF-E0-11-22-4A-46-49-46-00
.. etc
Based on your need change the regex accordingly so for capitals and numbers only you change to [A-Z0-9]. The {2} denotes two occurrences.
The ^ denotes the string should start with FF and $ means the string should end with 00.
Lets say you wanted to only match two numbers, so you would use \d{2}, the whole thing would look like this:
Regex rgx = new Regex(#"^FF-D8-FF-E0-\d{2}-\d{2}-4A-46-49-46-00$");
rgx.IsMatch(pngHex);
How do I know of these magical characters? Simple, there are docs everywhere. See this MSDN page for some basic regex patterns. This page shows some quantifiers, those are things like match one or more or match only one.
Cheat-sheets also come in handy.
A regex would help you; you can use the following tool to help you test and learn: -
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
I recommend you have a play because then you'll learn!
To simply match any character in place of the x, the following should work: -
"^FF-D8-FF-E0-..-..-4A-46-49-46-00$"
In C#, it would be something like this: -
var test = "FF-D8-FF-E0-AB-CD-4A-46-49-46-00";
var foo = new Regex("^FF-D8-FF-E0-..-..-4A-46-49-46-00$");
if (foo.IsMatch(test))
{
// Do magic
}
You will need to read up on regular expressions to understand some of the characters that may not look familiar, i.e. ^ and $. See http://www.regular-expressions.info/
I have a written a Regex to check whether the string consists of 9 digits or not as follows but this always returning me false even if I have my string as 123456789.
if (Regex.IsMatch(strSSN, " ^\\d{9}$"))
Can any one tell what's wrong and also if any better one provide me .. What i am achieving is the string should have only numeric data and the length should be =9
Spaces are significant in regular expressions (and of course, you can't match a space character before the start of the string). Remove it and everything should be fine:
if (Regex.IsMatch(strSSN, "^\\d{9}$"))
Also, you generally want to use verbatim strings for regexes in C#, so you don't have to double your backslashes, but this is just a convenience, not the reason for your problem:
if (Regex.IsMatch(strSSN, #"^\d{9}$"))
this should work
^\d{9}$
hope that helps
You have a whitespace between " and ^. It should be:
if (Regex.IsMatch(strSSN, "^\\d{9}$"))
could it be the space at the beginning of your pattern? That seems wierd, a space then a number anchored to the start of the line.
This should give you what you want:
if (Regex.IsMatch(strSSN, "^\\d{9}$"))
I'm in the process of updating a program that fixes subtitles.
Till now I got away without using regular expressions, but the last problem that has come up might benefit by their use. (I've already solved it without regular expressions, but it's a very unoptimized method that slows my program significantly).
TL;DR;
I'm trying to make the following work:
I want all instances of:
"! ." , "!." and "! . " to become: "!"
unless the dot is followed by another dot, in which case I want all instances of:
"!.." , "! .." , "! . . " and "!. ." to become: "!..."
I've tried this code:
the_str = Regex.Replace(the_str, "\\! \\. [^.]", "\\! [^.]");
that comes close to the first part of what I want to do, but I can't make the [^.] character of the replacement string to be the same character as the one in the original string... Please help!
I'm interested in both C# and PHP implementations...
$str = preg_replace('/!(?:\s*\.){2,3}/', '!...', $str);
$str = preg_replace('/!\s*\.(?!\s*\.)/', '!', $str);
This does the work in to PCREs. You probably could do some magic to merge it to one, but it wouldn't be readable anymore. The first PCRE is for !..., the second one for !. They are quite straightforward.
C#
s = Regex.Replace(s, #"!\s?\.\s?(\.?)\s?", "!$1$1$1");
PHP
$s = preg_replace('/!\s?\.\s?(\.?)\s?/', '!$1$1$1', $s);
The first dot is consumed but not captured; you're effectively throwing that one away. Group #1 captures the second dot if there is one, or an empty string if not. In either case, plugging it into the replacement string three times yields the desired result.
I used \s instead of literal spaces to make it more obvious what I was doing, and added the ? quantifier to make the spaces optional. If you really need to restrict it to actual space characters (not tabs, newlines, etc.) you can change them back to spaces. If you want to allow more than one space at a time, you can change ? to * where appropriate--e.g.:
#"!\s*\.\s*(\.?)\s*"
Also, notice the use of C#'s verbatim string literals--the antidote for backslashitis. ;)