I'm having a hard time understand regex. I have a scenario where valid characters are a-z, A-Z, 0-9 and a space. So when I try and create a RegEx for invalid characters I have this, [^a-zA-Z0-9 ].
Then I have strings that I want to search based on the RegEx and when it finds an invalid character, it checks if the character before it is invalid.
for example, "test test +?test"
So what I want to happen is if there are two invalid characters, one after the other, do nothing otherwise insert a '£'. So the string above will be fine, no changes. However, the string, "test test £test", should be changed to "test test ££test".
This is my code..
public string HandleInvalidChars(string message)
{
const string methodName = "HandleInvalidChars";
Regex specialChars = new Regex("[^a-zA-Z0-9 ]");
string strSpecialChars = specialChars.ToString();
//prev character in string which we are going to check
string prevChar;
Match match = specialChars.Match(message);
while (match.Success)
{
//get position of special character
int position = match.Index;
// get character before special character
prevChar = message.Substring(position - 1, 1);
//check if next character is a special character, if not insert ? escape character
try
{
if (!Regex.IsMatch(prevChar, strSpecialChars))
{
message = message.Insert(position, "?");
}
}
catch (Exception ex)
{
_logger.ErrorFormat("{0}: ApplicationException: {1}", methodName, ex);
return message;
}
match = match.NextMatch();
//loop through remainder of string until last character
}
return message;
}
When I test it on the first string it handles the first invalid char, '+', ok but it falls over when it reaches '£'.
Any help is really appreciated.
Thanks :)
What if you would change the RegEx to something like below, to check for only those cases with one special character and not with two?
[a-zA-Z0-9 ]{0,1}[^a-zA-Z0-9 ][a-zA-Z0-9 ]{0,1}
Another thing, I would create a new variable for the return value. As I can see you are keep changing the original string where you are looking for matches.
I believe you have overthought it a bit. All you need is to find a forbidden char that is not preceded nor followed with another forbidden char.
Declare
public string HandleInvalidChars(string message)
{
var pat = #"(?<![^A-Za-z0-9 ])[^A-Za-z0-9 ](?![^A-Za-z0-9 ])";
return Regex.Replace(message, pat, "£$&");
}
and use:
Console.WriteLine(HandleInvalidChars("test test £test"));
// => test test ££test
Console.WriteLine(HandleInvalidChars("test test +?test"));
// => test test +?test
See the online C# demo.
Details
(?<![^A-Za-z0-9 ]) - a negative lookbehind that fails the match if there is a char other than an ASCII letter/digit or space immediately to the left of the current location
[^A-Za-z0-9 ] - a char other than an ASCII letter/digit or space
(?![^A-Za-z0-9 ]) - a negative lookahead that fails the match if there is a char other than an ASCII letter/digit or space immediately to the right of the current location.
The replacement string contains a $&, backreference to the whole match value. Thus, using "£$&" we insert a £ before the match.
See the regex demo.
Related
I'm trying to understand how to match a specific string that's held within an array (This string will always be 3 characters long, ex: 123, 568, 458 etc) and I would match that string to a longer string of characters that could be in any order (9841273 for example). Is it possible to check that at least 2 of the 3 characters in the string match (in this example) strMoves? Please see my code below for clarification.
private readonly string[] strSolutions = new string[8] { "123", "159", "147", "258", "357", "369", "456", "789" };
Private Static string strMoves = "1823742"
foreach (string strResult in strSolutions)
{
Regex rgxMain = new Regex("[" + strMoves + "]{2}");
if (rgxMain.IsMatch(strResult))
{
MessageBox.Show(strResult);
}
}
The portion where I have designated "{2}" in Regex is where I expected the result to check for at least 2 matching characters, but my logic is definitely flawed. It will return true IF the two characters are in consecutive order as compared to the string in strResult. If it's not in the correct order it will return false. I'm going to continue to research on this but if anyone has ideas on where to look in Microsoft's documentation, that would be greatly appreciated!
Correct order where it would return true: "144257" when matched to "123"
incorrect order: "35718" when matched to "123"
The 3 is before the 1, so it won't match.
You can use the following solution if you need to find at least two different not necessarily consecutive chars from a specified set in a longer string:
new Regex($#"([{strMoves}]).*(?!\1)[{strMoves}]", RegexOptions.Singleline)
It will look like
([1823742]).*(?!\1)[1823742]
See the regex demo.
Pattern details:
([1823742]) - Capturing group 1: one of the chars in the character class
.* - any zero or more chars as many as possible (due to RegexOptions.Singleline, . matches any char including newline chars)
(?!\1) - a negative lookahead that fails the match if the next char is a starting point of the value stored in the Group 1 memory buffer (since it is a single char here, the next char should not equal the text in Group 1, one of the specified digits)
[1823742] - one of the chars in the character class.
I am trying to merge a few working RegEx patterns together (AND them). I don't think I am doing this properly, further, the first RegEx might be getting in the way of the next two.
Slug example (no special characters except for - and _):
(^[a-z0-9-_]+$)
Then I would like to ensure the first character is NOT - or _:
(^[^-_])
Then I would like to ensure the last character is NOT - or _:
([^-_]$)
Match (good Alias):
my-new_page
pagename
Not-Match (bad Alias)
-my-new-page
my-new-page_
!##$%^&*()
If this RegExp can be simplified and I am more than happy to use it. I am trying to create validation on a page URL that the user can provide, I am looking for the user to:
Not start or and with a special character
Start and end with a number or letter
middle (not start and end) can include - and _
One I get that working, I can tweak if for other characters as needed.
In the end I am applying as an Annotation to my model like so:
[RegularExpression(
#"(^[a-z0-9-_]+$)?(^[^-_])?([^-_]$)",
ErrorMessage = "Alias is not valid")
]
Thank you, and let me know if I should provide more information.
See regex in use here
^[a-z\d](?:[a-z\d_-]*[a-z\d])?$
^ Assert position at the start of the line
[a-z\d] Match any lowercase ASCII letter or digit
(?:[a-z\d_-]*[a-z\d])? Optionally match the following
[a-z\d_-]* Match any character in the set any number of times
[a-z\d] Match any lowercase ASCII letter or digit
$ Assert position at the end of the line
See code in use here
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
Regex regex = new Regex(#"^[a-z\d](?:[a-z\d_-]*[a-z\d])?$");
string[] strings = {"my-new_page", "pagename", "-my-new-page", "my-new-page_", "!##$%^&*()"};
foreach(string s in strings) {
if (regex.IsMatch(s))
{
Console.WriteLine(s);
}
}
}
}
Result (only positive matches):
my-new_page
pagename
I'm looking for a way to search a string for everything before a set of characters in C#. For Example, if this is my string value:
This is is a test.... 12345
I want build a new string with all of the characters before "12345".
So my new string would equal "This is is a test.... "
Is there a way to do this?
I've found Regex examples where you can focus on one character but not a sequence of characters.
You don't need to use a Regex:
public string GetBitBefore(string text, string end)
{
var index = text.IndexOf(end);
if (index == -1) return text;
return text.Substring(0, index);
}
You can use a lazy quantifier to match anything, followed by a lookahead:
var match = Regex.Match("This is is a test.... 12345", #".*?(?=\d{5})");
where:
.*? lazily matches everything (up to the lookahead)
(?=…) is a positive lookahead: the pattern must be matched, but is not included in the result
\d{5} matches exactly five digits. I'm assuming this is your lookahead; you can replace it
You can do so with help of regex lookahead.
.*(?=12345)
Example:
var data = "This is is a test.... 12345";
var rxStr = ".*(?=12345)";
var rx = new System.Text.RegularExpressions.Regex (rxStr,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
var match = rx.Match(data);
if (match.Success) {
Console.WriteLine (match.Value);
}
Above code snippet will print every thing upto 12345:
This is is a test....
For more detail about see regex positive lookahead
This should get you started:
var reg = new Regex("^(.+)12345$");
var match = reg.Match("This is is a test.... 12345");
var group = match.Groups[1]; // This is is a test....
Of course you'd want to do some additional validation, but this is the basic idea.
^ means start of string
$ means end of string
The asterisk tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more
{min,max} indicate the minimum/maximum number of matches.
\d matches a single character that is a digit, \w matches a "word character" (alphanumeric characters plus underscore), and \s matches a whitespace character (includes tabs and line breaks).
[^a] means not so exclude a
The dot matches a single character, except line break characters
In your case there many way to accomplish the task.
Eg excluding digit: ^[^\d]*
If you know the set of characters and they are not only digit, don't use regex but IndexOf(). If you know the separator between first and second part as "..." you can use Split()
Take a look at this snippet:
class Program
{
static void Main(string[] args)
{
string input = "This is is a test.... 12345";
// Here we call Regex.Match.
MatchCollection matches = Regex.Matches(input, #"(?<MySentence>(\w+\s*)*)(?<MyNumberPart>\d*)");
foreach (Match item in matches)
{
Console.WriteLine(item.Groups["MySentence"]);
Console.WriteLine("******");
Console.WriteLine(item.Groups["MyNumberPart"]);
}
Console.ReadKey();
}
}
You could just split, not as optimal as the indexOf solution
string value = "oiasjdoiasj12345";
string end = "12345";
string result = value.Split(new string[] { end }, StringSplitOptions.None)[0] //Take first part of the result, not the quickest but fairly simple
My task is to select first sentence from a text (I'm writing in C#). I suppose that the most appropriate way would be using regex but some troubles occurred. What regex pattern should I use to select the first sentence?
Several examples:
Input: "I am a lion and I want to be free. Do you see a lion when you look inside of me?" Expected result: "I am a lion and I want to be free."
Input: "I drink so much they call me Charlie 4.0 hands. Any text." Expected result: "I drink so much they call me Charlie 4.0 hands."
Input: "So take out your hands and throw the H.U. up. 'Now wave it around like you don't give a fake!'" Expected result: "So take out your hands and throw the H.U. up."
The third is really confusing me.
Since you aleready provided some assumptions:
sentences are divided by a whitespace
task is to select first sentence
You can use the following regex:
^.*?[.?!](?=\s+(?:$|\p{P}*\p{Lu}))
See RegexStorm demo
Regex breakdown:
^ - start of string (thus, only the first sentence will be matched)
.*? - any number of characters, as few as possible (use RegexOptions.Singleline to also match a newline with .)
[.?!] - a final punctuation symbol
(?=\s+(?:$|\p{P}*\p{Lu})) - a look-ahead making sure there is 1 or more whitespace symbols (\s+) right after before the end of string ($) or optional punctuation (\p{P}) and a capital letter (\p{Lu}).
UPDATE:
Since it turns out you can have single sentence input, and your sentences can start with any letter or digit, you can use
^.*?[.?!](?=\s+\p{P}*[\p{Lu}\p{N}]|\s*$)
See another demo
I came up with a regular expression that uses lots of negative look-aheads to exclude certain cases, e.g. a punctuation must not be followed by lowercase character, or a dot before a capital letter is not closing a sentence. This splits up all the text in their seperate sentences. If you are given a text, just take the first match.
[\s\S]*?(?![A-Z]+)(?:\.|\?|\!)(?!(?:\d|[A-Z]))(?! [a-z])/gm
Sentence separators should be searched with following scanner:
if it's sentence-finisher character (like [.!?])
it must be followed by space or allowed sequence of characters and then space:
like sequence of '.' for '.' (A sentence...)
...or sequence of '!' and/or '?' for '!' and '?' (Exclamation here!?)
then it must be followed by either:
capital character (ignore quotes, if any)
numeric
which must be followed by lowercase or another sentence-finister
dialog-starter character (Blah blah blah... - And what next, Elric?)
Tip: don't forget to add extra space character to input source string.
Upd:
Some wild pseudocode xD:
func sentence(inputString) {
finishers = ['.', '!', '?']
allowedSequences = ['.' => ['..'], '!' => ['!!', '?'], '?' => ['??', '!']]
input = inputString
result = ''
found = false
while input != '' {
finisherPos = min(pos(input, finishers))
if !finisherPos
return inputString
result += substr(input, 0, finisherPos + 1)
input = substr(input, finisherPos)
p = finisherPos
finisher = input[p]
p++
if input[p] != ' '
if match = testSequence(substr(input, p), allowedSequences[finisher]) {
result += match
found = true
break
} else {
continue
}
else {
p++
if input[p] in [A-Z] {
found = true
break
}
if input[p] in [0-9] {
p++
if input[p] in [a-z] or input[p] in finishers {
found = true
break
}
p--
}
if input[p] in ['-'] {
found = true;
break
}
}
}
if !found
return inputStr
return result
}
func testSequence(str, sequences) {
foreach (sequence: sequences)
if startsWith(str, sequence)
return sequence
return false
}
I have a string, and I want to make sure that every letter in it is English.
The other characters, I don't care.
34556#%42%$23$%^*&sdfsfr - valid
34556#%42%$23$%^*&בלה בלה - not valid
Can I do that with Linq? RegEx?
Thanks
You can define in a character class either all characters/character ranges/Unicode-properties/blocks you want to allow or you don't want to allow.
[abc] is a character class that allows a and b and c
[^abc] is a negated character class that matches everything but not a or b or c
Here in your case I would go this way, no need to define every character:
^[\P{L}A-Za-z]*$
Match from the start to the end of the string everything that is not a letter [^\p{L}] or A-Za-z.
\p{L} Is a Unicode property and matches everything that has the property letter. \P{L} is the negated version, everything that is not a letter.
Test code:
string[] StrInputNumber = { "34556#%42%$23$%^*&sdfsfr", "asdf!\"§$%&/()=?*+~#'", "34556#%42%$23$%^*&בלה בלה", "öäü!\"§$%&/()=?*+~#'" };
Regex ASCIILettersOnly = new Regex(#"^[\P{L}A-Za-z]*$");
foreach (String item in StrInputNumber) {
if (ASCIILettersOnly.IsMatch(item)) {
Console.WriteLine(item + " ==> Contains only ASCII letters");
}
else {
Console.WriteLine(item + " ==> Contains non ASCII letters");
}
}
Some more basic regex explanations: What absolutely every Programmer should know about regular expressions
Maybe you could use
using System.Linq;
...
static bool IsValid(string str)
{
return str.All(c => c <= sbyte.MaxValue);
}
This considers all ASCII chars to be "valid" (even control characters). But punctuation and other special characters outside ASCII are not "valid". If str is null, an exception is thrown.
One thing you can try is put the char you want in this regx
bool IsValid(string input) {
return !(Regex.IsMatch(#"[^A-Za-z0-9'\.&#:?!()$#^]", input));
}
char other than specfied in the regx string are get ignored i.e return false..