regular expression start (^) does not work correctly - c#

I want to match a pattern like 091\d{8} in a content.
I want to extract strings that start with 091, I try this:
^(091)\d{8}
this pattern only match when string begins in new line,what pattern must I use?

You should match for a word boundary (\b)

^ will only match the number if the string starts with 091, not in between.
You should match word boundaries in your regular expression ,
else it will fetch those expressions too which start with 091, but have more than 8 digits after that.
See this regex \b((091)\d{8})\b working at : http://regexr.com?310ra
The caputred group in parenthesis will give you the required number.

Related

Regex c# obtain subgroup of a captured group

It seems a simple question, but I don't think it is so easy.
From the example string AAACARACBBBBBDZAAAAEE, I want to extract the first 8 characters (= AAACARAC) and from this resulting 8-char long string, I want to extract everything except the leading 'A' characters (= CARAC).
I tried with this regex (?^[A]<WORD>\w{8}), but I dont know how to apply another regex on the captured group named WORD?
This is the regex you want:
(?=^.{8}(.*)$)A*(?<WORD>.*?)\1$
See a demo here (click then on "Table" for looking at the specific matches).
The regex firs will match the first eight characters looking for what comes next (matching this "tail" in the first capturing group), then will restart from the beginning of the string excluding all the trailing As and matching for as less character as possible such that these characters are followed by the same content of the first capturing group.
Using C#, you might also use a positive lookbehind to assert 8 chars to the left, matching optional A's and capture the chars that follow in a group.
^A*(?<WORD>[^\sA].*)(?<=^.{8})
^ Start of string
A* match optional repetitions of A
(?<WORD> Named group WORD
[^\sA].* Match any non whitespace char except A
) Close named group WORD
(?<=^.{8}) Assert 8 chars to the left of the current position
.NET regex demo
If you only want to match word characters:
^A*(?<WORD>[^\WA]\w*)(?<=^\w{8})
.NET Regex demo

C# equivalent for this regex pattern

I have this regular expression pattern: .{2}\#.{2}\K|\..*(*SKIP)(?!)|.(?=.*\.)
It works perfectly to convert to replace the matches to get
trabc#abtrec.com.lo => ***bc#ab*****.com.lo
demomail#demodomain.com => ******il#de*********.com
But when I try to use it on C# the \K and the (*SKIP) and (*F) are not allowed.
what will be the c# version of this pattern? or do you know a simpler way to mask the email without the unsupported pattern entries?
Demo
UPDATE:
(*SKIP): this verb causes the match to fail at the current starting position in the subject if the rest of the pattern does not match
(*F): Forces a matching failure at the given position in the pattern (the same as (?!)
Try this regex:
\w(?=.{2,}#)|(?<=#[^\.]{2,})\w
Click for Demo
Explanation:
\w - matches a word character
(?=.{2,}#) - positive lookahead to find the position immediately followed by 2+ occurrences of any character followed by #
| - OR
(?<=#[^\.]{2,}) - positive lookbehind to find the position immediately preceded by # followed by 2+ occurrences of any character that is not a .
\w - matches a word character.
Replace each match with a *
You can achieve the same result with a regex that matches items in one block, and applying a custom match evaluator:
var res = Regex.Replace(
s
, #"^.*(?=.{2}\#.{2})|(?<=.{2}\#.{2}).*(?=.com.*$)"
, match => new string('*', match.ToString().Length)
);
The regex has two parts:
The one on the left ^.*(?=.{2}\#.{2}) matches the user name portion except the last two characters
The one on the right (?<=.{2}\#.{2}).*(?=.com.*$) matches the suffix of the domain up to the ".com..." ending.
Demo.

Regular Expression doesn't Match with string

I am trying to use Regular Expressions to find a string sequence inside a string.
The pattern i am looking for is:
dd.dd.dddd dd:dd:dd //d is a digit from 0-9
my regex is:
Regex r = new Regex(#"(\d[0-9]{2}.\d[0-9]{2}.\d[0-9]{4}\s\d[0-9]{2}:\d[0-9]{2}:\d[0-9]{2})$");
I am now trying to check, if the string "27.11.2014 09:14:59" is Matching to the regex, but sadly it isn't matching.
string str= "27.11.2014 09:14:59";
Regex r = new Regex(#"(\d[0-9]{2}.\d[0-9]{2}.\d[0-9]{4}\s\d[0-9]{2}:\d[0-9]{2}:\d[0-9]{2})$");
test = r.IsMatch(str,0);
//output: test=false
Anyone knows why the String is not Matching with that regular expression?
\d[0-9]{2} matches three digits:
\d first digit
[0-9] second digit
{2} causes the previous expression ([0-9]) to match again
If you remove all occurences of \d, your pattern should work. You should escape all dots . though, because right now they match any character, not just a ..
As Rawing already said, the upper Regular expression is trying to match 3 digits instead of one. for everyone who want to know how the regular expression should look like:
#"(\d{2}.\d{2}.\d{4}\s\d{2}:\d{2}:\d{2})$"
Thats working, at least for me.

Problem with regex, how do I get all with \S up until a special character?

Ive got the text:
192.168.20.31 Url=/flash/56553550_hi.mp4?token=(uniquePlayerReference=81781956||videoId=1)
And im trying to get the uniquePlayerReference and the videoId
Ive tried this regular expression:
(?<=uniquePlayerReference=)\S*
but it matches:
81781956||videoId=1)
And then I try and get the video id with this:
(?<=videoId=)\S*
But it matches the ) after the videoId.
My question is two fold:
1) How do I use the \S character and get it to stop at a character? (essentially what is the regex to do what i want) I cant get it to stop at a defined character, I think I need to use a positive lookahead to match but not include the double pipe).
2) When should I use brackets?
The problem is the mul;tiplicity operator you have here - the * - which means "as many as possible". If you have an explicit number in mind you can use the operator {a,b} where a is a minimum and b a maximum number fo matches, but if you have an unknown number, you can't use \S (which is too generic).
As for brackets, if you mean () you use them to capture a part of a match for backreferencing. Bit complicated, think you need to use a reference for that.
I think you want something like this:
/uniquePlayerReference=(\d+)||videoId=(\d+)/i
and then backreference to \1 and \2 respectively.
Given that both id's are numeric you are probably better off using \d instead of \S. \d only matches numeric digits whereas \S matches any non-whitespace character.
What you might also do is a non gready match up till the character you do not want to match like so:
uniquePlayerReference=(.*?)\|\|videoId=(.*?)\)
Note that I have escaped both the | and ) characters because otherwise they would have a special meaning inside a regex.
In C# you would use this like so: (which also answers your question what the brackets are for, they are meant to capture parts of the matched result).
Regex regex = new Regex(#"uniquePlayerReference=(.*?)\|\|videoId=(.*?)\)");
Match match = regex.Match(
"192.168.20.31 Url=/flash/56553550_hi.mp4?token=(uniquePlayerReference=81781956||videoId=1)");
if (match.Success)
{
string playerReference = match.Groups[1].Value;
string videoId = match.Groups[2].Value;
// Etc.
}
If the ID isn't just digits then you could use [^|] instead of \S, i.e.
(?<=uniquePlayerReference=)[^|]*
Then you can use
(?<=videoId=)[^)]*
For the video ID
The \S means it matches any non-whitespace character, including the closing parenthesis. So if you had to use \S, you would have to explicitly say stop at the closing parenthesis, like this:
videoId=(\S+)\)
Therefore, you are better off using the \d, since what you are looking for are numeric:
uniquePlayerReference=(\d+)
videoId=(\d+)

why do these regex tests let certain characters pass?

I am checking a string with the following regexes:
[a-zA-Z0-9]+
[A-Za-z]+
For some reason, the characters:
.
-
_
are allowed to pass, why is that?
If you want to check that the complete string consists of only the wanted characters you need to anchor your regex like follows:
^[a-zA-Z0-9]+$
Otherwise every string will pass that contains a string of the allowed characters somewhere. The anchors essentially tell the regular expression engine to start looking for those characters at the start of the string and stop looking at the end of the string.
To clarify: If you just use [a-zA-Z0-9]+ as your regex, then the regex engine would rightfully reject the string -__-- as the regex doesn't match against that. There is no single character from the character class you defined.
However, with the string a-b it's different. The regular expression engine will match the first a here since that matches the expression you entered (at least one of the given characters) and won't care about the - or the b. It has done its job and successfully matched a substring according to your regular expression.
Similarly with _-abcdef- – the regex will match the substring abcdef just fine, because you didn't tell it to match only at the start or end of the string; and ignore the other characters.
So when using ^[a-zA-Z0-9]+$ as your regex you are telling the regex engine definitely that you are looking for one or more letters or digits, starting at the very beginning of the string right until the end of the string. There is no room for other characters to squeeze in or hide so this will do what you apparently want. But without the anchors, the match can be anywhere in your search string. For validation purposes you always want to use those anchors.
In regular expressions the + tells the engine to match one or more characters.
So this expression [A-Za-z]+ passes if the string contains a sequence of 1 or more alphabetic characters. The only strings that wouldn't pass are strings that contain no alphabetic characters at all.
The ^ symbol anchors the character class to the beginning of the string and the $ symbol anchors to the end of the string.
So ^[A-Za-z0-9]+ means 'match a string that begins with a sequence of one or more alphanumeric characters'. But would allow strings that include non-alphanumerics so long as those characters were not at the beginning of the string.
While ^[A-Za-z0-9]+$ means 'match a string that begins and ends with a sequence of one or more alphanumeric characters'. This is the only way to completely exclude non-alphanumerics from a string.

Categories