I'm new to regular expressions, and I need to write a set of regular expressions that match different data packet formats.
My problem is, usually I only need to look for the start and ending parts of the packet to distinguish between them, the data in between is irrelevant.
What's the most efficient way to ignore the data between the start and end?
Here's a simple example.
The packet I'm looking for starts with $CH; and ends with #
Currently my regex is \$CH;.*?#
It's the .*? I'm worried about. Is there a better (or more efficient) way to accept any character between the packet header and ending character?
Also, some of the packets have \n chars in the data, so using . won't work at all if it means [^\n].
I've also considered [^\x00]*? to detect any characters since null is never used in the data.
Any suggestions?
\$CH;.*?# is fine and should be quite efficient. You can make it more explicit that there should be no backtracking by writing it as \$CH;[^#]*#, if you like.
You can use (.|\n) or [\w\W] to match truly any character--or even better, use the RegexOptions.Singleline option to change the behavior of .:
Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).
Try this:
\$CH;[\s\S]*?#
To detect start of line/data use ^ anchor, to detect the end, use $ anchor:
^start.*?end$
Be aware that .*? may fail to match newlines, one option is to change it for [\s\S]*?
I would recommend checking the initial and terminal sequences separately using anchored regular expressions.
Related
My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary
I am to deal with the following problem.
I have to extract messages from a communication buffer. Sadly the communication protocol is lousy and not well-structured. The only way I came up with to distinguish packets in the buffer is an intermediate "ack" command that is transmitted by the server.
Example:
[Packet1][ACK][Packet2][ACK][Packet3]
I could have used String.Split(ACK), but the separator is also not consistent. Though, there are 3 rules to identify the ack packet.
Starts with "AK".
Ends with "0" or "1".
Total length is 5 characters.
Ack example:
"AKxxy" where:
xx: (01 to 99)
y: (0 or 1)
I hope that there may be a regular expression that can solve my problem, but I lack the needed knowledge and time.
Is there any RegEx "expert" that may possible help me? Feel free to suggest any solution.
Thank you.
Edit:
Example packet (I really had to remove the packet information):
AK010CONFIDENTIALPACKET1AK011CONFIDENTIALPACKET2AK020AK011CONFIDENTIALPACKET3AK021CONFIDENTIALPACKET4AK050
Sadly, each packet in the protocol does not start or end with a specific character so I cannot distinguish them. To identify each one I have to split them using the ack packet and then perform different checks on each one.
The direct translation would be
\bAK\d{2}[01]\b
That is
\b # a word boundary
AK # AK literally
\d{2} # two digits
[01] # one of 0 or 1
\b # another word boundary
The expression needs to be tested though (see a demo on regex101.com).
EDIT:
looking at the other answers, this has probably merely ornamental value.
The solution of #Jan and #ThymosK
var packets = Regex.Split(buffer, #"AK\d{2}[01]");
seems much more elegant.
But I think it might be nice to see how all the parsing can be moved inside the regex. Even if it is way too unreadable :P
I have designed a regex that can give you messages and delimiters as groups:
(?s)(AK[0-9][0-9][0,1])|((?:(?!AK[0-9][0-9][0,1]).)*)
It can analyze text like this:
You can test it here.
As usual, regexes are write only. I can hardly read this myself. But I'll try and go through it:
The first group is straightforward and simply catches your ack command:
(AK[0-9][0-9][0,1])
The second group contains a negative lookahead (?! ... ) which matches anything not followed by the regex specified by .... Here we insert your ack syntax, so anything not followed by ack is matched. Then we add a single character, to extend this to actually match anything up to ack.
Basically this second part asserts that we are currently not followed by ack and then adds a single character. This is repeated as long as possible, until we find the ack. I turn this into the second group.
Since I don't have C# currently, I can't wrap this in code with the C# regex engine. But python works nicely with it and offers a useful findall method, that gives you all those groups.
string interim = Regex.Replace(buffer, "AK\d{2}[01]", "|");
var commands = interim.Split('|');
Assuming that | is not a valid input char. You can pick something very exotic.
I'm using the following regex in c# to match some input cases:
^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$
The options are ignoring pattern whitespaces.
My input looks as follows:
hello
#world
[xxx]
This all can be tested here: DEMO
My problem is that this regex will not match the last line. Why?
What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.
This is a simplyfied regex and simplyfied input.
The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).
I try to understand why the conditional group doesn't match as stated in original regex.
I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:
^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$
That's the reason why I'm trying to use a conditional match.
UPDATE 10/12/2018
I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:
(?(a)a).*
DEMO
I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information
There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])
If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said
Singline tells the parser to handle the . to match all characters including the \n.
Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.
Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.
Notice the second match (as index 1) has world in group capture id and value as ↵.
I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.
Let us turn on Singline and see what happens.
Now everything is consumed, but there is a different problem. :-)
I've got some regex that'll check incoming passwords for complexity requirements. However it's not robust enough for my needs.
((?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,20})
It ensures that a password meets minimum length, contains characters of both cases and includes a number.
However I need to modify this so that it can contain a number and/or an allowed special character. I've even been given a list of allowed special characters.
I have two problems, delimiting the special characters and making the first condition do an and/or match for number or special.
I'd really appreciate advice from one of the regex gods round these parts.
The allowed special characters are: #%+\/'!#$^?:.(){}[]~-_
If I understand your question correctly, you're looking for a possibility to require another special character. This could be done as follows (see the last lookahead):
((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!§$%&/(/)]).{8,20})
See a demo for this approach here on regex101.com.
However, you can make your expression even better with further approvements: the dot-star (.*) brings you down the line and backtracks afterwards. If you have a password of say 10 characters and you want to make sure, four lookaheads need to be fulfilled, you'll need at least 40 steps (even more as the engine needs to backtrack).
To optimize your expression, you could use the exact opposite of your required characters, thus making the engine come to an end faster. Additionally, as already pointed out in the comments, do not limit your maximum password length.
In the language of regular expressions, this would come down to:
((?=\D*\d)(?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z])(?=.*[!§$%&/(/)]).{8,})
With the first approach, 63 steps are needed, while the optimized version only needs 29 steps (the half!). Regarding your second question, allowing a digit or a special character, you could simply use an alternation (|) like so:
((?:(?=\D*\d)|(?=.*[!§$%&/(/)]))(?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z]).{8,})
Or put the \d in the brackets as well, like so:
((?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z])(?=.*[\d!§$%&/(/)]).{8,})
This one would consider to be ConsideredAgoodPassw!rd and C0nsideredAgoodPassword a good password.
I am trying to capture the subdomain from huge lists of domain names. For example I want to capture "funstuff" from "funstuff.mysite.com". I do not want to capture, ".mysite.com" in the match. These occurances are in a sea of text so I can not depend on them being at the start of a line. I know the subdomain will not include any special characters or numbers. So what I have is:
[a-z]{2,10}(?=\.mysite\.com)
The problem is this will work only if the subdomain is NOT preceded by a number or special character. For example, "asdfbasdasdfdfunstuff.mysite.com" will return "fdfunstuff" but "asdfasf23/funstuff.mysite.com" won't make a match.
I can not depend on there being a special character before the subdomain, like a "/" as in "http://funstuff.mysite.com" so that can not be used as part of the condition.
It is ok if the capture gets erroneous text before the subdomain, although 99% of the time it will be preceded with something other that a lowercase letter. I have tried,
(?<=[^a-z])[a-z]{2,10}(?=\.mysite\.com)
but for some reason this does not capture text is a situation like:
afb"asdfunstuff.mysite.com
Where the quotation mark prevents a match for [a-z]{2-20}. Basically what I would want to do in that case would be to capture asdfunstuff.mysite.com. How can this be accomplished?
So you've got two problems to solve: first, you want to match ".mysite.com" but not capture it; second, you want to grab up to 10 alphabetic characters in the "subdomain" position.
First problem can be solved by using a capturing group. The regex
([a-z]{2,10})\.mysite\.com
will capture somewhere between 2 and 10 characters, and the returned match object will expose that in one of its properties (depends on the language). C# returns a collection of Match objects, so it'll be the only item.
Second problem can be solved by using the word-boundary character \b. In .NET, this matches where an alphanumeric (i.e. \w) is next to a non-alphanumeric (\W). Other languages (e.g. ECMAScript / Javascript) work simliarly.
So, I suggest the following regex to solve your problem:
\b([a-z]{2,10})\.mysite\.com
Note that numbers are legal in subdomain names, too, so the following might be generally correct (though perhaps not in your specific case):
\b(\w{2,10})\.mysite\.com
where the "word character" \w is equivalent to [a-zA-Z_0-9] in .NET's ECMAScript-compliant mode. (Further reading.)