Regex match a hash that has been split over multiple lines

Regex match a hash that has been split over multiple lines - c#

I want to match a hash that has been word wrapped by an author, and received over multiple lines.
Example:
SHA256: AB76235776BC87DBAB76235776BC87DBAB76235776BC87
DBAB76235776BC87DB
Has been received. My usual regex to match a sha256 hash like this is of course: [0-9A-Fa-f]{64}
But this does not work. I would like to leave the file unmodified while searching for this match, any ideas on how to match the split hash without removing newlines?
I'd like to have a regex that basically says 'look for 64 sequential hexadecimal values, but allow for one or more newlines in the mix, kthx'
Thanks in advance. C# is the language.

Try this:
\b(?:[a-fA-F0-9]\s*){64}\b
It allows any kind of whitespace, not just line separators. If it really has to allow only line separators, you can use this:
\b(?:[a-fA-F0-9][\r\n]*){64}\b
This will also include the line separator following the number, if there is one, and if it's followed by a word character. You can prevent that like this:
\b(?:[a-fA-F0-9][\r\n]*){63}[a-fA-F0-9]\b

Change your regex to include newline characters:
[A-Z0-9a-z\\r\\n ]{64, }
You could modify the upper bound to include a restriction on the number of linebreaks.
In this case you need to keep in mind linebreaks can be 2 symbols long, depending on machine culture and OS.
1 linebreak --> 66 chars
2 linebreaks --> 68 chars
Continue as much as you like.
On a sidenote. While parsing the file, you generally leave it rest. All your modifications are made with the variables you read the file in to. This is why I do not see the point of keeping the linebreaks.

Related

Underscore in regex not validating

How do I add underscore as a part of my regex string.
Here is my string that checks for uppercase, lowercase, numbers and special characters. The rest of the special characters work. Validation isn't working for underscores.
#"^[^\s](?=(.*[A-Za-z]){1,})(?=(.*[\d]){1,})(?=(.*[\W]){1,})(?=(.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]{1,})).*[^\s]$"
Any ideas?
Thanks

This is the regex that AWS Cogito uses, it should apply to your situation:
#"^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[\^$*.\[\]{}\(\)?\-“!##%&\/,><’:;|_~`])\S{8,99}$"
You can check regexes at http://regexstorm.net, it's faster than building your application everytime.

I've approached it like this: I took your requirements and made them into separate positive lookaheads:
Check for:
uppercase (?=.*[A-Z])
lowercase (?=.*[a-z]) (note that I broke A-Z and a-z up into separate groups)
numbers (?=.*\d)
special characters (?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~])
You can then combine them in any order and I've combined them in the same order as I listed them above and anchored it with the beginning of the line using ^. Don't add any extra matches before, in-between or after the groups in your requirement that could cause the regex to enforce a certain ordering of the groups:
The lookahead for any non-word character \W makes it impossible to match Underscore1_ since it will only match on "anything other than a letter, digit or underscore" - which is all Underscore1_ contains.
The starting [^\s] (and ending [^\s]) that consumes one character is likely destroying a lot of good matches. Underscore1_ or _1scoreUnder shouldn't matter, but if you start with _ and consume it with [^\s] like you do, the later lookahead for a special character will fail (unless you have a second special character in the password).
#"^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~])"
If you have a minimum length requirement of, say, 7 characters, you just have to add .{7,}$ to the end of the regex, making it:
#"^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]).{7,}$"
Without a minimum length, a password of one character from each group will be enough, and since there are 4 groups, a password with only 4 characters will pass the filter.
I see no point in putting an upper length limit into the regex. If the user interface has accepted a string that is thousands of characters long, then why reject it for being too long later? The length of what you store is probably going to be much smaller anyway since you'll be storing the bcrypt/scrypt/argon2/... encoded password.
Suggestion: Also add space (or even whitespaces) to the list of special characters.

In you regexp add underscore in 3rd Capturing Group regex101
#"^[^\s](?=(.*[A-Za-z]){1,})(?=(.*[\d]){1,})(?=(.*[\W_]){1,})(?=(.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]{1,})).*[^\s]$"

Conditional match without false force a match?

I'm using the following regex in c# to match some input cases:
^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$
The options are ignoring pattern whitespaces.
My input looks as follows:
hello
#world
[xxx]
This all can be tested here: DEMO
My problem is that this regex will not match the last line. Why?
What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.
This is a simplyfied regex and simplyfied input.
The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).
I try to understand why the conditional group doesn't match as stated in original regex.
I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:
^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$
That's the reason why I'm trying to use a conditional match.
UPDATE 10/12/2018
I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:
(?(a)a).*
DEMO
I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information

There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])
If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said
Singline tells the parser to handle the . to match all characters including the \n.
Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.
Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.
Notice the second match (as index 1) has world in group capture id and value as ↵.
I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.
Let us turn on Singline and see what happens.
Now everything is consumed, but there is a different problem. :-)

Extract string from a pattern preceded by any length

I'm looking for a regular expression to extract a string from a file name
eg if filename format is "anythingatallanylength_123_TESTNAME.docx", I'm interested in extracting "TESTNAME" ... probably fixed length of 8. (btw, 123 can be any three digit number)
I think I can use regex match ...
".*_[0-9][0-9][0-9]_[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z].docx$"
However this matches the whole thing. How can I just get "TESTNAME"?
Thanks

Use parenthesis to match a specific piece of the whole regex.
You can also use the curly braces to specify counts of matching characters, and \d for [0-9].
In C#:
var myRegex = new Regex(#"*._\d{3}_([A-Za-z]{8})\.docx$");
Now "TESTNAME" or whatever your 8 letter piece is will be found in the captures collection of your regex after using it.
Also note, there will be a performance overhead for look-ahead and look-behind, as presented in some other solutions.

You can use a look-behind and a look-ahead to check parts without matching them:
(?<=_[0-9]{3}_)[A-Z]{8}(?=\.docx$)
Note that this is case-sensitive, you may want to use other character classes and/or quantifiers to fit your exact pattern.

In your file name format "anythingatallanylength_123_TESTNAME.docx", the pattern you are trying to match is a string before .docx and the underscore _. Keeping the thing in mind that any _ before doesn't get matched I came up with following solution.
Regex: (?<=_)[A-Za-z]*(?=\.docx$)
Flags used:
g global search
m multi-line search.
Explanation:
(?<=_) checks if there is an underscore before the file name.
(?=\.docx$) checks for extension at the end.
[A-Za-z]* checks the required match.
Regex101 Demo

Thanks to #Lucero #noob #JamesFaix I came up with ...
#"(?<=.*[0-9]{3})[A-Z]{8}(?=.docx$)"
So a look behind (in brackets, starting with ?<=) for anything (ie zero or more any char (denoted by "." ) followed by an underscore, followed by thee numerics, followed by underscore. Thats the end of the look behind. Now to match what I need (eight letters). Finally, the look ahead (in brackets, starting with ?=), which is the .docx
Nice work, fellas. Thunderbirds are go.

c# Regex question: only letters, numbers and a dot (2 to 20 chars) allowed

i am wrestling with my regex.
I want to allow only letters and numbers and a dot in a username, and 2 to 20 chars long
I thought of something like this
[0-9a-zA-Z]{2,20}
but then 21 chars is also ok, and that's not what i want

I suggest that you make two checks -- one for length and one for content based on the fact that you probably only want one dot in the name, rather than any number of dots. I'll assume that names like username and user.name are the only formats allowed.
This should get the content( but allows underscores as well):
^\w+(\.\w+)?$
If you don't want underscores, then you would use [0-9a-zA-Z]+ in place of \w+. To explain, it will match any string that consists of one or more word characters, followed by exactly 0 or 1 of a dot followed by one or more word characters. It must also match the beginning and end of the string, i.e., no other characters are allowed in the string.
Then you only need to get the length with a simple length check.

^[0-9a-zA-Z\.]{2,20}$

Try ^[\w\.]{2,20}$ instead.

You need to use start and end of string (^ and $), and escape the .:
^[0-9a-zA-Z\.]{2,20}$

How do I ensure a text box is alphanumeric but without a leading digit?

My web application contains a text box for which I would like to restrict its input. I would like to prevent the user from entering text that:
Starts with white space
Starts with something other than a digit
Contains alphanumeric characters after the leading character.
Thank you for your suggestions!

Not to start with white space of alpha numeric: [a-zA-Z]+
Followed by 0 or more alphanumeric: [a-zA-Z0-9]*
Final expression
^[a-zA-Z]+[a-zA-Z0-9]*$

For ASCII characters you could use:
^[a-zA-Z][a-zA-Z0-9]*$ // Note you don't need the "+" after the first character group.
// or...
(?i:^[a-z][a-z0-9]*$) // Slightly shorter, albeit more unreadable, syntax (?i: ... ) makes the expression case-insensitive
If you want to match empty string just wrap the expression in "( ... )?", like so:
^([a-zA-Z][a-zA-Z0-9]*)?$
If you want to work in Unicode you might want to use:
^\p{L}[\p{L}\p{Nd}]*$
Unicode w. empty string:
^(\p{L}[\p{L}\p{Nd}]*)?$
To read more about unicode possibilities in regex, see this page on Regular-Expressions.info.
Edit
Just collected all possibilities in one answer.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.