Regex or Substring to Match Filename With Extension

Regex or Substring to Match Filename With Extension - c#

I have a current situation where I can be given a filename with path that looks like:
C:\\Users\\testUser\\Documents\\testFile.txt.9043632d298f44ad88509c677a8249f8
or
C:\\Users\\testUser\\Documents\\testFile.txt.9043632d298f44ad88509c677a8249f8.enc
I need to be able to extract everything up until the end of the extension (can be any file extension, will always have guid string preceded by a . after the extension)
So an example output would be:
C:\\Users\\testUser\\Documents\\testFile.txt
C:\\Users\\testUser\\Documents\\testFile.pdf
C:\\Users\\testUser\\Documents\\testFile.jpeg
I have tried substrings but cannot seem to get the proper input (though I assume it is a simple task). I can never seem to get the proper result.
An example I tried was along the lines of:
filename.Substring(0,filename.Indexof('.', //what goes here??));
but keep getting stuck.
Any help would be lovely!

You might use:
new Regex(#".*(?=\.[a-f\d]{32})", RegexOptions.IgnoreCase).Match(yourString)
Explanation:
.+ match one or more of any char
(?= ) look ahead, check if the following chars match, but don't include in match
\. match a dot
[a-f\d]{32} match any character a-f or digit exactly 32 times
RegexOptions.IgnoreCase ignores the case

Related

Match text not surrounded by & and ;

I am currently using the following regular expression:
(?<!&)[^&;]*(?!;)
To match text like this:
match1<match2>
And extract:
match1
match2
However, this seems to match an extra five empty strings. See Regex Storm.
How can I only match the two listed above?
Note the existing pattern ((?<=^|;)[^&]+) by #xanatos will only match matches 1 to 3 in the following string and not match4:
match1&lte;match2<match;3+match&4

Try changing the * to a +:
(?<!&)[^&;]+(?!;)
Test here
More correct regex:
(?<=^|;)[^&]+
Test here
The basic idea here is that a "good" substring starts at the beginning of the string (^) or right after the ;, and ends when you encounter a & ([^&]+).
Third version... But here we are showing how if you have a problem, and you decide to use regexes, now you have two problems:
(?<=^|;)([^&]|&(?=[^&;]*(?:&|$)))+
Test here

I have managed it with:
(?<Text>.+?)(?:&[^&;]*?;|$)
This seems to match all of the corner cases but it might not work with a case I can't think of at the moment.
This won't work if the string starts with a &...; pattern or is only that.
See Regex Storm.

Regex.Replace replaces more than bargained for

I'm writing some test cases for IIS Rewrite rules, but my tests are not matching the same way as IIS is, leading to some false negatives.
Can anyone tell me why the following two lines leads to the same result?
Regex.Replace("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", ".*v[1-9]/bids/.*", "http://localhost:9900/$0")
Regex.Replace("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", "v[1-9]/bids/", "http://localhost:9900/$0")
Both return:
http://localhost:9900/v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a
But I would expect the last regex to return:
http://localhost:9900/v1/bids/
As the GUID is not matched.
On IIS, the pattern tester yields the result below. Is {R:0} not equivalent to $0?
What I am asking is:
Given the test input of v[1-9]/bids/, how can I match IIS' way of doing Regex replaces so that I get the result http://localhost:9900/v1/bids/, which appears to be what IIS will rewrite to.

The point here is that the pattern you have matches the test strings at the start.
The first .*v[1-9]/bids/.* regex matches 0+ any characters but a newline (as many as possible) up to the last v followed with a digit (other than 0) and followed with /bids/, and then 0+ characters other than a newline. Since the string is matched at the beginning the whole string is matched and placed into Group 0. In the replacement, you just pre-pend http://localhost:9900/ to that value.
The second regex replacement returns the same result because the regex matches v1/bids/, stores it in Group 0, and replaces it with http://localhost:9900/ + v1/bids/. What remains is just appended to the replacement result as it does not match.
You need to match that "tail" in order to remove it.
To only get the http://localhost:9900/v1/bids/, use a capturing group around the v[0-9]/bids/ and use the $1 backreference in the replacement part:
(v[1-9]/bids/).*
Replace with http://localhost:9900/$1. Result: http://localhost:9900/v1/bids/
See the regex demo
Update
The IIS keeps the base URL and then adds the parts you match with the regex. So, in your case, you have http://localhost:9900/ as the base URL and then you match v1/bids/ with the regex. So, to simulate this behavior, just use Regex.Match:
var rx = Regex.Match("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", "v[1-9]/bids/");
var res = rx.Success ? string.Format("http://localhost:9900/{0}", rx.Value) : string.Empty;
See the IDEONE demo

Extract string from a pattern preceded by any length

I'm looking for a regular expression to extract a string from a file name
eg if filename format is "anythingatallanylength_123_TESTNAME.docx", I'm interested in extracting "TESTNAME" ... probably fixed length of 8. (btw, 123 can be any three digit number)
I think I can use regex match ...
".*_[0-9][0-9][0-9]_[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z].docx$"
However this matches the whole thing. How can I just get "TESTNAME"?
Thanks

Use parenthesis to match a specific piece of the whole regex.
You can also use the curly braces to specify counts of matching characters, and \d for [0-9].
In C#:
var myRegex = new Regex(#"*._\d{3}_([A-Za-z]{8})\.docx$");
Now "TESTNAME" or whatever your 8 letter piece is will be found in the captures collection of your regex after using it.
Also note, there will be a performance overhead for look-ahead and look-behind, as presented in some other solutions.

You can use a look-behind and a look-ahead to check parts without matching them:
(?<=_[0-9]{3}_)[A-Z]{8}(?=\.docx$)
Note that this is case-sensitive, you may want to use other character classes and/or quantifiers to fit your exact pattern.

In your file name format "anythingatallanylength_123_TESTNAME.docx", the pattern you are trying to match is a string before .docx and the underscore _. Keeping the thing in mind that any _ before doesn't get matched I came up with following solution.
Regex: (?<=_)[A-Za-z]*(?=\.docx$)
Flags used:
g global search
m multi-line search.
Explanation:
(?<=_) checks if there is an underscore before the file name.
(?=\.docx$) checks for extension at the end.
[A-Za-z]* checks the required match.
Regex101 Demo

Thanks to #Lucero #noob #JamesFaix I came up with ...
#"(?<=.*[0-9]{3})[A-Z]{8}(?=.docx$)"
So a look behind (in brackets, starting with ?<=) for anything (ie zero or more any char (denoted by "." ) followed by an underscore, followed by thee numerics, followed by underscore. Thats the end of the look behind. Now to match what I need (eight letters). Finally, the look ahead (in brackets, starting with ?=), which is the .docx
Nice work, fellas. Thunderbirds are go.

Excluding certain patterns in a regex

I'm working on a Regex in C# to exclude certain patterns within a string.
These are the types patterns I want to accept are: "%00" (Hex 00-FF) and any other character without a starting '%'. The patterns I would like to exclude are: "%0" (Values with a starting % and one character after) and/or characters "&<>'/".
So far I have this
Regex correctStringRegex = new Regex(#"(%[0-9a-fA-F]{2})|[^%&<>'/]|(^(%.))",
RegexOptions.IgnoreCase);
Below are examples of what I'm trying to pass and reject.
Passing String %02This is%0A%0Da string%03
Reject String %0%0Z%A&<%0a%
If a string doesn't pass all the requirements I would like to reject the whole string completely.
Any Help would be greatly appreciated!

I suggest this:
^(?:%[0-9a-f]{2}|[^%&<>'/])*$
Explanation:
^ # Start of string
(?: # Match either
%[0-9a-f]{2} # %xx
| # or
[^%&<>'/] # any character except the forbidden ones
)* # any number of times
$ # until end of string.
This ensures that % is only matched when followed by two hexadecimals. Since you're already compiling the regex with the IgnoreCase flag set, you don't need a-fA-F, either.

Hmm, given the comments so far, I think you need a different problem definition. You want to pass or fail a string, using regex, based on whether or not the string contains any invalid patterns. Im assuming a string will fail if there is ANY invalid pattern, rather than the reverse of a string passing if there is any valid pattern.
As such, I would use this regex: %(?![0-9a-f]{2})|[&<>'/]
You would then run this in such a way that a string is invalid if you GET a match, a valid string will not have any matches in this set.
A quick explanation of a rather odd regex. The format (?!) tells the regex "Match the previous symbol if the symbols in this set DONT follow it" ie: Match if suffix not present. So, what im telling it to look for is any instance of % that is not followed by 2 hex characters, or any other invalid character. The assumption is that anything that DOESN'T match this regex is a valid character entry.

Simple C# regex

I have a regex I need to match against a path like so: "C:\Documents and Settings\User\My Documents\ScanSnap\382893.pd~". I need a regex that matches all paths except those ending in '~' or '.dat'. The problem I am having is that I don't understand how to match and negate the exact string '.dat' and only at the end of the path. i.e. I don't want to match {d,a,t} elsewhere in the path.
I have built the regex, but need to not match .dat
[\w\s:\.\\]*[^~]$[^\.dat]
[\w\s:\.\\]* This matches all words, whitespace, the colon, periods, and backspaces.
[^~]$[^\.dat]$ This causes matches ending in '~' to fail. It seems that I should be able to follow up with a negated match for '.dat', but the match fails in my regex tester.
I think my answer lies in grouping judging from what I've read, would someone point me in the right direction? I should add, I am using a file watching program that allows regex matching, I have only one line to specify the regex.
This entry seems similar: Regex to match multiple strings

You want to use a negative look-ahead:
^((?!\.dat$)[\w\s:\.\\])*$
By the way, your character group ([\w\s:\.\\]) doesn't allow a tilde (~) in it. Did you intend to allow a tilde in the filename if it wasn't at the end? If so:
^((?!~$|\.dat$)[\w\s:\.\\~])*$

The following regex:
^.*(?<!\.dat|~)$
matches any string that does NOT end with a '~' or with '.dat'.
^ # the start of the string
.* # gobble up the entire string (without line terminators!)
(?<!\.dat|~) # looking back, there should not be '.dat' or '~'
$ # the end of the string
In plain English: match a string only when looking behind from the end of the string, there is no sub-string '.dat' or '~'.
Edit: the reason why your attempt failed is because a negated character class, [^...] will just negate a single character. A character class always matches a single character. So when you do [^.dat], you're not negating the string ".dat" but you're matching a single character other than '.', 'd', 'a' or 't'.

^((?!\.dat$)[\w\s:\.\\])*$
This is just a comment on an earlier answer suggestion:
. within a character class, [], is a literal . and does not need escaping.
^((?!\.dat$)[\w\s:.\\])*$
I'm sorry to post this as a new solution, but I apparently don't have enough credibility to simply comment on an answer yet.

I believe you are looking for this:
[\w\s:\.\\]*([^~]|[^\.dat])$
which finds, like before, all word chars, white space, periods (.), back slashes. Then matches for either tilde (~) or '.dat' at the end of the string. You may also want to add a caret (^) at the very beginning if you know that the string should be at the beginning of a new line.
^[\w\s:\.\\]*([^~]|[^\.dat])$

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex or Substring to Match Filename With Extension - c#

Related

Match text not surrounded by & and ;

Regex.Replace replaces more than bargained for

Extract string from a pattern preceded by any length

Excluding certain patterns in a regex

Simple C# regex

Categories

Resources