Is it a drive path or another? Check with Regex

Is it a drive path or another? Check with Regex - c#

I would like to check whether it is a drive path or a "pol" path.
For this I have already written a small code, unfortunately, I always return true.
The regex expression may be incorrect \W?\w{1}:{1}[/]{1}. How do I do it right?The path names can always be different and do not have to agree with the pole path.
Thank you in advance.
public bool isPolPath(string path)
{
bool isPolPath= true;
// Pol-Path: /Buy/Toy/Special/Clue
// drive-Path: Q:\Buy/Special/Clue
Regex myRegex = new Regex(#"\W?\w{1}:{1}[/]{1}", RegexOptions.IgnoreCase);
Match matchSuccess = myRegex.Match(path);
if (matchSuccess.Success)
isPolPath= false;
return isPolPath;
}

You don't need regexes to achieve this. Use System.IO.Path.GetPathRoot. It returns X:\ (where X is the actual drive letter) if the given path contains drive letter and an empty string or slash otherwise.
new List<string> {
#"/Buy/Toy/Special/Clue",
#"q:\Buy/Special/Clue",
#"Buy",
#"/",
#"\",
#"q:",
#"q:/",
#"q:\",
//#"", // This throws an exception saying path is illegal
}.ForEach(
p => Console.WriteLine(Path.GetPathRoot(p))
);
/* This code outputs:
\
q:\
\
\
q:
q:\
q:\
*/
Therefore your check may look like this:
isPolPath = Path.GetPathRoot(path).Length < 2;
If you wish to make your code more foolproof and protect from exception when an empty string is passed, you need to decide if an empty (or null) string is a pol-path or drive path. Depending on the decision the check would be either
sPolPath = string.IsNullOrEmpty(path) || Path.GetPathRoot(path).Length < 2;
or
if (string.IsNullOrEmpty(path))
sPolPath = false;
else
sPolPath = Path.GetPathRoot(path).Length < 2;

Related

Does my code prevent directory traversal or is it overkill?

I want to make sure this is enough to prevent directory traversal and also any suggestions or tips would be appreciated. The directory "/wwwroot/Posts/" is the only directory which is allowed.
[HttpGet("/[controller]/[action]/{name}")]
public IActionResult Post(string name)
{
if(string.IsNullOrEmpty(name))
{
return View("Post", new BlogPostViewModel(true)); //error page
}
char[] InvalidFilenameChars = Path.GetInvalidFileNameChars();
if (name.IndexOfAny(InvalidFilenameChars) >= 0)
{
return View("Post", new BlogPostViewModel(true));
}
DirectoryInfo dir = new DirectoryInfo(Path.Combine(Directory.GetCurrentDirectory(), "wwwroot/Posts"));
var userpath = Path.GetFullPath(Path.Combine(Directory.GetCurrentDirectory(), "wwwroot/Posts", name));
if (Path.GetDirectoryName(userpath) != dir.FullName)
{
return View("Post", new BlogPostViewModel(true));
}
var temp = Path.Combine(dir.FullName, name + ".html");
if (!System.IO.File.Exists(temp))
{
return View("Post", new BlogPostViewModel(true));
}
BlogPostViewModel model = new BlogPostViewModel(Directory.GetCurrentDirectory(), name);
return View("Post", model);
}

Probably, but I wouldn't consider it bulletproof. Let's break this down:
First you are black-listing known invalid characters:
char[] InvalidFilenameChars = Path.GetInvalidFileNameChars();
if (name.IndexOfAny(InvalidFilenameChars) >= 0)
{
return View("Post", new BlogPostViewModel(true));
}
This is a good first step, but blacklisting input is rarely enough. It will prevent certain control characters, but the documentation does not explicitly state that directory separators ( e.g. / and \) are included. The documentation states:
The array returned from this method is not guaranteed to contain the
complete set of characters that are invalid in file and directory
names. The full set of invalid characters can vary by file system.
Next, you attempt to make sure that after path.combine you have the expected parent folder for your file:
DirectoryInfo dir = new DirectoryInfo(Path.Combine(Directory.GetCurrentDirectory(), "wwwroot/Posts"));
var userpath = Path.GetFullPath(Path.Combine(Directory.GetCurrentDirectory(), "wwwroot/Posts", name));
if (Path.GetDirectoryName(userpath) != dir.FullName)
{
return View("Post", new BlogPostViewModel(true));
}
In theory, if the attacker passed in ../foo (and perhaps that gets past the blacklisting attempt above if / isn't in the list of invalid characters), then Path.Combine should combine the paths and return /somerootpath/wwwroot/foo. GetParentFolder would return /somerootpath/wwwroot which would be a non-match and it would get rejected. However, suppose Path.Combine concatenates and returns /somerootpath/wwwroot/Posts/../foo. In this case GetParentFolder will return /somerootpath/wwwRoot/Posts which is a match and it proceeds. Seems unlikely, but there may be control characters which get past GetInvalidFileNameChars() based on the documentation stating that it is not exhaustive which trick Path.Combine into something along these lines.
Your approach will probably work. However, if it is at all possible, I would strongly recommend you whitelist the expected input rather than attempt to blacklist all possible invalid inputs. For example, if you can be certain that all valid filenames will be made up of letters, numbers, and underscores, build a regular expression that asserts that and check before continuing. Testing for ^[A-Za-z0-0_]+$ would assert that and be 100% bulletproof.

Sanitizing a file path in C# without compromising the drive letter

I need to process some file paths in C# that potentially contain illegal characters, for example:
C:\path\something\output_at_13:26:43.txt
in that path, the :s in the timestamp make the filename invalid, and I want to replace them with another safe character.
I've searched for solutions here on SO, but they seem to be all based around something like:
path = string.Join("_", path.Split(Path.GetInvalidFileNameChars()));
or similar solutions. These solutions however are not good, because they screw up the drive letter, and I obtain an output of:
C_\path\something\output_at_13_26_43.txt
I tried using Path.GetInvalidPathChars() but it still doesn't work, because it doesn't include the : in the illegal characters, so it doesn't replace the ones in the filename.
So, after figuring that out, I tried doing this:
string dir = Path.GetDirectoryName(path);
string file = Path.GetFileName(path);
file = string.Join(replacement, file.Split(Path.GetInvalidFileNameChars()));
dir = string.Join(replacement, dir.Split(Path.GetInvalidPathChars()));
path = Path.Combine(dir, file);
but this is not good either, because the :s in the filename seem to interfere with the Path.GetFilename() logic, and it only returns the last piece after the last :, so I'm losing pieces of the path.
How do I do this "properly" without hacky solutions?

You can write a simple sanitizer that iterates each character and knows when to expect the colon as a drive separator. This one will catch any combination of letter A-Z followed directly by a ":". It will also detect path separators and not escape them. It will not detect whitespace at the beginning of the input string, so in case your input data might come with them, you will have to trim it first or modify the sanitizer accordingly:
enum ParserState {
PossibleDriveLetter,
PossibleDriveLetterSeparator,
Path
}
static string SanitizeFileName(string input) {
StringBuilder output = new StringBuilder(input.Length);
ParserState state = ParserState.PossibleDriveLetter;
foreach(char current in input) {
if (((current >= 'a') && (current <= 'z')) || ((current >= 'A') && (current <= 'Z'))) {
output.Append(current);
if (state == ParserState.PossibleDriveLetter) {
state = ParserState.PossibleDriveLetterSeparator;
}
else {
state = ParserState.Path;
}
}
else if ((current == Path.DirectorySeparatorChar) ||
(current == Path.AltDirectorySeparatorChar) ||
((current == ':') && (state == ParserState.PossibleDriveLetterSeparator)) ||
!Path.GetInvalidFileNameChars().Contains(current)) {
output.Append(current);
state = ParserState.Path;
}
else {
output.Append('_');
state = ParserState.Path;
}
}
return output.ToString();
}
You can try it out here.

You definitely should make sure that you only receive valid filenames.
If you can't, and you're certain your directory names will be, you could split the path the last backslash (assuming Windows) and reassemble the string:
public static string SanitizePath(string path)
{
var lastBackslash = path.LastIndexOf('\\');
var dir = path.Substring(0, lastBackslash);
var file = path.Substring(lastBackslash, path.Length - lastBackslash);
foreach (var invalid in Path.GetInvalidFileNameChars())
{
file = file.Replace(invalid, '_');
}
return dir + file;
}

Find unbounded file paths in string

I have these error messages generated by a closed source third party software from which I need to extract file paths.
The said file paths are :
not bounded (i.e. not surrounded by quotation marks, parentheses, brackets, etc)
rooted (i.e. start with <letter>:\ such as C:\)
not guaranteed to have a file extension
representing files (only files, not directories) that are guaranteed to exist on the computer running the extraction code.
made of any valid characters, including spaces, making them hard to spot (e.g. C:\This\is a\path \but what is an existing file path here)
To be noted, there can be 0 or more file paths per message.
How can these file paths be found in the error messages?
I've suggested an answer below, but I have a feeling that there is a better way to go about this.

For each match, look forward for the next '\' character. So you might get "c:\mydir\". Check to see if that directory exists. Then find the next \, giving "c:\mydir\subdir`. Check for that path. Eventually you'll find a path that doesn't exist, or you'll get to the start of the next match.
At that point, you know what directory to look in. Then just call Directory.GetFiles and match the longest filename that matches the substring starting at the last path you found.
That should minimize backtracking.
Here's how this could be done:
static void FindFilenamesInMessage(string message) {
// Find all the "letter colon backslash", indicating filenames.
var matches = Regex.Matches(message, #"\w:\\", RegexOptions.Compiled);
// Go backwards. Useful if you need to replace stuff in the message
foreach (var idx in matches.Cast<Match>().Select(m => m.idx).Reverse()) {
int length = 3;
var potentialPath = message.Substring(idx, length);
var lastGoodPath = potentialPath;
// Eat "\" until we get an invalid path
while (Directory.Exists(potentialPath)) {
lastGoodPath = potentialPath;
while (idx+length < message.Length && message[idx+length] != '\\')
length++;
length++; // Include the trailing backslash
if (idx + length >= message.Length)
length = (message.Length - idx) - 1;
potentialPath = message.Substring(idx, length);
}
potentialPath = message.Substring(idx);
// Iterate over the files in directory we found until we get a match
foreach (var file in Directory.EnumerateFiles(lastGoodPath)
.OrderByDescending(s => s.Length)) {
if (!potentialPath.StartsWith(file))
continue;
// 'file' contains a valid file name
break;
}
}
}

This is how I would do it.
I don't think substringing the message over and over is a good idea however.
static void FindFilenamesInMessage(string message)
{
// Find all the "letter colon backslash", indicating filenames.
var matches = Regex.Matches(message, #"\w:\\", RegexOptions.Compiled);
int length = message.Length;
foreach (var index in matches.Cast<Match>().Select(m => m.Index).Reverse())
{
length = length - index;
while (length > 0)
{
var subString = message.Substring(index, length);
if (File.Exists(subString))
{
// subString contains a valid file name
///////////////////////
// Payload goes here
//////////////////////
length = index;
break;
}
length--;
}
}
}

How to find folder below given section of a path?

Given a path and a certain section, how can I find the name of the folder immediately below that section?
This is hard to explain, let me give some examples. Suppose I am looking for the name of the folder below 'Dev/Branches'. Below are example inputs, with the expected results in bold
C:\Code\Dev\Branches\ Latest \bin\abc.dll
C:\Dev\Branches\ 5.1
D:\My Documents\Branches\ 7.0 \Source\Tests\test.cs
I am using C#
Edit: I suppose I could use the regex /Dev/Branches/(.*?)/ capturing the first group, but is there a neater solution without regex? That regex would fail on the second case, anyway.

// starting path
string path = #"C:\Code\Dev\Branches\Latest\bin\abc.dll";
// search path
string search = #"Dev\Branches";
// find the index of the search criteria
int idx = path.IndexOf(search);
// determine whether to exit or not
if (idx == -1 || idx + search.Length >= path.Length) return;
// get the substring AFTER the search criteria, split it and take the first item
string found = path.Substring(idx + search.Length).Split("\\".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).First();
Console.WriteLine(found);

Here's the code that will do exactly what you expect:
public static string GetSubdirectoryFromPath(string path, string parentDirectory, bool ignoreCase = true)
{
// 1. Standarize the path separators.
string safePath = path.Replace("/", #"\");
string safeParentDirectory = parentDirectory.Replace("/", #"\").TrimEnd('\\');
// 2. Prepare parentDirectory to use in Regex.
string directory = Regex.Escape(safeParentDirectory);
// 3. Find the immediate subdirectory to parentDirectory.
Regex match = new Regex(#"(?:|.+)" + directory + #"\\([^\\]+)(?:|.+)", ignoreCase ? RegexOptions.IgnoreCase : RegexOptions.None);
// 4. Return the match. If not found, it returns null.
string subDirectory = match.Match(safePath).Groups[1].ToString();
return subDirectory == "" ? null : subDirectory;
}
A test code:
void Test()
{
string path1 = #"C:\Code\Dev\Branches\Latest\bin\abc.dll";
string path2 = #"C:\Dev\Branches\5.1";
string path3 = #"D:\My Documents\Branches\7.0\Source\test.cs";
Console.WriteLine("Matches:");
Console.WriteLine(GetSubdirectoryFromPath(path1, "dev/branches/") ?? "Not found");
Console.WriteLine(GetSubdirectoryFromPath(path1, #"Dev\Branches") ?? "Not found");
Console.WriteLine(GetSubdirectoryFromPath(path3, "D:/My Documents/Branches") ?? "Not found");
// Incorrect parent directory.
Console.WriteLine(GetSubdirectoryFromPath(path2, "My Documents") ?? "Not found");
// Case sensitive checks.
Console.WriteLine(GetSubdirectoryFromPath(path3, #"My Documents\Branches", false) ?? "Not found");
Console.WriteLine(GetSubdirectoryFromPath(path3, #"my Documents\Branches", false) ?? "Not found");
// Output:
//
// Matches:
// Latest
// Latest
// 7.0
// Not found
// 7.0
// Not found
}

Break it down into smaller steps and you can solve this yourself:
(Optional, depending on further requirements): Get the directory name (file is irrelevant): Path.GetDirectoryName(string)
Get its parent directory Directory.GetParent(string).
This comes down to;
var directory = Path.GetDirectoryName(input);
var parentDirectory = Directory.GetParent(directory);
The supplied C:\Dev\Branches\5.1 -> 5.1 does not conform to your specification, that is the directory name of the input path itself. This will output Branches.

new Regex("\\?" + PathToMatchEscaped + "\\(\w+)\\?").Match()...

I went with this
public static string GetBranchName(string path, string prefix)
{
string folder = Path.GetDirectoryName(path);
// Walk up the path until it ends with Dev\Branches
while (!String.IsNullOrEmpty(folder) && folder.Contains(prefix))
{
string parent = Path.GetDirectoryName(folder);
if (parent != null && parent.EndsWith(prefix))
return Path.GetFileName(folder);
folder = parent;
}
return null;
}

glob pattern matching in .NET

Is there a built-in mechanism in .NET to match patterns other than Regular Expressions? I'd like to match using UNIX style (glob) wildcards (* = any number of any character).
I'd like to use this for a end-user facing control. I fear that permitting all RegEx capabilities will be very confusing.

I like my code a little more semantic, so I wrote this extension method:
using System.Text.RegularExpressions;
namespace Whatever
{
public static class StringExtensions
{
/// <summary>
/// Compares the string against a given pattern.
/// </summary>
/// <param name="str">The string.</param>
/// <param name="pattern">The pattern to match, where "*" means any sequence of characters, and "?" means any single character.</param>
/// <returns><c>true</c> if the string matches the given pattern; otherwise <c>false</c>.</returns>
public static bool Like(this string str, string pattern)
{
return new Regex(
"^" + Regex.Escape(pattern).Replace(#"\*", ".*").Replace(#"\?", ".") + "$",
RegexOptions.IgnoreCase | RegexOptions.Singleline
).IsMatch(str);
}
}
}
(change the namespace and/or copy the extension method to your own string extensions class)
Using this extension, you can write statements like this:
if (File.Name.Like("*.jpg"))
{
....
}
Just sugar to make your code a little more legible :-)

Just for the sake of completeness. Since 2016 in dotnet core there is a new nuget package called Microsoft.Extensions.FileSystemGlobbing that supports advanced globing paths. (Nuget Package)
some examples might be, searching for wildcard nested folder structures and files which is very common in web development scenarios.
wwwroot/app/**/*.module.js
wwwroot/app/**/*.js
This works somewhat similar with what .gitignore files use to determine which files to exclude from source control.

I found the actual code for you:
Regex.Escape( wildcardExpression ).Replace( #"\*", ".*" ).Replace( #"\?", "." );

The 2- and 3-argument variants of the listing methods like GetFiles() and EnumerateDirectories() take a search string as their second argument that supports filename globbing, with both * and ?.
class GlobTestMain
{
static void Main(string[] args)
{
string[] exes = Directory.GetFiles(Environment.CurrentDirectory, "*.exe");
foreach (string file in exes)
{
Console.WriteLine(Path.GetFileName(file));
}
}
}
would yield
GlobTest.exe
GlobTest.vshost.exe
The docs state that there are some caveats with matching extensions. It also states that 8.3 file names are matched (which may be generated automatically behind the scenes), which can result in "duplicate" matches in given some patterns.
The methods that support this are GetFiles(), GetDirectories(), and GetFileSystemEntries(). The Enumerate variants also support this.

If you want to avoid regular expressions this is a basic glob implementation:
public static class Globber
{
public static bool Glob(this string value, string pattern)
{
int pos = 0;
while (pattern.Length != pos)
{
switch (pattern[pos])
{
case '?':
break;
case '*':
for (int i = value.Length; i >= pos; i--)
{
if (Glob(value.Substring(i), pattern.Substring(pos + 1)))
{
return true;
}
}
return false;
default:
if (value.Length == pos || char.ToUpper(pattern[pos]) != char.ToUpper(value[pos]))
{
return false;
}
break;
}
pos++;
}
return value.Length == pos;
}
}
Use it like this:
Assert.IsTrue("text.txt".Glob("*.txt"));

If you use VB.Net, you can use the Like statement, which has Glob like syntax.
http://www.getdotnetcode.com/gdncstore/free/Articles/Intoduction%20to%20the%20VB%20NET%20Like%20Operator.htm

I have written a globbing library for .NETStandard, with tests and benchmarks. My goal was to produce a library for .NET, with minimal dependencies, that doesn't use Regex, and outperforms Regex.
You can find it here:
github.com/dazinator/DotNet.Glob
https://www.nuget.org/packages/DotNet.Glob/

I wrote a FileSelector class that does selection of files based on filenames. It also selects files based on time, size, and attributes. If you just want filename globbing then you express the name in forms like "*.txt" and similar. If you want the other parameters then you specify a boolean logic statement like "name = *.xls and ctime < 2009-01-01" - implying an .xls file created before January 1st 2009. You can also select based on the negative: "name != *.xls" means all files that are not xls.
Check it out.
Open source. Liberal license.
Free to use elsewhere.

Based on previous posts, I threw together a C# class:
using System;
using System.Text.RegularExpressions;
public class FileWildcard
{
Regex mRegex;
public FileWildcard(string wildcard)
{
string pattern = string.Format("^{0}$", Regex.Escape(wildcard)
.Replace(#"\*", ".*").Replace(#"\?", "."));
mRegex = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
}
public bool IsMatch(string filenameToCompare)
{
return mRegex.IsMatch(filenameToCompare);
}
}
Using it would go something like this:
FileWildcard w = new FileWildcard("*.txt");
if (w.IsMatch("Doug.Txt"))
Console.WriteLine("We have a match");
The matching is NOT the same as the System.IO.Directory.GetFiles() method, so don't use them together.

From C# you can use .NET's LikeOperator.LikeString method. That's the backing implementation for VB's LIKE operator. It supports patterns using *, ?, #, [charlist], and [!charlist].
You can use the LikeString method from C# by adding a reference to the Microsoft.VisualBasic.dll assembly, which is included with every version of the .NET Framework. Then you invoke the LikeString method just like any other static .NET method:
using Microsoft.VisualBasic;
using Microsoft.VisualBasic.CompilerServices;
...
bool isMatch = LikeOperator.LikeString("I love .NET!", "I love *", CompareMethod.Text);
// isMatch should be true.

https://www.nuget.org/packages/Glob.cs
https://github.com/mganss/Glob.cs
A GNU Glob for .NET.
You can get rid of the package reference after installing and just compile the single Glob.cs source file.
And as it's an implementation of GNU Glob it's cross platform and cross language once you find another similar implementation enjoy!

I don't know if the .NET framework has glob matching, but couldn't you replace the * with .*? and use regexes?

Just out of curiosity I've glanced into Microsoft.Extensions.FileSystemGlobbing - and it was dragging quite huge dependencies on quite many libraries - I've decided why I cannot try to write something similar?
Well - easy to say than done, I've quickly noticed that it was not so trivial function after all - for example "*.txt" should match for files only in current directly, while "**.txt" should also harvest sub folders.
Microsoft also tests some odd matching pattern sequences like "./*.txt" - I'm not sure who actually needs "./" kind of string - since they are removed anyway while processing.
(https://github.com/aspnet/FileSystem/blob/dev/test/Microsoft.Extensions.FileSystemGlobbing.Tests/PatternMatchingTests.cs)
Anyway, I've coded my own function - and there will be two copies of it - one in svn (I might bugfix it later on) - and I'll copy one sample here as well for demo purposes. I recommend to copy paste from svn link.
SVN Link:
https://sourceforge.net/p/syncproj/code/HEAD/tree/SolutionProjectBuilder.cs#l800
(Search for matchFiles function if not jumped correctly).
And here is also local function copy:
/// <summary>
/// Matches files from folder _dir using glob file pattern.
/// In glob file pattern matching * reflects to any file or folder name, ** refers to any path (including sub-folders).
/// ? refers to any character.
///
/// There exists also 3-rd party library for performing similar matching - 'Microsoft.Extensions.FileSystemGlobbing'
/// but it was dragging a lot of dependencies, I've decided to survive without it.
/// </summary>
/// <returns>List of files matches your selection</returns>
static public String[] matchFiles( String _dir, String filePattern )
{
if (filePattern.IndexOfAny(new char[] { '*', '?' }) == -1) // Speed up matching, if no asterisk / widlcard, then it can be simply file path.
{
String path = Path.Combine(_dir, filePattern);
if (File.Exists(path))
return new String[] { filePattern };
return new String[] { };
}
String dir = Path.GetFullPath(_dir); // Make it absolute, just so we can extract relative path'es later on.
String[] pattParts = filePattern.Replace("/", "\\").Split('\\');
List<String> scanDirs = new List<string>();
scanDirs.Add(dir);
//
// By default glob pattern matching specifies "*" to any file / folder name,
// which corresponds to any character except folder separator - in regex that's "[^\\]*"
// glob matching also allow double astrisk "**" which also recurses into subfolders.
// We split here each part of match pattern and match it separately.
//
for (int iPatt = 0; iPatt < pattParts.Length; iPatt++)
{
bool bIsLast = iPatt == (pattParts.Length - 1);
bool bRecurse = false;
String regex1 = Regex.Escape(pattParts[iPatt]); // Escape special regex control characters ("*" => "\*", "." => "\.")
String pattern = Regex.Replace(regex1, #"\\\*(\\\*)?", delegate (Match m)
{
if (m.ToString().Length == 4) // "**" => "\*\*" (escaped) - we need to recurse into sub-folders.
{
bRecurse = true;
return ".*";
}
else
return #"[^\\]*";
}).Replace(#"\?", ".");
if (pattParts[iPatt] == "..") // Special kind of control, just to scan upper folder.
{
for (int i = 0; i < scanDirs.Count; i++)
scanDirs[i] = scanDirs[i] + "\\..";
continue;
}
Regex re = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
int nScanItems = scanDirs.Count;
for (int i = 0; i < nScanItems; i++)
{
String[] items;
if (!bIsLast)
items = Directory.GetDirectories(scanDirs[i], "*", (bRecurse) ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly);
else
items = Directory.GetFiles(scanDirs[i], "*", (bRecurse) ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly);
foreach (String path in items)
{
String matchSubPath = path.Substring(scanDirs[i].Length + 1);
if (re.Match(matchSubPath).Success)
scanDirs.Add(path);
}
}
scanDirs.RemoveRange(0, nScanItems); // Remove items what we have just scanned.
} //for
// Make relative and return.
return scanDirs.Select( x => x.Substring(dir.Length + 1) ).ToArray();
} //matchFiles
If you find any bugs, I'll be grad to fix them.

I wrote a solution that does it. It does not depend on any library and it does not support "!" or "[]" operators. It supports the following search patterns:
C:\Logs\*.txt
C:\Logs\**\*P1?\**\asd*.pdf
/// <summary>
/// Finds files for the given glob path. It supports ** * and ? operators. It does not support !, [] or ![] operators
/// </summary>
/// <param name="path">the path</param>
/// <returns>The files that match de glob</returns>
private ICollection<FileInfo> FindFiles(string path)
{
List<FileInfo> result = new List<FileInfo>();
//The name of the file can be any but the following chars '<','>',':','/','\','|','?','*','"'
const string folderNameCharRegExp = #"[^\<\>:/\\\|\?\*" + "\"]";
const string folderNameRegExp = folderNameCharRegExp + "+";
//We obtain the file pattern
string filePattern = Path.GetFileName(path);
List<string> pathTokens = new List<string>(Path.GetDirectoryName(path).Split('\\', '/'));
//We obtain the root path from where the rest of files will obtained
string rootPath = null;
bool containsWildcardsInDirectories = false;
for (int i = 0; i < pathTokens.Count; i++)
{
if (!pathTokens[i].Contains("*")
&& !pathTokens[i].Contains("?"))
{
if (rootPath != null)
rootPath += "\\" + pathTokens[i];
else
rootPath = pathTokens[i];
pathTokens.RemoveAt(0);
i--;
}
else
{
containsWildcardsInDirectories = true;
break;
}
}
if (Directory.Exists(rootPath))
{
//We build the regular expression that the folders should match
string regularExpression = rootPath.Replace("\\", "\\\\").Replace(":", "\\:").Replace(" ", "\\s");
foreach (string pathToken in pathTokens)
{
if (pathToken == "**")
{
regularExpression += string.Format(CultureInfo.InvariantCulture, #"(\\{0})*", folderNameRegExp);
}
else
{
regularExpression += #"\\" + pathToken.Replace("*", folderNameCharRegExp + "*").Replace(" ", "\\s").Replace("?", folderNameCharRegExp);
}
}
Regex globRegEx = new Regex(regularExpression, RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
string[] directories = Directory.GetDirectories(rootPath, "*", containsWildcardsInDirectories ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly);
foreach (string directory in directories)
{
if (globRegEx.Matches(directory).Count > 0)
{
DirectoryInfo directoryInfo = new DirectoryInfo(directory);
result.AddRange(directoryInfo.GetFiles(filePattern));
}
}
}
return result;
}

Unfortunately the accepted answer will not handle escaped input correctly, because string .Replace("\*", ".*") fails to distinguish between "*" and "\*" - it will happily replace "*" in both of these strings, leading to incorrect results.
Instead, a basic tokenizer can be used to convert the glob path into a regex pattern, which can then be matched against a filename using Regex.Match. This is a more robust and flexible solution.
Here is a method to do this. It handles ?, *, and **, and surrounds each of these globs with a capture group, so the values of each glob can be inspected after the Regex has been matched.
static string GlobbedPathToRegex(ReadOnlySpan<char> pattern, ReadOnlySpan<char> dirSeparatorChars)
{
StringBuilder builder = new StringBuilder();
builder.Append('^');
ReadOnlySpan<char> remainder = pattern;
while (remainder.Length > 0)
{
int specialCharIndex = remainder.IndexOfAny('*', '?');
if (specialCharIndex >= 0)
{
ReadOnlySpan<char> segment = remainder.Slice(0, specialCharIndex);
if (segment.Length > 0)
{
string escapedSegment = Regex.Escape(segment.ToString());
builder.Append(escapedSegment);
}
char currentCharacter = remainder[specialCharIndex];
char nextCharacter = specialCharIndex < remainder.Length - 1 ? remainder[specialCharIndex + 1] : '\0';
switch (currentCharacter)
{
case '*':
if (nextCharacter == '*')
{
// We have a ** glob expression
// Match any character, 0 or more times.
builder.Append("(.*)");
// Skip over **
remainder = remainder.Slice(specialCharIndex + 2);
}
else
{
// We have a * glob expression
// Match any character that isn't a dirSeparatorChar, 0 or more times.
if(dirSeparatorChars.Length > 0) {
builder.Append($"([^{Regex.Escape(dirSeparatorChars.ToString())}]*)");
}
else {
builder.Append("(.*)");
}
// Skip over *
remainder = remainder.Slice(specialCharIndex + 1);
}
break;
case '?':
builder.Append("(.)"); // Regex equivalent of ?
// Skip over ?
remainder = remainder.Slice(specialCharIndex + 1);
break;
}
}
else
{
// No more special characters, append the rest of the string
string escapedSegment = Regex.Escape(remainder.ToString());
builder.Append(escapedSegment);
remainder = ReadOnlySpan<char>.Empty;
}
}
builder.Append('$');
return builder.ToString();
}
The to use it:
string testGlobPathInput = "/Hello/Test/Blah/**/test*123.fil?";
string globPathRegex = GlobbedPathToRegex(testGlobPathInput, "/"); // Could use "\\/" directory separator chars on Windows
Console.WriteLine($"Globbed path: {testGlobPathInput}");
Console.WriteLine($"Regex conversion: {globPathRegex}");
string testPath = "/Hello/Test/Blah/All/Hail/The/Hypnotoad/test_somestuff_123.file";
Console.WriteLine($"Test Path: {testPath}");
var regexGlobPathMatch = Regex.Match(testPath, globPathRegex);
Console.WriteLine($"Match: {regexGlobPathMatch.Success}");
for(int i = 0; i < regexGlobPathMatch.Groups.Count; i++) {
Console.WriteLine($"Group [{i}]: {regexGlobPathMatch.Groups[i]}");
}
Output:
Globbed path: /Hello/Test/Blah/**/test*123.fil?
Regex conversion: ^/Hello/Test/Blah/(.*)/test([^/]*)123\.fil(.)$
Test Path: /Hello/Test/Blah/All/Hail/The/Hypnotoad/test_somestuff_123.file
Match: True
Group [0]: /Hello/Test/Blah/All/Hail/The/Hypnotoad/test_somestuff_123.file
Group [1]: All/Hail/The/Hypnotoad
Group [2]: _somestuff_
Group [3]: e
I have created a gist here as a canonical version of this method:
https://gist.github.com/crozone/9a10156a37c978e098e43d800c6141ad

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Is it a drive path or another? Check with Regex - c#

Related

Does my code prevent directory traversal or is it overkill?

Sanitizing a file path in C# without compromising the drive letter

Find unbounded file paths in string

How to find folder below given section of a path?

glob pattern matching in .NET

Categories

Resources