Sanitizing a file path in C# without compromising the drive letter - c#

I need to process some file paths in C# that potentially contain illegal characters, for example:
C:\path\something\output_at_13:26:43.txt
in that path, the :s in the timestamp make the filename invalid, and I want to replace them with another safe character.
I've searched for solutions here on SO, but they seem to be all based around something like:
path = string.Join("_", path.Split(Path.GetInvalidFileNameChars()));
or similar solutions. These solutions however are not good, because they screw up the drive letter, and I obtain an output of:
C_\path\something\output_at_13_26_43.txt
I tried using Path.GetInvalidPathChars() but it still doesn't work, because it doesn't include the : in the illegal characters, so it doesn't replace the ones in the filename.
So, after figuring that out, I tried doing this:
string dir = Path.GetDirectoryName(path);
string file = Path.GetFileName(path);
file = string.Join(replacement, file.Split(Path.GetInvalidFileNameChars()));
dir = string.Join(replacement, dir.Split(Path.GetInvalidPathChars()));
path = Path.Combine(dir, file);
but this is not good either, because the :s in the filename seem to interfere with the Path.GetFilename() logic, and it only returns the last piece after the last :, so I'm losing pieces of the path.
How do I do this "properly" without hacky solutions?

You can write a simple sanitizer that iterates each character and knows when to expect the colon as a drive separator. This one will catch any combination of letter A-Z followed directly by a ":". It will also detect path separators and not escape them. It will not detect whitespace at the beginning of the input string, so in case your input data might come with them, you will have to trim it first or modify the sanitizer accordingly:
enum ParserState {
PossibleDriveLetter,
PossibleDriveLetterSeparator,
Path
}
static string SanitizeFileName(string input) {
StringBuilder output = new StringBuilder(input.Length);
ParserState state = ParserState.PossibleDriveLetter;
foreach(char current in input) {
if (((current >= 'a') && (current <= 'z')) || ((current >= 'A') && (current <= 'Z'))) {
output.Append(current);
if (state == ParserState.PossibleDriveLetter) {
state = ParserState.PossibleDriveLetterSeparator;
}
else {
state = ParserState.Path;
}
}
else if ((current == Path.DirectorySeparatorChar) ||
(current == Path.AltDirectorySeparatorChar) ||
((current == ':') && (state == ParserState.PossibleDriveLetterSeparator)) ||
!Path.GetInvalidFileNameChars().Contains(current)) {
output.Append(current);
state = ParserState.Path;
}
else {
output.Append('_');
state = ParserState.Path;
}
}
return output.ToString();
}
You can try it out here.

You definitely should make sure that you only receive valid filenames.
If you can't, and you're certain your directory names will be, you could split the path the last backslash (assuming Windows) and reassemble the string:
public static string SanitizePath(string path)
{
var lastBackslash = path.LastIndexOf('\\');
var dir = path.Substring(0, lastBackslash);
var file = path.Substring(lastBackslash, path.Length - lastBackslash);
foreach (var invalid in Path.GetInvalidFileNameChars())
{
file = file.Replace(invalid, '_');
}
return dir + file;
}

Related

Is it a drive path or another? Check with Regex

I would like to check whether it is a drive path or a "pol" path.
For this I have already written a small code, unfortunately, I always return true.
The regex expression may be incorrect \W?\w{1}:{1}[/]{1}. How do I do it right?The path names can always be different and do not have to agree with the pole path.
Thank you in advance.
public bool isPolPath(string path)
{
bool isPolPath= true;
// Pol-Path: /Buy/Toy/Special/Clue
// drive-Path: Q:\Buy/Special/Clue
Regex myRegex = new Regex(#"\W?\w{1}:{1}[/]{1}", RegexOptions.IgnoreCase);
Match matchSuccess = myRegex.Match(path);
if (matchSuccess.Success)
isPolPath= false;
return isPolPath;
}
You don't need regexes to achieve this. Use System.IO.Path.GetPathRoot. It returns X:\ (where X is the actual drive letter) if the given path contains drive letter and an empty string or slash otherwise.
new List<string> {
#"/Buy/Toy/Special/Clue",
#"q:\Buy/Special/Clue",
#"Buy",
#"/",
#"\",
#"q:",
#"q:/",
#"q:\",
//#"", // This throws an exception saying path is illegal
}.ForEach(
p => Console.WriteLine(Path.GetPathRoot(p))
);
/* This code outputs:
\
q:\
\
\
q:
q:\
q:\
*/
Therefore your check may look like this:
isPolPath = Path.GetPathRoot(path).Length < 2;
If you wish to make your code more foolproof and protect from exception when an empty string is passed, you need to decide if an empty (or null) string is a pol-path or drive path. Depending on the decision the check would be either
sPolPath = string.IsNullOrEmpty(path) || Path.GetPathRoot(path).Length < 2;
or
if (string.IsNullOrEmpty(path))
sPolPath = false;
else
sPolPath = Path.GetPathRoot(path).Length < 2;

Regex match up to the end of a standard pattern

I'm working on an application to manage filenames of downloaded TV Shows. Basically it will search the directory and clean up the filenames, removing things like full stops and replacing them with spaces and getting rid of the descriptions at the end of the filename after the easily recognizable pattern of, for eg., S01E13. (.1080p.BluRay.x264-ROVERS)
What I want to do is to make a regex expression for use in C# to just extract whatever is before the SnnEnn including itself (where n is any whole positive integer).
But, i don't know much regex to get me going
For example, if I had the filename TV.Show.S01E01.1080p.BluRay.x264-ROVERS, the query would only get TV.Show.S01E01, irrespective of how many words are before the pattern, so it could be TV.Show.On.ABC.S01E01 and it would still work.
Thanks for any help :)
Try this
string input = "TV.Show.S01E01.1080p.BluRay.x264-ROVERS";
string pattern = #"(?'pattern'^.*\d\d[A-Z]\d\d)";
string results = Regex.Match(input, pattern).Groups["pattern"].Value;
There is more obvious way without regex:
string GetNameByPattern(string s)
{
const string pattern_length = 6; //SnnEnn
for (int i = 0; i < s.Length - pattern_length; i++)
{
string part = s.SubString(i, pattern_length);
if (part[0] == 'S' && part[3] == 'N') //candidat
if (Char.IsDigit(part[1]) && Char.IsDigit(part[2]) && Char.IsDigit(part[4]) && Char.IsDigit(part[5]))
return s.SubString(0, i + pattern_length);
}
return "";
}

Find unbounded file paths in string

I have these error messages generated by a closed source third party software from which I need to extract file paths.
The said file paths are :
not bounded (i.e. not surrounded by quotation marks, parentheses, brackets, etc)
rooted (i.e. start with <letter>:\ such as C:\)
not guaranteed to have a file extension
representing files (only files, not directories) that are guaranteed to exist on the computer running the extraction code.
made of any valid characters, including spaces, making them hard to spot (e.g. C:\This\is a\path \but what is an existing file path here)
To be noted, there can be 0 or more file paths per message.
How can these file paths be found in the error messages?
I've suggested an answer below, but I have a feeling that there is a better way to go about this.
For each match, look forward for the next '\' character. So you might get "c:\mydir\". Check to see if that directory exists. Then find the next \, giving "c:\mydir\subdir`. Check for that path. Eventually you'll find a path that doesn't exist, or you'll get to the start of the next match.
At that point, you know what directory to look in. Then just call Directory.GetFiles and match the longest filename that matches the substring starting at the last path you found.
That should minimize backtracking.
Here's how this could be done:
static void FindFilenamesInMessage(string message) {
// Find all the "letter colon backslash", indicating filenames.
var matches = Regex.Matches(message, #"\w:\\", RegexOptions.Compiled);
// Go backwards. Useful if you need to replace stuff in the message
foreach (var idx in matches.Cast<Match>().Select(m => m.idx).Reverse()) {
int length = 3;
var potentialPath = message.Substring(idx, length);
var lastGoodPath = potentialPath;
// Eat "\" until we get an invalid path
while (Directory.Exists(potentialPath)) {
lastGoodPath = potentialPath;
while (idx+length < message.Length && message[idx+length] != '\\')
length++;
length++; // Include the trailing backslash
if (idx + length >= message.Length)
length = (message.Length - idx) - 1;
potentialPath = message.Substring(idx, length);
}
potentialPath = message.Substring(idx);
// Iterate over the files in directory we found until we get a match
foreach (var file in Directory.EnumerateFiles(lastGoodPath)
.OrderByDescending(s => s.Length)) {
if (!potentialPath.StartsWith(file))
continue;
// 'file' contains a valid file name
break;
}
}
}
This is how I would do it.
I don't think substringing the message over and over is a good idea however.
static void FindFilenamesInMessage(string message)
{
// Find all the "letter colon backslash", indicating filenames.
var matches = Regex.Matches(message, #"\w:\\", RegexOptions.Compiled);
int length = message.Length;
foreach (var index in matches.Cast<Match>().Select(m => m.Index).Reverse())
{
length = length - index;
while (length > 0)
{
var subString = message.Substring(index, length);
if (File.Exists(subString))
{
// subString contains a valid file name
///////////////////////
// Payload goes here
//////////////////////
length = index;
break;
}
length--;
}
}
}

Why does Path.Combine not properly concatenate filenames that start with Path.DirectorySeparatorChar?

From the Immediate Window in Visual Studio:
> Path.Combine(#"C:\x", "y")
"C:\\x\\y"
> Path.Combine(#"C:\x", #"\y")
"\\y"
It seems that they should both be the same.
The old FileSystemObject.BuildPath() didn't work this way...
This is kind of a philosophical question (which perhaps only Microsoft can truly answer), since it's doing exactly what the documentation says.
System.IO.Path.Combine
"If path2 contains an absolute path, this method returns path2."
Here's the actual Combine method from the .NET source. You can see that it calls CombineNoChecks, which then calls IsPathRooted on path2 and returns that path if so:
public static String Combine(String path1, String path2) {
if (path1==null || path2==null)
throw new ArgumentNullException((path1==null) ? "path1" : "path2");
Contract.EndContractBlock();
CheckInvalidPathChars(path1);
CheckInvalidPathChars(path2);
return CombineNoChecks(path1, path2);
}
internal static string CombineNoChecks(string path1, string path2)
{
if (path2.Length == 0)
return path1;
if (path1.Length == 0)
return path2;
if (IsPathRooted(path2))
return path2;
char ch = path1[path1.Length - 1];
if (ch != DirectorySeparatorChar && ch != AltDirectorySeparatorChar &&
ch != VolumeSeparatorChar)
return path1 + DirectorySeparatorCharAsString + path2;
return path1 + path2;
}
I don't know what the rationale is. I guess the solution is to strip off (or Trim) DirectorySeparatorChar from the beginning of the second path; maybe write your own Combine method that does that and then calls Path.Combine().
I wanted to solve this problem:
string sample1 = "configuration/config.xml";
string sample2 = "/configuration/config.xml";
string sample3 = "\\configuration/config.xml";
string dir1 = "c:\\temp";
string dir2 = "c:\\temp\\";
string dir3 = "c:\\temp/";
string path1 = PathCombine(dir1, sample1);
string path2 = PathCombine(dir1, sample2);
string path3 = PathCombine(dir1, sample3);
string path4 = PathCombine(dir2, sample1);
string path5 = PathCombine(dir2, sample2);
string path6 = PathCombine(dir2, sample3);
string path7 = PathCombine(dir3, sample1);
string path8 = PathCombine(dir3, sample2);
string path9 = PathCombine(dir3, sample3);
Of course, all paths 1-9 should contain an equivalent string in the end. Here is the PathCombine method I came up with:
private string PathCombine(string path1, string path2)
{
if (Path.IsPathRooted(path2))
{
path2 = path2.TrimStart(Path.DirectorySeparatorChar);
path2 = path2.TrimStart(Path.AltDirectorySeparatorChar);
}
return Path.Combine(path1, path2);
}
I also think that it is quite annoying that this string handling has to be done manually, and I'd be interested in the reason behind this.
This is the disassembled code from .NET Reflector for Path.Combine method. Check IsPathRooted function. If the second path is rooted (starts with a DirectorySeparatorChar), return second path as it is.
public static string Combine(string path1, string path2)
{
if ((path1 == null) || (path2 == null))
{
throw new ArgumentNullException((path1 == null) ? "path1" : "path2");
}
CheckInvalidPathChars(path1);
CheckInvalidPathChars(path2);
if (path2.Length == 0)
{
return path1;
}
if (path1.Length == 0)
{
return path2;
}
if (IsPathRooted(path2))
{
return path2;
}
char ch = path1[path1.Length - 1];
if (((ch != DirectorySeparatorChar) &&
(ch != AltDirectorySeparatorChar)) &&
(ch != VolumeSeparatorChar))
{
return (path1 + DirectorySeparatorChar + path2);
}
return (path1 + path2);
}
public static bool IsPathRooted(string path)
{
if (path != null)
{
CheckInvalidPathChars(path);
int length = path.Length;
if (
(
(length >= 1) &&
(
(path[0] == DirectorySeparatorChar) ||
(path[0] == AltDirectorySeparatorChar)
)
)
||
((length >= 2) &&
(path[1] == VolumeSeparatorChar))
)
{
return true;
}
}
return false;
}
In my opinion this is a bug. The problem is that there are two different types of "absolute" paths. The path "d:\mydir\myfile.txt" is absolute, the path "\mydir\myfile.txt" is also considered to be "absolute" even though it is missing the drive letter. The correct behavior, in my opinion, would be to prepend the drive letter from the first path when the second path starts with the directory separator (and is not a UNC path). I would recommend writing your own helper wrapper function which has the behavior you desire if you need it.
Following Christian Graus' advice in his "Things I Hate about Microsoft" blog titled "Path.Combine is essentially useless.", here is my solution:
public static class Pathy
{
public static string Combine(string path1, string path2)
{
if (path1 == null) return path2
else if (path2 == null) return path1
else return path1.Trim().TrimEnd(System.IO.Path.DirectorySeparatorChar)
+ System.IO.Path.DirectorySeparatorChar
+ path2.Trim().TrimStart(System.IO.Path.DirectorySeparatorChar);
}
public static string Combine(string path1, string path2, string path3)
{
return Combine(Combine(path1, path2), path3);
}
}
Some advise that the namespaces should collide, ... I went with Pathy, as a slight, and to avoid namespace collision with System.IO.Path.
Edit: Added null parameter checks
From MSDN:
If one of the specified paths is a zero-length string, this method returns the other path. If path2 contains an absolute path, this method returns path2.
In your example, path2 is absolute.
This code should do the trick:
string strFinalPath = string.Empty;
string normalizedFirstPath = Path1.TrimEnd(new char[] { '\\' });
string normalizedSecondPath = Path2.TrimStart(new char[] { '\\' });
strFinalPath = Path.Combine(normalizedFirstPath, normalizedSecondPath);
return strFinalPath;
Reason:
Your second URL is considered an absolute path, and the Combine method will only return the last path if the last path is an absolute path.
Solution:
Just remove the leading slash / from your second Path (/SecondPath to SecondPath), and it would work as excepted.
Not knowing the actual details, my guess is that it makes an attempt to join like you might join relative URIs. For example:
urljoin('/some/abs/path', '../other') = '/some/abs/other'
This means that when you join a path with a preceding slash, you are actually joining one base to another, in which case the second gets precedence.
This actually makes sense, in some way, considering how (relative) paths are treated usually:
string GetFullPath(string path)
{
string baseDir = #"C:\Users\Foo.Bar";
return Path.Combine(baseDir, path);
}
// Get full path for RELATIVE file path
GetFullPath("file.txt"); // = C:\Users\Foo.Bar\file.txt
// Get full path for ROOTED file path
GetFullPath(#"C:\Temp\file.txt"); // = C:\Temp\file.txt
The real question is: Why are paths, which start with "\", considered "rooted"? This was new to me too, but it works that way on Windows:
new FileInfo("\windows"); // FullName = C:\Windows, Exists = True
new FileInfo("windows"); // FullName = C:\Users\Foo.Bar\Windows, Exists = False
I used aggregate function to force paths combine as below:
public class MyPath
{
public static string ForceCombine(params string[] paths)
{
return paths.Aggregate((x, y) => Path.Combine(x, y.TrimStart('\\')));
}
}
If you want to combine both paths without losing any path you can use this:
?Path.Combine(#"C:\test", #"\test".Substring(0, 1) == #"\" ? #"\test".Substring(1, #"\test".Length - 1) : #"\test");
Or with variables:
string Path1 = #"C:\Test";
string Path2 = #"\test";
string FullPath = Path.Combine(Path1, Path2.IsRooted() ? Path2.Substring(1, Path2.Length - 1) : Path2);
Both cases return "C:\test\test".
First, I evaluate if Path2 starts with / and if it is true, return Path2 without the first character. Otherwise, return the full Path2.
Remove the starting slash ('\') in the second parameter (path2) of Path.Combine.
These two methods should save you from accidentally joining two strings that both have the delimiter in them.
public static string Combine(string x, string y, char delimiter) {
return $"{ x.TrimEnd(delimiter) }{ delimiter }{ y.TrimStart(delimiter) }";
}
public static string Combine(string[] xs, char delimiter) {
if (xs.Length < 1) return string.Empty;
if (xs.Length == 1) return xs[0];
var x = Combine(xs[0], xs[1], delimiter);
if (xs.Length == 2) return x;
var ys = new List<string>();
ys.Add(x);
ys.AddRange(xs.Skip(2).ToList());
return Combine(ys.ToArray(), delimiter);
}
This \ means "the root directory of the current drive". In your example it means the "test" folder in the current drive's root directory. So, this can be equal to "c:\test".
As mentiond by Ryan it's doing exactly what the documentation says.
From DOS times, current disk, and current path are distinguished.
\ is the root path, but for the CURRENT DISK.
For every "disk" there is a separate "current path".
If you change the disk using cd D: you do not change the current path to D:\, but to: "D:\whatever\was\the\last\path\accessed\on\this\disk"...
So, in windows, a literal #"\x" means: "CURRENTDISK:\x".
Hence Path.Combine(#"C:\x", #"\y") has as second parameter a root path, not a relative, though not in a known disk...
And since it is not known which might be the «current disk», python returns "\\y".
>cd C:
>cd \mydironC\apath
>cd D:
>cd \mydironD\bpath
>cd C:
>cd
>C:\mydironC\apath

Looking for Regex to find quoted newlines in a big string (for C#)

I have a big string (let's call it a CSV file, though it isn't actually one, it'll just be easier for now) that I have to parse in C# code.
The first step of the parsing process splits the file into individual lines by just using a StreamReader object and calling ReadLine until it's through the file. However, any given line might contain a quoted (in single quotes) literal with embedded newlines. I need to find those newlines and convert them temporarily into some other kind of token or escape sequence until I've split the file into an array of lines..then I can change them back.
Example input data:
1,2,10,99,'Some text without a newline', true, false, 90
2,1,11,98,'This text has an embedded newline
and continues here', true, true, 90
I could write all of the C# code needed to do this by using string.IndexOf to find the quoted sections and look within them for newlines, but I'm thinking a Regex might be a better choice (i.e. now I have two problems)
Since this isn't a true CSV file, does it have any sort of schema?
From your example, it looks like you have:
int, int, int, int, string , bool, bool, int
With that making up your record / object.
Assuming that your data is well formed (I don't know enough about your source to know how valid this assumption is); you could:
Read your line.
Use a state machine to parse your data.
If your line ends, and you're parsing a string, read the next line..and keep parsing.
I'd avoid using a regex if possible.
State-machines for doing such a job are made easy using C# 2.0 iterators. Here's hopefully the last CSV parser I'll ever write. The whole file is treated as a enumerable bunch of enumerable strings, i.e. rows/columns. IEnumerable is great because it can then be processed by LINQ operators.
public class CsvParser
{
public char FieldDelimiter { get; set; }
public CsvParser()
: this(',')
{
}
public CsvParser(char fieldDelimiter)
{
FieldDelimiter = fieldDelimiter;
}
public IEnumerable<IEnumerable<string>> Parse(string text)
{
return Parse(new StringReader(text));
}
public IEnumerable<IEnumerable<string>> Parse(TextReader reader)
{
while (reader.Peek() != -1)
yield return parseLine(reader);
}
IEnumerable<string> parseLine(TextReader reader)
{
bool insideQuotes = false;
StringBuilder item = new StringBuilder();
while (reader.Peek() != -1)
{
char ch = (char)reader.Read();
char? nextCh = reader.Peek() > -1 ? (char)reader.Peek() : (char?)null;
if (!insideQuotes && ch == FieldDelimiter)
{
yield return item.ToString();
item.Length = 0;
}
else if (!insideQuotes && ch == '\r' && nextCh == '\n') //CRLF
{
reader.Read(); // skip LF
break;
}
else if (!insideQuotes && ch == '\n') //LF for *nix-style line endings
break;
else if (ch == '"' && nextCh == '"') // escaped quotes ""
{
item.Append('"');
reader.Read(); // skip next "
}
else if (ch == '"')
insideQuotes = !insideQuotes;
else
item.Append(ch);
}
// last one
yield return item.ToString();
}
}
Note that the file is read character by character with the code deciding when newlines are to be treated as row delimiters or part of a quoted string.
What if you got the whole file into a variable then split that based on non-quoted newlines?
EDIT: Sorry, I've misinterpreted your post. If you're looking for a regex, then here is one:
content = Regex.Replace(content, "'([^']*)\n([^']*)'", "'\1TOKEN\2'");
There might be edge cases and that two problems but I think it should be ok most of the time. What the Regex does is that it first finds any pair of single quotes that has \n between it and replace that \n with TOKEN preserving any text in-between.
But still, I'd go state machine like what #bryansh explained below.

Categories