Find unbounded file paths in string - c#

I have these error messages generated by a closed source third party software from which I need to extract file paths.
The said file paths are :
not bounded (i.e. not surrounded by quotation marks, parentheses, brackets, etc)
rooted (i.e. start with <letter>:\ such as C:\)
not guaranteed to have a file extension
representing files (only files, not directories) that are guaranteed to exist on the computer running the extraction code.
made of any valid characters, including spaces, making them hard to spot (e.g. C:\This\is a\path \but what is an existing file path here)
To be noted, there can be 0 or more file paths per message.
How can these file paths be found in the error messages?
I've suggested an answer below, but I have a feeling that there is a better way to go about this.

For each match, look forward for the next '\' character. So you might get "c:\mydir\". Check to see if that directory exists. Then find the next \, giving "c:\mydir\subdir`. Check for that path. Eventually you'll find a path that doesn't exist, or you'll get to the start of the next match.
At that point, you know what directory to look in. Then just call Directory.GetFiles and match the longest filename that matches the substring starting at the last path you found.
That should minimize backtracking.
Here's how this could be done:
static void FindFilenamesInMessage(string message) {
// Find all the "letter colon backslash", indicating filenames.
var matches = Regex.Matches(message, #"\w:\\", RegexOptions.Compiled);
// Go backwards. Useful if you need to replace stuff in the message
foreach (var idx in matches.Cast<Match>().Select(m => m.idx).Reverse()) {
int length = 3;
var potentialPath = message.Substring(idx, length);
var lastGoodPath = potentialPath;
// Eat "\" until we get an invalid path
while (Directory.Exists(potentialPath)) {
lastGoodPath = potentialPath;
while (idx+length < message.Length && message[idx+length] != '\\')
length++;
length++; // Include the trailing backslash
if (idx + length >= message.Length)
length = (message.Length - idx) - 1;
potentialPath = message.Substring(idx, length);
}
potentialPath = message.Substring(idx);
// Iterate over the files in directory we found until we get a match
foreach (var file in Directory.EnumerateFiles(lastGoodPath)
.OrderByDescending(s => s.Length)) {
if (!potentialPath.StartsWith(file))
continue;
// 'file' contains a valid file name
break;
}
}
}

This is how I would do it.
I don't think substringing the message over and over is a good idea however.
static void FindFilenamesInMessage(string message)
{
// Find all the "letter colon backslash", indicating filenames.
var matches = Regex.Matches(message, #"\w:\\", RegexOptions.Compiled);
int length = message.Length;
foreach (var index in matches.Cast<Match>().Select(m => m.Index).Reverse())
{
length = length - index;
while (length > 0)
{
var subString = message.Substring(index, length);
if (File.Exists(subString))
{
// subString contains a valid file name
///////////////////////
// Payload goes here
//////////////////////
length = index;
break;
}
length--;
}
}
}

Related

Does my code prevent directory traversal or is it overkill?

I want to make sure this is enough to prevent directory traversal and also any suggestions or tips would be appreciated. The directory "/wwwroot/Posts/" is the only directory which is allowed.
[HttpGet("/[controller]/[action]/{name}")]
public IActionResult Post(string name)
{
if(string.IsNullOrEmpty(name))
{
return View("Post", new BlogPostViewModel(true)); //error page
}
char[] InvalidFilenameChars = Path.GetInvalidFileNameChars();
if (name.IndexOfAny(InvalidFilenameChars) >= 0)
{
return View("Post", new BlogPostViewModel(true));
}
DirectoryInfo dir = new DirectoryInfo(Path.Combine(Directory.GetCurrentDirectory(), "wwwroot/Posts"));
var userpath = Path.GetFullPath(Path.Combine(Directory.GetCurrentDirectory(), "wwwroot/Posts", name));
if (Path.GetDirectoryName(userpath) != dir.FullName)
{
return View("Post", new BlogPostViewModel(true));
}
var temp = Path.Combine(dir.FullName, name + ".html");
if (!System.IO.File.Exists(temp))
{
return View("Post", new BlogPostViewModel(true));
}
BlogPostViewModel model = new BlogPostViewModel(Directory.GetCurrentDirectory(), name);
return View("Post", model);
}
Probably, but I wouldn't consider it bulletproof. Let's break this down:
First you are black-listing known invalid characters:
char[] InvalidFilenameChars = Path.GetInvalidFileNameChars();
if (name.IndexOfAny(InvalidFilenameChars) >= 0)
{
return View("Post", new BlogPostViewModel(true));
}
This is a good first step, but blacklisting input is rarely enough. It will prevent certain control characters, but the documentation does not explicitly state that directory separators ( e.g. / and \) are included. The documentation states:
The array returned from this method is not guaranteed to contain the
complete set of characters that are invalid in file and directory
names. The full set of invalid characters can vary by file system.
Next, you attempt to make sure that after path.combine you have the expected parent folder for your file:
DirectoryInfo dir = new DirectoryInfo(Path.Combine(Directory.GetCurrentDirectory(), "wwwroot/Posts"));
var userpath = Path.GetFullPath(Path.Combine(Directory.GetCurrentDirectory(), "wwwroot/Posts", name));
if (Path.GetDirectoryName(userpath) != dir.FullName)
{
return View("Post", new BlogPostViewModel(true));
}
In theory, if the attacker passed in ../foo (and perhaps that gets past the blacklisting attempt above if / isn't in the list of invalid characters), then Path.Combine should combine the paths and return /somerootpath/wwwroot/foo. GetParentFolder would return /somerootpath/wwwroot which would be a non-match and it would get rejected. However, suppose Path.Combine concatenates and returns /somerootpath/wwwroot/Posts/../foo. In this case GetParentFolder will return /somerootpath/wwwRoot/Posts which is a match and it proceeds. Seems unlikely, but there may be control characters which get past GetInvalidFileNameChars() based on the documentation stating that it is not exhaustive which trick Path.Combine into something along these lines.
Your approach will probably work. However, if it is at all possible, I would strongly recommend you whitelist the expected input rather than attempt to blacklist all possible invalid inputs. For example, if you can be certain that all valid filenames will be made up of letters, numbers, and underscores, build a regular expression that asserts that and check before continuing. Testing for ^[A-Za-z0-0_]+$ would assert that and be 100% bulletproof.

Remove control characters sequence from string EOT comma ETX

I have some xml files where some control sequences are included in the text: EOT,ETX(anotherchar)
The other char following EOT comma ETX is not always present and not always the same.
Actual example:
<FatturaElettronicaHeader xmlns="">
</F<EOT>‚<ETX>èatturaElettronicaHeader>
Where <EOT> is the 04 char and <ETX> is 03. As I have to parse the xml this is actually a big issue.
Is this some kind of encoding I never heard about?
I have tried to remove all the control characters from my string but it will leave the comma that is still unwanted.
If I use Encoding.ASCII.GetString(file); the unwanted characters will be replaced with a '?' that is easy to remove but it will still leave some unwanted characters causing parse issues:
<BIC></WBIC> something like this.
string xml = Encoding.ASCII.GetString(file);
xml = new string(xml.Where(cc => !char.IsControl(cc)).ToArray());
I hence need to remove all this kind of control character sequences to be able to parse this kind of files and I'm unsure about how to programmatically check if a character is part of a control sequence or not.
I have find out that there are 2 wrong patterns in my files: the first is the one in the title and the second is EOT<.
In order to make it work I looked at this thread: Remove substring that starts with SOT and ends EOT, from string
and modified the code a little
private static string RemoveInvalidCharacters(string input)
{
while (true)
{
var start = input.IndexOf('\u0004');
if (start == -1) break;
if (input[start + 1] == '<')
{
input = input.Remove(start, 2);
continue;
}
if (input[start + 2] == '\u0003')
{
input = input.Remove(start, 4);
}
}
return input;
}
A further cleanup with this code:
static string StripExtended(string arg)
{
StringBuilder buffer = new StringBuilder(arg.Length); //Max length
foreach (char ch in arg)
{
UInt16 num = Convert.ToUInt16(ch);//In .NET, chars are UTF-16
//The basic characters have the same code points as ASCII, and the extended characters are bigger
if ((num >= 32u) && (num <= 126u)) buffer.Append(ch);
}
return buffer.ToString();
}
And now everything looks fine to parse.
sorry for the delay in responding,
but in my opinion the root of the problem might be an incorrect decoding of a p7m file.
I think originally the xml file you are trying to sanitize was a .xml.p7m file.
I believe the correct way to sanitize the file is by using a library such as Buoncycastle in java or dotnet and the class CmsSignedData.
CmsSignedData cmsObj = new CmsSignedData(content);
if (cmsObj.SignedContent != null)
{
using (var stream = new MemoryStream())
{
cmsObj.SignedContent.Write(stream);
content = stream.ToArray();
}
}

Sanitizing a file path in C# without compromising the drive letter

I need to process some file paths in C# that potentially contain illegal characters, for example:
C:\path\something\output_at_13:26:43.txt
in that path, the :s in the timestamp make the filename invalid, and I want to replace them with another safe character.
I've searched for solutions here on SO, but they seem to be all based around something like:
path = string.Join("_", path.Split(Path.GetInvalidFileNameChars()));
or similar solutions. These solutions however are not good, because they screw up the drive letter, and I obtain an output of:
C_\path\something\output_at_13_26_43.txt
I tried using Path.GetInvalidPathChars() but it still doesn't work, because it doesn't include the : in the illegal characters, so it doesn't replace the ones in the filename.
So, after figuring that out, I tried doing this:
string dir = Path.GetDirectoryName(path);
string file = Path.GetFileName(path);
file = string.Join(replacement, file.Split(Path.GetInvalidFileNameChars()));
dir = string.Join(replacement, dir.Split(Path.GetInvalidPathChars()));
path = Path.Combine(dir, file);
but this is not good either, because the :s in the filename seem to interfere with the Path.GetFilename() logic, and it only returns the last piece after the last :, so I'm losing pieces of the path.
How do I do this "properly" without hacky solutions?
You can write a simple sanitizer that iterates each character and knows when to expect the colon as a drive separator. This one will catch any combination of letter A-Z followed directly by a ":". It will also detect path separators and not escape them. It will not detect whitespace at the beginning of the input string, so in case your input data might come with them, you will have to trim it first or modify the sanitizer accordingly:
enum ParserState {
PossibleDriveLetter,
PossibleDriveLetterSeparator,
Path
}
static string SanitizeFileName(string input) {
StringBuilder output = new StringBuilder(input.Length);
ParserState state = ParserState.PossibleDriveLetter;
foreach(char current in input) {
if (((current >= 'a') && (current <= 'z')) || ((current >= 'A') && (current <= 'Z'))) {
output.Append(current);
if (state == ParserState.PossibleDriveLetter) {
state = ParserState.PossibleDriveLetterSeparator;
}
else {
state = ParserState.Path;
}
}
else if ((current == Path.DirectorySeparatorChar) ||
(current == Path.AltDirectorySeparatorChar) ||
((current == ':') && (state == ParserState.PossibleDriveLetterSeparator)) ||
!Path.GetInvalidFileNameChars().Contains(current)) {
output.Append(current);
state = ParserState.Path;
}
else {
output.Append('_');
state = ParserState.Path;
}
}
return output.ToString();
}
You can try it out here.
You definitely should make sure that you only receive valid filenames.
If you can't, and you're certain your directory names will be, you could split the path the last backslash (assuming Windows) and reassemble the string:
public static string SanitizePath(string path)
{
var lastBackslash = path.LastIndexOf('\\');
var dir = path.Substring(0, lastBackslash);
var file = path.Substring(lastBackslash, path.Length - lastBackslash);
foreach (var invalid in Path.GetInvalidFileNameChars())
{
file = file.Replace(invalid, '_');
}
return dir + file;
}

How do I search a txt document for a word and write that word to a new document for reporting?

I am trying to look through a log .txt file and find all instances of an ERROR message, and then have the code paste that error line in a new document using C#, and essentially create a separate .txt file with only the error lines from the log file listed.
I understand how to search for the text in the document using C#, but what would be the best way to approach extracting those error messages (that are typically no longer than 1-2 lines each) without appending the entire rest of the document after the first error instance?
EDIT
The log file logs the events on each line, and the error lines are read as follows:
Running install update (3)
ERROR: Running install update(3) (ERROR x34293) The system was unable to update the application, check that correct version is accessible by the system.
Running install update (4)
etc.
Please let me know if this helps.
Something like:
foreach(var line in File.ReadLines(filename)
.Where(l => l.Contains("ERROR MESSAGE")))
{
// Log line
}
Additionally if you need specific information inside the line you can use a Regex to capture the information. I cannot provide a better example without more information.
You can search the file via RegEx pattern, which gives you either character position found, or line number (can't remember). Then you can grab that portion that is returned from the RegEx.
Here's a block of code from my own "Find And Replace" program I wrote that uses RegEx. There are some nested methods, but you get the point...
int foundInThisFile;
string regExPattern = FindText;
System.Text.RegularExpressions.Regex regSearch = null;
if (IgnoreCase)
regSearch = new System.Text.RegularExpressions.Regex(regExPattern, System.Text.RegularExpressions.RegexOptions.IgnoreCase | System.Text.RegularExpressions.RegexOptions.Multiline);
else
regSearch = new System.Text.RegularExpressions.Regex(regExPattern, System.Text.RegularExpressions.RegexOptions.Multiline);
System.Text.RegularExpressions.MatchCollection regExMatches = regSearch.Matches(reader.ReadToEnd());
if (reader != null)
{
reader.Dispose();
reader = null;
}
found += regExMatches.Count;
TotalMatches(new CountEventArgs(found));
foundInThisFile = regExMatches.Count;
MatchesInThisFile(new CountEventArgs(foundInThisFile));
if (regExMatches.Count > 0)
{
foreach (System.Text.RegularExpressions.Match match in regExMatches)
{
// The first "group" is going to be the entire regex match, any other "group" is going to be the %1, %2 values that are returned
// Index is the character position in the entire document
if (match.Groups.Count > 1)
{
// This means the user wants to see the grouping results
string[] groupsArray = new string[match.Groups.Count - 1];
for (int counter = 1; counter < match.Groups.Count; counter++)
groupsArray[counter - 1] = match.Groups[counter].Value;
int lineNumber = 0;
string actualLine = String.Empty;
GetLineNumberAndLine(localPath, match.Groups[0].Index, out lineNumber, out actualLine);
AddToSearchResults(localPath, lineNumber, actualLine, groupsArray);
NewSearchResult(new SearchResultArgs(new FindReplaceItem(localPath, lineNumber, actualLine, ConvertGroupsArrayToString(groupsArray))));
}
else
{
int lineNumber = 0;
string actualLine = String.Empty;
GetLineNumberAndLine(localPath, match.Groups[0].Index, out lineNumber, out actualLine);
AddToSearchResults(localPath, lineNumber, actualLine);
NewSearchResult(new SearchResultArgs(new FindReplaceItem(localPath, lineNumber, actualLine)));
}
}
}

IndexOf does not correctly identify if a line starts with a value

How can I remove a whole line from a text file if the first word matches to a variable I have?
What I'm currently trying is:
List<string> lineList = File.ReadAllLines(dir + "textFile.txt").ToList();
lineList = lineList.Where(x => x.IndexOf(user) <= 0).ToList();
File.WriteAllLines(dir + "textFile.txt", lineList.ToArray());
But I can't get it to remove.
The only mistake that you have is you are checking <= 0 with indexOf, instead of = 0.
-1 is returned when the string does not contain the searched for string.
<= 0 means either starts with or does not contain
=0 means starts with <- This is what you want
This method will read the file line-by-line instead of all at once. Also note that this implementation is case-sensitive.
It also assumes you aren't subjected to leading spaces.
using (var writer = new StreamWriter("temp.file"))
{
//here I only write back what doesn't match
foreach(var line in File.ReadLines("file").Where(x => !x.StartsWith(user)))
writer.WriteLine(line); // not sure if this will cause a double-space ?
}
File.Move("temp.file", "file");
You were pretty close, String.StartsWith handles that nicely:
// nb: if you are case SENSITIVE remove the second argument to ll.StartsWith
File.WriteAllLines(
path,
File.ReadAllLines(path)
.Where(ll => ll.StartsWith(user, StringComparison.OrdinalIgnoreCase)));
For really large files that may not be well performing, instead:
// Write our new data to a temp file and read the old file On The Fly
var temp = Path.GetTempFileName();
try
{
File.WriteAllLines(
temp,
File.ReadLines(path)
.Where(
ll => ll.StartsWith(user, StringComparison.OrdinalIgnoreCase)));
File.Copy(temp, path, true);
}
finally
{
File.Delete(temp);
}
Another issue noted was that both IndexOf and StartsWith will treat ABC and ABCDEF as matches if the user is ABC:
var matcher = new Regex(
#"^" + Regex.Escape(user) + #"\b", // <-- matches the first "word"
RegexOptions.CaseInsensitive);
File.WriteAllLines(
path,
File.ReadAllLines(path)
.Where(ll => matcher.IsMatch(ll)));
Use `= 0` instead of `<= 0`.

Categories