System.IO.Directory search pattern not working as expected - c#

I am attempting to retrieve jpeg and jpg files using the following statement:
string[] files = Directory.GetFiles(someDirectoryPath, "*.jp?g");
MSDN's docs for System.IO.Directory.GetFiles(string, string) state that ? represents "Exactly zero or one character.", however the above block selects jpeg files but omits jpg files.
I am currently using the overly-permissive search pattern "*.jp*g" to achieve my results, but it wrinkles my brain because it should work.

From the docs you linked to:
A searchPattern with a file extension of one, two, or more than three characters returns only files having extensions of exactly that length that match the file extension specified in the searchPattern.
I suspect that's the problem. To be honest, I'd probably fetch all the files and then postprocess them in code - it'll make for code which is simpler to reason about than relying on the Windows path-handling oddities.

You could either use "*" as a pattern and process the result yourself OR use
string[] files = Directory.GetFiles(someDirectoryPath, "*.jpg").Union (Directory.GetFiles(someDirectoryPath, "*.jpeg")).ToArray();
According to the Docs the pattern you use would return only files with extensions which are 4 characters long.
MSDN reference on Union

Related

Regular Expression For File Name in C#

I am trying to find a regular expression to parse two sections out of the file name for the .resx files in my project. There is one main file called "UiText.resx" and then many translation .resx files with convention "UiText.ja-JP.resx". I need both the "UiText" and the "ja-JP" out of the latter string, as we do have other resx files that don't have to be for UiText (e.g. I have some files named "ExceptionText.resx").
The pattern I'm using right now (which works, it just requires a little extra coding after) is "(?<=\.)((.*?)(?=\.resx))". For the example above, "UiText.ja-JP.resx" gets me a match set in C# of "UiText.", "ja-JP.", "ja-JP.", ".resx"
Of course I am able to just take the first occurrence of "ja-JP." and "UiText." from this set and massage it to what I want, but I'd rather just have a cleaner "UiText" "ja-JP" and be done with it.
I figure I'll probably have to have at least two different patterns for this, so that is OK. Thank you in advance!
Since UiText seems to be constant you can use this regex to extract just js-JP into $1:
^UiText\.(.+?)\.resx$
https://regex101.com/r/XKvwHA/1/
If I'm understanding your needs correctly, then the main reason you need "UiText" is not because you have any value for the term itself, but rather because you need to filter your files. The real term you need to play around with is "ja-JP", which changes for the files you need.
If I'm correct, try this regex:
(?<=UiText\.).+(?=\.resx)
Used in C# as follows:
var fileName = "UiText.ja-JP.resx";
var result = new Regex(#"(?<=^UiText\.).+(?=\.resx$)").Match(fileName).Value;
A little explanation:
(?<=^UiText\.) Start of string must begin exactly with "UiText."
.+ Any number of characters (but at least one)
(?=\.resx$) End of string must end with ".resx"
Any file that doesn't meet your criteria will return an empty string for 'result'.

How to filter Directory.EnumerateFiles with specific extension

I want a list of all xml files in a folder like this:
foreach (var file in Directory.EnumerateFiles(folderPath, "*.xml"))
{
// add file to a collection
}
However, if I for some reason have any files in folderPath that ends with .xmlXXX where XXX represent any characters, then they will be part of the enumerator.
If can solve it easily by doing something like
foreach (var file in Directory.EnumerateFiles(folderPath, "*.xml").Where(x => x.EndsWith(".xml")))
But it seems a bit odd to me, as I basically have to search for the same thing two times. Is there any way to get the right files directly or am I doing something wrong?
The is the documented/default behaviour of the wildcard usage with file searching.
Directory.EnumerateFiles Method (String, String)
If the specified extension is exactly three characters long, the
method returns files with extensions that begin with the specified
extension. For example, "*.xls" returns both "book.xls" and
"book.xlsx".
Your current approach of filtering twice is the right way.
The only improvement you can do is to ignore case in EndsWith like:
x.EndsWith(".xml", StringComparison.CurrentCultureIgnoreCase)
It seems like you cant do it using EnumerateFiles for 3 characters extension, according to MSDN
Quote from the article above
When you use the asterisk wildcard character in a searchPattern such as ".txt", the number of characters in the specified extension affects the search as follows:
If the specified extension is exactly three characters long, the method returns files with extensions that begin with the specified extension. For example, ".xls" returns both "book.xls" and "book.xlsx".
In all other cases, the method returns files that exactly match the specified extension. For example, ".ai" returns "file.ai" but not "file.aif".
When you use the question mark wildcard character, this method returns only files that match the specified file extension. For example, given two files, "file1.txt" and "file1.txtother", in a directory, a search pattern of "file?.txt" returns just the first file, whereas a search pattern of "file.txt" returns both files.
Therefore using the .Where extension seems like the best solution to your problem
Yes, and this design is stupid, stupid, stupid! It shouldn't do that. And it's annoying too!
That said, it appears this is what is happening: It actually searches both the long and short filenames. So files with longer extensions will have a short filename with the extension truncated to three characters.
And on newer versions of Windows, the short filenames may be disabled. So the behavior on newer systems will actually be what you would expect, and what it should've been in the first place.

How can I make GetFiles() exclude files with extensions that start with the search extension?

I am using the following line to return specific files...
FileInfo file in nodeDirInfo.GetFiles("*.sbs", option)
But there are other files in the directory with the extension .sbsar, and it is getting them, too. How can I differentiate between .sbs and .sbsar in the search pattern?
The issue you're experiencing is a limitation of the search pattern, in the Win32 API.
A searchPattern with a file extension (for example *.txt) of exactly
three characters returns files having an extension of three or more
characters, where the first three characters match the file extension
specified in the searchPattern.
My solution is to manually filter the results, using Linq:
nodeDirInfo.GetFiles("*.sbs", option).Where(s => s.EndsWith(".sbs"),
StringComparison.InvariantCultureIgnoreCase));
Try this, filtered using file extension.
FileInfo[] files = nodeDirInfo.GetFiles("*", SearchOption.TopDirectoryOnly).
Where(f=>f.Extension==".sbs").ToArray<FileInfo>();
That's the behaviour of the Win32 API (FindFirstFile) that is underneath GetFiles() being reflected on to you.
You'll need to do your own filtering if you must use GetFiles(). For instance:
GetFiles("*", searchOption).Where(s => s.EndsWith(".sbs",
StringComparison.InvariantCultureIgnoreCase));
Or more efficiently:
EnumerateFiles("*", searchOption).Where(s => s.EndsWith(".sbs",
StringComparison.InvariantCultureIgnoreCase));
Note that I use StringComparison.InvariantCultureIgnoreCase to deal with the fact that Windows file names are case-insensitive.
If performance is an issue, that is if the search has to process directories with large numbers of files, then it is more efficient to perform the filtering twice: once in the call to GetFiles or EnumerateFiles, and once to clean up the unwanted file names. For example:
GetFiles("*.sbs", searchOption).Where(s => s.EndsWith(".sbs",
StringComparison.InvariantCultureIgnoreCase));
EnumerateFiles("*.sbs", searchOption).Where(s => s.EndsWith(".sbs",
StringComparison.InvariantCultureIgnoreCase));
Its mentioned in docs
When using the asterisk wildcard character in a searchPattern,a
searchPattern with a file extension of exactly three characters
returns files having an extension of three or more characters.When
using the question mark wildcard character, this method returns only
files that match the specified file extension.

Get files of certain extension c#

I wish to get a list of all the files of a certain extension (recursive), but only the files ending with that extension.
For example, I wish to get all the files with the ".exe" extension, If I have the following files:
file1.exe , file2.txt.exe , file3.exe.txt , file4.txt.exe1 , file5.txt
I expect to get a list of 1 file, which is: file1.exe.
I'm trying to use the following line:
List<string> theList = Directory.GetFiles(#"C:\SearchDir", "*.exe", SearchOption.AllDirectories).ToList();
But what I get is a list of the following three files: file1.exe , file2.txt.exe , file4.txt.exe1
Any ideas?
Try this:
var exeFiles = Directory.EnumerateFiles(sourceDirectory,
"*", SearchOption.AllDirectories)
.Where(s => s.EndsWith(".exe") && s.Count( c => c == '.') == 2)
.ToList();
This is a common issue to see. Take note to the MSDN documentation:
When using the asterisk wildcard character in a searchPattern, such as "*.txt", the matching behavior when the extension is exactly three characters long is different than when the extension is more or less than three characters long. A searchPattern with a file extension of exactly three characters returns files having an extension of three or more characters, where the first three characters match the file extension specified in the searchPattern.
You can't solve it by searching for the .exe extension; you'll need to filter your results one more time in the client code.
Now, one thing to note also is this. The following examples would in fact be considered executable files:
file1.exe
file2.txt.exe
whereas this one wouldn't technically be considered an executable file.
file4.txt.exe1
So the question then becomes, what algorithm do you want? It appears to me you want the following:
Files that have an extension of exe.
Files that don't have multiple extensions.
Have a look at Ahmed's answer for a fantastic approach to getting the algorithm you want.

C#: Using Directory.GetFiles to get files with fixed length

The directory 'C:\temp' has two files named 'GZ96A7005.tif' and 'GZ96A7005001.tif'. They have different length with the same extension. Now I run below code:
string[] resultFileNames = Directory.GetFiles(#"C:\temp", "????????????.tif");
The 'resultFileNames' return two items 'c:\temp\GZ96A7005.tif' and 'c:\temp\GZ96A7005001.tif'.
But the Window Search will work fine. This is why and how do I get I want?
For Directory.GetFiles, ? signifies "Exactly zero or one character." On the other hand, you could use DirectoryInfo.GetFiles, for which ? signifies "Exactly one character" (apparently what you want).
EDIT:
Full code:
string[] resultFileNames = (from fileInfo in new DirectoryInfo(#"C:\temp").GetFiles("????????????.tif") select fileInfo.Name).ToArray();
You can probably skip the ToArray and just let resultFileNames be an IEnumerable<string>.
People are reporting this doesn't work for them on MS .NET. The below exact code works for me with on Mono on Ubuntu Hardy. I agree it doesn't really make sense to have two related classes use different conventions. However, that is what the documentation (linked above) says, and Mono complies with the docs. If Microsoft's implementation doesn't, they have a bug:
using System;
using System.IO;
using System.Linq;
public class GetFiles
{
public static void Main()
{
string[] resultFileNames = (from fileInfo in new DirectoryInfo(#".").GetFiles("????????????.tif") select fileInfo.Name).ToArray();
foreach(string fileName in resultFileNames)
{
Console.WriteLine(fileName);
}
}
}
I know I've read about this somewhere before, but the best I could find right now was this reference to it in Raymond Chen's blog post. The point is that Windows keeps a short (8.3) filename for every file with a long filename, for backward compatibility, and filename wildcards are matched against both the long and short filenames. You can see these short filenames by opening a command prompt and running "dir /x". Normally, getting a list of files which match ????????.tif (8) returns a list of file with 8 or less characters in their filename and a .tif extension. But every file with a long filename also has a short filename with 8.3 characters, so they all match this filter.
In your case both GZ96A7005.tif and GZ96A7005001.tif are long filenames, so they both have a 8.3 short filename which matches ????????.tif (anything with 8 or more ?'s).
UPDATE... from MSDN:
Because this method checks against
file names with both the 8.3 file name
format and the long file name format,
a search pattern similar to "*1*.txt"
may return unexpected file names. For
example, using a search pattern of
"*1*.txt" returns "longfilename.txt"
because the equivalent 8.3 file name
format is "LONGFI~1.TXT".
UPDATE: The MSDN docs specifiy different behavior for the "?" wildcard in Directory.GetFiles() and DirectoryInfo.GetFiles(). The documentation seems to be wrong, however. See Matthew Flaschen's answer.
The ? character matches "zero or one" characters... so from what you have I would imagine that your search pattern will match any file ending in ".tif" that is between zero and twelve characters long.
Try dropping another file in that is only three characters long with a ".tif" extension and see if the code picks that up as well. I have a sneaking suspicion that it will ;)
As far as the Windows search is concerned, it is most definately not using the same algorithm under the hood. The ? character might have a very different meaning there than it does in the .Net search pattern specification for the Directory.GetFiles(string, string) method.
string path = "C:/";
var files = Directory.GetFiles(path)
.Where(f => f.Replace(path, "").Length == 8);
A little costly with the string replacement. You can add whatever extension you need.

Categories