Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
After referring many blogs and articles, I have reached at the following code for searching for a string in all files inside a folder. It is working fine in my tests.
QUESTIONS
Is there a faster approach for this (using C#)?
Is there any scenario that will fail with this code?
Note: I tested with very small files. Also very few number of files.
CODE
static void Main()
{
string sourceFolder = #"C:\Test";
string searchWord = ".class1";
List<string> allFiles = new List<string>();
AddFileNamesToList(sourceFolder, allFiles);
foreach (string fileName in allFiles)
{
string contents = File.ReadAllText(fileName);
if (contents.Contains(searchWord))
{
Console.WriteLine(fileName);
}
}
Console.WriteLine(" ");
System.Console.ReadKey();
}
public static void AddFileNamesToList(string sourceDir, List<string> allFiles)
{
string[] fileEntries = Directory.GetFiles(sourceDir);
foreach (string fileName in fileEntries)
{
allFiles.Add(fileName);
}
//Recursion
string[] subdirectoryEntries = Directory.GetDirectories(sourceDir);
foreach (string item in subdirectoryEntries)
{
// Avoid "reparse points"
if ((File.GetAttributes(item) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
{
AddFileNamesToList(item, allFiles);
}
}
}
REFERENCE
Using StreamReader to check if a file contains a string
Splitting a String with two criteria
C# detect folder junctions in a path
Detect Symbolic Links, Junction Points, Mount Points and Hard Links
FolderBrowserDialog SelectedPath with reparse points
C# - High Quality Byte Array Conversion of Images
Instead of File.ReadAllText() better use
File.ReadLines(#"C:\file.txt");
It returns IEnumerable (yielded) so you will not have to read the whole file if your string is found before the last line of the text file is reached
I wrote somthing very similar, a couple of changes I would recommend.
Use Directory.EnumerateDirectories instead of GetDirectories, it returns immediately with a IEnumerable so you don't need to wait for it to finish reading all of the directories before processing.
Use ReadLines instead of ReadAllText, this will only load one line in at a time in memory, this will be a big deal if you hit a large file.
If you are using a new enough version of .NET use Parallel.ForEach, this will allow you to search multiple files at once.
You may not be able to open the file, you need to check for read permissions or add to the manifest that your program requires administrative privileges (you should still check though)
I was creating a binary search tool, here is some snippets of what I wrote to give you a hand
private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
Parallel.ForEach(Directory.EnumerateFiles(_folder, _filter, SearchOption.AllDirectories), Search);
}
//_array contains the binary pattern I am searching for.
private void Search(string filePath)
{
if (Contains(filePath, _array))
{
//filePath points at a match.
}
}
private static bool Contains(string path, byte[] search)
{
//I am doing ReadAllBytes due to the fact that I am doing a binary search not a text search
// There are no "Lines" to seperate out on.
var file = File.ReadAllBytes(path);
var result = Parallel.For(0, file.Length - search.Length, (i, loopState) =>
{
if (file[i] == search[0])
{
byte[] localCache = new byte[search.Length];
Array.Copy(file, i, localCache, 0, search.Length);
if (Enumerable.SequenceEqual(localCache, search))
loopState.Stop();
}
});
return result.IsCompleted == false;
}
This uses two nested parallel loops. This design is terribly inefficient, and could be greatly improved by using the Booyer-Moore search algorithm but I could not find a binary implementation and I did not have the time when I wrote it originally to implement it myself.
the main problem here is that you are searching all the files in real time for every search. there is also the possibility of file access conflicts if 2+ users are searching at the same time.
to dramtically improve performance I would index the files ahead of time, and as they are edited/saved. store the indexed using something like lucene.net and then query the index (again using luence.net) and return the file names to the user. so the user never queries the files directly.
if you follow the links in this SO Post you may have a head start on implementing the indexing. I didn't follow the links, but it's worth a look.
Just a heads up, this will be an intense shift from your current approach and will require
a service to monitor/index the files
the UI project
I think your code will fail with an exception if you lack permission to open a file.
Compare it with the code here: http://bgrep.codeplex.com/releases/view/36186
That latter code supports
regular expression search and
filters for file extensions
-- things you should probably consider.
Instead of Contains better use algorithm Boyer-Moore search.
Fail scenario: file have not read permission.
Related
This question already has answers here:
How do I create a file AND any folders, if the folders don't exist?
(9 answers)
Closed 4 years ago.
I've been writing a simple console application as a part of exercise in project. Tasks are rather straightforward:
2nd method has to create nested directory tree where every folder name is Guid.
3rd method has to put empty file in chosen directory tree at specific level.
My main problem lies within 3rd method. Because while it works fine and creates file 'till third level of any directory, beyond that point it always throw "System.IO.DirectoryNotFoundException" - as it "can't find part of the path".
I use string as a container for path, but since it's few Guid set together it gets pretty long. I had similar problem with creating directory, but in order to work I had to simply put #"\?\" prefix behind the path. So is there any way to make it work, or maybe get around that?
Here are method that fails. Specifically it's
File.Create(PathToFile + #"\blank.txt").Dispose();
And part of code which makes string and invokes it:
string ChosenDirectoryPath = currDir.FullName + #"\";
for (int i = 0; i <= Position; i++)
{
ChosenDirectoryPath += ListsList[WhichList][i];
}
if (!File.Exists(ChosenDirectoryPath + #"\blank.txt"))
{
FileMaker(ref ChosenDirectoryPath);
}
Edit:
To be specific, directories are made by method:
public List<string> DirectoryList = new List<string>();
internal static List<List<string>> ListsList = new List<List<string>>();
private static DirectoryInfo currDir = new DirectoryInfo(".");
private string FolderName;
private static string DirectoryPath;
public void DeepDive(List<string> DirectoryList, int countdown)
{
FolderName = GuidMaker();
DirectoryList.Add(FolderName + #"\");
if (countdown <= 1)
{
foreach (string element in DirectoryList)
{
DirectoryPath += element;
}
Directory.CreateDirectory(#"\\?\" + currDir.FullName + #"\" + DirectoryPath);
Console.WriteLine("Folders were nested at directory {0} under folder {1}\n", currDir.FullName, DirectoryList[0]);
ListsList.Add(DirectoryList);
DirectoryPath = null;
return;
}
DeepDive(DirectoryList, countdown-1);
}
Which is pretty messy because of recursion (iteration would be better but i wanted to do it this way to learn something). The point is that directories are made and stored in list of lists.
Creating files works properly but only for the first three nested folders. So the problem is that it is somehow loosing it's path to file in 4th and 5th level, and can't even make those manually. Could it be too long path? And how to fix this.
Here is exception that throws out:
System.IO.DirectoryNotFoundException: „Can't find part of the path
„C:\Some\More\Folders\1b0c7715-ee01-4df8-9079-82ea7990030f\c6c806b0-b69d-4a3a-88d0-1bd8a0e31eb2\9671f2b3-3041-42d5-b631-4719d36c2ac5\6406f00f-7750-4b5a-a45d-cebcecb0b70e\bcacef2b-e391-4799-b84e-f2bc55605d40\blank.txt”.”
So it throws full path to file and yet says that it can't find it.
You problem is that File.Create doesn't create the corresponding directories for you, instead it throws a System.IO.DirectoryNotFoundException.
You have to create those directories yourself, by using System.IO.Directory.CreateDirectory()
If this exception occurs because of a too long path, you can still use the "long path syntax (\\?\)" like you did when creating your directories.
See also this question: How to deal with files with a name longer than 259 characters? there is also a good article linked
I have a program where I need to search an arbitrary number of nested zip-files. I was able to solve this in python 3 by taking the namelist of the archive at a given path, finding zip-files, opening them, converting the file to a byte string with BytesIO, and then calling the method again recursively on the bytestring. Like so:
def zip_dig(source_path, posts):
try:
with zipfile.ZipFile(source_path, 'r') as zip_ref: # Open initial zip file, list contents
for name in zip_ref.namelist():
if re.search(r'\.zip$', name) is not None:
if name.endswith('.zip'):
zfiledata = BytesIO(zip_ref.read(name))
zip_dig(zfiledata, posts)
except zipfile.BadZipFile:
pass
return posts
I now need to solve this in C#, but I can't seem to find any equivalent solution.
I have googled extensively and looked through the documentation of the ZipFile and ZipArchive classes, but I can't seem to find similar workaround for the fact that the file reference comes in the form of a Stream rather than a String:
internal static List<BsonDocument> ZipDig(string path, List<BsonDocument> posts)
{
path = Path.GetFullPath(path);
using (ZipArchive archive = ZipFile.OpenRead(path))
{
foreach (ZipArchiveEntry entry in archive.Entries)
{
if (entry.FullName.EndsWith(".zip", StringComparison.OrdinalIgnoreCase))
{
posts = ZipDig(entry, posts);
}
}
}
return posts;
}
Any help is appreciated!
EDIT: I should clarify, the zip files are often several gigabytes large and therefore extraction is not really and option from a time consumption perspective. I'm just finding a particular type of txt-file, reading them and entering the contents into a database.
ZipArchive has a constructor which takes a stream.
Use that below the initial level of recursion.
Basically I want to access 1000s of textfiles, input their data, store them in sqlite databases, parse them then show the output to users. So far I've developed a program that does this for only ONE textfile.
What I want to do: There is a Directory on our server which has about 15 folders. In each folder there are about 30-50 textfiles. I want to Loop through EACH FOLDER, and in each folder, loop through EACH file. A nice user helped me with doing this for 1000s of textfiles but I needed further clarification his method. This was his approach:
private static void ReadAllFilesStartingFromDirectory(string topLevelDirectory)
{
const string searchPattern = "*.txt";
var subDirectories = Directory.EnumerateDirectories(topLevelDirectory);
var filesInDirectory = Directory.EnumerateFiles(topLevelDirectory, searchPattern);
foreach (var subDirectory in subDirectories)
{
ReadAllFilesStartingFromDirectory(subDirectory);//recursion
}
IterateFiles(filesInDirectory, topLevelDirectory);
}
private static void IterateFiles(IEnumerable<string> files, string directory)
{
foreach (var file in files)
{
Console.WriteLine("{0}", Path.Combine(directory, file));//for verification
try
{
string[] lines = File.ReadAllLines(file);
foreach (var line in lines)
{
//Console.WriteLine(line);
}
}
catch (IOException ex)
{
//Handle File may be in use...
}
}
}
My problems/questions:
1) topLevelDirectory - what should I exactly put there? The 15 folders are located on a server with the format something like this \servername\randomfile\random\locationoftopleveldirectory. But how can I put the double slashes (at the begining of the path name) in this? Is this possible in c#? I thought we could only access local files (example :"c:\" - paths with single, not double slashes)
2) I dont understand what the purpose of the first foreach loop is. "readAllFilesStartingFromDirectory(subDirectory)" , yes we are looping the folders, but we aren't even doing anything with that loop. It's just reading the folders.
I'm not going to know your top level directory, but essentially if your files are in C:\tmp, then you would pass it #"C:\tmp". Escape your string with the # character to get double-slashes (or escape each slash individually).
string example0 = #"\\some\network\path";
string example1 = "\\\\some\\network\\path";
With ReadAllFilesStartingFromDirectory you're recursively calling IterateFiles, and it's doing whatever IterateFiles does in each directory. With the code you pasted above, that happens to be doing nothing, since Console.Writeline(line) is commented out.
Lets get to clarify the topLevelDirectory: This is a folder, which has items in it. It does not matter if these are files or other directories. These caontained other "subfolders" can contain folders themselves.
What toplevelDirectory means to you: take the folder which encapsulates all your files you need at the lowest level possible.
Your toplevelfolder is the directory which contains the 15 folders you want to crawl.
ReadAllFilesStartingFromDirectory(string topLevelDirectory)
You need to realise what recursion means. Recursion describes a method which calls itself.
Compare the name of the function (ReadAllFilesStartingFromDirectory), with the name of the function called in the foreach loop - they are the same.
In you case: The method gets all folders located in your topfolder. He then loops through all subfolders. Each subfolder then becomes the toplevel folder, which in turn can contain subfolders, who will become toplevelfolders in the next method call.
This is a nice way to loop through the whole file structure. If there are no more subfolders, there won't be any recursion and the method ends.
Your path problem: You need to mask the backslashes. You mask them by adding a backslash in front of them.
\path\randfolder\file.txt will become \\path\\randfolder\\file.txt
Or you set an # before the string. var path = #"\path\randfolder\file.txt", which also does the trick for you. Both ways work
1) Yes it is possible in C#. If your program has access permission to the network location you can use: "\\\\servername\\randomfile\\random\\locationoftopleveldirectory" - double slash in string interpreated as one slash. or you can use # before the string, it means 'ignore escape character' which is slash, then your string will look like this: #"\\servername\randomfile\random\locationoftopleveldirectory"
2) ReadAllFilesStartingFromDirectory is recursive function. Directories structure is hierarchical, therefore it is easy to traverse them recursively. This function looks for files in the root directory and in its sub-directories and in all their sub directories...
Try to put comment on this loop and you will see that only the files of the root directory parsed by the IterateFiles function
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How can I perform full recursive directory & file scan?
I can't find any informations, how to create a full directory listing in C#, including all files inside and all its subfolders's files, i.e:
c:\a.jpg
c:\b.php
c:\c\d.exe
c:\e\f.png
c:\g\h.mpg
You can use DirectoryInfo.GetFiles("C:\*.*", SearchOption.AllDirectories);
This will return an array of FileInfo objects, which include a Name and FullName property for the filename.
That being said, I would not do this on "C:\", as you'll return a huge array, but this technique will work correctly on an appropriate folder.
If you're using .NET 4, I'd recommend using EnumerateFiles instead, which will return an IEnumerable<T> instead of an array.
Note that the above code will most likely fail, however, as it requires full permissions to search the file system. A SecurityException will be raised if you can't access specific parts of the file system you're trying to search.
Also - if you're only interested in the file names and not the full FileInfo information, you can use Directory.EnumerateFiles instead, which returns an IEnumerable<string> with all of the relevant filenames.
There is an MSDN Article with code examples to do just this.
This is the money:
void DirSearch(string sDir)
{
try
{
foreach (string d in Directory.GetDirectories(sDir))
{
foreach (string f in Directory.GetFiles(d, txtFile.Text))
{
lstFilesFound.Items.Add(f);
}
DirSearch(d);
}
}
catch (System.Exception excpt)
{
Console.WriteLine(excpt.Message);
}
}
I wrote a program looking for a specific file in the computer, but it suffers from slow and delays in obtaining the many files on your computer
This function is working to get all the files
void Get_Files(DirectoryInfo D)
{
FileInfo[] Files;
try
{
Files = D.GetFiles("*.*");
foreach (FileInfo File_Name in Files)
listBox3.Items.Add(File_Name.FullName);
}
catch { }
DirectoryInfo[] Dirs;
try
{
Dirs = D.GetDirectories();
foreach (DirectoryInfo Dir in Dirs)
{
if (!(Dir.ToString().Equals("$RECYCLE.BIN")) && !(Dir.ToString().Equals("System Volume Information")))
Get_Files(Dir);
}
}
catch { }
}
Is there another way to get a little faster all the computer files??
Use profiler to find out, what operation is the slowest. Then think about how to make it faster. Otherwise you can waste your time by optimizing something, that is not bottleneck and will not bring you expected speed up.
In your case, you will probably find, that when you call this function for the first time (when directory structure is not in cache), most time will be spent in GetDirectories() and GetFiles() functions. You can pre-cache list of all files in memory (or in database) and use FileSystemWatcher to monitor changes in filesystem to update your file list with new files. Or you can use existing services, such as Windows Indexing service, but these may not be available on every computer.
Second bottleneck could be adding files to ListBox. If number of added item is large, you can temporarily disable drawing of listbox using ListBox.BeginUpdate and when you finish, enable it again with ListBox.EndUpdate. This can sometimes lead to huge speed up.
The answer will generally depend on your operating system. In any case you will want to build and maintain your own database of files; explicit search like in your example will be too costly and slow.
A standard solution on Linux (and Mac OS X, if I'm not mistaken) is to maintain a locatedb file, which is updated by the system on a regular basis. If run on these systems, your program could make queries against this database.
Part of the problem is that the GetFiles method doesn't return until it has gotten all the files in the folder and if you are performing a recursive search, then for each sub folder you recurse into, it will take longer and longer.
Look into using DirectoryInfo.EnumerateFile or DirectoryInfo.EnumerateFileSystemInfos
From the docs:
The EnumerateFiles and GetFiles methods differ as follows: When you
use EnumerateFiles, you can start enumerating the collection of
FileInfo objects before the whole collection is returned; when you use
GetFiles, you must wait for the whole array of FileInfo objects to be
returned before you can access the array. Therefore, when you are
working with many files and directories, EnumerateFiles can be more
efficient.
The same is true for EnumerateFileSystemInfos
You can also look into querying the Indexing Service (if it is installed and running). See this article on CodeProject:
http://www.codeproject.com/Articles/19540/Microsoft-Indexing-Service-How-To
I found this by Googling "How to query MS file system index"
You can enumerate all files once and store the list.
But if you can't do that, this is basically as good as it gets. You can do two small things:
Try using threads. This will get much better on an SSD but might hurt on a rotating disk
Use DirectoryInfo.GetFileSystemEntries. This will return files and dirs in one efficient call.
You will find much faster performance using Directory.GetFiles() as the FileInfo and DirectoryInfo classes get extra information from the file system which is much slower than simply than returning the string based file name.
Here is a code example that should yield much improved results and abstracts the action of retrieving files from the operation of displaying them in a list box.
static void Main(string[] args)
{
var fileFinder = new FileFinder(#"c:\SomePath");
listBox3.Items.Add(fileFinder.Files);
}
/// <summary>
/// SOLID: This class is responsible for recusing a directory to return the list of files, which are
/// not in an predefined set of folder exclusions.
/// </summary>
internal class FileFinder
{
private readonly string _rootPath;
private List<string> _fileNames;
private readonly IEnumerable<string> _doNotSearchFolders = new[] { "System Volume Information", "$RECYCLE.BIN" };
internal FileFinder(string rootPath)
{
_rootPath = rootPath;
}
internal IEnumerable<string> Files
{
get
{
if (_fileNames == null)
{
_fileNames = new List<string>();
GetFiles(_rootPath);
}
return _fileNames;
}
}
private void GetFiles(string path)
{
_fileNames.AddRange(Directory.GetFiles("*.*"));
foreach (var recursivePath in Directory.GetDirectories(path).Where(_doNotSearchFolders.Contains))
{
GetFiles(recursivePath);
}
}
}