Is There a Faster Way Than This to Find All the Files in a Directory and All Sub Directories

Is there a faster way than this to find all the files in a directory and all sub directories?

Try this iterator block version that avoids recursion and the Info objects:

public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
Queue<string> pending = new Queue<string>();
pending.Enqueue(rootFolderPath);
string[] tmp;
while (pending.Count > 0)
{
rootFolderPath = pending.Dequeue();
try
{
tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
}
catch (UnauthorizedAccessException)
{
continue;
}
for (int i = 0; i < tmp.Length; i++)
{
yield return tmp[i];
}
tmp = Directory.GetDirectories(rootFolderPath);
for (int i = 0; i < tmp.Length; i++)
{
pending.Enqueue(tmp[i]);
}
}
}

Note also that 4.0 has inbuilt iterator block versions (EnumerateFiles, EnumerateFileSystemEntries) that may be faster (more direct access to the file system; less arrays)

To reach a file faster, should i search trough all sub folders or just put them together in one folder than do the search?

If you know the path of the file you want to open, finding it when it is nested in a series of subdirectories is typically faster than finding a file in one huge directory. Of course it all depends on your filesystem, so it won't hurt to test.

Clarification: If you have to search for the file in lots of different places, this could be slower, actually. If you have so many files, the fastest solution would be to make it easier on your filesystem: store the location of each file in a database that maps each (uniquely named) "flat" to its full pathname. This way you can access each file with a single open call, and the filesystem will find it very quickly since the intermediate subdirectories are kept small (ish).

.net fastest way to find all files matching a pattern in all directories

http://msdn.microsoft.com/en-us/library/ms143316(v=vs.110).aspx

Directory.GetFiles(@"C:\Sketch", "*.ax5", SearchOption.AllDirectories);

Might be good enough for you?


As for performance, I doubt you will find any much faster ways to scan directories, since as @Mathew Foscarini points out, your disks are the bottleneck here.

If the directory is indexed, then it would be faster to use that as @jaccus mentions.


I took time to benchmark things a little. And it does actually seem like your able to get about a 33% performance gain on collecting files in an async way.

The test set I ran on might not match your situation, I don't know about how nested your files are etc... But what I did was create 5000 random files in each directory on every level (i settled for a single level though) and 100 directories amounting to 505.000 files...

I tested 3 methods of collecting files...

The simplest approach.

public class SimpleFileCollector
{
public List<string> CollectFiles(DirectoryInfo directory, string pattern)
{
return new List<string>( Directory.GetFiles(directory.FullName, pattern, SearchOption.AllDirectories));
}
}

The "Dumb" approach, although this is only dumb if you know of the overload used in the Simple approach... Otherwise this is a perfectly fine solution.

public class DumbFileCollector
{
public List<string> CollectFiles(DirectoryInfo directory, string pattern)
{
List<string> files = new List<string>(500000);
files.AddRange(directory.GetFiles(pattern).Select(file => file.FullName));

foreach (DirectoryInfo dir in directory.GetDirectories())
{
files.AddRange(CollectFiles(dir, pattern));
}
return files;
}
}

The Task API Approach...

public class ThreadedFileCollector
{
public List<string> CollectFiles(DirectoryInfo directory, string pattern)
{
ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
InternalCollectFiles(directory, pattern, queue);
return queue.AsEnumerable().ToList();
}

private void InternalCollectFiles(DirectoryInfo directory, string pattern, ConcurrentQueue<string> queue)
{
foreach (string result in directory.GetFiles(pattern).Select(file => file.FullName))
{
queue.Enqueue(result);
}

Task.WaitAll(directory
.GetDirectories()
.Select(dir => Task.Factory.StartNew(() => InternalCollectFiles(dir, pattern, queue))).ToArray());
}
}

This is only a test of collecting all files. Not processing them, the processing would make sense to kick off to threads.

Here is the results on my system:

Simple Collector:
- Pass 0: found 505000 files in 2847 ms
- Pass 1: found 505000 files in 2865 ms
- Pass 2: found 505000 files in 2860 ms
- Pass 3: found 505000 files in 3061 ms
- Pass 4: found 505000 files in 3006 ms
- Pass 5: found 505000 files in 2807 ms
- Pass 6: found 505000 files in 2849 ms
- Pass 7: found 505000 files in 2789 ms
- Pass 8: found 505000 files in 2790 ms
- Pass 9: found 505000 files in 2788 ms
Average: 2866 ms

Dumb Collector:
- Pass 0: found 505000 files in 5190 ms
- Pass 1: found 505000 files in 5204 ms
- Pass 2: found 505000 files in 5453 ms
- Pass 3: found 505000 files in 5311 ms
- Pass 4: found 505000 files in 5339 ms
- Pass 5: found 505000 files in 5362 ms
- Pass 6: found 505000 files in 5316 ms
- Pass 7: found 505000 files in 5319 ms
- Pass 8: found 505000 files in 5583 ms
- Pass 9: found 505000 files in 5197 ms
Average: 5327 ms

Threaded Collector:
- Pass 0: found 505000 files in 2152 ms
- Pass 1: found 505000 files in 2102 ms
- Pass 2: found 505000 files in 2022 ms
- Pass 3: found 505000 files in 2030 ms
- Pass 4: found 505000 files in 2075 ms
- Pass 5: found 505000 files in 2120 ms
- Pass 6: found 505000 files in 2030 ms
- Pass 7: found 505000 files in 1980 ms
- Pass 8: found 505000 files in 1993 ms
- Pass 9: found 505000 files in 2120 ms
Average: 2062 ms

As a side note, @Konrad Kokosa suggested blocking for each directory to ensure not to kick off millions of thread, don't do that...

There is no reason for you to manage how many threads that will be active at a given time, let the Task frameworks standard scheduler handle that, it will do a much better job at balancing the number of threads based on the number of cores you have...

And if you really wan't to control it your self just because, implementing a custom scheduler would be a better option: http://msdn.microsoft.com/en-us/library/system.threading.tasks.taskscheduler(v=vs.110).aspx

How can I recursively find all files in current and subfolders based on wildcard matching?

Use find:

find . -name "foo*"

find needs a starting point, so the . (dot) points to the current directory.

Bash - What is a good way to recursively find the type of all files in a directory and its subdirectories?

This may help: How to recursively list subdirectories in Bash without using "find" or "ls" commands?

That said, I modified it to accept user input as follows:

#!/bin/bash

recurse() {
for i in "$1"/*;do
if [ -d "$i" ];then
echo "dir: $i"
recurse "$i"
elif [ -f "$i" ]; then
echo "file: $i"
fi
done
}

recurse $1

If you didn't want the files portion (which it appears you don't) then just remove the elif and line below it. I left it in as the original post had it also. Hope this helps.

Method to get all files within folder and subfolders that will return a list

private List<String> DirSearch(string sDir)
{
List<String> files = new List<String>();
try
{
foreach (string f in Directory.GetFiles(sDir))
{
files.Add(f);
}
foreach (string d in Directory.GetDirectories(sDir))
{
files.AddRange(DirSearch(d));
}
}
catch (System.Exception excpt)
{
MessageBox.Show(excpt.Message);
}

return files;
}

and if you don't want to load the entire list in memory and avoid blocking you may take a look at the following answer.

Find all files in first sub directories

Just don't use the "all" option if you don't want all, simple as that.

var path = Path.Combine(
Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments),
@"GameLauncher");
var includedExtensions = new HashSet<string> { ".exe", ".lnk", ".url" };
var files =
from dir in Directory.EnumerateDirectories(path)
from file in Directory.EnumerateFiles(dir)
let extension = Path.GetExtension(file)
where includedExtensions.Contains(extension)
select file;

Fast way to enumerate all files including sub-folders

Short answer:

If this code is functionally correct for your project and you haven't proved it to be a problem with a profiler then don't change it. Continue to use a functionally correct solution until you prove it to be slow.

Long answer:

How fast or slow this particular piece of code is depends on a lot of factors. Many of which will depend on the specific machine you are running on (for instance hard drive speed). Looking at code that involves the file system and nothing else, it's very difficult to say "x is faster than y" with any degree of certainty.

In this case, I can only really comment on one thing. The return type of this method is an array of FileInfo values. Arrays require contiguous memory and very large arrays can cause fragmentation issues in your heap. If you have extremely large directories that you are reading it could lead to heap fragmentiation and indirectly performance issues.

If that turns out to be a problem then you can PInvoke into FindFirstFile / FindNextFile and get them one at a time. The result will be likely functionally slower in CPU cycles but will have less memory pressure.

But I must stress that you should prove these are problems before you fix them.



Related Topics



Leave a reply



Submit