Is There a Faster Way to Scan Through a Directory Recursively in .Net

Is there a faster way to scan through a directory recursively in .NET?

This implementation, which needs a bit of tweaking is 5-10X faster.

    static List<Info> RecursiveScan2(string directory) {
IntPtr INVALID_HANDLE_VALUE = new IntPtr(-1);
WIN32_FIND_DATAW findData;
IntPtr findHandle = INVALID_HANDLE_VALUE;

var info = new List<Info>();
try {
findHandle = FindFirstFileW(directory + @"\*", out findData);
if (findHandle != INVALID_HANDLE_VALUE) {

do {
if (findData.cFileName == "." || findData.cFileName == "..") continue;

string fullpath = directory + (directory.EndsWith("\\") ? "" : "\\") + findData.cFileName;

bool isDir = false;

if ((findData.dwFileAttributes & FileAttributes.Directory) != 0) {
isDir = true;
info.AddRange(RecursiveScan2(fullpath));
}

info.Add(new Info()
{
CreatedDate = findData.ftCreationTime.ToDateTime(),
ModifiedDate = findData.ftLastWriteTime.ToDateTime(),
IsDirectory = isDir,
Path = fullpath
});
}
while (FindNextFile(findHandle, out findData));

}
} finally {
if (findHandle != INVALID_HANDLE_VALUE) FindClose(findHandle);
}
return info;
}

extension method:

 public static class FILETIMEExtensions {
public static DateTime ToDateTime(this System.Runtime.InteropServices.ComTypes.FILETIME filetime ) {
long highBits = filetime.dwHighDateTime;
highBits = highBits << 32;
return DateTime.FromFileTimeUtc(highBits + (long)filetime.dwLowDateTime);
}
}

interop defs are:

    [DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]
public static extern IntPtr FindFirstFileW(string lpFileName, out WIN32_FIND_DATAW lpFindFileData);

[DllImport("kernel32.dll", CharSet = CharSet.Unicode)]
public static extern bool FindNextFile(IntPtr hFindFile, out WIN32_FIND_DATAW lpFindFileData);

[DllImport("kernel32.dll")]
public static extern bool FindClose(IntPtr hFindFile);

[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
public struct WIN32_FIND_DATAW {
public FileAttributes dwFileAttributes;
internal System.Runtime.InteropServices.ComTypes.FILETIME ftCreationTime;
internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastAccessTime;
internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastWriteTime;
public int nFileSizeHigh;
public int nFileSizeLow;
public int dwReserved0;
public int dwReserved1;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
public string cFileName;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
public string cAlternateFileName;
}

Is there a faster way than this to find all the files in a directory and all sub directories?

Try this iterator block version that avoids recursion and the Info objects:

public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
Queue<string> pending = new Queue<string>();
pending.Enqueue(rootFolderPath);
string[] tmp;
while (pending.Count > 0)
{
rootFolderPath = pending.Dequeue();
try
{
tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
}
catch (UnauthorizedAccessException)
{
continue;
}
for (int i = 0; i < tmp.Length; i++)
{
yield return tmp[i];
}
tmp = Directory.GetDirectories(rootFolderPath);
for (int i = 0; i < tmp.Length; i++)
{
pending.Enqueue(tmp[i]);
}
}
}

Note also that 4.0 has inbuilt iterator block versions (EnumerateFiles, EnumerateFileSystemEntries) that may be faster (more direct access to the file system; less arrays)

How can I perform full recursive directory & file scan?

private static void TreeScan( string sDir )
{
foreach (string f in Directory.GetFiles( sDir ))
{
//Save f :)
}
foreach (string d in Directory.GetDirectories( sDir ))
{
TreeScan( d );
}
}

Fast (lowlevel) method to recursively process files in folders

In .NET 4.0, there are inbuilt enumerable file listing methods; since this isn't far away, I would try using that. This might be a factor in particular if you have any folders that are massively populated (requiring a large array allocation).

If depth is the issue, I would consider flattening your method to use a local stack/queue and a single iterator block. This will reduce the code path used to enumerate the deep folders:

    private static IEnumerable<string> WalkFiles(string path, string filter)
{
var pending = new Queue<string>();
pending.Enqueue(path);
string[] tmp;
while (pending.Count > 0)
{
path = pending.Dequeue();
tmp = Directory.GetFiles(path, filter);
for(int i = 0 ; i < tmp.Length ; i++) {
yield return tmp[i];
}
tmp = Directory.GetDirectories(path);
for (int i = 0; i < tmp.Length; i++) {
pending.Enqueue(tmp[i]);
}
}
}

Iterate that, creating your ProcessFiles from the results.

Improve the performance for enumerating files and folders using .NET

This is (probably) as good as it's going to get:

DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles =
dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories)
.AsParallel()
.Where(fi => fi.CreationTime < sixtyLess).ToArray();

Changes:

  • Made the the 60 days less DateTime constant, and therefore less CPU load.
  • Used EnumerateFiles.
  • Made the query parallel.

Should run in a smaller amount of time (not sure how much smaller).

Here is another solution which might be faster or slower than the first, it depends on the data:

DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles =
dirInfo.EnumerateDirectories()
.AsParallel()
.SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories)
.Where(fi => fi.CreationTime < sixtyLess))
.ToArray();

Here it moves the parallelism to the main folder enumeration. Most of the changes from above apply too.

Searching for file in directories recursively

You could use this overload of Directory.GetFiles which searches subdirectories for you, for example:

string[] files = Directory.GetFiles(sDir, "*.xml", SearchOption.AllDirectories);

Only one extension can be searched for like that, but you could use something like:

var extensions = new List<string> { ".txt", ".xml" };
string[] files = Directory.GetFiles(sDir, "*.*", SearchOption.AllDirectories)
.Where(f => extensions.IndexOf(Path.GetExtension(f)) >= 0).ToArray();

to select files with the required extensions (N.B. that is case-sensitive for the extension).


In some cases it can be desirable to enumerate over the files with the Directory.EnumerateFiles Method:

foreach(string f in Directory.EnumerateFiles(sDir, "*.xml", SearchOption.AllDirectories))
{
// do something
}

Consult the documentation for exceptions which can be thrown, such as UnauthorizedAccessException if the code is running under an account which does not have appropriate access permissions.

If the UnauthorizedAccessException is a problem, then please see the fine answers at Directory.EnumerateFiles => UnauthorizedAccessException.

.net fastest way to find all files matching a pattern in all directories

http://msdn.microsoft.com/en-us/library/ms143316(v=vs.110).aspx

Directory.GetFiles(@"C:\Sketch", "*.ax5", SearchOption.AllDirectories);

Might be good enough for you?


As for performance, I doubt you will find any much faster ways to scan directories, since as @Mathew Foscarini points out, your disks are the bottleneck here.

If the directory is indexed, then it would be faster to use that as @jaccus mentions.


I took time to benchmark things a little. And it does actually seem like your able to get about a 33% performance gain on collecting files in an async way.

The test set I ran on might not match your situation, I don't know about how nested your files are etc... But what I did was create 5000 random files in each directory on every level (i settled for a single level though) and 100 directories amounting to 505.000 files...

I tested 3 methods of collecting files...

The simplest approach.

public class SimpleFileCollector
{
public List<string> CollectFiles(DirectoryInfo directory, string pattern)
{
return new List<string>( Directory.GetFiles(directory.FullName, pattern, SearchOption.AllDirectories));
}
}

The "Dumb" approach, although this is only dumb if you know of the overload used in the Simple approach... Otherwise this is a perfectly fine solution.

public class DumbFileCollector
{
public List<string> CollectFiles(DirectoryInfo directory, string pattern)
{
List<string> files = new List<string>(500000);
files.AddRange(directory.GetFiles(pattern).Select(file => file.FullName));

foreach (DirectoryInfo dir in directory.GetDirectories())
{
files.AddRange(CollectFiles(dir, pattern));
}
return files;
}
}

The Task API Approach...

public class ThreadedFileCollector
{
public List<string> CollectFiles(DirectoryInfo directory, string pattern)
{
ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
InternalCollectFiles(directory, pattern, queue);
return queue.AsEnumerable().ToList();
}

private void InternalCollectFiles(DirectoryInfo directory, string pattern, ConcurrentQueue<string> queue)
{
foreach (string result in directory.GetFiles(pattern).Select(file => file.FullName))
{
queue.Enqueue(result);
}

Task.WaitAll(directory
.GetDirectories()
.Select(dir => Task.Factory.StartNew(() => InternalCollectFiles(dir, pattern, queue))).ToArray());
}
}

This is only a test of collecting all files. Not processing them, the processing would make sense to kick off to threads.

Here is the results on my system:

Simple Collector:
- Pass 0: found 505000 files in 2847 ms
- Pass 1: found 505000 files in 2865 ms
- Pass 2: found 505000 files in 2860 ms
- Pass 3: found 505000 files in 3061 ms
- Pass 4: found 505000 files in 3006 ms
- Pass 5: found 505000 files in 2807 ms
- Pass 6: found 505000 files in 2849 ms
- Pass 7: found 505000 files in 2789 ms
- Pass 8: found 505000 files in 2790 ms
- Pass 9: found 505000 files in 2788 ms
Average: 2866 ms

Dumb Collector:
- Pass 0: found 505000 files in 5190 ms
- Pass 1: found 505000 files in 5204 ms
- Pass 2: found 505000 files in 5453 ms
- Pass 3: found 505000 files in 5311 ms
- Pass 4: found 505000 files in 5339 ms
- Pass 5: found 505000 files in 5362 ms
- Pass 6: found 505000 files in 5316 ms
- Pass 7: found 505000 files in 5319 ms
- Pass 8: found 505000 files in 5583 ms
- Pass 9: found 505000 files in 5197 ms
Average: 5327 ms

Threaded Collector:
- Pass 0: found 505000 files in 2152 ms
- Pass 1: found 505000 files in 2102 ms
- Pass 2: found 505000 files in 2022 ms
- Pass 3: found 505000 files in 2030 ms
- Pass 4: found 505000 files in 2075 ms
- Pass 5: found 505000 files in 2120 ms
- Pass 6: found 505000 files in 2030 ms
- Pass 7: found 505000 files in 1980 ms
- Pass 8: found 505000 files in 1993 ms
- Pass 9: found 505000 files in 2120 ms
Average: 2062 ms

As a side note, @Konrad Kokosa suggested blocking for each directory to ensure not to kick off millions of thread, don't do that...

There is no reason for you to manage how many threads that will be active at a given time, let the Task frameworks standard scheduler handle that, it will do a much better job at balancing the number of threads based on the number of cores you have...

And if you really wan't to control it your self just because, implementing a custom scheduler would be a better option: http://msdn.microsoft.com/en-us/library/system.threading.tasks.taskscheduler(v=vs.110).aspx

How to recursively list all the files in a directory in C#?

This article covers all you need. Except as opposed to searching the files and comparing names, just print out the names.

It can be modified like so:

static void DirSearch(string sDir)
{
try
{
foreach (string d in Directory.GetDirectories(sDir))
{
foreach (string f in Directory.GetFiles(d))
{
Console.WriteLine(f);
}
DirSearch(d);
}
}
catch (System.Exception excpt)
{
Console.WriteLine(excpt.Message);
}
}

Added by barlop

GONeale mentions that the above doesn't list the files in the current directory and suggests putting the file listing part outside the part that gets directories. The following would do that. It also includes a Writeline line that you can uncomment, that helps to trace where you are in the recursion that may help to show the calls to help show how the recursion works.

            DirSearch_ex3("c:\\aaa");
static void DirSearch_ex3(string sDir)
{
//Console.WriteLine("DirSearch..(" + sDir + ")");
try
{
Console.WriteLine(sDir);

foreach (string f in Directory.GetFiles(sDir))
{
Console.WriteLine(f);
}

foreach (string d in Directory.GetDirectories(sDir))
{
DirSearch_ex3(d);
}
}
catch (System.Exception excpt)
{
Console.WriteLine(excpt.Message);
}
}


Related Topics



Leave a reply



Submit