Benchmarks: Does Python Have a Faster Way of Walking a Network Folder

benchmarks: does python have a faster way of walking a network folder?

The Ruby implementation for Dir is in C (the file dir.c, according to this documentation). However, the Python equivalent is implemented in Python.

It's not surprising that Python is less performant than C, but the approach used in Python gives a little more flexibility - for example, you could skip entire subtrees named e.g. '.svn', '.git', '.hg' while traversing a directory hierarchy.

Most of the time, the Python implementation is fast enough.

Update: The skipping of files/subdirs doesn't affect the traversal rate at all, but the overall time taken to process a directory tree could certainly be reduced because you avoid having to traverse potentially large subtrees of the main tree. The time saved is of course proportional to how much you skip. In your case, which looks like folders of images, it's unlikely you would save much time (unless the images were under revision control, when skipping subtrees owned by the revision control system might have some impact).

Additional update: Skipping folders is done by changing the dirs value in place:

for root, dirs, files in os.walk(path):
    for skip in ('.hg', '.git', '.svn', '.bzr'):
        if skip in dirs:
            dirs.remove(skip)
        # Now process other stuff at this level, i.e.
        # in directory "root". The skipped folders
        # won't be recursed into.

A Faster way of Directory walking instead of os.listdir?

I was just trying to figure out how to speed up os.walk on a largish file system (350,000 files spread out within around 50,000 directories). I'm on a linux box usign an ext3 file system. I discovered that there is a way to speed this up for MY case.

Specifically, Using a top-down walk, any time os.walk returns a list of more than one directory, I use os.stat to get the inode number of each directory, and sort the directory list by inode number. This makes walk mostly visit the subdirectories in inode order, which reduces disk seeks.

For my use case, it sped up my complete directory walk from 18 minutes down to 13 minutes...

os.walk/scandir slow on network drive

You are double-scanning every path, once implicitly via walk, then again by explicitly re-scandiring the path walk returned for no reason. walk already returned the files, so the inner loop can avoid a double-scan by just using what it was given:

def subdirs(path):
    for path, folders, files in walk(path):
        for file in files:
            if '.xlsm' in file:
                yield os.path.join(path, file)

To address updated question, you'll probably want to either copy the existing scandir.walk code and modify it to return lists of DirEntrys instead of lists of names, or write similar special cased code for your specific needs; either way, this will allow you to avoid double-scanning, while keeping scandir's special low overhead behavior. For example:

def scanwalk(path, followlinks=False):
    '''Simplified scandir.walk; yields lists of DirEntries instead of lists of str'''
    dirs, nondirs = [], []
    for entry in scandir.scandir(path):
        if entry.is_dir(follow_symlinks=followlinks):
            dirs.append(entry)
        else:
            nondirs.append(entry)
    yield path, dirs, nondirs
    for dir in dirs:
        for res in scanwalk(dir.path, followlinks=followlinks):
            yield res

You can then replace your use of walk with it like this (I also added code that prunes directories with Test in them since all directories and files under them would have been rejected by your original code, but you'd still traverse them unnecessarily):

def subdirs(path):
    # Full prune if the path already contains Test
    if 'Test' in path:
        return
    for path, folders, files in scanwalk(path):
        # Remove any directory with Test to prevent traversal
        folders[:] = [d for d in folders if 'Test' not in d.name]
        for file in files:
            if '.xlsm' in file.path:
                yield file.stat()  # Maybe just yield file to get raw DirEntry?

for i in subdirs('O:\\'):
    print i

BTW, you may want to double check that you've properly installed/built the C accelerator for scandir, _scandir. If _scandir isn't built, the scandir module provides fallback implementations using ctypes, but they're significantly slower, which could explain performance problems. Try running import _scandir in an interactive Python session; if it raises ImportError, then you don't have the accelerator, so you're using the slow fallback implementation.

Quicker to os.walk or glob?

I made a research on a small cache of web pages in 1000 dirs. The task was to count a total number of files in dirs. The output is:

os.listdir: 0.7268s, 1326786 files found
os.walk: 3.6592s, 1326787 files found
glob.glob: 2.0133s, 1326786 files found

As you see, os.listdir is quickest of three. And glog.glob is still quicker than os.walk for this task.

The source:

import os, time, glob

n, t = 0, time.time()
for i in range(1000):
    n += len(os.listdir("./%d" % i))
t = time.time() - t
print "os.listdir: %.4fs, %d files found" % (t, n)

n, t = 0, time.time()
for root, dirs, files in os.walk("./"):
    for file in files:
        n += 1
t = time.time() - t
print "os.walk: %.4fs, %d files found" % (t, n)

n, t = 0, time.time()
for i in range(1000):
    n += len(glob.glob("./%d/*" % i))
t = time.time() - t
print "glob.glob: %.4fs, %d files found" % (t, n)

Most efficient way to traverse file structure Python

Yes, using os.walk is indeed the best way to do that.

Benchmarks: Does Python Have a Faster Way of Walking a Network Folder