benchmarks: does python have a faster way of walking a network folder?
The Ruby implementation for Dir
is in C (the file dir.c
, according to this documentation). However, the Python equivalent is implemented in Python.
It's not surprising that Python is less performant than C, but the approach used in Python gives a little more flexibility - for example, you could skip entire subtrees named e.g. '.svn'
, '.git'
, '.hg'
while traversing a directory hierarchy.
Most of the time, the Python implementation is fast enough.
Update: The skipping of files/subdirs doesn't affect the traversal rate at all, but the overall time taken to process a directory tree could certainly be reduced because you avoid having to traverse potentially large subtrees of the main tree. The time saved is of course proportional to how much you skip. In your case, which looks like folders of images, it's unlikely you would save much time (unless the images were under revision control, when skipping subtrees owned by the revision control system might have some impact).
Additional update: Skipping folders is done by changing the dirs
value in place:
for root, dirs, files in os.walk(path):
for skip in ('.hg', '.git', '.svn', '.bzr'):
if skip in dirs:
dirs.remove(skip)
# Now process other stuff at this level, i.e.
# in directory "root". The skipped folders
# won't be recursed into.
A Faster way of Directory walking instead of os.listdir?
I was just trying to figure out how to speed up os.walk on a largish file system (350,000 files spread out within around 50,000 directories). I'm on a linux box usign an ext3 file system. I discovered that there is a way to speed this up for MY case.
Specifically, Using a top-down walk, any time os.walk returns a list of more than one directory, I use os.stat to get the inode number of each directory, and sort the directory list by inode number. This makes walk mostly visit the subdirectories in inode order, which reduces disk seeks.
For my use case, it sped up my complete directory walk from 18 minutes down to 13 minutes...
os.walk/scandir slow on network drive
You are double-scanning every path, once implicitly via walk
, then again by explicitly re-scandir
ing the path
walk
returned for no reason. walk
already returned the files
, so the inner loop can avoid a double-scan by just using what it was given:
def subdirs(path):
for path, folders, files in walk(path):
for file in files:
if '.xlsm' in file:
yield os.path.join(path, file)
To address updated question, you'll probably want to either copy the existing scandir.walk
code and modify it to return list
s of DirEntry
s instead of list
s of names, or write similar special cased code for your specific needs; either way, this will allow you to avoid double-scanning, while keeping scandir
's special low overhead behavior. For example:
def scanwalk(path, followlinks=False):
'''Simplified scandir.walk; yields lists of DirEntries instead of lists of str'''
dirs, nondirs = [], []
for entry in scandir.scandir(path):
if entry.is_dir(follow_symlinks=followlinks):
dirs.append(entry)
else:
nondirs.append(entry)
yield path, dirs, nondirs
for dir in dirs:
for res in scanwalk(dir.path, followlinks=followlinks):
yield res
You can then replace your use of walk
with it like this (I also added code that prunes directories with Test
in them since all directories and files under them would have been rejected by your original code, but you'd still traverse them unnecessarily):
def subdirs(path):
# Full prune if the path already contains Test
if 'Test' in path:
return
for path, folders, files in scanwalk(path):
# Remove any directory with Test to prevent traversal
folders[:] = [d for d in folders if 'Test' not in d.name]
for file in files:
if '.xlsm' in file.path:
yield file.stat() # Maybe just yield file to get raw DirEntry?
for i in subdirs('O:\\'):
print i
BTW, you may want to double check that you've properly installed/built the C accelerator for scandir
, _scandir
. If _scandir
isn't built, the scandir
module provides fallback implementations using ctypes
, but they're significantly slower, which could explain performance problems. Try running import _scandir
in an interactive Python session; if it raises ImportError
, then you don't have the accelerator, so you're using the slow fallback implementation.
Quicker to os.walk or glob?
I made a research on a small cache of web pages in 1000 dirs. The task was to count a total number of files in dirs. The output is:
os.listdir: 0.7268s, 1326786 files found
os.walk: 3.6592s, 1326787 files found
glob.glob: 2.0133s, 1326786 files found
As you see, os.listdir
is quickest of three. And glog.glob
is still quicker than os.walk
for this task.
The source:
import os, time, glob
n, t = 0, time.time()
for i in range(1000):
n += len(os.listdir("./%d" % i))
t = time.time() - t
print "os.listdir: %.4fs, %d files found" % (t, n)
n, t = 0, time.time()
for root, dirs, files in os.walk("./"):
for file in files:
n += 1
t = time.time() - t
print "os.walk: %.4fs, %d files found" % (t, n)
n, t = 0, time.time()
for i in range(1000):
n += len(glob.glob("./%d/*" % i))
t = time.time() - t
print "glob.glob: %.4fs, %d files found" % (t, n)
Most efficient way to traverse file structure Python
Yes, using os.walk
is indeed the best way to do that.
Related Topics
Please Introduce a Multi-Processing Library in Perl or Ruby
If Monkey Patching Is Permitted in Both Ruby and Python, Why Is It More Controversial in Ruby
Looking for Recommendation on How to Convert PDF into Structured Format
Does Ruby Have Something Like Python's List Comprehensions
Pandas Column Access W/Column Names Containing Spaces
How to Map Numeric Data into Categories/Bins in Pandas Dataframe
Tkinter Adding Line Number to Text Widget
Getting an Instance Name Inside Class _Init_()
Popen Waiting for Child Process Even When the Immediate Child Has Terminated
Python Popen Command. Wait Until the Command Is Finished
Python Mocking Raw Input in Unittests
How to Overcome Typeerror: Unhashable Type: 'List'
Django - No Such Table: Main.Auth_User_Old
Function Which Returns the Least-Squares Solution to a Linear Matrix Equation
Dynamic Instantiation from String Name of a Class in Dynamically Imported Module