Calculating a directory's size using Python?
This walks all sub-directories; summing file sizes:
import os
def get_size(start_path = '.'):
total_size = 0
for dirpath, dirnames, filenames in os.walk(start_path):
for f in filenames:
fp = os.path.join(dirpath, f)
# skip if it is symbolic link
if not os.path.islink(fp):
total_size += os.path.getsize(fp)
return total_size
print(get_size(), 'bytes')
And a oneliner for fun using os.listdir (Does not include sub-directories):
import os
sum(os.path.getsize(f) for f in os.listdir('.') if os.path.isfile(f))
Reference:
- os.path.getsize - Gives the size in bytes
- os.walk
- os.path.islink
Updated
To use os.path.getsize, this is clearer than using the os.stat().st_size method.
Thanks to ghostdog74 for pointing this out!
os.stat - st_size Gives the size in bytes. Can also be used to get file size and other file related information.
import os
nbytes = sum(d.stat().st_size for d in os.scandir('.') if d.is_file())
Update 2018
If you use Python 3.4 or previous then you may consider using the more efficient walk
method provided by the third-party scandir
package. In Python 3.5 and later, this package has been incorporated into the standard library and os.walk
has received the corresponding increase in performance.
Update 2019
Recently I've been using pathlib
more and more, here's a pathlib
solution:
from pathlib import Path
root_directory = Path('.')
sum(f.stat().st_size for f in root_directory.glob('**/*') if f.is_file())
Most efficient way to determine the size of a directory in Python
The following snippet should be optimal on all operating systems & handle any folder structure you throw at it. Memory usage will obviously grow the more folders you encounter but to my knowledge there's nothing you can really do about that as you somehow have to keep track of where you need to go.
def get_tree_size(path):
total_size = 0
dirs = [path]
while dirs:
next_dir = dirs.pop()
with os.scandir(next_dir) as it:
for entry in it:
if entry.is_dir(follow_symlinks=False):
dirs.append(entry.path)
else:
total_size += entry.stat(follow_symlinks=False).st_size
return total_size
It's possible using a collections.deque
may speed up operations vs the naiive usage of a list here but I suspect it would be hard to write a benchmark to show this with disk speeds what they are today.
very quickly getting total size of folder
You are at a disadvantage.
Windows Explorer almost certainly uses FindFirstFile
/FindNextFile
to both traverse the directory structure and collect size information (through lpFindFileData
) in one pass, making what is essentially a single system call per file.
Python is unfortunately not your friend in this case. Thus,
os.walk
first callsos.listdir
(which internally callsFindFirstFile
/FindNextFile
)- any additional system calls made from this point onward can only make you slower than Windows Explorer
os.walk
then callsisdir
for each file returned byos.listdir
(which internally callsGetFileAttributesEx
-- or, prior to Win2k, aGetFileAttributes
+FindFirstFile
combo) to redetermine whether to recurse or notos.walk
andos.listdir
will perform additional memory allocation, string and array operations etc. to fill out their return value- you then call
getsize
for each file returned byos.walk
(which again callsGetFileAttributesEx
)
That is 3x more system calls per file than Windows Explorer, plus memory allocation and manipulation overhead.
You can either use Anurag's solution, or try to call FindFirstFile
/FindNextFile
directly and recursively (which should be comparable to the performance of a cygwin
or other win32 port du -s some_directory
.)
Refer to os.py
for the implementation of os.walk
, posixmodule.c
for the implementation of listdir
and win32_stat
(invoked by both isdir
and getsize
.)
Note that Python's os.walk
is suboptimal on all platforms (Windows and *nices), up to and including Python3.1. On both Windows and *nices os.walk
could achieve traversal in a single pass without calling isdir
since both FindFirst
/FindNext
(Windows) and opendir
/readdir
(*nix) already return file type via lpFindFileData->dwFileAttributes
(Windows) and dirent::d_type
(*nix).
Perhaps counterintuitively, on most modern configurations (e.g. Win7 and NTFS, and even some SMB implementations) GetFileAttributesEx
is twice as slow as FindFirstFile
of a single file (possibly even slower than iterating over a directory with FindNextFile
.)
Update: Python 3.5 includes the new PEP 471 os.scandir()
function that solves this problem by returning file attributes along with the filename. This new function is used to speed up the built-in os.walk()
(on both Windows and Linux). You can use the scandir module on PyPI to get this behavior for older Python versions, including 2.x.
How to generate directory size recursively in python, like du . does?
Have a look at os.walk
. Specifically, the documentation has an example to find the size of a directory:
import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
print root, "consumes",
print sum(getsize(join(root, name)) for name in files),
print "bytes in", len(files), "non-directory files"
if 'CVS' in dirs:
dirs.remove('CVS') # don't visit CVS directories
This should be easy enough to modify for your purposes.
Here's an untested version in response to your comment:
import os
from os.path import join, getsize
dirs_dict = {}
#We need to walk the tree from the bottom up so that a directory can have easy
# access to the size of its subdirectories.
for root, dirs, files in os.walk('python/Lib/email',topdown = False):
# Loop through every non directory file in this directory and sum their sizes
size = sum(getsize(join(root, name)) for name in files)
# Look at all of the subdirectories and add up their sizes from the `dirs_dict`
subdir_size = sum(dirs_dict[join(root,d)] for d in dirs)
# store the size of this directory (plus subdirectories) in a dict so we
# can access it later
my_size = dirs_dict[root] = size + subdir_size
print '%s: %d'%(root,my_size)
How do I get the size of sub directory from a directory in python?
To print the size of each immediate subdirectory and the total size for the parent directory similar to du -bcs */
command:
#!/usr/bin/env python3.6
"""Usage: du-bcs <parent-dir>"""
import os
import sys
if len(sys.argv) != 2:
sys.exit(__doc__) # print usage
parent_dir = sys.argv[1]
total = 0
for entry in os.scandir(parent_dir):
if entry.is_dir(follow_symlinks=False): # directory
size = get_tree_size_scandir(entry)
# print the size of each immediate subdirectory
print(size, entry.name, sep='\t')
elif entry.is_file(follow_symlinks=False): # regular file
size = entry.stat(follow_symlinks=False).st_size
else:
continue
total += size
print(total, parent_dir, sep='\t') # print the total size for the parent dir
where get_tree_size_scandir()
[text in Russian, code in Python, C, C++, bash].
The size of a directory here is the apparent size of all regular files in it and its subdirectories recursively. It doesn't count the size for the directory entries themselves or the actual disk usage for the files. Related: why is the output of du
often so different from du -b
.
How to check size of the files in a directory with python?
To get all of the files in a directory, you can use os.listdir
.
>>> import os
>>> basedir = 'tmp/example'
>>> names = os.listdir(basedir)
>>> names
['a', 'b', 'c']
Then you need to add basedir
on to the names:
>>> paths = [os.path.join(basedir, name) for name in names]
>>> paths
['tmp/example/a', 'tmp/example/b', 'tmp/example/c']
Then you can turn that into a list of pairs of (name, size) using a os.stat(path).st_size (the example files I've created are empty):
>>> sizes = [(path, os.stat(path).st_size) for path in paths]
>>> sizes
[('tmp/example/a', 0), ('tmp/example/b', 0), ('tmp/example/c', 0)]
Then you can group the paths with the same size together by using a collections.defaultdict
:
>>> import collections
>>> grouped = collections.defaultdict(list)
>>> for path, size in sizes:
... grouped[size].append(path)
...
>>> grouped
defaultdict(<type 'list'>, {0: ['tmp/example/a', 'tmp/example/b', 'tmp/example/c']})
Now you can get all of the files by size, and open them all (don't forget to close them afterwards!):
>>> open_files = [open(path) for path in grouped[0]]
Measure directory size with Python
The two last lines in your program, where you call the function, are indented and therefore considered a part of the function and will not execute. Simply dedent them.
How to calculate a Directory size in ADLS using PySpark?
The dbutils.fs.ls
doesn't have a recurse functionality like cp
, mv
or rm
. Thus, you need to iterate yourself. Here is a snippet that will do the task for you. Run the code from a Databricks Notebook.
from dbutils import FileInfo
from typing import List
root_path = "/mnt/datalake/.../XYZ"
def discover_size(path: str, verbose: bool = True):
def loop_path(paths: List[FileInfo], accum_size: float):
if not paths:
return accum_size
else:
head, tail = paths[0], paths[1:]
if head.size > 0:
if verbose:
print(f"{head.path}: {head.size / 1e6} MB")
accum_size += head.size / 1e6
return loop_path(tail, accum_size)
else:
extended_tail = dbutils.fs.ls(head.path) + tail
return loop_path(extended_tail, accum_size)
return loop_path(dbutils.fs.ls(path), 0.0)
discover_size(root_path, verbose=True) # Total size in megabytes at the end
If the location is mounted in the dbfs. Then you could use the du -h
approach (have not test it). If you are in the Notebook, create a new cell with:
%sh
du -h /mnt/datalake/.../XYZ
How to calculate size of immediate subfolders of a folder using os.walk()
I made this finally and works fine-
import os
from pathlib import Path
root='/dbfs/mnt/datalake/.../'
size = 0
for path, subdirs, files in os.walk(root):
for f in Path(root).iterdir():
if name in files:
if f.is_dir():
size += os.path.getsize(os.path.join(path, name))
dirSize = size/(1048576)
print(f, "--Size:", dirSize)
Related Topics
Python Strings and Integer Concatenation
Post-Install Script with Python Setuptools
Python Nameerror: Name Is Not Defined
Check for Presence of a Sliced List in Python
Pandas Dataframe Line Plot Display Date on Xaxis
Differences Between Distribute, Distutils, Setuptools and Distutils2
Convert Django Model Object to Dict with All of the Fields Intact
Error After Upgrading Pip: Cannot Import Name 'Main'
How to Test That a Python Function Throws an Exception
What Are the 'Levels', 'Keys', and Names Arguments for in Pandas' Concat Function
Pylint "Unresolved Import" Error in Visual Studio Code
Check If Something Is (Not) in a List in Python
Attributeerror: 'Module' Object Has No Attribute
Convert Numpy Array to Python List