Calculating a Directory's Size Using Python

Calculating a directory's size using Python?

This walks all sub-directories; summing file sizes:

import os

def get_size(start_path = '.'):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            # skip if it is symbolic link
            if not os.path.islink(fp):
                total_size += os.path.getsize(fp)

    return total_size

print(get_size(), 'bytes')

And a oneliner for fun using os.listdir (Does not include sub-directories):

import os
sum(os.path.getsize(f) for f in os.listdir('.') if os.path.isfile(f))

Reference:

os.path.getsize - Gives the size in bytes
os.walk
os.path.islink

Updated
To use os.path.getsize, this is clearer than using the os.stat().st_size method.

Thanks to ghostdog74 for pointing this out!

os.stat - st_size Gives the size in bytes. Can also be used to get file size and other file related information.

import os

nbytes = sum(d.stat().st_size for d in os.scandir('.') if d.is_file())

Update 2018

If you use Python 3.4 or previous then you may consider using the more efficient walk method provided by the third-party scandir package. In Python 3.5 and later, this package has been incorporated into the standard library and os.walk has received the corresponding increase in performance.

Update 2019

Recently I've been using pathlib more and more, here's a pathlib solution:

from pathlib import Path

root_directory = Path('.')
sum(f.stat().st_size for f in root_directory.glob('**/*') if f.is_file())

Most efficient way to determine the size of a directory in Python

The following snippet should be optimal on all operating systems & handle any folder structure you throw at it. Memory usage will obviously grow the more folders you encounter but to my knowledge there's nothing you can really do about that as you somehow have to keep track of where you need to go.

def get_tree_size(path):
    total_size = 0
    dirs = [path]
    while dirs:
        next_dir = dirs.pop()
        with os.scandir(next_dir) as it:
            for entry in it:
                if entry.is_dir(follow_symlinks=False):
                    dirs.append(entry.path)
                else:
                    total_size += entry.stat(follow_symlinks=False).st_size
    return total_size

It's possible using a collections.deque may speed up operations vs the naiive usage of a list here but I suspect it would be hard to write a benchmark to show this with disk speeds what they are today.

very quickly getting total size of folder

You are at a disadvantage.

Windows Explorer almost certainly uses FindFirstFile/FindNextFile to both traverse the directory structure and collect size information (through lpFindFileData) in one pass, making what is essentially a single system call per file.

Python is unfortunately not your friend in this case. Thus,

os.walk first calls os.listdir (which internally calls FindFirstFile/FindNextFile)
- any additional system calls made from this point onward can only make you slower than Windows Explorer
os.walk then calls isdir for each file returned by os.listdir (which internally calls GetFileAttributesEx -- or, prior to Win2k, a GetFileAttributes+FindFirstFile combo) to redetermine whether to recurse or not
os.walk and os.listdir will perform additional memory allocation, string and array operations etc. to fill out their return value
you then call getsize for each file returned by os.walk (which again calls GetFileAttributesEx)

That is 3x more system calls per file than Windows Explorer, plus memory allocation and manipulation overhead.

You can either use Anurag's solution, or try to call FindFirstFile/FindNextFile directly and recursively (which should be comparable to the performance of a cygwin or other win32 port du -s some_directory.)

Refer to os.py for the implementation of os.walk, posixmodule.c for the implementation of listdir and win32_stat (invoked by both isdir and getsize.)

Note that Python's os.walk is suboptimal on all platforms (Windows and *nices), up to and including Python3.1. On both Windows and *nices os.walk could achieve traversal in a single pass without calling isdir since both FindFirst/FindNext (Windows) and opendir/readdir (*nix) already return file type via lpFindFileData->dwFileAttributes (Windows) and dirent::d_type (*nix).

Perhaps counterintuitively, on most modern configurations (e.g. Win7 and NTFS, and even some SMB implementations) GetFileAttributesEx is twice as slow as FindFirstFile of a single file (possibly even slower than iterating over a directory with FindNextFile.)

Update: Python 3.5 includes the new PEP 471 os.scandir() function that solves this problem by returning file attributes along with the filename. This new function is used to speed up the built-in os.walk() (on both Windows and Linux). You can use the scandir module on PyPI to get this behavior for older Python versions, including 2.x.

How to generate directory size recursively in python, like du . does?

Have a look at os.walk. Specifically, the documentation has an example to find the size of a directory:

import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
    print root, "consumes",
    print sum(getsize(join(root, name)) for name in files),
    print "bytes in", len(files), "non-directory files"
    if 'CVS' in dirs:
        dirs.remove('CVS')  # don't visit CVS directories

This should be easy enough to modify for your purposes.

Here's an untested version in response to your comment:

import os
from os.path import join, getsize
dirs_dict = {}

#We need to walk the tree from the bottom up so that a directory can have easy
# access to the size of its subdirectories.
for root, dirs, files in os.walk('python/Lib/email',topdown = False):

    # Loop through every non directory file in this directory and sum their sizes
    size = sum(getsize(join(root, name)) for name in files) 

    # Look at all of the subdirectories and add up their sizes from the `dirs_dict`
    subdir_size = sum(dirs_dict[join(root,d)] for d in dirs)

    # store the size of this directory (plus subdirectories) in a dict so we 
    # can access it later
    my_size = dirs_dict[root] = size + subdir_size

    print '%s: %d'%(root,my_size)

How do I get the size of sub directory from a directory in python?

To print the size of each immediate subdirectory and the total size for the parent directory similar to du -bcs */ command:

#!/usr/bin/env python3.6
"""Usage: du-bcs <parent-dir>"""
import os
import sys

if len(sys.argv) != 2:
    sys.exit(__doc__)  # print usage

parent_dir = sys.argv[1]
total = 0
for entry in os.scandir(parent_dir):
    if entry.is_dir(follow_symlinks=False): # directory
        size = get_tree_size_scandir(entry)
        # print the size of each immediate subdirectory
        print(size, entry.name, sep='\t')  
    elif entry.is_file(follow_symlinks=False): # regular file
        size = entry.stat(follow_symlinks=False).st_size
    else:
        continue
    total += size
print(total, parent_dir, sep='\t') # print the total size for the parent dir

where get_tree_size_scandir()[text in Russian, code in Python, C, C++, bash].

The size of a directory here is the apparent size of all regular files in it and its subdirectories recursively. It doesn't count the size for the directory entries themselves or the actual disk usage for the files. Related: why is the output of du often so different from du -b.

How to check size of the files in a directory with python?

To get all of the files in a directory, you can use os.listdir.

>>> import os
>>> basedir = 'tmp/example'
>>> names = os.listdir(basedir)
>>> names
['a', 'b', 'c']

Then you need to add basedir on to the names:

>>> paths = [os.path.join(basedir, name) for name in names]
>>> paths
['tmp/example/a', 'tmp/example/b', 'tmp/example/c']

Then you can turn that into a list of pairs of (name, size) using a os.stat(path).st_size (the example files I've created are empty):

>>> sizes = [(path, os.stat(path).st_size) for path in paths]
>>> sizes
[('tmp/example/a', 0), ('tmp/example/b', 0), ('tmp/example/c', 0)]

Then you can group the paths with the same size together by using a collections.defaultdict:

>>> import collections
>>> grouped = collections.defaultdict(list)
>>> for path, size in sizes:
...     grouped[size].append(path)
... 
>>> grouped
defaultdict(<type 'list'>, {0: ['tmp/example/a', 'tmp/example/b', 'tmp/example/c']})

Now you can get all of the files by size, and open them all (don't forget to close them afterwards!):

>>> open_files = [open(path) for path in grouped[0]]

Measure directory size with Python

The two last lines in your program, where you call the function, are indented and therefore considered a part of the function and will not execute. Simply dedent them.

How to calculate a Directory size in ADLS using PySpark?

The dbutils.fs.ls doesn't have a recurse functionality like cp, mv or rm. Thus, you need to iterate yourself. Here is a snippet that will do the task for you. Run the code from a Databricks Notebook.

from dbutils import FileInfo
from typing import List

root_path = "/mnt/datalake/.../XYZ"

def discover_size(path: str, verbose: bool = True):
  def loop_path(paths: List[FileInfo], accum_size: float):
    if not paths:
      return accum_size
    else:
      head, tail = paths[0], paths[1:]
      if head.size > 0:
        if verbose:
          print(f"{head.path}: {head.size / 1e6} MB")
        accum_size += head.size / 1e6
        return loop_path(tail, accum_size)
      else:
        extended_tail = dbutils.fs.ls(head.path) + tail
        return loop_path(extended_tail, accum_size)

  return loop_path(dbutils.fs.ls(path), 0.0)

discover_size(root_path, verbose=True)  # Total size in megabytes at the end

If the location is mounted in the dbfs. Then you could use the du -h approach (have not test it). If you are in the Notebook, create a new cell with:

%sh
du -h /mnt/datalake/.../XYZ

How to calculate size of immediate subfolders of a folder using os.walk()

I made this finally and works fine-

import os
from pathlib import Path
root='/dbfs/mnt/datalake/.../'
size = 0
for path, subdirs, files in os.walk(root):
  for f in Path(root).iterdir():
    if name in files:
      if f.is_dir():
        size += os.path.getsize(os.path.join(path, name))
        dirSize = size/(1048576)
        print(f, "--Size:", dirSize)

Calculating a Directory's Size Using Python