Files.Walk(), Calculate Total Size

Files.walk(), calculate total size

No, this exception cannot be avoided.

The exception itself occurs inside the the lazy fetch of Files.walk(), hence why you are not seeing it early and why there is no way to circumvent it, consider the following code:

long size = Files.walk(Paths.get("C://"))
        .peek(System.out::println)
        .mapToLong(this::count)
        .sum();

On my system this will print on my computer:

C:\
C:\$Recycle.Bin
Exception in thread "main" java.io.UncheckedIOException: java.nio.file.AccessDeniedException: C:\$Recycle.Bin\S-1-5-18

And as an exception is thrown on the (main) thread on the third file, all further executions on that thread stop.

I believe this is a design failure, because as it stands now Files.walk is absolutely unusable, because you never can guarantee that there will be no errors when walking over a directory.

One important point to notice is that the stacktrace includes a sum() and reduce() operation, this is because the path is being lazily loaded, so at the point of reduce(), the bulk of stream machinery gets called (visible in stacktrace), and then it fetches the path, at which point the UnCheckedIOException occurs.

It could possibly be circumvented if you let every walking operation execute on their own thread. But that is not something you would want to be doing anyway.

Also, checking if a file is actually accessible is worthless (though useful to some extent), because you can not guarantee that it is readable even 1ms later.

Future extension

I believe it can still be fixed, though I do not know how FileVisitOptions exactly work.

Currently there is a FileVisitOption.FOLLOW_LINKS, if it operates on a per file basis, then I would suspect that a FileVisitOption.IGNORE_ON_IOEXCEPTION could also be added, however we cannot correctly inject that functionality in there.

How to calculate size of immediate subfolders of a folder using os.walk()

I made this finally and works fine-

import os
from pathlib import Path
root='/dbfs/mnt/datalake/.../'
size = 0
for path, subdirs, files in os.walk(root):
  for f in Path(root).iterdir():
    if name in files:
      if f.is_dir():
        size += os.path.getsize(os.path.join(path, name))
        dirSize = size/(1048576)
        print(f, "--Size:", dirSize)

Get size of folder or file

java.io.File file = new java.io.File("myfile.txt");
file.length();

This returns the length of the file in bytes or 0 if the file does not exist. There is no built-in way to get the size of a folder, you are going to have to walk the directory tree recursively (using the listFiles() method of a file object that represents a directory) and accumulate the directory size for yourself:

public static long folderSize(File directory) {
    long length = 0;
    for (File file : directory.listFiles()) {
        if (file.isFile())
            length += file.length();
        else
            length += folderSize(file);
    }
    return length;
}

WARNING: This method is not sufficiently robust for production use. directory.listFiles() may return null and cause a NullPointerException. Also, it doesn't consider symlinks and possibly has other failure modes. Use this method.

Calculating a directory's size using Python?

This walks all sub-directories; summing file sizes:

import os

def get_size(start_path = '.'):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            # skip if it is symbolic link
            if not os.path.islink(fp):
                total_size += os.path.getsize(fp)

    return total_size

print(get_size(), 'bytes')

And a oneliner for fun using os.listdir (Does not include sub-directories):

import os
sum(os.path.getsize(f) for f in os.listdir('.') if os.path.isfile(f))

Reference:

os.path.getsize - Gives the size in bytes
os.walk
os.path.islink

Updated
To use os.path.getsize, this is clearer than using the os.stat().st_size method.

Thanks to ghostdog74 for pointing this out!

os.stat - st_size Gives the size in bytes. Can also be used to get file size and other file related information.

import os

nbytes = sum(d.stat().st_size for d in os.scandir('.') if d.is_file())

Update 2018

If you use Python 3.4 or previous then you may consider using the more efficient walk method provided by the third-party scandir package. In Python 3.5 and later, this package has been incorporated into the standard library and os.walk has received the corresponding increase in performance.

Update 2019

Recently I've been using pathlib more and more, here's a pathlib solution:

from pathlib import Path

root_directory = Path('.')
sum(f.stat().st_size for f in root_directory.glob('**/*') if f.is_file())

Using os.walk to find total size of FTP server

Use this function to fetch the size of directory using ftp client.

def get_size_of_directory(ftp, directory):
    size = 0
    for name in ftp.nlst(directory):
        try:
            ftp.cwd(name)
            size += get_size_of_directory(name)
        except:
            ftp.voidcmd('TYPE I')
            size += ftp.size(name)
    return size

You can recursively call the get_size_of_directory for each directory you find in the directory
Hope this helps !!

very quickly getting total size of folder

You are at a disadvantage.

Windows Explorer almost certainly uses FindFirstFile/FindNextFile to both traverse the directory structure and collect size information (through lpFindFileData) in one pass, making what is essentially a single system call per file.

Python is unfortunately not your friend in this case. Thus,

os.walk first calls os.listdir (which internally calls FindFirstFile/FindNextFile)
- any additional system calls made from this point onward can only make you slower than Windows Explorer
os.walk then calls isdir for each file returned by os.listdir (which internally calls GetFileAttributesEx -- or, prior to Win2k, a GetFileAttributes+FindFirstFile combo) to redetermine whether to recurse or not
os.walk and os.listdir will perform additional memory allocation, string and array operations etc. to fill out their return value
you then call getsize for each file returned by os.walk (which again calls GetFileAttributesEx)

That is 3x more system calls per file than Windows Explorer, plus memory allocation and manipulation overhead.

You can either use Anurag's solution, or try to call FindFirstFile/FindNextFile directly and recursively (which should be comparable to the performance of a cygwin or other win32 port du -s some_directory.)

Refer to os.py for the implementation of os.walk, posixmodule.c for the implementation of listdir and win32_stat (invoked by both isdir and getsize.)

Note that Python's os.walk is suboptimal on all platforms (Windows and *nices), up to and including Python3.1. On both Windows and *nices os.walk could achieve traversal in a single pass without calling isdir since both FindFirst/FindNext (Windows) and opendir/readdir (*nix) already return file type via lpFindFileData->dwFileAttributes (Windows) and dirent::d_type (*nix).

Perhaps counterintuitively, on most modern configurations (e.g. Win7 and NTFS, and even some SMB implementations) GetFileAttributesEx is twice as slow as FindFirstFile of a single file (possibly even slower than iterating over a directory with FindNextFile.)

Update: Python 3.5 includes the new PEP 471 os.scandir() function that solves this problem by returning file attributes along with the filename. This new function is used to speed up the built-in os.walk() (on both Windows and Linux). You can use the scandir module on PyPI to get this behavior for older Python versions, including 2.x.

Avoid Java 8 Files.walk(..) termination cause of ( java.nio.file.AccessDeniedException )

Answer

Here is a temporary solution , which can be improved to use Java 8 Streams and Lambdas.

int[] count = {0};
try {
    Files.walkFileTree(
            Paths.get(dir.getPath()), 
            new HashSet<FileVisitOption>(Arrays.asList(FileVisitOption.FOLLOW_LINKS)),
            Integer.MAX_VALUE, new SimpleFileVisitor<Path>() {
                @Override
                public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) 
                        throws IOException {
                    System.out.printf("Visiting file %s\n", file);
                    ++count[0];
                    return FileVisitResult.CONTINUE;
                }
                
                @Override
                public FileVisitResult visitFileFailed(Path file, IOException e) 
                        throws IOException {
                    System.err.printf("Visiting failed for %s\n", file);
                    return FileVisitResult.SKIP_SUBTREE;
                }
                
                @Override
                public FileVisitResult preVisitDirectory(Path dir,
                                                         BasicFileAttributes attrs) 
                        throws IOException {
                    System.out.printf("About to visit directory %s\n", dir);
                    return FileVisitResult.CONTINUE;
                }
            });
} catch (IOException e) {
    // handle exception
}

How to get directory total size?

Using a global like that at best is bad practice.
It's also a race if DirSizeMB is called concurrently.

The simple solution is to use a closure, e.g.:

func DirSize(path string) (int64, error) {
    var size int64
    err := filepath.Walk(path, func(_ string, info os.FileInfo, err error) error {
        if err != nil {
            return err
        }
        if !info.IsDir() {
            size += info.Size()
        }
        return err
    })
    return size, err
}

Playground

You could assign the closure to a variable if you think that looks better.

Files.Walk(), Calculate Total Size