Python: Complexity of Os.Path.Exists with a Ext4 Filesystem

python: complexity of os.path.exists with a ext4 filesystem?

Underlying directory structure used by Ext4 (and Ext3) is exactly the same as in Ext2. Ext3 adds journaling, Ext4 improves that journaling. Journaling is irrelevant to your question.

Originally Ext2 used to store that as list, but that of course was inefficient for large directories. So it's has been changed to tweaked version of B-tree called HTree. Unlike standard B-tree, HTree has constant depth and uses hash-map per node, thus it's lookup complexity is O(1).

Ext2's scheme, which we dubbed
"HTree", uses 32-bit hashes for keys,
where each hash key references a range
of entries stored in a leaf block.
Since internal nodes are only 8 bytes,
HTrees have a very high fanout factor
(over 500 blocks can be referenced
using a 4K index block), two levels of
index nodes are sufficient to support
over 16 million 52-character
filenames. To further simplify the
implementation, HTrees are constant
depth (either one or two levels). The
combination of the high fanout factor
and the use of a hash of the filename,
plus a filesystem-specific secret to
serve as the search key for the HTree,
avoids the need for the implementation
to do balancing operations.

See: http://ext2.sourceforge.net/2005-ols/paper-html/node3.html

What is the time complexity of Python's os.path.exists()?

Given you are asking the OS if a single file exists, it does not need to do any algorithmic logic, or go down your path... I don't see how it could be anything else than O(1).

Data destroy using shred agains ext4 filesystem

This is a complex issue.

The only way that is 100% effective is physical destruction. The problem is that the drive firmware can mark sectors as bad and remap them to a pool of spares. These sectors are effectively no longer accessible to you but the old data may be recoverable from those sectors by other means (such as an alternate firmware or physically removing the platters).

That being said, running shred on the block device does not have the issues due to journaling.

The problem with journaling is that for partial overwrites to be recoverable you cannot actually overwrite the original data, so the overwrite of the file takes place in a second physical location, leaving the first in tact. Writing directly to the block device is not subject to journaling.

A Faster way of Directory walking instead of os.listdir?

I was just trying to figure out how to speed up os.walk on a largish file system (350,000 files spread out within around 50,000 directories). I'm on a linux box usign an ext3 file system. I discovered that there is a way to speed this up for MY case.

Specifically, Using a top-down walk, any time os.walk returns a list of more than one directory, I use os.stat to get the inode number of each directory, and sort the directory list by inode number. This makes walk mostly visit the subdirectories in inode order, which reduces disk seeks.

For my use case, it sped up my complete directory walk from 18 minutes down to 13 minutes...

Slow file look up for large directories

Performance usually begins to degrade when you have something on the order of tens of thousands of files, so yes, half a million files will probably kill your computer - this seems like a bad idea.

Python: Complexity of Os.Path.Exists with a Ext4 Filesystem